<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PyImageSearch</title>
	<atom:link href="https://pyimagesearch.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://pyimagesearch.com/</link>
	<description>You can master Computer Vision, Deep Learning, and OpenCV - PyImageSearch</description>
	<lastBuildDate>Sun, 12 Apr 2026 07:20:19 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.8.5</generator>
	<item>
		<title>FastAPI for MLOps: Python Project Structure and API Best Practices</title>
		<link>https://pyimagesearch.com/2026/04/13/fastapi-for-mlops-python-project-structure-and-api-best-practices/</link>
		
		<dc:creator><![CDATA[Vikram Singh]]></dc:creator>
		<pubDate>Mon, 13 Apr 2026 12:45:00 +0000</pubDate>
				<category><![CDATA[FastAPI]]></category>
		<category><![CDATA[MLOps]]></category>
		<category><![CDATA[Python Development]]></category>
		<category><![CDATA[Software Engineering]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[backend development]]></category>
		<category><![CDATA[fastapi]]></category>
		<category><![CDATA[fastapi mlops]]></category>
		<category><![CDATA[ml api]]></category>
		<category><![CDATA[mlops]]></category>
		<category><![CDATA[python poetry]]></category>
		<category><![CDATA[python project structure]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[tutorial]]></category>
		<guid isPermaLink="false">https://pyimagesearch.com/?p=53431</guid>

					<description><![CDATA[<p>Table of Contents FastAPI for MLOps: Python Project Structure and API Best Practices Introduction What You Will Build and Learn Why Software Engineering Comes First in MLOps Best Practices Where This Fits in the Overall Curriculum Python Project Structure Best&#8230;</p>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/04/13/fastapi-for-mlops-python-project-structure-and-api-best-practices/">FastAPI for MLOps: Python Project Structure and API Best Practices</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity" id="TOC"/>


<div class="yoast-breadcrumbs"><span><span><a href="https://pyimagesearch.com/">Home</a></span></div>


<div class="toc">
<hr class="TOC"/>
<p class="has-large-font-size"><strong>Table of Contents</strong></p>
<ul>
    <li id="TOC-h1-FastAPI-MLOps-Python-Project-Structure-API-Best-Practices"><a rel="noopener" target="_blank" href="#h1-FastAPI-MLOps-Python-Project-Structure-API-Best-Practices">FastAPI for MLOps: Python Project Structure and API Best Practices</a></li>

    <li id="TOC-h2-Introduction"><a rel="noopener" target="_blank" href="#h2-Introduction">Introduction</a></li>
    <ul>
        <li id="TOC-h3-What-You-Will-Build-Learn"><a rel="noopener" target="_blank" href="#h3-What-You-Will-Build-Learn">What You Will Build and Learn</a></li>
        <li id="TOC-h3-Why-Software-Engineering-Comes-First-MLOps-Best-Practices"><a rel="noopener" target="_blank" href="#h3-Why-Software-Engineering-Comes-First-MLOps-Best-Practices">Why Software Engineering Comes First in MLOps Best Practices</a></li>
        <li id="TOC-h3-Where-This-Fits-Overall-Curriculum"><a rel="noopener" target="_blank" href="#h3-Where-This-Fits-Overall-Curriculum">Where This Fits in the Overall Curriculum</a></li>
    </ul>

    <li id="TOC-h2-Python-Project-Structure-Best-Practices-MLOps"><a rel="noopener" target="_blank" href="#h2-Python-Project-Structure-Best-Practices-MLOps">Python Project Structure Best Practices for MLOps</a></li>
    <ul>
        <li id="TOC-h3-How-Structure-Python-Project-src-Layout"><a rel="noopener" target="_blank" href="#h3-How-Structure-Python-Project-src-Layout">How to Structure a Python Project with src/ Layout</a></li>
        <li id="TOC-h3-Python-Project-Structure-Explained-Repository-Walkthrough"><a rel="noopener" target="_blank" href="#h3-Python-Project-Structure-Explained-Repository-Walkthrough">Python Project Structure Explained: Repository Walkthrough</a></li>
        <li id="TOC-h3-Python-Project-Structure-Best-Practices-Directory-Breakdown"><a rel="noopener" target="_blank" href="#h3-Python-Project-Structure-Best-Practices-Directory-Breakdown">Python Project Structure Best Practices: Directory Breakdown</a></li>
        <li id="TOC-h3-How-This-Structure-Scales-Larger-ML-Systems"><a rel="noopener" target="_blank" href="#h3-How-This-Structure-Scales-Larger-ML-Systems">How This Structure Scales to Larger ML Systems</a></li>
    </ul>

    <li id="TOC-h2-Managing-Python-Dependencies-Poetry-ML-Projects"><a rel="noopener" target="_blank" href="#h2-Managing-Python-Dependencies-Poetry-ML-Projects">Managing Python Dependencies with Poetry for ML Projects</a></li>
    <ul>
        <li id="TOC-h3-Python-Poetry-vs-PDM-vs-UV-Choosing-Package-Manager-MLOps"><a rel="noopener" target="_blank" href="#h3-Python-Poetry-vs-PDM-vs-UV-Choosing-Package-Manager-MLOps">Python Poetry vs PDM vs UV: Choosing a Package Manager for MLOps</a></li>
        <li id="TOC-h3-Understanding-pyproject-toml-Python-Project-Configuration"><a rel="noopener" target="_blank" href="#h3-Understanding-pyproject-toml-Python-Project-Configuration">Understanding pyproject.toml for Python Project Configuration</a></li>
        <li id="TOC-h3-Installing-Dependencies-Poetry-PDM-UV"><a rel="noopener" target="_blank" href="#h3-Installing-Dependencies-Poetry-PDM-UV">Installing Dependencies (Poetry, PDM, UV)</a></li>
        <li id="TOC-h3-Managing-Python-Virtual-Environments-Reproducible-MLOps"><a rel="noopener" target="_blank" href="#h3-Managing-Python-Virtual-Environments-Reproducible-MLOps">Managing Python Virtual Environments for Reproducible MLOps</a></li>
        <li id="TOC-h3-Automating-MLOps-Setup-Python-Environment-Scripts"><a rel="noopener" target="_blank" href="#h3-Automating-MLOps-Setup-Python-Environment-Scripts">Automating MLOps Setup with Python Environment Scripts</a></li>
    </ul>

    <li id="TOC-h2-Configuration-Management-MLOps-YAML-env-Pydantic"><a rel="noopener" target="_blank" href="#h2-Configuration-Management-MLOps-YAML-env-Pydantic">Configuration Management in MLOps: YAML, .env, and Pydantic</a></li>
    <ul>
        <li id="TOC-h3-Using-Pydantic-Settings-MLOps-Configuration-Management"><a rel="noopener" target="_blank" href="#h3-Using-Pydantic-Settings-MLOps-Configuration-Management">Using Pydantic Settings for MLOps Configuration Management</a></li>
        <li id="TOC-h3-What-This-Means-MLOps-Configuration-System-Design"><a rel="noopener" target="_blank" href="#h3-What-This-Means-MLOps-Configuration-System-Design">What This Means for MLOps Configuration and System Design</a></li>
        <li id="TOC-h3-Loading-YAML-Merging-Layers"><a rel="noopener" target="_blank" href="#h3-Loading-YAML-Merging-Layers">Loading YAML and Merging Layers</a></li>
        <li id="TOC-h3-Designing-YAML-Configs-Scalable-MLOps-Pipelines"><a rel="noopener" target="_blank" href="#h3-Designing-YAML-Configs-Scalable-MLOps-Pipelines">Designing YAML Configs for Scalable MLOps Pipelines</a></li>
        <li id="TOC-h3-Using-env-Files-Secure-MLOps-Configuration"><a rel="noopener" target="_blank" href="#h3-Using-env-Files-Secure-MLOps-Configuration">Using .env Files for Secure MLOps Configuration</a></li>
        <li id="TOC-h3-Why-Configuration-Management-Matters-MLOps-Systems"><a rel="noopener" target="_blank" href="#h3-Why-Configuration-Management-Matters-MLOps-Systems">Why Configuration Management Matters in MLOps Systems</a></li>
        <li id="TOC-h3-How-App-Uses-Configuration-src-main-py"><a rel="noopener" target="_blank" href="#h3-How-App-Uses-Configuration-src-main-py">How the App Uses Configuration (src/main.py)</a></li>
        <li id="TOC-h3-How-FastAPI-Uses-Configuration-Production-MLOps-Systems"><a rel="noopener" target="_blank" href="#h3-How-FastAPI-Uses-Configuration-Production-MLOps-Systems">How FastAPI Uses Configuration in Production MLOps Systems</a></li>
        <li id="TOC-h3-Extending-MLOps-Configuration-Safely-Python-Projects"><a rel="noopener" target="_blank" href="#h3-Extending-MLOps-Configuration-Safely-Python-Projects">Extending MLOps Configuration Safely in Python Projects</a></li>
    </ul>

    <li id="TOC-h2-Logging-Best-Practices-MLOps-FastAPI-Applications"><a rel="noopener" target="_blank" href="#h2-Logging-Best-Practices-MLOps-FastAPI-Applications">Logging Best Practices for MLOps and FastAPI Applications</a></li>
    <ul>
        <li id="TOC-h3-Why-Logging-Critical-ML-Systems"><a rel="noopener" target="_blank" href="#h3-Why-Logging-Critical-ML-Systems">Why Logging Is Critical for ML Systems</a></li>
        <li id="TOC-h3-Logger-Initialization"><a rel="noopener" target="_blank" href="#h3-Logger-Initialization">Logger Initialization</a></li>
        <li id="TOC-h3-Log-Formatting-Levels"><a rel="noopener" target="_blank" href="#h3-Log-Formatting-Levels">Log Formatting and Levels</a></li>
        <li id="TOC-h3-Logging-Across-App"><a rel="noopener" target="_blank" href="#h3-Logging-Across-App">Logging Across the App</a></li>
        <li id="TOC-h3-Structured-Traceable-Behavior-Across-App"><a rel="noopener" target="_blank" href="#h3-Structured-Traceable-Behavior-Across-App">Together, This Gives Us Structured, Traceable Behavior Across the App</a></li>
    </ul>

    <li id="TOC-h2-FastAPI-MLOps-Building-Production-ML-API"><a rel="noopener" target="_blank" href="#h2-FastAPI-MLOps-Building-Production-ML-API">FastAPI for MLOps: Building a Production ML API</a></li>
    <ul>
        <li id="TOC-h3-Why-FastAPI-Ideal-MLOps-API-Development"><a rel="noopener" target="_blank" href="#h3-Why-FastAPI-Ideal-MLOps-API-Development">Why FastAPI Is Ideal for MLOps API Development</a></li>
        <li id="TOC-h3-Creating-FastAPI-Application-Machine-Learning-APIs"><a rel="noopener" target="_blank" href="#h3-Creating-FastAPI-Application-Machine-Learning-APIs">Creating a FastAPI Application for Machine Learning APIs</a></li>
        <li id="TOC-h3-Implementing-Health-Check-Endpoints-FastAPI-MLOps"><a rel="noopener" target="_blank" href="#h3-Implementing-Health-Check-Endpoints-FastAPI-MLOps">Implementing Health Check Endpoints in FastAPI (MLOps)</a></li>
        <li id="TOC-h3-Building-FastAPI-Prediction-Endpoint-ML-Models"><a rel="noopener" target="_blank" href="#h3-Building-FastAPI-Prediction-Endpoint-ML-Models">Building a FastAPI Prediction Endpoint for ML Models</a></li>
        <li id="TOC-h3-Behind-This-Endpoint-Prediction-Engine"><a rel="noopener" target="_blank" href="#h3-Behind-This-Endpoint-Prediction-Engine">Behind This Endpoint Is Your Prediction Engine</a></li>
        <li id="TOC-h3-Deploying-FastAPI-Uvicorn-MLOps-Applications"><a rel="noopener" target="_blank" href="#h3-Deploying-FastAPI-Uvicorn-MLOps-Applications">Deploying FastAPI with Uvicorn for MLOps Applications</a></li>
        <li id="TOC-h3-Auto-Generated-API-Docs-Swagger-ReDoc"><a rel="noopener" target="_blank" href="#h3-Auto-Generated-API-Docs-Swagger-ReDoc">Auto-Generated API Docs (Swagger, ReDoc)</a></li>
    </ul>

    <li id="TOC-h2-MLOps-Architecture-Service-Layer-Design-Patterns"><a rel="noopener" target="_blank" href="#h2-MLOps-Architecture-Service-Layer-Design-Patterns">MLOps Architecture: Service Layer Design Patterns</a></li>
    <ul>
        <li id="TOC-h3-Why-Separate-Services-Routes"><a rel="noopener" target="_blank" href="#h3-Why-Separate-Services-Routes">Why We Separate Services from Routes</a></li>
        <li id="TOC-h3-Designing-ML-Inference-Service"><a rel="noopener" target="_blank" href="#h3-Designing-ML-Inference-Service">Designing an ML Inference Service</a></li>
        <li id="TOC-h3-Scaling-MLOps-Systems-Modular-Service-Architecture"><a rel="noopener" target="_blank" href="#h3-Scaling-MLOps-Systems-Modular-Service-Architecture">Scaling MLOps Systems with Modular Service Architecture</a></li>
    </ul>

    <li id="TOC-h2-Model-Abstraction-MLOps-Decoupling-ML-APIs"><a rel="noopener" target="_blank" href="#h2-Model-Abstraction-MLOps-Decoupling-ML-APIs">Model Abstraction in MLOps: Decoupling ML from APIs</a></li>
    <ul>
        <li id="TOC-h3-Designing-Python-ML-Model-Class-MLOps"><a rel="noopener" target="_blank" href="#h3-Designing-Python-ML-Model-Class-MLOps">Designing a Python ML Model Class for MLOps</a></li>
        <li id="TOC-h3-Replace-Dummy-Models-Production-ML-Models"><a rel="noopener" target="_blank" href="#h3-Replace-Dummy-Models-Production-ML-Models">How to Replace Dummy Models with Production ML Models</a></li>
        <li id="TOC-h3-Versioning-Model-Class"><a rel="noopener" target="_blank" href="#h3-Versioning-Model-Class">Versioning the Model Class</a></li>
    </ul>

    <li id="TOC-h2-Building-Reusable-Utilities-Python-MLOps-Projects"><a rel="noopener" target="_blank" href="#h2-Building-Reusable-Utilities-Python-MLOps-Projects">Building Reusable Utilities in Python MLOps Projects</a></li>
    <ul>
        <li id="TOC-h3-Loading-YAML-Configs"><a rel="noopener" target="_blank" href="#h3-Loading-YAML-Configs">Loading YAML Configs</a></li>
        <li id="TOC-h3-Adding-New-Helper-Functions"><a rel="noopener" target="_blank" href="#h3-Adding-New-Helper-Functions">Adding New Helper Functions</a></li>
    </ul>

    <li id="TOC-h2-Running-FastAPI-MLOps-Application-Locally"><a rel="noopener" target="_blank" href="#h2-Running-FastAPI-MLOps-Application-Locally">Running a FastAPI MLOps Application Locally</a></li>
    <ul>
        <li id="TOC-h3-Running-via-Poetry"><a rel="noopener" target="_blank" href="#h3-Running-via-Poetry">Running via Poetry</a></li>
        <li id="TOC-h3-Running-via-UV"><a rel="noopener" target="_blank" href="#h3-Running-via-UV">Running via UV</a></li>
        <li id="TOC-h3-Running-Python-MLOps-Projects-PDM"><a rel="noopener" target="_blank" href="#h3-Running-Python-MLOps-Projects-PDM">Running Python MLOps Projects with PDM</a></li>
        <li id="TOC-h3-Testing-FastAPI-Endpoints-Health-Check-Prediction-API"><a rel="noopener" target="_blank" href="#h3-Testing-FastAPI-Endpoints-Health-Check-Prediction-API">Testing FastAPI Endpoints: Health Check and Prediction API</a></li>
    </ul>

    <li id="TOC-h2-Summary"><a rel="noopener" target="_blank" href="#h2-Summary">Summary</a></li>
    <ul>
        <li id="TOC-h3-Citation-Information"><a rel="noopener" target="_blank" href="#h3-Citation-Information">Citation Information</a></li>
    </ul>
</ul>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h1-FastAPI-MLOps-Python-Project-Structure-API-Best-Practices"/>



<h2 class="wp-block-heading"><a href="#TOC-h1-FastAPI-MLOps-Python-Project-Structure-API-Best-Practices">FastAPI for MLOps: Python Project Structure and API Best Practices</a></h2>



<p>In this lesson, you will learn how to structure a Machine Learning (ML) project like a real production system, complete with a <code data-enlighter-language="python" class="EnlighterJSRAW">src</code> directory layout, layered configuration, environment management, logging, and a FastAPI service that exposes your model through clean Application Programming Interface (API) routes.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/04/fastapi-for-mlops-python-project-structure-featured.png" target="_blank" rel=" noreferrer noopener"><img fetchpriority="high" decoding="async" width="940" height="780" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/fastapi-for-mlops-python-project-structure-featured.png?lossy=2&strip=1&webp=1" alt="fastapi-for-mlops-python-project-structure-featured.png" class="wp-image-53444" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/fastapi-for-mlops-python-project-structure-featured.png?size=126x105&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/fastapi-for-mlops-python-project-structure-featured-300x249.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/fastapi-for-mlops-python-project-structure-featured.png?size=378x314&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/fastapi-for-mlops-python-project-structure-featured.png?size=504x418&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/fastapi-for-mlops-python-project-structure-featured.png?size=630x523&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/fastapi-for-mlops-python-project-structure-featured-768x637.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/fastapi-for-mlops-python-project-structure-featured.png?lossy=2&amp;strip=1&amp;webp=1 940w" sizes="(max-width: 630px) 100vw, 630px" /></a></figure></div>


<p>This lesson is the 1st of a 2-part series on Software Engineering for Machine Learning Operations (MLOps):</p>



<ol class="wp-block-list">
<li><em><strong><a href="https://pyimg.co/yn8a5" target="_blank" rel="noreferrer noopener">FastAPI for MLOps: Python Project Structure and API Best Practices</a></strong></em><strong> (this tutorial)</strong></li>



<li><em>Lesson 2</em></li>
</ol>



<p><strong>To learn how to build reliable, scalable ML software the right way,</strong><em><strong> just keep reading.</strong></em></p>



<div id="pyi-source-code-block" class="source-code-wrap"><div class="gpd-source-code">
    <div class="gpd-source-code-content">
        <img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/source-code-icon.png?lossy=2&strip=1&webp=1" alt="">
        <h4>Looking for the source code to this post?</h4>
                    <a href="#download-the-code" class="pyis-cta-modal-open-modal">Jump Right To The Downloads Section <svg class="svg-icon arrow-right" width="12" height="12" aria-hidden="true" role="img" focusable="false" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M6.8125 0.1875C6.875 0.125 6.96875 0.09375 7.09375 0.09375C7.1875 0.09375 7.28125 0.125 7.34375 0.1875L13.875 6.75C13.9375 6.8125 14 6.90625 14 7C14 7.125 13.9375 7.1875 13.875 7.25L7.34375 13.8125C7.28125 13.875 7.1875 13.9062 7.09375 13.9062C6.96875 13.9062 6.875 13.875 6.8125 13.8125L6.1875 13.1875C6.125 13.125 6.09375 13.0625 6.09375 12.9375C6.09375 12.8438 6.125 12.75 6.1875 12.6562L11.0312 7.8125H0.375C0.25 7.8125 0.15625 7.78125 0.09375 7.71875C0.03125 7.65625 0 7.5625 0 7.4375V6.5625C0 6.46875 0.03125 6.375 0.09375 6.3125C0.15625 6.25 0.25 6.1875 0.375 6.1875H11.0312L6.1875 1.34375C6.125 1.28125 6.09375 1.1875 6.09375 1.0625C6.09375 0.96875 6.125 0.875 6.1875 0.8125L6.8125 0.1875Z" fill="#169FE6"></path></svg></a>
            </div>
</div>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Introduction"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Introduction">Introduction</a></h2>



<p>Modern ML systems do not succeed because of models alone — they succeed because of the <em>software engineering wrapped</em> around them. Most real-world failures in MLOps come from poor structure, missing configuration, messy environments, unclear APIs, or nonexistent logging, not from bad ML.</p>



<p>This lesson gives you the engineering foundation you need to build ML systems that are stable, testable, and production-ready. You’ll learn how to structure your project, manage environments, load configurations, build APIs, and prepare your system for future modules like testing, deployment, and automation.</p>



<p>To learn how solid software engineering underpins every ML workflow, just keep reading.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-What-You-Will-Build-Learn"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-What-You-Will-Build-Learn">What You Will Build and Learn</a></h3>



<p>In this lesson, you’ll build the backbone of a real ML application: a clean repository layout, environment management with modern tooling, configuration loading via Pydantic, structured logging, a FastAPI interface, and a simple service layer to power prediction.</p>



<p>These concepts form the “foundation layer” every MLOps system relies on — regardless of the model you eventually plug in.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Why-Software-Engineering-Comes-First-MLOps-Best-Practices"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Why-Software-Engineering-Comes-First-MLOps-Best-Practices">Why Software Engineering Comes First in MLOps Best Practices</a></h3>



<p>ML projects fail not because the model is wrong, but because the <em>plumbing</em> around the model collapses. Scripts turn into spaghetti, notebooks become unmaintainable, configs get scattered, and environments drift until the system becomes impossible to debug.</p>



<p>Good software engineering fixes this by introducing structure, consistency, and predictable behavior. When your API, config, logs, and model code work together cleanly, everything built on top (e.g., testing, serving, scaling, monitoring) suddenly becomes reliable.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Where-This-Fits-Overall-Curriculum"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Where-This-Fits-Overall-Curriculum">Where This Fits in the Overall Curriculum</a></h3>



<p>This lesson is the foundation of the entire MLOps series. Everything that comes next — testing, model integration, deployment workflows, Continuous Integration/Continuous Delivery (CI/CD) automation, monitoring, and scaling — builds on the engineering habits you establish here.</p>



<p>Think of this as your “software engineering base layer.” Once you master this structure, adding real models, adding load testing, or plugging the system into cloud infrastructure becomes far easier.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Python-Project-Structure-Best-Practices-MLOps"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Python-Project-Structure-Best-Practices-MLOps">Python Project Structure Best Practices for MLOps</a></h2>



<p>A well-structured repository is the first sign of a healthy ML system. Before we write any API code or load a model, we need a layout that cleanly separates configuration, services, models, and utilities. This not only prevents chaos — it makes testing, scaling, and future modules dramatically easier.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-How-Structure-Python-Project-src-Layout"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-How-Structure-Python-Project-src-Layout">How to Structure a Python Project with src/ Layout</a></h3>



<p>ML projects quickly become messy if everything sits at the root level. The <code data-enlighter-language="python" class="EnlighterJSRAW">src/</code> layout prevents naming collisions, enforces imports that match production structure, and makes it clear where application code actually lives.</p>



<p>This is the same structure used in mature Python services deployed in production environments.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Python-Project-Structure-Explained-Repository-Walkthrough"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Python-Project-Structure-Explained-Repository-Walkthrough">Python Project Structure Explained: Repository Walkthrough</a></h3>



<p>Here’s the repository layout we’re working with in this module (the exact tree will be shown later when you provide it):</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="1">sw-eng-mlops/
│
├── src/
│   ├── core/
│   ├── models/
│   ├── services/
│   ├── api/
│   ├── utils/
│   └── config/
│
├── tests/
│   ├── unit/
│   ├── integration/
│   └── performance/
│
├── pyproject.toml
├── README.md
├── setup_env.sh
└── .env.example
</pre>



<p>This structure is intentionally clean: <code data-enlighter-language="python" class="EnlighterJSRAW">core/</code> contains primitives, <code data-enlighter-language="python" class="EnlighterJSRAW">models/</code> stores your ML logic, <code data-enlighter-language="python" class="EnlighterJSRAW">services/</code> contains business logic, and <code data-enlighter-language="python" class="EnlighterJSRAW">api/</code> exposes everything through FastAPI routes.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Python-Project-Structure-Best-Practices-Directory-Breakdown"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Python-Project-Structure-Best-Practices-Directory-Breakdown">Python Project Structure Best Practices: Directory Breakdown</a></h3>



<h4 class="wp-block-heading">core/ — The Application Base Layer</h4>



<p>This folder contains shared components such as logging setup, base classes, or utility abstractions. Everything here is meant to be reusable across the whole system.</p>



<h4 class="wp-block-heading">models/ — ML or Dummy Model Code</h4>



<p>Even if you’re starting with a dummy model, isolating model code here makes it easy to swap in real models later.</p>



<h4 class="wp-block-heading">services/ — The Business Logic Layer</h4>



<p>This is where you place the logic that actually powers <code data-enlighter-language="python" class="EnlighterJSRAW">/predict</code>, not inside the API route. This separation keeps production-grade APIs maintainable.</p>



<h4 class="wp-block-heading">api/ — FastAPI Endpoints</h4>



<p>Routes live here. Each endpoint calls a service, which calls a model.</p>



<p>Tight, clean, and testable.</p>



<h4 class="wp-block-heading">utils/ — Shared Helpers</h4>



<p>Config loaders, yaml readers, or general-purpose helper functions sit here.</p>



<p>If it isn’t domain logic or a model, it goes here.</p>



<h4 class="wp-block-heading">config/ — Configuration Files</h4>



<p>YAML configs, <code data-enlighter-language="python" class="EnlighterJSRAW">BaseSettings</code> classes, validation logic, and environment overrides.</p>



<p>Centralizing config makes behavior predictable and testable.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-How-This-Structure-Scales-Larger-ML-Systems"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-How-This-Structure-Scales-Larger-ML-Systems">How This Structure Scales to Larger ML Systems</a></h3>



<p>This layout scales easily as your ML workload grows:</p>



<ul class="wp-block-list">
<li>Add a new model → create a folder inside <code data-enlighter-language="python" class="EnlighterJSRAW">models/</code>.</li>



<li>Add a new prediction workflow → add a service in <code data-enlighter-language="python" class="EnlighterJSRAW">services/</code>.</li>



<li>Add new API functionality → add a route in <code data-enlighter-language="python" class="EnlighterJSRAW">api/</code>.</li>



<li>Add data pipelines or vector DB logic → expand <code data-enlighter-language="python" class="EnlighterJSRAW">core/</code> or <code data-enlighter-language="python" class="EnlighterJSRAW">services/</code>.</li>
</ul>



<p>This way, the project grows <strong>horizontally</strong>, not chaotically.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with &#8230; for free? Head over to <a href="https://universe.roboflow.com/isl/az-6mqow?ref=pyimagesearch" target="_blank" rel="noreferrer noopener">Roboflow</a> and get a free account to grab these hand gesture images. </p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Managing-Python-Dependencies-Poetry-ML-Projects"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Managing-Python-Dependencies-Poetry-ML-Projects">Managing Python Dependencies with Poetry for ML Projects</a></h2>



<p>Modern MLOps projects rely on predictable, repeatable environments — and this section teaches you how to create exactly that. Before we build APIs or load models, we need a clean, isolated workspace where dependencies are installed, versions are pinned, and tools behave consistently across machines.</p>



<p>To learn how to manage dependencies, virtual environments, and setup scripts in real-world ML projects, just keep reading.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Python-Poetry-vs-PDM-vs-UV-Choosing-Package-Manager-MLOps"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Python-Poetry-vs-PDM-vs-UV-Choosing-Package-Manager-MLOps">Python Poetry vs PDM vs UV: Choosing a Package Manager for MLOps</a></h3>



<p>There are 3 modern Python toolchains worth knowing:</p>



<ul class="wp-block-list">
<li><strong>Poetry:</strong> full-featured dependency + environment + packaging manager.</li>



<li><strong>PDM</strong><strong> (Python Dependency Manager)</strong><strong>:</strong> simpler and faster than Poetry, with PEP-582 support.</li>



<li><strong><a href="https://docs.astral.sh/uv/" target="_blank" rel="noreferrer noopener">UV</a></strong><strong>:</strong> an extremely fast Rust-based package manager from Astral.</li>
</ul>



<p>All 3 support <code data-enlighter-language="python" class="EnlighterJSRAW">pyproject.toml</code>, the modern Python standard for dependencies and metadata.</p>



<p>Teams often standardize on a single tool, but your project supports <em>all three</em>, so students can use whichever they prefer.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Understanding-pyproject-toml-Python-Project-Configuration"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Understanding-pyproject-toml-Python-Project-Configuration">Understanding pyproject.toml for Python Project Configuration</a></h3>



<p>Your <code data-enlighter-language="python" class="EnlighterJSRAW">pyproject.toml</code> defines:</p>



<ul class="wp-block-list">
<li>project <code data-enlighter-language="python" class="EnlighterJSRAW">name</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">version</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">description</code></li>



<li>dependencies like <code data-enlighter-language="python" class="EnlighterJSRAW">fastapi</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">pydantic</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">pyyaml</code></li>



<li>dev tools like <code data-enlighter-language="python" class="EnlighterJSRAW">pytest</code> (Lesson 2)</li>



<li>optional entrypoints (<code data-enlighter-language="python" class="EnlighterJSRAW">start-server = "src.main:main"</code>)</li>
</ul>



<p>In other words, it is the <strong>single source of truth</strong> for installation and build metadata.</p>



<p>Any tool (Poetry, PDM, UV, pip) reads this file to install exactly what the project needs.</p>



<p>This is how professional ML systems avoid “works on my machine” issues.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Installing-Dependencies-Poetry-PDM-UV"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Installing-Dependencies-Poetry-PDM-UV">Installing Dependencies (Poetry, PDM, UV)</a></h3>



<h4 class="wp-block-heading">Using Poetry (recommended)</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="2">poetry install
poetry shell
poetry run python src/main.py
</pre>



<p>Poetry creates an isolated virtual environment and resolves all versions deterministically.</p>



<h4 class="wp-block-heading">Using UV (lightweight + blazing fast)</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="3">uv venv
source .venv/bin/activate
uv pip install -e .
python src/main.py
</pre>



<p>UV is perfect for fast installs and CI systems where speed matters.</p>



<h4 class="wp-block-heading">Using PDM (simple + modern)</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="4">pdm install
pdm run python src/main.py
</pre>



<p>PDM feels like <code data-enlighter-language="python" class="EnlighterJSRAW">npm</code> — no <code data-enlighter-language="python" class="EnlighterJSRAW">venv</code> folder by default; lightweight and straightforward.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Managing-Python-Virtual-Environments-Reproducible-MLOps"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Managing-Python-Virtual-Environments-Reproducible-MLOps">Managing Python Virtual Environments for Reproducible MLOps</a></h3>



<p>Regardless of what tool you choose, the goal is the same: isolate project dependencies from the system Python installation.</p>



<ul class="wp-block-list">
<li>Poetry creates its own environment automatically.</li>



<li>UV uses <code data-enlighter-language="python" class="EnlighterJSRAW">.venv/</code> inside your project.</li>



<li>PDM can create or avoid virtual environments depending on the configuration.</li>
</ul>



<p>The important principle:</p>



<p><strong>Never install ML dependencies globally.</strong></p>



<p>Environments keep your project reproducible and safe.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Automating-MLOps-Setup-Python-Environment-Scripts"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Automating-MLOps-Setup-Python-Environment-Scripts">Automating MLOps Setup with Python Environment Scripts</a></h3>



<p>Your project includes a helper script:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="5">./scripts/setup_env.sh
</pre>



<p>This script:</p>



<ul class="wp-block-list">
<li>Detects whether <strong>Poetry</strong>, <strong>UV</strong>, or plain <strong>pip</strong> is available</li>



<li>Installs dependencies using the detected tool</li>



<li>Creates or activates the <code data-enlighter-language="python" class="EnlighterJSRAW">.env</code> file</li>



<li>Shows the next steps to start the API</li>
</ul>



<p>This is extremely helpful for teams because it removes all “setup guessing” and gives new developers a consistent starting point.</p>



<p>You now know how environments, dependency managers, and <code data-enlighter-language="python" class="EnlighterJSRAW">pyproject.toml</code> work together to create a stable foundation for ML systems. With everything installed and configured, you’re ready to build and serve a real API.</p>



<p>Up next, we’ll create your first ML service with FastAPI and connect it to your project’s service layer.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<!-- wp:paragraph -->
<h3>Need Help Configuring Your Development Environment?</h3>
<!-- /wp:paragraph -->

<!-- wp:image {"align":"center","id":18137,"sizeSlug":"large","linkDestination":"custom"} -->
<figure class="wp-block-image aligncenter size-large"><a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank" rel="noreferrer noopener"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-18137" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?lossy=2&strip=1&webp=1 500w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=126x84&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=252x168&lossy=2&strip=1&webp=1 252w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=378x253&lossy=2&strip=1&webp=1 378w" sizes="(max-width: 500px) 100vw, 500px" /></a><figcaption>Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join <a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">PyImageSearch University</a> — you will be up and running with this tutorial in a matter of minutes. </figcaption></figure>
<!-- /wp:image -->

<!-- wp:paragraph -->
<p>All that said, are you:</p>
<!-- /wp:paragraph -->

<!-- wp:list -->
<ul><li>Short on time?</li><li>Learning on your employer’s administratively locked system?</li><li>Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?</li><li><strong>Ready to run the code immediately on your Windows, macOS, or Linux system?</strong></li></ul>
<!-- /wp:list -->

<!-- wp:paragraph -->
<p>Then join <a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank">PyImageSearch University</a> today!</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p><strong>Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser!</strong> No installation required.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!</p>
<!-- /wp:paragraph -->



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Configuration-Management-MLOps-YAML-env-Pydantic"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Configuration-Management-MLOps-YAML-env-Pydantic">Configuration Management in MLOps: YAML, .env, and Pydantic</a></h2>



<p><em>How the entire ML system loads, merges, and applies configuration at runtime.</em></p>



<p>Configuration is one of the most important engineering foundations in any ML system. In Lesson 1, we want students to walk away understanding not only <strong>why</strong> configuration matters but <strong>exactly how this project loads and merges config values</strong>. That means stepping through the real code inside <code data-enlighter-language="python" class="EnlighterJSRAW">src/core/config.py</code>, the <code data-enlighter-language="python" class="EnlighterJSRAW">.env.example</code>, and <code data-enlighter-language="python" class="EnlighterJSRAW">configs/config.yaml</code>.</p>



<p>We also want to show how the API, model, and services consume configuration. So when students replace the dummy model with a real one, the pattern already scales.</p>



<p>Let’s walk through it piece by piece.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Using-Pydantic-Settings-MLOps-Configuration-Management"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Using-Pydantic-Settings-MLOps-Configuration-Management">Using Pydantic Settings for MLOps Configuration Management</a></h3>



<p>Your configuration system starts with a <code data-enlighter-language="python" class="EnlighterJSRAW">Settings</code> class:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="6">class Settings(BaseSettings):
    api_host: str = "0.0.0.0"
    api_port: int = 8000
    debug: bool = False
    environment: str = "development"
    log_level: str = "INFO"

    class Config:
        env_file = ".env"
        env_file_encoding = "utf-8"
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-What-This-Means-MLOps-Configuration-System-Design"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-What-This-Means-MLOps-Configuration-System-Design">What This Means for MLOps Configuration and System Design</a></h3>



<ul class="wp-block-list">
<li>Pydantic’s <code data-enlighter-language="python" class="EnlighterJSRAW">BaseSettings</code> automatically reads:
<ul class="wp-block-list">
<li>environment variables</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">.env</code> file</li>



<li>any overrides you pass at runtime</li>
</ul>
</li>



<li>Defaults are provided <em>in code</em> so the system always works, even if <code data-enlighter-language="python" class="EnlighterJSRAW">.env</code> is missing.</li>



<li>Type safety ensures that if someone writes <code data-enlighter-language="python" class="EnlighterJSRAW">API_PORT=hello</code>, the app will fail fast.</li>
</ul>



<p>This is the right pattern for ML systems where dozens of environment variables must be synchronized across dev, test, staging, and production.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Loading-YAML-Merging-Layers"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Loading-YAML-Merging-Layers">Loading YAML and Merging Layers</a></h3>



<p>Next comes one of the most important parts of your system:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="7">def load_config() -> Settings:
    settings = Settings()

    config_path = "configs/config.yaml"
    if os.path.exists(config_path):
        yaml_config = load_yaml_config(config_path)

        for key, value in yaml_config.items():
            if hasattr(settings, key):
                setattr(settings, key, value)

    return settings
</pre>



<p><strong>Why This Is Powerful</strong></p>



<p>You now have <strong>layered configuration</strong>, which production ML systems use everywhere:</p>



<p><strong>Layer 1: Code defaults</strong></p>



<p>Ensures the app always runs.</p>



<p><strong>Layer 2: YAML</strong> (<code data-enlighter-language="python" class="EnlighterJSRAW">configs/config.yaml</code>)</p>



<p>Great for team-shared configs, model settings, cache sizes, service parameters.</p>



<p><strong>Layer 3:</strong> <code data-enlighter-language="python" class="EnlighterJSRAW">.env</code> <strong>file</strong></p>



<p>Local overrides (ports, debug mode, secrets).</p>



<p><strong>Layer 4: Runtime environment variables</strong></p>



<p>Final source of truth in cloud deployments.</p>



<p>This layered system prevents the “hard-coded value” trap and keeps ML infra consistent across environments.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Designing-YAML-Configs-Scalable-MLOps-Pipelines"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Designing-YAML-Configs-Scalable-MLOps-Pipelines">Designing YAML Configs for Scalable MLOps Pipelines</a></h3>



<p>Your YAML file contains deeper structural config:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="8">api_host: "0.0.0.0"
api_port: 8000
debug: true
environment: "development"

log_level: "INFO"

model:
  name: "dummy_classifier"
  version: "1.0.0"
  cache_size: 100

service:
  timeout: 30
  max_retries: 3
</pre>



<p>Even though <code data-enlighter-language="python" class="EnlighterJSRAW">Settings</code> does not yet support nested objects for models or services, YAML allows you to introduce new structured configuration later. This is how real ML teams configure:</p>



<ul class="wp-block-list">
<li>model version</li>



<li>tokenizer version</li>



<li>max batch size</li>



<li>timeouts</li>



<li>cache settings</li>



<li>experiment IDs</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Using-env-Files-Secure-MLOps-Configuration"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Using-env-Files-Secure-MLOps-Configuration">Using .env Files for Secure MLOps Configuration</a></h3>



<p>You also provide <code data-enlighter-language="python" class="EnlighterJSRAW">.env.example</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="9">API_PORT=8000
API_HOST=0.0.0.0
DEBUG=true
ENVIRONMENT=development
LOG_LEVEL=INFO
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Why-Configuration-Management-Matters-MLOps-Systems"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Why-Configuration-Management-Matters-MLOps-Systems">Why Configuration Management Matters in MLOps Systems</a></h3>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">.env.example</code> acts as documentation and a template.</li>



<li>You copy it to <code data-enlighter-language="python" class="EnlighterJSRAW">.env</code>, fill values, and the system boots.</li>



<li>This is a best practice in every production ML repo.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-How-App-Uses-Configuration-src-main-py"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-How-App-Uses-Configuration-src-main-py">How the App Uses Configuration (src/main.py)</a></h3>



<p>Your FastAPI entrypoint reads config like this:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="10">logger.info(f"Starting server on {settings.api_host}:{settings.api_port}")

uvicorn.run(
    "main:app",
    host=settings.api_host,
    port=settings.api_port,
    reload=settings.debug
)
</pre>



<p>Meaning:</p>



<ul class="wp-block-list">
<li>Change <code data-enlighter-language="python" class="EnlighterJSRAW">.env</code> to <code data-enlighter-language="python" class="EnlighterJSRAW">API_PORT=9000</code>: Your app automatically runs on port 9000.</li>



<li>Change YAML to <code data-enlighter-language="python" class="EnlighterJSRAW">debug: false</code>: Hot reload turns off.</li>
</ul>



<p>This is the <strong>practical benefit</strong> of structured configuration: no hard-coded values are buried inside the code.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-How-FastAPI-Uses-Configuration-Production-MLOps-Systems"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-How-FastAPI-Uses-Configuration-Production-MLOps-Systems">How FastAPI Uses Configuration in Production MLOps Systems</a></h3>



<p>Today, your inference service is simple, but in real projects, you might use:</p>



<ul class="wp-block-list">
<li>model name</li>



<li>version</li>



<li>batch size</li>



<li>latency budget</li>



<li>max retries</li>



<li>cache settings</li>



<li>rate limits</li>
</ul>



<p>All of these come from settings, not hardcoded logic.</p>



<p>In this lesson, you teach the <em>pattern</em>, so when the dummy model is eventually replaced with an Open Neural Network Exchange (ONNX) model, a Hugging Face model, or a custom PyTorch model, the service already has the right structure.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Extending-MLOps-Configuration-Safely-Python-Projects"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Extending-MLOps-Configuration-Safely-Python-Projects">Extending MLOps Configuration Safely in Python Projects</a></h3>



<p>Suppose tomorrow you want:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="11">MODEL_PATH=models/checkpoint.pt
ENABLE_CACHE=true
CACHE_TTL=300
</pre>



<p>You add:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="12">model_path: str = "models/dummy.pt"
enable_cache: bool = False
cache_ttl: int = 120
</pre>



<p>Then update <code data-enlighter-language="python" class="EnlighterJSRAW">.env.example</code>. Then, optionally override in YAML.</p>



<p>The app instantly supports new behavior — no rewrites, no refactoring, no confusion.</p>



<p>This is the level of <strong>software engineering maturity</strong> we want students to learn.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Logging-Best-Practices-MLOps-FastAPI-Applications"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Logging-Best-Practices-MLOps-FastAPI-Applications">Logging Best Practices for MLOps and FastAPI Applications</a></h2>



<p>Logging is one of the most underappreciated parts of an ML system. A model prediction might take milliseconds, but diagnosing a production issue without proper logs can take hours. Good logs reduce that time to minutes. In this section, we’ll look at how our lesson’s project initializes a logger, formats log messages, and uses logs consistently across the entire API.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Why-Logging-Critical-ML-Systems"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Why-Logging-Critical-ML-Systems">Why Logging Is Critical for ML Systems</a></h3>



<p>ML systems fail in ways traditional software does not.</p>



<p>A model might produce an unexpected prediction, a dependency might break silently, or the environment might load the wrong configuration. Logging gives you the breadcrumbs needed to understand:</p>



<ul class="wp-block-list">
<li>What inputs reached the API</li>



<li>What model version was used</li>



<li>What the service did before failing</li>



<li>How often errors occur</li>



<li>Whether latency is increasing</li>
</ul>



<p>Logs are your “black box recorder” when something goes wrong, and they’re equally important when everything seems to be working — because they tell you <em>why</em> things are working.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Logger-Initialization"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Logger-Initialization">Logger Initialization</a></h3>



<p>The project defines a single shared logger in <code data-enlighter-language="python" class="EnlighterJSRAW">src/core/logger.py</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="13">import logging
import sys

logger = logging.getLogger("mlops-lesson1")
logger.setLevel(logging.INFO)

handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

if not logger.handlers:
    logger.addHandler(handler)
</pre>



<p>Here’s what this setup accomplishes:</p>



<ul class="wp-block-list">
<li><strong>A named logger</strong> (<code data-enlighter-language="python" class="EnlighterJSRAW">mlops-lesson1</code>) groups logs for later aggregation (e.g., in Datadog, ELK (Elasticsearch, Logstash, Kibana), OpenTelemetry).</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">INFO</code> <strong>as the default level</strong> ensures we capture meaningful operational details without spamming output.</li>



<li><strong>A </strong><code data-enlighter-language="python" class="EnlighterJSRAW">StreamHandler</code> writes logs to <code data-enlighter-language="python" class="EnlighterJSRAW">stdout</code> — the standard for containerized deployments (Docker, Kubernetes).</li>



<li><strong>A simple timestamped formatter</strong> makes logs human-readable while remaining machine-parseable.</li>



<li>The <code data-enlighter-language="python" class="EnlighterJSRAW">if not logger.handlers:</code> guard prevents duplicate logs if modules are reloaded.</li>
</ul>



<p>This small file gives us a production-friendly logger with minimal overhead.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Log-Formatting-Levels"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Log-Formatting-Levels">Log Formatting and Levels</a></h3>



<p>The logger uses this format:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="14">2025-01-01 12:34:56 - INFO - Prediction result: positive
</pre>



<p>Each part of the log line matters:</p>



<ul class="wp-block-list">
<li><strong>Timestamp:</strong> crucial for correlating logs with events or latency spikes.</li>



<li><strong>Log level:</strong> signals severity: <code data-enlighter-language="python" class="EnlighterJSRAW">INFO</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">WARNING</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">ERROR</code>.</li>



<li><strong>Message:</strong> the human-readable explanation.</li>
</ul>



<p>In MLOps systems, you’ll most commonly use:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">INFO</code> for model loading, API calls, predictions</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">WARNING</code> for slow responses, unexpected patterns</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">ERROR</code> when something fails</li>
</ul>



<p>Because FastAPI reloads modules during development, you may see log duplication without safeguards — which is why we include the <code data-enlighter-language="python" class="EnlighterJSRAW">if not logger.handlers:</code> check.</p>



<p>If you later want structured JSON logs (for cloud log ingestion), this same module is the place to upgrade.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Logging-Across-App"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Logging-Across-App">Logging Across the App</a></h3>



<p>The logger is used in multiple places, showing a consistent logging strategy.</p>



<h4 class="wp-block-heading">Health endpoint (src/main.py)</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="15">@app.get("/health")
async def health_check():
    logger.info("Health check requested")
    return {"status": "ok"}
</pre>



<p>This gives visibility into uptime checks — important when a load balancer or Kubernetes performs probes.</p>



<h4 class="wp-block-heading">Prediction endpoint (src/services/inference_service.py)</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="16">logger.info(f"Making prediction for input: {input_text[:50]}...")
prediction = model.predict(input_text)
logger.info(f"Prediction result: {prediction}")
</pre>



<p>Here we log:</p>



<ul class="wp-block-list">
<li>The incoming input (truncated to avoid leaking full user data)</li>



<li>The model’s output</li>



<li>Any errors</li>
</ul>



<p>If something goes wrong:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="17">except Exception as e:
    logger.error(f"Error during prediction: {str(e)}")
    raise
</pre>



<p>This ensures errors appear in the logs <strong>before</strong> FastAPI converts them into HTTP exceptions.</p>



<h4 class="wp-block-heading">Server startup (main.py)</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="18">logger.info(f"Starting server on {settings.api_host}:{settings.api_port}")
</pre>



<p>This is important for:</p>



<ul class="wp-block-list">
<li>verifying the config loaded correctly</li>



<li>ensuring the correct port is used</li>



<li>debugging environments with conflicting overrides</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Structured-Traceable-Behavior-Across-App"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Structured-Traceable-Behavior-Across-App">Together, This Gives Us Structured, Traceable Behavior Across the App</a></h3>



<p>If a user reports:</p>



<p>“The API feels slow today.”</p>



<p>You can immediately look at:</p>



<ul class="wp-block-list">
<li>prediction request timestamps</li>



<li>whether model loading was triggered again</li>



<li>whether latency warnings appear</li>



<li>whether certain inputs correlate with errors</li>
</ul>



<p>Without logs, you’re flying blind.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-FastAPI-MLOps-Building-Production-ML-API"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-FastAPI-MLOps-Building-Production-ML-API">FastAPI for MLOps: Building a Production ML API</a></h2>



<p>APIs are the interface between your ML system and the outside world. Whether the consumer is a mobile app, a batch job, another microservice, or a human developer testing in Postman, every interaction eventually flows through an API. In MLOps, your API becomes the stable contract that hides internal details (model type, version, preprocessing, logging) — allowing you to upgrade models without breaking clients.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Why-FastAPI-Ideal-MLOps-API-Development"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Why-FastAPI-Ideal-MLOps-API-Development">Why FastAPI Is Ideal for MLOps API Development</a></h3>



<p>FastAPI gives you a fast, typed, and production-ready way to expose ML predictions.</p>



<p>It handles validation, serialization, documentation, and error responses, so your ML logic stays clean and modular.</p>



<p>The goal is simple: <strong>your API should stay stable even when everything behind it changes </strong>— models, configs, logging, monitoring, infrastructure.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Creating-FastAPI-Application-Machine-Learning-APIs"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Creating-FastAPI-Application-Machine-Learning-APIs">Creating a FastAPI Application for Machine Learning APIs</a></h3>



<p>Your project defines the API inside <code data-enlighter-language="python" class="EnlighterJSRAW">src/main.py</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="19">from fastapi import FastAPI
app = FastAPI(
    title="ML Service API",
    description="Code Foundations &amp; API Engineering for MLOps",
    version="0.1.0"
)
</pre>



<p>This initializes a fully documented ML service with:</p>



<ul class="wp-block-list">
<li>A <code data-enlighter-language="python" class="EnlighterJSRAW">title</code> for the UI</li>



<li>A <code data-enlighter-language="python" class="EnlighterJSRAW">description</code> that shows up in Swagger</li>



<li>A semantic <code data-enlighter-language="python" class="EnlighterJSRAW">version</code></li>



<li>Automatically generated schemas</li>
</ul>



<p>FastAPI instantly gives you API docs and a clean, declarative way to add endpoints.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Implementing-Health-Check-Endpoints-FastAPI-MLOps"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Implementing-Health-Check-Endpoints-FastAPI-MLOps">Implementing Health Check Endpoints in FastAPI (MLOps)</a></h3>



<p>A health endpoint is the first thing any production system needs.</p>



<p>Kubernetes, AWS Application Load Balancer (ALB), Docker Compose, Jenkins, and uptime monitors all rely on it.</p>



<p>Your implementation:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="20">@app.get("/health")
async def health_check():
    logger.info("Health check requested")
    return {"status": "ok"}
</pre>



<p>This performs 2 critical functions:</p>



<ul class="wp-block-list">
<li><strong>Confirms the API server is alive</strong></li>



<li><strong>Confirms logs are working</strong></li>
</ul>



<p>It also gives you a simple smoke test to verify the environment.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Building-FastAPI-Prediction-Endpoint-ML-Models"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Building-FastAPI-Prediction-Endpoint-ML-Models">Building a FastAPI Prediction Endpoint for ML Models</a></h3>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">/predict</code> endpoint is where real ML work happens.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="22">@app.post("/predict")
async def predict_route(input: str):
    return {"prediction": predict_service(input)}
</pre>



<p>This endpoint:</p>



<ul class="wp-block-list">
<li>Accepts a simple string input</li>



<li>Passes it into the inference service</li>



<li>Returns a structured JSON prediction</li>
</ul>



<p>Because prediction logic is isolated in <code data-enlighter-language="python" class="EnlighterJSRAW">services/inference_service.py</code>, the API stays lightweight and focused on HTTP behavior — not business logic.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Behind-This-Endpoint-Prediction-Engine"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Behind-This-Endpoint-Prediction-Engine">Behind This Endpoint Is Your Prediction Engine</a></h3>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="23">from models.dummy_model import DummyModel

model = DummyModel()

def predict(input_text: str) -> str:
    logger.info(f"Making prediction for input: {input_text[:50]}...")
    prediction = model.predict(input_text)
    logger.info(f"Prediction result: {prediction}")
    return prediction
</pre>



<p>Even though this is a dummy model, the structure mirrors real production design:</p>



<ul class="wp-block-list">
<li>The service layer owns the prediction logic</li>



<li>The model is instantiated once</li>



<li>Logging wraps the input and output</li>
</ul>



<p>When you upgrade to a real transformer or classifier, the API <strong>does not need to change</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Deploying-FastAPI-Uvicorn-MLOps-Applications"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Deploying-FastAPI-Uvicorn-MLOps-Applications">Deploying FastAPI with Uvicorn for MLOps Applications</a></h3>



<p>The server entrypoint lives at the bottom of <code data-enlighter-language="python" class="EnlighterJSRAW">main.py</code>:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="24">def main():
    logger.info(f"Starting server on {settings.api_host}:{settings.api_port}")
    uvicorn.run(
        "main:app",
        host=settings.api_host,
        port=settings.api_port,
        reload=settings.debug
    )
</pre>



<p>A few details matter:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">reload=True</code> reloads on code changes → perfect for development</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">host</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">port</code> come from config → ideal for containers/cloud</li>



<li><strong>logging is integrated</strong> → so you can trace server start behavior</li>
</ul>



<p>You can run the server with:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="25">poetry run start-server
</pre>



<p>or</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="26">uvicorn src.main:app --reload
</pre>



<p>Both give you a live API with hot reload.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Auto-Generated-API-Docs-Swagger-ReDoc"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Auto-Generated-API-Docs-Swagger-ReDoc">Auto-Generated API Docs (Swagger, ReDoc)</a></h3>



<p>FastAPI automatically exposes:</p>



<ul class="wp-block-list">
<li><strong>Swagger UI:</strong> <code data-enlighter-language="python" class="EnlighterJSRAW">http://localhost:8000/docs</code></li>



<li><strong>ReDoc:</strong> <code data-enlighter-language="python" class="EnlighterJSRAW">http://localhost:8000/redoc</code></li>



<li><strong>OpenAPI schema:</strong> <code data-enlighter-language="python" class="EnlighterJSRAW">http://localhost:8000/openapi.json</code></li>
</ul>



<p>These docs are invaluable in ML workflows because:</p>



<ul class="wp-block-list">
<li>You can test predictions interactively</li>



<li>Product, QA, and frontend engineers can explore endpoints</li>



<li>Payload schemas are always up to date</li>



<li>No one needs to ask “What does this endpoint expect?”</li>
</ul>



<p>FastAPI generates this from your Python type hints, which makes documentation essentially free.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-MLOps-Architecture-Service-Layer-Design-Patterns"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-MLOps-Architecture-Service-Layer-Design-Patterns">MLOps Architecture: Service Layer Design Patterns</a></h2>



<p>The service layer is where your application’s real business logic lives. In an ML system, this includes preprocessing, model selection, inference, error handling, postprocessing, and logging. By keeping this logic out of your API routes, you ensure that your codebase remains modular, testable, and ready for future model upgrades.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Why-Separate-Services-Routes"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Why-Separate-Services-Routes">Why We Separate Services from Routes</a></h3>



<p>FastAPI routes should only handle <strong>HTTP concerns</strong>: input validation, request parsing, and response formatting.</p>



<p>They should not know how your model works internally.</p>



<p>Separating logic into a <code data-enlighter-language="python" class="EnlighterJSRAW">services/</code> folder gives you:</p>



<ul class="wp-block-list">
<li><strong>Cleaner API routes:</strong> easier to read and maintain</li>



<li><strong>Better testability:</strong> you can unit test the inference logic without starting a server</li>



<li><strong>Loose coupling:</strong> upgrading models doesn’t require rewriting routes</li>



<li><strong>Clear ownership:</strong> one layer handles HTTP, the other handles ML logic</li>
</ul>



<p>This separation is one of the most critical software engineering patterns in MLOps — you want your system flexible enough that models can change, scale, or switch frameworks without touching your API.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Designing-ML-Inference-Service"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Designing-ML-Inference-Service">Designing an ML Inference Service</a></h3>



<p>Your inference logic lives in:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="27">src/services/inference_service.py
</pre>



<p>Let’s look at how it’s structured:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="28">from models.dummy_model import DummyModel
from core.logger import logger

# Initialize model
model = DummyModel()
logger.info(f"Loaded model: {model.model_name}")
</pre>



<p>This loads the model once at startup. In a real ML system, this is where:</p>



<ul class="wp-block-list">
<li>You load a transformer model</li>



<li>You warm up a GPU</li>



<li>You hydrate a vector store</li>



<li>You initialize the tokenizer/preprocessor state</li>
</ul>



<p>Then comes the prediction function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="29">def predict(input_text: str) -> str:
    logger.info(f"Making prediction for input: {input_text[:50]}...")
   
    try:
        prediction = model.predict(input_text)
        logger.info(f"Prediction result: {prediction}")
        return prediction
    except Exception as e:
        logger.error(f"Error during prediction: {str(e)}")
        raise
</pre>



<p>This function represents the <em>business logic</em> of your ML service:</p>



<ul class="wp-block-list">
<li>It trims the input for logging</li>



<li>Calls the model’s <code data-enlighter-language="python" class="EnlighterJSRAW">predict()</code></li>



<li>Logs errors and output cleanly</li>



<li>Returns only the result — not HTTP details</li>
</ul>



<p>This is exactly why we keep services separate: <strong>inference is not an HTTP concern</strong>, so it does not belong in a route.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Scaling-MLOps-Systems-Modular-Service-Architecture"/>



<h3 class="wp-block-heading"><a href="#TOC-h2-Model-Abstraction-MLOps-Decoupling-ML-APIs">Scaling MLOps Systems with Modular Service Architecture</a></h3>



<p>A great design scales. Tomorrow, your system might need:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">SentimentService</code>: for NLP</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">RecommendationService</code>: for personalization</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">VisionService</code>: that loads YOLO or CLIP</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">BatchService</code>: for async workflows</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">RetrievalService</code>: for Retrieval-Augmented Generation (RAG) pipelines</li>
</ul>



<p>You don’t modify <code data-enlighter-language="python" class="EnlighterJSRAW">main.py</code> or existing endpoints.</p>



<p>You simply add more files under:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="30">src/services/
├── inference_service.py  
├── recommendation_service.py  
├── vision_service.py  
└── retrieval_service.py  
</pre>



<p>Each service becomes independent, testable, and reusable.</p>



<p>Later in Lesson 2, this design becomes even more powerful because:</p>



<ul class="wp-block-list">
<li><strong>Unit tests:</strong> target individual services</li>



<li><strong>Integration tests:</strong> validate routes and services working together</li>



<li><strong>Load tests:</strong> measure the throughput of the <code data-enlighter-language="python" class="EnlighterJSRAW">/predict</code> pipeline</li>
</ul>



<p>By the time you add real ML models, this service layer becomes the heart of your system.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Model-Abstraction-MLOps-Decoupling-ML-APIs"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Model-Abstraction-MLOps-Decoupling-ML-APIs">Model Abstraction in MLOps: Decoupling ML from APIs</a></h2>



<p>Models change constantly in MLOps. Today you may be serving a dummy classifier; tomorrow it might be a 7B LLM or a YOLOv12 object detector. A good software engineering foundation treats the model as a <em>pluggable, versioned component</em> that can be replaced with minimal friction.</p>



<p>Your current <code data-enlighter-language="python" class="EnlighterJSRAW">models/</code> directory demonstrates exactly how this abstraction works.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Designing-Python-ML-Model-Class-MLOps"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Designing-Python-ML-Model-Class-MLOps">Designing a Python ML Model Class for MLOps</a></h3>



<p>Your lesson uses a simple placeholder model located at:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="31">src/models/dummy_model.py
</pre>



<p>The goal of this class isn’t to perform “real” ML — it’s to give you a clean structure that mimics how production model classes are written.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="32">class DummyModel:
    def __init__(self) -> None:
        self.model_name = "dummy_classifier"
        self.version = "1.0.0"
   
    def predict(self, input_data: Any) -> str:
        text = str(input_data).lower()
        if "good" in text or "great" in text:
            return "positive"
        return "negative"
</pre>



<p>Even in this tiny model, you already see foundational patterns:</p>



<ul class="wp-block-list">
<li>A <strong>constructor</strong> to load or initialize model state</li>



<li>A <code data-enlighter-language="python" class="EnlighterJSRAW">predict()</code> method that defines the inference interface</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">model_name</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">version</code> fields for introspection and tracking</li>
</ul>



<p>This interface is intentionally minimal: it forces your service and API layers to depend on an abstraction, not on implementation details.</p>



<p>In real MLOps systems, this exact pattern makes it easy to introduce new models without breaking your API.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Replace-Dummy-Models-Production-ML-Models"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Replace-Dummy-Models-Production-ML-Models">How to Replace Dummy Models with Production ML Models</a></h3>



<p>Here’s where the abstraction shines.</p>



<p>If tomorrow you decide to replace the dummy model with:</p>



<ul class="wp-block-list">
<li>A Hugging Face transformer</li>



<li>A PyTorch Lightning checkpoint</li>



<li>A TensorRT engine</li>



<li>An ONNX Runtime session</li>



<li>A vLLM text-generation server</li>



<li>A YOLO detection model</li>
</ul>



<p>…all you need to do is drop a new file into:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="33">src/models/
</pre>



<p>For example:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="34">src/models/
├── dummy_model.py
├── sentiment_model.py
├── llm_generation_model.py
└── object_detector.py
</pre>



<p>And update your service:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="35">from models.sentiment_model import SentimentModel
model = SentimentModel()
</pre>



<p>Nothing else changes.</p>



<p>Your FastAPI routes stay the same.</p>



<p>Your service interface stays the same.</p>



<p>Your tests stay the same (except for new model-specific tests).</p>



<p>This is <em>model decoupling</em>.</p>



<p>This is how ML systems avoid turning into tangled spaghetti when models evolve.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Versioning-Model-Class"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Versioning-Model-Class">Versioning the Model Class</a></h3>



<p>Model versioning is a real production concern, and your dummy model subtly teaches the pattern.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="36">self.version = "1.0.0"
</pre>



<p>Model versioning matters because:</p>



<ul class="wp-block-list">
<li>You may deploy multiple models at once</li>



<li>Clients might depend on specific behaviors</li>



<li>A/B testing needs separate versions</li>



<li>Rollbacks require deterministic reproducibility</li>



<li>Monitoring tools (e.g., Prometheus or Langfuse) track model changes</li>
</ul>



<p>In production, versioning happens in several places:</p>



<ul class="wp-block-list">
<li><strong>version field in the class</strong></li>



<li><strong>model registry tag</strong> (MLflow, SageMaker, Hugging Face Hub)</li>



<li><strong>Docker image tag</strong></li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">config.yaml</code><strong> entry</strong></li>



<li><strong>model card metadata</strong></li>
</ul>



<p>Your project follows the simplest, clearest entrypoint: a version attribute that propagates everywhere the model is used.</p>



<p>Later in Lesson 2, test cases and load tests will automatically pick up this version, mimicking real-world CI/CD systems that validate each model release.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Building-Reusable-Utilities-Python-MLOps-Projects"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Building-Reusable-Utilities-Python-MLOps-Projects">Building Reusable Utilities in Python MLOps Projects</a></h2>



<p>A well-designed ML system always contains a dedicated utilities layer — small, reusable functions that solve cross-cutting problems without polluting your core logic, service layer, or API routes.</p>



<p>In this project, the <code data-enlighter-language="python" class="EnlighterJSRAW">src/utils/</code> folder gives you a clean space to organize those helpers, starting with configuration loading, and is ready to grow as your system becomes more complex.</p>



<p>This layer keeps your codebase maintainable, testable, and extensible.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Loading-YAML-Configs"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Loading-YAML-Configs">Loading YAML Configs</a></h3>



<p>Your primary helper is <code data-enlighter-language="python" class="EnlighterJSRAW">load_yaml_config()</code> found in:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="37">src/utils/helpers.py
</pre>



<p>Here’s the implementation:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="38">def load_yaml_config(path: str) -> Dict[str, Any]:
    config_path = Path(path)
   
    if not config_path.exists():
        return {}
   
    try:
        with open(config_path, 'r', encoding='utf-8') as file:
            config = yaml.safe_load(file)
            return config if config is not None else {}
    except yaml.YAMLError as e:
        print(f"Error loading YAML config from {path}: {e}")
        return {}
    except Exception as e:
        print(f"Unexpected error loading config from {path}: {e}")
        return {}
</pre>



<p>This function may look simple, but it embodies 3 production-level lessons:</p>



<h4 class="wp-block-heading">Separation of concerns</h4>



<p>Your application logic (FastAPI, inference services) should not know <em>how</em> a YAML file is parsed. They should only receive clean configuration objects.</p>



<h4 class="wp-block-heading">Fault tolerance</h4>



<p>In real deployments:</p>



<ul class="wp-block-list">
<li>configs may be missing</li>



<li>YAML indentation may break</li>



<li>a misconfigured CI pipeline may pass an empty file</li>
</ul>



<p>Returning <code data-enlighter-language="python" class="EnlighterJSRAW">{}</code> instead of crashing gives you graceful degradation.</p>



<h4 class="wp-block-heading">Extensibility</h4>



<p>Tomorrow you may add:</p>



<ul class="wp-block-list">
<li>JSON config support</li>



<li>remote config loading (S3, Google Cloud Storage (GCS), Azure Blob)</li>



<li>encrypted secrets</li>



<li>multiple config layers</li>
</ul>



<p>This helper becomes the foundation.</p>



<p>Inside <code data-enlighter-language="python" class="EnlighterJSRAW">core/config.py</code>, you saw how <code data-enlighter-language="python" class="EnlighterJSRAW">load_yaml_config()</code> merges YAML values into your Pydantic settings. This is a real-world pattern used in production MLOps stacks like Airflow, FastAPI microservices, Ray Serve, and MLflow.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Adding-New-Helper-Functions"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Adding-New-Helper-Functions">Adding New Helper Functions</a></h3>



<p>The utilities layer is designed to grow organically as your system grows.</p>



<p>Common helpers you may introduce later include:</p>



<h4 class="wp-block-heading">String helpers</h4>



<ul class="wp-block-list">
<li>text normalization</li>



<li>input cleaning</li>



<li>token counting</li>
</ul>



<h4 class="wp-block-heading">File helpers</h4>



<ul class="wp-block-list">
<li>safe file writes</li>



<li>temporary directory management</li>



<li>checksum calculation for model files</li>
</ul>



<h4 class="wp-block-heading">Model helpers</h4>



<ul class="wp-block-list">
<li>downloading artifacts from cloud storage</li>



<li>caching models on disk</li>



<li>validating model signatures</li>
</ul>



<h4 class="wp-block-heading">API helpers</h4>



<ul class="wp-block-list">
<li>request validation</li>



<li>standardized error responses</li>



<li>retry/backoff wrappers around external calls</li>
</ul>



<h4 class="wp-block-heading">Monitoring helpers</h4>



<ul class="wp-block-list">
<li>timing decorators</li>



<li>metrics emitters (Prometheus, StatsD, OpenTelemetry)</li>



<li>latency buckets</li>
</ul>



<p>All of these belong in one place:</p>



<p><code data-enlighter-language="python" class="EnlighterJSRAW">src/utils/</code></p>



<p>This prevents your service layer or route handlers from becoming cluttered and ensures that common functionality is implemented once and reused everywhere.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Running-FastAPI-MLOps-Application-Locally"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Running-FastAPI-MLOps-Application-Locally">Running a FastAPI MLOps Application Locally</a></h2>



<p>At this point, you have a fully structured ML application: configuration, logging, models, service layer, and a clean FastAPI interface. Now it’s time to actually <em>run</em> the system locally.</p>



<p>This section walks you through running the API with <strong>Poetry</strong>, <strong>UV</strong>, or <strong>PDM</strong>, depending on your setup. We’ll conclude with a quick validation test to ensure everything works end-to-end.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Running-via-Poetry"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Running-via-Poetry">Running via Poetry</a></h3>



<p>If you’re using Poetry (recommended for most workflows), your steps are:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="39"># Install dependencies
poetry install

# Activate the environment
poetry shell

# Start the API server
poetry run python src/main.py
</pre>



<p>You should see log lines like:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="40">INFO - Starting server on 0.0.0.0:8000
INFO - Loaded model: dummy_classifier
</pre>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/04/image-7-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="273" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-7-1024x273.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-53447" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-7.png?size=126x34&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-7-300x80.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-7.png?size=378x101&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-7.png?size=504x134&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-7.png?size=630x168&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-7-768x205.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-7-1024x273.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-7-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-7-1536x410.png?lossy=2&amp;strip=1&amp;webp=1 1536w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 1:</strong> Running ML API using Poetry</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Running-via-UV"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Running-via-UV">Running via UV</a></h3>



<p>If you prefer <strong>UV</strong> (super-fast installer by Astral), run:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="41"># Create and activate a virtual environment
uv venv
source .venv/bin/activate

# Install project in editable mode
uv pip install -e .

# Start the API
python src/main.py
</pre>



<p>This path is great for users who want lightweight dependency management without Poetry’s abstraction.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Running-Python-MLOps-Projects-PDM"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Running-Python-MLOps-Projects-PDM">Running Python MLOps Projects with PDM</a></h3>



<p>If your workflow uses <strong>PDM</strong>, run:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="43"># Install dependencies
pdm install

# Start the server
pdm run python src/main.py
</pre>



<p>PDM offers a cleaner pyproject-first workflow and works well for CI/CD pipelines that prefer explicit environment setup.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/04/image-8-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="282" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-8-1024x282.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-53450" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-8.png?size=126x35&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-8-300x83.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-8.png?size=378x104&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-8.png?size=504x139&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-8.png?size=630x173&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-8-768x211.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-8-1024x282.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-8-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-8-1536x422.png?lossy=2&amp;strip=1&amp;webp=1 1536w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 2:</strong> Terminal showing a successful server started via PDM dependency resolution.</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Testing-FastAPI-Endpoints-Health-Check-Prediction-API"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Testing-FastAPI-Endpoints-Health-Check-Prediction-API">Testing FastAPI Endpoints: Health Check and Prediction API</a></h3>



<p>Once the server is running, validate the system with 2 quick API calls.</p>



<h4 class="wp-block-heading">Health Check</h4>



<p>Open:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="44">http://localhost:8000/health
</pre>



<p>Expected response:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="45">{"status": "ok"}
</pre>



<p>This confirms:</p>



<ul class="wp-block-list">
<li>the API is reachable</li>



<li>config and logger initialized</li>



<li>FastAPI routes are registered</li>
</ul>



<h4 class="wp-block-heading">Prediction Test</h4>



<p>Send a prediction request:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="46">curl -X POST "http://localhost:8000/predict?input=This+is+good"
</pre>



<p>Expected response:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="47">{"prediction": "positive"}
</pre>



<p>Under the hood:</p>



<ul class="wp-block-list">
<li>the service layer logs the request</li>



<li>the dummy model classifies sentiment</li>



<li>the API returns structured JSON</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/04/image-9-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="431" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-9-1024x431.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-53452" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-9.png?size=126x53&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-9-300x126.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-9.png?size=378x159&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-9.png?size=504x212&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-9.png?size=630x265&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-9-768x323.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-9-1024x431.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-9-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-9-1536x646.png?lossy=2&amp;strip=1&amp;webp=1 1536w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 3:</strong> Auto-generated documentation for the ML API.</figcaption></figure></div>

<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/04/image-10-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="392" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-10-1024x392.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-53453" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-10.png?size=126x48&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-10-300x115.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-10.png?size=378x145&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-10.png?size=504x193&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-10.png?size=630x241&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-10-768x294.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-10-1024x392.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-10-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-10-1536x588.png?lossy=2&amp;strip=1&amp;webp=1 1536w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 4:</strong> Real terminal output from running the <code>/predict</code> endpoint, validating the end-to-end workflow of the ML API.</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity"/>



<div id="pitch" style="padding: 40px; width: 100%; background-color: #F4F6FA;">
	<h3>What's next? We recommend <a target="_blank" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend">PyImageSearch University</a>.</h3>

	<script src="https://fast.wistia.com/embed/medias/kno0cmko2z.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_kno0cmko2z videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img decoding="async" src="https://fast.wistia.com/embed/medias/kno0cmko2z/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>

	<div style="margin-top: 32px; margin-bottom: 32px; ">
		<strong>Course information:</strong><br/>
		86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: April 2026<br/>
		<span style="color: #169FE6;">★★★★★</span> 4.84 (128 Ratings) • 16,000+ Students Enrolled
	</div>

	<p><strong>I strongly believe that if you had the right teacher you could <em>master</em> computer vision and deep learning.</strong></p>

	<p>Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?</p>

	<p>That’s <em>not</em> the case.</p>

	<p>All you need to master computer vision and deep learning is for someone to explain things to you in <em>simple, intuitive</em> terms. <em>And that’s exactly what I do</em>. My mission is to change education and how complex Artificial Intelligence topics are taught.</p>

	<p>If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to <em>successfully</em> and <em>confidently</em> apply computer vision to your work, research, and projects. Join me in computer vision mastery.</p>

	<p><strong>Inside PyImageSearch University you'll find:</strong></p>

	<ul style="margin-left: 0px;">
		<li style="list-style: none;">&check; <strong>86+ courses</strong> on essential computer vision, deep learning, and OpenCV topics</li>
		<li style="list-style: none;">&check; <strong>86 Certificates</strong> of Completion</li>
		<li style="list-style: none;">&check; <strong>115+ hours hours</strong> of on-demand video</li>
		<li style="list-style: none;">&check; <strong>Brand new courses released <em>regularly</em></strong>, ensuring you can keep up with state-of-the-art techniques</li>
		<li style="list-style: none;">&check; <strong>Pre-configured Jupyter Notebooks in Google Colab</strong></li>
		<li style="list-style: none;">&check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)</li>
		<li style="list-style: none;">&check; Access to <strong>centralized code repos for <em>all</em> 540+ tutorials</strong> on PyImageSearch</li>
		<li style="list-style: none;">&check; <strong> Easy one-click downloads</strong> for code, datasets, pre-trained models, etc.</li>
		<li style="list-style: none;">&check; <strong>Access</strong> on mobile, laptop, desktop, etc.</li>
	</ul>

	<p style="text-align: center;">
		<a target="_blank" class="button link" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend" style="background-color: #6DC713; border-bottom: none;">Click here to join PyImageSearch University</a>
	</p>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Summary"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Summary">Summary</a></h2>



<p>In this lesson, you learned how to build a clean, scalable foundation for ML systems using real software-engineering practices. You now understand why ML projects must be structured like production services — not experiments — if they are ever going to ship reliably.</p>



<p>We began by exploring the <em>why</em>: ML code becomes maintainable only when you enforce clear boundaries between configuration, logic, services, and I/O. That idea naturally led to the <code data-enlighter-language="python" class="EnlighterJSRAW">src/</code> layout, which gave our project a predictable and extensible shape.</p>



<p>You then learned how to manage dependencies using Poetry, UV, or PDM — ensuring that every ML environment is reproducible, isolated, and easy to rebuild. This solved the classic “it works on my machine” trap that haunts ML teams.</p>



<p>Next, we built a robust configuration system using Pydantic <code data-enlighter-language="python" class="EnlighterJSRAW">BaseSettings</code>, merging defaults, YAML files, and <code data-enlighter-language="python" class="EnlighterJSRAW">.env</code> variables into a single typed interface. You now have a configuration pattern used by real-world production ML systems.</p>



<p>We also implemented structured <strong>logging</strong>, enabling the application to communicate what it’s doing internally — a prerequisite for debugging, observability, and monitoring.</p>



<p>From there, you built your first production-style ML API with <strong>FastAPI</strong>, complete with <code data-enlighter-language="python" class="EnlighterJSRAW">/health</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">/predict</code>, and auto-generated documentation. You learned how to expose ML logic cleanly, and why APIs are the interface between ML systems and the real world.</p>



<p>We introduced the <strong>Service Layer</strong>, showing how routes should delegate to independent business logic so APIs stay thin and models stay swappable. This design decision is what makes the system testable and future-proof.</p>



<p>You then explored <strong>model abstraction</strong>, using a simple dummy model to illustrate how real models (PyTorch, TensorFlow, ONNX, vLLM, Transformers) can be slotted in without changing the API layer.</p>



<p>Finally, you saw how helper utilities make the system cleaner, and how to run the full application with Poetry, UV, or PDM. The result is a working ML service that looks, behaves, and organizes itself like production-grade software.</p>



<p>By completing this lesson, you’ve built the foundation required for every advanced MLOps practice: testing, performance monitoring, CI/CD, orchestration, and deployment.</p>



<p>You’re now ready for <strong>Lesson 2</strong>, where we transform this service into a fully tested, validated, and performance-monitored ML system.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Citation-Information"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Citation-Information">Citation Information</a></h3>



<p><strong>Singh, V</strong><strong>. </strong>“FastAPI for MLOps: Python Project Structure and API Best Practices,” <em>PyImageSearch</em>, S. Huot, A. Sharma, and P. Thakur, eds., 2026, <a href="https://pyimg.co/yn8a5" target="_blank" rel="noreferrer noopener">https://pyimg.co/yn8a5</a> </p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="classic" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="FastAPI for MLOps: Python Project Structure and API Best Practices" data-enlighter-group="48">@incollection{Singh_2026_fastapi-for-mlops-python-project-structure,
  author = {Vikram Singh},
  title = {{FastAPI for MLOps: Python Project Structure and API Best Practices}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/yn8a5},
}
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), </strong><em><strong>simply enter your email address in the form below!</strong></em></p>



<div id="download-the-code" class="post-cta-wrap">
<div class="gpd-post-cta">
	<div class="gpd-post-cta-content">
		

			<div class="gpd-post-cta-top">
				<div class="gpd-post-cta-top-image"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1" alt="" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1 410w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=126x174&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=252x348&lossy=2&strip=1&webp=1 252w" sizes="(max-width: 410px) 100vw, 410px" /></div>
				
				<div class="gpd-post-cta-top-title"><h4>Download the Source Code and FREE 17-page Resource Guide</h4></div>
				<div class="gpd-post-cta-top-desc"><p>Enter your email address below to get a .zip of the code and a <strong>FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning.</strong> Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!</p></div>


			</div>

			<div class="gpd-post-cta-bottom">
				<form id="footer-cta-code" class="footer-cta" action="https://www.getdrip.com/forms/4130035/submissions" method="post" target="blank" data-drip-embedded-form="4130035">
					<input name="fields[email]" type="email" value="" placeholder="Your email address" class="form-control" />

					<button type="submit">Download the code!</button>

					<div style="display: none;" aria-hidden="true"><label for="website">Website</label><br /><input type="text" id="website" name="website" tabindex="-1" autocomplete="false" value="" /></div>
				</form>
			</div>


		
	</div>

</div>
</div>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/04/13/fastapi-for-mlops-python-project-structure-and-api-best-practices/">FastAPI for MLOps: Python Project Structure and API Best Practices</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen</title>
		<link>https://pyimagesearch.com/2026/04/06/agentic-ai-vision-system-object-segmentation-with-sam-3-and-qwen/</link>
		
		<dc:creator><![CDATA[Piyush Thakur]]></dc:creator>
		<pubDate>Mon, 06 Apr 2026 13:03:56 +0000</pubDate>
				<category><![CDATA[Agentic AI]]></category>
		<category><![CDATA[Computer Vision]]></category>
		<category><![CDATA[Multimodal AI]]></category>
		<category><![CDATA[Qwen]]></category>
		<category><![CDATA[SAM]]></category>
		<category><![CDATA[Segmentation]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[agentic ai]]></category>
		<category><![CDATA[ai agents]]></category>
		<category><![CDATA[computer vision]]></category>
		<category><![CDATA[deep learning]]></category>
		<category><![CDATA[image segmentation]]></category>
		<category><![CDATA[multimodal ai]]></category>
		<category><![CDATA[open vocabulary segmentation]]></category>
		<category><![CDATA[qwen vl]]></category>
		<category><![CDATA[sam 3]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[vision language model]]></category>
		<guid isPermaLink="false">https://pyimagesearch.com/?p=53357</guid>

					<description><![CDATA[<p>Table of Contents Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen Why Agentic AI Outperforms Traditional Vision Pipelines Why Agentic AI Improves Computer Vision and Segmentation Tasks What We Will Build: An Agentic AI Vision and Segmentation&#8230;</p>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/04/06/agentic-ai-vision-system-object-segmentation-with-sam-3-and-qwen/">Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity" id="TOC"/>


<div class="yoast-breadcrumbs"><span><span><a href="https://pyimagesearch.com/">Home</a></span></div>


<script src="https://fast.wistia.com/player.js" async></script><script src="https://fast.wistia.com/embed/bj2sx8eu3j.js" async type="module"></script><style>wistia-player[media-id='bj2sx8eu3j']:not(:defined) { background: center / contain no-repeat url('https://fast.wistia.com/embed/medias/bj2sx8eu3j/swatch'); display: block; filter: blur(5px); padding-top:56.25%; }</style> <wistia-player media-id="bj2sx8eu3j" aspect="1.7777777777777777"></wistia-player>



<div class="toc">
<hr class="TOC"/>
<p class="has-large-font-size"><strong>Table of Contents</strong></p>
<ul>
    <li id="TOC-h1-Agentic-AI-Vision-System-Object-Segmentation-with-SAM-3-and-Qwen"><a rel="noopener" target="_blank" href="#h1-Agentic-AI-Vision-System-Object-Segmentation-with-SAM-3-and-Qwen">Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen</a></li>
    <li id="TOC-h2-Why-Agentic-AI-Outperforms-Traditional-Vision-Pipelines"><a rel="noopener" target="_blank" href="#h2-Why-Agentic-AI-Outperforms-Traditional-Vision-Pipelines">Why Agentic AI Outperforms Traditional Vision Pipelines</a></li>
    <li id="TOC-h2-Why-Agentic-AI-Improves-Computer-Vision-and-Segmentation-Tasks"><a rel="noopener" target="_blank" href="#h2-Why-Agentic-AI-Improves-Computer-Vision-and-Segmentation-Tasks">Why Agentic AI Improves Computer Vision and Segmentation Tasks</a></li>
    <li id="TOC-h2-What-We-Will-Build-An-Agentic-AI-Vision-and-Segmentation-System"><a rel="noopener" target="_blank" href="#h2-What-We-Will-Build-An-Agentic-AI-Vision-and-Segmentation-System">What We Will Build: An Agentic AI Vision and Segmentation System</a></li>
    <li id="TOC-h2-Agentic-AI-Workflow-Vision-Language-Reasoning-and-Segmentation-Loop"><a rel="noopener" target="_blank" href="#h2-Agentic-AI-Workflow-Vision-Language-Reasoning-and-Segmentation-Loop">Agentic AI Workflow: Vision-Language Reasoning and Segmentation Loop</a></li>
    <li id="TOC-h2-Agentic-AI-Architecture-Combining-VLMs-and-SAM-3-for-Vision"><a rel="noopener" target="_blank" href="#h2-Agentic-AI-Architecture-Combining-VLMs-and-SAM-3-for-Vision">Agentic AI Architecture: Combining VLMs and SAM 3 for Vision</a></li>
    <ul>
        <li id="TOC-h3-Vision-Language-Model-VLM-The-Reasoning-Component"><a rel="noopener" target="_blank" href="#h3-Vision-Language-Model-VLM-The-Reasoning-Component">Vision-Language Model (VLM): The Reasoning Component</a></li>
        <li id="TOC-h3-SAM-3-Segmentation-Model-Open-Vocabulary-Object-Segmentation"><a rel="noopener" target="_blank" href="#h3-SAM-3-Segmentation-Model-Open-Vocabulary-Object-Segmentation">SAM 3: Open-Vocabulary Object Segmentation</a></li>
        <li id="TOC-h3-The-Agentic-Feedback-Loop-Reasoning-Verification-and-Refinement"><a rel="noopener" target="_blank" href="#h3-The-Agentic-Feedback-Loop-Reasoning-Verification-and-Refinement">The Agentic Feedback Loop: Reasoning, Verification, and Refinement</a></li>
        <li id="TOC-h3-Why-Agentic-Segmentation-Outperforms-One-Shot-Models"><a rel="noopener" target="_blank" href="#h3-Why-Agentic-Segmentation-Outperforms-One-Shot-Models">Why Agentic Segmentation Outperforms One-Shot Models</a></li>
    </ul>
    <li id="TOC-h2-Final-Output-Agentic-Vision-System-with-Segmentation-and-Reasoning"><a rel="noopener" target="_blank" href="#h2-Final-Output-Agentic-Vision-System-with-Segmentation-and-Reasoning">Final Output: Agentic Vision System with Segmentation and Reasoning</a></li>
    <li id="TOC-h2-Key-Takeaway-VLM-SAM-3-Intelligent-Vision-Agent"><a rel="noopener" target="_blank" href="#h2-Key-Takeaway-VLM-SAM-3-Intelligent-Vision-Agent">Key Takeaway: VLM + SAM 3 = Intelligent Vision Agent</a></li>
    <li id="TOC-h2-Configuring-Your-Development-Environment"><a rel="noopener" target="_blank" href="#h2-Configuring-Your-Development-Environment">Configuring Your Development Environment</a></li>
    <li id="TOC-h2-Python-Setup-and-Imports-for-Agentic-AI-Vision-System"><a rel="noopener" target="_blank" href="#h2-Python-Setup-and-Imports-for-Agentic-AI-Vision-System">Python Setup and Imports for Agentic AI Vision System</a></li>
    <li id="TOC-h2-Loading-SAM-3-and-Qwen-Vision-Language-Models-in-Transformers"><a rel="noopener" target="_blank" href="#h2-Loading-SAM-3-and-Qwen-Vision-Language-Models-in-Transformers">Loading SAM 3 and Qwen Vision-Language Models in Transformers</a></li>
    <li id="TOC-h2-Implementing-VLM-Inference-for-Agentic-Vision-Reasoning-with-Qwen25-VL"><a rel="noopener" target="_blank" href="#h2-Implementing-VLM-Inference-for-Agentic-Vision-Reasoning-with-Qwen25-VL">Implementing VLM Inference for Agentic Vision Reasoning with Qwen2.5-VL</a></li>
    <li id="TOC-h2-Implementing-the-SAM-3-Text-Prompted-Segmentation-Function"><a rel="noopener" target="_blank" href="#h2-Implementing-the-SAM-3-Text-Prompted-Segmentation-Function">Implementing the SAM 3 Text-Prompted Segmentation Function</a></li>
    <li id="TOC-h2-Implementing-the-Agentic-AI-Segmentation-Pipeline-with-Iterative-Refinement"><a rel="noopener" target="_blank" href="#h2-Implementing-the-Agentic-AI-Segmentation-Pipeline-with-Iterative-Refinement">Implementing the Agentic AI Segmentation Pipeline with Iterative Refinement</a></li>
    <li id="TOC-h2-Visualizing-and-Saving-the-Segmentation-Results"><a rel="noopener" target="_blank" href="#h2-Visualizing-and-Saving-the-Segmentation-Results">Visualizing and Saving the Segmentation Results</a></li>
    <li id="TOC-h2-Running-the-Agentic-AI-Vision-System-on-Real-Images"><a rel="noopener" target="_blank" href="#h2-Running-the-Agentic-AI-Vision-System-on-Real-Images">Running the Agentic AI Vision System on Real Images</a></li>
    <li id="TOC-h2-Agentic-Segmentation-Output-Iterative-Prompt-Refinement-in-Action"><a rel="noopener" target="_blank" href="#h2-Agentic-Segmentation-Output-Iterative-Prompt-Refinement-in-Action">Agentic Segmentation Output: Iterative Prompt Refinement in Action</a></li>
    <li id="TOC-h2-Summary"><a rel="noopener" target="_blank" href="#h2-Summary">Summary</a></li>
    <ul>
        <li id="TOC-h3-Citation-Information"><a rel="noopener" target="_blank" href="#h3-Citation-Information">Citation Information</a></li>
    </ul>
</ul>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h1-Agentic-AI-Vision-System-Object-Segmentation-with-SAM-3-and-Qwen"/>



<h2 class="wp-block-heading"><a href="#TOC-h1-Agentic-AI-Vision-System-Object-Segmentation-with-SAM-3-and-Qwen">Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen</a></h2>



<p>This lesson is the <strong>4th and final part</strong> of our series on <strong>SAM 3</strong>. In the previous parts, we built a strong foundation for concept-aware segmentation.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/04/building-an-agentic-ai-vision-system-with-sam-3-and-qwen-featured.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="940" height="780" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/building-an-agentic-ai-vision-system-with-sam-3-and-qwen-featured.png?lossy=2&strip=1&webp=1" alt="building-an-agentic-ai-vision-system-with-sam-3-and-qwen-featured.png" class="wp-image-53381" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/building-an-agentic-ai-vision-system-with-sam-3-and-qwen-featured.png?size=126x105&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/building-an-agentic-ai-vision-system-with-sam-3-and-qwen-featured-300x249.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/building-an-agentic-ai-vision-system-with-sam-3-and-qwen-featured.png?size=378x314&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/building-an-agentic-ai-vision-system-with-sam-3-and-qwen-featured.png?size=504x418&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/building-an-agentic-ai-vision-system-with-sam-3-and-qwen-featured.png?size=630x523&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/building-an-agentic-ai-vision-system-with-sam-3-and-qwen-featured-768x637.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/building-an-agentic-ai-vision-system-with-sam-3-and-qwen-featured.png?lossy=2&amp;strip=1&amp;webp=1 940w" sizes="(max-width: 630px) 100vw, 630px" /></a></figure></div>


<p>In <strong><a href="https://pyimg.co/uming" target="_blank" rel="noreferrer noopener">Part 1</a></strong>, we introduced the fundamentals of SAM 3 and explored how it enables <strong>concept-based visual understanding and segmentation</strong>. We moved beyond fixed labels and used natural language to describe objects.</p>



<p>In <strong><a href="https://pyimg.co/5c4ag" target="_blank" rel="noreferrer noopener">Part 2</a></strong>, we extended this idea by introducing <strong>multi-modal prompting and interactive segmentation</strong>. We combined text, points, and bounding boxes to gain more precise control over segmentation.</p>



<p>In <strong><a href="https://pyimg.co/luxfd" target="_blank" rel="noreferrer noopener">Part 3</a></strong>, we extended this into the temporal domain. We applied SAM 3 to videos and built systems for <strong>concept-aware segmentation and object tracking across frames</strong>.</p>



<p>In this final part, we take a major step forward. Instead of treating segmentation as a single-step prediction, we introduce an <strong>agentic AI system</strong> that can reason, verify, and iteratively refine its outputs.</p>



<p>This lesson is the last of a 4-part series on <strong>SAM 3</strong>:</p>



<ol class="wp-block-list">
<li><em><a href="https://pyimg.co/uming" target="_blank" rel="noreferrer noopener">SAM 3: Concept-Based Visual Understanding and Segmentation</a></em></li>



<li><em><a href="https://pyimg.co/5c4ag" target="_blank" rel="noreferrer noopener">Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation</a></em></li>



<li><em><a href="https://pyimg.co/luxfd" target="_blank" rel="noreferrer noopener">SAM 3 for Video: Concept-Aware Segmentation and Object Tracking</a></em></li>



<li><em><strong><a href="https://pyimg.co/ohlwd" target="_blank" rel="noreferrer noopener">Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen</a></strong></em> <strong>(this tutorial)</strong></li>
</ol>



<p><strong>To learn how to build an Agentic AI Vision System with SAM</strong> <strong>3 and Qwen, </strong><em><strong>just keep reading.</strong></em></p>



<div id="pyi-source-code-block" class="source-code-wrap"><div class="gpd-source-code">
    <div class="gpd-source-code-content">
        <img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/source-code-icon.png?lossy=2&strip=1&webp=1" alt="">
        <h4>Looking for the source code to this post?</h4>
                    <a href="#download-the-code" class="pyis-cta-modal-open-modal">Jump Right To The Downloads Section <svg class="svg-icon arrow-right" width="12" height="12" aria-hidden="true" role="img" focusable="false" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M6.8125 0.1875C6.875 0.125 6.96875 0.09375 7.09375 0.09375C7.1875 0.09375 7.28125 0.125 7.34375 0.1875L13.875 6.75C13.9375 6.8125 14 6.90625 14 7C14 7.125 13.9375 7.1875 13.875 7.25L7.34375 13.8125C7.28125 13.875 7.1875 13.9062 7.09375 13.9062C6.96875 13.9062 6.875 13.875 6.8125 13.8125L6.1875 13.1875C6.125 13.125 6.09375 13.0625 6.09375 12.9375C6.09375 12.8438 6.125 12.75 6.1875 12.6562L11.0312 7.8125H0.375C0.25 7.8125 0.15625 7.78125 0.09375 7.71875C0.03125 7.65625 0 7.5625 0 7.4375V6.5625C0 6.46875 0.03125 6.375 0.09375 6.3125C0.15625 6.25 0.25 6.1875 0.375 6.1875H11.0312L6.1875 1.34375C6.125 1.28125 6.09375 1.1875 6.09375 1.0625C6.09375 0.96875 6.125 0.875 6.1875 0.8125L6.8125 0.1875Z" fill="#169FE6"></path></svg></a>
            </div>
</div>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Why-Agentic-AI-Outperforms-Traditional-Vision-Pipelines"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Why-Agentic-AI-Outperforms-Traditional-Vision-Pipelines">Why Agentic AI Outperforms Traditional Vision Pipelines </a></h2>



<p>Modern computer vision systems are evolving beyond traditional pipelines.</p>



<p>We designed systems where:</p>



<ul class="wp-block-list">
<li>an image is passed to a vision model</li>



<li>the model produces a prediction</li>



<li>the pipeline ends there</li>
</ul>



<p>This approach works well for clearly defined tasks. However, it struggles when tasks require <strong>understanding intent, </strong><strong>handling </strong><strong>ambiguity, or refin</strong><strong>ing outputs</strong>.</p>



<p>To address this, we now transition toward <strong>agentic AI systems</strong>.</p>



<p>Agentic systems are not limited to a single prediction. Instead, they behave more like an iterative reasoning loop.</p>



<p>They can:</p>



<ul class="wp-block-list">
<li>interpret a user request</li>



<li>select the appropriate models or tools</li>



<li>evaluate intermediate outputs</li>



<li>refine their decisions over multiple steps</li>
</ul>



<p>This shift allows us to build systems that are <strong>adaptive, iterative, and self-correcting</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Why-Agentic-AI-Improves-Computer-Vision-and-Segmentation-Tasks"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Why-Agentic-AI-Improves-Computer-Vision-and-Segmentation-Tasks">Why Agentic AI Improves Computer Vision and Segmentation Tasks </a></h2>



<p>Vision tasks are often ambiguous.</p>



<p>For example, consider the instruction:</p>



<ul class="wp-block-list">
<li><em>“the bag on the leftmost side”</em></li>
</ul>



<p>A traditional segmentation model cannot directly handle this:</p>



<ul class="wp-block-list">
<li>it expects fixed labels like <em>“bag”</em></li>



<li>it does not understand spatial reasoning like <em>“leftmost”</em></li>
</ul>



<p>This is where agentic design becomes powerful.</p>



<p>We introduce a <strong>Vision-Language Model (VLM)</strong> to:</p>



<ul class="wp-block-list">
<li>understand the instruction</li>



<li>extract the correct intent</li>



<li>translate it into a form usable by a segmentation model</li>
</ul>



<p>Then, instead of trusting the output blindly, we:</p>



<ul class="wp-block-list">
<li>verify the result</li>



<li>refine the input if needed</li>



<li>retry the process</li>
</ul>



<p>This creates a loop where the system continuously improves.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-What-We-Will-Build-An-Agentic-AI-Vision-and-Segmentation-System"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-What-We-Will-Build-An-Agentic-AI-Vision-and-Segmentation-System">What We Will Build: An Agentic AI Vision and Segmentation System</a></h2>



<p>In this lesson, we build an <strong>agentic segmentation system</strong> that combines reasoning with perception.</p>



<p>The system takes:</p>



<ul class="wp-block-list">
<li>an image</li>



<li>a natural language instruction</li>
</ul>



<p>and produces:</p>



<ul class="wp-block-list">
<li>segmentation masks</li>



<li>bounding boxes</li>



<li>confidence scores</li>



<li>a final overlay visualization</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Agentic-AI-Workflow-Vision-Language-Reasoning-and-Segmentation-Loop"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Agentic-AI-Workflow-Vision-Language-Reasoning-and-Segmentation-Loop">Agentic AI Workflow: Vision-Language Reasoning and Segmentation Loop</a></h2>



<p>The pipeline follows these steps:</p>



<ul class="wp-block-list">
<li><strong>User Input: </strong>First, we provide an image along with a natural language instruction.</li>



<li><strong>Instruction Understanding (VLM): </strong>Next, the VLM processes both the image and the text. It extracts the core intent and converts it into a short concept.</li>



<li><strong>Concept Simplification: </strong>The system converts complex instructions into concise phrases. For example:
<ul class="wp-block-list">
<li><em>“the bag on the leftmost side” → “leftmost bag”</em></li>
</ul>
</li>



<li><strong>Segmentation </strong><strong>(SAM3): </strong>Then, SAM3 uses this concept to generate:
<ul class="wp-block-list">
<li>segmentation masks</li>



<li>bounding boxes</li>



<li>confidence scores</li>
</ul>
</li>



<li><strong>Verification (VLM): </strong>After segmentation, the VLM evaluates whether the output matches the instruction.</li>



<li><strong>Refinement Loop: </strong>If the result is incorrect:
<ul class="wp-block-list">
<li>the VLM refines the concept</li>



<li>SAM3 runs again</li>



<li>the process repeats</li>
</ul>
</li>



<li>This loop continues until the result aligns with the user’s intent.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Agentic-AI-Architecture-Combining-VLMs-and-SAM-3-for-Vision"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Agentic-AI-Architecture-Combining-VLMs-and-SAM-3-for-Vision">Agentic AI Architecture: Combining VLMs and SAM 3 for Vision</a></h2>



<p>Before implementing the code, we break down the system into its core components.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Vision-Language-Model-VLM-The-Reasoning-Component"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Vision-Language-Model-VLM-The-Reasoning-Component">Vision-Language Model (VLM): The Reasoning Component</a></h3>



<p>The VLM is the <strong>reasoning component</strong> of our system. It performs 3 key roles:</p>



<p><strong>Instruction Understanding.</strong> It interprets the natural language input in the context of the image.</p>



<p><strong>Concept Generation.</strong> It converts long instructions into short, structured phrases. For example:</p>



<ul class="wp-block-list">
<li><em>“the person wearing a red shirt” → “person red shirt”</em></li>



<li><em>“the car in the background” → “background car”</em></li>
</ul>



<p>This step is critical because segmentation models perform better with:</p>



<ul class="wp-block-list">
<li>short</li>



<li>object-centric</li>



<li>unambiguous phrases</li>
</ul>



<p><strong>Result Verification.</strong> After segmentation, the VLM checks:</p>



<ul class="wp-block-list">
<li>whether the correct object was segmented</li>



<li>whether spatial or contextual constraints are satisfied</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-SAM-3-Segmentation-Model-Open-Vocabulary-Object-Segmentation"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-SAM-3-Segmentation-Model-Open-Vocabulary-Object-Segmentation">SAM 3: Open-Vocabulary Object Segmentation</a></h3>



<p>SAM3 acts as the <strong>perception component</strong>.</p>



<p>Unlike traditional segmentation models, SAM3 supports:</p>



<ul class="wp-block-list">
<li>flexible prompts</li>



<li>open-vocabulary segmentation</li>
</ul>



<p>This means we are not restricted to predefined classes.</p>



<p>Given a concept phrase, SAM3 produces:</p>



<ul class="wp-block-list">
<li>pixel-level segmentation masks</li>



<li>bounding boxes</li>



<li>confidence scores</li>
</ul>



<p>This makes SAM3 ideal for integration with a language-based reasoning system.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-The-Agentic-Feedback-Loop-Reasoning-Verification-and-Refinement"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-The-Agentic-Feedback-Loop-Reasoning-Verification-and-Refinement">The Agentic Feedback Loop: Reasoning, Verification, and Refinement</a></h3>



<p>The most important part of this system is the <strong>agentic loop</strong>.</p>



<p>Instead of a linear pipeline, we build a <strong>feedback-driven process</strong>.</p>



<p><strong>Step-by-step:</strong></p>



<ul class="wp-block-list">
<li>Generate a segmentation concept</li>



<li>Run segmentation using SAM3</li>



<li>Evaluate the output using the VLM</li>
</ul>



<p>If the output is incorrect:</p>



<ul class="wp-block-list">
<li>identify what went wrong</li>



<li>refine the concept</li>



<li>retry segmentation</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Why-Agentic-Segmentation-Outperforms-One-Shot-Models"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Why-Agentic-Segmentation-Outperforms-One-Shot-Models">Why Agentic Segmentation Outperforms One-Shot Models</a></h3>



<p>This loop introduces several important capabilities:</p>



<ul class="wp-block-list">
<li><strong>Self-correction: </strong>The system can recover from incorrect predictions</li>



<li><strong>Robustness: </strong>It handles ambiguous or complex instructions better</li>



<li><strong>Generalization: </strong>It works with open-ended language instead of fixed labels</li>



<li><strong>Improved alignment: </strong>Outputs better match user intent over iterations</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Final-Output-Agentic-Vision-System-with-Segmentation-and-Reasoning"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Final-Output-Agentic-Vision-System-with-Segmentation-and-Reasoning">Final Output: Agentic Vision System with Segmentation and Reasoning</a></h2>



<p>By the end of this tutorial, we build a system that:</p>



<ul class="wp-block-list">
<li>understands natural language instructions</li>



<li>converts them into structured segmentation concepts</li>



<li>performs open-vocabulary segmentation</li>



<li>verifies its own outputs</li>



<li>improves results through iterative refinement</li>
</ul>



<p>This represents a shift </p>



<p>from:</p>



<ul class="wp-block-list">
<li>static, one-shot predictions</li>
</ul>



<p>to:</p>



<ul class="wp-block-list">
<li><strong>dynamic, reasoning-driven vision systems</strong></li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Key-Takeaway-VLM-SAM-3-Intelligent-Vision-Agent"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Key-Takeaway-VLM-SAM-3-Intelligent-Vision-Agent">Key Takeaway: VLM + SAM 3 = Intelligent Vision Agent</a></h2>



<p>The real power of this system is not just segmentation.</p>



<p>It is the <strong>collaboration between models</strong>:</p>



<ul class="wp-block-list">
<li>the VLM provides reasoning</li>



<li>SAM3 provides perception</li>



<li>the loop provides intelligence</li>
</ul>



<p>Together, they form an <strong>agentic vision system</strong> that can think, act, and improve.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with &#8230; for free? Head over to <a href="https://universe.roboflow.com/isl/az-6mqow?ref=pyimagesearch" target="_blank" rel="noreferrer noopener">Roboflow</a> and get a free account to grab these hand gesture images. </p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Configuring-Your-Development-Environment"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Configuring-Your-Development-Environment">Configuring Your Development Environment</a></h2>



<p>To follow this guide, you need to have the following libraries installed on your system.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="1">!pip install -q transformers accelerate pillow torch torchvision bitsandbytes
</pre>



<p>First, we install the <code data-enlighter-language="python" class="EnlighterJSRAW">transformers</code> library. This library provides access to a wide range of pretrained models, including the Vision-Language Model we will use in this project.</p>



<p>Next, we install <code data-enlighter-language="python" class="EnlighterJSRAW">accelerate</code>, which helps efficiently run large models across GPUs and manage device placement automatically.</p>



<p>After that, we install <code data-enlighter-language="python" class="EnlighterJSRAW">pillow</code>, a lightweight Python library used for image loading and processing. We will use this library to read images and prepare them for model inference.</p>



<p>We also install <code data-enlighter-language="python" class="EnlighterJSRAW">torch</code>, which serves as the core deep learning framework for this project. Both the Vision-Language Model and the segmentation model rely on <code data-enlighter-language="python" class="EnlighterJSRAW">torch</code> for tensor computations and GPU acceleration.</p>



<p>Along with <code data-enlighter-language="python" class="EnlighterJSRAW">torch</code>, we install <code data-enlighter-language="python" class="EnlighterJSRAW">torchvision</code>, which provides datasets, transforms, and model utilities for computer vision tasks.</p>



<p>Finally, we install <code data-enlighter-language="python" class="EnlighterJSRAW">bitsandbytes</code>. This library enables efficient memory usage when working with large models by supporting quantization and optimized GPU kernels.</p>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">-q</code> flag runs the installation in quiet mode, reducing unnecessary output in the notebook.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<!-- wp:paragraph -->
<h3>Need Help Configuring Your Development Environment?</h3>
<!-- /wp:paragraph -->

<!-- wp:image {"align":"center","id":18137,"sizeSlug":"large","linkDestination":"custom"} -->
<figure class="wp-block-image aligncenter size-large"><a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank" rel="noreferrer noopener"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-18137" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?lossy=2&strip=1&webp=1 500w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=126x84&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=252x168&lossy=2&strip=1&webp=1 252w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=378x253&lossy=2&strip=1&webp=1 378w" sizes="(max-width: 500px) 100vw, 500px" /></a><figcaption>Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join <a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">PyImageSearch University</a> — you will be up and running with this tutorial in a matter of minutes. </figcaption></figure>
<!-- /wp:image -->

<!-- wp:paragraph -->
<p>All that said, are you:</p>
<!-- /wp:paragraph -->

<!-- wp:list -->
<ul><li>Short on time?</li><li>Learning on your employer’s administratively locked system?</li><li>Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?</li><li><strong>Ready to run the code immediately on your Windows, macOS, or Linux system?</strong></li></ul>
<!-- /wp:list -->

<!-- wp:paragraph -->
<p>Then join <a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank">PyImageSearch University</a> today!</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p><strong>Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser!</strong> No installation required.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!</p>
<!-- /wp:paragraph -->



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Python-Setup-and-Imports-for-Agentic-AI-Vision-System"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Python-Setup-and-Imports-for-Agentic-AI-Vision-System">Python Setup and Imports for Agentic AI Vision System</a></h2>



<p>Now that our environment is ready, we import the libraries required to build our agentic vision system. These libraries will help us perform deep learning inference, process images, visualize segmentation outputs, and load the models.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="2">import torch
import numpy as np
import os
import json
from PIL import Image, ImageDraw
import matplotlib
import matplotlib.pyplot as plt
from transformers import (
      AutoProcessor,
   Qwen2_5_VLForConditionalGeneration,
   Sam3Model,
   Sam3Processor,
)
</pre>



<p>First, we import <code data-enlighter-language="python" class="EnlighterJSRAW">torch</code>. This is the primary deep learning framework used to run both the Vision-Language Model and the segmentation model. PyTorch handles tensor computations and GPU acceleration during inference.</p>



<p>Next, we import <code data-enlighter-language="python" class="EnlighterJSRAW">numpy</code>, a popular library for numerical computing in Python. We will use NumPy when working with arrays such as segmentation masks and bounding boxes returned by the segmentation model.</p>



<p>After that, we import the <code data-enlighter-language="python" class="EnlighterJSRAW">os</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">json</code> libraries. The <code data-enlighter-language="python" class="EnlighterJSRAW">os</code> module helps us manage file paths and directories, while the <code data-enlighter-language="python" class="EnlighterJSRAW">json</code> module allows us to parse structured responses generated by the Vision-Language Model.</p>



<p>Next, we import <code data-enlighter-language="python" class="EnlighterJSRAW">Image</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">ImageDraw</code> from the <strong>Pillow</strong> library. Pillow is a lightweight image processing library that allows us to load, manipulate, and display images. In this project, we will use it to read input images and create segmentation overlays.</p>



<p>Then, we import <code data-enlighter-language="python" class="EnlighterJSRAW">matplotlib</code>, which we will use to visualize the results. Specifically, we use <code data-enlighter-language="python" class="EnlighterJSRAW">matplotlib.pyplot</code> to create figures that display the original image, bounding boxes, and segmentation masks.</p>



<p>Finally, we import several classes from the <code data-enlighter-language="python" class="EnlighterJSRAW">transformers</code> library. These classes allow us to load and run the models used in our system.</p>



<ul class="wp-block-list">
<li>The <code data-enlighter-language="python" class="EnlighterJSRAW">AutoProcessor</code> class automatically prepares inputs for multimodal models by handling both text and image preprocessing.</li>



<li>The <code data-enlighter-language="python" class="EnlighterJSRAW">Qwen2_5_VLForConditionalGeneration</code> class loads the <strong>Qwen2.5-VL Vision-Language Model</strong>, which will interpret user instructions and generate segmentation prompts.</li>



<li>The <code data-enlighter-language="python" class="EnlighterJSRAW">Sam3Model</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">Sam3Processor</code> classes load the <strong>SAM3 segmentation model</strong> and prepare its inputs.</li>
</ul>



<p>Before loading the models, we configure PyTorch to use optimized GPU settings. These settings help improve inference performance, especially when running large multimodal models.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="3">torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype  = torch.bfloat16 if device == "cuda" else torch.float32
print(f"Using device: {device}, dtype: {dtype}")
</pre>



<p>First, we enable <strong>TensorFloat-32 (TF32)</strong> support in PyTorch. TF32 is a numerical format supported by modern NVIDIA GPUs. It allows faster matrix multiplications during deep learning inference while maintaining good numerical stability. Since large models perform many matrix operations, enabling TF32 can significantly improve performance.</p>



<p>Next, we determine which device will be used for inference. Here, we check whether a CUDA-enabled GPU is available. If a GPU is detected, the system runs on <code data-enlighter-language="python" class="EnlighterJSRAW">"cuda"</code>. Otherwise, it falls back to the CPU.</p>



<p>After that, we configure the <strong>tensor precision</strong>. When running on a GPU, we use <strong>bfloat16 precision</strong>. This reduces memory usage and speeds up computation while preserving enough numerical accuracy for inference tasks.</p>



<p>If the system runs on a CPU, we instead use the standard <strong>float32 precision</strong>, which ensures compatibility with CPU computations.</p>



<p>Finally, we print the device configuration. This helps confirm whether the system is using the GPU and which precision mode is active. This information is useful when debugging performance or memory issues during model inference.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Loading-SAM-3-and-Qwen-Vision-Language-Models-in-Transformers"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Loading-SAM-3-and-Qwen-Vision-Language-Models-in-Transformers">Loading SAM 3 and Qwen Vision-Language Models in Transformers</a></h2>



<p>Now that the environment is configured, we load the two core models used in our agentic vision system: a <strong>Vision-Language Model (VLM)</strong> and a <strong>segmentation model</strong>.</p>



<p>The VLM will interpret the user’s instruction and generate a clean segmentation concept. The segmentation model will then use that concept to detect and segment objects in the image.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="4">VLM_MODEL_ID = "Qwen/Qwen2.5-VL-7B-Instruct"  # swap for Qwen/Qwen3-VL-8B once released in transformers
SAM_MODEL_ID = "facebook/sam3"

print("Loading VLM...")
vlm_processor = AutoProcessor.from_pretrained(VLM_MODEL_ID, trust_remote_code=True)
vlm_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
   VLM_MODEL_ID,
   device_map="auto",
   torch_dtype=dtype,
   trust_remote_code=True,
)
vlm_model.eval()
print("VLM loaded.")

print("Loading SAM3...")
sam_processor = Sam3Processor.from_pretrained(SAM_MODEL_ID)
sam_model = Sam3Model.from_pretrained(SAM_MODEL_ID, torch_dtype=dtype).to(device)
sam_model.eval()
print("SAM3 loaded.")
</pre>



<p>First, we define the model identifiers. These identifiers correspond to the pretrained models hosted on the Hugging Face model hub.</p>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">Qwen2.5-VL-7B-Instruct</code> model is a <strong>Vision-Language Model</strong> capable of understanding both images and text instructions. We will use this model to interpret the user’s request and generate segmentation prompts.</p>



<p>The second model, <strong>SAM3</strong>, is an open-vocabulary segmentation model that can segment objects based on text prompts.</p>



<p>Next, we load the Vision-Language Model. We first load the <strong>processor</strong> associated with the model. The processor prepares the inputs required by the VLM, including tokenizing text prompts and preprocessing images.</p>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">trust_remote_code=True</code> argument allows the Transformers library to load custom processing code provided by the model repository.</p>



<p>Next, we load the model itself. The <code data-enlighter-language="python" class="EnlighterJSRAW">from_pretrained()</code> method downloads the pretrained model weights and initializes the model architecture.</p>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">device_map="auto"</code> argument automatically distributes the model across available devices, which is useful when working with large models that require GPU memory.</p>



<p>We also specify <code data-enlighter-language="python" class="EnlighterJSRAW">torch_dtype=dtype</code>, which ensures the model runs using the precision we configured earlier: <strong>bfloat16 on GPU</strong> or <strong>float32 on CPU</strong>.</p>



<p>After loading the model, we switch it to evaluation mode. Evaluation mode disables training-specific behaviors such as dropout, ensuring consistent inference results.</p>



<p>Next, we load the segmentation model. Similar to the VLM, we first load the <code data-enlighter-language="python" class="EnlighterJSRAW">Sam3Processor</code>. This processor handles preprocessing tasks such as preparing the input image and formatting segmentation prompts.</p>



<p>Next, we load the SAM3 model. The <code data-enlighter-language="python" class="EnlighterJSRAW">from_pretrained()</code> function loads the segmentation model weights, and we move the model to the appropriate device using <code data-enlighter-language="python" class="EnlighterJSRAW">.to(device)</code>.</p>



<p>Finally, we set the model to evaluation mode. At this point, both models are fully initialized. The Vision-Language Model will interpret user instructions, while SAM3 will perform open-vocabulary segmentation based on those instructions.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Implementing-VLM-Inference-for-Agentic-Vision-Reasoning-with-Qwen25-VL"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Implementing-VLM-Inference-for-Agentic-Vision-Reasoning-with-Qwen25-VL">Implementing VLM Inference for Agentic Vision Reasoning with Qwen2.5-VL</a></h2>



<p>Now that our models are loaded, we implement a helper function that allows us to run inference using the Vision-Language Model. This function will take an image and a list of chat messages as input and return the model’s response.</p>



<p>In our agentic pipeline, this function plays a very important role. We will use it to:</p>



<ul class="wp-block-list">
<li>extract a clean segmentation prompt from the user instruction</li>



<li>refine prompts if segmentation fails</li>



<li>verify whether the segmentation results match the user intent</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="5">def vlm_generate(image: Image.Image, messages: list, max_new_tokens: int = 512) -> str:
   """
   Mirrors: send_generate_request()
   Runs VLM inference given a list of chat messages and returns the reply string.
   """
   text_input = vlm_processor.apply_chat_template(
       messages, tokenize=False, add_generation_prompt=True
   )
   inputs = vlm_processor(
       text=[text_input],
       images=[image],
       return_tensors="pt",
   )
   inputs = {k: v.to(vlm_model.device) for k, v in inputs.items()}
   input_len = inputs["input_ids"].shape[1]

   with torch.no_grad():
       generated_ids = vlm_model.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           do_sample=False,
       )

   new_tokens = generated_ids[0][input_len:]
   return vlm_processor.tokenizer.decode(new_tokens, skip_special_tokens=True).strip()
</pre>



<p>First, we define the function <code data-enlighter-language="python" class="EnlighterJSRAW">vlm_generate</code>. This function takes three inputs:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">image</code>: the input image that the model will analyze</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">messages</code>: a list of chat-style prompts used to guide the model</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">max_new_tokens</code>: the maximum number of tokens the model can generate</li>
</ul>



<p>The function returns a <strong>string response produced by the Vision-Language Model</strong>.</p>



<p>Next, we convert the chat messages into the format expected by the model. Many modern Vision-Language Models use a <strong>chat-style interface</strong> similar to conversational AI systems. The <code data-enlighter-language="python" class="EnlighterJSRAW">apply_chat_template()</code> method converts the list of messages into a properly formatted text prompt that the model understands.</p>



<p>The argument <code data-enlighter-language="python" class="EnlighterJSRAW">add_generation_prompt=True</code> tells the processor that the model should generate a response after the provided messages.</p>



<p>Next, we prepare the inputs for the model. Here, we pass both the text prompt and the image to the processor. The processor converts these inputs into tensors that can be processed by the model. The argument <code data-enlighter-language="python" class="EnlighterJSRAW">return_tensors="pt"</code> ensures the outputs are returned as <strong>PyTorch tensors</strong>.</p>



<p>Next, we move the tensors to the same device as the model. This step ensures that both the model and the input tensors reside on the same device, either the GPU or CPU.</p>



<p>After that, we store the length of the input tokens. This value helps us determine which tokens belong to the <strong>model&#8217;s generated response</strong>, rather than the original prompt.</p>



<p>Next, we perform inference using the model. We use <code data-enlighter-language="python" class="EnlighterJSRAW">torch.no_grad()</code> to disable gradient computations. Since we are only performing inference, this reduces memory usage and improves performance.</p>



<p>Inside this block, we generate the model’s output. The <code data-enlighter-language="python" class="EnlighterJSRAW">generate()</code> function performs autoregressive text generation. The parameter <code data-enlighter-language="python" class="EnlighterJSRAW">max_new_tokens</code> limits the length of the generated response. We also set <code data-enlighter-language="python" class="EnlighterJSRAW">do_sample=False</code>, which ensures deterministic outputs instead of random sampling.</p>



<p>Next, we extract only the tokens generated by the model. This removes the original prompt tokens, leaving only the newly generated tokens.</p>



<p>Finally, we convert the generated tokens into readable text. The <code data-enlighter-language="python" class="EnlighterJSRAW">decode()</code> method converts token IDs back into text. We also remove special tokens and strip unnecessary whitespace.</p>



<p>At this point, the function returns the <strong>final response generated by the Vision-Language Model</strong>.</p>



<p>This function will serve as the core interface between our agentic system and the Vision-Language Model. In the next sections, we will use it to extract segmentation prompts and evaluate the outputs produced by the segmentation model.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Implementing-the-SAM-3-Text-Prompted-Segmentation-Function"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Implementing-the-SAM-3-Text-Prompted-Segmentation-Function">Implementing the SAM 3 Text-Prompted Segmentation Function</a></h2>



<p>Now, we implement a helper function that runs segmentation using the SAM3 model. This function will take an input image and optional prompts, run the SAM3 model, and return the segmentation results.</p>



<p>In our agentic pipeline, this function serves as the <strong>tool used by the agent</strong> to perform segmentation.</p>



<p>Specifically, it returns three important outputs:</p>



<ul class="wp-block-list">
<li>segmentation masks</li>



<li>bounding boxes</li>



<li>confidence scores</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="6">def call_sam(
   image: Image.Image,
   text_prompt: str   = None,
   input_boxes        = None,   # list of [x1,y1,x2,y2]
   input_boxes_labels = None,   # list of 0/1 labels per box
   threshold: float   = 0.5,
) -> dict:
   """
   Mirrors: call_sam_service()
   Returns dict with keys: masks, boxes, scores (all as numpy arrays).
   """
   kwargs = dict(images=image, return_tensors="pt")
   if text_prompt:
       kwargs["text"] = text_prompt
   if input_boxes is not None:
       kwargs["input_boxes"] = [input_boxes]
       kwargs["input_boxes_labels"] = [input_boxes_labels or [1] * len(input_boxes)]

   inputs = sam_processor(**kwargs).to(device)

   with torch.no_grad():
       outputs = sam_model(**inputs)

   results = sam_processor.post_process_instance_segmentation(
       outputs,
       threshold=threshold,
       mask_threshold=0.5,
       target_sizes=inputs.get("original_sizes").tolist(),
   )[0]

   return {
       "masks":  results["masks"].cpu().numpy(),                          # [N, H, W] bool
       "boxes":  results["boxes"].cpu().to(torch.float32).numpy(),        # [N, 4]    xyxy
       "scores": results["scores"].cpu().to(torch.float32).numpy(),       # [N]
   }
</pre>



<p>First, we define the function <code data-enlighter-language="python" class="EnlighterJSRAW">call_sam</code>. This function accepts several inputs:</p>



<ul class="wp-block-list">
<li>The <code data-enlighter-language="python" class="EnlighterJSRAW">image</code> parameter is the input image that we want to segment.</li>



<li>The <code data-enlighter-language="python" class="EnlighterJSRAW">text_prompt</code> parameter allows us to perform <strong>concept-based segmentation</strong>. SAM3 can segment objects using natural language prompts such as <code data-enlighter-language="python" class="EnlighterJSRAW">"bag"</code> or <code data-enlighter-language="python" class="EnlighterJSRAW">"leftmost bag"</code>.</li>



<li>The <code data-enlighter-language="python" class="EnlighterJSRAW">input_boxes</code> parameter allows us to guide the segmentation model using bounding boxes. Each box is defined by four coordinates: [x1, y1, x2, y2]</li>



<li>Similarly, <code data-enlighter-language="python" class="EnlighterJSRAW">input_boxes_labels</code> specifies whether each box corresponds to a <strong>positive or negative prompt</strong>.</li>



<li>Finally, the <code data-enlighter-language="python" class="EnlighterJSRAW">threshold</code> parameter determines the confidence threshold used when filtering segmentation results.</li>
</ul>



<p>Next, we prepare the inputs required by the SAM3 processor.</p>



<p>Here, we create a dictionary containing the image input. The <code data-enlighter-language="python" class="EnlighterJSRAW">return_tensors="pt"</code> argument ensures that the processed outputs are returned as <strong>PyTorch tensors</strong>.</p>



<p>If a text prompt is provided, we include it in the input dictionary. This allows SAM3 to perform <strong>text-guided segmentation</strong>.</p>



<p>Next, we check whether bounding boxes are provided. If bounding boxes exist, we pass them to the processor along with their labels. If no labels are specified, we automatically assign <strong>positive labels (1)</strong> to all boxes.</p>



<p>Next, we preprocess the inputs using the SAM3 processor. The processor converts the image, prompts, and bounding boxes into tensors that the model can understand. We also move these tensors to the selected device (GPU or CPU).</p>



<p>Now we perform inference using SAM3. We wrap the inference step inside <code data-enlighter-language="python" class="EnlighterJSRAW">torch.no_grad()</code> to disable gradient calculations. Since we are performing inference only, this improves performance and reduces memory usage. The model returns raw segmentation outputs.</p>



<p>Next, we convert the raw model outputs into usable segmentation results. The <code data-enlighter-language="python" class="EnlighterJSRAW">post_process_instance_segmentation()</code> function performs several important tasks:</p>



<ul class="wp-block-list">
<li>filters predictions using the confidence threshold</li>



<li>converts predicted masks to the correct image resolution</li>



<li>extracts bounding boxes and scores</li>
</ul>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">[0]</code> index retrieves the results corresponding to the input image.</p>



<p>Finally, we return the segmentation results. The function returns a dictionary containing three elements.</p>



<ul class="wp-block-list">
<li>The <code data-enlighter-language="python" class="EnlighterJSRAW">masks</code> array contains the segmentation masks with shape: [N, H, W] where <strong>N</strong> represents the number of detected objects.</li>



<li>The <code data-enlighter-language="python" class="EnlighterJSRAW">boxes</code> array contains the bounding box coordinates in the format: [x1, y1, x2, y2]</li>



<li>Finally, the <code data-enlighter-language="python" class="EnlighterJSRAW">scores</code> array contains the confidence score for each detected object.</li>
</ul>



<p>We also move the tensors to the CPU and convert them into <strong>NumPy arrays</strong>. This makes them easier to process and visualize in later steps.</p>



<p>At this point, the <code data-enlighter-language="python" class="EnlighterJSRAW">call_sam()</code> function provides a simple interface for running <strong>SAM3 segmentation</strong> within our agentic vision pipeline.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Implementing-the-Agentic-AI-Segmentation-Pipeline-with-Iterative-Refinement"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Implementing-the-Agentic-AI-Segmentation-Pipeline-with-Iterative-Refinement">Implementing the Agentic AI Segmentation Pipeline with Iterative Refinement</a></h2>



<p>Now we implement the <strong>core function of our system</strong>. This function orchestrates the entire agentic workflow by combining the Vision-Language Model and the segmentation model.</p>



<p>Instead of running segmentation only once, the system follows an <strong>agentic loop</strong> where the Vision-Language Model interprets the user request, runs segmentation, verifies the result, and refines the prompt if needed.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="7">def run_single_image_inference(
   image_path: str,
   user_prompt: str,
   max_agent_rounds: int = 3,
   seg_threshold: float  = 0.5,
   output_dir: str       = "agent_output",
   debug: bool           = True,
) -> str | None:
   """
   Mirrors: run_single_image_inference() from sam3.agent.inference

   Agentic loop:
     Round 1 — VLM reads image + user prompt → produces a concise SAM3 concept phrase
     Round 2 — SAM3 segments with that phrase → VLM verifies / refines if needed
     Round N — repeat until VLM is satisfied or max_agent_rounds reached
   Returns path to the saved output image (or None on failure).
   """
   os.makedirs(output_dir, exist_ok=True)
   image = Image.open(image_path).convert("RGB")

   # ── Round 1: VLM extracts a clean SAM3 text prompt ──────────────────────
   extraction_messages = [
       {
           "role": "system",
           "content": (
               "You are a precise vision assistant. "
               "Your job is to convert a user's free-form description into a SHORT, "
               "clean object concept phrase suitable for an open-vocabulary segmentation model. "
               "Reply with ONLY a JSON object: {\"sam_prompt\": \"&lt;phrase>\"}. "
               "No explanation, no markdown, just the JSON."
           ),
       },
       {
           "role": "user",
           "content": [
               {"type": "image", "image": image},
               {"type": "text",  "text": f"User description: \"{user_prompt}\""},
           ],
       },
   ]

</pre>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">run_single_image_inference</code> function serves as the <strong>main entry point of our agentic vision system</strong>. It accepts several inputs:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">image_path</code>: the path to the image we want to analyze</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">user_prompt</code>: the natural language description of the object to segment</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">max_agent_rounds</code>: the maximum number of refinement iterations</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">seg_threshold</code>: the confidence threshold for segmentation</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">output_dir</code>: the directory where the output image will be saved</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">debug</code>: a flag that enables detailed logging</li>
</ul>



<p>The function returns the <strong>path of the saved output image</strong> or <code data-enlighter-language="python" class="EnlighterJSRAW">None</code> if segmentation fails.</p>



<p>First, we create the output directory and load the image. The <code data-enlighter-language="python" class="EnlighterJSRAW">os.makedirs()</code> function ensures that the output directory exists. If the directory already exists, the <code data-enlighter-language="python" class="EnlighterJSRAW">exist_ok=True</code> argument prevents an error. Next, we open the input image using Pillow and convert it to RGB format.</p>



<p>Here, we define a <strong>system message</strong> that instructs the Vision-Language Model to convert the user description into a short concept phrase. The SAM3 model performs better with <strong>short noun-style prompts</strong> such as: </p>



<ul class="wp-block-list">
<li>leftmost bag</li>



<li>red apple</li>



<li>wooden chair</li>
</ul>



<p>rather than long sentences.</p>



<p>We also include the user input. This message contains both the image and the user instruction. </p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="42" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="8">if debug:
       print(f"\n[Agent] Round 1 — extracting SAM3 prompt from: '{user_prompt}'")

   vlm_reply = vlm_generate(image, extraction_messages)
   if debug:
       print(f"[Agent] VLM raw reply: {vlm_reply}")

   # Parse the JSON; fall back to raw reply if needed
   try:
       clean = vlm_reply.strip().lstrip("```json").rstrip("```").strip()
       sam_prompt = json.loads(clean)["sam_prompt"]
   except Exception:
       sam_prompt = user_prompt  # graceful fallback
   if debug:
       print(f"[Agent] SAM3 prompt → '{sam_prompt}'")
</pre>



<p>Next, we call the VLM inference function. The Vision-Language Model analyzes the image and generates a <strong>clean segmentation prompt</strong>.</p>



<p>For example:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="9">User prompt: "the bag on the leftmost side"
Model output: {"sam_prompt": "leftmost bag"}
</pre>



<p>Next, we extract the segmentation prompt from the JSON response. This step removes formatting artifacts and converts the JSON string into a Python dictionary.</p>



<p>If the response cannot be parsed, we fall back to the original user prompt.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="58" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="10"># ── Agentic segmentation loop ────────────────────────────────────────────
   sam_result = None
   final_prompt = sam_prompt

   for round_idx in range(max_agent_rounds):
       if debug:
           print(f"\n[Agent] Round {round_idx + 2} — calling SAM3 with '{final_prompt}'")

       sam_result = call_sam(image, text_prompt=final_prompt, threshold=seg_threshold)
       n_masks = len(sam_result["masks"])
       if debug:
           print(f"[Agent] SAM3 found {n_masks} instance(s)")

</pre>



<p>Now we begin the <strong>agentic segmentation loop</strong>. Here, we initialize two variables:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">sam_result</code>: stores the segmentation output</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">final_prompt</code>: stores the prompt used for segmentation</li>
</ul>



<p>Next, we enter the iterative loop. This loop allows the system to refine segmentation prompts up to a maximum number of rounds. </p>



<p>Inside the loop, we call the SAM3 segmentation function. This function returns segmentation results including masks, bounding boxes, and confidence scores.</p>



<p>Next, we count the number of detected objects. This value helps determine whether the segmentation succeeded.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="71" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="11">       # ── Verification: ask VLM if the result looks right ─────────────────
       if n_masks == 0:
           # No masks found — ask VLM to rephrase
           refine_messages = [
               {
                   "role": "system",
                   "content": (
                       "You are a vision assistant helping refine segmentation prompts. "
                       "The segmentation model found NO objects. "
                       "Suggest a simpler or broader alternative concept phrase. "
                       "Reply ONLY with JSON: {\"sam_prompt\": \"&lt;phrase>\"}."
                   ),
               },
               {
                   "role": "user",
                   "content": [
                       {"type": "image", "image": image},
                       {"type": "text",  "text": (
                           f"Original user intent: \"{user_prompt}\". "
                           f"Failed prompt: \"{final_prompt}\". "
                           "Suggest a better phrase."
                       )},
                   ],
               },
           ]
           vlm_reply = vlm_generate(image, refine_messages)
           if debug:
               print(f"[Agent] VLM refine reply: {vlm_reply}")
           try:
               clean = vlm_reply.strip().lstrip("```json").rstrip("```").strip()
               final_prompt = json.loads(clean)["sam_prompt"]
           except Exception:
               break  # give up if we can't parse
       </pre>



<p>If SAM3 fails to detect any objects, we ask the Vision-Language Model to refine the segmentation prompt. We construct a new prompt asking the model to generate a <strong>simpler or broader concept phrase</strong>.</p>



<p>For example:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="12">Original prompt: "leftmost brown grocery bag"
Suggested prompt: "bag"
</pre>



<p>The VLM then generates a new segmentation prompt, and the loop repeats.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="105" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="13">else:
           # We have masks — ask VLM to verify they match the user intent
           verify_messages = [
               {
                   "role": "system",
                   "content": (
                       "You are a vision QA assistant. "
                       "Given the original user intent and the segmentation result metadata, "
                       "decide if the segmentation is correct. "
                       "Reply ONLY with JSON: {\"ok\": true/false, \"reason\": \"...\", \"sam_prompt\": \"&lt;refined phrase if not ok>\"}."
                   ),
               },
               {
                   "role": "user",
                   "content": [
                       {"type": "image", "image": image},
                       {"type": "text",  "text": (
                           f"User intent: \"{user_prompt}\".\n"
                           f"SAM3 was given prompt: \"{final_prompt}\".\n"
                           f"Result: {n_masks} mask(s) found, "
                           f"scores: {sam_result['scores'].tolist()}, "
                           f"boxes: {sam_result['boxes'].tolist()}.\n"
                           "Is this correct? If yes, ok=true. If not, provide a better sam_prompt."
                       )},
                   ],
               },
           ]
           vlm_reply = vlm_generate(image, verify_messages, max_new_tokens=256)
           if debug:
               print(f"[Agent] VLM verify reply: {vlm_reply}")
           try:
               clean = vlm_reply.strip().lstrip("```json").rstrip("```").strip()
               verdict = json.loads(clean)
               if verdict.get("ok", True):
                   if debug:
                       print("[Agent] VLM verified result ✓ — stopping.")
                   break
               else:
                   final_prompt = verdict.get("sam_prompt", final_prompt)
                   if debug:
                       print(f"[Agent] VLM says not ok → retrying with '{final_prompt}'")
           except Exception:
               break  # can't parse verdict, accept current result</pre>



<p>If SAM3 successfully detects objects, we verify whether the result matches the user intent.</p>



<p>In this step, we ask the Vision-Language Model to evaluate the segmentation results.</p>



<p>The model receives:</p>



<ul class="wp-block-list">
<li>the original user instruction</li>



<li>the segmentation prompt used</li>



<li>the number of detected masks</li>



<li>the confidence scores</li>



<li>the bounding boxes</li>
</ul>



<p>Based on this information, the model decides whether the segmentation result is correct.</p>



<p>The model returns a JSON response such as:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="14">{
"ok": true,
"reason": "correct object detected"
}
</pre>



<p>or</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="15">{
"ok": false,
"sam_prompt": "bag"
}
</pre>



<p>If the segmentation is incorrect, the system updates the segmentation prompt. The loop then repeats using the new prompt. If the segmentation result is correct, the loop stops. This verification step allows the system to <strong>self-correct its segmentation decisions</strong>.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="149" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="16">   # ── Render and save output ───────────────────────────────────────────────
   if sam_result is None or len(sam_result["masks"]) == 0:
       print("[Agent] No masks produced — check your prompt or image.")
       return None

   output_path = os.path.join(
       output_dir,
       os.path.splitext(os.path.basename(image_path))[0] + "_segmented.png"
   )
   _save_overlay(image, sam_result, output_path, title=f'"{user_prompt}"')
   print(f"\n[Agent] Output saved → {output_path}")
   return output_path
</pre>



<p>After the agentic loop finishes, we check whether segmentation succeeded. If no objects were detected, the function returns <code data-enlighter-language="python" class="EnlighterJSRAW">None</code>. Otherwise, we generate the output image path.</p>



<p>Finally, we visualize the segmentation results. This function creates an image containing the segmentation masks and bounding boxes. The result is saved to disk.</p>



<p>This function implements the <strong>agentic reasoning loop</strong> that makes our system powerful.</p>



<p>Instead of relying on a single segmentation attempt, the system:</p>



<ul class="wp-block-list">
<li>interprets the user request</li>



<li>generates a segmentation prompt</li>



<li>runs segmentation</li>



<li>evaluates the results</li>



<li>refines the prompt if necessary</li>
</ul>



<p>This iterative process allows the system to produce more accurate results and demonstrates how multiple AI models can collaborate within an <strong>agentic vision pipeline</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Visualizing-and-Saving-the-Segmentation-Results"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Visualizing-and-Saving-the-Segmentation-Results">Visualizing and Saving the Segmentation Results</a></h2>



<p>After running the agentic segmentation pipeline, we want to visualize the results in a clear and interpretable way. For this purpose, we implement a helper function that overlays the segmentation masks and bounding boxes on top of the original image.</p>



<p>This function generates a side-by-side visualization showing both the detected bounding boxes and the segmentation masks.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="17">def _save_overlay(image: Image.Image, sam_result: dict, output_path: str, title: str = ""):
   masks  = sam_result["masks"]
   boxes  = sam_result["boxes"]
   scores = sam_result["scores"]

   fig, axes = plt.subplots(1, 2, figsize=(16, 8))

   # Left: original + boxes
   axes[0].imshow(image)
   axes[0].set_title(f"Detected boxes  |  {title}", fontsize=11)
   axes[0].axis("off")
   cmap = matplotlib.colormaps.get_cmap("rainbow").resampled(max(len(masks), 1))
   for i, (box, score) in enumerate(zip(boxes, scores)):
       x1, y1, x2, y2 = box
       color = cmap(i)[:3]
       rect = plt.Rectangle(
           (x1, y1), x2 - x1, y2 - y1,
           linewidth=2, edgecolor=color, facecolor="none"
       )
       axes[0].add_patch(rect)
       axes[0].text(x1, y1 - 4, f"{score:.2f}", color=color, fontsize=9, fontweight="bold")

   # Right: mask overlay
   composite = image.convert("RGBA")
   for i, mask in enumerate(masks):
       color = tuple(int(c * 255) for c in cmap(i)[:3])
       mask_img = Image.fromarray((mask * 255).astype(np.uint8))
       overlay  = Image.new("RGBA", composite.size, color + (0,))
       overlay.putalpha(mask_img.point(lambda v: int(v * 0.5)))
       composite = Image.alpha_composite(composite, overlay)

   axes[1].imshow(composite)
   axes[1].set_title(f"SAM3 masks  ({len(masks)} instance(s))", fontsize=11)
   axes[1].axis("off")

   plt.tight_layout()
   plt.savefig(output_path, dpi=150, bbox_inches="tight")
   plt.close()
</pre>



<p>We begin by defining the <code data-enlighter-language="python" class="EnlighterJSRAW">_save_overlay</code> function, which takes the original image, the segmentation output from SAM3, the output path, and an optional title. From the segmentation results, we extract the masks, bounding boxes, and confidence scores. The masks represent pixel-level regions for each detected object, the boxes define object boundaries, and the scores indicate how confident the model is for each detection.</p>



<p>To visualize these results, we create a figure with two side-by-side panels. The left panel displays the original image along with bounding boxes, while the right panel shows the segmentation masks overlaid on the image.</p>



<p>The process starts by rendering the original image and assigning a distinct color to each detected object using a colormap. For every detection, we draw a rectangle corresponding to its bounding box and place the confidence score near it. This provides a quick overview of what the model has detected and how reliable those detections are.</p>



<p>For the mask visualization, the image is first converted to RGBA format so that transparent overlays can be applied. Each segmentation mask is then assigned a color, converted into an image, and used to create a semi-transparent overlay. These overlays are composited onto the original image, allowing the segmented regions to stand out while still preserving the underlying content.</p>



<p>The final composite is displayed in the second panel, along with the number of detected instances. The visualization is then saved to disk using a resolution of 150 DPI for clarity, with <code data-enlighter-language="python" class="EnlighterJSRAW">tight_layout()</code> ensuring proper spacing and <code data-enlighter-language="python" class="EnlighterJSRAW">bbox_inches="tight"</code> removing unnecessary margins. The figure is closed afterward to free up memory.</p>



<p>This results in a clean and intuitive visualization that combines bounding boxes, confidence scores, and segmentation masks, making it easy to verify the model’s predictions.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Running-the-Agentic-AI-Vision-System-on-Real-Images"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Running-the-Agentic-AI-Vision-System-on-Real-Images">Running the Agentic AI Vision System on Real Images</a></h2>



<p>Now that we have implemented all the components of our pipeline, we can run the complete agentic vision system on an example image.</p>



<p>In this step, we provide an image along with a natural language instruction and let the system handle the rest.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="18">output_image_path = run_single_image_inference(
   image_path  = "/content/groceries.jpg",
   user_prompt = "the bag on the leftmost side",
   max_agent_rounds = 3,
   seg_threshold    = 0.5,
   output_dir       = "agent_output",
   debug            = True,
)

if output_image_path:
   img = Image.open(output_image_path)
   img.show()
</pre>



<p>We begin by calling the <code data-enlighter-language="python" class="EnlighterJSRAW">run_single_image_inference()</code> function, which executes the complete agentic pipeline. The input image is provided through the <code data-enlighter-language="python" class="EnlighterJSRAW">image_path</code> parameter, and in this example, we use <code data-enlighter-language="python" class="EnlighterJSRAW">groceries.jpg</code>. Along with the image, we pass a natural language instruction — <em>&#8220;the bag on the leftmost side&#8221;</em>. This instruction is intentionally written in free-form language to demonstrate how the system can interpret human-like queries.</p>



<p>The pipeline is configured to allow up to three refinement iterations using <code data-enlighter-language="python" class="EnlighterJSRAW">max_agent_rounds=3</code>. A confidence threshold of <code data-enlighter-language="python" class="EnlighterJSRAW">0.5</code> is used to filter segmentation results, and the final output is saved to the <code data-enlighter-language="python" class="EnlighterJSRAW">agent_output</code> directory. Debugging is enabled to log intermediate steps such as prompt generation, segmentation outputs, and verification decisions.</p>



<p>Once the pipeline runs, it returns the path to the output image if segmentation is successful. We then load this image using Pillow and display it. The final visualization includes bounding boxes around detected objects, segmentation masks overlaid on the image, and confidence scores for each detection.</p>



<p>Under the hood, the system follows an iterative process. The Vision-Language Model first analyzes the image and converts the user’s instruction into a concise segmentation prompt. This prompt is passed to SAM3, which generates segmentation masks. The result is then evaluated by the Vision-Language Model to determine whether it matches the user’s intent. If the output is not satisfactory, the prompt is refined and the process repeats. Once the result is verified, the system produces the final visualization and saves it to disk.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Agentic-Segmentation-Output-Iterative-Prompt-Refinement-in-Action"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Agentic-Segmentation-Output-Iterative-Prompt-Refinement-in-Action">Agentic Segmentation Output: Iterative Prompt Refinement in Action</a></h2>



<p>The input image <strong>(Figure 1)</strong> shows multiple grocery bags placed inside the trunk of a car.</p>



<p>We provide the following natural language instruction:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="19">"the bag on the leftmost side"
</pre>



<p>This instruction is <strong>not a fixed label</strong>. Instead, it includes <strong>spatial reasoning</strong>, which makes the task more challenging for standard segmentation models.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/04/image-2.jpeg" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="800" height="534" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-2.jpeg?lossy=2&strip=1&webp=1" alt="" class="wp-image-53398" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-2.jpeg?size=126x84&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-2-300x200.jpeg?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-2.jpeg?size=378x252&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-2.jpeg?size=504x336&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-2.jpeg?size=630x421&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-2-768x513.jpeg?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-2.jpeg?lossy=2&amp;strip=1&amp;webp=1 800w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 1:</strong> Input Image (source: <a href="https://github.com/facebookresearch/sam3/blob/main/assets/images/groceries.jpg" target="_blank" rel="noreferrer noopener">Sam3 Official Repo assets</a>)</figcaption></figure></div>


<p>Now let’s examine how the system processes this instruction.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="20">[Agent] Round 1 — extracting SAM3 prompt from: 'the bag on the leftmost side'
[Agent] VLM raw reply: {"sam_prompt": "leftmost paper bag"}
</pre>



<p>First, the Vision-Language Model interprets the instruction and generates an initial segmentation prompt:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="21">[Agent] SAM3 prompt -> 'leftmost paper bag'

[Agent] Round 2 — calling SAM3 with 'leftmost paper bag'
[Agent] SAM3 found 0 instance(s)
</pre>



<p>Next, SAM3 attempts segmentation using this prompt.</p>



<p>However, <strong>no objects are detected</strong>.</p>



<p>This shows an important limitation: <strong>SAM3 is sensitive to how the prompt is phrased.</strong></p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="22">[Agent] VLM refine reply: {"sam_prompt": "leftmost brown paper bag"}
</pre>



<p>The system does not stop here.</p>



<p>Instead, the Vision-Language Model <strong>refines the prompt</strong> by adding more descriptive information.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="23">[Agent] Round 3 — calling SAM3 with 'leftmost brown paper bag'
[Agent] SAM3 found 0 instance(s)
</pre>



<p>Again, SAM3 fails to detect any objects.</p>



<p>At this point, we observe something important: <strong>More detailed prompts do not always improve segmentation.</strong></p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="24">[Agent] VLM refine reply: {"sam_prompt": "leftmost bag"}
</pre>



<p>Now, the model simplifies the prompt.</p>



<p>This step is critical. Instead of making the prompt more complex, the system makes it <strong>more general</strong>.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="25">[Agent] Round 4 — calling SAM3 with 'leftmost bag'
[Agent] SAM3 found 1 instance(s)
</pre>



<p>This time, SAM3 successfully detects the object.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="26">[Agent] VLM verify reply: {
 "ok": true,
 "reason": "The segmentation correctly identifies the leftmost bag as per the user's intent."
 "sam_prompt": ""
}
</pre>



<p>Finally, the Vision-Language Model verifies the result and confirms that the segmentation is correct.</p>



<p>The agentic loop stops here, and the system saves the final output image with a bounding box and segmentation mask overlaid on the input image.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/04/image-5-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="488" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-5-1024x488.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-53403" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-5.png?size=126x60&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-5-300x143.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-5.png?size=378x180&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-5.png?size=504x240&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-5.png?size=630x300&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-5-768x366.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-5-1024x488.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-5-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-5-1536x732.png?lossy=2&amp;strip=1&amp;webp=1 1536w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 2:</strong> Agentic AI Iterative Refinement Output (source: image by the author)</figcaption></figure></div>


<p>The output image <strong>(Figure 3)</strong> shows:</p>



<ul class="wp-block-list">
<li>the detected bounding box around the leftmost bag</li>



<li>the segmentation mask highlighted in color</li>



<li>the correct object selected based on the user’s instruction</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/04/image-6-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="371" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-6-1024x371.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-53406" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-6.png?size=126x46&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-6-300x109.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-6.png?size=378x137&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-6.png?size=504x183&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-6.png?size=630x228&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-6-768x278.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-6-1024x371.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/04/image-6-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 3:</strong> Generated Output with bounding box, mask, confidence score (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity"/>



<div id="pitch" style="padding: 40px; width: 100%; background-color: #F4F6FA;">
	<h3>What's next? We recommend <a target="_blank" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend">PyImageSearch University</a>.</h3>

	<script src="https://fast.wistia.com/embed/medias/kno0cmko2z.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_kno0cmko2z videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img decoding="async" src="https://fast.wistia.com/embed/medias/kno0cmko2z/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>

	<div style="margin-top: 32px; margin-bottom: 32px; ">
		<strong>Course information:</strong><br/>
		86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: April 2026<br/>
		<span style="color: #169FE6;">★★★★★</span> 4.84 (128 Ratings) • 16,000+ Students Enrolled
	</div>

	<p><strong>I strongly believe that if you had the right teacher you could <em>master</em> computer vision and deep learning.</strong></p>

	<p>Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?</p>

	<p>That’s <em>not</em> the case.</p>

	<p>All you need to master computer vision and deep learning is for someone to explain things to you in <em>simple, intuitive</em> terms. <em>And that’s exactly what I do</em>. My mission is to change education and how complex Artificial Intelligence topics are taught.</p>

	<p>If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to <em>successfully</em> and <em>confidently</em> apply computer vision to your work, research, and projects. Join me in computer vision mastery.</p>

	<p><strong>Inside PyImageSearch University you'll find:</strong></p>

	<ul style="margin-left: 0px;">
		<li style="list-style: none;">&check; <strong>86+ courses</strong> on essential computer vision, deep learning, and OpenCV topics</li>
		<li style="list-style: none;">&check; <strong>86 Certificates</strong> of Completion</li>
		<li style="list-style: none;">&check; <strong>115+ hours hours</strong> of on-demand video</li>
		<li style="list-style: none;">&check; <strong>Brand new courses released <em>regularly</em></strong>, ensuring you can keep up with state-of-the-art techniques</li>
		<li style="list-style: none;">&check; <strong>Pre-configured Jupyter Notebooks in Google Colab</strong></li>
		<li style="list-style: none;">&check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)</li>
		<li style="list-style: none;">&check; Access to <strong>centralized code repos for <em>all</em> 540+ tutorials</strong> on PyImageSearch</li>
		<li style="list-style: none;">&check; <strong> Easy one-click downloads</strong> for code, datasets, pre-trained models, etc.</li>
		<li style="list-style: none;">&check; <strong>Access</strong> on mobile, laptop, desktop, etc.</li>
	</ul>

	<p style="text-align: center;">
		<a target="_blank" class="button link" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend" style="background-color: #6DC713; border-bottom: none;">Click here to join PyImageSearch University</a>
	</p>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Summary"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Summary">Summary</a></h2>



<p>In this lesson, we built an <strong>agentic AI vision system</strong> that combines a Vision-Language Model with a segmentation model to solve a real-world problem.</p>



<p>Instead of relying on a single model, we designed a pipeline where multiple components work together in a loop. This allows the system to not only perform segmentation, but also <strong>understand instructions, evaluate results, and improve itself automatically</strong>.</p>



<p>First, we used a Vision-Language Model to interpret the user’s natural language query and convert it into a clean segmentation prompt.</p>



<p>Next, we used SAM3 to perform <strong>open-vocabulary segmentation</strong> using that prompt.</p>



<p>Then, we introduced an agentic loop where the Vision-Language Model verifies the segmentation output and refines the prompt if necessary.</p>



<p>Finally, we visualized the results by overlaying bounding boxes and segmentation masks on the original image.</p>



<p>This approach highlights an important shift in computer vision. Instead of building static pipelines, we are now moving toward <strong>interactive and self-correcting systems</strong> that can adapt to user intent.</p>



<p>Such systems can be extended to a wide range of applications, including:</p>



<ul class="wp-block-list">
<li>interactive image editing</li>



<li>robotics and autonomous perception</li>



<li>visual assistants</li>



<li>multimodal search systems</li>
</ul>



<p>In the future, we can further improve this system by:</p>



<ul class="wp-block-list">
<li>adding support for multiple images or video inputs</li>



<li>integrating more tools into the agent loop</li>



<li>introducing memory for long-term reasoning</li>



<li>optimizing inference for real-time applications</li>
</ul>



<p>By combining Vision-Language Models with powerful segmentation models, we take a step closer to building <strong>intelligent visual systems that can understand and act on human instructions</strong>.</p>



<p>This represents the foundation of next-generation AI systems.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Citation-Information"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Citation-Information">Citation Information</a></h3>



<p><strong>Thakur, P. </strong>“Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen,” <em>PyImageSearch</em>, S. Huot, G. Kudriavtsev, and A. Sharma, eds., 2026, <a href="https://pyimg.co/ohlwd" target="_blank" rel="noreferrer noopener">https://pyimg.co/ohlwd</a> </p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="classic" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen" data-enlighter-group="27">@incollection{Thakur_2026_building-an-agentic-ai-vision-system-with-sam-3-and-qwen,
  author = {Piyush Thakur},
  title = {{Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Georgii Kudriavtsev and Aditya Sharma},
  year = {2026},
  url = {https://pyimg.co/ohlwd},
}
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), </strong><em><strong>simply enter your email address in the form below!</strong></em></p>



<div id="download-the-code" class="post-cta-wrap">
<div class="gpd-post-cta">
	<div class="gpd-post-cta-content">
		

			<div class="gpd-post-cta-top">
				<div class="gpd-post-cta-top-image"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1" alt="" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1 410w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=126x174&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=252x348&lossy=2&strip=1&webp=1 252w" sizes="(max-width: 410px) 100vw, 410px" /></div>
				
				<div class="gpd-post-cta-top-title"><h4>Download the Source Code and FREE 17-page Resource Guide</h4></div>
				<div class="gpd-post-cta-top-desc"><p>Enter your email address below to get a .zip of the code and a <strong>FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning.</strong> Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!</p></div>


			</div>

			<div class="gpd-post-cta-bottom">
				<form id="footer-cta-code" class="footer-cta" action="https://www.getdrip.com/forms/4130035/submissions" method="post" target="blank" data-drip-embedded-form="4130035">
					<input name="fields[email]" type="email" value="" placeholder="Your email address" class="form-control" />

					<button type="submit">Download the code!</button>

					<div style="display: none;" aria-hidden="true"><label for="website">Website</label><br /><input type="text" id="website" name="website" tabindex="-1" autocomplete="false" value="" /></div>
				</form>
			</div>


		
	</div>

</div>
</div>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/04/06/agentic-ai-vision-system-object-segmentation-with-sam-3-and-qwen/">Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3</title>
		<link>https://pyimagesearch.com/2026/03/30/autoregressive-model-limits-and-multi-token-prediction-in-deepseek-v3/</link>
		
		<dc:creator><![CDATA[Puneet Mangla]]></dc:creator>
		<pubDate>Mon, 30 Mar 2026 12:45:00 +0000</pubDate>
				<category><![CDATA[AI Engineering]]></category>
		<category><![CDATA[Deep Learning]]></category>
		<category><![CDATA[LLMs]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[autoregressive models]]></category>
		<category><![CDATA[deepseek v3]]></category>
		<category><![CDATA[language modeling]]></category>
		<category><![CDATA[llm training]]></category>
		<category><![CDATA[mla]]></category>
		<category><![CDATA[moe]]></category>
		<category><![CDATA[multi-token prediction]]></category>
		<category><![CDATA[transformer models]]></category>
		<category><![CDATA[tutorial]]></category>
		<guid isPermaLink="false">https://pyimagesearch.com/?p=53306</guid>

					<description><![CDATA[<p>Table of Contents Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3 Why Next-Token Prediction Limits DeepSeek-V3 Multi-Token Prediction in DeepSeek-V3: Predicting Multiple Tokens Ahead DeepSeek-V3 Architecture: Multi-Token Prediction Heads Explained Gradient Insights for Multi-Token Prediction in DeepSeek-V3 DeepSeek-V3 Training vs.&#8230;</p>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/03/30/autoregressive-model-limits-and-multi-token-prediction-in-deepseek-v3/">Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity" id="TOC"/>


<div class="yoast-breadcrumbs"><span><span><a href="https://pyimagesearch.com/">Home</a></span></div>


<div class="toc">
<hr class="TOC"/>
<p class="has-large-font-size"><strong>Table of Contents</strong></p>
<ul>
    <li id="TOC-h1-Autoregressive-Model-Limits-Multi-Token-Prediction-DeepSeek-V3"><a rel="noopener" target="_blank" href="#h1-Autoregressive-Model-Limits-Multi-Token-Prediction-DeepSeek-V3">Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3</a></li>
    <li id="TOC-h2-Why-Next-Token-Prediction-Limits-DeepSeek-V3"><a rel="noopener" target="_blank" href="#h2-Why-Next-Token-Prediction-Limits-DeepSeek-V3">Why Next-Token Prediction Limits DeepSeek-V3</a></li>
    <li id="TOC-h2-Multi-Token-Prediction-DeepSeek-V3-Predicting-Multiple-Tokens-Ahead"><a rel="noopener" target="_blank" href="#h2-Multi-Token-Prediction-DeepSeek-V3-Predicting-Multiple-Tokens-Ahead">Multi-Token Prediction in DeepSeek-V3: Predicting Multiple Tokens Ahead</a></li>
    <li id="TOC-h2-DeepSeek-V3-Architecture-Multi-Token-Prediction-Heads-Explained"><a rel="noopener" target="_blank" href="#h2-DeepSeek-V3-Architecture-Multi-Token-Prediction-Heads-Explained">DeepSeek-V3 Architecture: Multi-Token Prediction Heads Explained</a></li>
    <li id="TOC-h2-Gradient-Insights-Multi-Token-Prediction-DeepSeek-V3"><a rel="noopener" target="_blank" href="#h2-Gradient-Insights-Multi-Token-Prediction-DeepSeek-V3">Gradient Insights for Multi-Token Prediction in DeepSeek-V3</a></li>
    <li id="TOC-h2-DeepSeek-V3-Training-vs-Inference-How-MTP-Changes-Both"><a rel="noopener" target="_blank" href="#h2-DeepSeek-V3-Training-vs-Inference-How-MTP-Changes-Both">DeepSeek-V3 Training vs. Inference: How MTP Changes Both</a></li>
    <li id="TOC-h2-Multi-Token-Prediction-Loss-Weighting-Decay-DeepSeek-V3"><a rel="noopener" target="_blank" href="#h2-Multi-Token-Prediction-Loss-Weighting-Decay-DeepSeek-V3">Multi-Token Prediction Loss Weighting and Decay for DeepSeek-V3</a></li>
    <li id="TOC-h2-Step-by-Step-Implementation-Multi-Token-Prediction-Heads-DeepSeek-V3"><a rel="noopener" target="_blank" href="#h2-Step-by-Step-Implementation-Multi-Token-Prediction-Heads-DeepSeek-V3">Step-by-Step Implementation of Multi-Token Prediction Heads in DeepSeek-V3</a></li>
    <li id="TOC-h2-Integrating-Multi-Token-Prediction-DeepSeek-V3-Core-Transformer"><a rel="noopener" target="_blank" href="#h2-Integrating-Multi-Token-Prediction-DeepSeek-V3-Core-Transformer">Integrating Multi-Token Prediction with DeepSeek-V3’s Core Transformer</a></li>
    <li id="TOC-h2-Theoretical-Foundations-MTP-Curriculum-Learning-Auxiliary-Tasks"><a rel="noopener" target="_blank" href="#h2-Theoretical-Foundations-MTP-Curriculum-Learning-Auxiliary-Tasks">Theoretical Foundations: MTP, Curriculum Learning, and Auxiliary Tasks</a></li>
    <li id="TOC-h2-Multi-Token-Prediction-Benefits-Coherence-Planning-Faster-Convergence"><a rel="noopener" target="_blank" href="#h2-Multi-Token-Prediction-Benefits-Coherence-Planning-Faster-Convergence">Multi-Token Prediction Benefits: Coherence, Planning, and Faster Convergence</a></li>
    <li id="TOC-h2-Summary"><a rel="noopener" target="_blank" href="#h2-Summary">Summary</a></li>
    <ul>
        <li id="TOC-h3-Citation-Information"><a rel="noopener" target="_blank" href="#h3-Citation-Information">Citation Information</a></li>
    </ul>
</ul>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h1-Autoregressive-Model-Limits-Multi-Token-Prediction-DeepSeek-V3"/>



<h2 class="wp-block-heading"><a href="#TOC-h1-Autoregressive-Model-Limits-Multi-Token-Prediction-DeepSeek-V3">Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3</a></h2>



<p>In the first three parts of this series, we built the foundation of DeepSeek-V3 by implementing its configuration and <strong>Rotary Position</strong><strong>al</strong><strong> Embeddings (RoPE)</strong>, exploring the efficiency gains of <strong>Multi</strong><strong>-H</strong><strong>ead Latent Attention (MLA)</strong>, and scaling capacity through the <strong>Mixture of Experts (MoE)</strong>. Each of these components adds a crucial piece to the puzzle, progressively shaping a model that balances performance, scalability, and efficiency. With these building blocks in place, we are now ready to tackle another defining innovation: <strong>Multi-Token Prediction (MTP)</strong>.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/03/autoregressive-model-limits-and-mTP-in-deepseek-v3-featured.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="940" height="780" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/autoregressive-model-limits-and-mTP-in-deepseek-v3-featured.png?lossy=2&strip=1&webp=1" alt="autoregressive-model-limits-and-mTP-in-deepseek-v3-featured.png" class="wp-image-53328" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/autoregressive-model-limits-and-mTP-in-deepseek-v3-featured.png?size=126x105&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/autoregressive-model-limits-and-mTP-in-deepseek-v3-featured-300x249.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/autoregressive-model-limits-and-mTP-in-deepseek-v3-featured.png?size=378x314&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/autoregressive-model-limits-and-mTP-in-deepseek-v3-featured.png?size=504x418&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/autoregressive-model-limits-and-mTP-in-deepseek-v3-featured.png?size=630x523&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/autoregressive-model-limits-and-mTP-in-deepseek-v3-featured-768x637.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/autoregressive-model-limits-and-mTP-in-deepseek-v3-featured.png?lossy=2&amp;strip=1&amp;webp=1 940w" sizes="(max-width: 630px) 100vw, 630px" /></a></figure></div>


<p>Unlike traditional autoregressive models that predict one token at a time, MTP enables DeepSeek-V3 to forecast multiple tokens simultaneously, significantly accelerating training and inference. This approach not only reduces computational overhead but also improves the model’s ability to capture richer contextual patterns across sequences. </p>



<p>In this lesson, we will explore the theory behind MTP, examine why it represents a leap forward in language modeling, and implement it step by step. As with the earlier lessons, this installment continues our broader mission to reconstruct DeepSeek-V3 from scratch, showing how innovations including RoPE, MLA, MoE, and now MTP fit together into a cohesive architecture that will culminate in the assembly and training of the full model.</p>



<p>This lesson is the 4th in a 6-part series on <strong>Building DeepSeek-V3 from Scratch</strong>:</p>



<ol class="wp-block-list">
<li><em><a href="https://pyimg.co/1atre" target="_blank" rel="noreferrer noopener">DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings</a></em> </li>



<li><em><a href="https://pyimg.co/scgjl" target="_blank" rel="noreferrer noopener">Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture</a></em></li>



<li><em><a href="https://pyimg.co/a1w0g" target="_blank" rel="noreferrer noopener">DeepSeek-V3 from Scratch: Mixture of Experts (MoE)</a></em></li>



<li><em><strong><a href="https://pyimg.co/alrep" target="_blank" rel="noreferrer noopener">Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3</a></strong></em> <strong>(this tutorial)</strong></li>



<li><em>Lesson 5</em></li>



<li><em>Lesson 6</em></li>
</ol>



<p><strong>To learn about DeepSeek-V3 and build it from scratch, </strong><em><strong>just keep reading.</strong></em></p>



<div id="pyi-source-code-block" class="source-code-wrap"><div class="gpd-source-code">
    <div class="gpd-source-code-content">
        <img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/source-code-icon.png?lossy=2&strip=1&webp=1" alt="">
        <h4>Looking for the source code to this post?</h4>
                    <a href="#download-the-code" class="pyis-cta-modal-open-modal">Jump Right To The Downloads Section <svg class="svg-icon arrow-right" width="12" height="12" aria-hidden="true" role="img" focusable="false" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M6.8125 0.1875C6.875 0.125 6.96875 0.09375 7.09375 0.09375C7.1875 0.09375 7.28125 0.125 7.34375 0.1875L13.875 6.75C13.9375 6.8125 14 6.90625 14 7C14 7.125 13.9375 7.1875 13.875 7.25L7.34375 13.8125C7.28125 13.875 7.1875 13.9062 7.09375 13.9062C6.96875 13.9062 6.875 13.875 6.8125 13.8125L6.1875 13.1875C6.125 13.125 6.09375 13.0625 6.09375 12.9375C6.09375 12.8438 6.125 12.75 6.1875 12.6562L11.0312 7.8125H0.375C0.25 7.8125 0.15625 7.78125 0.09375 7.71875C0.03125 7.65625 0 7.5625 0 7.4375V6.5625C0 6.46875 0.03125 6.375 0.09375 6.3125C0.15625 6.25 0.25 6.1875 0.375 6.1875H11.0312L6.1875 1.34375C6.125 1.28125 6.09375 1.1875 6.09375 1.0625C6.09375 0.96875 6.125 0.875 6.1875 0.8125L6.8125 0.1875Z" fill="#169FE6"></path></svg></a>
            </div>
</div>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Why-Next-Token-Prediction-Limits-DeepSeek-V3"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Why-Next-Token-Prediction-Limits-DeepSeek-V3">Why Next-Token Prediction Limits DeepSeek-V3</a></h2>



<p>Traditional language models are trained with a simple objective: given tokens <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8e2/8e2c54736b997eb3d14bcff0dc19966a-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='x_1, x_2, \ldots, x_t' title='x_1, x_2, \ldots, x_t' class='latex' />, predict the next token <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/940/940d6748ef869ab4c373721ae0be26c6-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='x_{t+1}' title='x_{t+1}' class='latex' />. Mathematically, we maximize:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/7cd/7cde956adb228476aaa85c87e237c052-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\mathcal{L}_\text{standard} = \sum\limits_{t=1}^{T-1} \log P(x_{t+1} \mid x_1, \ldots, x_t)' title='\mathcal{L}_\text{standard} = \sum\limits_{t=1}^{T-1} \log P(x_{t+1} \mid x_1, \ldots, x_t)' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/7cd/7cde956adb228476aaa85c87e237c052-ffffff-000000-0.png?lossy=2&strip=1&webp=1 263w,https://b2633864.smushcdn.com/2633864/wp-content/latex/7cd/7cde956adb228476aaa85c87e237c052-ffffff-000000-0.png?size=126x19&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 263px) 100vw, 263px' />.</p>



<p>This autoregressive factorization is elegant and has proven remarkably effective. However, it has a fundamental limitation: the model only receives a training signal for immediate next-token prediction. It never explicitly learns to plan multiple steps ahead.</p>



<p>Consider generating the sentence: &#8220;The cat sat on the mat because it was comfortable.&#8221; When predicting &#8220;because,&#8221; the model should already be considering how the sentence will complete — including the subordinate clause, the pronoun reference, and the conclusion. But with next-token prediction alone, there&#8217;s no explicit gradient signal encouraging this forward planning. The model might learn it implicitly through exposure to many examples, but we&#8217;re not directly optimizing for it.</p>



<p>This limitation becomes especially apparent in tasks requiring long-term coherence (e.g., story generation, multi-paragraph reasoning, or code generation), where later statements must be consistent with earlier declarations. The model can easily generate locally fluent text that globally contradicts itself because its training objective only looks one token ahead.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Multi-Token-Prediction-DeepSeek-V3-Predicting-Multiple-Tokens-Ahead"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Multi-Token-Prediction-DeepSeek-V3-Predicting-Multiple-Tokens-Ahead">Multi-Token Prediction in DeepSeek-V3: Predicting Multiple Tokens Ahead</a></h2>



<p>Multi-Token Prediction (<strong>Figure 1</strong>) addresses this by adding auxiliary prediction heads that forecast multiple tokens into the future. Alongside the standard prediction <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/1eb/1eb451ef892fd5af61c38049b2703449-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='P(x_{t+1} \mid x_1, \ldots, x_t)' title='P(x_{t+1} \mid x_1, \ldots, x_t)' class='latex' />, we also predict:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e22/e22febc1baab4bca2fab97a12782d594-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='P(x_{t+2} \mid x_1, \ldots, x_t, x_{t+1})' title='P(x_{t+2} \mid x_1, \ldots, x_t, x_{t+1})' class='latex' />
</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e6d/e6d199b3d4b439a61b5292ee8bdb7435-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='P(x_{t+3} \mid x_1, \ldots, x_t, x_{t+1}, x_{t+2})' title='P(x_{t+3} \mid x_1, \ldots, x_t, x_{t+1}, x_{t+2})' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/e6d/e6d199b3d4b439a61b5292ee8bdb7435-ffffff-000000-0.png?lossy=2&strip=1&webp=1 206w,https://b2633864.smushcdn.com/2633864/wp-content/latex/e6d/e6d199b3d4b439a61b5292ee8bdb7435-ffffff-000000-0.png?size=126x12&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 206px) 100vw, 206px' /></p>



<p>and so on for <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/7b8/7b8b965ad4bca0e41ab51de7b31363a1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='n' title='n' class='latex' /> tokens ahead. Critically, these predictions are computed in parallel during training (not autoregressively) — we know all ground truth tokens, so we can supervise all predictions simultaneously.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/03/image-10.jpeg" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="978" height="452" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-10.jpeg?lossy=2&strip=1&webp=1" alt="" class="wp-image-53333" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-10.jpeg?size=126x58&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-10-300x139.jpeg?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-10.jpeg?size=378x175&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-10.jpeg?size=504x233&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-10.jpeg?size=630x291&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-10-768x355.jpeg?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-10.jpeg?lossy=2&amp;strip=1&amp;webp=1 978w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 1:</strong> Multi-Token Prediction Head (source: <a href="https://arxiv.org/pdf/2401.06066" target="_blank" rel="noreferrer noopener">Dai et al., 2024</a>).</figcaption></figure></div>


<p>The complete training objective becomes:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/153/15346bd71930d8fac9e71641ec046424-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\mathcal{L}_\text{MTP} = \sum\limits_{t=1}^{T-1} \log P(x_{t+1} \mid x_{1:t}) + \sum\limits_{d=1}^{n} \lambda_d \sum\limits_{t=1}^{T-d-1} \log P(x_{t+d+1} \mid x_{1:t}, x_{t+1:t+d})' title='\mathcal{L}_\text{MTP} = \sum\limits_{t=1}^{T-1} \log P(x_{t+1} \mid x_{1:t}) + \sum\limits_{d=1}^{n} \lambda_d \sum\limits_{t=1}^{T-d-1} \log P(x_{t+d+1} \mid x_{1:t}, x_{t+1:t+d})' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/153/15346bd71930d8fac9e71641ec046424-ffffff-000000-0.png?lossy=2&strip=1&webp=1 496w,https://b2633864.smushcdn.com/2633864/wp-content/latex/153/15346bd71930d8fac9e71641ec046424-ffffff-000000-0.png?size=126x10&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/153/15346bd71930d8fac9e71641ec046424-ffffff-000000-0.png?size=252x20&lossy=2&strip=1&webp=1 252w,https://b2633864.smushcdn.com/2633864/wp-content/latex/153/15346bd71930d8fac9e71641ec046424-ffffff-000000-0.png?size=378x30&lossy=2&strip=1&webp=1 378w' sizes='(max-width: 496px) 100vw, 496px' />,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/7b8/7b8b965ad4bca0e41ab51de7b31363a1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='n' title='n' class='latex' /> is the number of future tokens we predict, <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/a5f/a5faa41fc217dda8dfbe1d81c2c19f42-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\lambda_d' title='\lambda_d' class='latex' /> are weighting coefficients (typically decreasing with distance: <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/3d3/3d3e2e10d63baaf2f7176f5bd82586ea-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\lambda_1 &gt; \lambda_2 &gt; \ldots' title='\lambda_1 &gt; \lambda_2 &gt; \ldots' class='latex' />), and we&#8217;ve explicitly shown that predictions at depth <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/827/8277e0910d750195b448797616e091ad-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d' title='d' class='latex' /> condition on both the context up to position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' /> and the intermediate tokens up to <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/37b/37bdc4a278b3d8dd4f843794c789a033-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t+d' title='t+d' class='latex' />.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-DeepSeek-V3-Architecture-Multi-Token-Prediction-Heads-Explained"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-DeepSeek-V3-Architecture-Multi-Token-Prediction-Heads-Explained">DeepSeek-V3 Architecture: Multi-Token Prediction Heads Explained</a></h2>



<p>Implementing MTP requires architectural additions. We can&#8217;t just reuse the main language modeling head for future predictions — we need to condition on the intermediate tokens. DeepSeek-V3 implements this through a hierarchy of prediction heads, each specialized for a particular future depth.</p>



<p><strong>Head Architecture:</strong> For predicting <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/827/8277e0910d750195b448797616e091ad-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d' title='d' class='latex' /> tokens ahead, we have a head <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/d56/d563484809a1a3a3748792b97f5bcbc7-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='H_d' title='H_d' class='latex' /> that combines:</p>



<ul class="wp-block-list">
<li>The hidden representation from the Transformer at position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' />: <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6c4/6c4ff69dbcc329835a33b80fe3a145c7-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='h_t' title='h_t' class='latex' /></li>



<li>The embedding of the token at position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/437/43726c0aa6585148ea3eb449a7410096-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t + d' title='t + d' class='latex' />: <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/c4b/c4b0e62323abea033bad10af0c0403d6-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='e_{t+d}' title='e_{t+d}' class='latex' /></li>
</ul>



<p>The combination follows:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/873/87311a912d887a21eef148ef0f02d713-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='h_t^{(d)} = \text{Combine}(h_t, e_{t+d})' title='h_t^{(d)} = \text{Combine}(h_t, e_{t+d})' class='latex' /></p>



<p>This combined representation is then processed through a mini-Transformer (lightweight attention and feedforward layers) before projecting to the vocabulary:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/1c0/1c030207bbcc7ce4138ddd74e7aadff5-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='h_t^{(d)} = h_t^{(d)} + \text{Attention}(h_t^{(d)})' title='h_t^{(d)} = h_t^{(d)} + \text{Attention}(h_t^{(d)})' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/1c0/1c030207bbcc7ce4138ddd74e7aadff5-ffffff-000000-0.png?lossy=2&strip=1&webp=1 197w,https://b2633864.smushcdn.com/2633864/wp-content/latex/1c0/1c030207bbcc7ce4138ddd74e7aadff5-ffffff-000000-0.png?size=126x13&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 197px) 100vw, 197px' /></p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/52b/52b54a3a5d95bb8fe66dd801cf8ab21e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='h_t^{(d)} = h_t^{(d)} + \text{MoE}(h_t^{(d)})' title='h_t^{(d)} = h_t^{(d)} + \text{MoE}(h_t^{(d)})' class='latex' /></p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/423/423718972e71df9fe72144b471ea64f2-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{logits}_{t+d+1} = h_t^{(d)} W_\text{vocab}' title='\text{logits}_{t+d+1} = h_t^{(d)} W_\text{vocab}' class='latex' /></p>



<p>The intuition is powerful: to predict token <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/613/613cc12f2c214aa5aba3fd31daf6930e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t+d+1' title='t+d+1' class='latex' />, we start with the representation at position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' /> (encoding all context), incorporate the embedding of token <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/37b/37bdc4a278b3d8dd4f843794c789a033-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t+d' title='t+d' class='latex' /> (telling us what word we&#8217;ve just generated), process through a small Transformer (allowing the model to refine this combination), and project to vocabulary (producing logits over the vocabulary). This architecture naturally encourages forward planning — the model must learn representations at position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' /> that are useful for predictions multiple steps ahead.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Gradient-Insights-Multi-Token-Prediction-DeepSeek-V3"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Gradient-Insights-Multi-Token-Prediction-DeepSeek-V3">Gradient Insights for Multi-Token Prediction in DeepSeek-V3</a></h2>



<p>From an optimization perspective, MTP provides richer gradient signals. In standard training, only the hidden representation <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6c4/6c4ff69dbcc329835a33b80fe3a145c7-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='h_t' title='h_t' class='latex' /> receives gradients from predicting <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/940/940d6748ef869ab4c373721ae0be26c6-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='x_{t+1}' title='x_{t+1}' class='latex' />. With MTP, <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6c4/6c4ff69dbcc329835a33b80fe3a145c7-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='h_t' title='h_t' class='latex' /> also receives gradients from predicting <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/771/771ac46505e058c79416f172638bd9fd-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='x_{t+2}, x_{t+3}, \ldots ' title='x_{t+2}, x_{t+3}, \ldots ' class='latex' />. These additional gradients encourage <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6c4/6c4ff69dbcc329835a33b80fe3a145c7-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='h_t' title='h_t' class='latex' /> to encode information relevant not just for the immediate next token, but for multiple future tokens.</p>



<p>Moreover, the gradients from future predictions flow through different pathways — through the MTP heads&#8217; mini-Transformers. This creates a form of multi-task learning in which different prediction depths impose distinct consistency constraints on the learned representations. A representation that works well for predicting 1 token ahead might not be good for predicting 5 tokens ahead; MTP encourages learning representations that support both.</p>



<p>We can think of this as adding an implicit regularizer. The additional prediction objectives constrain the learned representations to be more structured, more forward-looking, and more globally coherent. It&#8217;s similar in spirit to multi-task learning, where auxiliary tasks improve representation quality even if we care primarily about one main task.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-DeepSeek-V3-Training-vs-Inference-How-MTP-Changes-Both"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-DeepSeek-V3-Training-vs-Inference-How-MTP-Changes-Both">DeepSeek-V3 Training vs. Inference: How MTP Changes Both</a></h2>



<p><strong>During Training</strong>: We compute all predictions in parallel. For a sequence of length <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b9e/b9ece18c950afbfa6b0fdbfa4ff731d3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='T' title='T' class='latex' />, we predict:</p>



<ul class="wp-block-list">
<li><strong>Main head:</strong> positions 1 through <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b0c/b0c453d8de3950e1c5097f75ea6c5502-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='T-1' title='T-1' class='latex' /> predict positions 2 through <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b9e/b9ece18c950afbfa6b0fdbfa4ff731d3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='T' title='T' class='latex' /></li>



<li><strong>Depth-1 head:</strong> positions 1 through <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/a63/a632a6a07d149a53c3c98882c179fe7c-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='T-2' title='T-2' class='latex' /> predict positions 3 through <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b9e/b9ece18c950afbfa6b0fdbfa4ff731d3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='T' title='T' class='latex' /></li>



<li><strong>Depth-2 head:</strong> positions 1 through <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e89/e89bf3da0eaa846fce835629bdc861c6-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='T-3' title='T-3' class='latex' /> predict positions 4 through <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b9e/b9ece18c950afbfa6b0fdbfa4ff731d3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='T' title='T' class='latex' /></li>
</ul>



<p>Each prediction uses the ground truth intermediate tokens (available during training), so there&#8217;s no error accumulation. The losses are computed independently and summed with appropriate weights.</p>



<p><strong>During Inference:</strong> Interestingly, MTP heads are typically not used during autoregressive generation. Once training is complete, we generate text using only the main prediction head in the standard autoregressive manner. The MTP heads have served their purpose by improving the learned representations; we don&#8217;t need their multi-step predictions at inference time.</p>



<p>This is computationally appealing: we get the benefits of MTP (better representations, improved coherence) during training, but inference remains as efficient as a standard language model. There&#8217;s no additional computational cost at deployment.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Multi-Token-Prediction-Loss-Weighting-Decay-DeepSeek-V3"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Multi-Token-Prediction-Loss-Weighting-Decay-DeepSeek-V3">Multi-Token Prediction Loss Weighting and Decay for DeepSeek-V3</a></h2>



<p>The weighting coefficients <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/a5f/a5faa41fc217dda8dfbe1d81c2c19f42-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\lambda_d' title='\lambda_d' class='latex' /> are important hyperparameters. Intuitively, predictions further in the future are harder and less reliable, so we should weight them less heavily. A common scheme is exponential decay:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/1a3/1a3a2e60da0426591ab6be4156be2572-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\lambda_d = \beta^{d-1}' title='\lambda_d = \beta^{d-1}' class='latex' /></p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/f12/f1202bbb73858018622ad4c94aa0ff8e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='0 {\ &lt;\ } \beta {\ &lt;\ } 1' title='0 {\ &lt;\ } \beta {\ &lt;\ } 1' class='latex' />. For example, with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e3e/e3e2fcba65c4e7a857af2c743759b0ba-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\beta = 0.5' title='\beta = 0.5' class='latex' />:</p>



<ul class="wp-block-list">
<li><strong>Depth 1</strong> (predicting <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/44c/44c03d73c7504ed0cfc0dba08a961d04-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t+2' title='t+2' class='latex' /> from <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' />): weight 1.0</li>



<li><strong>Depth 2</strong> (predicting <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/17f/17fb101194dfcbe03db6a5341642cdad-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t+3' title='t+3' class='latex' /> from <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' />): weight 0.5</li>



<li><strong>Depth 3</strong> (predicting <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/05d/05d886190c10a680ff24f16ac2a6071e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t+4' title='t+4' class='latex' /> from <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' />): weight 0.25</li>
</ul>



<p>In our implementation, we use a simpler approach: uniform weighting of 0.3 for all MTP losses relative to the main loss. This is less sophisticated but easier to tune and still provides the core benefits.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Step-by-Step-Implementation-Multi-Token-Prediction-Heads-DeepSeek-V3"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Step-by-Step-Implementation-Multi-Token-Prediction-Heads-DeepSeek-V3">Step-by-Step Implementation of Multi-Token Prediction Heads in DeepSeek-V3</a></h2>



<p>Let&#8217;s implement the complete MTP system:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3" data-enlighter-group="1">class MultiTokenPredictionHead(nn.Module):
    """
    Multi-Token Prediction Head

    Each head predicts a token at a specific future position.
    Combines previous hidden state with future token embedding.
    """
    def __init__(self, config: DeepSeekConfig, depth: int):
        super().__init__()
        self.depth = depth
        self.n_embd = config.n_embd

        # Combine previous hidden state with future token embedding
        self.combine_proj = nn.Linear(2 * config.n_embd, config.n_embd, bias=config.bias)

        # Normalization
        self.norm1 = RMSNorm(config.n_embd)
        self.norm2 = RMSNorm(config.n_embd)

        # Transformer components (mini-transformer for each head)
        self.attn = MultiheadLatentAttention(config)
        self.mlp = MixtureOfExperts(config)
        self.attn_norm = RMSNorm(config.n_embd)
        self.mlp_norm = RMSNorm(config.n_embd)

</pre>



<p><strong>Lines 1-24: Prediction Head Structure</strong><strong>.</strong> Each <code data-enlighter-language="python" class="EnlighterJSRAW">MultiTokenPredictionHead</code> is specialized for a particular depth — head 1 predicts 1 token ahead, head 2 predicts 2 tokens ahead, etc. We store the depth for potential depth-conditional processing (though we don&#8217;t use it in this simple implementation). </p>



<p>The architecture has 3 main components: a combination projection that merges the hidden state and future token embeddings, normalization layers for stabilization, and a mini-Transformer consisting of an attention module and an MoE. This mini-Transformer is complete but lightweight — it has the same architecture as our main model blocks but serves a specialized purpose.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="26" data-enlighter-title="Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3" data-enlighter-group="2">    def forward(self, prev_hidden, future_token_embed):
        """
        Args:
            prev_hidden: [B, T, D] - Hidden states from previous layer
            future_token_embed: [B, T, D] - Embeddings of future tokens

        Returns:
            hidden: [B, T, D] - Processed hidden states
        """
        # Normalize inputs
        prev_norm = self.norm1(prev_hidden)
        future_norm = self.norm2(future_token_embed)

        # Combine representations
        combined = torch.cat([prev_norm, future_norm], dim=-1)
        hidden = self.combine_proj(combined)

        # Process through mini-transformer
        hidden = hidden + self.attn(self.attn_norm(hidden))
        moe_out, _ = self.mlp(self.mlp_norm(hidden))
        hidden = hidden + moe_out

        return hidden
</pre>



<p><strong>Lines 26-41: The Combination Strategy</strong><strong>.</strong> The forward method takes two inputs: <code data-enlighter-language="python" class="EnlighterJSRAW">prev_hidden</code> (the hidden representation at position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' />, encoding all context up to that point) and <code data-enlighter-language="python" class="EnlighterJSRAW">future_token_embed</code> (the embedding of the token at position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/37b/37bdc4a278b3d8dd4f843794c789a033-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t+d' title='t+d' class='latex' />, providing information about what&#8217;s been generated). We normalize both inputs independently — this prevents scale mismatches between the hidden representations (which may have grown or shrunk through many Transformer layers) and the embeddings (which come fresh from the embedding layer). We concatenate along the feature dimension, doubling the dimensionality, then project back to <code data-enlighter-language="python" class="EnlighterJSRAW">n_embd</code> dimensions. This projection learns how to merge content from these two different sources.</p>



<p><strong>Lines 44-46: Mini-Transformer Processing.</strong> The combined representation flows through a lightweight Transformer. First, attention with a residual connection: the model can attend across the sequence, allowing position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' /> to gather information from other positions when predicting <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/613/613cc12f2c214aa5aba3fd31daf6930e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t+d+1' title='t+d+1' class='latex' />. This is crucial because the prediction might depend on context earlier in the sequence. Then, MoE with a residual connection: the expert networks can apply non-linear transformations, refining the combined representation. The use of the same MLA attention and MoE that we&#8217;ve already implemented is elegant — we&#8217;re reusing well-tested components. The pre-norm architecture (normalizing before attention and MoE rather than after) has become standard in modern Transformers for training stability.</p>



<p><strong>Line 48: Returning Refined Hidden State</strong><strong>.</strong> The output hidden state has the same dimensionality as the input (<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/646/6469a03ebce607f5e9fc3cca520cc84a-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{model}' title='d_\text{model}' class='latex' />), so it can be projected through the vocabulary matrix to get logits for predicting <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/094/094488e54e4c20547f97672a13d6f249-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='x_{t+d+1}' title='x_{t+d+1}' class='latex' />. This hidden state has been enriched with information from both the context (via <code data-enlighter-language="python" class="EnlighterJSRAW">prev_hidden</code>) and the intermediate token (via <code data-enlighter-language="python" class="EnlighterJSRAW">future_token_embed</code>), and has been refined through attention and expert processing. It represents the model&#8217;s best understanding of what should come next-next, not just next.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Integrating-Multi-Token-Prediction-DeepSeek-V3-Core-Transformer"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Integrating-Multi-Token-Prediction-DeepSeek-V3-Core-Transformer">Integrating Multi-Token Prediction with DeepSeek-V3’s Core Transformer</a></h2>



<p>The MTP heads integrate into the main model during training. After computing the final hidden states <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/568/5682e80f7c49a85c2dcce39e8233c18f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='h_1, h_2, \ldots, h_T' title='h_1, h_2, \ldots, h_T' class='latex' /> from the main Transformer, we apply the following operations:</p>



<ul class="wp-block-list">
<li><strong>Main prediction:</strong> Project <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6c4/6c4ff69dbcc329835a33b80fe3a145c7-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='h_t' title='h_t' class='latex' /> to vocabulary to predict <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/940/940d6748ef869ab4c373721ae0be26c6-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='x_{t+1}' title='x_{t+1}' class='latex' />, compute cross-entropy loss</li>



<li><strong>Depth-1 prediction:</strong> For each position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' />, get embedding of <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/940/940d6748ef869ab4c373721ae0be26c6-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='x_{t+1}' title='x_{t+1}' class='latex' /> (ground truth), combine with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6c4/6c4ff69dbcc329835a33b80fe3a145c7-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='h_t' title='h_t' class='latex' /> through head 1, project to vocabulary to predict <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/248/248947317fb471f4124642cc0848175f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='x_{t+2}' title='x_{t+2}' class='latex' />, compute cross-entropy loss</li>



<li><strong>Depth-2 prediction:</strong> For each position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' />, get embedding of <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/248/248947317fb471f4124642cc0848175f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='x_{t+2}' title='x_{t+2}' class='latex' /> (ground truth), combine with head-1 output, project to vocabulary to predict <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/97c/97ca68b679b3640aa4c517e1ef952bb7-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='x_{t+3}' title='x_{t+3}' class='latex' />, compute cross-entropy loss</li>
</ul>



<p>The key insight is that we chain the heads: head 2’s input includes head 1’s output. This creates a hierarchical structure in which each head builds on the previous one, progressively looking further into the future.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Theoretical-Foundations-MTP-Curriculum-Learning-Auxiliary-Tasks"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Theoretical-Foundations-MTP-Curriculum-Learning-Auxiliary-Tasks">Theoretical Foundations: MTP, Curriculum Learning, and Auxiliary Tasks</a></h2>



<p>MTP has interesting theoretical connections to other areas of machine learning:</p>



<p><strong>Temporal Difference Learning:</strong> In reinforcement learning, temporal difference learning propagates value information backward from future states. MTP does something analogous — it propagates gradient information backward from future predictions, encouraging current representations to encode future-relevant information.</p>



<p><strong>Auxiliary Tasks:</strong> MTP can be viewed as an auxiliary task framework in which the auxiliary tasks are future token predictions. Research in multi-task learning shows that auxiliary tasks improve representation quality when they are related but distinct from the main task. Future token prediction is perfectly related (it is the same task at different time steps) but distinct (it requires different information).</p>



<p><strong>Curriculum Learning:</strong> The depth-weighted loss structure implements a form of curriculum — we emphasize near-future predictions (easier, more reliable) more than far-future predictions (harder, noisier). This gradually increasing difficulty may help training by first learning short-term dependencies before tackling long-term structure.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Multi-Token-Prediction-Benefits-Coherence-Planning-Faster-Convergence"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Multi-Token-Prediction-Benefits-Coherence-Planning-Faster-Convergence">Multi-Token Prediction Benefits: Coherence, Planning, and Faster Convergence</a></h2>



<p>Research on Multi-Token Prediction shows several empirical benefits:</p>



<ul class="wp-block-list">
<li><strong>Improved Coherence:</strong> Models trained with MTP generate more globally coherent text, with fewer contradictions or topic drift over long generations</li>



<li><strong>Better Planning:</strong> For tasks like story writing or code generation, where early decisions constrain later possibilities, MTP helps the model make forward-compatible choices</li>



<li><strong>Faster Convergence:</strong> The additional training signals can accelerate learning, reaching target performance with fewer training steps</li>



<li><strong>Regularization:</strong> MTP acts as a regularizer, preventing overfitting by encouraging representations that support multiple related objectives</li>
</ul>



<p>However, MTP also has costs. Training becomes more complex — we must manage multiple prediction heads and carefully weight their losses. Training is slower — computing multiple predictions per position increases computation by a factor of roughly <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/34c/34c666dcd14f84cdeb371f25688bebb8-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='1 + n/2' title='1 + n/2' class='latex' /> for <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/7b8/7b8b965ad4bca0e41ab51de7b31363a1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='n' title='n' class='latex' /> future tokens (the factor is not linear because not all positions can predict <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/7b8/7b8b965ad4bca0e41ab51de7b31363a1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='n' title='n' class='latex' /> tokens ahead). Memory usage increases due to the additional heads&#8217; parameters.</p>



<p>The tradeoff is typically favorable for larger models and longer-form generation tasks. For small models or short-sequence tasks, the overhead may outweigh the benefits. In our children&#8217;s story generation task, MTP should help with maintaining narrative consistency across a story.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<div id="pitch" style="padding: 40px; width: 100%; background-color: #F4F6FA;">
	<h3>What's next? We recommend <a target="_blank" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend">PyImageSearch University</a>.</h3>

	<script src="https://fast.wistia.com/embed/medias/kno0cmko2z.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_kno0cmko2z videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img decoding="async" src="https://fast.wistia.com/embed/medias/kno0cmko2z/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>

	<div style="margin-top: 32px; margin-bottom: 32px; ">
		<strong>Course information:</strong><br/>
		86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: April 2026<br/>
		<span style="color: #169FE6;">★★★★★</span> 4.84 (128 Ratings) • 16,000+ Students Enrolled
	</div>

	<p><strong>I strongly believe that if you had the right teacher you could <em>master</em> computer vision and deep learning.</strong></p>

	<p>Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?</p>

	<p>That’s <em>not</em> the case.</p>

	<p>All you need to master computer vision and deep learning is for someone to explain things to you in <em>simple, intuitive</em> terms. <em>And that’s exactly what I do</em>. My mission is to change education and how complex Artificial Intelligence topics are taught.</p>

	<p>If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to <em>successfully</em> and <em>confidently</em> apply computer vision to your work, research, and projects. Join me in computer vision mastery.</p>

	<p><strong>Inside PyImageSearch University you'll find:</strong></p>

	<ul style="margin-left: 0px;">
		<li style="list-style: none;">&check; <strong>86+ courses</strong> on essential computer vision, deep learning, and OpenCV topics</li>
		<li style="list-style: none;">&check; <strong>86 Certificates</strong> of Completion</li>
		<li style="list-style: none;">&check; <strong>115+ hours hours</strong> of on-demand video</li>
		<li style="list-style: none;">&check; <strong>Brand new courses released <em>regularly</em></strong>, ensuring you can keep up with state-of-the-art techniques</li>
		<li style="list-style: none;">&check; <strong>Pre-configured Jupyter Notebooks in Google Colab</strong></li>
		<li style="list-style: none;">&check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)</li>
		<li style="list-style: none;">&check; Access to <strong>centralized code repos for <em>all</em> 540+ tutorials</strong> on PyImageSearch</li>
		<li style="list-style: none;">&check; <strong> Easy one-click downloads</strong> for code, datasets, pre-trained models, etc.</li>
		<li style="list-style: none;">&check; <strong>Access</strong> on mobile, laptop, desktop, etc.</li>
	</ul>

	<p style="text-align: center;">
		<a target="_blank" class="button link" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend" style="background-color: #6DC713; border-bottom: none;">Click here to join PyImageSearch University</a>
	</p>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Summary"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Summary">Summary</a></h2>



<p>In the first three lessons of this series, we progressively assembled the foundations of DeepSeek-V3: starting with its configuration and <strong>Rotary Positional Embeddings (RoPE)</strong>, then advancing to the efficiency of <strong>Multi-Head Latent Attention (MLA)</strong>, and scaling capacity through the <strong>Mixture of Experts (MoE)</strong>. Each of these innovations has added a crucial piece to the architecture, balancing efficiency, scalability, and representational power. With those components in place, we turn to another breakthrough that redefines how language models learn and generate text: <strong>Multi-Token Prediction (MTP)</strong>.</p>



<p>Traditional autoregressive models rely on next-token prediction, a strategy that, while effective, can be shortsighted — focusing only on immediate context rather than broader sequence-level patterns. MTP addresses this limitation by enabling the model to predict multiple tokens ahead, accelerating training and inference while enriching contextual understanding. In this lesson, we explore the shortcomings of next-token prediction, introduce the architecture of specialized prediction heads, and examine why MTP works from a gradient perspective.</p>



<p>We then dive into practical considerations (e.g., weighted loss, decay strategies, and implementation details), before integrating MTP into the main model. By the end, we see how this innovation not only improves efficiency but also strengthens the theoretical and empirical foundations of DeepSeek-V3, bringing us closer to assembling the complete architecture.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Citation-Information"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Citation-Information">Citation Information</a></h3>



<p><strong>Mangla, P</strong><strong>. </strong>“Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3,” <em>PyImageSearch</em>, S. Huot, A. Sharma, and P. Thakur, eds., 2026, <a href="https://pyimg.co/alrep" target="_blank" rel="noreferrer noopener">https://pyimg.co/alrep</a> </p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="classic" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3" data-enlighter-group="3">@incollection{Mangla_2026_autoregressive-model-limits-and-mTP-in-deepseek-v3,
  author = {Puneet Mangla},
  title = {{Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/alrep},
}
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), </strong><em><strong>simply enter your email address in the form below!</strong></em></p>



<div id="download-the-code" class="post-cta-wrap">
<div class="gpd-post-cta">
	<div class="gpd-post-cta-content">
		

			<div class="gpd-post-cta-top">
				<div class="gpd-post-cta-top-image"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1" alt="" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1 410w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=126x174&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=252x348&lossy=2&strip=1&webp=1 252w" sizes="(max-width: 410px) 100vw, 410px" /></div>
				
				<div class="gpd-post-cta-top-title"><h4>Download the Source Code and FREE 17-page Resource Guide</h4></div>
				<div class="gpd-post-cta-top-desc"><p>Enter your email address below to get a .zip of the code and a <strong>FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning.</strong> Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!</p></div>


			</div>

			<div class="gpd-post-cta-bottom">
				<form id="footer-cta-code" class="footer-cta" action="https://www.getdrip.com/forms/4130035/submissions" method="post" target="blank" data-drip-embedded-form="4130035">
					<input name="fields[email]" type="email" value="" placeholder="Your email address" class="form-control" />

					<button type="submit">Download the code!</button>

					<div style="display: none;" aria-hidden="true"><label for="website">Website</label><br /><input type="text" id="website" name="website" tabindex="-1" autocomplete="false" value="" /></div>
				</form>
			</div>


		
	</div>

</div>
</div>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/03/30/autoregressive-model-limits-and-multi-token-prediction-in-deepseek-v3/">Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>DeepSeek-V3 from Scratch: Mixture of Experts (MoE)</title>
		<link>https://pyimagesearch.com/2026/03/23/deepseek-v3-from-scratch-mixture-of-experts-moe/</link>
		
		<dc:creator><![CDATA[Puneet Mangla]]></dc:creator>
		<pubDate>Mon, 23 Mar 2026 12:45:00 +0000</pubDate>
				<category><![CDATA[Deep Learning]]></category>
		<category><![CDATA[DeepSeek]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Neural Networks]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[deepseek-v3]]></category>
		<category><![CDATA[expert routing]]></category>
		<category><![CDATA[expert specialization]]></category>
		<category><![CDATA[load balancing]]></category>
		<category><![CDATA[mixture of experts]]></category>
		<category><![CDATA[moe]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[pytorch]]></category>
		<category><![CDATA[swiglu]]></category>
		<category><![CDATA[transformer]]></category>
		<category><![CDATA[tutorial]]></category>
		<guid isPermaLink="false">https://pyimagesearch.com/?p=53251</guid>

					<description><![CDATA[<p>Table of Contents DeepSeek-V3 from Scratch: Mixture of Experts (MoE) The Scaling Challenge in Neural Networks Mixture of Experts (MoE): Mathematical Foundation and Routing Mechanism SwiGLU Activation in DeepSeek-V3: Improving MoE Non-Linearity Shared Expert in DeepSeek-V3: Universal Processing in MoE&#8230;</p>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/03/23/deepseek-v3-from-scratch-mixture-of-experts-moe/">DeepSeek-V3 from Scratch: Mixture of Experts (MoE)</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity" id="TOC"/>


<div class="yoast-breadcrumbs"><span><span><a href="https://pyimagesearch.com/">Home</a></span></div>


<div class="toc">
<hr class="TOC"/>
<p class="has-large-font-size"><strong>Table of Contents</strong></p>
<ul>
    <li id="TOC-h1-DeepSeek-V3-from-Scratch-Mixture-of-Experts-MoE"><a rel="noopener" target="_blank" href="#h1-DeepSeek-V3-from-Scratch-Mixture-of-Experts-MoE">DeepSeek-V3 from Scratch: Mixture of Experts (MoE)</a></li>
    <li id="TOC-h2-The-Scaling-Challenge-in-Neural-Networks"><a rel="noopener" target="_blank" href="#h2-The-Scaling-Challenge-in-Neural-Networks">The Scaling Challenge in Neural Networks</a></li>
    <li id="TOC-h2-Mixture-of-Experts-MoE-Mathematical-Foundation-and-Routing-Mechanism"><a rel="noopener" target="_blank" href="#h2-Mixture-of-Experts-MoE-Mathematical-Foundation-and-Routing-Mechanism">Mixture of Experts (MoE): Mathematical Foundation and Routing Mechanism</a></li>
    <li id="TOC-h2-SwiGLU-Activation-in-DeepSeek-V3-Improving-MoE-Non-Linearity"><a rel="noopener" target="_blank" href="#h2-SwiGLU-Activation-in-DeepSeek-V3-Improving-MoE-Non-Linearity">SwiGLU Activation in DeepSeek-V3: Improving MoE Non-Linearity</a></li>
    <li id="TOC-h2-Shared-Expert-in-DeepSeek-V3-Universal-Processing-in-MoE-Layers"><a rel="noopener" target="_blank" href="#h2-Shared-Expert-in-DeepSeek-V3-Universal-Processing-in-MoE-Layers">Shared Expert in DeepSeek-V3: Universal Processing in MoE Layers</a></li>
    <li id="TOC-h2-Auxiliary-Loss-Free-Load-Balancing-in-DeepSeek-V3-MoE"><a rel="noopener" target="_blank" href="#h2-Auxiliary-Loss-Free-Load-Balancing-in-DeepSeek-V3-MoE">Auxiliary-Loss-Free Load Balancing in DeepSeek-V3 MoE</a></li>
    <li id="TOC-h2-Sequence-Wise-Load-Balancing-for-Mixture-of-Experts-Models"><a rel="noopener" target="_blank" href="#h2-Sequence-Wise-Load-Balancing-for-Mixture-of-Experts-Models">Sequence-Wise Load Balancing for Mixture of Experts Models</a></li>
    <li id="TOC-h2-Expert-Specialization-in-MoE-Emergent-Behavior-in-DeepSeek-V3"><a rel="noopener" target="_blank" href="#h2-Expert-Specialization-in-MoE-Emergent-Behavior-in-DeepSeek-V3">Expert Specialization in MoE: Emergent Behavior in DeepSeek-V3</a></li>
    <li id="TOC-h2-Implementation-Building-the-DeepSeek-V3-MoE-Layer-from-Scratch"><a rel="noopener" target="_blank" href="#h2-Implementation-Building-the-DeepSeek-V3-MoE-Layer-from-Scratch">Implementation: Building the DeepSeek-V3 MoE Layer from Scratch</a></li>
    <li id="TOC-h2-MoE-Design-Decisions-in-DeepSeek-V3-SwiGLU-Shared-Experts-and-Routing"><a rel="noopener" target="_blank" href="#h2-MoE-Design-Decisions-in-DeepSeek-V3-SwiGLU-Shared-Experts-and-Routing">MoE Design Decisions in DeepSeek-V3: SwiGLU, Shared Experts, and Routing</a></li>
    <li id="TOC-h2-MoE-Computational-and-Memory-Analysis-in-DeepSeek-V3"><a rel="noopener" target="_blank" href="#h2-MoE-Computational-and-Memory-Analysis-in-DeepSeek-V3">MoE Computational and Memory Analysis in DeepSeek-V3</a></li>
    <li id="TOC-h2-MoE-Expert-Specialization-in-Practice-Real-World-Behavior"><a rel="noopener" target="_blank" href="#h2-MoE-Expert-Specialization-in-Practice-Real-World-Behavior">MoE Expert Specialization in Practice: Real-World Behavior</a></li>
    <li id="TOC-h2-Training-Dynamics-of-MoE-Load-Balancing-and-Expert-Utilization"><a rel="noopener" target="_blank" href="#h2-Training-Dynamics-of-MoE-Load-Balancing-and-Expert-Utilization">Training Dynamics of MoE: Load Balancing and Expert Utilization</a></li>
    <li id="TOC-h2-Mixture-of-Experts-vs-Related-Techniques-Switch-Transformers-and-Sparse-Models"><a rel="noopener" target="_blank" href="#h2-Mixture-of-Experts-vs-Related-Techniques-Switch-Transformers-and-Sparse-Models">Mixture of Experts vs Related Techniques: Switch Transformers and Sparse Models</a></li>
    <li id="TOC-h2-Summary"><a rel="noopener" target="_blank" href="#h2-Summary">Summary</a>
        <ul>
            <li id="TOC-h3-Citation-Information"><a rel="noopener" target="_blank" href="#h3-Citation-Information">Citation Information</a></li>
        </ul>
    </li>
</ul>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h1-DeepSeek-V3-from-Scratch-Mixture-of-Experts-MoE"/>



<h2 class="wp-block-heading"><a href="#TOC-h1-DeepSeek-V3-from-Scratch-Mixture-of-Experts-MoE">DeepSeek-V3 from Scratch: Mixture of Experts (MoE)</a></h2>



<p>In the first two parts of this series, we established the foundations of DeepSeek-V3 by implementing its core configuration and positional encoding, followed by a deep dive into <strong>Multi</strong><strong>-H</strong><strong>ead Latent Attention (MLA)</strong>. Together, these components set the stage for a model that is both efficient and capable of handling long-range dependencies. With those building blocks in place, we now explore another key innovation in DeepSeek-V3: the <strong>Mixture of Experts (MoE)</strong>.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/03/deepseek-v3-from-scratch-moe-featured.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="940" height="780" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-from-scratch-moe-featured.png?lossy=2&strip=1&webp=1" alt="deepseek-v3-from-scratch-moe-featured.png" class="wp-image-53267" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-from-scratch-moe-featured.png?size=126x105&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-from-scratch-moe-featured-300x249.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-from-scratch-moe-featured.png?size=378x314&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-from-scratch-moe-featured.png?size=504x418&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-from-scratch-moe-featured.png?size=630x523&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-from-scratch-moe-featured-768x637.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-from-scratch-moe-featured.png?lossy=2&amp;strip=1&amp;webp=1 940w" sizes="(max-width: 630px) 100vw, 630px" /></a></figure></div>


<p>MoE introduces a dynamic way of scaling model capacity without proportionally increasing computational cost. Instead of activating every parameter for every input, the model selectively routes tokens through specialized “expert” networks, allowing it to expand representational power while keeping inference efficient. In this lesson, we’ll unpack the theory behind MoE, explain how expert routing works, and then implement it step by step. This installment continues our broader goal of reconstructing DeepSeek-V3 from scratch — showing how each innovation, from RoPE to MLA to MoE, fits together into a cohesive architecture that balances scale, efficiency, and performance.</p>



<p>This lesson is the 3rd in a 6-part series on <strong>Building DeepSeek-V3 from Scratch</strong>:</p>



<ol class="wp-block-list">
<li><em><a href="https://pyimg.co/1atre" target="_blank" rel="noreferrer noopener">DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings</a></em> </li>



<li><em><a href="https://pyimg.co/scgjl" target="_blank" rel="noreferrer noopener">Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture</a></em></li>



<li><em><strong><a href="https://pyimg.co/a1w0g" target="_blank" rel="noreferrer noopener">DeepSeek-V3 from Scratch: Mixture of Experts (MoE)</a></strong></em> <strong>(this tutorial)</strong></li>



<li><em>Lesson 4</em></li>



<li><em>Lesson 5</em></li>



<li><em>Lesson 6</em></li>
</ol>



<p><strong>To learn about DeepSeek-V3 and build it from scratch, </strong><em><strong>just keep reading.</strong></em></p>



<div id="pyi-source-code-block" class="source-code-wrap"><div class="gpd-source-code">
    <div class="gpd-source-code-content">
        <img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/source-code-icon.png?lossy=2&strip=1&webp=1" alt="">
        <h4>Looking for the source code to this post?</h4>
                    <a href="#download-the-code" class="pyis-cta-modal-open-modal">Jump Right To The Downloads Section <svg class="svg-icon arrow-right" width="12" height="12" aria-hidden="true" role="img" focusable="false" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M6.8125 0.1875C6.875 0.125 6.96875 0.09375 7.09375 0.09375C7.1875 0.09375 7.28125 0.125 7.34375 0.1875L13.875 6.75C13.9375 6.8125 14 6.90625 14 7C14 7.125 13.9375 7.1875 13.875 7.25L7.34375 13.8125C7.28125 13.875 7.1875 13.9062 7.09375 13.9062C6.96875 13.9062 6.875 13.875 6.8125 13.8125L6.1875 13.1875C6.125 13.125 6.09375 13.0625 6.09375 12.9375C6.09375 12.8438 6.125 12.75 6.1875 12.6562L11.0312 7.8125H0.375C0.25 7.8125 0.15625 7.78125 0.09375 7.71875C0.03125 7.65625 0 7.5625 0 7.4375V6.5625C0 6.46875 0.03125 6.375 0.09375 6.3125C0.15625 6.25 0.25 6.1875 0.375 6.1875H11.0312L6.1875 1.34375C6.125 1.28125 6.09375 1.1875 6.09375 1.0625C6.09375 0.96875 6.125 0.875 6.1875 0.8125L6.8125 0.1875Z" fill="#169FE6"></path></svg></a>
            </div>
</div>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-The-Scaling-Challenge-in-Neural-Networks"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-The-Scaling-Challenge-in-Neural-Networks">The Scaling Challenge in Neural Networks</a></h2>



<p>As we scale neural networks, we face a fundamental tradeoff: larger models have greater capacity to learn complex patterns, but they&#8217;re more expensive to train and deploy. A standard Transformer feedforward layer applies the same computation to every token:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/bf2/bf2201bc2baf63ca1f6c4d234c0149e9-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{FFN}(x) = \text{GELU}(x W_1 + b_1) W_2 + b_2' title='\text{FFN}(x) = \text{GELU}(x W_1 + b_1) W_2 + b_2' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/bf2/bf2201bc2baf63ca1f6c4d234c0149e9-ffffff-000000-0.png?lossy=2&strip=1&webp=1 256w,https://b2633864.smushcdn.com/2633864/wp-content/latex/bf2/bf2201bc2baf63ca1f6c4d234c0149e9-ffffff-000000-0.png?size=126x9&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 256px) 100vw, 256px' /> ,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/17b/17bb42fa7c2c263ea68f39dcacbae39c-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='W_1 \in \mathbb{R}^{d_\text{model} \times d_{ff}}' title='W_1 \in \mathbb{R}^{d_\text{model} \times d_{ff}}' class='latex' /> and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/f77/f77d7abe628bb9eb9d94b9fd6744507c-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='W_2 \in \mathbb{R}^{d_{ff} \times d_\text{model}}' title='W_2 \in \mathbb{R}^{d_{ff} \times d_\text{model}}' class='latex' /> are weight matrices, typically with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/928/92859ae4b73cd7d045ab1f38a8d696d5-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_{ff} = 4 \times d_\text{model}' title='d_{ff} = 4 \times d_\text{model}' class='latex' />. For our model with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/79d/79d0b8290e3c7cc6a6c914fcecd14969-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{model} = 256' title='d_\text{model} = 256' class='latex' />, this means <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/182/182959cd9cb0edd5a1151ed6c9779b9d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_{ff} = 1024' title='d_{ff} = 1024' class='latex' />, giving us approximately 256K parameters per FFN (FeedForward Network) per layer.</p>



<p>To increase model capacity, we could simply make <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/0c5/0c5420389eb3e2e4d227251e42fe3199-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_{ff}' title='d_{ff}' class='latex' /> larger — say, <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/eab/eabddbf289d74b732510d32ed8521c8b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='8 \times d_\text{model}' title='8 \times d_\text{model}' class='latex' /> instead of <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/7ed/7ed7796a644c184a711ba6371620a806-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='4 \times' title='4 \times' class='latex' />. This doubles the FFN parameters and theoretically doubles capacity. But it also doubles the computation for every token, even if most don&#8217;t need that extra capacity.</p>



<p>Mixture of Experts (<strong>Figure 1</strong>) offers a more efficient scaling paradigm: instead of a single large FFN, we create multiple smaller expert FFNs and route each token to a subset of these experts. This gives us the capacity of a much larger model while maintaining computational efficiency.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/03/image-9-scaled.jpeg" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="507" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-9-1024x507.jpeg?lossy=2&strip=1&webp=1" alt="" class="wp-image-53270" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-9.jpeg?size=126x62&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-9-300x149.jpeg?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-9.jpeg?size=378x187&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-9.jpeg?size=504x250&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-9.jpeg?size=630x312&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-9-768x381.jpeg?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-9-1024x507.jpeg?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-9-scaled.jpeg?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 1:</strong> Types of Mixture of Experts Models (source: <a href="https://arxiv.org/pdf/2401.06066" target="_blank" rel="noreferrer noopener">Dai et al., 2024</a>).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Mixture-of-Experts-MoE-Mathematical-Foundation-and-Routing-Mechanism"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Mixture-of-Experts-MoE-Mathematical-Foundation-and-Routing-Mechanism">Mixture of Experts (MoE): Mathematical Foundation and Routing Mechanism</a></h2>



<p>Consider <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8d9/8d9c307cb7f3c4a32822a51922d1ceaa-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='N' title='N' class='latex' /> expert networks, each with the same architecture as a standard FFN:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/52a/52a505f95caf675e22535da8f910f9fc-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='E_i(x) = \text{SwiGLU}(x)' title='E_i(x) = \text{SwiGLU}(x)' class='latex' /></p>



<p>for <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/49b/49bfb2130de5717f9054b41bfc628ec6-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='i = 1, \ldots, N' title='i = 1, \ldots, N' class='latex' />. Instead of using all experts for every token, we select the top-k experts. The selection is determined by a learned routing function:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/097/09740afe02c7859063f0a0ca5b41a84c-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r(x) = \text{softmax}(x W_r + b) \in \mathbb{R}^N' title='r(x) = \text{softmax}(x W_r + b) \in \mathbb{R}^N' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/097/09740afe02c7859063f0a0ca5b41a84c-ffffff-000000-0.png?lossy=2&strip=1&webp=1 221w,https://b2633864.smushcdn.com/2633864/wp-content/latex/097/09740afe02c7859063f0a0ca5b41a84c-ffffff-000000-0.png?size=126x10&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 221px) 100vw, 221px' /></p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/0b1/0b1d1c3ca3a7f71719f2e764a5421423-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='W_r \in \mathbb{R}^{d_\text{model} \times N}' title='W_r \in \mathbb{R}^{d_\text{model} \times N}' class='latex' /> is the router weight matrix and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/2da/2dad4704d5645062cbf7099281734bc0-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='b \in \mathbb{R}^N' title='b \in \mathbb{R}^N' class='latex' /> is a learnable bias vector. This gives us a probability distribution over experts for each token.</p>



<p><strong>Top-k Routing:</strong> We select the top-k experts based on router probabilities:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/075/075946977174988b15344db74edb18d1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\mathcal{T}_k(x) = {i \mid r_i(x) \text{ is in the top-k values of } r(x)}' title='\mathcal{T}_k(x) = {i \mid r_i(x) \text{ is in the top-k values of } r(x)}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/075/075946977174988b15344db74edb18d1-ffffff-000000-0.png?lossy=2&strip=1&webp=1 320w,https://b2633864.smushcdn.com/2633864/wp-content/latex/075/075946977174988b15344db74edb18d1-ffffff-000000-0.png?size=126x7&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/075/075946977174988b15344db74edb18d1-ffffff-000000-0.png?size=252x14&lossy=2&strip=1&webp=1 252w' sizes='(max-width: 320px) 100vw, 320px' /></p>



<p>The final output combines the selected experts, weighted by their normalized routing probabilities:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/016/0169862ead696d181f0f08316abfb1a9-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{MoE}(x) = \sum_{i \in \mathcal{T}_k(x)} \dfrac{r_i(x)}{\sum_{j \in \mathcal{T}_k(x)} r_j(x)} E_i(x)' title='\text{MoE}(x) = \sum_{i \in \mathcal{T}_k(x)} \dfrac{r_i(x)}{\sum_{j \in \mathcal{T}_k(x)} r_j(x)} E_i(x)' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/016/0169862ead696d181f0f08316abfb1a9-ffffff-000000-0.png?lossy=2&strip=1&webp=1 277w,https://b2633864.smushcdn.com/2633864/wp-content/latex/016/0169862ead696d181f0f08316abfb1a9-ffffff-000000-0.png?size=126x20&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 277px) 100vw, 277px' /></p>



<p>The renormalization <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/635/635b3207ac632da397bf178badefdb3f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\frac{r_i(x)}{\sum_{j \in \mathcal{T}_k(x)} r_j(x)}' title='\frac{r_i(x)}{\sum_{j \in \mathcal{T}_k(x)} r_j(x)}' class='latex' /> ensures the selected experts&#8217; weights sum to 1.</p>



<p><strong>Capacity and Computation</strong>: With <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/c9c/c9c0455233a24a05b9fae35beb3b6bd1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='N = 4' title='N = 4' class='latex' /> experts and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/2d4/2d4dcf10084570378af72846cd24eee5-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='k = 2' title='k = 2' class='latex' /> (our configuration), each token activates 2 out of 4 experts. If each expert has the same size as a standard FFN, we have <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/a09/a09d7b2cdb03c4894dbf1ed0c9efaa8d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='4\times' title='4\times' class='latex' /> the parameters but only <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/94d/94d33465cb423e98be5087e0b60fb662-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='2\times' title='2\times' class='latex' /> the computation per token. This is the MoE efficiency advantage: parameter count scales with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8d9/8d9c307cb7f3c4a32822a51922d1ceaa-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='N' title='N' class='latex' />, but computation scales with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8ce/8ce4b16b22b58894aa86c421e8759df3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='k' title='k' class='latex' />.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-SwiGLU-Activation-in-DeepSeek-V3-Improving-MoE-Non-Linearity"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-SwiGLU-Activation-in-DeepSeek-V3-Improving-MoE-Non-Linearity">SwiGLU Activation in DeepSeek-V3: Improving MoE Non-Linearity</a></h2>



<p>DeepSeek uses SwiGLU (Swish-Gated Linear Unit) instead of the traditional GELU (Gaussian Error Linear Units) activation. SwiGLU is a gated activation function that has shown superior performance in language models:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e32/e322ccdd109b1dcc2a8cfe53607ce7c7-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{SwiGLU}(x) = \text{SiLU}(\text{gate}(x)) \odot \text{up}(x)' title='\text{SwiGLU}(x) = \text{SiLU}(\text{gate}(x)) \odot \text{up}(x)' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/e32/e322ccdd109b1dcc2a8cfe53607ce7c7-ffffff-000000-0.png?lossy=2&strip=1&webp=1 263w,https://b2633864.smushcdn.com/2633864/wp-content/latex/e32/e322ccdd109b1dcc2a8cfe53607ce7c7-ffffff-000000-0.png?size=126x9&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 263px) 100vw, 263px' /></p>



<p>where:</p>



<ul class="wp-block-list">
<li><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/eb5/eb50de2bc852d4828e568e01f0aa9063-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{gate}(x) = x W_\text{gate}' title='\text{gate}(x) = x W_\text{gate}' class='latex' />: projects input to hidden dimension</li>



<li><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/0b8/0b84b337fe57983379de7c5358dea928-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{up}(x) = x W_\text{up}' title='\text{up}(x) = x W_\text{up}' class='latex' />: is another projection to hidden dimension</li>



<li><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b7c/b7c589ccf99675194c9922bebe2b2371-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{SiLU}(x) = x \cdot \sigma(x)' title='\text{SiLU}(x) = x \cdot \sigma(x)' class='latex' />: is the Swish activation (smooth version of ReLU)</li>



<li><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/319/319d584a4a5166ee6c51f4b8348856ea-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\odot' title='\odot' class='latex' />: denotes element-wise multiplication</li>



<li>The result is then projected back: <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/179/1794e1a3552889b63345c679b085f6e2-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{down}(\text{SwiGLU}(x))' title='\text{down}(\text{SwiGLU}(x))' class='latex' /></li>
</ul>



<p>The gating mechanism allows the network to control information flow more precisely than simple activation functions. The <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/f5e/f5e0441bb0e5b247071eb3e14ea4c20d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{SiLU}' title='\text{SiLU}' class='latex' /> activation provides smooth gradients everywhere, improving training dynamics compared to ReLU&#8217;s hard threshold.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Shared-Expert-in-DeepSeek-V3-Universal-Processing-in-MoE-Layers"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Shared-Expert-in-DeepSeek-V3-Universal-Processing-in-MoE-Layers">Shared Expert in DeepSeek-V3: Universal Processing in MoE Layers</a></h2>



<p>DeepSeek introduces a <strong>shared expert</strong> that processes all tokens in addition to the routed experts. This design addresses a key limitation of pure MoE: some computations are beneficial for all tokens regardless of their content.</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/7ab/7ab72343be53d1ae37ac35b5c2e1b5ac-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{MoE}_\text{total}(x) = \text{SharedExpert}(x) + \sum_{i \in \mathcal{T}_k(x)} w_i E_i(x)' title='\text{MoE}_\text{total}(x) = \text{SharedExpert}(x) + \sum_{i \in \mathcal{T}_k(x)} w_i E_i(x)' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/7ab/7ab72343be53d1ae37ac35b5c2e1b5ac-ffffff-000000-0.png?lossy=2&strip=1&webp=1 357w,https://b2633864.smushcdn.com/2633864/wp-content/latex/7ab/7ab72343be53d1ae37ac35b5c2e1b5ac-ffffff-000000-0.png?size=126x8&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/7ab/7ab72343be53d1ae37ac35b5c2e1b5ac-ffffff-000000-0.png?size=252x16&lossy=2&strip=1&webp=1 252w' sizes='(max-width: 357px) 100vw, 357px' /></p>



<p>The shared expert has a larger hidden dimension (768 in our configuration vs 512 for individual experts) and processes every token. This ensures that:</p>



<ul class="wp-block-list">
<li>Common patterns are efficiently handled by dedicated capacity</li>



<li>Specialized experts can focus on token-specific features</li>



<li>Training is more stable with guaranteed gradient flow</li>
</ul>



<p>The shared expert serves as a &#8220;base&#8221; computation that&#8217;s always present, while routed experts add specialized processing on top of it.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Auxiliary-Loss-Free-Load-Balancing-in-DeepSeek-V3-MoE"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Auxiliary-Loss-Free-Load-Balancing-in-DeepSeek-V3-MoE">Auxiliary-Loss-Free Load Balancing in DeepSeek-V3 MoE</a></h2>



<p>A critical challenge in MoE is load balancing. If the router learns to always send tokens to the same one or two experts, we lose the benefits of having multiple experts — the unused experts contribute nothing, and the overused ones become bottlenecks.</p>



<p>Traditional MoE models use an <strong>auxiliary loss</strong> that penalizes uneven expert usage:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/755/75588a746edfba27742971371b272d04-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\mathcal{L}_\text{aux} = \alpha \displaystyle\sum\limits_{i=1}^N \left( \dfrac{L_i}{|\mathcal{B}|} - \dfrac{k}{N} \right)^2' title='\mathcal{L}_\text{aux} = \alpha \displaystyle\sum\limits_{i=1}^N \left( \dfrac{L_i}{|\mathcal{B}|} - \dfrac{k}{N} \right)^2' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/755/75588a746edfba27742971371b272d04-ffffff-000000-0.png?lossy=2&strip=1&webp=1 185w,https://b2633864.smushcdn.com/2633864/wp-content/latex/755/75588a746edfba27742971371b272d04-ffffff-000000-0.png?size=126x33&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 185px) 100vw, 185px' /></p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6b6/6b623bcf78099c519f69e9dbba46fbf2-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='L_i' title='L_i' class='latex' /> is the number of tokens routed to expert <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/865/865c0c0b4ab0e063e5caa3387c1a8741-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='i' title='i' class='latex' />, <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/33e/33e12eb1c3c4ab3e7380b6556798b8ae-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='|\mathcal{B}|' title='|\mathcal{B}|' class='latex' /> is batch size, and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/7b7/7b7f9dbfea05c83784f8b85149852f08-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\alpha' title='\alpha' class='latex' /> is a coefficient. However, auxiliary losses add complexity and require careful tuning.</p>



<p><strong>DeepSeek&#8217;s Innovation:</strong> Auxiliary-loss-free load balancing through <strong>dynamic bias updates</strong>. Instead of penalizing imbalance during training, we adjust the router biases to encourage balanced usage:</p>



<p>During training, we monitor how many tokens are routed to each expert. This gives us an <code data-enlighter-language="python" class="EnlighterJSRAW">expert_usage</code> vector, where each entry counts the number of tokens assigned to a particular expert. We then compute the average usage across all experts. </p>



<p>To maintain a balanced load, we adjust the router biases: if an expert is used more than the average, its bias is decreased to make it less likely to be chosen in the future; if it is used less than the average, its bias is increased to make it more likely to be selected. This dynamic bias update encourages fair distribution of tokens across experts without requiring an explicit auxiliary loss.</p>



<p>Let <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/eb0/eb00a04135562ae6f74786f084f54327-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='u_i' title='u_i' class='latex' /> denote the usage (number of tokens) of expert <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/865/865c0c0b4ab0e063e5caa3387c1a8741-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='i' title='i' class='latex' />, and let</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/9cd/9cdbac664b10cbab5ac822d1be8f4a14-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\bar{u} = \dfrac{1}{N} \displaystyle\sum\limits_{j=1}^{N} u_j' title='\bar{u} = \dfrac{1}{N} \displaystyle\sum\limits_{j=1}^{N} u_j' class='latex' /></p>



<p>be the average usage across all <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8d9/8d9c307cb7f3c4a32822a51922d1ceaa-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='N' title='N' class='latex' /> experts. The router bias for expert <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/865/865c0c0b4ab0e063e5caa3387c1a8741-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='i' title='i' class='latex' />, denoted <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/fe3/fe3e01a305f27284ff5115f4c5ea0fa4-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='b_i' title='b_i' class='latex' />, is updated as:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/ab3/ab311f534726b554bd5d6f1b554a872f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='b_i \leftarrow \left\{\begin{array}{ll} b_i - \eta, &amp; \text{if } u_i &gt; \bar{u} \\ \\ b_i + \eta, &amp; \text{if } u_i \leq \bar{u} \end{array}\right.' title='b_i \leftarrow \left\{\begin{array}{ll} b_i - \eta, &amp; \text{if } u_i &gt; \bar{u} \\ \\ b_i + \eta, &amp; \text{if } u_i \leq \bar{u} \end{array}\right.' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/ab3/ab311f534726b554bd5d6f1b554a872f-ffffff-000000-0.png?lossy=2&strip=1&webp=1 178w,https://b2633864.smushcdn.com/2633864/wp-content/latex/ab3/ab311f534726b554bd5d6f1b554a872f-ffffff-000000-0.png?size=126x42&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 178px) 100vw, 178px' /> ,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/ffe/ffe9f913124f345732e9f00fa258552e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\eta' title='\eta' class='latex' /> is the learning rate controlling the magnitude of the bias adjustment.</p>



<p>This approach:</p>



<ul class="wp-block-list">
<li>Eliminates the need for auxiliary loss hyperparameter tuning</li>



<li>Provides smoother load balancing over time</li>



<li>Doesn&#8217;t interfere with the primary task loss</li>



<li>Automatically adapts to data distribution changes</li>
</ul>



<p>The bias updates are performed with a small learning rate (0.001 in our implementation) to ensure gradual adjustment without disrupting training.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Sequence-Wise-Load-Balancing-for-Mixture-of-Experts-Models"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Sequence-Wise-Load-Balancing-for-Mixture-of-Experts-Models">Sequence-Wise Load Balancing for Mixture of Experts Models</a></h2>



<p>For even better load balancing, DeepSeek can use a <strong>complementary sequence-wise auxiliary loss</strong>. This encourages different sequences in a batch to use different experts:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/d88/d88c20adec2903c6f9fe5fd32613cc1e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\mathcal{L}_\text{comp} = \dfrac{1}{B^2}\displaystyle\sum\limits_{i=1}^B \displaystyle\sum\limits_{j \neq i}^B \text{sim}(u_i, u_j)' title='\mathcal{L}_\text{comp} = \dfrac{1}{B^2}\displaystyle\sum\limits_{i=1}^B \displaystyle\sum\limits_{j \neq i}^B \text{sim}(u_i, u_j)' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/d88/d88c20adec2903c6f9fe5fd32613cc1e-ffffff-000000-0.png?lossy=2&strip=1&webp=1 213w,https://b2633864.smushcdn.com/2633864/wp-content/latex/d88/d88c20adec2903c6f9fe5fd32613cc1e-ffffff-000000-0.png?size=126x30&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 213px) 100vw, 213px' />,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/eb0/eb00a04135562ae6f74786f084f54327-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='u_i' title='u_i' class='latex' /> is the expert usage vector for sequence <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/865/865c0c0b4ab0e063e5caa3387c1a8741-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='i' title='i' class='latex' /> (i.e., which experts were used), and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/083/08387073a5d9b07b40c8f9ccb56c578b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{sim}' title='\text{sim}' class='latex' /> measures similarity. By minimizing this loss, we encourage sequences to be complementary — if sequence A uses experts 1 and 2 heavily, sequence B should use experts 3 and 4.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Expert-Specialization-in-MoE-Emergent-Behavior-in-DeepSeek-V3"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Expert-Specialization-in-MoE-Emergent-Behavior-in-DeepSeek-V3">Expert Specialization in MoE: Emergent Behavior in DeepSeek-V3</a></h2>



<p>A fascinating property of MoE is expert specialization. Even though we don&#8217;t explicitly tell experts what to specialize in, they often learn to handle different types of patterns. In language models, researchers have observed:</p>



<ul class="wp-block-list">
<li><strong>Syntactic experts:</strong> Handle grammatical structures, verb conjugations</li>



<li><strong>Semantic experts:</strong> Process meaning, synonyms, and conceptual relationships</li>



<li><strong>Domain experts:</strong> Specialize in specific topics (e.g., scientific text, dialogue)</li>



<li><strong>Numerical experts:</strong> Handle arithmetic, dates, quantities</li>
</ul>



<p>This specialization emerges naturally as the routing function learns which experts are most effective for different inputs. Gradient flow during training reinforces this — when an expert performs well on certain patterns, the router learns to send similar patterns to that expert.</p>



<p>Mathematically, we can think of each expert as learning a local model <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/2a0/2a043c059262d6d490dd7c417cd171d8-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='E_i(x)' title='E_i(x)' class='latex' /> that&#8217;s particularly good in some region of the input space. The router function <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/7f0/7f0562b7361b94feb27ee472a1cbc253-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r(x)' title='r(x)' class='latex' /> implicitly partitions the input space, assigning different regions to different experts. This is similar to a mixture of experts in classical machine learning, but learned end-to-end through backpropagation.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Implementation-Building-the-DeepSeek-V3-MoE-Layer-from-Scratch"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Implementation-Building-the-DeepSeek-V3-MoE-Layer-from-Scratch">Implementation: Building the DeepSeek-V3 MoE Layer from Scratch</a></h2>



<p>Let&#8217;s implement the complete MoE layer with expert networks, routing, and load balancing:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="DeepSeek-V3 from Scratch: Mixture of Experts (MoE)" data-enlighter-group="1">class SwiGLU(nn.Module):
    """SwiGLU activation function used in DeepSeek experts"""
   
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int, bias: bool = True):
        super().__init__()
        self.gate_proj = nn.Linear(input_dim, hidden_dim, bias=bias)
        self.up_proj = nn.Linear(input_dim, hidden_dim, bias=bias)
        self.down_proj = nn.Linear(hidden_dim, output_dim, bias=bias)
       
    def forward(self, x: torch.Tensor):
        gate = F.silu(self.gate_proj(x))  # SiLU activation
        up = self.up_proj(x)
        return self.down_proj(gate * up)
</pre>



<p><strong>Lines 1-13: SwiGLU Activation</strong>: The <code data-enlighter-language="python" class="EnlighterJSRAW">SwiGLU</code> class implements a gated activation mechanism. We have 3 linear projections: </p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">gate_proj</code>: for the gating signal</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">up_proj</code>: for the value branch</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">down_proj</code>: for the output projection</li>
</ul>



<p>The forward pass applies SiLU (Sigmoid Linear Unit) to the gate projection, multiplies it element-wise with the up-projection, and projects back down. This creates a more expressive activation than simple GELU, with the gating mechanism allowing fine-grained control over information flow.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="14" data-enlighter-title="DeepSeek-V3 from Scratch: Mixture of Experts (MoE)" data-enlighter-group="2">class MoEExpert(nn.Module):
    """Expert network for Mixture of Experts using SwiGLU"""

    def __init__(self, config: DeepSeekConfig):
        super().__init__()
        self.expert_mlp = SwiGLU(
            config.n_embd,
            config.expert_intermediate_size,
            config.n_embd,
            config.bias
        )

    def forward(self, x: torch.Tensor):
        return self.expert_mlp(x)
</pre>



<p><strong>Lines 14-27: Expert with SwiGLU:</strong> Each <code data-enlighter-language="python" class="EnlighterJSRAW">MoEExpert</code> is now a SwiGLU network instead of a simple FFN. The intermediate size (<code data-enlighter-language="python" class="EnlighterJSRAW">expert_intermediate_size</code>) controls capacity — we use 512 in our configuration, which is smaller than the shared expert&#8217;s 768. This asymmetry reflects the fact that routed experts handle specialized patterns, while the shared expert handles common operations.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="28" data-enlighter-title="DeepSeek-V3 from Scratch: Mixture of Experts (MoE)" data-enlighter-group="3">class MixtureOfExperts(nn.Module):
    """
    DeepSeek MoE layer with shared expert and auxiliary-loss-free load balancing
   
    Key features:
    - Shared expert that processes all tokens
    - Auxiliary-loss-free load balancing via bias updates
    - Top-k routing to selected experts
    """

    def __init__(self, config: DeepSeekConfig):
        super().__init__()
        self.config = config
        self.n_experts = config.n_experts
        self.top_k = config.n_experts_per_token
        self.n_embd = config.n_embd

        # Router: learns which experts to use for each token
        self.router = nn.Linear(config.n_embd, config.n_experts, bias=False)

        # Expert networks
        self.experts = nn.ModuleList([
            MoEExpert(config) for _ in range(config.n_experts)
        ])

        # Shared expert (processes all tokens)
        if config.use_shared_expert:
            self.shared_expert = SwiGLU(
                config.n_embd,
                config.shared_expert_intermediate_size,
                config.n_embd,
                config.bias
            )
        else:
            self.shared_expert = None

        # Auxiliary-loss-free load balancing
        self.register_buffer('expert_bias', torch.zeros(config.n_experts))
        self.bias_update_rate = 0.001

        self.dropout = nn.Dropout(config.dropout)

</pre>



<p><strong>Lines 28-68: MoE Layer Structure:</strong> The <code data-enlighter-language="python" class="EnlighterJSRAW">MixtureOfExperts</code> class orchestrates routing and expert execution. The 3 key additions: </p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">shared_expert</code>: full-capacity expert that processes all tokens</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">expert_bias</code>: buffer for auxiliary-loss-free balancing</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">bias_update_rate</code>: controls how quickly biases adapt</li>
</ul>



<p>The dropout provides regularization across the entire MoE output.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="70" data-enlighter-title="DeepSeek-V3 from Scratch: Mixture of Experts (MoE)" data-enlighter-group="4">    def forward(self, x: torch.Tensor):
        batch_size, seq_len, hidden_dim = x.shape
        x_flat = x.view(-1, hidden_dim)

        # Routing phase with bias for load balancing
        router_logits = self.router(x_flat) + self.expert_bias

        # Top-k routing
        top_k_logits, top_k_indices = torch.topk(router_logits, self.top_k, dim=-1)
        routing_weights = torch.zeros_like(router_logits)
        routing_weights.scatter_(-1, top_k_indices, F.softmax(top_k_logits, dim=-1))

        # Expert computation
        output = torch.zeros_like(x_flat)
        expert_usage = torch.zeros(self.n_experts, device=x.device)

</pre>



<p><strong>Lines 70-84: Routing with Learnable Bias.</strong> The forward pass begins by flattening the input for efficient processing. We compute router logits and <strong>add the expert bias </strong>— this is the key to auxiliary-loss-free balancing. Overused experts have negative bias (making them less likely to be selected), while underused experts have positive bias (encouraging them to be selected). We then perform top-k selection and softmax normalization across the selected experts.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="86" data-enlighter-title="DeepSeek-V3 from Scratch: Mixture of Experts (MoE)" data-enlighter-group="5">        # Process through selected experts
        for expert_idx in range(self.n_experts):
            expert_mask = (top_k_indices == expert_idx).any(dim=-1)
            expert_usage[expert_idx] = expert_mask.sum().float()

            if expert_mask.any():
                expert_input = x_flat[expert_mask]
                expert_output = self.experts[expert_idx](expert_input)

                # Weight by routing probability
                weights = routing_weights[expert_mask, expert_idx].unsqueeze(-1)
                output[expert_mask] += expert_output * weights

        # Add shared expert output (processes all tokens)
        if self.shared_expert is not None:
            shared_output = self.shared_expert(x_flat)
            output += shared_output

        # Auxiliary-loss-free load balancing (update biases during training)
        if self.training:
            with torch.no_grad():
                avg_usage = expert_usage.mean()
                for i in range(self.n_experts):
                    if expert_usage[i] > avg_usage:
                        self.expert_bias[i] -= self.bias_update_rate
                    else:
                        self.expert_bias[i] += self.bias_update_rate

        output = self.dropout(output)
        return output.view(batch_size, seq_len, hidden_dim), router_logits.view(batch_size, seq_len, -1)

</pre>



<p><strong>Lines 86-97: Expert Processing.</strong> We iterate over all experts, identifying which tokens route to each one via the <code data-enlighter-language="python" class="EnlighterJSRAW">expert_mask</code>. For each expert with assigned tokens, we extract those tokens, process them through the expert network, weight them by routing probability, and accumulate them into the output. This selective execution is what makes MoE efficient — we don&#8217;t compute all experts for all tokens.</p>



<p><strong>Lines 100-102: Shared Expert<strong>.</strong></strong> The shared expert processes <strong>all</strong> tokens unconditionally and adds its output to the routed experts&#8217; output. This ensures every token receives some baseline processing, improving training stability and providing capacity for universal patterns. The shared expert&#8217;s larger hidden dimension (768 vs 512) reflects its broader responsibility.</p>



<p><strong>Lines 105-112: Auxiliary-Loss-Free Balancing<strong>.</strong></strong> During training, we update expert biases based on usage. We compute average usage across experts, then adjust biases: overused experts receive negative adjustments (discouraging future selection), while underused experts receive positive adjustments (encouraging future selection). Using the <code data-enlighter-language="python" class="EnlighterJSRAW">torch.no_grad()</code> context ensures these bias updates don&#8217;t interfere with gradient computation. The small update rate (0.001) provides smooth, stable balancing over time.</p>



<p><strong>Lines 114-115: Output and Return<strong>.</strong></strong> We apply dropout to the combined output (routed + shared experts) and reshape back to the original dimensions. We return both the output and router logits — the latter can be used for optional auxiliary loss computation.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="117" data-enlighter-title="DeepSeek-V3 from Scratch: Mixture of Experts (MoE)" data-enlighter-group="6">    def _complementary_sequence_aux_loss(self, router_logits, seq_mask=None):
      """
      router_logits: [batch_size, seq_len, num_experts]
          Raw logits from the router before softmax.
      seq_mask: optional mask for padding tokens.
      """

      # Convert to probabilities
      probs = F.softmax(router_logits, dim=-1)  # [B, T, E]

      # Aggregate per-sequence expert usage
      if seq_mask is not None:
          probs = probs * seq_mask.unsqueeze(-1)  # mask padding
      seq_usage = probs.sum(dim=1)  # [B, E]

      # Normalize per sequence
      seq_usage = seq_usage / seq_usage.sum(dim=-1, keepdim=True)

      # Compute pairwise similarity between sequences
      sim_matrix = torch.matmul(seq_usage, seq_usage.transpose(0, 1))  # [B, B]

      # Encourage complementarity: minimize similarity off-diagonal
      batch_size = seq_usage.size(0)
      off_diag = sim_matrix - torch.eye(batch_size, device=sim_matrix.device)
      loss = off_diag.mean()

      return loss
</pre>



<p><strong>Lines 117-143: Complementary Sequence-Wise Loss<strong>.</strong></strong> This method implements an alternative load-balancing approach. It converts router logits to probabilities, aggregates expert usage for each sequence, and computes pairwise similarity between sequences&#8217; expert usage patterns. By minimizing off-diagonal similarity, we encourage different sequences to use different experts, promoting diversity in expert utilization. This can be added to the training loss with a small weight (e.g., 0.01).</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-MoE-Design-Decisions-in-DeepSeek-V3-SwiGLU-Shared-Experts-and-Routing"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-MoE-Design-Decisions-in-DeepSeek-V3-SwiGLU-Shared-Experts-and-Routing">MoE Design Decisions in DeepSeek-V3: SwiGLU, Shared Experts, and Routing</a></h2>



<p>Several implementation choices merit discussion:</p>



<p><strong>SwiGLU vs GELU:</strong> We use SwiGLU instead of traditional GELU because empirical research shows it consistently outperforms GELU in language models. The gating mechanism provides more expressive power, and SiLU&#8217;s smoothness improves gradient flow. The computational cost is slightly higher (three projections instead of two), but the quality improvement justifies it.</p>



<p><strong>Shared Expert Design:</strong> The shared expert is a DeepSeek innovation that addresses a key limitation of pure MoE: some computations benefit all tokens. By providing dedicated capacity for universal processing, we free routed experts to specialize more aggressively. The larger hidden dimension (768 vs 512) for the shared expert reflects empirical findings that shared capacity requires more parameters than individual experts.</p>



<p><strong>Auxiliary-Loss-Free Balancing:</strong> Traditional MoE uses auxiliary losses, such as:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/a53/a5371c46c63433ffc9ae0e666de20fc2-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\mathcal{L}_\text{aux} = \alpha \cdot N \displaystyle\sum\limits_{i=1}^N f_i \cdot P_i' title='\mathcal{L}_\text{aux} = \alpha \cdot N \displaystyle\sum\limits_{i=1}^N f_i \cdot P_i' class='latex' /></p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/59b/59bdf0ba696e13164c5a926386f23cb0-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='f_i' title='f_i' class='latex' /> is the fraction of tokens routed to expert <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/865/865c0c0b4ab0e063e5caa3387c1a8741-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='i' title='i' class='latex' /> and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/08b/08b0104e514f16d489cc743b6f66d906-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='P_i' title='P_i' class='latex' /> is the average routing probability. This requires tuning <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/7b7/7b7f9dbfea05c83784f8b85149852f08-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\alpha' title='\alpha' class='latex' /> (typically 0.01-0.1). Our bias-based approach eliminates the need for this hyperparameter, simplifying training. The tradeoff is that bias updates are less direct than gradient-based learning, but in practice, the smoother adaptation works well.</p>



<p><strong>Complementary Sequence-Wise Loss:</strong> This alternative balancing approach is useful when batch diversity is high. By encouraging different sequences to use different experts, we naturally achieve balance. However, if the batch contains very similar sequences (e.g., all from the same domain), this loss may not be effective. It&#8217;s best used in combination with bias-based balancing or as an optional auxiliary objective.</p>



<p><strong>Expert Capacity:</strong> Production MoE systems often implement <strong>expert capacity constraints </strong>— if too many tokens route to one expert, excess tokens are dropped or routed to a second choice. We don&#8217;t implement this in our educational model, but the formula would be:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/326/326b1dea7e6b3e87bb95e57d99e9ff8a-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{capacity}_i = \dfrac{|\mathcal{B}| \cdot k}{N} \cdot \text{factor}' title='\text{capacity}_i = \dfrac{|\mathcal{B}| \cdot k}{N} \cdot \text{factor}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/326/326b1dea7e6b3e87bb95e57d99e9ff8a-ffffff-000000-0.png?lossy=2&strip=1&webp=1 182w,https://b2633864.smushcdn.com/2633864/wp-content/latex/326/326b1dea7e6b3e87bb95e57d99e9ff8a-ffffff-000000-0.png?size=126x25&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 182px) 100vw, 182px' /></p>



<p>where factor is typically 1.25-1.5. Tokens beyond this capacity are handled via overflow strategies.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-MoE-Computational-and-Memory-Analysis-in-DeepSeek-V3"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-MoE-Computational-and-Memory-Analysis-in-DeepSeek-V3">MoE Computational and Memory Analysis in DeepSeek-V3</a></h2>



<p>Let&#8217;s analyze the computational cost. For a standard FFN with hidden dimension <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/5fd/5fd5c23b588d00b68f294891cdc0b4e9-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_{ff} ' title='d_{ff} ' class='latex' />:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/273/273ac3c26f1fcb71666fff5194a44b04-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{FLOPs}_\text{standard} = 2 \cdot d_\text{model} \cdot d_{ff} + 2 \cdot d_{ff} \cdot d_\text{model} = 4 \cdot d_\text{model} \cdot d_{ff}' title='\text{FLOPs}_\text{standard} = 2 \cdot d_\text{model} \cdot d_{ff} + 2 \cdot d_{ff} \cdot d_\text{model} = 4 \cdot d_\text{model} \cdot d_{ff}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/273/273ac3c26f1fcb71666fff5194a44b04-ffffff-000000-0.png?lossy=2&strip=1&webp=1 445w,https://b2633864.smushcdn.com/2633864/wp-content/latex/273/273ac3c26f1fcb71666fff5194a44b04-ffffff-000000-0.png?size=126x5&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/273/273ac3c26f1fcb71666fff5194a44b04-ffffff-000000-0.png?size=252x10&lossy=2&strip=1&webp=1 252w,https://b2633864.smushcdn.com/2633864/wp-content/latex/273/273ac3c26f1fcb71666fff5194a44b04-ffffff-000000-0.png?size=378x15&lossy=2&strip=1&webp=1 378w' sizes='(max-width: 445px) 100vw, 445px' /></p>



<p>For our MoE with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/c9c/c9c0455233a24a05b9fae35beb3b6bd1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='N = 4' title='N = 4' class='latex' /> routed experts (each with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/30e/30e330755ec178d9bd48d4b39fb846a4-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{expert} = 512' title='d_\text{expert} = 512' class='latex' />), <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/2d4/2d4dcf10084570378af72846cd24eee5-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='k = 2' title='k = 2' class='latex' /> selected, and shared expert (<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/231/231dab74be4fb72d267969ac0af4bdd4-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{shared} = 768' title='d_\text{shared} = 768' class='latex' />):</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b33/b333b879cc3a92d780a1b5d731abd13d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{FLOPs}_\text{MoE} = d_\text{model} \cdot N + k \cdot \text{SwiGLU}_\text{expert} + \text{SwiGLU}_\text{shared}' title='\text{FLOPs}_\text{MoE} = d_\text{model} \cdot N + k \cdot \text{SwiGLU}_\text{expert} + \text{SwiGLU}_\text{shared}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/b33/b333b879cc3a92d780a1b5d731abd13d-ffffff-000000-0.png?lossy=2&strip=1&webp=1 415w,https://b2633864.smushcdn.com/2633864/wp-content/latex/b33/b333b879cc3a92d780a1b5d731abd13d-ffffff-000000-0.png?size=126x5&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/b33/b333b879cc3a92d780a1b5d731abd13d-ffffff-000000-0.png?size=252x11&lossy=2&strip=1&webp=1 252w' sizes='(max-width: 415px) 100vw, 415px' /></p>



<p>The SwiGLU computation involves three projections:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/ce7/ce7479fa9bd98838e36402f7a1c52c2d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{SwiGLU}_\text{expert} = 3 \cdot d_\text{model} \cdot d_\text{expert} + 3 \cdot d_\text{expert} \cdot d_\text{model} = 6 \cdot d_\text{model} \cdot d_\text{expert}' title='\text{SwiGLU}_\text{expert} = 3 \cdot d_\text{model} \cdot d_\text{expert} + 3 \cdot d_\text{expert} \cdot d_\text{model} = 6 \cdot d_\text{model} \cdot d_\text{expert}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/ce7/ce7479fa9bd98838e36402f7a1c52c2d-ffffff-000000-0.png?lossy=2&strip=1&webp=1 499w,https://b2633864.smushcdn.com/2633864/wp-content/latex/ce7/ce7479fa9bd98838e36402f7a1c52c2d-ffffff-000000-0.png?size=126x5&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/ce7/ce7479fa9bd98838e36402f7a1c52c2d-ffffff-000000-0.png?size=252x9&lossy=2&strip=1&webp=1 252w,https://b2633864.smushcdn.com/2633864/wp-content/latex/ce7/ce7479fa9bd98838e36402f7a1c52c2d-ffffff-000000-0.png?size=378x14&lossy=2&strip=1&webp=1 378w' sizes='(max-width: 499px) 100vw, 499px' /></p>



<p>For our configuration:</p>



<ul class="wp-block-list">
<li><strong>Routing:</strong> <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/3ce/3ce156710f553b9177cdf245e2af0392-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='256 \cdot 4' title='256 \cdot 4' class='latex' /> (negligible)</li>



<li><strong>Routed experts:</strong> <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/f90/f90572bace88a6df8e23a777b255d793-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='2 \cdot 6 \cdot 256 \cdot 512 = 1\text{,}572\text{,}864' title='2 \cdot 6 \cdot 256 \cdot 512 = 1\text{,}572\text{,}864' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/f90/f90572bace88a6df8e23a777b255d793-ffffff-000000-0.png?lossy=2&strip=1&webp=1 188w,https://b2633864.smushcdn.com/2633864/wp-content/latex/f90/f90572bace88a6df8e23a777b255d793-ffffff-000000-0.png?size=126x11&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 188px) 100vw, 188px' /></li>



<li><strong>Shared expert:</strong> <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/110/11027ab94784bc01dbb25f8730d4b1a4-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='6 \cdot 256 \cdot 768 = 1\text{,}179\text{,}648' title='6 \cdot 256 \cdot 768 = 1\text{,}179\text{,}648' class='latex' /></li>



<li><strong>Total:</strong> <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/658/6588c95074f2609674f5fe10ab63f88f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\sim' title='\sim' class='latex' />2.75M FLOPs per token</li>
</ul>



<p>Compare to a standard FFN with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/182/182959cd9cb0edd5a1151ed6c9779b9d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_{ff} = 1024' title='d_{ff} = 1024' class='latex' />: <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/be2/be2ee2698e20700424bdd771982c67f6-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='4 \cdot 256 \cdot 1024 = 1\text{,}048\text{,}576' title='4 \cdot 256 \cdot 1024 = 1\text{,}048\text{,}576' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/be2/be2ee2698e20700424bdd771982c67f6-ffffff-000000-0.png?lossy=2&strip=1&webp=1 176w,https://b2633864.smushcdn.com/2633864/wp-content/latex/be2/be2ee2698e20700424bdd771982c67f6-ffffff-000000-0.png?size=126x11&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 176px) 100vw, 176px' /> FLOPs. Our MoE uses <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/658/6588c95074f2609674f5fe10ab63f88f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\sim' title='\sim' class='latex' />2.6× more computation but has much higher capacity (4 experts × 512 + 1 shared × 768 = 2,816 vs 1,024). We get <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/658/6588c95074f2609674f5fe10ab63f88f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\sim' title='\sim' class='latex' />2.7× capacity for <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/658/6588c95074f2609674f5fe10ab63f88f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\sim' title='\sim' class='latex' />2.6× computation — roughly linear scaling, which is the goal.</p>



<p>Memory usage during the forward pass stores activations for active experts only. During backpropagation, we need gradients for all experts (since routing is differentiable), yet the memory remains manageable. The bias vector is tiny (4 floats for 4 experts).</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-MoE-Expert-Specialization-in-Practice-Real-World-Behavior"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-MoE-Expert-Specialization-in-Practice-Real-World-Behavior">MoE Expert Specialization in Practice: Real-World Behavior</a></h2>



<p>While we can&#8217;t demonstrate this in our small toy model, in larger-scale MoE models, expert specialization is observable through analysis of routing patterns. Researchers have visualized which experts activate for different types of inputs, revealing clear specialization. For example:</p>



<ul class="wp-block-list">
<li><strong>Multilingual models:</strong> Different experts handle different languages</li>



<li><strong>Code models:</strong> Some experts handle syntax, others semantics, others API patterns</li>



<li><strong>Reasoning models:</strong> Numerical experts for math, logical experts for inference, retrieval experts for factual recall</li>
</ul>



<p>This specialization isn&#8217;t programmed — it emerges from optimization. The routing function learns to partition the input space, and experts learn to excel in their assigned partitions. It&#8217;s a beautiful example of how end-to-end learning can discover structured solutions.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Training-Dynamics-of-MoE-Load-Balancing-and-Expert-Utilization"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Training-Dynamics-of-MoE-Load-Balancing-and-Expert-Utilization">Training Dynamics of MoE: Load Balancing and Expert Utilization</a></h2>



<p>In practice, MoE training exhibits interesting dynamics:</p>



<p><strong>Early Training:</strong> Routing is initially random or near-uniform. All experts receive a similar load. The shared expert learns basic patterns that benefit all tokens.</p>



<p><strong>Mid Training:</strong> Routing starts specializing. Some experts become preferred for certain patterns. Load imbalance can emerge without careful management. Bias-based balancing begins correcting the imbalance.</p>



<p><strong>Late Training:</strong> Experts are clearly specialized. Routing is confident (high softmax probabilities for selected experts). Load is balanced through continuous bias adjustment. The shared expert handles universal operations while routed experts focus on specialized patterns.</p>



<p>Monitoring expert usage during training is valuable. We can log:</p>



<ul class="wp-block-list">
<li>Per-expert selection frequency</li>



<li>Routing entropy (higher means more uniform)</li>



<li>Expert bias magnitudes (large values indicate strong correction needed)</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Mixture-of-Experts-vs-Related-Techniques-Switch-Transformers-and-Sparse-Models"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Mixture-of-Experts-vs-Related-Techniques-Switch-Transformers-and-Sparse-Models">Mixture of Experts vs Related Techniques: Switch Transformers and Sparse Models</a></h2>



<p>MoE shares ideas with several other architectural patterns:</p>



<p><strong>Switch Transformers:</strong> Use top-1 routing (only one expert per token) for maximum efficiency. Simpler but less expressive than top-k.</p>



<p><strong>Expert Choice:</strong> Instead of tokens choosing experts, experts choose tokens. Helps with load balancing but changes the computational pattern.</p>



<p><strong>Sparse Attention:</strong> Like MoE, selectively activates parts of the network. Can be combined with MoE for extreme efficiency.</p>



<p><strong>Dynamic Networks:</strong> Adapt network structure based on input. MoE is a specific form of dynamic computation.</p>



<p>With our MoE implementation complete, we&#8217;ve added efficient scaling to our model — the capacity grows superlinearly with computation cost. Combined with MLA&#8217;s memory efficiency and the upcoming MTP&#8217;s improved training signal, we&#8217;re building a model that&#8217;s efficient in training, efficient in inference, and capable of strong performance. Next, we&#8217;ll tackle Multi-Token Prediction, which improves the training signal itself by having the model look further ahead.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<div id="pitch" style="padding: 40px; width: 100%; background-color: #F4F6FA;">
	<h3>What's next? We recommend <a target="_blank" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend">PyImageSearch University</a>.</h3>

	<script src="https://fast.wistia.com/embed/medias/kno0cmko2z.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_kno0cmko2z videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img decoding="async" src="https://fast.wistia.com/embed/medias/kno0cmko2z/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>

	<div style="margin-top: 32px; margin-bottom: 32px; ">
		<strong>Course information:</strong><br/>
		86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: April 2026<br/>
		<span style="color: #169FE6;">★★★★★</span> 4.84 (128 Ratings) • 16,000+ Students Enrolled
	</div>

	<p><strong>I strongly believe that if you had the right teacher you could <em>master</em> computer vision and deep learning.</strong></p>

	<p>Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?</p>

	<p>That’s <em>not</em> the case.</p>

	<p>All you need to master computer vision and deep learning is for someone to explain things to you in <em>simple, intuitive</em> terms. <em>And that’s exactly what I do</em>. My mission is to change education and how complex Artificial Intelligence topics are taught.</p>

	<p>If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to <em>successfully</em> and <em>confidently</em> apply computer vision to your work, research, and projects. Join me in computer vision mastery.</p>

	<p><strong>Inside PyImageSearch University you'll find:</strong></p>

	<ul style="margin-left: 0px;">
		<li style="list-style: none;">&check; <strong>86+ courses</strong> on essential computer vision, deep learning, and OpenCV topics</li>
		<li style="list-style: none;">&check; <strong>86 Certificates</strong> of Completion</li>
		<li style="list-style: none;">&check; <strong>115+ hours hours</strong> of on-demand video</li>
		<li style="list-style: none;">&check; <strong>Brand new courses released <em>regularly</em></strong>, ensuring you can keep up with state-of-the-art techniques</li>
		<li style="list-style: none;">&check; <strong>Pre-configured Jupyter Notebooks in Google Colab</strong></li>
		<li style="list-style: none;">&check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)</li>
		<li style="list-style: none;">&check; Access to <strong>centralized code repos for <em>all</em> 540+ tutorials</strong> on PyImageSearch</li>
		<li style="list-style: none;">&check; <strong> Easy one-click downloads</strong> for code, datasets, pre-trained models, etc.</li>
		<li style="list-style: none;">&check; <strong>Access</strong> on mobile, laptop, desktop, etc.</li>
	</ul>

	<p style="text-align: center;">
		<a target="_blank" class="button link" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend" style="background-color: #6DC713; border-bottom: none;">Click here to join PyImageSearch University</a>
	</p>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Summary"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Summary">Summary</a></h2>



<p>In the third installment of our <strong>DeepSeek-V3 from Scratch</strong> series, we turn our attention to the <strong>Mixture of Experts (MoE)</strong> framework, a powerful approach to scaling neural networks efficiently. We begin by unpacking the scaling challenge in modern architectures and how MoE addresses it through selective expert activation. From its mathematical foundation to the introduction of <strong>SwiGLU activation</strong>, we explore how enhanced non-linearity and universal shared experts contribute to more flexible and expressive models.</p>



<p>We then examine the mechanics of <strong>load balancing</strong>, highlighting innovations (e.g., auxiliary-loss-free balancing and complementary sequence-wise strategies). These techniques ensure that experts are used effectively without introducing unnecessary complexity. We also explore how expert specialization emerges naturally during training, leading to diverse behaviors across experts that improve overall performance. This emergent specialization is not just theoretical — it becomes visible in practice, shaping how the model processes different types of input.</p>



<p>Finally, we walk through the <strong>implementation of MoE</strong>, discussing design decisions, computational trade-offs, and memory analysis. We connect these insights to related techniques, showing how MoE integrates into the broader landscape of efficient deep learning. By the end, we not only understand the theory but also gain practical knowledge of how to implement and optimize MoE within DeepSeek-V3. This part of the series equips us with the tools to harness expert specialization while keeping training dynamics balanced and efficient.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Citation-Information"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Citation-Information">Citation Information</a></h3>



<p><strong>Mangla, P</strong><strong>. </strong>“DeepSeek-V3 from Scratch: Mixture of Experts (MoE),” <em>PyImageSearch</em>, S. Huot, A. Sharma, and P. Thakur, eds., 2026, <a href="https://pyimg.co/a1w0g" target="_blank" rel="noreferrer noopener">https://pyimg.co/a1w0g</a></p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="classic" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="DeepSeek-V3 from Scratch: Mixture of Experts (MoE)" data-enlighter-group="7">@incollection{Mangla_2026_deepseek-v3-from-scratch-moe,
  author = {Puneet Mangla},
  title = {{DeepSeek-V3 from Scratch: Mixture of Experts (MoE)}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/a1w0g},
}
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), </strong><em><strong>simply enter your email address in the form below!</strong></em></p>



<div id="download-the-code" class="post-cta-wrap">
<div class="gpd-post-cta">
	<div class="gpd-post-cta-content">
		

			<div class="gpd-post-cta-top">
				<div class="gpd-post-cta-top-image"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1" alt="" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1 410w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=126x174&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=252x348&lossy=2&strip=1&webp=1 252w" sizes="(max-width: 410px) 100vw, 410px" /></div>
				
				<div class="gpd-post-cta-top-title"><h4>Download the Source Code and FREE 17-page Resource Guide</h4></div>
				<div class="gpd-post-cta-top-desc"><p>Enter your email address below to get a .zip of the code and a <strong>FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning.</strong> Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!</p></div>


			</div>

			<div class="gpd-post-cta-bottom">
				<form id="footer-cta-code" class="footer-cta" action="https://www.getdrip.com/forms/4130035/submissions" method="post" target="blank" data-drip-embedded-form="4130035">
					<input name="fields[email]" type="email" value="" placeholder="Your email address" class="form-control" />

					<button type="submit">Download the code!</button>

					<div style="display: none;" aria-hidden="true"><label for="website">Website</label><br /><input type="text" id="website" name="website" tabindex="-1" autocomplete="false" value="" /></div>
				</form>
			</div>


		
	</div>

</div>
</div>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/03/23/deepseek-v3-from-scratch-mixture-of-experts-moe/">DeepSeek-V3 from Scratch: Mixture of Experts (MoE)</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture</title>
		<link>https://pyimagesearch.com/2026/03/16/build-deepseek-v3-multi-head-latent-attention-mla-architecture/</link>
		
		<dc:creator><![CDATA[Puneet Mangla]]></dc:creator>
		<pubDate>Mon, 16 Mar 2026 12:45:00 +0000</pubDate>
				<category><![CDATA[Deep Learning]]></category>
		<category><![CDATA[Large Language Models]]></category>
		<category><![CDATA[PyTorch]]></category>
		<category><![CDATA[Transformers]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[attention mechanisms]]></category>
		<category><![CDATA[deepseek-v3]]></category>
		<category><![CDATA[kv cache optimization]]></category>
		<category><![CDATA[large language models]]></category>
		<category><![CDATA[mla]]></category>
		<category><![CDATA[multi-head latent attention]]></category>
		<category><![CDATA[pytorch tutorial]]></category>
		<category><![CDATA[RoPE]]></category>
		<category><![CDATA[rotary positional embeddings]]></category>
		<category><![CDATA[transformer architecture]]></category>
		<category><![CDATA[tutorial]]></category>
		<guid isPermaLink="false">https://pyimagesearch.com/?p=53170</guid>

					<description><![CDATA[<p>Table of Contents Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture The KV Cache Memory Problem in DeepSeek-V3 Multi-Head Latent Attention (MLA): KV Cache Compression with Low-Rank Projections Query Compression and Rotary Positional Embeddings (RoPE) Integration Attention Computation with Multi-Head Latent&#8230;</p>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/03/16/build-deepseek-v3-multi-head-latent-attention-mla-architecture/">Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity" id="TOC"/>


<div class="yoast-breadcrumbs"><span><span><a href="https://pyimagesearch.com/">Home</a></span></div>


<div class="toc">
<hr class="TOC"/>
<p class="has-large-font-size"><strong>Table of Contents</strong></p>
<ul>
    <li id="TOC-h1-Build-DeepSeek-V3-Multi-Head-Latent-Attention-MLA-Architecture"><a rel="noopener" target="_blank" href="#h1-Build-DeepSeek-V3-Multi-Head-Latent-Attention-MLA-Architecture">Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture</a></li>
    <li id="TOC-h2-The-KV-Cache-Memory-Problem-in-DeepSeek-V3"><a rel="noopener" target="_blank" href="#h2-The-KV-Cache-Memory-Problem-in-DeepSeek-V3">The KV Cache Memory Problem in DeepSeek-V3</a></li>
    <li id="TOC-h2-Multi-Head-Latent-Attention-MLA-KV-Cache-Compression-with-Low-Rank-Projections"><a rel="noopener" target="_blank" href="#h2-Multi-Head-Latent-Attention-MLA-KV-Cache-Compression-with-Low-Rank-Projections">Multi-Head Latent Attention (MLA): KV Cache Compression with Low-Rank Projections</a></li>
    <li id="TOC-h2-Query-Compression-and-Rotary-Positional-Embeddings-RoPE-Integration"><a rel="noopener" target="_blank" href="#h2-Query-Compression-and-Rotary-Positional-Embeddings-RoPE-Integration">Query Compression and Rotary Positional Embeddings (RoPE) Integration</a></li>
    <li id="TOC-h2-Attention-Computation-with-Multi-Head-Latent-Attention-MLA"><a rel="noopener" target="_blank" href="#h2-Attention-Computation-with-Multi-Head-Latent-Attention-MLA">Attention Computation with Multi-Head Latent Attention (MLA)</a></li>
    <li id="TOC-h2-Implementation-Multi-Head-Latent-Attention-MLA"><a rel="noopener" target="_blank" href="#h2-Implementation-Multi-Head-Latent-Attention-MLA">Implementation: Multi-Head Latent Attention (MLA)</a></li>
    <li id="TOC-h2-Multi-Head-Latent-Attention-and-KV-Cache-Optimization"><a rel="noopener" target="_blank" href="#h2-Multi-Head-Latent-Attention-and-KV-Cache-Optimization">Multi-Head Latent Attention and KV Cache Optimization</a></li>
    <li id="TOC-h2-Summary"><a rel="noopener" target="_blank" href="#h2-Summary">Summary</a></li>
    <ul>
        <li id="TOC-h3-Citation-Information"><a rel="noopener" target="_blank" href="#h3-Citation-Information">Citation Information</a></li>
    </ul>
</ul>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h1-Build-DeepSeek-V3-Multi-Head-Latent-Attention-MLA-Architecture"/>



<h2 class="wp-block-heading"><a href="#TOC-h1-Build-DeepSeek-V3-Multi-Head-Latent-Attention-MLA-Architecture">Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture</a></h2>



<p>In the first part of this series, we laid the foundation by exploring the <strong>theoretical underpinnings of DeepSeek-V3</strong> and implementing key configuration elements such as <strong>Rotary Position</strong><strong>al</strong><strong> Embeddings (RoPE)</strong>. That tutorial established how DeepSeek-V3 manages long-range dependencies and sets up its architecture for efficient scaling. By grounding theory in working code, we ensured that readers not only understood the concepts but also saw how they translate into practical implementation.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/03/build-deepseek-v3-mla-architecture-v2-featured.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="940" height="780" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/build-deepseek-v3-mla-architecture-v2-featured.png?lossy=2&strip=1&webp=1" alt="build-deepseek-v3-mla-architecture-v2-featured.png" class="wp-image-53245" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/build-deepseek-v3-mla-architecture-v2-featured.png?size=126x105&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/build-deepseek-v3-mla-architecture-v2-featured-300x249.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/build-deepseek-v3-mla-architecture-v2-featured.png?size=378x314&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/build-deepseek-v3-mla-architecture-v2-featured.png?size=504x418&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/build-deepseek-v3-mla-architecture-v2-featured.png?size=630x523&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/build-deepseek-v3-mla-architecture-v2-featured-768x637.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/build-deepseek-v3-mla-architecture-v2-featured.png?lossy=2&amp;strip=1&amp;webp=1 940w" sizes="(max-width: 630px) 100vw, 630px" /></a></figure></div>


<p>With that groundwork in place, we now turn to one of DeepSeek-V3’s most distinctive innovations: <strong>Multi-</strong><strong>H</strong><strong>ead Latent Attention (MLA)</strong>. While traditional attention mechanisms have proven remarkably effective, they often come with steep computational and memory costs. MLA reimagines this core operation by introducing a latent representation space that dramatically reduces overhead while preserving the model’s ability to capture rich contextual relationships.</p>



<p>In this lesson, we’ll break down the theory behind MLA, explore why it matters, and then implement it step by step. This installment continues our hands-on approach — moving beyond abstract concepts to practical code — while advancing the broader goal of the series: to reconstruct DeepSeek-V3 from scratch, piece by piece, until we assemble and train the full architecture.</p>



<p>This lesson is the 2nd of the 6-part series on <strong>Building DeepSeek-V3 from Scratch</strong>:</p>



<ol class="wp-block-list">
<li><em><a href="https://pyimg.co/1atre" target="_blank" rel="noreferrer noopener">DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings</a></em> </li>



<li><em><strong><a href="https://pyimg.co/scgjl" target="_blank" rel="noreferrer noopener">Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture</a></strong></em> <strong>(this tutorial)</strong></li>



<li><em>Lesson 3</em></li>



<li><em>Lesson 4</em></li>



<li><em>Lesson 5</em></li>



<li><em>Lesson 6</em></li>
</ol>



<p><strong>To learn about DeepSeek-V3 and build it from scratch, </strong><em><strong>just keep reading.</strong></em></p>



<div id="pyi-source-code-block" class="source-code-wrap"><div class="gpd-source-code">
    <div class="gpd-source-code-content">
        <img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/source-code-icon.png?lossy=2&strip=1&webp=1" alt="">
        <h4>Looking for the source code to this post?</h4>
                    <a href="#download-the-code" class="pyis-cta-modal-open-modal">Jump Right To The Downloads Section <svg class="svg-icon arrow-right" width="12" height="12" aria-hidden="true" role="img" focusable="false" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M6.8125 0.1875C6.875 0.125 6.96875 0.09375 7.09375 0.09375C7.1875 0.09375 7.28125 0.125 7.34375 0.1875L13.875 6.75C13.9375 6.8125 14 6.90625 14 7C14 7.125 13.9375 7.1875 13.875 7.25L7.34375 13.8125C7.28125 13.875 7.1875 13.9062 7.09375 13.9062C6.96875 13.9062 6.875 13.875 6.8125 13.8125L6.1875 13.1875C6.125 13.125 6.09375 13.0625 6.09375 12.9375C6.09375 12.8438 6.125 12.75 6.1875 12.6562L11.0312 7.8125H0.375C0.25 7.8125 0.15625 7.78125 0.09375 7.71875C0.03125 7.65625 0 7.5625 0 7.4375V6.5625C0 6.46875 0.03125 6.375 0.09375 6.3125C0.15625 6.25 0.25 6.1875 0.375 6.1875H11.0312L6.1875 1.34375C6.125 1.28125 6.09375 1.1875 6.09375 1.0625C6.09375 0.96875 6.125 0.875 6.1875 0.8125L6.8125 0.1875Z" fill="#169FE6"></path></svg></a>
            </div>
</div>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-The-KV-Cache-Memory-Problem-in-DeepSeek-V3"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-The-KV-Cache-Memory-Problem-in-DeepSeek-V3">The KV Cache Memory Problem in DeepSeek-V3</a></h2>



<p>To understand why MLA is revolutionary, we must first understand the memory bottleneck in Transformer inference. Standard multi-head attention computes:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/c32/c32a2af114ff840b52cb30380e43d9fa-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{Attention}(Q, K, V) = \text{softmax}\left(\dfrac{QK^T}{\sqrt{d_k}}\right)V' title='\text{Attention}(Q, K, V) = \text{softmax}\left(\dfrac{QK^T}{\sqrt{d_k}}\right)V' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/c32/c32a2af114ff840b52cb30380e43d9fa-ffffff-000000-0.png?lossy=2&strip=1&webp=1 297w,https://b2633864.smushcdn.com/2633864/wp-content/latex/c32/c32a2af114ff840b52cb30380e43d9fa-ffffff-000000-0.png?size=126x18&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 297px) 100vw, 297px' />,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/0ea/0eadc3c29bcbf4eb7630c12a115fb446-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='Q, K, V \in \mathbb{R}^{T \times d_\text{model}}' title='Q, K, V \in \mathbb{R}^{T \times d_\text{model}}' class='latex' /> are query, key, and value matrices for sequence length <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b9e/b9ece18c950afbfa6b0fdbfa4ff731d3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='T' title='T' class='latex' />. In autoregressive generation (producing one token at a time), we cannot recompute attention over all previous tokens from scratch at each step — that would be <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/436/43633ac99f2ab4a26a21922c2a32bd0d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='O(T^2)' title='O(T^2)' class='latex' /> computation per token generated.</p>



<p>Instead, we cache the key and value matrices. When generating token <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e35/e358efa489f58062f10dd7316b65649e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t' title='t' class='latex' />, we only compute <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e42/e4202876915eb091a491b87652ec941f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='Q_t' title='Q_t' class='latex' /> (the query for the new token), then compute attention using <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e42/e4202876915eb091a491b87652ec941f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='Q_t' title='Q_t' class='latex' /> and the cached <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/361/3612b1e4d907a79611e95d5b25925ba9-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='K_{1:t-1}, V_{1:t-1}' title='K_{1:t-1}, V_{1:t-1}' class='latex' />. This reduces computation from <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/436/43633ac99f2ab4a26a21922c2a32bd0d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='O(T^2)' title='O(T^2)' class='latex' /> to <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/439/43995c439a3df1ae219e6814777e8ec7-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='O(T)' title='O(T)' class='latex' /> per generated token — a dramatic speedup.</p>



<p>However, this cache comes at a steep memory cost. For a model with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/d20/d20caec3b48a1eef164cb4ca81ba2587-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='L' title='L' class='latex' /> layers, <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/c1d/c1d9f50f86825a1a2302ec2449c17196-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='H' title='H' class='latex' /> attention heads, and head dimension <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/5ec/5ec55cbd8a0eb01750844da3e072cf4c-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{head} = d_\text{model}/H' title='d_\text{head} = d_\text{model}/H' class='latex' />, the KV cache requires:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/d76/d7641ee846d3c398408474619f009a5b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{Memory}_\text{KV} = 2 \times L \times H \times d_\text{head} \times T \times \text{sizeof}(\text{float})' title='\text{Memory}_\text{KV} = 2 \times L \times H \times d_\text{head} \times T \times \text{sizeof}(\text{float})' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/d76/d7641ee846d3c398408474619f009a5b-ffffff-000000-0.png?lossy=2&strip=1&webp=1 361w,https://b2633864.smushcdn.com/2633864/wp-content/latex/d76/d7641ee846d3c398408474619f009a5b-ffffff-000000-0.png?size=126x7&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/d76/d7641ee846d3c398408474619f009a5b-ffffff-000000-0.png?size=252x13&lossy=2&strip=1&webp=1 252w' sizes='(max-width: 361px) 100vw, 361px' />.</p>



<p>For a model like GPT-3 with 96 layers, 96 heads, 128-head dimensions, and 2048 sequence length, this is:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/cdf/cdf02ec8f99fbec2dab32658aa8d1b2a-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='2 \times 96 \times 96 \times 128 \times 2048 \times 2 \text{ bytes} = 9.6 \text{ GB per sequence}' title='2 \times 96 \times 96 \times 128 \times 2048 \times 2 \text{ bytes} = 9.6 \text{ GB per sequence}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/cdf/cdf02ec8f99fbec2dab32658aa8d1b2a-ffffff-000000-0.png?lossy=2&strip=1&webp=1 417w,https://b2633864.smushcdn.com/2633864/wp-content/latex/cdf/cdf02ec8f99fbec2dab32658aa8d1b2a-ffffff-000000-0.png?size=126x5&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/cdf/cdf02ec8f99fbec2dab32658aa8d1b2a-ffffff-000000-0.png?size=252x10&lossy=2&strip=1&webp=1 252w' sizes='(max-width: 417px) 100vw, 417px' />.</p>



<p>This means you can only serve a handful of users concurrently on even high-end GPUs. The memory bottleneck is often the limiting factor in deployment, not computation.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Multi-Head-Latent-Attention-MLA-KV-Cache-Compression-with-Low-Rank-Projections"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Multi-Head-Latent-Attention-MLA-KV-Cache-Compression-with-Low-Rank-Projections">Multi-Head Latent Attention (MLA): KV Cache Compression with Low-Rank Projections</a></h2>



<p>MLA (<strong>Figure 1</strong>) solves this through a compress-decompress strategy inspired by Low-Rank Adaptation (LoRA). The key insight: we do not need to store full <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/646/6469a03ebce607f5e9fc3cca520cc84a-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{model}' title='d_\text{model}' class='latex' />-dimensional representations. We can compress them into a lower-dimensional latent space for storage, then decompress when needed for computation.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/03/image-8-scaled.jpeg" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="717" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-8-1024x717.jpeg?lossy=2&strip=1&webp=1" alt="" class="wp-image-53211" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-8.jpeg?size=126x88&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-8-300x210.jpeg?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-8.jpeg?size=378x265&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-8.jpeg?size=504x353&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-8.jpeg?size=630x441&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-8-768x538.jpeg?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-8-1024x717.jpeg?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-8-scaled.jpeg?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 1:</strong> Multi-Head Latent Attention architecture (source: <a href="https://arxiv.org/pdf/2412.19437" target="_blank" rel="noreferrer noopener">DeepSeek-AI, 2025</a>).</figcaption></figure></div>


<p><strong>Step 1</strong><strong>.</strong><strong> Key-Value Compression</strong><strong>:</strong> Instead of storing <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/5fb/5fb3f59770692c808ec0b864b2351e7b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='K, V \in \mathbb{R}^{T \times d_\text{model}}' title='K, V \in \mathbb{R}^{T \times d_\text{model}}' class='latex' /> directly, we project them through a low-rank bottleneck:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/029/029beacdd9c292345e480950b3c1ac78-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='C_{kv} = \text{RMSNorm}(X W_\text{down}) \in \mathbb{R}^{T \times r_{kv}}' title='C_{kv} = \text{RMSNorm}(X W_\text{down}) \in \mathbb{R}^{T \times r_{kv}}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/029/029beacdd9c292345e480950b3c1ac78-ffffff-000000-0.png?lossy=2&strip=1&webp=1 259w,https://b2633864.smushcdn.com/2633864/wp-content/latex/029/029beacdd9c292345e480950b3c1ac78-ffffff-000000-0.png?size=126x9&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 259px) 100vw, 259px' />,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/fa0/fa08bbe27a422c10b661998eb4c430bf-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='X \in \mathbb{R}^{T \times d_\text{model}}' title='X \in \mathbb{R}^{T \times d_\text{model}}' class='latex' /> is the input, <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/dbb/dbb13011d04b4ee23ed5945d1dd9fcb6-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='W_\text{down} \in \mathbb{R}^{d_\text{model} \times r_{kv}}' title='W_\text{down} \in \mathbb{R}^{d_\text{model} \times r_{kv}}' class='latex' /> is the down-projection, and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/3d4/3d474992c45dd91fa455f1c2994b8a1b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_{kv} \le d_\text{model}' title='r_{kv} \le d_\text{model}' class='latex' /> is the low-rank dimension. We only cache <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/dd6/dd6158096ed0b0416c54f7ec5cc08a41-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='C_{kv}' title='C_{kv}' class='latex' /> rather than the full <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/a5f/a5f3c6a11b03839d46af9fb43c97c188-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='K' title='K' class='latex' /> and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/520/5206560a306a2e085a437fd258eb57ce-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='V' title='V' class='latex' />.</p>



<p><strong>Step 2. Key-Value Decompression:</strong> When we need the actual key and value matrices for attention computation, we decompress:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/4f0/4f04c9d30d41a5b61e3f98597aa25295-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='K_\text{content} = C_{kv} W_K \in \mathbb{R}^{T \times d_\text{model}}' title='K_\text{content} = C_{kv} W_K \in \mathbb{R}^{T \times d_\text{model}}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/4f0/4f04c9d30d41a5b61e3f98597aa25295-ffffff-000000-0.png?lossy=2&strip=1&webp=1 209w,https://b2633864.smushcdn.com/2633864/wp-content/latex/4f0/4f04c9d30d41a5b61e3f98597aa25295-ffffff-000000-0.png?size=126x10&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 209px) 100vw, 209px' /></p>



<p class="has-text-align-center">
<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/5f0/5f0a343a43393109efcb182011489ba0-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='V = C_{kv} W_V \in \mathbb{R}^{T \times d_\text{model}}' title='V = C_{kv} W_V \in \mathbb{R}^{T \times d_\text{model}}' class='latex' />,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6cf/6cfd28954e506e11141cd3f8160f72d2-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='W_K, W_V \in \mathbb{R}^{r_{kv} \times d_\text{model}}' title='W_K, W_V \in \mathbb{R}^{r_{kv} \times d_\text{model}}' class='latex' /> are up-projection matrices. This decomposition approximates the full key and value matrices through a low-rank factorization: <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/52d/52d5e8ba4023e4f3a7904deae066bc38-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='K \approx X W_\text{down} W_K' title='K \approx X W_\text{down} W_K' class='latex' /> and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/024/0241098f02eeaa2b0c7ee954147a3d4a-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='V \approx X W_\text{down} W_V' title='V \approx X W_\text{down} W_V' class='latex' />.</p>



<p><strong>Memory Savings:</strong> Instead of caching <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/288/2886ddbde3ba8cd160cccf49060810bb-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='2 \times T \times d_\text{model}' title='2 \times T \times d_\text{model}' class='latex' />, we cache <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/044/04458fd3516ca3504f06d1d6b0899434-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='T \times r_{kv}' title='T \times r_{kv}' class='latex' />. The reduction factor is <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/557/557ea5ba0e56ff1e9c962d1f3fd066f2-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\frac{2 \times d_\text{model}}{r_{kv}}' title='\frac{2 \times d_\text{model}}{r_{kv}}' class='latex' />. For our configuration with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/79d/79d0b8290e3c7cc6a6c914fcecd14969-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{model} = 256' title='d_\text{model} = 256' class='latex' /> and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/22a/22a4c847a1bb7331479b1cd47f9c51f4-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_{kv} = 128' title='r_{kv} = 128' class='latex' />, this is a 4× reduction. For larger models with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b8e/b8e7da42339ae3d81d9a5f1db166c6d3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{model} = 4096' title='d_\text{model} = 4096' class='latex' /> and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/f80/f80344af61324b6974c2a0f355dc9a58-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_{kv} = 512' title='r_{kv} = 512' class='latex' />, it&#8217;s a 16× reduction — transformative for deployment.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Query-Compression-and-Rotary-Positional-Embeddings-RoPE-Integration"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Query-Compression-and-Rotary-Positional-Embeddings-RoPE-Integration">Query Compression and Rotary Positional Embeddings (RoPE) Integration</a></h2>



<p>MLA extends compression to queries, though less aggressively since queries are not cached:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/289/289f2512ee9337a49decc448010ff68b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='C_q = X W_q \in \mathbb{R}^{T \times r_q}' title='C_q = X W_q \in \mathbb{R}^{T \times r_q}' class='latex' /></p>



<p class="has-text-align-center">
<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b89/b89e9eae3f352fafa3acbe6771f844c2-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='Q_\text{content} = C_q W_{Q} \in \mathbb{R}^{T \times d_\text{model}}' title='Q_\text{content} = C_q W_{Q} \in \mathbb{R}^{T \times d_\text{model}}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/b89/b89e9eae3f352fafa3acbe6771f844c2-ffffff-000000-0.png?lossy=2&strip=1&webp=1 200w,https://b2633864.smushcdn.com/2633864/wp-content/latex/b89/b89e9eae3f352fafa3acbe6771f844c2-ffffff-000000-0.png?size=126x12&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 200px) 100vw, 200px' />,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/698/698eda0f93c2b24773206a15cf460703-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_q' title='r_q' class='latex' /> can be different from <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/fdc/fdc6a99c1f6e297720c7a8fb9c66bfcc-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_{kv}' title='r_{kv}' class='latex' />. In our configuration, <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/157/157303f0c7f82826d0cc5be2bee6125c-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_q = 192' title='r_q = 192' class='latex' /> versus <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/22a/22a4c847a1bb7331479b1cd47f9c51f4-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_{kv} = 128' title='r_{kv} = 128' class='latex' /> — we give queries slightly more capacity.</p>



<p>Now comes the clever part: integrating RoPE. We split both queries and keys into content and positional components:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/bb6/bb6ab893acac0d32c66ea670c4da0ab3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='Q = [Q_\text{content} \parallel Q_\text{rope}]' title='Q = [Q_\text{content} \parallel Q_\text{rope}]' class='latex' /></p>



<p class="has-text-align-center">
<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/78a/78a6aefa47951fcb5f56191065b985b4-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='K = [K_\text{content} \parallel K_\text{rope}]' title='K = [K_\text{content} \parallel K_\text{rope}]' class='latex' />,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/d13/d137aba004822e3783f694305e05a6ab-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\parallel' title='\parallel' class='latex' /> denotes concatenation. The content components come from the compression-decompression process described above. The positional components are separate projections that we apply RoPE to:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b76/b7625f3a174bf6007146cb1e24ad7573-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='Q_\text{rope} = \text{RoPE}_m(C_q W{Q_\text{rope}})' title='Q_\text{rope} = \text{RoPE}_m(C_q W{Q_\text{rope}})' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/b76/b7625f3a174bf6007146cb1e24ad7573-ffffff-000000-0.png?lossy=2&strip=1&webp=1 194w,https://b2633864.smushcdn.com/2633864/wp-content/latex/b76/b7625f3a174bf6007146cb1e24ad7573-ffffff-000000-0.png?size=126x12&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 194px) 100vw, 194px' /></p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/92a/92ae045f79977a231c47948a5523a250-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='K_\text{rope} = \text{RoPE}_n(X W{K_\text{rope}})' title='K_\text{rope} = \text{RoPE}_n(X W{K_\text{rope}})' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/92a/92ae045f79977a231c47948a5523a250-ffffff-000000-0.png?lossy=2&strip=1&webp=1 190w,https://b2633864.smushcdn.com/2633864/wp-content/latex/92a/92ae045f79977a231c47948a5523a250-ffffff-000000-0.png?size=126x13&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 190px) 100vw, 190px' />,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/198/198ed2fa37240dac80a2a5f780d1ceb4-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{RoPE}_m' title='\text{RoPE}_m' class='latex' /> denotes applying rotary embedding at position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6f8/6f8f57715090da2632453988d9a1501b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='m' title='m' class='latex' />. This separation is crucial: content and position are independently represented and combined only in the attention scores.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Attention-Computation-with-Multi-Head-Latent-Attention-MLA"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Attention-Computation-with-Multi-Head-Latent-Attention-MLA">Attention Computation with Multi-Head Latent Attention (MLA)</a></h2>



<p>The complete attention computation becomes:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/ba6/ba69f565f2185af859563e3059da9e47-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='Q = [Q_\text{content} \parallel Q_\text{rope}] = [C_q W_Q \parallel \text{RoPE}(C_q W_{Q_\text{rope}})]' title='Q = [Q_\text{content} \parallel Q_\text{rope}] = [C_q W_Q \parallel \text{RoPE}(C_q W_{Q_\text{rope}})]' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/ba6/ba69f565f2185af859563e3059da9e47-ffffff-000000-0.png?lossy=2&strip=1&webp=1 357w,https://b2633864.smushcdn.com/2633864/wp-content/latex/ba6/ba69f565f2185af859563e3059da9e47-ffffff-000000-0.png?size=126x7&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/ba6/ba69f565f2185af859563e3059da9e47-ffffff-000000-0.png?size=252x15&lossy=2&strip=1&webp=1 252w' sizes='(max-width: 357px) 100vw, 357px' /></p>



<p class="has-text-align-center">
<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/df8/df82e0d3c9e02692d9042e92a9d4cc79-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='K = [K_\text{content} \parallel K_\text{rope}] = [C_{kv} W_K \parallel \text{RoPE}(X W_{K_\text{rope}})]' title='K = [K_\text{content} \parallel K_\text{rope}] = [C_{kv} W_K \parallel \text{RoPE}(X W_{K_\text{rope}})]' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/df8/df82e0d3c9e02692d9042e92a9d4cc79-ffffff-000000-0.png?lossy=2&strip=1&webp=1 367w,https://b2633864.smushcdn.com/2633864/wp-content/latex/df8/df82e0d3c9e02692d9042e92a9d4cc79-ffffff-000000-0.png?size=126x7&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/df8/df82e0d3c9e02692d9042e92a9d4cc79-ffffff-000000-0.png?size=252x14&lossy=2&strip=1&webp=1 252w' sizes='(max-width: 367px) 100vw, 367px' /></p>



<p class="has-text-align-center">
<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b74/b7447e766d39eff0dc4a15c7ae50bd09-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='V = C_{kv} W_V' title='V = C_{kv} W_V' class='latex' />.</p>



<p>Then standard multi-head attention:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/d0f/d0f1534592cc884a5865cbe753d0a05f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)' title='\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/d0f/d0f1534592cc884a5865cbe753d0a05f-ffffff-000000-0.png?lossy=2&strip=1&webp=1 279w,https://b2633864.smushcdn.com/2633864/wp-content/latex/d0f/d0f1534592cc884a5865cbe753d0a05f-ffffff-000000-0.png?size=126x9&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 279px) 100vw, 279px' />,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/2d3/2d3f3693da98b88b64f0f2d7b131cb42-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='W_i^Q, W_i^K, W_i^V' title='W_i^Q, W_i^K, W_i^V' class='latex' /> are per-head projections. The attention scores <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/044/0440e9f540210a57a7e2f2681a87fabf-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='QK^T' title='QK^T' class='latex' /> naturally incorporate both content similarity (through <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/961/961b8a474d604b9810b5ac8e33db3b56-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='Q_\text{content} K_\text{content}^T' title='Q_\text{content} K_\text{content}^T' class='latex' />) and positional information (through <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e7a/e7afd98856cf9df8e7d03d2ab567f448-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='Q_\text{rope} K_\text{rope}^T' title='Q_\text{rope} K_\text{rope}^T' class='latex' />).</p>



<p><strong>Causal Masking:</strong> For autoregressive language modeling, we must prevent tokens from attending to future positions. We apply a causal mask:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/482/482c96988c0bf2101c5f21a2f8c4e4cf-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{mask}_{ij} = \begin{cases} 0 &amp; \text{if } i \geq j \\ -\infty &amp; \text{if } i &lt; j \end{cases} \ ' title='\text{mask}_{ij} = \begin{cases} 0 &amp; \text{if } i \geq j \\ -\infty &amp; \text{if } i &lt; j \end{cases} \ ' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/482/482c96988c0bf2101c5f21a2f8c4e4cf-ffffff-000000-0.png?lossy=2&strip=1&webp=1 177w,https://b2633864.smushcdn.com/2633864/wp-content/latex/482/482c96988c0bf2101c5f21a2f8c4e4cf-ffffff-000000-0.png?size=126x36&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 177px) 100vw, 177px' /> .</p>



<p>This ensures position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/865/865c0c0b4ab0e063e5caa3387c1a8741-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='i' title='i' class='latex' /> can only attend to positions <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/bd9/bd95f3f46cd1f363501c8f62cccf5de1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='0, 1, \ldots, i' title='0, 1, \ldots, i' class='latex' />, maintaining the autoregressive property.</p>



<p><strong>Attention Weights and Output:</strong> After computing scores with the causal mask applied:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/738/738a0be6a3c9276b311ca66ff035228a-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='A = \text{softmax}\left(\dfrac{QK^T + \text{mask}}{\sqrt{d_k}}\right) \in \mathbb{R}^{T \times T}' title='A = \text{softmax}\left(\dfrac{QK^T + \text{mask}}{\sqrt{d_k}}\right) \in \mathbb{R}^{T \times T}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/738/738a0be6a3c9276b311ca66ff035228a-ffffff-000000-0.png?lossy=2&strip=1&webp=1 273w,https://b2633864.smushcdn.com/2633864/wp-content/latex/738/738a0be6a3c9276b311ca66ff035228a-ffffff-000000-0.png?size=126x19&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 273px) 100vw, 273px' />,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/751/7516b96678349ed002f1931a294f577c-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_k' title='d_k' class='latex' /> is the effective key dimension (content plus RoPE dimensions). We apply attention to values:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e15/e152aac582dc808fe8dc7721bddb6d7f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='O = A V W_O' title='O = A V W_O' class='latex' />,</p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/f85/f8546364d53cb9ff46ab53434bc42a22-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='W_O' title='W_O' class='latex' /> is the output projection. Finally, dropout is applied for regularization, and the result is added to the residual connection.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Implementation-Multi-Head-Latent-Attention-MLA"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Implementation-Multi-Head-Latent-Attention-MLA">Implementation: Multi-Head Latent Attention (MLA)</a></h2>



<p>Here is the complete implementation of MLA:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture" data-enlighter-group="1">class MultiheadLatentAttention(nn.Module):
    """
    Multihead Latent Attention (MLA) - DeepSeek's efficient attention mechanism

    Key innovations:
    - Compression/decompression of queries and key-values
    - LoRA-style low-rank projections for efficiency
    - RoPE with separate content and positional components
    """

    def __init__(self, config: DeepSeekConfig):
        super().__init__()
        self.config = config
        self.n_embd = config.n_embd
        self.n_head = config.n_head
        self.head_dim = config.n_embd // config.n_head

        # Compression dimensions
        self.kv_lora_rank = config.kv_lora_rank
        self.q_lora_rank = config.q_lora_rank
        self.rope_dim = config.rope_dim

</pre>



<p><strong>Lines 11-21: Configuration and Dimensions</strong><strong>.</strong> We extract key parameters from the configuration object, computing the head dimension as <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8a7/8a79ca4aebf2f271ccea6b1e8424a0e1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{head} = d_\text{model} / H' title='d_\text{head} = d_\text{model} / H' class='latex' />. We store compression ranks (<code data-enlighter-language="python" class="EnlighterJSRAW">kv_lora_rank</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">q_lora_rank</code>) and the RoPE dimension. These define the memory-accuracy tradeoff — lower ranks mean more compression but potentially lower quality. Our choices balance efficiency with model capacity.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="23" data-enlighter-title="Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture" data-enlighter-group="2">        # KV decompression
        self.k_decompress = nn.Linear(self.kv_lora_rank, self.n_head * self.head_dim, bias=False)
        self.v_decompress = nn.Linear(self.kv_lora_rank, self.n_head * self.head_dim, bias=False)

        # Query compression
        self.q_proj = nn.Linear(self.n_embd, self.q_lora_rank, bias=False)
        self.q_decompress = nn.Linear(self.q_lora_rank, self.n_head * self.head_dim, bias=False)

        # RoPE projections
        self.k_rope_proj = nn.Linear(self.n_embd, self.n_head * self.rope_dim, bias=False)
        self.q_rope_proj = nn.Linear(self.q_lora_rank, self.n_head * self.rope_dim, bias=False)

        # Output projection
        self.o_proj = nn.Linear(self.n_head * self.head_dim, self.n_embd, bias=config.bias)

        # Dropout
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)

        # RoPE
        self.rope = RotaryEmbedding(self.rope_dim, config.block_size)

        # Causal mask
        self.register_buffer(
            "causal_mask",
            torch.tril(torch.ones(config.block_size, config.block_size)).view(
                1, 1, config.block_size, config.block_size
            )
        )
</pre>



<p><strong>Lines 23-29: KV Compression Pipeline</strong><strong>.</strong> The compression-decompression architecture follows the low-rank factorization principle. The <code data-enlighter-language="python" class="EnlighterJSRAW">kv_proj</code> layer performs the down-projection from <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/79d/79d0b8290e3c7cc6a6c914fcecd14969-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{model} = 256' title='d_\text{model} = 256' class='latex' /> to <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/22a/22a4c847a1bb7331479b1cd47f9c51f4-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_{kv} = 128' title='r_{kv} = 128' class='latex' />, cutting the dimensionality in half. We apply RMSNorm to the compressed representation for stability — this normalization helps prevent the compressed representation from drifting to extreme values during training. The decompression layers <code data-enlighter-language="python" class="EnlighterJSRAW">k_decompress</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">v_decompress</code> then expand back to <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/a6f/a6f30f860651ff6e705192f3f91de06e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='H \times d_\text{head} = 8 \times 32 = 256' title='H \times d_\text{head} = 8 \times 32 = 256' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/a6f/a6f30f860651ff6e705192f3f91de06e-ffffff-000000-0.png?lossy=2&strip=1&webp=1 181w,https://b2633864.smushcdn.com/2633864/wp-content/latex/a6f/a6f30f860651ff6e705192f3f91de06e-ffffff-000000-0.png?size=126x11&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 181px) 100vw, 181px' /> dimensions. Note that we use <code data-enlighter-language="python" class="EnlighterJSRAW">bias=False</code> for these projections — empirical research shows that biases in attention projections do not significantly help and add unnecessary parameters.</p>



<p><strong>Lines 31-33: Query Processing and RoPE Projections</strong><strong>.</strong> Query handling follows a similar compression pattern but with a slightly higher rank (<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/157/157303f0c7f82826d0cc5be2bee6125c-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_q = 192' title='r_q = 192' class='latex' />). The asymmetry makes sense: we do not cache queries, so memory pressure is lower, and we can afford more capacity. The RoPE projections are separate pathways — <code data-enlighter-language="python" class="EnlighterJSRAW">k_rope_proj</code> projects directly from the input <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/021/02129bb861061d1a052c592e2dc6b383-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='X' title='X' class='latex' />, while <code data-enlighter-language="python" class="EnlighterJSRAW">q_rope_proj</code> projects from the compressed query representation. Both target the RoPE dimension of 64. This separation of content and position is architecturally elegant: the model learns different transformations for &#8220;what&#8221; (content) versus &#8220;where&#8221; (position).</p>



<p><strong>Lines 36-51: Infrastructure Components</strong><strong>.</strong> The output projection <code data-enlighter-language="python" class="EnlighterJSRAW">o_proj</code> combines multi-head outputs back to the model dimension. We include 2 dropout layers:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">attn_dropout</code>: applied to attention weights (reducing overfitting on attention patterns)</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">resid_dropout</code>: applied to the final output (regularizing the residual connection)</li>
</ul>



<p>The RoPE module is instantiated with our chosen dimension and maximum sequence length. Finally, we create and register a causal mask as a buffer — by using <code data-enlighter-language="python" class="EnlighterJSRAW">register_buffer</code>, this tensor moves with the model to GPU/CPU and is included in the state dict, but is not treated as a learnable parameter.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="52" data-enlighter-title="Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture" data-enlighter-group="3">    def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
        B, T, C = x.size()

        # Compression phase
        kv_compressed = self.kv_norm(self.kv_proj(x))
        q_compressed = self.q_proj(x)

        # Decompression phase
        k_content = self.k_decompress(kv_compressed)
        v = self.v_decompress(kv_compressed)
        q_content = self.q_decompress(q_compressed)

        # RoPE components
        k_rope = self.k_rope_proj(x)
        q_rope = self.q_rope_proj(q_compressed)

        # Reshape [B, H, T, d_head] for multi-head attention
        k_content = k_content.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        q_content = q_content.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k_rope = k_rope.view(B, T, self.n_head, self.rope_dim).transpose(1, 2)
        q_rope = q_rope.view(B, T, self.n_head, self.rope_dim).transpose(1, 2)

        # Apply RoPE
        cos, sin = self.rope(x, T)
        q_rope = apply_rope(q_rope, cos, sin)
        k_rope = apply_rope(k_rope, cos, sin)

        # Concatenate content and rope parts
        q = torch.cat([q_content, q_rope], dim=-1)
        k = torch.cat([k_content, k_rope], dim=-1)

</pre>



<p><strong>Lines 52-57: Compression Phase</strong><strong>.</strong> The forward pass begins by compressing the input. We project onto the KV latent space, apply normalization, and project back onto the query latent space. These operations are lightweight — just matrix multiplications. The compressed representations are what we would cache during inference. Notice that <code data-enlighter-language="python" class="EnlighterJSRAW">kv_compressed</code> has shape <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/adc/adc7537e80565e8e66aadd0c2e4d8d9b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='[B, T, 128]' title='[B, T, 128]' class='latex' /> versus the original <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/164/164ef205ce8f83b5b35003a75459d10b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='[B, T, 256]' title='[B, T, 256]' class='latex' /> — we&#8217;ve already halved the memory footprint.</p>



<p><strong>Lines 60-73: Decompression and RoPE</strong><strong>.</strong> We decompress to get content components and compute separate RoPE projections. Then comes a crucial reshaping step: we convert from <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/0c4/0c4a6bc039a37a204979e51949c8d0bf-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='[B, T, H \times d_\text{head}]' title='[B, T, H \times d_\text{head}]' class='latex' /> to <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/ca2/ca2c0152d1bb4eac8662d1600c713cc0-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='[B, H, T, d_\text{head}]' title='[B, H, T, d_\text{head}]' class='latex' />, moving the head dimension before the sequence dimension. This layout is required for multi-head attention — each head operates independently, and we want to batch those operations. The <code data-enlighter-language="python" class="EnlighterJSRAW">.transpose(1, 2)</code> operation efficiently swaps dimensions without copying data.</p>



<p><strong>Lines 76-82: RoPE Application and Concatenation</strong><strong>.</strong> We fetch cosine and sine tensors from our RoPE module and apply the rotation to both queries and keys. Critically, we only rotate the RoPE components, not the content components. This maintains the separation between &#8220;what&#8221; and &#8220;where&#8221; information. We then concatenate along the feature dimension, creating final query and key tensors of shape <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/d64/d64bff4c35da78fe1c1b2f1a5be71be1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='[B, H, T, d_\text{head} + d_\text{rope}] = [B, 8, T, 96]' title='[B, H, T, d_\text{head} + d_\text{rope}] = [B, 8, T, 96]' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/d64/d64bff4c35da78fe1c1b2f1a5be71be1-ffffff-000000-0.png?lossy=2&strip=1&webp=1 252w,https://b2633864.smushcdn.com/2633864/wp-content/latex/d64/d64bff4c35da78fe1c1b2f1a5be71be1-ffffff-000000-0.png?size=126x9&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 252px) 100vw, 252px' />. The attention scores will capture both content similarity and relative position.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="84" data-enlighter-title="Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture" data-enlighter-group="4">        # Attention computation
        scale = 1.0 / math.sqrt(q.size(-1))
        scores = torch.matmul(q, k.transpose(-2, -1)) * scale

        # Apply causal mask
        scores = scores.masked_fill(self.causal_mask[:, :, :T, :T] == 0, float('-inf'))

        # Apply padding mask if provided
        if attention_mask is not None:
            padding_mask_additive = (1 - attention_mask).unsqueeze(1).unsqueeze(2) * float('-inf')
            scores = scores + padding_mask_additive

        # Softmax and dropout
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.attn_dropout(attn_weights)

        # Apply attention to values
        out = torch.matmul(attn_weights, v)

        # Reshape and project
        out = out.transpose(1, 2).contiguous().view(B, T, self.n_head * self.head_dim)
        out = self.resid_dropout(self.o_proj(out))

        return out
</pre>



<p><strong>Lines 84-94: Attention Score Computation and Masking</strong><strong>.</strong> We compute scaled dot-product attention: <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6a2/6a28139693df21eb3ddb72dc9969849b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='QK^T / \sqrt{d_k}' title='QK^T / \sqrt{d_k}' class='latex' />. The scaling factor is critical for training stability — without it, attention logits would grow large as dimensions increase, leading to vanishing gradients in the softmax. We apply the causal mask using <code data-enlighter-language="python" class="EnlighterJSRAW">masked_fill</code>, setting future positions to negative infinity so they contribute zero probability after softmax. If an attention mask is provided (for handling padding), we convert it to an additive mask and add it to scores. This handles variable-length sequences in a batch.</p>



<p><strong>Lines 97-107: Attention Weights and Output</strong><strong>.</strong> We apply softmax to convert scores to probabilities, ensuring they sum to 1 over the sequence dimension. Dropout is applied to attention weights — this has been shown to help with generalization, perhaps by preventing the model from becoming overly dependent on specific attention patterns. We multiply attention weights by values to get our output. The final transpose and reshape convert from the multi-head layout <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/ca2/ca2c0152d1bb4eac8662d1600c713cc0-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='[B, H, T, d_\text{head}]' title='[B, H, T, d_\text{head}]' class='latex' /> back to <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/0c4/0c4a6bc039a37a204979e51949c8d0bf-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='[B, T, H \times d_\text{head}]' title='[B, T, H \times d_\text{head}]' class='latex' />, concatenating all heads. The output projection and residual dropout complete the attention module.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Multi-Head-Latent-Attention-and-KV-Cache-Optimization"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Multi-Head-Latent-Attention-and-KV-Cache-Optimization">Multi-Head Latent Attention and KV Cache Optimization</a></h2>



<p>Multi-Head Latent Attention (MLA) is one approach to KV cache optimization — compression through low-rank projections. Other approaches include the following: </p>



<ul class="wp-block-list">
<li>Multi-Query Attention (MQA), where all heads share a single key and value</li>



<li>Grouped-Query Attention (GQA), where heads are grouped to share KV pairs</li>



<li>KV Cache Quantization, which stores keys and values at lower precision (INT8 or INT4)</li>



<li>Cache Eviction Strategies, which discard less important past tokens</li>
</ul>



<p>Each approach has the following trade-offs: </p>



<ul class="wp-block-list">
<li>MQA and GQA reduce quality more than MLA but are simpler</li>



<li>Quantization can degrade accuracy </li>



<li>Cache eviction strategies discard historical context</li>
</ul>



<p>DeepSeek-V3’s MLA offers an appealing middle ground — significant memory savings with minimal quality loss through a principled compression approach.</p>



<p>For readers interested in diving deeper into KV cache optimization, we recommend exploring the “KV Cache Optimization” series, which covers these techniques in detail, including implementation strategies, benchmarking results, and guidance on choosing the right approach for a given use case.</p>



<p>With MLA implemented, we have addressed one of the primary memory bottlenecks in Transformer inference — the KV cache. Our attention mechanism can now serve longer contexts and more concurrent users within the same hardware budget. In the next lesson, we will address another critical challenge: scaling model capacity efficiently through Mixture of Experts (MoE).</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<div id="pitch" style="padding: 40px; width: 100%; background-color: #F4F6FA;">
	<h3>What's next? We recommend <a target="_blank" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend">PyImageSearch University</a>.</h3>

	<script src="https://fast.wistia.com/embed/medias/kno0cmko2z.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_kno0cmko2z videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img decoding="async" src="https://fast.wistia.com/embed/medias/kno0cmko2z/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>

	<div style="margin-top: 32px; margin-bottom: 32px; ">
		<strong>Course information:</strong><br/>
		86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: April 2026<br/>
		<span style="color: #169FE6;">★★★★★</span> 4.84 (128 Ratings) • 16,000+ Students Enrolled
	</div>

	<p><strong>I strongly believe that if you had the right teacher you could <em>master</em> computer vision and deep learning.</strong></p>

	<p>Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?</p>

	<p>That’s <em>not</em> the case.</p>

	<p>All you need to master computer vision and deep learning is for someone to explain things to you in <em>simple, intuitive</em> terms. <em>And that’s exactly what I do</em>. My mission is to change education and how complex Artificial Intelligence topics are taught.</p>

	<p>If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to <em>successfully</em> and <em>confidently</em> apply computer vision to your work, research, and projects. Join me in computer vision mastery.</p>

	<p><strong>Inside PyImageSearch University you'll find:</strong></p>

	<ul style="margin-left: 0px;">
		<li style="list-style: none;">&check; <strong>86+ courses</strong> on essential computer vision, deep learning, and OpenCV topics</li>
		<li style="list-style: none;">&check; <strong>86 Certificates</strong> of Completion</li>
		<li style="list-style: none;">&check; <strong>115+ hours hours</strong> of on-demand video</li>
		<li style="list-style: none;">&check; <strong>Brand new courses released <em>regularly</em></strong>, ensuring you can keep up with state-of-the-art techniques</li>
		<li style="list-style: none;">&check; <strong>Pre-configured Jupyter Notebooks in Google Colab</strong></li>
		<li style="list-style: none;">&check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)</li>
		<li style="list-style: none;">&check; Access to <strong>centralized code repos for <em>all</em> 540+ tutorials</strong> on PyImageSearch</li>
		<li style="list-style: none;">&check; <strong> Easy one-click downloads</strong> for code, datasets, pre-trained models, etc.</li>
		<li style="list-style: none;">&check; <strong>Access</strong> on mobile, laptop, desktop, etc.</li>
	</ul>

	<p style="text-align: center;">
		<a target="_blank" class="button link" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend" style="background-color: #6DC713; border-bottom: none;">Click here to join PyImageSearch University</a>
	</p>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Summary"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Summary">Summary</a></h2>



<p>In this 2nd lesson of our <strong>DeepSeek-V3 from Scratch</strong> series, we dive into the mechanics of <strong>Multi</strong><strong>-H</strong><strong>ead Latent Attention (MLA)</strong> and why it is a crucial innovation for scaling large language models.</p>



<p>We begin by introducing MLA and framing it against the <strong>KV cache memory problem</strong>, a common bottleneck in Transformer architectures. By understanding this challenge, we set the stage for how MLA provides a more efficient solution through compression and smarter attention computation.</p>



<p>We then explore how <strong>low-rank projections</strong> enable MLA to compress key-value representations without losing essential information. This compression is paired with <strong>query compression and RoPE integration</strong>, ensuring that positional encoding remains geometrically consistent while reducing computational overhead.</p>



<p>Together, these techniques rethink the attention mechanism, balancing efficiency and accuracy and making MLA a powerful tool for modern architectures.</p>



<p>Finally, we walk through the <strong>implementation of MLA</strong>, showing how it connects directly to KV cache optimization.</p>



<p>By the end of this lesson, we not only understand the theory but also gain hands-on experience implementing MLA and integrating it into DeepSeek-V3. This practical approach shows how MLA reshapes attention computation, paving the way for more memory-efficient and scalable models.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Citation-Information"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Citation-Information">Citation Information</a></h3>



<p><strong>Mangla, P</strong><strong>. </strong>“Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture,” <em>PyImageSearch</em>, S. Huot, A. Sharma, and P. Thakur, eds., 2026, <a href="https://pyimg.co/scgjl" target="_blank" rel="noreferrer noopener">https://pyimg.co/scgjl</a></p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="classic" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture" data-enlighter-group="5">@incollection{Mangla_2026_build-deepseek-v3-mla-architecture,
  author = {Puneet Mangla},
  title = {{Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/scgjl},
}
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), </strong><em><strong>simply enter your email address in the form below!</strong></em></p>



<div id="download-the-code" class="post-cta-wrap">
<div class="gpd-post-cta">
	<div class="gpd-post-cta-content">
		

			<div class="gpd-post-cta-top">
				<div class="gpd-post-cta-top-image"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1" alt="" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1 410w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=126x174&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=252x348&lossy=2&strip=1&webp=1 252w" sizes="(max-width: 410px) 100vw, 410px" /></div>
				
				<div class="gpd-post-cta-top-title"><h4>Download the Source Code and FREE 17-page Resource Guide</h4></div>
				<div class="gpd-post-cta-top-desc"><p>Enter your email address below to get a .zip of the code and a <strong>FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning.</strong> Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!</p></div>


			</div>

			<div class="gpd-post-cta-bottom">
				<form id="footer-cta-code" class="footer-cta" action="https://www.getdrip.com/forms/4130035/submissions" method="post" target="blank" data-drip-embedded-form="4130035">
					<input name="fields[email]" type="email" value="" placeholder="Your email address" class="form-control" />

					<button type="submit">Download the code!</button>

					<div style="display: none;" aria-hidden="true"><label for="website">Website</label><br /><input type="text" id="website" name="website" tabindex="-1" autocomplete="false" value="" /></div>
				</form>
			</div>


		
	</div>

</div>
</div>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/03/16/build-deepseek-v3-multi-head-latent-attention-mla-architecture/">Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings</title>
		<link>https://pyimagesearch.com/2026/03/09/deepseek-v3-model-theory-config-and-rotary-positional-embeddings/</link>
		
		<dc:creator><![CDATA[Puneet Mangla]]></dc:creator>
		<pubDate>Mon, 09 Mar 2026 12:45:00 +0000</pubDate>
				<category><![CDATA[DeepSeek-V3]]></category>
		<category><![CDATA[KV Cache]]></category>
		<category><![CDATA[MultiHead Latent Attention]]></category>
		<category><![CDATA[RoPE]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[deepseekv3]]></category>
		<category><![CDATA[kv cache]]></category>
		<category><![CDATA[multihead latent attention]]></category>
		<category><![CDATA[tutorial]]></category>
		<guid isPermaLink="false">https://pyimagesearch.com/?p=53125</guid>

					<description><![CDATA[<p>Table of Contents DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings Introduction to the DeepSeek-V3 Model The Four Pillars of DeepSeek-V3 What You Will Build Prerequisites and Setup for Building the DeepSeek-V3 Model Implementing DeepSeek-V3 Model Configuration and RoPE DeepSeek-V3&#8230;</p>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/03/09/deepseek-v3-model-theory-config-and-rotary-positional-embeddings/">DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity" id="h1-DeepSeek-V3-Model-Theory-Config-and-Rotary-Positional-Embeddings"/>


<div class="yoast-breadcrumbs"><span><span><a href="https://pyimagesearch.com/">Home</a></span></div>


<div class="toc">
<hr class="TOC"/>
<p class="has-large-font-size"><strong>Table of Contents</strong></p>
<ul>
    <li id="TOC-h1-DeepSeek-V3-Model-Theory-Config-and-Rotary-Positional-Embeddings">
        <a rel="noopener" target="_blank" href="#h1-DeepSeek-V3-Model-Theory-Config-and-Rotary-Positional-Embeddings">
            DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings
        </a>
        <ul>
            <li id="TOC-h2-Introduction-to-the-DeepSeek-V3-Model">
                <a rel="noopener" target="_blank" href="#h2-Introduction-to-the-DeepSeek-V3-Model">
                    Introduction to the DeepSeek-V3 Model
                </a>
                <ul>
                    <li id="TOC-h3-The-Four-Pillars-of-DeepSeek-V3">
                        <a rel="noopener" target="_blank" href="#h3-The-Four-Pillars-of-DeepSeek-V3">
                            The Four Pillars of DeepSeek-V3
                        </a>
                    </li>
                    <li id="TOC-h3-What-You-Will-Build">
                        <a rel="noopener" target="_blank" href="#h3-What-You-Will-Build">
                            What You Will Build
                        </a>
                    </li>
                    <li id="TOC-h3-Prerequisites-and-Setup-for-Building-the-DeepSeek-V3-Model">
                        <a rel="noopener" target="_blank" href="#h3-Prerequisites-and-Setup-for-Building-the-DeepSeek-V3-Model">
                            Prerequisites and Setup for Building the DeepSeek-V3 Model
                        </a>
                    </li>
                </ul>
            </li>
            <li id="TOC-h2-Implementing-DeepSeek-V3-Model-Configuration-and-RoPE">
                <a rel="noopener" target="_blank" href="#h2-Implementing-DeepSeek-V3-Model-Configuration-and-RoPE">
                    Implementing DeepSeek-V3 Model Configuration and RoPE
                </a>
                <ul>
                    <li id="TOC-h3-DeepSeek-V3-Model-Parameters-and-Configuration">
                        <a rel="noopener" target="_blank" href="#h3-DeepSeek-V3-Model-Parameters-and-Configuration">
                            DeepSeek-V3 Model Parameters and Configuration
                        </a>
                    </li>
                    <li id="TOC-h3-Rotary-Positional-Embeddings-Geometric-Position-Encoding">
                        <a rel="noopener" target="_blank" href="#h3-Rotary-Positional-Embeddings-Geometric-Position-Encoding">
                            Rotary Positional Embeddings: Geometric Position Encoding
                        </a>
                    </li>
                    <li id="TOC-h3-Implementation-Configuration-and-Rotary-Positional-Embeddings">
                        <a rel="noopener" target="_blank" href="#h3-Implementation-Configuration-and-Rotary-Positional-Embeddings">
                            Implementation: Configuration and Rotary Positional Embeddings
                        </a>
                    </li>
                </ul>
            </li>
            <li id="TOC-h2-Summary">
                <a rel="noopener" target="_blank" href="#h2-Summary">
                    Summary
                </a>
                <ul>
                    <li id="TOC-h3-Citation-Information">
                        <a rel="noopener" target="_blank" href="#h3-Citation-Information">
                            Citation Information
                        </a>
                    </li>
                </ul>
            </li>
        </ul>
    </li>
</ul>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Introduction-to-the-DeepSeek-V3-Model"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Introduction-to-the-DeepSeek-V3-Model">Introduction to the DeepSeek-V3 Model</a></h2>



<p>The landscape of large language models has been rapidly evolving, with innovations in architecture, training efficiency, and inference optimization pushing the boundaries of what is possible in natural language processing. The <strong>DeepSeek-V3 model </strong>represents a significant milestone in this evolution, introducing a suite of cutting-edge techniques that address some of the most pressing challenges in modern language model development: </p>



<ul class="wp-block-list">
<li>memory efficiency during inference</li>



<li>computational cost during training</li>



<li>effective capture of long-range dependencies </li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/03/deepseek-v3-model-theory-config-and-rope-featured.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="940" height="780" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-model-theory-config-and-rope-featured.png?lossy=2&strip=1&webp=1" alt="deepseek-v3-model-theory-config-and-rope-featured.png" class="wp-image-53178" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-model-theory-config-and-rope-featured.png?size=126x105&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-model-theory-config-and-rope-featured-300x249.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-model-theory-config-and-rope-featured.png?size=378x314&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-model-theory-config-and-rope-featured.png?size=504x418&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-model-theory-config-and-rope-featured.png?size=630x523&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-model-theory-config-and-rope-featured-768x637.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/deepseek-v3-model-theory-config-and-rope-featured.png?lossy=2&amp;strip=1&amp;webp=1 940w" sizes="(max-width: 630px) 100vw, 630px" /></a></figure></div>


<p>In this comprehensive lesson, we embark on an ambitious journey to build DeepSeek-V3 from scratch, implementing every component from first principles. This isn&#8217;t just another theoretical overview. We will write actual, working code that you can run, modify, and experiment with. By the end of this series, you will have a deep understanding of 4 revolutionary architectural innovations and how they synergistically combine to create a powerful language model.</p>



<p>This lesson is the 1st in a 6-part series on <strong>Building DeepSeek-V3 from Scratch</strong>:</p>



<ol class="wp-block-list">
<li><em><strong><a href="https://pyimg.co/1atre" target="_blank" rel="noreferrer noopener">DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings</a></strong></em> <strong>(this tutorial)</strong></li>



<li><em>Lessons 2</em></li>



<li><em>Lesson 3</em></li>



<li><em>Lesson 4</em></li>



<li><em>Lesson 5</em></li>



<li><em>Lesson 6</em></li>
</ol>



<p><strong>To learn about DeepSeek-V3 and build it from scratch, </strong><em><strong>just keep reading.</strong></em></p>



<div id="pyi-source-code-block" class="source-code-wrap"><div class="gpd-source-code">
    <div class="gpd-source-code-content">
        <img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/source-code-icon.png?lossy=2&strip=1&webp=1" alt="">
        <h4>Looking for the source code to this post?</h4>
                    <a href="#download-the-code" class="pyis-cta-modal-open-modal">Jump Right To The Downloads Section <svg class="svg-icon arrow-right" width="12" height="12" aria-hidden="true" role="img" focusable="false" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M6.8125 0.1875C6.875 0.125 6.96875 0.09375 7.09375 0.09375C7.1875 0.09375 7.28125 0.125 7.34375 0.1875L13.875 6.75C13.9375 6.8125 14 6.90625 14 7C14 7.125 13.9375 7.1875 13.875 7.25L7.34375 13.8125C7.28125 13.875 7.1875 13.9062 7.09375 13.9062C6.96875 13.9062 6.875 13.875 6.8125 13.8125L6.1875 13.1875C6.125 13.125 6.09375 13.0625 6.09375 12.9375C6.09375 12.8438 6.125 12.75 6.1875 12.6562L11.0312 7.8125H0.375C0.25 7.8125 0.15625 7.78125 0.09375 7.71875C0.03125 7.65625 0 7.5625 0 7.4375V6.5625C0 6.46875 0.03125 6.375 0.09375 6.3125C0.15625 6.25 0.25 6.1875 0.375 6.1875H11.0312L6.1875 1.34375C6.125 1.28125 6.09375 1.1875 6.09375 1.0625C6.09375 0.96875 6.125 0.875 6.1875 0.8125L6.8125 0.1875Z" fill="#169FE6"></path></svg></a>
            </div>
</div>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-The-Four-Pillars-of-DeepSeek-V3"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-The-Four-Pillars-of-DeepSeek-V3">The Four Pillars of DeepSeek-V3</a></h3>



<p><strong>Multihead Latent Attention (MLA):</strong> Traditional Transformer models face a critical bottleneck during inference: the key-value (KV) cache grows linearly with sequence length, consuming massive amounts of memory. For a model with 32 attention heads and a hidden dimension of 4096, storing keys and values for a single sequence of 2048 tokens requires over 1GB of memory. DeepSeek&#8217;s MLA addresses this by introducing a clever compression-decompression mechanism inspired by Low-Rank Adaptation (LoRA). Instead of storing full key and value matrices, MLA compresses them into a low-rank latent space, achieving up to a 75% reduction in KV cache memory while maintaining model quality. This isn&#8217;t just a theoretical improvement; it translates directly to the ability to serve more concurrent users or process longer contexts with the same hardware (<strong>Figure 1</strong>).</p>



<p><strong>Mixture of Experts (MoE):</strong> The challenge in scaling language models is balancing capacity with computational cost. Simply making models wider and deeper becomes prohibitively expensive. MoE offers an elegant solution: instead of every token passing through the same feedforward network, we create multiple “expert” networks and route each token to only a subset of them. DeepSeek-V3 implements this with a learned routing mechanism that dynamically selects the most relevant experts for each token. With 4 experts and top-2 routing, we effectively quadruple the model&#8217;s capacity while only doubling the computation per token. The routing function learns to specialize different experts for different types of patterns — perhaps one expert becomes good at handling numerical reasoning, another at processing dialogue, and so on.</p>



<p><strong>Multi-Token Prediction (MTP):</strong> Traditional language models predict one token at a time, receiving a training signal only for the immediate next token. This is somewhat myopic — humans don&#8217;t just think about the very next word; we plan ahead, considering how sentences and paragraphs will unfold. MTP addresses this by training the model to predict multiple future tokens simultaneously. If we are at position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/865/865c0c0b4ab0e063e5caa3387c1a8741-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='i' title='i' class='latex' /> in the sequence, standard training predicts token <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/15a/15ab2d2b0b92c13f328635e5c4bdbe64-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='i+1' title='i+1' class='latex' />. MTP adds auxiliary prediction heads that predict tokens <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/726/726087b8901423d7ce6b5004e1eb1511-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='i+2' title='i+2' class='latex' />, <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6d7/6d74b25929611d29fc89054bd1679d9f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='i+3' title='i+3' class='latex' />, and so on. This provides a richer training signal, encouraging the model to learn better long-range planning and coherence. It is particularly valuable for tasks requiring forward-looking reasoning.</p>



<p><strong>Rotary Positional Embeddings (RoPE):</strong> Transformers don&#8217;t inherently understand position — they need explicit positional information. Early approaches used absolute position embeddings, but these struggle with sequences longer than those seen during training. RoPE takes a geometric approach: it rotates query and key vectors in a high-dimensional space, with the rotation angle proportional to the position. This naturally encodes relative position information and exhibits remarkable extrapolation properties. A model trained on 512-token sequences can often handle 2048-token sequences at inference time without degradation.</p>



<p>The combination of these 4 techniques is more than the sum of its parts. MLA reduces memory pressure, allowing us to handle longer contexts or larger batch sizes. MoE increases model capacity without proportional compute increases, making training more efficient. MTP provides richer gradients, accelerating learning and improving model quality. RoPE enables better position understanding and length generalization. Together, they create a model that is efficient to train, efficient to serve, and capable of producing high-quality outputs.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/03/image-6.jpeg" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="975" height="780" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-6.jpeg?lossy=2&strip=1&webp=1" alt="" class="wp-image-53180" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-6.jpeg?size=126x101&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-6-300x240.jpeg?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-6.jpeg?size=378x302&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-6.jpeg?size=504x403&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-6.jpeg?size=630x504&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-6-768x614.jpeg?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-6.jpeg?lossy=2&amp;strip=1&amp;webp=1 975w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 1:</strong> DeepSeek-V3 (source: <a href="https://arxiv.org/pdf/2412.19437" target="_blank" rel="noreferrer noopener">DeepSeek-AI, 2025</a>).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-What-You-Will-Build"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-What-You-Will-Build">What You Will Build</a></h3>



<p>By the end of this series, you will have implemented a working DeepSeek-V3 model trained on the TinyStories dataset — a curated collection of simple children&#8217;s stories. The dataset is ideal for demonstrating core language modeling concepts without requiring massive computational resources. Your model will be able to generate coherent, creative stories in the style of children&#8217;s literature. More importantly, you will understand every line of code, every architectural decision, and every mathematical principle behind the model.</p>



<p>The DeepSeek-V3 model we build uses carefully chosen hyperparameters for educational purposes:</p>



<ul class="wp-block-list">
<li>6 Transformer layers</li>



<li>256-dimensional token embeddings</li>



<li>8 attention heads</li>



<li>4 MoE experts with top-2 routing</li>



<li>2-token-ahead prediction training objective (MTP)</li>
</ul>



<p>These choices balance pedagogical clarity with practical performance: the model is small enough to train on a single GPU in a reasonable time, yet large enough to generate meaningful outputs and demonstrate the key architectural innovations.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Prerequisites-and-Setup-for-Building-the-DeepSeek-V3-Model"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Prerequisites-and-Setup-for-Building-the-DeepSeek-V3-Model">Prerequisites and Setup for Building the DeepSeek-V3 Model</a></h3>



<p>Before we dive in, ensure you have a working Python environment with PyTorch 2.0+, the <code data-enlighter-language="python" class="EnlighterJSRAW">transformers</code> library, and standard scientific computing packages (e.g., <code data-enlighter-language="python" class="EnlighterJSRAW">numpy</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">datasets</code>). A GPU is highly recommended but not required — you can train on a CPU, though it will be slower. The complete code is available as a Jupyter notebook, allowing you to experiment interactively.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings" data-enlighter-group="1"># Install required packages
!pip install -q transformers datasets torch accelerate tensorboard

# Import core libraries
import os
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
from typing import Optional, Tuple, List, Dict
import logging
import json

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device: {torch.device('cuda' if torch.cuda.is_available() else 'cpu')}")
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Implementing-DeepSeek-V3-Model-Configuration-and-RoPE"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Implementing-DeepSeek-V3-Model-Configuration-and-RoPE">Implementing DeepSeek-V3 Model Configuration and RoPE</a></h2>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-DeepSeek-V3-Model-Parameters-and-Configuration"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-DeepSeek-V3-Model-Parameters-and-Configuration">DeepSeek-V3 Model Parameters and Configuration</a></h3>



<p>Before we can build any neural network, we need a systematic way to manage its hyperparameters — the architectural decisions that define the model. In modern deep learning, the configuration pattern has become essential: we encapsulate all hyperparameters in a single, serializable object that can be saved, loaded, and modified independently of the model code. This is not just good software engineering — it is crucial for reproducibility, experimentation, and deployment.</p>



<p>DeepSeek-V3&#8217;s configuration must capture parameters across multiple dimensions. First, there are the standard Transformer parameters:</p>



<ul class="wp-block-list">
<li>vocabulary size <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/520/5206560a306a2e085a437fd258eb57ce-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='V' title='V' class='latex' /></li>



<li>number of Transformer layers <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/d20/d20caec3b48a1eef164cb4ca81ba2587-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='L' title='L' class='latex' /></li>



<li>hidden dimension <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/646/6469a03ebce607f5e9fc3cca520cc84a-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{model}' title='d_\text{model}' class='latex' /></li>



<li>number of attention heads <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/c1d/c1d9f50f86825a1a2302ec2449c17196-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='H' title='H' class='latex' /></li>



<li>maximum context length <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/de6/de69efff479ba0b7962f8f1bddce0e00-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='T_\text{max}' title='T_\text{max}' class='latex' /></li>
</ul>



<p>These follow from the canonical Transformer architecture, where the model transforms input sequences through <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/d20/d20caec3b48a1eef164cb4ca81ba2587-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='L' title='L' class='latex' /> layers of self-attention and feedforward processing.</p>



<p>Beyond these basics, we need parameters specific to the DeepSeek-V3 innovations. For MLA, we require the LoRA ranks for key-value compression (<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/fdc/fdc6a99c1f6e297720c7a8fb9c66bfcc-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_{kv}' title='r_{kv}' class='latex' />) and query compression (<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/698/698eda0f93c2b24773206a15cf460703-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_q' title='r_q' class='latex' />), as well as the RoPE dimension (<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e01/e01d64a1065b28df8a4a91cc41e1207e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{rope}' title='d_\text{rope}' class='latex' />). For MoE, we specify the number of experts (<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/4cb/4cb7245d0446256c32b54a119d2c1e64-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='N_\text{experts}' title='N_\text{experts}' class='latex' />), how many to activate per token (<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8ce/8ce4b16b22b58894aa86c421e8759df3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='k' title='k' class='latex' />), and coefficients for auxiliary losses. For MTP, we define how many tokens ahead to predict (<img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/995/99501c9f72b6752d908e52a5add59668-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='n_\text{predict}' title='n_\text{predict}' class='latex' />).</p>



<p>The mathematical relationship between these parameters determines the model&#8217;s computational and memory characteristics. The standard Transformer attention complexity scales as <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/da3/da34d4f396e1acf9baaccfd5a0f031ca-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='O(T^2 \cdot d_\text{model})' title='O(T^2 \cdot d_\text{model})' class='latex' /> for sequence length <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b9e/b9ece18c950afbfa6b0fdbfa4ff731d3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='T' title='T' class='latex' />. With MLA&#8217;s compression, we reduce the KV cache from <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/19c/19c3bafdb12fe38720a68d23257c7e72-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='2 \cdot L \cdot H \cdot d_\text{head} \cdot T' title='2 \cdot L \cdot H \cdot d_\text{head} \cdot T' class='latex' /> to approximately <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/bda/bda9c79a55ba449f41e2b1f49882bc08-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='2 \cdot L \cdot r_{kv} \cdot T' title='2 \cdot L \cdot r_{kv} \cdot T' class='latex' />, where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8a7/8a79ca4aebf2f271ccea6b1e8424a0e1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{head} = d_\text{model} / H' title='d_\text{head} = d_\text{model} / H' class='latex' />. For our chosen parameters with <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/22a/22a4c847a1bb7331479b1cd47f9c51f4-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='r_{kv} = 128' title='r_{kv} = 128' class='latex' /> and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/79d/79d0b8290e3c7cc6a6c914fcecd14969-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d_\text{model} = 256' title='d_\text{model} = 256' class='latex' />, this represents approximately a 50% reduction in KV cache size.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Rotary-Positional-Embeddings-Geometric-Position-Encoding"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Rotary-Positional-Embeddings-Geometric-Position-Encoding">Rotary Positional Embeddings: Geometric Position Encoding</a></h3>



<p>RoPE (<strong>Figure 2</strong>) represents one of the most elegant ideas in modern Transformer research. To understand it, we must first examine why position matters and where earlier approaches had limitations.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/03/image-7.jpeg" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="798" height="731" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-7.jpeg?lossy=2&strip=1&webp=1" alt="" class="wp-image-53182" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-7.jpeg?size=126x115&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-7-300x275.jpeg?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-7.jpeg?size=378x346&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-7.jpeg?size=504x462&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-7.jpeg?size=630x577&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-7-768x704.jpeg?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/image-7.jpeg?lossy=2&amp;strip=1&amp;webp=1 798w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 2:</strong> Rotary Positional Embeddings (source: <a href="https://krasserm.github.io/2022/12/13/rotary-position-embedding/" target="_blank" rel="noreferrer noopener">Krasser, 2022</a>).</figcaption></figure></div>


<p><strong>The Position Problem:</strong> Self-attention mechanisms are permutation-invariant — if we shuffle the input tokens, we get the same output (modulo the shuffling). But language is sequential; &#8220;The cat chased the mouse&#8221; means something very different from &#8220;The mouse chased the cat.&#8221; We need to inject positional information.</p>



<p><strong>Absolute Positional Embeddings:</strong> The original Transformer used sinusoidal positional embeddings: <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/135/1356c73c51db3abb4d73c3bc0cfd4892-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{PE}_{(\text{pos}, 2i)} = \sin(\text{pos} / 10000^{2i/d_\text{model}})' title='\text{PE}_{(\text{pos}, 2i)} = \sin(\text{pos} / 10000^{2i/d_\text{model}})' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/135/1356c73c51db3abb4d73c3bc0cfd4892-ffffff-000000-0.png?lossy=2&strip=1&webp=1 238w,https://b2633864.smushcdn.com/2633864/wp-content/latex/135/1356c73c51db3abb4d73c3bc0cfd4892-ffffff-000000-0.png?size=126x11&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 238px) 100vw, 238px' /> and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/d41/d410a9b5f171fc28f85d3c03e6fd1a33-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{PE}_{(\text{pos}, 2i+1)} = \cos(\text{pos} / 10000^{2i/d_\text{model}})' title='\text{PE}_{(\text{pos}, 2i+1)} = \cos(\text{pos} / 10000^{2i/d_\text{model}})' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/d41/d410a9b5f171fc28f85d3c03e6fd1a33-ffffff-000000-0.png?lossy=2&strip=1&webp=1 255w,https://b2633864.smushcdn.com/2633864/wp-content/latex/d41/d410a9b5f171fc28f85d3c03e6fd1a33-ffffff-000000-0.png?size=126x10&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 255px) 100vw, 255px' />. These are added to input embeddings. Learned absolute positional embeddings are another option. But both struggle with extrapolation — a model trained on sequences up to length 512 often fails when applied to sequences of length 1024.</p>



<p><strong>Relative Position Approaches:</strong> Some models (e.g., Transformer-XL) use relative positional encodings, explicitly modeling the distance between tokens. This helps with extrapolation but adds computational overhead.</p>



<p><strong>RoPE&#8217;s Geometric Insight:</strong> RoPE takes a different approach, encoding position through rotation in complex space. Consider the attention score between query <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/769/7694f4a66316e53c8cdd9d9954bd611d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='q' title='q' class='latex' /> at position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/6f8/6f8f57715090da2632453988d9a1501b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='m' title='m' class='latex' /> and key <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8ce/8ce4b16b22b58894aa86c421e8759df3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='k' title='k' class='latex' /> at position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/7b8/7b8b965ad4bca0e41ab51de7b31363a1-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='n' title='n' class='latex' />:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/a48/a48d12cb8dd79bed620ffc8c62582193-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{score} = q^T k' title='\text{score} = q^T k' class='latex' /></p>



<p>RoPE modifies this by rotating both <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/769/7694f4a66316e53c8cdd9d9954bd611d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='q' title='q' class='latex' /> and <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8ce/8ce4b16b22b58894aa86c421e8759df3-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='k' title='k' class='latex' /> by angles proportional to their positions:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/75c/75cc35cc9d821c487d8c88c274f90e21-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{score}_\text{rope} = (R_{\theta, m} q)^T (R_{\theta, n} k) = q^T R_{\theta, m}^T R_{\theta, n} k = q^T R_{\theta, n-m} k' title='\text{score}_\text{rope} = (R_{\theta, m} q)^T (R_{\theta, n} k) = q^T R_{\theta, m}^T R_{\theta, n} k = q^T R_{\theta, n-m} k' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/75c/75cc35cc9d821c487d8c88c274f90e21-ffffff-000000-0.png?lossy=2&strip=1&webp=1 400w,https://b2633864.smushcdn.com/2633864/wp-content/latex/75c/75cc35cc9d821c487d8c88c274f90e21-ffffff-000000-0.png?size=126x7&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/75c/75cc35cc9d821c487d8c88c274f90e21-ffffff-000000-0.png?size=252x13&lossy=2&strip=1&webp=1 252w' sizes='(max-width: 400px) 100vw, 400px' /></p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/ba5/ba5384ece070aee57f9d796c0a385c7f-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='R_{\theta, p}' title='R_{\theta, p}' class='latex' /> is the rotation matrix corresponding to position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/838/83878c91171338902e0fe0fb97a8c47a-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='p' title='p' class='latex' />. The key insight: rotation matrices satisfy <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/ef7/ef7777314bbf909b9d46779569a1185d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='R_{\theta, m}^T R_{\theta, n} = R_{\theta, n-m}' title='R_{\theta, m}^T R_{\theta, n} = R_{\theta, n-m}' class='latex' />, so the attention score naturally depends on the relative position <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/88a/88a21e6a3e2ebbd7deb5212b0baa4058-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='n - m' title='n - m' class='latex' /> rather than absolute positions.</p>



<p>In practice, we implement this in 2D rotation pairs. For a <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/827/8277e0910d750195b448797616e091ad-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d' title='d' class='latex' />-dimensional vector, we split it into <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/13d/13dbc000a38a396b099ee29212fa519b-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d/2' title='d/2' class='latex' /> pairs and rotate each pair:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/93b/93b65ff86e15b56b032ecbf3f995b6b6-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\begin{bmatrix} q_i \ q_{i+1} \end{bmatrix}&#039; = \begin{bmatrix} \cos(m\theta_i) &amp; -\sin(m\theta_i) \ \sin(m\theta_i) &amp; \cos(m\theta_i) \end{bmatrix} \begin{bmatrix} q_i \ q_{i+1} \end{bmatrix}' title='\begin{bmatrix} q_i \ q_{i+1} \end{bmatrix}&#039; = \begin{bmatrix} \cos(m\theta_i) &amp; -\sin(m\theta_i) \ \sin(m\theta_i) &amp; \cos(m\theta_i) \end{bmatrix} \begin{bmatrix} q_i \ q_{i+1} \end{bmatrix}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/93b/93b65ff86e15b56b032ecbf3f995b6b6-ffffff-000000-0.png?lossy=2&strip=1&webp=1 444w,https://b2633864.smushcdn.com/2633864/wp-content/latex/93b/93b65ff86e15b56b032ecbf3f995b6b6-ffffff-000000-0.png?size=126x6&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/93b/93b65ff86e15b56b032ecbf3f995b6b6-ffffff-000000-0.png?size=252x12&lossy=2&strip=1&webp=1 252w,https://b2633864.smushcdn.com/2633864/wp-content/latex/93b/93b65ff86e15b56b032ecbf3f995b6b6-ffffff-000000-0.png?size=378x19&lossy=2&strip=1&webp=1 378w' sizes='(max-width: 444px) 100vw, 444px' /></p>



<p>where <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/04b/04be01893d412a613fbeaae2fd031953-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\theta_i = 10000^{-2i/d_\text{model}}' title='\theta_i = 10000^{-2i/d_\text{model}}' class='latex' /> follows the same frequency pattern as sinusoidal embeddings. This gives us multiple rotation frequencies, allowing the model to capture both fine-grained and coarse-grained positional relationships.</p>



<p><strong>Why RoPE Extrapolates Well:</strong> The rotation formulation naturally extends to positions beyond training data. If the model learns that a relative position of +5 corresponds to a certain rotation angle, it can apply the same principle to positions beyond its training range. The continuous nature of trigonometric functions means there are no discrete position embeddings that &#8220;run out.&#8221;</p>



<p><strong>RMSNorm: A Modern Normalization Choice:</strong> Before diving into code, we should mention RMSNorm (Root Mean Square Normalization), which DeepSeek uses instead of LayerNorm. While LayerNorm computes:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/e7e/e7e8456a7544b1a81898a1ed0f688db0-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{LayerNorm}(x) = \gamma \dfrac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta' title='\text{LayerNorm}(x) = \gamma \dfrac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/e7e/e7e8456a7544b1a81898a1ed0f688db0-ffffff-000000-0.png?lossy=2&strip=1&webp=1 223w,https://b2633864.smushcdn.com/2633864/wp-content/latex/e7e/e7e8456a7544b1a81898a1ed0f688db0-ffffff-000000-0.png?size=126x19&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 223px) 100vw, 223px' /></p>



<p>RMSNorm simplifies by removing the mean-centering and bias:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/aed/aedd01dfc879dd2ebf5acf00cd7b9872-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{RMSNorm}(x) = \gamma \dfrac{x}{\sqrt{\dfrac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}}' title='\text{RMSNorm}(x) = \gamma \dfrac{x}{\sqrt{\dfrac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/aed/aedd01dfc879dd2ebf5acf00cd7b9872-ffffff-000000-0.png?lossy=2&strip=1&webp=1 245w,https://b2633864.smushcdn.com/2633864/wp-content/latex/aed/aedd01dfc879dd2ebf5acf00cd7b9872-ffffff-000000-0.png?size=126x30&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 245px) 100vw, 245px' /></p>



<p>This is computationally cheaper and empirically performs just as well for language models. The key insight is that the mean-centering term in LayerNorm may not be necessary for Transformers, where the activations are already roughly centered.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Implementation-Configuration-and-Rotary-Positional-Embeddings"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Implementation-Configuration-and-Rotary-Positional-Embeddings">Implementation: Configuration and Rotary Positional Embeddings</a></h3>



<p>Now let&#8217;s implement these concepts. We&#8217;ll start with the configuration class:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings" data-enlighter-group="2">import json


@dataclass
class DeepSeekConfig:
    """Configuration for DeepSeek model optimized for children's stories"""
    vocab_size: int = 50259  # GPT-2 vocabulary size + &lt;|story|> + &lt;/|story|> tokens
    n_layer: int = 6         # Number of transformer blocks
    n_head: int = 8          # Number of attention heads
    n_embd: int = 256        # Embedding dimension
    block_size: int = 1024   # Maximum context window
    dropout: float = 0.1     # Dropout rate
    bias: bool = True        # Use bias in linear layers

    # MLA (Multihead Latent Attention) config
    kv_lora_rank: int = 128  # LoRA rank for key-value projection
    q_lora_rank: int = 192   # LoRA rank for query projection
    rope_dim: int = 64       # RoPE dimension

    # MoE (Mixture of Experts) config
    n_experts: int = 4       # Number of experts
    n_experts_per_token: int = 2  # Number of experts per token (top-k)
    expert_intermediate_size: int = 512  # Expert hidden size
    shared_expert_intermediate_size: int = 768  # Shared expert hidden size
    use_shared_expert: bool = True  # Enable shared expert
    aux_loss_weight: float = 0.0  # Auxiliary loss weight (0.0 for aux-free)

    # Multi-token prediction
    multi_token_predict: int = 2  # Predict next 2 tokens

</pre>



<p><strong>Lines 1-5: Configuration Class Structure:</strong> We use Python&#8217;s <code data-enlighter-language="python" class="EnlighterJSRAW">@dataclass</code> decorator to define our <code data-enlighter-language="python" class="EnlighterJSRAW">DeepSeekConfig</code> class, which automatically generates initialization and representation methods. This is more than syntactic sugar — it ensures type hints are respected and provides built-in equality comparisons. The configuration serves as a single source of truth for model hyperparameters, making it easy to experiment with different architectures by simply modifying this object.</p>



<p><strong>Lines 7-13: Standard Transformer Parameters:</strong> We define the core Transformer dimensions. The vocabulary size of 50,259 comes from the GPT-2 tokenizer, with two additional custom tokens for story boundaries. We choose 6 layers and a 256-dimensional embedding size as a balance between model capacity and computational cost — this is small enough to train on a single consumer GPU but large enough to demonstrate the key DeepSeek innovations. The block size of 1024 determines the model’s maximum context length, sufficient for coherent short stories. The dropout rate of 0.1 provides regularization without being overly aggressive.</p>



<p><strong>Lines 16-18: MLA Configuration:</strong> These parameters control our Multihead Latent Attention mechanism. The <code data-enlighter-language="python" class="EnlighterJSRAW">kv_lora_rank</code> of 128 means we compress key-value representations from 256 dimensions down to 128 — a 50% reduction that translates directly to KV cache memory savings. The <code data-enlighter-language="python" class="EnlighterJSRAW">q_lora_rank</code> of 192 provides slightly more capacity for query compression since queries don&#8217;t need to be cached during inference. The <code data-enlighter-language="python" class="EnlighterJSRAW">rope_dim</code> of 64 specifies how many dimensions use RoPE — we don&#8217;t apply RoPE to all dimensions, only to a subset, allowing some dimensions to focus purely on content rather than position.</p>



<p><strong>Lines 21-29: MoE and MTP Configuration:</strong> We configure 4 expert networks with top-2 routing, meaning each token will be processed by exactly 2 out of 4 experts. This gives us 2× more parameters than a standard feedforward layer while maintaining the same computational cost. The <code data-enlighter-language="python" class="EnlighterJSRAW">aux_loss_weight</code> of 0.01 determines how strongly we penalize uneven expert usage — this is crucial for preventing all tokens from routing to just one or two experts. The <code data-enlighter-language="python" class="EnlighterJSRAW">multi_token_predict</code> parameter determines how many future tokens the model is trained to predict at each step.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="31" data-enlighter-title="DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings" data-enlighter-group="3">    def __post_init__(self):
        """Initialize special tokens after dataclass initialization"""
        self.special_tokens = {
            "story_start": "&lt;|story|>",
            "story_end": "&lt;/|story|>",
        }

    def to_dict(self):
        """Convert configuration to dictionary"""
        return {
            'vocab_size': self.vocab_size,
            'n_layer': self.n_layer,
            'n_head': self.n_head,
            'n_embd': self.n_embd,
            'block_size': self.block_size,
            'dropout': self.dropout,
            'bias': self.bias,
            'kv_lora_rank': self.kv_lora_rank,
            'q_lora_rank': self.q_lora_rank,
            'rope_dim': self.rope_dim,
            'n_experts': self.n_experts,
            'n_experts_per_token': self.n_experts_per_token,
            'expert_intermediate_size': self.expert_intermediate_size,
            'shared_expert_intermediate_size': self.shared_expert_intermediate_size,
            'use_shared_expert': self.use_shared_expert,
            'aux_loss_weight': self.aux_loss_weight,
            'multi_token_predict': self.multi_token_predict,
            'special_tokens': self.special_tokens,
        }

    def to_json_string(self, indent=2):
        """Convert configuration to JSON string"""
        return json.dumps(self.to_dict(), indent=indent)

    @classmethod
    def from_dict(cls, config_dict):
        """Create configuration from dictionary"""
        # Remove special_tokens from dict as it's set in __post_init__
        config_dict = {k: v for k, v in config_dict.items() if k != 'special_tokens'}
        return cls(**config_dict)

    @classmethod
    def from_json_string(cls, json_string):
        """Create configuration from JSON string"""
        return cls.from_dict(json.loads(json_string))
</pre>



<p><strong>Lines 31-75: Special Methods for Serialization:</strong> We implement <code data-enlighter-language="python" class="EnlighterJSRAW">__post_init__</code> to add special tokens after initialization, ensuring they&#8217;re always present but not required in the constructor. The <code data-enlighter-language="python" class="EnlighterJSRAW">to_dict</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">to_json_string</code> methods enable easy serialization for saving configurations alongside trained models. The class methods <code data-enlighter-language="python" class="EnlighterJSRAW">from_dict</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">from_json_string</code> provide deserialization, creating a complete round-trip for configuration management. This pattern is essential for reproducibility — we can save a configuration with our trained model and later reconstruct the exact architecture.</p>



<p>Next, we implement the RoPE module.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings" data-enlighter-group="4">class RMSNorm(nn.Module):
    """Root Mean Square Layer Normalization"""
    def __init__(self, ndim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(ndim))

    def forward(self, x):
        norm = x.norm(dim=-1, keepdim=True) * (x.size(-1) ** -0.5)
        return self.weight * x / (norm + self.eps)

</pre>



<p><strong>RMSNorm Implementation (Lines 1-10):</strong> Our <code data-enlighter-language="python" class="EnlighterJSRAW">RMSNorm</code> class is remarkably simple. In the constructor, we create a learnable <code data-enlighter-language="python" class="EnlighterJSRAW">weight</code> parameter (the <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/ae5/ae539dfcc999c28e25a0f3ae65c1de79-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\gamma' title='\gamma' class='latex' /> in our equations) initialized to ones. In the forward pass, we compute the L2 norm of the input along the feature dimension, multiply by <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/133/1339bb612c6c85b22f5312b00f737c97-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='d^{-0.5}' title='d^{-0.5}' class='latex' /> to get the RMS, and then scale the input by the inverse of this norm (plus epsilon for numerical stability) and multiply by the learned weight parameter. This normalization ensures our activations have unit RMS, helping with training stability and gradient flow.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="12" data-enlighter-title="DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings" data-enlighter-group="5">class RotaryEmbedding(nn.Module):
    """Rotary Positional Embedding (RoPE) for better position understanding"""
    def __init__(self, dim, max_seq_len=2048):
        super().__init__()
        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer('inv_freq', inv_freq)
        self.max_seq_len = max_seq_len

    def forward(self, x, seq_len=None):
        if seq_len is None:
            seq_len = x.shape[-2]

        t = torch.arange(seq_len, device=x.device).type_as(self.inv_freq)
        freqs = torch.outer(t, self.inv_freq)
        cos, sin = freqs.cos(), freqs.sin()
        return cos, sin

def apply_rope(x, cos, sin):
    """Apply rotary position embedding"""
    x1, x2 = x.chunk(2, dim=-1)
    return torch.cat([x1 * cos - x2 * sin, x1 * sin + x2 * cos], dim=-1)
</pre>



<p><strong>The </strong><code data-enlighter-language="python" class="EnlighterJSRAW">RotaryEmbedding</code><strong> Class (Lines 12-27):</strong> The constructor creates the inverse frequency vector <code data-enlighter-language="python" class="EnlighterJSRAW">inv_freq</code> following the same frequency schedule used in sinusoidal positional embeddings, where each pair of dimensions is assigned a frequency following the schedule <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/a96/a9602250dfaa233c3a731010eb6d96e6-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\theta_i = 10000^{-2i/d}' title='\theta_i = 10000^{-2i/d}' class='latex' />. We use <code data-enlighter-language="python" class="EnlighterJSRAW">register_buffer</code> rather than a parameter because these frequencies shouldn&#8217;t be learned — they&#8217;re fixed by our positional encoding design. In the forward pass, we create position indices from 0 to <code data-enlighter-language="python" class="EnlighterJSRAW">seq_len</code>, compute the outer product with inverse frequencies (giving us a matrix where entry <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b7f/b7f3ec4bdf57e0f4164d80a9a58e7941-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='(t, i)' title='(t, i)' class='latex' /> is <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/92b/92b5ef845e301bbd691bd5eb19bcfc91-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='t \cdot \theta_i ' title='t \cdot \theta_i ' class='latex' />, and compute the cosine and sine values. These will be broadcast and applied to query and key vectors. The resulting cosine and sine tensors broadcast across the batch, head, and sequence dimensions during attention computation.</p>



<p><strong>The </strong><code data-enlighter-language="python" class="EnlighterJSRAW">apply_rope</code><strong> Function (Lines 29-32):</strong> This elegant function applies the 2D rotation. We chunk the input into pairs of dimensions (effectively treating each pair of dimensions as the real and imaginary components of a complex number). We then apply the rotation formula: </p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/5a8/5a86c9adb62d363939512cb68e326152-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='(x_1^\prime, x_2^\prime) = (x_1 \cos \theta - x_2 \sin \theta, x_1 \sin \theta + x_2 \cos \theta).' title='(x_1^\prime, x_2^\prime) = (x_1 \cos \theta - x_2 \sin \theta, x_1 \sin \theta + x_2 \cos \theta).' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/5a8/5a86c9adb62d363939512cb68e326152-ffffff-000000-0.png?lossy=2&strip=1&webp=1 337w,https://b2633864.smushcdn.com/2633864/wp-content/latex/5a8/5a86c9adb62d363939512cb68e326152-ffffff-000000-0.png?size=126x7&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/5a8/5a86c9adb62d363939512cb68e326152-ffffff-000000-0.png?size=252x14&lossy=2&strip=1&webp=1 252w' sizes='(max-width: 337px) 100vw, 337px' /> </p>



<p>The chunking operation splits along the last dimension. We compute each rotated component and then concatenate them back together. This vectorized implementation is far more efficient than iterating over dimension pairs in Python. </p>



<p><strong>Design Choices and Tradeoffs:</strong> Several decisions merit discussion. We chose partial RoPE (<code data-enlighter-language="python" class="EnlighterJSRAW">rope_dim=64</code> rather than full <code data-enlighter-language="python" class="EnlighterJSRAW">n_embd=256</code>) because empirical research shows that applying RoPE to all dimensions can sometimes hurt performance — some dimensions benefit from remaining content-focused rather than encoding position. Our LoRA ranks are fairly high (128 and 192) relative to the 256-dimensional embeddings; in larger models, the compression ratio would be more aggressive. The special tokens pattern (<code data-enlighter-language="python" class="EnlighterJSRAW">story_start</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">story_end</code>) provides explicit boundaries that help the model learn story structure — it knows when a generation should terminate.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<div id="pitch" style="padding: 40px; width: 100%; background-color: #F4F6FA;">
	<h3>What's next? We recommend <a target="_blank" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend">PyImageSearch University</a>.</h3>

	<script src="https://fast.wistia.com/embed/medias/kno0cmko2z.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_kno0cmko2z videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img decoding="async" src="https://fast.wistia.com/embed/medias/kno0cmko2z/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>

	<div style="margin-top: 32px; margin-bottom: 32px; ">
		<strong>Course information:</strong><br/>
		86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: April 2026<br/>
		<span style="color: #169FE6;">★★★★★</span> 4.84 (128 Ratings) • 16,000+ Students Enrolled
	</div>

	<p><strong>I strongly believe that if you had the right teacher you could <em>master</em> computer vision and deep learning.</strong></p>

	<p>Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?</p>

	<p>That’s <em>not</em> the case.</p>

	<p>All you need to master computer vision and deep learning is for someone to explain things to you in <em>simple, intuitive</em> terms. <em>And that’s exactly what I do</em>. My mission is to change education and how complex Artificial Intelligence topics are taught.</p>

	<p>If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to <em>successfully</em> and <em>confidently</em> apply computer vision to your work, research, and projects. Join me in computer vision mastery.</p>

	<p><strong>Inside PyImageSearch University you'll find:</strong></p>

	<ul style="margin-left: 0px;">
		<li style="list-style: none;">&check; <strong>86+ courses</strong> on essential computer vision, deep learning, and OpenCV topics</li>
		<li style="list-style: none;">&check; <strong>86 Certificates</strong> of Completion</li>
		<li style="list-style: none;">&check; <strong>115+ hours hours</strong> of on-demand video</li>
		<li style="list-style: none;">&check; <strong>Brand new courses released <em>regularly</em></strong>, ensuring you can keep up with state-of-the-art techniques</li>
		<li style="list-style: none;">&check; <strong>Pre-configured Jupyter Notebooks in Google Colab</strong></li>
		<li style="list-style: none;">&check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)</li>
		<li style="list-style: none;">&check; Access to <strong>centralized code repos for <em>all</em> 540+ tutorials</strong> on PyImageSearch</li>
		<li style="list-style: none;">&check; <strong> Easy one-click downloads</strong> for code, datasets, pre-trained models, etc.</li>
		<li style="list-style: none;">&check; <strong>Access</strong> on mobile, laptop, desktop, etc.</li>
	</ul>

	<p style="text-align: center;">
		<a target="_blank" class="button link" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend" style="background-color: #6DC713; border-bottom: none;">Click here to join PyImageSearch University</a>
	</p>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Summary"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Summary">Summary</a></h2>



<p>In this blog, we walk through the foundations of <strong>DeepSeek-V3</strong>, starting with its theoretical underpinnings and the four pillars that shape its architecture. We explore why these pillars matter, how they guide the design of the model, and what we aim to build by the end of the lesson. By laying out the prerequisites and setup, we ensure that we’re equipped with the right tools and mindset before diving into the implementation details.</p>



<p>Next, we focus on the <strong>model configuration</strong>, where we break down the essential parameters that define DeepSeek-V3’s behavior. We discuss how these configurations influence performance, scalability, and adaptability, and why they are critical for building a robust model. Alongside this, we introduce <strong>Rotary Positional Embedding</strong><strong>s</strong><strong> (RoPE)</strong>, a geometric approach to positional encoding that enhances the model’s ability to capture sequential information with precision.</p>



<p>Finally, we bring theory into practice by implementing both the configuration and RoPE step by step. We highlight how these components integrate seamlessly, forming the backbone of DeepSeek-V3. By the end, we not only understand the theoretical aspects but also gain hands-on experience in building and customizing the model. Together, these steps demystify the process and set the stage for deeper experimentation with advanced Transformer architectures.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Citation-Information"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Citation-Information">Citation Information</a></h3>



<p><strong>Mangla, P</strong><strong>. </strong>“DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings,” <em>PyImageSearch</em>, S. Huot, A. Sharma, and P. Thakur, eds., 2026, <a href="https://pyimg.co/1atre" target="_blank" rel="noreferrer noopener">https://pyimg.co/1atre</a> </p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="classic" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings" data-enlighter-group="6">@incollection{Mangla_2026_deepseek-v3-model-theory-config-and-rotary-positional-embeddings,
  author = {Puneet Mangla},
  title = {{DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/1atre},
}
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), </strong><em><strong>simply enter your email address in the form below!</strong></em></p>



<div id="download-the-code" class="post-cta-wrap">
<div class="gpd-post-cta">
	<div class="gpd-post-cta-content">
		

			<div class="gpd-post-cta-top">
				<div class="gpd-post-cta-top-image"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1" alt="" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1 410w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=126x174&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=252x348&lossy=2&strip=1&webp=1 252w" sizes="(max-width: 410px) 100vw, 410px" /></div>
				
				<div class="gpd-post-cta-top-title"><h4>Download the Source Code and FREE 17-page Resource Guide</h4></div>
				<div class="gpd-post-cta-top-desc"><p>Enter your email address below to get a .zip of the code and a <strong>FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning.</strong> Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!</p></div>


			</div>

			<div class="gpd-post-cta-bottom">
				<form id="footer-cta-code" class="footer-cta" action="https://www.getdrip.com/forms/4130035/submissions" method="post" target="blank" data-drip-embedded-form="4130035">
					<input name="fields[email]" type="email" value="" placeholder="Your email address" class="form-control" />

					<button type="submit">Download the code!</button>

					<div style="display: none;" aria-hidden="true"><label for="website">Website</label><br /><input type="text" id="website" name="website" tabindex="-1" autocomplete="false" value="" /></div>
				</form>
			</div>


		
	</div>

</div>
</div>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/03/09/deepseek-v3-model-theory-config-and-rotary-positional-embeddings/">DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>SAM 3 for Video: Concept-Aware Segmentation and Object Tracking</title>
		<link>https://pyimagesearch.com/2026/03/02/sam-3-for-video-concept-aware-segmentation-and-object-tracking/</link>
		
		<dc:creator><![CDATA[Piyush Thakur]]></dc:creator>
		<pubDate>Mon, 02 Mar 2026 13:45:00 +0000</pubDate>
				<category><![CDATA[Computer Vision]]></category>
		<category><![CDATA[Detection]]></category>
		<category><![CDATA[SAM3]]></category>
		<category><![CDATA[Segmentation]]></category>
		<category><![CDATA[Tracking]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[computer vision]]></category>
		<category><![CDATA[concept-aware segmentation]]></category>
		<category><![CDATA[detection]]></category>
		<category><![CDATA[gradio app]]></category>
		<category><![CDATA[hugging face transformers]]></category>
		<category><![CDATA[multi-object tracking]]></category>
		<category><![CDATA[object tracking]]></category>
		<category><![CDATA[pytorch]]></category>
		<category><![CDATA[sam3]]></category>
		<category><![CDATA[single-click tracking]]></category>
		<category><![CDATA[streaming inference]]></category>
		<category><![CDATA[text-prompt tracking]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[video segmentation]]></category>
		<category><![CDATA[video tracking]]></category>
		<category><![CDATA[webcam segmentation]]></category>
		<guid isPermaLink="false">https://pyimagesearch.com/?p=52983</guid>

					<description><![CDATA[<p>Table of Contents SAM 3 for Video: Concept-Aware Segmentation and Object Tracking Configuring Your Development Environment Setup and Imports Text-Prompt Video Tracking Load the SAM3 Video Model Helper Function: Visualizing Video Segmentation Masks, Bounding Boxes, and Tracking IDs Main Pipeline:&#8230;</p>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/03/02/sam-3-for-video-concept-aware-segmentation-and-object-tracking/">SAM 3 for Video: Concept-Aware Segmentation and Object Tracking</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity" id="TOC"/>


<div class="yoast-breadcrumbs"><span><span><a href="https://pyimagesearch.com/">Home</a></span></div>


<script src="https://fast.wistia.com/embed/medias/sczi5z84gj.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_sczi5z84gj seo=true videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img decoding="async" src="https://fast.wistia.com/embed/medias/sczi5z84gj/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>



<div class="toc">
  <hr class="TOC"/>
  <p class="has-large-font-size"><strong>Table of Contents</strong></p>

  <ul>
    <li id="TOC-h1-SAM-3-for-Video-Concept-Aware-Segmentation-and-Object-Tracking">
      <a rel="noopener" target="_blank" href="#h1-SAM-3-for-Video-Concept-Aware-Segmentation-and-Object-Tracking">SAM 3 for Video: Concept-Aware Segmentation and Object Tracking</a>
    </li>

    <li id="TOC-h2-Configuring-Your-Development-Environment">
      <a rel="noopener" target="_blank" href="#h2-Configuring-Your-Development-Environment">Configuring Your Development Environment</a>
    </li>

    <li id="TOC-h2-Setup-and-Imports">
      <a rel="noopener" target="_blank" href="#h2-Setup-and-Imports">Setup and Imports</a>
    </li>

    <li id="TOC-h2-Text-Prompt-Video-Tracking">
      <a rel="noopener" target="_blank" href="#h2-Text-Prompt-Video-Tracking">Text-Prompt Video Tracking</a>
    </li>
    <ul>
      <li id="TOC-h3-Load-the-SAM3-Video-Model">
        <a rel="noopener" target="_blank" href="#h3-Load-the-SAM3-Video-Model">Load the SAM3 Video Model</a>
      </li>
      <li id="TOC-h3-Helper-Function-Visualizing-Video-Segmentation-Masks-Bounding-Boxes-and-Tracking-IDs">
        <a rel="noopener" target="_blank" href="#h3-Helper-Function-Visualizing-Video-Segmentation-Masks-Bounding-Boxes-and-Tracking-IDs">Helper Function: Visualizing Video Segmentation Masks, Bounding Boxes, and Tracking IDs</a>
      </li>
      <li id="TOC-h3-Main-Pipeline-Running-the-Full-Video-Segmentation-and-Tracking-Workflow">
        <a rel="noopener" target="_blank" href="#h3-Main-Pipeline-Running-the-Full-Video-Segmentation-and-Tracking-Workflow">Main Pipeline: Running the Full Video Segmentation and Tracking Workflow</a>
      </li>
      <li id="TOC-h3-Launch-the-Gradio-Application">
        <a rel="noopener" target="_blank" href="#h3-Launch-the-Gradio-Application">Launch the Gradio Application</a>
      </li>
      <li id="TOC-h3-Output-Text-Prompt-Video-Segmentation-and-Tracking-Results">
        <a rel="noopener" target="_blank" href="#h3-Output-Text-Prompt-Video-Segmentation-and-Tracking-Results">Output: Text-Prompt Video Segmentation and Tracking Results</a>
      </li>
    </ul>

    <li id="TOC-h2-Real-Time-Text-Prompt-Tracking-Webcam">
      <a rel="noopener" target="_blank" href="#h2-Real-Time-Text-Prompt-Tracking-Webcam">Real-Time Text-Prompt Tracking (Webcam)</a>
    </li>
    <ul>
      <li id="TOC-h3-Helper-Function-Stable-Color-Overlays-for-Real-Time-Video-Tracking">
        <a rel="noopener" target="_blank" href="#h3-Helper-Function-Stable-Color-Overlays-for-Real-Time-Video-Tracking">Helper Function: Stable Color Overlays for Real-Time Video Tracking</a>
      </li>
      <li id="TOC-h3-Streaming-Inference-Function-Maintaining-Temporal-Memory-Across-Live-Video-Frames">
        <a rel="noopener" target="_blank" href="#h3-Streaming-Inference-Function-Maintaining-Temporal-Memory-Across-Live-Video-Frames">Streaming Inference Function: Maintaining Temporal Memory Across Live Video Frames</a>
      </li>
      <li id="TOC-h3-Launch-the-Gradio-Application-2">
        <a rel="noopener" target="_blank" href="#h3-Launch-the-Gradio-Application-2">Launch the Gradio Application</a>
      </li>
      <li id="TOC-h3-Output-Real-Time-Webcam-Video-Segmentation-Results">
        <a rel="noopener" target="_blank" href="#h3-Output-Real-Time-Webcam-Video-Segmentation-Results">Output: Real-Time Webcam Video Segmentation Results</a>
      </li>
    </ul>

    <li id="TOC-h2-Single-Click-Object-Tracking">
      <a rel="noopener" target="_blank" href="#h2-Single-Click-Object-Tracking">Single-Click Object Tracking</a>
    </li>
    <ul>
      <li id="TOC-h3-Load-the-SAM3-Tracker-Video-Model">
        <a rel="noopener" target="_blank" href="#h3-Load-the-SAM3-Tracker-Video-Model">Load the SAM3 Tracker Video Model</a>
      </li>
      <li id="TOC-h3-Extract-First-Frame-Preparing-the-Initial-Frame-for-Object-Selection">
        <a rel="noopener" target="_blank" href="#h3-Extract-First-Frame-Preparing-the-Initial-Frame-for-Object-Selection">Extract First Frame: Preparing the Initial Frame for Object Selection</a>
      </li>
      <li id="TOC-h3-Tracking-Object-Function-Propagating-a-Single-Object-Mask-Across-Video-Frames">
        <a rel="noopener" target="_blank" href="#h3-Tracking-Object-Function-Propagating-a-Single-Object-Mask-Across-Video-Frames">Tracking Object Function: Propagating a Single Object Mask Across Video Frames</a>
      </li>
      <li id="TOC-h3-Launch-the-Gradio-Application-3">
        <a rel="noopener" target="_blank" href="#h3-Launch-the-Gradio-Application-3">Launch the Gradio Application</a>
      </li>
      <li id="TOC-h3-Output-Single-Click-Video-Object-Tracking-Results">
        <a rel="noopener" target="_blank" href="#h3-Output-Single-Click-Video-Object-Tracking-Results">Output: Single-Click Video Object Tracking Results</a>
      </li>
    </ul>

    <li id="TOC-h2-Multi-Click-Object-Tracking">
      <a rel="noopener" target="_blank" href="#h2-Multi-Click-Object-Tracking">Multi-Click Object Tracking</a>
    </li>
    <ul>
      <li id="TOC-h3-Initialize-Few-Colors-Defining-a-Color-Palette-for-Multi-Object-Tracking-Visualization">
        <a rel="noopener" target="_blank" href="#h3-Initialize-Few-Colors-Defining-a-Color-Palette-for-Multi-Object-Tracking-Visualization">Initialize Few Colors: Defining a Color Palette for Multi-Object Tracking Visualization</a>
      </li>
      <li id="TOC-h3-Extract-First-Frame-Preparing-the-First-Frame-for-Multi-Object-Selection">
        <a rel="noopener" target="_blank" href="#h3-Extract-First-Frame-Preparing-the-First-Frame-for-Multi-Object-Selection">Extract First Frame: Preparing the First Frame for Multi-Object Selection</a>
      </li>
      <li id="TOC-h3-Tracking-Object-Function-Tracking-Multiple-Objects-with-Unique-IDs-Across-Video-Frames">
        <a rel="noopener" target="_blank" href="#h3-Tracking-Object-Function-Tracking-Multiple-Objects-with-Unique-IDs-Across-Video-Frames">Tracking Object Function: Tracking Multiple Objects with Unique IDs Across Video Frames</a>
      </li>
      <li id="TOC-h3-Launch-the-Gradio-Application-4">
        <a rel="noopener" target="_blank" href="#h3-Launch-the-Gradio-Application-4">Launch the Gradio Application</a>
      </li>
      <li id="TOC-h3-Output-Multi-Object-Video-Segmentation-and-Tracking-Results">
        <a rel="noopener" target="_blank" href="#h3-Output-Multi-Object-Video-Segmentation-and-Tracking-Results">Output: Multi-Object Video Segmentation and Tracking Results</a>
      </li>
    </ul>

    <li id="TOC-h2-Summary">
      <a rel="noopener" target="_blank" href="#h2-Summary">Summary</a>
    </li>
    <ul>
      <li id="TOC-h3-Citation-Information">
        <a rel="noopener" target="_blank" href="#h3-Citation-Information">Citation Information</a>
      </li>
    </ul>
  </ul>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h1-SAM-3-for-Video-Concept-Aware-Segmentation-and-Object-Tracking"/>



<h2 class="wp-block-heading"><a href="#TOC-h1-SAM-3-for-Video-Concept-Aware-Segmentation-and-Object-Tracking">SAM 3 for Video: Concept-Aware Segmentation and Object Tracking</a></h2>



<p>In <a href="https://pyimg.co/uming" target="_blank" rel="noreferrer noopener">Part 1</a> of this series, we introduced Segment Anything Model 3 (SAM 3) and saw how it moves beyond geometric prompts to concept-based visual understanding. We learned how the model can segment <em>all instances</em> of a concept using natural language and visual examples.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/03/sam-3-sam3-video-concept-aware-segmentation-object-tracking-featured.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="940" height="780" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/sam-3-sam3-video-concept-aware-segmentation-object-tracking-featured.png?lossy=2&strip=1&webp=1" alt="sam-3-sam3-video-concept-aware-segmentation-object-tracking-featured.png" class="wp-image-53016" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/sam-3-sam3-video-concept-aware-segmentation-object-tracking-featured.png?size=126x105&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/sam-3-sam3-video-concept-aware-segmentation-object-tracking-featured-300x249.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/sam-3-sam3-video-concept-aware-segmentation-object-tracking-featured.png?size=378x314&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/sam-3-sam3-video-concept-aware-segmentation-object-tracking-featured.png?size=504x418&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/sam-3-sam3-video-concept-aware-segmentation-object-tracking-featured.png?size=630x523&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/sam-3-sam3-video-concept-aware-segmentation-object-tracking-featured-768x637.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/03/sam-3-sam3-video-concept-aware-segmentation-object-tracking-featured.png?lossy=2&amp;strip=1&amp;webp=1 940w" sizes="(max-width: 630px) 100vw, 630px" /></a></figure></div>


<p>In <a href="https://pyimg.co/5c4ag" target="_blank" rel="noreferrer noopener">Part 2</a>, we went one step further. We explored multi-modal prompting and interactive workflows. We combined text, bounding boxes, and point clicks to build precise and controllable segmentation pipelines on images.</p>



<p>So far, however, everything we have done lives in a <strong>static world</strong>.</p>



<p>Images do not move. Objects do not disappear. There is no notion of time.</p>



<p>Video changes everything.</p>



<p>In videos, segmentation alone is not enough. We also need <strong>temporal consistency</strong>. If a person appears in frame 1 and walks across the scene, the model must not only segment that person — it must also <strong>remember</strong> that it is the same person in frame 200.</p>



<p>This is where SAM3 becomes fundamentally different from previous systems.</p>



<p>SAM3 does not treat video as a bag of independent images. Instead, it maintains a <strong>streaming memory</strong> and a <strong>tracking state</strong> that allows it to propagate object identities across frames. Detection, segmentation, and tracking are no longer separate steps. They are part of a single, unified pipeline.</p>



<p>In other words, SAM3 does not just answer:     “Where is the object in this frame?”</p>



<p>It answers:     “Where is this concept over time?”</p>



<p>In this tutorial, we will focus entirely on <strong>using SA</strong><strong>M3</strong><strong> with video</strong>. We will build several practical pipelines that combine detection, segmentation, and tracking into one coherent workflow.</p>



<p>Specifically, we will implement 4 tasks:</p>



<ul class="wp-block-list">
<li><strong><strong>Video Detection, Segmentation, and Tracking Using a Text Prompt</strong><br></strong>Here, we use a text prompt such as <code data-enlighter-language="python" class="EnlighterJSRAW">"person"</code> and let SAM3 detect, segment, and track all instances of that concept throughout the video.</li>



<li><strong>Real-Time</strong> <strong>Detection, Segmentation, and Tracking Using a Text Prompt via Webcam<br></strong>This is the same idea, but running in real time on a live camera stream.</li>



<li><strong>Detection, Segmentation, and Tracking Using a Single Click on an Object<br></strong>Here, we do not use text. We simply click on an object in the first frame and let SAM3 track it.</li>



<li><strong>Detection, Segmentation, and Tracking Using Multiple Clicks on Objects<br></strong>In this case, we select multiple objects interactively and track all of them at once.</li>
</ul>



<p>Across these examples, we will see the same core idea again and again:</p>



<ul class="wp-block-list">
<li>First, SAM3 recognizes what to track.</li>



<li>Then, it segments it.</li>



<li>Finally, it remembers and propagates it through time.</li>
</ul>



<p>This lesson is the 3rd of a 4-part series on <strong>SAM 3</strong>:</p>



<ol class="wp-block-list">
<li><em><strong><a href="https://pyimg.co/uming" target="_blank" rel="noreferrer noopener">SAM 3: Concept-Based Visual Understanding and Segmentation</a></strong></em></li>



<li><em><strong><a href="https://pyimg.co/5c4ag" target="_blank" rel="noreferrer noopener">Advanced SAM 3: Multi-Modal Prompting and Interactive Segmentation</a></strong></em></li>



<li><em><strong><a href="https://pyimg.co/luxfd" target="_blank" rel="noreferrer noopener">SAM 3 for Video: Concept-Aware Segmentation and Object Trackin</a></strong></em><em><strong><a href="https://pyimg.co/luxfd" target="_blank" rel="noreferrer noopener">g</a></strong></em> <strong>(this tutorial)</strong></li>



<li><em>Lesson 4</em></li>
</ol>



<p><strong>To learn how to build SA</strong><strong>M3-</strong><strong>powered video segmentation and tracking pipelines — using text prompts, real-time webcam streams, and interactive multi-object tracking inside a dynamic Gradio interface, </strong><em><strong>just keep reading.</strong></em></p>



<div id="pyi-source-code-block" class="source-code-wrap"><div class="gpd-source-code">
    <div class="gpd-source-code-content">
        <img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/source-code-icon.png?lossy=2&strip=1&webp=1" alt="">
        <h4>Looking for the source code to this post?</h4>
                    <a href="#download-the-code" class="pyis-cta-modal-open-modal">Jump Right To The Downloads Section <svg class="svg-icon arrow-right" width="12" height="12" aria-hidden="true" role="img" focusable="false" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M6.8125 0.1875C6.875 0.125 6.96875 0.09375 7.09375 0.09375C7.1875 0.09375 7.28125 0.125 7.34375 0.1875L13.875 6.75C13.9375 6.8125 14 6.90625 14 7C14 7.125 13.9375 7.1875 13.875 7.25L7.34375 13.8125C7.28125 13.875 7.1875 13.9062 7.09375 13.9062C6.96875 13.9062 6.875 13.875 6.8125 13.8125L6.1875 13.1875C6.125 13.125 6.09375 13.0625 6.09375 12.9375C6.09375 12.8438 6.125 12.75 6.1875 12.6562L11.0312 7.8125H0.375C0.25 7.8125 0.15625 7.78125 0.09375 7.71875C0.03125 7.65625 0 7.5625 0 7.4375V6.5625C0 6.46875 0.03125 6.375 0.09375 6.3125C0.15625 6.25 0.25 6.1875 0.375 6.1875H11.0312L6.1875 1.34375C6.125 1.28125 6.09375 1.1875 6.09375 1.0625C6.09375 0.96875 6.125 0.875 6.1875 0.8125L6.8125 0.1875Z" fill="#169FE6"></path></svg></a>
            </div>
</div>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with &#8230; for free? Head over to <a href="https://universe.roboflow.com/isl/az-6mqow?ref=pyimagesearch" target="_blank" rel="noreferrer noopener">Roboflow</a> and get a free account to grab these hand gesture images. </p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Configuring-Your-Development-Environment"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Configuring-Your-Development-Environment">Configuring Your Development Environment</a></h2>



<p>To follow this guide, you need to have the following libraries installed on your system.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="1">!pip install --q git+https://github.com/huggingface/transformers av gradio
</pre>



<p>First, we install the <strong>latest development version of </strong><code data-enlighter-language="python" class="EnlighterJSRAW">transformers</code> <strong>directly from GitHub</strong>.</p>



<p>We do this because SAM3 is very new. Its video processors, tracker models, and streaming APIs are not yet available in older stable releases. Installing from GitHub ensures we get access to:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">Sam3VideoProcessor</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">Sam3VideoModel</code> for concept-based segmentation</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">Sam3TrackerVideoProcessor</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">Sam3TrackerVideoModel</code> for video tracking</li>



<li>The video session and streaming inference utilities</li>
</ul>



<p>In short, this gives us <strong>all SA</strong><strong>M3</strong><strong> image and video capabilities</strong> in one place.</p>



<p>Next, we install <code data-enlighter-language="python" class="EnlighterJSRAW">av</code>. This is a Python binding for FFmpeg. We use it to decode video files, read frames efficiently, and handle video streams coming from disk or webcam. Without <code data-enlighter-language="python" class="EnlighterJSRAW">av</code>, working with video frames would be slow and unreliable.</p>



<p>Finally, we install <code data-enlighter-language="python" class="EnlighterJSRAW">gradio</code>. We use Gradio to build interactive demos, run webcam-based real-time segmentation, create simple UI components for clicking on objects and visualizing results. This allows us to turn SAM3 into a <strong>live, interactive video application</strong>, not just a notebook script.</p>



<p>We also pass the <code data-enlighter-language="python" class="EnlighterJSRAW">--q</code> flag to keep the installation output quiet. This keeps our notebook clean and easy to read.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<!-- wp:paragraph -->
<h3>Need Help Configuring Your Development Environment?</h3>
<!-- /wp:paragraph -->

<!-- wp:image {"align":"center","id":18137,"sizeSlug":"large","linkDestination":"custom"} -->
<figure class="wp-block-image aligncenter size-large"><a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank" rel="noreferrer noopener"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-18137" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?lossy=2&strip=1&webp=1 500w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=126x84&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=252x168&lossy=2&strip=1&webp=1 252w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=378x253&lossy=2&strip=1&webp=1 378w" sizes="(max-width: 500px) 100vw, 500px" /></a><figcaption>Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join <a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">PyImageSearch University</a> — you will be up and running with this tutorial in a matter of minutes. </figcaption></figure>
<!-- /wp:image -->

<!-- wp:paragraph -->
<p>All that said, are you:</p>
<!-- /wp:paragraph -->

<!-- wp:list -->
<ul><li>Short on time?</li><li>Learning on your employer’s administratively locked system?</li><li>Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?</li><li><strong>Ready to run the code immediately on your Windows, macOS, or Linux system?</strong></li></ul>
<!-- /wp:list -->

<!-- wp:paragraph -->
<p>Then join <a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank">PyImageSearch University</a> today!</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p><strong>Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser!</strong> No installation required.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!</p>
<!-- /wp:paragraph -->



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Setup-and-Imports"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Setup-and-Imports">Setup and Imports</a></h2>



<p>Once the dependencies are installed, we can import the libraries we need for video processing, model execution, and interactive demos.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="2">import cv2
import torch
import numpy as np
import gradio as gr

from PIL import Image
from accelerate import Accelerator
from transformers.video_utils import load_video
from transformers import Sam3VideoModel, Sam3VideoProcessor
from transformers import Sam3TrackerVideoModel, Sam3TrackerVideoProcessor
</pre>



<p>First, we import <strong>OpenCV</strong> (<code data-enlighter-language="python" class="EnlighterJSRAW">cv2</code>). We use OpenCV to read frames from videos and webcams, convert between different image formats, and perform basic image and video I/O operations. This forms the low-level video input layer of our system.</p>



<p>Next, we import <strong>PyTorch</strong> (<code data-enlighter-language="python" class="EnlighterJSRAW">torch</code>). PyTorch is the backend that runs SAM3. We use it to move models and tensors to GPU or CPU, run inference efficiently, and control precision and memory usage. All heavy computation in this tutorial happens inside PyTorch.</p>



<p>We also import <strong>NumPy</strong> (<code data-enlighter-language="python" class="EnlighterJSRAW">numpy</code>). NumPy is used for simple array manipulations, converting between OpenCV images, PIL images, and tensors, and handling masks and frame buffers in a lightweight way.</p>



<p>Then, we import <strong>Gradio</strong> (<code data-enlighter-language="python" class="EnlighterJSRAW">gradio</code>). We use Gradio to build interactive demos, create webcam-based real-time segmentation apps, and handle mouse clicks for point-based object selection. This turns our video pipeline into a <strong>live, interactive application</strong> instead of a static script.</p>



<p>Next, we import <strong>PIL’s </strong><code data-enlighter-language="python" class="EnlighterJSRAW">Image</code>. This is used to convert frames into PIL format when required by the processor, handle RGB image conversions cleanly, and bridge between OpenCV and the Transformers preprocessing pipeline.</p>



<p>Then, we import <code data-enlighter-language="python" class="EnlighterJSRAW">Accelerator</code> <strong>from Hugging Face Accelerate</strong>. The <code data-enlighter-language="python" class="EnlighterJSRAW">Accelerator</code> class helps us automatically place the model on CPU or GPU, write device-agnostic code, and scale to different hardware setups without changing logic. This keeps the code clean and portable.</p>



<p>Next, we import <code data-enlighter-language="python" class="EnlighterJSRAW">load_video</code> from <code data-enlighter-language="python" class="EnlighterJSRAW">transformers.video_utils</code>. This utility function loads a video file from disk, decodes it into a list (or generator) of frames, and handles resizing and format conversion in a consistent way. We use this for file-based video experiments.</p>



<p>Finally, we import the <strong>SA</strong><strong>M3</strong><strong> video models and processors</strong>. These are the core components of this tutorial.</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">Sam3VideoModel</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">Sam3VideoProcessor</code> are used for:
<ul class="wp-block-list">
<li>Text-prompted video segmentation</li>



<li>Concept detection on video frames</li>
</ul>
</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">Sam3TrackerVideoModel</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">Sam3TrackerVideoProcessor</code> are used for:
<ul class="wp-block-list">
<li>Point-based prompting</li>



<li>Interactive object selection</li>



<li>Multi-object tracking with memory across frames</li>
</ul>
</li>
</ul>



<p>Together, these 2 model families allow us to cover <strong>all 4 workflows</strong>:</p>



<ul class="wp-block-list">
<li>Text-prompted tracking</li>



<li>Webcam-based tracking</li>



<li>Single-click object tracking</li>



<li>Multi-click object tracking</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Text-Prompt-Video-Tracking"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Text-Prompt-Video-Tracking">Text-Prompt Video Tracking</a></h2>



<p>In this section, we use a natural language prompt such as <code data-enlighter-language="python" class="EnlighterJSRAW">"person"</code> or <code data-enlighter-language="python" class="EnlighterJSRAW">"car"</code> to detect, segment, and track objects across an entire video. SAM3 maintains temporal memory, ensuring that object identities remain consistent from the first frame to the last. The result is a fully annotated video with masks, bounding boxes, and tracking IDs propagated over time.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Load-the-SAM3-Video-Model"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Load-the-SAM3-Video-Model">Load the SAM3 Video Model</a></h3>



<p>Before we process any video, we need to load the SAM3 model and its processor.</p>



<p>Since these models are large, we load them <strong>once</strong> and reuse them across all videos and frames.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="3"># ------------------------------------------------
# Model setup (loaded once)
# ------------------------------------------------
accelerator = Accelerator()
device = accelerator.device

model = Sam3VideoModel.from_pretrained(
   "facebook/sam3"
).to(device, dtype=torch.bfloat16)

processor = Sam3VideoProcessor.from_pretrained("facebook/sam3")
</pre>



<p>First, we create an <strong>Accelerator</strong> object. The <code data-enlighter-language="python" class="EnlighterJSRAW">Accelerator</code> automatically:</p>



<ul class="wp-block-list">
<li>Detects whether a GPU is available</li>



<li>Chooses the best device (CPU or GPU)</li>



<li>Handles device placement in a clean and consistent way</li>
</ul>



<p>By using this, we avoid hardcoding <code data-enlighter-language="python" class="EnlighterJSRAW">"cuda"</code> or <code data-enlighter-language="python" class="EnlighterJSRAW">"cpu"</code> anywhere in our code. The <code data-enlighter-language="python" class="EnlighterJSRAW">device</code> variable now represents <strong>where the model will run</strong>.</p>



<p>Next, we load the <strong>SAM3 Video model</strong>. This does 3 important things:</p>



<ul class="wp-block-list">
<li>It downloads the pretrained SAM3 weights from Hugging Face</li>



<li>It constructs the full video-capable segmentation model</li>



<li>It moves the model to the selected device (GPU or CPU)</li>
</ul>



<p>​​We also explicitly set <code data-enlighter-language="python" class="EnlighterJSRAW">dtype=torch.bfloat16</code>. This tells PyTorch to run the model in <strong>bfloat16 precision</strong>. Using bfloat16 reduces memory usage significantly, speeds up inference on modern GPUs, and has almost no impact on segmentation quality. This is especially important because SAM3 is a <strong>very large model</strong>.</p>



<p>Then, we load the <strong>processor</strong>. The processor is responsible for preprocessing video frames (resizing, normalization, padding), encoding text prompts, formatting inputs for the model, and post-processing outputs (resizing masks back to the original resolution).</p>



<p>At this point, we have:</p>



<ul class="wp-block-list">
<li>A fully loaded <strong>SA</strong><strong>M3</strong><strong> video model</strong> on the correct device</li>



<li>A <strong>video processor</strong> that knows how to prepare inputs and decode outputs</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Helper-Function-Visualizing-Video-Segmentation-Masks-Bounding-Boxes-and-Tracking-IDs"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Helper-Function-Visualizing-Video-Segmentation-Masks-Bounding-Boxes-and-Tracking-IDs">Helper Function: Visualizing Video Segmentation Masks, Bounding Boxes, and Tracking IDs</a></h3>



<p>Before we start running video inference, we define a small helper function to <strong>visualize segmentation and tracking results</strong>.</p>



<p>This function overlays:</p>



<ul class="wp-block-list">
<li>Segmentation masks</li>



<li>Bounding boxes</li>



<li>Object IDs</li>



<li>Confidence scores</li>
</ul>



<p>directly on top of each video frame.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="4"># ------------------------------------------------
# Visualization helper
# ------------------------------------------------
def overlay_masks_boxes(image, masks, boxes, scores, object_ids, alpha=0.5):
   image = image.copy()
   h, w = image.shape[:2]

   for i, mask in enumerate(masks):
       color = np.random.randint(0, 255, (3,), dtype=np.uint8)

       if mask.shape[-2:] != (h, w):
           mask = cv2.resize(mask.astype(np.uint8), (w, h)) > 0

       colored = np.zeros_like(image)
       colored[mask] = color
       image = cv2.addWeighted(image, 1.0, colored, alpha, 0)

       x1, y1, x2, y2 = boxes[i].astype(int)
       cv2.rectangle(image, (x1, y1), (x2, y2), color.tolist(), 2)

       label = f"ID {int(object_ids[i])} | {scores[i]:.2f}"
       cv2.putText(
           image,
           label,
           (x1, max(y1 - 5, 15)),
           cv2.FONT_HERSHEY_SIMPLEX,
           0.5,
           color.tolist(),
           1,
           cv2.LINE_AA,
       )

   return image
</pre>



<p>First, we create a copy of the input image and read its spatial resolution. We do this to avoid modifying the original frame and ensure all masks are resized to the same height and width.</p>



<p>Next, we loop over <strong>each detected object</strong>. Each iteration corresponds to <strong>one tracked instance</strong> in the frame. Inside the loop, we generate a <strong>random color</strong> for that object. This makes each object visually distinct in the overlay.</p>



<p>Then, we ensure the mask matches the image resolution. Because masks are produced at a lower resolution, we resize them to the full frame size and convert them to a Boolean mask. This ensures perfect alignment with the video frame.</p>



<p>Next, we create a colored overlay and blend it with the image. This:</p>



<ul class="wp-block-list">
<li>Paints the object region with the selected color</li>



<li>Blends it with transparency (<code data-enlighter-language="python" class="EnlighterJSRAW">alpha</code>)</li>



<li>Keeps the original image visible underneath</li>
</ul>



<p>Then, we draw the <strong>bounding box</strong> for the same object. This helps us visually confirm the detection region and the spatial extent of each tracked object.</p>



<p>Next, we prepare a <strong>label string</strong>. This shows the <strong>tracking ID</strong> assigned by SAM3 and the <strong>confidence score</strong> of the detection.</p>



<p>We then render this text near the bounding box. We slightly shift the text upward to avoid overlapping with the box. Finally, after all objects are drawn, we return the annotated frame.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Main-Pipeline-Running-the-Full-Video-Segmentation-and-Tracking-Workflow"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Main-Pipeline-Running-the-Full-Video-Segmentation-and-Tracking-Workflow">Main Pipeline: Running the Full Video Segmentation and Tracking Workflow</a></h3>



<p>Now we build the main function that runs SAM3 on a video using a <strong>text prompt</strong> and produces an annotated output video.</p>



<p>This function does 4 things:</p>



<ul class="wp-block-list">
<li>Loads the video frames</li>



<li>Initializes a SAM3 video session with memory</li>



<li>Propagates segmentation and tracking across frames</li>



<li>Writes an annotated video to disk</li>
</ul>



<p>Let us walk through the full pipeline step by step.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="5"># ------------------------------------------------
# Main pipeline
# ------------------------------------------------
def run_sam3_video(video_path, text_prompt):
   # Load frames for SAM3
   video_frames, _ = load_video(video_path)

   # Reliable FPS extraction (OpenCV)
   cap = cv2.VideoCapture(video_path)
   fps = cap.get(cv2.CAP_PROP_FPS)
   cap.release()

   if fps is None or fps &lt;= 0:
       fps = 25.0
   fps = float(fps)

   # Init SAM3 session
   inference_session = processor.init_video_session(
       video=video_frames,
       inference_device=device,
       processing_device="cpu",
       video_storage_device="cpu",
       dtype=torch.bfloat16,
   )

   inference_session = processor.add_text_prompt(
       inference_session=inference_session,
       text=text_prompt,
   )

   outputs_per_frame = {}

   for model_outputs in model.propagate_in_video_iterator(
       inference_session=inference_session,
       max_frame_num_to_track=len(video_frames),
   ):
       processed = processor.postprocess_outputs(
           inference_session,
           model_outputs,
       )
       outputs_per_frame[model_outputs.frame_idx] = processed

   # Prepare output video
   h, w = video_frames[0].shape[:2]
   out_path = "sam3_annotated.mp4"

   writer = cv2.VideoWriter(
       out_path,
       cv2.VideoWriter_fourcc(*"mp4v"),
       fps,
       (w, h),
   )

   for idx, frame in enumerate(video_frames):
       outputs = outputs_per_frame.get(idx)

       if outputs and len(outputs["object_ids"]) > 0:
           frame = overlay_masks_boxes(
               frame,
               outputs["masks"].cpu().numpy(),
               outputs["boxes"].cpu().numpy(),
               outputs["scores"].cpu().numpy(),
               outputs["object_ids"].cpu().numpy(),
           )

       # OpenCV expects BGR
       writer.write(frame[:, :, ::-1])

   writer.release()

   return out_path, out_path
</pre>



<p>First, we load the video frames using the Transformers utility. This returns:</p>



<ul class="wp-block-list">
<li>A list of RGB frames as NumPy arrays</li>



<li>(Optionally) audio, which we ignore here</li>
</ul>



<p>Next, we extract the <strong>frame rate</strong> using OpenCV. We do this because:</p>



<ul class="wp-block-list">
<li>We want the output video to play at the same speed as the input</li>



<li>Some video loaders do not always return reliable FPS metadata</li>
</ul>



<p>If FPS is missing or invalid, we fall back to a default value.</p>



<p>Instead of processing each frame independently, SAM3 uses a <strong>video session</strong> that maintains the following:</p>



<ul class="wp-block-list">
<li>Temporal memory</li>



<li>Object identities</li>



<li>Propagation state</li>
</ul>



<p>We initialize the session using <code data-enlighter-language="python" class="EnlighterJSRAW">processor.init_video_session(...)</code>. We pass:</p>



<ul class="wp-block-list">
<li>The full list of frames</li>



<li>The device for model inference (GPU or CPU)</li>



<li>CPU for processing and storage (to save GPU memory)</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">dtype=torch.bfloat16</code> for efficient computation</li>
</ul>



<p>This session object now holds the entire video, memory, and tracking state.</p>



<p>Next, we tell SAM3 <strong>what concept we want to track</strong> using <code data-enlighter-language="python" class="EnlighterJSRAW">processor.add_text_prompt(...)</code>.</p>



<p>For example:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">"person"</code></li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">"car"</code></li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">"player in red jersey"</code></li>
</ul>



<p>From this point on, the session is configured to: detect, segment, and track <strong>all instances of this concept</strong> across the video.</p>



<p>The real work happens when we call <code data-enlighter-language="python" class="EnlighterJSRAW">model.propagate_in_video_iterator(...)</code>.</p>



<p>This function:</p>



<ul class="wp-block-list">
<li>Processes frames sequentially</li>



<li>Uses memory to propagate masks and identities</li>



<li>Emits results for each frame</li>
</ul>



<p>For each frame:</p>



<ul class="wp-block-list">
<li>We post-process raw model outputs</li>



<li>Convert them into usable masks, boxes, scores, and IDs</li>



<li>Store them in a dictionary indexed by frame number</li>
</ul>



<p>At the end of this loop, we have a full timeline of segmentation and tracking results.</p>



<p>We now prepare an OpenCV video writer.</p>



<ul class="wp-block-list">
<li>Same resolution as input</li>



<li>Same FPS</li>



<li>MP4 format</li>
</ul>



<p>We now loop over each frame. If SAM3 produced results for that frame: we overlay masks, boxes, IDs, and scores using our helper function. Then we write the frame to the output video. We convert RGB → BGR because OpenCV expects BGR format.</p>



<p>We close the video writer and return the path to the annotated video. At this point, we have a <strong>complete pipeline</strong> that: takes a video + text prompt → returns a fully segmented and tracked video.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Launch-the-Gradio-Application"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Launch-the-Gradio-Application">Launch the Gradio Application</a></h3>



<p>Now that our video pipeline is ready, we wrap everything into a simple <strong>Gradio web interface</strong>.</p>



<p>This allows us to:</p>



<ul class="wp-block-list">
<li>Upload a video</li>



<li>Enter a text prompt (e.g., <code data-enlighter-language="python" class="EnlighterJSRAW">"person"</code>)</li>



<li>Run SAM3 segmentation and tracking</li>



<li>Preview the result</li>



<li>Download the annotated video</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="6"># ------------------------------------------------
# Gradio UI
# ------------------------------------------------
with gr.Blocks() as demo:
   gr.Markdown("# 🎥 SAM3 Video Segmentation &amp; Tracking")

   with gr.Row():
       video_input = gr.Video(label="Upload Video")
       prompt = gr.Textbox(
           label="Text Prompt",
           placeholder="e.g. person, chair, bed",
           value="person",
       )

   run_btn = gr.Button("Run Segmentation")

   video_out = gr.Video(label="Annotated Output")
   download = gr.File(label="Download Video")

   run_btn.click(
       fn=run_sam3_video,
       inputs=[video_input, prompt],
       outputs=[video_out, download],
   )

demo.launch(debug=True)
</pre>



<p><code data-enlighter-language="python" class="EnlighterJSRAW">gr.Blocks()</code> lets us build a <strong>custom UI layout</strong> instead of a simple one-function demo. Everything inside this block becomes part of our web interface.</p>



<p>This is just a header that appears at the top of the page.</p>



<p>Here, we place 2 widgets side by side:</p>



<ul class="wp-block-list">
<li>A <strong>video uploader</strong></li>



<li>A <strong>text box</strong> for the concept prompt</li>
</ul>



<p>The default value is <code data-enlighter-language="python" class="EnlighterJSRAW">"person"</code>, so the app works immediately without typing anything.</p>



<p>This button will trigger the SAM3 pipeline.</p>



<p>We create 2 outputs:</p>



<ul class="wp-block-list">
<li>One to <strong>preview</strong> the annotated video</li>



<li>One to <strong>download</strong> the result file</li>
</ul>



<p>It tells Gradio:  When the button is clicked, call <code data-enlighter-language="python" class="EnlighterJSRAW">run_sam3_video(video, prompt)</code> and display its outputs.</p>



<p>Recall that our function returns 2 values.</p>



<p>So:</p>



<ul class="wp-block-list">
<li>The first output goes to the video preview</li>



<li>The second output goes to the download widget</li>
</ul>



<p>This starts a local web server and opens the interface in the browser.</p>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">debug=True</code> flag helps:</p>



<ul class="wp-block-list">
<li>Show errors in the console</li>



<li>Make debugging easier during development</li>
</ul>



<p>At this point, we have a <strong>complete application</strong>:</p>



<ul class="wp-block-list">
<li>upload a video </li>



<li>type a concept </li>



<li>get a fully segmented and tracked output video</li>
</ul>



<p>This completes the <strong>text-prompt video segmentation and tracking</strong> pipeline.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Output-Text-Prompt-Video-Segmentation-and-Tracking-Results"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Output-Text-Prompt-Video-Segmentation-and-Tracking-Results">Output: Text-Prompt Video Segmentation and Tracking Results</a></h3>



<figure style="text-align: center; max-width: 700px; margin: auto;">
  <!-- Paste your embed code below -->
<iframe width="700" height="264" src="https://www.youtube.com/embed/pf6nChxBCTw" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

  <!-- Optional Caption -->
  <figcaption style="align: center; margin-top: 8px;">
    <strong>Figure 1:</strong> Text-Prompt Video Segmentation and Tracking Demo (source: GIF by the author).
  </figcaption>
</figure>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Real-Time-Text-Prompt-Tracking-Webcam"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Real-Time-Text-Prompt-Tracking-Webcam">Real-Time Text-Prompt Tracking (Webcam)</a></h2>



<p>Here, we extend text-prompt tracking to a live webcam stream. Instead of processing a preloaded video, frames arrive continuously, and SAM3 updates its tracking state in real time. This enables live, concept-aware segmentation with stable object identities across streaming frames.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Helper-Function-Stable-Color-Overlays-for-Real-Time-Video-Tracking"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Helper-Function-Stable-Color-Overlays-for-Real-Time-Video-Tracking">Helper Function: Stable Color Overlays for Real-Time Video Tracking</a></h3>



<p>In the previous pipeline, we used a visualization helper to draw masks and bounding boxes on each frame. For streaming and webcam scenarios, we slightly modify this helper so that <strong>each tracked object keeps a stable color across frames</strong>.</p>



<p>This small change makes tracking much easier to follow visually, especially when objects move across the scene.</p>



<p>Here is the updated helper:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="7"># ------------------------------------------------
# Visualization helper
# ------------------------------------------------
def overlay_masks_boxes(image, masks, boxes, scores, object_ids, alpha=0.5):
   image = image.copy()
   h, w = image.shape[:2]

   for i, mask in enumerate(masks):
       # Stable color per object id
       rng = np.random.default_rng(int(object_ids[i]))
       color = rng.integers(0, 255, size=3, dtype=np.uint8)

       if mask.shape[-2:] != (h, w):
           mask = cv2.resize(mask.astype(np.uint8), (w, h)) > 0

       colored = np.zeros_like(image)
       colored[mask] = color
       image = cv2.addWeighted(image, 1.0, colored, alpha, 0)

       x1, y1, x2, y2 = boxes[i].astype(int)
       cv2.rectangle(image, (x1, y1), (x2, y2), color.tolist(), 2)

       label = f"ID {int(object_ids[i])} | {scores[i]:.2f}"
       cv2.putText(
           image,
           label,
           (x1, max(y1 - 5, 15)),
           cv2.FONT_HERSHEY_SIMPLEX,
           0.5,
           color.tolist(),
           1,
           cv2.LINE_AA,
       )

   return image
</pre>



<p>Let us walk through what changed and why it matters.</p>



<p>In streaming inference, objects appear across many frames. If colors change every frame, it becomes hard to follow which object is which.</p>



<p>To solve this, we generate colors based on the <strong>object ID</strong>:</p>



<p>Here, on <strong>Lines 10 and 11</strong>:</p>



<ul class="wp-block-list">
<li>The object ID is used as the random seed.</li>



<li>The same object always produces the same color.</li>



<li>Tracking becomes visually consistent across frames.</li>
</ul>



<p>So if object ID 3 appears in 200 frames, it will always use the same color.</p>



<p>Sometimes masks are produced at a lower resolution, so we resize them to match the frame. <strong>Lines 13 and 14</strong> ensure masks perfectly align with the streaming frame.</p>



<p>Next, we create a colored mask and blend it with the original frame on <strong>Lines 16-18</strong>. The transparency factor <code data-enlighter-language="python" class="EnlighterJSRAW">alpha</code> controls how strongly the mask appears.</p>



<p>We then draw bounding boxes for each tracked object. <strong>Line 21</strong> helps confirm detection regions visually.</p>



<p>Finally, we render the object ID and confidence score and draw it above the box. This allows us to verify identity consistency and inspect tracking confidence per frame (<strong>Lines 23-33</strong>).</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Streaming-Inference-Function-Maintaining-Temporal-Memory-Across-Live-Video-Frames"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Streaming-Inference-Function-Maintaining-Temporal-Memory-Across-Live-Video-Frames">Streaming Inference Function: Maintaining Temporal Memory Across Live Video Frames</a></h3>



<p>So far, we processed videos by loading all frames first and then propagating segmentation across the entire clip.</p>



<p>For webcam or live-stream scenarios, this approach does not work because frames arrive <strong>one at a time</strong>. Instead, we need a streaming pipeline that:</p>



<ul class="wp-block-list">
<li>Maintains tracking memory across frames</li>



<li>Processes each incoming frame independently</li>



<li>Updates segmentation and tracking state continuously</li>
</ul>



<p>To achieve this, we maintain a persistent session and reuse it across frames.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="8"># ------------------------------------------------
# Global streaming state (kept across frames)
# ------------------------------------------------
STREAM_STATE = {
   "session": None,
   "prompt": None,
}
</pre>



<p>First, we create a small global structure that keeps track of the active inference session.</p>



<p>This dictionary stores:</p>



<ul class="wp-block-list">
<li>The active SAM3 inference session</li>



<li>The currently used text prompt</li>
</ul>



<p>This allows us to reuse the same session across frames instead of recreating it repeatedly.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="8" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="9"># ------------------------------------------------
# Streaming inference function
# ------------------------------------------------
def process_webcam_frame(frame, text_prompt):
   global STREAM_STATE

   if frame is None:
       return None

   # Initialize session if needed or if prompt changed
   if (
       STREAM_STATE["session"] is None
       or STREAM_STATE["prompt"] != text_prompt
   ):
       session = processor.init_video_session(
           inference_device=device,
           processing_device="cpu",
           video_storage_device="cpu",
           dtype=torch.bfloat16,
       )
       session = processor.add_text_prompt(
           inference_session=session,
           text=text_prompt,
       )
       STREAM_STATE["session"] = session
       STREAM_STATE["prompt"] = text_prompt

   session = STREAM_STATE["session"]

   # Preprocess frame
   inputs = processor(images=frame, device=device, return_tensors="pt")

   # Streaming forward pass
   with torch.no_grad():
       model_outputs = model(
           inference_session=session,
           frame=inputs.pixel_values[0],
           reverse=False,
       )

   # Postprocess to original resolution
   outputs = processor.postprocess_outputs(
       session,
       model_outputs,
       original_sizes=inputs.original_sizes,
   )

   # Visualize
   if outputs and len(outputs["object_ids"]) > 0:
       frame = overlay_masks_boxes(
           frame,
           outputs["masks"].cpu().numpy(),
           outputs["boxes"].cpu().numpy(),
           outputs["scores"].cpu().numpy(),
           outputs["object_ids"].cpu().numpy(),
       )

   return frame
</pre>



<p>Now we define the function that processes each incoming webcam frame. We initialize the global stream state defined earlier. If no frame is available, we simply return <code data-enlighter-language="python" class="EnlighterJSRAW">None</code> (<strong>Lines 14 and 15</strong>).</p>



<p>Next, we check whether a session already exists or whether the user has changed the prompt. If either condition is true, we create a new session. Unlike offline video processing, we do not pass frames during initialization because frames arrive one at a time. We then attach the prompt and store the session globally (<strong>Lines 18-33</strong>).</p>



<p>This session (<strong>Line 35</strong>) now contains:</p>



<ul class="wp-block-list">
<li>Tracking memory</li>



<li>Object identity history</li>



<li>Propagation state</li>
</ul>



<p>across previous frames.</p>



<p>To convert each frame into tensors before inference, we initialize the processor, which handles resizing, normalization, tensor formatting, and device placement (<strong>Line 38</strong>).</p>



<p>We now run inference for the current frame (<strong>Lines 41-46</strong>). Key points:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">torch.no_grad()</code> disables gradients, improving speed.</li>



<li>The frame is processed using the <strong>existing session</strong>.</li>



<li>SAM3 updates tracking memory internally.</li>
</ul>



<p>So segmentation and identities propagate automatically.</p>



<p>Model outputs are resized back to original resolution. This produces masks, bounding boxes, scores, and object IDs for the current frame (<strong>Lines 49-53</strong>).</p>



<p>If objects are detected, we overlay results on the frame. This produces the annotated frame. The processed frame is returned to the UI or video stream (<strong>Lines 56-65</strong>).</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Launch-the-Gradio-Application-2"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Launch-the-Gradio-Application-2">Launch the Gradio Application</a></h3>



<p>Now we connect our streaming inference pipeline to a <strong>live webcam interface</strong> using Gradio.</p>



<p>This allows us to:</p>



<ul class="wp-block-list">
<li>Capture frames directly from a webcam</li>



<li>Run SAM3 segmentation and tracking in real time</li>



<li>Visualize masks and tracked objects continuously</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="10"># ------------------------------------------------
# Gradio UI
# ------------------------------------------------
with gr.Blocks() as demo:
   gr.Markdown("# 📷 SAM3 Live Webcam Segmentation &amp; Tracking")

   with gr.Row():
       webcam = gr.Image(
           sources=["webcam"],
           streaming=True,
           label="Webcam",
           type="numpy",
       )

       output = gr.Image(
           label="Live Segmentation",
           type="numpy",
       )

   prompt = gr.Textbox(
       label="Text Prompt",
       value="person",
       placeholder="e.g. person, face, chair, bottle",
   )

   webcam.stream(
       fn=process_webcam_frame,
       inputs=[webcam, prompt],
       outputs=output,
   )

demo.launch(debug=True)
</pre>



<p><strong>Line 1</strong> creates the Gradio app layout where we can combine multiple UI components. <strong>Line 2</strong> displays a header explaining what the demo does.</p>



<p>We place input and output side by side. Inside <code data-enlighter-language="python" class="EnlighterJSRAW">gr.Row</code>, we define 2 components:</p>



<p>First component:</p>



<ul class="wp-block-list">
<li>Captures frames from the webcam</li>



<li>Streams frames continuously</li>



<li>Sends frames as NumPy arrays to our function</li>
</ul>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">streaming=True</code> flag enables real-time frame delivery.</p>



<p>The second component displays the processed frame returned by our streaming pipeline.</p>



<p>Next, we create a textbox for specifying the concept to track. Users can change this dynamically while the webcam runs. If the prompt changes, our pipeline automatically resets the session.</p>



<p>The key connection happens on <strong>Lines 26-30</strong>. This tells Gradio:</p>



<ul class="wp-block-list">
<li>Send every webcam frame to <code data-enlighter-language="python" class="EnlighterJSRAW">process_webcam_frame</code></li>



<li>Pass along the current text prompt</li>



<li>Display the returned frame in the output panel</li>
</ul>



<p>This loop runs continuously while the webcam is active.</p>



<p>Finally, we launch the interface. This starts a local server and opens the demo in a browser. The <code data-enlighter-language="python" class="EnlighterJSRAW">debug=True</code> flag helps diagnose errors during development.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Output-Real-Time-Webcam-Video-Segmentation-Results"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Output-Real-Time-Webcam-Video-Segmentation-Results">Output: Real-Time Webcam Video Segmentation Results</a></h3>



<figure style="text-align: center; max-width: 700px; margin: auto;">
  <!-- Paste your embed code below -->
<iframe width="700" height="394" src="https://www.youtube.com/embed/kLH-uQX1nrE" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

  <!-- Optional Caption -->
  <figcaption style="align: center; margin-top: 8px;">
    <strong>Figure 2:</strong> Real-Time Webcam Video Segmentation Demo (source: GIF by the author).
  </figcaption>
</figure>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Single-Click-Object-Tracking"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Single-Click-Object-Tracking">Single-Click Object Tracking</a></h2>



<p>In this workflow, we remove text prompts and select an object by clicking on it in the first frame. SAM3 segments the clicked object and propagates its mask throughout the video using its tracking memory. With just one foreground point, we obtain consistent object tracking across the full sequence.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Load-the-SAM3-Tracker-Video-Model"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Load-the-SAM3-Tracker-Video-Model">Load the SAM3 Tracker Video Model</a></h3>



<p>So far, we used text prompts to detect and track concepts across videos and live streams.</p>



<p>Now we move to a different workflow: <strong>interactive tracking</strong>, where we manually select objects and let SAM3 track them across frames.</p>



<p>To enable this, we switch from the text-prompt video model to the <strong>tracker-specific video model</strong>.</p>



<p>Here is how we load it:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="11"># Initialize model
device = Accelerator().device
model = Sam3TrackerVideoModel.from_pretrained("facebook/sam3").to(device, dtype=torch.bfloat16)
processor = Sam3TrackerVideoProcessor.from_pretrained("facebook/sam3")
</pre>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">Accelerator().device</code> automatically selects the available hardware:</p>



<ul class="wp-block-list">
<li>GPU if available</li>



<li>CPU otherwise</li>
</ul>



<p>This keeps our code portable across machines.</p>



<p>We load the <strong>SA</strong><strong>M3</strong><strong> tracker model</strong>, which is optimized for:</p>



<ul class="wp-block-list">
<li>Point-based object prompting</li>



<li>Interactive tracking</li>



<li>Multi-object identity propagation</li>



<li>Frame-to-frame tracking consistency</li>
</ul>



<p>Unlike the previous model, this one does not require text prompts. Instead, it expects clicks or point annotations.</p>



<p>We also move the model to the selected device and run it in <strong>bfloat16 precision</strong>, reducing memory usage and speeding up inference.</p>



<p>We load the processor which prepares inputs and postprocesses outputs specifically for tracking workflows. It handles frame preprocessing, prompt encoding (clicks or points), mask decoding, and identity propagation formatting.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Extract-First-Frame-Preparing-the-Initial-Frame-for-Object-Selection"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Extract-First-Frame-Preparing-the-Initial-Frame-for-Object-Selection">Extract First Frame: Preparing the Initial Frame for Object Selection</a></h3>



<p>Before we start tracking objects in a video, we need a way to <strong>select them</strong>.</p>



<p>In our workflow, we select objects by clicking on them in the <strong>first frame</strong>. So, the first step is to extract that frame from the video.</p>



<p>Here is a small helper function that does exactly that:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="12">def extract_first_frame(video_path):
   """Extract first frame from video for point selection"""
   cap = cv2.VideoCapture(video_path)
   ret, frame = cap.read()
   cap.release()
   if ret:
       return cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
   return None
</pre>



<p>First, OpenCV opens the video file and prepares it for frame reading. Then it attempts to read the next frame from the video.</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">ret</code> indicates whether reading succeeded.</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">frame</code> contains the actual image data.</li>
</ul>



<p>Since we just opened the video, this call returns the <strong>first frame</strong>.</p>



<p>We release the video resource immediately after reading the frame. This is important because:</p>



<ul class="wp-block-list">
<li>It frees system resources</li>



<li>It prevents file locks</li>



<li>It keeps later video processing clean</li>
</ul>



<p>OpenCV loads images in <strong>BGR format</strong>, but most visualization and processing pipelines expect <strong>RGB</strong>. So we convert BGR to RGB before returning the frame.</p>



<p>If the frame cannot be read, the function safely returns <code data-enlighter-language="python" class="EnlighterJSRAW">None</code>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Tracking-Object-Function-Propagating-a-Single-Object-Mask-Across-Video-Frames"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Tracking-Object-Function-Propagating-a-Single-Object-Mask-Across-Video-Frames">Tracking Object Function: Propagating a Single Object Mask Across Video Frames</a></h3>



<p>We now build the core function that allows us to <strong>track an object through an entire video using a single click</strong>.</p>



<p>The workflow is simple:</p>



<ul class="wp-block-list">
<li>The user uploads a video.</li>



<li>The user clicks on the object in the first frame.</li>



<li>SAM3 segments that object.</li>



<li>The tracker propagates the object mask across all frames.</li>



<li>A new annotated video is generated.</li>
</ul>



<p>Let us walk through the implementation step by step:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="13">def track_object(video_path, point_coords):
   """Track object through video based on clicked point"""
   if video_path is None:
       return None, "Please upload a video first"

   if point_coords is None:
       return None, "Please click on the first frame to select an object"

   try:
       # Load video
       video_frames, _ = load_video(video_path)

       # Get click coordinates
       x, y = int(point_coords[0]), int(point_coords[1])

       # Initialize session
       inference_session = processor.init_video_session(
           video=video_frames,
           inference_device=device,
           dtype=torch.bfloat16,
       )

       # Add point annotation
       points = [[[[x, y]]]]
       labels = [[[1]]]  # 1 for foreground point

       processor.add_inputs_to_inference_session(
           inference_session=inference_session,
           frame_idx=0,
           obj_ids=1,
           input_points=points,
           input_labels=labels,
       )

       # First, segment the object on the first frame
       outputs = model(
           inference_session=inference_session,
           frame_idx=0,
       )
       first_frame_masks = processor.post_process_masks(
           [outputs.pred_masks],
           original_sizes=[[inference_session.video_height, inference_session.video_width]],
           binarize=False
       )[0]

       # Propagate through video
       video_segments = {0: first_frame_masks}
       for sam3_tracker_video_output in model.propagate_in_video_iterator(inference_session):
           video_res_masks = processor.post_process_masks(
               [sam3_tracker_video_output.pred_masks],
               original_sizes=[[inference_session.video_height, inference_session.video_width]],
               binarize=False
           )[0]
           video_segments[sam3_tracker_video_output.frame_idx] = video_res_masks

       # Create output video with masks
       output_path = "/tmp/output_tracked.mp4"
       fourcc = cv2.VideoWriter_fourcc(*'mp4v')
       height, width = video_frames[0].shape[:2]
       out = cv2.VideoWriter(output_path, fourcc, 30.0, (width, height))

       for idx in range(len(video_frames)):
           frame = video_frames[idx].copy().astype(np.uint8)
           if idx in video_segments:
               masks = video_segments[idx]
               # Convert mask to float32 first, then to boolean
               mask = masks[0, 0].float().cpu().numpy() > 0.0
               # Overlay red mask
               overlay = frame.copy()
               overlay[mask] = [255, 0, 0]
               frame = cv2.addWeighted(frame.astype(np.float32), 0.6, overlay.astype(np.float32), 0.4, 0).astype(np.uint8)
           out.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))

       out.release()

       status = f"✅ Successfully tracked object through {len(video_segments)} frames at point ({x}, {y})"
       return output_path, status

   except Exception as e:
       return None, f"❌ Error: {str(e)}"
</pre>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">track_object()</code> function accepts:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">video_path</code>: Path to the uploaded video file.</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">point_coords</code>: <code data-enlighter-language="python" class="EnlighterJSRAW">(x, y)</code> coordinates of the user’s click on the first frame.</li>
</ul>



<p>The goal is simple: “Given a video and one clicked point, track that object through the entire video.”</p>



<p>If no video is uploaded, tracking cannot begin. The function returns:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">None</code>: No output video</li>



<li>A message explaining the issue</li>
</ul>



<p>Likewise, tracking requires a <strong>foreground prompt</strong>. SAM3 needs at least one positive point to know: which object to segment, and where that object exists in the first frame. Without it, tracking is undefined.</p>



<p>Inside a <code data-enlighter-language="python" class="EnlighterJSRAW">try</code> block, the <code data-enlighter-language="python" class="EnlighterJSRAW">load_video()</code> function reads the video file, extracts all frames into memory, and returns them as a list of NumPy arrays.</p>



<p>Why load all frames?</p>



<p>Because SAM3 tracking requires:</p>



<ul class="wp-block-list">
<li>Access to the entire temporal sequence</li>



<li>Mask propagation across frames</li>



<li>Internal memory consistency</li>
</ul>



<p>Each frame shape is typically: <code data-enlighter-language="python" class="EnlighterJSRAW">(height, width, 3)</code>. The UI provides floating-point coordinates. We convert them to integers because:</p>



<ul class="wp-block-list">
<li>Pixel indices must be integers</li>



<li>Mask indexing requires integer positions</li>
</ul>



<p>This <code data-enlighter-language="python" class="EnlighterJSRAW">(x, y)</code> now represents a foreground location inside frame 0.</p>



<p>Next, we initialize a video session which creates an internal tracking session, stores all video frames, model memory state, and object tracking buffers. We also set computation device (either CPU or GPU) and use <code data-enlighter-language="python" class="EnlighterJSRAW">bfloat16</code> for faster inference and lower memory usage. This prepares SAM3’s brain to process a full video.</p>



<p>Then, we prepare a foreground prompt. Since SAM3 expects inputs in batch format, <code data-enlighter-language="python" class="EnlighterJSRAW">points = [[[[x, y]]]]</code> means:</p>



<ul class="wp-block-list">
<li>Batch size = 1</li>



<li>Object ID = 1</li>



<li>One point</li>



<li>Coordinates <code data-enlighter-language="python" class="EnlighterJSRAW">(x, y)</code></li>
</ul>



<p>and in <code data-enlighter-language="python" class="EnlighterJSRAW">labels = [[[1]]]</code></p>



<p><code data-enlighter-language="python" class="EnlighterJSRAW">1</code> means:</p>



<ul class="wp-block-list">
<li>Foreground point</li>
</ul>



<p>If it were <code data-enlighter-language="python" class="EnlighterJSRAW">0</code>, it would mean:</p>



<ul class="wp-block-list">
<li>Background point</li>
</ul>



<p>So this tells SAM3: “This pixel belongs to the object.”</p>



<p>Then <code data-enlighter-language="python" class="EnlighterJSRAW">processor.add_inputs_to_inference_session()</code> injects the prompt into the tracking session.</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">frame_idx=0</code>: Object exists in the first frame</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">obj_ids=1</code>: This is object ID 1</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">input_points</code>: Where the object is</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">input_labels</code>: Foreground signal</li>
</ul>



<p>At this point, the model knows that object 1 is located at <code data-enlighter-language="python" class="EnlighterJSRAW">(x, y)</code> in frame 0.</p>



<p>We explicitly run segmentation on frame 0. This produces:</p>



<ul class="wp-block-list">
<li>Raw mask logits</li>



<li>Low-resolution mask predictions</li>
</ul>



<p>Then <code data-enlighter-language="python" class="EnlighterJSRAW">processor.post_process_masks()</code>:</p>



<ul class="wp-block-list">
<li>Resizes masks to original resolution</li>



<li>Converts internal representation into full-size masks</li>



<li>Keeps them as float probabilities (not binarized)</li>
</ul>



<p>We now have a full-resolution mask for frame 0.</p>



<p>On <strong>Line 47</strong>, we store frame 0 results first.​​ Then <code data-enlighter-language="python" class="EnlighterJSRAW">model.propagate_in_video_iterator()</code> runs SAM3’s tracking mechanism.</p>



<p>What happens internally:</p>



<ul class="wp-block-list">
<li>It uses memory from frame 0</li>



<li>Matches object appearance across frames</li>



<li>Predicts masks for each new frame</li>
</ul>



<p>For each frame <code data-enlighter-language="python" class="EnlighterJSRAW">processor.post_process_masks(...)</code>, we resize masks and store them in a dictionary.</p>



<p>Final structure:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">video_segments = {
  0: mask0,
  1: mask1,
  2: mask2,
  ...
}</pre>



<p>Now we have segmentation results for the full video.</p>



<p>Next we define the output video path, and define the codec format to <code data-enlighter-language="python" class="EnlighterJSRAW">mp4v</code>. We get the resolution which ensures output video matches input resolution. We also initialize the OpenCV writer using <code data-enlighter-language="python" class="EnlighterJSRAW">cv2.VideoWriter()</code> and pass the output video path, codec format, 30 FPS, and height and width.</p>



<p>We then iterate over each frame and create a copy to avoid modifying the original frames. We move the output tensor to the CPU, convert it to a NumPy array, compute probabilities, and threshold the result to obtain a Boolean mask. The resulting <code data-enlighter-language="python" class="EnlighterJSRAW">mask</code> is <code data-enlighter-language="python" class="EnlighterJSRAW">True</code> where the object exists and <code data-enlighter-language="python" class="EnlighterJSRAW">False</code> elsewhere.</p>



<p>We then overlay a red mask and blend it using <code data-enlighter-language="python" class="EnlighterJSRAW">cv2.addWeighted()</code>, producing a frame with 60% original content and 40% red overlay for smooth visualization. Because OpenCV expects BGR format, we convert the frame using <code data-enlighter-language="python" class="EnlighterJSRAW">cv2.cvtColor()</code> with the <code data-enlighter-language="python" class="EnlighterJSRAW">cv2.COLOR_RGB2BGR</code> flag.</p>



<p><code data-enlighter-language="python" class="EnlighterJSRAW">out.release()</code> finalizes the file, writes any remaining buffers, and closes the video properly. Without this call, the output file may become corrupted.</p>



<p>Finally, we return the path to the saved video and a success message. If an error occurs:</p>



<ul class="wp-block-list">
<li>The function safely returns an error message</li>



<li>The application does not crash</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Launch-the-Gradio-Application-3"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Launch-the-Gradio-Application-3">Launch the Gradio Application</a></h3>



<p>We now connect our tracking pipeline to an interactive Gradio interface. This interface allows users to upload a video, click on an object in the first frame, and automatically track that object across the entire clip.</p>



<p>Here is the full interface code:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="14"># Create Gradio interface with blocks for better control
with gr.Blocks(title="SAM3 Video Tracker") as demo:
   gr.Markdown("# 🎯 SAM3 Video Object Tracker")
   gr.Markdown("Upload a video and click on an object in the first frame to track it throughout the video")

   with gr.Row():
       with gr.Column():
           video_input = gr.Video(label="Upload Video")
           first_frame = gr.Image(label="Click on object to track", type="numpy")
           point_display = gr.Textbox(label="Selected Point", interactive=False)
           track_btn = gr.Button("Track Object", variant="primary")

       with gr.Column():
           video_output = gr.Video(label="Tracked Video")
           status_output = gr.Textbox(label="Status")

   # Store clicked point
   clicked_point = gr.State(None)

   # Extract first frame when video is uploaded
   def on_video_upload(video):
       if video:
           frame = extract_first_frame(video)
           return frame, None, "Upload complete. Click on the object you want to track."
       return None, None, ""

   video_input.change(
       on_video_upload,
       inputs=[video_input],
       outputs=[first_frame, clicked_point, status_output]
   )

   # Handle click on first frame
   def on_click(img, evt: gr.SelectData):
       x, y = evt.index[0], evt.index[1]
       # Draw a circle on the clicked point
       img_copy = img.copy()
       cv2.circle(img_copy, (x, y), 5, (255, 0, 0), -1)
       return img_copy, (x, y), f"Point selected: ({x}, {y})"

   first_frame.select(
       on_click,
       inputs=[first_frame],
       outputs=[first_frame, clicked_point, point_display]
   )

   # Track button
   track_btn.click(
       track_object,
       inputs=[video_input, clicked_point],
       outputs=[video_output, status_output]
   )

# Launch
demo.launch(debug=True)
</pre>



<p>The Gradio interface is built using <code data-enlighter-language="python" class="EnlighterJSRAW">gr.Blocks</code>, which gives full control over layout, components, and event handling. The goal is simple: “Allow a user to upload a video, click on a single object in the first frame, and track that object throughout the entire video.”</p>



<p>At the top of the interface, we display two <code data-enlighter-language="python" class="EnlighterJSRAW">gr.Markdown()</code> sections. The first acts as the main heading so users immediately understand what the application does. The second provides short instructions explaining the workflow: upload a video, click on an object in the first frame, and then track it.</p>



<p>Next, we structure the layout using a <code data-enlighter-language="python" class="EnlighterJSRAW">gr.Row()</code>. Inside that row, we create two columns. The left column contains all inputs and interactions. The right column displays outputs.</p>



<p>In the left column, we first add a video upload component. This allows the user to upload a video file from their system. Once uploaded, the backend receives the file path. That file path is later used to extract frames and run tracking.</p>



<p>Below the video upload, we place an image component. This image will display the first frame of the uploaded video. We set its type to NumPy so that the backend receives the frame as a NumPy array. This is important because we draw visual markers on the frame using OpenCV when the user clicks.</p>



<p>Below the image, we add a textbox labeled <code data-enlighter-language="python" class="EnlighterJSRAW">"Selected Point"</code>. This textbox is non-interactive, meaning the user cannot manually edit it. It simply displays the coordinates of the selected point so the user can confirm their click.</p>



<p>Under that, we add a <code data-enlighter-language="python" class="EnlighterJSRAW">"Track Object"</code> button. This button is styled as primary so it visually stands out as the main action. When clicked, it triggers the tracking pipeline.</p>



<p>In the right column, we create a video output component. This will display the processed video after tracking is complete. Below it, we add a status textbox. This displays messages such as upload confirmation, tracking success, or error details.</p>



<p>To maintain interaction state, we use a Gradio State variable called <code data-enlighter-language="python" class="EnlighterJSRAW">clicked_point</code>. This variable stores the coordinates of the selected object. Initially, it is set to <code data-enlighter-language="python" class="EnlighterJSRAW">None</code>. State is important because the click event and the tracking event happen at different times, and we need a way to remember which point the user selected.</p>



<p>When a video is uploaded, a function <code data-enlighter-language="python" class="EnlighterJSRAW">on_video_upload()</code> is triggered. This function checks whether a valid video exists. If it does, we extract the first frame using a helper function. That first frame is returned to the image component so the user can see it. We also reset the stored clicked point to <code data-enlighter-language="python" class="EnlighterJSRAW">None</code>, ensuring that any previous selection is cleared. Finally, we return a status message informing the user that the upload is complete and they should click on an object.</p>



<p>If no video is uploaded, the function returns empty values, keeping the interface clean.</p>



<p>When the user clicks on the first frame image, another function <code data-enlighter-language="python" class="EnlighterJSRAW">on_click()</code> handles the event. The click event provides the pixel coordinates of the selected location through <code data-enlighter-language="python" class="EnlighterJSRAW">evt.index</code>. We extract the <code data-enlighter-language="python" class="EnlighterJSRAW">x</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">y</code> coordinates from that event data.</p>



<p>Next, we create a copy of the displayed image. This is important because we do not want to modify the original image directly. On this copy, we draw a small filled circle at the clicked location using OpenCV. The circle visually marks the selected object so the user knows exactly where they clicked.</p>



<p>After drawing the marker, we return 3 things:</p>



<ul class="wp-block-list">
<li>The updated image with the circle drawn</li>



<li>The tuple <code data-enlighter-language="python" class="EnlighterJSRAW">(x, y)</code> stored in the state variable</li>



<li>A formatted string such as <code data-enlighter-language="python" class="EnlighterJSRAW">"Point selected: (x, y)"</code> displayed in the textbox</li>
</ul>



<p>This ensures the UI updates immediately and the selected point is stored for later use.</p>



<p>The Track Object button is connected to the backend tracking function. When pressed, it sends 2 inputs:</p>



<ul class="wp-block-list">
<li>The uploaded video path</li>



<li>The stored clicked point</li>
</ul>



<p>The tracking function then performs segmentation and mask propagation across the entire video. Once processing is complete, it returns:</p>



<ul class="wp-block-list">
<li>The path to the output tracked video</li>



<li>A status message indicating success or failure</li>
</ul>



<p>These outputs are displayed in the video output component and the status textbox, respectively.</p>



<p>Finally, the application is launched with debug mode enabled. Debug mode prints detailed logs in case of errors, which is helpful during development and testing.</p>



<p>The complete flow works as follows:</p>



<ul class="wp-block-list">
<li>The user uploads a video.</li>



<li>The first frame is extracted and displayed.</li>



<li>The user clicks on an object.</li>



<li>A visual marker appears and the coordinates are stored.</li>



<li>The user presses Track Object.</li>



<li>The backend processes the video and returns the tracked result.</li>



<li>The output video and status message are displayed.</li>
</ul>



<p>This design keeps the interface simple, intuitive, and focused on a single-click tracking workflow while properly managing state and user interaction.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Output-Single-Click-Video-Object-Tracking-Results"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Output-Single-Click-Video-Object-Tracking-Results">Output: Single-Click Video Object Tracking Results</a></h3>



<figure style="text-align: center; max-width: 700px; margin: auto;">
  <!-- Paste your embed code below -->
<iframe width="700" height="401" src="https://www.youtube.com/embed/MV-6LwSSRhM" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

  <!-- Optional Caption -->
  <figcaption style="align: center; margin-top: 8px;">
    <strong>Figure 3:</strong> Single-Click Video Object Tracking Demo (source: GIF by the author).
  </figcaption>
</figure>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Multi-Click-Object-Tracking"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Multi-Click-Object-Tracking">Multi-Click Object Tracking</a></h2>



<p>In this final setup, we select multiple objects by clicking different locations in the first frame. Each click initializes a unique object ID, and SAM3 tracks all selected objects simultaneously. The output video shows multiple masks with distinct colors, preserving identity consistency across frames.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Initialize-Few-Colors-Defining-a-Color-Palette-for-Multi-Object-Tracking-Visualization"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Initialize-Few-Colors-Defining-a-Color-Palette-for-Multi-Object-Tracking-Visualization">Initialize Few Colors: Defining a Color Palette for Multi-Object Tracking Visualization</a></h3>



<p>When tracking multiple objects at the same time, visualization becomes very important. If all objects share the same mask color, it becomes difficult to understand which mask corresponds to which object.</p>



<p>To solve this, we assign <strong>different colors to different tracked objects</strong>.</p>



<p>Here is a small color palette we use:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="15"># Different colors for different objects
COLORS = [
   [255, 0, 0],    # Red
   [0, 255, 0],    # Green
   [0, 0, 255],    # Blue
   [255, 255, 0],  # Yellow
   [255, 0, 255],  # Magenta
   [0, 255, 255],  # Cyan
   [255, 128, 0],  # Orange
   [128, 0, 255],  # Purple
]
</pre>



<p>Each entry in this list represents an RGB color used to render masks and overlays for a tracked object.</p>



<p>For example:</p>



<ul class="wp-block-list">
<li>Object 1 may appear in <strong>red</strong></li>



<li>Object 2 in <strong>green</strong></li>



<li>Object 3 in <strong>blue</strong></li>



<li>and so on.</li>
</ul>



<p>During visualization, we typically assign colors based on object index or object ID, cycling through the list if the number of objects exceeds the available colors.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Extract-First-Frame-Preparing-the-First-Frame-for-Multi-Object-Selection"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Extract-First-Frame-Preparing-the-First-Frame-for-Multi-Object-Selection">Extract First Frame: Preparing the First Frame for Multi-Object Selection</a></h3>



<p>For multi-object tracking, we again begin by extracting the first frame of the video. This frame is used as the interaction surface where users click on multiple objects they want to track.</p>



<p>The helper function below reads the first frame from the uploaded video.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="16">def extract_first_frame(video_path):
   """Extract first frame from video for point selection"""
   cap = cv2.VideoCapture(video_path)
   ret, frame = cap.read()
   cap.release()
   if ret:
       return cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
   return None
</pre>



<p>First, we open the video using OpenCV’s <code data-enlighter-language="python" class="EnlighterJSRAW">VideoCapture</code>. Immediately after opening, we read a single frame. Since the video has just been opened, this corresponds to the very first frame.</p>



<p>Next, we release the video handle to free system resources and avoid locking the file for later processing steps.</p>



<p>OpenCV loads images in BGR format, but our visualization and model pipelines expect RGB images. Therefore, we convert the frame from BGR to RGB before returning it.</p>



<p>If frame extraction fails, the function safely returns <code data-enlighter-language="python" class="EnlighterJSRAW">None</code>, allowing the application to handle the error gracefully.</p>



<p>This function now allows users to click on <strong>multiple objects in the first frame</strong>, which we will use as prompts for tracking several objects simultaneously in the next step.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Tracking-Object-Function-Tracking-Multiple-Objects-with-Unique-IDs-Across-Video-Frames"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Tracking-Object-Function-Tracking-Multiple-Objects-with-Unique-IDs-Across-Video-Frames">Tracking Object Function: Tracking Multiple Objects with Unique IDs Across Video Frames</a></h3>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="17">def track_objects(video_path, points_list):
   """Track multiple objects through video based on clicked points"""
   if video_path is None:
       return None, "Please upload a video first"

   if not points_list or len(points_list) == 0:
       return None, "Please click on at least one object in the first frame"

   try:
       # Load video
       video_frames, _ = load_video(video_path)

       # Initialize session
       inference_session = processor.init_video_session(
           video=video_frames,
           inference_device=device,
           dtype=torch.bfloat16,
       )

       # Prepare points for all objects
       obj_ids = list(range(1, len(points_list) + 1))
       input_points = [[[[int(x), int(y)]] for x, y in points_list]]
       input_labels = [[[1] for _ in points_list]]  # All are foreground points

       # Add all objects to inference session
       processor.add_inputs_to_inference_session(
           inference_session=inference_session,
           frame_idx=0,
           obj_ids=obj_ids,
           input_points=input_points,
           input_labels=input_labels,
       )

       # First, segment objects on the first frame
       outputs = model(
           inference_session=inference_session,
           frame_idx=0,
       )
       first_frame_masks = processor.post_process_masks(
           [outputs.pred_masks],
           original_sizes=[[inference_session.video_height, inference_session.video_width]],
           binarize=False
       )[0]

       # Initialize video segments with first frame
       video_segments = {0: {
           obj_id: first_frame_masks[i]
           for i, obj_id in enumerate(inference_session.obj_ids)
       }}

       # Propagate through video
       for sam3_tracker_video_output in model.propagate_in_video_iterator(inference_session):
           video_res_masks = processor.post_process_masks(
               [sam3_tracker_video_output.pred_masks],
               original_sizes=[[inference_session.video_height, inference_session.video_width]],
               binarize=False
           )[0]
           video_segments[sam3_tracker_video_output.frame_idx] = {
               obj_id: video_res_masks[i]
               for i, obj_id in enumerate(inference_session.obj_ids)
           }

       # Create output video with masks
       output_path = "/tmp/output_tracked.mp4"
       fourcc = cv2.VideoWriter_fourcc(*'mp4v')
       height, width = video_frames[0].shape[:2]
       out = cv2.VideoWriter(output_path, fourcc, 30.0, (width, height))

       for idx in range(len(video_frames)):
           frame = video_frames[idx].copy().astype(np.uint8)

           if idx in video_segments:
               # Create overlay for all objects
               overlay = frame.copy().astype(np.float32)

               for obj_idx, (obj_id, masks) in enumerate(video_segments[idx].items()):
                   # Convert mask to float32 first, then to boolean
                   mask = masks[0].float().cpu().numpy() > 0.0

                   # Use different color for each object
                   color = COLORS[obj_idx % len(COLORS)]
                   overlay[mask] = np.array(color, dtype=np.float32)

               # Blend overlay with original frame
               frame = cv2.addWeighted(frame.astype(np.float32), 0.6, overlay, 0.4, 0).astype(np.uint8)

           out.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))

       out.release()

       status = f"✅ Successfully tracked {len(points_list)} object(s) through {len(video_segments)} frames"
       return output_path, status

   except Exception as e:
       import traceback
       return None, f"❌ Error: {str(e)}\n{traceback.format_exc()}"
</pre>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">track_objects()</code> function accepts:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">video_path</code>: Path to the uploaded video file.</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">points_list</code>: A list of <code data-enlighter-language="python" class="EnlighterJSRAW">(x, y)</code> coordinates where the user clicked on the first frame (one click per object).</li>
</ul>



<p>The goal is simple: “Given a video and multiple clicked points, track all selected objects through the entire video.”</p>



<p>If no video is uploaded, tracking cannot begin. The function returns:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">None</code>: No output video</li>



<li>A message explaining the issue</li>
</ul>



<p>Likewise, tracking requires at least one foreground prompt. SAM3 needs one or more positive points to know:</p>



<ul class="wp-block-list">
<li>Which objects to segment</li>



<li>Where those objects exist in the first frame</li>
</ul>



<p>If <code data-enlighter-language="python" class="EnlighterJSRAW">points_list</code> is empty, tracking is undefined.</p>



<p>Inside a <code data-enlighter-language="python" class="EnlighterJSRAW">try</code> block, the <code data-enlighter-language="python" class="EnlighterJSRAW">load_video()</code> function reads the video file, extracts all frames into memory, and returns them as a list of NumPy arrays.</p>



<p>Why load all frames?</p>



<p>Because SAM3 tracking requires:</p>



<ul class="wp-block-list">
<li>Access to the entire temporal sequence</li>



<li>Mask propagation across frames</li>



<li>Internal memory consistency</li>
</ul>



<p>Each frame shape is typically: <code data-enlighter-language="python" class="EnlighterJSRAW">(height, width, 3)</code>. Next, we initialize a video session using <code data-enlighter-language="python" class="EnlighterJSRAW">processor.init_video_session()</code>. This creates an internal tracking session that:</p>



<ul class="wp-block-list">
<li>Stores all video frames</li>



<li>Maintains model memory state</li>



<li>Manages object tracking buffers</li>
</ul>



<p>We also:</p>



<ul class="wp-block-list">
<li>Set the computation device (CPU or GPU)</li>



<li>Use <code data-enlighter-language="python" class="EnlighterJSRAW">bfloat16</code> for faster inference and lower memory usage</li>
</ul>



<p>This step prepares SAM3’s internal tracking mechanism to process the full video.</p>



<p>Now we prepare inputs for <strong>multiple objects</strong>. If the user clicked 3 points, this becomes:</p>



<p><code data-enlighter-language="python" class="EnlighterJSRAW">[1, 2, 3]</code></p>



<p>Each clicked point is treated as a separate object with its own ID.</p>



<p>Then we structure the coordinates:</p>



<p><code data-enlighter-language="python" class="EnlighterJSRAW">input_points = [[[[int(x), int(y)]] for x, y in points_list]]</code></p>



<p>SAM3 expects batch format:</p>



<p><code data-enlighter-language="python" class="EnlighterJSRAW">[batch][object][points][coordinates]</code></p>



<p>So this structure means:</p>



<ul class="wp-block-list">
<li>Batch size = 1</li>



<li>Multiple objects</li>



<li>One point per object</li>



<li>Each point has <code data-enlighter-language="python" class="EnlighterJSRAW">(x, y)</code></li>
</ul>



<p>We convert coordinates to integers because:</p>



<ul class="wp-block-list">
<li>Pixel indices must be integers</li>



<li>Mask indexing requires integer positions</li>
</ul>



<p>Each <code data-enlighter-language="python" class="EnlighterJSRAW">(x, y)</code> now represents a foreground location for a different object in frame 0.</p>



<p>Next, we define labels:</p>



<p><code data-enlighter-language="python" class="EnlighterJSRAW">input_labels = [[[1] for _ in points_list]]</code></p>



<p><code data-enlighter-language="python" class="EnlighterJSRAW">1</code> means:</p>



<ul class="wp-block-list">
<li>Foreground point</li>
</ul>



<p>If it were <code data-enlighter-language="python" class="EnlighterJSRAW">0</code>, it would mean:</p>



<ul class="wp-block-list">
<li>Background point</li>
</ul>



<p>So this tells SAM3: “Each of these clicked pixels belongs to a separate object.”</p>



<p>Then we inject everything into the inference session using <code data-enlighter-language="python" class="EnlighterJSRAW">processor.add_inputs_to_inference_session()</code>:</p>



<p>Here:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">frame_idx=0</code>: Objects exist in the first frame</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">obj_ids</code>: Multiple object identifiers</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">input_points</code>: Click locations</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">input_labels</code>: Foreground signals</li>
</ul>



<p>At this point, the model knows that multiple objects are present at the selected coordinates in frame 0.</p>



<p>Next, we explicitly run segmentation on frame 0. This produces:</p>



<ul class="wp-block-list">
<li>Raw mask logits</li>



<li>Low-resolution mask predictions for all objects</li>
</ul>



<p>Then we post-process the masks using <code data-enlighter-language="python" class="EnlighterJSRAW">processor.post_process_masks(...)</code>. This step:</p>



<ul class="wp-block-list">
<li>Resizes masks to original resolution</li>



<li>Converts internal representation into full-size masks</li>



<li>Keeps them as float probabilities (not binarized)</li>
</ul>



<p>We now have: Full-resolution masks for all selected objects in frame 0.</p>



<p>Now we initialize storage on <strong>Lines 46-49</strong>:</p>



<p>This means:</p>



<ul class="wp-block-list">
<li>Frame 0 results are stored first</li>



<li>Each object ID maps to its own mask</li>
</ul>



<p>Structure becomes:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">video_segments = {
   0: {
       1: mask_for_object_1,
       2: mask_for_object_2,
       ...
   }
}</pre>



<p>Next, we propagate through the video using <code data-enlighter-language="python" class="EnlighterJSRAW">model.propagate_in_video_iterator()</code>. This runs SAM3’s tracking mechanism.</p>



<p>What happens internally:</p>



<ul class="wp-block-list">
<li>It uses memory from frame 0</li>



<li>Matches object appearance across frames</li>



<li>Maintains identity consistency for each object</li>



<li>Predicts masks for each new frame</li>
</ul>



<p>For every frame:</p>



<ul class="wp-block-list">
<li>We resize masks</li>



<li>Store them in a dictionary per object</li>
</ul>



<p>Final structure:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="" data-enlighter-group="">video_segments = {
 0: {1: mask0_1, 2: mask0_2},
 1: {1: mask1_1, 2: mask1_2},
 2: {1: mask2_1, 2: mask2_2},
 ...
}</pre>



<p>Now we have segmentation results for all objects across the full video.</p>



<p>Next, we define:</p>



<ul class="wp-block-list">
<li>Output video path</li>



<li>Codec format (<code data-enlighter-language="python" class="EnlighterJSRAW">mp4v</code>)</li>
</ul>



<p>We get the resolution from the first frame to ensure the output video matches the input resolution. Then we initialize <code data-enlighter-language="python" class="EnlighterJSRAW">cv2.VideoWriter()</code> which sets:</p>



<ul class="wp-block-list">
<li>Output path</li>



<li>Codec</li>



<li>30 FPS</li>



<li>Frame dimensions</li>
</ul>



<p>Now we iterate through each frame. We copy each frame to avoid modifying the original.</p>



<p>If masks exist for that frame:</p>



<ul class="wp-block-list">
<li>We create an overlay</li>



<li>For each object:
<ul class="wp-block-list">
<li>Convert tensor: CPU</li>



<li>Convert to NumPy</li>



<li>Convert probabilities: Boolean mask</li>
</ul>
</li>
</ul>



<p>Now:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">True</code>: Object pixels</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">False</code>: Background</li>
</ul>



<p>Unlike the single-object version, here we:</p>



<ul class="wp-block-list">
<li>Assign a different color for each object</li>



<li>Use <code data-enlighter-language="python" class="EnlighterJSRAW">COLORS[obj_idx % len(COLORS)]</code></li>



<li>Overlay masks for multiple objects</li>
</ul>



<p>Then we blend using <code data-enlighter-language="python" class="EnlighterJSRAW">cv2.addWeighted()</code>. This results in:</p>



<ul class="wp-block-list">
<li>60% original frame</li>



<li>40% colored overlay</li>



<li>Smooth visualization</li>
</ul>



<p>OpenCV expects BGR format, so we convert using <code data-enlighter-language="python" class="EnlighterJSRAW">cv2.COLOR_RGB2BGR</code>.</p>



<p>Finally, <code data-enlighter-language="python" class="EnlighterJSRAW">out.release()</code> finalizes the file, writes any remaining buffers, and properly closes the video. Without this step, the output video may become corrupted.</p>



<p>At the end, we return:</p>



<ul class="wp-block-list">
<li>Path to the saved video</li>



<li>A success message indicating how many objects were tracked</li>
</ul>



<p>If anything fails:</p>



<ul class="wp-block-list">
<li>The function safely returns the error</li>



<li>The application does not crash</li>



<li>The traceback is included for debugging</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Launch-the-Gradio-Application-4"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Launch-the-Gradio-Application-4">Launch the Gradio Application</a></h3>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="18"># Create Gradio interface with blocks for better control
with gr.Blocks(title="SAM3 Multi-Object Video Tracker") as demo:
   gr.Markdown("# 🎯 SAM3 Multi-Object Video Tracker")
   gr.Markdown("Upload a video and click on multiple objects in the first frame to track them. Each object gets a different color!")

   with gr.Row():
       with gr.Column():
           video_input = gr.Video(label="Upload Video")
           first_frame = gr.Image(label="Click on objects to track (multiple clicks supported)", type="numpy")

           with gr.Row():
               clear_points_btn = gr.Button("Clear Points", variant="secondary")
               track_btn = gr.Button("Track Objects", variant="primary")

           points_display = gr.Textbox(label="Selected Points", interactive=False, lines=5)

       with gr.Column():
           video_output = gr.Video(label="Tracked Video")
           status_output = gr.Textbox(label="Status")
           gr.Markdown("""
           ### Color Legend:
           - 🔴 Red - Object 1
           - 🟢 Green - Object 2
           - 🔵 Blue - Object 3
           - 🟡 Yellow - Object 4
           - 🩷 Magenta - Object 5
           - 🩵 Cyan - Object 6
           - 🟠 Orange - Object 7
           - 🟣 Purple - Object 8
           """)

   # Store clicked points and original frame
   clicked_points = gr.State([])
   original_frame = gr.State(None)

   # Extract first frame when video is uploaded
   def on_video_upload(video):
       if video:
           frame = extract_first_frame(video)
           return frame, frame, [], "Upload complete. Click on objects you want to track."
       return None, None, [], ""

   video_input.change(
       on_video_upload,
       inputs=[video_input],
       outputs=[first_frame, original_frame, clicked_points, status_output]
   )

   # Handle click on first frame
   def on_click(img, orig_frame, points, evt: gr.SelectData):
       if orig_frame is None:
           return img, points, "Please upload a video first"

       x, y = evt.index[0], evt.index[1]

       # Add point to list
       points.append((x, y))

       # Draw all points on the image
       img_copy = orig_frame.copy()
       for i, (px, py) in enumerate(points):
           color = COLORS[i % len(COLORS)]
           cv2.circle(img_copy, (px, py), 8, tuple(color), -1)
           cv2.circle(img_copy, (px, py), 10, (255, 255, 255), 2)
           # Add number label
           cv2.putText(img_copy, str(i+1), (px+15, py+5),
                      cv2.FONT_HERSHEY_SIMPLEX, 0.6, tuple(color), 2)

       points_text = "\n".join([f"Object {i+1}: ({x}, {y})" for i, (x, y) in enumerate(points)])

       return img_copy, points, points_text

   first_frame.select(
       on_click,
       inputs=[first_frame, original_frame, clicked_points],
       outputs=[first_frame, clicked_points, points_display]
   )

   # Clear points button
   def clear_points(orig_frame):
       return orig_frame, [], ""

   clear_points_btn.click(
       clear_points,
       inputs=[original_frame],
       outputs=[first_frame, clicked_points, points_display]
   )

   # Track button
   track_btn.click(
       track_objects,
       inputs=[video_input, clicked_points],
       outputs=[video_output, status_output]
   )

# Launch
demo.launch(debug=True)
</pre>



<p>The Gradio interface is built using <code data-enlighter-language="python" class="EnlighterJSRAW">gr.Blocks</code>, which allows us to design a structured, interactive layout with full control over components and events. The goal is simple: “Create an interactive UI where a user uploads a video, clicks on multiple objects in the first frame, and then tracks them across the entire video.”</p>



<p>At the top of the interface, we display a title and short instructions using <code data-enlighter-language="python" class="EnlighterJSRAW">gr.Markdown()</code>. This helps users immediately understand what the application does and what steps they need to follow. Clear instructions reduce confusion and improve usability.</p>



<p>Next, we organize the layout using a <code data-enlighter-language="python" class="EnlighterJSRAW">gr.Row()</code>. Inside that row, we create 2 columns using <code data-enlighter-language="python" class="EnlighterJSRAW">gr.Columns()</code>. The left column handles inputs and interactions. The right column displays outputs and tracking results. This separation keeps the workflow intuitive and clean.</p>



<p>In the left column, we first create a video upload component. This allows the user to upload a video file from their system. Once a video is uploaded, the backend receives the file path, which is later used to extract frames and perform tracking.</p>



<p>Below the video upload, we place an image component that displays the first frame of the uploaded video. This image is interactive and supports click events. We set its type to NumPy so the backend receives the image as a NumPy array. This is important because we draw circles and labels on the frame using OpenCV.</p>



<p>Under the image, we add 2 buttons side by side. The first button clears selected points. The second button triggers the tracking pipeline. The track button is styled as the primary button so it visually stands out as the main action.</p>



<p>Below the buttons, we add a textbox that displays selected points. This textbox is non-interactive, meaning users cannot edit it manually. It simply shows a formatted list of selected objects and their coordinates. This helps users confirm that they clicked the correct locations before starting tracking.</p>



<p>In the right column, we create a video output component. This will display the processed video returned by the tracking function. Below that, we add a status textbox to show messages such as upload confirmation, tracking success, or error details. Finally, we display a color legend using Markdown so users understand which color corresponds to which object during visualization.</p>



<p>To manage interaction data across events, we use Gradio’s State component. One state variable stores the list of clicked points. This list grows as the user clicks on multiple objects. Another state variable stores the original first frame. This is important because every time a new click occurs, we redraw all points on a clean copy of the original frame instead of repeatedly drawing over an already modified image. Without this, the markers would stack incorrectly and distort the visualization.</p>



<p>When a video is uploaded, a function <code data-enlighter-language="python" class="EnlighterJSRAW">on_video_upload()</code> is triggered. This function extracts the first frame from the video and returns 4 values:</p>



<ul class="wp-block-list">
<li>The extracted first frame for display</li>



<li>The same frame stored as the original clean frame</li>



<li>An empty list of clicked points</li>



<li>A status message confirming upload completion</li>
</ul>



<p>This ensures that each new upload resets the application state properly.</p>



<p>When the user clicks on the first frame image, another function <code data-enlighter-language="python" class="EnlighterJSRAW">on_click()</code> handles the event. The click event provides the pixel coordinates of the selected location. First, we check whether a video has been uploaded. If not, we return a message asking the user to upload one.</p>



<p>If a frame exists, we extract the <code data-enlighter-language="python" class="EnlighterJSRAW">x</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">y</code> coordinates from the click event. We then append this coordinate pair to the stored list of points. After updating the list, we redraw the image. We copy the original clean frame and loop over all stored points. For each point:</p>



<ul class="wp-block-list">
<li>We select a color from the predefined <code data-enlighter-language="python" class="EnlighterJSRAW">COLORS</code> list</li>



<li>We draw a filled circle at the clicked location</li>



<li>We draw a white border around the circle for better visibility</li>



<li>We add a numeric label next to the point indicating Object 1, Object 2, and so on</li>
</ul>



<p>This ensures each selected object is visually distinct and clearly labeled.</p>



<p>We also generate formatted text listing all selected objects and their coordinates. This text is displayed in the textbox so the user can verify selections.</p>



<p>The Clear Points button is connected to a function that resets the interface. It restores the original clean frame, empties the clicked points list, and clears the points display textbox. This allows the user to start fresh without reloading the video.</p>



<p>The Track Objects button is connected to the tracking function. When clicked, it sends:</p>



<ul class="wp-block-list">
<li>The uploaded video</li>



<li>The stored list of clicked points</li>
</ul>



<p>to the backend tracking pipeline. The tracking function processes the video, segments and propagates masks for all selected objects, and returns:</p>



<ul class="wp-block-list">
<li>The path to the processed output video</li>



<li>A status message</li>
</ul>



<p>These are then displayed in the output video component and the status textbox.</p>



<p>Finally, the application is launched with debug mode enabled. Debug mode provides detailed logs in case errors occur during development, making troubleshooting easier.</p>



<p>Overall, the interface follows this flow:</p>



<ul class="wp-block-list">
<li>The user uploads a video.</li>



<li>The first frame is extracted and displayed.</li>



<li>The user clicks multiple objects.</li>



<li>Points are stored and visualized with colors and labels.</li>



<li>The user presses Track Objects.</li>



<li>The backend processes the video and returns the tracked result.</li>



<li>The output video and status message are displayed.</li>
</ul>



<p>State management ensures smooth interaction across multiple events, and the 2-column layout keeps inputs and outputs clearly separated.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Output-Multi-Object-Video-Segmentation-and-Tracking-Results"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Output-Multi-Object-Video-Segmentation-and-Tracking-Results">Output: Multi-Object Video Segmentation and Tracking Results</a></h3>



<figure style="text-align: center; max-width: 700px; margin: auto;">
  <!-- Paste your embed code below -->
<iframe width="700" height="394" src="https://www.youtube.com/embed/1hvRp8C_k6Q" title="" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

  <!-- Optional Caption -->
  <figcaption style="align: center; margin-top: 8px;">
    <strong>Figure 4:</strong> Multi-Object Video Segmentation and Tracking Demo (source: GIF by the author).
  </figcaption>
</figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<div id="pitch" style="padding: 40px; width: 100%; background-color: #F4F6FA;">
	<h3>What's next? We recommend <a target="_blank" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend">PyImageSearch University</a>.</h3>

	<script src="https://fast.wistia.com/embed/medias/kno0cmko2z.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_kno0cmko2z videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img decoding="async" src="https://fast.wistia.com/embed/medias/kno0cmko2z/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>

	<div style="margin-top: 32px; margin-bottom: 32px; ">
		<strong>Course information:</strong><br/>
		86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: April 2026<br/>
		<span style="color: #169FE6;">★★★★★</span> 4.84 (128 Ratings) • 16,000+ Students Enrolled
	</div>

	<p><strong>I strongly believe that if you had the right teacher you could <em>master</em> computer vision and deep learning.</strong></p>

	<p>Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?</p>

	<p>That’s <em>not</em> the case.</p>

	<p>All you need to master computer vision and deep learning is for someone to explain things to you in <em>simple, intuitive</em> terms. <em>And that’s exactly what I do</em>. My mission is to change education and how complex Artificial Intelligence topics are taught.</p>

	<p>If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to <em>successfully</em> and <em>confidently</em> apply computer vision to your work, research, and projects. Join me in computer vision mastery.</p>

	<p><strong>Inside PyImageSearch University you'll find:</strong></p>

	<ul style="margin-left: 0px;">
		<li style="list-style: none;">&check; <strong>86+ courses</strong> on essential computer vision, deep learning, and OpenCV topics</li>
		<li style="list-style: none;">&check; <strong>86 Certificates</strong> of Completion</li>
		<li style="list-style: none;">&check; <strong>115+ hours hours</strong> of on-demand video</li>
		<li style="list-style: none;">&check; <strong>Brand new courses released <em>regularly</em></strong>, ensuring you can keep up with state-of-the-art techniques</li>
		<li style="list-style: none;">&check; <strong>Pre-configured Jupyter Notebooks in Google Colab</strong></li>
		<li style="list-style: none;">&check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)</li>
		<li style="list-style: none;">&check; Access to <strong>centralized code repos for <em>all</em> 540+ tutorials</strong> on PyImageSearch</li>
		<li style="list-style: none;">&check; <strong> Easy one-click downloads</strong> for code, datasets, pre-trained models, etc.</li>
		<li style="list-style: none;">&check; <strong>Access</strong> on mobile, laptop, desktop, etc.</li>
	</ul>

	<p style="text-align: center;">
		<a target="_blank" class="button link" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend" style="background-color: #6DC713; border-bottom: none;">Click here to join PyImageSearch University</a>
	</p>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Summary"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Summary">Summary</a></h2>



<p>In this tutorial, we extended SAM3 from image-based segmentation workflows to full video understanding and tracking. We first built pipelines that detect, segment, and track concepts across videos using simple text prompts, enabling automatic tracking of objects such as people or vehicles without manual annotation.</p>



<p>Next, we moved to streaming inference, where SAM 3 processes frames continuously from a webcam while maintaining tracking memory across time. This allowed us to build real-time segmentation and tracking systems that operate on live video streams.</p>



<p>We then explored interactive tracking workflows, where users select objects directly using click prompts. Starting from single-object tracking, we progressed to multi-object tracking, enabling several objects to be tracked simultaneously with consistent identity and color-coded visualization.</p>



<p>By the end of this tutorial, we developed complete end-to-end systems that combine detection, segmentation, tracking, and interactive workflows into practical applications using Gradio interfaces. Together with the previous parts of this series, we now have a full understanding of how SAM 3 enables concept-aware segmentation and tracking across both images and videos, opening the door to intelligent video editing, annotation, and analysis workflows.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Citation-Information"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Citation-Information">Citation Information</a></h3>



<p><strong>Thakur, P</strong><strong>. </strong>“SAM 3 for Video: Concept-Aware Segmentation and Object Tracking,” <em>PyImageSearch</em>, S. Huot, G. Kudriavtsev, and A. Sharma, eds., 2026, <a href="https://pyimg.co/luxfd" target="_blank" rel="noreferrer noopener">https://pyimg.co/luxfd</a> </p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="classic" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="SAM 3 for Video: Concept-Aware Segmentation and Object Tracking" data-enlighter-group="19">@incollection{Thakur_2026_sam-3-sam3-for-video-concept-aware-segmentation-and-object-tracking,
  author = {Piyush Thakur},
  title = {{SAM 3 for Video: Concept-Aware Segmentation and Object Tracking}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Georgii Kudriavtsev and Aditya Sharma},
  year = {2026},
  url = {https://pyimg.co/luxfd},
}
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), </strong><em><strong>simply enter your email address in the form below!</strong></em></p>



<div id="download-the-code" class="post-cta-wrap">
<div class="gpd-post-cta">
	<div class="gpd-post-cta-content">
		

			<div class="gpd-post-cta-top">
				<div class="gpd-post-cta-top-image"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1" alt="" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1 410w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=126x174&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=252x348&lossy=2&strip=1&webp=1 252w" sizes="(max-width: 410px) 100vw, 410px" /></div>
				
				<div class="gpd-post-cta-top-title"><h4>Download the Source Code and FREE 17-page Resource Guide</h4></div>
				<div class="gpd-post-cta-top-desc"><p>Enter your email address below to get a .zip of the code and a <strong>FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning.</strong> Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!</p></div>


			</div>

			<div class="gpd-post-cta-bottom">
				<form id="footer-cta-code" class="footer-cta" action="https://www.getdrip.com/forms/4130035/submissions" method="post" target="blank" data-drip-embedded-form="4130035">
					<input name="fields[email]" type="email" value="" placeholder="Your email address" class="form-control" />

					<button type="submit">Download the code!</button>

					<div style="display: none;" aria-hidden="true"><label for="website">Website</label><br /><input type="text" id="website" name="website" tabindex="-1" autocomplete="false" value="" /></div>
				</form>
			</div>


		
	</div>

</div>
</div>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/03/02/sam-3-for-video-concept-aware-segmentation-and-object-tracking/">SAM 3 for Video: Concept-Aware Segmentation and Object Tracking</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)</title>
		<link>https://pyimagesearch.com/2026/02/23/vector-search-using-ollama-for-retrieval-augmented-generation-rag/</link>
		
		<dc:creator><![CDATA[Vikram Singh]]></dc:creator>
		<pubDate>Mon, 23 Feb 2026 13:45:00 +0000</pubDate>
				<category><![CDATA[AI & Machine Learning]]></category>
		<category><![CDATA[LLMOps]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[Vector Databases]]></category>
		<category><![CDATA[approximate nearest neighbor]]></category>
		<category><![CDATA[citation support]]></category>
		<category><![CDATA[embeddings]]></category>
		<category><![CDATA[faiss]]></category>
		<category><![CDATA[hnsw]]></category>
		<category><![CDATA[llm grounding]]></category>
		<category><![CDATA[local llm]]></category>
		<category><![CDATA[ollama]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[rag]]></category>
		<category><![CDATA[retrieval augmented generation]]></category>
		<category><![CDATA[semantic search]]></category>
		<category><![CDATA[sentence transformers]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[vector search]]></category>
		<guid isPermaLink="false">https://pyimagesearch.com/?p=52851</guid>

					<description><![CDATA[<p>Table of Contents Vector Search Using Ollama for Retrieval-Augmented Generation (RAG) How Vector Search Powers Retrieval-Augmented Generation (RAG) From Search to Context The Flow of Meaning Putting It All Together What Is Retrieval-Augmented Generation (RAG)? The Retrieve-Read-Generate Architecture Explained Why&#8230;</p>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/02/23/vector-search-using-ollama-for-retrieval-augmented-generation-rag/">Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity" id="TOC"/>


<div class="yoast-breadcrumbs"><span><span><a href="https://pyimagesearch.com/">Home</a></span></div>


<div class="toc">
<hr class="TOC"/>
<p class="has-large-font-size"><strong>Table of Contents</strong></p>
<ul>
    <li id="TOC-h1-Vector-Search-Using-Ollama-for-Retrieval-Augmented-Generation-RAG">
        <a rel="noopener" target="_blank" href="#h1-Vector-Search-Using-Ollama-for-Retrieval-Augmented-Generation-RAG">
            Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)
        </a>
    </li>

    <li id="TOC-h2-How-Vector-Search-Powers-Retrieval-Augmented-Generation-RAG">
        <a rel="noopener" target="_blank" href="#h2-How-Vector-Search-Powers-Retrieval-Augmented-Generation-RAG">
            How Vector Search Powers Retrieval-Augmented Generation (RAG)
        </a>
        <ul>
            <li id="TOC-h3-From-Search-to-Context">
                <a rel="noopener" target="_blank" href="#h3-From-Search-to-Context">
                    From Search to Context
                </a>
            </li>
            <li id="TOC-h3-The-Flow-of-Meaning">
                <a rel="noopener" target="_blank" href="#h3-The-Flow-of-Meaning">
                    The Flow of Meaning
                </a>
            </li>
            <li id="TOC-h3-Putting-It-All-Together-Vector-Search">
                <a rel="noopener" target="_blank" href="#h3-Putting-It-All-Together-Vector-Search">
                    Putting It All Together
                </a>
            </li>
        </ul>
    </li>

    <li id="TOC-h2-What-Is-Retrieval-Augmented-Generation-RAG">
        <a rel="noopener" target="_blank" href="#h2-What-Is-Retrieval-Augmented-Generation-RAG">
            What Is Retrieval-Augmented Generation (RAG)?
        </a>
        <ul>
            <li id="TOC-h3-The-Retrieve-Read-Generate-Architecture-Explained">
                <a rel="noopener" target="_blank" href="#h3-The-Retrieve-Read-Generate-Architecture-Explained">
                    The Retrieve-Read-Generate Architecture Explained
                </a>
            </li>
            <li id="TOC-h3-Why-Retrieval-Augmented-Generation-RAG-Improves-LLM-Accuracy">
                <a rel="noopener" target="_blank" href="#h3-Why-Retrieval-Augmented-Generation-RAG-Improves-LLM-Accuracy">
                    Why Retrieval-Augmented Generation (RAG) Improves LLM Accuracy
                </a>
            </li>
            <li id="TOC-h3-The-Broader-Picture-A-Hybrid-of-Search-and-Generation">
                <a rel="noopener" target="_blank" href="#h3-The-Broader-Picture-A-Hybrid-of-Search-and-Generation">
                    The Broader Picture: A Hybrid of Search and Generation
                </a>
            </li>
            <li id="TOC-h3-Key-Takeaway">
                <a rel="noopener" target="_blank" href="#h3-Key-Takeaway">
                    Key Takeaway
                </a>
            </li>
        </ul>
    </li>

    <li id="TOC-h2-How-to-Build-a-RAG-Pipeline-with-FAISS-and-Ollama-Local-LLM">
        <a rel="noopener" target="_blank" href="#h2-How-to-Build-a-RAG-Pipeline-with-FAISS-and-Ollama-Local-LLM">
            How to Build a RAG Pipeline with FAISS and Ollama (Local LLM)
        </a>
        <ul>
            <li id="TOC-h3-Step-1-Implementing-HNSW-Vector-Search-with-FAISS-for-RAG">
                <a rel="noopener" target="_blank" href="#h3-Step-1-Implementing-HNSW-Vector-Search-with-FAISS-for-RAG">
                    Step 1: Implementing HNSW Vector Search with FAISS for RAG
                </a>
            </li>
            <li id="TOC-h3-Step-2-Prompt-Engineering-for-Retrieval-Augmented-Generation-RAG">
                <a rel="noopener" target="_blank" href="#h3-Step-2-Prompt-Engineering-for-Retrieval-Augmented-Generation-RAG">
                    Step 2: Prompt Engineering for Retrieval-Augmented Generation (RAG)
                </a>
            </li>
            <li id="TOC-h3-Step-3-Generating-Grounded-Answers-with-Ollama-Local-LLM">
                <a rel="noopener" target="_blank" href="#h3-Step-3-Generating-Grounded-Answers-with-Ollama-Local-LLM">
                    Step 3: Generating Grounded Answers with Ollama Local LLM
                </a>
            </li>
            <li id="TOC-h3-Adding-Feedback-Loops-to-Improve-Retrieval-Accuracy">
                <a rel="noopener" target="_blank" href="#h3-Adding-Feedback-Loops-to-Improve-Retrieval-Accuracy">
                    Adding Feedback Loops to Improve Retrieval Accuracy
                </a>
            </li>
            <li id="TOC-h3-Putting-It-All-Together-RAG-Pipeline">
                <a rel="noopener" target="_blank" href="#h3-Putting-It-All-Together-RAG-Pipeline">
                    Putting It All Together
                </a>
            </li>
        </ul>
    </li>

    <li id="TOC-h2-Configuring-Your-Development-Environment-Setting-Up-Ollama-and-FAISS-for-a-Local-RAG-Pipeline">
        <a rel="noopener" target="_blank" href="#h2-Configuring-Your-Development-Environment-Setting-Up-Ollama-and-FAISS-for-a-Local-RAG-Pipeline">
            Configuring Your Development Environment: Setting Up Ollama and FAISS for a Local RAG Pipeline
        </a>
        <ul>
            <li id="TOC-h3-Optional-Dependencies">
                <a rel="noopener" target="_blank" href="#h3-Optional-Dependencies">
                    Optional Dependencies
                </a>
            </li>
            <li id="TOC-h3-Local-LLM-Setup-Ollama">
                <a rel="noopener" target="_blank" href="#h3-Local-LLM-Setup-Ollama">
                    Local LLM Setup (Ollama)
                </a>
            </li>
        </ul>
    </li>

    <li id="TOC-h2-Implementation-Walkthrough">
        <a rel="noopener" target="_blank" href="#h2-Implementation-Walkthrough">
            Implementation Walkthrough
        </a>
        <ul>
            <li id="TOC-h3-Configuration-config-py">
                <a rel="noopener" target="_blank" href="#h3-Configuration-config-py">
                    Configuration (config.py)
                </a>
            </li>
        </ul>
    </li>

    <li id="TOC-h2-Integrating-Ollama-with-FAISS-Vector-Search-for-RAG">
        <a rel="noopener" target="_blank" href="#h2-Integrating-Ollama-with-FAISS-Vector-Search-for-RAG">
            Integrating Ollama with FAISS Vector Search for RAG
        </a>
        <ul>
            <li id="TOC-h3-Overview-and-Setup">
                <a rel="noopener" target="_blank" href="#h3-Overview-and-Setup">
                    Overview and Setup
                </a>
            </li>
            <li id="TOC-h3-Health-Check-and-Model-Discovery">
                <a rel="noopener" target="_blank" href="#h3-Health-Check-and-Model-Discovery">
                    Health Check and Model Discovery
                </a>
            </li>
            <li id="TOC-h3-Making-the-Ollama-Call">
                <a rel="noopener" target="_blank" href="#h3-Making-the-Ollama-Call">
                    Making the Ollama Call
                </a>
            </li>
            <li id="TOC-h3-Optional-Cloud-Fallback-OpenAI">
                <a rel="noopener" target="_blank" href="#h3-Optional-Cloud-Fallback-OpenAI">
                    Optional: Cloud Fallback (OpenAI)
                </a>
            </li>
            <li id="TOC-h3-Selecting-the-Top-k-Relevant-Chunks">
                <a rel="noopener" target="_blank" href="#h3-Selecting-the-Top-k-Relevant-Chunks">
                    Selecting the Top-k Relevant Chunks
                </a>
            </li>
            <li id="TOC-h3-Splitting-Answers-into-Sentences">
                <a rel="noopener" target="_blank" href="#h3-Splitting-Answers-into-Sentences">
                    Splitting Answers into Sentences
                </a>
            </li>
            <li id="TOC-h3-Computing-Sentence-Support">
                <a rel="noopener" target="_blank" href="#h3-Computing-Sentence-Support">
                    Computing Sentence Support
                </a>
            </li>
            <li id="TOC-h3-Formatting-and-Styling">
                <a rel="noopener" target="_blank" href="#h3-Formatting-and-Styling">
                    Formatting and Styling
                </a>
            </li>
            <li id="TOC-h3-The-Core-generate-rag-response">
                <a rel="noopener" target="_blank" href="#h3-The-Core-generate-rag-response">
                    The Core: generate_rag_response()
                </a>
            </li>
            <li id="TOC-h3-Summary-of-the-Utilities">
                <a rel="noopener" target="_blank" href="#h3-Summary-of-the-Utilities">
                    Summary of the Utilities
                </a>
            </li>
        </ul>
    </li>

    <li id="TOC-h2-Running-a-Local-RAG-Pipeline-with-Ollama-and-FAISS">
        <a rel="noopener" target="_blank" href="#h2-Running-a-Local-RAG-Pipeline-with-Ollama-and-FAISS">
            Running a Local RAG Pipeline with Ollama and FAISS
        </a>
        <ul>
            <li id="TOC-h3-Imports-and-Module-Wiring">
                <a rel="noopener" target="_blank" href="#h3-Imports-and-Module-Wiring">
                    Imports and Module Wiring
                </a>
            </li>
            <li id="TOC-h3-Ensure-Embeddings-load-or-build-once">
                <a rel="noopener" target="_blank" href="#h3-Ensure-Embeddings-load-or-build-once">
                    Ensure Embeddings (load or build once)
                </a>
            </li>
            <li id="TOC-h3-Ensure-Indexes-Flat-must-exist-HNSW-optional">
                <a rel="noopener" target="_blank" href="#h3-Ensure-Indexes-Flat-must-exist-HNSW-optional">
                    Ensure Indexes (Flat must exist; HNSW is optional)
                </a>
            </li>
            <li id="TOC-h3-Interactive-QA-Loop-Optional-Mode">
                <a rel="noopener" target="_blank" href="#h3-Interactive-QA-Loop-Optional-Mode">
                    Interactive Q&amp;A Loop — Optional Mode
                </a>
            </li>
            <li id="TOC-h3-Pretty-Printing-the-Answer-and-Context">
                <a rel="noopener" target="_blank" href="#h3-Pretty-Printing-the-Answer-and-Context">
                    Pretty Printing the Answer and Context (optional prompt/support)
                </a>
            </li>
            <li id="TOC-h3-CLI-Entry-Point-main-flags-loading-answering">
                <a rel="noopener" target="_blank" href="#h3-CLI-Entry-Point-main-flags-loading-answering">
                    CLI Entry Point (main) — flags, loading, answering
                </a>
            </li>
            <li id="TOC-h3-Standard-Python-Entrypoint">
                <a rel="noopener" target="_blank" href="#h3-Standard-Python-Entrypoint">
                    Standard Python Entrypoint
                </a>
            </li>
        </ul>
    </li>

    <li id="TOC-h2-Tiny-Gotchas-and-Tips">
        <a rel="noopener" target="_blank" href="#h2-Tiny-Gotchas-and-Tips">
            Tiny Gotchas and Tips
        </a>
    </li>

    <li id="TOC-h2-How-to-Run-a-Local-RAG-System-with-Ollama-and-FAISS">
        <a rel="noopener" target="_blank" href="#h2-How-to-Run-a-Local-RAG-System-with-Ollama-and-FAISS">
            How to Run a Local RAG System with Ollama and FAISS
        </a>
    </li>

    <li id="TOC-h2-Example-Output">
        <a rel="noopener" target="_blank" href="#h2-Example-Output">
            Example Output
        </a>
    </li>

    <li id="TOC-h2-What-You-Learned-Building-a-Production-Ready-Local-RAG-System-with-Ollama-and-FAISS">
        <a rel="noopener" target="_blank" href="#h2-What-You-Learned-Building-a-Production-Ready-Local-RAG-System-with-Ollama-and-FAISS">
            What You Learned: Building a Production-Ready Local RAG System with Ollama and FAISS
        </a>
    </li>

    <li id="TOC-h2-Summary">
        <a rel="noopener" target="_blank" href="#h2-Summary">
            Summary
        </a>
        <ul>
            <li id="TOC-h3-Citation-Information">
                <a rel="noopener" target="_blank" href="#h3-Citation-Information">
                    Citation Information
                </a>
            </li>
        </ul>
    </li>
</ul>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h1-Vector-Search-Using-Ollama-for-Retrieval-Augmented-Generation-RAG"/>



<h2 class="wp-block-heading"><a href="#TOC-h1-Vector-Search-Using-Ollama-for-Retrieval-Augmented-Generation-RAG">Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)</a></h2>



<p>In the previous lessons, you learned how to generate text embeddings, store them efficiently, and perform fast vector search using FAISS. Now, it’s time to put that search power to use — by connecting it with a language model to build a complete Retrieval-Augmented Generation (RAG) pipeline.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/vector-search-using-ollama-for-rag-featured.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="940" height="780" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-using-ollama-for-rag-featured.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52900" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-using-ollama-for-rag-featured.png?size=126x105&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-using-ollama-for-rag-featured-300x249.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-using-ollama-for-rag-featured.png?size=378x314&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-using-ollama-for-rag-featured.png?size=504x418&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-using-ollama-for-rag-featured.png?size=630x523&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-using-ollama-for-rag-featured-768x637.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-using-ollama-for-rag-featured.png?lossy=2&amp;strip=1&amp;webp=1 940w" sizes="(max-width: 630px) 100vw, 630px" /></a></figure></div>


<p>RAG is the bridge between retrieval and reasoning — it lets your LLM (large language model) access facts it hasn’t memorized. Instead of relying solely on pre-training, the model fetches relevant context from your own data before answering, ensuring responses that are accurate, up-to-date, and grounded in evidence.</p>



<p>Think of it as asking a well-trained assistant a question: they don’t guess — they quickly look up the right pages in your company wiki, then answer with confidence.</p>



<p>This lesson is the last of a 3-part series on <strong>Retrieval-Augmented Generation (RAG)</strong>:</p>



<ol class="wp-block-list">
<li><em><strong><a href="https://pyimg.co/msp43" target="_blank" rel="noreferrer noopener">TF-IDF vs. Embeddings: From Keywords to Semantic Search</a></strong></em></li>



<li><em><strong><a href="https://pyimg.co/htl5f" target="_blank" rel="noreferrer noopener">Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained</a></strong></em></li>



<li><em><strong><a href="https://pyimg.co/q68nv" target="_blank" rel="noreferrer noopener">Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)</a></strong></em><strong> (this tutorial)</strong></li>
</ol>



<p><strong>To learn how to make your LLM do the same, </strong><em><strong>just keep reading.</strong></em></p>



<div id="pyi-source-code-block" class="source-code-wrap"><div class="gpd-source-code">
    <div class="gpd-source-code-content">
        <img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/source-code-icon.png?lossy=2&strip=1&webp=1" alt="">
        <h4>Looking for the source code to this post?</h4>
                    <a href="#download-the-code" class="pyis-cta-modal-open-modal">Jump Right To The Downloads Section <svg class="svg-icon arrow-right" width="12" height="12" aria-hidden="true" role="img" focusable="false" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M6.8125 0.1875C6.875 0.125 6.96875 0.09375 7.09375 0.09375C7.1875 0.09375 7.28125 0.125 7.34375 0.1875L13.875 6.75C13.9375 6.8125 14 6.90625 14 7C14 7.125 13.9375 7.1875 13.875 7.25L7.34375 13.8125C7.28125 13.875 7.1875 13.9062 7.09375 13.9062C6.96875 13.9062 6.875 13.875 6.8125 13.8125L6.1875 13.1875C6.125 13.125 6.09375 13.0625 6.09375 12.9375C6.09375 12.8438 6.125 12.75 6.1875 12.6562L11.0312 7.8125H0.375C0.25 7.8125 0.15625 7.78125 0.09375 7.71875C0.03125 7.65625 0 7.5625 0 7.4375V6.5625C0 6.46875 0.03125 6.375 0.09375 6.3125C0.15625 6.25 0.25 6.1875 0.375 6.1875H11.0312L6.1875 1.34375C6.125 1.28125 6.09375 1.1875 6.09375 1.0625C6.09375 0.96875 6.125 0.875 6.1875 0.8125L6.8125 0.1875Z" fill="#169FE6"></path></svg></a>
            </div>
</div>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-How-Vector-Search-Powers-Retrieval-Augmented-Generation-RAG"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-How-Vector-Search-Powers-Retrieval-Augmented-Generation-RAG">How Vector Search Powers Retrieval-Augmented Generation (RAG)</a></h2>



<p>Before we start wiring our first Retrieval-Augmented Generation (RAG) pipeline, let’s pause to understand how far we’ve come — and why this next step is a natural progression.</p>



<p>In <strong><a href="https://pyimg.co/msp43" target="_blank" rel="noreferrer noopener">Lesson 1</a></strong>, we learned how to translate language into geometry.</p>



<p>Each sentence became a vector — a point in high-dimensional space — where <strong>semantic closeness</strong> means <strong>directional similarity</strong>. Instead of matching exact words, embeddings capture <em>meaning</em>.</p>



<p>In <strong><a href="https://pyimg.co/htl5f" target="_blank" rel="noreferrer noopener">Lesson 2</a></strong>, we tackled the scale problem: when millions of such vectors exist, finding the nearest ones efficiently demands specialized data structures such as <strong>FAISS indexes </strong>— Flat, HNSW, and IVF.</p>



<p>These indexes allow us to perform lightning-fast <em>approximate nearest neighbor</em> (ANN) searches with only a small trade-off in precision.</p>



<p>Now, in <strong>Lesson 3</strong>, we finally connect this retrieval ability to an LLM.</p>



<p>Think of the FAISS index as a <strong>semantic memory vault</strong> — it remembers every sentence you’ve embedded.</p>



<p>RAG acts as the <strong>retrieval layer</strong> that fetches the most relevant facts when you ask a question, passing those snippets to the model before it generates an answer.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-From-Search-to-Context"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-From-Search-to-Context">From Search to Context</a></h3>



<p>Traditional vector search stops at retrieval:</p>



<p>You enter a query, it finds semantically similar passages, and displays them as search results.</p>



<p>RAG goes one step further — it <strong>feeds</strong> those retrieved passages <em>into</em> the language model’s input prompt.</p>



<p>Instead of reading raw similarity scores, the model sees sentences such as:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="1">Context:
1. Vector databases store and search embeddings efficiently using ANN.
2. FAISS supports multiple indexing strategies including Flat, HNSW, and IVF.

User Question:
What’s the advantage of using HNSW over Flat indexes?
</pre>



<p>Now the model doesn’t have to “guess” — it answers with contextually grounded reasoning.</p>



<p>That is what transforms search into retrieval-based reasoning (<strong>Figure 1</strong>).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-90.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="945" height="741" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-90.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52902" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-90.png?size=126x99&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-90-300x235.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-90.png?size=378x296&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-90.png?size=504x395&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-90.png?size=630x494&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-90-768x602.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-90.png?lossy=2&amp;strip=1&amp;webp=1 945w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 1: </strong>RAG extends vector search by adding a reasoning layer on top of retrieval (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-The-Flow-of-Meaning"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-The-Flow-of-Meaning">The Flow of Meaning</a></h3>



<p>Let’s connect all the components (<strong>Table 1</strong>).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-91.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="875" height="411" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-91.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52904" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-91.png?size=126x59&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-91-300x141.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-91.png?size=378x178&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-91.png?size=504x237&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-91.png?size=630x296&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-91-768x361.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-91.png?lossy=2&amp;strip=1&amp;webp=1 875w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Table 1:</strong> Step-by-step transformation from text embedding to a generated answer in a RAG pipeline.</figcaption></figure></div>


<p>This is the <strong>essence of RAG</strong> — combining the recall strength of retrieval with the reasoning power of generation.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Putting-It-All-Together-Vector-Search"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Putting-It-All-Together-Vector-Search">Putting It All Together</a></h3>



<p>Imagine browsing through a giant photo album of your entire text corpus.</p>



<p>Vector search helps you instantly find pictures with <em>similar colors and patterns</em> — that’s embeddings at work.</p>



<p>But RAG doesn’t stop there. It shows those pictures to a <strong>storyteller</strong> (the LLM), who uses them to narrate a coherent story about what’s happening across them.</p>



<p>Embeddings give you <em>semantic lookup</em>.</p>



<p>RAG gives you <em>semantic understanding</em> (<strong>Figure 2</strong>).</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-92-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="207" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-92-1024x207.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52907" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-92.png?size=126x25&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-92-300x61.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-92.png?size=378x76&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-92.png?size=504x102&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-92.png?size=630x127&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-92-768x155.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-92-1024x207.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-92-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-92-1536x310.png?lossy=2&amp;strip=1&amp;webp=1 1536w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 2:</strong> RAG sits at the intersection of retrieval and reasoning — transforming raw text into embeddings, searching the vector index for context, and guiding the LLM to turn meaning into insight (source: image by the author).</figcaption></figure></div>


<p>If this flow made sense, you’re ready for the real action — understanding how <strong>Retrieval Augmented Generation actually works</strong> under the hood.</p>



<p>Next, we’ll break down the architecture, components, and the 2-stage process that powers modern RAG pipelines.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with &#8230; for free? Head over to <a href="https://universe.roboflow.com/isl/az-6mqow?ref=pyimagesearch" target="_blank" rel="noreferrer noopener">Roboflow</a> and get a free account to grab these hand gesture images. </p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-What-Is-Retrieval-Augmented-Generation-RAG"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-What-Is-Retrieval-Augmented-Generation-RAG">What Is Retrieval-Augmented Generation (RAG)?</a></h2>



<p>Large Language Models (LLMs) have changed how we interact with information.</p>



<p>But they come with two fundamental weaknesses: they <strong>can’t access external data</strong> and <strong>they forget easily</strong>.</p>



<p>Even the most powerful LLMs (e.g., GPT-4 or Mistral) rely entirely on patterns learned during training.</p>



<p>They don’t know about the latest company reports, your private PDFs, or a proprietary codebase unless explicitly retrained — which is expensive, slow, and often impossible for organizations working with sensitive data.</p>



<p>This is exactly where <strong>Retrieval-Augmented Generation (RAG)</strong> steps in.</p>



<p>RAG acts as a <em>bridge</em> between <strong>frozen LLM knowledge</strong> and <strong>fresh, external information</strong>.</p>



<p>Instead of forcing the model to memorize everything, we give it a <strong>retrieval memory system</strong> — a searchable knowledge store filled with your domain data.</p>



<p>Imagine giving your LLM a library card — and access to an intelligent librarian.</p>



<p>Whenever a question arrives, the LLM doesn’t rely on its memory alone — it sends the librarian to fetch relevant documents, reads them carefully, and then generates a grounded, evidence-based response.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-The-Retrieve-Read-Generate-Architecture-Explained"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-The-Retrieve-Read-Generate-Architecture-Explained">The Retrieve-Read-Generate Architecture Explained</a></h3>



<p>RAG systems follow a predictable 3-step pipeline that connects information retrieval with text generation:</p>



<h4 class="wp-block-heading">Retrieve</h4>



<p>The user’s question is first converted into a numerical vector (embedding).</p>



<p>This vector represents the <em>semantic meaning</em> of the query and is matched against stored document vectors in a <strong>vector index</strong> (e.g., FAISS, Pinecone, or Milvus).</p>



<p>The top-<em>k</em> closest matches — meaning the most semantically similar chunks — are returned as potential context.</p>



<h4 class="wp-block-heading">Read</h4>



<p>These retrieved chunks are merged into a short <em>context window</em> — effectively a mini-knowledge pack relevant to the user’s query.</p>



<p>This step is vital: instead of dumping the entire corpus into the model, we pass only the most useful and concise context.</p>



<h4 class="wp-block-heading">Generate</h4>



<p>The LLM (e.g., one running locally through Ollama or remotely via an API) takes both the <strong>query</strong> and <strong>retrieved context</strong>, then composes an answer that blends natural language fluency with factual grounding.</p>



<p>If well-designed, the model avoids hallucinating and gracefully responds <em>“I don’t know”</em> when information is missing.</p>



<p><strong>Figure 3</strong> displays a high-level visual summary of this process.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-104.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="1024" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-104.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52961" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-104-150x150.png?lossy=2&amp;strip=1&amp;webp=1 150w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-104-300x300.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-104.png?size=378x378&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-104.png?size=504x504&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-104.png?size=630x630&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-104-768x768.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-104.png?lossy=2&amp;strip=1&amp;webp=1 1024w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 3:</strong> RAG connects a retriever (search) with a generator (LLM) to produce context-aware, fact-grounded responses (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Why-Retrieval-Augmented-Generation-RAG-Improves-LLM-Accuracy"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Why-Retrieval-Augmented-Generation-RAG-Improves-LLM-Accuracy">Why Retrieval-Augmented Generation (RAG) Improves LLM Accuracy</a></h3>



<p>At first glance, RAG may appear to be “just another way to query a model,” but it represents a fundamental shift in <em>how</em> LLMs reason.</p>



<p>Traditional LLMs store <em>knowledge</em> in their parameters — they <strong>memorize</strong> facts.</p>



<p>RAG decouples knowledge from parameters and instead <strong>retrieves</strong> it on demand.</p>



<p>This means you can keep your model small, fast, and efficient, while still answering domain-specific queries with accuracy.</p>



<p>Let’s unpack this with a few concrete advantages, as reported in <strong>Table 2</strong>.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-94-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="427" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-94-1024x427.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52913" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-94.png?size=126x53&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-94-300x125.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-94.png?size=378x158&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-94.png?size=504x210&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-94.png?size=630x263&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-94-768x320.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-94-1024x427.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-94-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Table 2:</strong> Common LLM limitations and how RAG mitigates each issue.</figcaption></figure></div>


<p>The result?</p>



<p>A <strong>modular intelligence system</strong> — where the retriever evolves with your data, and the generator focuses purely on language reasoning.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-The-Broader-Picture-A-Hybrid-of-Search-and-Generation"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-The-Broader-Picture-A-Hybrid-of-Search-and-Generation">The Broader Picture: A Hybrid of Search and Generation</a></h3>



<p>You can think of RAG as the perfect fusion of <strong>information retrieval</strong> and <strong>natural language generation</strong>.</p>



<p>Traditional search engines stop at retrieval — they return ranked documents.</p>



<p>LLMs go further — they <em>interpret and explain</em>.</p>



<p>RAG combines both: <em>find relevant context, then generate insights from it.</em></p>



<p>It’s the same principle behind how humans answer questions:</p>



<ul class="wp-block-list">
<li>We first <strong>recall</strong> or <strong>look up</strong> what we know.</li>



<li>Then we <strong>synthesize</strong> an answer in our own words.</li>
</ul>



<p>RAG gives LLMs the same skill — combining retrieval precision with generative fluency.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Key-Takeaway"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Key-Takeaway">Key Takeaway</a></h3>



<p>RAG doesn’t replace fine-tuning — it complements it.</p>



<p>It’s the fastest, cheapest, and most reliable way to make LLMs domain-aware without touching their weights.</p>



<p>Once you set up your retriever (built from the FAISS indexes we created in Lesson 2) and connect it to a generator (which we’ll later run via Ollama), you’ll have a self-contained intelligent assistant — one that can reason over your data and answer complex questions in natural language.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-How-to-Build-a-RAG-Pipeline-with-FAISS-and-Ollama-Local-LLM"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-How-to-Build-a-RAG-Pipeline-with-FAISS-and-Ollama-Local-LLM">How to Build a RAG Pipeline with FAISS and Ollama (Local LLM)</a></h2>



<p>Now that you understand what Retrieval Augmented Generation is and why it matters, let’s break down how to actually <strong>build</strong> one — conceptually first, before we dive into the code.</p>



<p>A RAG pipeline may sound complicated, but in practice it’s a clean, modular system made of 3 major parts: the <strong>retriever</strong>, the <strong>reader</strong>, and the <strong>generator</strong>.</p>



<p>Each part does one job well, and together they form the backbone of every production-grade RAG system — whether you’re querying a few PDFs or an entire knowledge base.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Step-1-Implementing-HNSW-Vector-Search-with-FAISS-for-RAG"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Step-1-Implementing-HNSW-Vector-Search-with-FAISS-for-RAG">Step 1: Implementing HNSW Vector Search with FAISS for RAG</a></h3>



<p>The retriever’s job is to search your document corpus and return the chunks most relevant to a user query.</p>



<p>It’s powered by the vector indexes you built in <strong>Lesson 2</strong>, which enable efficient approximate nearest-neighbor (ANN) search.</p>



<p>When a user asks a question, here’s what happens:</p>



<ul class="wp-block-list">
<li>The query text is <strong>embedded</strong> using the same Sentence Transformer model used during indexing.</li>



<li>That query embedding is compared with your stored document embeddings via a <strong>FAISS index</strong>.</li>



<li>The retriever returns the <em>top-k</em> results (typically 3-5 chunks) ranked by cosine similarity.</li>
</ul>



<p>Think of it as Google Search for your private data — except instead of matching keywords, it matches meaning (<strong>Figure 4</strong>).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-95.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="821" height="773" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-95.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52915" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-95.png?size=126x119&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-95-300x282.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-95.png?size=378x356&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-95.png?size=504x475&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-95.png?size=630x593&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-95-768x723.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-95.png?lossy=2&amp;strip=1&amp;webp=1 821w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 4:</strong> A visual comparison of keyword search vs. vector search — traditional keyword search relies on word overlap, while vector search uses semantic proximity in embedding space to capture meaning and context (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Step-2-Prompt-Engineering-for-Retrieval-Augmented-Generation-RAG"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Step-2-Prompt-Engineering-for-Retrieval-Augmented-Generation-RAG">Step 2: Prompt Engineering for Retrieval-Augmented Generation (RAG)</a></h3>



<p>Once the relevant chunks are retrieved, we can’t just throw them at the LLM.</p>



<p>They must be <strong>assembled and formatted</strong> into a coherent, bounded prompt.</p>



<p>This is the job of the reader — a lightweight logic layer that:</p>



<ul class="wp-block-list">
<li>Ranks and filters retrieved chunks by similarity score or metadata (e.g., document name or section).</li>



<li>Merges them into a context block that stays within the LLM’s <strong>context-window limit</strong> (say, 4K-8K tokens).</li>



<li>Wraps them inside a consistent prompt template.</li>
</ul>



<p>In our code, this will be handled using utilities from <code data-enlighter-language="python" class="EnlighterJSRAW">config.py</code> — notably <code data-enlighter-language="python" class="EnlighterJSRAW">build_prompt()</code>, which combines system prompts, retrieved text, and user queries into a final message ready for the model (<strong>Figure 5</strong>).</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-105-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="319" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-105-1024x319.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52963" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-105.png?size=204x64&amp;lossy=2&amp;strip=1&amp;webp=1 204w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-105-300x93.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-105.png?size=409x127&amp;lossy=2&amp;strip=1&amp;webp=1 409w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-105.png?size=614x191&amp;lossy=2&amp;strip=1&amp;webp=1 614w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-105-768x239.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-105.png?size=819x255&amp;lossy=2&amp;strip=1&amp;webp=1 819w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-105-1024x319.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-105-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 5:</strong> The reader transforms retrieved text into a well-structured prompt for the generator (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Step-3-Generating-Grounded-Answers-with-Ollama-Local-LLM"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Step-3-Generating-Grounded-Answers-with-Ollama-Local-LLM">Step 3: Generating Grounded Answers with Ollama Local LLM</a></h3>



<p>Finally, the generator — your LLM — reads the composed prompt and generates a response grounded in the retrieved data.</p>



<p>In our implementation, this will be the stage where we integrate with <strong>Ollama</strong>, a local LLM runtime capable of running models (e.g., Llama 3, Mistral, or Gemma 2) on your machine.</p>



<p>But the design will stay <strong>framework-agnostic</strong>, so you can later swap Ollama for an API call to OpenAI, Claude, or an enterprise model running in-house.</p>



<p>What makes this step powerful is the <strong>synergy</strong> between retrieval and generation: the LLM isn’t hallucinating — it’s <em>reasoning</em> with evidence. If the context doesn’t contain the answer, it should politely say so, thanks to the strict vs. synthesis prompt patterns defined in <code data-enlighter-language="python" class="EnlighterJSRAW">config.py</code> (<strong>Figure 6</strong>).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-97.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="552" height="701" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-97.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52920" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-97.png?size=126x160&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-97-236x300.png?lossy=2&amp;strip=1&amp;webp=1 236w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-97.png?size=378x480&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-97.png?lossy=2&amp;strip=1&amp;webp=1 552w" sizes="(max-width: 552px) 100vw, 552px" /></a><figcaption class="wp-element-caption"><strong>Figure 6:</strong> A modular view of the RAG pipeline, showing the interaction between the Retriever, Reader, and Generator components, with a feedback loop from the generator to the retriever for iterative context refinement (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Adding-Feedback-Loops-to-Improve-Retrieval-Accuracy"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Adding-Feedback-Loops-to-Improve-Retrieval-Accuracy">Adding Feedback Loops to Improve Retrieval Accuracy</a></h3>



<p>In more advanced systems, RAG doesn’t end at generation. You can capture user feedback (e.g., thumbs-up/down or re-query actions) to fine-tune retrieval parameters, re-rank documents, or even re-embed sections of your corpus. This transforms a static RAG setup into a <strong>continually learning knowledge engine</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Putting-It-All-Together-RAG-Pipeline"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Putting-It-All-Together-RAG-Pipeline">Putting It All Together</a></h3>



<p><strong>Figure 7</strong> displays a conceptual flow that ties the 3 components together.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-5.jpeg" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="1024" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-5.jpeg?lossy=2&strip=1&webp=1" alt="" class="wp-image-52923" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-5-150x150.jpeg?lossy=2&amp;strip=1&amp;webp=1 150w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-5-300x300.jpeg?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-5.jpeg?size=378x378&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-5.jpeg?size=504x504&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-5.jpeg?size=630x630&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-5-768x768.jpeg?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-5.jpeg?lossy=2&amp;strip=1&amp;webp=1 1024w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 7:</strong> Step-by-step view of a RAG pipeline with optional feedback, illustrating how a user query is embedded, searched in FAISS, ranked, and passed to an LLM — while allowing feedback loops to enhance future retrieval quality (source: image by the author).</figcaption></figure></div>


<p>Each box in this pipeline maps directly to a section of your upcoming implementation.</p>



<p>In code, these steps will unfold through modular utilities and clean interfaces so you can swap retrievers, tweak prompt templates, or change models without rewriting the entire system.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Configuring-Your-Development-Environment-Setting-Up-Ollama-and-FAISS-for-a-Local-RAG-Pipeline"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Configuring-Your-Development-Environment-Setting-Up-Ollama-and-FAISS-for-a-Local-RAG-Pipeline">Configuring Your Development Environment: Setting Up Ollama and FAISS for a Local RAG Pipeline</a></h2>



<p>To follow this RAG pipeline guide, you&#8217;ll need several Python packages installed on your system. The tutorial builds upon semantic embeddings and vector search, requiring machine learning libraries, HTTP clients, and visualization tools.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="2">$ pip install sentence-transformers==2.7.0
$ pip install faiss-cpu==1.8.0.post1
$ pip install numpy==1.26.4
$ pip install requests==2.32.3
$ pip install rich==13.8.1
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Optional-Dependencies"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Optional-Dependencies">Optional Dependencies</a></h3>



<p>For visualization and enhanced functionality:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="6" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="3">$ pip install scikit-learn==1.5.1
$ pip install matplotlib==3.9.2
$ pip install ollama>=0.1.0
</pre>



<p>This installs the Python client only. The Ollama runtime must be installed separately.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Local-LLM-Setup-Ollama"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Local-LLM-Setup-Ollama">Local LLM Setup (Ollama)</a></h3>



<p>The RAG pipeline uses Ollama for local language model inference. Install Ollama separately:</p>



<ul class="wp-block-list">
<li><strong>Install Ollama:</strong> Visit <a href="https://ollama.ai" target="_blank" rel="noreferrer noopener">ollama.ai</a> and follow the installation instructions for your platform.</li>



<li><strong>Pull a model:</strong> Once Ollama is installed, download a model:</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="4">$ ollama pull llama3
</pre>



<ul class="wp-block-list">
<li><strong>Verify installation:</strong></li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="5">$ ollama list
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<!-- wp:paragraph -->
<h3>Need Help Configuring Your Development Environment?</h3>
<!-- /wp:paragraph -->

<!-- wp:image {"align":"center","id":18137,"sizeSlug":"large","linkDestination":"custom"} -->
<figure class="wp-block-image aligncenter size-large"><a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank" rel="noreferrer noopener"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-18137" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?lossy=2&strip=1&webp=1 500w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=126x84&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=252x168&lossy=2&strip=1&webp=1 252w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=378x253&lossy=2&strip=1&webp=1 378w" sizes="(max-width: 500px) 100vw, 500px" /></a><figcaption>Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join <a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">PyImageSearch University</a> — you will be up and running with this tutorial in a matter of minutes. </figcaption></figure>
<!-- /wp:image -->

<!-- wp:paragraph -->
<p>All that said, are you:</p>
<!-- /wp:paragraph -->

<!-- wp:list -->
<ul><li>Short on time?</li><li>Learning on your employer’s administratively locked system?</li><li>Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?</li><li><strong>Ready to run the code immediately on your Windows, macOS, or Linux system?</strong></li></ul>
<!-- /wp:list -->

<!-- wp:paragraph -->
<p>Then join <a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank">PyImageSearch University</a> today!</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p><strong>Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser!</strong> No installation required.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!</p>
<!-- /wp:paragraph -->



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Implementation-Walkthrough"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Implementation-Walkthrough">Implementation Walkthrough</a></h2>



<p>We’ll cover this in 3 parts:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">config.py</code>: central configuration and prompt templates</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">rag_utils.py</code>: retrieval + LLM integration logic</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">03_rag_pipeline.py</code>: driver script that ties everything together</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Configuration-config-py"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Configuration-config-py">Configuration (config.py)</a></h3>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">config.py</code> module defines paths, constants, and templates that are used throughout the RAG pipeline. Think of it as the “control room” for your entire setup.</p>



<h4 class="wp-block-heading">Directory and Path Setup</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="6">BASE_DIR = Path(__file__).resolve().parent.parent
DATA_DIR = BASE_DIR / "data"
INPUT_DIR = DATA_DIR / "input"
OUTPUT_DIR = DATA_DIR / "output"
INDEX_DIR = DATA_DIR / "indexes"
FIGURES_DIR = DATA_DIR / "figures"
</pre>



<p>Here, we define a <strong>consistent directory structure</strong> so that every script can find data, indexes, and output files, regardless of where it runs from.</p>



<p>This ensures reproducibility — a key trait for multi-script projects like this one.</p>



<p><em><strong>Tip:</strong></em> Using <code data-enlighter-language="python" class="EnlighterJSRAW">Path(__file__).resolve().parent.parent</code> automatically points to your project’s root directory, keeping all paths portable.</p>



<h4 class="wp-block-heading">Corpus and Embedding Artifacts</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="7">CORPUS_PATH = INPUT_DIR / "corpus.txt"
CORPUS_META_PATH = INPUT_DIR / "corpus_metadata.json"
EMBEDDINGS_PATH = OUTPUT_DIR / "embeddings.npy"
METADATA_ALIGNED_PATH = OUTPUT_DIR / "metadata_aligned.json"
DIM_REDUCED_PATH = OUTPUT_DIR / "pca_2d.npy"
</pre>



<p>These paths represent:</p>



<ul class="wp-block-list">
<li><strong>Corpus files:</strong> your input text and metadata</li>



<li><strong>Embedding artifacts:</strong> precomputed vectors and PCA-reduced coordinates for visualization</li>
</ul>



<p>We also include environment variable overrides (i.e., <code data-enlighter-language="python" class="EnlighterJSRAW">CORPUS_PATH</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">CORPUS_META_PATH</code>) to make it easy to point to new datasets without editing code.</p>



<h4 class="wp-block-heading">Index Artifacts</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="8">FLAT_INDEX_PATH = INDEX_DIR / "faiss_flat.index"
HNSW_INDEX_PATH = INDEX_DIR / "faiss_hnsw.index"
</pre>



<p>These define storage for your <strong>Flat</strong> (exact) and <strong>HNSW</strong> (approximate) FAISS indexes.</p>



<p>They’re generated in Lesson 2 and reused here for retrieval.</p>



<h4 class="wp-block-heading">Model and General Settings</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="9">EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
SEED = 42
DEFAULT_TOP_K = 5
SIM_THRESHOLD = 0.35
</pre>



<ul class="wp-block-list">
<li><strong>Sentence Transformer model:</strong> the same compact model used for embedding queries and documents</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">SEED</code>: ensures deterministic sampling</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">DEFAULT_TOP_K</code>: number of chunks retrieved per question</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">SIM_THRESHOLD</code>: a similarity cut-off to filter weak matches</li>
</ul>



<h4 class="wp-block-heading">Prompt Templates for RAG</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="10">STRICT_SYSTEM_PROMPT = (
    "You are a concise assistant. Use ONLY the provided context."
    " If the answer is not contained verbatim or explicitly, say you do not know."
)
SYNTHESIZING_SYSTEM_PROMPT = (
    "You are a concise assistant. Rely ONLY on the provided context, but you MAY synthesize"
    " an answer by combining or paraphrasing the facts present. If the context truly lacks"
    " sufficient evidence, say you do not know instead of guessing."
)
</pre>



<p>The following 2 templates control <strong>LLM behavior</strong>:</p>



<ul class="wp-block-list">
<li><strong>Strict mode</strong><strong>:</strong> purely extractive, no paraphrasing</li>



<li><strong>Synthesizing mode</strong><strong>:</strong> allows combining retrieved snippets to form explanatory answers</li>
</ul>



<p>This distinction is critical when testing <em>retrieval quality</em> versus <em>generation quality</em>.</p>



<h4 class="wp-block-heading">Intelligent Prompt Builder</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="11">def build_prompt(context_chunks, question: str, allow_synthesis: bool = False) -> str:
    system_prompt = SYNTHESIZING_SYSTEM_PROMPT if allow_synthesis else STRICT_SYSTEM_PROMPT
    context_str = "\n\n".join(context_chunks)
    return f"System: {system_prompt}\n{CONTEXT_HEADER}\n{context_str}\n\n" + USER_QUESTION_TEMPLATE.format(question=question)
</pre>



<p>This function <strong>assembles the final prompt</strong> fed into the LLM.</p>



<p>It concatenates retrieved context snippets, appends the system instructions, and ends with the user query.</p>



<p><em><strong>Tip:</strong></em> The key here is flexibility — by toggling <code data-enlighter-language="python" class="EnlighterJSRAW">allow_synthesis</code>, you can dynamically switch between <em>closed-book</em> and <em>open-book</em> answering styles.</p>



<h4 class="wp-block-heading">Directory Bootstrap</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="12">for d in (OUTPUT_DIR, INDEX_DIR, FIGURES_DIR):
    d.mkdir(parents=True, exist_ok=True)
</pre>



<p>Ensures that all critical folders exist before any writing occurs — a small but essential safeguard for production stability (<strong>Figure 8</strong>).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-106.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="700" height="466" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-106.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52967" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-106.png?size=126x84&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-106-300x200.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-106.png?size=378x252&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-106.png?size=504x336&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-106.png?size=630x419&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-106.png?lossy=2&amp;strip=1&amp;webp=1 700w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 8:</strong> A high-level overview of the RAG Configuration Flow, showing how <code>config.py</code> centralizes paths, corpus files, embedding models, prompt templates, and model settings — feeding these configurations into the rest of the RAG pipeline (i.e., vector store, retrieval logic, and Ollama LLM) (source: image by the author).</figcaption></figure></div>


<p>At this point, the configuration module provides the foundation for the next step: actually <em>retrieving and generating answers</em>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Integrating-Ollama-with-FAISS-Vector-Search-for-RAG"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Integrating-Ollama-with-FAISS-Vector-Search-for-RAG">Integrating Ollama with FAISS Vector Search for RAG</a></h2>



<p>Now that our FAISS index is ready to serve embeddings, the next step is to <strong>connect it with an LLM</strong> — the final reasoning layer that generates natural-language answers based on retrieved context.</p>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">rag_utils.py</code> file is where <em>retrieval meets generation</em>.</p>



<p>It ties together the embedding search results, builds prompts, calls the LLM (Ollama by default), and even adds explainability through citations and per-sentence support scoring.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Overview-and-Setup"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Overview-and-Setup">Overview and Setup</a></h3>



<p>Let’s start by looking at the top of the file:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="13">import os, json, re, requests
import numpy as np
from typing import List, Dict, Tuple, Any

try:
    import ollama  # type: ignore
except ImportError:
    ollama = None
</pre>



<p>At the core, this script:</p>



<ul class="wp-block-list">
<li>Uses <strong>Ollama</strong> for local LLM inference, but gracefully falls back to HTTP if the Python client isn’t installed.</li>



<li>Imports <strong>NumPy</strong> for fast vector math, <strong>requests</strong> for API calls, and <strong>typing</strong> hints for readability.</li>
</ul>



<p>Then, it configures Ollama’s endpoints:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="14">OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
OLLAMA_API_URL = f"{OLLAMA_BASE_URL}/api/generate"
OLLAMA_TAGS_URL = f"{OLLAMA_BASE_URL}/api/tags"
</pre>



<p><em><strong>Tip:</strong></em> You can override <code data-enlighter-language="python" class="EnlighterJSRAW">OLLAMA_BASE_URL</code> with an environment variable — handy when deploying on remote servers or Docker containers (<strong>Figure 9</strong>).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-99.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="889" height="261" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-99.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52931" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-99.png?size=126x37&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-99-300x88.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-99.png?size=378x111&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-99.png?size=504x148&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-99.png?size=630x185&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-99-768x225.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-99.png?lossy=2&amp;strip=1&amp;webp=1 889w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 9: </strong>High-level flow of a Retrieval-Augmented Generation (RAG) system — the RAG pipeline retrieves relevant context, sends it to the Ollama server for model inference, and returns the final LLM response to the user (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Health-Check-and-Model-Discovery"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Health-Check-and-Model-Discovery">Health Check and Model Discovery</a></h3>



<p>Before we make any generation calls, it’s good practice to confirm that Ollama is actually reachable.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="15">def ollama_available() -> bool:
    try:
        r = requests.get(OLLAMA_TAGS_URL, timeout=2)
        return r.status_code == 200
    except requests.RequestException:
        return False
</pre>



<p>If this returns <code data-enlighter-language="python" class="EnlighterJSRAW">False</code>, your RAG pipeline will still work — it will simply skip generation or return a warning message.</p>



<p>Similarly, you can list all locally available models:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="16">def list_ollama_models() -> List[str]:
    """Return a list of available local Ollama model names (empty if unreachable)."""
    resp = requests.get(OLLAMA_TAGS_URL, timeout=2)
    resp.raise_for_status()
    data = resp.json()
    models = []
    for m in data.get("models", []):
        name = m.get("name", "")
        if name.endswith(":latest"):
            name = name.rsplit(":", 1)[0]
        if name:
            models.append(name)
    return sorted(set(models))
</pre>



<p>This lets you dynamically query what’s installed (e.g., <code data-enlighter-language="python" class="EnlighterJSRAW">llama3</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">mistral</code>, or <code data-enlighter-language="python" class="EnlighterJSRAW">gemma2</code>).</p>



<p>If you’re running an interactive RAG app, this list can populate a dropdown for user selection.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Making-the-Ollama-Call"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Making-the-Ollama-Call">Making the Ollama Call</a></h3>



<p>Here’s the heart of your LLM connector:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="17">def call_ollama(model: str, prompt: str, stream: bool = False) -> str:
    """Call Ollama using python client if available else raw HTTP."""
    if ollama is not None:
        try:
            if stream:
                out = []
                for chunk in ollama.generate(model=model, prompt=prompt, stream=True):
                    out.append(chunk.get("response", ""))
                return "".join(out)
            else:
                resp = ollama.generate(model=model, prompt=prompt)
                return resp.get("response", "")
        except Exception:
            pass
</pre>



<ul class="wp-block-list">
<li>If the <code data-enlighter-language="python" class="EnlighterJSRAW">ollama</code> library is installed, the function uses its <strong>official Python client</strong> for better efficiency and streaming support.</li>



<li>If not, it falls back to a <strong>manual HTTP request</strong>:</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="18">payload = {"model": model, "prompt": prompt, "stream": stream}
resp = requests.post(OLLAMA_API_URL, json=payload, timeout=120, stream=stream)
</pre>



<p>It even supports streaming tokens one by one — useful for building chat UIs or dashboards that display the answer as it’s generated.</p>



<p><em>Why this dual approach?</em></p>



<p>Not all environments (e.g., Docker containers or lightweight cloud runners) have the <code data-enlighter-language="python" class="EnlighterJSRAW">ollama</code> Python package installed, but they can still access the REST (Representational State Transfer) API.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Optional-Cloud-Fallback-OpenAI"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Optional-Cloud-Fallback-OpenAI">Optional: Cloud Fallback (OpenAI)</a></h3>



<p>There’s a commented-out section providing an optional fallback to <strong>OpenAI’s API</strong>.</p>



<p>If uncommented, you can quickly switch between local and cloud models (e.g., <code data-enlighter-language="python" class="EnlighterJSRAW">gpt-4o-mini</code>).</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="19"># OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini")
# openai.api_key = os.getenv("OPENAI_API_KEY")
# def call_openai(prompt: str, model: str = OPENAI_MODEL) -> str:
#     ...
</pre>



<p>This flexibility lets you deploy the same RAG logic on-premises (Ollama) or in the cloud (OpenAI).</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Selecting-the-Top-k-Relevant-Chunks"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Selecting-the-Top-k-Relevant-Chunks">Selecting the Top-k Relevant Chunks</a></h3>



<p>Once a user asks a question, we compute its embedding and retrieve similar text chunks:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="20">def select_top_k(question_emb, embeddings, texts, metadata, k=5, sim_threshold=0.35):
    sims = embeddings @ question_emb  # cosine if normalized
    ranked = np.argsort(-sims)
    results = []
    for idx in ranked[:k * 2]:
        score = float(sims[idx])
        if score &lt; sim_threshold and len(results) >= k:
            break
        results.append({
            "id": metadata[idx]["id"],
            "text": texts[idx],
            "score": score,
            "topic": metadata[idx].get("topic", "unknown")
        })
        if len(results) >= k:
            break
    return results
</pre>



<p>This function:</p>



<ul class="wp-block-list">
<li>Computes cosine similarities between the query and all embeddings.</li>



<li>Ranks them, filters by a similarity threshold, and returns the top-<em>k</em> chunks with metadata.</li>
</ul>



<p>This lightweight retrieval replaces the need to re-query FAISS every time — perfect for quick experiments or small datasets.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Splitting-Answers-into-Sentences"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Splitting-Answers-into-Sentences">Splitting Answers into Sentences</a></h3>



<p>Once the LLM produces an answer, we may want to analyze it sentence-by-sentence.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="21">def _sentence_split(text: str) -> List[str]:
    raw = re.split(r'(?&lt;=[.!?])\s+|\n+', text.strip())
    return [s.strip() for s in raw if s and not s.isspace()]
</pre>



<p>This regex-based approach avoids heavy NLP libraries and still performs well for most English prose.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Computing-Sentence-Support"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Computing-Sentence-Support">Computing Sentence Support</a></h3>



<p>A unique feature of this pipeline is its ability to score each sentence in the LLM’s answer by how well it aligns with the retrieved context chunks.</p>



<p>This helps determine which parts of the generated answer are actually supported by the retrieved evidence — forming the basis for citations such as [1], [2].</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="22">def _compute_support(sentences, retrieved, metadata, embeddings, model):
    id_to_idx = {m["id"]: i for i, m in enumerate(metadata)}
    chunk_vecs, ranks = [], []
    for rank, r in enumerate(retrieved, start=1):
        idx = id_to_idx.get(r["id"])
        if idx is None:
            continue
        chunk_vecs.append(embeddings[idx])
        ranks.append(rank)
    if not chunk_vecs:
        return [], sentences

    chunk_matrix = np.vstack(chunk_vecs)
    sent_embs = model.encode(sentences, normalize_embeddings=True, convert_to_numpy=True)
</pre>



<p>Each sentence is embedded and compared to the embeddings of the top-<em>k</em> retrieved chunks.</p>



<p>This yields 2 useful artifacts:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">support_rows</code>: structured table of support scores</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">cited_sentences</code>: answer text annotated with citations such as [1], [2]</li>
</ul>



<h4 class="wp-block-heading">Example: Sentence-to-Context Alignment</h4>



<p>For example, suppose the user asked:</p>



<p><em>“What is Streamlit used for?”</em></p>



<p>The retriever would return the top-<em>k</em> most relevant chunks for that query.</p>



<p>Each sentence in the model’s generated answer is then compared to the retrieved chunks to determine how well it is supported (<strong>Table 3</strong>).</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-100-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="330" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-100-1024x330.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52934" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-100.png?size=126x41&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-100-300x97.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-100.png?size=378x122&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-100.png?size=504x162&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-100.png?size=630x203&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-100-768x247.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-100-1024x330.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-100-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Table 3:</strong> Example mapping of answer sentences to their retrieved context ranks and similarity scores.</figcaption></figure></div>


<p><em><strong>Note:</strong></em> The context ranks come from the retrieval step based on the query <em>“What is Streamlit used for?”</em>. The similarity scores show how strongly each sentence aligns with those retrieved chunks — indicating how well each part of the generated answer is supported by evidence.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Formatting-and-Styling"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Formatting-and-Styling">Formatting and Styling</a></h3>



<p>To display results nicely, the <code data-enlighter-language="python" class="EnlighterJSRAW">_apply_style()</code> helper supports different output styles:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="23">def _apply_style(answer, style, cited_sentences):
    if style == "bullets" and cited_sentences:
        return "\n" + "\n".join(f"- {s}" for s in cited_sentences)
    return answer
</pre>



<p>This allows both <strong>paragraph</strong> and <strong>bullet</strong>-point summaries with inline citations — perfect for user-facing dashboards.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-The-Core-generate-rag-response"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-The-Core-generate-rag-response">The Core: generate_rag_response()</a></h3>



<p>Finally, the star of this file — the main RAG generation function:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="24">def generate_rag_response(question, model, embeddings, texts, metadata,
                          llm_model_name="llama3", top_k=5,
                          allow_synthesis=False, force_strict=False,
                          add_citations=False, compute_support=False,
                          style="paragraph") -> Dict:
</pre>



<p>This function orchestrates the full retrieval-generation pipeline:</p>



<h4 class="wp-block-heading">Step 1: Detect intent and embeddings</h4>



<p>It embeds the question and automatically decides whether to allow synthesis:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="25">if any(pat in q_lower for pat in config.AUTO_SYNTHESIS_PATTERNS):
    allow_synthesis = True
    heuristic_triggered = True
</pre>



<p>So if a query contains words like <em>“why”</em> or <em>“benefits”</em>, the model automatically switches to a paraphrasing mode instead of strict extraction.</p>



<h4 class="wp-block-heading">Step 2: Retrieve top-<em>k</em> chunks</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="26">top = select_top_k(q_emb, embeddings, texts, metadata, k=top_k)
prompt = build_prompt([r["text"] for r in top], question, allow_synthesis=allow_synthesis)
</pre>



<h4 class="wp-block-heading">Step 3: Generate via LLM</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="27">if not ollama_available():
    answer = "[Ollama not available at base URL.]"
else:
    answer = call_ollama(llm_model_name, prompt)
</pre>



<h4 class="wp-block-heading">Step 4: Optional post-processing</h4>



<p>If citations or support scoring are enabled:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="28">sentences = _sentence_split(answer)
support_rows, cited_sentences = _compute_support(sentences, top, metadata, embeddings, model)
answer = _apply_style(answer, style, cited_sentences)
</pre>



<p>Finally, it returns a structured dictionary — containing everything from the retrieved context to the generated answer and support metrics.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Summary-of-the-Utilities"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Summary-of-the-Utilities">Summary of the Utilities</a></h3>



<p>The <code data-enlighter-language="python" class="EnlighterJSRAW">rag_utils.py</code> file provides a robust and extensible RAG backbone:</p>



<ul class="wp-block-list">
<li><strong>Local-first design:</strong> works seamlessly with Ollama or over HTTP</li>



<li><strong>Hybrid retrieval:</strong> embedding search + FAISS indexes</li>



<li><strong>Explainable outputs:</strong> sentence-level support and citations</li>



<li><strong>Prompt control:</strong> configurable synthesis vs. strict modes</li>



<li><strong>Output flexibility:</strong> paragraph or bullet styles, JSON export</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-109-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="253" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-109-1024x253.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52977" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-109.png?size=126x31&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-109-300x74.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-109.png?size=378x93&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-109.png?size=504x125&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-109.png?size=630x156&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-109-768x190.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-109-1024x253.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-109-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-109-1536x379.png?lossy=2&amp;strip=1&amp;webp=1 1536w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 10:</strong> A Retrieval-Augmented Generation (RAG) pipeline powered by Ollama — user queries are encoded, relevant context is fetched using FAISS, prompts are built and passed to the model, and the final answer is generated with citations (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Running-a-Local-RAG-Pipeline-with-Ollama-and-FAISS"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Running-a-Local-RAG-Pipeline-with-Ollama-and-FAISS">Running a Local RAG Pipeline with Ollama and FAISS</a></h2>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Imports-and-Module-Wiring"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Imports-and-Module-Wiring">Imports and Module Wiring</a></h3>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="29">"""

Steps:
1. Load embeddings &amp; indexes (or build fallbacks)
2. Accept user question(s)
3. Retrieve top-k relevant chunks
4. Construct prompt &amp; call Ollama (fallback to placeholder if unavailable)
5. Display answer with retrieved context &amp; scores
"""
from __future__ import annotations

import argparse
import json
from pathlib import Path

import numpy as np
from rich import print
from rich.table import Table

from pyimagesearch import config
from pyimagesearch.embeddings_utils import load_embeddings, load_corpus, get_model, generate_embeddings
from pyimagesearch.vector_search_utils import build_flat_index, load_index, build_hnsw_index
from pyimagesearch.rag_utils import generate_rag_response, list_ollama_models, ollama_available
</pre>



<p><strong>What this sets up:</strong></p>



<ul class="wp-block-list">
<li>CLI (command line interface) flags (<code data-enlighter-language="python" class="EnlighterJSRAW">argparse</code>), pretty terminal output (<code data-enlighter-language="python" class="EnlighterJSRAW">rich</code>), NumPy for arrays.</li>



<li>Pulls in <strong>config paths</strong>, <strong>embedding helpers</strong>, <strong>FAISS index builders </strong><strong>and </strong><strong>loaders</strong>, the <strong>RAG core</strong> (<code data-enlighter-language="python" class="EnlighterJSRAW">generate_rag_response</code>), and Ollama helpers.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Ensure-Embeddings-load-or-build-once"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Ensure-Embeddings-load-or-build-once">Ensure Embeddings (load or build once)</a></h3>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="30">def ensure_embeddings(corpus_path=None, meta_path=None):
    if config.EMBEDDINGS_PATH.exists():
        emb, meta = load_embeddings()
        texts, _ = load_corpus(corpus_path or config.CORPUS_PATH, meta_path or config.CORPUS_META_PATH)
        return emb, meta, texts
    texts, meta = load_corpus(corpus_path or config.CORPUS_PATH, meta_path or config.CORPUS_META_PATH)
    model = get_model()
    emb = generate_embeddings(texts, model=model)
    from pyimagesearch.embeddings_utils import save_embeddings
    save_embeddings(emb, meta)
    return emb, meta, texts
</pre>



<p><strong>What it does (and why):</strong></p>



<ul class="wp-block-list">
<li>If <code data-enlighter-language="python" class="EnlighterJSRAW">data/output/embeddings.npy</code> is present, it <strong>loads</strong> the embeddings and aligned metadata, then reads the current corpus to ensure your text list is up to date.</li>



<li>If not present, it <strong>embeds</strong> the corpus with SentenceTransformer and <strong>caches</strong> both artifacts to disk for speed on re-runs.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Ensure-Indexes-Flat-must-exist-HNSW-optional"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Ensure-Indexes-Flat-must-exist-HNSW-optional">Ensure Indexes (Flat must exist; HNSW is optional)</a></h3>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="31">def ensure_indexes(embeddings):
    # Try load flat
    idx = None
    if config.FLAT_INDEX_PATH.exists():
        try:
            from pyimagesearch.vector_search_utils import load_index
            idx = load_index(config.FLAT_INDEX_PATH)
        except Exception:
            idx = None
    if idx is None:
        idx = build_flat_index(embeddings)
    # Optional: attempt HNSW
    hnsw = None
    if config.HNSW_INDEX_PATH.exists():
        try:
            hnsw = load_index(config.HNSW_INDEX_PATH)
        except Exception:
            hnsw = None
    else:
        try:
            hnsw = build_hnsw_index(embeddings)
        except Exception:
            hnsw = None
    return idx, hnsw
</pre>



<p><strong>What it does (and why):</strong></p>



<ul class="wp-block-list">
<li><strong>Flat index (exact, inner</strong> <strong>product):</strong> Attempts to load from disk; if missing, builds from the embedding matrix. This guarantees you always have a correct baseline.</li>



<li><strong>HNSW (approximate, fast):</strong> Loads if available; otherwise builds the index. If FAISS isn’t installed with HNSW support, it <strong>fails gracefully</strong> and returns <code data-enlighter-language="python" class="EnlighterJSRAW">None</code>.</li>



<li><strong>Returns:</strong> A tuple (<code data-enlighter-language="python" class="EnlighterJSRAW">flat</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">hnsw</code>) for downstream use.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Interactive-QA-Loop-Optional-Mode"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Interactive-QA-Loop-Optional-Mode">Interactive Q&amp;A Loop — Optional Mode</a></h3>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="32">def interactive_loop(model, embeddings, texts, metadata, llm_model: str, top_k: int, allow_synth: bool):
    print("[bold cyan]Enter questions (type 'exit' to quit).[/bold cyan]")
    while True:
        try:
            q = input("Question> ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\n[red]Exiting.[/red]")
            break
        if not q:
            continue
        if q.lower() in {"exit", "quit"}:
            break
        result = generate_rag_response(q, model, embeddings, texts, metadata, llm_model_name=llm_model, top_k=top_k, allow_synthesis=allow_synth)
        show_result(result)
</pre>



<p><strong>What it does (and why):</strong></p>



<ul class="wp-block-list">
<li>Lets you <strong>chat</strong> with your local RAG system.</li>



<li>For each typed question, calls <code data-enlighter-language="python" class="EnlighterJSRAW">generate_rag_response(...)</code> — retrieves context → builds the prompt → calls Ollama → formats the answer — and prints a rich table of the results.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Pretty-Printing-the-Answer-and-Context"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Pretty-Printing-the-Answer-and-Context">Pretty Printing the Answer and Context (optional prompt/support)</a></h3>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="33">def show_result(result, show_prompt: bool = False, show_support: bool = False):
    print("\n[bold green]Answer[/bold green]:")
    print(result["answer"].strip())
    synth_flag = "yes" if result.get("synthesis_used") else "no"
    if result.get("synthesis_used") and result.get("synthesis_heuristic"):
        print(f"[dim]Synthesis: {synth_flag} (auto-enabled by heuristic)\n[/dim]")
    else:
        print(f"[dim]Synthesis: {synth_flag}\n[/dim]")
    table = Table(title="Retrieved Context")
    table.add_column("Rank")
    table.add_column("ID")
    table.add_column("Score", justify="right")
    table.add_column("Snippet")
    for i, r in enumerate(result["retrieved"], start=1):
        snippet = r["text"][:80] + ("..." if len(r["text"]) > 80 else "")
        table.add_row(str(i), r["id"], f"{r['score']:.3f}", snippet)
    print(table)
    if show_prompt:
        print("[bold yellow]\n--- Prompt Sent to LLM ---[/bold yellow]")
        print(result.get("prompt", "[prompt missing]"))
    if show_support and result.get("support"):
        support_table = Table(title="Sentence Support Scores")
        support_table.add_column("Sentence")
        support_table.add_column("Rank")
        support_table.add_column("Score", justify="right")
        for row in result["support"]:
            support_table.add_row(row["sentence"], str(row["citation_rank"]), f"{row['support_score']:.3f}")
        print(support_table)
</pre>



<p><strong>What it does (and why):</strong></p>



<ul class="wp-block-list">
<li>Prints the <strong>final answer</strong> and indicates whether <strong>synthesis</strong> was used (including whether it was auto-enabled by the heuristic).</li>



<li>Renders a <strong>Retrieved Context</strong> table showing rank, ID, similarity score, and a clean snippet.</li>



<li>If <code data-enlighter-language="python" class="EnlighterJSRAW">--show-prompt</code> is used, prints the full prompt for transparency.</li>



<li>If <code data-enlighter-language="python" class="EnlighterJSRAW">--support-scores</code> is enabled, shows per-sentence <strong>support strength</strong> against the retrieved chunks — useful for debugging groundedness.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-CLI-Entry-Point-main-flags-loading-answering"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-CLI-Entry-Point-main-flags-loading-answering">CLI Entry Point (main) — flags, loading, answering</a></h3>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="34">def main():
    parser = argparse.ArgumentParser(description="Minimal RAG pipeline demo")
    parser.add_argument("--llm-model", default="llama3", help="Ollama model name (must be pulled beforehand, e.g. 'ollama pull llama3')")
    parser.add_argument("--top-k", type=int, default=config.DEFAULT_TOP_K)
    parser.add_argument("--corpus-path", type=str, help="Override corpus file path")
    parser.add_argument("--corpus-meta-path", type=str, help="Override corpus metadata path")
    parser.add_argument("--question", type=str, help="Single question to answer (skip interactive mode)")
    parser.add_argument("--allow-synthesis", action="store_true", help="Permit model to synthesize answer by combining provided context facts")
    parser.add_argument("--list-models", action="store_true", help="List available local Ollama models and exit")
    parser.add_argument("--show-prompt", action="store_true", help="Display the full constructed prompt for debugging/teaching")
    parser.add_argument("--strict", action="store_true", help="Force strict extractive mode (disable synthesis even if heuristic matches)")
    parser.add_argument("--citations", action="store_true", help="Annotate sentences with citation indices")
    parser.add_argument("--style", choices=["paragraph", "bullets"], default="paragraph", help="Answer formatting style")
    parser.add_argument("--support-scores", action="store_true", help="Compute and display per-sentence support scores")
    parser.add_argument("--json", action="store_true", help="Output full result JSON to stdout (suppresses pretty tables except retrieved context)")
    args = parser.parse_args()
    if args.list_models:
        if not ollama_available():
            print("[red]Ollama not reachable at default base URL. Start Ollama to list models.[/red]")
            return
        models = list_ollama_models()
        if not models:
            print("[yellow]No models returned. Pull some with: ollama pull llama3[/yellow]")
        else:
            print("[bold cyan]Available Ollama models:[/bold cyan]")
            for m in models:
                print(f" - {m}")
        return
    print(f"[bold magenta]Using LLM model:[/bold magenta] {args.llm_model}")
    print("[bold magenta]Loading embeddings...[/bold magenta]")
    embeddings, metadata, texts = ensure_embeddings(corpus_path=args.corpus_path, meta_path=args.corpus_meta_path)
    model = get_model()
    print("[bold magenta]Preparing indexes (flat + optional hnsw)...[/bold magenta]")
    flat, hnsw = ensure_indexes(embeddings)
    # NOTE: We use embedding matrix directly for retrieval selection in rag_utils (cosine) for transparency.
    if args.question:
        result = generate_rag_response(
            args.question,
            model,
            embeddings,
            texts,
            metadata,
            llm_model_name=args.llm_model,
            top_k=args.top_k,
            allow_synthesis=args.allow_synthesis,
            force_strict=args.strict,
            add_citations=args.citations,
            compute_support=args.support_scores,
            style=args.style,
        )
        if args.json:
            import json as _json
            print(_json.dumps(result, indent=2))
        show_result(result, show_prompt=args.show_prompt, show_support=args.support_scores)
    else:
        # For interactive mode we keep previous behavior (could extend flags similarly if desired)
        interactive_loop(model, embeddings, texts, metadata, args.llm_model, args.top_k, args.allow_synthesis)

    print("[green]\nFinished RAG demo.\n[/green]")
</pre>



<p><strong>What it does (and why):</strong></p>



<ul class="wp-block-list">
<li>Defines a <strong>rich set of flags</strong> to control the model, retrieval depth, strictness vs. synthesis, prompt visibility, citations, style, and JSON output.</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">--list-models</code> lets you sanity-check your local Ollama setup without running the full pipeline.</li>



<li>Loads or creates embeddings, prepares indexes, then either:
<ul class="wp-block-list">
<li>answers a <strong>single</strong> question (<code data-enlighter-language="python" class="EnlighterJSRAW">--question ...</code>), or</li>



<li>launches the <strong>interactive loop</strong>.</li>
</ul>
</li>



<li>Optional JSON output is useful for scripting or automated tests.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Standard-Python-Entrypoint"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Standard-Python-Entrypoint">Standard Python Entrypoint</a></h3>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="35">if __name__ == "__main__":
    main()
</pre>



<p><strong>What it does:</strong></p>



<ul class="wp-block-list">
<li>Runs the CLI when you execute <code data-enlighter-language="python" class="EnlighterJSRAW">python 03_rag_pipeline.py</code>.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Tiny-Gotchas-and-Tips"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Tiny-Gotchas-and-Tips">Tiny Gotchas and Tips</a></h2>



<ul class="wp-block-list">
<li>If FAISS was installed without HNSW support, <code data-enlighter-language="python" class="EnlighterJSRAW">ensure_indexes</code> will still work — it just will not provide an HNSW index. The Flat index is always available.</li>



<li>Make sure the Ollama model you request (e.g., <code data-enlighter-language="python" class="EnlighterJSRAW">llama3</code>) is <strong>pulled</strong> first:</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="36">ollama pull llama3
</pre>



<ul class="wp-block-list">
<li>You can view exactly what the model saw with:</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="37">python 03_rag_pipeline.py --question "What is IVF indexing?" --show-prompt
</pre>



<ul class="wp-block-list">
<li>For teaching and debugging groundedness:</li>
</ul>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="38">python 03_rag_pipeline.py --question "Why normalize embeddings?" --citations --support-scores
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-How-to-Run-a-Local-RAG-System-with-Ollama-and-FAISS"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-How-to-Run-a-Local-RAG-System-with-Ollama-and-FAISS">How to Run a Local RAG System with Ollama and FAISS</a></h2>



<p>Now that everything’s wired up — embeddings, FAISS indexes, and the RAG utilities — it’s time to see the full pipeline in action.</p>



<p>You can start by verifying your local Ollama setup and ensuring the model (e.g., <strong>Llama 3</strong>) is pulled:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="39">ollama pull llama3
</pre>



<p>Then, from your project root, launch the RAG pipeline:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="40">python 03_rag_pipeline.py --question "What is FAISS?" --show-prompt --support-scores
</pre>



<p>If you’d rather chat interactively:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="41">python 03_rag_pipeline.py
</pre>



<p>You’ll be greeted with a prompt like:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="42">Question> Why do we normalize embeddings?
</pre>



<p>and can exit at any time with <code data-enlighter-language="python" class="EnlighterJSRAW">exit</code> or <code data-enlighter-language="python" class="EnlighterJSRAW">Ctrl+C</code>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Example-Output"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Example-Output">Example Output</a></h2>



<p>Here’s what a typical run looks like inside your terminal (<strong>Figure 11</strong>).</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-102.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="529" height="646" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-102.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52938" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-102.png?size=126x154&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-102-246x300.png?lossy=2&amp;strip=1&amp;webp=1 246w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-102.png?size=378x462&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-102.png?lossy=2&amp;strip=1&amp;webp=1 529w" sizes="(max-width: 529px) 100vw, 529px" /></a><figcaption class="wp-element-caption"><strong>Figure 11:</strong> Example terminal output of the local RAG pipeline showing the answer, retrieved context, and sentence-level support scores (source: image by the author).</figcaption></figure></div>

<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-103.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="906" height="487" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-103.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52940" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-103.png?size=126x68&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-103-300x161.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-103.png?size=378x203&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-103.png?size=504x271&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-103.png?size=630x339&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-103-768x413.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-103.png?lossy=2&amp;strip=1&amp;webp=1 906w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 12:</strong> End-to-end flow of retrieval augmented generation using local embeddings, FAISS, and Ollama (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-What-You-Learned-Building-a-Production-Ready-Local-RAG-System-with-Ollama-and-FAISS"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-What-You-Learned-Building-a-Production-Ready-Local-RAG-System-with-Ollama-and-FAISS">What You Learned: Building a Production-Ready Local RAG System with Ollama and FAISS</a></h2>



<p>By the end of this tutorial, you will have built and tested a complete, local <strong>Retrieval-Augmented Generation (RAG)</strong> system:</p>



<ul class="wp-block-list">
<li>Connected the FAISS vector store built in Lesson 2 to a local LLM served by Ollama.</li>



<li>Used embeddings to retrieve semantically relevant chunks from your corpus.</li>



<li>Constructed prompts dynamically and generated grounded answers, optionally including citations and synthesis.</li>
</ul>



<p>This closes the loop of your <strong>vector</strong> → <strong>retrieval</strong> → <strong>generation</strong> workflow — forming the foundation for more advanced, production-ready RAG pipelines.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<div id="pitch" style="padding: 40px; width: 100%; background-color: #F4F6FA;">
	<h3>What's next? We recommend <a target="_blank" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend">PyImageSearch University</a>.</h3>

	<script src="https://fast.wistia.com/embed/medias/kno0cmko2z.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_kno0cmko2z videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img decoding="async" src="https://fast.wistia.com/embed/medias/kno0cmko2z/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>

	<div style="margin-top: 32px; margin-bottom: 32px; ">
		<strong>Course information:</strong><br/>
		86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: April 2026<br/>
		<span style="color: #169FE6;">★★★★★</span> 4.84 (128 Ratings) • 16,000+ Students Enrolled
	</div>

	<p><strong>I strongly believe that if you had the right teacher you could <em>master</em> computer vision and deep learning.</strong></p>

	<p>Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?</p>

	<p>That’s <em>not</em> the case.</p>

	<p>All you need to master computer vision and deep learning is for someone to explain things to you in <em>simple, intuitive</em> terms. <em>And that’s exactly what I do</em>. My mission is to change education and how complex Artificial Intelligence topics are taught.</p>

	<p>If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to <em>successfully</em> and <em>confidently</em> apply computer vision to your work, research, and projects. Join me in computer vision mastery.</p>

	<p><strong>Inside PyImageSearch University you'll find:</strong></p>

	<ul style="margin-left: 0px;">
		<li style="list-style: none;">&check; <strong>86+ courses</strong> on essential computer vision, deep learning, and OpenCV topics</li>
		<li style="list-style: none;">&check; <strong>86 Certificates</strong> of Completion</li>
		<li style="list-style: none;">&check; <strong>115+ hours hours</strong> of on-demand video</li>
		<li style="list-style: none;">&check; <strong>Brand new courses released <em>regularly</em></strong>, ensuring you can keep up with state-of-the-art techniques</li>
		<li style="list-style: none;">&check; <strong>Pre-configured Jupyter Notebooks in Google Colab</strong></li>
		<li style="list-style: none;">&check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)</li>
		<li style="list-style: none;">&check; Access to <strong>centralized code repos for <em>all</em> 540+ tutorials</strong> on PyImageSearch</li>
		<li style="list-style: none;">&check; <strong> Easy one-click downloads</strong> for code, datasets, pre-trained models, etc.</li>
		<li style="list-style: none;">&check; <strong>Access</strong> on mobile, laptop, desktop, etc.</li>
	</ul>

	<p style="text-align: center;">
		<a target="_blank" class="button link" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend" style="background-color: #6DC713; border-bottom: none;">Click here to join PyImageSearch University</a>
	</p>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Summary"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Summary">Summary</a></h2>



<p>In this final lesson, you brought everything together (i.e., embeddings, vector search, and generation) to build a complete <strong>Retrieval-Augmented Generation (RAG)</strong> pipeline from scratch. You began by understanding how retrieval connects to language models, bridging the gap between semantic search and contextual reasoning.</p>



<p>Next, you explored how the system uses <strong>SentenceTransformer embeddings</strong> and <strong>FAISS indexes</strong> to fetch relevant context from a corpus before generating an answer. You then examined the RAG utilities in detail — from <code data-enlighter-language="python" class="EnlighterJSRAW">ollama_available()</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">call_ollama()</code>, which handle model calls and fallbacks, to <code data-enlighter-language="python" class="EnlighterJSRAW">select_top_k()</code>, which performs the crucial retrieval step by ranking and filtering results based on cosine similarity. You also saw how automatic synthesis heuristics determine when to allow the LLM to combine information creatively, adding flexibility to the pipeline.</p>



<p>Then came the <strong>driver script</strong>, where the theoretical pieces transformed into a working application. You walked through the full flow — loading embeddings, preparing indexes, retrieving the top-<em>k</em> most relevant chunks, and generating context-aware answers via Ollama. You also learned how to add citations, measure support scores, and switch between strict and synthesis modes for transparent reasoning.</p>



<p>Finally, you ran the pipeline locally, queried your own data, and observed meaningful, grounded responses generated by a local LLM. With this, you completed a true end-to-end workflow — from encoding and indexing knowledge to retrieving and generating answers — running fully offline and powered by <strong>FAISS and Ollama</strong>.</p>



<p>In short, you did not just learn RAG — you built it.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Citation-Information"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Citation-Information">Citation Information</a></h3>



<p><strong>Singh, V</strong><strong>. </strong>“Vector Search Using Ollama for Retrieval-Augmented Generation (RAG),” <em>PyImageSearch</em>, P. Chugh, S. Huot, A. Sharma, and P. Thakur, eds., 2026, <a href="https://pyimg.co/q68nv" target="_blank" rel="noreferrer noopener">https://pyimg.co/q68nv</a> </p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="classic" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)" data-enlighter-group="43">@incollection{Singh_2026_vector-search-using-ollama-for-rag,
  author = {Vikram Singh},
  title = {{Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)}},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/q68nv},
}
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), </strong><em><strong>simply enter your email address in the form below!</strong></em></p>



<div id="download-the-code" class="post-cta-wrap">
<div class="gpd-post-cta">
	<div class="gpd-post-cta-content">
		

			<div class="gpd-post-cta-top">
				<div class="gpd-post-cta-top-image"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1" alt="" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1 410w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=126x174&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=252x348&lossy=2&strip=1&webp=1 252w" sizes="(max-width: 410px) 100vw, 410px" /></div>
				
				<div class="gpd-post-cta-top-title"><h4>Download the Source Code and FREE 17-page Resource Guide</h4></div>
				<div class="gpd-post-cta-top-desc"><p>Enter your email address below to get a .zip of the code and a <strong>FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning.</strong> Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!</p></div>


			</div>

			<div class="gpd-post-cta-bottom">
				<form id="footer-cta-code" class="footer-cta" action="https://www.getdrip.com/forms/4130035/submissions" method="post" target="blank" data-drip-embedded-form="4130035">
					<input name="fields[email]" type="email" value="" placeholder="Your email address" class="form-control" />

					<button type="submit">Download the code!</button>

					<div style="display: none;" aria-hidden="true"><label for="website">Website</label><br /><input type="text" id="website" name="website" tabindex="-1" autocomplete="false" value="" /></div>
				</form>
			</div>


		
	</div>

</div>
</div>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/02/23/vector-search-using-ollama-for-retrieval-augmented-generation-rag/">Vector Search Using Ollama for Retrieval-Augmented Generation (RAG)</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained</title>
		<link>https://pyimagesearch.com/2026/02/16/vector-search-with-faiss-approximate-nearest-neighbor-ann-explained/</link>
		
		<dc:creator><![CDATA[Vikram Singh]]></dc:creator>
		<pubDate>Mon, 16 Feb 2026 13:45:00 +0000</pubDate>
				<category><![CDATA[Deep Learning]]></category>
		<category><![CDATA[Retrieval Augmented Generation]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[Vector Databases]]></category>
		<category><![CDATA[ann]]></category>
		<category><![CDATA[approximate nearest neighbor]]></category>
		<category><![CDATA[cosine similarity]]></category>
		<category><![CDATA[embeddings]]></category>
		<category><![CDATA[faiss]]></category>
		<category><![CDATA[flat index]]></category>
		<category><![CDATA[hnsw]]></category>
		<category><![CDATA[ivf]]></category>
		<category><![CDATA[rag]]></category>
		<category><![CDATA[recall at k]]></category>
		<category><![CDATA[retrieval augmented generation]]></category>
		<category><![CDATA[semantic search]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[vector database]]></category>
		<category><![CDATA[vector search]]></category>
		<guid isPermaLink="false">https://pyimagesearch.com/?p=52740</guid>

					<description><![CDATA[<p>Table of Contents Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained From Exact to Approximate: Why Indexing Matters The Trouble with Brute-Force Search The Curse of Dimensionality Enter the Approximate Nearest Neighbor (ANN) Accuracy vs. Latency: The Core Trade-Off&#8230;</p>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/02/16/vector-search-with-faiss-approximate-nearest-neighbor-ann-explained/">Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<hr class="wp-block-separator has-alpha-channel-opacity" id="TOC"/>


<div class="yoast-breadcrumbs"><span><span><a href="https://pyimagesearch.com/">Home</a></span></div>


<div class="toc">
<hr class="TOC"/>
<p class="has-large-font-size"><strong>Table of Contents</strong></p>

<ul>
  <li id="TOC-h1-Vector-Search-with-FAISS-Approximate-Nearest-Neighbor-ANN-Explained">
    <a rel="noopener" target="_blank" href="#h1-Vector-Search-with-FAISS-Approximate-Nearest-Neighbor-ANN-Explained">Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained</a>
  </li>

  <li id="TOC-h2-From-Exact-to-Approximate-Why-Indexing-Matters">
    <a rel="noopener" target="_blank" href="#h2-From-Exact-to-Approximate-Why-Indexing-Matters">From Exact to Approximate: Why Indexing Matters</a>
  </li>
  <ul>
    <li id="TOC-h3-The-Trouble-with-Brute-Force-Search">
      <a rel="noopener" target="_blank" href="#h3-The-Trouble-with-Brute-Force-Search">The Trouble with Brute-Force Search</a>
    </li>
    <li id="TOC-h3-The-Curse-of-Dimensionality">
      <a rel="noopener" target="_blank" href="#h3-The-Curse-of-Dimensionality">The Curse of Dimensionality</a>
    </li>
    <li id="TOC-h3-Enter-the-Approximate-Nearest-Neighbor-ANN">
      <a rel="noopener" target="_blank" href="#h3-Enter-the-Approximate-Nearest-Neighbor-ANN">Enter the Approximate Nearest Neighbor (ANN)</a>
    </li>
    <li id="TOC-h3-Accuracy-vs-Latency-The-Core-Trade-Off">
      <a rel="noopener" target="_blank" href="#h3-Accuracy-vs-Latency-The-Core-Trade-Off">Accuracy vs. Latency: The Core Trade-Off</a>
    </li>
    <li id="TOC-h3-What-Youll-Learn-Next">
      <a rel="noopener" target="_blank" href="#h3-What-Youll-Learn-Next">What You’ll Learn Next</a>
    </li>
  </ul>

  <li id="TOC-h2-Inside-FAISS-How-Vector-Indexing-Works">
    <a rel="noopener" target="_blank" href="#h2-Inside-FAISS-How-Vector-Indexing-Works">Inside FAISS: How Vector Indexing Works</a>
  </li>
  <ul>
    <li id="TOC-h3-What-Is-FAISS">
      <a rel="noopener" target="_blank" href="#h3-What-Is-FAISS">What Is FAISS?</a>
    </li>
    <li id="TOC-h3-Why-We-Need-Index-Structures">
      <a rel="noopener" target="_blank" href="#h3-Why-We-Need-Index-Structures">Why We Need Index Structures</a>
    </li>
    <li id="TOC-h3-Flat-Index-The-Exact-Baseline">
      <a rel="noopener" target="_blank" href="#h3-Flat-Index-The-Exact-Baseline">Flat Index: The Exact Baseline</a>
    </li>
    <li id="TOC-h3-IVF-Flat-Coarse-Clustering-for-Speed">
      <a rel="noopener" target="_blank" href="#h3-IVF-Flat-Coarse-Clustering-for-Speed">IVF-Flat: Coarse Clustering for Speed</a>
    </li>
    <li id="TOC-h3-HNSW-Navigable-Graph-Search">
      <a rel="noopener" target="_blank" href="#h3-HNSW-Navigable-Graph-Search">HNSW: Navigable Graph Search</a>
    </li>
    <li id="TOC-h3-Comparing-the-Three-Index-Types">
      <a rel="noopener" target="_blank" href="#h3-Comparing-the-Three-Index-Types">Comparing the Three Index Types</a>
    </li>
    <li id="TOC-h3-What-Youll-Do-Next">
      <a rel="noopener" target="_blank" href="#h3-What-Youll-Do-Next">What You’ll Do Next</a>
    </li>
  </ul>

  <li id="TOC-h2-Configuring-Your-Development-Environment">
    <a rel="noopener" target="_blank" href="#h2-Configuring-Your-Development-Environment">Configuring Your Development Environment</a>
  </li>
  <ul>
    <li id="TOC-h3-Note-on-FAISS-Installation">
      <a rel="noopener" target="_blank" href="#h3-Note-on-FAISS-Installation">Note on FAISS Installation</a>
    </li>
    <li id="TOC-h3-Optional-Dependencies">
      <a rel="noopener" target="_blank" href="#h3-Optional-Dependencies">Optional Dependencies</a>
    </li>
    <li id="TOC-h3-Verifying-Your-Installation">
      <a rel="noopener" target="_blank" href="#h3-Verifying-Your-Installation">Verifying Your Installation</a>
    </li>
  </ul>

  <li id="TOC-h2-Implementation-Walkthrough">
    <a rel="noopener" target="_blank" href="#h2-Implementation-Walkthrough">Implementation Walkthrough</a>
  </li>
  <ul>
    <li id="TOC-h3-pyimagesearch-config-py-what-we-use-in-Lesson-2">
      <a rel="noopener" target="_blank" href="#h3-pyimagesearch-config-py-what-we-use-in-Lesson-2">pyimagesearch/config.py (what we use in Lesson 2)</a>
    </li>
    <li id="TOC-h3-pyimagesearch-vector-search-utils-py">
      <a rel="noopener" target="_blank" href="#h3-pyimagesearch-vector-search-utils-py">pyimagesearch/vector_search_utils.py</a>
    </li>
    <li id="TOC-h3-02-vector-search-ann-py-driver">
      <a rel="noopener" target="_blank" href="#h3-02-vector-search-ann-py-driver">02_vector_search_ann.py (driver)</a>
    </li>
  </ul>

  <li id="TOC-h2-Benchmarking-and-Analyzing-Results">
    <a rel="noopener" target="_blank" href="#h2-Benchmarking-and-Analyzing-Results">Benchmarking and Analyzing Results</a>
  </li>
  <ul>
    <li id="TOC-h3-How-to-Interpret-These-Results">
      <a rel="noopener" target="_blank" href="#h3-How-to-Interpret-These-Results">How to Interpret These Results</a>
    </li>
    <li id="TOC-h3-Takeaways">
      <a rel="noopener" target="_blank" href="#h3-Takeaways">Takeaways</a>
    </li>
  </ul>

  <li id="TOC-h2-Summary">
    <a rel="noopener" target="_blank" href="#h2-Summary">Summary</a>
  </li>
  <ul>
    <li id="TOC-h3-Citation-Information">
      <a rel="noopener" target="_blank" href="#h3-Citation-Information">Citation Information</a>
    </li>
  </ul>
</ul>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h1-Vector-Search-with-FAISS-Approximate-Nearest-Neighbor-ANN-Explained"/>



<h2 class="wp-block-heading"><a href="#TOC-h1-Vector-Search-with-FAISS-Approximate-Nearest-Neighbor-ANN-Explained">Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained</a></h2>



<p>In this tutorial, you’ll learn how vector databases achieve lightning-fast retrieval using Approximate Nearest Neighbor (ANN) algorithms. You’ll explore how FAISS structures your embeddings into efficient indexes (e.g., Flat, IVF, and HNSW), benchmark their performance, and see how recall and latency trade off as your dataset scales.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/vector-search-with-faiss-ann-explained-featured.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="940" height="780" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-with-faiss-ann-explained-featured.png?lossy=2&strip=1&webp=1" alt="vector-search-with-faiss-ann-explained-featured.png" class="wp-image-52757" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-with-faiss-ann-explained-featured.png?size=126x105&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-with-faiss-ann-explained-featured-300x249.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-with-faiss-ann-explained-featured.png?size=378x314&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-with-faiss-ann-explained-featured.png?size=504x418&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-with-faiss-ann-explained-featured.png?size=630x523&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-with-faiss-ann-explained-featured-768x637.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/vector-search-with-faiss-ann-explained-featured.png?lossy=2&amp;strip=1&amp;webp=1 940w" sizes="(max-width: 630px) 100vw, 630px" /></a></figure></div>


<p>This lesson is the 2nd of a 3-part series on <strong>Retrieval-Augmented Generation (RAG)</strong>:</p>



<ol class="wp-block-list">
<li><em><strong><a href="https://pyimg.co/msp43" target="_blank" rel="noreferrer noopener">TF-IDF vs. Embeddings: From Keywords to Semantic Search</a></strong></em></li>



<li><em><strong><a href="https://pyimg.co/htl5f" target="_blank" rel="noreferrer noopener">Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained</a></strong></em><strong> (this tutorial)</strong></li>



<li><em>Lesson 3</em></li>
</ol>



<p><strong>To learn how to scale semantic search from milliseconds to millions of vectors,</strong><em><strong> just keep reading.</strong></em></p>



<div id="pyi-source-code-block" class="source-code-wrap"><div class="gpd-source-code">
    <div class="gpd-source-code-content">
        <img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/source-code-icon.png?lossy=2&strip=1&webp=1" alt="">
        <h4>Looking for the source code to this post?</h4>
                    <a href="#download-the-code" class="pyis-cta-modal-open-modal">Jump Right To The Downloads Section <svg class="svg-icon arrow-right" width="12" height="12" aria-hidden="true" role="img" focusable="false" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M6.8125 0.1875C6.875 0.125 6.96875 0.09375 7.09375 0.09375C7.1875 0.09375 7.28125 0.125 7.34375 0.1875L13.875 6.75C13.9375 6.8125 14 6.90625 14 7C14 7.125 13.9375 7.1875 13.875 7.25L7.34375 13.8125C7.28125 13.875 7.1875 13.9062 7.09375 13.9062C6.96875 13.9062 6.875 13.875 6.8125 13.8125L6.1875 13.1875C6.125 13.125 6.09375 13.0625 6.09375 12.9375C6.09375 12.8438 6.125 12.75 6.1875 12.6562L11.0312 7.8125H0.375C0.25 7.8125 0.15625 7.78125 0.09375 7.71875C0.03125 7.65625 0 7.5625 0 7.4375V6.5625C0 6.46875 0.03125 6.375 0.09375 6.3125C0.15625 6.25 0.25 6.1875 0.375 6.1875H11.0312L6.1875 1.34375C6.125 1.28125 6.09375 1.1875 6.09375 1.0625C6.09375 0.96875 6.125 0.875 6.1875 0.8125L6.8125 0.1875Z" fill="#169FE6"></path></svg></a>
            </div>
</div>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-From-Exact-to-Approximate-Why-Indexing-Matters"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-From-Exact-to-Approximate-Why-Indexing-Matters">From Exact to Approximate: Why Indexing Matters</a></h2>



<p>In the <a href="https://pyimg.co/msp43" target="_blank" rel="noreferrer noopener">previous lesson</a>, you learned how to turn text into embeddings — compact, high-dimensional vectors that capture semantic meaning.</p>



<p>By computing cosine similarity between these vectors, you could find which sentences or paragraphs were most alike.</p>



<p>That worked beautifully for a <strong>small handcrafted corpus</strong> of 30-40 paragraphs.</p>



<p>But what if your dataset grows to <strong>millions of documents</strong> or <strong>billions of image embeddings</strong>?</p>



<p>Suddenly, your brute-force search breaks down — and that’s where <strong>Approximate Nearest Neighbor (ANN)</strong> methods come to the rescue.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-The-Trouble-with-Brute-Force-Search"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-The-Trouble-with-Brute-Force-Search">The Trouble with Brute-Force Search</a></h3>



<p>In <a href="https://pyimg.co/msp43" target="_blank" rel="noreferrer noopener">Lesson 1</a>, we compared every query vector <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/769/7694f4a66316e53c8cdd9d9954bd611d-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='q' title='q' class='latex' /> against all stored embeddings <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/1ba/1ba8aaab47179b3d3e24b0ccea9f4e30-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='x_i' title='x_i' class='latex' /> using cosine similarity:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/267/267b0e5e21633dc8eea832587d1fb69e-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{similarity}(q, x_i) = \displaystyle\frac{q \cdot x_i}{\| q\| \| x_i\|}' title='\text{similarity}(q, x_i) = \displaystyle\frac{q \cdot x_i}{\| q\| \| x_i\|}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/267/267b0e5e21633dc8eea832587d1fb69e-ffffff-000000-0.png?lossy=2&strip=1&webp=1 187w,https://b2633864.smushcdn.com/2633864/wp-content/latex/267/267b0e5e21633dc8eea832587d1fb69e-ffffff-000000-0.png?size=126x24&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 187px) 100vw, 187px' /></p>



<p>If your dataset contains <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8d9/8d9c307cb7f3c4a32822a51922d1ceaa-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='N' title='N' class='latex' /> vectors, this means <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/8d9/8d9c307cb7f3c4a32822a51922d1ceaa-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='N' title='N' class='latex' /> dot products <strong>for each query</strong> — an operation that scales linearly as <img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/336/33697ce7dfa48ba80980d298c8089378-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='O(N)' title='O(N)' class='latex' />.</p>



<p>That’s fine for 500 vectors, but catastrophic for 50 million.</p>



<p>Let’s do a quick back-of-the-envelope estimate:</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-68-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="302" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-68-1024x302.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52845" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-68.png?size=126x37&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-68-300x88.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-68.png?size=378x111&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-68.png?size=504x149&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-68.png?size=630x186&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-68-768x226.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-68-1024x302.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-68-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Table 1:</strong> Computational cost of brute-force cosine similarity as corpus size scales, assuming <code>ᯈ50</code> GFLOPS/sec effective CPU throughput.</figcaption></figure></div>


<p>Even on modern hardware capable of tens of gigaflops (GFLOPS) per second, brute-force search over billions of vectors still incurs multi-second latency per query — before accounting for memory bandwidth limits — making indexing essential for production-scale retrieval.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-51.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="680" height="573" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-51.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52771" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-51.png?size=126x106&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-51-300x253.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-51.png?size=378x319&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-51.png?size=504x425&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-51.png?size=630x531&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-51.png?lossy=2&amp;strip=1&amp;webp=1 680w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 1:</strong> Brute-force cosine similarity scales linearly with corpus size, becoming infeasible beyond hundreds of millions of vectors (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-The-Curse-of-Dimensionality"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-The-Curse-of-Dimensionality">The Curse of Dimensionality</a></h3>



<p>High-dimensional data introduces another subtle problem: <strong>distances </strong><strong>begin </strong><strong>to blu</strong><strong>r</strong>.</p>



<p>In 384-dimensional space, most points are almost equally distant from each other.</p>



<p>This phenomenon, called the <em>curse of dimensionality</em>, makes “nearest neighbor” less meaningful unless your metric is very carefully chosen and your embeddings are normalized.</p>



<p>🧠<strong> Intuition:</strong> imagine scattering points on a 2D plane — you can easily tell which ones are close.</p>



<p>In 384D, everything is “far away,” so naive distance computations lose discriminative power.</p>



<p>That’s why normalization (L2) is critical — it keeps embeddings on a unit hypersphere, letting cosine similarity focus purely on angle, not length.</p>



<p>But even with normalization, you still can’t escape the sheer <strong>computational load</strong> of checking every vector.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Enter-the-Approximate-Nearest-Neighbor-ANN"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Enter-the-Approximate-Nearest-Neighbor-ANN">Enter the Approximate Nearest Neighbor (ANN)</a></h3>



<p>So, how do we search faster without comparing against <em>every</em> vector?</p>



<p>Approximate Nearest Neighbor (ANN) algorithms do exactly that.</p>



<p>They trade a small amount of accuracy (recall) for <strong>huge latency gains</strong> — often 100× faster.</p>



<p>Conceptually, they <strong>build an index</strong> that helps jump directly to promising regions of vector space.</p>



<p>Think of it like finding a book in a massive library:</p>



<ul class="wp-block-list">
<li><strong>Brute-force search:</strong> opening every book until you find the right topic.</li>



<li><strong>Indexed search:</strong> using the catalog — you might not find the <em>exact</em> match every time, but you’ll get very close, much faster.</li>
</ul>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-65.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="640" height="639" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-65.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52828" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-65-150x150.png?lossy=2&amp;strip=1&amp;webp=1 150w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-65-300x300.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-65.png?size=378x377&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-65.png?size=504x503&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-65.png?lossy=2&amp;strip=1&amp;webp=1 640w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 2:</strong> Approximate nearest neighbor methods (e.g., HNSW) skip most of the search space, dramatically reducing query time while maintaining high recall (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Accuracy-vs-Latency-The-Core-Trade-Off"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Accuracy-vs-Latency-The-Core-Trade-Off">Accuracy vs. Latency: The Core Trade-Off</a></h3>



<p>ANN methods exist in many flavors — graph-based (HNSW), cluster-based (IVF), tree-based (LSH, KD-Tree), and hybrid approaches — but they all share a single philosophy:</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/b88/b888831619cfe380f3fd7c665bbf2416-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{Speed up retrieval by avoiding exhaustive search.}' title='\text{Speed up retrieval by avoiding exhaustive search.}' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/b88/b888831619cfe380f3fd7c665bbf2416-ffffff-000000-0.png?lossy=2&strip=1&webp=1 350w,https://b2633864.smushcdn.com/2633864/wp-content/latex/b88/b888831619cfe380f3fd7c665bbf2416-ffffff-000000-0.png?size=126x6&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/latex/b88/b888831619cfe380f3fd7c665bbf2416-ffffff-000000-0.png?size=252x12&lossy=2&strip=1&webp=1 252w' sizes='(max-width: 350px) 100vw, 350px' /></p>



<p>They do so by:</p>



<ul class="wp-block-list">
<li><strong>Pre-structuring</strong> the vector space (clustering, graphs, hashing).</li>



<li><strong>Restricting</strong> comparisons to only nearby candidates.</li>



<li><strong>Accepting</strong> that some true neighbors might be missed — in exchange for massive speed-ups.</li>
</ul>



<p>You’ll often see performance expressed using <code data-enlighter-language="python" class="EnlighterJSRAW">recall@k</code>, the fraction of true neighbors recovered among the top-<em>k</em> results.</p>



<p>A good ANN index achieves 0.9 &#8211; 0.99 <code data-enlighter-language="python" class="EnlighterJSRAW">recall@k</code> while being orders of magnitude faster than brute force.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-52-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="228" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-52-1024x228.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52776" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-52.png?size=204x45&amp;lossy=2&amp;strip=1&amp;webp=1 204w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-52-300x67.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-52.png?size=409x91&amp;lossy=2&amp;strip=1&amp;webp=1 409w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-52.png?size=614x137&amp;lossy=2&amp;strip=1&amp;webp=1 614w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-52-768x171.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-52.png?size=819x182&amp;lossy=2&amp;strip=1&amp;webp=1 819w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-52-1024x228.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-52-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Table 2:</strong> Example recall and latency comparison of Flat, IVF-Flat, and HNSW indexes on a medium-scale corpus.</figcaption></figure></div>

<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-62.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="690" height="579" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-62.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52805" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-62.png?size=126x106&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-62-300x252.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-62.png?size=378x317&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-62.png?size=504x423&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-62.png?size=630x529&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-62.png?lossy=2&amp;strip=1&amp;webp=1 690w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 3:</strong> Approximate indexes (e.g., HNSW and IVF-Flat) achieve near-perfect recall while being orders of magnitude faster than brute-force Flat search (source: image by the author).</figcaption></figure></div>


<p>The Flat index delivers exact results but with high latency, whereas HNSW and IVF-Flat balance speed and accuracy for scalable retrieval.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-What-Youll-Learn-Next"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-What-Youll-Learn-Next">What You’ll Learn Next</a></h3>



<p>In the next section, we’ll open the black box and look <em>inside</em> FAISS — the industry-standard library that powers vector search at scale.</p>



<p>You’ll understand how these indexes (i.e., Flat, IVF, and HNSW) are built, what parameters control their trade-offs, and how to construct them yourself in code.</p>



<p>By the end, you’ll see how these algorithms let you scale semantic search from a few dozen vectors to millions — <strong>without los</strong><strong>s of</strong><strong> meaning</strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p>Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with &#8230; for free? Head over to <a href="https://universe.roboflow.com/isl/az-6mqow?ref=pyimagesearch" target="_blank" rel="noreferrer noopener">Roboflow</a> and get a free account to grab these hand gesture images. </p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Inside-FAISS-How-Vector-Indexing-Works"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Inside-FAISS-How-Vector-Indexing-Works">Inside FAISS: How Vector Indexing Works</a></h2>



<p>In Lesson 1, we learned how to generate embeddings and compute semantic similarity.</p>



<p>In Section 1 of this post, we discovered why brute-force search breaks down as our vector collection grows.</p>



<p>Now, let’s open the black box and explore how <strong>FAISS</strong> makes large-scale vector search practical.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-What-Is-FAISS"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-What-Is-FAISS">What Is FAISS?</a></h3>



<p><strong>FAISS (Facebook AI Similarity Search)</strong> is an open-source library developed by Meta AI for <em>efficient similarity search and clustering of dense vectors</em>.</p>



<p>It’s implemented in highly optimized C++ with Python bindings and optional GPU support.</p>



<p>In simple terms, FAISS is the <strong>NumPy of vector search</strong> — it provides a consistent interface for building, training, querying, and persisting vector indexes, with dozens of back-ends optimized for both accuracy and speed.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-69-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="363" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-69-1024x363.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52848" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-69.png?size=126x45&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-69-300x106.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-69.png?size=378x134&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-69.png?size=504x179&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-69.png?size=630x223&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-69-768x272.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-69-1024x363.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-69-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Table 3:</strong> Core components of FAISS for efficient approximate nearest neighbor (ANN) search.</figcaption></figure></div>

<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-67.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="779" height="524" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-67.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52837" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-67.png?size=126x85&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-67-300x202.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-67.png?size=378x254&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-67.png?size=504x339&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-67.png?size=630x424&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-67-768x517.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-67.png?lossy=2&amp;strip=1&amp;webp=1 779w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 4:</strong> FAISS serves as a high-performance bridge between embeddings and retrieval logic, performing similarity search to return top-<em>k</em> relevant chunks efficiently (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Why-We-Need-Index-Structures"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Why-We-Need-Index-Structures">Why We Need Index Structures</a></h3>



<p>Every FAISS index is a data structure that helps you locate nearest neighbors efficiently in a large vector space.</p>



<p>Instead of storing raw vectors in a flat array and scanning them linearly, FAISS builds one of several possible structures that act as “shortcuts” through the space.</p>



<p>Let’s explore the three we will implement in this tutorial — <strong>Flat</strong>, <strong>IVF-Flat</strong>, and <strong>HNSW</strong> — and see how each balances speed and accuracy.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Flat-Index-The-Exact-Baseline"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Flat-Index-The-Exact-Baseline">Flat Index: The Exact Baseline</a></h3>



<p>The <strong>Flat index</strong> is the simplest and most precise approach.</p>



<p>It stores every embedding as is and computes cosine similarity (or inner product) between the query and each vector.</p>



<p class="has-text-align-center"><img src='https://b2633864.smushcdn.com/2633864/wp-content/latex/aed/aed011bc30921e8494c4910d6b59f3ee-ffffff-000000-0.png?lossy=2&strip=1&webp=1' alt='\text{similarity}(q, x_i) = q \cdot x_i' title='\text{similarity}(q, x_i) = q \cdot x_i' class='latex' /></p>



<p>This is essentially our brute-force baseline from Lesson 1.</p>



<p>It achieves perfect recall (1.0) but scales poorly — every query requires scanning all vectors.</p>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-66.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="932" height="626" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-66.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52831" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-66.png?size=126x85&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-66-300x202.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-66.png?size=378x254&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-66.png?size=504x339&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-66.png?size=630x423&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-66-768x516.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-66.png?lossy=2&amp;strip=1&amp;webp=1 932w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 5:</strong> In brute-force vector search, each query is sequentially compared to every stored vector — an <code>O(N)</code> process that ensures exact matches but becomes prohibitively slow for large-scale datasets (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-IVF-Flat-Coarse-Clustering-for-Speed"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-IVF-Flat-Coarse-Clustering-for-Speed">IVF-Flat: Coarse Clustering for Speed</a></h3>



<p><strong>IVF</strong> stands for <em>Inverted File Index</em>.</p>



<p>The idea is simple but powerful: partition the vector space into coarse clusters and search only a few of them at query time.</p>



<p><strong>Step-by-step:</strong></p>



<ul class="wp-block-list">
<li>FAISS trains a coarse quantizer using <em>k-means</em> on your embeddings. Each cluster center is a “list” in the index — there are <code data-enlighter-language="python" class="EnlighterJSRAW">nlist</code> of them.</li>



<li>Each vector is assigned to its closest centroid.</li>



<li>During search, instead of scanning all lists, FAISS only probes <code data-enlighter-language="python" class="EnlighterJSRAW">nprobe</code> of them (the most promising clusters).</li>
</ul>



<p>That reduces search complexity from <code data-enlighter-language="python" class="EnlighterJSRAW">O(N)</code> to roughly <code data-enlighter-language="python" class="EnlighterJSRAW">O(N / nlist × nprobe)</code>.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-57-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="192" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-57-1024x192.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52792" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-57.png?size=126x24&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-57-300x56.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-57.png?size=378x71&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-57.png?size=504x95&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-57.png?size=630x118&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-57-768x144.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-57-1024x192.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-57-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Table 4:</strong> Key configuration parameters of the FAISS IVF index and their impact on recall and search latency.</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-HNSW-Navigable-Graph-Search"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-HNSW-Navigable-Graph-Search">HNSW: Navigable Graph Search</a></h3>



<p>The <strong>Hierarchical Navigable Small World (HNSW)</strong> index uses a graph instead of clusters.</p>



<p>Each embedding is a node connected to its nearest neighbors.</p>



<p>At query time, the algorithm navigates the graph — starting from a random entry point and greedily moving to closer nodes until it reaches a local minimum.</p>



<p>This search is then refined by probing nearby connections to find better candidates.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-58-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="188" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-58-1024x188.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52794" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-58.png?size=126x23&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-58-300x55.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-58.png?size=378x69&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-58.png?size=504x93&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-58.png?size=630x116&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-58-768x141.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-58-1024x188.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-58-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Table 5:</strong> Key HNSW parameters controlling graph connectivity and search-time accuracy-latency trade-offs.</figcaption></figure></div>

<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-63.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="927" height="610" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-63.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52814" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-63.png?size=126x83&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-63-300x197.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-63.png?size=378x249&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-63.png?size=504x332&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-63.png?size=630x415&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-63-768x505.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-63.png?lossy=2&amp;strip=1&amp;webp=1 927w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 6:</strong> HNSW organizes vectors into multiple graph layers. The query starts at a coarse layer and navigates downward, refining its path at each level until it reaches the base layer to retrieve the nearest neighbors efficiently (source: image by the author).</figcaption></figure></div>


<p><strong>Intuition: </strong>It’s like finding a friend’s house by asking neighbors for directions instead of checking every street yourself.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Comparing-the-Three-Index-Types"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Comparing-the-Three-Index-Types">Comparing the Three Index Types</a></h3>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-59-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="261" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-59-1024x261.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52800" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-59.png?size=126x32&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-59-300x76.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-59.png?size=378x96&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-59.png?size=504x128&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-59.png?size=630x161&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-59-768x195.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-59-1024x261.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-59-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Table 6:</strong> Performance comparison of FAISS index types for approximate nearest neighbor (ANN) search.</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-What-Youll-Do-Next"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-What-Youll-Do-Next">What You’ll Do Next</a></h3>



<p>Now that you understand what each index does, you’ll see how to implement them with FAISS.</p>



<p>In the next section, we’ll dive into the code — explaining <code data-enlighter-language="python" class="EnlighterJSRAW">config.py</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">vector_search_utils.py</code>, then building and benchmarking indexes side by side.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Configuring-Your-Development-Environment"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Configuring-Your-Development-Environment">Configuring Your Development Environment</a></h2>



<p>To follow this guide, you&#8217;ll need to install several Python libraries for vector search and approximate nearest neighbor (ANN) indexing.</p>



<p>The core dependencies are:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="1">$ pip install sentence-transformers==2.7.0
$ pip install faiss-cpu==1.8.0
$ pip install numpy==1.26.4
$ pip install rich==13.8.1
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Note-on-FAISS-Installation"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Note-on-FAISS-Installation">Note on FAISS Installation</a></h3>



<ul class="wp-block-list">
<li>Use <code data-enlighter-language="python" class="EnlighterJSRAW">faiss-cpu</code> for CPU-only environments (most common)</li>



<li>If you have a CUDA-compatible GPU, you can install <code data-enlighter-language="python" class="EnlighterJSRAW">faiss-gpu</code> instead for better performance with large datasets</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Optional-Dependencies"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Optional-Dependencies">Optional Dependencies</a></h3>



<p>If you plan to extend the examples with your own experiments:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="shell" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="2">$ pip install matplotlib  # For creating performance visualizations
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Verifying-Your-Installation"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Verifying-Your-Installation">Verifying Your Installation</a></h3>



<p>You can verify FAISS is properly installed by running:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="3">import faiss
print(f"FAISS version: {faiss.__version__}")
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<!-- wp:paragraph -->
<h3>Need Help Configuring Your Development Environment?</h3>
<!-- /wp:paragraph -->

<!-- wp:image {"align":"center","id":18137,"sizeSlug":"large","linkDestination":"custom"} -->
<figure class="wp-block-image aligncenter size-large"><a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank" rel="noreferrer noopener"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-18137" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?lossy=2&strip=1&webp=1 500w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=126x84&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=252x168&lossy=2&strip=1&webp=1 252w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2021/01/pyimagesearch_plus_jupyter.png?size=378x253&lossy=2&strip=1&webp=1 378w" sizes="(max-width: 500px) 100vw, 500px" /></a><figcaption>Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join <a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">PyImageSearch University</a> — you will be up and running with this tutorial in a matter of minutes. </figcaption></figure>
<!-- /wp:image -->

<!-- wp:paragraph -->
<p>All that said, are you:</p>
<!-- /wp:paragraph -->

<!-- wp:list -->
<ul><li>Short on time?</li><li>Learning on your employer’s administratively locked system?</li><li>Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?</li><li><strong>Ready to run the code immediately on your Windows, macOS, or Linux system?</strong></li></ul>
<!-- /wp:list -->

<!-- wp:paragraph -->
<p>Then join <a href="https://pyimagesearch.com/pyimagesearch-university/" target="_blank">PyImageSearch University</a> today!</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p><strong>Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser!</strong> No installation required.</p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p>And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!</p>
<!-- /wp:paragraph -->



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Implementation-Walkthrough"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Implementation-Walkthrough">Implementation Walkthrough</a></h2>



<p>Alright, time to get our hands dirty. </p>



<p>You’ve seen why indexing matters and how FAISS structures (e.g., Flat, HNSW, and IVF) work conceptually; now we’ll implement them and run them side-by-side. We’ll build each index, query it with the same vectors, and compare <code data-enlighter-language="python" class="EnlighterJSRAW">recall@k</code>, <strong>average query latency</strong>, <strong>build time</strong>, and a rough <strong>memory footprint</strong>. </p>



<p>To keep things clean, we’ll proceed in 3 parts: </p>



<ul class="wp-block-list">
<li><strong>config:</strong> paths and constants</li>



<li><strong>FAISS utilities:</strong> builders, query/save/load, benchmarking helpers</li>



<li><strong>driver script:</strong> ties it all together and prints a clear, comparable report</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-pyimagesearch-config-py-what-we-use-in-Lesson-2"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-pyimagesearch-config-py-what-we-use-in-Lesson-2">pyimagesearch/config.py (what we use in Lesson 2)</a></h3>



<p>These constants keep paths, filenames, and defaults in one place, so the rest of the code stays clean and portable.</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="4">from pathlib import Path
import os

# Base paths
BASE_DIR = Path(__file__).resolve().parent.parent
DATA_DIR = BASE_DIR / "data"
INPUT_DIR = DATA_DIR / "input"
OUTPUT_DIR = DATA_DIR / "output"
INDEX_DIR = DATA_DIR / "indexes"
FIGURES_DIR = DATA_DIR / "figures"

# Corpus files (allow environment overrides for flexibility)
_CORPUS_OVERRIDE = os.getenv("CORPUS_PATH")
_CORPUS_META_OVERRIDE = os.getenv("CORPUS_META_PATH")
CORPUS_PATH = Path(_CORPUS_OVERRIDE) if _CORPUS_OVERRIDE else INPUT_DIR / "corpus.txt"
CORPUS_META_PATH = Path(_CORPUS_META_OVERRIDE) if _CORPUS_META_OVERRIDE else INPUT_DIR / "corpus_metadata.json"

# Embedding artifacts
EMBEDDINGS_PATH = OUTPUT_DIR / "embeddings.npy"
METADATA_ALIGNED_PATH = OUTPUT_DIR / "metadata_aligned.json"
DIM_REDUCED_PATH = OUTPUT_DIR / "pca_2d.npy"

# Index artifacts
FLAT_INDEX_PATH = INDEX_DIR / "faiss_flat.index"
HNSW_INDEX_PATH = INDEX_DIR / "faiss_hnsw.index"

# Models &amp; defaults
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
SEED = 42
DEFAULT_TOP_K = 5
SIM_THRESHOLD = 0.35  # not used in this lesson, but fine to keep

# Ensure dirs exist
for d in (OUTPUT_DIR, INDEX_DIR, FIGURES_DIR):
    d.mkdir(parents=True, exist_ok=True)
</pre>



<p><strong>What’s happening here (and why):</strong></p>



<ul class="wp-block-list">
<li>We compute <code data-enlighter-language="python" class="EnlighterJSRAW">BASE_DIR</code> and derive data/ subfolders so the repo runs the same on any machine.</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">CORPUS_PATH</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">CORPUS_META_PATH</code> can be overridden via environment variables — a nice quality-of-life touch if you swap datasets.</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">EMBEDDINGS_PATH</code> is the cache produced in Lesson 1; we load it here to avoid re-embedding.</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">FLAT_INDEX_PATH</code> and <code data-enlighter-language="python" class="EnlighterJSRAW">HNSW_INDEX_PATH</code> are where we’ll <strong>persist</strong> FAISS indexes after benchmarking. (IVF can be added similarly if you want on-disk reuse.)</li>



<li>The directory loop at the bottom makes sure first runs never fail due to missing folders.</li>
</ul>



<p><em><strong>Note:</strong></em> The file also contains prompt-related constants for Lesson 3 (RAG). We <strong>don’t</strong> use them here — safe to ignore in this lesson.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-pyimagesearch-vector-search-utils-py"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-pyimagesearch-vector-search-utils-py">pyimagesearch/vector_search_utils.py</a></h3>



<p>This module is your FAISS “Swiss Army Knife.” We’ll go through every function you shared, including the ones you asked me not to miss: <code data-enlighter-language="python" class="EnlighterJSRAW">brute_force_search</code>, <code data-enlighter-language="python" class="EnlighterJSRAW">measure_latency</code>, and <code data-enlighter-language="python" class="EnlighterJSRAW">estimate_index_memory_bytes</code>, plus <strong>query/save/load</strong>.</p>



<h4 class="wp-block-heading">a) FAISS Import Guard</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="5">try:
    import faiss  # type: ignore
except ImportError:
    faiss = None  # type: ignore

def require_faiss():
    if faiss is None:
        raise ImportError("faiss not installed. Please pip install faiss-cpu (or faiss-gpu).")
</pre>



<p>This is a friendly safety check. Any time we build or query an index, we call <code data-enlighter-language="python" class="EnlighterJSRAW">require_faiss()</code>. If FAISS isn’t installed, you get a clear error with the exact package to install.</p>



<h4 class="wp-block-heading">b) Flat (exact) Index</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="6">def build_flat_index(embeddings: np.ndarray):
    require_faiss()
    d = embeddings.shape[1]
    index = faiss.IndexFlatIP(d)
    index.add(embeddings.astype(np.float32))
    return index
</pre>



<p><strong>What it does:</strong></p>



<p>We create an <code data-enlighter-language="python" class="EnlighterJSRAW">IndexFlatIP</code>, where <strong>IP = inner product</strong>. Because our embeddings are L2-normalized (from Lesson 1), inner product equals cosine similarity. We then <code data-enlighter-language="python" class="EnlighterJSRAW">.add()</code> all vectors (as <code data-enlighter-language="python" class="EnlighterJSRAW">float32</code>) to the index. This gives <strong>perfect recall</strong> but scans everything (slow at scale). This equivalence holds only if both stored embeddings and query vectors are L2-normalized. If not, inner product will not equal cosine similarity.</p>



<h4 class="wp-block-heading">c) HNSW (graph) Index</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="7">def build_hnsw_index(embeddings: np.ndarray, m: int = 32, ef_construction: int = 128, ef_search: int = 64):
    require_faiss()
    d = embeddings.shape[1]
    index = faiss.IndexHNSWFlat(d, m, faiss.METRIC_INNER_PRODUCT)
    index.hnsw.efConstruction = ef_construction
    index.hnsw.efSearch = ef_search
    index.add(embeddings.astype(np.float32))
    return index
</pre>



<p><strong>What it does:</strong></p>



<p>HNSW builds a multi-layer graph over your vectors:</p>



<ul class="wp-block-list">
<li><code data-enlighter-language="python" class="EnlighterJSRAW">m</code>: sets connectivity per node (more links → better recall and higher memory)</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">efConstruction</code>: controls how widely we explore during graph building</li>



<li><code data-enlighter-language="python" class="EnlighterJSRAW">efSearch</code>: controls breadth at query time (higher → better recall, slower)</li>
</ul>



<p>We add vectors and return a ready-to-query graph index.</p>



<h4 class="wp-block-heading">d) IVF-Flat (clustered) Index</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="8">def build_ivf_flat_index(embeddings: np.ndarray, nlist: int = 8, nprobe: int = 4):
    require_faiss()
    d = embeddings.shape[1]
    quantizer = faiss.IndexFlatIP(d)
    index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
    if not index.is_trained:
        index.train(embeddings.astype(np.float32))
    index.add(embeddings.astype(np.float32))
    index.nprobe = nprobe
    return index
</pre>



<p><strong>What it does:</strong></p>



<p>IVF partitions your space into <code data-enlighter-language="python" class="EnlighterJSRAW">nlist</code> coarse clusters (centroids via <em>k</em>-means). You must <strong>train</strong> first, then add vectors. Queries only <strong>probe</strong> a handful of lists (<code data-enlighter-language="python" class="EnlighterJSRAW">nprobe</code>) near the query, drastically reducing the number of distance comparisons.</p>



<ul class="wp-block-list">
<li>Tune <code data-enlighter-language="python" class="EnlighterJSRAW">nlist</code> up as your dataset grows (more clusters).</li>



<li>Tune <code data-enlighter-language="python" class="EnlighterJSRAW">nprobe</code> for recall vs. latency.</li>
</ul>



<h4 class="wp-block-heading">e) Unified Query Wrapper</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="9">def query_index(index, query_vectors: np.ndarray, top_k: int = config.DEFAULT_TOP_K) -> Tuple[np.ndarray, np.ndarray]:
    distances, indices = index.search(query_vectors.astype(np.float32), top_k)
    return indices, distances
</pre>



<p><strong>What it does:</strong></p>



<p>FAISS returns (<code data-enlighter-language="python" class="EnlighterJSRAW">distances, indices</code>), but most of our code wants (<code data-enlighter-language="python" class="EnlighterJSRAW">indices, distances</code>). This wrapper normalizes that behavior and ensures the input dtype is <code data-enlighter-language="python" class="EnlighterJSRAW">float32</code>. Use this anywhere you want a consistent top-<em>k</em> interface.</p>



<h4 class="wp-block-heading">f) Save and Load Indexes</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="10">def save_index(index, path: Path):
    require_faiss()
    faiss.write_index(index, str(path))

def load_index(path: Path):
    require_faiss()
    return faiss.read_index(str(path))
</pre>



<p><strong>What it does:</strong></p>



<p>Build once, reuse many times. These functions serialize an index to disk and read it back instantly — essential when your dataset is large or when you run repeated experiments.</p>



<h4 class="wp-block-heading">g) Brute-Force Baseline (ground truth)</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="11">def brute_force_search(embeddings: np.ndarray, queries: np.ndarray, top_k: int) -> Tuple[np.ndarray, np.ndarray]:
    sims = queries @ embeddings.T  # cosine if normalized
    top_indices = np.argsort(-sims, axis=1)[:, :top_k]
    top_scores = np.take_along_axis(sims, top_indices, axis=1)
    return top_indices, top_scores
</pre>



<p><strong>What it does:</strong></p>



<p>This computes exact cosine similarities (via matrix multiplication) and sorts them to get the true top-<em>k</em>. We use this to measure <code data-enlighter-language="python" class="EnlighterJSRAW">recall@k</code> of ANN indexes.</p>



<h4 class="wp-block-heading">h) Latency measurement helper</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="12">def measure_latency(func, *args, repeat: int = 3, **kwargs) -> Dict[str, Any]:
    import time
    times = []
    for _ in range(repeat):
        start = time.perf_counter()
        func(*args, **kwargs)
        times.append(time.perf_counter() - start)
    return {"avg_s": float(np.mean(times)), "stdev_s": float(np.std(times)), "runs": times}
</pre>



<p><strong>What it does:</strong></p>



<p>This runs the same query callable a few times and returns the average and standard deviation. This gives stable latency numbers for fair comparisons between Flat, HNSW, and IVF.</p>



<h4 class="wp-block-heading">i) Rough Memory Estimator (educational)</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="13">def estimate_index_memory_bytes(index_type: str, n: int, d: int, m: int = 32, nlist: int = 0) -> int:
    base = n * d * 4
    if index_type == "Flat":
        return base
    if index_type == "HNSW":
        return base + n * m * 4
    if index_type == "IVF-Flat":
        return base + nlist * d * 4
    return base
</pre>



<p><strong>What it does:</strong></p>



<p>Provides a <strong>lower-bound</strong> estimate (in bytes) for apples-to-apples reporting.</p>



<ul class="wp-block-list">
<li><strong>Flat:</strong> just the raw vectors (<code data-enlighter-language="python" class="EnlighterJSRAW">float32</code> = 4 bytes).</li>



<li><strong>HNSW:</strong> vectors + approximate link list.</li>



<li><strong>IVF-Flat:</strong> vectors + centroid matrix.</li>
</ul>



<p>FAISS has internal overhead, so treat this as a teaching aid, not an exact profiler.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-02-vector-search-ann-py-driver"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-02-vector-search-ann-py-driver">02_vector_search_ann.py (driver)</a></h3>



<p>This script glues everything together: </p>



<ul class="wp-block-list">
<li>loads embeddings</li>



<li>builds indexes</li>



<li>benchmarks recall and latency</li>



<li>displays results</li>



<li>persists indexes</li>
</ul>



<h4 class="wp-block-heading">a) Ensuring Embeddings Exist</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="14">def ensure_embeddings():
    if config.EMBEDDINGS_PATH.exists():
        emb, meta = load_embeddings()
        texts, _ = load_corpus()
        return emb, meta, texts
    texts, meta = load_corpus()
    model = get_model()
    emb = generate_embeddings(texts, model=model)
    from pyimagesearch.embeddings_utils import save_embeddings
    save_embeddings(emb, meta)
    return emb, meta, texts
</pre>



<p>First run? We load corpus → embed → save. Subsequent runs? We simply load <code data-enlighter-language="python" class="EnlighterJSRAW">embeddings.npy</code>. This keeps the ANN lesson focused on indexing rather than recomputing vectors.</p>



<h4 class="wp-block-heading">b) Query Sampling and Recall</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="15">def sample_query_indices(n: int, total: int, seed: int = config.SEED) -> np.ndarray:
    rng = np.random.default_rng(seed)
    return rng.choice(total, size=min(n, total), replace=False)

def compute_recall(brute_force_idx: np.ndarray, ann_idx: np.ndarray) -> float:
    matches = 0
    total = brute_force_idx.shape[0] * brute_force_idx.shape[1]
    for row_true, row_test in zip(brute_force_idx, ann_idx):
        matches += len(set(row_true.tolist()) &amp; set(row_test.tolist()))
    return matches / total if total else 0.0
</pre>



<p>We pick some random query rows from the embedding matrix and compare ANN results against the ground truth to compute <code data-enlighter-language="python" class="EnlighterJSRAW">recall@k</code>.</p>



<h4 class="wp-block-heading">c) Benchmark Core</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="16">def benchmark(
    embeddings: np.ndarray,
    queries: np.ndarray,
    k: int = config.DEFAULT_TOP_K,
    hnsw_m: int = 32,
    hnsw_ef_search: int = 64,
    ivf_nlist: int = 8,
    ivf_nprobe: int = 4,
    auto_adjust_ivf: bool = True,
):
    # 1) Ground truth
    bf_idx, bf_scores = brute_force_search(embeddings, queries, k)
    results = []

    # 2) Flat
    build_start = time.perf_counter()
    flat_index = build_flat_index(embeddings)
    build_time_flat = time.perf_counter() - build_start
    flat_index.search(queries[:1].astype(np.float32), k)  # warm-up
    latency_flat = measure_latency(flat_index.search, queries.astype(np.float32), k)
    flat_dist, flat_idx_res = flat_index.search(queries.astype(np.float32), k)
    recall_flat = compute_recall(bf_idx, flat_idx_res)
    if recall_flat &lt; 0.99:
        print(f"[red]Warning: Flat recall {recall_flat:.3f} &lt; 0.99. Check normalization or brute force implementation.[/red]")
    results.append(("Flat", recall_flat, latency_flat["avg_s"], build_time_flat,
                    estimate_index_memory_bytes("Flat", embeddings.shape[0], embeddings.shape[1])))

    # 3) HNSW
    try:
        build_start = time.perf_counter()
        hnsw_index = build_hnsw_index(embeddings, m=hnsw_m, ef_search=hnsw_ef_search)
        build_time_hnsw = time.perf_counter() - build_start
        hnsw_index.search(queries[:1].astype(np.float32), k)
        latency_hnsw = measure_latency(hnsw_index.search, queries.astype(np.float32), k)
        hnsw_dist, hnsw_idx_res = hnsw_index.search(queries.astype(np.float32), k)
        recall_hnsw = compute_recall(bf_idx, hnsw_idx_res)
        results.append(("HNSW", recall_hnsw, latency_hnsw["avg_s"], build_time_hnsw,
                        estimate_index_memory_bytes("HNSW", embeddings.shape[0], embeddings.shape[1], m=hnsw_m)))
    except Exception as e:
        results.append(("HNSW (unavailable)", 0.0, 0.0, 0.0, 0))
        print(f"[yellow]HNSW build failed: {e}[/yellow]")

    # 4) IVF-Flat (+ tiny-corpus auto-tuning)
    try:
        effective_nlist = ivf_nlist
        if auto_adjust_ivf:
            N = embeddings.shape[0]
            if N &lt; ivf_nlist * 5:
                shrunk = max(2, min(ivf_nlist, max(2, N // 2)))
                if shrunk != ivf_nlist:
                    print(f"[yellow]Adjusting nlist from {ivf_nlist} to {shrunk} for tiny corpus (N={N}).[/yellow]")
                    effective_nlist = shrunk
        build_start = time.perf_counter()
        ivf_index = build_ivf_flat_index(embeddings, nlist=effective_nlist, nprobe=ivf_nprobe)
        build_time_ivf = time.perf_counter() - build_start
        ivf_index.search(queries[:1].astype(np.float32), k)
        latency_ivf = measure_latency(ivf_index.search, queries.astype(np.float32), k)
        ivf_dist, ivf_idx_res = ivf_index.search(queries.astype(np.float32), k)
        recall_ivf = compute_recall(bf_idx, ivf_idx_res)
        results.append(("IVF-Flat", recall_ivf, latency_ivf["avg_s"], build_time_ivf,
                        estimate_index_memory_bytes("IVF-Flat", embeddings.shape[0], embeddings.shape[1], nlist=effective_nlist)))
    except Exception as e:
        results.append(("IVF-Flat (failed)", 0.0, 0.0, 0.0, 0))
        print(f"[yellow]IVF build failed: {e}[/yellow]")

    return bf_idx, results
</pre>



<p>The pipeline is: </p>



<ul class="wp-block-list">
<li>create a <strong>ground-truth</strong> baseline </li>



<li>build each index </li>



<li><strong>w</strong><strong>arm up</strong> </li>



<li>measure latency </li>



<li>compute recall vs. the ground-truth baseline </li>



<li>record a row of metrics</li>
</ul>



<p>The IVF block also includes a <strong>tiny-corpus auto-adjust</strong> so you don’t “over-partition” small datasets.</p>



<h4 class="wp-block-heading">d) Results Table</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="17">def display_results(results):
    table = Table(title="ANN Benchmark - Recall / Query(ms) / Build(ms) / Mem(KB)")
    table.add_column("Index")
    table.add_column("Recall", justify="right")
    table.add_column("Query (ms)", justify="right")
    table.add_column("Build (ms)", justify="right")
    table.add_column("Mem (KB)", justify="right")
    for name, recall, q_latency, build_time, mem_bytes in results:
        table.add_row(
            name,
            f"{recall:.3f}",
            f"{q_latency*1000:.2f}",
            f"{build_time*1000:.1f}",
            f"{mem_bytes/1024:.1f}",
        )
    print(table)
</pre>



<p>This produces the clean terminal table you’ve seen earlier — great for screenshots.</p>



<h4 class="wp-block-heading">e) Persisting Indexes</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="18">def persist_indexes(embeddings: np.ndarray):
    from pyimagesearch.vector_search_utils import save_index, build_flat_index
    flat = build_flat_index(embeddings)
    save_index(flat, config.FLAT_INDEX_PATH)
    try:
        from pyimagesearch.vector_search_utils import build_hnsw_index
        hnsw = build_hnsw_index(embeddings)
        save_index(hnsw, config.HNSW_INDEX_PATH)
    except Exception:
        pass
</pre>



<p>We persist Flat and HNSW by default, so Lesson 3 can load them directly. (Add IVF persistence if you want — same pattern.)</p>



<h4 class="wp-block-heading">f) CLI Entrypoint</h4>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="19">def main():
    parser = argparse.ArgumentParser(description="ANN benchmark demo")
    parser.add_argument("--k", type=int, default=5)
    parser.add_argument("--queries", type=int, default=5)
    parser.add_argument("--hnsw-m", type=int, default=32)
    parser.add_argument("--hnsw-ef-search", type=int, default=64)
    parser.add_argument("--ivf-nlist", type=int, default=8)
    parser.add_argument("--ivf-nprobe", type=int, default=4)
    parser.add_argument("--no-auto-adjust-ivf", action="store_true")
    args = parser.parse_args()

    print("[bold magenta]Loading embeddings...[/bold magenta]")
    embeddings, metadata, texts = ensure_embeddings()

    q_idx = sample_query_indices(args.queries, embeddings.shape[0])
    queries = embeddings[q_idx]

    print("[bold magenta]Benchmarking indexes...[/bold magenta]")
    bf_idx, results = benchmark(
        embeddings,
        queries,
        k=args.k,
        hnsw_m=args.hnsw_m,
        hnsw_ef_search=args.hnsw_ef_search,
        ivf_nlist=args.ivf_nlist,
        ivf_nprobe=args.ivf_nprobe,
        auto_adjust_ivf=not args.no_auto_adjust_ivf,
    )
    display_results(results)

    print("[bold magenta]Persisting indexes...[/bold magenta]")
    persist_indexes(embeddings)

    print("[green]\nDone. Proceed to Post 3 for RAG integration.\n[/green]")
</pre>



<p>Run it like this:</p>



<pre class="EnlighterJSRAW" data-enlighter-language="python" data-enlighter-theme="" data-enlighter-highlight="" data-enlighter-linenumbers="true" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="20">python 02_vector_search_ann.py --queries 10 --k 5 --ivf-nlist 32 --ivf-nprobe 8 --hnsw-m 32 --hnsw-ef-search 128
</pre>



<h4 class="wp-block-heading">Terminal Output (example)</h4>


<div class="wp-block-image">
<figure class="aligncenter size-full"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-64.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="368" height="299" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-64.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52817" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-64.png?size=126x102&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-64-300x244.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-64.png?lossy=2&amp;strip=1&amp;webp=1 368w" sizes="(max-width: 368px) 100vw, 368px" /></a></figure></div>

<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-60-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="234" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-60-1024x234.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52801" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-60.png?size=126x29&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-60-300x68.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-60.png?size=378x86&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-60.png?size=504x115&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-60.png?size=630x144&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-60-768x175.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-60-1024x234.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-60-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 7:</strong> The FAISS index lifecycle — from generating embeddings and building the index to benchmarking, saving for reuse, and querying for top-<em>k</em> results (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Benchmarking-and-Analyzing-Results"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Benchmarking-and-Analyzing-Results">Benchmarking and Analyzing Results</a></h2>



<p>After running the script with: </p>



<p><code data-enlighter-language="python" class="EnlighterJSRAW">python</code> <code data-enlighter-language="python" class="EnlighterJSRAW">02_vector_search_ann.py</code> <code data-enlighter-language="python" class="EnlighterJSRAW">--queries</code> <code data-enlighter-language="python" class="EnlighterJSRAW">10</code> <code data-enlighter-language="python" class="EnlighterJSRAW">--k</code> <code data-enlighter-language="python" class="EnlighterJSRAW">5</code> <code data-enlighter-language="python" class="EnlighterJSRAW">--ivf-nlist</code> <code data-enlighter-language="python" class="EnlighterJSRAW">32</code> <code data-enlighter-language="python" class="EnlighterJSRAW">--ivf-nprobe</code> <code data-enlighter-language="python" class="EnlighterJSRAW">8</code> <code data-enlighter-language="python" class="EnlighterJSRAW">--hnsw-m</code> <code data-enlighter-language="python" class="EnlighterJSRAW">32</code> <code data-enlighter-language="python" class="EnlighterJSRAW">--hnsw-ef-search</code> <code data-enlighter-language="python" class="EnlighterJSRAW">128</code></p>



<p>FAISS prints a benchmark table comparing <strong>Flat</strong>, <strong>HNSW</strong>, and <strong>IVF-Fla</strong><strong>t</strong> across recall, query latency, build time, and memory footprint.</p>


<div class="wp-block-image">
<figure class="aligncenter size-large"><a href="https://pyimagesearch.com/wp-content/uploads/2026/02/image-61-scaled.png" target="_blank" rel=" noreferrer noopener"><img decoding="async" width="1024" height="381" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-61-1024x381.png?lossy=2&strip=1&webp=1" alt="" class="wp-image-52803" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-61.png?size=126x47&amp;lossy=2&amp;strip=1&amp;webp=1 126w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-61-300x112.png?lossy=2&amp;strip=1&amp;webp=1 300w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-61.png?size=378x141&amp;lossy=2&amp;strip=1&amp;webp=1 378w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-61.png?size=504x188&amp;lossy=2&amp;strip=1&amp;webp=1 504w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-61.png?size=630x234&amp;lossy=2&amp;strip=1&amp;webp=1 630w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-61-768x286.png?lossy=2&amp;strip=1&amp;webp=1 768w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-61-1024x381.png?lossy=2&amp;strip=1&amp;webp=1 1024w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-61-scaled.png?lossy=2&amp;strip=1&amp;webp=1 1080w, https://b2633864.smushcdn.com/2633864/wp-content/uploads/2026/02/image-61-1536x571.png?lossy=2&amp;strip=1&amp;webp=1 1536w" sizes="(max-width: 630px) 100vw, 630px" /></a><figcaption class="wp-element-caption"><strong>Figure 8:</strong> Benchmark results comparing FAISS Flat, HNSW, and IVF-Flat indexes across recall, query latency, build time, and memory usage (source: image by the author).</figcaption></figure></div>


<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-How-to-Interpret-These-Results"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-How-to-Interpret-These-Results">How to Interpret These Results</a></h3>



<p>Let’s break down what this means.</p>



<ul class="wp-block-list">
<li><strong>Flat Index:</strong><br>This performs an <strong>exact nearest-neighbor search</strong>. Every query compares against all stored embeddings. It gives <strong>perfect recall (1.0)</strong> but becomes <strong>slow at scale</strong> because it’s linear in the number of vectors.</li>



<li><strong>HNSW (Hierarchical Navigable Small World):</strong><br>A <strong>graph-based index</strong> that builds layered shortcuts between points. It maintains <strong>near-perfect recall</strong> while cutting down query latency by 10-50× on large datasets.<br>Here, because our dataset is tiny (41 vectors), the difference isn’t dramatic — but with 1M+ vectors, HNSW scales gracefully.</li>



<li><strong>IVF-Flat (Inverted File Index):</strong><br>This structure first clusters vectors into “lists” (centroids), then searches only a few clusters (<code data-enlighter-language="python" class="EnlighterJSRAW">nprobe</code>).<br>You can see a <strong>small recall drop (0.98)</strong> because some true neighbors fall outside probed clusters — but <strong>query speed improves noticeably</strong>.<br>The build time is slightly higher since training the quantizer (<em>k</em>-means step) adds overhead.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Takeaways"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Takeaways">Takeaways</a></h3>



<ul class="wp-block-list">
<li>For small datasets, <strong>Flat</strong> is perfectly fine — it’s fast and exact.</li>



<li>For large-scale retrieval (e.g., thousands of documents), <strong>HNSW</strong> offers the best accuracy — latency balance.</li>



<li>For ultra-large datasets with millions of vectors, <strong>IVF-Flat</strong> or <strong>IVF-PQ</strong> can drastically cut memory and latency with minimal recall loss.</li>



<li>Always benchmark with your actual data — FAISS lets you trade off accuracy and performance based on your use case.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<div id="pitch" style="padding: 40px; width: 100%; background-color: #F4F6FA;">
	<h3>What's next? We recommend <a target="_blank" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend">PyImageSearch University</a>.</h3>

	<script src="https://fast.wistia.com/embed/medias/kno0cmko2z.jsonp" async></script><script src="https://fast.wistia.com/assets/external/E-v1.js" async></script><div class="wistia_responsive_padding" style="padding:56.25% 0 0 0;position:relative;"><div class="wistia_responsive_wrapper" style="height:100%;left:0;position:absolute;top:0;width:100%;"><div class="wistia_embed wistia_async_kno0cmko2z videoFoam=true" style="height:100%;position:relative;width:100%"><div class="wistia_swatch" style="height:100%;left:0;opacity:0;overflow:hidden;position:absolute;top:0;transition:opacity 200ms;width:100%;"><img decoding="async" src="https://fast.wistia.com/embed/medias/kno0cmko2z/swatch" style="filter:blur(5px);height:100%;object-fit:contain;width:100%;" alt="" aria-hidden="true" onload="this.parentNode.style.opacity=1;" /></div></div></div></div>

	<div style="margin-top: 32px; margin-bottom: 32px; ">
		<strong>Course information:</strong><br/>
		86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: April 2026<br/>
		<span style="color: #169FE6;">★★★★★</span> 4.84 (128 Ratings) • 16,000+ Students Enrolled
	</div>

	<p><strong>I strongly believe that if you had the right teacher you could <em>master</em> computer vision and deep learning.</strong></p>

	<p>Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?</p>

	<p>That’s <em>not</em> the case.</p>

	<p>All you need to master computer vision and deep learning is for someone to explain things to you in <em>simple, intuitive</em> terms. <em>And that’s exactly what I do</em>. My mission is to change education and how complex Artificial Intelligence topics are taught.</p>

	<p>If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to <em>successfully</em> and <em>confidently</em> apply computer vision to your work, research, and projects. Join me in computer vision mastery.</p>

	<p><strong>Inside PyImageSearch University you'll find:</strong></p>

	<ul style="margin-left: 0px;">
		<li style="list-style: none;">&check; <strong>86+ courses</strong> on essential computer vision, deep learning, and OpenCV topics</li>
		<li style="list-style: none;">&check; <strong>86 Certificates</strong> of Completion</li>
		<li style="list-style: none;">&check; <strong>115+ hours hours</strong> of on-demand video</li>
		<li style="list-style: none;">&check; <strong>Brand new courses released <em>regularly</em></strong>, ensuring you can keep up with state-of-the-art techniques</li>
		<li style="list-style: none;">&check; <strong>Pre-configured Jupyter Notebooks in Google Colab</strong></li>
		<li style="list-style: none;">&check; Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)</li>
		<li style="list-style: none;">&check; Access to <strong>centralized code repos for <em>all</em> 540+ tutorials</strong> on PyImageSearch</li>
		<li style="list-style: none;">&check; <strong> Easy one-click downloads</strong> for code, datasets, pre-trained models, etc.</li>
		<li style="list-style: none;">&check; <strong>Access</strong> on mobile, laptop, desktop, etc.</li>
	</ul>

	<p style="text-align: center;">
		<a target="_blank" class="button link" href="https://pyimagesearch.com/pyimagesearch-university/?utm_source=blogPost&utm_medium=bottomBanner&utm_campaign=What%27s%20next%3F%20I%20recommend" style="background-color: #6DC713; border-bottom: none;">Click here to join PyImageSearch University</a>
	</p>
</div>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h2-Summary"/>



<h2 class="wp-block-heading"><a href="#TOC-h2-Summary">Summary</a></h2>



<p>In this tutorial, you moved beyond understanding embeddings and stepped into the practical world of <strong>vector search</strong>. You began by revisiting the limitations of brute-force similarity search and saw why indexing is crucial when datasets grow larger. Then, you explored how <strong>FAISS</strong>, Meta’s open-source library for efficient similarity search, builds the backbone of modern vector databases.</p>



<p>From there, you implemented and compared 3 major index types (i.e., <strong>Flat</strong>, <strong>HNSW</strong>, and <strong>IVF-Flat</strong>) each offering a different trade-off between speed, recall, and memory. You wrote code to benchmark their performance, measuring <code data-enlighter-language="python" class="EnlighterJSRAW">recall@k</code>, <strong>query latency</strong>, <strong>build time</strong>, and <strong>memory usage</strong>, and visualized the results directly in your terminal. Along the way, you saw how HNSW maintains near-perfect accuracy while drastically improving query speed, and how IVF-Flat compresses the search space to achieve sub-millisecond responses even with slight precision loss.</p>



<p>By the end, you not only understood <em>how</em> approximate nearest neighbor (ANN) search works but also saw <em>why</em> it matters — enabling scalable, real-time retrieval without compromising too much on accuracy. You also saved your indexes, preparing them for use in the next lesson, where we’ll connect them to a <strong>Retrieval-Augmented Generation (RAG) </strong>pipeline and bring everything full circle: embeddings, vector search, and large language models working together in action.</p>



<hr class="wp-block-separator has-alpha-channel-opacity" id="h3-Citation-Information"/>



<h3 class="wp-block-heading"><a href="#TOC-h3-Citation-Information">Citation Information</a></h3>



<p><strong>Singh, V</strong><strong>. </strong>“Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained,” <em>PyImageSearch</em>, P. Chugh, S. Huot, A. Sharma, and P. Thakur, eds., 2026, <a href="https://pyimg.co/htl5f" target="_blank" rel="noreferrer noopener">https://pyimg.co/htl5f</a> </p>



<pre class="EnlighterJSRAW" data-enlighter-language="raw" data-enlighter-theme="classic" data-enlighter-highlight="" data-enlighter-linenumbers="false" data-enlighter-lineoffset="" data-enlighter-title="Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained" data-enlighter-group="22">@incollection{Singh_2026_vector-search-with-faiss-ann-explained,
  author = {Vikram Singh},
  title = {{Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained}},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/htl5f},
}
</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<p><strong>To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), </strong><em><strong>simply enter your email address in the form below!</strong></em></p>



<div id="download-the-code" class="post-cta-wrap">
<div class="gpd-post-cta">
	<div class="gpd-post-cta-content">
		

			<div class="gpd-post-cta-top">
				<div class="gpd-post-cta-top-image"><img decoding="async" src="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1" alt="" srcset="https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?lossy=2&strip=1&webp=1 410w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=126x174&lossy=2&strip=1&webp=1 126w,https://b2633864.smushcdn.com/2633864/wp-content/uploads/2020/01/cta-source-guide-1.png?size=252x348&lossy=2&strip=1&webp=1 252w" sizes="(max-width: 410px) 100vw, 410px" /></div>
				
				<div class="gpd-post-cta-top-title"><h4>Download the Source Code and FREE 17-page Resource Guide</h4></div>
				<div class="gpd-post-cta-top-desc"><p>Enter your email address below to get a .zip of the code and a <strong>FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning.</strong> Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!</p></div>


			</div>

			<div class="gpd-post-cta-bottom">
				<form id="footer-cta-code" class="footer-cta" action="https://www.getdrip.com/forms/4130035/submissions" method="post" target="blank" data-drip-embedded-form="4130035">
					<input name="fields[email]" type="email" value="" placeholder="Your email address" class="form-control" />

					<button type="submit">Download the code!</button>

					<div style="display: none;" aria-hidden="true"><label for="website">Website</label><br /><input type="text" id="website" name="website" tabindex="-1" autocomplete="false" value="" /></div>
				</form>
			</div>


		
	</div>

</div>
</div>
<p>The post <a rel="nofollow" href="https://pyimagesearch.com/2026/02/16/vector-search-with-faiss-approximate-nearest-neighbor-ann-explained/">Vector Search with FAISS: Approximate Nearest Neighbor (ANN) Explained</a> appeared first on <a rel="nofollow" href="https://pyimagesearch.com">PyImageSearch</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
