PyImageSearch

Building and Training a Kimi-K2 Model Using DeepSeek-V3 Components

Puneet Mangla — Mon, 11 May 2026 12:45:00 +0000

Home

Table of Contents

Building and Training a Kimi-K2 Model Using DeepSeek-V3 Components
Kimi-K2 vs DeepSeek-V3: Key Architecture Differences in LLM Design

Mixture of Experts Scaling in Kimi-K2: Model Size, Sparsity, and Efficiency
Attention Head Optimization in Kimi-K2 for Efficient Long-Context LLMs

MuonClip Optimizer: Stabilizing Large-Scale LLM Training in Kimi-K2

Token Efficiency in LLM Training: Why It Matters for Kimi-K2
Attention Logit Explosion in LLMs: Training Instability and Challenges
QK-Clip: Preventing Attention Logit Explosion in Kimi-K2 Training

Training Data Optimization for Kimi-K2: Improving Token Utility in LLMs

Token Utility in LLM Training: Maximizing Learning per Token
Knowledge Data Rephrasing for LLMs: Improving Training Data Quality

Kimi-K2 Implementation: Training an Open-Source LLM with DeepSeek-V3

Multi-Head Latent Attention (MLA) with Max Logit Tracking in Kimi-K2
Implementing the MuonClip Optimizer for Stable LLM Training
Complete Kimi-K2 Training Pipeline: Setup, Config, and Optimization

Summary

Citation Information

Building and Training a Kimi-K2 Model Using DeepSeek-V3 Components

The landscape of large language models (LLMs) is undergoing a fundamental transformation toward agentic intelligence, where models can autonomously perceive, plan, reason, and act within complex and dynamic environments. This paradigm shift moves beyond traditional static imitation learning toward models that actively learn through interaction, acquire skills beyond their training distribution, and adapt their behavior based on experience. Agentic intelligence represents a critical capability for the next generation of foundation models, with transformative implications for tool use, software development, and real-world autonomy.

Kimi-K2 stands at the forefront of this revolution. As a 1.04 trillion-parameter Mixture-of-Experts (MoE) language model with 32 billion activated parameters, Kimi-K2 was purposefully designed to address the core challenges of agentic capability development. The model achieves remarkable performance across diverse benchmarks:

66.1 on Tau2-bench
76.5 on ACEBench (en)
65.8 on SWE-bench Verified
53.7 on LiveCodeBench v6
75.1 on GPQA-Diamond

On the LMSYS (Large Model Systems Organization) Arena leaderboard, Kimi-K2 ranks as the top open-source model and 5th overall, competing closely with Claude 4 Opus and Claude 4 Sonnet.

In this lesson, we dive deep into the technical innovations behind Kimi-K2, focusing on its architectural differences from DeepSeek-V3, the revolutionary MuonClip optimizer, and training data improvements. We also provide a complete implementation guide using DeepSeek-V3 components as building blocks.

Kimi-K2 vs DeepSeek-V3: Key Architecture Differences in LLM Design

While Kimi-K2 builds on DeepSeek-V3’s architecture, several strategic modifications were made to optimize agentic capabilities and inference efficiency. Understanding these architectural differences is crucial for implementing the model effectively (Table 1).

Table 1: Kimi-K2 vs DeepSeek-V3 Configurations (source: Kimi Team, 2026).

Mixture of Experts Scaling in Kimi-K2: Model Size, Sparsity, and Efficiency

The most significant architectural departure lies in Kimi-K2’s aggressive sparsity scaling. Through carefully controlled small-scale experiments, the Kimi team developed a sparsity scaling law that demonstrated a clear relationship: with the number of activated parameters held constant (i.e., constant FLOPs), increasing the total number of experts consistently lowers both training and validation loss. This finding led to a dramatic increase in model sparsity.

Kimi-K2 employs 384 experts compared to DeepSeek-V3’s 256 experts, representing a 50% increase. Despite this, the model maintains 8 active experts per token, resulting in a sparsity ratio of 48 (384/8) versus DeepSeek-V3’s 32 (256/8). This increased sparsity comes with a trade-off: while total parameters grow to 1.04 trillion (54% more than DeepSeek-V3’s 671B), the number of activated parameters actually decreases to 32.6B (13% less than DeepSeek-V3’s 37B). This design choice optimizes the compute-performance frontier, achieving superior model quality while maintaining efficient inference.

Attention Head Optimization in Kimi-K2 for Efficient Long-Context LLMs

A critical optimization for agentic applications involves the number of attention heads. DeepSeek-V3 sets the number of attention heads to roughly twice the number of model layers (128 heads for 61 layers) to better utilize memory bandwidth. However, as context length increases, this design incurs significant inference overhead.

For agentic applications requiring efficient long-context processing, this becomes prohibitive. With a 128k sequence length, increasing attention heads from 64 to 128 (while keeping 384 total experts) leads to an 83% increase in inference FLOPs. Through controlled experiments, the Kimi team found that doubling the number of attention heads yields only modest improvements in validation loss (0.5% to 1.2%) under iso-token training conditions.

Given that sparsity 48 already provides strong performance, the marginal gains from doubling attention heads do not justify the inference cost. Kimi-K2 therefore uses 64 attention heads (half of DeepSeek-V3’s 128), dramatically reducing inference costs for long-context agentic workloads while maintaining competitive performance.

MuonClip Optimizer: Stabilizing Large-Scale LLM Training in Kimi-K2

The MuonClip optimizer represents one of the most significant innovations in Kimi-K2’s development, addressing the fundamental challenge of training stability at trillion-parameter scale while maintaining token efficiency. Understanding MuonClip requires examining both the underlying Muon optimizer and the novel QK-Clip mechanism that makes it stable for large-scale training.

Token Efficiency in LLM Training: Why It Matters for Kimi-K2

Given the increasingly limited availability of high-quality human data, token efficiency has emerged as a critical factor in LLM scaling. Token efficiency refers to how much performance improvement is achieved per token consumed during training. The Muon optimizer, introduced by Jordan et al. (2024), substantially outperforms AdamW under the same compute budget, model size, and training data volume.

Previous work in Moonlight demonstrated that Muon’s token efficiency gains make it an ideal choice for maximizing the intelligence extracted from limited high-quality tokens. However, scaling Muon to trillion-parameter models revealed a critical challenge: training instability due to exploding attention logits.

Attention Logit Explosion in LLMs: Training Instability and Challenges

During medium-scale training runs using vanilla Muon, attention logits rapidly exceeded magnitudes of 1000, leading to numerical instabilities and occasional training divergence (Figure 1). This phenomenon occurred more frequently with Muon than with AdamW, suggesting that Muon’s aggressive optimization dynamics amplify instabilities in the attention mechanism.

Figure 1: Attention logits rapidly exceed 1000, which could lead to potential numerical instabilities and even training divergence (source: Kimi Team, 2026).

Existing mitigation strategies proved insufficient:

Logit soft-capping (used in Gemma) directly clips attention logits, but the dot products between queries and keys can still grow excessively before capping is applied
Query-Key Normalization (QK-Norm) (Dehghani et al., 2023) is incompatible with Multi-head Latent Attention (MLA) because full key matrices are not explicitly materialized during inference

QK-Clip: Preventing Attention Logit Explosion in Kimi-K2 Training

To address this fundamental challenge, the Kimi team proposed QK-Clip, a novel weight-clipping mechanism that explicitly constrains attention logits by rescaling the query and key projection weights post-update. The elegance of QK-Clip lies in its simplicity: it does not alter forward and backward computation in the current step but instead uses maximum logits as a guiding signal to control weight growth (Figure 2).

Figure 2: Maximum logits for Kimi-K2 with MuonClip and τ = 100 over the entire training run. The max logits rapidly increase to the capped value of 100 before decaying to a stable range (source: Kimi Team, 2026).

For each attention head , the attention mechanism computes:

The attention output is:

QK-Clip defines the max logit per head as:

where is the current batch and index different tokens.

When exceeds a threshold (set to 100 for Kimi-K2), QK-Clip rescales the weights. Critically, the rescaling is applied per-head rather than globally, minimizing intervention on heads that remain stable:

This per-head, component-aware clipping represents a substantial refinement over naive global clipping strategies.

Figure 3 describes the complete algorithm for MuonClip Optimizer.

Figure 3: MuonClip Algorithm (source: Kimi Team, 2026).

Training Data Optimization for Kimi-K2: Improving Token Utility in LLMs

Beyond architectural and optimizer innovations, Kimi-K2’s superior performance stems significantly from strategic improvements in training data. With high-quality human-generated data becoming increasingly scarce, the focus shifts to increasing token utility, defined as the effective learning signal each token contributes to model updates.

Token Utility in LLM Training: Maximizing Learning per Token

Token efficiency in pre-training encompasses 2 related but distinct concepts:

Optimizer efficiency: How effectively the optimizer extracts signal from each gradient update (addressed by MuonClip)
Token utility: The inherent information density and learning signal in each token

Increasing token utility directly improves token efficiency. A naive approach involves repeated exposure to the same tokens across multiple epochs, but this leads to overfitting and reduced generalization. The key innovation in Kimi-K2 lies in a sophisticated synthetic data generation strategy that amplifies high-quality tokens without inducing overfitting.

Knowledge Data Rephrasing for LLMs: Improving Training Data Quality

Pre-training on knowledge-intensive text presents a fundamental trade-off: a single epoch is insufficient for comprehensive knowledge absorption, while multi-epoch repetition yields diminishing returns. To resolve this tension, Kimi-K2 employs a synthetic rephrasing framework with the following 3 key components.

Style- and Perspective-Diverse Prompting

To enhance linguistic diversity while maintaining factual integrity, carefully engineered prompts guide a large language model to generate faithful rephrasings in varied styles and perspectives. This approach ensures that while surface-level linguistic features change, the underlying factual content remains consistent. The diversity of expressions forces the model to learn robust representations of the same knowledge across multiple linguistic realizations.

Chunk-wise Autoregressive Generation

Long documents pose a challenge for standard LLM-based rewriting due to implicit output length limitations. Kimi-K2 addresses this through a chunk-based autoregressive strategy: documents are segmented, each segment is rephrased individually with preserved context, and segments are stitched back together to form complete passages. This methodology prevents information loss and maintains global coherence across extended texts (Figure 4).

Fidelity Verification

To ensure consistency between original and rewritten content, fidelity checks compare the semantic alignment of each rephrased passage with its source. This quality control step prevents the introduction of hallucinations or factual errors during the rephrasing process.

Figure 4: Auto-regressive chunk-wise rephrasing pipeline for long input excerpts (source: Kimi Team, 2026).

Mathematics Data Rephrasing

To enhance mathematical reasoning capabilities, high-quality mathematical documents are rewritten into a “learning-note” style following SwallowMath methodology (Figure 5). This transformation converts dense mathematical exposition into more pedagogical formats that better support learning. Additionally, data diversity is increased through the translation of high-quality mathematical materials from other languages into English, effectively multiplying the available high-quality mathematical training data.

Figure 5: Four-stage pipeline for constructing SwallowMath (source: Fujii et al., 2026).

Overall Pre-training Corpus

The complete Kimi-K2 pre-training corpus comprises 15.5 trillion tokens of curated, high-quality data spanning 4 primary domains:

Web Text: General knowledge and natural language understanding
Code: Programming and structured reasoning
Mathematics: Quantitative reasoning and formal problem-solving
Knowledge: Domain-specific expertise and factual information

Kimi-K2 Implementation: Training an Open-Source LLM with DeepSeek-V3

In this section, we walk through the key implementation details for training Kimi-K2, focusing specifically on the components that differ from the standard DeepSeek-V3 implementation. We’ll examine the enhanced Multi-head Latent Attention with max logit tracking, the MuonClip optimizer implementation, and the custom training setup.

Multi-Head Latent Attention (MLA) with Max Logit Tracking in Kimi-K2

The Multi-head Latent Attention (MLA) mechanism in Kimi-K2 extends DeepSeek-V3’s implementation with critical modifications to support QK-Clip. The key enhancement is per-head max-logit tracking during the forward pass, which provides the signal needed for weight clipping by the optimizer.

class MultiheadLatentAttention(nn.Module):
    
    def __init__(self, config: DeepSeekConfig):
        super().__init__()
        self.config = config
        self.n_embd = config.n_embd
        self.n_head = config.n_head
        self.head_dim = config.n_embd // config.n_head

        # Compression dimensions
        self.kv_lora_rank = config.kv_lora_rank
        self.q_lora_rank = config.q_lora_rank
        self.rope_dim = config.rope_dim

        # KV compression
        self.kv_proj = nn.Linear(self.n_embd, self.kv_lora_rank, bias=False)
        self.kv_norm = RMSNorm(self.kv_lora_rank)

        # KV decompression
        self.k_decompress = nn.Linear(self.kv_lora_rank, self.n_head * self.head_dim, bias=False)
        self.v_decompress = nn.Linear(self.kv_lora_rank, self.n_head * self.head_dim, bias=False)

        # Query compression
        self.q_proj = nn.Linear(self.n_embd, self.q_lora_rank, bias=False)
        self.q_decompress = nn.Linear(self.q_lora_rank, self.n_head * self.head_dim, bias=False)

        # RoPE projections
        self.k_rope_proj = nn.Linear(self.n_embd, self.n_head * self.rope_dim, bias=False)
        self.q_rope_proj = nn.Linear(self.q_lora_rank, self.n_head * self.rope_dim, bias=False)

        # Output projection
        self.o_proj = nn.Linear(self.n_head * self.head_dim, self.n_embd, bias=config.bias)

        # Dropout
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)

        # RoPE
        self.rope = RotaryEmbedding(self.rope_dim, config.block_size)

        # Causal mask
        self.register_buffer(
            "causal_mask",
            torch.tril(torch.ones(config.block_size, config.block_size)).view(
                1, 1, config.block_size, config.block_size
            )
        )

        self.max_logits = 0.0  # Track maximum attention logits

On Lines 1-47, we define the MLA architecture following DeepSeek-V3’s design with compression and decompression of queries and key-values through low-rank projections. The key innovation appears on Line 49, where we initialize self.max_logits = 0.0, a critical state variable that tracks the maximum attention logits across heads. This tracking mechanism is essential for QK-Clip to function properly.

    def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
        B, T, C = x.size()

        # Compression phase
        kv_compressed = self.kv_norm(self.kv_proj(x))
        q_compressed = self.q_proj(x)

        # Decompression phase
        k_content = self.k_decompress(kv_compressed)
        v = self.v_decompress(kv_compressed)
        q_content = self.q_decompress(q_compressed)

        # RoPE components
        k_rope = self.k_rope_proj(x)
        q_rope = self.q_rope_proj(q_compressed)

        # Reshape for multi-head attention
        k_content = k_content.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        q_content = q_content.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k_rope = k_rope.view(B, T, self.n_head, self.rope_dim).transpose(1, 2)
        q_rope = q_rope.view(B, T, self.n_head, self.rope_dim).transpose(1, 2)

        # Apply RoPE
        cos, sin = self.rope(x, T)
        q_rope = apply_rope(q_rope, cos, sin)
        k_rope = apply_rope(k_rope, cos, sin)

        # Concatenate content and rope parts
        q = torch.cat([q_content, q_rope], dim=-1)
        k = torch.cat([k_content, k_rope], dim=-1)

On Lines 52-82, we implement the standard forward pass through the compression-decompression pipeline. The input undergoes compression via kv_proj and q_proj, followed by decompression through dedicated linear layers. We then reshape tensors for multi-head processing and apply Rotary Position Embeddings (RoPE) separately to content and positional components. This separation allows per-head QK-Clip to target only the appropriate components without affecting shared rotary embeddings.

        # Concatenate content and rope parts
        q = torch.cat([q_content, q_rope], dim=-1)
        k = torch.cat([k_content, k_rope], dim=-1)

        # Attention computation
        scale = 1.0 / math.sqrt(q.size(-1))
        scores = torch.matmul(q, k.transpose(-2, -1)) * scale

        with torch.no_grad():
            # self.max_logits = torch.max(scores, dim=1).item()
            self.max_logits = list(torch.max(scores.transpose(1, 0).contiguous().view(scores.shape[1], -1), dim=-1)[0])

        # Apply causal mask
        scores = scores.masked_fill(self.causal_mask[:, :, :T, :T] == 0, float('-inf'))

        # Apply padding mask if provided
        if attention_mask is not None:
            padding_mask_additive = (1 - attention_mask).unsqueeze(1).unsqueeze(2) * float('-inf')
            scores = scores + padding_mask_additive

        # Softmax and dropout
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.attn_dropout(attn_weights)

        # Apply attention to values
        out = torch.matmul(attn_weights, v)

        # Reshape and project
        out = out.transpose(1, 2).contiguous().view(B, T, self.n_head * self.head_dim)
        out = self.resid_dropout(self.o_proj(out))

        return out

On Lines 89-94, we compute attention scores and implement the crucial max logit tracking. The score computation follows standard scaled dot-product attention. However, Lines 92-94 represent a key departure from vanilla DeepSeek-V3: we track the maximum attention logit per head using torch.no_grad() to avoid affecting gradients. The scores tensor has shape [batch, num_heads, seq_len, seq_len], and we transpose and reshape to extract per-head maximum values. This per-head granularity enables targeted intervention only on heads exhibiting logit explosion, minimizing disruption to stable heads.

On Lines 97-113, we complete the attention mechanism with causal masking, optional padding masks, softmax normalization, and dropout. The final output projection maintains the standard MLA architecture. The elegance of this implementation lies in its non-invasiveness: max logit tracking adds minimal computational overhead (a single max operation under torch.no_grad) while providing the critical signal for optimizer-level weight clipping.

Implementing the MuonClip Optimizer for Stable LLM Training

The MuonClip optimizer represents the core innovation enabling stable trillion-parameter training. Our implementation integrates Newton-Schulz orthogonalization, RMS matching, weight decay, and per-head QK-Clip into a unified optimizer.

def apply_qk_clip_per_head(
    query_weights: torch.Tensor,
    key_weights: torch.Tensor,
    max_logits_per_head: Union[List[float], torch.Tensor],
    tau: float = 100.0
) -> None:
        if isinstance(max_logits_per_head, list):
        max_logits_per_head = torch.tensor(
            max_logits_per_head,
            device=query_weights.device,
            dtype=query_weights.dtype
        )
    apply_qk_clip_vectorized(query_weights, key_weights, max_logits_per_head, tau)

On Lines 1-13, we define the entry point for the QK-Clip application. The function accepts query and key projection weights along with per-head max logits and a threshold (defaulting to 100). We handle both list and tensor inputs for flexibility, converting lists to tensors on the appropriate device with matching dtype. The critical design choice here is in-place modification: we directly modify weight tensors to avoid memory allocation overhead during optimization.

def apply_qk_clip_per_head(
    query_weights: torch.Tensor,
    key_weights: torch.Tensor,
    max_logits_per_head: Union[List[float], torch.Tensor],
    tau: float = 100.0
) -> None:
        if isinstance(max_logits_per_head, list):
        max_logits_per_head = torch.tensor(
            max_logits_per_head,
            device=query_weights.device,
            dtype=query_weights.dtype
        )
    apply_qk_clip_vectorized(query_weights, key_weights, max_logits_per_head, tau)

@torch.no_grad()
def apply_qk_clip_vectorized(
    query_weights: torch.Tensor,
    key_weights: torch.Tensor,
    max_logits_per_head: torch.Tensor,
    tau: float = 100.0
) -> None:
    
    q_out, q_in = query_weights.shape[0], query_weights.shape[1]
    k_out, k_in = key_weights.shape[0], key_weights.shape[1]
    num_heads = len(max_logits_per_head)
    d_k = q_out // num_heads

    # Ensure tensor type
    if not isinstance(max_logits_per_head, torch.Tensor):
        max_logits_per_head = torch.tensor(
            max_logits_per_head,
            device=query_weights.device,
            dtype=query_weights.dtype
        )

    # Compute scaling factors: gamma = tau / max_logit where max_logit > tau
    needs_clip = max_logits_per_head > tau

On Lines 15-48, we extract dimensions and ensure tensor type compatibility. We first extract dimensions and compute the per-head scaling factor only for heads where \tau' title='S_{\max}^h > \tau' class='latex' />.

@torch.no_grad()
def apply_qk_clip_vectorized(
    query_weights: torch.Tensor,
    key_weights: torch.Tensor,
    max_logits_per_head: torch.Tensor,
    tau: float = 100.0
) -> None:
    
    q_out, q_in = query_weights.shape[0], query_weights.shape[1]
    k_out, k_in = key_weights.shape[0], key_weights.shape[1]
    num_heads = len(max_logits_per_head)
    d_k = q_out // num_heads

    # Ensure tensor type
    if not isinstance(max_logits_per_head, torch.Tensor):
        max_logits_per_head = torch.tensor(
            max_logits_per_head,
            device=query_weights.device,
            dtype=query_weights.dtype
        )

    # Compute scaling factors: gamma = tau / max_logit where max_logit > tau
    needs_clip = max_logits_per_head > tau

    # If no clipping needed, return early
    if not needs_clip.any():
        return

    gamma = torch.where(
        needs_clip,
        tau / max_logits_per_head.clamp(min=1e-8),
        torch.ones_like(max_logits_per_head)
    )
    sqrt_gamma = torch.sqrt(gamma)

    # Reshape weights to [d_model, num_heads, d_k] for per-head scaling
    # Views share underlying storage, so in-place ops modify original tensor
    q_reshaped = query_weights.view(q_out // num_heads, num_heads, q_in)
    k_reshaped = key_weights.view(k_out // num_heads, num_heads, k_in)

    # Apply per-head scaling IN-PLACE: broadcast sqrt_gamma [num_heads] over [d_model, num_heads, d_k]
    q_reshaped.mul_(sqrt_gamma.view(1, num_heads, 1))
    k_reshaped.mul_(sqrt_gamma.view(1, num_heads, 1))

    q_reshaped = q_reshaped.view(q_out, q_in)
    k_reshaped = k_reshaped.view(k_out, k_in)

On Lines 80-97, we perform the actual weight clipping through careful tensor reshaping and in-place multiplication. The weights are reshaped from [d_model, d_model] to [d_model/num_heads, num_heads, d_k] to expose the head dimension. We then apply scaling using in-place multiplication (mul_) with broadcasting. The square root scaling ensures that when query and key both receive , their dot product receives the full scaling. This elegant mathematical property allows us to clip attention logits by rescaling the weights that produce them, rather than clipping logits directly after they’re computed.

Lines 77 and 78 implement early exit if no head requires clipping, which becomes a common case later in training when attention logits stabilize. This optimization avoids unnecessary computation when the model is well-behaved.

class MuonClip(torch.optim.Optimizer):
    def __init__(
        self,
        params,
        lr: float = 1e-3,
        momentum: float = 0.95,
        weight_decay: float = 0.01,
        tau: float = 100.0,
        ns_steps: int = 5,
        eps: float = 1e-7
    ):
        if lr < 0.0:
            raise ValueError(f"Invalid learning rate: {lr}")
        if not 0.0 <= momentum <= 1.0:
            raise ValueError(f"Invalid momentum value: {momentum}")
        if weight_decay < 0.0:
            raise ValueError(f"Invalid weight_decay value: {weight_decay}")
        if tau <= 0.0:
            raise ValueError(f"Invalid tau value: {tau}")

        defaults = dict(
            lr=lr,
            momentum=momentum,
            weight_decay=weight_decay,
            tau=tau,
            ns_steps=ns_steps,
            eps=eps
        )
        super().__init__(params, defaults)

        # For QK-Clip functionality
        self.model = None
        self.attention_layers = []

    def set_model(self, model: nn.Module):
        self.model = model
        if hasattr(model, 'get_attention_layers'):
            self.attention_layers = model.get_attention_layers()

On Lines 1-33, we define the MuonClip optimizer class, inheriting from PyTorch’s base Optimizer. The constructor accepts standard hyperparameters (learning rate, momentum, weight decay) plus QK-Clip-specific parameters ( and Newton-Schulz steps). We validate all parameters and initialize state tracking. Critically, Lines 35-38 implement model registration through set_model(), which extracts attention layers for later QK-Clip application. This design separates optimizer logic from model architecture, allowing the optimizer to operate on any model exposing a get_attention_layers() method.

    @torch.no_grad()
    def step(self, closure: Optional[Callable] = None) -> Optional[float]:
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            lr = group['lr']
            momentum = group['momentum']
            weight_decay = group['weight_decay']
            ns_steps = group['ns_steps']
            eps = group['eps']

            for p in group['params']:
                if p.grad is None:
                    continue

                grad = p.grad
                state = self.state[p]

                # Initialize momentum buffer
                if len(state) == 0:
                    state['momentum_buffer'] = torch.zeros_like(p)

                buf = state['momentum_buffer']

                # Apply momentum: Mt = μMt−1 + Gt
                buf.mul_(momentum).add_(grad)

                if p.ndim >= 2:  # 2D+ parameters - use Muon
                    # Apply Newton-Schulz orthogonalization
                    if p.ndim > 2:
                        original_shape = buf.shape
                        buf_2d = buf.view(buf.shape[0], -1)
                        orthogonal_update = newton_schulz(buf_2d, ns_steps, eps)
                        orthogonal_update = orthogonal_update.view(original_shape)
                    else:
                        orthogonal_update = newton_schulz(buf, ns_steps, eps)

                    # RMS matching factor: √(max(n,m) × 0.2)
                    n, m = p.shape[0], p.shape[1] if p.ndim > 1 else 1
                    rms_factor = math.sqrt(max(n, m) * 0.2)
                    orthogonal_update = orthogonal_update * rms_factor

                    # Update: Wt = Wt−1 − η(Ot + λWt−1)
                    p.add_(orthogonal_update + weight_decay * p, alpha=-lr)
                else:
                    # 1D parameters - standard momentum
                    p.add_(buf + weight_decay * p, alpha=-lr)

        # Apply QK-Clip
        self._apply_qk_clip()

        return loss

On Lines 41-94, we implement the core optimization step integrating Muon updates with QK-Clip. The step begins with standard closure handling and parameter group iteration. Lines 41-68 implement momentum accumulation () using in-place operations for memory efficiency. The critical branching occurs at Line 70: parameters with 2+ dimensions receive Muon treatment.

On Lines 72-83, we apply the Muon update for matrix parameters. Newton-Schulz orthogonalization produces an orthogonal approximation of the momentum buffer, which we then scale by to match AdamW’s RMS characteristics. This scaling ensures Muon’s updates have similar magnitudes to AdamW, enabling easier hyperparameter transfer. Finally, Line 86 applies the update with weight decay: . Line 89 applies standard momentum updates to 1D parameters such as biases and normalization layers.

    def _apply_qk_clip(self):
        """Apply QK-Clip to attention layers to prevent logit explosion."""
        if not self.attention_layers:
            return

        tau = self.param_groups[0]['tau']

        for attention_layer in self.attention_layers:
            if not hasattr(attention_layer, 'max_logits'):
                continue

            max_logits = attention_layer.max_logits
            if not max_logits:
                continue


            # Handle both scalar and per-head max logits
            if isinstance(max_logits, (int, float)):
                max_logits = [max_logits]


            apply_qk_clip_per_head(
                    attention_layer.k_decompress.weight.data,
                    attention_layer.q_decompress.weight.data,
                    max_logits,
                    tau
            )

On Lines 96-122, we apply QK-Clip after all weight updates. The _apply_qk_clip() method iterates through all registered attention layers, extracts their max_logits attribute (populated during forward pass), and applies per-head clipping to the query and key decompression weights. This post-update clipping ensures weights don’t grow unboundedly across training steps while preserving gradient information within each step.

Complete Kimi-K2 Training Pipeline: Setup, Config, and Optimization

Finally, we bring everything together in a complete training configuration:

config = DeepSeekConfig()
config.multi_token_predict = 0
config.n_experts = 8
config.n_head = 4

training_args = TrainingArguments(
    output_dir="./kimik2_checkpoints",
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=4,
    learning_rate=5e-4,
    warmup_steps=10,
    weight_decay=0.01,
    logging_dir="./kimik2_checkpoints/logs",
    logging_steps=50,
    save_steps=50,
    save_total_limit=3,
    eval_steps=50,
    eval_strategy="steps",
    save_strategy="steps",
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    gradient_accumulation_steps=4,
    fp16=True,
    dataloader_num_workers=2,
    remove_unused_columns=False,
    report_to="none",
    push_to_hub=False,
    save_safetensors=False,
)

On Lines 1-4, we configure the model architecture. Kimi-K2 does not use Multi-Token Prediction, so we disable multi-token prediction (multi_token_predict=0) to simplify training and focus on core capabilities. We use 8 experts for this educational implementation rather than the hundreds used in production-scale Kimi-K2 and DeepSeek-V3 models. We also use 4 attention heads for this small-scale educational implementation, compared to the production-scale configurations used in DeepSeek-V3 and Kimi-K2.

On Lines 6-30, we define training arguments following best practices for small-scale experiments. We use gradient accumulation (4 steps) to simulate larger batch sizes with limited GPU memory, enable mixed-precision training (fp16=True) for speed and memory efficiency, and configure regular evaluation and checkpointing every 50 steps. The learning rate of 5e-4 is conservative for stable training, with a brief 10-step warmup.

model = DeepSeek(config)

data_collator = DeepSeekDataCollator(tokenizer)

optimizer = MuonClip(model.parameters(), lr=5e-3)
optimizer.set_model(model)

# Create trainer
trainer = DeepSeekTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    optimizers=(optimizer, None)
)

print("✓ Trainer created. Starting training...")
print("=" * 80)

# Train!
trainer.train()

print("=" * 80)
print("✓ Training complete!")

# Save final model
trainer.save_model("./kimik2_final")
tokenizer.save_pretrained("./kimik2_final")
print("✓ Model saved to ./kimik2_final")

On Lines 31-36, we initialize the model and create a MuonClip optimizer. Critically, Line 36 registers the model with the optimizer using set_model(), enabling QK-Clip to access attention layers. This registration must occur before training begins.

On Lines 39-60, we instantiate the custom trainer with all components and launch training. The optimizers=(optimizer, None) argument provides our custom optimizer to Hugging Face Trainer, overriding its default optimizer creation. After training completes, we save both the model weights and tokenizer for later inference.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: May 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

We began by detailing how to train Kimi-K2 from scratch using DeepSeek-V3 components, emphasizing the architectural differences that set Kimi-K2 apart. We explored the model’s scale and sparsity, showing that reducing the number of attention heads allowed us to balance efficiency and performance. A key part of this journey was the introduction of the MuonClip optimizer, which stabilizes training while pushing the limits of large-scale language modeling.

We then turned to the challenges of token efficiency and the attention logit explosion problem. To address these, we introduced the QK-Clip innovation, which helped us control runaway logits and improve overall stability. Alongside this, we refined our training data pipeline, focusing on token utility and knowledge data rephrasing to ensure that every token contributed meaningfully to the model’s learning process. These improvements allowed us to maximize the value of the data while keeping training efficient.

Finally, we described the implementation details, including enhanced multi-head latent attention with max logit tracking and the practical integration of the MuonClip optimizer. We concluded with a complete training setup, showing how all these innovations came together to make Kimi-K2 a robust, efficient, and scalable model. By combining architectural refinements, optimizer breakthroughs, and data improvements, this lesson demonstrated how these techniques push the boundaries of what’s possible in modern language model training.

Citation Information

Mangla, P. “Building and Training a Kimi-K2 Model Using DeepSeek-V3 Components,” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/d3tge

@incollection{Mangla_2026_building-training-kimi-k2-model-using-deepseek-v3,
  author = {Puneet Mangla},
  title = {{Building and Training a Kimi-K2 Model Using DeepSeek-V3 Components}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/d3tge},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

The post Building and Training a Kimi-K2 Model Using DeepSeek-V3 Components appeared first on PyImageSearch.

Semantic Caching for LLMs: TTLs, Confidence, and Cache Safety

Vikram Singh — Mon, 04 May 2026 12:45:00 +0000

Home

Table of Contents

Semantic Caching for LLMs: TTLs, Confidence, and Cache Safety
Why Semantic Caching for LLMs Requires Production Hardening
Cache TTL in Semantic Caching: Preventing Stale LLM Responses
MLOps Project Structure for Semantic Caching with FastAPI and Redis
How to Implement Cache TTL Validation in Python and Redis
Confidence Scoring in Semantic Caching: Beyond Similarity for LLMs
Implementing Confidence Scoring for LLM Cache Optimization (Code Walkthrough)
Query Normalization and Deduplication for Efficient Semantic Caching
Preventing Cache Poisoning in Semantic Caching for LLM Systems
End-to-End Semantic Cache Hardening: TTL, Confidence, and Safety Demos
Semantic Caching Limitations: Trade-Offs in LLM Optimization Systems
Summary

Semantic Caching for LLMs: TTLs, Confidence, and Cache Safety

In this lesson, you will learn how to harden a semantic cache for LLMs, one of the most important LLMOps patterns for reducing redundant inference costs, and move from a working semantic caching prototype to a system that can survive real-world usage with TTL validation, confidence scoring, deduplication, and cache poisoning prevention.

This lesson is the last in a 2-part series on Semantic Caching for LLMs:

Semantic Caching for LLMs: FastAPI, Redis, and Embeddings
Semantic Caching for LLMs: TTLs, Confidence, and Cache Safety (this tutorial)

To learn how to harden a semantic cache for LLMs and make it safe, reliable, and production-ready, just keep reading.

Looking for the source code to this post?

Why Semantic Caching for LLMs Requires Production Hardening

In Lesson 1, we built a semantic cache that works end-to-end. It correctly avoids redundant LLM calls, reuses responses for identical queries, and even handles paraphrased inputs via semantic similarity. For many tutorials, that would be the end of the story.

In real systems, however, working is only the starting point.

A semantic cache that works under ideal conditions can still fail in subtle and dangerous ways when exposed to real users, long-running processes, and evolving information. These failures do not usually appear as crashes or explicit errors. Instead, they show up as silent correctness issues, degraded user trust, and unpredictable behavior over time.

What Lesson 1 Solved — and What It Didn’t

Lesson 1 focused on the correctness of flow:

Requests move through exact match → semantic match → LLM fallback (generation)
Cached responses are reused when appropriate
The system is observable and debuggable
Nothing is hidden behind abstractions

What it intentionally did not address was long-term safety.

We did not ask:

How old is this cached response, and should we still trust it?
What happens if the LLM returns an error or partial output?
What if the cache slowly fills with duplicates?
What if similarity is high but the answer is no longer valid?

Those questions only matter once the system runs for days or weeks, not minutes.

Real-World Failure Modes in Semantic Caching

Semantic caching introduces failure modes that rarely exist in traditional exact-match caches.

For example:

A cached answer with very high similarity may still be stale
An error response may be accidentally cached and reused
Slight variations of the same query may create duplicate entries
Old but similar answers may appear correct while being subtly wrong

None of these issues breaks the system outright. Instead, they quietly degrade correctness and user trust over time.

These are the hardest bugs to detect because the system continues to respond quickly and confidently.

Why “It Works” Does Not Mean “It’s Safe”

A semantic cache sits directly in the decision path of an LLM system. When it makes a mistake, that mistake is amplified through reuse.

If an unsafe response enters the cache:

It can be served repeatedly
It can outlive the conditions that made it valid
It can be returned with high confidence

This is why semantic caching requires more discipline, not less, than direct LLM calls.

In this lesson, we will take the working system from Lesson 1 and begin hardening it. We will introduce explicit safeguards for staleness, confidence, duplication, and safety — without changing the core architecture.

The goal is not to make the system perfect, but to make its failures controlled, visible, and predictable.

That is the difference between a demo and a system you can trust.

Cache TTL in Semantic Caching: Preventing Stale LLM Responses

Once a semantic cache is deployed and begins reusing LLM responses, a new question immediately arises:

How long should a cached response be trusted?

Unlike traditional caches that store deterministic outputs, semantic caches store model-generated answers. These answers are only valid within a certain window of time and context. Without explicit controls, a semantic cache can continue serving responses that are technically valid but practically wrong.

This section explains why cached LLM responses become stale, how TTLs help, and what it means for a cache entry to be unsafe.

Why Cached LLM Responses Become Stale

LLM responses are not timeless.

They are influenced by:

evolving APIs and libraries
changing business logic or documentation
updated prompts or system behavior
newly introduced edge cases

A cached answer that was correct an hour ago may no longer reflect the current state of the world.

Semantic caching amplifies this risk because:

responses are reused aggressively
high similarity can mask outdated content
cached answers are returned with confidence

Without staleness controls, the cache slowly becomes a museum of old truths.

TTL as a Safety Mechanism

A time-to-live (TTL) specifies how long a cache entry remains valid.

Once the TTL expires:

the entry is treated as unsafe
it should no longer be reused
a fresh LLM response must be generated

TTL does not guarantee correctness, but it limits the blast radius of staleness.

In semantic caching, TTL is not an optimization. It is a correctness safeguard.

Application-Level TTL vs Redis: EXPIRE

There are 2 common ways to implement TTLs when using Redis:

Redis EXPIRE

Redis automatically deletes keys after a fixed duration
Expired entries are removed entirely
The application has no visibility into expired data

Application-Level TTL (Used Here)

Entries remain stored in Redis
Expiration is checked at read time by the application
The application decides whether an entry is safe to reuse

In this system, TTL is enforced at the application layer rather than using Redis TTL via the native EXPIRE command, a deliberate choice that prioritizes observability over automation.

This choice allows us to:

inspect expired entries during debugging
apply custom expiration logic
combine TTL with other safety signals (such as confidence)

We trade automatic deletion for control and observability.

When a Cache Entry Becomes Unsafe

In this system, a cache entry is considered unsafe when any of the following are true:

its TTL has expired
its content is malformed or erroneous
its confidence score falls below an acceptable threshold

TTL is the first and most basic of these checks.

If an entry fails the TTL check, semantic similarity is irrelevant.

Reusing it would prioritize speed over correctness.

Designing TTLs for LLM Workloads

There is no universal “correct” TTL for LLM responses.

Instead, TTLs should be chosen based on:

how fast the underlying information changes
how costly incorrect answers are
how frequently similar queries appear

Short TTLs:

reduce staleness risk
increase LLM calls

Long TTLs:

improve cache hit rate
increase risk of outdated responses

In Lesson 1, we used a conservative default TTL to keep behavior predictable. In this lesson, we will focus on how TTLs are enforced rather than on tuning them for a specific domain.

TTL design is a policy decision. TTL enforcement is a correctness requirement.

Would you like immediate access to 3,457 images curated and labeled with hand gestures to train, explore, and experiment with … for free? Head over to Roboflow and get a free account to grab these hand gesture images.

Need Help Configuring Your Development Environment?

Having trouble configuring your development environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you will be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code immediately on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

MLOps Project Structure for Semantic Caching with FastAPI and Redis

Before diving into individual components, let’s take a moment to understand how the project is organized.

A clear directory structure is especially important in LLM-backed systems, where responsibilities span API orchestration, caching, embeddings, model calls, and observability. In this project, each concern is isolated into its own module so the request flow remains easy to trace and reason about.

After downloading the source code from the “Downloads” section, your directory structure should look like this:

.
├── app
│   ├── api
│   │   ├── __init__.py
│   │   └── ask.py
│   ├── cache
│   │   ├── __init__.py
│   │   ├── poisoning.py
│   │   ├── schemas.py
│   │   ├── semantic_cache.py
│   │   └── ttl.py
│   ├── config
│   │   ├── __init__.py
│   │   └── settings.py
│   ├── embeddings
│   │   ├── __init__.py
│   │   └── embedder.py
│   ├── llm
│   │   ├── __init__.py
│   │   └── ollama_client.py
│   ├── main.py
│   └── observability
│       └── metrics.py
├── complete-codebase.txt
├── docker-compose.yml
├── Dockerfile
├── README.md
└── requirements.txt

Let’s break this down at a high level.

The app/ Package

The app/ directory contains all runtime application code. Nothing outside this folder is imported at runtime.

This keeps the service self-contained and makes it easy to reason about deployment and dependencies.

app/main.py: Application Entry Point

This file defines the FastAPI application and registers all routers.

It contains no business logic — only service wiring. Every request to the system enters through this file.

app/api/: API Layer

The api/ package defines HTTP-facing endpoints.

ask.py: Implements the /ask endpoint and acts as the orchestration layer for the entire semantic caching pipeline.

The API layer is responsible for:

validating input
enforcing cache ordering
coordinating cache, embeddings, and LLM calls
returning structured debug information

It does not implement caching or similarity logic directly.

app/cache/: Caching Logic

This package contains all cache-related functionality.

semantic_cache.py: Core semantic cache implementation (exact match, semantic match, Redis storage, similarity search).
schemas.py: Defines the cache entry schema used for Redis storage.
ttl.py: Application-level TTL configuration and expiration checks.
poisoning.py: Safety checks to prevent invalid or error responses from being reused.

By isolating caching logic here, the API layer stays clean and reusable.

app/embeddings/: Embedding Generation

embedder.py: Handles embedding generation via Ollama’s embedding endpoint.

This module has a single responsibility: converting text into semantic vectors.

It does not cache, rank, or validate embeddings.

app/llm/: LLM Client

ollama_client.py: Wraps calls to the Ollama text-generation endpoint.

Isolating LLM interaction allows the rest of the system to remain model-agnostic.

app/observability/: Metrics

metrics.py: Implements simple in-memory counters for cache hits, misses, and LLM calls.

These metrics are intentionally lightweight and meant for learning and debugging, not production monitoring.

Configuration and Infrastructure

Outside the app/ directory:

config/settings.py: Centralizes environment-based configuration (Redis host, TTLs, model names).
Dockerfile and docker-compose.yml: Define a reproducible runtime environment for the API and Redis.
requirements.txt: Lists all Python dependencies required to run the service.

How to Implement Cache TTL Validation in Python and Redis

In the previous section, we discussed why cached LLM responses become stale and why TTLs are necessary. In this section, we move from concept to code and look at how TTL validation is enforced in practice.

The key idea is simple but important:

Cache entries are not deleted automatically. They are validated at read time.

This design choice keeps cache behavior explicit, observable, and safe.

The Default TTL Configuration

TTL configuration is centralized in a single helper function:

File: app/cache/ttl.py

def default_ttl():
    return settings.CACHE_TTL_SECONDS

Rather than hardcoding a value, the TTL is loaded from configuration. This allows different environments to use different TTLs without changing the code.

At this stage, the specific TTL value is not important. What matters is that:

every cache entry receives a TTL at creation time
TTL is treated as metadata, not as a Redis feature

Checking Whether an Entry Has Expired

TTL enforcement happens through a dedicated validation function:

def is_expired(entry):
    try:
        created_at = int(entry["created_at"])
        ttl = int(entry["ttl"])
        now = int(time.time())
        return now > (created_at + ttl)
    except (KeyError, ValueError, TypeError):
        return True

This function answers 1 question:

Is this cache entry still safe to reuse?

If the current time exceeds created_at + ttl, the entry is considered expired and must not be reused.

Fail-Safe Expiration Behavior

Notice the exception handling at the end of is_expired().

If the entry:

is missing required fields
contains malformed values
cannot be parsed safely

…it is treated as expired by default.

This is a deliberate fail-safe design.

When dealing with cached LLM responses, silently trusting malformed data is more dangerous than recomputing a response. If the system is unsure, it expires the entry and falls back to the LLM.

Correctness always wins over reuse.

Figure 1: Application-level TTL validation for semantic cache entries. Cached responses are reused only within their TTL window and are rejected at read time once expired (source: image by the author).

Best-Effort Cleanup During Cache Reads

TTL validation does more than reject expired entries — it also performs opportunistic cleanup during cache searches.

Inside the semantic cache search logic:

expired entries are detected
expired keys are removed from Redis
the cache continues scanning remaining entries

This cleanup happens:

without background workers
without scheduled jobs
without blocking the request

This is not a full garbage collector. It is a best-effort hygiene mechanism that keeps the cache from accumulating junk over time.

Why We Validate on Read, Not Delete on Write

At this point, a natural question arises:

Why not just use Redis EXPIRE and let Redis delete entries automatically?

There are 3 reasons this system validates TTLs on read instead:

Visibility: Expired entries remain inspectable during debugging.
Control: The application decides what “expired” means, not Redis.
Composability: TTL checks can be combined with confidence scoring, poisoning detection, and other safety signals.

By validating at read time, TTL becomes part of the decision-making pipeline rather than an invisible background mechanism.

Confidence Scoring in Semantic Caching: Beyond Similarity for LLMs

Up to this point, semantic caching decisions have relied heavily on semantic similarity. If a cached response is similar enough to a new query, it feels reasonable to reuse it.

In practice, this assumption breaks down.

High similarity answers an important question — “Is this response about the same thing?” — but it does not answer an equally important one:

“Is this response still safe to reuse right now?”

Confidence scoring exists to bridge that gap.

Why High Similarity Can Still Be Wrong

Semantic similarity measures closeness in meaning, not correctness over time.

Consider a cached response that:

has very high embedding similarity to the current query
was generated hours or days ago
refers to information that has since changed

From a vector perspective, the response still appears “correct.”

From a system perspective, it may no longer be trustworthy.

This problem is subtle because:

similarity scores remain high
responses look fluent and confident
failures are silent rather than catastrophic

Without an additional signal, the cache has no way to distinguish relevant but stale from relevant and safe.

Combining Semantic Similarity with Freshness

Confidence scoring introduces a second dimension: freshness.

Rather than deciding reuse based on similarity alone, the cache evaluates a combined signal that reflects:

how semantically close the response is
how recently the response was generated

At a high level, confidence answers the question:

“How comfortable are we reusing this response right now?”

Fresh responses with high similarity score high confidence.

Old responses, even with high similarity, gradually lose confidence as they age.

This ensures that time acts as a natural decay mechanism.

Figure 2: Confidence scoring combines semantic similarity with freshness. Even highly similar cached responses lose confidence over time and are eventually rejected (source: image by the author).

Understanding the Confidence Score (High-Level)

In this system, confidence is a weighted combination of:

semantic similarity
freshness relative to TTL

You do not need to think about exact formulas at this stage. What matters is the behavior:

Confidence starts high when an entry is created
Confidence decreases as the entry ages
Confidence is capped by semantic similarity
Expired entries always fail confidence checks

Confidence is not a probability. It is a reuse heuristic designed to favor correctness over speed.

How Confidence Affects Cache Reuse Decisions

Confidence scoring acts as a gatekeeper in the cache pipeline.

Even if:

the entry is not expired
the semantic similarity is above threshold

…the cache will reject reuse if confidence falls below an acceptable level.

When this happens:

the cache treats the entry as unsafe
the request falls back to the LLM
a fresh response is generated and stored

This behavior ensures that the cache degrades gracefully.

As uncertainty increases, the system automatically shifts work back to the LLM rather than returning questionable results.

Why Confidence Belongs in the Cache (Not the LLM)

It’s tempting to push this logic downstream and let the LLM “fix” stale responses.

That approach fails for two reasons:

the LLM has no context about cache age
the LLM cannot distinguish reused content from fresh inference

Confidence must be enforced before reuse, not after generation.

By embedding confidence checks directly into the cache, we ensure that reuse decisions are explicit, explainable, and controllable.

Implementing Confidence Scoring for LLM Cache Optimization (Code Walkthrough)

In the previous section, we introduced confidence scoring as a conceptual safeguard: a way to prevent semantically similar but stale responses from being reused.

In this section, we make that idea concrete by implementing it.

We will walk through where confidence is computed, where it is enforced, and what happens when a cached entry is rejected.

Where Confidence Is Computed

Confidence is computed inside the semantic cache, alongside similarity scoring.

def compute_confidence(similarity: float, created_at: int, ttl: int) -> float:
    age = time.time() - created_at

    if ttl <= 0:
        freshness = 1.0
    else:
        freshness = max(0.0, 1.0 - (age / ttl))

    confidence = (0.7 * similarity) + (0.3 * freshness)
    return round(confidence, 3)

This function combines 2 signals:

Semantic similarity: how close the meanings are
Freshness: how recent the response is relative to its TTL

The exact weights are not important here. What matters is the behavior:

Fresh, similar responses score high confidence
Old responses lose confidence over time
Expired entries collapse to low confidence

Confidence is therefore bounded, decaying, and explicitly defined.

Why Confidence Is Computed in the Cache

Notice that confidence is computed inside the cache layer, not in the API.

This ensures:

all reuse decisions are centralized
confidence logic is applied consistently
the API remains an orchestration layer, not a policy engine

The API does not need to understand how confidence is computed — only whether it is acceptable.

Where Confidence Is Enforced

Confidence enforcement happens in the request pipeline in ask.py.

elif cached.get("confidence", 0.0) < 0.7:
    miss_reason = "low_confidence"

This check occurs after:

exact or semantic matching
TTL validation
poisoning checks

And before a cached response is returned.

If confidence is below the threshold:

the cache entry is rejected
the request is treated as a cache miss
the pipeline falls back to the LLM

This ensures that reuse happens only when confidence meets an acceptable threshold.

Why Rejection Is Safer Than Reuse

When confidence is low, the system has 2 choices:

reuse a response it does not fully trust
generate a fresh response

This implementation always chooses the second option.

The cost of an extra LLM call is predictable.

The cost of serving an incorrect response is not.

By rejecting low-confidence entries, the cache degrades gracefully instead of failing silently.

What Happens After Rejection

Once a cached entry is rejected:

the request proceeds to the LLM
a new response is generated
the new response is stored with a fresh timestamp and TTL

Over time, this naturally refreshes the cache without requiring explicit invalidation logic.

Making Rejections Observable

Confidence-based rejections are not hidden.

They are surfaced via:

miss_reason = "low_confidence"
debug metadata returned to the client
cache miss metrics

This makes it possible to understand why the cache did not reuse a response — a critical property when tuning thresholds later.

Query Normalization and Deduplication for Efficient Semantic Caching

At this point, our semantic cache is safe against stale and low-confidence responses. However, there is another failure mode that appears once the system runs for longer periods of time:

The cache slowly fills with duplicate entries representing the same query.

This problem does not break correctness, but it can silently degrade cache quality and efficiency.

Why Duplicate Cache Entries Are a Problem

In natural language systems, users rarely type queries the same way twice.

Consider the following inputs:

What is semantic caching?
What is semantic caching
What is semantic caching?

From a human perspective, these queries are identical.

From a naïve cache’s perspective, they are completely different strings.

If we store each variation separately:

cache size grows unnecessarily
similarity scans become slower
cache hit rate decreases
identical LLM work is repeated

This is not a semantic problem — it is a normalization problem.

Normalizing Queries Before Caching

To prevent this, the cache normalizes queries before storing them.

def _hash_query(query: str) -> str:
    normalized = " ".join(query.lower().split())
    return hashlib.sha256(normalized.encode()).hexdigest()

This function performs 3 important steps:

Lowercasing: Ensures case-insensitive matching
Whitespace normalization: Collapses extra spaces and removes leading/trailing whitespace
Hashing: Produces a fixed-length identifier for fast comparison

The result is a stable representation of the query’s structure, not its formatting.

Deduplication at Store Time

Deduplication happens when a new cache entry is about to be written.

query_hash = self._hash_query(query)

for key in self.r.smembers(f"{self.namespace}:keys"):
    data = self.r.hgetall(key)
    if data and data.get("query_hash") == query_hash:
        return

Before storing a new entry, the cache checks whether an entry with the same normalized hash already exists in the cache.

If it does:

the new entry is not stored
the cache avoids creating a duplicate
storage space and future scans are preserved

This approach ensures that identical queries map to a single cache entry, regardless of how they were formatted.

Why Deduplication Happens in the Cache Layer

Deduplication is enforced inside the cache rather than in the API layer.

This design ensures:

all cache writes are normalized consistently
deduplication logic lives next to storage logic
API code remains simple and declarative

The API does not need to care how deduplication works — only that the cache remains clean.

Why Hash-Based Deduplication Works Well Here

Using a hash instead of raw strings provides several advantages:

fixed-length comparisons
efficient storage
no dependency on query length
practical collision resistance

For this system, SHA-256 is more than sufficient. The goal is stability and simplicity, not cryptographic security.

What Deduplication Does Not Solve

It’s important to understand the limits of this approach.

Hash-based deduplication:

prevents exact duplicates after normalization
does not merge semantically similar queries
does not replace semantic caching

In other words:

deduplication keeps the cache clean
semantic similarity keeps the cache useful

They solve different problems and complement each other.

Preventing Cache Poisoning in Semantic Caching for LLM Systems

So far, we’ve protected the semantic cache against staleness, low confidence, and duplicate entries. There is one more failure mode that can silently undermine the entire system if left unchecked:

Cache poisoning — storing responses that should never be reused.

Cache poisoning does not usually crash the system. Instead, it causes the cache to confidently serve bad answers repeatedly, amplifying a single failure into many incorrect responses.

What Cache Poisoning Looks Like in LLM Systems

In the context of LLM-backed systems, cache poisoning typically happens when:

the LLM returns an error message
the response is empty or incomplete
the output is malformed due to a timeout or partial generation

If these responses are cached, every future “hit” returns the same failure instantly — fast, but incorrect.

This is especially dangerous because:

the cache appears to be working
responses are returned quickly
the system looks healthy from the outside

Poisoning Prevention Strategy

Rather than trying to detect every possible bad response, this system uses a simple, conservative heuristic:

If a response looks unsafe, do not cache it.

This keeps the logic easy to reason about and avoids false positives.

Detecting Poisoned Entries

Poisoning detection is implemented in a dedicated helper function.

def is_poisoned(entry):
    resp = entry.get("response", "")
    if not resp or resp.startswith("[LLM Error]"):
        return True
    return False

This function flags an entry as poisoned if:

the response is empty, or
the response is an explicit LLM error

These conditions are intentionally strict. When in doubt, the entry is treated as unsafe.

Where Poisoning Is Enforced

Poisoning checks are applied before any cached response is reused in ask.py.

elif is_poisoned(cached):
    miss_reason = "poisoned"

If a cached entry is poisoned:

it is rejected immediately
the request is treated as a cache miss
the pipeline falls back to the LLM

This ensures that invalid responses are never reused, even if they have high similarity or appear fresh.

Why Poisoned Entries Are Rejected, Not Repaired

The cache does not attempt to “fix” poisoned entries.

Trying to repair cached LLM output introduces:

ambiguity
hidden transformations
unpredictable behavior

Instead, the system takes the safest possible action:

reject the entry
generate a fresh response
overwrite with a clean result

This keeps the cache behavior explicit and predictable.

Making Poisoning Visible

Just like low-confidence rejections, poisoning is not silent.

The reason is surfaced via:

miss_reason = "poisoned"
debug metadata returned to the client
cache miss metrics

This makes it possible to distinguish between:

semantic misses
safety rejections
forced fallbacks

Visibility is a critical part of safety.

What This Approach Does Not Cover

This poisoning strategy is intentionally simple.

It does not attempt to:

analyze response quality
validate structured output
detect hallucinations
score semantic correctness

Those checks are domain-specific and belong outside the cache.

The cache’s responsibility is narrow:

Do not reuse responses that are obviously unsafe.

End-to-End Semantic Cache Hardening: TTL, Confidence, and Safety Demos

In Lesson 1, we verified that semantic caching works.

In this lesson, we harden that system by watching each safety mechanism activate in practice.

The goal of these demos is not performance testing.

The goal is behavioral verification.

Each demo isolates one hardening feature and makes its effect visible through the response payload.

Demo Case 1: TTL Expiration Forces a Cache Miss

Start by sending a query and populating the cache:

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain semantic caching for LLMs"}'

This first request falls back to the LLM and stores a new cache entry.

After waiting longer than the configured TTL, send the same request again:

sleep 61
curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain semantic caching for LLMs"}'

Expected Behavior

Exact-match lookup finds an entry
TTL validation fails
Entry is rejected
LLM is called again

Example response

{
  "from_cache": false,
  "debug": {
    "hit": false,
    "miss_reason": "no_match"
  }
}

This confirms that stale responses are not reused.

Demo Case 2: Semantic Reuse When Confidence Remains High

Now consider a cached response that is still within TTL and retains sufficient confidence.

Send a semantically similar query:

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "How does semantic caching reduce LLM calls?"}'

Expected Behavior

Semantic similarity match found
Confidence computed
Confidence above threshold
Cached response reused

Example response

{
  "from_cache": true,
  "debug": {
    "hit": true,
    "cache_path": "semantic_match",
    "confidence": 0.81
  }
}

This demonstrates that semantic reuse is allowed when both relevance and freshness remain acceptable.

Demo Case 3: Failed LLM Responses Are Never Cached

A safe semantic cache must ensure that failed LLM responses are never reused. This demo demonstrates write-time cache poisoning prevention.

This system enforces that rule at write time.

if not response.startswith("[LLM Error]"):
    cache.store(...)

Only valid responses are ever written to Redis.

How We Demonstrate This

We do not shut down Ollama or the embedding service.

Network failures abort the request before caching logic runs and are not suitable demos.

Instead, we simulate an LLM failure.

Step 1: Temporarily Simulate an LLM Error

In generate_llm_response():

if "simulate_error" in prompt.lower():
    return "[LLM Error] Simulated failure"

Step 2: Send a Query

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "Simulate error in semantic caching"}'

Expected Behavior

from_cache = false
Cache miss
Error response returned

Step 3: Send the Same Query Again

Expected Result

Cache miss again
LLM called again
No cached response reused

Why the Miss Reason Is no_match

Failed responses are never stored
No cache entry exists to reject or evaluate
Cache poisoning checks apply only to existing entries

This is intentional and correct.

Demo Case 4: Deduplication Under Query Variations

Send a query with unusual spacing:

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "   What   is   semantic   caching?   "}'

Then send the normalized version:

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "What is semantic caching?"}'

Expected Behavior

Both queries map to the same normalized hash
Only one cache entry exists
Exact-match reuse occurs

Example response

{
  "from_cache": true,
  "debug": {
    "hit": true,
    "cache_path": "exact_match"
  }
}

This confirms deduplication is working correctly.

Demo Case 5: Observing Metrics After Hardening

After running several demos, inspect the metrics endpoint:

curl http://localhost:8000/internal/metrics

Example response

{
  "hits": 3,
  "misses": 4,
  "llm_calls": 4,
  "_note": "In-memory metrics. Reset on restart. Not production-ready."
}

Metrics help you verify that:

safety rejections increase misses
LLM calls rise when reuse is unsafe
the system degrades gracefully

What These Demos Prove

Across these scenarios, we verified that:

Stale entries are rejected
Low-confidence reuse is prevented
Poisoned responses are never cached
Duplicate entries are avoided
Cache behavior is observable and explainable

The cache no longer optimizes for speed alone.

It optimizes for safe reuse.

Semantic Caching Limitations: Trade-Offs in LLM Optimization Systems

By this point, we’ve built a semantic cache that is not only functional, but also hardened against common failure modes: staleness, low confidence, poisoning, duplication, and silent reuse.

However, no system design is complete without clearly stating what it does not attempt to solve.

This section makes those boundaries explicit.

Why This Cache Still Uses O(N) Scans

All semantic lookups in this implementation perform a linear scan over cached entries.

That means:

every semantic search compares the query embedding against all stored embeddings
time complexity grows linearly with cache size

This is not an oversight.

It is a deliberate design choice made for:

teaching clarity
transparency
small-to-medium cache sizes

By avoiding ANN indexes or vector databases, every decision remains visible and debuggable. You can trace exactly why a match was selected or rejected.

For educational systems and low-volume services, this trade-off is acceptable — and often desirable.

What We Intentionally Did Not Implement

To keep the system focused and understandable, several production features were intentionally left out:

Approximate nearest neighbor (ANN) indexing
Redis Vector Search or RediSearch
Background garbage collection workers
Distributed locks for thundering herd prevention
Request coalescing or single-flight patterns
Multi-process or persistent metrics
Cache warming strategies

Each of these adds complexity that would obscure the core ideas being taught.

This cache is designed to explain semantic caching, not to compete with specialized retrieval infrastructure.

When This Design Is “Good Enough”

This architecture works well when:

cache size is modest (hundreds to low thousands of entries)
traffic is low to moderate
correctness and explainability matter more than raw throughput
you are experimenting with semantic reuse behavior
you want to understand cache dynamics before scaling

Typical examples include:

internal tools
developer-facing APIs
research prototypes
educational systems
early-stage LLM applications

In these contexts, the simplicity of the design is a strength, not a weakness.

When You Need a Vector Database or ANN Index

As usage grows, linear scans eventually become the bottleneck.

You should consider a dedicated vector search solution when:

cache size grows into tens or hundreds of thousands of entries
latency requirements become strict
multiple workers or services share the same cache
semantic search dominates request time

At that point, technologies such as the following:

FAISS (Facebook AI Similarity Search)
Milvus
Pinecone
Redis Vector Search

become appropriate.

Importantly, the hardening concepts from this lesson still apply. TTLs, confidence scoring, poisoning prevention, and observability remain relevant even when the storage backend changes.

The Core Trade-Off, Revisited

This lesson deliberately favors:

clarity over cleverness
explicit decisions over hidden automation
safety over aggressive reuse

That makes it an ideal foundation, not a final destination.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: May 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

That’s not the case.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this lesson, we took a working semantic cache and made it safe, bounded, and explainable.

Rather than focusing on improving cache hit rates at all costs, we introduced guardrails to ensure cached LLM responses are reused only when they are trustworthy.

We added application-level TTL validation to prevent stale responses from persisting indefinitely, combined semantic similarity with freshness through confidence scoring, and enforced explicit rejection paths for low-confidence and expired entries.

We also addressed subtle but dangerous failure modes that appear in real systems over time. Query normalization and deduplication prevent silent cache bloat, and poisoning checks ensure that error responses are never reused.

Observability signals make every cache decision inspectable rather than implicit. Together, these changes transform the cache from a performance optimization into a reliability component.

Finally, we made the system’s limitations explicit. This design favors clarity, correctness, and debuggability over raw scalability. It deliberately avoids ANN indexes, vector databases, and distributed coordination, making it suitable for small-to-medium systems and educational use cases.

As workloads grow, the same hardening principles apply even when the underlying storage or retrieval strategy changes.

With this lesson, semantic caching is no longer just fast. It is defensive, explainable, and production-aware.

Citation Information

Singh, V. “Semantic Caching for LLMs: TTLs, Confidence, and Cache Safety,” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/ahr3p

@incollection{Singh_2026_semantic-caching-llms-ttls-confidence-cache-safety,
  author = {Vikram Singh},
  title = {{Semantic Caching for LLMs: TTLs, Confidence, and Cache Safety}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/ahr3p},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

The post Semantic Caching for LLMs: TTLs, Confidence, and Cache Safety appeared first on PyImageSearch.

Semantic Caching for LLMs: FastAPI, Redis, and Embeddings

Vikram Singh — Mon, 27 Apr 2026 12:45:00 +0000

Home

Table of Contents

Semantic Caching for LLMs: FastAPI, Redis, and Embeddings
Introduction: Why Semantic Caching Matters for LLM Systems
How Semantic Caching Works for LLMs: Embeddings and Similarity Search Explained
Semantic Caching Architecture and Request Flow
Configuring Your Environment for Semantic Caching: FastAPI, Redis, and Ollama Setup
Project Structure
FastAPI Entry Point for Semantic Caching: Wiring the API Service
FastAPI Ask Endpoint: End-to-End Semantic Caching Request Flow
Embeddings: Turning Text into Semantic Vectors
The Semantic Cache: Cosine Similarity, Redis Storage, and Reusing Meaning
Cache Entries: What Exactly Gets Stored?
End-to-End Demo: Verifying Core Cache Behavior
Summary

Semantic Caching for LLMs: FastAPI, Redis, and Embeddings

In this lesson, you will learn how to build a semantic cache for LLM applications using FastAPI, Redis, and embedding-based similarity search, and how requests flow from exact matches to semantic matches before falling back to the LLM.

This lesson is the 1st in a 2-part series on Semantic Caching for LLMs:

Semantic Caching for LLMs: FastAPI, Redis, and Embeddings (this tutorial)
Lesson 2

To learn how to build a semantic cache for LLM applications using embeddings and Redis, just keep reading.

Looking for the source code to this post?

Introduction: Why Semantic Caching Matters for LLM Systems

Cost, Latency, and Redundant LLM Calls

Large language models are powerful, but they are not cheap. Every request to an LLM involves tokenization, inference, decoding, and network overhead. Even when models are hosted locally, response times are measured in hundreds of milliseconds or seconds rather than microseconds.

In real applications, this cost compounds quickly. Users often ask similar questions repeatedly, either across sessions or within the same workflow. Each request is treated as a fresh LLM invocation, even when the underlying intent has already been handled before.

This leads to 3 systemic problems:

High latency: Users wait for responses that could have been reused instantly
Increased cost: Identical reasoning is paid for multiple times
Wasted capacity: LLM throughput is consumed by redundant requests

These issues become especially visible under load, where repeated paraphrased queries can overwhelm an otherwise well-sized system.

Why Exact-Match Caching Breaks Down for Natural Language

Traditional caching assumes that identical inputs produce identical outputs. This works well for APIs, database queries, and deterministic functions. It fails for natural language.

From a string-matching perspective, the following queries are completely unrelated:

“What is semantic caching?”
“Can you explain how semantic caching works?”
“How does caching based on embeddings work for LLMs?”

A traditional cache keyed on raw strings will miss all three. As a result, the system calls the LLM three times, even though a human would expect the same answer.

This brittleness causes exact-match caches to have extremely low hit rates in LLM-backed systems. Worse, it gives a false sense of optimization. The cache exists, but it almost never helps in practice.

Where Semantic Caching Fits in Real Systems

Semantic caching addresses this mismatch by caching meaning instead of exact text.

Rather than asking “have I seen this string before?”, a semantic cache asks “have I answered something semantically similar before?”. It does this by converting queries into embeddings and comparing them using a similarity metric such as cosine similarity.

In a real system, semantic caching sits between the application layer and the LLM:

The application sends a query
The cache evaluates whether a prior response is reusable
Only true cache misses reach the LLM

When designed correctly, this layer is invisible to the user. Responses feel faster, costs drop, and the system scales more gracefully without changing the frontend or prompt logic.

This lesson focuses on building that layer explicitly and transparently, using FastAPI, Redis, and embeddings, without hiding the mechanics behind heavy abstractions.

Figure 1: Why semantic caching matters for LLM systems. Exact-match caching treats paraphrased queries as unique requests, resulting in repeated LLM calls. Semantic caching groups queries by meaning, reducing latency and redundant inference (source: image by the author).

Exact-match caching treats paraphrased queries as unique requests, resulting in repeated LLM calls. Semantic caching groups similar queries by meaning, allowing responses to be reused and reducing both latency and cost.

How Semantic Caching Works for LLMs: Embeddings and Similarity Search Explained

Section 1 explained why semantic caching exists.

This section explains how it works, conceptually, before we touch any FastAPI, Redis, or code.

The goal here is to give the reader a mental execution model they can keep in their head while reading the implementation.

From Text to Meaning: Embeddings as the Cache Key

Semantic caching replaces raw text comparison with vector similarity.

Instead of caching responses under the literal query string, the system converts each query into an embedding: a high-dimensional numeric vector that captures semantic meaning. Queries that are worded differently but mean the same thing produce embeddings that are close together in vector space.

This is what allows the cache to recognize paraphrases as equivalent:

“How do I reset my password?”
“I forgot my password, what should I do?”
“Guide me through password recovery”

Exact strings differ. Embeddings do not.

At a high level, semantic caching works by:

Generating an embedding for the incoming query
Comparing it against embeddings stored in the cache
Reusing a cached response if similarity is high enough

The similarity metric used in this lesson is cosine similarity, which measures the angle between two vectors rather than their raw magnitude.

Why a Layered Cache Beats Semantic-Only Caching

While semantic matching is powerful, it is also computationally expensive.

Embedding generation requires a model call. Similarity search requires vector math. Doing this for every request, even when the exact same query has already been seen, would be wasteful.

That is why this lesson uses a layered caching strategy.

Layer 1: Exact Match (Fast Path)

The query is normalized and hashed.

If the same query has already been answered, the response is returned immediately.

No embedding generation
No similarity computation
Minimal latency

This handles repeated identical queries efficiently.

Layer 2: Semantic Match (Flexible Path)

If no exact match exists, the query is embedded and compared against cached embeddings.

This layer catches:

paraphrases
minor wording differences
reordered phrases

Semantic matches trade compute cost for much higher cache hit rates.

Layer 3: LLM Fallback (Slow Path)

If neither exact nor semantic matches succeed, the request is forwarded to the LLM.

The response is then stored in the cache so future requests can reuse it.

This layered approach ensures:

the cheapest checks happen first
expensive operations are only used when necessary

Confidence, Freshness, and Cache Safety

Semantic similarity alone is not enough to decide whether a cached response should be reused.

This lesson introduces the idea of confidence scoring, which combines:

Similarity: how close the embeddings are
Freshness: how old the cached entry is

A highly similar but stale response should not necessarily be trusted. Likewise, a fresh response with low similarity should be rejected.

In addition, cached entries are validated to prevent:

expired responses
poisoned entries (errors, empty outputs)

These checks ensure the cache improves correctness and performance rather than degrading them.

Figure 2: Layered semantic caching request flow (source: image by the author).

Incoming queries first attempt an exact-match lookup, then fall back to semantic similarity search using embeddings, and finally call the LLM only on cache miss. This ordering minimizes latency and unnecessary model calls.

Note: In this lesson, we implement this flow using Redis as a simple embedding store with linear similarity scans, rather than a dedicated vector database.

Semantic Caching Architecture and Request Flow

In Section 2, you learned how semantic caching works conceptually.

In this section, we map that mental model to a real request flow in an LLM-backed service.

The goal is to answer one question clearly:

What happens, step by step, when a user sends a request to this system?

We will stay implementation-aware, but not code-specific yet. That comes next.

High-Level System Components

At a high level, the system consists of 5 logical components:

API layer: Receives user requests and orchestrates the caching pipeline.
Exact-match cache: Performs fast hash-based lookups for identical queries.
Embedding model: Converts text queries into semantic vectors when needed.
Semantic cache: Stores embeddings and responses and performs similarity matching.
LLM: Acts as the final fallback when no cache entry is suitable.

Each component has a narrowly defined responsibility. This separation is intentional and keeps the system easy to reason about and extend.

In this implementation:

The API layer is built using FastAPI and acts as the orchestration point.
Redis is used as the backing store for both exact-match and semantic cache layers.
Ollama provides both embedding generation and LLM inference locally.

These choices keep the system lightweight, self-contained, and easy to reason about while still reflecting real production patterns.

End-to-End Request Flow

When a user sends a query, the system processes it in the following order.

Step 1: Request enters the API

The API receives a text query along with optional flags, such as whether to use the bypass_cache. Input validation happens immediately to prevent meaningless or malformed queries from entering the pipeline.

This ensures the cache is not polluted with empty or invalid entries.

Step 2: Exact-match cache lookup

The query is normalized and hashed.

The system checks whether an identical query has already been answered.

If an exact match exists and is valid, the response is returned immediately.
No embeddings are generated.
The LLM is not touched.

This is the fastest possible path through the system.

Step 3: Embedding generation

If the exact-match lookup fails, the query is passed to the embedding model.

The model converts the text into a numeric vector that captures semantic meaning. This vector becomes the key for semantic comparison.

This step is intentionally skipped when an exact match succeeds.

Step 4: Semantic cache lookup

The embedding is compared against cached embeddings using a similarity metric.

A cached response is reused only if:

similarity exceeds a defined threshold
the entry has not expired
the entry is not poisoned
the computed confidence is high enough

If a suitable match is found, the response is returned to the user without calling the LLM.

Step 5: LLM fallback and cache population

If both cache layers miss, the request is forwarded to the LLM.

Once a response is generated:

it is returned to the user
it is stored in the cache with metadata, timestamps, and TTL (Time To Live)

This ensures future requests can reuse the result.

Why This Architecture Works Well

This architecture is intentionally conservative and explicit.

Cheap operations happen first.
Expensive operations are deferred.
Every step is observable and debuggable.
No component hides complexity behind opaque abstractions.

Most importantly, the system degrades gracefully. Even when the cache provides no benefit, the request still succeeds via the LLM.

Figure 3: Architecture and request flow for a layered semantic caching system (source: image by the author).

User queries enter the API, attempt an exact-match lookup, fall back to semantic similarity search using embeddings, and call the LLM only when both cache layers miss. Successful LLM responses are stored for future reuse.

Configuring Your Environment for Semantic Caching: FastAPI, Redis, and Ollama Setup

To follow this guide, you need a small set of Python libraries and system services that support API orchestration, vector similarity, and LLM interaction. The goal is to keep the environment lightweight, reproducible, and easy to reason about.

At a minimum, you will need:

Python 3.10 or newer
Redis (used as the cache backing store)
An LLM + embedding provider (Ollama in this tutorial)

All required Python dependencies are pip-installable.

Installing Python Dependencies

Create and activate a virtual environment (recommended), then install the required packages:

$ pip install fastapi uvicorn redis httpx python-dotenv numpy

These libraries provide the following functionality:

fastapi: API layer and request orchestration
uvicorn: ASGI server for running the service
redis: client Communication with the cache store
httpx: HTTP client for embedding and LLM calls
numpy: Vector math for cosine similarity
python-dotenv: Environment-based configuration

Verifying Redis

This lesson assumes Redis is running locally on the default port.

You can verify Redis is available with:

$ redis-cli ping
PONG

If Redis is not installed, you can start it quickly using Docker (but you also can spin it up using the docker-compose.yml we provide in the code zip):

$ docker run -p 6379:6379 redis:7

Setting Up Ollama

This system uses Ollama for both embedding generation and LLM inference. Make sure Ollama is installed and running, and that the required models are available.

For example:

$ ollama pull nomic-embed-text
$ ollama pull llama3.2

Once running, Ollama exposes local HTTP endpoints that the application will call directly for embeddings and text generation.

Need Help Configuring Your Development Environment?

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code immediately on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Project Structure

Before diving into individual components, let’s take a moment to understand how the project is organized.

After downloading the source code from the “Downloads” section, your directory structure should look like this:

.
├── app
│   ├── api
│   │   ├── __init__.py
│   │   └── ask.py
│   ├── cache
│   │   ├── __init__.py
│   │   ├── poisoning.py
│   │   ├── schemas.py
│   │   ├── semantic_cache.py
│   │   └── ttl.py
│   ├── config
│   │   ├── __init__.py
│   │   └── settings.py
│   ├── embeddings
│   │   ├── __init__.py
│   │   └── embedder.py
│   ├── llm
│   │   ├── __init__.py
│   │   └── ollama_client.py
│   ├── main.py
│   └── observability
│       └── metrics.py
├── complete-codebase.txt
├── docker-compose.yml
├── Dockerfile
├── README.md
└── requirements.txt

Let’s break this down at a high level.

The app/ Package

The app/ directory contains all runtime application code. Nothing outside this folder is imported at execution time.

This keeps the service self-contained and makes it easy to reason about deployment and dependencies.

app/main.py: Application Entry Point

This file defines the FastAPI application and registers all routers.

It contains no business logic — only service wiring. Every request into the system enters through this file.

app/api/: API Layer

The api/ package defines HTTP-facing endpoints.

ask.py: Implements the /ask endpoint and acts as the orchestration layer for the entire semantic caching pipeline.

The API layer is responsible for:

input validation
enforcing cache ordering
coordinating cache, embeddings, and LLM calls
returning structured debug information

It does not implement caching or similarity logic directly.

app/cache/: Caching Logic

This package contains all cache-related functionality.

semantic_cache.py: Core semantic cache implementation (exact match, semantic match, Redis storage, similarity search).
schemas.py: Defines the cache entry schema used for Redis storage.
ttl.py: Application-level TTL configuration and expiration checks.
poisoning.py: Safety checks to prevent invalid or error responses from being reused.

By isolating caching logic here, the API layer stays clean and reusable.

app/embeddings/: Embedding Generation

embedder.py: Handles embedding generation via Ollama’s embedding endpoint.

This module has a single responsibility: convert text into semantic vectors.

It does not cache, rank, or validate embeddings.

app/llm/: LLM Client

ollama_client.py: Wraps calls to the Ollama text-generation endpoint.

Keeping LLM interaction isolated allows the rest of the system to remain model-agnostic.

app/observability/: Metrics

metrics.py: Implements simple in-memory counters for cache hits, misses, and LLM calls.

These metrics are intentionally lightweight and meant for learning and debugging, not production monitoring.

Configuration and Infrastructure

Outside the app/ directory:

config/settings.py: Centralizes environment-based configuration (Redis host, TTLs, model names).
Dockerfile and docker-compose.yml: Define a reproducible runtime environment for the API and Redis.
requirements.txt: Lists all Python dependencies required to run the service.

FastAPI Entry Point for Semantic Caching: Wiring the API Service

Before we look at caching logic, embeddings, or Redis, it’s important to understand how the service itself is wired together. Every request to the semantic cache enters the system through a single FastAPI application, defined in app/main.py.

This file acts as the entry point of the service. Its responsibility is not to implement business logic, but to connect the application components and expose HTTP routes.

Application Entry Point (app/main.py)

from fastapi import FastAPI
from api.ask import router as ask_router

app = FastAPI(title="Semantic Cache Basics")
app.include_router(ask_router)

Let’s break this down.

The FastAPI() call creates the application object. This object represents the entire web service and is what the ASGI (Asynchronous Server Gateway Interface) server (uvicorn) runs when the container starts.

The application itself contains no knowledge of caching, embeddings, or LLMs. It simply defines a runtime container that will host those capabilities.

Router Registration

Instead of defining endpoints directly in main.py, the application imports a router from api/ask.py and registers it using include_router().

This pattern serves several purposes:

Separation of concerns: Routing and request handling live outside the application entry point.
Scalability: As the system grows, additional routers (for health checks, metrics, or admin endpoints) can be added without modifying core application wiring.
Readability: main.py remains easy to understand at a glance, even as the codebase expands.

At runtime, FastAPI merges the routes defined in ask_router into the main application. When a request arrives at the /ask endpoint, FastAPI resolves it through the registered router and forwards it to the appropriate handler function.

Why This Matters

Keeping the entry point minimal is intentional. It ensures that:

The application startup process is predictable
Routing logic is easy to trace
Core functionality can evolve independently of service wiring

With the application structure in place, we can now focus on what actually happens when a request reaches the system.

In the next section, we will walk through the /ask endpoint and see how it orchestrates exact-match caching, semantic search, and LLM fallback step by step.

FastAPI Ask Endpoint: End-to-End Semantic Caching Request Flow

This section makes the architecture concrete. We now walk through the /ask endpoint, which orchestrates the entire semantic caching pipeline from request arrival to response delivery.

The goal here is not to memorize code, but to understand why each step exists, where it lives, and how it protects performance, cost, and correctness.

The Role of the Ask Endpoint

The Ask endpoint is the control plane of the system.

It does not:

Compute similarity
Store embeddings
Talk directly to Redis internals

Instead, it:

Validates input
Decides which cache layers to consult
Enforces ordering between cheap and expensive operations
Collects observability signals
Guarantees a response even on cache failure

This separation is intentional. Cache logic remains reusable and testable, while orchestration logic stays explicit at the API boundary.

Defining the API Contract

We begin by defining the request and response models.

class AskRequest(BaseModel):
    query: str
    bypass_cache: bool = False

The request consists of a user query and an optional bypass_cache flag. This flag allows us to force a cache miss during debugging or testing, ensuring that the LLM and embedding pipeline still function correctly.

Before the request ever reaches the cache, the query field is validated.

@field_validator('query')
@classmethod
def validate_query(cls, v: str) -> str:
    if not v or not v.strip():
        raise ValueError("Query cannot be empty or whitespace-only")
    return v.strip()

This validation step protects the system at the boundary. Rejecting empty or whitespace-only queries prevents:

wasted embedding computation
cache pollution with meaningless entries
unnecessary LLM calls

This is a recurring pattern in production systems: fail fast, before expensive operations are triggered.

class AskResponse(BaseModel):
    response: str
    from_cache: bool
    similarity: float
    debug: dict

The response model intentionally exposes diagnostic information through fields such as from_cache, similarity, and debug. During development, this makes cache behavior transparent rather than opaque.

Initializing the Cache

Before handling requests, we create a SemanticCache instance:

cache = SemanticCache()

The endpoint itself remains stateless. All persistence and reuse live inside the cache layer.

Step 1: Entering the Endpoint

The endpoint is registered using FastAPI’s routing mechanism:

@router.post("/ask", response_model=AskResponse)
def ask_endpoint(request: AskRequest):

FastAPI automatically validates incoming requests and outgoing responses using the schemas defined earlier. If invalid data enters or exits the system, FastAPI raises an error instead of silently failing.

Inside the handler, we extract the query and initialize tracking state:

query = request.query
miss_reason = None

The miss_reason variable exists purely for observability. Rather than treating cache misses as a black box, we explicitly track why a miss occurred.

Step 2: Exact-Match Cache Lookup (Fast Path)

The first decision point is the exact-match cache lookup:

if not request.bypass_cache:
    cached = cache.search(None, exact_query=query)

This is the cheapest path through the system.

If the same query has already been answered, the response can be returned immediately:

no embeddings are generated
no similarity computation occurs
the LLM is not touched

If a cached entry is found, it is validated:

if is_expired(cached):
    miss_reason = "expired"
elif is_poisoned(cached):
    miss_reason = "poisoned"
elif cached.get("confidence", 0.0) < 0.7:
    miss_reason = "low_confidence"

Only entries that are fresh, valid, and confident are allowed to short-circuit the pipeline.

When all checks pass, the endpoint returns immediately:

metrics.cache_hit()
return AskResponse(...)

This path typically completes in milliseconds and handles repeated identical queries efficiently.

Step 3: Embedding Generation (Escalation Point)

If the exact-match lookup fails or is bypassed, the endpoint escalates:

embedding = embed_text(query)

Embedding generation is expensive, even when running locally. For this reason, it is intentionally delayed until all cheaper options have been exhausted.

This single design choice has a significant impact on system efficiency.

Step 4: Semantic Cache Lookup

With the embedding available, the endpoint attempts a semantic search:

cached = cache.search(embedding)

This path catches paraphrased and reworded queries. As before, cached entries are validated to ensure they are safe to reuse.

If a suitable match is found, the response is returned without calling the LLM.

Step 5: Explicit Cache Bypass

The bypass_cache flag is handled explicitly:

if request.bypass_cache:
    miss_reason = "bypass"

This allows controlled testing and debugging without modifying code or disabling cache logic globally.

Step 6: LLM Fallback and Cache Population

If both cache layers miss, the request is forwarded to the LLM:

metrics.cache_miss()
response = generate_llm_response(query)
metrics.llm_call()

This is the slowest path through the system, but it guarantees correctness.

Successful responses are stored in the cache:

if not response.startswith("[LLM Error]"):
    cache.store(query, embedding, response, metadata=metadata)

Responses beginning with [LLM Error] are intentionally not cached, preventing cache poisoning and ensuring failures do not propagate to future requests.

Control Flow Summary

The endpoint follows a simple, explicit sequence:

Figure 4: LLM API Control Flow with Layered Semantic Caching (source: image by the author).

Every expensive operation is deferred until absolutely necessary.

Embeddings: Turning Text into Semantic Vectors

Up to this point, we have treated embeddings as a black box: something expensive that we try to avoid unless absolutely necessary.

In this section, we will open that box just enough to understand what embeddings are, when they are generated, and why they enable semantic caching without diving into vector math or model internals.

Why Embeddings Exist in This System

Exact-match caching works only when queries are identical at the string level. As soon as wording changes, exact matching breaks down.

Embeddings solve this problem by converting text into a numeric representation that captures meaning rather than surface form.

Queries that mean the same thing tend to produce vectors that are close together in vector space, even if their wording differs significantly.

This is the foundation that makes semantic caching possible.

Embedding Generation Happens on Demand

In our implementation, embeddings are generated only after the exact-match cache fails.

This decision is intentional.

Embedding generation involves:

a model invocation
network overhead
serialization and deserialization
non-trivial latency

Because of this cost, embeddings are treated as an escalation step, not a default operation.

This is why the /ask endpoint first attempts an exact-match lookup before calling embed_text().

The embed_text Function

def embed_text(text: str):

This function has one responsibility: Convert input text into a semantic vector representation.

It does not perform caching, similarity search, or validation. Those concerns live elsewhere.

Calling the Embedding Model

url = f"http://{settings.OLLAMA_HOST}:{settings.OLLAMA_PORT}/api/embeddings"

Here, we construct the Ollama embedding endpoint using configuration values (e.g., settings.OLLAMA_HOST, settings.OLLAMA_PORT, etc.).

This allows the embedding service to run locally, inside Docker, or on a remote host without changing code.

resp = httpx.post(
    url,
    json={"model": settings.EMBEDDING_MODEL, "prompt": text},
    timeout=10.0
)

This request sends 2 key pieces of information to the embedding service:

the embedding model name (e.g., nomic-embed-text)
the input text to embed

The timeout ensures the request does not hang indefinitely. Embedding generation is expensive, but it should still fail fast if something goes wrong.

Handling the Response

resp.raise_for_status()
return resp.json().get("embedding", [])

If the request succeeds, the embedding model returns a numeric vector — typically a list of floating-point values.

This vector represents the semantic meaning of the input text and becomes the key used for similarity comparison in the cache.

At this stage, we treat the vector as an opaque object. We do not inspect its dimensionality or normalize it here.

Error Handling Strategy

except Exception as e:
    raise RuntimeError(f"Failed to generate embedding: {e}")

If embedding generation fails for any reason (network issues, model errors, timeouts), the function raises an exception.

This is intentional.

If embeddings cannot be generated, the system cannot safely perform semantic matching. Silently continuing would lead to unpredictable behavior, so we fail loudly instead.

Why the Embedder Is Intentionally Simple

Notice what this function does not do:

it does not store embeddings
it does not perform similarity search
it does not retry failed requests
it does not fall back to alternative models

Those decisions are deliberate.

For Lesson 1, the embedder exists purely to convert text into vectors. Keeping it small and focused makes the system easier to understand and test.

How the Embedder Is Used in the Pipeline

At runtime, the embedder is called only when necessary:

Exact-match cache fails
The query is passed to embed_text()
The returned vector is sent to the semantic cache
Similarity is computed against stored embeddings

This ensures embeddings are generated only when cheaper paths have already failed.

Key Takeaways

Embeddings are generated via a simple HTTP call to a local model
The embedder has a single responsibility
Errors are surfaced immediately
Embeddings act as semantic keys for cache lookup

With embedding generation understood, we are now ready to look at the semantic cache itself, how embeddings and responses are stored, scanned, and matched.

In the next section, we will walk through the semantic cache implementation, starting with a deliberately naive but correct linear scan approach.

The Semantic Cache: Cosine Similarity, Redis Storage, and Reusing Meaning

At this point, we understand how queries enter the system and how text is converted into embeddings. What remains is the component that ties everything together: the semantic cache itself.

The semantic cache is responsible for 2 things:

Storing past queries, embeddings, and responses
Retrieving the best reusable response for a new query

In Lesson 1, we intentionally implement the cache in the simplest correct way possible: a linear scan over cached entries. This keeps the implementation easy to reason about and makes the request flow fully transparent.

The Semantic Cache Module

The cache logic lives in semantic_cache.py:

class SemanticCache:

This class encapsulates all Redis interaction and similarity logic. The API layer never talks to Redis directly.

Initializing the Cache

def __init__(self):
    self.r = redis.Redis(
        host=settings.REDIS_HOST,
        port=settings.REDIS_PORT,
        decode_responses=True
    )
    self.similarity_threshold = 0.85
    self.namespace = "semantic_cache:v1"

Here we establish a Redis connection and configure 2 important parameters:

Similarity threshold: Only responses with sufficiently high semantic similarity are eligible for reuse.
Namespace prefix: All Redis keys are namespaced to avoid collisions and allow future versioning.

For Lesson 1, the exact threshold value is not important. What matters is that a threshold exists and is applied consistently.

Storing Cache Entries

The first core operation is storing new entries.

def store(self, query, embedding, response, metadata=None):

This method is called only after a successful LLM response.

Creating a Cache Entry

entry = CacheEntry(
    id=entry_uuid,
    query=query,
    query_hash=query_hash,
    embedding=json.dumps(embedding),
    response=response,
    created_at=int(time.time()),
    ttl=default_ttl(),
    metadata=metadata or {}
)

Each cache entry stores:

the original query
a normalized query hash (used for exact matching)
the embedding (serialized for Redis storage)
the LLM response
timestamps and TTL
optional metadata for observability

This structure allows the cache to support both exact-match and semantic lookups.

Writing to Redis

self.r.hset(redis_key, mapping=entry.dict())
self.r.sadd(f"{self.namespace}:keys", redis_key)

Each cache entry is stored as a Redis hash, and all entry keys are tracked in a Redis set.

This allows the cache to iterate over all entries during search operations.

For Lesson 1, this approach is intentionally simple and explicit.

Searching the Cache

The second core operation is lookup.

def search(self, embedding, exact_query=None):

This method supports 2 search modes, which map directly to the layered cache strategy used in the API.

Exact-Match Lookup (Fast Path)

if exact_query:
    query_hash = self._hash_query(exact_query)

When an exact query is provided, the cache first attempts a hash-based lookup.

Each cached entry is scanned until a matching hash is found. If found, the entry is returned immediately with a similarity score of 1.0.

No embeddings are involved in this path.

Semantic Lookup (Flexible Path)

If no exact match is found and an embedding is provided, the cache performs a semantic search:

sim = self.cosine_similarity(query_embedding, cached_embedding)

Each cached embedding is compared against the query embedding using cosine similarity.

Only entries that exceed the configured similarity threshold are considered candidates.

Selecting the Best Match

During the scan, the cache tracks the highest similarity score and returns the best matching entry.

This ensures that even when multiple entries are similar, the most relevant response is reused.

Why This Implementation Is O(N)

Every search scans all cached entries.

This is not an accident.

For Lesson 1, a linear scan has 3 advantages:

the behavior is easy to understand
the logic is fully visible
debugging is straightforward

More advanced indexing strategies belong in later lessons.

Why Expired Entries Are Cleaned During Search

While scanning entries, expired items are removed opportunistically.

This prevents stale data from accumulating indefinitely without introducing background workers or schedulers.

Key Takeaways

The semantic cache owns all Redis interactions
Exact-match lookup is attempted before semantic matching
Semantic similarity is computed using embeddings
A linear scan trades performance for clarity
The cache returns the best reusable response, not just the first match

At this stage, the system is fully functional: queries can be answered, cached, and reused.

Cache Entries: What Exactly Gets Stored?

So far, we’ve treated the cache as a logical concept: something that stores queries, embeddings, and responses.

In this section, we’ll make that concrete by looking at the structure of a cache entry. Understanding this structure is important because it explains why the cache can support both exact-match and semantic lookup — without duplicating data or logic.

The Cache Entry Schema

Cache entries are defined using a Pydantic model:

class CacheEntry(BaseModel):
    id: str
    query: str
    query_hash: str
    embedding: str
    response: str
    created_at: int
    ttl: int
    metadata: Optional[Dict] = Field(default_factory=dict)

Each field exists for a specific reason. Let’s walk through them one by one.

Identity and Query Fields

id: str
query: str
query_hash: str

id: uniquely identifies the cache entry and is used to construct the Redis key.
query: stores the original user input. This is useful for debugging and inspection.
query_hash: stores a normalized hash of the query and enables exact-match lookup.

At this stage, it’s enough to know that the hash ensures identical queries can be matched quickly. We’ll revisit how and why this normalization matters in a later lesson.

Embedding Storage

embedding: str

Embeddings are stored as a JSON-serialized string, not as a raw Python list.

This choice is deliberate:

Redis stores strings efficiently
Serialization keeps the schema simple
Deserialization happens only when similarity needs to be computed

For Lesson 1, the important takeaway is that embeddings are stored once, alongside the response they produced.

Response and Timing Information

response: str
created_at: int
ttl: int

response: is the text returned by the LLM.
created_at: records when the entry was generated.
ttl: defines how long the entry is considered valid.

The cache does not rely on Redis expiration here. Instead, validity is checked at read time. This gives the application full control over when an entry should be reused or rejected.

We intentionally avoid deeper TTL semantics in this lesson.

Metadata and Safety

metadata: Optional[Dict] = Field(default_factory=dict)

Metadata allows the cache to store contextual information such as:

pipeline name
model identifier
request origin

The use of default_factory=dict avoids shared mutable state across cache entries — a subtle but important correctness detail.

At this stage, metadata is informational rather than functional.

Why This Schema Works Well

This schema supports the layered caching strategy naturally:

Exact match uses query_hash
Semantic match uses embedding
Freshness checks use created_at and ttl
Safety checks use response and metadata

All required information is co-located in a single cache entry, making lookup and validation straightforward.

End-to-End Demo: Verifying Core Cache Behavior

In this section, we will verify that the semantic cache behaves as expected under a small set of controlled scenarios.

These examples are meant to be run locally by the reader. The responses shown below are representative and may vary slightly depending on the model and configuration.

Demo Case 1: Cold Request (LLM Fallback)

We begin with a query that has not been seen before.

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "What is semantic caching?"}'

Expected behavior

Exact-match cache miss
Semantic cache miss
LLM call
Cache population

Response

Figure 5: Cold request flow showing a cache miss at both the exact-match and semantic cache layers, triggering an LLM fallback. The response is generated by the model and stored for future reuse (source: image by the author).

The key signal here is "from_cache": false, confirming the request fell back to the LLM.

Demo Case 2: Exact-Match Cache Hit

Now we send the same query again.

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "What is semantic caching?"}'

Expected behavior

Exact-match cache hit
No embedding generation
No LLM call

Example response

Figure 6: Exact-match cache behavior. The repeated query is served directly from the cache via an exact string match, bypassing embedding generation and the LLM entirely (source: image by the author).

Here, the cache reused the response immediately using an exact-match lookup.

Optional Demo: Whitespace Normalization

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "   What   is   semantic   caching?   "}'

This will hit the exact-match cache due to query normalization.

Demo Case 3: Semantic Cache Hit (Paraphrased Query)

Next, we send a paraphrased version of the original query.

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "Can you explain how semantic caching works?"}'

Expected behavior

Exact-match cache miss
Embedding generation
Semantic cache hit
No LLM call

Example response

Figure 7: Semantic cache hit for a paraphrased query. Although the input text differs, the cached response is reused based on embedding similarity, avoiding a new LLM call (source: image by the author).

Even though the query text is different, the cache successfully reused the response based on semantic similarity.

Demo Case 4: Forcing a Cache Miss with bypass_cache

The bypass_cache flag allows us to force the system to skip both cache layers.

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"query": "What is semantic caching?", "bypass_cache": true}'

Expected behavior

Exact-match cache skipped
Semantic cache skipped
LLM called unconditionally

Example response

Figure 8: Cache bypass behavior. The request explicitly skips all cache layers via bypass_cache, ensuring the LLM pipeline executes independently of cached responses (source: image by the author).

This is useful for debugging and validating that the LLM pipeline still works independently of the cache.

Observing Cache Metrics (Optional)

You can inspect basic cache statistics using the /internal/metrics endpoint:

curl http://localhost:8000/internal/metrics

Example response

Figure 9: Internal cache metrics showing hit, miss, and bypass counters, enabling lightweight observability of cache behavior during development and debugging (source: image by the author).

These metrics make cache behavior observable without requiring external tooling.

If you can reproduce these behaviors locally, you’ve successfully implemented a working semantic cache.

In the next lesson, we will take this system and begin hardening it for real-world use.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: May 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

That’s not the case.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this lesson, we built a complete semantic caching system for LLM applications from the ground up. We started by wiring a FastAPI service and defining a clean request–response contract, then implemented a layered caching strategy that prioritizes cheap exact-match lookups before escalating to semantic similarity and, finally, LLM inference.

We walked through how text queries are converted into embeddings on demand, how cached responses and embeddings are stored in Redis, and how the cache decides whether a prior response can be safely reused. By keeping the implementation intentionally simple and explicit, every step in the request flow remains observable and easy to reason about.

Finally, we verified the system end-to-end by running controlled demos: a cold request falling back to the LLM, an exact-match cache hit, a semantic cache hit for a paraphrased query, and an explicit cache bypass. At this point, you have a working semantic cache that behaves correctly, makes its decisions visible, and serves as a solid foundation for further hardening and optimization.

Citation Information

Singh, V. “Semantic Caching for LLMs: FastAPI, Redis, and Embeddings,” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/yso6f

@incollection{Singh_2026_semantic-caching-for-llms-fastapi-redis-and-embeddings,
  author = {Vikram Singh},
  title = {{Semantic Caching for LLMs: FastAPI, Redis, and Embeddings}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/yso6f},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

The post Semantic Caching for LLMs: FastAPI, Redis, and Embeddings appeared first on PyImageSearch.

Pytest Tutorial: MLOps Testing, Fixtures, and Locust Load Testing

Vikram Singh — Mon, 20 Apr 2026 12:45:00 +0000

Home

Table of Contents

Pytest Tutorial: MLOps Testing, Fixtures, and Locust Load Testing
Introduction to MLOps Testing: Building Reliable ML Systems with Pytest
Why Testing Is Non-Negotiable in MLOps

What You Will Learn: Pytest, Fixtures, and Load Testing for MLOps
From FastAPI to Testing: Extending Your MLOps Pipeline with Validation

Test-Driven MLOps: Applying Software Testing Best Practices to ML Pipelines

What to Test in MLOps Pipelines: Models, APIs, and Configurations
Unit vs Integration vs Performance Testing
The Software Testing Pyramid for MLOps: Unit, Integration, and Load Testing

Project Structure and Test Layout

Test Directory Structure for MLOps: unit, integration, and performance
Understanding Pytest Fixtures: Using conftest.py for Reusable Test Setup
Where to Place Tests in MLOps Projects: Unit vs Integration vs Performance

Unit Testing in MLOps with Pytest

The Code Under Test: Inference Service and Dummy Model
services/inference_service.py
models/dummy_model.py
Writing Pytest Unit Tests for MLOps: test_inference_service.py
Testing the Inference Service with Pytest (MLOps Unit Tests)
Testing ML Models in Isolation with Pytest
How to Run Pytest Unit Tests for MLOps Projects

Integration Testing in MLOps

Using FastAPI TestClient for Integration Testing with Pytest
How FastAPI TestClient Works for API Testing
Testing API Endpoints (/health, /predict)
What Integration Tests Verify in an MLOps API
Testing the /predict Endpoint in an MLOps API
Testing Documentation Endpoints (/docs, /openapi.json)
What This Ensures
Testing Error Handling in FastAPI APIs with Pytest
Integration Test Breakdown: What Each Test Validates
How to Run Integration Tests with Pytest in MLOps

Performance and Load Testing with Locust

Why Load Testing Is Essential for MLOps and ML APIs
Locust Load Testing Concepts: Users, Spawn Rate, and Tasks Explained
Writing the locustfile.py
What This Locust Load Test Validates in an MLOps API
Running Locust: Headless Mode vs Web UI Dashboard
Generating Locust Load Testing Reports for ML APIs
Understanding Test Metrics (RPS, failures, latency, P95/P99)

MLOps Test Configuration: YAML and Environment Variables

Understanding test_config.yaml for MLOps Testing
What test_config.yaml Controls in MLOps Pipelines
Overriding Application Configuration in Test Mode
How Configuration Overrides Work: YAML and Environment Variables
Why Configuration Management Matters in MLOps Testing
Using Environment Variables for Test Isolation

Code Quality in MLOps: Linting, Formatting, and Static Analysis Tools

Linting Python Code with flake8
Formatting Python Code with Black Pipelines
Using isort to Manage Python Imports
How to Run isort for Clean Python Imports
Static Type Checking with MyPy for MLOps Codebases
Using a Makefile to Automate MLOps Testing and Code Quality

Automating Testing with a Pytest Test Runner Script

Running Automated Tests with run_tests.sh
Understanding Pytest Output and Test Results
Why Automated Testing Workflows Matter in MLOps
Integrating Pytest into CI/CD Pipelines

Automating Load Testing in MLOps with Locust Scripts

Running Automated Locust Load Tests with run_locust.sh
Automatically Generating Load Testing Reports for ML APIs
Preparing Load Testing for CI/CD and Cloud MLOps Pipelines

Test Coverage in MLOps: Measuring and Improving Code Coverage

Using pytest-cov to Measure Test Coverage
How to Measure Code Coverage in MLOps Projects
How to Increase Test Coverage in MLOps Pipelines
Recommended Test Coverage Targets for MLOps Systems

Summary

Citation Information

Pytest Tutorial: MLOps Testing, Fixtures, and Locust Load Testing

In this lesson, you will learn how to make ML systems reliable, correct, and production-ready through structured testing and validation. You will walk through unit tests, integration tests, load and performance checks, fixtures, code quality tools, and automated test runs, giving you everything you need to ensure your ML API behaves predictably under real-world conditions.

This lesson is the last of a 2-part series on Software Engineering for Machine Learning Operations (MLOps):

FastAPI for MLOps: Python Project Structure and API Best Practices
Pytest Tutorial: MLOps Testing, Fixtures, and Locust Load Testing (this tutorial)

To learn how to test, validate, and stress-test your ML services like a professional MLOps engineer, just keep reading.

Looking for the source code to this post?

Introduction to MLOps Testing: Building Reliable ML Systems with Pytest

Testing is the backbone of reliable MLOps. A model might look great in a notebook, but once wrapped in services, APIs, configs, and infrastructure, dozens of things can break silently: incorrect inputs, unexpected model outputs, missing environment variables, slow endpoints, and downstream failures. This lesson ensures you never ship those problems into production.

In this lesson, you will learn the complete testing workflow for machine learning (ML) systems: from small, isolated unit tests to full API integration checks and load testing your endpoints under real traffic conditions. You will also understand how to structure your tests, how each type of test fits into the MLOps lifecycle, and how to design a test suite that grows cleanly as your project evolves.

To learn how to validate, benchmark, and harden your ML applications for production, just keep reading.

Why Testing Is Non-Negotiable in MLOps

Machine learning adds layers of unpredictability on top of regular software engineering. Models drift, inputs vary, inference latency can increase, and small code changes can ripple into major behavioral shifts. Without testing, you have no safety net. Proper tests make your system observable, predictable, and safe to deploy.

What You Will Learn: Pytest, Fixtures, and Load Testing for MLOps

You will walk through a practical testing workflow tailored for ML applications: writing unit tests for inference logic, validating API endpoints end-to-end, using fixtures to isolate environments, verifying configuration behavior, and running load tests to understand real-world performance. Each example connects directly to the codebase you built earlier.

From FastAPI to Testing: Extending Your MLOps Pipeline with Validation

Previously, you learned how to structure a clean ML codebase, configure environments, separate services, and expose reliable API endpoints. Now, you will stress-test that foundation. This lesson transforms your structured application into a validated, production-ready system with tests that catch issues before users ever see them.

Test-Driven MLOps: Applying Software Testing Best Practices to ML Pipelines

Test-driven development (TDD) matters even more in ML because models introduce uncertainty on top of normal software complexity. A single mistake in preprocessing, an incorrect model version, or a slow endpoint can break your application in ways that are hard to detect without a structured testing strategy. Test-driven MLOps gives you a predictable workflow: write tests, run them often, and let failures guide improvements.

What to Test in MLOps Pipelines: Models, APIs, and Configurations

ML systems require testing across multiple layers because issues can appear anywhere: in preprocessing logic, service code, configuration loading, API endpoints, or the model itself. You should verify that your inference service behaves correctly with both valid and invalid inputs, that your API returns consistent responses, that your configuration behaves as expected, and that the entire pipeline works end-to-end. Even when using a dummy model, testing ensures that the structure of your system remains correct as the real model is swapped in later.

Unit vs Integration vs Performance Testing

Unit tests focus on the smallest pieces of your system: functions, helper modules, and the inference service. They run fast and break quickly when a small change introduces an error. Integration tests validate how components work together: routes, services, configs, and the FastAPI layer. They ensure your API behaves consistently no matter what changes inside the codebase. Performance tests simulate real user traffic, evaluating latency, throughput, and failure rates under load. Together, these 3 types of tests create full confidence in your ML application.

The Software Testing Pyramid for MLOps: Unit, Integration, and Load Testing

The testing pyramid helps prioritize effort: many unit tests at the bottom, fewer integration tests in the middle, and a small number of heavy performance tests at the top. ML systems especially benefit from this structure because most failures occur in smaller utilities and service functions, not in the final API layer. By weighting your test suite correctly, you get fast feedback during development while still validating the entire system before deployment.

Project Structure and Test Layout

A clean testing layout makes your ML system predictable, scalable, and easy to maintain. By separating tests into clear categories (e.g., unit, integration, and performance), you ensure that each kind of test has a focused purpose and a natural home inside the repository. This structure also mirrors how real production MLOps teams organize their work, making your project easier to extend as your system grows.

Test Directory Structure for MLOps: unit, integration, and performance

Your Lesson 2 repository includes a dedicated tests/ directory with 3 subfolders:

tests/
│── unit/
│── integration/
└── performance/

unit/: holds small, fast tests that validate individual pieces such as the DummyModel, the inference service, or helper functions.
integration/: contains tests that spin up the FastAPI app and verify endpoints like /health, /predict, and the OpenAPI docs.
performance/: includes Locust load testing scripts that simulate real traffic hitting your API to measure latency, throughput, and error rates.

This layout ensures that each type of test is separated by intent and runtime cost, giving you a clean way to scale your test suite over time.

Understanding Pytest Fixtures: Using conftest.py for Reusable Test Setup

The conftest.py file is the backbone of your testing environment. Pytest automatically loads fixtures defined here and makes them available across all test files without explicit imports.

Your project uses conftest.py to provide:

FastAPI TestClient fixture: allows integration tests to call your API exactly the way a real HTTP client would.
Sample input data: keeps repeated values out of your test files.
Expected outputs: help tests stay focused on behavior rather than setup.

This shared setup reduces duplication, keeps tests clean, and ensures consistent test behavior across the entire suite.

Where to Place Tests in MLOps Projects: Unit vs Integration vs Performance

A simple rule-of-thumb keeps your test organization disciplined:

Put tests in unit/ when the code under test does not require a running API or external system.
Example: testing that the DummyModel.predict() returns “positive” for the word great.
Put tests in integration/ when the test needs the full FastAPI app running.
Example: calling /predict and checking that the API returns a JSON response.
Put tests in performance/ when measuring speed, concurrency limits, or error behavior under load.
Example: Locust scripts simulating dozens of users sending /predict requests at once.

Following this pattern ensures your tests remain stable, fast, and easy to reason about as the project grows.

Need Help Configuring Your Development Environment?

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code immediately on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Unit Testing in MLOps with Pytest

Unit tests are your first safety net in MLOps. Before you hit the API, spin up Locust, or ship to production, you want to know: Does my core prediction code behave exactly the way I think it does?

In this lesson, you do that by testing 2 things in isolation:

inference service: services/inference_service.py
dummy model: models/dummy_model.py

All of that is captured in tests/unit/test_inference_service.py.

The Code Under Test: Inference Service and Dummy Model

First, recall what you are testing.

services/inference_service.py

"""
Simple inference service for making model predictions.
"""
from models.dummy_model import DummyModel
from core.logger import logger

# Initialize model
model = DummyModel()
logger.info(f"Loaded model: {model.model_name}")


def predict(input_text: str) -> str:
    """
    Make a prediction using the loaded model.
   
    Args:
        input_text: Input text for prediction
       
    Returns:
        Prediction result as string
    """
    logger.info(f"Making prediction for input: {input_text[:50]}...")
   
    try:
        prediction = model.predict(input_text)
        logger.info(f"Prediction result: {prediction}")
        return prediction
    except Exception as e:
        logger.error(f"Error during prediction: {str(e)}")
        raise

This file does 3 things:

Initializes a DummyModel once at import time and logs that it loaded.
Exposes a predict(input_text: str) -> str function that:
- Logs the incoming input (truncated to 50 chars).
- Calls model.predict(...).
- Logs and returns the prediction.
Catches any exception, logs the error, and re-raises it so failures are visible.

You are not testing FastAPI here, just pure Python logic: given some text, does this function consistently return the correct label?

models/dummy_model.py

"""
Placeholder dummy model class.
"""
from typing import Any


class DummyModel:
    """
    A placeholder ML model class that returns fixed predictions.
    """
   
    def __init__(self) -> None:
        """Initialize the dummy model."""
        self.model_name = "dummy_classifier"
        self.version = "1.0.0"
   
    def predict(self, input_data: Any) -> str:
        """
        Make a prediction (returns a fixed string for demonstration).
       
        Args:
            input_data: Input data for prediction
           
        Returns:
            Fixed prediction string
        """
        text = str(input_data).lower()
        if "good" in text or "great" in text:
            return "positive"
        return "negative"

This model is deliberately simple:

The constructor sets model_name and version for logging and version tracking.
The predict() method:
- Converts any input to lowercase text.
- Returns "positive" if it sees "good" or "great" in the text.
- Returns "negative" otherwise.

Your unit tests will assert that both the service and model behave exactly like this.

Writing Pytest Unit Tests for MLOps: test_inference_service.py

Here is the full unit test module:

"""
Unit tests for the inference service.
"""
import pytest
from services.inference_service import predict
from models.dummy_model import DummyModel


class TestInferenceService:
    """Test class for inference service."""
   
    def test_predict_returns_string(self):
        """Test that predict() returns a string."""
        result = predict("some input text")
        assert isinstance(result, str)
   
    def test_predict_positive_input(self):
        """Test prediction with positive input."""
        result = predict("This is good")
        assert result == "positive"
   
    def test_predict_negative_input(self):
        """Test prediction with negative input."""
        result = predict("This is bad")
        assert result == "negative"


class TestDummyModel:
    """Test class for DummyModel."""
   
    def test_model_initialization(self):
        """Test that the model initializes correctly."""
        model = DummyModel()
        assert model.model_name == "dummy_classifier"
        assert model.version == "1.0.0"
   
    def test_predict_with_good_word(self):
        """Test that the model returns positive for 'good'."""
        model = DummyModel()
        result = model.predict("This is good")
        assert result == "positive"
   
    def test_predict_with_great_word(self):
        """Test that the model returns positive for 'great'."""
        model = DummyModel()
        result = model.predict("This is great")
        assert result == "positive"
   
    def test_predict_without_keywords(self):
        """Test that the model returns negative without keywords."""
        model = DummyModel()
        test_inputs = ["test", "random text", "negative sentiment"]
        for input_text in test_inputs:
            result = model.predict(input_text)
            assert result == "negative"

Let us break it down.

Testing the Inference Service with Pytest (MLOps Unit Tests)

The first test class focuses on the service function, not the API:

class TestInferenceService:
    """Test class for inference service."""
   
    def test_predict_returns_string(self):
        """Test that predict() returns a string."""
        result = predict("some input text")
        assert isinstance(result, str)

This test ensures predict() always returns a string, no matter what you pass in.
If someone later changes predict() to return a dict, tuple, or Pydantic model, this test will fail immediately.

    def test_predict_positive_input(self):
        """Test prediction with positive input."""
        result = predict("This is good")
        assert result == "positive"
   
    def test_predict_negative_input(self):
        """Test prediction with negative input."""
        result = predict("This is bad")
        assert result == "negative"

These 2 tests verify the happy-path behavior:

Text containing "good" should be classified as "positive".
Text without "good" or "great" should default to "negative".

Notice what’s not happening here:

No FastAPI client.
No HTTP calls.
No environment or config loading.

This is pure, fast, deterministic testing of the core service logic.

Testing ML Models in Isolation with Pytest

The second test class targets the model directly:

class TestDummyModel:
    """Test class for DummyModel."""
   
    def test_model_initialization(self):
        """Test that the model initializes correctly."""
        model = DummyModel()
        assert model.model_name == "dummy_classifier"
        assert model.version == "1.0.0"

This verifies that your model is initialized correctly.
In real projects, this might include loading weights, setting up devices, or configuration. Here, it is just model_name and version, but the pattern is the same.

    def test_predict_with_good_word(self):
        """Test that the model returns positive for 'good'."""
        model = DummyModel()
        result = model.predict("This is good")
        assert result == "positive"
   
    def test_predict_with_great_word(self):
        """Test that the model returns positive for 'great'."""
        model = DummyModel()
        result = model.predict("This is great")
        assert result == "positive"

These tests assert that the keyword-based classification logic works: both "good" and "great" map to "positive".

    def test_predict_without_keywords(self):
        """Test that the model returns negative without keywords."""
        model = DummyModel()
        test_inputs = ["test", "random text", "negative sentiment"]
        for input_text in test_inputs:
            result = model.predict(input_text)
            assert result == "negative"

This test loops over several neutral and negative phrases to make sure the model consistently returns “negative” when no positive keywords are present.
This is your guardrail against accidental changes to the keyword logic.

How to Run Pytest Unit Tests for MLOps Projects

To run just these tests:

pytest tests/unit/ -v

Or with Poetry:

poetry run pytest tests/unit/ -v

You will see output similar to:

tests/unit/test_inference_service.py::TestInferenceService::test_predict_returns_string PASSED
tests/unit/test_inference_service.py::TestInferenceService::test_predict_positive_input PASSED
tests/unit/test_inference_service.py::TestInferenceService::test_predict_negative_input PASSED
tests/unit/test_inference_service.py::TestDummyModel::test_model_initialization PASSED
...

When everything is green, you know:

Your core prediction logic is stable.
The dummy model behaves exactly as designed.
You can now safely move on to integration tests and performance tests in later sections.

Integration Testing in MLOps

Unit tests validate your core Python logic, but integration tests answer a different question:

“Does the entire application behave correctly when all components work together?”

This means testing:

FastAPI app
routing layer
service functions
model
configuration loaded at runtime

All of this happens using FastAPI’s TestClient and your actual running application object (app from main.py).

Let’s break it down.

Using FastAPI TestClient for Integration Testing with Pytest

Your conftest.py defines a reusable client fixture:

from fastapi.testclient import TestClient
from main import app

@pytest.fixture
def client():
    """Create a test client for the FastAPI app."""
    return TestClient(app)

How FastAPI TestClient Works for API Testing

TestClient(app) spins up an in-memory FastAPI instance.
No server is launched, no networking occurs.
Every test receives a fresh client that behaves exactly like a real HTTP client or API consumer.

This lets you write code such as:

response = client.get("/health")

as if you were calling a real deployed API, but entirely offline and deterministic.

Testing API Endpoints (/health, /predict)

Here is the integration test code from your repo:

class TestHealthEndpoint:
    def test_health_check_returns_ok(self, client):
        response = client.get("/health")

        assert response.status_code == 200
        assert response.json() == {"status": "ok"}
   
    def test_health_check_has_correct_content_type(self, client):
        response = client.get("/health")

        assert response.status_code == 200
        assert "application/json" in response.headers["content-type"]

What Integration Tests Verify in an MLOps API

Your /health route is reachable.
It always returns a 200 response.
It returns valid JSON.
The content type is correct.

Here is the real FastAPI code being tested (main.py):

@app.get("/health")
async def health_check():
    logger.info("Health check requested")
    return {"status": "ok"}

This alignment is exactly correct.

Testing the /predict Endpoint in an MLOps API

Your integration tests call the prediction endpoint:

class TestPredictEndpoint:

    def test_predict_endpoint(self, client):
        response = client.post("/predict", params={"input": "good movie"})
        assert response.status_code == 200
        assert "prediction" in response.json()
   
    def test_predict_positive(self, client):
        response = client.post("/predict", params={"input": "This is a great movie!"})
        assert response.status_code == 200
        assert response.json()["prediction"] == "positive"
   
    def test_predict_negative(self, client):
        response = client.post("/predict", params={"input": "This is bad"})
        assert response.status_code == 200
        assert response.json()["prediction"] == "negative"

This tests:

The endpoint exists and accepts POST requests.
The parameter is correctly passed using params={"input": ...}.
The internal inference logic (service → model) behaves correctly end-to-end.

Here is the actual API endpoint in your main.py:

@app.post("/predict")
async def predict_route(input: str):
    return {"prediction": predict_service(input)}

Perfect 1:1 match.

Testing Documentation Endpoints (/docs, /openapi.json)

These are built into FastAPI and must exist for production ML systems.

Your tests:

class TestAPIDocumentation:
    def test_openapi_schema_accessible(self, client):
        response = client.get("/openapi.json")

        assert response.status_code == 200
        schema = response.json()
        assert "openapi" in schema
        assert "info" in schema
   
    def test_swagger_ui_accessible(self, client):
        response = client.get("/docs")

        assert response.status_code == 200
        assert "text/html" in response.headers["content-type"]

What This Ensures

The OpenAPI schema is generated.
Swagger UI loads successfully.
No misconfiguration broke the docs.
Consumers (frontend teams, other ML services, monitoring) can introspect your API.

This is standard for production ML systems.

Testing Error Handling in FastAPI APIs with Pytest

Your code includes error tests that verify robustness:

class TestErrorHandling:
    def test_nonexistent_endpoint_returns_404(self, client):
        response = client.get("/nonexistent")
        assert response.status_code == 404
   
    def test_invalid_method_on_health_endpoint(self, client):
        response = client.post("/health")
        assert response.status_code == 405  # Method Not Allowed
   
    def test_malformed_requests_handled_gracefully(self, client):
        response = client.get("/health")
        assert response.status_code == 200

Integration Test Breakdown: What Each Test Validates

Table 1: Key API edge case tests and their importance in ensuring system reliability

These tests ensure your service behaves consistently even when clients behave incorrectly.

How to Run Integration Tests with Pytest in MLOps

To run only the integration tests:

Using pytest directly

pytest tests/integration/ -v

With Poetry

poetry run pytest tests/integration/ -v

With Makefile

make test-integration

You will see output like:

tests/integration/test_api_routes.py::TestHealthEndpoint::test_health_check_returns_ok PASSED
tests/integration/test_api_routes.py::TestPredictEndpoint::test_predict_positive PASSED
tests/integration/test_api_routes.py::TestAPIDocumentation::test_swagger_ui_accessible PASSED
...

Green = your API works correctly end-to-end.

Performance and Load Testing with Locust

Performance testing is critical for ML systems because even a lightweight model can become slow, unstable, or unresponsive when many users hit the API at once. With Locust, you can simulate hundreds or thousands of concurrent users calling your ML inference endpoints and measure how your API behaves under pressure.

This section explains why load testing matters, how Locust works, how your actual test file is structured, and how to interpret its results.

Why Load Testing Is Essential for MLOps and ML APIs

ML inference services have unique scaling behaviors:

Model loading requires significant memory.
Inference latency grows non-linearly under load.
CPU/GPU bottlenecks show up only when multiple users hit the system.
Thread starvation can cause cascading failures.
Autoscaling decisions depend on real-world load patterns.

A service that performs well for one user may fail miserably at 50 users.

Load testing ensures:

The API stays responsive under traffic.
Latency stays under acceptable thresholds.
No unexpected failures or timeouts occur.
You understand the system’s scaling limits before going to production.

Locust is perfect for this because it is lightweight, Python-based, and designed for web APIs.

Locust Load Testing Concepts: Users, Spawn Rate, and Tasks Explained

Locust simulates user behavior using simple Python classes.

Users

A “user” is an independent client that continuously makes requests to your API.

Example:

10 users = 10 active clients repeatedly calling /predict.

Spawn rate

How quickly Locust ramps up users.

Example:

spawn rate 2 = add 2 users per second until target is reached.

This helps simulate realistic traffic spikes instead of instantly launching all users.

Tasks

Each simulated user executes a set of tasks (e.g., repeatedly calling the /predict endpoint).

Every task can have a weight:

Higher weight = more frequent calls.

This lets you mimic real user patterns like:

90% predict calls
10% health checks

Your project does exactly this.

Writing the locustfile.py

from locust import HttpUser, task, between

class MLAPIUser(HttpUser):
    """
    Locust user class for testing the ML API.
   
    Simulates a user making requests to the API endpoints.
    """
   
    # Wait between 1 and 3 seconds between requests
    wait_time = between(1, 3)
   
    @task(10)
    def test_predict(self):
        """
        Test the predict endpoint.
       
        This task has weight 10, making it the most frequently called.
        """
        payload = {"input": "The movie was good"}
        with self.client.post("/predict", params=payload, catch_response=True) as response:
            if response.status_code == 200:
                response_data = response.json()
                if "prediction" in response_data:
                    response.success()
                else:
                    response.failure(f"Missing prediction in response: {response_data}")
            else:
                response.failure(f"HTTP {response.status_code}")
   
    def on_start(self):
        """
        Called when a user starts testing.
       
        Used for setup tasks like authentication.
        """
        # Verify the API is reachable
        response = self.client.get("/health")
        if response.status_code != 200:
            print(f"Warning: API health check failed with status {response.status_code}")

What This Locust Load Test Validates in an MLOps API

Creates a simulated user (MLAPIUser) that calls /predict.
Gives the /predict task a weight of 10, making it the dominant request.
Sends realistic input (“The movie was good”).
Validates:
- Response code is 200.
- JSON contains “prediction”.
Marks failures explicitly for clean reporting.
On startup, each user verifies that /health works.

This matches your API perfectly:

/predict is POST with query parameter input=...
/health is GET and returns status OK

Nothing needs to be changed; this is production-quality.

Running Locust: Headless Mode vs Web UI Dashboard

Locust supports two modes.

A. Web UI Mode (Interactive Dashboard)

Launch Locust:

locust -f tests/performance/locustfile.py --host=http://localhost:8000

Then open:

http://localhost:8089

You will see a dashboard where you can:

Set number of users
Set spawn rate
Start/stop tests
View real-time stats

B. Headless Mode (Automated CI/CD or scripting)

You already have a script:

software-engineering-mlops-lesson2/scripts/run_locust.sh

Run:

./scripts/run_locust.sh http://localhost:8000 10 2 5m

This executes:

10 users
spawn rate 2 users per second
run time 5 minutes
save HTML report

No UI; perfect for pipelines.

Generating Locust Load Testing Reports for ML APIs

Your script uses:

--html="reports/locust_reports/locust_report_.html"

Which produces files like:

reports/locust_reports/locust_report_20251030_031331.html

Each report includes:

Requests per second (RPS)
Failure stats
Full latency distribution
Percentiles (50th, 95th, 99th)
Charts of active users and response times

These HTML reports are great for:

Comparing deployments
Regression testing API performance
Flagging slow model versions
Archiving performance history

Everything is already correctly set up in your repo.

Understanding Test Metrics (RPS, failures, latency, P95/P99)

Locust gives several performance metrics you must understand for ML systems.

Requests per Second (RPS)

How many inference calls your API can handle per second.

CPU-bound models lead to low RPS
Simple models lead to high RPS

Increasing users will show where your model and server saturates.

Failures

Locust marks a request as failed when:

Status code ≠ 200
Response JSON does not contain "prediction"
Timeout occurs
Server returns an internal error

Your catch_response=True logic handles this explicitly.

This prevents “hidden” failures.

Latency (ms)

Response time per request, typically measured in milliseconds.

For ML, latency is the most important metric.

You will see:

Average latency
Median (P50)
Slowest (max latency)

P95 / P99 (Tail Latency)

The 95th and 99th percentile response times.

These capture worst-case behavior.

Example:

P50 = 40 ms
P95 = 210 ms
P99 = 540 ms

This means:

Most users see fast responses, but a small % experience major slowdowns.

This is common in ML workloads due to:

Model warmup
Thread contention
Python GIL blockage
Model cache misses

Production Service Level Objectives (SLOs) usually track P95 and P99, not averages.

MLOps Test Configuration: YAML and Environment Variables

ML systems behave differently across production, development, and testing environments.

Your Lesson 2 codebase separates these environments cleanly using:

A test-specific YAML config
A modified BaseSettings loader
.env overrides for test mode

This ensures that tests run quickly, deterministically, and without polluting real environment settings.

Let’s break down how this works.

Understanding test_config.yaml for MLOps Testing

# Test Configuration
environment: "test"
log_level: "DEBUG"

# API Configuration
api_host: "127.0.0.1"
api_port: 8000
debug: true

# Performance Testing
performance:
  baseline_users: 10
  spawn_rate: 2
  test_duration: "5m"

# Model Configuration
model:
  name: "dummy_classifier"
  version: "1.0.0"

What test_config.yaml Controls in MLOps Pipelines

Table 2: Configuration keys and their roles in test environment setup

This config prevents tests from accidentally picking up production configs.

Overriding Application Configuration in Test Mode

Your test environment uses a special configuration loader inside:

core/config.py

Here is the real code:

def load_config() -> Settings:
    # Load base settings from environment
    settings = Settings()
   
    # Load additional configuration from YAML if it exists
    config_path = "configs/test_config.yaml"
    if os.path.exists(config_path):
        yaml_config = load_yaml_config(config_path)
       
        # Override settings with YAML values if they exist
        for key, value in yaml_config.items():
            if hasattr(settings, key):
                setattr(settings, key, value)
   
    return settings

How Configuration Overrides Work: YAML and Environment Variables

Step 1: BaseSettings loads environment variables
(.env, operating system (OS) variables, defaults)
Step 2: YAML configuration overrides them
test_config.yaml replaces any matching fields in Settings.
Final output:
The application is now in test mode, completely isolated from development and production environments.

Why Configuration Management Matters in MLOps Testing

Integration tests always use the same port, host, and log settings.
Tests are repeatable and deterministic.
You never accidentally load production API keys or endpoints.
CI/CD pipelines get consistent behavior.

This pattern is very common in real-world MLOps systems.

Using Environment Variables for Test Isolation

Your test environment uses a .env.example file:

# API Configuration
API_PORT=8000
API_HOST=0.0.0.0
DEBUG=true

# Environment
ENVIRONMENT=test

# Logging
LOG_LEVEL=DEBUG

During setup, users run:

cp .env.example .env

This creates the .env used during tests.

Why Test-Specific .env Variables Matter

Table 3: Environment variables and their impact on test execution

Combined with YAML Overrides

.env → applies defaults

test_config.yaml → overrides final values

This gives you a flexible and safe configuration stack.

Code Quality in MLOps: Linting, Formatting, and Static Analysis Tools

Testing ensures correctness, but code quality tools ensure that your ML system remains maintainable as it grows.

In Lesson 2, you introduce a full suite of professional-quality tooling:

flake8 for linting
Black for auto-formatting
isort for import ordering
MyPy for static typing
Makefile for automation consistency

Together, they enforce the same engineering discipline used on real production ML teams at scale.

Linting Python Code with flake8

Linting catches code smells, stylistic issues, and subtle bugs before they hit production.

Your repository includes a real .flake8 file:

[flake8]
max-line-length = 88
extend-ignore = E203, W503
exclude =
    .git,
    __pycache__,
    .venv,
    venv,
    env,
    build,
    dist,
    *.egg-info,
    .pytest_cache,
    .mypy_cache
per-file-ignores =
    __init__.py:F401
max-complexity = 10

What your flake8 setup enforces

88-character line limit (matches Black)
Ignores stylistic warnings that Black also overrides (E203, W503)
Avoids checking generated or virtual-env directories
Allows unused imports only in __init__.py files
Enforces a maximum complexity score of 10

Run flake8 manually

poetry run flake8 .

Or via Makefile

make lint

Linting becomes part of your day-to-day workflow and prevents style drift across your ML services.

Formatting Python Code with Black Pipelines

Black is an automatic code formatter; it rewrites Python code into a consistent style.

Your Lesson 2 pyproject.toml includes:

[tool.black]
line-length = 88
target-version = ['py39']
include = '\.pyi?$'

This means:

All Python files (.py) are formatted.
Max line length is 88 chars.
py39 syntax is allowed.

Format all code:

poetry run black .

Or using the Makefile shortcut:

make format

Black removes tedious decisions about spacing, commas, and line breaks, ensuring all contributors share the same style.

Using isort to Manage Python Imports

isort automatically manages import sorting and grouping.

Your pyproject.toml contains:

[tool.isort]
profile = "black"
multi_line_output = 3

This aligns isort’s output with Black’s formatting rules, avoiding conflicts.

How to Run isort for Clean Python Imports

poetry run isort .

Or via Makefile:

make format

Why This Matters

As ML services grow, import lists become messy. isort keeps them clean and consistent, improving readability exponentially.

Static Type Checking with MyPy for MLOps Codebases

Static typing is increasingly important in MLOps systems, especially when passing models, configs, and data structures between services.

Your repo contains a full mypy.ini:

[mypy]
python_version = 3.9
warn_return_any = True
warn_unused_configs = True
disallow_untyped_defs = False
ignore_missing_imports = True

[mypy-tests.*]
disallow_untyped_defs = False

[mypy-locust.*]
ignore_missing_imports = True

What This Config Enforces

Flags functions that return Any
Warns about unused config options
Does not require type hints everywhere (reasonable for ML codebases)
Skips type-checking external packages (common in ML pipelines)
Allows untyped defs in tests

Run MyPy

poetry run mypy .

Or via Makefile:

make type-check

Why MyPy Is Critical in ML Systems

Prevents silent type errors (e.g., passing a list where a tensor is expected)
Catches config mistakes before runtime
Improves refactor safety for large ML codebases

Using a Makefile to Automate MLOps Testing and Code Quality

Your Makefile automates all key development tasks:

make test          # Run all tests
make test-unit     # Unit tests only
make test-integration
make format        # Black + isort
make lint          # flake8
make type-check    # mypy
make load-test     # Locust performance tests
make clean         # Reset environment

This ensures:

Every developer uses the same commands
CI/CD pipelines can call the same interface
Tooling stays consistent across machines

Example workflow for contributors:

make format
make lint
make type-check
make test

If all commands pass, you know your code is clean, consistent, and ready for production.

Automating Testing with a Pytest Test Runner Script

As your ML system grows, running dozens of unit, integration, and performance tests manually becomes tedious and error-prone.

Lesson 2 includes a fully automated test runner (scripts/run_tests.sh) that enforces a predictable, repeatable workflow for your entire test suite.

This script acts like a miniature CI pipeline that you can run locally. It prints structured logs, enforces failure conditions, and ensures that no test is accidentally skipped.

Running Automated Tests with run_tests.sh

Your repository includes a fully functional test runner:

#!/bin/bash

# Test Runner Script for MLOps Lesson 2

set -e

echo "🧪 Running MLOps Lesson 2 Tests..."

# Colors for output
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
NC='\033[0m'

print_status() {
    echo -e "${GREEN}✅ $1${NC}"
}

print_warning() {
    echo -e "${YELLOW}⚠️  $1${NC}"
}

print_error() {
    echo -e "${RED}❌ $1${NC}"
}

# Run unit tests
echo ""
echo "📝 Running unit tests..."
poetry run pytest tests/unit/ -v
if [ $? -eq 0 ]; then
    print_status "Unit tests passed"
else
    print_error "Unit tests failed"
    exit 1
fi

# Run integration tests
echo ""
echo "🔗 Running integration tests..."
poetry run pytest tests/integration/ -v
if [ $? -eq 0 ]; then
    print_status "Integration tests passed"
else
    print_error "Integration tests failed"
    exit 1
fi

echo ""
print_status "All tests completed successfully!"

How to Run It

./scripts/run_tests.sh

or, via Makefile:

make test

What It Does

Runs unit tests
Runs integration tests
Stops immediately (set -e) if anything fails
Prints colored output for clarity
Provides a clear pass/fail summary

This mirrors real CI pipelines where a failing test stops deployment.

Understanding Pytest Output and Test Results

When you run the script, you will typically see output like this:

🧪 Running MLOps Lesson 2 Tests...

📝 Running unit tests...
============================= test session starts ==============================
collected 7 items

tests/unit/test_inference_service.py::TestInferenceService::test_predict_returns_string PASSED
tests/unit/test_inference_service.py::TestInferenceService::test_predict_positive_input PASSED
tests/unit/test_inference_service.py::TestInferenceService::test_predict_negative_input PASSED
tests/unit/test_inference_service.py::TestDummyModel::test_model_initialization PASSED
tests/unit/test_inference_service.py::TestDummyModel::test_predict_with_good_word PASSED
tests/unit/test_inference_service.py::TestDummyModel::test_predict_with_great_word PASSED
tests/unit/test_inference_service.py::TestDummyModel::test_predict_without_keywords PASSED

============================== 7 passed in 0.45s ===============================
✅ Unit tests passed

Then integration tests:

🔗 Running integration tests...

tests/integration/test_api_routes.py::TestHealthEndpoint::test_health_check_returns_ok PASSED
tests/integration/test_api_routes.py::TestPredictEndpoint::test_predict_positive PASSED
tests/integration/test_api_routes.py::TestAPIDocumentation::test_swagger_ui_accessible PASSED
tests/integration/test_api_routes.py::TestErrorHandling::test_nonexistent_endpoint_returns_404 PASSED

============================== 8 passed in 0.78s ===============================
✅ Integration tests passed

Finally:

✅ All tests completed successfully!

Why Automated Testing Workflows Matter in MLOps

You see exactly which tests failed.
You immediately know whether the API is healthy.
You build the habit of treating tests as a gatekeeper before shipping ML code.

This is foundational MLOps workflow discipline.

Integrating Pytest into CI/CD Pipelines

Your test runner is already written as if it were part of CI.

Very soon, you will plug this into:

GitHub Actions
GitLab CI
CircleCI
AWS CodeBuild
Azure DevOps

A typical GitHub Actions step would look like:

- name: Run Tests
  run: ./scripts/run_tests.sh

Since your script exits with non-zero status on failures, the CI job fails automatically.

What this enables in production ML workflows:

No pull request gets merged unless tests pass
Deployments are blocked if integration tests fail
Load testing can be added as a gated step
Test failures provide early feedback on regressions
Teams enforce consistent standards across developers

You already have everything CI needs:

A deterministic test runner
A strict exit-on-fail system
Separate unit and integration test layers
Makefile wrappers for automation
Poetry ensuring repeatable environments

Once you introduce CI/CD in later lessons, these scripts plug in seamlessly.

Automating Load Testing in MLOps with Locust Scripts

Performance testing becomes essential once an ML API starts supporting real traffic. You want confidence that your inference service will not collapse under load, that p95/p99 latencies remain acceptable, and that the system behaves predictably when scaling horizontally.

Manually running Locust is fine for experimentation, but production MLOps requires automated, repeatable load tests. Lesson 2 provides a dedicated script (run_locust.sh) which allows you to run performance tests in a single line and automatically generate HTML reports for analysis.

Running Automated Locust Load Tests with run_locust.sh

#!/bin/bash

# Simple Locust Load Testing Script for MLOps Lesson 2

set -e

echo "🚀 Starting Locust Load Testing..."

# Configuration
HOST=${1:-"http://localhost:8000"}
USERS=${2:-10}
SPAWN_RATE=${3:-2}
RUN_TIME=${4:-"5m"}

echo "🔧 Configuration: $USERS users, spawn rate $SPAWN_RATE, run time $RUN_TIME"

# Create reports directory
mkdir -p reports/locust_reports

# Check if the API is running
echo "🏥 Checking if API is running..."
if ! curl -s "$HOST/health" > /dev/null; then
    echo "❌ API is not reachable at $HOST"
    echo "Please start the API server first with: python main.py"
    exit 1
fi

echo "✅ API is reachable"

# Run Locust load test
echo "🧪 Starting load test..."

TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
HTML_REPORT="reports/locust_reports/locust_report_$TIMESTAMP.html"

poetry run locust \
    -f tests/performance/locustfile.py \
    --host="$HOST" \
    --users="$USERS" \
    --spawn-rate="$SPAWN_RATE" \
    --run-time="$RUN_TIME" \
    --html="$HTML_REPORT" \
    --headless

echo "✅ Load test completed!"
echo "📊 Report: $HTML_REPORT"

How to Run It

Basic load test:

./scripts/run_locust.sh

10 users, spawn rate 2 users/sec, run for 5 minutes.

Custom parameters:

./scripts/run_locust.sh http://localhost:8000 30 5 2m

This means:

30 users total
5 users per second spawn rate
2-minute runtime
Tests /predict endpoint repeatedly (because of locustfile.py)

What This Script Automates

API health check before running
Creates timestamped report directories
Runs Locust in headless mode
Stores HTML reports for analysis
Fails gracefully when API is unreachable

This gives you a push-button reproducible performance test, a key requirement in professional MLOps.

Automatically Generating Load Testing Reports for ML APIs

Every run creates a unique HTML report:

reports/locust_reports/
    locust_report_20251203_031331.html
    locust_report_20251203_041215.html
    ...

This file includes:

Requests per second (RPS)
Response time percentiles (p50, p90, p95, p99)
Failure rates
Total requests
Charts for concurrency vs performance
Per-endpoint performance metrics

You can open the report in your browser:

open reports/locust_reports/locust_report_20251203_031331.html

(Windows)

start reports\locust_reports\locust_report_XXXX.html

Why This Is Important

Performance regressions are one of the most common ML service failures:

model upgrades slow down inference unintentionally
logging overhead increases latency
new preprocessing increases CPU usage
hardware changes alter throughput

By keeping each test run stored, you can compare historical performance.

This is the foundation of automatic performance regression detection.

Preparing Load Testing for CI/CD and Cloud MLOps Pipelines

Your load testing script is already CI-ready.

Here is how it fits into a production MLOps pipeline.

Option 1 — GitHub Actions

- name: Run Load Tests
  run: ./scripts/run_locust.sh http://localhost:8000 20 5 1m

Since the script exits non-zero on error, it becomes a gated step:

Deployment is blocked if the API cannot sustain the expected load.
Only performant builds reach production.

Option 2 — Nightly Performance Jobs

Teams often run Locust nightly to catch degradations early:

baseline: 20 users
alert if p95 > 300 ms
alert if failures > 1%

Reports are archived automatically via your script.

Option 3 — Cloud Load Testing (AWS/GCP/Azure)

Your script can run inside:

AWS CodeBuild
Azure Pipelines
Google CloudBuild

Simply modify the host:

./scripts/run_locust.sh https://staging.mycompany.com/api 50 10 10m

Why CI Load Tests Matter

Prevents slow releases from being deployed
Ensures model swaps do not tank performance
Protects SLAs (Service Level Agreements)
Helps capacity planning and autoscaling decisions
Detects bottlenecks before customers do

Your repository already contains everything needed to industrialize performance testing.

Test Coverage in MLOps: Measuring and Improving Code Coverage

Even with strong unit, integration, and performance testing, you still need a way to quantify how much of your codebase is actually exercised. This is where test coverage comes in. Coverage tools show you which lines are tested, which are skipped, and where hidden bugs may still be lurking. This is especially important in ML systems, where subtle code paths (error handling, preprocessing, retry logic) can easily be missed.

Your Lesson 2 environment includes pytest-cov, allowing you to generate detailed coverage reports in a single command.

Using pytest-cov to Measure Test Coverage

Coverage is enabled simply by adding --cov flags to pytest.

Basic usage:

pytest --cov=.

Your repo’s pyproject.toml installs pytest-cov automatically under [tool.poetry.group.dev.dependencies], so coverage works out of the box.

A more detailed command:

pytest --cov=. --cov-report=term-missing

This reports:

total coverage percentage
which lines were executed
which lines were missed
hints for improving coverage

Example output you might see:

---------- coverage: platform linux, python 3.9 ----------
Name                                Stmts   Miss  Cover
--------------------------------------------------------
services/inference_service.py          22      0   100%
models/dummy_model.py                  16      0   100%
core/config.py                         40      8    80%
core/logger.py                         15      0   100%
tests/unit/test_inference_service.py   28      0   100%
--------------------------------------------------------
TOTAL                                 121      8    93%

This gives immediate visibility into which modules need more test attention.

How to Measure Code Coverage in MLOps Projects

To formally measure coverage for Lesson 2, run:

pytest -v --cov=. --cov-report=html

This generates a full HTML report inside:

htmlcov/index.html

Open it in your browser:

open htmlcov/index.html

(Windows)

start htmlcov\index.html

The HTML report visualizes:

executed vs missed lines
branch coverage
per-module summaries
clickable source code with line highlighting

This is the gold standard report format used in industry pipelines.

Integrating Coverage into Your Workflow

Your Makefile could easily support it:

make coverage

But even without that, pytest-cov gives you everything you need to evaluate test completeness.

How to Increase Test Coverage in MLOps Pipelines

ML systems often have unusual testing challenges:

multiple code paths depending on data
dynamic model loading
error cases that only appear in production
preprocessing/postprocessing steps
branching logic based on config values
retry and timeout logic
logging behavior that might hide bugs

To increase coverage meaningfully:

1. Test failure modes

Example: model not loaded, invalid input, exceptions in service layer.

2. Test alternative branches

For example., your dummy model has:

if "good" in text or "great" in text:
    return "positive"
return "negative"

Coverage increases when you test:

positive branch
fallback branch
edge cases like empty strings

3. Test configuration-dependent behavior

Since your system loads from:

.env
YAML
runtime values

Try testing scenarios where each layer overrides the next.

4. Test logging paths

Logging is crucial in MLOps, and ensuring logs appear where expected also contributes to coverage.

5. Test the API under different payloads

Missing parameters, malformed types, unexpected values.

6. Test integration between modules

Even simple ML systems can break across module boundaries, so testing interactions raises coverage dramatically.

Recommended Test Coverage Targets for MLOps Systems

High coverage is good, but perfection is unrealistic and unnecessary.

Here are industry-grade ML-specific targets:

Table 4: Recommended test coverage ranges across system components

Why You Do Not Aim for 100%

ML models are often treated as black boxes
Some branches (especially failure conditions) are difficult to simulate
Performance code paths are not always practical to test

A strong MLOps system targets:

Overall coverage: 80-90%

This ensures the most important logic is covered while avoiding diminishing returns.

Critical paths: 100%

Inference, preprocessing, conversion, routing, safety checks.

Performance-sensitive code: covered via load tests

This is why Locust complements pytest rather than replacing it.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: May 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

That’s not the case.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this lesson, you learned how to make ML systems safe, correct, and production-ready through a full testing and validation workflow. You started by understanding why ML services need far more than “just unit tests,” and how a layered approach (unit, integration, and performance tests) creates confidence in both the code and the behavior of the system. You then explored a real test layout with dedicated folders, fixtures, and isolation, and saw how each type of test validates a different piece of the pipeline.

From there, you implemented unit tests for the inference service and dummy model, followed by integration tests that exercise real FastAPI endpoints, documentation routes, and error handling. You also learned how to perform load testing with Locust, simulate concurrent users, generate performance reports, and interpret latency and failure metrics. This is an essential skill for production ML APIs.

Finally, you covered the tools that keep an ML codebase clean and maintainable: linting, formatting, static typing, and the Makefile commands that tie everything together. You closed with automated test runners, load-test scripts, and coverage reporting, giving you an end-to-end workflow that mirrors real MLOps engineering practice.

By now, you have seen how professional ML systems are tested, validated, measured, and maintained. This sets you up for the next module, where we will begin building data pipelines and reproducible ML workflows.

Citation Information

Singh, V. “Pytest Tutorial: MLOps Testing, Fixtures, and Locust Load Testing,” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/4ztdu

@incollection{Singh_2026_pytest-tutorial-mlops-testing-fixtures-locust-load-testing,
  author = {Vikram Singh},
  title = {{Pytest Tutorial: MLOps Testing, Fixtures, and Locust Load Testing}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/4ztdu},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

The post Pytest Tutorial: MLOps Testing, Fixtures, and Locust Load Testing appeared first on PyImageSearch.

FastAPI for MLOps: Python Project Structure and API Best Practices

Vikram Singh — Mon, 13 Apr 2026 12:45:00 +0000

Home

Table of Contents

FastAPI for MLOps: Python Project Structure and API Best Practices
Introduction

What You Will Build and Learn
Why Software Engineering Comes First in MLOps Best Practices
Where This Fits in the Overall Curriculum

Python Project Structure Best Practices for MLOps

How to Structure a Python Project with src/ Layout
Python Project Structure Explained: Repository Walkthrough
Python Project Structure Best Practices: Directory Breakdown
How This Structure Scales to Larger ML Systems

Managing Python Dependencies with Poetry for ML Projects

Python Poetry vs PDM vs UV: Choosing a Package Manager for MLOps
Understanding pyproject.toml for Python Project Configuration
Installing Dependencies (Poetry, PDM, UV)
Managing Python Virtual Environments for Reproducible MLOps
Automating MLOps Setup with Python Environment Scripts

Configuration Management in MLOps: YAML, .env, and Pydantic

Using Pydantic Settings for MLOps Configuration Management
What This Means for MLOps Configuration and System Design
Loading YAML and Merging Layers
Designing YAML Configs for Scalable MLOps Pipelines
Using .env Files for Secure MLOps Configuration
Why Configuration Management Matters in MLOps Systems
How the App Uses Configuration (src/main.py)
How FastAPI Uses Configuration in Production MLOps Systems
Extending MLOps Configuration Safely in Python Projects

Logging Best Practices for MLOps and FastAPI Applications

Why Logging Is Critical for ML Systems
Logger Initialization
Log Formatting and Levels
Logging Across the App
Together, This Gives Us Structured, Traceable Behavior Across the App

FastAPI for MLOps: Building a Production ML API

Why FastAPI Is Ideal for MLOps API Development
Creating a FastAPI Application for Machine Learning APIs
Implementing Health Check Endpoints in FastAPI (MLOps)
Building a FastAPI Prediction Endpoint for ML Models
Behind This Endpoint Is Your Prediction Engine
Deploying FastAPI with Uvicorn for MLOps Applications
Auto-Generated API Docs (Swagger, ReDoc)

MLOps Architecture: Service Layer Design Patterns

Why We Separate Services from Routes
Designing an ML Inference Service
Scaling MLOps Systems with Modular Service Architecture

Model Abstraction in MLOps: Decoupling ML from APIs

Designing a Python ML Model Class for MLOps
How to Replace Dummy Models with Production ML Models
Versioning the Model Class

Building Reusable Utilities in Python MLOps Projects

Loading YAML Configs
Adding New Helper Functions

Running a FastAPI MLOps Application Locally

Running via Poetry
Running via UV
Running Python MLOps Projects with PDM
Testing FastAPI Endpoints: Health Check and Prediction API

Summary

Citation Information

FastAPI for MLOps: Python Project Structure and API Best Practices

In this lesson, you will learn how to structure a Machine Learning (ML) project like a real production system, complete with a src directory layout, layered configuration, environment management, logging, and a FastAPI service that exposes your model through clean Application Programming Interface (API) routes.

This lesson is the 1st of a 2-part series on Software Engineering for Machine Learning Operations (MLOps):

FastAPI for MLOps: Python Project Structure and API Best Practices (this tutorial)
Lesson 2

To learn how to build reliable, scalable ML software the right way, just keep reading.

Looking for the source code to this post?

Introduction

Modern ML systems do not succeed because of models alone — they succeed because of the software engineering wrapped around them. Most real-world failures in MLOps come from poor structure, missing configuration, messy environments, unclear APIs, or nonexistent logging, not from bad ML.

This lesson gives you the engineering foundation you need to build ML systems that are stable, testable, and production-ready. You’ll learn how to structure your project, manage environments, load configurations, build APIs, and prepare your system for future modules like testing, deployment, and automation.

To learn how solid software engineering underpins every ML workflow, just keep reading.

What You Will Build and Learn

In this lesson, you’ll build the backbone of a real ML application: a clean repository layout, environment management with modern tooling, configuration loading via Pydantic, structured logging, a FastAPI interface, and a simple service layer to power prediction.

These concepts form the “foundation layer” every MLOps system relies on — regardless of the model you eventually plug in.

Why Software Engineering Comes First in MLOps Best Practices

ML projects fail not because the model is wrong, but because the plumbing around the model collapses. Scripts turn into spaghetti, notebooks become unmaintainable, configs get scattered, and environments drift until the system becomes impossible to debug.

Good software engineering fixes this by introducing structure, consistency, and predictable behavior. When your API, config, logs, and model code work together cleanly, everything built on top (e.g., testing, serving, scaling, monitoring) suddenly becomes reliable.

Where This Fits in the Overall Curriculum

This lesson is the foundation of the entire MLOps series. Everything that comes next — testing, model integration, deployment workflows, Continuous Integration/Continuous Delivery (CI/CD) automation, monitoring, and scaling — builds on the engineering habits you establish here.

Think of this as your “software engineering base layer.” Once you master this structure, adding real models, adding load testing, or plugging the system into cloud infrastructure becomes far easier.

Python Project Structure Best Practices for MLOps

A well-structured repository is the first sign of a healthy ML system. Before we write any API code or load a model, we need a layout that cleanly separates configuration, services, models, and utilities. This not only prevents chaos — it makes testing, scaling, and future modules dramatically easier.

How to Structure a Python Project with src/ Layout

ML projects quickly become messy if everything sits at the root level. The src/ layout prevents naming collisions, enforces imports that match production structure, and makes it clear where application code actually lives.

This is the same structure used in mature Python services deployed in production environments.

Python Project Structure Explained: Repository Walkthrough

Here’s the repository layout we’re working with in this module (the exact tree will be shown later when you provide it):

sw-eng-mlops/
│
├── src/
│   ├── core/
│   ├── models/
│   ├── services/
│   ├── api/
│   ├── utils/
│   └── config/
│
├── tests/
│   ├── unit/
│   ├── integration/
│   └── performance/
│
├── pyproject.toml
├── README.md
├── setup_env.sh
└── .env.example

This structure is intentionally clean: core/ contains primitives, models/ stores your ML logic, services/ contains business logic, and api/ exposes everything through FastAPI routes.

Python Project Structure Best Practices: Directory Breakdown

core/ — The Application Base Layer

This folder contains shared components such as logging setup, base classes, or utility abstractions. Everything here is meant to be reusable across the whole system.

models/ — ML or Dummy Model Code

Even if you’re starting with a dummy model, isolating model code here makes it easy to swap in real models later.

services/ — The Business Logic Layer

This is where you place the logic that actually powers /predict, not inside the API route. This separation keeps production-grade APIs maintainable.

api/ — FastAPI Endpoints

Routes live here. Each endpoint calls a service, which calls a model.

Tight, clean, and testable.

utils/ — Shared Helpers

Config loaders, yaml readers, or general-purpose helper functions sit here.

If it isn’t domain logic or a model, it goes here.

config/ — Configuration Files

YAML configs, BaseSettings classes, validation logic, and environment overrides.

Centralizing config makes behavior predictable and testable.

How This Structure Scales to Larger ML Systems

This layout scales easily as your ML workload grows:

Add a new model → create a folder inside models/.
Add a new prediction workflow → add a service in services/.
Add new API functionality → add a route in api/.
Add data pipelines or vector DB logic → expand core/ or services/.

This way, the project grows horizontally, not chaotically.

Managing Python Dependencies with Poetry for ML Projects

Modern MLOps projects rely on predictable, repeatable environments — and this section teaches you how to create exactly that. Before we build APIs or load models, we need a clean, isolated workspace where dependencies are installed, versions are pinned, and tools behave consistently across machines.

To learn how to manage dependencies, virtual environments, and setup scripts in real-world ML projects, just keep reading.

Python Poetry vs PDM vs UV: Choosing a Package Manager for MLOps

There are 3 modern Python toolchains worth knowing:

Poetry: full-featured dependency + environment + packaging manager.
PDM (Python Dependency Manager): simpler and faster than Poetry, with PEP-582 support.
UV: an extremely fast Rust-based package manager from Astral.

All 3 support pyproject.toml, the modern Python standard for dependencies and metadata.

Teams often standardize on a single tool, but your project supports all three, so students can use whichever they prefer.

Understanding pyproject.toml for Python Project Configuration

Your pyproject.toml defines:

project name, version, description
dependencies like fastapi, pydantic, pyyaml
dev tools like pytest (Lesson 2)
optional entrypoints (start-server = "src.main:main")

In other words, it is the single source of truth for installation and build metadata.

Any tool (Poetry, PDM, UV, pip) reads this file to install exactly what the project needs.

This is how professional ML systems avoid “works on my machine” issues.

Installing Dependencies (Poetry, PDM, UV)

Using Poetry (recommended)

poetry install
poetry shell
poetry run python src/main.py

Poetry creates an isolated virtual environment and resolves all versions deterministically.

Using UV (lightweight + blazing fast)

uv venv
source .venv/bin/activate
uv pip install -e .
python src/main.py

UV is perfect for fast installs and CI systems where speed matters.

Using PDM (simple + modern)

pdm install
pdm run python src/main.py

PDM feels like npm — no venv folder by default; lightweight and straightforward.

Managing Python Virtual Environments for Reproducible MLOps

Regardless of what tool you choose, the goal is the same: isolate project dependencies from the system Python installation.

Poetry creates its own environment automatically.
UV uses .venv/ inside your project.
PDM can create or avoid virtual environments depending on the configuration.

The important principle:

Never install ML dependencies globally.

Environments keep your project reproducible and safe.

Automating MLOps Setup with Python Environment Scripts

Your project includes a helper script:

./scripts/setup_env.sh

This script:

Detects whether Poetry, UV, or plain pip is available
Installs dependencies using the detected tool
Creates or activates the .env file
Shows the next steps to start the API

This is extremely helpful for teams because it removes all “setup guessing” and gives new developers a consistent starting point.

You now know how environments, dependency managers, and pyproject.toml work together to create a stable foundation for ML systems. With everything installed and configured, you’re ready to build and serve a real API.

Up next, we’ll create your first ML service with FastAPI and connect it to your project’s service layer.

Need Help Configuring Your Development Environment?

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code immediately on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Configuration Management in MLOps: YAML, .env, and Pydantic

How the entire ML system loads, merges, and applies configuration at runtime.

Configuration is one of the most important engineering foundations in any ML system. In Lesson 1, we want students to walk away understanding not only why configuration matters but exactly how this project loads and merges config values. That means stepping through the real code inside src/core/config.py, the .env.example, and configs/config.yaml.

We also want to show how the API, model, and services consume configuration. So when students replace the dummy model with a real one, the pattern already scales.

Let’s walk through it piece by piece.

Using Pydantic Settings for MLOps Configuration Management

Your configuration system starts with a Settings class:

class Settings(BaseSettings):
    api_host: str = "0.0.0.0"
    api_port: int = 8000
    debug: bool = False
    environment: str = "development"
    log_level: str = "INFO"

    class Config:
        env_file = ".env"
        env_file_encoding = "utf-8"

What This Means for MLOps Configuration and System Design

Pydantic’s BaseSettings automatically reads:
- environment variables
- .env file
- any overrides you pass at runtime
Defaults are provided in code so the system always works, even if .env is missing.
Type safety ensures that if someone writes API_PORT=hello, the app will fail fast.

This is the right pattern for ML systems where dozens of environment variables must be synchronized across dev, test, staging, and production.

Loading YAML and Merging Layers

Next comes one of the most important parts of your system:

def load_config() -> Settings:
    settings = Settings()

    config_path = "configs/config.yaml"
    if os.path.exists(config_path):
        yaml_config = load_yaml_config(config_path)

        for key, value in yaml_config.items():
            if hasattr(settings, key):
                setattr(settings, key, value)

    return settings

Why This Is Powerful

You now have layered configuration, which production ML systems use everywhere:

Layer 1: Code defaults

Ensures the app always runs.

Layer 2: YAML (configs/config.yaml)

Great for team-shared configs, model settings, cache sizes, service parameters.

Layer 3: .env file

Local overrides (ports, debug mode, secrets).

Layer 4: Runtime environment variables

Final source of truth in cloud deployments.

This layered system prevents the “hard-coded value” trap and keeps ML infra consistent across environments.

Designing YAML Configs for Scalable MLOps Pipelines

Your YAML file contains deeper structural config:

api_host: "0.0.0.0"
api_port: 8000
debug: true
environment: "development"

log_level: "INFO"

model:
  name: "dummy_classifier"
  version: "1.0.0"
  cache_size: 100

service:
  timeout: 30
  max_retries: 3

Even though Settings does not yet support nested objects for models or services, YAML allows you to introduce new structured configuration later. This is how real ML teams configure:

model version
tokenizer version
max batch size
timeouts
cache settings
experiment IDs

Using .env Files for Secure MLOps Configuration

You also provide .env.example:

API_PORT=8000
API_HOST=0.0.0.0
DEBUG=true
ENVIRONMENT=development
LOG_LEVEL=INFO

Why Configuration Management Matters in MLOps Systems

.env.example acts as documentation and a template.
You copy it to .env, fill values, and the system boots.
This is a best practice in every production ML repo.

How the App Uses Configuration (src/main.py)

Your FastAPI entrypoint reads config like this:

logger.info(f"Starting server on {settings.api_host}:{settings.api_port}")

uvicorn.run(
    "main:app",
    host=settings.api_host,
    port=settings.api_port,
    reload=settings.debug
)

Meaning:

Change .env to API_PORT=9000: Your app automatically runs on port 9000.
Change YAML to debug: false: Hot reload turns off.

This is the practical benefit of structured configuration: no hard-coded values are buried inside the code.

How FastAPI Uses Configuration in Production MLOps Systems

Today, your inference service is simple, but in real projects, you might use:

model name
version
batch size
latency budget
max retries
cache settings
rate limits

All of these come from settings, not hardcoded logic.

In this lesson, you teach the pattern, so when the dummy model is eventually replaced with an Open Neural Network Exchange (ONNX) model, a Hugging Face model, or a custom PyTorch model, the service already has the right structure.

Extending MLOps Configuration Safely in Python Projects

Suppose tomorrow you want:

MODEL_PATH=models/checkpoint.pt
ENABLE_CACHE=true
CACHE_TTL=300

You add:

model_path: str = "models/dummy.pt"
enable_cache: bool = False
cache_ttl: int = 120

Then update .env.example. Then, optionally override in YAML.

The app instantly supports new behavior — no rewrites, no refactoring, no confusion.

This is the level of software engineering maturity we want students to learn.

Logging Best Practices for MLOps and FastAPI Applications

Logging is one of the most underappreciated parts of an ML system. A model prediction might take milliseconds, but diagnosing a production issue without proper logs can take hours. Good logs reduce that time to minutes. In this section, we’ll look at how our lesson’s project initializes a logger, formats log messages, and uses logs consistently across the entire API.

Why Logging Is Critical for ML Systems

ML systems fail in ways traditional software does not.

A model might produce an unexpected prediction, a dependency might break silently, or the environment might load the wrong configuration. Logging gives you the breadcrumbs needed to understand:

What inputs reached the API
What model version was used
What the service did before failing
How often errors occur
Whether latency is increasing

Logs are your “black box recorder” when something goes wrong, and they’re equally important when everything seems to be working — because they tell you why things are working.

Logger Initialization

The project defines a single shared logger in src/core/logger.py:

import logging
import sys

logger = logging.getLogger("mlops-lesson1")
logger.setLevel(logging.INFO)

handler = logging.StreamHandler(sys.stdout)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

if not logger.handlers:
    logger.addHandler(handler)

Here’s what this setup accomplishes:

A named logger (mlops-lesson1) groups logs for later aggregation (e.g., in Datadog, ELK (Elasticsearch, Logstash, Kibana), OpenTelemetry).
INFO as the default level ensures we capture meaningful operational details without spamming output.
A StreamHandler writes logs to stdout — the standard for containerized deployments (Docker, Kubernetes).
A simple timestamped formatter makes logs human-readable while remaining machine-parseable.
The if not logger.handlers: guard prevents duplicate logs if modules are reloaded.

This small file gives us a production-friendly logger with minimal overhead.

Log Formatting and Levels

The logger uses this format:

2025-01-01 12:34:56 - INFO - Prediction result: positive

Each part of the log line matters:

Timestamp: crucial for correlating logs with events or latency spikes.
Log level: signals severity: INFO, WARNING, ERROR.
Message: the human-readable explanation.

In MLOps systems, you’ll most commonly use:

INFO for model loading, API calls, predictions
WARNING for slow responses, unexpected patterns
ERROR when something fails

Because FastAPI reloads modules during development, you may see log duplication without safeguards — which is why we include the if not logger.handlers: check.

If you later want structured JSON logs (for cloud log ingestion), this same module is the place to upgrade.

Logging Across the App

The logger is used in multiple places, showing a consistent logging strategy.

Health endpoint (src/main.py)

@app.get("/health")
async def health_check():
    logger.info("Health check requested")
    return {"status": "ok"}

This gives visibility into uptime checks — important when a load balancer or Kubernetes performs probes.

Prediction endpoint (src/services/inference_service.py)

logger.info(f"Making prediction for input: {input_text[:50]}...")
prediction = model.predict(input_text)
logger.info(f"Prediction result: {prediction}")

Here we log:

The incoming input (truncated to avoid leaking full user data)
The model’s output
Any errors

If something goes wrong:

except Exception as e:
    logger.error(f"Error during prediction: {str(e)}")
    raise

This ensures errors appear in the logs before FastAPI converts them into HTTP exceptions.

Server startup (main.py)

logger.info(f"Starting server on {settings.api_host}:{settings.api_port}")

This is important for:

verifying the config loaded correctly
ensuring the correct port is used
debugging environments with conflicting overrides

Together, This Gives Us Structured, Traceable Behavior Across the App

If a user reports:

“The API feels slow today.”

You can immediately look at:

prediction request timestamps
whether model loading was triggered again
whether latency warnings appear
whether certain inputs correlate with errors

Without logs, you’re flying blind.

FastAPI for MLOps: Building a Production ML API

APIs are the interface between your ML system and the outside world. Whether the consumer is a mobile app, a batch job, another microservice, or a human developer testing in Postman, every interaction eventually flows through an API. In MLOps, your API becomes the stable contract that hides internal details (model type, version, preprocessing, logging) — allowing you to upgrade models without breaking clients.

Why FastAPI Is Ideal for MLOps API Development

FastAPI gives you a fast, typed, and production-ready way to expose ML predictions.

It handles validation, serialization, documentation, and error responses, so your ML logic stays clean and modular.

The goal is simple: your API should stay stable even when everything behind it changes — models, configs, logging, monitoring, infrastructure.

Creating a FastAPI Application for Machine Learning APIs

Your project defines the API inside src/main.py:

from fastapi import FastAPI
app = FastAPI(
    title="ML Service API",
    description="Code Foundations & API Engineering for MLOps",
    version="0.1.0"
)

This initializes a fully documented ML service with:

A title for the UI
A description that shows up in Swagger
A semantic version
Automatically generated schemas

FastAPI instantly gives you API docs and a clean, declarative way to add endpoints.

Implementing Health Check Endpoints in FastAPI (MLOps)

A health endpoint is the first thing any production system needs.

Kubernetes, AWS Application Load Balancer (ALB), Docker Compose, Jenkins, and uptime monitors all rely on it.

Your implementation:

@app.get("/health")
async def health_check():
    logger.info("Health check requested")
    return {"status": "ok"}

This performs 2 critical functions:

Confirms the API server is alive
Confirms logs are working

It also gives you a simple smoke test to verify the environment.

Building a FastAPI Prediction Endpoint for ML Models

The /predict endpoint is where real ML work happens.

@app.post("/predict")
async def predict_route(input: str):
    return {"prediction": predict_service(input)}

This endpoint:

Accepts a simple string input
Passes it into the inference service
Returns a structured JSON prediction

Because prediction logic is isolated in services/inference_service.py, the API stays lightweight and focused on HTTP behavior — not business logic.

Behind This Endpoint Is Your Prediction Engine

from models.dummy_model import DummyModel

model = DummyModel()

def predict(input_text: str) -> str:
    logger.info(f"Making prediction for input: {input_text[:50]}...")
    prediction = model.predict(input_text)
    logger.info(f"Prediction result: {prediction}")
    return prediction

Even though this is a dummy model, the structure mirrors real production design:

The service layer owns the prediction logic
The model is instantiated once
Logging wraps the input and output

When you upgrade to a real transformer or classifier, the API does not need to change.

Deploying FastAPI with Uvicorn for MLOps Applications

The server entrypoint lives at the bottom of main.py:

def main():
    logger.info(f"Starting server on {settings.api_host}:{settings.api_port}")
    uvicorn.run(
        "main:app",
        host=settings.api_host,
        port=settings.api_port,
        reload=settings.debug
    )

A few details matter:

reload=True reloads on code changes → perfect for development
host and port come from config → ideal for containers/cloud
logging is integrated → so you can trace server start behavior

You can run the server with:

poetry run start-server

uvicorn src.main:app --reload

Both give you a live API with hot reload.

Auto-Generated API Docs (Swagger, ReDoc)

FastAPI automatically exposes:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc
OpenAPI schema: http://localhost:8000/openapi.json

These docs are invaluable in ML workflows because:

You can test predictions interactively
Product, QA, and frontend engineers can explore endpoints
Payload schemas are always up to date
No one needs to ask “What does this endpoint expect?”

FastAPI generates this from your Python type hints, which makes documentation essentially free.

MLOps Architecture: Service Layer Design Patterns

The service layer is where your application’s real business logic lives. In an ML system, this includes preprocessing, model selection, inference, error handling, postprocessing, and logging. By keeping this logic out of your API routes, you ensure that your codebase remains modular, testable, and ready for future model upgrades.

Why We Separate Services from Routes

FastAPI routes should only handle HTTP concerns: input validation, request parsing, and response formatting.

They should not know how your model works internally.

Separating logic into a services/ folder gives you:

Cleaner API routes: easier to read and maintain
Better testability: you can unit test the inference logic without starting a server
Loose coupling: upgrading models doesn’t require rewriting routes
Clear ownership: one layer handles HTTP, the other handles ML logic

This separation is one of the most critical software engineering patterns in MLOps — you want your system flexible enough that models can change, scale, or switch frameworks without touching your API.

Designing an ML Inference Service

Your inference logic lives in:

src/services/inference_service.py

Let’s look at how it’s structured:

from models.dummy_model import DummyModel
from core.logger import logger

# Initialize model
model = DummyModel()
logger.info(f"Loaded model: {model.model_name}")

This loads the model once at startup. In a real ML system, this is where:

You load a transformer model
You warm up a GPU
You hydrate a vector store
You initialize the tokenizer/preprocessor state

Then comes the prediction function:

def predict(input_text: str) -> str:
    logger.info(f"Making prediction for input: {input_text[:50]}...")
   
    try:
        prediction = model.predict(input_text)
        logger.info(f"Prediction result: {prediction}")
        return prediction
    except Exception as e:
        logger.error(f"Error during prediction: {str(e)}")
        raise

This function represents the business logic of your ML service:

It trims the input for logging
Calls the model’s predict()
Logs errors and output cleanly
Returns only the result — not HTTP details

This is exactly why we keep services separate: inference is not an HTTP concern, so it does not belong in a route.

Scaling MLOps Systems with Modular Service Architecture

A great design scales. Tomorrow, your system might need:

SentimentService: for NLP
RecommendationService: for personalization
VisionService: that loads YOLO or CLIP
BatchService: for async workflows
RetrievalService: for Retrieval-Augmented Generation (RAG) pipelines

You don’t modify main.py or existing endpoints.

You simply add more files under:

src/services/
├── inference_service.py  
├── recommendation_service.py  
├── vision_service.py  
└── retrieval_service.py

Each service becomes independent, testable, and reusable.

Later in Lesson 2, this design becomes even more powerful because:

Unit tests: target individual services
Integration tests: validate routes and services working together
Load tests: measure the throughput of the /predict pipeline

By the time you add real ML models, this service layer becomes the heart of your system.

Model Abstraction in MLOps: Decoupling ML from APIs

Models change constantly in MLOps. Today you may be serving a dummy classifier; tomorrow it might be a 7B LLM or a YOLOv12 object detector. A good software engineering foundation treats the model as a pluggable, versioned component that can be replaced with minimal friction.

Your current models/ directory demonstrates exactly how this abstraction works.

Designing a Python ML Model Class for MLOps

Your lesson uses a simple placeholder model located at:

src/models/dummy_model.py

The goal of this class isn’t to perform “real” ML — it’s to give you a clean structure that mimics how production model classes are written.

class DummyModel:
    def __init__(self) -> None:
        self.model_name = "dummy_classifier"
        self.version = "1.0.0"
   
    def predict(self, input_data: Any) -> str:
        text = str(input_data).lower()
        if "good" in text or "great" in text:
            return "positive"
        return "negative"

Even in this tiny model, you already see foundational patterns:

A constructor to load or initialize model state
A predict() method that defines the inference interface
model_name and version fields for introspection and tracking

This interface is intentionally minimal: it forces your service and API layers to depend on an abstraction, not on implementation details.

In real MLOps systems, this exact pattern makes it easy to introduce new models without breaking your API.

How to Replace Dummy Models with Production ML Models

Here’s where the abstraction shines.

If tomorrow you decide to replace the dummy model with:

A Hugging Face transformer
A PyTorch Lightning checkpoint
A TensorRT engine
An ONNX Runtime session
A vLLM text-generation server
A YOLO detection model

…all you need to do is drop a new file into:

src/models/

For example:

src/models/
├── dummy_model.py
├── sentiment_model.py
├── llm_generation_model.py
└── object_detector.py

And update your service:

from models.sentiment_model import SentimentModel
model = SentimentModel()

Nothing else changes.

Your FastAPI routes stay the same.

Your service interface stays the same.

Your tests stay the same (except for new model-specific tests).

This is model decoupling.

This is how ML systems avoid turning into tangled spaghetti when models evolve.

Versioning the Model Class

Model versioning is a real production concern, and your dummy model subtly teaches the pattern.

self.version = "1.0.0"

Model versioning matters because:

You may deploy multiple models at once
Clients might depend on specific behaviors
A/B testing needs separate versions
Rollbacks require deterministic reproducibility
Monitoring tools (e.g., Prometheus or Langfuse) track model changes

In production, versioning happens in several places:

version field in the class
model registry tag (MLflow, SageMaker, Hugging Face Hub)
Docker image tag
config.yaml entry
model card metadata

Your project follows the simplest, clearest entrypoint: a version attribute that propagates everywhere the model is used.

Later in Lesson 2, test cases and load tests will automatically pick up this version, mimicking real-world CI/CD systems that validate each model release.

Building Reusable Utilities in Python MLOps Projects

A well-designed ML system always contains a dedicated utilities layer — small, reusable functions that solve cross-cutting problems without polluting your core logic, service layer, or API routes.

In this project, the src/utils/ folder gives you a clean space to organize those helpers, starting with configuration loading, and is ready to grow as your system becomes more complex.

This layer keeps your codebase maintainable, testable, and extensible.

Loading YAML Configs

Your primary helper is load_yaml_config() found in:

src/utils/helpers.py

Here’s the implementation:

def load_yaml_config(path: str) -> Dict[str, Any]:
    config_path = Path(path)
   
    if not config_path.exists():
        return {}
   
    try:
        with open(config_path, 'r', encoding='utf-8') as file:
            config = yaml.safe_load(file)
            return config if config is not None else {}
    except yaml.YAMLError as e:
        print(f"Error loading YAML config from {path}: {e}")
        return {}
    except Exception as e:
        print(f"Unexpected error loading config from {path}: {e}")
        return {}

This function may look simple, but it embodies 3 production-level lessons:

Separation of concerns

Your application logic (FastAPI, inference services) should not know how a YAML file is parsed. They should only receive clean configuration objects.

Fault tolerance

In real deployments:

configs may be missing
YAML indentation may break
a misconfigured CI pipeline may pass an empty file

Returning {} instead of crashing gives you graceful degradation.

Extensibility

Tomorrow you may add:

JSON config support
remote config loading (S3, Google Cloud Storage (GCS), Azure Blob)
encrypted secrets
multiple config layers

This helper becomes the foundation.

Inside core/config.py, you saw how load_yaml_config() merges YAML values into your Pydantic settings. This is a real-world pattern used in production MLOps stacks like Airflow, FastAPI microservices, Ray Serve, and MLflow.

Adding New Helper Functions

The utilities layer is designed to grow organically as your system grows.

Common helpers you may introduce later include:

String helpers

text normalization
input cleaning
token counting

File helpers

safe file writes
temporary directory management
checksum calculation for model files

Model helpers

downloading artifacts from cloud storage
caching models on disk
validating model signatures

API helpers

request validation
standardized error responses
retry/backoff wrappers around external calls

Monitoring helpers

timing decorators
metrics emitters (Prometheus, StatsD, OpenTelemetry)
latency buckets

All of these belong in one place:

src/utils/

This prevents your service layer or route handlers from becoming cluttered and ensures that common functionality is implemented once and reused everywhere.

Running a FastAPI MLOps Application Locally

At this point, you have a fully structured ML application: configuration, logging, models, service layer, and a clean FastAPI interface. Now it’s time to actually run the system locally.

This section walks you through running the API with Poetry, UV, or PDM, depending on your setup. We’ll conclude with a quick validation test to ensure everything works end-to-end.

Running via Poetry

If you’re using Poetry (recommended for most workflows), your steps are:

# Install dependencies
poetry install

# Activate the environment
poetry shell

# Start the API server
poetry run python src/main.py

You should see log lines like:

INFO - Starting server on 0.0.0.0:8000
INFO - Loaded model: dummy_classifier

Figure 1: Running ML API using Poetry

Running via UV

If you prefer UV (super-fast installer by Astral), run:

# Create and activate a virtual environment
uv venv
source .venv/bin/activate

# Install project in editable mode
uv pip install -e .

# Start the API
python src/main.py

This path is great for users who want lightweight dependency management without Poetry’s abstraction.

Running Python MLOps Projects with PDM

If your workflow uses PDM, run:

# Install dependencies
pdm install

# Start the server
pdm run python src/main.py

PDM offers a cleaner pyproject-first workflow and works well for CI/CD pipelines that prefer explicit environment setup.

Figure 2: Terminal showing a successful server started via PDM dependency resolution.

Testing FastAPI Endpoints: Health Check and Prediction API

Once the server is running, validate the system with 2 quick API calls.

Health Check

Open:

http://localhost:8000/health

Expected response:

{"status": "ok"}

This confirms:

the API is reachable
config and logger initialized
FastAPI routes are registered

Prediction Test

Send a prediction request:

curl -X POST "http://localhost:8000/predict?input=This+is+good"

Expected response:

{"prediction": "positive"}

Under the hood:

the service layer logs the request
the dummy model classifies sentiment
the API returns structured JSON

Figure 3: Auto-generated documentation for the ML API.

Figure 4: Real terminal output from running the /predict endpoint, validating the end-to-end workflow of the ML API.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: May 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

That’s not the case.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this lesson, you learned how to build a clean, scalable foundation for ML systems using real software-engineering practices. You now understand why ML projects must be structured like production services — not experiments — if they are ever going to ship reliably.

We began by exploring the why: ML code becomes maintainable only when you enforce clear boundaries between configuration, logic, services, and I/O. That idea naturally led to the src/ layout, which gave our project a predictable and extensible shape.

You then learned how to manage dependencies using Poetry, UV, or PDM — ensuring that every ML environment is reproducible, isolated, and easy to rebuild. This solved the classic “it works on my machine” trap that haunts ML teams.

Next, we built a robust configuration system using Pydantic BaseSettings, merging defaults, YAML files, and .env variables into a single typed interface. You now have a configuration pattern used by real-world production ML systems.

We also implemented structured logging, enabling the application to communicate what it’s doing internally — a prerequisite for debugging, observability, and monitoring.

From there, you built your first production-style ML API with FastAPI, complete with /health, /predict, and auto-generated documentation. You learned how to expose ML logic cleanly, and why APIs are the interface between ML systems and the real world.

We introduced the Service Layer, showing how routes should delegate to independent business logic so APIs stay thin and models stay swappable. This design decision is what makes the system testable and future-proof.

You then explored model abstraction, using a simple dummy model to illustrate how real models (PyTorch, TensorFlow, ONNX, vLLM, Transformers) can be slotted in without changing the API layer.

Finally, you saw how helper utilities make the system cleaner, and how to run the full application with Poetry, UV, or PDM. The result is a working ML service that looks, behaves, and organizes itself like production-grade software.

By completing this lesson, you’ve built the foundation required for every advanced MLOps practice: testing, performance monitoring, CI/CD, orchestration, and deployment.

You’re now ready for Lesson 2, where we transform this service into a fully tested, validated, and performance-monitored ML system.

Citation Information

Singh, V. “FastAPI for MLOps: Python Project Structure and API Best Practices,” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/yn8a5

@incollection{Singh_2026_fastapi-for-mlops-python-project-structure,
  author = {Vikram Singh},
  title = {{FastAPI for MLOps: Python Project Structure and API Best Practices}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/yn8a5},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

The post FastAPI for MLOps: Python Project Structure and API Best Practices appeared first on PyImageSearch.

Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen

Piyush Thakur — Mon, 06 Apr 2026 13:03:56 +0000

Home

Table of Contents

Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen
Why Agentic AI Outperforms Traditional Vision Pipelines
Why Agentic AI Improves Computer Vision and Segmentation Tasks
What We Will Build: An Agentic AI Vision and Segmentation System
Agentic AI Workflow: Vision-Language Reasoning and Segmentation Loop
Agentic AI Architecture: Combining VLMs and SAM 3 for Vision

Vision-Language Model (VLM): The Reasoning Component
SAM 3: Open-Vocabulary Object Segmentation
The Agentic Feedback Loop: Reasoning, Verification, and Refinement
Why Agentic Segmentation Outperforms One-Shot Models

Final Output: Agentic Vision System with Segmentation and Reasoning
Key Takeaway: VLM + SAM 3 = Intelligent Vision Agent
Configuring Your Development Environment
Python Setup and Imports for Agentic AI Vision System
Loading SAM 3 and Qwen Vision-Language Models in Transformers
Implementing VLM Inference for Agentic Vision Reasoning with Qwen2.5-VL
Implementing the SAM 3 Text-Prompted Segmentation Function
Implementing the Agentic AI Segmentation Pipeline with Iterative Refinement
Visualizing and Saving the Segmentation Results
Running the Agentic AI Vision System on Real Images
Agentic Segmentation Output: Iterative Prompt Refinement in Action
Summary

Citation Information

Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen

This lesson is the 4th and final part of our series on SAM 3. In the previous parts, we built a strong foundation for concept-aware segmentation.

In Part 1, we introduced the fundamentals of SAM 3 and explored how it enables concept-based visual understanding and segmentation. We moved beyond fixed labels and used natural language to describe objects.

In Part 2, we extended this idea by introducing multi-modal prompting and interactive segmentation. We combined text, points, and bounding boxes to gain more precise control over segmentation.

In Part 3, we extended this into the temporal domain. We applied SAM 3 to videos and built systems for concept-aware segmentation and object tracking across frames.

In this final part, we take a major step forward. Instead of treating segmentation as a single-step prediction, we introduce an agentic AI system that can reason, verify, and iteratively refine its outputs.

This lesson is the last of a 4-part series on SAM 3:

To learn how to build an Agentic AI Vision System with SAM 3 and Qwen, just keep reading.

Looking for the source code to this post?

Why Agentic AI Outperforms Traditional Vision Pipelines

Modern computer vision systems are evolving beyond traditional pipelines.

We designed systems where:

an image is passed to a vision model
the model produces a prediction
the pipeline ends there

This approach works well for clearly defined tasks. However, it struggles when tasks require understanding intent, handling ambiguity, or refining outputs.

To address this, we now transition toward agentic AI systems.

Agentic systems are not limited to a single prediction. Instead, they behave more like an iterative reasoning loop.

They can:

interpret a user request
select the appropriate models or tools
evaluate intermediate outputs
refine their decisions over multiple steps

This shift allows us to build systems that are adaptive, iterative, and self-correcting.

Why Agentic AI Improves Computer Vision and Segmentation Tasks

Vision tasks are often ambiguous.

For example, consider the instruction:

“the bag on the leftmost side”

A traditional segmentation model cannot directly handle this:

it expects fixed labels like “bag”
it does not understand spatial reasoning like “leftmost”

This is where agentic design becomes powerful.

We introduce a Vision-Language Model (VLM) to:

understand the instruction
extract the correct intent
translate it into a form usable by a segmentation model

Then, instead of trusting the output blindly, we:

verify the result
refine the input if needed
retry the process

This creates a loop where the system continuously improves.

What We Will Build: An Agentic AI Vision and Segmentation System

In this lesson, we build an agentic segmentation system that combines reasoning with perception.

The system takes:

an image
a natural language instruction

and produces:

segmentation masks
bounding boxes
confidence scores
a final overlay visualization

Agentic AI Workflow: Vision-Language Reasoning and Segmentation Loop

The pipeline follows these steps:

User Input: First, we provide an image along with a natural language instruction.
Instruction Understanding (VLM): Next, the VLM processes both the image and the text. It extracts the core intent and converts it into a short concept.
Concept Simplification: The system converts complex instructions into concise phrases. For example:
- “the bag on the leftmost side” → “leftmost bag”
Segmentation (SAM3): Then, SAM3 uses this concept to generate:
- segmentation masks
- bounding boxes
- confidence scores
Verification (VLM): After segmentation, the VLM evaluates whether the output matches the instruction.
Refinement Loop: If the result is incorrect:
- the VLM refines the concept
- SAM3 runs again
- the process repeats
This loop continues until the result aligns with the user’s intent.

Agentic AI Architecture: Combining VLMs and SAM 3 for Vision

Before implementing the code, we break down the system into its core components.

Vision-Language Model (VLM): The Reasoning Component

The VLM is the reasoning component of our system. It performs 3 key roles:

Instruction Understanding. It interprets the natural language input in the context of the image.

Concept Generation. It converts long instructions into short, structured phrases. For example:

“the person wearing a red shirt” → “person red shirt”
“the car in the background” → “background car”

This step is critical because segmentation models perform better with:

short
object-centric
unambiguous phrases

Result Verification. After segmentation, the VLM checks:

whether the correct object was segmented
whether spatial or contextual constraints are satisfied

SAM 3: Open-Vocabulary Object Segmentation

SAM3 acts as the perception component.

Unlike traditional segmentation models, SAM3 supports:

flexible prompts
open-vocabulary segmentation

This means we are not restricted to predefined classes.

Given a concept phrase, SAM3 produces:

pixel-level segmentation masks
bounding boxes
confidence scores

This makes SAM3 ideal for integration with a language-based reasoning system.

The Agentic Feedback Loop: Reasoning, Verification, and Refinement

The most important part of this system is the agentic loop.

Instead of a linear pipeline, we build a feedback-driven process.

Step-by-step:

Generate a segmentation concept
Run segmentation using SAM3
Evaluate the output using the VLM

If the output is incorrect:

identify what went wrong
refine the concept
retry segmentation

Why Agentic Segmentation Outperforms One-Shot Models

This loop introduces several important capabilities:

Self-correction: The system can recover from incorrect predictions
Robustness: It handles ambiguous or complex instructions better
Generalization: It works with open-ended language instead of fixed labels
Improved alignment: Outputs better match user intent over iterations

Final Output: Agentic Vision System with Segmentation and Reasoning

By the end of this tutorial, we build a system that:

understands natural language instructions
converts them into structured segmentation concepts
performs open-vocabulary segmentation
verifies its own outputs
improves results through iterative refinement

This represents a shift

from:

static, one-shot predictions

to:

dynamic, reasoning-driven vision systems

Key Takeaway: VLM + SAM 3 = Intelligent Vision Agent

The real power of this system is not just segmentation.

It is the collaboration between models:

the VLM provides reasoning
SAM3 provides perception
the loop provides intelligence

Together, they form an agentic vision system that can think, act, and improve.

Configuring Your Development Environment

To follow this guide, you need to have the following libraries installed on your system.

!pip install -q transformers accelerate pillow torch torchvision bitsandbytes

First, we install the transformers library. This library provides access to a wide range of pretrained models, including the Vision-Language Model we will use in this project.

Next, we install accelerate, which helps efficiently run large models across GPUs and manage device placement automatically.

After that, we install pillow, a lightweight Python library used for image loading and processing. We will use this library to read images and prepare them for model inference.

We also install torch, which serves as the core deep learning framework for this project. Both the Vision-Language Model and the segmentation model rely on torch for tensor computations and GPU acceleration.

Along with torch, we install torchvision, which provides datasets, transforms, and model utilities for computer vision tasks.

Finally, we install bitsandbytes. This library enables efficient memory usage when working with large models by supporting quantization and optimized GPU kernels.

The -q flag runs the installation in quiet mode, reducing unnecessary output in the notebook.

Need Help Configuring Your Development Environment?

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code immediately on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Python Setup and Imports for Agentic AI Vision System

Now that our environment is ready, we import the libraries required to build our agentic vision system. These libraries will help us perform deep learning inference, process images, visualize segmentation outputs, and load the models.

import torch
import numpy as np
import os
import json
from PIL import Image, ImageDraw
import matplotlib
import matplotlib.pyplot as plt
from transformers import (
      AutoProcessor,
   Qwen2_5_VLForConditionalGeneration,
   Sam3Model,
   Sam3Processor,
)

First, we import torch. This is the primary deep learning framework used to run both the Vision-Language Model and the segmentation model. PyTorch handles tensor computations and GPU acceleration during inference.

Next, we import numpy, a popular library for numerical computing in Python. We will use NumPy when working with arrays such as segmentation masks and bounding boxes returned by the segmentation model.

After that, we import the os and json libraries. The os module helps us manage file paths and directories, while the json module allows us to parse structured responses generated by the Vision-Language Model.

Next, we import Image and ImageDraw from the Pillow library. Pillow is a lightweight image processing library that allows us to load, manipulate, and display images. In this project, we will use it to read input images and create segmentation overlays.

Then, we import matplotlib, which we will use to visualize the results. Specifically, we use matplotlib.pyplot to create figures that display the original image, bounding boxes, and segmentation masks.

Finally, we import several classes from the transformers library. These classes allow us to load and run the models used in our system.

The AutoProcessor class automatically prepares inputs for multimodal models by handling both text and image preprocessing.
The Qwen2_5_VLForConditionalGeneration class loads the Qwen2.5-VL Vision-Language Model, which will interpret user instructions and generate segmentation prompts.
The Sam3Model and Sam3Processor classes load the SAM3 segmentation model and prepare its inputs.

Before loading the models, we configure PyTorch to use optimized GPU settings. These settings help improve inference performance, especially when running large multimodal models.

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype  = torch.bfloat16 if device == "cuda" else torch.float32
print(f"Using device: {device}, dtype: {dtype}")

First, we enable TensorFloat-32 (TF32) support in PyTorch. TF32 is a numerical format supported by modern NVIDIA GPUs. It allows faster matrix multiplications during deep learning inference while maintaining good numerical stability. Since large models perform many matrix operations, enabling TF32 can significantly improve performance.

Next, we determine which device will be used for inference. Here, we check whether a CUDA-enabled GPU is available. If a GPU is detected, the system runs on "cuda". Otherwise, it falls back to the CPU.

After that, we configure the tensor precision. When running on a GPU, we use bfloat16 precision. This reduces memory usage and speeds up computation while preserving enough numerical accuracy for inference tasks.

If the system runs on a CPU, we instead use the standard float32 precision, which ensures compatibility with CPU computations.

Finally, we print the device configuration. This helps confirm whether the system is using the GPU and which precision mode is active. This information is useful when debugging performance or memory issues during model inference.

Loading SAM 3 and Qwen Vision-Language Models in Transformers

Now that the environment is configured, we load the two core models used in our agentic vision system: a Vision-Language Model (VLM) and a segmentation model.

The VLM will interpret the user’s instruction and generate a clean segmentation concept. The segmentation model will then use that concept to detect and segment objects in the image.

VLM_MODEL_ID = "Qwen/Qwen2.5-VL-7B-Instruct"  # swap for Qwen/Qwen3-VL-8B once released in transformers
SAM_MODEL_ID = "facebook/sam3"

print("Loading VLM...")
vlm_processor = AutoProcessor.from_pretrained(VLM_MODEL_ID, trust_remote_code=True)
vlm_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
   VLM_MODEL_ID,
   device_map="auto",
   torch_dtype=dtype,
   trust_remote_code=True,
)
vlm_model.eval()
print("VLM loaded.")

print("Loading SAM3...")
sam_processor = Sam3Processor.from_pretrained(SAM_MODEL_ID)
sam_model = Sam3Model.from_pretrained(SAM_MODEL_ID, torch_dtype=dtype).to(device)
sam_model.eval()
print("SAM3 loaded.")

First, we define the model identifiers. These identifiers correspond to the pretrained models hosted on the Hugging Face model hub.

The Qwen2.5-VL-7B-Instruct model is a Vision-Language Model capable of understanding both images and text instructions. We will use this model to interpret the user’s request and generate segmentation prompts.

The second model, SAM3, is an open-vocabulary segmentation model that can segment objects based on text prompts.

Next, we load the Vision-Language Model. We first load the processor associated with the model. The processor prepares the inputs required by the VLM, including tokenizing text prompts and preprocessing images.

The trust_remote_code=True argument allows the Transformers library to load custom processing code provided by the model repository.

Next, we load the model itself. The from_pretrained() method downloads the pretrained model weights and initializes the model architecture.

The device_map="auto" argument automatically distributes the model across available devices, which is useful when working with large models that require GPU memory.

We also specify torch_dtype=dtype, which ensures the model runs using the precision we configured earlier: bfloat16 on GPU or float32 on CPU.

After loading the model, we switch it to evaluation mode. Evaluation mode disables training-specific behaviors such as dropout, ensuring consistent inference results.

Next, we load the segmentation model. Similar to the VLM, we first load the Sam3Processor. This processor handles preprocessing tasks such as preparing the input image and formatting segmentation prompts.

Next, we load the SAM3 model. The from_pretrained() function loads the segmentation model weights, and we move the model to the appropriate device using .to(device).

Finally, we set the model to evaluation mode. At this point, both models are fully initialized. The Vision-Language Model will interpret user instructions, while SAM3 will perform open-vocabulary segmentation based on those instructions.

Implementing VLM Inference for Agentic Vision Reasoning with Qwen2.5-VL

Now that our models are loaded, we implement a helper function that allows us to run inference using the Vision-Language Model. This function will take an image and a list of chat messages as input and return the model’s response.

In our agentic pipeline, this function plays a very important role. We will use it to:

extract a clean segmentation prompt from the user instruction
refine prompts if segmentation fails
verify whether the segmentation results match the user intent

def vlm_generate(image: Image.Image, messages: list, max_new_tokens: int = 512) -> str:
   """
   Mirrors: send_generate_request()
   Runs VLM inference given a list of chat messages and returns the reply string.
   """
   text_input = vlm_processor.apply_chat_template(
       messages, tokenize=False, add_generation_prompt=True
   )
   inputs = vlm_processor(
       text=[text_input],
       images=[image],
       return_tensors="pt",
   )
   inputs = {k: v.to(vlm_model.device) for k, v in inputs.items()}
   input_len = inputs["input_ids"].shape[1]

   with torch.no_grad():
       generated_ids = vlm_model.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           do_sample=False,
       )

   new_tokens = generated_ids[0][input_len:]
   return vlm_processor.tokenizer.decode(new_tokens, skip_special_tokens=True).strip()

First, we define the function vlm_generate. This function takes three inputs:

image: the input image that the model will analyze
messages: a list of chat-style prompts used to guide the model
max_new_tokens: the maximum number of tokens the model can generate

The function returns a string response produced by the Vision-Language Model.

Next, we convert the chat messages into the format expected by the model. Many modern Vision-Language Models use a chat-style interface similar to conversational AI systems. The apply_chat_template() method converts the list of messages into a properly formatted text prompt that the model understands.

The argument add_generation_prompt=True tells the processor that the model should generate a response after the provided messages.

Next, we prepare the inputs for the model. Here, we pass both the text prompt and the image to the processor. The processor converts these inputs into tensors that can be processed by the model. The argument return_tensors="pt" ensures the outputs are returned as PyTorch tensors.

Next, we move the tensors to the same device as the model. This step ensures that both the model and the input tensors reside on the same device, either the GPU or CPU.

After that, we store the length of the input tokens. This value helps us determine which tokens belong to the model’s generated response, rather than the original prompt.

Next, we perform inference using the model. We use torch.no_grad() to disable gradient computations. Since we are only performing inference, this reduces memory usage and improves performance.

Inside this block, we generate the model’s output. The generate() function performs autoregressive text generation. The parameter max_new_tokens limits the length of the generated response. We also set do_sample=False, which ensures deterministic outputs instead of random sampling.

Next, we extract only the tokens generated by the model. This removes the original prompt tokens, leaving only the newly generated tokens.

Finally, we convert the generated tokens into readable text. The decode() method converts token IDs back into text. We also remove special tokens and strip unnecessary whitespace.

At this point, the function returns the final response generated by the Vision-Language Model.

This function will serve as the core interface between our agentic system and the Vision-Language Model. In the next sections, we will use it to extract segmentation prompts and evaluate the outputs produced by the segmentation model.

Implementing the SAM 3 Text-Prompted Segmentation Function

Now, we implement a helper function that runs segmentation using the SAM3 model. This function will take an input image and optional prompts, run the SAM3 model, and return the segmentation results.

In our agentic pipeline, this function serves as the tool used by the agent to perform segmentation.

Specifically, it returns three important outputs:

segmentation masks
bounding boxes
confidence scores

def call_sam(
   image: Image.Image,
   text_prompt: str   = None,
   input_boxes        = None,   # list of [x1,y1,x2,y2]
   input_boxes_labels = None,   # list of 0/1 labels per box
   threshold: float   = 0.5,
) -> dict:
   """
   Mirrors: call_sam_service()
   Returns dict with keys: masks, boxes, scores (all as numpy arrays).
   """
   kwargs = dict(images=image, return_tensors="pt")
   if text_prompt:
       kwargs["text"] = text_prompt
   if input_boxes is not None:
       kwargs["input_boxes"] = [input_boxes]
       kwargs["input_boxes_labels"] = [input_boxes_labels or [1] * len(input_boxes)]

   inputs = sam_processor(**kwargs).to(device)

   with torch.no_grad():
       outputs = sam_model(**inputs)

   results = sam_processor.post_process_instance_segmentation(
       outputs,
       threshold=threshold,
       mask_threshold=0.5,
       target_sizes=inputs.get("original_sizes").tolist(),
   )[0]

   return {
       "masks":  results["masks"].cpu().numpy(),                          # [N, H, W] bool
       "boxes":  results["boxes"].cpu().to(torch.float32).numpy(),        # [N, 4]    xyxy
       "scores": results["scores"].cpu().to(torch.float32).numpy(),       # [N]
   }

First, we define the function call_sam. This function accepts several inputs:

The image parameter is the input image that we want to segment.
The text_prompt parameter allows us to perform concept-based segmentation. SAM3 can segment objects using natural language prompts such as "bag" or "leftmost bag".
The input_boxes parameter allows us to guide the segmentation model using bounding boxes. Each box is defined by four coordinates: [x1, y1, x2, y2]
Similarly, input_boxes_labels specifies whether each box corresponds to a positive or negative prompt.
Finally, the threshold parameter determines the confidence threshold used when filtering segmentation results.

Next, we prepare the inputs required by the SAM3 processor.

Here, we create a dictionary containing the image input. The return_tensors="pt" argument ensures that the processed outputs are returned as PyTorch tensors.

If a text prompt is provided, we include it in the input dictionary. This allows SAM3 to perform text-guided segmentation.

Next, we check whether bounding boxes are provided. If bounding boxes exist, we pass them to the processor along with their labels. If no labels are specified, we automatically assign positive labels (1) to all boxes.

Next, we preprocess the inputs using the SAM3 processor. The processor converts the image, prompts, and bounding boxes into tensors that the model can understand. We also move these tensors to the selected device (GPU or CPU).

Now we perform inference using SAM3. We wrap the inference step inside torch.no_grad() to disable gradient calculations. Since we are performing inference only, this improves performance and reduces memory usage. The model returns raw segmentation outputs.

Next, we convert the raw model outputs into usable segmentation results. The post_process_instance_segmentation() function performs several important tasks:

filters predictions using the confidence threshold
converts predicted masks to the correct image resolution
extracts bounding boxes and scores

The [0] index retrieves the results corresponding to the input image.

Finally, we return the segmentation results. The function returns a dictionary containing three elements.

The masks array contains the segmentation masks with shape: [N, H, W] where N represents the number of detected objects.
The boxes array contains the bounding box coordinates in the format: [x1, y1, x2, y2]
Finally, the scores array contains the confidence score for each detected object.

We also move the tensors to the CPU and convert them into NumPy arrays. This makes them easier to process and visualize in later steps.

At this point, the call_sam() function provides a simple interface for running SAM3 segmentation within our agentic vision pipeline.

Implementing the Agentic AI Segmentation Pipeline with Iterative Refinement

Now we implement the core function of our system. This function orchestrates the entire agentic workflow by combining the Vision-Language Model and the segmentation model.

Instead of running segmentation only once, the system follows an agentic loop where the Vision-Language Model interprets the user request, runs segmentation, verifies the result, and refines the prompt if needed.

def run_single_image_inference(
   image_path: str,
   user_prompt: str,
   max_agent_rounds: int = 3,
   seg_threshold: float  = 0.5,
   output_dir: str       = "agent_output",
   debug: bool           = True,
) -> str | None:
   """
   Mirrors: run_single_image_inference() from sam3.agent.inference

   Agentic loop:
     Round 1 — VLM reads image + user prompt → produces a concise SAM3 concept phrase
     Round 2 — SAM3 segments with that phrase → VLM verifies / refines if needed
     Round N — repeat until VLM is satisfied or max_agent_rounds reached
   Returns path to the saved output image (or None on failure).
   """
   os.makedirs(output_dir, exist_ok=True)
   image = Image.open(image_path).convert("RGB")

   # ── Round 1: VLM extracts a clean SAM3 text prompt ──────────────────────
   extraction_messages = [
       {
           "role": "system",
           "content": (
               "You are a precise vision assistant. "
               "Your job is to convert a user's free-form description into a SHORT, "
               "clean object concept phrase suitable for an open-vocabulary segmentation model. "
               "Reply with ONLY a JSON object: {\"sam_prompt\": \"\"}. "
               "No explanation, no markdown, just the JSON."
           ),
       },
       {
           "role": "user",
           "content": [
               {"type": "image", "image": image},
               {"type": "text",  "text": f"User description: \"{user_prompt}\""},
           ],
       },
   ]

The run_single_image_inference function serves as the main entry point of our agentic vision system. It accepts several inputs:

image_path: the path to the image we want to analyze
user_prompt: the natural language description of the object to segment
max_agent_rounds: the maximum number of refinement iterations
seg_threshold: the confidence threshold for segmentation
output_dir: the directory where the output image will be saved
debug: a flag that enables detailed logging

The function returns the path of the saved output image or None if segmentation fails.

First, we create the output directory and load the image. The os.makedirs() function ensures that the output directory exists. If the directory already exists, the exist_ok=True argument prevents an error. Next, we open the input image using Pillow and convert it to RGB format.

Here, we define a system message that instructs the Vision-Language Model to convert the user description into a short concept phrase. The SAM3 model performs better with short noun-style prompts such as:

leftmost bag
red apple
wooden chair

rather than long sentences.

We also include the user input. This message contains both the image and the user instruction.

if debug:
       print(f"\n[Agent] Round 1 — extracting SAM3 prompt from: '{user_prompt}'")

   vlm_reply = vlm_generate(image, extraction_messages)
   if debug:
       print(f"[Agent] VLM raw reply: {vlm_reply}")

   # Parse the JSON; fall back to raw reply if needed
   try:
       clean = vlm_reply.strip().lstrip("```json").rstrip("```").strip()
       sam_prompt = json.loads(clean)["sam_prompt"]
   except Exception:
       sam_prompt = user_prompt  # graceful fallback
   if debug:
       print(f"[Agent] SAM3 prompt → '{sam_prompt}'")

Next, we call the VLM inference function. The Vision-Language Model analyzes the image and generates a clean segmentation prompt.

For example:

User prompt: "the bag on the leftmost side"
Model output: {"sam_prompt": "leftmost bag"}

Next, we extract the segmentation prompt from the JSON response. This step removes formatting artifacts and converts the JSON string into a Python dictionary.

If the response cannot be parsed, we fall back to the original user prompt.

# ── Agentic segmentation loop ────────────────────────────────────────────
   sam_result = None
   final_prompt = sam_prompt

   for round_idx in range(max_agent_rounds):
       if debug:
           print(f"\n[Agent] Round {round_idx + 2} — calling SAM3 with '{final_prompt}'")

       sam_result = call_sam(image, text_prompt=final_prompt, threshold=seg_threshold)
       n_masks = len(sam_result["masks"])
       if debug:
           print(f"[Agent] SAM3 found {n_masks} instance(s)")

Now we begin the agentic segmentation loop. Here, we initialize two variables:

sam_result: stores the segmentation output
final_prompt: stores the prompt used for segmentation

Next, we enter the iterative loop. This loop allows the system to refine segmentation prompts up to a maximum number of rounds.

Inside the loop, we call the SAM3 segmentation function. This function returns segmentation results including masks, bounding boxes, and confidence scores.

Next, we count the number of detected objects. This value helps determine whether the segmentation succeeded.

       # ── Verification: ask VLM if the result looks right ─────────────────
       if n_masks == 0:
           # No masks found — ask VLM to rephrase
           refine_messages = [
               {
                   "role": "system",
                   "content": (
                       "You are a vision assistant helping refine segmentation prompts. "
                       "The segmentation model found NO objects. "
                       "Suggest a simpler or broader alternative concept phrase. "
                       "Reply ONLY with JSON: {\"sam_prompt\": \"\"}."
                   ),
               },
               {
                   "role": "user",
                   "content": [
                       {"type": "image", "image": image},
                       {"type": "text",  "text": (
                           f"Original user intent: \"{user_prompt}\". "
                           f"Failed prompt: \"{final_prompt}\". "
                           "Suggest a better phrase."
                       )},
                   ],
               },
           ]
           vlm_reply = vlm_generate(image, refine_messages)
           if debug:
               print(f"[Agent] VLM refine reply: {vlm_reply}")
           try:
               clean = vlm_reply.strip().lstrip("```json").rstrip("```").strip()
               final_prompt = json.loads(clean)["sam_prompt"]
           except Exception:
               break  # give up if we can't parse

If SAM3 fails to detect any objects, we ask the Vision-Language Model to refine the segmentation prompt. We construct a new prompt asking the model to generate a simpler or broader concept phrase.

For example:

Original prompt: "leftmost brown grocery bag"
Suggested prompt: "bag"

The VLM then generates a new segmentation prompt, and the loop repeats.

else:
           # We have masks — ask VLM to verify they match the user intent
           verify_messages = [
               {
                   "role": "system",
                   "content": (
                       "You are a vision QA assistant. "
                       "Given the original user intent and the segmentation result metadata, "
                       "decide if the segmentation is correct. "
                       "Reply ONLY with JSON: {\"ok\": true/false, \"reason\": \"...\", \"sam_prompt\": \"\"}."
                   ),
               },
               {
                   "role": "user",
                   "content": [
                       {"type": "image", "image": image},
                       {"type": "text",  "text": (
                           f"User intent: \"{user_prompt}\".\n"
                           f"SAM3 was given prompt: \"{final_prompt}\".\n"
                           f"Result: {n_masks} mask(s) found, "
                           f"scores: {sam_result['scores'].tolist()}, "
                           f"boxes: {sam_result['boxes'].tolist()}.\n"
                           "Is this correct? If yes, ok=true. If not, provide a better sam_prompt."
                       )},
                   ],
               },
           ]
           vlm_reply = vlm_generate(image, verify_messages, max_new_tokens=256)
           if debug:
               print(f"[Agent] VLM verify reply: {vlm_reply}")
           try:
               clean = vlm_reply.strip().lstrip("```json").rstrip("```").strip()
               verdict = json.loads(clean)
               if verdict.get("ok", True):
                   if debug:
                       print("[Agent] VLM verified result ✓ — stopping.")
                   break
               else:
                   final_prompt = verdict.get("sam_prompt", final_prompt)
                   if debug:
                       print(f"[Agent] VLM says not ok → retrying with '{final_prompt}'")
           except Exception:
               break  # can't parse verdict, accept current result

If SAM3 successfully detects objects, we verify whether the result matches the user intent.

In this step, we ask the Vision-Language Model to evaluate the segmentation results.

The model receives:

the original user instruction
the segmentation prompt used
the number of detected masks
the confidence scores
the bounding boxes

Based on this information, the model decides whether the segmentation result is correct.

The model returns a JSON response such as:

{
"ok": true,
"reason": "correct object detected"
}

{
"ok": false,
"sam_prompt": "bag"
}

If the segmentation is incorrect, the system updates the segmentation prompt. The loop then repeats using the new prompt. If the segmentation result is correct, the loop stops. This verification step allows the system to self-correct its segmentation decisions.

   # ── Render and save output ───────────────────────────────────────────────
   if sam_result is None or len(sam_result["masks"]) == 0:
       print("[Agent] No masks produced — check your prompt or image.")
       return None

   output_path = os.path.join(
       output_dir,
       os.path.splitext(os.path.basename(image_path))[0] + "_segmented.png"
   )
   _save_overlay(image, sam_result, output_path, title=f'"{user_prompt}"')
   print(f"\n[Agent] Output saved → {output_path}")
   return output_path

After the agentic loop finishes, we check whether segmentation succeeded. If no objects were detected, the function returns None. Otherwise, we generate the output image path.

Finally, we visualize the segmentation results. This function creates an image containing the segmentation masks and bounding boxes. The result is saved to disk.

This function implements the agentic reasoning loop that makes our system powerful.

Instead of relying on a single segmentation attempt, the system:

interprets the user request
generates a segmentation prompt
runs segmentation
evaluates the results
refines the prompt if necessary

This iterative process allows the system to produce more accurate results and demonstrates how multiple AI models can collaborate within an agentic vision pipeline.

Visualizing and Saving the Segmentation Results

After running the agentic segmentation pipeline, we want to visualize the results in a clear and interpretable way. For this purpose, we implement a helper function that overlays the segmentation masks and bounding boxes on top of the original image.

This function generates a side-by-side visualization showing both the detected bounding boxes and the segmentation masks.

def _save_overlay(image: Image.Image, sam_result: dict, output_path: str, title: str = ""):
   masks  = sam_result["masks"]
   boxes  = sam_result["boxes"]
   scores = sam_result["scores"]

   fig, axes = plt.subplots(1, 2, figsize=(16, 8))

   # Left: original + boxes
   axes[0].imshow(image)
   axes[0].set_title(f"Detected boxes  |  {title}", fontsize=11)
   axes[0].axis("off")
   cmap = matplotlib.colormaps.get_cmap("rainbow").resampled(max(len(masks), 1))
   for i, (box, score) in enumerate(zip(boxes, scores)):
       x1, y1, x2, y2 = box
       color = cmap(i)[:3]
       rect = plt.Rectangle(
           (x1, y1), x2 - x1, y2 - y1,
           linewidth=2, edgecolor=color, facecolor="none"
       )
       axes[0].add_patch(rect)
       axes[0].text(x1, y1 - 4, f"{score:.2f}", color=color, fontsize=9, fontweight="bold")

   # Right: mask overlay
   composite = image.convert("RGBA")
   for i, mask in enumerate(masks):
       color = tuple(int(c * 255) for c in cmap(i)[:3])
       mask_img = Image.fromarray((mask * 255).astype(np.uint8))
       overlay  = Image.new("RGBA", composite.size, color + (0,))
       overlay.putalpha(mask_img.point(lambda v: int(v * 0.5)))
       composite = Image.alpha_composite(composite, overlay)

   axes[1].imshow(composite)
   axes[1].set_title(f"SAM3 masks  ({len(masks)} instance(s))", fontsize=11)
   axes[1].axis("off")

   plt.tight_layout()
   plt.savefig(output_path, dpi=150, bbox_inches="tight")
   plt.close()

We begin by defining the _save_overlay function, which takes the original image, the segmentation output from SAM3, the output path, and an optional title. From the segmentation results, we extract the masks, bounding boxes, and confidence scores. The masks represent pixel-level regions for each detected object, the boxes define object boundaries, and the scores indicate how confident the model is for each detection.

To visualize these results, we create a figure with two side-by-side panels. The left panel displays the original image along with bounding boxes, while the right panel shows the segmentation masks overlaid on the image.

The process starts by rendering the original image and assigning a distinct color to each detected object using a colormap. For every detection, we draw a rectangle corresponding to its bounding box and place the confidence score near it. This provides a quick overview of what the model has detected and how reliable those detections are.

For the mask visualization, the image is first converted to RGBA format so that transparent overlays can be applied. Each segmentation mask is then assigned a color, converted into an image, and used to create a semi-transparent overlay. These overlays are composited onto the original image, allowing the segmented regions to stand out while still preserving the underlying content.

The final composite is displayed in the second panel, along with the number of detected instances. The visualization is then saved to disk using a resolution of 150 DPI for clarity, with tight_layout() ensuring proper spacing and bbox_inches="tight" removing unnecessary margins. The figure is closed afterward to free up memory.

This results in a clean and intuitive visualization that combines bounding boxes, confidence scores, and segmentation masks, making it easy to verify the model’s predictions.

Running the Agentic AI Vision System on Real Images

Now that we have implemented all the components of our pipeline, we can run the complete agentic vision system on an example image.

In this step, we provide an image along with a natural language instruction and let the system handle the rest.

output_image_path = run_single_image_inference(
   image_path  = "/content/groceries.jpg",
   user_prompt = "the bag on the leftmost side",
   max_agent_rounds = 3,
   seg_threshold    = 0.5,
   output_dir       = "agent_output",
   debug            = True,
)

if output_image_path:
   img = Image.open(output_image_path)
   img.show()

We begin by calling the run_single_image_inference() function, which executes the complete agentic pipeline. The input image is provided through the image_path parameter, and in this example, we use groceries.jpg. Along with the image, we pass a natural language instruction — “the bag on the leftmost side”. This instruction is intentionally written in free-form language to demonstrate how the system can interpret human-like queries.

The pipeline is configured to allow up to three refinement iterations using max_agent_rounds=3. A confidence threshold of 0.5 is used to filter segmentation results, and the final output is saved to the agent_output directory. Debugging is enabled to log intermediate steps such as prompt generation, segmentation outputs, and verification decisions.

Once the pipeline runs, it returns the path to the output image if segmentation is successful. We then load this image using Pillow and display it. The final visualization includes bounding boxes around detected objects, segmentation masks overlaid on the image, and confidence scores for each detection.

Under the hood, the system follows an iterative process. The Vision-Language Model first analyzes the image and converts the user’s instruction into a concise segmentation prompt. This prompt is passed to SAM3, which generates segmentation masks. The result is then evaluated by the Vision-Language Model to determine whether it matches the user’s intent. If the output is not satisfactory, the prompt is refined and the process repeats. Once the result is verified, the system produces the final visualization and saves it to disk.

Agentic Segmentation Output: Iterative Prompt Refinement in Action

The input image (Figure 1) shows multiple grocery bags placed inside the trunk of a car.

We provide the following natural language instruction:

"the bag on the leftmost side"

This instruction is not a fixed label. Instead, it includes spatial reasoning, which makes the task more challenging for standard segmentation models.

Figure 1: Input Image (source: Sam3 Official Repo assets)

Now let’s examine how the system processes this instruction.

[Agent] Round 1 — extracting SAM3 prompt from: 'the bag on the leftmost side'
[Agent] VLM raw reply: {"sam_prompt": "leftmost paper bag"}

First, the Vision-Language Model interprets the instruction and generates an initial segmentation prompt:

[Agent] SAM3 prompt -> 'leftmost paper bag'

[Agent] Round 2 — calling SAM3 with 'leftmost paper bag'
[Agent] SAM3 found 0 instance(s)

Next, SAM3 attempts segmentation using this prompt.

However, no objects are detected.

This shows an important limitation: SAM3 is sensitive to how the prompt is phrased.

[Agent] VLM refine reply: {"sam_prompt": "leftmost brown paper bag"}

The system does not stop here.

Instead, the Vision-Language Model refines the prompt by adding more descriptive information.

[Agent] Round 3 — calling SAM3 with 'leftmost brown paper bag'
[Agent] SAM3 found 0 instance(s)

Again, SAM3 fails to detect any objects.

At this point, we observe something important: More detailed prompts do not always improve segmentation.

[Agent] VLM refine reply: {"sam_prompt": "leftmost bag"}

Now, the model simplifies the prompt.

This step is critical. Instead of making the prompt more complex, the system makes it more general.

[Agent] Round 4 — calling SAM3 with 'leftmost bag'
[Agent] SAM3 found 1 instance(s)

This time, SAM3 successfully detects the object.

[Agent] VLM verify reply: {
 "ok": true,
 "reason": "The segmentation correctly identifies the leftmost bag as per the user's intent."
 "sam_prompt": ""
}

Finally, the Vision-Language Model verifies the result and confirms that the segmentation is correct.

The agentic loop stops here, and the system saves the final output image with a bounding box and segmentation mask overlaid on the input image.

Figure 2: Agentic AI Iterative Refinement Output (source: image by the author)

The output image (Figure 3) shows:

the detected bounding box around the leftmost bag
the segmentation mask highlighted in color
the correct object selected based on the user’s instruction

Figure 3: Generated Output with bounding box, mask, confidence score (source: image by the author).

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: May 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

That’s not the case.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this lesson, we built an agentic AI vision system that combines a Vision-Language Model with a segmentation model to solve a real-world problem.

Instead of relying on a single model, we designed a pipeline where multiple components work together in a loop. This allows the system to not only perform segmentation, but also understand instructions, evaluate results, and improve itself automatically.

First, we used a Vision-Language Model to interpret the user’s natural language query and convert it into a clean segmentation prompt.

Next, we used SAM3 to perform open-vocabulary segmentation using that prompt.

Then, we introduced an agentic loop where the Vision-Language Model verifies the segmentation output and refines the prompt if necessary.

Finally, we visualized the results by overlaying bounding boxes and segmentation masks on the original image.

This approach highlights an important shift in computer vision. Instead of building static pipelines, we are now moving toward interactive and self-correcting systems that can adapt to user intent.

Such systems can be extended to a wide range of applications, including:

interactive image editing
robotics and autonomous perception
visual assistants
multimodal search systems

In the future, we can further improve this system by:

adding support for multiple images or video inputs
integrating more tools into the agent loop
introducing memory for long-term reasoning
optimizing inference for real-time applications

By combining Vision-Language Models with powerful segmentation models, we take a step closer to building intelligent visual systems that can understand and act on human instructions.

This represents the foundation of next-generation AI systems.

Citation Information

Thakur, P. “Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen,” PyImageSearch, S. Huot, G. Kudriavtsev, and A. Sharma, eds., 2026, https://pyimg.co/ohlwd

@incollection{Thakur_2026_building-an-agentic-ai-vision-system-with-sam-3-and-qwen,
  author = {Piyush Thakur},
  title = {{Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Georgii Kudriavtsev and Aditya Sharma},
  year = {2026},
  url = {https://pyimg.co/ohlwd},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

The post Agentic AI Vision System: Object Segmentation with SAM 3 and Qwen appeared first on PyImageSearch.

Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3

Puneet Mangla — Mon, 30 Mar 2026 12:45:00 +0000

Home

Table of Contents

Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3
Why Next-Token Prediction Limits DeepSeek-V3
Multi-Token Prediction in DeepSeek-V3: Predicting Multiple Tokens Ahead
DeepSeek-V3 Architecture: Multi-Token Prediction Heads Explained
Gradient Insights for Multi-Token Prediction in DeepSeek-V3
DeepSeek-V3 Training vs. Inference: How MTP Changes Both
Multi-Token Prediction Loss Weighting and Decay for DeepSeek-V3
Step-by-Step Implementation of Multi-Token Prediction Heads in DeepSeek-V3
Integrating Multi-Token Prediction with DeepSeek-V3’s Core Transformer
Theoretical Foundations: MTP, Curriculum Learning, and Auxiliary Tasks
Multi-Token Prediction Benefits: Coherence, Planning, and Faster Convergence
Summary

Citation Information

Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3

In the first three parts of this series, we built the foundation of DeepSeek-V3 by implementing its configuration and Rotary Positional Embeddings (RoPE), exploring the efficiency gains of Multi-Head Latent Attention (MLA), and scaling capacity through the Mixture of Experts (MoE). Each of these components adds a crucial piece to the puzzle, progressively shaping a model that balances performance, scalability, and efficiency. With these building blocks in place, we are now ready to tackle another defining innovation: Multi-Token Prediction (MTP).

Unlike traditional autoregressive models that predict one token at a time, MTP enables DeepSeek-V3 to forecast multiple tokens simultaneously, significantly accelerating training and inference. This approach not only reduces computational overhead but also improves the model’s ability to capture richer contextual patterns across sequences.

In this lesson, we will explore the theory behind MTP, examine why it represents a leap forward in language modeling, and implement it step by step. As with the earlier lessons, this installment continues our broader mission to reconstruct DeepSeek-V3 from scratch, showing how innovations including RoPE, MLA, MoE, and now MTP fit together into a cohesive architecture that will culminate in the assembly and training of the full model.

This lesson is the 4th in a 6-part series on Building DeepSeek-V3 from Scratch:

DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings
Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture
DeepSeek-V3 from Scratch: Mixture of Experts (MoE)
Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3 (this tutorial)
Lesson 5
Lesson 6

To learn about DeepSeek-V3 and build it from scratch, just keep reading.

Looking for the source code to this post?

Why Next-Token Prediction Limits DeepSeek-V3

Traditional language models are trained with a simple objective: given tokens , predict the next token . Mathematically, we maximize:

This autoregressive factorization is elegant and has proven remarkably effective. However, it has a fundamental limitation: the model only receives a training signal for immediate next-token prediction. It never explicitly learns to plan multiple steps ahead.

Consider generating the sentence: “The cat sat on the mat because it was comfortable.” When predicting “because,” the model should already be considering how the sentence will complete — including the subordinate clause, the pronoun reference, and the conclusion. But with next-token prediction alone, there’s no explicit gradient signal encouraging this forward planning. The model might learn it implicitly through exposure to many examples, but we’re not directly optimizing for it.

This limitation becomes especially apparent in tasks requiring long-term coherence (e.g., story generation, multi-paragraph reasoning, or code generation), where later statements must be consistent with earlier declarations. The model can easily generate locally fluent text that globally contradicts itself because its training objective only looks one token ahead.

Multi-Token Prediction in DeepSeek-V3: Predicting Multiple Tokens Ahead

Multi-Token Prediction (Figure 1) addresses this by adding auxiliary prediction heads that forecast multiple tokens into the future. Alongside the standard prediction , we also predict:

and so on for tokens ahead. Critically, these predictions are computed in parallel during training (not autoregressively) — we know all ground truth tokens, so we can supervise all predictions simultaneously.

Figure 1: Multi-Token Prediction Head (source: Dai et al., 2024).

The complete training objective becomes:

where is the number of future tokens we predict, are weighting coefficients (typically decreasing with distance: \lambda_2 > \ldots' title='\lambda_1 > \lambda_2 > \ldots' class='latex' />), and we’ve explicitly shown that predictions at depth condition on both the context up to position and the intermediate tokens up to .

DeepSeek-V3 Architecture: Multi-Token Prediction Heads Explained

Implementing MTP requires architectural additions. We can’t just reuse the main language modeling head for future predictions — we need to condition on the intermediate tokens. DeepSeek-V3 implements this through a hierarchy of prediction heads, each specialized for a particular future depth.

Head Architecture: For predicting tokens ahead, we have a head that combines:

The hidden representation from the Transformer at position :
The embedding of the token at position :

The combination follows:

This combined representation is then processed through a mini-Transformer (lightweight attention and feedforward layers) before projecting to the vocabulary:

The intuition is powerful: to predict token , we start with the representation at position (encoding all context), incorporate the embedding of token (telling us what word we’ve just generated), process through a small Transformer (allowing the model to refine this combination), and project to vocabulary (producing logits over the vocabulary). This architecture naturally encourages forward planning — the model must learn representations at position that are useful for predictions multiple steps ahead.

Gradient Insights for Multi-Token Prediction in DeepSeek-V3

From an optimization perspective, MTP provides richer gradient signals. In standard training, only the hidden representation receives gradients from predicting . With MTP, also receives gradients from predicting . These additional gradients encourage to encode information relevant not just for the immediate next token, but for multiple future tokens.

Moreover, the gradients from future predictions flow through different pathways — through the MTP heads’ mini-Transformers. This creates a form of multi-task learning in which different prediction depths impose distinct consistency constraints on the learned representations. A representation that works well for predicting 1 token ahead might not be good for predicting 5 tokens ahead; MTP encourages learning representations that support both.

We can think of this as adding an implicit regularizer. The additional prediction objectives constrain the learned representations to be more structured, more forward-looking, and more globally coherent. It’s similar in spirit to multi-task learning, where auxiliary tasks improve representation quality even if we care primarily about one main task.

DeepSeek-V3 Training vs. Inference: How MTP Changes Both

During Training: We compute all predictions in parallel. For a sequence of length , we predict:

Main head: positions 1 through predict positions 2 through
Depth-1 head: positions 1 through predict positions 3 through
Depth-2 head: positions 1 through predict positions 4 through

Each prediction uses the ground truth intermediate tokens (available during training), so there’s no error accumulation. The losses are computed independently and summed with appropriate weights.

During Inference: Interestingly, MTP heads are typically not used during autoregressive generation. Once training is complete, we generate text using only the main prediction head in the standard autoregressive manner. The MTP heads have served their purpose by improving the learned representations; we don’t need their multi-step predictions at inference time.

This is computationally appealing: we get the benefits of MTP (better representations, improved coherence) during training, but inference remains as efficient as a standard language model. There’s no additional computational cost at deployment.

Multi-Token Prediction Loss Weighting and Decay for DeepSeek-V3

The weighting coefficients are important hyperparameters. Intuitively, predictions further in the future are harder and less reliable, so we should weight them less heavily. A common scheme is exponential decay:

where . For example, with :

Depth 1 (predicting from ): weight 1.0
Depth 2 (predicting from ): weight 0.5
Depth 3 (predicting from ): weight 0.25

In our implementation, we use a simpler approach: uniform weighting of 0.3 for all MTP losses relative to the main loss. This is less sophisticated but easier to tune and still provides the core benefits.

Step-by-Step Implementation of Multi-Token Prediction Heads in DeepSeek-V3

Let’s implement the complete MTP system:

class MultiTokenPredictionHead(nn.Module):
    """
    Multi-Token Prediction Head

    Each head predicts a token at a specific future position.
    Combines previous hidden state with future token embedding.
    """
    def __init__(self, config: DeepSeekConfig, depth: int):
        super().__init__()
        self.depth = depth
        self.n_embd = config.n_embd

        # Combine previous hidden state with future token embedding
        self.combine_proj = nn.Linear(2 * config.n_embd, config.n_embd, bias=config.bias)

        # Normalization
        self.norm1 = RMSNorm(config.n_embd)
        self.norm2 = RMSNorm(config.n_embd)

        # Transformer components (mini-transformer for each head)
        self.attn = MultiheadLatentAttention(config)
        self.mlp = MixtureOfExperts(config)
        self.attn_norm = RMSNorm(config.n_embd)
        self.mlp_norm = RMSNorm(config.n_embd)

Lines 1-24: Prediction Head Structure. Each MultiTokenPredictionHead is specialized for a particular depth — head 1 predicts 1 token ahead, head 2 predicts 2 tokens ahead, etc. We store the depth for potential depth-conditional processing (though we don’t use it in this simple implementation).

The architecture has 3 main components: a combination projection that merges the hidden state and future token embeddings, normalization layers for stabilization, and a mini-Transformer consisting of an attention module and an MoE. This mini-Transformer is complete but lightweight — it has the same architecture as our main model blocks but serves a specialized purpose.

    def forward(self, prev_hidden, future_token_embed):
        """
        Args:
            prev_hidden: [B, T, D] - Hidden states from previous layer
            future_token_embed: [B, T, D] - Embeddings of future tokens

        Returns:
            hidden: [B, T, D] - Processed hidden states
        """
        # Normalize inputs
        prev_norm = self.norm1(prev_hidden)
        future_norm = self.norm2(future_token_embed)

        # Combine representations
        combined = torch.cat([prev_norm, future_norm], dim=-1)
        hidden = self.combine_proj(combined)

        # Process through mini-transformer
        hidden = hidden + self.attn(self.attn_norm(hidden))
        moe_out, _ = self.mlp(self.mlp_norm(hidden))
        hidden = hidden + moe_out

        return hidden

Lines 26-41: The Combination Strategy. The forward method takes two inputs: prev_hidden (the hidden representation at position , encoding all context up to that point) and future_token_embed (the embedding of the token at position , providing information about what’s been generated). We normalize both inputs independently — this prevents scale mismatches between the hidden representations (which may have grown or shrunk through many Transformer layers) and the embeddings (which come fresh from the embedding layer). We concatenate along the feature dimension, doubling the dimensionality, then project back to n_embd dimensions. This projection learns how to merge content from these two different sources.

Lines 44-46: Mini-Transformer Processing. The combined representation flows through a lightweight Transformer. First, attention with a residual connection: the model can attend across the sequence, allowing position to gather information from other positions when predicting . This is crucial because the prediction might depend on context earlier in the sequence. Then, MoE with a residual connection: the expert networks can apply non-linear transformations, refining the combined representation. The use of the same MLA attention and MoE that we’ve already implemented is elegant — we’re reusing well-tested components. The pre-norm architecture (normalizing before attention and MoE rather than after) has become standard in modern Transformers for training stability.

Line 48: Returning Refined Hidden State. The output hidden state has the same dimensionality as the input (), so it can be projected through the vocabulary matrix to get logits for predicting . This hidden state has been enriched with information from both the context (via prev_hidden) and the intermediate token (via future_token_embed), and has been refined through attention and expert processing. It represents the model’s best understanding of what should come next-next, not just next.

Integrating Multi-Token Prediction with DeepSeek-V3’s Core Transformer

The MTP heads integrate into the main model during training. After computing the final hidden states from the main Transformer, we apply the following operations:

Main prediction: Project to vocabulary to predict , compute cross-entropy loss
Depth-1 prediction: For each position , get embedding of (ground truth), combine with through head 1, project to vocabulary to predict , compute cross-entropy loss
Depth-2 prediction: For each position , get embedding of (ground truth), combine with head-1 output, project to vocabulary to predict , compute cross-entropy loss

The key insight is that we chain the heads: head 2’s input includes head 1’s output. This creates a hierarchical structure in which each head builds on the previous one, progressively looking further into the future.

Theoretical Foundations: MTP, Curriculum Learning, and Auxiliary Tasks

MTP has interesting theoretical connections to other areas of machine learning:

Temporal Difference Learning: In reinforcement learning, temporal difference learning propagates value information backward from future states. MTP does something analogous — it propagates gradient information backward from future predictions, encouraging current representations to encode future-relevant information.

Auxiliary Tasks: MTP can be viewed as an auxiliary task framework in which the auxiliary tasks are future token predictions. Research in multi-task learning shows that auxiliary tasks improve representation quality when they are related but distinct from the main task. Future token prediction is perfectly related (it is the same task at different time steps) but distinct (it requires different information).

Curriculum Learning: The depth-weighted loss structure implements a form of curriculum — we emphasize near-future predictions (easier, more reliable) more than far-future predictions (harder, noisier). This gradually increasing difficulty may help training by first learning short-term dependencies before tackling long-term structure.

Multi-Token Prediction Benefits: Coherence, Planning, and Faster Convergence

Research on Multi-Token Prediction shows several empirical benefits:

Improved Coherence: Models trained with MTP generate more globally coherent text, with fewer contradictions or topic drift over long generations
Better Planning: For tasks like story writing or code generation, where early decisions constrain later possibilities, MTP helps the model make forward-compatible choices
Faster Convergence: The additional training signals can accelerate learning, reaching target performance with fewer training steps
Regularization: MTP acts as a regularizer, preventing overfitting by encouraging representations that support multiple related objectives

However, MTP also has costs. Training becomes more complex — we must manage multiple prediction heads and carefully weight their losses. Training is slower — computing multiple predictions per position increases computation by a factor of roughly for future tokens (the factor is not linear because not all positions can predict tokens ahead). Memory usage increases due to the additional heads’ parameters.

The tradeoff is typically favorable for larger models and longer-form generation tasks. For small models or short-sequence tasks, the overhead may outweigh the benefits. In our children’s story generation task, MTP should help with maintaining narrative consistency across a story.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: May 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

That’s not the case.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In the first three lessons of this series, we progressively assembled the foundations of DeepSeek-V3: starting with its configuration and Rotary Positional Embeddings (RoPE), then advancing to the efficiency of Multi-Head Latent Attention (MLA), and scaling capacity through the Mixture of Experts (MoE). Each of these innovations has added a crucial piece to the architecture, balancing efficiency, scalability, and representational power. With those components in place, we turn to another breakthrough that redefines how language models learn and generate text: Multi-Token Prediction (MTP).

Traditional autoregressive models rely on next-token prediction, a strategy that, while effective, can be shortsighted — focusing only on immediate context rather than broader sequence-level patterns. MTP addresses this limitation by enabling the model to predict multiple tokens ahead, accelerating training and inference while enriching contextual understanding. In this lesson, we explore the shortcomings of next-token prediction, introduce the architecture of specialized prediction heads, and examine why MTP works from a gradient perspective.

We then dive into practical considerations (e.g., weighted loss, decay strategies, and implementation details), before integrating MTP into the main model. By the end, we see how this innovation not only improves efficiency but also strengthens the theoretical and empirical foundations of DeepSeek-V3, bringing us closer to assembling the complete architecture.

Citation Information

Mangla, P. “Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3,” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/alrep

@incollection{Mangla_2026_autoregressive-model-limits-and-mTP-in-deepseek-v3,
  author = {Puneet Mangla},
  title = {{Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/alrep},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

The post Autoregressive Model Limits and Multi-Token Prediction in DeepSeek-V3 appeared first on PyImageSearch.

DeepSeek-V3 from Scratch: Mixture of Experts (MoE)

Puneet Mangla — Mon, 23 Mar 2026 12:45:00 +0000

Home

Table of Contents

DeepSeek-V3 from Scratch: Mixture of Experts (MoE)
The Scaling Challenge in Neural Networks
Mixture of Experts (MoE): Mathematical Foundation and Routing Mechanism
SwiGLU Activation in DeepSeek-V3: Improving MoE Non-Linearity
Shared Expert in DeepSeek-V3: Universal Processing in MoE Layers
Auxiliary-Loss-Free Load Balancing in DeepSeek-V3 MoE
Sequence-Wise Load Balancing for Mixture of Experts Models
Expert Specialization in MoE: Emergent Behavior in DeepSeek-V3
Implementation: Building the DeepSeek-V3 MoE Layer from Scratch
MoE Design Decisions in DeepSeek-V3: SwiGLU, Shared Experts, and Routing
MoE Computational and Memory Analysis in DeepSeek-V3
MoE Expert Specialization in Practice: Real-World Behavior
Training Dynamics of MoE: Load Balancing and Expert Utilization
Mixture of Experts vs Related Techniques: Switch Transformers and Sparse Models
Summary
- Citation Information

DeepSeek-V3 from Scratch: Mixture of Experts (MoE)

In the first two parts of this series, we established the foundations of DeepSeek-V3 by implementing its core configuration and positional encoding, followed by a deep dive into Multi-Head Latent Attention (MLA). Together, these components set the stage for a model that is both efficient and capable of handling long-range dependencies. With those building blocks in place, we now explore another key innovation in DeepSeek-V3: the Mixture of Experts (MoE).

MoE introduces a dynamic way of scaling model capacity without proportionally increasing computational cost. Instead of activating every parameter for every input, the model selectively routes tokens through specialized “expert” networks, allowing it to expand representational power while keeping inference efficient. In this lesson, we’ll unpack the theory behind MoE, explain how expert routing works, and then implement it step by step. This installment continues our broader goal of reconstructing DeepSeek-V3 from scratch — showing how each innovation, from RoPE to MLA to MoE, fits together into a cohesive architecture that balances scale, efficiency, and performance.

This lesson is the 3rd in a 6-part series on Building DeepSeek-V3 from Scratch:

DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings
Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture
DeepSeek-V3 from Scratch: Mixture of Experts (MoE) (this tutorial)
Lesson 4
Lesson 5
Lesson 6

To learn about DeepSeek-V3 and build it from scratch, just keep reading.

Looking for the source code to this post?

The Scaling Challenge in Neural Networks

As we scale neural networks, we face a fundamental tradeoff: larger models have greater capacity to learn complex patterns, but they’re more expensive to train and deploy. A standard Transformer feedforward layer applies the same computation to every token:

where and are weight matrices, typically with . For our model with , this means , giving us approximately 256K parameters per FFN (FeedForward Network) per layer.

To increase model capacity, we could simply make larger — say, instead of . This doubles the FFN parameters and theoretically doubles capacity. But it also doubles the computation for every token, even if most don’t need that extra capacity.

Mixture of Experts (Figure 1) offers a more efficient scaling paradigm: instead of a single large FFN, we create multiple smaller expert FFNs and route each token to a subset of these experts. This gives us the capacity of a much larger model while maintaining computational efficiency.

Figure 1: Types of Mixture of Experts Models (source: Dai et al., 2024).

Mixture of Experts (MoE): Mathematical Foundation and Routing Mechanism

Consider expert networks, each with the same architecture as a standard FFN:

for . Instead of using all experts for every token, we select the top-k experts. The selection is determined by a learned routing function:

where is the router weight matrix and is a learnable bias vector. This gives us a probability distribution over experts for each token.

Top-k Routing: We select the top-k experts based on router probabilities:

The final output combines the selected experts, weighted by their normalized routing probabilities:

The renormalization ensures the selected experts’ weights sum to 1.

Capacity and Computation: With experts and (our configuration), each token activates 2 out of 4 experts. If each expert has the same size as a standard FFN, we have the parameters but only the computation per token. This is the MoE efficiency advantage: parameter count scales with , but computation scales with .

SwiGLU Activation in DeepSeek-V3: Improving MoE Non-Linearity

DeepSeek uses SwiGLU (Swish-Gated Linear Unit) instead of the traditional GELU (Gaussian Error Linear Units) activation. SwiGLU is a gated activation function that has shown superior performance in language models:

where:

: projects input to hidden dimension
: is another projection to hidden dimension
: is the Swish activation (smooth version of ReLU)
: denotes element-wise multiplication
The result is then projected back:

The gating mechanism allows the network to control information flow more precisely than simple activation functions. The activation provides smooth gradients everywhere, improving training dynamics compared to ReLU’s hard threshold.

Shared Expert in DeepSeek-V3: Universal Processing in MoE Layers

DeepSeek introduces a shared expert that processes all tokens in addition to the routed experts. This design addresses a key limitation of pure MoE: some computations are beneficial for all tokens regardless of their content.

The shared expert has a larger hidden dimension (768 in our configuration vs 512 for individual experts) and processes every token. This ensures that:

Common patterns are efficiently handled by dedicated capacity
Specialized experts can focus on token-specific features
Training is more stable with guaranteed gradient flow

The shared expert serves as a “base” computation that’s always present, while routed experts add specialized processing on top of it.

Auxiliary-Loss-Free Load Balancing in DeepSeek-V3 MoE

A critical challenge in MoE is load balancing. If the router learns to always send tokens to the same one or two experts, we lose the benefits of having multiple experts — the unused experts contribute nothing, and the overused ones become bottlenecks.

Traditional MoE models use an auxiliary loss that penalizes uneven expert usage:

where is the number of tokens routed to expert , is batch size, and is a coefficient. However, auxiliary losses add complexity and require careful tuning.

DeepSeek’s Innovation: Auxiliary-loss-free load balancing through dynamic bias updates. Instead of penalizing imbalance during training, we adjust the router biases to encourage balanced usage:

During training, we monitor how many tokens are routed to each expert. This gives us an expert_usage vector, where each entry counts the number of tokens assigned to a particular expert. We then compute the average usage across all experts.

To maintain a balanced load, we adjust the router biases: if an expert is used more than the average, its bias is decreased to make it less likely to be chosen in the future; if it is used less than the average, its bias is increased to make it more likely to be selected. This dynamic bias update encourages fair distribution of tokens across experts without requiring an explicit auxiliary loss.

Let denote the usage (number of tokens) of expert , and let

be the average usage across all experts. The router bias for expert , denoted , is updated as:

\bar{u} \\ \\ b_i + \eta, & \text{if } u_i \leq \bar{u} \end{array}\right.' title='b_i \leftarrow \left\{\begin{array}{ll} b_i - \eta, & \text{if } u_i > \bar{u} \\ \\ b_i + \eta, & \text{if } u_i \leq \bar{u} \end{array}\right.' class='latex' srcset='https://b2633864.smushcdn.com/2633864/wp-content/latex/ab3/ab311f534726b554bd5d6f1b554a872f-ffffff-000000-0.png?lossy=2&strip=1&webp=1 178w,https://b2633864.smushcdn.com/2633864/wp-content/latex/ab3/ab311f534726b554bd5d6f1b554a872f-ffffff-000000-0.png?size=126x42&lossy=2&strip=1&webp=1 126w' sizes='(max-width: 178px) 100vw, 178px' /> ,

where is the learning rate controlling the magnitude of the bias adjustment.

This approach:

Eliminates the need for auxiliary loss hyperparameter tuning
Provides smoother load balancing over time
Doesn’t interfere with the primary task loss
Automatically adapts to data distribution changes

The bias updates are performed with a small learning rate (0.001 in our implementation) to ensure gradual adjustment without disrupting training.

Sequence-Wise Load Balancing for Mixture of Experts Models

For even better load balancing, DeepSeek can use a complementary sequence-wise auxiliary loss. This encourages different sequences in a batch to use different experts:

where is the expert usage vector for sequence (i.e., which experts were used), and measures similarity. By minimizing this loss, we encourage sequences to be complementary — if sequence A uses experts 1 and 2 heavily, sequence B should use experts 3 and 4.

Expert Specialization in MoE: Emergent Behavior in DeepSeek-V3

A fascinating property of MoE is expert specialization. Even though we don’t explicitly tell experts what to specialize in, they often learn to handle different types of patterns. In language models, researchers have observed:

Syntactic experts: Handle grammatical structures, verb conjugations
Semantic experts: Process meaning, synonyms, and conceptual relationships
Domain experts: Specialize in specific topics (e.g., scientific text, dialogue)
Numerical experts: Handle arithmetic, dates, quantities

This specialization emerges naturally as the routing function learns which experts are most effective for different inputs. Gradient flow during training reinforces this — when an expert performs well on certain patterns, the router learns to send similar patterns to that expert.

Mathematically, we can think of each expert as learning a local model that’s particularly good in some region of the input space. The router function implicitly partitions the input space, assigning different regions to different experts. This is similar to a mixture of experts in classical machine learning, but learned end-to-end through backpropagation.

Implementation: Building the DeepSeek-V3 MoE Layer from Scratch

Let’s implement the complete MoE layer with expert networks, routing, and load balancing:

class SwiGLU(nn.Module):
    """SwiGLU activation function used in DeepSeek experts"""
   
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int, bias: bool = True):
        super().__init__()
        self.gate_proj = nn.Linear(input_dim, hidden_dim, bias=bias)
        self.up_proj = nn.Linear(input_dim, hidden_dim, bias=bias)
        self.down_proj = nn.Linear(hidden_dim, output_dim, bias=bias)
       
    def forward(self, x: torch.Tensor):
        gate = F.silu(self.gate_proj(x))  # SiLU activation
        up = self.up_proj(x)
        return self.down_proj(gate * up)

Lines 1-13: SwiGLU Activation: The SwiGLU class implements a gated activation mechanism. We have 3 linear projections:

gate_proj: for the gating signal
up_proj: for the value branch
down_proj: for the output projection

The forward pass applies SiLU (Sigmoid Linear Unit) to the gate projection, multiplies it element-wise with the up-projection, and projects back down. This creates a more expressive activation than simple GELU, with the gating mechanism allowing fine-grained control over information flow.

class MoEExpert(nn.Module):
    """Expert network for Mixture of Experts using SwiGLU"""

    def __init__(self, config: DeepSeekConfig):
        super().__init__()
        self.expert_mlp = SwiGLU(
            config.n_embd,
            config.expert_intermediate_size,
            config.n_embd,
            config.bias
        )

    def forward(self, x: torch.Tensor):
        return self.expert_mlp(x)

Lines 14-27: Expert with SwiGLU: Each MoEExpert is now a SwiGLU network instead of a simple FFN. The intermediate size (expert_intermediate_size) controls capacity — we use 512 in our configuration, which is smaller than the shared expert’s 768. This asymmetry reflects the fact that routed experts handle specialized patterns, while the shared expert handles common operations.

class MixtureOfExperts(nn.Module):
    """
    DeepSeek MoE layer with shared expert and auxiliary-loss-free load balancing
   
    Key features:
    - Shared expert that processes all tokens
    - Auxiliary-loss-free load balancing via bias updates
    - Top-k routing to selected experts
    """

    def __init__(self, config: DeepSeekConfig):
        super().__init__()
        self.config = config
        self.n_experts = config.n_experts
        self.top_k = config.n_experts_per_token
        self.n_embd = config.n_embd

        # Router: learns which experts to use for each token
        self.router = nn.Linear(config.n_embd, config.n_experts, bias=False)

        # Expert networks
        self.experts = nn.ModuleList([
            MoEExpert(config) for _ in range(config.n_experts)
        ])

        # Shared expert (processes all tokens)
        if config.use_shared_expert:
            self.shared_expert = SwiGLU(
                config.n_embd,
                config.shared_expert_intermediate_size,
                config.n_embd,
                config.bias
            )
        else:
            self.shared_expert = None

        # Auxiliary-loss-free load balancing
        self.register_buffer('expert_bias', torch.zeros(config.n_experts))
        self.bias_update_rate = 0.001

        self.dropout = nn.Dropout(config.dropout)

Lines 28-68: MoE Layer Structure: The MixtureOfExperts class orchestrates routing and expert execution. The 3 key additions:

shared_expert: full-capacity expert that processes all tokens
expert_bias: buffer for auxiliary-loss-free balancing
bias_update_rate: controls how quickly biases adapt

The dropout provides regularization across the entire MoE output.

    def forward(self, x: torch.Tensor):
        batch_size, seq_len, hidden_dim = x.shape
        x_flat = x.view(-1, hidden_dim)

        # Routing phase with bias for load balancing
        router_logits = self.router(x_flat) + self.expert_bias

        # Top-k routing
        top_k_logits, top_k_indices = torch.topk(router_logits, self.top_k, dim=-1)
        routing_weights = torch.zeros_like(router_logits)
        routing_weights.scatter_(-1, top_k_indices, F.softmax(top_k_logits, dim=-1))

        # Expert computation
        output = torch.zeros_like(x_flat)
        expert_usage = torch.zeros(self.n_experts, device=x.device)

Lines 70-84: Routing with Learnable Bias. The forward pass begins by flattening the input for efficient processing. We compute router logits and add the expert bias — this is the key to auxiliary-loss-free balancing. Overused experts have negative bias (making them less likely to be selected), while underused experts have positive bias (encouraging them to be selected). We then perform top-k selection and softmax normalization across the selected experts.

        # Process through selected experts
        for expert_idx in range(self.n_experts):
            expert_mask = (top_k_indices == expert_idx).any(dim=-1)
            expert_usage[expert_idx] = expert_mask.sum().float()

            if expert_mask.any():
                expert_input = x_flat[expert_mask]
                expert_output = self.experts[expert_idx](expert_input)

                # Weight by routing probability
                weights = routing_weights[expert_mask, expert_idx].unsqueeze(-1)
                output[expert_mask] += expert_output * weights

        # Add shared expert output (processes all tokens)
        if self.shared_expert is not None:
            shared_output = self.shared_expert(x_flat)
            output += shared_output

        # Auxiliary-loss-free load balancing (update biases during training)
        if self.training:
            with torch.no_grad():
                avg_usage = expert_usage.mean()
                for i in range(self.n_experts):
                    if expert_usage[i] > avg_usage:
                        self.expert_bias[i] -= self.bias_update_rate
                    else:
                        self.expert_bias[i] += self.bias_update_rate

        output = self.dropout(output)
        return output.view(batch_size, seq_len, hidden_dim), router_logits.view(batch_size, seq_len, -1)

Lines 86-97: Expert Processing. We iterate over all experts, identifying which tokens route to each one via the expert_mask. For each expert with assigned tokens, we extract those tokens, process them through the expert network, weight them by routing probability, and accumulate them into the output. This selective execution is what makes MoE efficient — we don’t compute all experts for all tokens.

Lines 100-102: Shared Expert. The shared expert processes all tokens unconditionally and adds its output to the routed experts’ output. This ensures every token receives some baseline processing, improving training stability and providing capacity for universal patterns. The shared expert’s larger hidden dimension (768 vs 512) reflects its broader responsibility.

Lines 105-112: Auxiliary-Loss-Free Balancing. During training, we update expert biases based on usage. We compute average usage across experts, then adjust biases: overused experts receive negative adjustments (discouraging future selection), while underused experts receive positive adjustments (encouraging future selection). Using the torch.no_grad() context ensures these bias updates don’t interfere with gradient computation. The small update rate (0.001) provides smooth, stable balancing over time.

Lines 114-115: Output and Return. We apply dropout to the combined output (routed + shared experts) and reshape back to the original dimensions. We return both the output and router logits — the latter can be used for optional auxiliary loss computation.

    def _complementary_sequence_aux_loss(self, router_logits, seq_mask=None):
      """
      router_logits: [batch_size, seq_len, num_experts]
          Raw logits from the router before softmax.
      seq_mask: optional mask for padding tokens.
      """

      # Convert to probabilities
      probs = F.softmax(router_logits, dim=-1)  # [B, T, E]

      # Aggregate per-sequence expert usage
      if seq_mask is not None:
          probs = probs * seq_mask.unsqueeze(-1)  # mask padding
      seq_usage = probs.sum(dim=1)  # [B, E]

      # Normalize per sequence
      seq_usage = seq_usage / seq_usage.sum(dim=-1, keepdim=True)

      # Compute pairwise similarity between sequences
      sim_matrix = torch.matmul(seq_usage, seq_usage.transpose(0, 1))  # [B, B]

      # Encourage complementarity: minimize similarity off-diagonal
      batch_size = seq_usage.size(0)
      off_diag = sim_matrix - torch.eye(batch_size, device=sim_matrix.device)
      loss = off_diag.mean()

      return loss

Lines 117-143: Complementary Sequence-Wise Loss. This method implements an alternative load-balancing approach. It converts router logits to probabilities, aggregates expert usage for each sequence, and computes pairwise similarity between sequences’ expert usage patterns. By minimizing off-diagonal similarity, we encourage different sequences to use different experts, promoting diversity in expert utilization. This can be added to the training loss with a small weight (e.g., 0.01).

MoE Design Decisions in DeepSeek-V3: SwiGLU, Shared Experts, and Routing

Several implementation choices merit discussion:

SwiGLU vs GELU: We use SwiGLU instead of traditional GELU because empirical research shows it consistently outperforms GELU in language models. The gating mechanism provides more expressive power, and SiLU’s smoothness improves gradient flow. The computational cost is slightly higher (three projections instead of two), but the quality improvement justifies it.

Shared Expert Design: The shared expert is a DeepSeek innovation that addresses a key limitation of pure MoE: some computations benefit all tokens. By providing dedicated capacity for universal processing, we free routed experts to specialize more aggressively. The larger hidden dimension (768 vs 512) for the shared expert reflects empirical findings that shared capacity requires more parameters than individual experts.

Auxiliary-Loss-Free Balancing: Traditional MoE uses auxiliary losses, such as:

where is the fraction of tokens routed to expert and is the average routing probability. This requires tuning (typically 0.01-0.1). Our bias-based approach eliminates the need for this hyperparameter, simplifying training. The tradeoff is that bias updates are less direct than gradient-based learning, but in practice, the smoother adaptation works well.

Complementary Sequence-Wise Loss: This alternative balancing approach is useful when batch diversity is high. By encouraging different sequences to use different experts, we naturally achieve balance. However, if the batch contains very similar sequences (e.g., all from the same domain), this loss may not be effective. It’s best used in combination with bias-based balancing or as an optional auxiliary objective.

Expert Capacity: Production MoE systems often implement expert capacity constraints — if too many tokens route to one expert, excess tokens are dropped or routed to a second choice. We don’t implement this in our educational model, but the formula would be:

where factor is typically 1.25-1.5. Tokens beyond this capacity are handled via overflow strategies.

MoE Computational and Memory Analysis in DeepSeek-V3

Let’s analyze the computational cost. For a standard FFN with hidden dimension :

For our MoE with routed experts (each with ), selected, and shared expert ():

The SwiGLU computation involves three projections:

For our configuration:

Routing: (negligible)
Routed experts:
Shared expert:
Total: 2.75M FLOPs per token

Compare to a standard FFN with : FLOPs. Our MoE uses 2.6× more computation but has much higher capacity (4 experts × 512 + 1 shared × 768 = 2,816 vs 1,024). We get 2.7× capacity for 2.6× computation — roughly linear scaling, which is the goal.

Memory usage during the forward pass stores activations for active experts only. During backpropagation, we need gradients for all experts (since routing is differentiable), yet the memory remains manageable. The bias vector is tiny (4 floats for 4 experts).

MoE Expert Specialization in Practice: Real-World Behavior

While we can’t demonstrate this in our small toy model, in larger-scale MoE models, expert specialization is observable through analysis of routing patterns. Researchers have visualized which experts activate for different types of inputs, revealing clear specialization. For example:

Multilingual models: Different experts handle different languages
Code models: Some experts handle syntax, others semantics, others API patterns
Reasoning models: Numerical experts for math, logical experts for inference, retrieval experts for factual recall

This specialization isn’t programmed — it emerges from optimization. The routing function learns to partition the input space, and experts learn to excel in their assigned partitions. It’s a beautiful example of how end-to-end learning can discover structured solutions.

Training Dynamics of MoE: Load Balancing and Expert Utilization

In practice, MoE training exhibits interesting dynamics:

Early Training: Routing is initially random or near-uniform. All experts receive a similar load. The shared expert learns basic patterns that benefit all tokens.

Mid Training: Routing starts specializing. Some experts become preferred for certain patterns. Load imbalance can emerge without careful management. Bias-based balancing begins correcting the imbalance.

Late Training: Experts are clearly specialized. Routing is confident (high softmax probabilities for selected experts). Load is balanced through continuous bias adjustment. The shared expert handles universal operations while routed experts focus on specialized patterns.

Monitoring expert usage during training is valuable. We can log:

Per-expert selection frequency
Routing entropy (higher means more uniform)
Expert bias magnitudes (large values indicate strong correction needed)

Mixture of Experts vs Related Techniques: Switch Transformers and Sparse Models

MoE shares ideas with several other architectural patterns:

Switch Transformers: Use top-1 routing (only one expert per token) for maximum efficiency. Simpler but less expressive than top-k.

Expert Choice: Instead of tokens choosing experts, experts choose tokens. Helps with load balancing but changes the computational pattern.

Sparse Attention: Like MoE, selectively activates parts of the network. Can be combined with MoE for extreme efficiency.

Dynamic Networks: Adapt network structure based on input. MoE is a specific form of dynamic computation.

With our MoE implementation complete, we’ve added efficient scaling to our model — the capacity grows superlinearly with computation cost. Combined with MLA’s memory efficiency and the upcoming MTP’s improved training signal, we’re building a model that’s efficient in training, efficient in inference, and capable of strong performance. Next, we’ll tackle Multi-Token Prediction, which improves the training signal itself by having the model look further ahead.

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: May 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

That’s not the case.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In the third installment of our DeepSeek-V3 from Scratch series, we turn our attention to the Mixture of Experts (MoE) framework, a powerful approach to scaling neural networks efficiently. We begin by unpacking the scaling challenge in modern architectures and how MoE addresses it through selective expert activation. From its mathematical foundation to the introduction of SwiGLU activation, we explore how enhanced non-linearity and universal shared experts contribute to more flexible and expressive models.

We then examine the mechanics of load balancing, highlighting innovations (e.g., auxiliary-loss-free balancing and complementary sequence-wise strategies). These techniques ensure that experts are used effectively without introducing unnecessary complexity. We also explore how expert specialization emerges naturally during training, leading to diverse behaviors across experts that improve overall performance. This emergent specialization is not just theoretical — it becomes visible in practice, shaping how the model processes different types of input.

Finally, we walk through the implementation of MoE, discussing design decisions, computational trade-offs, and memory analysis. We connect these insights to related techniques, showing how MoE integrates into the broader landscape of efficient deep learning. By the end, we not only understand the theory but also gain practical knowledge of how to implement and optimize MoE within DeepSeek-V3. This part of the series equips us with the tools to harness expert specialization while keeping training dynamics balanced and efficient.

Citation Information

Mangla, P. “DeepSeek-V3 from Scratch: Mixture of Experts (MoE),” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/a1w0g

@incollection{Mangla_2026_deepseek-v3-from-scratch-moe,
  author = {Puneet Mangla},
  title = {{DeepSeek-V3 from Scratch: Mixture of Experts (MoE)}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/a1w0g},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

The post DeepSeek-V3 from Scratch: Mixture of Experts (MoE) appeared first on PyImageSearch.

Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture

Puneet Mangla — Mon, 16 Mar 2026 12:45:00 +0000

Home

Table of Contents

Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture
The KV Cache Memory Problem in DeepSeek-V3
Multi-Head Latent Attention (MLA): KV Cache Compression with Low-Rank Projections
Query Compression and Rotary Positional Embeddings (RoPE) Integration
Attention Computation with Multi-Head Latent Attention (MLA)
Implementation: Multi-Head Latent Attention (MLA)
Multi-Head Latent Attention and KV Cache Optimization
Summary

Citation Information

Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture

In the first part of this series, we laid the foundation by exploring the theoretical underpinnings of DeepSeek-V3 and implementing key configuration elements such as Rotary Positional Embeddings (RoPE). That tutorial established how DeepSeek-V3 manages long-range dependencies and sets up its architecture for efficient scaling. By grounding theory in working code, we ensured that readers not only understood the concepts but also saw how they translate into practical implementation.

With that groundwork in place, we now turn to one of DeepSeek-V3’s most distinctive innovations: Multi-Head Latent Attention (MLA). While traditional attention mechanisms have proven remarkably effective, they often come with steep computational and memory costs. MLA reimagines this core operation by introducing a latent representation space that dramatically reduces overhead while preserving the model’s ability to capture rich contextual relationships.

In this lesson, we’ll break down the theory behind MLA, explore why it matters, and then implement it step by step. This installment continues our hands-on approach — moving beyond abstract concepts to practical code — while advancing the broader goal of the series: to reconstruct DeepSeek-V3 from scratch, piece by piece, until we assemble and train the full architecture.

This lesson is the 2nd of the 6-part series on Building DeepSeek-V3 from Scratch:

DeepSeek-V3 Model: Theory, Config, and Rotary Positional Embeddings
Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture (this tutorial)
Lesson 3
Lesson 4
Lesson 5
Lesson 6

To learn about DeepSeek-V3 and build it from scratch, just keep reading.

Looking for the source code to this post?

The KV Cache Memory Problem in DeepSeek-V3

To understand why MLA is revolutionary, we must first understand the memory bottleneck in Transformer inference. Standard multi-head attention computes:

where are query, key, and value matrices for sequence length . In autoregressive generation (producing one token at a time), we cannot recompute attention over all previous tokens from scratch at each step — that would be computation per token generated.

Instead, we cache the key and value matrices. When generating token , we only compute (the query for the new token), then compute attention using and the cached . This reduces computation from to per generated token — a dramatic speedup.

However, this cache comes at a steep memory cost. For a model with layers, attention heads, and head dimension , the KV cache requires:

For a model like GPT-3 with 96 layers, 96 heads, 128-head dimensions, and 2048 sequence length, this is:

This means you can only serve a handful of users concurrently on even high-end GPUs. The memory bottleneck is often the limiting factor in deployment, not computation.

Multi-Head Latent Attention (MLA): KV Cache Compression with Low-Rank Projections

MLA (Figure 1) solves this through a compress-decompress strategy inspired by Low-Rank Adaptation (LoRA). The key insight: we do not need to store full -dimensional representations. We can compress them into a lower-dimensional latent space for storage, then decompress when needed for computation.

Figure 1: Multi-Head Latent Attention architecture (source: DeepSeek-AI, 2025).

Step 1. Key-Value Compression: Instead of storing directly, we project them through a low-rank bottleneck:

where is the input, is the down-projection, and is the low-rank dimension. We only cache rather than the full and .

Step 2. Key-Value Decompression: When we need the actual key and value matrices for attention computation, we decompress:

where are up-projection matrices. This decomposition approximates the full key and value matrices through a low-rank factorization: and .

Memory Savings: Instead of caching , we cache . The reduction factor is . For our configuration with and , this is a 4× reduction. For larger models with and , it’s a 16× reduction — transformative for deployment.

Query Compression and Rotary Positional Embeddings (RoPE) Integration

MLA extends compression to queries, though less aggressively since queries are not cached:

where can be different from . In our configuration, versus — we give queries slightly more capacity.

Now comes the clever part: integrating RoPE. We split both queries and keys into content and positional components:

where denotes concatenation. The content components come from the compression-decompression process described above. The positional components are separate projections that we apply RoPE to:

where denotes applying rotary embedding at position . This separation is crucial: content and position are independently represented and combined only in the attention scores.

Attention Computation with Multi-Head Latent Attention (MLA)

The complete attention computation becomes:

Then standard multi-head attention:

where are per-head projections. The attention scores naturally incorporate both content similarity (through ) and positional information (through ).

Causal Masking: For autoregressive language modeling, we must prevent tokens from attending to future positions. We apply a causal mask:

This ensures position can only attend to positions , maintaining the autoregressive property.

Attention Weights and Output: After computing scores with the causal mask applied:

where is the effective key dimension (content plus RoPE dimensions). We apply attention to values:

where is the output projection. Finally, dropout is applied for regularization, and the result is added to the residual connection.

Implementation: Multi-Head Latent Attention (MLA)

Here is the complete implementation of MLA:

class MultiheadLatentAttention(nn.Module):
    """
    Multihead Latent Attention (MLA) - DeepSeek's efficient attention mechanism

    Key innovations:
    - Compression/decompression of queries and key-values
    - LoRA-style low-rank projections for efficiency
    - RoPE with separate content and positional components
    """

    def __init__(self, config: DeepSeekConfig):
        super().__init__()
        self.config = config
        self.n_embd = config.n_embd
        self.n_head = config.n_head
        self.head_dim = config.n_embd // config.n_head

        # Compression dimensions
        self.kv_lora_rank = config.kv_lora_rank
        self.q_lora_rank = config.q_lora_rank
        self.rope_dim = config.rope_dim

Lines 11-21: Configuration and Dimensions. We extract key parameters from the configuration object, computing the head dimension as . We store compression ranks (kv_lora_rank and q_lora_rank) and the RoPE dimension. These define the memory-accuracy tradeoff — lower ranks mean more compression but potentially lower quality. Our choices balance efficiency with model capacity.

        # KV decompression
        self.k_decompress = nn.Linear(self.kv_lora_rank, self.n_head * self.head_dim, bias=False)
        self.v_decompress = nn.Linear(self.kv_lora_rank, self.n_head * self.head_dim, bias=False)

        # Query compression
        self.q_proj = nn.Linear(self.n_embd, self.q_lora_rank, bias=False)
        self.q_decompress = nn.Linear(self.q_lora_rank, self.n_head * self.head_dim, bias=False)

        # RoPE projections
        self.k_rope_proj = nn.Linear(self.n_embd, self.n_head * self.rope_dim, bias=False)
        self.q_rope_proj = nn.Linear(self.q_lora_rank, self.n_head * self.rope_dim, bias=False)

        # Output projection
        self.o_proj = nn.Linear(self.n_head * self.head_dim, self.n_embd, bias=config.bias)

        # Dropout
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)

        # RoPE
        self.rope = RotaryEmbedding(self.rope_dim, config.block_size)

        # Causal mask
        self.register_buffer(
            "causal_mask",
            torch.tril(torch.ones(config.block_size, config.block_size)).view(
                1, 1, config.block_size, config.block_size
            )
        )

Lines 23-29: KV Compression Pipeline. The compression-decompression architecture follows the low-rank factorization principle. The kv_proj layer performs the down-projection from to , cutting the dimensionality in half. We apply RMSNorm to the compressed representation for stability — this normalization helps prevent the compressed representation from drifting to extreme values during training. The decompression layers k_decompress and v_decompress then expand back to dimensions. Note that we use bias=False for these projections — empirical research shows that biases in attention projections do not significantly help and add unnecessary parameters.

Lines 31-33: Query Processing and RoPE Projections. Query handling follows a similar compression pattern but with a slightly higher rank (). The asymmetry makes sense: we do not cache queries, so memory pressure is lower, and we can afford more capacity. The RoPE projections are separate pathways — k_rope_proj projects directly from the input , while q_rope_proj projects from the compressed query representation. Both target the RoPE dimension of 64. This separation of content and position is architecturally elegant: the model learns different transformations for “what” (content) versus “where” (position).

Lines 36-51: Infrastructure Components. The output projection o_proj combines multi-head outputs back to the model dimension. We include 2 dropout layers:

attn_dropout: applied to attention weights (reducing overfitting on attention patterns)
resid_dropout: applied to the final output (regularizing the residual connection)

The RoPE module is instantiated with our chosen dimension and maximum sequence length. Finally, we create and register a causal mask as a buffer — by using register_buffer, this tensor moves with the model to GPU/CPU and is included in the state dict, but is not treated as a learnable parameter.

    def forward(self, x: torch.Tensor, attention_mask: Optional[torch.Tensor] = None):
        B, T, C = x.size()

        # Compression phase
        kv_compressed = self.kv_norm(self.kv_proj(x))
        q_compressed = self.q_proj(x)

        # Decompression phase
        k_content = self.k_decompress(kv_compressed)
        v = self.v_decompress(kv_compressed)
        q_content = self.q_decompress(q_compressed)

        # RoPE components
        k_rope = self.k_rope_proj(x)
        q_rope = self.q_rope_proj(q_compressed)

        # Reshape [B, H, T, d_head] for multi-head attention
        k_content = k_content.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        q_content = q_content.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k_rope = k_rope.view(B, T, self.n_head, self.rope_dim).transpose(1, 2)
        q_rope = q_rope.view(B, T, self.n_head, self.rope_dim).transpose(1, 2)

        # Apply RoPE
        cos, sin = self.rope(x, T)
        q_rope = apply_rope(q_rope, cos, sin)
        k_rope = apply_rope(k_rope, cos, sin)

        # Concatenate content and rope parts
        q = torch.cat([q_content, q_rope], dim=-1)
        k = torch.cat([k_content, k_rope], dim=-1)

Lines 52-57: Compression Phase. The forward pass begins by compressing the input. We project onto the KV latent space, apply normalization, and project back onto the query latent space. These operations are lightweight — just matrix multiplications. The compressed representations are what we would cache during inference. Notice that kv_compressed has shape versus the original — we’ve already halved the memory footprint.

Lines 60-73: Decompression and RoPE. We decompress to get content components and compute separate RoPE projections. Then comes a crucial reshaping step: we convert from to , moving the head dimension before the sequence dimension. This layout is required for multi-head attention — each head operates independently, and we want to batch those operations. The .transpose(1, 2) operation efficiently swaps dimensions without copying data.

Lines 76-82: RoPE Application and Concatenation. We fetch cosine and sine tensors from our RoPE module and apply the rotation to both queries and keys. Critically, we only rotate the RoPE components, not the content components. This maintains the separation between “what” and “where” information. We then concatenate along the feature dimension, creating final query and key tensors of shape . The attention scores will capture both content similarity and relative position.

        # Attention computation
        scale = 1.0 / math.sqrt(q.size(-1))
        scores = torch.matmul(q, k.transpose(-2, -1)) * scale

        # Apply causal mask
        scores = scores.masked_fill(self.causal_mask[:, :, :T, :T] == 0, float('-inf'))

        # Apply padding mask if provided
        if attention_mask is not None:
            padding_mask_additive = (1 - attention_mask).unsqueeze(1).unsqueeze(2) * float('-inf')
            scores = scores + padding_mask_additive

        # Softmax and dropout
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.attn_dropout(attn_weights)

        # Apply attention to values
        out = torch.matmul(attn_weights, v)

        # Reshape and project
        out = out.transpose(1, 2).contiguous().view(B, T, self.n_head * self.head_dim)
        out = self.resid_dropout(self.o_proj(out))

        return out

Lines 84-94: Attention Score Computation and Masking. We compute scaled dot-product attention: . The scaling factor is critical for training stability — without it, attention logits would grow large as dimensions increase, leading to vanishing gradients in the softmax. We apply the causal mask using masked_fill, setting future positions to negative infinity so they contribute zero probability after softmax. If an attention mask is provided (for handling padding), we convert it to an additive mask and add it to scores. This handles variable-length sequences in a batch.

Lines 97-107: Attention Weights and Output. We apply softmax to convert scores to probabilities, ensuring they sum to 1 over the sequence dimension. Dropout is applied to attention weights — this has been shown to help with generalization, perhaps by preventing the model from becoming overly dependent on specific attention patterns. We multiply attention weights by values to get our output. The final transpose and reshape convert from the multi-head layout back to , concatenating all heads. The output projection and residual dropout complete the attention module.

Multi-Head Latent Attention and KV Cache Optimization

Multi-Head Latent Attention (MLA) is one approach to KV cache optimization — compression through low-rank projections. Other approaches include the following:

Multi-Query Attention (MQA), where all heads share a single key and value
Grouped-Query Attention (GQA), where heads are grouped to share KV pairs
KV Cache Quantization, which stores keys and values at lower precision (INT8 or INT4)
Cache Eviction Strategies, which discard less important past tokens

Each approach has the following trade-offs:

MQA and GQA reduce quality more than MLA but are simpler
Quantization can degrade accuracy
Cache eviction strategies discard historical context

DeepSeek-V3’s MLA offers an appealing middle ground — significant memory savings with minimal quality loss through a principled compression approach.

For readers interested in diving deeper into KV cache optimization, we recommend exploring the “KV Cache Optimization” series, which covers these techniques in detail, including implementation strategies, benchmarking results, and guidance on choosing the right approach for a given use case.

With MLA implemented, we have addressed one of the primary memory bottlenecks in Transformer inference — the KV cache. Our attention mechanism can now serve longer contexts and more concurrent users within the same hardware budget. In the next lesson, we will address another critical challenge: scaling model capacity efficiently through Mixture of Experts (MoE).

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: May 2026
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

That’s not the case.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this 2nd lesson of our DeepSeek-V3 from Scratch series, we dive into the mechanics of Multi-Head Latent Attention (MLA) and why it is a crucial innovation for scaling large language models.

We begin by introducing MLA and framing it against the KV cache memory problem, a common bottleneck in Transformer architectures. By understanding this challenge, we set the stage for how MLA provides a more efficient solution through compression and smarter attention computation.

We then explore how low-rank projections enable MLA to compress key-value representations without losing essential information. This compression is paired with query compression and RoPE integration, ensuring that positional encoding remains geometrically consistent while reducing computational overhead.

Together, these techniques rethink the attention mechanism, balancing efficiency and accuracy and making MLA a powerful tool for modern architectures.

Finally, we walk through the implementation of MLA, showing how it connects directly to KV cache optimization.

By the end of this lesson, we not only understand the theory but also gain hands-on experience implementing MLA and integrating it into DeepSeek-V3. This practical approach shows how MLA reshapes attention computation, paving the way for more memory-efficient and scalable models.

Citation Information

Mangla, P. “Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture,” PyImageSearch, S. Huot, A. Sharma, and P. Thakur, eds., 2026, https://pyimg.co/scgjl

@incollection{Mangla_2026_build-deepseek-v3-mla-architecture,
  author = {Puneet Mangla},
  title = {{Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture}},
  booktitle = {PyImageSearch},
  editor = {Susan Huot and Aditya Sharma and Piyush Thakur},
  year = {2026},
  url = {https://pyimg.co/scgjl},
}

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

The post Build DeepSeek-V3: Multi-Head Latent Attention (MLA) Architecture appeared first on PyImageSearch.