<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DSP LOG</title>
	<atom:link href="https://dsplog.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://dsplog.com</link>
	<description>Signal Processing</description>
	<lastBuildDate>Sat, 14 Mar 2026 05:40:44 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>Loss functions for handling class imbalance</title>
		<link>https://dsplog.com/2026/03/05/loss-functions-for-handling-class-imbalance/</link>
					<comments>https://dsplog.com/2026/03/05/loss-functions-for-handling-class-imbalance/#respond</comments>
		
		<dc:creator><![CDATA[Krishna Sankar]]></dc:creator>
		<pubDate>Thu, 05 Mar 2026 01:05:23 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Assymetric loss]]></category>
		<category><![CDATA[Class Balanced Loss]]></category>
		<category><![CDATA[Focal Loss]]></category>
		<category><![CDATA[Logit Adjusted Loss]]></category>
		<category><![CDATA[Weighted Cross Entropy]]></category>
		<guid isPermaLink="false">https://dsplog.com/?p=2720</guid>

					<description><![CDATA[<p>For handling class imbalance, multiple stratetgies have emerged. Post covers Weighted Cross Entropy, Focal Loss, Assymetric Loss, Class Balanced Loss and Logit Adjusted Loss. </p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2026/03/05/loss-functions-for-handling-class-imbalance/">Loss functions for handling class imbalance</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Most of real world datasets have <strong>class imbalance</strong>, where a &#8220;<strong>majority</strong>&#8221; class dwarfs the &#8220;<strong>minority</strong>&#8221; samples. Typical examples are &#8211; identifying rare pathologies in medical diagnosis or flagging anomalous transactions to detect fraud or detecting sparse foreground objects from vast background objects in computer vision to name a few.</p>



<p>The machine learning models we have discussed &#8211; <strong>binary classification</strong> <sup>(refer post <a href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/">Gradients for Binary Classification with Sigmoid</a>)</sup> or <strong>multiclass classification</strong> <sup>(refer post <a href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/">Gradients for multi class classification with Softmax</a>)</sup> needs tweaks to <strong>learn</strong> from these imbalanced datasets. Without these adjustments, the models can &#8220;<strong>cheat</strong>&#8221; by favouring the <strong>majority class</strong> and can report a <strong>pseudo high accuracy</strong> though the class specific accuracy is low.</p>



<p>Different strategies have emerged over the years, and in this article we are covering the approaches listed below.</p>



<ol class="wp-block-list">
<li><strong>Weighted cross entropy</strong>
<ul class="wp-block-list">
<li>Foundational baseline, where a <strong>class-specific weight factor</strong> to the standard cross-entropy loss to weight the loss based on <strong>frequency</strong> of the class. </li>
</ul>
</li>



<li><strong><a href="https://arxiv.org/abs/1708.02002" target="_blank" rel="noreferrer noopener">Focal Loss for Dense Object Detection</a>,</strong> Lin et al. (2017)
<ul class="wp-block-list">
<li>Propose a <strong>modulating</strong> factor <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?(1-p_t) ^\gamma" alt=""> to the cross-entropy loss to <strong>down-weight</strong> easy/frequent examples which indirectly forces the model to focus on hard/rare examples</li>
</ul>
</li>



<li><strong><a href="https://arxiv.org/abs/2009.14119" target="_blank" rel="noreferrer noopener">Asymmetric Loss for Multi-Label Classification</a>, </strong>Ridnik et al. (2021)
<ul class="wp-block-list">
<li>Extended the intuition of Focal Loss by having <strong>independent</strong> <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma" alt=""> hyper-parameter for positive and negative samples. This allows for more aggressive &#8220;pushing&#8221; of easy/frequent examples while preserving the gradient signal for hard/rare samples.</li>



<li>Additionally, authors introduces a <strong>probability margin</strong> that explicitly zeros out the loss from easy/frequent  samples. </li>
</ul>
</li>



<li><strong><a href="https://arxiv.org/abs/1901.05555" target="_blank" rel="noreferrer noopener">Class-Balanced Loss Based on Effective Number of Samples</a>, </strong>Cui et al. (CVPR 2019)
<ul class="wp-block-list">
<li>Based on the intuition that there are <strong>similarities among the samples</strong>, authors propose a framework to capture the <strong>diminishing  benefit</strong> when more datasamples are added to a class.</li>
</ul>
</li>



<li><strong><a href="https://arxiv.org/abs/2007.07314" target="_blank" rel="noreferrer noopener">Long-tail Learning via Logit Adjustment</a>, </strong>Menon et al. (ICLR 2021)
<ul class="wp-block-list">
<li>Based on the foundations from <strong>Bayes Rule</strong>, authors propose that adding a <strong>class dependent offset based on the prior probabilities</strong> help the model learn to <strong>minimise the balanced error rate</strong> (the average of error rates for each class) instead of minimising global error rate. </li>
</ul>
</li>
</ol>



<span id="more-2720"></span>



<h2 class="wp-block-heading">Weighted Cross Entropy</h2>



<p>Standard Cross Entropy treats all classes equally, which becomes problematic when your dataset contains 1,000s of easy background examples but only 100s of rare foreground objects. In such cases, the majority class dominates the loss and biases the model. <strong>Weighted Cross Entropy (WCE)</strong> addresses this by assigning a static weight to each class, manually boosting the importance of rare samples.</p>



<h3 class="wp-block-heading">Binary weighted Cross Entropy</h3>



<p>For binary classification, a weighting factor <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha \in [0, 1]" alt=""> to the standard BCE formula is used the scale the loss.</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
BCE_W(y, p) = -[\alpha \cdot y \log(p) + (1 - \alpha) \cdot (1 - y) \log(1 - p)]
" alt="Weighted Binary Cross Entropy Formula"/>



<p>where <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha" alt=""> is typically set to the <strong>inverse of the class frequency</strong>. </p>



<p>By setting a high <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha" alt=""> for the rare class (e.g., 0.9 for the 100 foreground samples) and a low weight for the frequent class (0.1 for the 1,000 background samples), ensures that the rare foreground objects provide a <strong>sufficient gradient</strong> signal during training.</p>



<h3 class="wp-block-heading">Multiclass Weighted Cross Entropy</h3>



<p>In the multiclass case with <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?K" alt="" align="absmiddle"> classes, the loss for a single example where <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?c" alt="" align="absmiddle"> is the ground-truth label is defined as:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? CE_W(p) = -\alpha_c \log(p_c) " alt="Multiclass Weighted Cross Entropy Formula"/>



<p>Where <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_c" alt="" align="absmiddle"> is a fixed weight assigned to class <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?c" alt="" align="absmiddle">, typically calculated using the <strong>Inverse Class Frequency</strong>:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? \alpha_c = \frac{N}{K \cdot n_c}\\
\text{where, } \\
N \text{ is the total number of samples} " alt="Inverse Class Frequency formula"/>



<p>Weighted versions of cross entropy loss is natively supported in PyTorch library as :</p>



<ul class="wp-block-list">
<li><strong>torch.nn.BCEWithLogitsLoss</strong> <sup>(<a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html" target="_blank" rel="noopener">refer</a>)</sup> : using the argument <strong>pos_weight</strong> for the binary classification</li>



<li><strong>torch.nn.CrossEntropyLoss</strong> <sup>(<a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html" target="_blank" rel="noopener">refer</a>)</sup> : using the argument <strong>weight</strong> for multiclass classification.</li>
</ul>



<p>Toy example computing the loss using the manually vs PyTorch implementation @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/loss_functions_for_class_imbalance/weighted_cross_entropy.ipynb">loss_functions_for_class_imbalance/weighted_cross_entropy.ipynb</a></p>



<iframe src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/loss_functions_for_class_imbalance/weighted_cross_entropy.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<h2 class="wp-block-heading">Focal Loss (Lin et al  2017)</h2>



<p>In the paper <strong><a href="https://arxiv.org/abs/1708.02002" target="_blank" rel="noreferrer noopener">Focal Loss for Dense Object Detection</a> </strong>Lin et al. (2017) , authors propose an extension to standard Cross Entropy loss to <strong>focus training on hard/rare examples</strong>. The key intuition is that by adding a probability-dependent modulating factor to the loss, the contribution of <strong>easy/frequent examples</strong> (where the estimated probability is close to the truth) is down-weighted. This indirectly forces the training to focus specifically on the hard/rare examples.</p>



<p>Focal loss is defined as : </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
FL(y,p) = -[(1-p)^\gamma y  \log(p) +  p^\gamma (1-y)\log(1-p) ]
" alt="">



<p>where, </p>



<ul class="wp-block-list">
<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y\in\{0,1\}" alt=""> represent the ground truth labels and</li>



<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?p\in\[0,1]" alt="">  is the estimated probabilities</li>



<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma" alt=""> is a hyperparameter to control the modulating factor</li>
</ul>



<p>Note  : The standard cross entropy loss for binary classification is</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
CE(y,p) = -[y\log(p) + (1-y)\log(1-p) ]
" alt="">



<h3 class="wp-block-heading">Gradients in standard Cross Entropy Loss</h3>



<p>To understand how Focal Loss works, the gradient i.e. the derivative with respect to the model&#8217;s output logits, <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?z" alt=""> is explored. The model outputs a real number <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt=""> number, which is converted to a probability <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?p\in\[0,1]" alt=""> using the sigmoid function <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\sigma(z)" alt="">. <br><br>Using the <strong>chain rule from calculus</strong>&nbsp;<a href="https://en.wikipedia.org/wiki/Chain_rule#Intuitive_explanation" target="_blank" rel="noopener"><sup>(refer wiki entry on Chain Rule)</sup></a>, then the gradient of loss with respect <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt=""> <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial {L}}{\partial z}" alt=""> is found as &#8211; gradient of loss with respect to probabilty <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial {L}}{\partial p}" alt=""> multiplied with gradient of probability with respect to parameter <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial p}{\partial z}" alt=""> i.e.</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial {L}}{\partial \mathbf{z}} = \frac{\partial {L}}{\partial p} \cdot \frac{\partial p}{\partial z} 
" alt="">



<p>For standard <strong>Cross Entropy</strong> loss, as derived in the post on <a href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Gradients_with_Binary_Cross_Entropy_BCE_Loss" target="_blank" rel="noreferrer noopener">Gradients for Binary Classification with Sigmoid</a>, gradient is, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
\frac{\partial {CE}}{\partial {p}} &#038; = &#038; -\left[\frac{y}{p} - \frac{1-y}{1-p} \right]  \\
\frac{\partial {p}}{\partial {z}} &#038; = &#038; p(1-p)  \\
\\
\text{then, }
\\

\frac{\partial {CE}}{\partial {z}} 
&#038; = &#038; \frac{\partial {CE}}{\partial {p}} \cdot   \frac{\partial {p}}{\partial {z}} \\
&#038;=&#038; -\left[\frac{y}{p} - \frac{1-y}{1-p} \right] \cdot p(1-p) \\
&#038;=&#038;-\left[y(1-p) - (1-y)p \right]  \\
&#038;=&#038;p-y


\end{array}
" alt="">



<p>The <strong>gradient is linear</strong> and depends only on the error &#8211; this means an &#8220;easy/frequent&#8221; example (where the error is small, e.g., 0.1) when <strong>summed over large number of of easy examples</strong> still contributes to the loss and can <strong>overwhelm</strong> the training.</p>



<h3 class="wp-block-heading">Gradients in Focal Loss</h3>



<p>For computing the gradients with focal loss, let us define the <strong>ground truth labels</strong> <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y\in\{0,1\}" alt="">and the model&#8217;s <strong>estimated probability</strong> <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?p\in\[0,1]" alt=""> as : </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? p_t = \begin{cases} p &#038; \text{if } 
y = 1 \\ 1 - p 
&#038; 
 \text{otherwise} \end{cases} " alt="Definition of pt">



<p>where,</p>



<ul class="wp-block-list">
<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y=0" alt=""> background class with 1000&#8217;s of easy/frequent examples</li>



<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y=1" alt=""> foreground class with 100&#8217;s of hard/rare examples</li>
</ul>



<p>Taking the case of <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y=1" alt="">, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
\frac{\partial {FL}}{\partial p_t} 
&#038; = &#038; -(1-p_t)^\gamma \cdot \frac{\partial }{\partial p_t}\log(p_t) - \log(p_t) \frac{\partial  }{\partial p_t}(1-p_t)^\gamma \\

&#038; = &#038; -(1-p_t)^\gamma \cdot \frac{1}{(p_t)} + \gamma (1-p_t)^{\gamma-1}\log(p_t) \\


\end{array}
" alt="">



<p>Multiplying with <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial {p_t}}{\partial {z}}  =  p_t(1-p_t) " alt=""> ,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}

\frac{\partial {FL}}{\partial {z}} 
&#038; = &#038; \frac{\partial {FL}}{\partial {p_t}} \cdot   \frac{\partial {p_t}}{\partial {z}} \\

&#038; = &#038; \[-(1-p_t)^\gamma \cdot \frac{1}{(p_t)} + \gamma (1-p_t)^{\gamma-1}\log(p_t) \] \cdot p_t(1-p_t) \\

&#038; = &#038; \[-(1-p_t)^{\gamma+1} + \gamma p_t(1-p_t)^{\gamma}\log(p_t) \] \\
&#038; = &#038; (1-p_t)^{\gamma}\[-(1-p_t) + \gamma p_t\log(p_t) \] \\
&#038; = &#038; \underbrace{(1-p_t)^{\gamma}}_{\text{scaling term}}\[\underbrace{(p_t-1)}_{\text{CE term}} + \underbrace{\gamma p_t\log(p_t)}_{\text{focal term}} \] 




\end{array}
" alt="">



<p>Sweeping the value of <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?p" alt=""> from 0 to 1 for <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma=2" alt="">, the behaviour of the individual terms are as shown in the plot below. </p>



<div class="wp-block-cover"><span aria-hidden="true" class="wp-block-cover__background has-background-dim"></span><div class="wp-block-cover__inner-container is-layout-flow wp-block-cover-is-layout-flow">
<div class="wp-block-group"><div class="wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained">
<figure class="wp-block-gallery aligncenter has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex">
<figure class="wp-block-image size-full is-style-default"><a href="https://dsplog.com/db-install/wp-content/uploads/2026/01/image-2.png"><img fetchpriority="high" decoding="async" width="711" height="536" data-id="2733" src="https://dsplog.com/db-install/wp-content/uploads/2026/01/image-2.png" alt="" class="wp-image-2733" srcset="https://dsplog.com/db-install/wp-content/uploads/2026/01/image-2.png 711w, https://dsplog.com/db-install/wp-content/uploads/2026/01/image-2-300x226.png 300w" sizes="(max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px" /></a></figure>
</figure>
</div></div>
</div></div>



<p>code @<strong> <a href="https://github.com/dsplog/dsplog.com/blob/main/code/loss_functions_for_class_imbalance/focal_loss_terms.py" target="_blank" rel="noreferrer noopener">focal_loss_terms.py</a></strong></p>



<p>The model learns easy/frequent examples much faster and <strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?p" alt=""> is close to the ground truth <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt=""></strong>, which means <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?p_t\rightarrow 1" alt="">. As <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?p_t" alt=""> approaches 1, the scaling term <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?(1-p_t) ^\gamma" alt=""> effectively <strong>silences the gradient.</strong></p>



<p>Plugging in numbers, when the model is estimating <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?p_t \approx 0.99" alt=""> for the frequent examples, the throttle becomes <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?(1-0.99)^2 \approx 0.0001" alt=""> and the gradient from these examples is effectively silenced.</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
\frac{\partial FL}{\partial z} &#038; \approx &#038;(1-p_t)^\gamma (p - y)
\end{array}
" alt="">



<p>Thus the term <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?(1-p_t) ^\gamma" alt=""> acts as a <strong>throttle</strong> for <strong>easy/frequent examples</strong>.</p>



<p></p>



<h3 class="wp-block-heading">The Weighting Factor $\alpha$</h3>



<p>With the focusing parameter <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma" alt=""> down-weighting easy/frequent examples, the choosing <strong>class weights</strong> <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha" alt=""> parameter using <strong>inverse of class frequency</strong> is not preferred. To understand the intuitions, let us define <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_t" alt=""> as below :</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? 
\alpha_t = 
\begin{cases} \alpha &amp; \text{if } 
y = 1 \text{ (foreground)} \\ 
1 - \alpha &amp; \text{if } y = 0 \text{ (background)}
\end{cases} " alt="Definition of alpha_t" />



<p>The <strong>Focal Loss</strong> including <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_t" alt=""> is :</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? FL(p_t) = -\alpha_t (1-p_t)^\gamma \log(p_t) " alt="Alpha Balanced Focal Loss"/>



<p></p>



<p>When we go with <strong>inverse of class frequency</strong>, typically values of <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha" alt=""> is : </p>



<ul class="wp-block-list">
<li>high <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha" alt=""> (around 0.9) for <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y=1" alt=""> (hard/rare foreground class) and </li>



<li>low <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha" alt=""> (around 0.1) for <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y=0" alt=""> (easy/frequent background class)</li>
</ul>



<p>With the Focal Loss, the focusing term <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?(1-p_t)^\gamma" alt=""> aggressively down-weights the easy examples and the accumulated loss from the background class drops drastically. Then with high <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha" alt=""> the <strong>hard/rare foreground</strong> class with only 100s of examples will now<strong> dominate the gradient and can cause instability</strong>.</p>



<p>Therefore, as <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma" alt=""> <strong>is increased</strong>, <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha" alt=""> should be <strong>decreased</strong>. In the paper, for <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma=2" alt="">, the authors found the best balance was actually <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha=0.25" alt=""> for the foreground class <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y=1" alt="">.</p>



<h3 class="wp-block-heading">Extension to Multiclass Focal Loss</h3>



<p>While the binary case uses a single probability <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?p" alt="">, the multiclass classification involve <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?C" alt=""> distinct classes. In the <strong>multiclass</strong> setting, the model outputs a <strong>vector of logits</strong>, which are transformed into probabilities using the <strong>Softmax</strong> function. The estimated probability for <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?l^{th}" alt=""> class is :</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? 
P_{l} = \frac{e^{z_{l}}}{\sum_{j=1}^{C} e^{z_{j}}}"/>



<p>The Multiclass Focal Loss for a single example for <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?l^{th}" alt=""> ground truth class  is,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? 
FL_{\text{multi class}} = -\alpha_{l} (1 - P_{l})^\gamma \log(P_{l})
"/>



<p>Typically <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma=2" alt=""> is chosen as a scalar, and the weights factor <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_{l}" alt=""> is defined as a class dependent vector.  </p>



<p>Choosing <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_{l}=0.25" alt=""> for the rare classes and <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_{l}=0.75" alt=""> for the frequent classes seems to be choice which can be arrived at using hyper parameter tuning. Though it is counter intuitive to give higher <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha" alt=""> for frequent classes, it helps to prevent their contribution from being completely <strong>throttled</strong> by the  <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?(1-p_t)^\gamma" alt=""> term. </p>



<p>Toy example showing implementation of Focal Loss for binary and multi-class classification @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/loss_functions_for_class_imbalance/focal_loss_binary_multiclass.ipynb"><strong>loss_functions_for_class_imbalance/focal_loss_binary_multiclass.ipynb</strong></a></p>



<iframe src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/loss_functions_for_class_imbalance/focal_loss_binary_multiclass.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<h2 class="wp-block-heading">Assymetric Loss (2021)</h2>



<p>In the focal loss definition, the <strong>same <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma" alt=""> is used for both background class</strong> with high count of easy examples and<strong> rare foreground class</strong>. If a higher <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma" alt=""> is used to throttle the gradients of easy background classes, then this also affects when the model is learning the hard foreground classes.</p>



<p>In the paper, <a href="https://arxiv.org/abs/2009.14119" target="_blank" rel="noopener"><strong>Asymmetric Loss for Multi-Label Classification</strong></a>, Ridnik et al. (2021). authors proposed to <strong>decouple</strong> the <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma" alt=""> for <strong>foreground and background classes</strong>.</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?

L = \begin{cases} 
-(1-p)^{(\gamma_+)} \log(p) &#038; \text{if } y=1 \quad \text{(hard/rare foreground class)}\\ 
-p^{(\gamma_-)} \log(1-p) &#038; \text{if } y=0 \quad \text{(easy/frequent background class)}
\end{cases}
" alt=""/>



<p>To give <strong>emphasis to the contribution of positive samples</strong>, <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma_- \gt \gamma_+" alt=""> . </p>



<p>The typical values can be <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma_+ =0" alt=""> so that the hard/low count positive samples behave similar to standard cross entropy loss and <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma_- =2" alt=""> to throttle gradients for easy/high count background classes.</p>



<p>Authors further propose <strong>adding a margin</strong> on the probability of easy backround classes by probability shifting which <strong>discards them when the probability is below a threshold</strong>.</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
p_m= \max(p-m,0)
" alt=""/>



<p>with <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt=""> as a hyperparameter and a typical value being <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?m=0.2" alt="">. <br></p>



<p>Combining both, the <strong>Assymetric Loss </strong>is defined as,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?

ASL = \begin{cases} 
-(1-p)^{(\gamma_+)} \log(p) &#038; \text{if } y=1 \quad \text{(hard/rare foreground class)}\\ 
-p_m^{(\gamma_-)} \log(1-p_m) &#038; \text{if } y=0 \quad \text{(easy/frequent background class)}
\end{cases}
" alt=""/>



<p>Toy implementation of assymetric loss @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/loss_functions_for_class_imbalance/assymetric_loss.ipynb"><strong>loss_functions_for_class_imbalance/assymetric_loss.ipynb</strong></a></p>



<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/loss_functions_for_class_imbalance/assymetric_loss.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<h2 class="wp-block-heading">Class-Balanced Loss (Yin Cui et al 2019)</h2>



<p>In the paper <strong><a href="https://arxiv.org/abs/1901.05555" target="_blank" rel="noreferrer noopener">Class-Balanced Loss Based on Effective Number of Samples</a>, </strong>Cui et al. , authors argue that there will be <strong>similarities among the samples</strong> and as the number of samples increase, the probability that this sample is <strong>covered</strong> in the existing samples increases. Based on this intuition, authors propose a framework to capture the <strong>diminishing  benefit</strong> when more datasamples are added to a class.</p>



<h3 class="wp-block-heading">Derivation</h3>



<p>Let us denote the effective number of samples as <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?E_n" alt="">, and the total volume of this space as <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?N" alt="">. Consider the case where we have <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?n-1" alt=""> examples and is going to sample the <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?n^{th}" alt=""> example. The probability that the newly sampled example to be overlapped with the previous samples is,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?p=\frac{E_{n-1}}{N}" alt=""/>



<p><strong>Expected volume</strong> with the <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?n^{th}" alt=""> example is,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}E_n 
&#038; = &#038; pE_{n-1} + (1-p)(E_{n-1}+1) \\
&#038; = &#038; pE_{n-1} + E_{n-1}+1 -pE_{n-1} - p \\
&#038; = &#038; E_{n-1} + 1-p\\
\quad \text{substituting for } p, \\ 
&#038; = &#038; E_{n-1} + 1- \frac{E_{n-1}}{N} \\
&#038; = &#038; \frac{NE_{n-1} + N - E_{n-1}}{N} \\
&#038; = &#038; 1 + \frac{N-1}{N}E_{n-1}  \\
&#038; = &#038; 1 + \beta E_{n-1}, \quad \text{where, } \beta =  \frac {N-1}{N}



\end{array}
" alt=""/>



<p>To solve for <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?E_n" alt="">, re-writing as a <strong>geometric series</strong>,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll}
n=1, &#038; E_1 &#038; =&#038;  1\\
n=2, &#038; E_2 &#038;= &#038; 1+\beta E_1 = 1+\beta \\
n=3, &#038; E_3 &#038;= &#038; 1+\beta E_2 = 1+\beta(1+\beta) = 1+ \beta + \beta^2 \\
n=4, &#038; E_4 &#038;= &#038; 1+\beta E_3 = 1+\beta(1+\beta+\beta^2) = 1+ \beta + \beta^2 + \beta^3 \\
\vdots

\end{array}
" alt=""/>



<p>For the general <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?E_n" alt=""> can be written as</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llllll}
E_n &#038; = &#038; \sum_{j=1}^{n} \beta^{j-1} &#038; = &#038; 1 + \beta + \beta^2 + \cdots + \beta^{n-1}

\end{array}
" alt=""/>



<p>Solving for <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?E_n" alt="">, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llllll}
E_n - \beta E_n &#038; = &#038; (1 + \beta + \beta^2 + \cdots + \beta^{n-1}) - \beta(1 + \beta + \beta^2 + \cdots + \beta^{n-1}) \\
&#038; = &#038; 1-\beta^n \\
\text{solving, }\\
(1-\beta)E_n &#038; = &#038; 1-\beta^n \\
E_n &#038; = &#038; (1-\beta^n)/(1-\beta)

\end{array}
" alt=""/>



<p><strong>Note</strong> :  </p>



<ul class="wp-block-list">
<li>When <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta=0" alt="">, the effective number of samples <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?E_n=1" alt=""> indicating that there is <strong>no benefit</strong> in adding more samples.</li>



<li>When <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta\rightarrow 1" alt="">, the expected number of samples <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?E_n=n" alt="">, indicating that each sample is treated <strong>unique</strong>.</li>
</ul>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
\lim_{\beta \to 1}E_n  
&#038; = &#038; \lim_{\beta \to 1}\frac{(1-\beta^n)}{(1-\beta)} \\
\text{using L' Hospitals rule, } \\
&#038;  = &#038; \frac{-n\beta^{n-1}}{-1} 
&#038; = &#038; n

\end{array}
" alt=""/>



<p>In the paper authors explore <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta" alt=""> as a <strong>hyper-parameter</strong> and report that in long tailed CIFAR-10 (Imbalance Factor = 50) dataset, the best is <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta=0.9999" alt=""> . In this dataset, the <strong>most frequent class has 5000</strong> images, while the <strong>rarest class has 100</strong> images. With <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta=0.9999" alt="">, the effective number of samples for the frequent and rarest class is</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
E_{100} = \frac{1-\beta^{100}}{1-\beta} = 99.5 \\
E_{5000} = \frac{1-\beta^{5000}}{1-\beta} = 3934.85 \\
\end{array}
" alt=""/>



<p></p>



<figure class="wp-block-table aligncenter"><table class="has-fixed-layout"><thead><tr><td><strong>Weighting Scheme</strong></td><td><strong>β Value</strong></td><td><strong>Majority En​</strong></td><td><strong>Minority En​</strong></td><td><strong>Ratio (Maj/Min)</strong></td></tr></thead><tbody><tr><td><strong>Inverse Frequency</strong></td><td><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta\rightarrow 1" alt=""></td><td>5000</td><td>100</td><td><strong>50.0 : 1</strong></td></tr><tr><td><strong>Class-Balanced</strong></td><td><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta=0.9999" alt=""></td><td>3934.85</td><td>99.5</td><td><strong>39.5 : 1</strong></td></tr><tr><td><strong>Class-Balanced</strong></td><td><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta=0.999" alt=""></td><td>993</td><td>95.3</td><td><strong>10.4 : 1</strong></td></tr><tr><td><strong>No Weighting</strong></td><td><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta=0" alt=""></td><td>1</td><td>1</td><td><strong>1.0 : 1</strong></td></tr></tbody></table><figcaption class="wp-element-caption">Table : Relative ratio of Effective samples in CIFAR long tail (imbalance factor=50) dataset</figcaption></figure>



<p>Though the weight ratio between frequent class and small class is 50:1, by<strong> choosing a lower</strong> <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta" alt="">, we assume <strong>higher redundancy</strong> in the dataset and give lesser weightage to the sample count of majority class.</p>



<h3 class="wp-block-heading">Applying to loss</h3>



<p>To balance the loss, for each class <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?i" alt=""> which has <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?n_i" alt="">samples, a weighting factor <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_i" alt=""> that is that is <strong>inversely proportional</strong> to the effective number of samples for each class is found out, i.e</p>



<p><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_i \propto 1/E_{n}_i" alt=""> . </p>



<p>To make the total loss roughly in the same scale when applying <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_i" alt="">, a normalization factor to scale the sum of <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_i" alt=""> to the class count <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?C" alt=""> i.e.  </p>



<p><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\sum_{i=1}^C\alpha_i = C" alt=""> <br></p>



<p>With this definition, </p>



<p>a) the <strong>class balanced softmax loss</strong> is,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
\mathcal{L}_{\text{CB CE}}(\mathbf{y}, \mathbf{p}) 
&#038; = &#038; - \sum_{i=1}^{C} \alpha_iy_i \log(p_i) \\
&#038; = &#038; - \sum_{i=1}^{C} \(\frac{1-\beta}{1-\beta^{n_i}}\)y_i\log(p_i)
\end{array}
" alt=""/>



<p>b) <strong>class balanced focal loss </strong>is,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array} {lll}

\mathcal{L}_{\text{CB FL}}(\mathbf{y}, \mathbf{p}) 
&#038; = &#038; -\sum_{i=1}^{C}  \alpha_i(1-p_i)^\gamma y_i\log(p_i) \\ 
&#038; = &#038; - \sum_{i=1}^{C} \(\frac{1-\beta}{1-\beta^{n_i}}\)(1-p_i)^\gamma y_i\log(p_i)
\end{array}
" alt=""/>



<p><strong>Class-Balanced Loss</strong> as a <strong>specific weighting strategy</strong> for standard loss functions and it provides a mathematically grounded way to calculate the weight <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_i" alt=""> capturing the &#8220;<strong>effective number of samples</strong>&#8220;</p>



<p>Code to find the class balanced weights @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/loss_functions_for_class_imbalance/class_balanced_weights.ipynb"><strong>loss_functions_for_class_imbalance/class_balanced_weights.ipynb</strong></a></p>



<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/loss_functions_for_class_imbalance/class_balanced_weights.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<h2 class="wp-block-heading">Logit Adjustment (Menon et al 2021)</h2>



<p>In the paper <strong><a href="https://arxiv.org/abs/2007.07314" target="_blank" rel="noreferrer noopener">Long-tail Learning via Logit Adjustment</a>, </strong>Menon et al. (ICLR 2021), authors argue that for scenarios with heavy class imbalance, the <strong>average misclassification error is not a suitable metric</strong>.</p>



<h3 class="wp-block-heading">Average Classification error in Multiclass classification </h3>



<p>Consider that <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt=""> is an <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi? n" alt=""> dimensional input feature vector <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\[x_1, x_2, \cdots, x_n\]" alt=""> and the model is trained on a multiclass classification task to learn the probability of <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?L" alt=""> classes.</p>



<p>The model <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?f_y(x)" alt=""> outputs a vector<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z}%20\in%20\mathbb{R}^{L%20\times%201}" alt=""> which captures the <strong>logarithm of the probability (aka logit) </strong>for each class. The scores are converted into <strong>probabilities</strong> using <strong>SoftMax</strong> function. For the class <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?k" alt="">, the estimated probability is,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array} {lll}
P(y_k|\mathbf{x}) = \frac{\exp(f_{y_k}(\mathbf{x}))}{\sum_{i=1}^L\exp(f_{y_i}(\mathbf{x}))}
\end{array}
" alt=""/>



<p>Taking logarithm, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array} {lll}
\ln(P(y_k|\mathbf{x})) 
&#038; = &#038; \ln\(\frac{\exp(f_{y_k}(\mathbf{x}))}{\sum_{i=1}^L\exp(f_{y_i}(\mathbf{x}))}\) \\
&#038; = &#038; \ln(\exp(f_{y_k}(\mathbf{x}))) - \ln\(\sum_{i=1}^L\exp(f_{y_i}(\mathbf{x}))\) \\
&#038; = &#038; f_{y_k}(\mathbf{x}) -  C

\end{array}
" alt=""/>



<p>where, constant <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?C=\ln\(\sum_{i=1}^L\exp(f_{y_i}(\mathbf{x}))\)" alt="">. <br><br>The training loop to estimate the probability of the true class <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?k" alt=""> given the input <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt="">, <strong>minimizes the negative log likelihood</strong> i.e.</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array} {lll}
L(y,f(\mathbf{x})) &#038; = &#038; -\ln(P(y_k|\mathbf{x})) \\ 
&#038;=&#038; -\ln \(\frac{\exp(f_{y_k}(\mathbf{x}))}{\sum_{i=1}^L\exp(f_{y_i}(\mathbf{x}))}\) \\
&#038;=&#038;-f_{y_k}(\mathbf{x}) + \ln \(\sum_{i=1}^L\exp(f_{y_i}(\mathbf{x}))\) \\
&#038;=&#038;-f_{y_k}(\mathbf{x}) + \ln \(\exp(f_{y_k}(\mathbf{x})) + \sum_{i=1,i \ne k}^L\exp(f_{y_i}(\mathbf{x}))\)\\
&#038;=&#038;-f_{y_k}(\mathbf{x}) + \ln \(\exp(f_{y_k}(\mathbf{x})) \(1 + \frac{\sum_{i=1,i \ne k}^L\exp(f_{y_i}(x))}{\exp(f_{y_k}(x))}\)\) \\
&#038;=&#038;-f_{y_k}(\mathbf{x}) + f_{y_k}(\mathbf{x})  + \ln \(1 + \frac{\sum_{i=1,i \ne k}^L\exp(f_{y_i}(\mathbf{x}))}{\exp(f_{y_k}(\mathbf{x}))}\) \\
&#038;=&#038; \ln \(1 + \sum_{i=1,i \ne k}^L\frac{\exp(f_{y_i}(\mathbf{x}))}{\exp(f_{y_k}(\mathbf{x}))}\) \\
&#038;=&#038; \ln \(1 + \sum_{i=1,i \ne k}^L\exp(f_{y_i}(\mathbf{x})-f_{y_k}(\mathbf{x}))\)


\end{array}
" alt=""/>



<p>From the above equation, we can see that  &#8211; when the <strong>logit corresponding to true class</strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?f_{y_k}(x)" alt=""> is <strong>much greater</strong> than the <strong>logit corresponding to incorrect class</strong> <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?f_{y_i}(x)" alt=""> i.e. <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?f_{y_k}(x) \gg f_{y_i}(x)" alt="">, the exponential term <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\exp(f_{y_i}(x) - f_{y_k}(x)) \rightarrow 0" alt=""> and the <strong>loss tends to 0</strong>.</p>



<p>To understand how the <strong>class imbalance affects the loss</strong>, the term <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?f_{y_i}(\mathbf{x}) - f_{y_k}(\mathbf{x})" alt=""> can be expanded using <strong>Bayes rule</strong> <sup>(<a href="https://en.wikipedia.org/wiki/Bayes%27_theorem" target="_blank" rel="noopener">refer wiki entry</a>)</sup> as,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array} {lll}
f_{y_i}(\mathbf{x}) - f_{y_k}(\mathbf{x}) 
&#038; = &#038; \ln(P(y_i|\mathbf{x})) - \ln(P(y_k|\mathbf{x}))  \\ 
&#038; = &#038; \ln\(\frac{P(y_i|\mathbf{x})}{P(y_k|\mathbf{x})}\)  \\ 
\text{using Bayes rule, }\\
&#038; = &#038; \ln\(\frac{\frac{P(\mathbf{x}|y_i)P(y_i)}{P(\mathbf{x})}}{\frac{P(\mathbf{x}|y_k)P(y_k)}{P(\mathbf{x})}}\) \\

&#038; = &#038; \ln\(\frac{P(\mathbf{x}|y_i)P(y_i)}{P(\mathbf{x}|y)P(y_k)}\) \\
&#038; = &#038; \underbrace{\ln\(\frac{P(\mathbf{x}|y_i)}{P(\mathbf{x}|y_k)}\)}_{\text{likelihood}} + \underbrace{\ln\(\frac{P(y_i)}{P(y_k)}\)}_{\text{class frequency}\\



\end{array}
" alt=""/>



<p>If the <strong>classes are balanced,</strong> then the <strong>class frequency term <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\ln\(\frac{P(y_i)}{P(y_k)}\)" alt=""> tends to 0</strong> and does not contribute to the loss. However, when there is <strong>class imbalance</strong>, for example with with the <strong>class <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?k" alt=""> being rare</strong>, then the term <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\ln\(\frac{P(y_i)}{P(y_k)}\)" alt=""> is a <strong>large positive number contributing to the loss</strong>. </p>



<p>To minimize the loss, instead of doing the &#8220;hard work&#8221; of learning discriminative features in the likelihood term <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\ln(\frac{P(\mathbf{x}|y_i)}{P(\mathbf{x}|y_k)})" alt="">, the model can &#8220;<strong>cheat</strong>&#8221; by biasing its predictions toward the majority class <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y_i" alt="">.</p>



<p>Thus we can see that a model which <strong>minimizes the average misclassification error </strong>has its learning affected by the prior probabilities i.e. <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(y|\mathbf{x}) \propto P(\mathbf{x}|y)P(y)" alt="">.</p>



<h3 class="wp-block-heading">Logit Adjustment for Balanced Error rate</h3>



<p>For a model to <strong>minimize the balanced error rate</strong> i.e. <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P^{\text{bal}}(y|\mathbf{x}) \propto \frac{1}{L} P(\mathbf{x}|y)" alt="">, the loss should depend only on the likelihood <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi? P(\mathbf{x}|y)" alt=""> and <strong>not be affected by the prior probabilities </strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(y)" alt="">.</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array} {lll}

E_{bal} = \frac{1}{L}\sum_{i=1}^{L}P(\hat{y} \ne y_i | y_i)
\end{array}
" alt=""/>



<p>This is can be done by <strong>dividing the posterior probabilities </strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(y|\mathbf{x})" alt=""> by the <strong>prior probabilities</strong>. This is equivalent to <strong>subtraction of the log prior</strong> for each class i.e <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi? \ln(P(y_i))" alt=""> from the model <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?f_y(x)" alt=""> output capturing the<strong> log probabilities</strong> <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z}%20\in%20\mathbb{R}^{L%20\times%201}" alt="">. </p>



<p>Defining <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\pi_i=P(y_i)" alt=""> as the probability of each class <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?i" alt="">, the adjusted logit for each class is, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i^{\text{adj}} = f_{y_i}(\mathbf{x}) - \tau\ln (\pi_i)" alt=""/>



<p>where, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\tau" alt=""> is a hyperparameter to tune. </p>



<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\tau=1" alt=""> : Theoretically aligns the model to <strong>minimize the balanced error rate</strong>, typically chosen value.</li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?0 \lt \tau \lt 1" alt=""> : Provides a <strong>partial correction</strong>, useful for balancing overall accuracy and per-class recall in noisy datasets</li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\tau \gt 1" alt=""> : <strong>Over-corrects for minority classes</strong>, pushing decision boundaries further to prioritize rare class recall</li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\tau=0" alt="">: Disables the adjustment, reverting the model to standard cross entropy loss.</li>
</ul>



<p>The <strong>loss function with adjusting the logits</strong> is ,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array} {lll}
L_{\text{logit adj} }
&#038; = &#038; -\ln \left( \frac{e^{f_{y_k}(\mathbf{x}) - \tau \ln(\pi_k)}}{\sum_{i=1}^L e^{f_{y_i}(\mathbf{x}) - \tau \ln \pi_i}} \right) \\
&#038; = &#038; -(f_{y_k}(\mathbf{x}) - \tau \ln (\pi_k)) + \ln \left( \sum_{i=1}^L e^{f_{y_i}(\mathbf{x}) - \tau \ln (\pi_i)} \right)
\end{array}
" alt="Logit Adjusted Loss Formula"/>



<p>Incorporating <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi? \ln(\pi_i)" alt=""> into the training loss, enforces a <strong>class-dependent margin</strong>. This forces the model to &#8220;<strong>work harder</strong>&#8221; on minority classes by requiring a higher logit score for a rare class to achieve the same loss as a majority class.</p>



<p>During <strong>inference</strong>, the adjustment is typically <strong>removed</strong> to use the raw learned likelihoods, resulting in a model that has learned to treat each class with equal importance regardless of its original frequency in the training set.</p>



<p>Example code with logit adjusted loss @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/loss_functions_for_class_imbalance/logit_adjusted_loss.ipynb"><strong>loss_functions_for_class_imbalance/logit_adjusted_loss.ipynb</strong></a></p>



<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/loss_functions_for_class_imbalance/logit_adjusted_loss.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<h2 class="wp-block-heading">Summary</h2>



<p>This article covers : </p>



<p><strong>Evolution:</strong> How we move beyond standard Cross Entropy to specialized loss functions like <strong>Focal Loss</strong> and <strong>Asymmetric Loss</strong> to handle extreme class imbalance.</p>



<p><strong>Math:</strong> Detailed derivations of the gradients for <strong>Focal Loss</strong> and a Bayesian decomposition of <strong>Logit Adjustment</strong> to show how models &#8220;cheat&#8221; using prior probabilities.</p>



<p><strong>Intuition:</strong> A look at the <strong>Effective Number of Samples</strong> framework, capturing the diminishing returns of adding more data to a majority class.</p>



<p><strong>Code:</strong> Complete Python and PyTorch implementations, including toy examples and notebooks comparing manual derivations against library-standard functions.</p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2026/03/05/loss-functions-for-handling-class-imbalance/">Loss functions for handling class imbalance</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://dsplog.com/2026/03/05/loss-functions-for-handling-class-imbalance/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Word Embeddings using neural networks</title>
		<link>https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/</link>
					<comments>https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#respond</comments>
		
		<dc:creator><![CDATA[Krishna Sankar]]></dc:creator>
		<pubDate>Sat, 27 Dec 2025 15:12:18 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[CBOW]]></category>
		<category><![CDATA[Embeddings]]></category>
		<category><![CDATA[GLoVE]]></category>
		<category><![CDATA[Hierarchical SoftMax]]></category>
		<category><![CDATA[NCE]]></category>
		<category><![CDATA[Negative Sampling]]></category>
		<category><![CDATA[Noise Contrastive Estimation]]></category>
		<guid isPermaLink="false">https://dsplog.com/?p=2252</guid>

					<description><![CDATA[<p>The post covers various neural network based word embedding models. Starting from the Neural Probabilistic Language Model from Bengio et al 2003, then reduction of complexity using Hierarchical softmax and Noise Contrastive Estimation. Further works like CBoW, GlOVe, Skip Gram and Negative Sampling which helped to train on much higher data. </p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/">Word Embeddings using neural networks</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>In machine learning, converting the input data <strong>(text, images, or time series)</strong> —into a <strong>vector</strong> format (also known as <strong>embeddings</strong>) forms a key building block for enabling downstream tasks. This article explores in detail the architecture of some of the <strong>neural network</strong> based <strong>word embedding models</strong> in the literature.</p>



<p>Papers referred : </p>



<ol class="wp-block-list">
<li><em><a href="https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf" target="_blank" rel="noreferrer noopener">Neural Probabilistic Language Model</a>,</em> Bengio et al 2003 
<ul class="wp-block-list">
<li>proposed a <strong>neural network </strong>architecture to <strong>jointly learn word feature vector</strong> and <strong>probability of words in a sequence</strong>.</li>
</ul>
</li>



<li><em><a href="https://proceedings.mlr.press/r5/morin05a/morin05a.pdf" target="_blank" rel="noreferrer noopener">Hierarchical Probabilistic Neural Network Language Model</a></em>, <em>Morin &amp; Bengio (2005)</em>
<ul class="wp-block-list">
<li>given the <strong>softmax</strong> layer for finding the <strong>probability scales with vocabulary</strong>, proposed a <strong>hierarchical</strong> version of <strong>softmax</strong> to reduce the complexity from <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?O(|V|)" alt=""> to <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?O(\log_2|V|)" alt="">.</li>
</ul>
</li>



<li><em><a href="https://www.jmlr.org/papers/volume13/gutmann12a/gutmann12a.pdf" target="_blank" rel="noreferrer noopener">Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics</a></em>, Gutmann et al 2012,
<ul class="wp-block-list">
<li>Instead of directly estimating the data distribution, noise contrastive estimation estimates the probability of a sample being from the <strong>data versus from a known noise distribution</strong>. </li>



<li>This approach was extended to <strong>neural language models</strong> in the paper <em><a href="https://arxiv.org/abs/1206.6426" target="_blank" rel="noreferrer noopener">A fast and simple algorithm for training neural probabilistic language models</a> A Mnih et al, 2012</em>.</li>
</ul>
</li>



<li><em><a href="https://arxiv.org/abs/1301.3781" target="_blank" rel="noreferrer noopener">Efficient Estimation of Word Representations in Vector Space</a></em>, Mikolov et al 2013. 
<ul class="wp-block-list">
<li>proposed <strong>simpler neural architectures</strong> with the intuition that simpler models enable training on much larger corpus of data.</li>



<li><strong>Continuous Bag of Words</strong> (CBOW) to predict the center word given the context, <strong>Skip Gram </strong>to predict surrounding words given center word was introduced. </li>
</ul>
</li>



<li><a href="https://arxiv.org/abs/1310.4546" target="_blank" rel="noreferrer noopener">Distributed Representations of Words and Phrases and their Compositionality,</a> Mikolov et al 2013
<ul class="wp-block-list">
<li> speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words</li>



<li>simplified variant of<strong> Noise Contrastive estimation</strong> called <strong>Negative Sampling</strong></li>
</ul>
</li>



<li><a href="https://nlp.stanford.edu/pubs/glove.pdf" target="_blank" rel="noreferrer noopener">GloVe: Global Vectors for Word Representation,</a> Pennington et al 2014
<ul class="wp-block-list">
<li>propose that <strong>ratio of co-occurrence probabilities</strong> capture semantic information better than co-occurance probabilities.</li>
</ul>
</li>
</ol>



<span id="more-2252"></span>



<p>In this post we will cover the key aspects proposed in the above papers with supporting python code. </p>


<div id="ez-toc-container" class="ez-toc-v2_0_82_2 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction">
<div class="ez-toc-title-container">
<p class="ez-toc-title" style="cursor:inherit">Table of Contents</p>
<span class="ez-toc-title-toggle"><a href="#" class="ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle" aria-label="Toggle Table of Content"><span class="ez-toc-js-icon-con"><span class=""><span class="eztoc-hide" style="display:none;">Toggle</span><span class="ez-toc-icon-toggle-span"><svg style="fill: #999;color:#999" xmlns="http://www.w3.org/2000/svg" class="list-377408" width="20px" height="20px" viewBox="0 0 24 24" fill="none"><path d="M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z" fill="currentColor"></path></svg><svg style="fill: #999;color:#999" class="arrow-unsorted-368013" xmlns="http://www.w3.org/2000/svg" width="10px" height="10px" viewBox="0 0 24 24" version="1.2" baseProfile="tiny"><path d="M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z"/></svg></span></span></span></a></span></div>
<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-1" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Neural_Probabilistic_Language_Model_Bengio_et_al_2003">Neural Probabilistic Language Model (Bengio et al, 2003)</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-2" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Neural_network_Architecture">Neural network Architecture</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-3" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Model">Model</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-4" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Python_code">Python code</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-5" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Hierarchical_Softmax_Morin_Bengio_2005">Hierarchical Softmax (Morin &amp; Bengio 2005)</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-6" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Derivation">Derivation</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-7" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Binary_Tree">Binary Tree</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-8" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Model-2">Model</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-9" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Python_code_%E2%80%93_Naive_implementation_using_for-loops">Python code &#8211; Naive implementation using for-loops</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-10" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Python_code_%E2%80%93_Vectorized_implementation">Python code &#8211; Vectorized implementation</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-11" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Noise_contrastive_estimation_Gutmann_et_al_2012_Mnih_et_al_2012">Noise contrastive estimation (Gutmann et al 2012, Mnih et al 2012)</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-12" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Model-3">Model</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-13" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Noise_Distribution">Noise Distribution</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-14" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Python_code-2">Python code</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-15" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Word2Vec_papers_Mikolov_et_al_2013">Word2Vec papers (Mikolov et al, 2013)</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-16" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Continuous_Bag_of_Words_CBOW_Model">Continuous Bag of Words (CBOW) Model</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-17" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Continuous_Skip-gram_Model">Continuous Skip-gram Model</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-18" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Negative_Sampling">Negative Sampling</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-19" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Python_code-3">Python code</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-20" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#GloVe_Embeddings_Penning_et_al_2014">GloVe Embeddings ( Penning et al 2014)</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-21" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Model-4">Model</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-22" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Python_code-4">Python code</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-23" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/#Summary">Summary</a></li></ul></nav></div>




<h2 class="wp-block-heading">Neural Probabilistic Language Model (Bengio et al, 2003) </h2>



<p>Reference : <a style="font-style: italic;" href="https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf" target="_blank" rel="noreferrer noopener">Neural Probabilistic Language Model</a><i>,</i> Bengio et al 2003 </p>



<p>The probability of sequence of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{T}" alt=""> words <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_1^T = \left(w_1, w_2, \ldots, w_T\right)
" alt="" align="absmiddle"> can be expressed as conditional probability of sequence of previous words, i.e</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\hat{P}(w_1^T) = \prod_{t=1}^T \hat{P}\big(w_t \,|\, w_1^{t-1}\big)

" alt="">



<p>For example, consider a sequence of 4 words, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_1^T = (w_1, w_2, w_3, w_4) = (\text{the}, \text{cat}, \text{sat}, \text{down})" alt="">



<p>Then, by the chain rule of probability:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\hat{P}(w_1^4) = \begin{array}{l}
\hat{P}(w_1) \times \\
\hat{P}(w_2 \,|\, w_1) \times \\
\hat{P}(w_3 \,|\, w_1, w_2) \times \\
\hat{P}(w_4 \,|\, w_1, w_2, w_3)
\end{array}" alt="">



<p>Substituting the actual words:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\hat{P}(\text{the cat sat down}) = \begin{array}{l}
\hat{P}(\text{the}) \times \\
\hat{P}(\text{cat} \,|\, \text{the}) \times \\
\hat{P}(\text{sat} \,|\, \text{the}, \text{cat}) \times \\
\hat{P}(\text{down} \,|\, \text{the}, \text{cat}, \text{sat})
\end{array}" alt="">



<p>For a long word sequence, instead of conditioning on all previous words, it is common to approximate the probability by conditioning only on the last <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n-1" alt=""> words. That is:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\hat{P}(w_t \mid w_1^{t-1}) \approx \hat{P}(w_t \mid w_{t-n+1}^{t-1})" alt="">



<h3 class="wp-block-heading">Neural network Architecture</h3>



<p>The neural probabilistic language model builds on the <strong>n-gram approximation </strong>and proposes a way to</p>



<ul class="wp-block-list">
<li><strong>Jointly</strong> learn <strong>word feature vector</strong> (each word in the vocabulary has a feature vector — a real-valued vector in <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbb{R}^k" alt="">) and</li>



<li>Learn the <strong>probability of the sequence of words</strong> in terms of sequence of word feature vectors</li>
</ul>



<p></p>



<p>The objective is to learn a model that predicts the probability of the next word given the previous <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n-1" alt=""> words, i.e.</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?f(w_t, \ldots, w_{t-n+1}) = \hat{P}(w_t \mid w_1^{t-1})" alt="">



<p>The model is subject to the following constraints:</p>



<ul class="wp-block-list">
<li>For any sequence of words, the model outputs a <strong>non-zero</strong> probability, i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?f(\dots) &gt; 0" alt="" align="absmiddle"></li>



<li>The <strong>sum of probabilities</strong> over all possible next words <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_i" alt="" align="absmiddle"> in the vocabulary <strong>equals 1</strong>, i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\sum_{i=1}^{|V|} f(w_i, w_{t-1}, \ldots, w_{t-n+1}) = 1" alt="" align="absmiddle"></li>
</ul>



<p>where <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt="" align="absmiddle"> is the vocabulary size, and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?i" alt="" align="absmiddle"> indexes over all possible words in the vocabulary.</p>



<p><strong>Note : </strong></p>



<ul class="wp-block-list">
<li><strong>Non-zero probability:</strong> Ensures that the model never completely rules out any word as a possible next word, allowing it to adapt to all possible word sequences and avoid zero-probability issues during training.</li>
</ul>



<ul class="wp-block-list">
<li><strong>Probabilities sum to one:</strong> Guarantees that <em>f</em> defines a valid probability distribution over the vocabulary for the next word, so the total probability of all possible next words is exactly 1.</li>
</ul>



<p>The estimation of the function <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?f(w_t, \ldots, w_{t-n+1}) = \hat{P}(w_t \mid w_1^{t-1})" alt="" align="absmiddle"> is done as follows : </p>



<ul class="wp-block-list">
<li>for any word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_i" alt="" align="absmiddle"> in the vocabulary <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt="" align="absmiddle">, lookup a real vector  <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C(w_i) \in \mathbb{R}^m" alt="" align="absmiddle"></li>



<li>a function <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?g" alt="" align="absmiddle"> maps an input sequence of feature vectors for words in the context, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\big(C(w_{t-n+1}), \ldots, C(w_{t-1})\big)" alt="" align="absmiddle"> to a conditional probability distribution over words in <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?V" alt="" align="absmiddle"> for the next word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_t" alt="" align="absmiddle">
</li>
</ul>



<h3 class="wp-block-heading">Model</h3>



<p>The neural network model can be expressed as:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y = b + W \cdot x + U \cdot \tanh(d + H \cdot x)" alt="">



<p>where:</p>



<ul class="wp-block-list">
<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x" alt="" align="absmiddle"></strong> is the concatenated input feature vector of the previous <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n-1" alt="" align="absmiddle"> words, with dimension <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\big[(n-1) \cdot m \times 1\big]" alt="" align="absmiddle">.</li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?H" alt="" align="absmiddle"></strong> is a weight matrix of size <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\big[h \times (n-1) \cdot m\big]" alt="" align="absmiddle">, which transforms the input <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x" alt="" align="absmiddle"> into the hidden layer space.</li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?d" alt="" align="absmiddle"></strong> is a bias vector for the hidden layer, of dimension <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\big[h \times 1\big]" alt="" align="absmiddle">.</li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?U" alt="" align="absmiddle"></strong> is a weight matrix of size <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\big[|V| \times h\big]" alt="" align="absmiddle"> that maps the hidden layer activations to the output layer, where <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt="" align="absmiddle"> is the vocabulary size.</li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?W" alt="" align="absmiddle"></strong> is a weight matrix of size <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\big[|V| \times (n-1) \cdot m\big]" alt="" align="absmiddle"> that connects the input <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x" alt="" align="absmiddle"> directly to the output layer.</li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""></strong> is the bias vector for the output layer, of dimension <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\big[|V| \times 1\big]" alt="" align="absmiddle">.</li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt="" align="absmiddle"></strong> is the output vector containing the unnormalized log-probabilities (scores) for each word in the vocabulary, of dimension <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\big[|V|%20\times%201\big]" alt="" align="absmiddle">.</li>
</ul>



<p>Using <strong>softmax</strong> to convert the output vector <strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt=""></strong>into a <strong>probability distribution</strong> over the vocabulary,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?f(w_t, \ldots, w_{t-n+1}) = \hat{P}(w_t \mid w_{t-n+1}^{t-1})=\hat{P}(w_t \mid w_{t-1}, \ldots, w_{t-n+1}) = a(w_t)= \frac{e^{y_{w_t}}}{\sum_i e^{y_{w_i}}" alt="">



<p>Using <strong>softmax</strong> layer ensures the constraints defined earlier:</p>



<ul class="wp-block-list">
<li>All probabilities are <strong>positive</strong>, satisfying the <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?f &gt; 0" alt=""> constraint.</li>



<li>The probabilities <strong>sum to one</strong> across all possible next words, satisfying the normalization constraint <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\sum_i f = 1" alt="">.</li>
</ul>



<p><strong>Loss function</strong></p>



<p>The <strong>maximum likelihood</strong> estimate for selecting the target word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_t" alt=""> over all the words in vocabulary <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt=""> is equivalent to <strong>minimising the negative log likelihood</strong>,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L_{\text{NLL}}} = -\sum_{i=1}^{k} \log a(w_t),

" alt=""/>



<p>where,</p>



<p><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k" alt=""> is multiple word sequence examples</p>



<p>As can be seen in the section on <a href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Loss_for_multi-class_classification">Loss for Multiclass classification <sup>(refer post on Gradients for Multiclass classification with SoftMax)</sup></a>, the <strong>negative log likelihood</strong> is indeed the <strong>Categorical Cross Entropy Loss</strong>.</p>



<h3 class="wp-block-heading">Python code</h3>



<p>The training of a <strong>Neural Probabilistic Language Model</strong> in PyTorch involves a few key components, each corresponding to the mathematical elements discussed earlier:</p>



<ul class="wp-block-list">
<li><strong><a href="https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html" target="_blank" rel="noopener">torch.nn.Embedding</a></strong> — implements the <em><strong>word feature vector lookup</strong></em> function <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C(w_i)" alt="">. Each word index in the vocabulary maps to a dense vector in <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbb{R}^m" alt="">.</li>



<li><strong><a href="https://pytorch.org/docs/stable/generated/torch.nn.Linear.html" target="_blank" rel="noopener">torch.nn.Linear</a></strong> — implements the<strong> fully-connected </strong>(dense) layers, corresponding to the transformation matrices <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?W" alt="">, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?U" alt="">.</li>



<li><a href="https://pytorch.org/docs/stable/generated/torch.nn.Parameter.html" target="_blank" rel="noopener"><code><strong>torch.nn.Parameter</strong></code></a> &#8211; the parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?d" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> are explicitly created.</li>



<li><strong><a href="https://pytorch.org/docs/stable/generated/torch.nn.functional.log_softmax.html" target="_blank" rel="noopener">torch.nn.functional.log_softmax</a></strong> — applies the <em><strong>SoftMax</strong></em> in <strong>log space </strong>to obtain <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\log \hat{P}(w_t \mid w_{t-n+1}^{t-1})" alt=""> while maintaining numerical stability.</li>



<li><strong><a href="https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html" target="_blank" rel="noopener">torch.nn.NLLLoss</a></strong> — implements the<strong> <em>Negative Log Likelihood Loss</em></strong>, which directly minimises <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?-\log \hat{P}(w_t \mid w_{t-n+1}^{t-1})" alt=""> for the correct target word index.</li>
</ul>



<p>These functions, combined with an optimizer such as <a href="https://pytorch.org/docs/stable/generated/torch.optim.SGD.html" target="_blank" rel="noopener">torch.optim.SGD</a> or <a href="https://pytorch.org/docs/stable/generated/torch.optim.Adam.html" target="_blank" rel="noopener">torch.optim.Adam</a>, form the complete training loop for the model.</p>



<p><strong>code @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/word_embeddings/neural_probabilistic_language_model.ipynb" target="_blank" rel="noreferrer noopener">word_embeddings/neural_probabilistic_language_model.ipynb</a><br></strong>The training loop implementing the model for a simple toy example of 20 sentences shows that the model is doing reasonable in predicting the probability of next word. </p>



<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/word_embeddings/neural_probabilistic_language_model.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<h2 class="wp-block-heading">Hierarchical Softmax (<em>Morin &amp; Bengio 2005</em>)</h2>



<p>Reference : <em><a href="https://proceedings.mlr.press/r5/morin05a/morin05a.pdf" target="_blank" rel="noreferrer noopener">Hierarchical Probabilistic Neural Network Language Model</a></em>, <em>Morin &amp; Bengio (2005)</em></p>



<p>As computing of the probability of all tokens using SoftMax scales with vocabulary size <strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt=""></strong> , in the paper <em>Hierarchical Probabilistic Neural Network Language Model</em>, <em>Morin &amp; Bengio (2005)</em> proposed an approach to reduce the complexity from <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?O(|V|)" alt=""> to <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?O(\log_2|V|)" alt="">.</p>



<p>Based on the intuition shared in the paper <em><a href="https://arxiv.org/abs/cs/0108006" target="_blank" rel="noreferrer noopener">Classes for Fast Maximum Entropy Training</a>, J Goodman 2001</em>, to compute <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(Y=y|X=x)">, instead of directly computing the probability of the target word <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y"> given the context words <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?x">, we decompose it hierarchically as product of :</p>



<ul class="wp-block-list">
<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(C=c(y)|X=x)">   : probability of <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y"> in class <strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?C=c(y)" alt=""></strong> given context <strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?X=x" alt=""></strong> </li>



<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(Y|C=c(y),X=x)">   : probability of word <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?Y=y" alt="">, given <strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt=""></strong> is in class <strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?c(y)" alt=""></strong> AND context <strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?X=x" alt=""></strong> </li>
</ul>



<p>i.e</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(Y=y|X=x)%20=%20P(C=c(y)|X=x)\,P(Y=y|C=c(y),X=x)" alt="P(Y=y|X=x)=P(C=c(y)|X=x)\cdot P(Y=y|C=c(y),X=x)">



<p>where, </p>



<ul class="wp-block-list">
<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt="y"></strong> : the target word we want to predict (e.g., “dog”).</li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x" alt="x"></strong> : the context (the surrounding words or features used to predict the next word, e.g., “the big”).</li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?c(y)" alt="c(y)"></strong> : the cluster/class that the target word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt="y"> belongs to (e.g., <em>dog</em> → Noun class).</li>
</ul>



<p></p>



<h3 class="wp-block-heading">Derivation</h3>



<p>To derive the the decomposition of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(Y=y|X=x)" align="absmiddle">, let us introduce a class variable <strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C=c(y)" alt="c(y)" align="absmiddle">, </strong>i.e. the word <strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt="y" align="absmiddle"></strong> belongs to the class <strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?c(y)" alt="c(y)" align="absmiddle"></strong>. </p>



<p>Then the probability of <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(Y=y|X=x)"> can be written as the sum of <strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt="y"></strong> is in the class or not</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
P(Y=y\mid X=x)  
&#038;=&#038; P\big(Y=y,\,C=c(y)\mid X=x\big) + {P\big(Y=y,\,C\neq c(y)\mid X=x\big)}
\end{array}" alt="">



<p>Since each word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?Y=y" alt="y" align="absmiddle"> belongs to exactly one class, the term <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(Y,\,C\ne c(y)|X=x)" align="absmiddle"> is zero.</p>



<p>Hence,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
P(Y=y\mid X=x)  
&#038;=&#038; P\big(Y=y,\,C=c(y)\mid X=x\big) + \underbrace{P\big(Y=y,\,C\neq c(y)\mid X=x\big)}_{=\,0} \\
&#038;=&#038; P\big(Y=y,\,C=c(y)\mid X=x\big)
\end{array}" alt="">



<p></p>



<p>The term <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(Y=y,\,C = c(y)|X=x)"> can be expanded using the <strong>chain rule of conditional probabilities</strong> as follows:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
P\big(Y{=}y,\,C{=}c(y)\mid X{=}x\big)
&#038;= \frac{P\big(Y{=}y,\,C{=}c(y),\,X{=}x\big)}{P(X{=}x)} \\
&#038;= \frac{P\big(C{=}c(y),\,X{=}x\big)\cdot \;P\big(Y{=}y\mid C{=}c(y),\,X{=}x\big)}{P(X{=}x)} \\
&#038;= P\big(C{=}c(y),\,X{=}x\big)\frac{\;P\big(Y{=}y\mid C{=}c(y),\,X{=}x\big)}{P(X{=}x)} \\

&#038;= P\big(C{=}c(y)\mid X{=}x\big)\cdot P(X{=}x)\frac{\;P\big(Y{=}y\mid C{=}c(y),\,X{=}x\big)}{P(X{=}x)}\\
&#038;= P\big(C{=}c(y)\mid X{=}x\big)\;P\big(Y{=}y\mid C{=}c(y),\,X{=}x\big).
\end{array}" alt="">



<p>Summarizing,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(Y=y|X=x)%20=%20P(C=c(y)|X=x)\,P(Y=y|C=c(y),X=x)" alt="P(Y=y|X=x)=P(C=c(y)|X=x)P(Y=y|C=c(y),X=x)">



<p>Thus, computing <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(Y=y|X=x)"> reduces to first predicting the <strong>class <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?c(y)"> given the context <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?x"></strong> and then <strong>predicting the word <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y"> within that class conditioned on <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?x"></strong>.</p>



<p><strong>Complexity</strong></p>



<p>With this approach, instead of computing probability over the entire vocabulary <strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt="|V|" align="absmiddle"></strong>, this is broken down to  computing the probability over the classes, and then computing the probability over the words within the chosen class. <br><br>Taking the example shared in the paper, assuming that <strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt="|V|" align="absmiddle"></strong> is 10000 words, and we break it down into 100 classes, with each class having 100 words. Then the computations needed are:</p>



<ul class="wp-block-list">
<li>Finding probability over 100 classes</li>



<li>Finding probability over 100 words in the chosen class</li>
</ul>



<p>This reduces the computation to ~200 probability calculations instead of 10000 in the flat structure. Equivalently, the complexity reduces from <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt="|V|" align="absmiddle"> to <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\sqrt{|V|}" alt="\sqrt{|V|}" align="absmiddle"> operations.</p>



<p></p>



<h3 class="wp-block-heading">Binary Tree</h3>



<p>An alternative to class-based grouping is to arrange the vocabulary words as the <strong>leaves</strong> of a <strong>binary tree</strong>. Each internal node corresponds to a binary decision (left or right child), and each <strong>leaf</strong> corresponds to <strong>one word</strong> in the vocabulary. This hierarchical arrangement reduces the search complexity from <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?O(|V|)" alt=""> to <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?O(\log_2|V|)" alt="">, making it efficient for large vocabularies.</p>



<p>For constructing the binary tree, multiple approaches are possible :</p>



<ul class="wp-block-list">
<li><strong>Perfect binary tree</strong>
<ul class="wp-block-list">
<li><span style="background-color: rgba(0, 0, 0, 0.2); color: initial;">Requires the leaves to be a power of 2 (for eg, 2, 4, 8, 16 etc). </span></li>



<li><span style="color: initial;">If </span><strong style="color: initial;"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt="|V|"></strong><span style="color: initial; background-color: rgba(0, 0, 0, 0.2);"> is not a power of 2, some leaves will remain unused</span></li>



<li><span style="color: initial;">To reach every word, it takes the same path length i.e </span><img decoding="async" style="color: initial;" src="https://dsplog.com/cgi-bin/mimetex.cgi?\lceil \log_2{|V|} \rceil" alt="ceil(log2(|V|))" align="absmiddle"><span style="color: initial;">.</span></li>



<li>Average depth:<span style="color: initial;"> exactly </span><img decoding="async" style="color: initial;" src="https://dsplog.com/cgi-bin/mimetex.cgi?\log_2{|V|}" alt="log2(|V|)" align="absmiddle"><span style="color: initial;"> since all leaves are at the same level.</span></li>
</ul>
</li>



<li><strong>Balanced binary tree</strong>
<ul class="wp-block-list">
<li><span style="color: initial;">Tries to keep the left and right subtrees of equal size.</span></li>



<li><span style="color: initial;">When the vocabulary is not a power of 2, leaf depths differ by at most 1.</span></li>



<li><span style="color: initial;">No empty leaves; every leaf corresponds to a word.</span></li>



<li>Average depth:<span style="color: initial;"> approximately </span><img decoding="async" style="color: initial;" src="https://dsplog.com/cgi-bin/mimetex.cgi?\log_2{|V|}" alt="log2(|V|)" align="absmiddle"><span style="color: initial;">, often slightly smaller because some leaves are shallower.</span></li>
</ul>
</li>



<li><strong>Word frequency based tree</strong>
<ul class="wp-block-list">
<li><span style="color: initial;">Constructed using a Huffman coding structure, frequent words are placed closer to the root node while rare words are deeper.</span></li>



<li><span style="color: initial;">This minimises the average number of binary decisions required to reach a word.</span></li>



<li>Average depth:<span style="color: initial;"> depends on the frequency distribution; it is minimised and typically much smaller than </span><img decoding="async" style="color: initial;" src="https://dsplog.com/cgi-bin/mimetex.cgi?\log_2{|V|}" alt="log2(|V|)" align="absmiddle"><span style="color: initial;"> for natural language vocabularies (due to Zipf’s law).</span></li>
</ul>
</li>
</ul>



<p>For a toy corpus of 12 words, construction of the binary tree with the above approaches is shown below. <strong>code @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/word_embeddings/binary_tree.ipynb" target="_blank" rel="noreferrer noopener">word_embeddings/binary_tree.ipynb</a></strong><br></p>



<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/word_embeddings/binary_tree.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<h3 class="wp-block-heading">Model</h3>



<p>The probability of the next word given the context <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_{t-1},...,w_{t-n+1}" alt="probability formula"> can be written as:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
P(v|w_{t-1},...,w_{t-n+1})
&#038;=&#038;\prod_{j=1}^{p}P(b_j(v)|b_1(v),...,b_{j-1}(v),w_{t-1},...,w_{t-n+1})  \\
\end{array}
" alt="probability formula">



<p>where, </p>



<ul class="wp-block-list">
<li>each word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?v" alt="" align="absmiddle"> is represented by a <strong>bit vector</strong> <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?(b_1(v),b_2(v),...,b_p(v))" alt="(b1(v), b2(v), ..., bp(v))" align="absmiddle"></li>



<li>the path <em>p</em> depends on the position of the word in the binary tree.</li>
</ul>



<p>For example, if each word is represented by 4 bits, then the probability of predicting the next word given the context <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_{t-1},...,w_{t-n+1}" align="absmiddle"> becomes:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll} P(v|w_{t-1},...,w_{t-n+1}) &#038;=&#038;P(b_1(v)|w_{t-1},...,w_{t-n+1}) \\ &#038;\times&#038;P(b_2(v)|b_1(v),w_{t-1},...,w_{t-n+1}) \\ &#038;\times&#038;P(b_3(v)|b_1(v),b_2(v),w_{t-1},...,w_{t-n+1}) \\ &#038;\times&#038;P(b_4(v)|b_1(v),b_2(v),b_3(v),w_{t-1},...,w_{t-n+1}) \end{array}" alt="4-bit chain rule example">



<p>Taking log on both sides converts the product into a summation:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll} \log P(v|w_{t-1},...,w_{t-n+1}) &#038;=&#038;\log P(b_1(v)|w_{t-1},...,w_{t-n+1}) \\ &#038;+&#038;\log P(b_2(v)|b_1(v),w_{t-1},...,w_{t-n+1}) \\ &#038;+&#038;\log P(b_3(v)|b_1(v),b_2(v),w_{t-1},...,w_{t-n+1}) \\ &#038;+&#038;\log P(b_4(v)|b_1(v),b_2(v),b_3(v),w_{t-1},...,w_{t-n+1}) \end{array}" alt="log form 4 bit">



<p>In general, for a word represented with <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?p"> bits:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\log P(v|w_{t-1},...,w_{t-n+1})=\sum_{j=1}^{p}\log P(b_j(v)|b_1(v),...,b_{j-1}(v),w_{t-1},...,w_{t-n+1})" alt="general log form">



<p>The bit vector corresponds to the <strong>path</strong> (left or right at each nodes) starting from the root node to the leaf node (the word). Each internal node outputs a probability of going right (<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b=1" alt="" align="absmiddle"> ). For the true label <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b \in [0,1]" alt="" align="absmiddle">, the binary cross-entropy loss at that node is:</p>



<p><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?L_{node}=-\big[b\cdot\log(p)+(1-b)\cdot\log(1-p)\big]" alt="binary cross entropy"></p>



<p>The total loss for predicting <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?v" alt=""> is the sum of the node losses along the path:</p>



<p><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?L(v)=-\sum_{j=1}^{p}\big[b_j(v)\log(p_j)+(1-b_j(v))\log(1-p_j)\big]" alt="word loss"></p>



<p>where </p>



<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?p_j" alt="pj" align="absmiddle"> is the predicted probability at the <em>j-th</em> node along the path. </li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b_j(v)" alt="bj" align="absmiddle"> denotes the binary choice (0 or 1) at the <em>j-th</em> internal node along the path to word <em><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?v" alt="" align="absmiddle"></em>.</li>
</ul>



<p>This is equivalent to the negative log-likelihood of the full word probability.</p>



<p><strong>Binary Node Predictor</strong></p>



<p>Each internal node of the binary tree acts as a logistic classifier that decides left vs right, based on both the <em>(n−1)-gram context</em> and the <em>node embedding</em>. The conditional probability of taking the binary decision <em><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b=1" alt=""></em> at a node, given the past context, is modelled as:</p>



<p><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(b=1|\;node, w_{t-1},...,w_{t-n+1})=\sigma(\alpha_{node}+\beta' \tanh(c+Wx+UN_{node}))" alt="probability equation"></p>



<p>where, </p>



<ul class="wp-block-list">
<li>the sigmoid function is <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\sigma(y)=\frac{1}{1+e^{-y}}" alt="sigmoid function">.</li>



<li>for any word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_i" alt=""> in the vocabulary <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt="">, lookup a real vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C(w_i) \in \mathbb{R}^m" alt=""></li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x" alt="x"></strong> : concatenation of the previous (n−1) word embeddings, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x \in \mathbb{R}^{(n-1)\cdot m \times 1}" alt=""></li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_{node}" alt="alpha node"></strong> : bias term specific to the node, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_{node} \in \mathbb{R}" alt="scalar"></li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta" alt="beta"></strong> : projection vector applied after hidden nonlinearity, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta \in \mathbb{R}^{h \times 1}" alt="beta in R^h"></li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?c" alt="c"></strong> : bias for hidden layer, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?c \in \mathbb{R}^{h}" alt="c in R^{h \times 1}"></li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?W" alt="W"></strong> : weight matrix projecting context to hidden space, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?W \in \mathbb{R}^{h \times (n-1)\cdot m}" alt="W in R^(h x (n-1)m)"></li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?U" alt="U"></strong> : weight matrix projecting node embedding, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?U \in \mathbb{R}^{h \times d_{node}}" alt="U in R^(h x d_node)"></li>



<li><strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?N_{node}" alt="N node"></strong> : embedding vector for the current node, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?N_{node} \in \mathbb{R}^{d_{node} \times 1}" alt="N_node in R^(d_node)"></li>
</ul>



<p>The matrices <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?W" alt="">, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?U" alt="">, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta" alt=""> (projection vector) and the bias <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?c" alt=""> are <strong>common parameters</strong> shared across all nodes. </p>



<p>Each internal node has its own <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_{node}" alt=""> (scalar bias), and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?N_{node}" alt=""> (node embedding). These take care of the decision boundary at each internal node.</p>



<h3 class="wp-block-heading">Python code &#8211; Naive implementation using for-loops</h3>



<p>For the toy corpus, naive implementation of hierarchical softmax using for-looops is provided.</p>



<ol class="wp-block-list">
<li>Defined a toy corpus of 20 sentences which has around 42 words. </li>



<li>Training example is constructed as 3 context words and the corresponding target word </li>



<li>Constructed a balanced binary tree, which has 41 internal nodes</li>



<li>Model defined with the binary node predictor for each of the nodes
<ul class="wp-block-list">
<li>The parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?W" alt="">, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?U" alt="">, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta" alt=""> (projection vector) and the bias <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?c" alt=""> are shared across all nodes.</li>



<li>Each internal node has its own <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_{node}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?N_{node}" alt=""> parameter</li>
</ul>
</li>



<li>For each target word in the training example, the path to the leaf node via the tree is known</li>



<li>Using the binary decision at each path, the loss for each example is computed </li>



<li>The loss is back propagated to find the parameters which minimizes the loss</li>
</ol>



<p>Using the trained model, for finding the probabilities for top-k words given the context words,</p>



<ul class="wp-block-list">
<li>For each word in the vocabulary find the path to its leaf node</li>



<li>Starting from the root node, find the probability at each node</li>



<li>Based on the known decision (right vs left) at each node, use either  <em>p</em> for going right OR (1-p) for going left</li>



<li>The joint probability is the product of probabilities at each node. </li>



<li>For numerical stability (loss of accuracy when many small probabilities are multiplied), log of probabilities is found and then summed</li>



<li>On the final log probability is exponentiated to get back in probability (optional)</li>



<li>Then the top-k candidate words are printed </li>
</ul>



<p><strong>code @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/word_embeddings/hierarchical_probabilistic_neural_language_model.ipynb" target="_blank" rel="noreferrer noopener">word_embeddings/hierarchical_probabilistic_neural_language_model.ipynb</a><br></strong></p>



<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/word_embeddings/hierarchical_probabilistic_neural_language_model.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<h3 class="wp-block-heading">Python code &#8211; Vectorized implementation</h3>



<p>As one can imagine, using the for loops significantly slows down the training. To form a vectorized implementation, the following was done.</p>



<ol class="wp-block-list">
<li><strong>Path preparation</strong>
<ul class="wp-block-list">
<li>Assign a unique id to every internal node in the binary tree.</li>



<li>Precompute for each word:
<ul class="wp-block-list">
<li>sequence of node-ids on the path to its leaf,</li>



<li>binary decision targets at each node.</li>
</ul>
</li>



<li>Pad all paths to a fixed length <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?p_{pad}"> using a dummy (UNK) node id. Build a mask to ignore padded positions.</li>



<li></li>
</ul>
</li>



<li><strong>Parameter lookup</strong>
<ul class="wp-block-list">
<li>Use <code>torch.nn.Embedding</code> to fetch node-specific parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_{nodes}"> (biases) and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?N_{nodes}"> (embeddings).</li>



<li>Shapes:
<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?N_{nodes} \in \mathbb{R}^{n_{batch}\times p_{pad}\times d_{node}}"></li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha_{nodes} \in \mathbb{R}^{n_{batch}\times p_{pad}}"></li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x \in \mathbb{R}^{n_{batch}\times (n-1)\cdot m}"></li>
</ul>
</li>
</ul>
</li>



<li><strong>Forward pass (vectorized)</strong>
<ul class="wp-block-list">
<li>Context projection: <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?Wx \in \mathbb{R}^{n_{batch}\times h}"> and bias <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?c \in \mathbb{R}^{h}">.</li>



<li>Node projection: <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?UN_{nodes} \in \mathbb{R}^{n_{batch}\times p_{pad}\times h}"> using <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?U \in \mathbb{R}^{h\times d_{node}}">.</li>



<li>Broadcast: <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?c+Wx+UN_{nodes} \in \mathbb{R}^{n_{batch}\times p_{pad}\times h}">.</li>



<li>Nonlinearity : <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?H=\tanh(c+Wx+UN_{nodes})">.</li>



<li>Projection: <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?logits=\alpha_{nodes}+(H\cdot\beta) \in \mathbb{R}^{n_{batch}\times p_{pad}}"> with <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta \in \mathbb{R}^{h\times 1}">.</li>



<li>Probabilities: <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?p=\sigma(logits) \in \mathbb{R}^{n_{batch}\times p_{pad}}">.</li>
</ul>
</li>



<li><strong>Loss and masking</strong>
<ul class="wp-block-list">
<li>Binary cross-entropy is computed between <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?p"> and the decision targets, with mask applied to ignore padded nodes:
<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?loss_i=\sum_{j}mask_{ij}\cdot BCE(p_{ij},target_{ij})"></li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?loss=\frac{1}{n_{batch}}\sum_i loss_i"></li>
</ul>
</li>
</ul>
</li>
</ol>



<p>Notes : </p>



<ul class="wp-block-list">
<li>Compute <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?Wx+c"> once per batch and broadcast, instead of recomputing per path.</li>



<li>UNK node parameters are trainable but excluded from loss using the mask.</li>
</ul>



<p><strong>code @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/word_embeddings/vectorized_hierarchical_probabilistic_neural_language_model.ipynb" target="_blank" rel="noreferrer noopener">word_embeddings/vectorized_hierarchical_probabilistic_neural_language_model.ipynb</a></strong></p>



<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/word_embeddings/vectorized_hierarchical_probabilistic_neural_language_model.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<p></p>



<h2 class="wp-block-heading">Noise contrastive estimation (Gutmann et al 2012, Mnih et al 2012)</h2>



<p>As computing of the probability using SoftMax scales with vocabulary size <strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt=""></strong>, in the paper <em><a href="https://www.jmlr.org/papers/volume13/gutmann12a/gutmann12a.pdf" target="_blank" rel="noreferrer noopener">Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics</a></em>, Gutmann et al 2012, provided an approach called <strong>Noise Contrastive Estimation (NCE)</strong>. Instead of directly estimating the data distribution, <strong>NCE estimates the probability of a sample being from the data versus from a known noise distribution</strong>. By learning the ratio between the data and noise distributions, and knowing the noise distribution, the data distribution can be inferred.</p>



<p>This approach was extended to <strong>neural language models</strong> in the paper <em><a href="https://arxiv.org/abs/1206.6426" target="_blank" rel="noreferrer noopener">A fast and simple algorithm for training neural probabilistic language models</a> A Mnih et al, 2012</em>.</p>



<h3 class="wp-block-heading">Model</h3>



<p>In <strong>Neural Probabilistic Language model</strong>, the estimation of the probability of the words using <strong>SoftMax</strong> computation is,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll}
\hat{P}(w_t \mid w_{t-1}, \ldots, w_{t-n+1}) = \hat{P}(w_t \mid h) &#038; = &#038;  \frac{\exp({y_{w_t}})}{\sum_i \exp\left(y_{w_i}\right)} \\

&#038; = &#038; \frac{\exp({s_\theta(w_t,h)})}{\sum_i \exp\left(s_\theta(w_i,h\)\right)}\\

\end{array}
" alt="">



<p>where, </p>



<ul class="wp-block-list">
<li>context words is <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?h = \left[w_{t-1}, \ldots, w_{t-n+1}\right]" align="absmiddle"></li>



<li>term in numerator <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y_{w_t}=s_\theta(w_t,h)" align="absmiddle"> is estimated using a neural model with parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\theta" align="absmiddle"> for the target word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_t" align="absmiddle"> using the context words <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?h" align="absmiddle"></li>



<li>term in denominator is sum over all the words in the vocabulary i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\sum_i e^{y_{w_i}}=\sum_i e^{s_\theta(w_i,h)}" align="absmiddle"></li>
</ul>



<p></p>



<p>Let us define a set <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W} = \left\{w_1, w_2, \cdots, w_{T_x+T_n}\right\}" alt="" align="absmiddle">, which is the union of two sets <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\left{\mathbf{X},\ \mathbf{N}\right}" align="absmiddle">, where</p>



<ul class="wp-block-list">
<li>the class label <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C_t=1 \quad : \quad w_t  \in \mathbf{X}" align="absmiddle"> when the word is from the true target word distribution</li>



<li>the class label <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C_t=0 \quad : \quad w_t \in \mathbf{N}" align="absmiddle"> when the word is NOT from the true target word distribution</li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?T_x"> is the number of <strong>true (data) samples</strong> in the batch (or dataset)</li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?T_n"> is the the number of <strong>noise samples</strong> generated for contrast</li>
</ul>



<p>The formulation is, </p>



<ul class="wp-block-list">
<li>for each true (data) sample <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\left(h,w^+\right)" alt="" align="absmiddle"> will draw <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k" alt="" align="absmiddle"> noise samples <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\left(w_1^-, w_2^-, \cdots, w_k^-\right)" alt="" align="absmiddle">  from <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P_n" alt="" align="absmiddle"></li>



<li>the model has to learn a binary classification where the sample <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w" align="absmiddle"> is from the <strong>true distribution </strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C_t=1" align="absmiddle"> or from the <strong>noise distribution</strong> <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C_t=0" align="absmiddle"></li>
</ul>



<p>Further, instead of computing the <strong>denominator term for normalzing to probabilities</strong>, learn it as a context dependent normalizing term, i.e.</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll}

\hat{P}(w_t \mid h) = P_{\theta}(w \mid h) = &#038; \frac{\exp({s_\theta(w,h)})}{\mathbf{Z}_\theta(h)}
\quad, \text{where } \mathbf{Z}_\theta(h)=\sum_i \exp\left(s_\theta(w_i,h\)
\\

\end{array}
" alt="">



<p>The probability of sample coming from the true distribution given the context <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?h"> can be written as , </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P_{\theta}(w|h)=P(w|C_t=1)" alt="">



<p>Similarly, the probability of the word in noise distribution is,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P_{n}(w)=P(w|C_t=0)" alt="">



<p>Further, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lr} 
P(C_t=1)&#038; = &#038;\frac{T_x}{T_x+T_n} \\
P(C_t=0)&#038; = &#038;\frac{T_n}{T_x+T_n}\\
k = \frac{P(C_t=0)}{P(C_t=1)} &#038; = \frac{T_n}{T_x}
\end{array}">



<p><p>Since <strong>NCE</strong> reframes the problem as a <strong>binary classification task</strong> (distinguishing true data from noise), the class labels <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?C_t"> are modelled as <strong>independent Bernoulli variables</strong>. Consequently, the <strong>conditional log likelihood</strong> is the sum of the binary cross-entropy terms: </p></p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll}

\mathbf{L} (\theta) &#038; = &#038; \sum_{t=1}^{T_x+T_n} \left[C_t\log ( P(C_t=1|w)) + (1-C_t)\log(P(C_t=0|w)) \right]\\
&#038; = &#038; \sum_{t=1}^{T_x} \log (P(C_t=1|w)) + \sum_{t=1}^{T_n} \log(P(C_t=0|w)) \\
\end{array}, 
\\
\\
" alt="">



<p>For a single true target word  <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w"> and its corresponding <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?k"> noise samples </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll}
\mathcal{L}_t (\theta)
&#038; = &#038; \log (P(C_t=1|w)) + \sum_{j=1}^{k} \log(P(C_t=0|w)) \\
\end{array} 
\\
\\
" alt="">



<p>To evaluate this loss, need to express <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(C_t=1|w)"> in terms of model parameters. Using Bayes Rule, the probability that the class is true <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?C_t=1"> given the context <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?h"> and target word <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?w"> is,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
P(C_t=1|w) &#038; = &#038; \frac{P(w|C_t=1)P(C_t=1)}{P(w)}\\
&#038; = &#038; \frac{P(w|C_t=1)P(C_t=1)}{P(w|C_t=1)P(C_t=1) + P(w|C_t=0)P(C_t=0)}\\
&#038; = &#038; \frac{P(w|C_t=1)}{P(w|C_t=1) + P(w|C_t=0)\frac{P(C_t=0)}{P(C_t=1)}}\\
&#038; = &#038; \frac{P_{\theta}(w|h)}{P_{\theta}(w|h) + kP_n(w)}\\
&#038; = &#038; \frac{1}{1 + \frac{k\cdot P_n(w)}{P_{\theta}(w|h)}}\\
&#038; = &#038;  \frac{1}{1 + \frac{k P_n(w)}{\left(\frac{\exp({s_\theta(w,h)})}{\mathbf{Z}_\theta(h)}\right)}} \\
&#038; = &#038;  \frac{1}{1 + \frac{k P_n(w)\mathbf{Z}_\theta(h)}{\exp({s_\theta(w,h)})}} \\
\end{array} 


\quad \text{,where } P_{\theta}(w \mid h) =  \frac{\exp({s_\theta(w,h)})}{\mathbf{Z}_\theta(h)}
" alt="">



<p>This gives the general probability for any word <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?w">. When calculating the loss for a true target word, we substitute <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?w=w^+"> to get the positive sample probability</p>



<p>Converting to sigmoid form which is used in logistic regression, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
P(C_t=1|w^+)     
&#038; = &#038;  \frac{1}{1 + \frac{k P_n(w^+)\mathbf{Z}_\theta(h)}{\exp({s_\theta(w^+,h)})}} 
&#038; = &#038; \frac{1}{1+\frac{1}{z^+}} 
&#038; = &#038;  \frac{1}{1+\exp(\log(1/z^+))}  
&#038; = &#038; \frac{1}{1+\exp(-\log(z^+))} \\
&#038; = &#038; \sigma( \log(z^+))\\
\end{array}

\\
\\
\text{where, }\\
\begin{array}{llll}

z^+ &#038; = &#038;\frac{\exp({s_\theta(w^+,h)}) }{k P_n(w^+)\mathbf{Z}_\theta(h)} \\

\log(z^+) &#038; = &#038; \log\left(\frac{\exp({s_\theta(w^+,h)}) }{k P_n(w)\mathbf{Z}_\theta(h)} \right) \\
&#038; = &#038; \log \left(\exp({s_\theta(w^+,h)} ) \right) - \log(k P_n(w^+)) - \log (\mathbf{Z}_\theta(h)) \\
&#038; = &#038; \left[ s_\theta(w^+,h) - \log(k P_n(w^+)) - \log (\mathbf{Z}_\theta(h))\right]

\end{array} 

" alt="">



<p>Similarly, for the target word from noise distribution i.e. the probability that the class is noise <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?C_t=0"> given the context <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?h"> and target word <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?w"> is,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
P(C_t=0|w) &#038; = &#038; \frac{P(w|C_t=0)P(C_t=0)}{P(w)}\\\\
&#038; = &#038; \frac{P(w|C_t=0)P(C_t=0)}{P(w|C_t=1)P(C_t=1) + P(w|C_t=0)P(C_t=0)}\\\\
&#038; = &#038; \frac{P(w|C_t=0)P(C_t=0)/P(C_t=1)}{P(w|C_t=1) + P(w|C_t=0)\dfrac{P(C_t=0)}{P(C_t=1)}}\\\\
&#038; = &#038; \frac{k\,P_n(w)}{P_{\theta}(w|h) + k\,P_n(w)}\\\\
&#038; = &#038; \frac{1}{1 + \dfrac{P_{\theta}(w|h)}{k\,P_n(w)}}\\\\
&#038; = &#038; \frac{1}{1 + \dfrac{\exp\!\big(s_\theta(w,h)\big)}{k\,P_n(w)\,\mathbf{Z}_\theta(h)}}\\\\
\end{array}
\quad,\ \text{where } P_{\theta}(w \mid h) =  \frac{\exp\!\big(s_\theta(w,h)\big)}{\mathbf{Z}_\theta(h)}
" alt="">



<p>This gives the general probability for any word <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?w">. When calculating the loss for a noise target word, we substitute <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?w=w^-"> to get the noise sample probability</p>



<p>Converting to sigmoid form which is used in logistic regression, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
P(C_t=0|w^-) &#038; = &#038; \frac{1}{1 + \dfrac{\exp\!\big(s_\theta(w^-,h)\big)}{k\,P_n(w^-)\,\mathbf{Z}_\theta(h)}}
&#038; = &#038; \frac{1}{1+\frac{1}{z^-}} 
&#038; = &#038; \frac{1}{1+\exp(\log(1/z^-))}  
&#038; = &#038; \frac{1}{1+\exp(-\log(z^-))} \\
&#038; = &#038;  \sigma( \log(z^-))
\end{array}

\\
\text{where, } \\
\begin{array}{llll}

z^- &#038; = &#038; \frac{k P_n(w^-)\mathbf{Z}_\theta(h)}{\exp({s_\theta(w^-,h)}) }  \\
\log(z^-) &#038; = &#038; \log\left(\frac{k P_n(w^-)\mathbf{Z}_\theta(h)}{\exp({s_\theta(w^-,h)}) } \right) \\
&#038; = &#038;   \log(k P_n(w^-)) + \log (\mathbf{Z}_\theta(h)) -\log \left(\exp({s_\theta(w^-,h)})\right) \\
&#038; = &#038;  -\left[s_\theta(w^-,h)  -\log(k P_n(w^-)) - \log (\mathbf{Z}_\theta(h))  \right]

\end{array}

" alt="">



<p></p>



<p>Plugging in the terms to the log likelihood for a single example,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll}

\mathbf{L}_t (\theta) 
&#038; = &#038; \log (P(C_t=1|w)) + \sum_{j=1}^{k} \log(P(C_t=0|w)) \\
&#038; = &#038; \log (\sigma( \log(z^+))) + \sum_{j=1}^{k} \log(\sigma( \log(z^-))) \\
&#038; = &#038;\log (\sigma(\left[ s_\theta(w^+,h) - \log(k P_n(w^+)) - \log (\mathbf{Z}_\theta(h))\right])) +\\
&#038;&#038; \sum_{j=1}^{k} \log(\sigma(-\left[s_\theta(w^-,h)  -\log(k P_n(w^-)) - \log (\mathbf{Z}_\theta(h))  \right] )) \\
 

\end{array}, 
\\
\\
" alt="">



<p></p>



<p>To obtain the objective function for the entire dataset, sum the log-likelihoods over all true training examples <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?t=1 \dots T_x">. For each training example at step <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?t">, we have a specific context <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?h_t">, a true target word <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_t">, and a fresh set of <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?k"> noise samples.</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll}

\mathbf{L}(\theta) &#038;=&#038;
\sum_{t=1}^{T_x} \mathcal{L}_t (\theta) \\
\ &#038; = &#038; \sum_{t=1}^{T_x} \left[ \log (P(C_t=1|w_t)) + \sum_{j=1}^{k} \log(P(C_t=0|w_{t,j}^-)) \right] 


\end{array} " alt=""/>



<p>The final loss function <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{J}(\theta)" align="absmiddle"> that we minimize is the negative log-likelihood over the full dataset:</p>



<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll} \mathbf{J}(\theta) &amp; = &amp; - \sum_{t=1}^{T_x} \left( \log \left[ \sigma \left( s_\theta(w_t,h_t) - \log(k P_n(w_t)) - \log (\mathbf{Z}_\theta(h_t)) \right) \right] + \sum_{j=1}^{k} \log \left[ \sigma \left( \log(k P_n(w_{t,j}^-)) + \log (\mathbf{Z}_\theta(h_t)) - s_\theta(w_{t,j}^-,h_t) \right) \right] \right) \end{array} " alt=""/></figure>



<p><strong>Note : <br></strong>In the paper, <em>A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012</em>, authors mention that approximating the learning of context dependent normalizing factor to 1 <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{Z}_\theta(h) \approx 1" alt=""> did not affect the performance of downstream tasks.</p>



<p></p>



<h3 class="wp-block-heading">Noise Distribution</h3>



<p>The noise distribution <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P_n(w)" alt="" align="absmiddle"> is typically chosen proportional to the unigram frequency of words in the corpus:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P_n(w)=\frac{\text{count}(w)}{\sum_v \text{count}(v)}">



<p>Often a smoothed unigram distribution improves results:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P_n(w)\propto \text{count}(w)^{3/4}">



<p></p>



<h3 class="wp-block-heading">Python code</h3>



<p>For the toy vocabulary, the code for Neural Probabilistic Language Model, Bengio et al, with the SoftMax head replaced with with Noise Contrastive Estimation (NCE).</p>



<p><strong>The code @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/word_embeddings/nplm_with_noise_contrastive_estimation.ipynb" target="_blank" rel="noreferrer noopener">word_embeddings/nplm_with_noise_contrastive_estimation.ipynb</a></strong></p>



<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/word_embeddings/nplm_with_noise_contrastive_estimation.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<p></p>



<h2 class="wp-block-heading">Word2Vec papers (Mikolov et al, 2013)</h2>



<p>In the paper,&nbsp;<em><a href="https://arxiv.org/abs/1301.3781" target="_blank" rel="noreferrer noopener">Efficient Estimation of Word Representations in Vector Space</a></em>, Mikolov et al 2013. proposed architectures to reduce the computation complexity in learning word embeddings , with the intuition that simpler models enable training on much larger corpus of data.</p>



<p>Two architectures where proposed. </p>



<h3 class="wp-block-heading">Continuous Bag of Words (CBOW) Model</h3>



<p>When comparing with <em>Neural Probabilistic Language Model,</em> Bengio et al 2003, the following simplifications are proposed. </p>



<ul class="wp-block-list">
<li><strong>order of context words is ignored</strong>
<ul class="wp-block-list">
<li>instead of concatenating embedding of previous words, averaging the word embeddings of surrounding words is proposed</li>



<li>this approach is called &#8220;<strong>bag-of-words</strong>&#8221; as the order is not taken into consideration</li>
</ul>
</li>



<li><strong>no non linear hidden layer</strong>
<ul class="wp-block-list">
<li>the model uses a shared projection layer</li>
</ul>
</li>
</ul>



<p>Additionally, the in the model context including future words too. </p>



<p><strong>Equations</strong></p>



<p>The neural network output is :</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z=Ux">



<p>where:</p>



<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?U%20\in%20\mathbb{R}^{|V|%20\times%20m}" alt="">&nbsp;is the<strong>&nbsp;output weight matrix</strong>, mapping from hidden dimension&nbsp;<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt="">&nbsp;to vocabulary size&nbsp;<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?|V|" alt=""></li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x%20\in%20\mathbb{R}^{m%20\times%201}" alt="x in R^{m x 1}">&nbsp;is the&nbsp;<strong>averaged context embeddings</strong>.</li>
</ul>



<p>The averaged context embedding vector&nbsp;<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x%20\in%20\mathbb{R}^{m%20\times%201}" alt="x in R^{m x 1}">&nbsp;is computed as:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x%20=%20\frac{1}{2n}%20\sum_{-n%20\le%20i%20\le%20n,%20i%20\ne%200}%20C({w_i})" alt="x averaging formula"/>



<p>where,</p>



<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C%20\in%20\mathbb{R}^{|V|%20\times%20m}" alt="C in R^{|V| x m}">is the&nbsp;<strong>input embedding matrix</strong></li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C(w_i)%20\in%20\mathbb{R}^{m%20\times%201}" alt="">is the&nbsp;<strong>embedding of the&nbsp;<em>i</em>-th context word</strong>, and</li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n" alt="">&nbsp;is the number of words to the&nbsp;<strong>left or right&nbsp;</strong>of the target word, giving a total context size of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?2n" alt="2n"></li>
</ul>



<p>The&nbsp;<strong>probability distribution</strong>&nbsp;over the vocabulary is obtained using the&nbsp;<strong>softmax</strong>&nbsp;function:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a(w_t)%20=%20\frac{e^{z_{w_t}}}{\sum_{i=1}^{|V|}%20e^{z_{w_i}}}" alt="softmax"/>



<p>where:</p>



<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a(w_t)" alt="">&nbsp;is the predicted probability of word&nbsp;<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_t" alt="">&nbsp;being the target word.</li>
</ul>



<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_{w_t}" alt="">&nbsp;is the score for word&nbsp;<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_t" alt="">&nbsp;from the output layer.</li>



<li>The denominator sums the exponentiated scores over all vocabulary entries.</li>
</ul>



<p></p>



<h3 class="wp-block-heading">Continuous Skip-gram Model </h3>



<p>The Skip-gram model, tries to predict <strong>context words given the current target word</strong>. The main idea is that each word is trained to predict the words surrounding it within a context window of size <em>n</em>.</p>



<ul class="wp-block-list">
<li><strong>Input:</strong> one-hot encoding of the target word <em>w<sub>t</sub></em></li>



<li><strong>Output:</strong> probability distribution over vocabulary for each context word</li>



<li><strong>No non-linear hidden layer:</strong> uses a shared projection matrix (linear)</li>
</ul>



<p></p>



<p>Given a target word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_t" alt="w_t">, the model tries to predict each surrounding context word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_{t+i}" alt="w_{t+i}"> for <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?-n \le i \le n, i \ne 0" alt="">. The training goal is to maximize the probability of all context words around each target word:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\begin{array}{ll}
J&#038;=&#038;\frac{1}{T} \sum_{t=1}^{T} \sum_{-n \le i \le n, i \ne 0} \log p(w_{t+i} | w_t) \\
\text{where, } \\
&#038;T&#038;\text{ total words in vocabulary} 
\end{array}


">



<p></p>



<p><strong>Equations</strong></p>



<p>The output score, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i = Ux ">



<p>where, </p>



<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x=C({w_t})" alt="" align="absmiddle">, where <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x%20\in%20\mathbb{R}^{m%20\times%201}" alt="x in R^{m x 1}"> is the embedding vector for the word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_t" alt="w_t">  </li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C%20\in%20\mathbb{R}^{|V|%20\times%20m}" alt="C in R^{|V| x m}">is the&nbsp;<strong>input embedding matrix</strong></li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?U \in \mathbb{R}^{|V| \times m}"> is the output embedding matrix and</li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i \in \mathbb{R}^{|V| \times 1}"> </li>
</ul>



<p>The scores <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i"> are computed for each context word, and the probability of all the context word is maximized.  </p>



<p>For either CBOW or  Skipgram, both <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?C" alt="C"> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?U" alt="U"> are trainable. After training, either one (or their average) is used as the word embedding.</p>



<p>Naive way for finding the probability is using SoftMax, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
p(w_o|w_t)=\frac{\exp(z_{w_o})}{\sum_{k=1}^{|V|}\exp(z_k)}
">



<p>where, </p>



<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_o"> is the output word </li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_{w_o}"> is score for output word</li>



<li>denominator = normalizing constant over vocabulary</li>
</ul>



<p>For finding the probability, hierarchical softmax is proposed in the paper.</p>



<h3 class="wp-block-heading">Negative Sampling</h3>



<p>In the paper, <a href="https://arxiv.org/abs/1310.4546" target="_blank" rel="noreferrer noopener">Distributed Representations of Words and Phrases and their Compositionality</a>, Mikolov et al 2013 introduced two concepts:</p>



<ul class="wp-block-list">
<li>the speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words</li>



<li>simplified variant of Noise Contrastive estimation called Negative Sampling</li>
</ul>



<p></p>



<p>The key intuition in Negative Sampling is that noise contrastive loss defined had terms to normalize the score to approximate the probabilities. However, to learn word embeddings the probabilities are not needed and the terms </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\log(k P_n(w^+)),\quad \log (\mathbf{Z}_\theta(h)), \quad \log(k P_n(w^-))" alt="">



<p>can be ignored. <br></p>



<p>With this simplification, the negative sampling loss is,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll}


\mathbf{L}_{ns} (\theta) &#038; = &#038;\log (\sigma(\left[ s_\theta(w^+,h) \right])) +
 \sum_{t=1}^{k} \log(\sigma(-\left[s_\theta(w^-,h)   \right] )) \\

\end{array}
\\
\\
" alt="">



<p></p>



<h3 class="wp-block-heading">Python code</h3>



<p>For the toy vocabulary, finding word vectors with </p>



<p>a) Continuous Bag of words (CBOW) with negative sampling</p>



<p><strong>The code @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/word_embeddings/cbow_negative_sampling%20copy.ipynb" target="_blank" rel="noreferrer noopener">word_embeddings/cbow_negative_sampling copy.ipynb</a></strong></p>



<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/word_embeddings/cbow_negative_sampling%20copy.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<p></p>



<p>b) Skip Graph with Negative Sampling </p>



<p><strong>The code @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/word_embeddings/skip_gram_negative_sampling.ipynb" target="_blank" rel="noreferrer noopener">word_embeddings/skip_gram_negative_sampling.ipynb</a></strong></p>



<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/word_embeddings/skip_gram_negative_sampling.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<p><br></p>



<h2 class="wp-block-heading">GloVe Embeddings ( Penning et al 2014)</h2>



<p>In the paper <a href="https://nlp.stanford.edu/pubs/glove.pdf" target="_blank" rel="noreferrer noopener">GloVe: Global Vectors for Word Representation</a>, Penning et al 2014, authors propose that <strong>ratio of co-occurrence probabilities</strong> capture semantic information better than co-occurance probabilities.</p>



<p>Let, </p>



<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{X}"> be matrix of the word co-occurance counts</li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?X_{ij}" align="absmiddle"> be the number of times word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?j"> occur in context of word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?i">.</li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?X_{i}=\sum_kX_{ik}" align="absmiddle"> be the number of times any word appear in the context of word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?i"></li>



<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P_{ij}=P(j|i)=\frac{X_{ij}}{{X_i}}"> be the probability that word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?j"> occur in context of word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?i">.</li>
</ul>



<p>Authors show that on the 6 billion token corpus dataset, </p>



<img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\begin{array}{llllll}
P(solid|ice)&#038;=&#038;1.9 \times 10^{-4} &#038;
P(gas|ice)&#038;=&#038;6.6\times 10^{-5} &#038;
P(water|ice)&#038;=&#038;3.0\times 10^{-3} \\

P(solid|steam)&#038;=&#038;2.2 \times 10^{-5} &#038;
P(gas|steam)&#038;=&#038;7.8\times 10^{-4} &#038;
P(water|steam)&#038;=&#038;2.2\times 10^{-3} \\

\end{array}
">



<p>Taking the ratio of co-occurance probabilities,</p>



<img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\begin{array}{llllll}
\frac{P(solid|ice)}{P(solid|steam)} &#038;=&#038; 8.9 \\ \\ 
\frac{P(gas|ice)}{P(gas|steam)} &#038;=&#038;8.5\times 10^{-2} \\ \\
\frac{P(water|ice)}{P(water|steam)}&#038;=&#038;1.36 \\


\end{array}
">



<p>The ratios indicate that, </p>



<ul class="wp-block-list">
<li><em>solid</em> and <em>ice</em> has a higher relationship than with <em>steam</em>.</li>



<li><em>gas</em> and <em>ice</em> is far less likely to co-occur than with <em>steam</em></li>



<li><em>water</em> is related to both <em>ice</em> and <em>steam</em> in similar proportions</li>
</ul>



<h3 class="wp-block-heading">Model </h3>



<p>To capture this <strong>ratio</strong> relationship in a vector space, the authors search for a function <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?F"> that satisfies:</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? F(w_i, w_j, \tilde{w}_k) = \frac{P(k|i)}{P(k|j)} = \frac{P_{ik}}{P_{jk}}" alt=""/>



<p>where </p>



<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k"> is context word</li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?i,j"> are target words</li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w \in \mathbb{R}^d"> are the word vectors and </li>



<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\tilde{w} \in \mathbb{R}^d"> are separate context vectors.</li>
</ul>



<p>Authors enforce that the relationship should be <strong>linear (vector difference)</strong> and the result should be a <strong>scalar (dot product)</strong>, leading to :</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? F((w_i - w_j)^T \tilde{w}_k) = \frac{P_{ik}}{P_{jk}}" alt=""/>



<p></p>



<p>To satisfy that, authors propose choosing function <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?F=\exp()"> so that the dot product of vector difference can be written as ratio of probabilities, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\begin{array}{lll}
F((w_i - w_j)^T \tilde{w}_k)  &#038; = &#038; \frac{F(w_i^T \tilde{w}_k)}{F(w_j^T \tilde{w}_k)} \\
&#038;=&#038;\frac{P_{ik}}{P_{jk}}
\end{array}


" alt=""/>



<p>With this choice, for a <strong>single word-context pair</strong> estimates the co-occurence probabilities,</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? F(w_i^T \tilde{w}_k) = \exp((w_i^T \tilde{w}_k)) = P_{ik} = \frac{X_{ik}}{X_i}" alt=""/>



<p>Taking logarithm, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\begin{array}{lll}
w_i^T \tilde{w}_k = \log(P_{ik}) &#038; = &#038; \log(X_{ik}) - \log(X_i)
\end{array}

"alt=""/>



<p><strong>Note :</strong></p>



<p>The model capturing the relation between two words <strong>should not change</strong> even if the <strong>words are swapped</strong>. Even though the co-occurrence counts are identical (<img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?X_{ik} = X_{ki}">), because the total counts of words are not equal (<img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?X_{i} \ne X_{k}">) , the <strong>conditional probability is not symmetric</strong> (<img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?P_{ik} \ne P_{ki}">).</p>



<p>The above equation is not symmetric if we swap target word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?i"> and context word <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k"> as the row-dependent term <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\log(X_{i})"> has to be handled. </p>



<p>To make it symmetric, the authors absorb <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\log(X_{i})"> into a learnable bias term <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?b_{i}"> and then adds a corresponding bias <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\tilde{b}_{k}"> for the context word. This ensures the model is fully symmetric i.e.</p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\begin{array}{lll}
w_i^T \tilde{w}_k + b_i + \tilde{b}_k =  \log(X_{ik}) 
\end{array}

"alt=""/>



<p>The loss function then becomes , </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\begin{array}{lll}
J = \sum_{i=1}^{V}\sum_{k=1}^{V}\left[w^T_i \tilde{w}_k + b_i + \tilde{b}_k -  \log(X_{ik}) \right]^2
\end{array}

"alt=""/>



<p>The key aspect in the above simplification is that, by<strong> training for the pairs of words</strong> to minimize the above loss will <strong>indirectly</strong> ensure that the <strong>dot product </strong>of<strong> vector difference of target words </strong>with<strong> context word </strong>will arrive at the<strong> ratio of probabilities</strong>.</p>



<p><strong>Weighted Least Squares</strong></p>



<p>The above loss function weighs all <strong>co-occurances equally</strong>. Authors noted that rare co-occurances are noisy and around 75-95% of the co-occurance is zeros, and proposed adding a <strong>weighting function</strong> to least squares loss proposed above.</p>



<p>The weighting function is chosen to obey the following : </p>



<ol class="wp-block-list">
<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?f(0)=0"> (to handle the zero co-occurance counts)</li>



<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?f(x)"> should be non-decreasing so that rare co-occurance are given less weight</li>



<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?f(x)"> should be relatively small for large values of <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?x"> so that frequent co-occurrences are not over-weighted</li>
</ol>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
f(x) =  \left\{
\begin{array}{ll}
(x/x_{max})^{\alpha} &#038; \text{if } x \le x_{max}  \\
1 &#038; \text{otherwise}
\end{array}
\right.
" alt=""/>



<p>The parameters <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha = 3/4"> and <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?x_{max} = 100"> are chosen empirically.</p>



<p>Then the Weighted Least Squares loss function becomes, </p>



<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\begin{array}{lll}
J = \sum_{i=1}^{V}\sum_{k=1}^{V}f(X_{ik})\left[w^T_i \tilde{w}_k + b_i + \tilde{b}_k -  \log(X_{ik}) \right]^2
\end{array}

"alt=""/>



<h3 class="wp-block-heading">Python code</h3>



<p><strong>The code @ <a href="https://github.com/dsplog/dsplog.com/blob/main/code/word_embeddings/glove_word_embedding.ipynb" target="_blank" rel="noreferrer noopener">word_embeddings/glove_word_embedding.ipynb</a></strong></p>



<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/word_embeddings/glove_word_embedding.ipynb?flush_cache=false" width="100%" height="600"></iframe>



<h2 class="wp-block-heading">Summary</h2>



<p>This article covers</p>



<p><strong>Evolution:</strong> How we moved from Bengio&#8217;s NPLM (2003) to efficient architectures like Word2Vec and GloVe.</p>



<p><strong>Math:</strong> Detailed derivations of <strong>Hierarchical Softmax</strong> (using binary trees) and <strong>Noise Contrastive Estimation</strong> (differentiating data from noise).</p>



<p><strong>Architectures:</strong> A deep look at CBOW, Skip-Gram, and the intuition behind Negative Sampling.</p>



<p><strong>Code:</strong> Complete Python implementations for every model discussed, including vectorized implementations for efficiency.</p>



<p><strong>Acknowledgment </strong> </p>



<p>In addition to the primary papers listed above, this post draws inspiration from the excellent overview in the post <strong><a href="https://lilianweng.github.io/posts/2017-10-15-word-embedding/" target="_blank" rel="noreferrer noopener">Learning word embedding</a>, Weng, Lilian 2017</strong>. Credit also goes to the recent Large Language Models <a href="https://gemini.google.com/" target="_blank" rel="noopener"><strong>Gemini</strong></a> and <a href="https://chatgpt.com/" target="_blank" rel="noopener"><strong>ChatGPT</strong></a> which helped to bounce thoughts and refine the drafts.</p>



<p></p>



<p></p>



<p></p>



<p></p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/">Word Embeddings using neural networks</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://dsplog.com/2025/12/27/word-embeddings-using-neural-networks/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Gradients for multi class classification with Softmax</title>
		<link>https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/</link>
					<comments>https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#comments</comments>
		
		<dc:creator><![CDATA[Krishna Sankar]]></dc:creator>
		<pubDate>Sun, 22 Jun 2025 08:53:17 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[Categorical Cross Entropy]]></category>
		<category><![CDATA[Linear]]></category>
		<category><![CDATA[Maximum Likelihood]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[Softmax]]></category>
		<guid isPermaLink="false">https://dsplog.com/?p=2231</guid>

					<description><![CDATA[<p>In a multi class classification problem, the output (also called the label or class) takes a finite set of discrete values . In this post, system model for a multi class classification with a linear layer followed by softmax layer is defined. The softmax function transforms the output of a linear layer into values lying &#8230; <a href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/" class="more-link">Continue reading<span class="screen-reader-text"> "Gradients for multi class classification with Softmax"</span></a></p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/">Gradients for multi class classification with Softmax</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></description>
										<content:encoded><![CDATA[</p>
<p>In a <strong>multi class classification</strong> problem, the output (also called the <strong>label</strong> or <strong>class</strong>) takes a finite set of <strong>discrete values</strong> <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y \in \{1, 2, \ldots, k\}" alt="">. In this post, <strong>system model</strong> for a multi class classification with a <strong>linear layer</strong> followed by <strong>softmax layer</strong> is defined. The <strong>softmax function</strong> transforms the output of a <strong>linear layer </strong> into values lying between 0 and 1, which can be interpreted as <strong>probability scores</strong>.</p>
</p>
<p>Next, the <strong>loss function</strong> using <strong>categorical cross entropy</strong> is explained and  derive the <strong>gradients</strong> for model parameters using the <strong>chain rule</strong>. The <strong>analytically computed gradients</strong> are then compared with those obtained from the deep learning framework <strong>PyTorch</strong>. Finally, we implement a <strong>training loop</strong> using <strong>gradient descent</strong> for a toy multi-class classification task with <strong>2D Gaussian-distributed data</strong>. </p>
</p>
<p><span id="more-2231"></span></p>
<p><div id="ez-toc-container" class="ez-toc-v2_0_82_2 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction">
<div class="ez-toc-title-container">
<p class="ez-toc-title" style="cursor:inherit">Table of Contents</p>
<span class="ez-toc-title-toggle"><a href="#" class="ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle" aria-label="Toggle Table of Content"><span class="ez-toc-js-icon-con"><span class=""><span class="eztoc-hide" style="display:none;">Toggle</span><span class="ez-toc-icon-toggle-span"><svg style="fill: #999;color:#999" xmlns="http://www.w3.org/2000/svg" class="list-377408" width="20px" height="20px" viewBox="0 0 24 24" fill="none"><path d="M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z" fill="currentColor"></path></svg><svg style="fill: #999;color:#999" class="arrow-unsorted-368013" xmlns="http://www.w3.org/2000/svg" width="10px" height="10px" viewBox="0 0 24 24" version="1.2" baseProfile="tiny"><path d="M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z"/></svg></span></span></span></a></span></div>
<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-1" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Model">Model</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-2" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Linear_Layer">Linear Layer</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-3" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Softmax_layer">Softmax layer</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-4" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Derivatives">Derivatives</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-5" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Derivative_of_Softmax_layer">Derivative of Softmax layer</a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-6" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Derivative_for_case_ij">Derivative for case i=j</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-7" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Derivative_for_case_i_%E2%89%A0_j">Derivative for case i ≠ j</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-8" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Final_output_matrix_form">Final output (matrix form)</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-9" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Code">Code</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-10" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Derivative_of_Linear_layer">Derivative of Linear layer</a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-11" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Derivative_of_Weights">Derivative of Weights</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-12" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Derivative_of_Bias">Derivative of Bias</a></li></ul></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-13" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Loss_for_multi-class_classification">Loss for multi-class classification</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-14" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Maximum_Likelihood_Estimate">Maximum Likelihood Estimate</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-15" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Connecting_to_Cross_Entropy_Loss">Connecting to Cross Entropy Loss</a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-16" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Cross_Entropy">Cross Entropy</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-17" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Cross_Entropy_Loss">Cross Entropy Loss</a></li></ul></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-18" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Gradients_with_Cross_Entropy_CE_Loss">Gradients with Cross Entropy (CE) Loss</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-19" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Gradients_of_Loss_with_respect_to_Probability_dLda">Gradients of Loss with respect to Probability (dL/da)</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-20" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Gradients_of_Loss_with_respect_to_z_dLdz">Gradients of Loss with respect to z (dL/dz)</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-21" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Gradients_of_loss_with_respect_to_Parameters_dLdW_dLdb">Gradients of loss with respect to Parameters (dL/dW, dL/db)</a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-22" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Gradients_of_Weights_W">Gradients of Weights (W)</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-23" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Gradients_of_bias_b">Gradients of bias (b)</a></li></ul></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-24" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Vectorised_operations_with_m_examples">Vectorised operations (with m examples)</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-25" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Code_gradients">Code (gradients)</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-26" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Training_for_toy_example_with_3_classes">Training for toy example with 3 classes</a></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-27" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Training_with_Label_Smoothing">Training with Label Smoothing</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-28" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Training_code">Training code</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-29" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/#Summary">Summary</a></li></ul></nav></div>

</p>
</p>
<p>As always, contents from <a href="https://cs229.stanford.edu/main_notes.pdf" target="_blank" rel="noreferrer noopener">CS229 Lecture Notes</a> and the notations used in the course <a href="https://youtube.com/playlist?list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&amp;si=_3Xs1piNOfQ847gd" target="_blank" rel="noopener">Deep Learning Specialization C1W1L01</a> from Dr Andrew Ng forms key references.</p>
</p>
<h2 class="wp-block-heading">Model</h2>
</p>
<p>Let us take an example of estimating <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y \in \{1, 2, \ldots, k\}" alt=""> based on feature vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt=""> having <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n" alt=""> features i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x} \in \mathbb{R}^{n \times 1}" alt=""> and there are <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt=""> examples.</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{tabular}{|c|c|c|c|c|}
\hline&amp;{example^1}&amp;{example^2}&amp;\ldots&amp;{example^m}\\
\hline{feature_1}&amp;{x_1}^{1}&amp;{x_1}^{2}&amp;\ldots&amp;{x_1}^{m}\\
\hline{feature_2}&amp;{x_2}^{1}&amp;{x_2}^{2}&amp;\ldots&amp;{x_2}^{m}\\
\hline&amp;\vdots&amp;\vdots&amp;\ldots&amp;\vdots\\
\hline{feature_n}&amp;{x_n}^{1}&amp;{x_n}^{2}&amp;\ldots&amp;{x_n}^{m}&amp;\\
\hline{output}&amp;{y}^{1}&amp;{y}^{2}&amp;\ldots&amp;{y}^{m}\end{tabular}
" alt="">
</p>
</p>
<h3 class="wp-block-heading">Linear Layer</h3>
</p>
<p>Let us <strong>assume</strong> that the variable <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt="">is defined as <strong>linear function</strong> of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt="">. For a single training example, this can be written as :</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{align*}
\mathbf{z} &#038;= \mathbf{W} \mathbf{x} + \mathbf{b} \\
\end{align*}" alt="">
</p>
</p>
<p>where, </p>
</p>
<ul class="wp-block-list">
<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z} = 
\begin{bmatrix}
z_1 \\
z_2 \\
\vdots \\
z_k
\end{bmatrix}" alt=""> is the vector of size <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k \times 1" alt=""> i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z} \in \mathbb{R}^{k \times 1}" alt="">,</li>
</p>
<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W} = 
\begin{bmatrix}
w_{11} &amp; w_{12} &amp; \cdots &amp; w_{1n} \\
w_{21} &amp; w_{22} &amp; \cdots &amp; w_{2n} \\
\vdots &amp; \vdots &amp; \ddots &amp; \vdots \\
w_{k1} &amp; w_{k2} &amp; \cdots &amp; w_{kn}
\end{bmatrix}" alt=""> is the <strong>parameter matrix</strong> of size <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k \times n" alt=""> i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W} \in \mathbb{R}^{k \times n}" alt="">,  </li>
</p>
<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x} = 
\begin{bmatrix}
x_1 \\
x_2 \\
\vdots \\
x_n
\end{bmatrix} " alt=""> is the feature vector of size <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n \times 1" alt=""> i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x} \in \mathbb{R}^{n \times 1}" alt=""> and</li>
</p>
<li><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{b} = 
\begin{bmatrix}
b_1 \\
b_2 \\
\vdots \\
b_k
\end{bmatrix} " alt=""> is the <strong>parameter vector</strong> of size <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k \times 1" alt=""> i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{b} \in \mathbb{R}^{k \times 1}" alt=""></li>
</ul>
</p>
<p>Note : </p>
</p>
<p>This is the definition of <strong>Linear layer</strong> in <strong>PyTorch</strong><sup> <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html" target="_blank" rel="noopener">(refer entry on Linear layer)</a></sup>. This is alternatively called as <strong>Dense</strong> <strong>Layer</strong> in <strong>Tensorflow</strong> <a href="https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense" target="_blank" rel="noopener"><sup>(refer entry on Dense)</sup></a> and as <strong>Fully Connected layer</strong> in deep learning literature.</p>
</p>
<h3 class="wp-block-heading">Softmax layer</h3>
</p>
<p>To map the<strong> real valued vector</strong> <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z} \in \mathbb{R}^{k \times 1}" alt=""> to a<strong> probability vector </strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a} \in \mathbb{R}^{k \times 1}" alt=""> with elements of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a}" alt=""><strong>summing up to 1</strong>, we use the <strong>softmax function</strong> <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? \mathbf{S}(\cdot)" alt=""> <sup><a href="https://en.wikipedia.org/wiki/Softmax_function" target="_blank" rel="noopener">(refer wiki entry on SoftMax)</a> </sup>. <strong>Softmax</strong> function is defined as,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a}=\mathbf{S}(\mathbf{z}) = 
\begin{bmatrix}
\frac{e^{z_1}}{\sum_{j=1}^{k} e^{z_j}} \\
\frac{e^{z_2}}{\sum_{j=1}^{k} e^{z_j}} \\
\vdots \\
\frac{e^{z_k}}{\sum_{j=1}^{k} e^{z_j}}
\end{bmatrix}, 


\quad \in \mathbb{R}^{k \times 1}" alt="">
</p>
</p>
<p>Equivalently, this can be written as,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a} = 

\begin{bmatrix}
a_1 \\
a_2 \\
\vdots \\
a_k
\end{bmatrix}, \quad \in \mathbb{R}^{k \times 1}" alt=""/>
</p>
</p>
<p>where, each <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?a_i" alt=""> represents the normalized exponential of the corresponding <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i" alt="">. This ensures that</p>
</p>
<ul class="wp-block-list">
<li>each element <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?a_i" alt=""> lies in the <strong>range [0,1</strong>] i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?0 \leq a_i \leq 1" alt=""></li>
</p>
<li><strong>sum</strong> of all the elements <strong>add upto 1</strong>, i.e. <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\sum_{i=1}^k a_i = 1" alt="">.</li>
</ul>
</p>
<p>This makes <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a}" alt=""> interpretable as a probability distribution over the <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k" alt=""> classes.</p>
</p>
<h2 class="wp-block-heading">Derivatives</h2>
</p>
<h3 class="wp-block-heading">Derivative of Softmax layer</h3>
</p>
<p>To compute the <strong>derivative of the softmax output</strong> <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a} = \mathbf{S}(\mathbf{z})" alt=""> with respect to its input <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z}" alt="">, we need to find the <strong>Jacobian matrix</strong> <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathbf{a}}{\partial \mathbf{z}} \in \mathbb{R}^{k \times k}" alt="">. The <strong>Jacobian</strong> contains<strong> all partial derivatives </strong>of each output component <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?{a_i}" alt=""> with respect to each input component <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_j " alt="">. </p>
</p>
<p>To find the derivative for all cases, let us split into two scenarios i.e.</p>
</p>
<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}\frac{\partial a_i}{\partial z_j}&amp;  \text{where, } i = j \\  \end{array}" alt=""></li>
</p>
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}\frac{\partial a_i}{\partial z_j}&amp;  \text{where, } i \ne j \\  \end{array}" alt=""> </li>
</ul>
</p>
<h4 class="wp-block-heading">Derivative for case <em>i=j</em></h4>
</p>
<p>Using the product rule of derivatives, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{align*}

\frac{\partial a_i}{\partial z_i} 
&#038;= \frac{\partial}{\partial z_i} \left( e^{z_i} \cdot \frac{1}{\sum_{j=1}^{k} e^{z_j}} \right) \\
&#038;= \frac{\partial }{\partial z_i}\left(e^{z_i}\right) \cdot \frac{1}{\sum_{j=1}^{k} e^{z_j}} 
+ e^{z_i} \cdot \frac{\partial}{\partial z_i} \left( \frac{1}{\sum_{j=1}^{k} e^{z_j}} \right) \\
&#038;= e^{z_i} \cdot \frac{1}{\sum_{j=1}^{k} e^{z_j}} 
- e^{z_i} \cdot \frac{e^{z_i}}{\left( \sum_{j=1}^{k} e^{z_j} \right)^2} \\
&#038;= \frac{e^{z_i}}{\sum_{j=1}^{k} e^{z_j}} 
- \frac{(e^{z_i})^2}{\left( \sum_{j=1}^{k} e^{z_j} \right)^2} \\
&#038;= a_i - a_i^2 \\
&#038;= a_i(1 - a_i)
\end{align*}

" alt=""/>
</p>
</p>
<h4 class="wp-block-heading">Derivative for case <em>i ≠ j</em></h4>
</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{align*}
\frac{\partial a_i}{\partial z_j} 
&#038;= \frac{\partial}{\partial z_j} \left( e^{z_i} \cdot \frac{1}{\sum_{l=1}^{k} e^{z_l}} \right) \\
&#038;= \frac{\partial }{\partial z_j} \left(e^{z_i}\right)\cdot \frac{1}{\sum_{l=1}^{k} e^{z_l}} 
+ e^{z_i} \cdot \frac{\partial}{\partial z_j} \left( \frac{1}{\sum_{l=1}^{k} e^{z_l}} \right) \\
&#038;= 0 \cdot \frac{1}{\sum_{l=1}^{k} e^{z_l}} 
- e^{z_i} \cdot \frac{e^{z_j}}{\left( \sum_{l=1}^{k} e^{z_l} \right)^2} \\
&#038;= - \frac{e^{z_i} \cdot e^{z_j}}{\left( \sum_{l=1}^{k} e^{z_l} \right)^2} \\
&#038;= - a_i \cdot a_j
\end{align*}

" alt=""/>
</p>
</p>
<h4 class="wp-block-heading">Final output (matrix form)</h4>
</p>
<p>Based on the above derivations, the derivative is defined as :</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial a_i}{\partial z_j} = \begin{cases} a_i (1 - a_i), &amp; \text{if } i = j \\ - a_i a_j, &amp; \text{if } i \ne j \end{cases}" alt=""/>
</p>
</p>
<p>In matrix form, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathbf{a}}{\partial \mathbf{z}} =
\begin{bmatrix}
a_1(1 - a_1) &#038; -a_1 a_2 &#038; -a_1 a_3 &#038; \cdots &#038; -a_1 a_k \\
-a_2 a_1 &#038; a_2(1 - a_2) &#038; -a_2 a_3 &#038; \cdots &#038; -a_2 a_k \\
-a_3 a_1 &#038; -a_3 a_2 &#038; a_3(1 - a_3) &#038; \cdots &#038; -a_3 a_k \\
\vdots &#038; \vdots &#038; \vdots &#038; \ddots &#038; \vdots \\
-a_k a_1 &#038; -a_k a_2 &#038; -a_k a_3 &#038; \cdots &#038; a_k(1 - a_k)
\end{bmatrix}, \quad \in \mathbb{R}^{k \times k}

" alt=""/>
</p>
</p>
<h4 class="wp-block-heading">Code</h4>
</p>
<p>Python code comparing the derivative of softmax using the derivation above vs computed by PyTorch autograd function below. </p>
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/gradients_multiclass_classification/gradients_cross_entropy_loss.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
<h3 class="wp-block-heading">Derivative of Linear layer</h3>
</p>
<p>To find the derivative of the linear layer <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{align*}
\mathbf{z} &amp;= \mathbf{W} \mathbf{x} + \mathbf{b} \\
\end{align*}" alt="">, with respect to parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{b}" alt="">, we must compute two partial derivatives:</p>
</p>
<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathbf{z}}{\partial \mathbf{W}}" alt="\frac{\partial \mathbf{z}}{\partial \mathbf{W}}"> – how the output changes with respect to the weight matrix.  </li>
</p>
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathbf{z}}{\partial \mathbf{b}}" alt="\frac{\partial \mathbf{z}}{\partial \mathbf{b}}"> – how the output changes with respect to the bias vector.  </li>
</ul>
</p>
</p>
<h4 class="wp-block-heading">Derivative of Weights</h4>
</p>
<p>To compute the derivative <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathbf{z}}{\partial \mathbf{W}}">, we evaluate how each weight parameter <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?W_{ij}"> affects each output dimension <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i">.  The <i><strong>i</strong></i>-th component of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z}" alt="\mathbf{z}"> is:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i = \sum_{j=1}^{n} W_{ij} x_j + b_i" alt="">.
</p>
</p>
<p>The partial derivative of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i" alt="z_i"> with respect to <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?W_{ij}" alt="W_{ij}"> is :</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial z_i}{\partial W_{ij}} = x_j
" alt=""/>
</p>
</p>
<p>where, </p>
</p>
<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?i \in \{1, 2, \dots, k\}" alt=""> index over the output vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z} \in \mathbb{R}^{k \times 1}" alt=""></li>
</p>
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?j \in \{1, 2, \dots, n\}" alt=""> indexes the elements of the input vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x} \in \mathbb{R}^{n \times 1}" alt=""></li>
</ul>
</p>
<p>Since each output <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i"> depends only on the weights in i-th row <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W}_{i,:}">, the <strong>Jacobian</strong> simplifies to a matrix where each row is <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}^\top" alt="">. This can be represented as</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial z_i}{\partial \mathbf{W}_{i,:}} = \mathbf{x}^\top \in \mathbb{R}^{1 \times n}

" alt=""/>
</p>
</p>
<p>For all the rows of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W}" alt="">, the derivative is </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathbf{z}}{\partial \mathbf{W}} = 
\begin{bmatrix}
\mathbf{x}^\top \\
\mathbf{x}^\top \\
\vdots \\
\mathbf{x}^\top
\end{bmatrix}
\in \mathbb{R}^{k \times n}
\quad \text{(each row is } \mathbf{x}^\top \text{)}
" alt=""/>
</p>
</p>
</p>
<h4 class="wp-block-heading">Derivative of Bias</h4>
</p>
<p>The bias vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{b} \in \mathbb{R}^{k \times 1}" alt=""> is added element-wise to the output of the linear transformation <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W}\mathbf{x}">. That is, each output component <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i"> is given by:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i = \sum_{j=1}^{n} W_{ij} x_j + b_i" alt=""/>
</p>
</p>
<p>So the partial derivative of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_i" alt=""> with respect to <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b_j" alt=""> is:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial z_i}{\partial b_j} = \begin{cases} 1 &#038; \text{if } i = j \\ 0 &#038; \text{if } i \ne j \end{cases} " alt=""/>
</p>
</p>
<p>This implies that the Jacobian matrix of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z}"> with respect to <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{b}"> is an identity matrix:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathbf{z}}{\partial \mathbf{b}} = \mathbf{I}_k \in \mathbb{R}^{k \times k}" alt=""/>
</p>
</p>
<p>This tells us that the bias only affects its corresponding output component (i.e., <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b_j" alt=""> only affects <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?z_j" alt="">).</p>
</p>
</p>
<h2 class="wp-block-heading">Loss for multi-class classification</h2>
</p>
<h3 class="wp-block-heading">Maximum Likelihood Estimate</h3>
</p>
<p>The <strong>likelihood</strong> of observing the true class c, given input <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt="">, under the model is:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(y = c \mid \mathbf{x}) = a_c
" alt=""/>
</p>
</p>
<p>The <strong>log-likelihood</strong> over a dataset with <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt=""> examples is:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\log \mathcal{L} = \sum_{i=1}^{m} \log a^i_c

" alt=""/>
</p>
</p>
<p>where, </p>
</p>
<p><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a^i_c" alt=""> is the model&#8217;s predicted probability for the correct class <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?c" alt=""> for the i-th example.</p>
</p>
<p><strong>Maximizing</strong> this log-likelihood is equivalent to <strong>minimizing</strong> the <strong>negative log-likelihood</strong>:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L_{\text{NLL}}} = -\sum_{i=1}^{m} \log a^i_c

" alt=""/>
</p>
</p>
<h3 class="wp-block-heading">Connecting to Cross Entropy Loss</h3>
</p>
<p>To map the ground truth class label <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y \in \{1, 2, \ldots, k\}" alt=""> to be represented as a target vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{y} \in \mathbb{R}^{k \times 1}" alt="">, a common choice is the <strong>one-hot encoding</strong> scheme, where the true class is indicated by a <strong>1</strong> in the corresponding position and <strong>0</strong> elsewhere.  For example, suppose we have <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k = 3" alt="k=3"> classes, and the correct label is class 2, i.e., <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y = 2" alt="y = 2">, then the one-hot encoded vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{y}" alt="\mathbf{y}"> becomes :</p>
</p>
<p><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{y} = \begin{bmatrix} 0 \\ 1 \\ 0  \end{bmatrix}" alt="\mathbf{y} = [0, 1, 0]^T"></p>
</p>
<h4 class="wp-block-heading">Cross Entropy</h4>
</p>
<p>To compare the model’s predicted probability vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a}" alt="\mathbf{a}"> with the one-hot encoded true label <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{y}" alt="\mathbf{y}">, we use a metric called <strong>cross-entropy</strong> <a href="https://en.wikipedia.org/wiki/Cross-entropy" target="_blank" rel="noopener"><sup>(refer wiki entry on Cross Entropy)</sup></a>.  The cross-entropy of the distribution&nbsp;<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{q}" alt="\mathbf{y}">&nbsp;relative to a distribution<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{p}" alt="\mathbf{p}"> over a given set is defined as follows:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?H(p, q) = -\operatorname{E}_p[\log q]
" alt=""/>
</p>
</p>
<p>where, </p>
</p>
<p><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\operatorname{E}_p[\cdot]" alt=""> is the&nbsp;<a href="https://en.wikipedia.org/wiki/Expected_value" target="_blank" rel="noopener">expected value</a>&nbsp;operator with respect to the distribution&nbsp;<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{p}" alt="\mathbf{y}">.</p>
</p>
<p>For&nbsp;<a href="https://en.wikipedia.org/wiki/Discrete_random_variable" target="_blank" rel="noopener">discrete</a>&nbsp;probability distributions <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{p}" alt="\mathbf{p}">and&nbsp;<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{q}" alt="\mathbf{y}">&nbsp;with <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{X}" alt="">set of all possible outcomes or classes.</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?H(p,q) = -\sum_{x \in \mathcal{X}} p(x) \log q(x)
" alt=""/>
</p>
</p>
<h4 class="wp-block-heading">Cross Entropy Loss</h4>
</p>
<p>In the context of training classification models, we use the <strong>cross-entropy loss</strong> as a cost function to minimize. For a single training example, to evaluate how well the <strong>predicted probability vector </strong><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a} \in \mathbb{R}^{k \times 1}" alt=""> matches the ground truth vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{y} \in \mathbb{R}^{k \times 1}" alt="">, the <strong>cross-entropy loss</strong> is defined as:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L}_{\text{CE}}(\mathbf{y}, \mathbf{a}) = - \sum_{k=1}^{k} y_i \log(a_i)
" alt=""/>
</p>
</p>
<p>where, </p>
</p>
<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y_i" alt=""> is the true probability of class <code><em>i</em></code>  and</li>
</p>
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a_i" alt=""> is the predicted probability for class <code>i</code>.</li>
</ul>
</p>
<p>The loss encourages the model to assign <strong>higher probability to the correct class</strong> which <strong>indirectly lowers </strong>the probabilities to the<strong> incorrect classes</strong>. The <strong>smaller the cross-entropy loss</strong>, the <strong>closer the predicted probabilities</strong> are to the true labels.</p>
</p>
<p>Loss across all <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt="">examples is, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L}_{\text{CE}}(\mathbf{y}, \mathbf{a}) = -\sum_{i=1}^{m} \sum_{j=1}^{k} y^i_j \log(a^i_j)
" alt=""/>
</p>
</p>
<p>When <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y^i_c
" alt=""> is <strong>one-hot coded</strong>, as only the <strong>correct class is non-zero</strong>, the equation reduces to</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L}_{\text{CE}}(\mathbf{y}, \mathbf{a}) = -\sum_{i=1}^{m} \log(a^i_c)
" alt=""/>
</p>
</p>
<p>We can see that this <strong>cross entropy loss </strong>is same as the <strong>maximum likelihood</strong> estimate derived earlier. </p>
</p>
</p>
<p>Note : </p>
</p>
<ul class="wp-block-list">
<li>Function for cross entropy loss  is available in PyTorch library as <code>torch.nn.CrossEntropyLoss</code><span style="font-size: revert; color: initial;"> </span><a style="font-size: revert;" href="https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html" target="_blank" rel="noreferrer noopener"><sup>(refer entry on CELoss in PyTorch)</sup></a><span style="font-size: revert; color: initial;">.</span>  </li>
</p>
<li>In the <code>torch.nn.CrossEntropyLoss</code> definition, we only need to provide the output of the linear layer <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z} \in \mathbb{R}^{k \times 1}" alt=""> (called <strong>logits</strong>) and the <strong>class indices</strong> <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y \in \{1, 2, \ldots, k\}" alt=""> as an integer. The softmax and logarithm of probabilities are computed internally, so we do not need to apply softmax before passing logits to this function.</li>
</ul>
</p>
<h2 class="wp-block-heading">Gradients with Cross Entropy (CE) Loss</h2>
</p>
<p>The system model for binary classification involves multiple steps: </p>
</p>
<ul class="wp-block-list">
<li>firstly, the vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z}" alt="">is defined as <strong>linear function</strong> of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt=""> using parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{b}" alt="">,</li>
</p>
<li>then <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z}" alt=""> gets transformed into a <strong>estimated probability</strong> score <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a}" alt=""> using <strong>softmax function</strong>.</li>
</p>
<li>lastly, using the<strong> true label </strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y \in \{0, 1, \ldots k\}" alt=""> and estimated probability score <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a}" alt="">,<strong>  cross entropy</strong> loss <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L}_{\text{CE}}" alt=""> is computed</li>
</ul>
</p>
<p>For performing gradient descent of the parameters, the goal is to the find the gradients of the loss <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L}_{\text{CE}}" alt=""> w.r.t to the parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{b}" alt="">. To find the gradients, we go in the reverse order i.e.</p>
</p>
<ul class="wp-block-list">
<li>first, gradients of the loss <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L}_{\text{CE}}" alt=""> w.r.t to the estimated probability score <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a}" alt="">, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial \mathbf{a}}"> is computed.</li>
</p>
<li>then gradients of the probability score <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a}" alt=""> w.r.t to the output of linear function <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z}" alt="">  is multiplied with the gradients of loss with respect to <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a}" alt="">  i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z}"></li>
</p>
<li>lastly, to find the gradients of output of linear function <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z}" alt=""> w.r.t to parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{b}" alt="">, the product of all the individual gradients is used. This is written as,</li>
</ul>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}}
" alt="">
</p>
</p>
<p>The steps described, calculating gradients in the reverse order from the loss back to the parameters<strong> </strong>is an application of the<strong> chain rule from calculus</strong> <a href="https://en.wikipedia.org/wiki/Chain_rule#Intuitive_explanation" target="_blank" rel="noopener"><sup>(refer wiki entry on Chain Rule)</sup></a>. This method is the foundation of <strong>back propagation</strong> used in training models <a href="https://en.wikipedia.org/wiki/Backpropagation" target="_blank" rel="noopener"><sup>(refer wiki entry on Backpropagation)</sup></a>.</p>
</p>
<h3 class="wp-block-heading">Gradients of Loss with respect to Probability (dL/da)</h3>
</p>
<p>As defined earlier, for a <strong>multi-class classification</strong> setting, the <strong>cross-entropy loss</strong> is given by:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L} = - \sum_{i=1}^{k} y_i \log a_i" alt="">
</p>
</p>
<p>Derivative of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L}" alt=""> w.r.t  <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a_i" alt=""> is, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial a_i} = - \frac{y_i}{a_i}" alt="">
</p>
</p>
<p>So, the gradient is large if the predicted probability <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a_i" alt=""> is small for the correct class — this penalises the model for incorrect predictions, which is desired during training. The vectorized form of the loss gradient w.r.t. the probability vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a}" alt=""> is:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial \mathbf{a}} =
\begin{bmatrix}
-\dfrac{y_1}{a_1} \\
-\dfrac{y_2}{a_2} \\
\vdots \\
-\dfrac{y_k}{a_k}
\end{bmatrix}, \quad  \in \mathbb{R}^{k \times 1}
" alt="">
</p>
</p>
<p>Equivalently,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial \mathbf{a}} =
\frac{\mathbf{y}}{\mathbf{a}}, \quad \in \mathbb{R}^{k \times 1}
" alt="">
</p>
</p>
</p>
<h3 class="wp-block-heading">Gradients of Loss with respect to z (dL/dz)</h3>
</p>
<p>Using chain rule, to the find the gradient of loss with respect to <strong>z</strong> i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial \mathbf{z}}"> , we multiply the derivative of softmax output <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathbf{a}}{\partial \mathbf{z}}"> which is a <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k \times k"> matrix with <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial \mathbf{a}}"> which is of dimension <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k \times 1"> ,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial \mathcal{L}}{\partial \mathbf{z}} &#038; = &#038;  \frac{\partial \mathbf{a}}{\partial \mathbf{z}} \cdot  \frac{\partial \mathcal{L}}{\partial \mathbf{a}}  \\
&#038;= &#038; \begin{bmatrix} a_1(1 - a_1) &#038; -a_1 a_2 &#038; -a_1 a_3 &#038; \cdots &#038; -a_1 a_k \\ -a_2 a_1 &#038; a_2(1 - a_2) &#038; -a_2 a_3 &#038; \cdots &#038; -a_2 a_k \\ -a_3 a_1 &#038; -a_3 a_2 &#038; a_3(1 - a_3) &#038; \cdots &#038; -a_3 a_k \\ \vdots &#038; \vdots &#038; \vdots &#038; \ddots &#038; \vdots \\ -a_k a_1 &#038; -a_k a_2 &#038; -a_k a_3 &#038; \cdots &#038; a_k(1 - a_k) \end{bmatrix}\begin{bmatrix}
-\dfrac{y_1}{a_1} \\
-\dfrac{y_2}{a_2} \\
\vdots \\
-\dfrac{y_k}{a_k}
\end{bmatrix} \\

\\
&#038;=&#038; \begin{bmatrix}a_1(1 - a_1)\left(-\frac{y_1}{a_1}\right) 
+ (-a_1 a_2)\left(-\frac{y_2}{a_2}\right)
+ (-a_1 a_3)\left(-\frac{y_3}{a_3}\right)
+ \cdots 
+ (-a_1 a_k)\left(-\frac{y_k}{a_k}\right) \\
(-a_2 a_1)\left(-\frac{y_1}{a_1}\right)
+ a_2(1 - a_2)\left(-\frac{y_2}{a_2}\right)
+ (-a_2 a_3)\left(-\frac{y_3}{a_3}\right)
+ \cdots 
+ (-a_2 a_k)\left(-\frac{y_k}{a_k}\right) \\
\vdots\\
(-a_k a_1)\left(-\frac{y_1}{a_1}\right)
+ (-a_k a_2)\left(-\frac{y_2}{a_2}\right)
+ (-a_k a_3)\left(-\frac{y_3}{a_3}\right)

+ \cdots 

+ a_k(1 - a_k)\left(-\frac{y_k}{a_k}\right) \\

\end{bmatrix} 
\\
&#038;=&#038;
\begin{bmatrix}
-y_1(1 - a_1) + a_1y_2 + a_1y_3 + \dots + a_1y_k\\
a_2y_1  -y_2(1 - a_2) + a_2y_3 + \dots + a_2y_k\\
\vdots\\
a_ky_1 + a_ky_2 + a_ky_3\dots  -y_k(1 - a_k) \\
\end{bmatrix}

\\
&#038;=&#038;
\begin{bmatrix}
-y_1 + a_1\cdot\left(y_1 + y_2 + y_3 + \dots + y_k\right)\\
-y_2 + a_2\cdot\left(y_1 + y_2 + y_3 + \dots + y_k\right)\\
\vdots\\
-y_k + a_k\cdot\left(y_1 + y_2 + y_3 + \dots + y_k\right)
\end{bmatrix}

\text{, note : }y_1 + y_2 + y_3 + \dots + y_k=1 \\

&#038;=&#038;
\begin{bmatrix}
a_1-y_1 \\
a_2-y_2 \\
\vdots\\
a_k-y_k \\
\end{bmatrix} \in \mathbb{R}^{k \times 1}



\end{array}"/>
</p>
</p>
<p>In vectorized form, can be represented as </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \mathbf{a} - \mathbf{y}, \quad \in \mathbb{R}^{k \times 1} " alt="">
</p>
</p>
</p>
<h3 class="wp-block-heading">Gradients of loss with respect to Parameters (dL/dW, dL/db)</h3>
</p>
<h4 class="wp-block-heading">Gradients of Weights (W)</h4>
</p>
<p>Based on the chain rule, to find the gradient of loss with respect to parameter , we multiply each row of the <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial \mathbf{z_i}}"> with the corresponding row from <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial z_i}{\partial \mathbf{W}_{i,:}} = \mathbf{x}^\top" alt=""> </p>
</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
\frac{\partial \mathcal{L}}{\partial \mathbf{W}} 
&#038;= &#038; \frac{\partial \mathcal{L}}{\partial \mathbf{z}} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{W}}\\

&#038;=&#038;
\begin{bmatrix}
(a_1-y_1) \cdot \mathbf{x}^\top \\
(a_2-y_2)  \cdot \mathbf{x}^\top  \\
\vdots\\
(a_k-y_k)  \cdot \mathbf{x}^\top \\
\end{bmatrix} 
\\

&#038;=&#038;
\begin{bmatrix}
(a_1-y_1) \\
(a_2-y_2) \\
\vdots\\
(a_k-y_k) \\
\end{bmatrix}  \mathbf{x}^\top \\

&#038;=&#038;
\begin{bmatrix}
(a_1-y_1) \\
(a_2-y_2) \\
\vdots\\
(a_k-y_k) \\
\end{bmatrix} \cdot 
\begin{bmatrix}
x_1 &#038; x_2 &#038; \cdots &#038; x_n
\end{bmatrix} \\

&#038; = &#038; 
\begin{bmatrix}
(a_1 - y_1) \cdot x_1 &#038; (a_1 - y_1) \cdot x_2 &#038; \cdots &#038; (a_1 - y_1) \cdot x_n \\
(a_2 - y_2) \cdot x_1 &#038; (a_2 - y_2) \cdot x_2 &#038; \cdots &#038; (a_2 - y_2) \cdot x_n \\
\vdots &#038; \vdots &#038; \ddots &#038; \vdots \\
(a_k - y_k) \cdot x_1 &#038; (a_k - y_k) \cdot x_2 &#038; \cdots &#038; (a_k - y_k) \cdot x_n
\end{bmatrix} \in \mathbb{R}^{k \times n} 

\end{array} 

" alt="">
</p>
</p>
<p>This is equivalent to the outer product, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = (\mathbf{a} - \mathbf{y}) \mathbf{x}^\top, \quad \in \mathbb{R}^{k \times n}" alt="">
</p>
</p>
</p>
<h4 class="wp-block-heading">Gradients of bias (b)</h4>
</p>
<p>Recall the linear transformation: <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{z} = \mathbf{W} \mathbf{x} + \mathbf{b}" alt="">. The gradients are:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial \mathcal{L}}{\partial \mathbf{b}} &#038; = &#038; \frac{\partial \mathcal{L}}{\partial \mathbf{z}} \cdot \frac{\partial \mathbf{z}}{\partial \mathbf{b}} \\

&#038;=&#038;(\mathbf{a} - \mathbf{y}) \mathbf{I}_k \\

&#038;=&#038;(\mathbf{a} - \mathbf{y}) 
\end{array} "  alt="">
</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? \quad \frac{\partial \mathcal{L}}{\partial \mathbf{b}} = \mathbf{a} - \mathbf{y}" alt="">
</p>
</p>
<p>The intuition from above equations is :</p>
</p>
<p>if the <strong>estimated probability <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{a}" alt=""> is close to the true value <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{y}" alt=""></strong> then the gradient is <strong>small</strong>, and the update to the parameters is also correspondingly <strong>smaller</strong>. If you recall, the gradients for <strong>binary classification</strong> <sup>(refer post on <a href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/"> Gradients for Binary Classification with Sigmoid</a>)</sup>,  <strong>linear regression</strong> <sup>(refer post on <a href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Gradients" target="_blank" rel="noreferrer noopener">Gradients for Linear Regression</a>)</sup> follows a similar intuitive explanation.</p>
</p>
<p>These gradients are then used in the optimizer (e.g., SGD) to update parameters and reduce the loss.</p>
</p>
<h2 class="wp-block-heading">Vectorised operations (with m examples)</h2>
</p>
<p>The <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt="">training examples each having <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n" alt=""> features is represented as, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{X} = \begin{bmatrix}
x_{1}^1 &#038; x_{1}^2 &#038; \dots &#038; x_{1}^m \\
x_{2}^1 &#038; x_{2}^2 &#038; \dots &#038; x_{2}^m \\
\vdots &#038; \vdots &#038; \ddots &#038; \vdots \\
x_{n}^1 &#038; x_{n}^2 &#038; \dots &#038; x_{n}^m
\end{bmatrix}, \quad \mathbf{X} \in \mathbb{R}^{n \times m}
" alt="">
</p>
</p>
<p>The output, which is a probability matrix across <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k" alt=""> classes for each of the <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt=""> examples, is:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{Y} = \begin{bmatrix}
y_1^1 &#038; y_1^2 &#038; \dots &#038; y_1^m \\
y_2^1 &#038; y_2^2 &#038; \dots &#038; y_2^m \\
\vdots &#038; \vdots &#038; \ddots &#038; \vdots \\
y_k^1 &#038; y_k^2 &#038; \dots &#038; y_k^m
\end{bmatrix}, \quad \mathbf{Y} \in \mathbb{R}^{k \times m}
" alt="">
</p>
</p>
<p>The linear transformation before applying the activation function (e.g., softmax) is given by:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{Z} = \mathbf{W} \mathbf{X} + \mathbf{b}, \quad  \in \mathbb{R}^{k \times m} 
" alt="">
</p>
</p>
<p>where,  the parameters</p>
</p>
<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W} \in \mathbb{R}^{k \times n}" alt=""> and</li>
</p>
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{b} \in \mathbb{R}^{k \times 1}" alt=""></li>
</ul>
</p>
<p>The softmax activation is applied column-wise to the matrix <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{Z}"> to obtain the probability outputs:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi? 
\mathbf{A}_{:,j} = \mathrm{softmax}(\mathbf{Z}_{:,j}) = \frac{\exp(\mathbf{Z}_{:,j})}{\sum_{i=1}^{k} \exp(Z_{i,j})}, \quad \text{for } j = 1, 2, \dots, m
" alt="">
</p>
</p>
<p>In matrix form, this is written as, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{A} = \begin{bmatrix}
a_1^1 &#038; a_1^2 &#038; \dots &#038; a_1^m \\
a_2^1 &#038; a_2^2 &#038; \dots &#038; a_2^m \\
\vdots &#038; \vdots &#038; \ddots &#038; \vdots \\
a_k^1 &#038; a_k^2 &#038; \dots &#038; a_k^m
\end{bmatrix}, \quad \mathbf{A} \in \mathbb{R}^{k \times m}
" alt="">
</p>
</p>
<p>The cross-entropy loss compares the predicted probabilities <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{A}"> with the ground truth one-hot encoded labels <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{Y} \in \mathbb{R}^{k \times m}">:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
L = -\frac{1}{m} \sum_{j=1}^{m} \sum_{i=1}^{k} Y_{i,j} \log A_{i,j}
" alt="">
</p>
</p>
<p>The derivative of the cross-entropy loss with softmax activation, with respect to the input <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{Z}"> (logits), simplifies to:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\frac{\partial L}{\partial \mathbf{Z}} = \mathbf{A} - \mathbf{Y},  \quad \in \mathbb{R}^{k \times m}
" alt="">
</p>
</p>
<p>The gradient of the loss with respect to the weight matrix <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{W}"> is:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\frac{\partial L}{\partial \mathbf{W}} = \frac{1}{m} (\mathbf{A} - \mathbf{Y}) \mathbf{X}^\top, \in \mathbb{R}^{k \times n} 
" alt="">
</p>
</p>
<p>
As the input matrix <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{X}"> has shape <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n \times m">, the inner product<br />
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
(\mathbf{A} - \mathbf{Y}) \mathbf{X}^\top
"> results in a matrix of shape<br />
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k \times n">. This captures the total gradient of the loss over all <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m"> examples.<br />
Averaging over the examples is done by multiplying with <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{1}{m}">.
</p>
</p>
<p>The gradient of the loss with respect to the bias vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{b}"> is computed by summing the gradient over all <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m"> examples using a row vector of ones:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\frac{\partial L}{\partial \mathbf{b}} = \frac{1}{m} (\mathbf{A} - \mathbf{Y}) \mathbf{1}_{m \times 1}, \quad \in \mathbb{R}^{k \times 1}
" alt="">
</p>
</p>
<p>
Here, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{A} - \mathbf{Y} \in \mathbb{R}^{k \times m}"> and<br />
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{1}_{m \times 1}"> sums the gradients across all <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m"> examples.<br />
The result is a <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?k \times 1"> vector, which matches the shape of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{b}">.
</p>
</p>
<h3 class="wp-block-heading">Code (gradients)</h3>
</p>
<p>Example code comparing the gradients computed using the derivation with autograd from PyTorch. </p>
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/gradients_multiclass_classification/gradients_cross_entropy_loss.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
<h2 class="wp-block-heading">Training for toy example with 3 classes</h2>
</p>
<p>Below is an example of training a multi class classifier based on the model and gradient descent.&nbsp;<strong>Synthetic training</strong>&nbsp;data data is generated from two&nbsp;<strong>independent Gaussian random variables&nbsp;</strong>with zero mean and unit variance. Mean is shifted by <strong>(-2,-2), (+2,+2 ), (-2,+2 )&nbsp;</strong>corresponding to<strong>&nbsp;class 0</strong>,&nbsp;<strong>class 1, class 2</strong>&nbsp;respectively.</p>
</p>
<p>The training loop is done using the&nbsp;<strong>numerically computed gradients&nbsp;</strong>and using the&nbsp;<code><strong>torch.autograd</strong></code>&nbsp;provided by&nbsp;<strong>PyTorch</strong>, and can see that both are numerically very close.</p>
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/gradients_multiclass_classification/training_multiclass_classification.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
<h2 class="wp-block-heading">Training with Label Smoothing</h2>
</p>
<p>In the previous section, we derived the gradients for multi-class classification using <strong>one-hot encoded targets</strong>. In the paper <em>&#8220;Rethinking the Inception Architecture for Computer Vision&#8221;</em> by Szegedy et al. (2016) <a href="https://arxiv.org/abs/1512.00567" target="_blank" rel="noopener"><sup>(arXiv:1512.00567)</sup></a>, the idea of <strong>label smoothing</strong> was introduced. The key observation is that one-hot targets, which drives the predicted probability for the correct class toward 1 and ignore the other classes in the loss function, encourage models to become <strong>overconfident</strong>.</p>
</p>
<p>Label smoothing combats this by replacing the <strong>hard 1</strong> in the true class with a <strong>slightly lower value</strong>, such as <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
(1 - \varepsilon)" alt="">, and distributing the remaining <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\varepsilon" alt=""> <strong>equally among the other classes</strong>. So, instead of teaching the model that <strong>one class is <em>absolutely</em> correct</strong>, we teach it that <strong>one class is <em>very likely</em> correct</strong> — allowing for some uncertainty.</p>
</p>
<p>For a classification problem with  <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?K" alt=""> classes and smoothing parameter <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\varepsilon" alt=""> , the smoothed label vector becomes:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\mathbf{y}_{\text{smooth}} = (1 - \varepsilon) \cdot \mathbf{y}_{\text{one-hot}} + \frac{\varepsilon}{K}
" alt="">
</p>
</p>
<p>For an example with <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?K" alt="">=4 classes,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
\mathbf{y}_{\text{one-hot}} &#038; = &#038; [0, 0, 1, 0] \\

\mathbf{y}_{\text{smooth}} &#038; = &#038; \left[ \frac{\varepsilon}{K}, \frac{\varepsilon}{K}, 1 - \varepsilon, \frac{\varepsilon}{K} \right] \quad \text{where } K = 4
\end{array}
" alt="">
</p>
</p>
<p>Even though we modify the target labels using label smoothing, the <strong>sum of the smoothed probabilities still adds up to 1</strong>. Because of this, the <strong>gradient derivations from the previous section remain valid</strong>. </p>
</p>
<h3 class="wp-block-heading">Training code</h3>
</p>
<p>For the toy training example earlier, we compare the training with smoothed labels vs one-hot coded. In  PyTorch function <code>torch.nn.CrossEntropyLoss</code><span style="font-size: revert; color: initial;"> </span><a style="font-size: revert;" href="https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html" target="_blank" rel="noreferrer noopener"><sup>(refer entry on CELoss in PyTorch)</sup></a> has an optional argument <strong><code>label_smoothing</code></strong>  which implements the label smoothing as defined earlier. </p>
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/gradients_multiclass_classification/training_label_smoothing.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
<p>In the training results on the toy example we can see that the <strong>loss is higher</strong> for the training<strong> </strong>with<strong> label smoothing</strong> and correspondingly <strong>misclassification rate </strong>is also<strong> slightly higher</strong>.</p>
</p>
<p>However, label smoothing has been shown to <strong>improve generalization </strong>in larger models trained on complex datasets. The concept was first introduced in <em><strong>Rethinking the Inception Architecture for Computer Vision</strong></em> <a class="" href="https://arxiv.org/abs/1512.00567" target="_blank" rel="noopener">(Szegedy et al., 2016)</a>, and was later used in the foundational paper <em><strong>Attention is All You Need</strong></em> <a class="" href="https://arxiv.org/abs/1706.03762" target="_blank" rel="noopener">(Vaswani et al., 2017)</a>. A broader study, <strong><em>When Does Label Smoothing Help?</em> </strong><a class="" href="https://arxiv.org/abs/1906.02629" target="_blank" rel="noopener">(Müller et al., 2019)</a>, analyzed its effectiveness in large models like <strong>ResNets</strong> and <strong>Transformers</strong>.</p>
</p>
<h2 class="wp-block-heading">Summary</h2>
</p>
<p>The post covers the following key aspects</p>
</p>
<ul class="wp-block-list">
<li><strong>System model </strong>for <strong>multi class classification</strong> with <strong>linear layer</strong> and <strong>softmax</strong></li>
</p>
<li><strong>Loss function&nbsp;</strong>based on<strong> categorical cross entropy</strong> and showing that this is <strong>Maximum Likelihood Estimate</strong></li>
</p>
<li>Computation of the&nbsp;<strong>gradient</strong>&nbsp;based on&nbsp;<strong>chain rule of derivatives</strong></li>
</p>
<li><strong>Vectorized Operations</strong>&nbsp;for batch of examples which implements  computations using efficient matrix and vector math.</li>
</p>
<li><strong>Training loop</strong>&nbsp;for the classification using both manual and PyTorch based gradients</li>
</p>
<li>Explains the concept of <strong>label smoothing </strong>and implements a training loop for explaining the concept</li>
</ul>
</p>
<p>Have any questions or feedback? Feel free to drop your feedback in the comments section. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /> </p></p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/">Gradients for multi class classification with Softmax</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://dsplog.com/2025/06/22/gradients-for-multi-class-classification-with-softmax/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Gradients for Binary Classification with Sigmoid</title>
		<link>https://dsplog.com/2025/05/17/gradients-for-binary-classification/</link>
					<comments>https://dsplog.com/2025/05/17/gradients-for-binary-classification/#comments</comments>
		
		<dc:creator><![CDATA[Krishna Sankar]]></dc:creator>
		<pubDate>Sat, 17 May 2025 13:05:07 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[Binary Classification]]></category>
		<category><![CDATA[Binary Cross Entropy]]></category>
		<category><![CDATA[Maximum Likelihood]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[Sigmoid]]></category>
		<guid isPermaLink="false">https://dsplog.com/?p=2162</guid>

					<description><![CDATA[<p>In a classification problem, the output (also called the label or class) takes a small number of discrete values rather than continuous values. For a simple binary classification problem, where output takes only two discrete values : 0 or 1, the sigmoid function can be used to transform the output of a linear regression model &#8230; <a href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/" class="more-link">Continue reading<span class="screen-reader-text"> "Gradients for Binary Classification with Sigmoid"</span></a></p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/">Gradients for Binary Classification with Sigmoid</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></description>
										<content:encoded><![CDATA[</p>
<p>In a <strong>classification</strong> problem, the output (also called the <strong>label</strong> or <strong>class</strong>) takes a small number of <strong>discrete values</strong> rather than continuous values. For a simple <strong>binary classification</strong> problem, where output takes only <strong>two discrete </strong>values : 0 or 1, the <strong>sigmoid function</strong> can be used to transform the output of a <strong>linear regression</strong> model into a value between 0 and 1, squashing the continuous prediction into a <strong>probability</strong>-like score. This score can then be interpreted as the <strong>likelihood</strong> of the output being class 1, with a <strong>threshold</strong> (commonly 0.5) used to decide between class 0 and class 1.</p>
</p>
<p>In this post, the intuition for <strong>loss function</strong> for <strong>binary classification</strong> based on <strong>Maximum Likelihood Estimate (MLE)</strong> is explained. We then derive the <strong>gradients</strong> for model parameters using the <strong>chain rule</strong>. Gradients computed&nbsp;<strong>analytically</strong>&nbsp;are compared against gradients computed using deep learning framework&nbsp;<strong>PyTorch</strong>. Further, <strong>training loop</strong> using&nbsp;<strong>gradient descent</strong>&nbsp;for a binary classification problem having two dimensional Gaussian distributed data is implemented.</p>
</p>
<p><span id="more-2162"></span></p>
<p><div id="ez-toc-container" class="ez-toc-v2_0_82_2 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction">
<div class="ez-toc-title-container">
<p class="ez-toc-title" style="cursor:inherit">Table of Contents</p>
<span class="ez-toc-title-toggle"><a href="#" class="ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle" aria-label="Toggle Table of Content"><span class="ez-toc-js-icon-con"><span class=""><span class="eztoc-hide" style="display:none;">Toggle</span><span class="ez-toc-icon-toggle-span"><svg style="fill: #999;color:#999" xmlns="http://www.w3.org/2000/svg" class="list-377408" width="20px" height="20px" viewBox="0 0 24 24" fill="none"><path d="M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z" fill="currentColor"></path></svg><svg style="fill: #999;color:#999" class="arrow-unsorted-368013" xmlns="http://www.w3.org/2000/svg" width="10px" height="10px" viewBox="0 0 24 24" version="1.2" baseProfile="tiny"><path d="M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z"/></svg></span></span></span></a></span></div>
<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-1" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Model">Model</a></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-2" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Sigmoid_function_and_its_derivative">Sigmoid function and its derivative</a></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-3" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Loss_function_for_binary_classification">Loss function for binary classification</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-4" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Maximum_Likelihood_Estimation">Maximum Likelihood Estimation</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-5" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Log_Likelihood">Log Likelihood</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-6" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Negative_Log_Likelihood">Negative Log Likelihood</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-7" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Averaging_the_Loss">Averaging the Loss</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-8" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Gradients_with_Binary_Cross_Entropy_BCE_Loss">Gradients with Binary Cross Entropy (BCE) Loss</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-9" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Deriving_the_gradients">Deriving the gradients</a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-10" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Step1_Gradients_of_loss_wrt_to_probability_score">Step1 : Gradients of loss w.r.t to probability score</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-11" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Step2_Gradients_of_probability_score_wrt_to_output_of_linear_function">Step2 : Gradients of probability score w.r.t to output of linear function</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-12" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Step3_Gradients_of_output_of_linear_function_wrt_to_parameters">Step3 : Gradients of output of linear function w.r.t to parameters</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-13" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Gradients_of_loss_wrt_to_parameters">Gradients of loss w.r.t to parameters</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-14" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Vectorised_operations">Vectorised operations</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-15" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Gradients_computed_numerically_vs_PyTorch">Gradients computed numerically vs PyTorch</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-16" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Training_%E2%80%93_Binary_Classification">Training &#8211; Binary Classification</a></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-17" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/#Summary">Summary</a></li></ul></nav></div>

</p>
</p>
<p>As always, contents from <a href="https://cs229.stanford.edu/main_notes.pdf" target="_blank" rel="noreferrer noopener">CS229 Lecture Notes</a> and the notations used in the course <a href="https://youtube.com/playlist?list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&amp;si=_3Xs1piNOfQ847gd" target="_blank" rel="noopener">Deep Learning Specialization C1W1L01</a> from Dr Andrew Ng forms key references.</p>
</p>
<h2 class="wp-block-heading">Model</h2>
</p>
<p>Let us take an example of estimating <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y \in \{0, 1\}" alt=""> based on feature vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt=""> having <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n" alt=""> features i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x} \in \mathbb{R}^n" alt="">.</p>
</p>
<p>There are <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt=""> examples.</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{tabular}{|c|c|c|c|c|}
\hline&amp;{example^1}&amp;{example^2}&amp;\ldots&amp;{example^m}\\
\hline{feature_1}&amp;{x_1}^{1}&amp;{x_1}^{2}&amp;\ldots&amp;{x_1}^{m}\\
\hline{feature_2}&amp;{x_2}^{1}&amp;{x_2}^{2}&amp;\ldots&amp;{x_2}^{m}\\
\hline&amp;\vdots&amp;\vdots&amp;\ldots&amp;\vdots\\
\hline{feature_n}&amp;{x_n}^{1}&amp;{x_n}^{2}&amp;\ldots&amp;{x_n}^{m}&amp;\\
\hline{output}&amp;{y}^{1}&amp;{y}^{2}&amp;\ldots&amp;{y}^{m}\end{tabular}
" alt="">
</p>
</p>
<p>Let us <strong>assume</strong> that the variable <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt="">is defined as <strong>linear function</strong> of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt="">. Then <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt=""> gets transformed into a probability score <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a" alt=""> using <strong>sigmoid function</strong>. For a single training example, this can be written as :</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}{z}^1&#038;=&#038;w_1{x_1}^1+w_2{x_2}^1+\dots+w_n{x_n}^1+b\\&#038;=&#038;\mathbf{w^T}\mathbf{x}^1+b\end{array}" alt="">
</p>
</p>
<p>where, </p>
</p>
<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> is the feature vector of size <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n" alt=""> i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w} \in \mathbb{R}^n" alt="">and</li>
</p>
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> is a scalar</li>
</ul>
</p>
<p>To convert the real number <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt=""> to a number <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a" alt="">lying between 0 and 1, let us define</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{equation}
a^1 = \sigma(z^1) = \frac{1}{1 + \exp(-z^1)} = \frac{1}{1 + \exp\left(-(\mathbf{w^T} \mathbf{x}^1 + b)\right)}
\end{equation}" alt="">
</p>
</p>
<p>where,  <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\sigma(z)" alt=""> is the sigmoid function <a href="https://en.wikipedia.org/wiki/Sigmoid_function" target="_blank" rel="noopener"><sup>(refer wiki entry on sigmoid function</sup>)</a></p>
</p>
<h2 class="wp-block-heading">Sigmoid function and its derivative</h2>
</p>
<p>Sigmoid function which has a smooth S-shaped mathematical function is defined as:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{equation}
\sigma(z) = \frac{1}{1 + \exp(-z)}
\end{equation}" alt="">
</p>
</p>
<p>which has the properties</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\textbf{Output range: } (0, 1)\\
\text{As } z \to -\infty, \quad \sigma(z) \to 0 \\
\text{As } z \to +\infty, \quad \sigma(z) \to 1  \\
\text{Symmetric around } z = 0: \quad \sigma(0) = 0.5
" alt="">
</p>
</p>
<p>The derivative of sigmoid <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\sigma'(z)" alt=""> is, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{rcll}
\sigma'(z) &#038; = &#038; \dfrac{d}{dz} \left( \dfrac{1}{1 + e^{-z}} \right) \\
&#038; = &#038; -1 \cdot \left(1 + e^{-z} \right)^{-2} \cdot \dfrac{d}{dz}(1 + e^{-z}) \\
&#038; = &#038; - \dfrac{1}{(1 + e^{-z})^2} \cdot (-e^{-z}) \\
&#038; = &#038; \dfrac{e^{-z}}{(1 + e^{-z})^2} \\
&#038; = &#038; \left( \dfrac{1}{1 + e^{-z}} \right) \left( \dfrac{e^{-z}}{1 + e^{-z}} \right) \\
&#038; = &#038; \sigma(z)\left(1 - \sigma(z)\right)
\end{array}" alt="">
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/gradients_binary_classification/sigmoid_and_derivative.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
<p>From the plots of derivative of sigmoid, two key observations :</p>
</p>
<ul class="wp-block-list">
<li><strong>Vanishing gradients</strong> : for very large or very small <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt="">, the derivative approaches 0 causing gradients to vanish during back propagation — this slows or stalls learning in deep networks.</li>
</p>
<li><strong>Low Maximum Gradient</strong> : the maximum value of derivative is 0.25, which caps the gradient flow, making it harder for deep layers to effectively update their weights</li>
</ul>
</p>
<p>As mentioned in the article<a href="https://karpathy.medium.com/yes-you-should-understand-backprop-e2f06eab496b" target="_blank" rel="noopener"> Yes you should understand backprop by Andrej Karpathy</a>, these aspects have to be kept in mind when using <strong>sigmoid</strong> for training <strong>deeper neural networks</strong>.</p>
</p>
<h2 class="wp-block-heading">Loss function for binary classification</h2>
</p>
<h3 class="wp-block-heading">Maximum Likelihood Estimation </h3>
</p>
<p>Let us assume that the probability of output being 1, given input <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt=""> and parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt="">, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> is, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(y = 1 \mid \mathbf{x}, \mathbf{w}, b) = a = \sigma(z) = \frac{1}{1 + e^{-(\mathbf{w^T} \mathbf{x}+b)}}
" alt="">
</p>
</p>
<p>Then, for the binary classification, the probability of output being 0 is,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(y = 0 \mid \mathbf{x},\mathbf{w}, b) = 1-a 
" alt="">
</p>
</p>
<p>Since <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt=""> can either be 0 or 1, we can compactly write the <strong>likelihood</strong> as:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?P(y^i|x^i,\mathbf{w}, b) = (a^i)^{y^i}  (1-a^i)^{(1-y^i)} 
" alt="">
</p>
</p>
<p>The <strong>likelihood function</strong> is the probability of the actual label <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y \in \{0, 1\}" alt="">given the prediction <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a" alt="">. When the multiple independent training examples are <strong>independently and identically distributed (i.i.d.)</strong>, the total likelihood for the dataset is the <strong>product of the likelihoods</strong> for each example. With this assumption, for <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt=""> training examples, the likelihood for the parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> is,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L}(w, b) = \prod_{i=1}^{m} P(y_i \mid x_i, \mathbf{w},b)
 = \prod_{i=1}^{m} (a^i)^{y^i}  (1-a^i)^{(1-y^i)} 
" alt="">
</p>
</p>
<h3 class="wp-block-heading">Log Likelihood</h3>
</p>
<p>To avoid the<strong> product of many small numbers</strong>, we take the <strong>natural logarithm</strong> of the<strong> likelihood function</strong>. The <strong>log-likelihood </strong>for the entire dataset is the sum of the log-likelihoods for each example:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
\log \mathcal{L}(w, b) &#038;=&#038; \log \prod_{i=1}^{m} P(y_i \mid x_i, \mathbf{w},b)\\ 
&#038;=&#038; \sum_{i=1}^{m} \log P(y_i \mid x_i, \mathbf{w},b) \\
&#038;=&#038; \sum_{i=1}^{m} \log \left[(a^i)^{y^i}  (1-a^i)^{(1-y^i)} \right] \\
&#038;=&#038; \sum_{i=1}^{m} \left[ y^{i} \log a^{i} + (1 - y^{i}) \log(1 - a^{i}) \right]
\end{array}
 
" alt="">
</p>
</p>
<h3 class="wp-block-heading">Negative Log Likelihood</h3>
</p>
<p>Since <strong>optimizers</strong> like gradient descent are designed to <strong>minimize</strong> functions, we<strong> minimize the negative log-likelihood</strong> instead of<strong> maximizing the log-likelihood</strong>. </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
\text{neg } \log \mathcal{L}(w, b) 
&#038;=&#038; -\sum_{i=1}^{m} \left[ y^{i} \log a^{i} + (1 - y^{i}) \log(1 - a^{i}) \right]
\end{array}
 
" alt="">
</p>
</p>
<h3 class="wp-block-heading">Averaging the Loss</h3>
</p>
<p>Averaging the loss ensures that the total loss remains on the <strong>same scale</strong>, regardless of the size of the training dataset. This is important because it allows the use of a <strong>fixed learning rate</strong> across different dataset sizes, leading to more stable and consistent optimization behaviour.</p>
</p>
<p>The <strong>averaged negative log-likelihood</strong> is defined as:</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}
\text{neg } \log \mathcal{L}_{\text{avg}}(w, b) 
&#038;=&#038; -\frac{1}{m}\sum_{i=1}^{m} \left[ y^{i} \log a^{i} + (1 - y^{i}) \log(1 - a^{i}) \right]
\end{array}
 
" alt="">
</p>
</p>
<p>This expression is known as the <strong>Binary Cross-Entropy (BCE) Loss</strong>, which is widely used in <strong>binary classification</strong> tasks. This function is available in PyTorch library as <code>torch.nn.BCELoss</code> <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.BCELoss.html" target="_blank" rel="noopener"><sup>(refer entry on BECLoss in PyTorch)</sup></a>.</p>
</p>
<h2 class="wp-block-heading">Gradients with Binary Cross Entropy (BCE) Loss</h2>
</p>
<p>The system model for binary classification involves multiple steps: </p>
</p>
<ul class="wp-block-list">
<li>firstly, the variable <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt="">is defined as <strong>linear function</strong> of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt=""> using parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt="">, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt="">.</li>
</p>
<li>then <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt=""> gets transformed into a <strong>estimated probability</strong> score <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a" alt=""> using <strong>sigmoid function</strong>. </li>
</p>
<li>lastly, use the<strong> true label </strong><img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?y \in \{0, 1\}" alt=""> and estimated probability score <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a" alt="">,<strong> binary cross entropy</strong> loss <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L}" alt=""> is computed</li>
</ul>
</p>
<p>For performing gradient descent of the parameters, the goal is to the find the gradients of the loss <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L}" alt=""> w.r.t to the parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt="">.  To find the gradients, we go in the reverse order i.e.</p>
</p>
<ul class="wp-block-list">
<li>firstly, gradients of the loss <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L}" alt=""> w.r.t to the estimated probability score <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a" alt=""></li>
</p>
<li>then gradients of the probability score <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?a" alt=""> w.r.t to the output of linear function <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt=""></li>
</p>
<li>lastly, gradients of output of linear function <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt=""> w.r.t to parameters  <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt="">, <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""></li>
</ul>
</p>
<p>Then the product of all the individual gradients from the gradients of loss w.r.t to the parameters. This is written as,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}}
" alt="">
</p>
</p>
<p>The steps described, calculating gradients in the reverse order from the loss back to the parameters<strong> </strong>is an application of the<strong> chain rule from calculus</strong> <a href="https://en.wikipedia.org/wiki/Chain_rule#Intuitive_explanation" target="_blank" rel="noopener"><sup>(refer wiki entry on Chain Rule)</sup></a>. This method is the foundation of <strong>backpropagation</strong> used in training models <a href="https://en.wikipedia.org/wiki/Backpropagation" target="_blank" rel="noopener"><sup>(refer wiki entry on Backpropagation)</sup></a>.</p>
</p>
<h3 class="wp-block-heading">Deriving the gradients</h3>
</p>
<p>For simplicity, take a single example and computing gradients step by step, </p>
</p>
<h4 class="wp-block-heading">Step1 : Gradients of loss w.r.t to probability score </h4>
</p>
<p>With the loss <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathcal{L} = \left[ y^{1} \log a^{1} + (1 - y^{1}) \log(1 - a^{1})\right]" alt="">, then the derivative of loss w.r.t to sigmoid output <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?a" alt=""> is,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial \mathcal{L}}{\partial a} = \frac{y^1}{a^1} - \frac{1-y^1}{1-a^1}
" alt="">
</p>
</p>
<h4 class="wp-block-heading">Step2 : Gradients of probability score w.r.t to output of linear function</h4>
</p>
<p>With <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?a^1 =\sigma(z^1)" alt=""> as the output of sigmoid function, the derivative is </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial a}{\partial z} = \sigma(z^1)(1-\sigma(z^1))= a^1(1-a^1)
" alt="">
</p>
</p>
<h4 class="wp-block-heading">Step3 : Gradients of output of linear function w.r.t to parameters</h4>
</p>
<p>With <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}{z}^1&amp;=&amp;w_1{x_1}^1+w_2{x_2}^1+\dots+w_n{x_n}^1+b&amp;=&amp;\mathbf{w^T}\mathbf{x}^1+b\end{array} " alt="">, the derivative is, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial z}{\partial \mathbf{w}}  = [x_1^1, x_2^1, \dots, x_n^1] = \mathbf{x^1}
" alt="">
</p>
</p>
<p>Similarly,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial z}{\partial b}  = 1
" alt="">
</p>
</p>
<h4 class="wp-block-heading">Gradients of loss w.r.t to parameters </h4>
</p>
<p>Taking the product of the gradients from all the steps,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{align*}
\frac{\partial \mathcal{L}}{\partial w} 
&#038;= \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{w}} \\
&#038;= -\left( \frac{y^1}{a^1} - \frac{1 - y^1}{1 - a^1} \right)  \cdot a^1(1 - a^1) \cdot \mathbf{x^1} \\
&#038;= \left( -y^1(1 - a^1) + (1 - y^1)a^1 \right) \cdot \mathbf{x^1} \\
&#038;= \left( -y^1 + y^1a^1 + a^1 - a^1y^1 \right) \cdot \mathbf{x^1} \\
&#038;= (a^1 - y^1) \cdot \mathbf{x^1} \\
\end{align*}
" 
alt="">
</p>
</p>
<p>Similarly, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{align*}
\frac{\partial \mathcal{L}}{\partial b} 
&#038;= \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial \mathbf{b}} \\
&#038;= -\left( \frac{y^1}{a^1} - \frac{1 - y^1}{1 - a^1} \right)  \cdot a^1(1 - a^1)  \\
&#038;= \left( -y^1(1 - a^1) + (1 - y^1)a^1 \right)   \\
&#038;= \left( -y^1 + y^1a^1 + a^1 - a^1y^1 \right)  \\
&#038;= (a^1 - y^1) \\
\end{align*}
" 
alt="">
</p>
</p>
<p>The intuition from above equations is :</p>
</p>
<p>if the <strong>estimated probability <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?{a}=\sigma(z) = \sigma(\mathbf{w^T}\mathbf{x}+b)" alt=""> is close to the true value <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt=""></strong> then the gradient is <strong>small</strong>, and the update to the parameters is also correspondingly <strong>smaller</strong>. If you recall, the gradients for linear regression <sup>(refer post on <a href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Gradients" target="_blank" rel="noreferrer noopener">Gradients for Linear Regression</a>)</sup> follows a similar intuitive explanation.</p>
</p>
<p><strong>Note : </strong>With <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt=""> training examples the loss is averaged, and this becomes :</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial%20L}{\partial%20w_n}&#038;=&#038;\frac{1}{m}\sum_{i=1}^m\(a^i-y^i\){x_n}^i\end{array}" alt="">
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial%20L}{\partial%20b}&#038;=&#038;\frac{1}{m}\sum_{i=1}^m\(a^i-y^i\)\end{array}" alt="">
</p>
</p>
<h3 class="wp-block-heading">Vectorised operations</h3>
</p>
<p>The <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt="">training examples each having <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n" alt=""> features is represented as, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{X} = \begin{bmatrix}
x_{1}^1 &#038; x_{1}^2 &#038; \dots &#038; x_{1}^m \\
x_{2}^1 &#038; x_{2}^2 &#038; \dots &#038; x_{2}^m \\
\vdots &#038; \vdots &#038; \ddots &#038; \vdots \\
x_{n}^1 &#038; x_{n}^2 &#038; \dots &#038; x_{n}^m
\end{bmatrix}, \quad \mathbf{X} \in \mathbb{R}^{n \times m}
" alt="">
</p>
</p>
<p>The output is,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{y} = \begin{bmatrix}
y^1 &#038; y^2 &#038; \dots &#038; y^m \end{bmatrix}, \quad \mathbf{y} \in \mathbb{R}^{1 \times m}
" alt="">
</p>
</p>
<p>The parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> represented as,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix}, \quad \mathbf{w} \in \mathbb{R}^{n \times 1}
" alt="">
</p>
</p>
<p><img decoding="async" style="font-size: revert; color: initial;" src="https://dsplog.com/cgi-bin/mimetex.cgi?b\in \mathbb{R}^{1 \times 1}
" alt=""></p>
</p>
<p>where, </p>
</p>
<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> is the feature vector of size <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n" alt=""> i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w} \in \mathbb{R}^n" alt="">and</li>
</p>
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> is a scalar</li>
</ul>
</p>
<p>The output <img decoding="async" align="absmiddle" src="https://dsplog.com/cgi-bin/mimetex.cgi?a" alt=""> is,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\mathbf{a}&#038; = &#038; \sigma(\mathbf{z}) = \sigma( \mathbf{w^T}\mathbf{X} +b) &#038;=&#038;\begin{bmatrix}
a^1 &#038; a^2 &#038; \dots &#038; a^m \end{bmatrix}, \quad \mathbf{a} \in \mathbb{R}^{1 \times m}\end{array}
" alt="">
</p>
</p>
<p>Gradients, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?d\mathbf{w} = \begin{bmatrix}
\frac{\partial \mathbf{L}}{\partial w_1} \\ \frac{\partial \mathbf{L}}{\partial w_2} \\ \vdots \\ \frac{\partial \mathbf{L}}{\partial w_n} \end{bmatrix}, \quad d\mathbf{w} \in \mathbb{R}^{n \times 1}
" alt="">
</p>
</p>
<p>The gradient w.r.t to <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> can be represented in matrix operations as,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}d\mathbf{w} &#038;=&#038; \frac{1}{m}\begin{bmatrix}
x_{1}^1 &#038; x_{1}^2 &#038; \dots &#038; x_{1}^m \\
x_{2}^1 &#038; x_{2}^2 &#038; \dots &#038; x_{2}^m \\
\vdots &#038; \vdots &#038; \ddots &#038; \vdots \\
x_{n}^1 &#038; x_{n}^2 &#038; \dots &#038; x_{n}^m
\end{bmatrix}\begin{bmatrix}a^1-y^1\\a^2-y^2\\\vdots\\a^m-y^m\end{bmatrix}\\
&#038;=&#038;\frac{1}{m}\mathbf{X}(\mathbf{a}-\mathbf{y})^T\end{array}
" alt="">
</p>
</p>
<p>Similarly, for the bias term</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial L}{\partial b}=d\mathbf{b} &#038;=&#038; \frac{1}{m}\underbrac{\begin{bmatrix}
1 &#038; 1 &#038; \dots &#038; 1 \\
\end{bmatrix}}_{1\times m}\begin{bmatrix}a^1-y^1\\a^2-y^2\\\vdots\\a^m-y^m\end{bmatrix}\\
&#038;=&#038;\frac{1}{m}\sum_i^m(a^i-y^i)\end{array}
" alt="">
</p>
</p>
<h3 class="wp-block-heading">Gradients computed numerically vs PyTorch</h3>
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/gradients_binary_classification/gradients_binary_cross_entropy_loss.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
<h2 class="wp-block-heading">Training &#8211; Binary Classification</h2>
</p>
<p>Below is an example of training a binary classifier based on the model and gradient descent. <strong>Synthetic training</strong> data data is generated from two <strong>independent Gaussian random variables </strong>with zero mean and unit variance. Mean is shifted on <strong>half the samples by (-2,-2) </strong>and the<strong> remaining half by (+2,+2 ) </strong>corresponding to<strong> class 0</strong> and <strong>class 1</strong> respectively.</p>
</p>
<p>The training loop is done using the <strong>numerically computed gradients </strong>and using the <code><strong>torch.autograd</strong></code> provided by <strong>PyTorch</strong>, and can see that both are numerically very close.</p>
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/gradients_binary_classification/training_loop_binary_classification.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
<p>The <strong>estimated probability score</strong> indicates the <strong>likelihood</strong> that the given input corresponds to one of the classes. As can be seen in the plot <strong>Predicted Probability for Each Input</strong>, inputs<strong> close to center point</strong> (0,0) have a <strong>probability close to 0.5</strong>, and as we move <strong>away from the center</strong> the probabilities tend to be <strong>closer to either 0 or 1</strong>.</p>
</p>
<p>To convert this probability into a <strong>class label</strong>, a <strong>decision threshold</strong> needs to be applied. In this example, as can be seen in the plot of <strong>Classification Error vs Threshold</strong>, the t<strong>hreshold of 0.5 is corresponding to the lowest error rate</strong>.</p>
</p>
<p>However, there are other scenarios where the <strong>threshold of 0.5 can be inappropriate</strong> &#8211; like dealing with <strong>imbalanced</strong> datasets, <strong>skewed class </strong>distribution etc. These require adjusting the threshold for better performance.</p>
</p>
<h2 class="wp-block-heading">Summary</h2>
</p>
<p>The post covers the following key aspects</p>
</p>
<ul class="wp-block-list">
<li><strong>Loss function </strong>based on<strong> Maximum Likelihood Estimate </strong></li>
</p>
<li>Computation of the <strong>gradient</strong> based on <strong>chain rule of derivates</strong></li>
</p>
<li><strong>Vectorized Operations</strong> Implements all computations using efficient matrix and vector math.</li>
</p>
<li><strong>Training loop</strong> for the binary classification using both manual and PyTorch based gradients</li>
</ul>
</p>
<p>Have any questions or feedback on the gradient computation techniques? Feel free to drop your feedback in the comments section. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p></p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2025/05/17/gradients-for-binary-classification/">Gradients for Binary Classification with Sigmoid</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://dsplog.com/2025/05/17/gradients-for-binary-classification/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Gradients for linear regression</title>
		<link>https://dsplog.com/2025/05/01/gradients-for-linear-regression/</link>
					<comments>https://dsplog.com/2025/05/01/gradients-for-linear-regression/#comments</comments>
		
		<dc:creator><![CDATA[Krishna Sankar]]></dc:creator>
		<pubDate>Thu, 01 May 2025 06:02:13 +0000</pubDate>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[Gradient]]></category>
		<category><![CDATA[MAE]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[MSE]]></category>
		<category><![CDATA[PyTorch]]></category>
		<guid isPermaLink="false">https://dsplog.com/?p=2019</guid>

					<description><![CDATA[<p>Understanding gradients is essential in machine learning, as they indicate the direction and rate of change in the loss function with respect to model parameters. This post covers the gradients for the vanilla Linear Regression case taking two loss functions Mean Square Error (MSE) and Mean Absolute Error (MAE) as examples. The gradients computed analytically &#8230; <a href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/" class="more-link">Continue reading<span class="screen-reader-text"> "Gradients for linear regression"</span></a></p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/">Gradients for linear regression</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></description>
										<content:encoded><![CDATA[</p>
<p>Understanding  <strong>gradients</strong> is essential in <strong>machine learning,</strong> as they indicate the <strong>direction</strong> and <strong>rate of change</strong> in the loss function with respect to model parameters.  This post covers the gradients for the vanilla <strong>Linear Regression </strong>case taking two loss functions <strong>Mean Square Error (MSE)</strong> and <strong>Mean Absolute Error (MAE)</strong>  as examples. </p>
</p>
<p>The gradients computed <strong>analytically</strong> are compared against gradient computed using deep learning framework <strong>PyTorch</strong>. Further, using the gradients, training loop using <strong>gradient descent</strong> is implemented for the simplest example of <strong>fitting a straight line</strong>.</p>
</p>
<p>As always, contents from <a href="https://cs229.stanford.edu/main_notes.pdf" target="_blank" rel="noreferrer noopener">CS229 Lecture Notes</a> and the notations used in the course <a href="https://youtube.com/playlist?list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0&amp;si=_3Xs1piNOfQ847gd" target="_blank" rel="noopener">Deep Learning Specialization C1W1L01</a> from Dr Andrew Ng forms key references.</p>
</p>
<p><span id="more-2019"></span></p>
<p><div id="ez-toc-container" class="ez-toc-v2_0_82_2 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction">
<div class="ez-toc-title-container">
<p class="ez-toc-title" style="cursor:inherit">Table of Contents</p>
<span class="ez-toc-title-toggle"><a href="#" class="ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle" aria-label="Toggle Table of Content"><span class="ez-toc-js-icon-con"><span class=""><span class="eztoc-hide" style="display:none;">Toggle</span><span class="ez-toc-icon-toggle-span"><svg style="fill: #999;color:#999" xmlns="http://www.w3.org/2000/svg" class="list-377408" width="20px" height="20px" viewBox="0 0 24 24" fill="none"><path d="M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z" fill="currentColor"></path></svg><svg style="fill: #999;color:#999" class="arrow-unsorted-368013" xmlns="http://www.w3.org/2000/svg" width="10px" height="10px" viewBox="0 0 24 24" version="1.2" baseProfile="tiny"><path d="M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z"/></svg></span></span></span></a></span></div>
<nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-1" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Model">Model</a></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-2" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Least_Mean_Squares">Least Mean Squares</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-3" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Gradients">Gradients</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-4" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Vectorised_operations">Vectorised operations</a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-5" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Inputs_Outputs">Inputs &amp; Outputs</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-6" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Gradients-2">Gradients</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-7" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Training">Training</a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-8" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Training_loop_%E2%80%93_using_the_derivatives">Training loop &#8211; using the derivatives</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-9" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Computing_Gradients">Computing Gradients</a><ul class='ez-toc-list-level-5' ><li class='ez-toc-heading-level-5'><a class="ez-toc-link ez-toc-heading-10" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Using_PyTorch">Using PyTorch</a></li><li class='ez-toc-page-1 ez-toc-heading-level-5'><a class="ez-toc-link ez-toc-heading-11" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Numerical_approximation_finite_difference_method">Numerical approximation (finite difference method)</a></li><li class='ez-toc-page-1 ez-toc-heading-level-5'><a class="ez-toc-link ez-toc-heading-12" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Example_%E2%80%93_Analytic_vs_PyTorch_vs_Numerical_Approximation">Example &#8211; Analytic vs PyTorch vs Numerical Approximation</a></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-13" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Training_Loop_%E2%80%93_using_PyTorch">Training Loop &#8211; using PyTorch</a></li></ul></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-14" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Mean_Absolute_Error">Mean Absolute Error</a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-15" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Gradient_%E2%80%93_Absolute_function">Gradient &#8211; Absolute function</a></li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class="ez-toc-link ez-toc-heading-16" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Training-2">Training</a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-17" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Deriving_the_Gradients">Deriving the Gradients</a></li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class="ez-toc-link ez-toc-heading-18" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Training_Loop_%E2%80%93_using_derivatives_and_PyTorch">Training Loop &#8211; using derivatives and PyTorch</a></li></ul></li></ul></li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class="ez-toc-link ez-toc-heading-19" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/#Summary">Summary</a></li></ul></nav></div>

</p>
</p>
<h2 class="wp-block-heading">Model</h2>
</p>
<p>Let us take an example of estimating <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt=""> based on feature vector <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt=""> having <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n" alt=""> features i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x} \in \mathbb{R}^n" alt="">.</p>
</p>
<p>There are <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt=""> examples.</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{tabular}{|c|c|c|c|c|}
\hline&amp;{example^1}&amp;{example^2}&amp;\ldots&amp;{example^m}\\
\hline{feature_1}&amp;{x_1}^{1}&amp;{x_1}^{2}&amp;\ldots&amp;{x_1}^{m}\\
\hline{feature_2}&amp;{x_2}^{1}&amp;{x_2}^{2}&amp;\ldots&amp;{x_2}^{m}\\
\hline&amp;\vdots&amp;\vdots&amp;\ldots&amp;\vdots\\
\hline{feature_n}&amp;{x_n}^{1}&amp;{x_n}^{2}&amp;\ldots&amp;{x_n}^{m}&amp;\\
\hline{output}&amp;{y}^{1}&amp;{y}^{2}&amp;\ldots&amp;{y}^{m}\end{tabular}
" alt="">
</p>
</p>
<p><strong>Assume</strong> that the estimate <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt="">is as <strong>linear function</strong> of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{x}" alt="">. </p>
</p>
<p>For a single training example, this can be written as : </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}{z}^1&#038;=&#038;w_1{x_1}^1+w_2{x_2}^1+\dots+w_n{x_n}^1+b\\&#038;=&#038;\mathbf{w^T}\mathbf{x}^1+b\end{array}" alt="">
</p>
</p>
<p>where, </p>
</p>
<ul class="wp-block-list">
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> is the feature vector of size <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n" alt=""> i.e. <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w} \in \mathbb{R}^n" alt="">and</li>
</p>
<li><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> is a scalar</li>
</ul>
</p>
<h2 class="wp-block-heading">Least Mean Squares</h2>
</p>
<p>To find the parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt="">, based on <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt=""> training examples, need to formalise a metric to quantify the &#8220;<strong>closeness</strong>&#8221; of the estimate <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt=""> to the true value <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt="">. As an arbitrary choice, let us define a metric <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?L" alt=""> based on the <strong>mean square error (MSE)</strong> as,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}L&#038;=&#038;\frac{1}{m}\sum_{i=1}^{i}\({z}^i-y^i\)^2\\&#038;=&#038;\frac{1}{m}\sum_{i=1}^{m}\(\mathbf{w^T}\mathbf{x}^i+b -y^i\)^2\end{array}" alt="">
</p>
</p>
<p>Goal is to find the parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> which <strong>minimizes</strong> the metric <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?L" alt="">. This can be considered as <strong>ordinary least squares</strong> (<sup><a href="https://en.wikipedia.org/wiki/Ordinary_least_squares" target="_blank" rel="noopener">wiki entry on ordinary least squares</a></sup>) model.</p>
</p>
<p>To find the value of parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> which <strong>minimises</strong> the metric <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?L" alt="">, let us try <strong>gradient descent </strong>method where we </p>
</p>
<p>i) start with <strong>initial random</strong> values of parameters and </p>
</p>
<p>ii) <strong>repeatedly update </strong>parameters simultaneously for all values of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""></p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\textbf{till convergence:} \quad 
\begin{cases}
\mathbf{w} := \mathbf{w} - \alpha \dfrac{\partial L}{\partial \mathbf{w}} \\
b := b - \alpha \dfrac{\partial L}{\partial b}
\end{cases}
" alt="">
</p>
</p>
<p>where,</p>
</p>
<p><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\alpha" alt=""> is the learning rate, </p>
</p>
<p><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial L}{\partial \mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial L}{\partial b" alt=""> are the partial derivatives of the loss metric <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?L" alt=""> over parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> respectively.</p>
</p>
<p>The intuition is, <em>to take repeated steps in the <strong>opposite direction</strong> of the&nbsp;<a href="https://en.wikipedia.org/wiki/Gradient" target="_blank" rel="noopener">gradient</a>&nbsp;(or approximate gradient) of the function at the current point, because this is the direction of <strong>steepest descent</strong></em> <sup><a href="https://en.wikipedia.org/wiki/Gradient_descent" target="_blank" rel="noopener">Wiki Article on gradient descent</a></sup>.</p>
</p>
<h3 class="wp-block-heading">Gradients</h3>
</p>
<p>In this formulation, we need to find the derivative of a scalar i.e. loss <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?L" alt=""> over the a vector of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n+1" alt=""> parameters of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt="">. </p>
</p>
<p>For easier understanding, can define the derivative over each parameter as below,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lcl}w_1&#038;:=&#038;w_1-\alpha\frac{\partial%20L}{\partial%20w_1}\\w_2&#038;:=&#038;w_2-\alpha\frac{\partial%20L}{\partial%20w_2}\\&#038;\vdots&#038;\\w_n&#038;:=&#038;w_n-\alpha\frac{\partial%20L}{\partial%20w_n}\\b&#038;:=&#038;b-\alpha\frac{\partial%20L}{\partial%20b}\end{array}" alt="">
</p>
</p>
<p>Further, taking only one training example, the loss is</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}L&#038;=&#038;(\mathbf{w^T}\mathbf{x^1}+b -y^1\)^2\\&#038;=&#038;(w_1x_1^1 + w_2x_2_^1 + \dots + w_nx_n^1 + b - y^1)^2\\&#038;=&#038;(z^1-y^1)^2\end{array}" alt="">
</p>
</p>
<p>Taking the derivative w.r.t to first parameter <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?w_1" alt=""></p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial%20L}{\partial%20w_1}&#038;=&#038;\frac{\partial%20}{\partial%20w_1}(w_1{x_1}^1%20+%20w_2{x_2}_^1%20+%20\dots%20+%20w_n{x_n}^1%20+%20b%20-%20y^1)^2\\&#038;=&#038;2(w_1{x_1}^1%20+%20w_2{x_2}_^1%20+%20\dots%20+%20w_n{x_n}^1%20+%20b%20-%20y^1)\frac{\partial}{\partial%20w_1}(w_1{x_1}^1%20+%20w_2{x_2}_^1%20+%20\dots%20+%20w_n{x_n}^1%20+%20b%20-%20y^1)\\&#038;=&#038;2(w_1{x_1}^1%20+%20w_2{x_2}_^1%20+%20\dots%20+%20w_n{x_n}^1%20+%20b%20-%20y^1){x_1}^1\\&#038;=&#038;2(\mathbf{w^T}\mathbf{x^1}+b-y^1){x_1}^1\\&#038;=&#038;2(z^1-{y^1}){x_1}^1\end{array}" alt="">
</p>
</p>
<p>Similarly, for the <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n^{th}" alt=""> parameter of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt="">, the gradient is</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial%20L}{\partial%20w_n}&#038;=&#038;2(\mathbf{w^T}\mathbf{x^1}+b-y^1){x_n}^1=&#038;2(z^1-y^1){x_n}^1\end{array}" alt="">
</p>
</p>
<p>For the bias parameter <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt="">, the gradient is</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial%20L}{\partial%20b}&#038;=&#038;2(\mathbf{w^T}\mathbf{x^1}+b-y^1)=2(z^1-y^1)\end{array}" alt="">
</p>
</p>
<p>The intuition from above equations is : </p>
</p>
<p>if the <strong>estimate <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}=\mathbf{w^T}\mathbf{x}+b" alt=""> is close to the true value <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt=""></strong> then the gradient is <strong>small</strong>, and the update to the parameters is also correspondingly <strong>smaller</strong>.</p>
</p>
<p>With <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt=""> training examples the loss is averaged, and this becomes :</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial%20L}{\partial%20w_n}&#038;=&#038;\frac{2}{m}\sum_{i=1}^m(\mathbf{w^T}\mathbf{x^i}+b-y^i){x_n}^i&#038;=&#038;\frac{2}{m}\sum_{i=1}^m\(z^i-y^i\){x_n}^i\end{array}" alt="">
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial%20L}{\partial%20b}&#038;=&#038;\frac{2}{m}\sum_{i=1}^m(\mathbf{w^T}\mathbf{x^i}+b-y^i)&#038;=&#038;\frac{2}{m}\sum_{i=1}^m\(z^i-y^i\)\end{array}" alt="">
</p>
</p>
<h3 class="wp-block-heading">Vectorised operations</h3>
</p>
<p>Vectorised operations allow CPUs/GPUs to do SIMD (Single Instruction Multiple Data<sup><a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data" target="_blank" rel="noopener">(Refer Wiki)</a></sup>) processing, making it much faster than using for-loops.</p>
</p>
<h4 class="wp-block-heading">Inputs &amp; Outputs</h4>
</p>
<p>In the current example, this translates to the parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt="">  and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> represented as, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix}, \quad \mathbf{w} \in \mathbb{R}^{n \times 1}
" alt="">
</p>
</p>
<p><img decoding="async" style="font-size: revert; color: initial;" src="https://dsplog.com/cgi-bin/mimetex.cgi?b\in \mathbb{R}^{1 \times 1}
" alt=""></p>
</p>
<p>respectively. </p>
</p>
<p>The <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?m" alt="">training examples of  <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n" alt=""> features is represented as </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{X} = \begin{bmatrix}
x_{1}^1 &#038; x_{1}^2 &#038; \dots &#038; x_{1}^m \\
x_{2}^1 &#038; x_{2}^2 &#038; \dots &#038; x_{2}^m \\
\vdots &#038; \vdots &#038; \ddots &#038; \vdots \\
x_{n}^1 &#038; x_{n}^2 &#038; \dots &#038; x_{n}^m
\end{bmatrix}, \quad \mathbf{X} \in \mathbb{R}^{n \times m}
" alt="">
</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{y} = \begin{bmatrix}
y^1 &#038; y^2 &#038; \dots &#038; y^m \end{bmatrix}, \quad \mathbf{y} \in \mathbb{R}^{1 \times m}
" alt="">
</p>
</p>
<p>The output</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\mathbf{z}&#038; = &#038; \quad \mathbf{w^T}\mathbf{X} +b \\&#038;=&#038;\begin{bmatrix}
z^1 &#038; z^2 &#038; \dots &#038; z^m \end{bmatrix}, \quad \mathbf{z} \in \mathbb{R}^{1 \times m}\end{array}
" alt="">
</p>
</p>
</p>
<h4 class="wp-block-heading">Gradients</h4>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?d\mathbf{w} = \begin{bmatrix}
\frac{\partial \mathbf{L}}{\partial w_1} \\ \frac{\partial \mathbf{L}}{\partial w_2} \\ \vdots \\ \frac{\partial \mathbf{L}}{\partial w_n} \end{bmatrix}, \quad d\mathbf{w} \in \mathbb{R}^{n \times 1}
" alt="">
</p>
</p>
<p>The gradient w.r.t to <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> can be represented in matrix operations as,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}d\mathbf{w} &#038;=&#038; \frac{2}{m}\begin{bmatrix}
x_{1}^1 &#038; x_{1}^2 &#038; \dots &#038; x_{1}^m \\
x_{2}^1 &#038; x_{2}^2 &#038; \dots &#038; x_{2}^m \\
\vdots &#038; \vdots &#038; \ddots &#038; \vdots \\
x_{n}^1 &#038; x_{n}^2 &#038; \dots &#038; x_{n}^m
\end{bmatrix}\begin{bmatrix}z^1-y^1\\z^2-y^2\\\vdots\\z^m-y^m\end{bmatrix}\\
&#038;=&#038;\frac{2}{m}\mathbf{X}(\mathbf{z}-\mathbf{y})^T\end{array}
" alt="">
</p>
</p>
</p>
<p>Similarly, for the bias term</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial L}{\partial b}=d\mathbf{b} &#038;=&#038; \frac{2}{m}\underbrac{\begin{bmatrix}
1 &#038; 1 &#038; \dots &#038; 1 \\
\end{bmatrix}}_{1\times m}\begin{bmatrix}z^1-y^1\\z^2-y^2\\\vdots\\z^m-y^m\end{bmatrix}\\
&#038;=&#038;\frac{2}{m}\sum_i^m(z^i-y^i)\end{array}
" alt="">
</p>
</p>
</p>
</p>
<h3 class="wp-block-heading">Training </h3>
</p>
<h4 class="wp-block-heading">Training loop &#8211; using the derivatives</h4>
</p>
<p>The code for linear regression using the gradients descent defined in the previous section. </p>
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/linear_regression/linear_regression.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
<h4 class="wp-block-heading">Computing Gradients </h4>
</p>
<h5 class="wp-block-heading">Using PyTorch</h5>
</p>
<p>For the simple linear regression example, it is relatively straight forward to derive the gradients and perform the training loop. When the function for estimation involves multiple stages/layers a.k.a <strong>deep learning</strong><sup> <a href="https://en.wikipedia.org/wiki/Deep_learning" target="_blank" rel="noopener">(refer wiki)</a></sup>, it becomes harder to derive the gradients.</p>
</p>
<p>Popular deep learning frameworks like <a href="https://pytorch.org/" target="_blank" rel="noopener">Pytorch</a> provides tools for automatic differentiation ( <code>torch.autograd</code> <sup><a href="https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html" target="_blank" rel="noopener">refer pytorch entry on autograd</a> </sup>) to find the gradients of each parameter based on the loss function. </p>
</p>
<h5 class="wp-block-heading">Numerical approximation (finite difference method)</h5>
</p>
<p>To verify the gradients, derivatives can be computed numerically using using finite difference<sup> <a href="https://en.wikipedia.org/wiki/Finite_difference" target="_blank" rel="noopener">(refer wiki entry  on finite difference)</a></sup> method i.e. </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?f'(x) = \lim_{\varepsilon \to 0} \displaystyle \frac{f(x+\varepsilon) - f(x - \varepsilon)}{2\varepsilon}
" alt="">
</p>
</p>
<p>where, </p>
</p>
<p><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?f'(x)" alt=""> is the true derivative of function  <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?f(x)" alt=""> and </p>
</p>
<p><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\varepsilon" alt=""> is a small constant.</p>
</p>
<h5 class="wp-block-heading">Example &#8211; Analytic vs PyTorch vs Numerical Approximation</h5>
</p>
<p>For the toy example below, can see that the gradients computed by Analytically vs PyTorch vs numerical approximation using finite difference methods are matching. </p>
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/linear_regression/gradients_analytic_vs_finite_difference_vs_pytorch.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
</p>
<h4 class="wp-block-heading">Training Loop &#8211; using PyTorch</h4>
</p>
<p>Key aspects in the code for implementing the training loop using PyTorch :</p>
</p>
<ul class="wp-block-list">
<li>the variables are defined as torch tensors <a href="https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html" target="_blank" rel="noopener"><sup>refer pytorch article on tensors</sup></a>.
<ul class="wp-block-list">
<li>Tensors are similar to numpy ndarrays, with capabilities to use GPU/hardware accelerators, optimized for automatic differentiation etc</li>
</ul>
</li>
</p>
<li>defining the parameters needing gradient computation.
<ul class="wp-block-list">
<li>the parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt=""> which needs gradient computation are initialised with <code>requires_grad=True</code></li>
</ul>
</li>
</p>
<li>computing the gradient
<ul class="wp-block-list">
<li>the call <code>loss.backward()</code> is used to compute the gradients for the parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt="">. </li>
</p>
<li>this makes the gradient values available in <code>w.grad()</code>  and <code>b.grad()</code> respectively</li>
</ul>
</li>
</p>
<li>updating the paramaeters
<ul class="wp-block-list">
<li>as gradient tracking is unnecessary during parameter updates, they are performed within <code>torch.no_grad():</code> context</li>
</ul>
</li>
</p>
<li>zeroing gradients between calls
<ul class="wp-block-list">
<li>PyTorch accumulates gradients by default during each backward pass i.e. each <code>loss.backward()</code> call</li>
</p>
<li>so, performing <code>w.grad.zero_()</code> and <code>b.grad.zero_()</code> is needed to clear previous gradients. </li>
</ul>
</li>
</ul>
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/linear_regression/linear_regression_pytorch.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
<p>As would expect, both the training loop approaches converges to similar values for parameters  <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt="">. </p>
</p>
<h2 class="wp-block-heading">Mean Absolute Error</h2>
</p>
<p>Another popular metric to quantify the &#8220;<strong>closeness</strong>&#8221; of the estimate <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?{z}" alt=""> to the true value <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?y" alt=""> is <strong>Mean Absolute Error</strong> (MAE). In the cases where there are outliers in the data, <strong>Mean Absolute Error (MAE)</strong> is preferred over <strong>Mean Squared Error (MSE)</strong> as <strong>MAE</strong> penalizes errors linearly rather than quadratically.</p>
<p>Formally,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
\begin{array}{lll}
L_{\text{mae}} &#038;=&#038; \frac{1}{m} \sum_{i=1}^{m} \left| z^i - y^i \right| \\
               &#038;=&#038; \frac{1}{m} \sum_{i=1}^{m} \left| \mathbf{w}^T \mathbf{x}^i + b - y^i \right|
\end{array}
" alt="">
</p>
</p>
<p>For computing the gradient for the loss from <strong>Mean Absolute Error,</strong> need to find the gradient of <strong>absolute</strong> function.</p>
</p>
<h3 class="wp-block-heading">Gradient &#8211; Absolute function</h3>
</p>
<p>The absolute value of function is defined as, </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?
|x|%20=%20\begin{cases}%20x,%20&#038;%20\text{if%20}%20x%20\geq%200%20\\%20-x,%20&#038;%20\text{if%20}%20x%20%3C%200%20\end{cases}
" alt="">
</p>
</p>
<p>The derivative is </p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{d}{dx}|x|=\begin{cases}
+1&#038;\text{if}x\gt0\\
-1&#038;\text{if}x\lt0\\
\text{undefined}&#038;\text{if}x=0
\end{cases}
" alt="">
</p>
</p>
<p>This can be compactly written as,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{d}{dx}|x| = \mathrm{sign}(x), \quad \text{for } x \ne 0
" alt="">
</p>
</p>
<p>The absolute function is non-differentiable at point at <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x=0" alt="">, where the function has a <strong>sharp corner</strong>.</p>
</p>
<p>The concept of<em> <strong>subderivative</strong> (or <strong>subgradient</strong>) </em>generalises the <em>derivative to a convex functions which are not differentiable</em> <sup><a href="https://en.wikipedia.org/wiki/Subderivative" target="_blank" rel="noopener">(refer wiki entry on Subderivative)</a> </sup>. With this definition, subderivative at <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x=0" alt=""> lies in the interval <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?[ -1, 1] " alt=""> .</p>
</p>
<p>Using the concept of <strong>Symmetric derivative</strong> <a href="https://en.wikipedia.org/wiki/Symmetric_derivative#The_absolute_value_function" target="_blank" rel="noopener"><sup>(refer wiki entry on symmetric derivative)</sup></a>, the subderivtive at <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?x=0" alt=""> can be chosen as <strong>0</strong>.</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll}f_s(0) &#038;= \lim_{h \to 0} \frac{f(0 + h) - f(0 - h)}{2h} = \lim_{h \to 0} \frac{f(h) - f(-h)}{2h} \\
&#038;= \lim_{h \to 0} \frac{|h| - |-h|}{2h} \\
&#038;= \lim_{h \to 0} \frac{|h| - |h|}{2h} \\
&#038;= \lim_{h \to 0} \frac{0}{2h} = 0.
\end{array}
" alt="">
</p>
</p>
<p>In practice, deep learning frameworks (like PyTorch, TensorFlow) and numerical methods like NumPy define <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\text{sign}(0) = 0" alt="">. This is the <strong>subgradient</strong>, and it works fine in optimization.</p>
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/linear_regression/gradients_absolute_function.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
<h3 class="wp-block-heading">Training  </h3>
</p>
<h4 class="wp-block-heading">Deriving the Gradients </h4>
</p>
<p>For a single training example, for the <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?n^{th}" alt=""> parameter of <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt="">, the gradient is</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial%20L}{\partial%20w_n}&#038;=&#038;\frac{\partial%20}{\partial%20w_n}|w_1{x_1}^1%20+%20w_2{x_2}_^1%20+%20\dots%20+%20w_n{x_n}^1%20+%20b%20-%20y^1|\\&#038;=&#038;\mathrm{sign}(w_1{x_1}^1%20+%20w_2{x_2}_^1%20+%20\dots%20+%20w_n{x_n}^1%20+%20b%20-%20y^1)\frac{\partial%20L}{\partial%20w_n}(w_1{x_1}^1%20+%20w_2{x_2}_^1%20+%20\dots%20+%20w_n{x_n}^1%20+%20b%20-%20y^1)\\&#038;=&#038;\mathrm{sign}(w_1{x_1}^1%20+%20w_2{x_2}_^1%20+%20\dots%20+%20w_n{x_n}^1%20+%20b%20-%20y^1){x_n}^1\\&#038;=&#038;\mathrm{sign}(\mathbf{w^T}\mathbf{x^1}+b-y^1){x_n}^1\\&#038;=&#038;\mathrm{sign}(z^1-{y^1}){x_n}^1\end{array}" alt="">
</p>
</p>
<p>Similarly for the bias term,</p>
</p>
<p>
<img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{lll}\frac{\partial%20L}{\partial%20b}&#038;=&#038;\mathrm{sign}(\mathbf{w^T}\mathbf{x^1}+b-y^1)=\mathrm{sign}(z^1-y^1)\end{array}" alt="">
</p>
</p>
<h4 class="wp-block-heading">Training Loop &#8211; using derivatives and PyTorch</h4>
</p>
<p>For the same example, the code for training the linear regression using <strong>Mean Absolute Error </strong>as the <strong>Loss function</strong>.</p>
</p>
<p>
<iframe loading="lazy" src="https://nbviewer.org/github/dsplog/dsplog.com/blob/main/code/linear_regression/linear_regression_mean_abs_error.ipynb?flush_cache=false" width="100%" height="600"></iframe>
</p>
</p>
<p>Can see that both the training loops for <strong>Mean Absolute Error (MAE)</strong> using <strong>PyTorch</strong> and Analaytic approaches converges to the same parameters <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\mathbf{w}" alt=""> and <img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?b" alt="">. </p>
</p>
<h2 class="wp-block-heading">Summary</h2>
</p>
<p>The post covers the following key aspects </p>
</p>
<ul class="wp-block-list">
<li><strong>Gradient Basics:</strong> How to deriving the gradients for loss functions Mean Square Error and Mean Absolute Error</li>
</p>
<li><strong>Efficient Computation:</strong> Use of vectorized operations and PyTorch autograd</li>
</p>
<li><strong>Gradient Computation:</strong> Analytical, numerical (finite difference), and PyTorch comparison</li>
</p>
<li><strong>Training Loops:</strong> Implementing updates using both manual and PyTorch-based gradients</li>
</ul>
</p>
<p>Have any questions or feedback on the gradient computation techniques? Feel free to drop your feedback in the comments section. <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /> </p></p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2025/05/01/gradients-for-linear-regression/">Gradients for linear regression</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://dsplog.com/2025/05/01/gradients-for-linear-regression/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Migrated to Amazon EC2 instance (from shared hosting)</title>
		<link>https://dsplog.com/2013/03/11/migrated-to-amazon-ec2-from-shared-hosting/</link>
					<comments>https://dsplog.com/2013/03/11/migrated-to-amazon-ec2-from-shared-hosting/#respond</comments>
		
		<dc:creator><![CDATA[Krishna Sankar]]></dc:creator>
		<pubDate>Mon, 11 Mar 2013 01:20:38 +0000</pubDate>
				<category><![CDATA[News]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[ec2]]></category>
		<guid isPermaLink="false">http://www.dsplog.com/?p=1961</guid>

					<description><![CDATA[<p>Being not too happy with the speed of the shared hosting, decided to move the blog to an Amazon Elastic Compute Cloud (Amazon EC2) instance.  Given this is a baby step, picked up a micro instance running an Ubuntu server and installed Apache web server, MySQL, PHP . After doing a bit of tweaking with this new &#8230; <a href="https://dsplog.com/2013/03/11/migrated-to-amazon-ec2-from-shared-hosting/" class="more-link">Continue reading<span class="screen-reader-text"> "Migrated to Amazon EC2 instance (from shared hosting)"</span></a></p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2013/03/11/migrated-to-amazon-ec2-from-shared-hosting/">Migrated to Amazon EC2 instance (from shared hosting)</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Being not too happy with the speed of the shared hosting, decided to move the blog to an <a title="Amazon Elastic Compute Cloud" href="http://aws.amazon.com/ec2/" target="_blank" rel="noopener">Amazon Elastic Compute Cloud (Amazon EC2)</a> instance.  Given this is a baby step, picked up a micro instance running an Ubuntu server and installed <a title="Apache web server" href="http://en.wikipedia.org/wiki/Apache_HTTP_Server" target="_blank" rel="noopener">Apache web server</a>, <a title="wiki entry on MySQL" href="http://en.wikipedia.org/wiki/MySQL" target="_blank" rel="noopener">MySQL</a>, <a title="wiki entry on PHP" href="http://en.wikipedia.org/wiki/PHP" target="_blank" rel="noopener">PHP</a> . After doing a bit of tweaking with this new instance, imported the SQL database and other files from the shared hosting and pointed the A name record to the new IP address. This switch happened over this weekend.</p>
<p>One particular issue which I faced was frequent crashing of MySQL due to memory limitations. Followed few online instructions to improve the situation and the current configuration seems to be holding up (but this is a cause of worry &#8211; need to figure the right solution).</p>
<p>Anyhow, hope you like the decreased page load time! <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>
<p><strong>Some helpful links from the web:</strong></p>
<p><strong></strong>a) <a title="How to install WordPress on Amazon EC2" href="http://iampuneet.com/wordpress-amazon-ec2/" target="_blank" rel="noopener">How to install WordPress on Amazone EC2</a></p>
<p>b) <a title="Move WordPress site from shared hosting to Amazon EC2" href="http://blog.lopau.com/move-wordpress-site-from-shared-hosting-to-amazon-ec2/" target="_blank" rel="noopener">Move WordPress site from shared hosting to Amazon EC2</a></p>
<p>c) <a title="DIY: Enable CGI on your Apache server" href="http://www.techrepublic.com/blog/doityourself-it-guy/diy-enable-cgi-on-your-apache-server/1066" target="_blank" rel="noopener">DIY: Enable CGI on your Apache server</a></p>
<p>d) <a title="Import MySQL Dumpfile, SQL Datafile Into My Database" href="http://www.cyberciti.biz/faq/import-mysql-dumpfile-sql-datafile-into-my-database/" target="_blank" rel="noopener">Import MySQL Dumpfile, SQL Datafile Into My Database</a></p>
<p>e) <a title="Making WordPress Stable on EC2-Micro" href="http://www.frameloss.org/2011/11/04/making-wordpress-stable-on-ec2-micro/" target="_blank" rel="noopener">Making WordPress Stable on EC2-Micro</a></p>
<p>f) <a title="http://www.lavluda.com/2007/07/15/how-to-enable-mod_rewrite-in-apache22-debian/" href="how to enable mod_rewrite in apache2.2 (debian/ubuntu)" target="_blank">how to enable mod_rewrite in apache2.2 (debian/ubuntu)</a></p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2013/03/11/migrated-to-amazon-ec2-from-shared-hosting/">Migrated to Amazon EC2 instance (from shared hosting)</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://dsplog.com/2013/03/11/migrated-to-amazon-ec2-from-shared-hosting/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>GATE-2012 ECE Q28 (electromagnetics)</title>
		<link>https://dsplog.com/2013/02/20/gate-2012-ece-q28-electromagnetics/</link>
					<comments>https://dsplog.com/2013/02/20/gate-2012-ece-q28-electromagnetics/#respond</comments>
		
		<dc:creator><![CDATA[Krishna Sankar]]></dc:creator>
		<pubDate>Wed, 20 Feb 2013 01:50:01 +0000</pubDate>
				<category><![CDATA[GATE]]></category>
		<category><![CDATA[2012]]></category>
		<category><![CDATA[ECE]]></category>
		<category><![CDATA[electromagnetics]]></category>
		<guid isPermaLink="false">http://www.dsplog.com/?p=1933</guid>

					<description><![CDATA[<p>Question 28 on electromagnetics from GATE (Graduate Aptitude Test in Engineering) 2012 Electronics and Communication Engineering paper. Q28.&#160;A transmission line with a characteristic impedance of 100&#160;is used to match a&#160;50&#160;section to a&#160;200&#160;section. If the matching is to be done both at 429MHz and 1GHz, the length of the transmission line can be approximately (A) 82.5cm &#8230; <a href="https://dsplog.com/2013/02/20/gate-2012-ece-q28-electromagnetics/" class="more-link">Continue reading<span class="screen-reader-text"> "GATE-2012 ECE Q28 (electromagnetics)"</span></a></p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2013/02/20/gate-2012-ece-q28-electromagnetics/">GATE-2012 ECE Q28 (electromagnetics)</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></description>
										<content:encoded><![CDATA[</p>
<p>Question 28 on electromagnetics from GATE (Graduate Aptitude Test in Engineering) 2012 Electronics and Communication Engineering paper.</p>
</p>
<h2 class="wp-block-heading">Q28.&nbsp;A transmission line with a characteristic impedance of 100<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Large{\Omega}" align="bottom" border="0">&nbsp;is used to match a&nbsp;50<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Large{\Omega}" align="bottom" border="0">&nbsp;section to a&nbsp;200<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Large{\Omega}" align="bottom" border="0">&nbsp;section. If the matching is to be done both at 429MHz and 1GHz, the length of the transmission line can be approximately</h2>
</p>
<h2 class="wp-block-heading">(A) 82.5cm</h2>
</p>
<h2 class="wp-block-heading">(B) 1.05m</h2>
</p>
<h2 class="wp-block-heading">(C) 1.58m</h2>
</p>
<h2 class="wp-block-heading">(D) 1.75m</h2>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/db-install/wp-includes/js/tinymce/plugins/wordpress/img/trans.gif" alt="" title="More..."/></figure>
</p>
</p>
<p><span id="more-1933"></span></p>
</p>
<h2 class="wp-block-heading">Solution</h2>
</p>
<p>To answer this question, let us first understand the propagation in a transmission line, &nbsp;termination and the concept of impedance matching. The <strong>section 2.1 in Microwave Engineering, David M Pozar (<a href="http://www.amazon.com/gp/product/0470631554/ref=as_li_tf_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0470631554&amp;linkCode=as2&amp;tag=dl04-20" target="_blank" rel="noopener">buy from Amazon.com</a><img loading="lazy" decoding="async" style="border: none !important; margin: 0px !important;" alt="" src="http://www.assoc-amazon.com/e/ir?t=dl04-20&amp;l=as2&amp;o=1&amp;a=0470631554" width="1" height="1" border="0">, <a href="http://www.flipkart.com/microwave-engineering-3rd/p/itmdytmvtyj3arjj?pid=9788126510498&amp;affid=krishnadsp" target="_blank" rel="noopener">Buy from Flipkart.com</a>)&nbsp;</strong>&nbsp;is used as reference.</p>
</p>
<p>Consider a transmission line of very small length&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Delta z" align="bottom" border="0">&nbsp;having the parameters as show in figure below.</p>
</p>
<figure class="wp-block-image"><img loading="lazy" decoding="async" width="378" height="150" src="https://dsplog.com/db-install/wp-content/uploads/2013/02/transmission_line_model.png" alt="transmission_line_model" class="wp-image-1943" srcset="https://dsplog.com/db-install/wp-content/uploads/2013/02/transmission_line_model.png 378w, https://dsplog.com/db-install/wp-content/uploads/2013/02/transmission_line_model-300x119.png 300w, https://dsplog.com/db-install/wp-content/uploads/2013/02/transmission_line_model-375x150.png 375w" sizes="auto, (max-width: 378px) 85vw, 378px" /></figure>
</p>
</p>
<p><strong>Figure : Transmission line model&nbsp;<strong>&nbsp;(Reference Figure 2.1 in&nbsp;<strong>Microwave Engineering, David M Pozar (<a href="http://www.amazon.com/gp/product/0470631554/ref=as_li_tf_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0470631554&amp;linkCode=as2&amp;tag=dl04-20" target="_blank" rel="noopener">buy from Amazon.com</a><img loading="lazy" decoding="async" alt="" src="http://www.assoc-amazon.com/e/ir?t=dl04-20&amp;l=as2&amp;o=1&amp;a=0470631554" width="1" height="1" border="0">,&nbsp;<a href="http://www.flipkart.com/microwave-engineering-3rd/p/itmdytmvtyj3arjj?pid=9788126510498&amp;affid=krishnadsp" target="_blank" rel="noopener">Buy from Flipkart.com</a>)</strong></strong></strong></p>
</p>
<p>&nbsp;</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?R" align="bottom" border="0"> is the resistance per unit length <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Omega/m" align="absmiddle" border="0">,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?L" align="bottom" border="0">&nbsp;is the inductance per unit length&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?H/m" align="absmiddle" border="0">,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?G" align="bottom" border="0">&nbsp;is the conductance per unit length&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?S/m" align="absmiddle" border="0">,</p>
</p>
<p><strong><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?C" align="bottom" border="0"></strong>&nbsp;is the capacitance per unit length<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?F/m" align="absmiddle" border="0">.</p>
</p>
<p>Applying Kirchoff&#8217;s voltage law,</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?v(z,t)-R\Delta%20zi(z,t)-L\Delta%20z\frac{\partial%20i(z,t)}{\partial%20t}-v(z+\Delta%20z%20,t)=0" alt=""/></figure>
</p>
</p>
<p>Applying Kirchoff&#8217;s current law,</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?i(z,t)-G\Delta%20zv(z+\Delta%20z,t)-C\Delta%20z\frac{\partial%20v(z+\Delta%20z,t)}{\partial%20t}-i(z+\Delta%20z%20,t)=0" alt=""/></figure>
</p>
</p>
<p>Dividing the above equations by&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Delta z" align="bottom" border="0">&nbsp;and taking the limit&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Delta z \rightarrow 0" align="absmiddle" border="0">,</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial%20v(z,t)}{\partial%20z}=-Ri(z,t)%20-%20L\frac{\partial%20i(z,t)}{\partial%20t}" alt=""/></figure>
</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\partial%20i(z,t)}{\partial%20z}=-Gv(z,t)%20-%20C\frac{\partial%20v(z,t)}{\partial%20t}" alt=""/></figure>
</p>
</p>
<p>If we assume that the inputs are sinusoidal, then the above equation can be re-written as</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{dV(z)}{dz}=-(R+jwL)I(z)" alt=""/></figure>
</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{dI(z)}{dz}=-(G+jwC)V(z)" alt=""/></figure>
</p>
</p>
<p>Substituting,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{d^2V(z)}{dz}-\gamma^2V(z)=0" align="absmiddle" border="0">,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{d^2I(z)}{dz}-\gamma^2I(z)=0" align="absmiddle" border="0">,</p>
</p>
<p>where</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\gamma=\alpha+j\beta%20=%20\sqrt%7B(R+jwL)(G+jwC)%7D" align="absmiddle" border="0">.</p>
</p>
<p>The solution to the above equations are,</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?V(z)%20=%20V_0^+e^{-\gamma%20z}+V_0^-e^{\gamma%20z}" alt=""/></figure>
</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?I(z)%20=%20I_0^+e^{-\gamma%20z}+I_0^-e^{\gamma%20z}" align="absmiddle" border="0">.</p>
</p>
<p>The current on the line can be alternately expressed as,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?I(z)%20=\frac{\gamma}{R+jwL}\(V_0^+e^{-\gamma%20z}-V_0^-e^{\gamma%20z}\)=\frac{1}{Z_0}\(V_0^+e^{-\gamma%20z}-V_0^-e^{\gamma%20z}\)" align="absmiddle" border="0">,</p>
</p>
<p>where the characteristic impedance of the line is defined as,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_0=\frac{R+jwL}{\gamma}=\sqrt{\frac{R+jwL}{G+jwC}}" align="absmiddle" border="0">.</p>
</p>
<p>The wavelength on the line is,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\lambda=\frac{2\pi}{\beta}" align="absmiddle" border="0">.</p>
</p>
<p><strong>Loss less transmission line case</strong></p>
</p>
<p>For a lossless transmission line, we can set&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?R=G=0" align="bottom" border="0">.</p>
</p>
<p>Then the propagation constant reduces to<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}\alpha=0,&amp;\beta = \omega\sqrt{LC}\end{array}" align="absmiddle" border="0">,&nbsp;the characteristic impedance is <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_0=\sqrt{\frac{L}{C}}" align="absmiddle" border="0">&nbsp;and the voltage and current on the line can be written as,</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?V(z)%20=%20V_0^+e^{-j\beta%20z}+V_0^-e^{j\beta%20z}" alt=""/></figure>
</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?I(z)%20=\frac{1}{Z_0}\(V_0^+e^{-j\beta%20z}-V_0^-e^{j\beta%20z}\)" alt=""/></figure>
</p>
</p>
<h2 class="wp-block-heading">Terminated lossless transmission line</h2>
</p>
<p>Consider a transmission line terminated with load impedance&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_L" align="absmiddle" border="0">&nbsp;as shown in figure below.</p>
</p>
<figure class="wp-block-image"><img loading="lazy" decoding="async" width="367" height="235" src="https://dsplog.com/db-install/wp-content/uploads/2013/02/transmission_line_with_load_impedance.png" alt="transmission_line_with_load_impedance" class="wp-image-1944" srcset="https://dsplog.com/db-install/wp-content/uploads/2013/02/transmission_line_with_load_impedance.png 367w, https://dsplog.com/db-install/wp-content/uploads/2013/02/transmission_line_with_load_impedance-300x192.png 300w" sizes="auto, (max-width: 367px) 85vw, 367px" /></figure>
</p>
</p>
<p><strong>Figure: Transmission line with load impedance (Reference Figure 2.4 in&nbsp;<strong>Microwave Engineering, David M Pozar (<a href="http://www.amazon.com/gp/product/0470631554/ref=as_li_tf_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0470631554&amp;linkCode=as2&amp;tag=dl04-20" target="_blank" rel="noopener">buy from Amazon.com</a><img loading="lazy" decoding="async" alt="" src="http://www.assoc-amazon.com/e/ir?t=dl04-20&amp;l=as2&amp;o=1&amp;a=0470631554" width="1" height="1" border="0">,&nbsp;<a href="http://www.flipkart.com/microwave-engineering-3rd/p/itmdytmvtyj3arjj?pid=9788126510498&amp;affid=krishnadsp" target="_blank" rel="noopener">Buy from Flipkart.com</a>)</strong></strong></p>
</p>
<p>At the load <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?z=0" align="absmiddle" border="0">, the relation between the total voltage and current is related to the load impedance&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_L" align="absmiddle" border="0"></p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_L=\frac{V(0)}{I(0)}=\frac{V_0^{+}+V_0^{-}}{V_0^{+}-V_0^-}Z_0" align="absmiddle" border="0">.</p>
</p>
<p>Alternatively,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?V_0^{-}=\frac{Z_L-Z_0}{Z_L+Z_0}V_0^{+}" align="bottom" border="0">.</p>
</p>
<p>The reflection coefficient&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_L" align="absmiddle" border="0">&nbsp;is defined as the amplitude of the reflected voltage to the incident voltage,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Gamma=\frac{V_0^{-}}{V_0^{+}}=\frac{Z_L-Z_0}{Z_L+Z_0}" align="absmiddle" border="0">.</p>
</p>
<p><strong>For no reflection to happen, i.e&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Gamma=0" align="absmiddle" border="0">, the load impedance <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_L" align="absmiddle" border="0"> should be equal to the characteristic impedance <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_0" align="absmiddle" border="0"> of the transmission line.&nbsp;The above equation captures the impedance seen at the load&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?z=0" align="absmiddle" border="0">.&nbsp;</strong></p>
</p>
<p>The voltage and current on the line can be represented using&nbsp;<strong><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Gamma" align="absmiddle" border="0">&nbsp;</strong>as,</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?V(z)%20=%20V_0^+\(e^{-j\beta%20z}+\Gamma e^{j\beta%20z}\)" alt=""/></figure>
</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?I(z)%20=\frac{V_0^+}{Z_0}\(e^{-j\beta%20z}-\Gamma e^{j\beta%20z}\)" alt=""/></figure>
</p>
</p>
<p>When looking from a point&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?z=-l" align="absmiddle" border="0"> from the load, the input impedance seen is,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llllll}Z_{in}&amp;=&amp;\frac{V(-l)}{I(-l)}&amp;=&amp;\frac{V_0^{+}\(e^{j\beta%20l}%20+%20\Gamma%20e^{-j\beta%20l}\)}{\frac{V_0^{+}}{Z_0}\(e^{j\beta%20l}%20-%20\Gamma%20e^{-j\beta%20l}\)}\end{array}" align="absmiddle" border="0">.</p>
</p>
<p>Substituting for&nbsp;<strong><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Gamma" align="absmiddle" border="0">,&nbsp;</strong></p>
</p>
<p><strong></strong><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llllll}Z_{in}&amp;=&amp;Z_0\frac{\(Z_L+Z_0\)e^{j\beta%20l}%20+(Z_L-Z_0)%20%20e^{-j\beta%20l}}{\(Z_L+Z_0\)e^{j\beta%20l}%20-\(Z_L-Z_0\)%20e^{-j\beta%20l}}\\&amp;=&amp;Z_0\frac{Z_L+jZ_0\tan%20\beta%20l}{Z_0+jZ_L\tan%20\beta%20l}\end{array}" align="absmiddle" border="0">.</p>
</p>
<p>&nbsp;</p>
</p>
<p><strong>Special case when <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?{l =\lambda/4}" align="absmiddle" border="0">&nbsp;(and it&#8217;s odd multiples)</strong></p>
</p>
<p>For the case when&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?{l =\(2n+1\)\frac{\lambda}{4}}" align="absmiddle" border="0"> the input impedance seen is,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll}Z_{in}&amp;=&amp;Z_0\frac{Z_L+jZ_0\tan%20\beta%20l}{Z_0+jZ_L\tan%20\beta%20l}\\&amp;=&amp;Z_0\frac{Z_L+jZ_0\tan\(\frac{2\pi}{\lambda}\frac{(2n+1)\lambda}{4}\)}{Z_0+jZ_L\tan\(\frac{2\pi}{\lambda}\frac{(2n+1)\lambda}{4}\)}\\&amp;=&amp;Z_0\frac{Z_0}{Z_L}=\frac{Z_0^2}{Z_L}\end{array}" align="absmiddle" border="0">.</p>
</p>
<p>This result can be used to for impedance matching.</p>
</p>
<h2 class="wp-block-heading">Quarter wave transformer</h2>
</p>
<p>Consider a circuit with load&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?R_L" align="absmiddle" border="0"> and a line with characteristic impedance <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_0" align="absmiddle" border="0"> connected by a transmission line of characteristic impedance <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_1" align="absmiddle" border="0"> with length <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\lambda/4" align="absmiddle" border="0">.</p>
</p>
<figure class="wp-block-image"><img loading="lazy" decoding="async" width="352" height="266" src="https://dsplog.com/db-install/wp-content/uploads/2013/02/quarter_wave_matching_transformer.png" alt="quarter_wave_matching_transformer" class="wp-image-1946" srcset="https://dsplog.com/db-install/wp-content/uploads/2013/02/quarter_wave_matching_transformer.png 352w, https://dsplog.com/db-install/wp-content/uploads/2013/02/quarter_wave_matching_transformer-300x226.png 300w" sizes="auto, (max-width: 352px) 85vw, 352px" /></figure>
</p>
</p>
<p><strong>Figure: Quarter Wave Matching transformer (Reference Figure 2.16 in&nbsp;<strong>Microwave Engineering, David M Pozar (<a href="http://www.amazon.com/gp/product/0470631554/ref=as_li_tf_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0470631554&amp;linkCode=as2&amp;tag=dl04-20" target="_blank" rel="noopener">buy from Amazon.com</a><img loading="lazy" decoding="async" alt="" src="http://www.assoc-amazon.com/e/ir?t=dl04-20&amp;l=as2&amp;o=1&amp;a=0470631554" width="1" height="1" border="0">,&nbsp;<a href="http://www.flipkart.com/microwave-engineering-3rd/p/itmdytmvtyj3arjj?pid=9788126510498&amp;affid=krishnadsp" target="_blank" rel="noopener">Buy from Flipkart.com</a>)</strong></strong></p>
</p>
<p>The input impedance seen is,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\begin{array}{llll}Z_{in}&amp;=&amp;Z_1\frac{R_L+jZ_1\tan%20\beta%20l}{Z_1+jR_L\tan%20\beta%20l} \\&amp;=&amp;Z_0\frac{Z_L+jZ_0\tan\(\frac{2\pi}{\lambda}\frac{\lambda}{4}\)}{Z_0+jZ_L\tan\(\frac{2\pi}{\lambda}\frac{\lambda}{4}\)}&amp; \mbox{  } &amp;  \(\beta l =\frac{2\pi}{\lambda}\frac{\lambda}{4}\right  \pi/2\) \\&amp;=&amp;Z_1\frac{Z_1}{R_L}=\frac{Z_1^2}{R_L}\end{array}" align="absmiddle" border="0">.</p>
</p>
<p>So if we choose&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_1 = \sqrt{Z_0R_L}" align="absmiddle" border="0">, then the input impedance seen is&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_{in} = Z_0" align="absmiddle" border="0"> which is the condition required for having no reflection i.e. <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Gamma=0" align="absmiddle" border="0">.</p>
</p>
<p>One important aspect to note here is that&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Gamma=0" align="absmiddle" border="0">&nbsp;is not guaranteed for all frequencies, but rather only for certain frequencies.&nbsp;The frequency dependence can be found by finding the frequencies for which&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta l=\(2n+1\)\frac{\pi}{2}" align="absmiddle" border="0">&nbsp;.</p>
</p>
<p>Replacing&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi? l=\frac{\lambda_0}{4}" align="absmiddle" border="0"> where <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\lambda_0" align="absmiddle" border="0"> is the wave length corresponding to frequency <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?f_0" align="absmiddle" border="0">,</p>
</p>
<p><img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta l = \(\frac{2\pi}{\lambda}\)\(\frac{\lambda_0}{4}\)=\(\frac{2\pi f}{v_p}\)\(\frac{v_p}{4f_0}\)=\frac{\pi f}{2 f_0}" align="absmiddle" border="0">.</p>
</p>
<p>It can be seen that only for&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?f=\(2n+1\)f_0" align="absmiddle" border="0">&nbsp;, the term <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\beta l=\(2n+1\)\frac{\pi}{2}" align="absmiddle" border="0"> resulting in reflection coefficient <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\Gamma=0" align="absmiddle" border="0"> only for those frequencies.</p>
</p>
<h2 class="wp-block-heading">Solving the GATE question</h2>
</p>
<p>Applying all this to the problem at hand, we have&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?Z_0=50" align="absmiddle" border="0">,&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?R_L=200" align="absmiddle" border="0">&nbsp; and&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?R_1=100" align="absmiddle" border="0">.</p>
</p>
<p>Given that <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?R_1=\sqrt{Z_0 R_L} = \sqrt{50 * 200 }=100" align="absmiddle" border="0">, we know that a quarter wave transformer is used to achieve impedance matching.</p>
</p>
<p>Now we also know that we need to match for two frequencies&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?f_{1}=429\mbox{ MHz}" align="absmiddle" border="0"> and <img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?f_{2}=1\mbox{ GHz}" align="absmiddle" border="0">.</p>
</p>
<p>The wavelength for each frequencies are,</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\lambda_{1}=\frac{3e^8}{429e^6}*100\simeq 70\mbox{ cm}" alt=""/></figure>
</p>
</p>
<figure class="wp-block-image"><img decoding="async" src="https://dsplog.com/cgi-bin/mimetex.cgi?\lambda_{2}=\frac{3e^8}{1e^9}*100\simeq 30\mbox{ cm}" alt=""/></figure>
</p>
</p>
<p>The least common multiple of these two wavelength is,&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\lambda_{lcm}=\mbox{lcm}\(70,30\)=210\mbox{ cm}" align="absmiddle" border="0">&nbsp;and the corresponding&nbsp;quarter wave length is&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\lambda_{lcm}}{4}=52.5\mbox{ cm}" align="absmiddle" border="0">.</p>
</p>
<p>Given than&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?\frac{\lambda_{lcm}}{4}=52.5\mbox{ cm}" align="absmiddle" border="0">&nbsp;is not listed in the options, we can go for the next higher odd multiple i.e.&nbsp;<img decoding="async" alt="" src="https://dsplog.com/cgi-bin/mimetex.cgi?52.5*3=157.5 \simeq 1.58\mbox{ m}" align="absmiddle" border="0"></p>
</p>
<p><strong>Based on the above, the right choice is&nbsp;(C) 1.58m</strong></p>
</p>
<p>&nbsp;</p>
</p>
<h2 class="wp-block-heading">References</h2>
</p>
<p>[1] GATE Examination Question Papers [Previous Years] from Indian Institute of Technology, Madras&nbsp;<a href="http://gate.iitm.ac.in/gateqps/2012/ec.pdf" target="_blank" rel="noopener">http://gate.iitm.ac.in/gateqps/2012/ec.pdf</a></p>
</p>
<p>[2]&nbsp;<strong>Microwave Engineering, David M Pozar (<a href="http://www.amazon.com/gp/product/0470631554/ref=as_li_tf_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0470631554&amp;linkCode=as2&amp;tag=dl04-20" target="_blank" rel="noopener">buy from Amazon.com</a><img loading="lazy" decoding="async" alt="" src="http://www.assoc-amazon.com/e/ir?t=dl04-20&amp;l=as2&amp;o=1&amp;a=0470631554" width="1" height="1" border="0">,&nbsp;<a href="http://www.flipkart.com/microwave-engineering-3rd/p/itmdytmvtyj3arjj?pid=9788126510498&amp;affid=krishnadsp" target="_blank" rel="noopener">Buy from Flipkart.com</a>)&nbsp;</strong></p>
</p>
<p>&nbsp;</p>
</p>
<p>&nbsp;</p></p>
<p>The post <a rel="nofollow" href="https://dsplog.com/2013/02/20/gate-2012-ece-q28-electromagnetics/">GATE-2012 ECE Q28 (electromagnetics)</a> appeared first on <a rel="nofollow" href="https://dsplog.com">DSP LOG</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://dsplog.com/2013/02/20/gate-2012-ece-q28-electromagnetics/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
