DSP LOG

Word Embeddings using neural networks

Krishna Sankar — Sat, 27 Dec 2025 15:12:18 +0000

In machine learning, converting the input data (text, images, or time series) —into a vector format (also known as embeddings) forms a key building block for enabling downstream tasks. This article explores in detail the architecture of some of the neural network based word embedding models in the literature.

Papers referred :

Neural Probabilistic Language Model, Bengio et al 2003
- proposed a neural network architecture to jointly learn word feature vector and probability of words in a sequence.
Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)
- given the softmax layer for finding the probability scales with vocabulary, proposed a hierarchical version of softmax to reduce the complexity from to .
Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012,
- Instead of directly estimating the data distribution, noise contrastive estimation estimates the probability of a sample being from the data versus from a known noise distribution.
- This approach was extended to neural language models in the paper A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012.
Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013.
- proposed simpler neural architectures with the intuition that simpler models enable training on much larger corpus of data.
- Continuous Bag of Words (CBOW) to predict the center word given the context, Skip Gram to predict surrounding words given center word was introduced.
Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013
- speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
- simplified variant of Noise Contrastive estimation called Negative Sampling
GloVe: Global Vectors for Word Representation, Pennington et al 2014
- propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.

In this post we will cover the key aspects proposed in the above papers with supporting python code.

Neural Probabilistic Language Model (Bengio et al, 2003)

Reference : Neural Probabilistic Language Model, Bengio et al 2003

The probability of sequence of words can be expressed as conditional probability of sequence of previous words, i.e

For example, consider a sequence of 4 words,

Then, by the chain rule of probability:

Substituting the actual words:

For a long word sequence, instead of conditioning on all previous words, it is common to approximate the probability by conditioning only on the last words. That is:

Neural network Architecture

The neural probabilistic language model builds on the n-gram approximation and proposes a way to

Jointly learn word feature vector (each word in the vocabulary has a feature vector — a real-valued vector in ) and
Learn the probability of the sequence of words in terms of sequence of word feature vectors

The objective is to learn a model that predicts the probability of the next word given the previous words, i.e.

The model is subject to the following constraints:

For any sequence of words, the model outputs a non-zero probability, i.e. 0″ alt=”” align=”absmiddle”>
The sum of probabilities over all possible next words in the vocabulary equals 1, i.e.

where is the vocabulary size, and indexes over all possible words in the vocabulary.

Note :

Non-zero probability: Ensures that the model never completely rules out any word as a possible next word, allowing it to adapt to all possible word sequences and avoid zero-probability issues during training.

Probabilities sum to one: Guarantees that f defines a valid probability distribution over the vocabulary for the next word, so the total probability of all possible next words is exactly 1.

The estimation of the function is done as follows :

for any word in the vocabulary , lookup a real vector
a function maps an input sequence of feature vectors for words in the context, to a conditional probability distribution over words in for the next word

Model

The neural network model can be expressed as:

where:

is the concatenated input feature vector of the previous words, with dimension .
is a weight matrix of size , which transforms the input into the hidden layer space.
is a bias vector for the hidden layer, of dimension .
is a weight matrix of size that maps the hidden layer activations to the output layer, where is the vocabulary size.
is a weight matrix of size that connects the input directly to the output layer.
is the bias vector for the output layer, of dimension .
is the output vector containing the unnormalized log-probabilities (scores) for each word in the vocabulary, of dimension .

Using softmax to convert the output vector into a probability distribution over the vocabulary,

Using softmax layer ensures the constraints defined earlier:

All probabilities are positive, satisfying the 0″ alt=””> constraint.
The probabilities sum to one across all possible next words, satisfying the normalization constraint .

Loss function

The maximum likelihood estimate for selecting the target word over all the words in vocabulary is equivalent to minimising the negative log likelihood,

where,

is multiple word sequence examples

As can be seen in the section on Loss for Multiclass classification ^{(refer post on Gradients for Multiclass classification with SoftMax)}, the negative log likelihood is indeed the Categorical Cross Entropy Loss.

Python code

The training of a Neural Probabilistic Language Model in PyTorch involves a few key components, each corresponding to the mathematical elements discussed earlier:

torch.nn.Embedding — implements the word feature vector lookup function . Each word index in the vocabulary maps to a dense vector in .
torch.nn.Linear — implements the fully-connected (dense) layers, corresponding to the transformation matrices , .
torch.nn.Parameter – the parameters and are explicitly created.
torch.nn.functional.log_softmax — applies the SoftMax in log space to obtain while maintaining numerical stability.
torch.nn.NLLLoss — implements the Negative Log Likelihood Loss, which directly minimises for the correct target word index.

These functions, combined with an optimizer such as torch.optim.SGD or torch.optim.Adam, form the complete training loop for the model.

code @ word_embeddings/neural_probabilistic_language_model.ipynb
The training loop implementing the model for a simple toy example of 20 sentences shows that the model is doing reasonable in predicting the probability of next word.

Hierarchical Softmax (Morin & Bengio 2005)

Reference : Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005)

As computing of the probability of all tokens using SoftMax scales with vocabulary size , in the paper Hierarchical Probabilistic Neural Network Language Model, Morin & Bengio (2005) proposed an approach to reduce the complexity from to .

Based on the intuition shared in the paper Classes for Fast Maximum Entropy Training, J Goodman 2001, to compute , instead of directly computing the probability of the target word given the context words , we decompose it hierarchically as product of :

: probability of in class given context
: probability of word , given is in class AND context

i.e

where,

: the target word we want to predict (e.g., “dog”).
: the context (the surrounding words or features used to predict the next word, e.g., “the big”).
: the cluster/class that the target word belongs to (e.g., dog → Noun class).

Derivation

To derive the the decomposition of , let us introduce a class variable , i.e. the word belongs to the class .

Then the probability of can be written as the sum of is in the class or not

Since each word belongs to exactly one class, the term is zero.

Hence,

The term can be expanded using the chain rule of conditional probabilities as follows:

Summarizing,

Thus, computing reduces to first predicting the class given the context and then predicting the word within that class conditioned on .

Complexity

With this approach, instead of computing probability over the entire vocabulary , this is broken down to computing the probability over the classes, and then computing the probability over the words within the chosen class.

Taking the example shared in the paper, assuming that is 10000 words, and we break it down into 100 classes, with each class having 100 words. Then the computations needed are:

Finding probability over 100 classes
Finding probability over 100 words in the chosen class

This reduces the computation to ~200 probability calculations instead of 10000 in the flat structure. Equivalently, the complexity reduces from to operations.

Binary Tree

An alternative to class-based grouping is to arrange the vocabulary words as the leaves of a binary tree. Each internal node corresponds to a binary decision (left or right child), and each leaf corresponds to one word in the vocabulary. This hierarchical arrangement reduces the search complexity from to , making it efficient for large vocabularies.

For constructing the binary tree, multiple approaches are possible :

Perfect binary tree
- Requires the leaves to be a power of 2 (for eg, 2, 4, 8, 16 etc).
- If is not a power of 2, some leaves will remain unused
- To reach every word, it takes the same path length i.e .
- Average depth: exactly since all leaves are at the same level.
Balanced binary tree
- Tries to keep the left and right subtrees of equal size.
- When the vocabulary is not a power of 2, leaf depths differ by at most 1.
- No empty leaves; every leaf corresponds to a word.
- Average depth: approximately , often slightly smaller because some leaves are shallower.
Word frequency based tree
- Constructed using a Huffman coding structure, frequent words are placed closer to the root node while rare words are deeper.
- This minimises the average number of binary decisions required to reach a word.
- Average depth: depends on the frequency distribution; it is minimised and typically much smaller than for natural language vocabularies (due to Zipf’s law).

For a toy corpus of 12 words, construction of the binary tree with the above approaches is shown below. code @ word_embeddings/binary_tree.ipynb

Model

The probability of the next word given the context can be written as:

where,

each word is represented by a bit vector
the path p depends on the position of the word in the binary tree.

For example, if each word is represented by 4 bits, then the probability of predicting the next word given the context becomes:

Taking log on both sides converts the product into a summation:

In general, for a word represented with bits:

The bit vector corresponds to the path (left or right at each nodes) starting from the root node to the leaf node (the word). Each internal node outputs a probability of going right ( ). For the true label , the binary cross-entropy loss at that node is:

The total loss for predicting is the sum of the node losses along the path:

where

is the predicted probability at the j-th node along the path.
denotes the binary choice (0 or 1) at the j-th internal node along the path to word .

This is equivalent to the negative log-likelihood of the full word probability.

Binary Node Predictor

Each internal node of the binary tree acts as a logistic classifier that decides left vs right, based on both the (n−1)-gram context and the node embedding. The conditional probability of taking the binary decision at a node, given the past context, is modelled as:

where,

the sigmoid function is .
for any word in the vocabulary , lookup a real vector
: concatenation of the previous (n−1) word embeddings,
: bias term specific to the node,
: projection vector applied after hidden nonlinearity,
: bias for hidden layer,
: weight matrix projecting context to hidden space,
: weight matrix projecting node embedding,
: embedding vector for the current node,

The matrices , , (projection vector) and the bias are common parameters shared across all nodes.

Each internal node has its own (scalar bias), and (node embedding). These take care of the decision boundary at each internal node.

Python code – Naive implementation using for-loops

For the toy corpus, naive implementation of hierarchical softmax using for-looops is provided.

Defined a toy corpus of 20 sentences which has around 42 words.
Training example is constructed as 3 context words and the corresponding target word
Constructed a balanced binary tree, which has 41 internal nodes
Model defined with the binary node predictor for each of the nodes
- The parameters , , (projection vector) and the bias are shared across all nodes.
- Each internal node has its own and parameter
For each target word in the training example, the path to the leaf node via the tree is known
Using the binary decision at each path, the loss for each example is computed
The loss is back propagated to find the parameters which minimizes the loss

Using the trained model, for finding the probabilities for top-k words given the context words,

For each word in the vocabulary find the path to its leaf node
Starting from the root node, find the probability at each node
Based on the known decision (right vs left) at each node, use either p for going right OR (1-p) for going left
The joint probability is the product of probabilities at each node.
For numerical stability (loss of accuracy when many small probabilities are multiplied), log of probabilities is found and then summed
On the final log probability is exponentiated to get back in probability (optional)
Then the top-k candidate words are printed

code @ word_embeddings/hierarchical_probabilistic_neural_language_model.ipynb

Python code – Vectorized implementation

As one can imagine, using the for loops significantly slows down the training. To form a vectorized implementation, the following was done.

Path preparation
- Assign a unique id to every internal node in the binary tree.
- Precompute for each word:
  - sequence of node-ids on the path to its leaf,
  - binary decision targets at each node.
- Pad all paths to a fixed length using a dummy (UNK) node id. Build a mask to ignore padded positions.
Parameter lookup
- Use torch.nn.Embedding to fetch node-specific parameters (biases) and (embeddings).
- Shapes:
Forward pass (vectorized)
- Context projection: and bias .
- Node projection: using .
- Broadcast: .
- Nonlinearity : .
- Projection: with .
- Probabilities: .
Loss and masking
- Binary cross-entropy is computed between and the decision targets, with mask applied to ignore padded nodes:

Notes :

Compute once per batch and broadcast, instead of recomputing per path.
UNK node parameters are trainable but excluded from loss using the mask.

code @ word_embeddings/vectorized_hierarchical_probabilistic_neural_language_model.ipynb

Noise contrastive estimation (Gutmann et al 2012, Mnih et al 2012)

As computing of the probability using SoftMax scales with vocabulary size , in the paper Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics, Gutmann et al 2012, provided an approach called Noise Contrastive Estimation (NCE). Instead of directly estimating the data distribution, NCE estimates the probability of a sample being from the data versus from a known noise distribution. By learning the ratio between the data and noise distributions, and knowing the noise distribution, the data distribution can be inferred.

This approach was extended to neural language models in the paper A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012.

Model

In Neural Probabilistic Language model, the estimation of the probability of the words using SoftMax computation is,

where,

context words is
term in numerator is estimated using a neural model with parameters for the target word using the context words
term in denominator is sum over all the words in the vocabulary i.e.

Let us define a set , which is the union of two sets , where

the class label when the word is from the true target word distribution
the class label when the word is NOT from the true target word distribution
is the number of true (data) samples in the batch (or dataset)
is the the number of noise samples generated for contrast

The formulation is,

for each true (data) sample will draw noise samples from
the model has to learn a binary classification where the sample is from the true distribution or from the noise distribution

Further, instead of computing the denominator term for normalzing to probabilities, learn it as a context dependent normalizing term, i.e.

The probability of sample coming from the true distribution given the context can be written as ,

Similarly, the probability of the word in noise distribution is,

Further,

Since NCE reframes the problem as a binary classification task (distinguishing true data from noise), the class labels are modelled as independent Bernoulli variables. Consequently, the conditional log likelihood is the sum of the binary cross-entropy terms:

For a single true target word and its corresponding noise samples

To evaluate this loss, need to express in terms of model parameters. Using Bayes Rule, the probability that the class is true given the context and target word is,

This gives the general probability for any word . When calculating the loss for a true target word, we substitute to get the positive sample probability

Converting to sigmoid form which is used in logistic regression,

Similarly, for the target word from noise distribution i.e. the probability that the class is noise given the context and target word is,

This gives the general probability for any word . When calculating the loss for a noise target word, we substitute to get the noise sample probability

Converting to sigmoid form which is used in logistic regression,

Plugging in the terms to the log likelihood for a single example,

To obtain the objective function for the entire dataset, sum the log-likelihoods over all true training examples . For each training example at step , we have a specific context , a true target word , and a fresh set of noise samples.

The final loss function that we minimize is the negative log-likelihood over the full dataset:

Note :
In the paper, A fast and simple algorithm for training neural probabilistic language models A Mnih et al, 2012, authors mention that approximating the learning of context dependent normalizing factor to 1 did not affect the performance of downstream tasks.

Noise Distribution

The noise distribution is typically chosen proportional to the unigram frequency of words in the corpus:

Often a smoothed unigram distribution improves results:

Python code

For the toy vocabulary, the code for Neural Probabilistic Language Model, Bengio et al, with the SoftMax head replaced with with Noise Contrastive Estimation (NCE).

The code @ word_embeddings/nplm_with_noise_contrastive_estimation.ipynb

Word2Vec papers (Mikolov et al, 2013)

In the paper, Efficient Estimation of Word Representations in Vector Space, Mikolov et al 2013. proposed architectures to reduce the computation complexity in learning word embeddings , with the intuition that simpler models enable training on much larger corpus of data.

Two architectures where proposed.

Continuous Bag of Words (CBOW) Model

When comparing with Neural Probabilistic Language Model, Bengio et al 2003, the following simplifications are proposed.

order of context words is ignored
- instead of concatenating embedding of previous words, averaging the word embeddings of surrounding words is proposed
- this approach is called “bag-of-words” as the order is not taken into consideration
no non linear hidden layer
- the model uses a shared projection layer

Additionally, the in the model context including future words too.

Equations

The neural network output is :

where:

is the output weight matrix, mapping from hidden dimension to vocabulary size
is the averaged context embeddings.

The averaged context embedding vector is computed as:

where,

is the input embedding matrix
is the embedding of the i-th context word, and
is the number of words to the left or right of the target word, giving a total context size of

The probability distribution over the vocabulary is obtained using the softmax function:

where:

is the predicted probability of word being the target word.

is the score for word from the output layer.
The denominator sums the exponentiated scores over all vocabulary entries.

Continuous Skip-gram Model

The Skip-gram model, tries to predict context words given the current target word. The main idea is that each word is trained to predict the words surrounding it within a context window of size n.

Input: one-hot encoding of the target word w_t
Output: probability distribution over vocabulary for each context word
No non-linear hidden layer: uses a shared projection matrix (linear)

Given a target word , the model tries to predict each surrounding context word for . The training goal is to maximize the probability of all context words around each target word:

Equations

The output score,

where,

, where is the embedding vector for the word
is the input embedding matrix
is the output embedding matrix and

The scores are computed for each context word, and the probability of all the context word is maximized.

For either CBOW or Skipgram, both and are trainable. After training, either one (or their average) is used as the word embedding.

Naive way for finding the probability is using SoftMax,

where,

is the output word
is score for output word
denominator = normalizing constant over vocabulary

For finding the probability, hierarchical softmax is proposed in the paper.

Negative Sampling

In the paper, Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al 2013 introduced two concepts:

the speedup provided by sub sampling of frequent words helps to improve the accuracy of the less-frequent words
simplified variant of Noise Contrastive estimation called Negative Sampling

The key intuition in Negative Sampling is that noise contrastive loss defined had terms to normalize the score to approximate the probabilities. However, to learn word embeddings the probabilities are not needed and the terms

can be ignored.

With this simplification, the negative sampling loss is,

Python code

For the toy vocabulary, finding word vectors with

a) Continuous Bag of words (CBOW) with negative sampling

The code @ word_embeddings/cbow_negative_sampling copy.ipynb

b) Skip Graph with Negative Sampling

The code @ word_embeddings/skip_gram_negative_sampling.ipynb

GloVe Embeddings ( Penning et al 2014)

In the paper GloVe: Global Vectors for Word Representation, Penning et al 2014, authors propose that ratio of co-occurrence probabilities capture semantic information better than co-occurance probabilities.

Let,

be matrix of the word co-occurance counts
be the number of times word occur in context of word .
be the number of times any word appear in the context of word
be the probability that word occur in context of word .

Authors show that on the 6 billion token corpus dataset,

Taking the ratio of co-occurance probabilities,

The ratios indicate that,

solid and ice has a higher relationship than with steam.
gas and ice is far less likely to co-occur than with steam
water is related to both ice and steam in similar proportions

Model

To capture this ratio relationship in a vector space, the authors search for a function that satisfies:

where

is context word
are target words
are the word vectors and
are separate context vectors.

Authors enforce that the relationship should be linear (vector difference) and the result should be a scalar (dot product), leading to :

To satisfy that, authors propose choosing function so that the dot product of vector difference can be written as ratio of probabilities,

With this choice, for a single word-context pair estimates the co-occurence probabilities,

Taking logarithm,

Note :

The model capturing the relation between two words should not change even if the words are swapped. Even though the co-occurrence counts are identical (), because the total counts of words are not equal () , the conditional probability is not symmetric ().

The above equation is not symmetric if we swap target word and context word as the row-dependent term has to be handled.

To make it symmetric, the authors absorb into a learnable bias term and then adds a corresponding bias for the context word. This ensures the model is fully symmetric i.e.

The loss function then becomes ,

The key aspect in the above simplification is that, by training for the pairs of words to minimize the above loss will indirectly ensure that the dot product of vector difference of target words with context word will arrive at the ratio of probabilities.

Weighted Least Squares

The above loss function weighs all co-occurances equally. Authors noted that rare co-occurances are noisy and around 75-95% of the co-occurance is zeros, and proposed adding a weighting function to least squares loss proposed above.

The weighting function is chosen to obey the following :

(to handle the zero co-occurance counts)
should be non-decreasing so that rare co-occurance are given less weight
should be relatively small for large values of so that frequent co-occurrences are not over-weighted

The parameters and are chosen empirically.

Then the Weighted Least Squares loss function becomes,

Python code

The code @ word_embeddings/glove_word_embedding.ipynb

Summary

This article covers

Evolution: How we moved from Bengio’s NPLM (2003) to efficient architectures like Word2Vec and GloVe.

Math: Detailed derivations of Hierarchical Softmax (using binary trees) and Noise Contrastive Estimation (differentiating data from noise).

Architectures: A deep look at CBOW, Skip-Gram, and the intuition behind Negative Sampling.

Code: Complete Python implementations for every model discussed, including vectorized implementations for efficiency.

Acknowledgment

In addition to the primary papers listed above, this post draws inspiration from the excellent overview in the post Learning word embedding, Weng, Lilian 2017. Credit also goes to the recent Large Language Models Gemini and ChatGPT which helped to bounce thoughts and refine the drafts.

The post Word Embeddings using neural networks appeared first on DSP LOG.

Gradients for multi class classification with Softmax

Krishna Sankar — Sun, 22 Jun 2025 08:53:17 +0000

In a multi class classification problem, the output (also called the label or class) takes a finite set of discrete values . In this post, system model for a multi class classification with a linear layer followed by softmax layer is defined. The softmax function transforms the output of a linear layer into values lying between 0 and 1, which can be interpreted as probability scores.

Next, the loss function using categorical cross entropy is explained and derive the gradients for model parameters using the chain rule. The analytically computed gradients are then compared with those obtained from the deep learning framework PyTorch. Finally, we implement a training loop using gradient descent for a toy multi-class classification task with 2D Gaussian-distributed data.

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.

Model

Let us take an example of estimating based on feature vector having features i.e. and there are examples.

Linear Layer

Let us assume that the variable is defined as linear function of . For a single training example, this can be written as :

where,

is the vector of size i.e. ,

is the parameter matrix of size i.e. ,

is the feature vector of size i.e. and

is the parameter vector of size i.e.

Note :

This is the definition of Linear layer in PyTorch^{(refer entry on Linear layer)}. This is alternatively called as Dense Layer in Tensorflow ^{(refer entry on Dense)} and as Fully Connected layer in deep learning literature.

Softmax layer

To map the real valued vector to a probability vector with elements of summing up to 1, we use the softmax function ^{(refer wiki entry on SoftMax)}. Softmax function is defined as,

Equivalently, this can be written as,

where, each represents the normalized exponential of the corresponding . This ensures that

each element lies in the range [0,1] i.e.

sum of all the elements add upto 1, i.e. .

This makes interpretable as a probability distribution over the classes.

Derivatives

Derivative of Softmax layer

To compute the derivative of the softmax output with respect to its input , we need to find the Jacobian matrix . The Jacobian contains all partial derivatives of each output component with respect to each input component .

To find the derivative for all cases, let us split into two scenarios i.e.

Derivative for case i=j

Using the product rule of derivatives,

Derivative for case i ≠ j

Final output (matrix form)

Based on the above derivations, the derivative is defined as :

In matrix form,

Code

Python code comparing the derivative of softmax using the derivation above vs computed by PyTorch autograd function below.

Derivative of Linear layer

To find the derivative of the linear layer , with respect to parameters and , we must compute two partial derivatives:

– how the output changes with respect to the weight matrix.

– how the output changes with respect to the bias vector.

Derivative of Weights

To compute the derivative , we evaluate how each weight parameter affects each output dimension . The i-th component of is:

The partial derivative of with respect to is :

where,

index over the output vector

indexes the elements of the input vector

Since each output depends only on the weights in i-th row , the Jacobian simplifies to a matrix where each row is . This can be represented as

For all the rows of , the derivative is

Derivative of Bias

The bias vector is added element-wise to the output of the linear transformation . That is, each output component is given by:

So the partial derivative of with respect to is:

This implies that the Jacobian matrix of with respect to is an identity matrix:

This tells us that the bias only affects its corresponding output component (i.e., only affects ).

Loss for multi-class classification

Maximum Likelihood Estimate

The likelihood of observing the true class c, given input , under the model is:

The log-likelihood over a dataset with examples is:

where,

is the model’s predicted probability for the correct class for the i-th example.

Maximizing this log-likelihood is equivalent to minimizing the negative log-likelihood:

Connecting to Cross Entropy Loss

To map the ground truth class label to be represented as a target vector , a common choice is the one-hot encoding scheme, where the true class is indicated by a 1 in the corresponding position and 0 elsewhere. For example, suppose we have classes, and the correct label is class 2, i.e., , then the one-hot encoded vector becomes :

Cross Entropy

To compare the model’s predicted probability vector with the one-hot encoded true label , we use a metric called cross-entropy ^{(refer wiki entry on Cross Entropy)}. The cross-entropy of the distribution relative to a distribution over a given set is defined as follows:

where,

is the expected value operator with respect to the distribution .

For discrete probability distributions and with set of all possible outcomes or classes.

Cross Entropy Loss

In the context of training classification models, we use the cross-entropy loss as a cost function to minimize. For a single training example, to evaluate how well the predicted probability vector matches the ground truth vector , the cross-entropy loss is defined as:

where,

is the true probability of class i and

is the predicted probability for class i.

The loss encourages the model to assign higher probability to the correct class which indirectly lowers the probabilities to the incorrect classes. The smaller the cross-entropy loss, the closer the predicted probabilities are to the true labels.

Loss across all examples is,

When is one-hot coded, as only the correct class is non-zero, the equation reduces to

We can see that this cross entropy loss is same as the maximum likelihood estimate derived earlier.

Note :

Function for cross entropy loss is available in PyTorch library as torch.nn.CrossEntropyLoss ^{(refer entry on CELoss in PyTorch)}.

In the torch.nn.CrossEntropyLoss definition, we only need to provide the output of the linear layer (called logits) and the class indices as an integer. The softmax and logarithm of probabilities are computed internally, so we do not need to apply softmax before passing logits to this function.

Gradients with Cross Entropy (CE) Loss

The system model for binary classification involves multiple steps:

firstly, the vector is defined as linear function of using parameters and ,

then gets transformed into a estimated probability score using softmax function.

lastly, using the true label and estimated probability score , cross entropy loss is computed

For performing gradient descent of the parameters, the goal is to the find the gradients of the loss w.r.t to the parameters and . To find the gradients, we go in the reverse order i.e.

first, gradients of the loss w.r.t to the estimated probability score , is computed.

then gradients of the probability score w.r.t to the output of linear function is multiplied with the gradients of loss with respect to i.e.

lastly, to find the gradients of output of linear function w.r.t to parameters and , the product of all the individual gradients is used. This is written as,

The steps described, calculating gradients in the reverse order from the loss back to the parameters is an application of the chain rule from calculus ^{(refer wiki entry on Chain Rule)}. This method is the foundation of back propagation used in training models ^{(refer wiki entry on Backpropagation)}.

Gradients of Loss with respect to Probability (dL/da)

As defined earlier, for a multi-class classification setting, the cross-entropy loss is given by:

Derivative of w.r.t is,

So, the gradient is large if the predicted probability is small for the correct class — this penalises the model for incorrect predictions, which is desired during training. The vectorized form of the loss gradient w.r.t. the probability vector is:

Equivalently,

Gradients of Loss with respect to z (dL/dz)

Using chain rule, to the find the gradient of loss with respect to z i.e. , we multiply the derivative of softmax output which is a matrix with which is of dimension ,

In vectorized form, can be represented as

Gradients of loss with respect to Parameters (dL/dW, dL/db)

Gradients of Weights (W)

Based on the chain rule, to find the gradient of loss with respect to parameter , we multiply each row of the with the corresponding row from

This is equivalent to the outer product,

Gradients of bias (b)

Recall the linear transformation: . The gradients are:

The intuition from above equations is :

if the estimated probability is close to the true value then the gradient is small, and the update to the parameters is also correspondingly smaller. If you recall, the gradients for binary classification ^{(refer post on Gradients for Binary Classification with Sigmoid)}, linear regression ^{(refer post on Gradients for Linear Regression)} follows a similar intuitive explanation.

These gradients are then used in the optimizer (e.g., SGD) to update parameters and reduce the loss.

Vectorised operations (with m examples)

The training examples each having features is represented as,

The output, which is a probability matrix across classes for each of the examples, is:

The linear transformation before applying the activation function (e.g., softmax) is given by:

where, the parameters

The softmax activation is applied column-wise to the matrix to obtain the probability outputs:

In matrix form, this is written as,

The cross-entropy loss compares the predicted probabilities with the ground truth one-hot encoded labels :

The derivative of the cross-entropy loss with softmax activation, with respect to the input (logits), simplifies to:

The gradient of the loss with respect to the weight matrix is:

As the input matrix has shape , the inner product
results in a matrix of shape
. This captures the total gradient of the loss over all examples.
Averaging over the examples is done by multiplying with .

The gradient of the loss with respect to the bias vector is computed by summing the gradient over all examples using a row vector of ones:

Here, and
sums the gradients across all examples.
The result is a vector, which matches the shape of .

Code (gradients)

Example code comparing the gradients computed using the derivation with autograd from PyTorch.

Training for toy example with 3 classes

Below is an example of training a multi class classifier based on the model and gradient descent. Synthetic training data data is generated from two independent Gaussian random variables with zero mean and unit variance. Mean is shifted by (-2,-2), (+2,+2 ), (-2,+2 ) corresponding to class 0, class 1, class 2 respectively.

The training loop is done using the numerically computed gradients and using the torch.autograd provided by PyTorch, and can see that both are numerically very close.

Training with Label Smoothing

In the previous section, we derived the gradients for multi-class classification using one-hot encoded targets. In the paper “Rethinking the Inception Architecture for Computer Vision” by Szegedy et al. (2016) ^{(arXiv:1512.00567)}, the idea of label smoothing was introduced. The key observation is that one-hot targets, which drives the predicted probability for the correct class toward 1 and ignore the other classes in the loss function, encourage models to become overconfident.

Label smoothing combats this by replacing the hard 1 in the true class with a slightly lower value, such as , and distributing the remaining equally among the other classes. So, instead of teaching the model that one class is absolutely correct, we teach it that one class is very likely correct — allowing for some uncertainty.

For a classification problem with classes and smoothing parameter , the smoothed label vector becomes:

For an example with =4 classes,

Even though we modify the target labels using label smoothing, the sum of the smoothed probabilities still adds up to 1. Because of this, the gradient derivations from the previous section remain valid.

Training code

For the toy training example earlier, we compare the training with smoothed labels vs one-hot coded. In PyTorch function torch.nn.CrossEntropyLoss ^{(refer entry on CELoss in PyTorch)} has an optional argument label_smoothing which implements the label smoothing as defined earlier.

In the training results on the toy example we can see that the loss is higher for the training with label smoothing and correspondingly misclassification rate is also slightly higher.

However, label smoothing has been shown to improve generalization in larger models trained on complex datasets. The concept was first introduced in Rethinking the Inception Architecture for Computer Vision (Szegedy et al., 2016), and was later used in the foundational paper Attention is All You Need (Vaswani et al., 2017). A broader study, When Does Label Smoothing Help? (Müller et al., 2019), analyzed its effectiveness in large models like ResNets and Transformers.

Summary

The post covers the following key aspects

System model for multi class classification with linear layer and softmax

Loss function based on categorical cross entropy and showing that this is Maximum Likelihood Estimate

Computation of the gradient based on chain rule of derivatives

Vectorized Operations for batch of examples which implements computations using efficient matrix and vector math.

Training loop for the classification using both manual and PyTorch based gradients

Explains the concept of label smoothing and implements a training loop for explaining the concept

Have any questions or feedback? Feel free to drop your feedback in the comments section.

The post Gradients for multi class classification with Softmax appeared first on DSP LOG.

Gradients for Binary Classification with Sigmoid

Krishna Sankar — Sat, 17 May 2025 13:05:07 +0000

In a classification problem, the output (also called the label or class) takes a small number of discrete values rather than continuous values. For a simple binary classification problem, where output takes only two discrete values : 0 or 1, the sigmoid function can be used to transform the output of a linear regression model into a value between 0 and 1, squashing the continuous prediction into a probability-like score. This score can then be interpreted as the likelihood of the output being class 1, with a threshold (commonly 0.5) used to decide between class 0 and class 1.

In this post, the intuition for loss function for binary classification based on Maximum Likelihood Estimate (MLE) is explained. We then derive the gradients for model parameters using the chain rule. Gradients computed analytically are compared against gradients computed using deep learning framework PyTorch. Further, training loop using gradient descent for a binary classification problem having two dimensional Gaussian distributed data is implemented.

Table of Contents

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.

Model

Let us take an example of estimating based on feature vector having features i.e. .

There are examples.

Let us assume that the variable is defined as linear function of . Then gets transformed into a probability score using sigmoid function. For a single training example, this can be written as :

where,

is the feature vector of size i.e. and

is a scalar

To convert the real number to a number lying between 0 and 1, let us define

where, is the sigmoid function ^{(refer wiki entry on sigmoid function})

Sigmoid function and its derivative

Sigmoid function which has a smooth S-shaped mathematical function is defined as:

which has the properties

The derivative of sigmoid is,

From the plots of derivative of sigmoid, two key observations :

Vanishing gradients : for very large or very small , the derivative approaches 0 causing gradients to vanish during back propagation — this slows or stalls learning in deep networks.

Low Maximum Gradient : the maximum value of derivative is 0.25, which caps the gradient flow, making it harder for deep layers to effectively update their weights

As mentioned in the article Yes you should understand backprop by Andrej Karpathy, these aspects have to be kept in mind when using sigmoid for training deeper neural networks.

Loss function for binary classification

Maximum Likelihood Estimation

Let us assume that the probability of output being 1, given input and parameters , is,

Then, for the binary classification, the probability of output being 0 is,

Since can either be 0 or 1, we can compactly write the likelihood as:

The likelihood function is the probability of the actual label given the prediction . When the multiple independent training examples are independently and identically distributed (i.i.d.), the total likelihood for the dataset is the product of the likelihoods for each example. With this assumption, for training examples, the likelihood for the parameters and is,

Log Likelihood

To avoid the product of many small numbers, we take the natural logarithm of the likelihood function. The log-likelihood for the entire dataset is the sum of the log-likelihoods for each example:

Negative Log Likelihood

Since optimizers like gradient descent are designed to minimize functions, we minimize the negative log-likelihood instead of maximizing the log-likelihood.

Averaging the Loss

Averaging the loss ensures that the total loss remains on the same scale, regardless of the size of the training dataset. This is important because it allows the use of a fixed learning rate across different dataset sizes, leading to more stable and consistent optimization behaviour.

The averaged negative log-likelihood is defined as:

This expression is known as the Binary Cross-Entropy (BCE) Loss, which is widely used in binary classification tasks. This function is available in PyTorch library as torch.nn.BCELoss ^{(refer entry on BECLoss in PyTorch)}.

Gradients with Binary Cross Entropy (BCE) Loss

The system model for binary classification involves multiple steps:

firstly, the variable is defined as linear function of using parameters , .

then gets transformed into a estimated probability score using sigmoid function.

lastly, use the true label and estimated probability score , binary cross entropy loss is computed

For performing gradient descent of the parameters, the goal is to the find the gradients of the loss w.r.t to the parameters and . To find the gradients, we go in the reverse order i.e.

firstly, gradients of the loss w.r.t to the estimated probability score

then gradients of the probability score w.r.t to the output of linear function

lastly, gradients of output of linear function w.r.t to parameters ,

Then the product of all the individual gradients from the gradients of loss w.r.t to the parameters. This is written as,

The steps described, calculating gradients in the reverse order from the loss back to the parameters is an application of the chain rule from calculus ^{(refer wiki entry on Chain Rule)}. This method is the foundation of backpropagation used in training models ^{(refer wiki entry on Backpropagation)}.

Deriving the gradients

For simplicity, take a single example and computing gradients step by step,

Step1 : Gradients of loss w.r.t to probability score

With the loss , then the derivative of loss w.r.t to sigmoid output is,

Step2 : Gradients of probability score w.r.t to output of linear function

With as the output of sigmoid function, the derivative is

Step3 : Gradients of output of linear function w.r.t to parameters

With , the derivative is,

Similarly,

Gradients of loss w.r.t to parameters

Taking the product of the gradients from all the steps,

Similarly,

The intuition from above equations is :

if the estimated probability is close to the true value then the gradient is small, and the update to the parameters is also correspondingly smaller. If you recall, the gradients for linear regression ^{(refer post on Gradients for Linear Regression)} follows a similar intuitive explanation.

Note : With training examples the loss is averaged, and this becomes :

Vectorised operations

The training examples each having features is represented as,

The output is,

The parameters and represented as,

where,

is the feature vector of size i.e. and

is a scalar

The output is,

Gradients,

The gradient w.r.t to can be represented in matrix operations as,

Similarly, for the bias term

Gradients computed numerically vs PyTorch

Training – Binary Classification

Below is an example of training a binary classifier based on the model and gradient descent. Synthetic training data data is generated from two independent Gaussian random variables with zero mean and unit variance. Mean is shifted on half the samples by (-2,-2) and the remaining half by (+2,+2 ) corresponding to class 0 and class 1 respectively.

The training loop is done using the numerically computed gradients and using the torch.autograd provided by PyTorch, and can see that both are numerically very close.

The estimated probability score indicates the likelihood that the given input corresponds to one of the classes. As can be seen in the plot Predicted Probability for Each Input, inputs close to center point (0,0) have a probability close to 0.5, and as we move away from the center the probabilities tend to be closer to either 0 or 1.

To convert this probability into a class label, a decision threshold needs to be applied. In this example, as can be seen in the plot of Classification Error vs Threshold, the threshold of 0.5 is corresponding to the lowest error rate.

However, there are other scenarios where the threshold of 0.5 can be inappropriate – like dealing with imbalanced datasets, skewed class distribution etc. These require adjusting the threshold for better performance.

Summary

The post covers the following key aspects

Loss function based on Maximum Likelihood Estimate

Computation of the gradient based on chain rule of derivates

Vectorized Operations Implements all computations using efficient matrix and vector math.

Training loop for the binary classification using both manual and PyTorch based gradients

Have any questions or feedback on the gradient computation techniques? Feel free to drop your feedback in the comments section.

The post Gradients for Binary Classification with Sigmoid appeared first on DSP LOG.

Gradients for linear regression

Krishna Sankar — Thu, 01 May 2025 06:02:13 +0000

Understanding gradients is essential in machine learning, as they indicate the direction and rate of change in the loss function with respect to model parameters. This post covers the gradients for the vanilla Linear Regression case taking two loss functions Mean Square Error (MSE) and Mean Absolute Error (MAE) as examples.

The gradients computed analytically are compared against gradient computed using deep learning framework PyTorch. Further, using the gradients, training loop using gradient descent is implemented for the simplest example of fitting a straight line.

As always, contents from CS229 Lecture Notes and the notations used in the course Deep Learning Specialization C1W1L01 from Dr Andrew Ng forms key references.

Model

Let us take an example of estimating based on feature vector having features i.e. .

There are examples.

Assume that the estimate is as linear function of .

For a single training example, this can be written as :

where,

is the feature vector of size i.e. and

is a scalar

Least Mean Squares

To find the parameters and , based on training examples, need to formalise a metric to quantify the “closeness” of the estimate to the true value . As an arbitrary choice, let us define a metric based on the mean square error (MSE) as,

Goal is to find the parameters and which minimizes the metric . This can be considered as ordinary least squares (^{wiki entry on ordinary least squares}) model.

To find the value of parameters and which minimises the metric , let us try gradient descent method where we

i) start with initial random values of parameters and

ii) repeatedly update parameters simultaneously for all values of and

where,

is the learning rate,

and are the partial derivatives of the loss metric over parameters and respectively.

The intuition is, to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent ^{Wiki Article on gradient descent}.

Gradients

In this formulation, we need to find the derivative of a scalar i.e. loss over the a vector of parameters of and .

For easier understanding, can define the derivative over each parameter as below,

Further, taking only one training example, the loss is

Taking the derivative w.r.t to first parameter

Similarly, for the parameter of , the gradient is

For the bias parameter , the gradient is

The intuition from above equations is :

if the estimate is close to the true value then the gradient is small, and the update to the parameters is also correspondingly smaller.

With training examples the loss is averaged, and this becomes :

Vectorised operations

Vectorised operations allow CPUs/GPUs to do SIMD (Single Instruction Multiple Data^{(Refer Wiki)}) processing, making it much faster than using for-loops.

Inputs & Outputs

In the current example, this translates to the parameters and represented as,

respectively.

The training examples of features is represented as

The output

Gradients

The gradient w.r.t to can be represented in matrix operations as,

Similarly, for the bias term

Training

Training loop – using the derivatives

The code for linear regression using the gradients descent defined in the previous section.

Computing Gradients

Using PyTorch

For the simple linear regression example, it is relatively straight forward to derive the gradients and perform the training loop. When the function for estimation involves multiple stages/layers a.k.a deep learning^{(refer wiki)}, it becomes harder to derive the gradients.

Popular deep learning frameworks like Pytorch provides tools for automatic differentiation ( torch.autograd ^{refer pytorch entry on autograd}) to find the gradients of each parameter based on the loss function.

Numerical approximation (finite difference method)

To verify the gradients, derivatives can be computed numerically using using finite difference^{(refer wiki entry on finite difference)} method i.e.

where,

is the true derivative of function and

is a small constant.

Example – Analytic vs PyTorch vs Numerical Approximation

For the toy example below, can see that the gradients computed by Analytically vs PyTorch vs numerical approximation using finite difference methods are matching.

Training Loop – using PyTorch

Key aspects in the code for implementing the training loop using PyTorch :

the variables are defined as torch tensors ^{refer pytorch article on tensors}.
- Tensors are similar to numpy ndarrays, with capabilities to use GPU/hardware accelerators, optimized for automatic differentiation etc

defining the parameters needing gradient computation.
- the parameters and which needs gradient computation are initialised with requires_grad=True

computing the gradient
- the call loss.backward() is used to compute the gradients for the parameters and .
- this makes the gradient values available in w.grad() and b.grad() respectively

updating the paramaeters
- as gradient tracking is unnecessary during parameter updates, they are performed within torch.no_grad(): context

zeroing gradients between calls
- PyTorch accumulates gradients by default during each backward pass i.e. each loss.backward() call
- so, performing w.grad.zero_() and b.grad.zero_() is needed to clear previous gradients.

As would expect, both the training loop approaches converges to similar values for parameters and .

Mean Absolute Error

Another popular metric to quantify the “closeness” of the estimate to the true value is Mean Absolute Error (MAE). In the cases where there are outliers in the data, Mean Absolute Error (MAE) is preferred over Mean Squared Error (MSE) as MAE penalizes errors linearly rather than quadratically.

Formally,

For computing the gradient for the loss from Mean Absolute Error, need to find the gradient of absolute function.

Gradient – Absolute function

The absolute value of function is defined as,

The derivative is

This can be compactly written as,

The absolute function is non-differentiable at point at , where the function has a sharp corner.

The concept of subderivative (or subgradient) generalises the derivative to a convex functions which are not differentiable ^{(refer wiki entry on Subderivative)}. With this definition, subderivative at lies in the interval .

Using the concept of Symmetric derivative ^{(refer wiki entry on symmetric derivative)}, the subderivtive at can be chosen as 0.

In practice, deep learning frameworks (like PyTorch, TensorFlow) and numerical methods like NumPy define . This is the subgradient, and it works fine in optimization.

Training

Deriving the Gradients

For a single training example, for the parameter of , the gradient is

Similarly for the bias term,

Training Loop – using derivatives and PyTorch

For the same example, the code for training the linear regression using Mean Absolute Error as the Loss function.

Can see that both the training loops for Mean Absolute Error (MAE) using PyTorch and Analaytic approaches converges to the same parameters and .

Summary

The post covers the following key aspects

Gradient Basics: How to deriving the gradients for loss functions Mean Square Error and Mean Absolute Error

Efficient Computation: Use of vectorized operations and PyTorch autograd

Gradient Computation: Analytical, numerical (finite difference), and PyTorch comparison

Training Loops: Implementing updates using both manual and PyTorch-based gradients

Have any questions or feedback on the gradient computation techniques? Feel free to drop your feedback in the comments section.

The post Gradients for linear regression appeared first on DSP LOG.

Migrated to Amazon EC2 instance (from shared hosting)

Krishna Sankar — Mon, 11 Mar 2013 01:20:38 +0000

Being not too happy with the speed of the shared hosting, decided to move the blog to an Amazon Elastic Compute Cloud (Amazon EC2) instance. Given this is a baby step, picked up a micro instance running an Ubuntu server and installed Apache web server, MySQL, PHP . After doing a bit of tweaking with this new instance, imported the SQL database and other files from the shared hosting and pointed the A name record to the new IP address. This switch happened over this weekend.

One particular issue which I faced was frequent crashing of MySQL due to memory limitations. Followed few online instructions to improve the situation and the current configuration seems to be holding up (but this is a cause of worry – need to figure the right solution).

Anyhow, hope you like the decreased page load time!

Some helpful links from the web:

a) How to install WordPress on Amazone EC2

b) Move WordPress site from shared hosting to Amazon EC2

c) DIY: Enable CGI on your Apache server

d) Import MySQL Dumpfile, SQL Datafile Into My Database

e) Making WordPress Stable on EC2-Micro

f) how to enable mod_rewrite in apache2.2 (debian/ubuntu)

The post Migrated to Amazon EC2 instance (from shared hosting) appeared first on DSP LOG.

GATE-2012 ECE Q28 (electromagnetics)

Krishna Sankar — Wed, 20 Feb 2013 01:50:01 +0000

Question 28 on electromagnetics from GATE (Graduate Aptitude Test in Engineering) 2012 Electronics and Communication Engineering paper.

Q28. A transmission line with a characteristic impedance of 100 is used to match a 50 section to a 200 section. If the matching is to be done both at 429MHz and 1GHz, the length of the transmission line can be approximately

(A) 82.5cm

(B) 1.05m

(C) 1.58m

(D) 1.75m

Solution

To answer this question, let us first understand the propagation in a transmission line, termination and the concept of impedance matching. The section 2.1 in Microwave Engineering, David M Pozar (buy from Amazon.com, Buy from Flipkart.com) is used as reference.

Consider a transmission line of very small length having the parameters as show in figure below.

Figure : Transmission line model (Reference Figure 2.1 in Microwave Engineering, David M Pozar (buy from Amazon.com, Buy from Flipkart.com)

is the resistance per unit length ,

is the inductance per unit length ,

is the conductance per unit length ,

is the capacitance per unit length.

Applying Kirchoff’s voltage law,

Applying Kirchoff’s current law,

Dividing the above equations by and taking the limit ,

If we assume that the inputs are sinusoidal, then the above equation can be re-written as

Substituting,

where

The solution to the above equations are,

The current on the line can be alternately expressed as,

where the characteristic impedance of the line is defined as,

The wavelength on the line is,

Loss less transmission line case

For a lossless transmission line, we can set .

Then the propagation constant reduces to, the characteristic impedance is and the voltage and current on the line can be written as,

Terminated lossless transmission line

Consider a transmission line terminated with load impedance as shown in figure below.

Figure: Transmission line with load impedance (Reference Figure 2.4 in Microwave Engineering, David M Pozar (buy from Amazon.com, Buy from Flipkart.com)

At the load , the relation between the total voltage and current is related to the load impedance

Alternatively,

The reflection coefficient is defined as the amplitude of the reflected voltage to the incident voltage,

For no reflection to happen, i.e , the load impedance should be equal to the characteristic impedance of the transmission line. The above equation captures the impedance seen at the load .

The voltage and current on the line can be represented using as,

When looking from a point from the load, the input impedance seen is,

Substituting for ,

Special case when (and it’s odd multiples)

For the case when the input impedance seen is,

This result can be used to for impedance matching.

Quarter wave transformer

Consider a circuit with load and a line with characteristic impedance connected by a transmission line of characteristic impedance with length .

Figure: Quarter Wave Matching transformer (Reference Figure 2.16 in Microwave Engineering, David M Pozar (buy from Amazon.com, Buy from Flipkart.com)

The input impedance seen is,

So if we choose , then the input impedance seen is which is the condition required for having no reflection i.e. .

One important aspect to note here is that is not guaranteed for all frequencies, but rather only for certain frequencies. The frequency dependence can be found by finding the frequencies for which .

Replacing where is the wave length corresponding to frequency ,

It can be seen that only for , the term resulting in reflection coefficient only for those frequencies.

Solving the GATE question

Applying all this to the problem at hand, we have , and .

Given that , we know that a quarter wave transformer is used to achieve impedance matching.

Now we also know that we need to match for two frequencies and .

The wavelength for each frequencies are,

The least common multiple of these two wavelength is, and the corresponding quarter wave length is .

Given than is not listed in the options, we can go for the next higher odd multiple i.e.

Based on the above, the right choice is (C) 1.58m

References

[1] GATE Examination Question Papers [Previous Years] from Indian Institute of Technology, Madras http://gate.iitm.ac.in/gateqps/2012/ec.pdf

[2] Microwave Engineering, David M Pozar (buy from Amazon.com, Buy from Flipkart.com)

The post GATE-2012 ECE Q28 (electromagnetics) appeared first on DSP LOG.

Image Rejection Ratio (IMRR) with transmit IQ gain/phase imbalance

Krishna Sankar — Thu, 31 Jan 2013 01:47:36 +0000

The post on IQ imbalance in transmitter, briefly discussed the effect of amplitude and phase imbalance and also showed that IQ imbalance results in spectrum at the image frequency. In this article, we will quantify the power of the image with respect to the desired tone (also known as IMage Rejection Ratio IMRR) for different values of gain and phase imbalance.

System Model

Consider an IQ modulator having gain of and on each arm and phase imbalance of as shown in figure below.

Figure : IQ modulator with gain and phase imbalance

The output signal is,

Considering an ideal IQ demodulator multiplying the received signal with and respectively,

Ignoring the common term and writing the base band equivalent form,

This is the model for transmit IQ imbalance.

Image Rejection Ratio (IMRR) with transmit IQ imbalance

By sending a complex sinusoidal , and by taking ratio of the power of the signal at the image frequency and desired frequency , the image rejection ratio can be computed.

Let and correspondingly, and .

Finding the component

To find the component, multiply the received signal with and integrate over period .

The power of the component is,

Finding the component

To find the component, multiply the received signal with and integrate over period .

The Image Rejection Ratio (IMRR) is

Substituting and with variable and , the equation simplifies to,

A useful approximation to IMRR

When there is no phase imbalance i.e , the equation reduces to,

When there is no gain imbalance i.e , the equation reduces to,

As these two are independent, they can be added to give an approximate value of Image Rejection Ratio.

Summarizing, the Image Rejection Ratio for a given value of gain imbalance and phase imbalance is,

Simulation Results

Simple Matlab/Octave code plotting the simulated and theoretical values of Image Rejection for different values of gain and phase imbalance.

clear; close all

N  = 64;
fm = 2;
gammadB_v = [-3:.1:3];
phiDeg_v  = [-6:.2:6];
[tt gammadB_zeroIdx ] = min(abs((gammadB_v-0)));
[tt phiDeg_zeroIdx  ] = min(abs((phiDeg_v-0)));

for (ii = 1:length(gammadB_v))
   for (jj = 1:length(phiDeg_v))
      gammadB    = gammadB_v(ii); 
      phiDeg     = phiDeg_v(jj);
      gammaLin   = 10^(gammadB/20); 
      phiRad     = phiDeg*pi/180;
      epsilonLin = gammaLin -1 ;

      % transmitted signal
      xt        = exp(j*2*pi*fm*[0:N-1]/N);
      % received signal with IQ imbalance
      xht_re    = gammaLin*cos(phiRad/2)*real(xt) + sin(phiRad/2)*imag(xt);
      xht_im    = gammaLin*sin(phiRad/2)*real(xt) + cos(phiRad/2)*imag(xt);
      xht       = xht_re + j*xht_im; 

      % taking ifft() to find the +fm and -fm components
      yF        = fft(xht,N);
      y_pfm     = yF(fm+1);
      y_nfm     = yF(N-fm+1);

      est_imrr_lin    = (abs(y_nfm)./abs(y_pfm))^2;
      theory_imrr_lin = (gammaLin^2 + 1  - 2*gammaLin*cos(phiRad))./(gammaLin^2 + 1  + 2*gammaLin*cos(phiRad));
      approx_imrr_lin = (epsilonLin^2 + phiRad^2)/4;

      est_imrr_dB(ii,jj)    = 10*log10(est_imrr_lin);
      theory_imrr_dB(ii,jj) = 10*log10(theory_imrr_lin);
      approx_imrr_dB(ii,jj) = 10*log10(approx_imrr_lin);
   end
end

figure
plot(gammadB_v,theory_imrr_dB(:,phiDeg_zeroIdx),'bs-'); hold on
plot(gammadB_v,est_imrr_dB(:,phiDeg_zeroIdx),'md-');
plot(gammadB_v,approx_imrr_dB(:,phiDeg_zeroIdx),'gx-');
xlabel('gain imbalance, dB'); ylabel('image rejection, dB'); grid on;
legend('theory','estimated','approx')
title('Image Rejection Ratio with gain imbalance alone');
axis([-3 3 -50 -10]);

figure
plot(phiDeg_v,theory_imrr_dB(gammadB_zeroIdx,:),'bs-'); hold on
plot(phiDeg_v,est_imrr_dB(gammadB_zeroIdx,:),'md-');
plot(phiDeg_v,approx_imrr_dB(gammadB_zeroIdx,:),'gx-');
xlabel('phase imbalance, degree'); ylabel('image rejection, dB'); grid on;
legend('theory','estimated','approx')
title('Image Rejection Ratio with phase imbalance alone');
axis([-6 6 -50 -20]);

Figure : Image Rejection Ratio (IMRR) with gain imbalance alone

Figure : Image Rejection Ratio (IMRR) with phase imbalance alone

Observations

1) The approximate expression holds good for reasonable values of gain and phase imbalance.

2) As a rule of thumb, the following numbers are useful :

– For 1 degree of phase imbalance, the Image Rejection Ratio (IMRR) is around -41dB

– For 1dB of gain imbalance, the Image Rejection Ratio (IMRR) is around -25dB

References

[1] Cavers, J.K.; Liao, M.W.; , “Adaptive compensation for imbalance and offset losses in direct conversion transceivers,” Vehicular Technology, IEEE Transactions on , vol.42, no.4, pp.581-588, Nov 1993 doi: 10.1109/25.260752

[2] Table of trignometric identities http://www.sosmath.com/trig/Trig5/trig5/trig5.html

The post Image Rejection Ratio (IMRR) with transmit IQ gain/phase imbalance appeared first on DSP LOG.