coding.vision

Build Tesseract 5 in Conda Environment

Tue, 15 Sep 2020 21:45:05 +0000

Here’s a short guide to building Tesseract 5 from source (master branch on GitHub).

I’m writing this mainly because conda offers as packages only versions of Tesseract up to 4.1.1 – at least at this moment. The other reason is that the cluster I’m compiling Tesseract on is running a CentOS 7 and permitting only inside-environment changes so I can’t install packages with yum.

In this guide I’m using gcc/g++ version 6.2.0; it is recommended to use recent versions when compiling Tesseract 5. For example, the build fails with gcc/g++ 4.8.5.

Building Steps

Create your conda environment and activate it:

conda create --name tess-build 
conda activate tess-build

Install the following dependencies. You’ll need at least leptonica 1.74 for this to work - I’m using 1.78.0.

conda install -c conda-forge automake
conda install -c conda-forge libtool
conda install -c conda-forge pkgconfig
conda install -c conda-forge leptonica

Clone the latest Tesseract version from the master branch and navigate into the directory:

git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseract

Run the following scripts to prepare the building process

./autogen.sh
./configure

Conda might not include the path to its libraries inside the LD_LIBRARY_PATH environment variable. I had to include it manually otherwise the build fails during linking:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/.conda/envs/tess-build/lib

Run the makefile:
```
1
make
```

Set the TESSDATA_PREFIX environment variable in order to inform Tesseract where to look for language packs; also download the eng (default) language pack into tessdata

export TESSDATA_PREFIX=$HOME/tesseract/tessdata
wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata -P tessdata/

See if it works:

(tess-build) [dan.sporici@hpsl-wn02 tesseract]$ ./tesseract -v
tesseract 5.0.0-alpha-781-gb19e3
leptonica-1.78.0
libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.0.2 : libopenjp2 2.3.1
Found AVX
Found SSE
Found OpenMP 201511

(tess-build) [dan.sporici@hpsl-wn02 tesseract]$ ./tesseract --list-langs
List of available languages (1):
eng

Possible Leptonica Linking Issue

/usr/bin/ld: warning: libpng16.so.16, needed by /.conda/envs/tess-build/lib/liblept.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libjpeg.so.9, needed by /.conda/envs/tess-build/lib/liblept.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libgif.so.7, needed by /.conda/envs/tess-build/lib/liblept.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libwebp.so.7, needed by /.conda/envs/tess-build/lib/liblept.so, not found (try using -rpath or -rpath-link)
/.conda/envs/tess-build/lib/liblept.so: undefined reference to `png_create_read_struct@PNG16_0'
/.conda/envs/tess-build/lib/liblept.so: undefined reference to `DGifOpen'
/.conda/envs/tess-build/lib/liblept.so: undefined reference to `png_get_PLTE@PNG16_0'
/.conda/envs/tess-build/lib/liblept.so: undefined reference to `jpeg_std_error@LIBJPEG_9.0' 
/.conda/envs/tess-build/lib/liblept.so: undefined reference to `png_write_image@PNG16_0'
/.conda/envs/tess-build/lib/liblept.so: undefined reference to `EGifPutScreenDesc'
/.conda/envs/tess-build/lib/liblept.so: undefined reference to `EGifPutComment'
/.conda/envs/tess-build/lib/liblept.so: undefined reference to `WebPEncodeRGBA'
[...]
/.conda/envs/tess-build/lib/liblept.so: undefined reference to `png_init_io@PNG16_0'
collect2: error: ld returned 1 exit status
make[2]: *** [tesseract] Error 1
make[2]: Leaving directory `/tesseract'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/tesseract'
make: *** [all] Error 2

This happens because the libraries in cause (libpng16.so, libjpeg.so, libgif.so, libwebp.so) are not found in the directories included in LD_LIBRARY_PATH. If step 5 doesn’t work (although it should), you might be able to get around this by modifying the Makefile and adding the libraries yourself after -llept:

LEPTONICA_LIBS = -L/.conda/envs/tess-build/lib -llept -lz -lpng16 -ljpeg -lgif -lwebp

If you follow this approach, you need to copy the libraries to tesseract/.libs otherwise you’ll get:

(tess-build) [dan.sporici@hpsl-wn02 tesseract]$ ./tesseract
/tesseract/.libs/lt-tesseract: error while loading shared libraries: liblept.so.5: cannot open shared object file: No such file or directory
/tesseract/.libs/lt-tesseract: error while loading shared libraries: libpng16.so.16: cannot open shared object file: No such file or directory
/tesseract/.libs/lt-tesseract: error while loading shared libraries: libjpeg.so.9: cannot open shared object file: No such file or directory
/tesseract/.libs/lt-tesseract: error while loading shared libraries: libgif.so.7: cannot open shared object file: No such file or directory 

That all; I hope this helps.

PyTorch CRNN: Seq2Seq Digits Recognition w/ CTC

Thu, 30 Jul 2020 21:45:05 +0000

This article discusses handwritten character recognition (OCR) in images using sequence-to-sequence (seq2seq) mapping performed by a Convolutional Recurrent Neural Network (CRNN) trained with Connectionist Temporal Classification (CTC) loss. The aforementioned approach is employed in multiple modern OCR engines for handwritten text (e.g., Google’s Keyboard App - convolutions are replaced with Bezier interpolations) or typed text (e.g., Tesseract 4’s CRNN Based Recognition Module).

For the sake of simplicity, the example I’ll be presenting performs only digit recognition but can be easily extended to accommodate more classes of characters.

The overall source code for this project is quite long so I’m providing a Google Colab document that includes a working example.

Previous Inadequacies and Justification

“Why not simply segment characters in the image and recognize them one by one?”

While the approach is, indeed, more straightforward and has been incorporated in older OCR engines, it has its caveats, especially when considering handwritten text. These are caused by the imperfections of the written characters which can produce segmentation issues thus attempting to recognize invalid glyphs or symbols. Consider the following images for clarification:

A fragmented ‘5’ is segmented as 2 different characters that are later passed to the recognition module.

The first 2 digits are ‘merged’ together and considered a single character by both segmentation mechanism and OCR engine.

Whereas the MNIST problem is considered solved thus implying that reliable classifiers can be constructed to individually recognize digits, the problem of correct segmentation still remains in realistic scenarios. Splitting or merging glyphs to form valid digits proves to be a difficult challenge and requires additional knowledge to be embedded into the segmentation module.

Seq2Seq Classifications

In this context, the main advantage brought by a seq2seq classifier is that it diminishes the impact of erroneous segmentations and takes advantage of the ability of a neural network to generalize. It only requires a valid segmentation of the word or text line in cause.

Consider the following simplistic model that has a sliding window or mask (no convolutions), of size (1, img_height). Each set of pixels covered by the sliding window is fed into a neural network made out of neurons with memory (e.g., GRU or LSTM); the job of the neural network is to take a sequence of such stripes and output recognized digits. Take a look at the following figure:

The RNN learns to recognize the digit ‘5’ only by seeing stripes of width equal to 1 of the digit in cause - think of it as time series; by combining information from previous and current inputs, the RNN can determine the correct class.

Multiple digits will be included in a single sequence - because we’re feeding the network an image which contains more than a digit. It is up to the neural network to determine during the training phase how many stripes to take into account when classifying a digit (i.e., how much to memorize). The image below illustrates how a RNN should ‘group’ stripes together in order to recognize each digit in the sequence.

The RNN receives sequences of ‘vertical’ arrays of pixels (stripes) covered by the sliding window of width equal to 1; once trained, the RNN will be able to memorize that certain sequences of arrays (here in colors) form specific digits and properly separate multiple digits (i.e., ‘change the colors’) even though they are merged in the given image.

Using this method, it is possible to train a neural network by simply saying that the image above contains the numbers ‘55207’, without further information (e.g.: alignment, delimitations, bounding boxes etc.)

CTC and Duplicates Removal

CTC loss is most commonly employed to train seq2seq RNNs. It works by summing the probabilities for all possible alignments; the probability of an alignment is determined by multiplying the probabilities of having specific digits in certain slots. An alignment can be seen as a plausible sequence of recognized digits.

Going back to the ‘55207’ example, we can express the probability of the alignment \(A_{55207}\) as follows:

\[P(A_{55207}) = P(A_1 = 5) \cdot P(A_2 = 5) \cdot P(A_3 = 2) \cdot P(A_4 = 0) \cdot P(A_5 = 7)\]

To properly remove duplicates and also correctly handle numbers that contain repeating digits, the blank class is introduced, with the following rules:

2 (or more) repeating digits are collapsed into a single instance of that digit unless separated by blank - this compensates for the fact that the RNN performs a classification for each stripe that represents a part of a digit (thus producing duplicates)
multiple consecutive blanks are collapsed into one blank - this compensates for the spacing before, after or between the digits

Given these aspects, there are multiple alignments that, once collapsed, lead to the correct alignment (‘55207’).

For example: 55-55222–07 once collapsed leads to ‘55207’ and suggests the correct sequence even though it has a different structure because of additional duplicates and blanks (marked as ‘-’ here). The probability of this alignment (\(A_{55-55222--07}\)) is computed as previously shown but it also includes the probabilities of the blank class:

\[P(A_{55-55222--07}) = P(A_1 = 5) \cdot P(A_2 = 5) \cdot P(A_3 = -) \cdot P(A_4 = 5) \cdot P(A_5 = 5) \cdot P(A_6 = 2) \cdot P(A_7 = 2) \cdot P(A_8 = 2) \cdot P(A_9 = -) \cdot P(A_{10} = -) \cdot P(A_{11} = 0) \cdot P(A_{12} = 7)\]

Finally, the CTC probability of a sequence is calculated, as previously mentioned, by summing the probabilities for all different alignments:

\[P(S_{55207}) = \sum_{A \in Alignments(55207)}{A}\]

When training, the neural network attempts to maximize this probability for the sequence provided as ground truth.

A decoding method is used to recover the text from a set of digits probabilities; a naive approach would be to pick, for each slot in the alignment, the digits with the highest probability and then collapse the result. This approach is easier to implement and might be enough for this example although beam search (i.e.: greedy approach that picks first N digits with highest probabilities, instead of only one) is employed for such tasks in larger projects.

Including Convolutional Layers

Implementing convolutions in the previously described model simply implies that raw pixel information is replaced, in the input of the RNN, with higher level features. In PyTorch, the output of the convolution layers must be reshaped to the time sequence format (batch_size, sequence_length, gru_input_size).

In the current project, the output of the convolution part has the following shape: (batch_size, num_channels, convolved_img_height, convolved_img_width). I’m permuting the tensor to (batch_size, convolved_img_width, convolved_img_height, num_channels) and then reshaping the last 2 dimensions into one which becomes gru_input_size).

Dataset Generation

To avoid additional steps such as image preprocessing, segmentation and class balancing I picked a more friendly dataset: EMNIST for digits. The following helper script randomly picks digits from the dataset, applies affine augmentations and concatenates them into sequences of a given length.

Dataset example for the seq2seq CRNN - Input and Ground Truth

CRNN Model

A LeNet-5 based convolution model is employed, with the following modifications:

5x5 filters are replaced with 2 consecutive 3x3 filters
max-pooling is replaced with strided convolutions

The resulted higher level features are fed into a Bi-GRU RNN with a linear layer in the end which has 10 + 1 possible outputs ([0-9] digits + blank). I’ve chosen GRU over LSTM since it had similar results but required fewer resources. A log_softmax activation function is used in the final layer since it the loss function (PyTorch’s CTCLoss) requires a logarithmized version of the output; also, this should provide better numerical properties as it highly penalizes incorrect classifications.

class CRNN(nn.Module):

    def __init__(self):
        super(CRNN, self).__init__()

        self.num_classes = 10 + 1
        self.image_H = 28

        self.conv1 = nn.Conv2d(1, 32, kernel_size=(3,3))
        self.in1 = nn.InstanceNorm2d(32)

        self.conv2 = nn.Conv2d(32, 32, kernel_size=(3,3))
        self.in2 = nn.InstanceNorm2d(32)

        self.conv3 = nn.Conv2d(32, 32, kernel_size=(3,3), stride=2)
        self.in3 = nn.InstanceNorm2d(32)

        self.conv4 = nn.Conv2d(32, 64, kernel_size=(3,3))
        self.in4 = nn.InstanceNorm2d(64)

        self.conv5 = nn.Conv2d(64, 64, kernel_size=(3,3))
        self.in5 = nn.InstanceNorm2d(64)

        self.conv6 = nn.Conv2d(64, 64, kernel_size=(3,3), stride=2)
        self.in6 = nn.InstanceNorm2d(64)

        self.postconv_height = 3
        self.postconv_width = 31

        self.gru_input_size = self.postconv_height * 64
        self.gru_hidden_size = 128 
        self.gru_num_layers = 2
        self.gru_h = None
        self.gru_cell = None

        self.gru = nn.GRU(self.gru_input_size, self.gru_hidden_size, self.gru_num_layers, batch_first = True, bidirectional = True)

        self.fc = nn.Linear(self.gru_hidden_size * 2, self.num_classes)

    def forward(self, x):
        batch_size = x.shape[0]

        out = self.conv1(x) 
        out = F.leaky_relu(out)
        out = self.in1(out)

        out = self.conv2(out) 
        out = F.leaky_relu(out)
        out = self.in2(out)

        out = self.conv3(out)
        out = F.leaky_relu(out)
        out = self.in3(out)

        out = self.conv4(out)
        out = F.leaky_relu(out)
        out = self.in4(out)

        out = self.conv5(out)
        out = F.leaky_relu(out)
        out = self.in5(out)

        out = self.conv6(out)
        out = F.leaky_relu(out)
        out = self.in6(out)

        out = out.permute(0, 3, 2, 1) 
        out = out.reshape(batch_size, -1, self.gru_input_size)

        out, gru_h = self.gru(out, self.gru_h)
        self.gru_h = gru_h.detach()
        out = torch.stack([F.log_softmax(self.fc(out[i])) for i in range(out.shape[0])])

        return out

    def reset_hidden(self, batch_size):
        h = torch.zeros(self.gru_num_layers * 2, batch_size, self.gru_hidden_size)
        self.gru_h = Variable(h)

crnn = CRNN()
criterion = nn.CTCLoss(blank=10, reduction='mean', zero_infinity=True)
optimizer = torch.optim.Adam(crnn.parameters(), lr=0.001) 

When performing backpropagation, the CTCLoss method will take the following parameters:

log_probabilities - this is the output from the log_softmax
targets - a tensor which contains the expected sequence of digits
input_lengts - the length of the input sequence after it is processed by the convolutional layers (i.e. post-convolution width)
target_lengths - the length of the target sequence

The last 2 parameters (input_lengths and target_lengths) are used to instruct the CTCLoss function to ignore additional padding (in case you added padding to the imagine or the target sequences to fit them into a batch).

log_probabilities will look like a (T, C)-shaped tensor (T = number of timesteps, C = number of classes) and specifies, for teach timestep, the probability of it belonging in a specific class. This tensor is decoded into text using a best path (greedy) approach: for each timestep, this algorithm picks the class with the maximum probability while also collapsing multiple occurences of the same character into one (unless they’re separated by a blank).

In my implementation, I’ve used y_pred.permute(1, 0, 2) to reorder the CRNN’s output so it matches the CTCLoss’s desired input format.

Another aspect you should pay attention to is resetting the hidden state of the GRU layers (crnn.reset_hidden(batch_size)) before recognizing any new sequence; in my experience this provided better results.

Feel free to check the code on my Google colab (link above) for further details.

Results

I’ve tested the model using 10,000 generated sequences: 8,000 for training and 2,000 for testing. Below are the plots for training and testing loss and also the evolution of precision - I’m considering that the dataset is approximately balanced. A true positive (TP) is counted only when the recognized sequence entirely matches the ground truth. The results are not ideal but I think the current model represents a decent starting point for greater projects.

The CRNN manifests some overfitting behavior but the results are acceptable considering its purpose.

Loss Evolution after 6 epochs

Precision Evolution after 6 epochs

After 6 epochs, the CRNN successfully recognizes 7567 out of 8000 sequences in the training set and 1776 out of 2000 from the testing set.

References

Improving Tesseract 4's OCR Accuracy through Image Preprocessing

Sun, 07 Jun 2020 21:45:05 +0000

In this work I took a look at Tesseract 4’s performance at recognizing characters from a challenging dataset and proposed a minimalistic convolution-based approach for input image preprocessing that can boost the character-level accuracy from 13.4% to 61.6% (+359% relative change), and the F1 score from 16.3% to 72.9% (+347% relative change) on the aforementioned dataset. The convolution kernels are determined using reinforcement learning; moreover, to simulate the lack of ground truth in realistic scenarios, the training set consists of only 30 images while the testing set includes 10,000.

The dataset in cause is called Brno Mobile, and contains colored photographs of typed text, taken with handheld devices. Factors such as blurriness, low resolution, contrast, brightness are contributing to making the images challenging for an OCR engine.

Resized image from the Brno dataset which contains text that was not recognized by Tesseract 4 during the evaluation (an empty string was returned)

During this experiment, the out of the box version of Tesseract 4 has been used, which implies:

no retraining of the OCR engine
no lexicon / dictionary augmentations
no hints about the language used in the dataset
no hints about segmentation methods; default (automatic) segmentation is used
default settings for the recognition engine (LSTM + Tesseract)

Problem Analysis

Tesseract 4 has proven great performance when tested on favorable datasets by achieving good balance between precision and recall. It is presumed that this evaluation is performed on images that resemble scanned documents or book pages (with or without additional preprocessing) in which the number of camera-caused distortions is minimal. Tests on the Brno dataset led to much worse performance that will be discussed later in the article.

Tesseract 4’s performance when evaluated using the Google Books Dataset - taken from DAS 2016

In the above figure, a high precision indicates favorable True-Positives to False-Positives ratio thus revealing proper differentiation between characters (i.e. a relatively small number of misclassifications). Despite this, almost no improvements in recall can be observed when switching from the base classification method to the Long Short-Term Memory (LSTM) based Convolutional Recurrent Neural Network (CRNN) for sequence to sequence mapping.

“Despite being designed over 20 years ago, the current Tesseract classifier is incredibly difficult to beat with so-called modern methods.” - Ray Smith, author of Tesseract

I assume that further training for different fonts might not provide significant improvements and neither will a different model of classifier. Is there a chance that the classifier doesn’t receive the correct input?

It was pointed out in a previous article that Tesseract is not robust to noise; certain salt-and-pepper noise patterns disrupt the character recognition process, leading to large segments of text being completely ignored by the OCR engine - the infamous empty string. From empirical observations, these errors seem to occur either for a whole word or sentence or not at all thus suggesting a weakness in the segmentation methodology.

The existence of similar behavior, given images which present more natural distortions, is questioned - hence this experiment.

Black-box Considerations

Since analyzing Tesseract’s segmentation methods is a daunting task, I opted for an adaptive external image correction method. To avoid diving into Tesseract 4’s source code, the OCR engine is considered a black-box; in this case, an unsupervised learning method must be employed. This ensures easier transitions to other OCR engines as it doesn’t directly rely on concrete implementations but only on outputs - at the cost of processing power and optimality.

Proposed Solution

The solution consists in directly preprocessing images before they are fed to Tesseract 4. An adaptive preprocessing operation is required, in order to properly compensate for any image features that cause problems in the segmentation process. In other words, an input image must be adapted so it complies with Tesseract 4’s preferences and maximizes the chance of producing the correct output, preferably without performing down-sampling.

I choose a convolution-based approach for flexibility and speed; other articles tend to perform more rigid image adjustments (such as global changes in brightness, fixed-constant conversion to grayscale, histogram equalization, etc.). I preferred an approach that can properly learn to highlight or mask regions of the image according to various features. For this, the kernels are optimized using reinforcement learning using an actor-critic model. To be more specific, it relies on Twin Delayed Deep Deterministic Policy Gradient (TD3 for short), for discovering features which minimize the Levenshtein distance between the recognized text and the ground truth. I’ll not dive into implementation details of TD3 here as it would be somehow out of scope but think of it as a method of optimizing the following formula:

\[\max_{K1,K2,K3,K4,K5}\sum_{i=1}^{N}{-Levenshtein(OCR(Image_i * K1 * K2 * K3 * K4 * K5),Text_i)}\]

Where \(K_j\) is a kernel, and \(<Image_i, Text_i>\) is a tuple from the training set.

A short (simpler) proof of concept of the convolutional preprocessor is presented in this Google Colab. It uses a different architecture than the final one and has the purpose of verifying if the idea of using convolutions is feasible and offers good results. A comparison is presented between original and preprocessed images including recognized texts for each sample.

The final model is illustrated below, with ReLU activations after each convolution to capture nonlinearities and prevent having negative values as pixels’ colors.

Architecture of the Convolutional Preprocessor used to adapt images for Tesseract 4

To properly compensate for image coloring and reduce the number of channels (R, G, B), 1x1 convolutions are used. This prevents overfitting up to a point while also ensuring grayscale output. Further convolutions are applied only on the grayscale image.

Symmetry constraints are additionally enforced for each 3x3 kernel in order to minimize the number of trainable parameters and avoid overfitting. This means that for a 3x3 kernel only 6 variables out of 9 must be determined while the rest can be generated through mirroring. Below are the values I got for the five kernels (bold to emphasize symmetry):

#1	#2			#3
0.7	0.2573	-0.3	0.3	0.3	-0.2996	0.3
1.3	0.3	1.3	-0.295	0.3	1.2949	0.3
1.3	0.2573	-0.3	0.3	-0.2802	0.2922	-0.2802

#4			#5
-0.2793	0.2395	0.2885	-0.294	-0.2905	-0.2939
0.2395	0.7119	0.3	0.3	1.162	-0.2905
0.2885	0.3	-0.2828	-0.2328	0.3	-0.294

Preprocessing Results

I extracted the image from each convolution layer and clamped its values to the 0-255 interval to properly view each transformation:

Transformations of an image as it passes through the convolutional preprocessor, viewed from left (original) to right (final sample); observe the removal of incomplete characters from the upper-left region

Comparison

I used 10,000 images from the testing set for the evaluation of the current methodology and compiled the following graphs. The differences between original and preprocessed samples are illustrated with three metrics of interest: Character Error Rate (CER), Word Error Rate (WER) and Longest Common Subsequence Error (LCSE). In this article, LCSE is computed as follows:

\[LCSE(Text_1,Text_2 )=|Text_1 |-|LCS(Text_1,Text_2 )|+|Text_2 |-|LCS(Text_1,Text_2 )|\]

Preprocessed vs Original Images from the testing set; lower is better for each metric; dashed lines represent first degree approximations using least squares regression for the ease of interpretation

Additionally, I plotted everything in histogram format to properly see the distributions of errors. For CER and WER, it is easy to observe the spikes around 1 (100%) that suggest the aforementioned segmentation problem (at block-of-text level) produces the most frequent error (empty strings are returned so all characters are wrong). In certain situations, the WER is larger than 1 because the preprocessing step introduces artifacts near the border of the image thus leading to recognition of non-existent characters. When looking at the LCSE plot, a distribution shift can be seen from the original approximately gaussian shape with its peak (mode) near the average number of characters in an image (56.95) to a more favorable shape with overall lower error rates.

Preprocessed vs Original Images from the testing set; comparison of distributions of errors

A numeric comparison is presented below:

Metric	Original (Avg.)	Preprocessed (Avg.)
CER	0.866	0.384
WER	0.903	0.593
LCSE	48.834	24.987
Precision	0.155	0.725
Recall	0.172	0.734
F1 Score	0.163	0.729

Takeaways

Significant improvements can be observed through this preprocessing operation. Moreover, the majority of errors probably do not occur in the sequence to sequence classifier (since all the recognized characters are erroneous and would contradict previous performance analysis). A page-segmentation issue when automatic mode is used seems more plausible. It is shown that an array of convolutions is sufficient, in this case, to decrease error rates substantially.

The OCR performance on the preprocessed images is overall better but not good enough to be reliable. A 38% character error rate is still a large setback. I’m pretty sure that better recognitions can be obtained with more fine-tuning, a more complex architecture for the convolutional preprocessor and a more diverse training set. However, the current implementation is already very slow to train which makes me question if the entire methodology is feasible from this point of view.

Cite

If you found this relevant to your work, you can cite the article using:

@article{sporici2020improving,
  title={Improving the Accuracy of Tesseract 4.0 OCR Engine Using Convolution-Based Preprocessing},
  author={Sporici, Dan and Cușnir, Elena and Boiangiu, Costin-Anton},
  journal={Symmetry},
  volume={12},
  number={5},
  pages={715},
  year={2020},
  publisher={Multidisciplinary Digital Publishing Institute}
}

PyTorch Iterative FGVM: Targeted Adversarial Samples for Traffic-Sign Recognition

Thu, 30 Apr 2020 21:45:05 +0000

Inspired by the progress of driverless cars and by the fact that this subject is not thoroughly discussed I decided to give it a shot at creating smooth targeted adversarial samples that are interpreted as legit traffic signs with a high confidence by a PyTorch Convolutional Neural Network (CNN) classifier trained on the GTSRB dataset.

I’ll be using the Fast Gradient Value Method (FGVM) in an iterative manner - which is also called the Basic Iterative Method (BIM). I noticed that most articles only present PyTorch code for non-targeted Fast Gradient Sign Method (FGSM) - which performs well in evading classifiers but is, in my opinion, somehow limited.

Smooth targeted adversarial sample generated using the current implementation, being misclassified as a ‘Stop’ sign.

I’ll try to discuss in this article only the important aspects of this problem. However, I also prepared a Google Colab Notebook which includes complete source code and results.

Targeted Network

For this experiment, I’ve constructed a basic LeNet5 inspired CNN in PyTorch. It performs 2 convolutions of size 5x5 on 32x32 grayscale images, separated by max-pooling. The dataset is slightly unbalanced, but this was compensated for during the training process.

Results of the Traffic-Sign Recognition CNN on the GTSRB Test Dataset

This network is represented using the following PyTorch snippet:

class LeNet(nn.Module):
  def __init__(self, num_classes=47, affine=True):

      super().__init__()
      self.conv1 = nn.Conv2d(1, 32, 5)
      self.in1 = nn.InstanceNorm2d(32, affine=affine)

      self.conv2 = nn.Conv2d(32, 64, 5)
      self.in2 = nn.InstanceNorm2d(64, affine=affine)
      
      self.fc1 = nn.Linear(64 * 5 * 5, 256)
      self.fc2 = nn.Linear(256, 128)
      self.fc3 = nn.Linear(128, num_classes)


  def forward(self, x):
      out = F.relu(self.in1(self.conv1(x)))
      out = F.max_pool2d(out, 2)

      out = F.relu(self.in2(self.conv2(out)))
      out = F.max_pool2d(out, 2)
      
      out = out.view(out.size(0), -1)
      
      out = F.relu(self.fc1(out))
      out = F.relu(self.fc2(out))
      out = self.fc3(out)

      return out

The architecture is not optimal for the sake of simplicity; additionally, achieving state-of-the-art traffic-sign recognition is not in the scope of this article. Evaluation results on the GTSRB testing set are as follows:

Accuracy: ~95%
Precision: ~93%
Recall: ~93%

Targeted Adversarial Samples with Iterative FGVM

When training a neural network the focus is on optimizing parameters (i.e. weights) in order to minimize the loss (e.g.: Mean Squared Error, Cross Entropy, etc.) between the current output and desired output while the inputs are fixed. This is done through gradient descent. As an example, if a neural network models the function below, the \(w\) (weight) and \(b\) (bias) variables are adjusted during the training.

\[f(x) = w \cdot x + b\]

When talking about targeted FGVM, \(w\) and \(b\) are fixed and the input \(x\) is adjusted through gradient descent (computed w.r.t. different variables, obviously). Usually this implies minimizing the error between the targeted adversarial output and the current output - basically shifting the current output towards the targeted output.

Moreover, when the input is in image-format, additional constraints must be addressed:

images (inputs) must be clamped between 0 and 1 (float representation)
images must be smooth in order to mitigate basic noise filtering mechanisms

PyTorch: Generating Adversarial Samples

The code I ended up with is posted below; further implementation details will also be presented.

targeted_adversarial_class = torch.tensor([INV_TRAFFIC_SIGNS_LABELS['stop']])
adversarial_sample = torch.rand((1, 1, 32, 32)).requires_grad_() 

# optimizer for the adversarial sample
adversarial_optimizer = torch.optim.Adam([adversarial_sample], lr=1e-3)

for i in range(10000):

  adversarial_optimizer.zero_grad()

  prediction = net(adversarial_sample)
  
  # classification loss + 0.05 * image smoothing loss
  loss = torch.nn.CrossEntropyLoss()(prediction, targeted_adversarial_class) + \
          0.05*((torch.nn.functional.conv2d(torch.nn.functional.pad(adversarial_sample, (1,1,1,1), 'reflect'), torch.FloatTensor([[[0, 0, 0], [0, -3, 1], [0, 1, 1]]]).view(1,1,3,3))**2).sum())
  

  # this is the predicted class number
  predicted_class = np.argmax(prediction.detach().numpy(), axis=1)

  # updates gradient and backpropagates errors to the input
  loss.backward()
  adversarial_optimizer.step()

  # ensuring that the image is valid
  adversarial_sample.data = torch.clamp(adversarial_sample.data, 0, 1)

  if i % 500 == 0:
    plt.imshow(adversarial_sample.data.view(32, 32), cmap='gray')
    plt.show()

    print('Predicted:', TRAFFIC_SIGNS_LABELS[predicted_class[0]])
    print('Loss:', loss)

The current CNN is trained on 32x32 grayscale images so it makes sense to start with an adversarial sample of same size which consists of random noise distributed over one channel. It is also required to indicate through requires_grad_() that this variable should be updated by Autograd.

adversarial_sample = torch.rand((1, 1, 32, 32)).requires_grad_() 

Next, an optimizer is created that instead of tweaking weights will tweak the adversarial_sample defined above:

adversarial_optimizer = torch.optim.Adam([adversarial_sample], lr=1e-3)

The loss function is defined using torch.nn.CrossEntropyLoss() - which is the same criterion used for training. In this example, I’ll try to create a sample that is classified as a stop sign (targeted_adversarial_class).

targeted_adversarial_class = torch.tensor([INV_TRAFFIC_SIGNS_LABELS['stop']])

prediction = net(adversarial_sample)

# classification loss
loss = torch.nn.CrossEntropyLoss()(prediction, targeted_adversarial_class)

This loss function does well in generating adversarial images but the results have a noisy aspect (e.g., powerful contrasts between small groups of pixels) and might look suspicious. Since this noise can be easily removed using basic filtering, smooth images are wanted.

Using only the CrossEntropyLoss() will most likely generate noisy adversarial samples

Defining a smooth-image constraint can be done by minimizing the Mean Squared Error between adjacent pixels. Think of it as applying an edge-detection filter and attempting to minimize the overall result. However, this has an impact on the efficiency of the generated sample as it adds dependencies between pixels. To minimize the loss of freedom, only the adjacent pixels from the bottom-right side are taken into account. The following 3x3 convolution kernel is used to determine the color difference between a pixel and its 3 other neighbors:

K
0	0	0
0	-3	1
0	1	1

In PyTorch, I implemented the aforementioned method using torch.nn.functional.conv2d() and torch.nn.functional.pad():

# image smoothing loss
loss += (torch.nn.functional.conv2d(torch.nn.functional.pad(adversarial_sample, (1,1,1,1), 'reflect'), torch.FloatTensor([[[0, 0, 0], [0, -3, 1], [0, 1, 1]]]).view(1, 1, 3, 3))**2).sum()

Finally, the image is clamped to create a valid float tensor using:

adversarial_sample.data = torch.clamp(adversarial_sample.data, 0, 1)

Multiple iterations are required in order to properly optimize the input.

Conclusions

FGVM proves reliable in crafting smooth targeted adversarial samples for basic classifiers implemented with CNNs. However, additional problems need to be addressed in order to become a feasible attack. The crafted sample must be picked up by the segmentation algorithm as a possible traffic sign in the detection phase. Next, the adversarial sample’s efficiency should not be impacted by small affine transformations (e.g., being shifted 3 pixels to the left) - this might be fixed through data augmentation. Additionally, factors such as brightness, contrast or various camera properties can still reduce the success rate of an adversarial sample.

Finally, samples which are more resistant to uniformly distributed noise can be obtained by removing the image smoothing constraint.

RSA: Encrypt in .NET & Decrypt in Python

Mon, 06 Apr 2020 21:45:05 +0000

So… one of my current projects required the following actions: asymmetrically encrypt a string in .NET using a public key and decrypt it in a python script using a private key.

The problem that I’ve encountered was that, apparently, I couldn’t achieve compatibility between the two exposed classes: RSACryptoServiceProvider and PKCS1_v1_5. To be more specific, the python script couldn’t decrypt the ciphertext even though proper configurations were made and the provided keys were compatible. Additionally, separate encryption-decryption actions worked inside .NET and python but not in-between them.

I wasn’t able to find too much information about this specific problem in the RSAParameters documentation, hence this post.

Solution

Alright, the issue seems to be caused by a difference in endianness between the two classes, when the RSA parameters are provided. PKCS1_v1_5 uses little endian and RSACryptoServiceProvider prefers big endian. In my case, this made the encryption method use a different key than the one I though I specified. Nevertheless, it was more fun to debug because of PKCS which always ensured different ciphertexts.

I fixed this by base64-encoding the exponent and modulus in big-endian format (in python) and then loading them with RSACryptoServiceProvider.FromXmlString() (in .NET).

Working Example

I hardcoded the (N, E, D) parameters for a private key in python and exported the exponent and modulus to be used later for encryption.

# custom base64 encoding
def b64_enc(n, l):
    n = n.to_bytes(l, 'big')
    return base64.b64encode(n)

# fixed a set of keys for testing purposes
N = 26004126751443262055682011081007404548850063543219588539086190001742195632834884763548378850634989264309169823030784372770378521274048211537270851954737597964394738860810397764157069391719551179298507244962912383723776384386127059976543327113777072990654810746825378287761304202032439750301912045623786736128233730798303406858144431081065384988539277630625160727011582345942687126935423502995613920211095965452425548919926951203151483590222152446516520421379279591807660810550784744188433550335950652666201439521115515355539373928576162221297645781251953236644092963307595988040539993067709240004782161131243282208593
E = 65537
D = 844954574014654722486150458473919587206863455991060222377955072839922571984098861772377020041002939383041291761051853484512886782322743892284027026528735139923685801975918062144627908962369108081178131103781404720078456605432924519279933702927938064507063482999903002331319671303661755165294744970869186178561527578261522199503340027952798084625109041630166309505066404215223685733585467434168146932177924040219720383860880583466676764286302300281603021045351842170755190359364339936360197909582974922675680101321863304283607829144759777189360340512230537108705852116021758740440195445732631657876008160876867027543

# construct pair of keys
private_key = RSA.construct((N, E, D))
public_key = private_key.publickey()

# base64-encode parameters in big-endian format
EXP = b64_enc(public_key.e, 3)
MODULUS = b64_enc(public_key.n, 256)

print('EXP:', EXP, 'MODULUS:', MODULUS)

# Output:
# EXP: b'AQAB' MODULUS: b'zf4LgceVPvjMLz/pp8exH58AeBrhjLe0k4FRmd59I0k4sH6oug6Z9RfY4FvEFcssBwH1cmWF5/Zen8xbRVRyUnzer6b6cKmlzHFYf0LlbovvYMkW5pdhRcTHK2ijByGtmVgU/CEKEQTy3elpU7ZsHE8D6T1M7L2gmGAxvgldUMRu4l8BPuRyht1a9dA9b6005atpdlkCSc3emXSfyBOBwNE0UicVTVncn9SBjP7bTBGgOKshYnYsqh4BD0I7AU3xdoAsZVWudECX/zVa7uUOk1ooVYjMEyfBngrEDXrmIkAlVruUuj/eWiYwT2vXqByQgDfDvat5IS4i3ywiHAWXUQ=='

In .NET (I used C#), there will be something like this:

using System;
using System.Security.Cryptography;
using System.Text;

public class RSACryptoApp
{
    // parameters from the python script (public key)
    private static readonly String EXP = "AQAB";
    private static readonly String MODULUS = "zf4LgceVPvjMLz/pp8exH58AeBrhjLe0k4FRmd59I0k4sH6oug6Z9RfY4FvEFcssBwH1cmWF5/Zen8xbRVRyUnzer6b6cKmlzHFYf0LlbovvYMkW5pdhRcTHK2ijByGtmVgU/CEKEQTy3elpU7ZsHE8D6T1M7L2gmGAxvgldUMRu4l8BPuRyht1a9dA9b6005atpdlkCSc3emXSfyBOBwNE0UicVTVncn9SBjP7bTBGgOKshYnYsqh4BD0I7AU3xdoAsZVWudECX/zVa7uUOk1ooVYjMEyfBngrEDXrmIkAlVruUuj/eWiYwT2vXqByQgDfDvat5IS4i3ywiHAWXUQ==";

    public static void Main(string[] args)
    {
       RSACryptoServiceProvider csp = new RSACryptoServiceProvider(2048);
       csp.FromXmlString("<RSAKeyValue><Exponent>" + EXP + "</Exponent><Modulus>" + MODULUS + "</Modulus></RSAKeyValue>");

       // encrypting a string for testing purposes
       byte[] plainText = Encoding.ASCII.GetBytes("Hello from .NET");
       byte[] cipherText = csp.Encrypt(plainText, false);

       Console.WriteLine("Encrypted: " + Convert.ToBase64String(cipherText));

       // Output:
       // Encrypted: F/agXpfSrs7HSXZz+jVq5no/xyQDXuOiVAG/MOY7WzSlp14vMOTM8TshFiWtegB3+2BZCMOEPLQFFFbxusuCFOYGGJ8yRaV7q985z/UDJVXvbX5ANYqrirobR+c868mY4V33loAt2ZFNXwr+Ubk11my1aJgHmoBem/6yPfoRd9GrZaSQnbJRSa3EDtP+8pXETkF9B98E7KvElrsRTLXEXSBygmeKsyENo5DDcARW+lVVsQuP8wUEGnth9SX4oG8i++gmQKkrv0ep6yFrn05xZJKgpOfRiTTo/Bkh7FxNP2wo7utzhtYkNnvtXaJPWAvqXg93KmNPqg1IsN4P1Swb8w==
    }
}

Back to the python script:

cipher = PKCS1_v1_5.new(private_key)

random_generator = Random.new().read
sentinel = random_generator(20)

cipher_text = 'F/agXpfSrs7HSXZz+jVq5no/xyQDXuOiVAG/MOY7WzSlp14vMOTM8TshFiWtegB3+2BZCMOEPLQFFFbxusuCFOYGGJ8yRaV7q985z/UDJVXvbX5ANYqrirobR+c868mY4V33loAt2ZFNXwr+Ubk11my1aJgHmoBem/6yPfoRd9GrZaSQnbJRSa3EDtP+8pXETkF9B98E7KvElrsRTLXEXSBygmeKsyENo5DDcARW+lVVsQuP8wUEGnth9SX4oG8i++gmQKkrv0ep6yFrn05xZJKgpOfRiTTo/Bkh7FxNP2wo7utzhtYkNnvtXaJPWAvqXg93KmNPqg1IsN4P1Swb8w=='

plain_text = cipher.decrypt(base64.b64decode(cipher_text.encode('ASCII')), sentinel)
print('Decrypted:', plain_text.decode('ASCII'))

# Output:
# Decrypted: Hello from .NET

Avoid a Mistake: Correctly Calculate Multiclass Accuracy

Tue, 10 Dec 2019 21:45:05 +0000

Today I held a short laboratory which tackled different metrics used in evaluating classifiers. One of the tasks required that, given the performances of 2 classifiers as confusion matrices, the students will calculate the accuracy of the 2 models. One model was a binary classifier and the other was a multiclass classifier.

Many students resorted to googling for an accuracy formula which returned the following function:

\[{\color{Red}{ACC = \frac{TP + TN}{TP + TN + FP +FN}}}\]

Then, they calculated a ‘per-class’ accuracy (for class \(i\), they had \(ACC_i\)) and macro-averaged the results like below:

\[ACC = \frac{\sum_{i=1}^{i=N}{ACC_i}}{N}\]

To their surprise, the resulted accuracy for the multiclass classifier was erroneous and highly different (when compared to accuracy_score() from sklearn). However, the accuracy of the binary classifier was correct.

As there wasn’t much time available, I told them to use the following accuracy formula to match the results of sklearn and I’ll send an explanation later:

\[{\color{Green}{ACC = \frac{\sum_{i=1}^{i=N}{TP_i}}{\sum_{i = 1}^{i=N}{(TP_i + FP_i)}}}}\]

Some of you might recognize this as micro-averaged precision.

The purpose of this article is to serve as a list of DO’s and DONT’s so we can avoid such mistakes in the future.

What was wrong?

Basically, you’re prone to get invalid results if you average accuracy values in an attempt to obtain the global accuracy. But… even if you directly calculate the global accuracy using the above formula, you’d get skewed values.

Take a look at the following classifier, described using a confusion matrix:

\	Class #0	Class #1	Class #2
Class #0	0	100	100
Class #1	100	0	100
Class #2	100	100	0

You’ll notice that \(TP = 0\) thus the classifier is doing a really bad job.

If we follow the students’ approach and calculate the ‘per-class’ accuracy (let’s say Class #0), we have:

\[TP_0 = 0, TN_0 = 200, FP_0 = 200, FN_0 = 200\] \[\color{Red}{ACC_0 = \frac{0 + 200}{0+200+200+200} = 0.333(3)}\]

This already looks suspicious. You’ll get the same results for the other 2 classes, so… on average, \(\color{Red}{ACC = 0.333(3)}\). This is definitely wrong.

If you directly compute global accuracy using the same formula (summing all \(TP's\), \(TN's\), …), you get the same result because of the symmetry. This happens mainly because of the \(TN\) in the numerator which grows faster than any other term. In other words, as the number of classes grows, this error grows as well; a similar model, but with 4 classes, gets a 0.5 accuracy.

Using the second formula, the global accuracy becomes:

\[\color{Green}{ACC = \frac{0+0+0}{(0+200) + (0+200) + (0 + 200)} = 0}\]

Which yields, indeed, a better result. Moreover, it generates the same results as accuracy_score() from sklearn, given more diverse confusion matrices.

If you compute ‘per class’ accuracies using the second formula and average the values, you’re basically getting a macro-averaged precision. Point is, that’s not accuracy - so don’t do that.

Conclusion

I’d recommend avoiding:

the idea of calculating a global accuracy by averaging ‘per-class’ accuracies
the red formula, which includes \(TN\), since the other one returns correct values for any number of classes

Overall, you can compute precision, recall, F1 in a ‘per-class’ manner. But I’m not so sure it also works with the accuracy.

C# Predict the Random Number Generator of .NET

Fri, 06 Dec 2019 21:45:05 +0000

This post targets to underline the predictability of the random… or better said pseudo-random number generator (PRNG) exposed by the .NET framework (aka the Random() class), under certain assumptions. Because of the nature of the implementation, 100% accuracy can be obtained with a fairly simple idea and a rather short code snippet.

The presented method definitely isn’t something new in the domain of cryptography, however the purpose of the article is to bring awareness about this specific weakness.

The following scenario is considered:

no access to the process’s memory
must work for any chosen seed
a limited set of generated random numbers is visible to the attacker
we focus on Random.nextDouble() as there is no data loss because int casting

I’ll be presenting a short summary of the algorithm used by Random() and how can we predict the random numbers. If you feel like going directly to code, scroll down to the bottom of the article.

The Random class

While many pseudo-random implementations (e.g., libc’s rand()) rely on a Linear Congruential Generator (LCG) which generates each number in the sequence by taking into account the previous one, I discovered that .NET’s random number generator uses a different approach.

By looking at the implementation of the Random() class, one can easily observe that pseudo-random number generation is based on a Subtractive Generator, which permits the user to specify a custom seed or use Environment.TickCount (system’s uptime in milliseconds) as default.

The core of the pseudo-random generator is the InternalSample() (line #100) method which constructs the sequence of numbers. Random.nextDouble() will actually call the Sample() method which returns the value of InternalSample() divided by Int32.MaxValue, as this is claimed to improve the distribution of random numbers. Without going into much details regarding the included gimmicks, we can describe the generator as follows:

\[R_i = R_i - R_j, j=i+21\] \[R_i = \left\{\begin{matrix} R_i - 1, if (R_i = Int32.Max)\\ R_i, else \end{matrix}\right.\] \[R_i = \left\{\begin{matrix} R_i + Int32.Max, if (R_i < 0)\\ R_i, if (R_i \geqslant 0) \end{matrix}\right.\] \[retVal = \frac{R_i}{Int32.Max}\]

where \(R_i\) contributes to describing the state of the algorithm and \(retVal\) is, obviously, the returned value.

To store the state of the pseudo-random number generator, a circular array of 56 ints is employed - this means \(i\) and \(j\) will get re-initialized to 1 whenever they exceed the length of the array - however the offset of 21 remains constant.

Predicting Random Numbers

In my opinion, it seems rather difficult to determine the starting state of the algorithm without knowing the seed. But… we notice that the algorithm is outputting pseudo-random numbers which properly describe each value of its state array.

In other words, if we have access to a randomly generated number \(retVal\), we can compute \(R_i\) and \(R_i\) is used to generate future states & numbers in the sequence. However, we will need values for \(i = 1,55\) in order to cover all the properties.

If we manage to leak a continuous set of 55 generated numbers, we have enough information to describe and construct a new generator (by providing a circular array of states) which will output the same numbers as the original but can be used as a predictor.

In my implementation, I’m using the following trick to simplify the things: I don’t convert the leaked \(retVal\) back to \(R_i\) (by multiplying with the Int32.MaxValue) because I’ll have to divide it again to compare the results. So I’m working directly with differences of leaked values (instead of differences of \(R_i\)’s) – I hope it makes sense.

Here’s the code I used, it should help clear things up.

public class Program
{
	/* predicts random numbers, given 2 state descriptors */
	public static double computeDiffAndOffset(double r1, double r2)
	{
		double diff = r1 - r2;
		
		if (diff == Int32.MaxValue)
			diff=- 1/(double)Int32.MaxValue;
		if (diff < 0)
			return diff + 1;
		else
			return diff;
	}
	
	public static void Main()
	{
		/* this we break */
		Random r = new Random();
		
		/* describes the state of the subtractive generator */
		double[] SeedArray = new double[56];
		
		/* leaking the state by observing the first 55 random numbers */
		for (int i = 1; i < 56; i++)
			SeedArray[i] = r.NextDouble();
		
		/* the offset is known from the original implementation */
		int offset = 21;
		
		/* from the theory part: i = index1, j = index2 */
		int index1 = 1, index2 = index1 + offset;
		
		/* running a few tests */
		for (int i = 0; i < 1000; i++)
		{
			/* handling the circular array limits */
			if (index1 >= 56)
				index1 = 1;
			
			if (index2 >= 56)
				index2 = 1;
			
			/* this is the predicted random number */
			double predictedValue = computeDiffAndOffset(SeedArray[index1], SeedArray[index2]);

			/* this is the correct random number */
			double correctRandom =  r.NextDouble();
			
			/* we compare them as doubles */
			if (Math.Abs(predictedValue - correctRandom) > 0.00001)
				throw new Exception(String.Format("Failed at {0} vs {1}", predictedValue, correctRandom));
			
			/* printing the results */
			Console.WriteLine("Predicted: " + predictedValue + " | Correct: " + correctRandom);

			/* updating the state of the generator */
			SeedArray[index1] = predictedValue;
			
			index1++;
			index2++;
		}
	}
}

You should get something like this when running it (well, different numbers because you’ll have a different seed - but you get the point). Tested it on .NET 4.7.2.

Predicted: 0.562743733899083 | Correct: 0.562743733899083
Predicted: 0.0782367256834342 | Correct: 0.0782367256834343
Predicted: 0.48149561019684 | Correct: 0.48149561019684
Predicted: 0.768610569075034 | Correct: 0.768610569075034
Predicted: 0.288163338456379 | Correct: 0.288163338456379
Predicted: 0.652038850659523 | Correct: 0.652038850659523
Predicted: 0.331446861071254 | Correct: 0.331446861071255
Predicted: 0.573066327056413 | Correct: 0.573066327056413
[...]

Conclusions

Definitely don’t use Random() for cryptographic functions. Bad idea. However, limiting the information provided to the adversary (i.e. hiding the randomly generated numbers) would greatly diminish the effectiveness of this attack.

Not much else to be said. It’s my first take at breaking something which is not an LCG - it might not be state-of-the-art level (performance-wise) but I hope you found this informative.

Evaluating the Robustness of OCR Systems

Sat, 07 Sep 2019 21:45:05 +0000

In this article, I’m going to discuss about my Bachelor’s degree final project, which is about evaluating the robustness of OCR systems (such as Tesseract or Google’s Cloud Vision) when adversarial samples are presented as inputs. It’s somewhere in-between fuzzing and adversarial samples crafting, on a black box, the main objective being the creation of OCR-proof images, with minimal amounts of noise.

It’s an old project that I recently presented at an International Security Summer School hosted by the University of Padua. I decided to also publish it here mainly because of the positive feedback received when presented at the summer school.

I’ll try to focus on methodology and results, which I consider being of interest, without diving into implementation details.

I published this ~1 year ago - not sure if it still works as described here. Hopefully it does, but I’m pretty sure Google made changes to the Vision engine since then.

Motivation

Let’s start with what I considered to be plausible use cases for this project and what problems it would be able to solve.

Confidentiality of text included in images? – It is no surprise to us that large services (that’s you, Google) will scan hosted images for texts in order to improve classification or extract user information. We might want some of that information to remain private.
Smart CAPTCHA? – This aims to improve the efficiency of CAPTCHAs by creating images which are easier to read by humans, thus reducing the discomfort, while also rendering OCR-based bots ineffective.
Defense against content generators? – This could serve as a defense mechanism against programs which scan documents and republish content (sometimes using different names) in order to gain undeserved merits.

Challenges

Now, let’s focus on the different constraints and challenges:

1. Complex / closed-source architecture

Tesseract’s pipeline as presented at DAS 2016

Modern OCR systems are more complex than basic convolutional neural networks as they need to perform multiple actions (e.g.: deskewing, layout detection, text rows segmentation), therefore finding ways to correctly compute gradients is a daunting task. Moreover, many of them do not provide access to source code thus making it difficult to use techniques such as FGSM or GANs.

2. Binarization

Result of the binarization procedure, using an adaptive threshold

An OCR system usually applies a binarization procedure (e.g.: Otsu’s method) to the image before running it through the main classifier in order to separate the text from the background, the ideal output being pure black text on a clean white background.

This proves troublesome because it restricts the samples generator from altering pixels using small values: as an example, converting a black pixel to a grayish color will be reverted in the binarization process thus generating no feedback.

3. Adaptive classification

Tesseract’s adaptive classifier incorrectly recognizes an ‘h’ as a ‘b’, in the first image. In the second sample, Tesseract observes a correct ‘h’ character (confidence is larger than a threshold) adjusts the classifier’s configuration and correctly classifies the first ‘h’

This is specific to Tesseract, which is rather deprecated nowadays - still very popular, though. Modern classifiers might be using this method, too. It consists of performing 2 iterations over the same input image. In the first pass, characters which can be recognized with a certain confidence are selected and used as temporary training data. In the second pass, the OCR attempts to classify characters which were not recognized in the first iteration, but using what it previously learned.

Considering this, having an adversarial generator which alters one character at a time might not work as expected since that character might appear later in the image.

4. Lower entropy

This refers to the fact that the input data is rather ‘limited’ for an OCR system when compared to… let’s say object recognition. As an example, images which contain 3D objects have larger variance than those which contain characters since the characters have a rather fixed shape and format. This should make it more difficult to create adversarial samples for character classifiers without applying distortions.

A direct consequence is that it greatly restricts the amount of noise that can be added to an image so that the readability is preserved.

Applying noise in an image usually decreases readability, which is not what we want here

5. Dictionaries

OCR systems will attempt to improve their accuracy by employing dictionaries with predefined words. Altering a single character in a word (i.e.: the incremental approach) might not be effective in this case.

Targeted OCR Systems

Tested locally on Tesseract 4.0 and remotely on Google’s Cloud Vision OCR

For this project, I used Tesseract 4.0 for prototyping and testing, as it had no timing restrictions and allowed me to run a fast, parallel model with high throughput so I could test if the implementation works as expected. Later, I moved to Google’s Cloud Vision OCR and tried some ‘remote’ fuzzing through the API.

Methodology

A rather simplified view of the flow; a feedback-based adversarial samples generator (in image: obfuscator) alters inputs in order to maximize the error of the OCR system

In order to be able to cover even black box cases, I used a genetic algorithm guided by the feedback of the targeted OCR system. We observe that the confidence of the classifier, alone, is not a good metric for this problem, a score function based on the Levenshtein distance and the amount of noise is employed.

One of the main problems here was the size of the search space which was partially solved by identifying regions of interest in the image and focusing only on these. Also, lots of parameter tuning…

Noise properties

Given the constraints, the following properties of the noise model must be matched:

high contrast – so it bypasses the binarization process and generates feedback
low density – in order to maintain readability by exploiting the natural low-filtering capability of the human vision

Applying salt-and-pepper noise in a smart manner will, hopefully, satisfy the constraints.

Working modes

Different working modes for small and large characters, in order to preserve readability. Both managed to entirely hide the given text when tested on Tesseract 4.0

Initially, the algorithm worked using only overtext mode, which applied noise in the rectangle which contained characters. However, this method is not the best choice for texts written using smaller characters mainly because there are less pixels that can be altered thus drastically lowering the readability even with minimal amounts of noise. For this special case, the decision to insert the noise in-between the text rows (artifacts) was taken in order to preserve the original characters. Both methods presented similar success rates in hiding texts from the targeted OCR system.

Just for fun, here’s what happens if the score function is inverted, which translates as “generate an image with as much noise as possible, but which can be read by OCR software”. Weird, but it’s still recognized…

Tesseract recognized the original text with no errors. How about you?

Results on Tesseract

Promising results were achieved while testing against Tesseract 4.0. In the following figure is presented an early (non-final) sample in which the word “Random” is not recognized by Tesseract:

The first word is successfully hidden from the OCR system

Tests on Google’s Cloud Vision Platform

This is where things get interesting.

The implemented score function can be maximized in 2 ways: hiding characters or tricking the OCR engine into adding characters which shouldn’t be there.

One of the samples managed to create a loop in the recognition process of Google’s Cloud Vision OCR, basically recognizing the same text multiple times. No DoS or anything (or I’m not aware of it), I’m still not sure if the loop persisted or not - it either produced a small number of iterations, failed (timed out?) or they had load balancers which compensated for this and used different instances.

Possible loop in the recognition process: the same text gets recognized multiple times. The bottom-left and the top-right corners are ‘merged’ into an oblique text row so the recognition process is sent back to already processed text.

Let’s take a closer look at the sample: below, you can see how the adversarial sample was interpreted by Google’s Cloud Vision OCR system. The image was submitted directly to the Cloud Vision platform via the “Try the API” option so, at the moment of testing, the results could be easily reproduced.

Rectangles returned by Cloud Vision indicate that additional text rows are ‘created’ during the recognition thus creating a loop

Also the ‘boring’ case where the characters are hidden:

Once again, using the artifacts mode on a small text since larger texts are way easier to hide

Conclusions

It works, but the project reached its objective and is no longer in development. It seems difficult to create samples that work for all OCR systems (generalization).

Also, the samples are vulnerable to changes at the preprocessing stage in the OCR pipeline such as:

noise filtering (e.g.: median filters)
compression techniques (e.g.: Fourier compression)
downscaling->upscaling (e.g.: Autoencoders)

However, we can conclude that, using this approach, it is more challenging to mask small characters without making the text difficult to read. I compiled the following graph, in which are compared: the images generated by the algorithm (below 7% noise density) and a set of images that contain random noise (15% noise density). The 2 sets contain different images with characters of sizes: 12, 21, 36, 50. Each random noise set contains 62 samples for each size - average values were used.

Noise efficiency is computed by taking into account the Levenshtein distance and the total amount of noise in the image.

As characters get smaller, the efficiency of the noise added by the algorithm decreases - the random noise samples behave in an opposite manner.

Interesting TODO’s

Extracting templates from samples and training a generator?
Exploiting directly the row segmentation feature?
Attacking Otsu’s binarization method?

Maybe someday…

Cite

Should you find this relevant to your work, you can cite the article using:

@inproceedings{sporici2018evaluation,
  title={An Evaluation of OCR Systems Against Adversarial Machine Learning},
  author={Sporici, Dan and Chiroiu, Mihai and Cioc{\^\i}rlan, Dan},
  booktitle={International Conference on Security for Information Technology and Communications},
  pages={126--141},
  year={2018},
  organization={Springer}
}

Hot Patching C/C++ Functions with Intel Pin

Tue, 20 Aug 2019 21:45:05 +0000

5 years ago, I said in one of my articles that I shall return, one day, with a method of hot patching functions inside live processes; So… I guess this is that day.

What we’ll try to achieve here is to replace, from outside, a function inside a running executable, without stopping/freezing the process (or crashing it…).

In my opinion, applying hot patches is quite a daunting task, if implemented from scratch, since:

it requires access to a different process’ memory (most operating systems are fans of process isolation)
has software compatibility constraints (Windows binaries vs Linux binaries)
has architecture compatibility constraints (32bit vs 64bit)
it implies working with machine code and brings certain issues to the table
it has only a didactic purpose - probably no one would actually use a ‘from-scratch’ method since there are tools that do this better

Considering these, I guess it is better to use something that was actually written for this task and not coding something manually. Therefore, we’ll be looking at a way to do this with Intel Pin. I stumbled upon this tool while working at a completely different project but it seems to be quite versatile. Basically, it is described as a Dynamic Binary Instrumentation Tool, however we’ll be using it to facilitate the procedure of writing code to another process’ memory.

Initial Preparations

Start by downloading Intel Pin and extract it somewhere in your workspace.

I’m doing this tutorial on Ubuntu x86_64, but I’m expecting the code to be highly similar on Windows or other operating systems.

Now, I imagine this turns out to be useful for endpoints that provide remote services to clients - i.e.: a server receives some sort of input and is expected to also return something. Let’s say that someone discovered that a service is vulnerable to certain inputs - so it can be compromised by the first attacker who submits a specially crafted request. We’ll consider that taking the service down, compiling, deploying and launching a new instance is not a desirable solution so hot patching is wanted until a new version is ready.

I’ll use the following dummy C program to illustrate the aforementioned model - to keep it simple, I’m reading inputs from stdin (instead of a tcp stream / network).

#include <stdio.h>

// TODO: hot patch this method
void read_input()
{
    printf("Tell me your name:\n");
    
    char name[11];
    scanf("%s", name); // this looks bad
    
    printf("Hello, %s!\n\n", name);
}

int main()
{
    // not gonna end too soon
    while(1 == 1)
        read_input();
    
    return 0;
}

Some of you probably noticed that the read_input() function is not very well written since it’s reading inputs using scanf("%s", name); and thus enabling an attacker to hijack the program’s execution using buffer overflow.

Scanf() reading exceeds the limits of the allocated buffer

We intend to patch this vulnerability by “replacing” the vulnerable reading function (read_input()) with another that we know it’s actually safe. I’m using quotes there to express the fact that it will act more like a re-routing procedure - the code of the original (vulnerable) function will still be in the process’ memory, but all the calls will be forwarded to the new (patched) method.

I hope it makes sense for now.

Project’s Structure

Intel Pin works by performing actions, indicated in tools, to targeted binaries or processes. As an example, you may have a tool that says ‘increase a counter each time you find a RET instruction’ that you can attach to an executable and get the value of the counter at a certain time.

It offers a directory with examples of tools which can be found at: pin/source/tools/. In order to avoid updating makefile dependencies, we’ll work here so continue by creating a new directory (mine’s named Hotpatch) - this is where the coding happens.

Also, copy a makefile to your new directory, if you don’t feel like writing one:

cp ../SimpleExamples/makefile .

And use the following as your makefile.rules file:

TEST_TOOL_ROOTS := hotpatch # for hotpatch.cpp
SANITY_SUBSET := $(TEST_TOOL_ROOTS) $(TEST_ROOTS)

Finally, create a file named hotpatch.cpp with some dummy code and run the make command. If everything works fine, you should end up with something like this…

Directory structure for the Hotpatch tool

Coding the Hot Patcher

The whole idea revolves around registering a callback which is called every time the binary loads an image (see IMG_AddInstrumentFunction()). Since the method is defined in the running program, we’re interested when the process loads its own image. In this callback, we look for the method that we want to hot patch (replace) - in my example, it’s read_input().

You can list the functions that are present in a binary using:

nm targeted_binary_name

The process of replacing a function (RTN_ReplaceSignatureProbed()) is based on probes - as you can tell by the name, which, according to Intel’s claims, ensure less overhead and are less intrusive. Under the hood, Intel Pin will overwrite the original function’s instructions with a JMP that points to the replacement function. It is up to you to call the original function, if needed.

Without further ado, the code I ended up with:

#include "pin.H"
#include <iostream>
#include <stdio.h>


char target_routine_name[] = "read_input";


// replacement routine's code (i.e. patched read_input)
void read_input_patched(void *original_routine_ptr, int *return_address)
{
    printf("Tell me your name:\n");
    
    // 5 stars stdin reading method
    char name[12] = {0}, c;
    fgets(name, sizeof(name), stdin);
    name[strcspn(name, "\r\n")] = 0;

    // discard rest of the data from stdin
    while((c = getchar()) != '\n' && c != EOF);

    printf("Hello, %s!\n\n", name);
}


void loaded_image_callback(IMG current_image, void *v)
{
    // look for the routine in the loaded image
    RTN current_routine = RTN_FindByName(current_image, target_routine_name);
    

    // stop if the routine was not found in this image
    if (!RTN_Valid(current_routine))
        return;

    // skip routines which are unsafe for replacement
    if (!RTN_IsSafeForProbedReplacement(current_routine))
    {
        std::cerr << "Skipping unsafe routine " << target_routine_name << " in image " << IMG_Name(current_image) << std::endl;
        return;
    }

    // replacement routine's prototype: returns void, default calling standard, name, takes no arugments 
    PROTO replacement_prototype = PROTO_Allocate(PIN_PARG(void), CALLINGSTD_DEFAULT, target_routine_name, PIN_PARG_END());

    // replaces the original routine with a jump to the new one 
    RTN_ReplaceSignatureProbed(current_routine, 
                               AFUNPTR(read_input_patched), 
                               IARG_PROTOTYPE, 
                               replacement_prototype,
                               IARG_ORIG_FUNCPTR,
                               IARG_FUNCARG_ENTRYPOINT_VALUE, 0,
                               IARG_RETURN_IP,
                               IARG_END);

    PROTO_Free(replacement_prototype);

    std::cout << "Successfully replaced " << target_routine_name << " from image " << IMG_Name(current_image) << std::endl;
}


int main(int argc, char *argv[])
{
    PIN_InitSymbols();

    if (PIN_Init(argc, argv))
    {
        std::cerr << "Failed to initialize PIN." << std::endl; 
        exit(EXIT_FAILURE);
    }

    // registers a callback for the "load image" action
    IMG_AddInstrumentFunction(loaded_image_callback, 0);
    
    // runs the program in probe mode
    PIN_StartProgramProbed();
    
    return EXIT_SUCCESS;
}

After running make, use a command like the following one to attach Intel Pin to a running instance of the targeted process.

sudo ../../../pin -pid $(pidof targeted_binary_name) -t obj-intel64/hotpatch.so

Results and Conclusions

Aaand it seems to be working:

Testing the Hot Patched version against Buffer Overflow

To conclude, I’m pretty sure Intel Pin is capable of more complex stuff than what I’m presenting here - which I believe is examples-level (actually it’s inspired by an example). To me, it seems rather strange that it is not a more popular tool - and no, I’m not paid by Intel to endorse it.

However, I hope this article manages to provide support and solutions/ideas to those who are looking at hot patching methods and who, like me, never heard of Intel Pin before.

CodeProject

Gradient Descent Simply Explained (with Example)

Mon, 12 Aug 2019 21:45:05 +0000

So… I’ll try to explain here the concept of gradient descent as simple as possible in order to provide some insight of what’s happening from a mathematical perspective and why the formula works. I’ll try to keep it short and split this into 2 chapters: theory and example - take it as a ELI5 linear regression tutorial.

Feel free to skip the mathy stuff and jump directly to the example if you feel that it might be easier to understand.

Theory and Formula

For the sake of simplicity, we’ll work in the 1D space: we’ll optimize a function that has only one coefficient so it is easier to plot and comprehend. The function can look like this:

\[f(x) = w \cdot x + 2\]

where we have to determine the value of \(w\) such that the function successfully matches / approximates a set of known points.

Since our interest is to find the best coefficient, we’ll consider \(w\) as a variable in our formulas and while computing the derivatives; \(x\) will be treated as a constant. In other words, we don’t compute the derivative with respect to \(x\) since we don’t want to find values for it - we already have a set of inputs for the function, we’re not allowed to change them.

To properly grasp the gradient descent, as an optimization method, you need to know the following mathematical fact:

The derivative of a function is positive when the function increases and is negative when the function decreases.

And writing this mathematically…

\[\frac{\mathrm{d} }{\mathrm{d} w}f(w) {\color{Green}> 0} \rightarrow f(w) {\color{Green}\nearrow }\] \[\frac{\mathrm{d} }{\mathrm{d} w}f(w) {\color{Red}< 0} \rightarrow f(w) {\color{Red}\swarrow }\]

This is happening because the derivative can be seen as the slope of a function’s plot at a given point. I won’t go into details here, but check out the graph below - it should help.

Why is this important?

Because, as you probably know already, gradient descent attempts to minimize the error function (aka cost function).

Now, assuming we use the MSE (Mean Squared Error) function, we have something that looks like this:

\[\hat{y_i} = f(x_i)\] \[MSE = \frac{1}{n} \cdot \sum_{i=1}^{i=n}{(y_i - \hat{y_i})^2}\]

Where: \(y_i\) is the correct value, \(\hat{y_i}\) is the current (computed) value and \(n\) is the number of points we’re using to compute the \(MSE\).

The MSE is always positive (since it’s a sum of squared values) and therefore has a known minimum, which is 0 - so it can be minimized using the aforementioned method.

Take a look at the plot below: the sign of the slope provides useful information of where the minimum of the function is. We can use the value of the slope (the derivative) to adjust the value of the coefficient w (i.e.: w = w - slope).

The sign of the slope can be used to locate the function’s minimum value.

Time to compute the derivative. Before that, I must warn you: it’s quite a long formula but I tried to do it step by step. Behold!

\[\frac{\mathrm{d}}{\mathrm{d} w}MSE = \frac{\mathrm{d}}{\mathrm{d} w} (\frac{1}{n} \cdot \sum_{i=1}^{i=n}{(y_i - \hat{y_i})^2}) =\] \[= \frac{1}{n} \cdot \frac{\mathrm{d}}{\mathrm{d} w} (\sum_{i=1}^{i=n}{(y_i - \hat{y_i})^2}) =\] \[= \frac{1}{n} \cdot \sum_{i=1}^{i=n}{\frac{\mathrm{d}}{\mathrm{d} w}((y_i - \hat{y_i})^2}) =\] \[= \frac{1}{n} \cdot \sum_{i=1}^{i=n}{\frac{\mathrm{d}}{\mathrm{d} w}((y_i - \hat{y_i})^2}) =\] \[= \frac{2}{n} \cdot \sum_{i=1}^{i=n}{(y_i - \hat{y_i})} \cdot (-1) \cdot \frac{\mathrm{d \hat{y_i}}}{\mathrm{d} w}\]

Phew. From here, you’d have to replace \(\frac{\mathrm{d \hat{y_i}}}{\mathrm{d} w}\) with the derivative of the function you chose to optimize. For \(\hat{y} = w \cdot x + 2\), we get:

\[= \frac{2}{n} \cdot \sum_{i=1}^{i=n}{(y_i - \hat{y_i})} \cdot (-1) \cdot x\]

And that’s about it. You can now update the values of your coefficient \(w\) using the following formula:

\[w = w - learning\_rate \cdot \frac{\mathrm{d }}{\mathrm{d} w}MSE(w)\]

Example

We’ll do the example in a 2D space, in order to represent a basic linear regression (a Perceptron without an activation function). Given the function below:

\[f(x) = w_1 \cdot x + w_2\]

we have to find \(w_1\) and \(w_2\), using gradient descent, so it approximates the following set of points:

\[f(1) = 5, f(2) = 7\]

We start by writing the MSE:

\[MSE = \frac{1}{n} \cdot \sum_{i=1}^{i=2}{(y_i - (w_1 \cdot x + w_2))^2}\]

And then the differentiation part. Since there are 2 coefficients, we compute partial derivatives - each one corresponds to its coefficient.

For \(w_1\):

\[\frac{\partial}{\partial w_1} (\frac{1}{n} \cdot \sum_{i=1}^{i=2}{(y_i - (w_1 \cdot x_i + w_2))^2}) =\] \[= \frac{1}{n} \cdot \sum_{i=1}^{i=2}{(y_i - \frac{\partial}{\partial w_1}(w_1 \cdot x_i + w_2))^2} =\] \[= \frac{1}{n} \cdot 2 \cdot \sum_{i=1}^{i=2}{(y_i - (w_1 \cdot x_i + w_2)) \cdot (-1) \cdot x_i} =\] \[= -\frac{2}{n} \cdot \sum_{i=1}^{i=2}{(y_i - (w_1 \cdot x_i + w_2)) \cdot x_i}\]

For \(w_2\):

\[\frac{\partial}{\partial w_2} (\frac{1}{n} \cdot \sum_{i=1}^{i=2}{(y_i - (w_1 \cdot x_i + w_2))^2}) =\] \[= -\frac{2}{n} \cdot \sum_{i=1}^{i=2}{(y_i - (w_1 \cdot x_i + w_2))}\]

Now, we pick some random values for our coefficients. Let’s say \(w_1 = 9\) and \(w_2 = 10\).

We compute:

\[f(1) = 9 \cdot 1 + 10 = 19, f(2) = 9 \cdot 2 + 10 = 28\]

Obviously, these are not the outputs we’re looking for, so we’ll continue by adjusting the coefficients (we’ll consider a 0.15 learning rate):

\[w_1 = w_1 - learning\_rate \cdot \frac{\partial}{\partial w_1} MSE =\] \[= 9 + 0.15 \cdot \frac{2}{2} \cdot \sum_{i=1}^{i=2}{(y_i - (w_1 \cdot x_i + w_2)) \cdot x_i} =\] \[= 9 + 0.15 \cdot ((5 - (9 \cdot 1 + 10)) \cdot 1 + (7 - (9 \cdot 2 + 10)) \cdot 2) =\] \[= 9 - 0.15 \cdot 56 = 0.6\] \[w_2 = w_2 - learning\_rate \cdot \frac{\partial}{\partial w_2} MSE =\] \[= 10 + 0.15 \cdot \frac{2}{2} \cdot \sum_{i=1}^{i=2}{(y_i - (w_1 \cdot x_i + w_2))} =\] \[= 10 + 0.15 \cdot ((5 - (9 \cdot 1 + 10)) + (7 - (9 \cdot 2 + 10))) =\] \[= 10 - 0.15 \cdot 35 = 4.75\]

Recalculating the output of our function, we observe that the outputs are somehow closer to our expected values.

\[f(1) = 0.6 \cdot 1 + 4.75 = 5.35, f(2) = 0.6 \cdot 2 + 1.25 = 5.95\]

Running a second step of optimization:

\[w_1 = 0.6 + 0.15 \cdot ((5 - (0.6 \cdot 1 + 4.75)) \cdot 1 + (7 - (0.6 \cdot 2 + 4.75)) \cdot 2) =\] \[= 0.6 + 0.15 \cdot 1.75 = 0.86\] \[w_2 = 4.75 + 0.15 \cdot ((5 - (0.6 \cdot 1 + 4.75)) + (7 - (0.6 \cdot 2 + 4.75))) =\] \[= 4.75 + 0.15 \cdot 0.7 = 4.85\]

Now, this is going to take multiple iterations in order to converge and we’re not going to do everything by hand. Writing this formula as a Python script yields the following results:

1: w1 = 9.000, w2 = 10.000, MSE: 318.5 
   f(1) = 19.000, f(2) = 28.000
------------------------------------------------
2: w1 = 0.600, w2 = 4.750, MSE: 0.6125 
   f(1) = 5.350, f(2) = 5.950
------------------------------------------------
3: w1 = 0.862, w2 = 4.855, MSE: 0.345603125 
   f(1) = 5.718, f(2) = 6.580
------------------------------------------------
4: w1 = 0.881, w2 = 4.810, MSE: 0.330451789063 
   f(1) = 5.691, f(2) = 6.572
------------------------------------------------
5: w1 = 0.906, w2 = 4.771, MSE: 0.316146225664 
   f(1) = 5.676, f(2) = 6.582
------------------------------------------------
6: w1 = 0.929, w2 = 4.732, MSE: 0.302460106908 
   f(1) = 5.662, f(2) = 6.591
------------------------------------------------
7: w1 = 0.953, w2 = 4.694, MSE: 0.289366466781 
   f(1) = 5.647, f(2) = 6.600
------------------------------------------------
8: w1 = 0.976, w2 = 4.657, MSE: 0.276839656487 
   f(1) = 5.633, f(2) = 6.609
------------------------------------------------
9: w1 = 0.998, w2 = 4.621, MSE: 0.264855137696 
   f(1) = 5.619, f(2) = 6.617


[...]


------------------------------------------------
195: w1 = 1.984, w2 = 3.026, MSE: 7.04866766459e-05 
     f(1) = 5.010, f(2) = 6.994
------------------------------------------------
196: w1 = 1.984, w2 = 3.026, MSE: 6.74352752985e-05 
     f(1) = 5.010, f(2) = 6.994
------------------------------------------------
197: w1 = 1.984, w2 = 3.025, MSE: 6.45159705491e-05 
     f(1) = 5.010, f(2) = 6.994
------------------------------------------------
198: w1 = 1.985, w2 = 3.025, MSE: 6.17230438739e-05 
     f(1) = 5.009, f(2) = 6.994
------------------------------------------------
199: w1 = 1.985, w2 = 3.024, MSE: 5.90510243065e-05 
     f(1) = 5.009, f(2) = 6.994
------------------------------------------------
200: w1 = 1.985, w2 = 3.024, MSE: 5.64946777215e-05 
     f(1) = 5.009, f(2) = 6.994

It converges to \(w_1 = 2\) and \(w_2 = 3\) which are, indeed, the coefficients we were looking for.

In practice, I recommend experimenting with smaller learning rates and more iterations - large learning rates can lead to divergence (the coefficients stray from their correct values and tend to plus or minus infinity).

Conclusion

I guess this is all. Reading it now, I think it might take more than 5 minutes but… I guess it’s still a short article when compared to others that discuss the same subject :))

I hope this proves useful as a starting point and you’ve got something out of it. Backpropagation of errors in nerual networks works in a similar fashion, although the number of dimensions is way larger than what was presented here. Aaand it contains some additional features in order to handle non-convex functions (and avoid getting stuck in local minima). Maybe in other article we’ll take a look at those, too.