Today, I’d like to share a quick post. As you already know, Kaggle is the place to study and learn machine learning. Kaggle’s users share their solutions and insights for several problems in ML. How can they share so many information in an easy and practical way? Users share their knowledge through kernels, how it looks like? Let’s take a look?

From Kaggle:

Scripts are short snippets of code that do individual tasks, but what we have created is something more. Kernels are a combination of environment, input, code, and output – all stored together for every version you create. By storing all of these attributes together Kernels are fundamentally reproducible, easy to share, and easy to learn from or fork. Kernels contain both the code needed for an analysis, and the analysis itself.

In my opinion the idea of kernel is really a great one. Now we have a platform where we can write scripts, add data and a kind of a wiki to explain the algorithm, examine data, plot graphics and share our knowledge in one place! On top of that, you don’t need to fork locally or set up your machine to make small tests, everything can be done in your browser. You can develop your code in Python or R (which by the way I’m studying). By the way, you can fork someone’s else kernel, modify, improve, run it again and share for others!

There is two types of kernels: Script and Notebook. The image below explains pretty much when chose one or another.

For this post, I created a notebook where I implemented the same multi class logistic regression for the MNIST dataset. Remember that we used the very same dataset for our posts and then we’ll be able to compare results. The link for the kernel is here.

After run the kernel, our classifier’s accuracy is 87.84%. In Octave we had 87.50% without regularization and 87.75% with some regularization both close to the 87.84%.

So we can see that the kernel is a very good tool to make some prototypes, testings and place to learn and share knowledge.

Bye and until next time!

]]>

Sometimes when we train our algorithm, it becomes too specific to our dataset which is not good. Why? Because the algorithm must be able to classify correctly data never seem before too. So today, I’ll show you a way to try to improve the accuracy of our algorithm.

The problem described in the introduction is know as overfitting. Overfitting occurs when the algorithm becomes to specialized to the training data. It means that during the training it will be able to classify very well (even 100%) but it will perform poorly with new data. More information about overfitting can be read in wikipedia.

The image below depicts what happens when we overfit:

Figure 1 – Overfitting problem

As you can see, the green line is when we overfit. It tries to split all datapoint exactly as they should be splitted. This occurs when we have too much features for example.

Regularization is the technique used to avoid overfitting. Simply put we’ll add a term in our function to be minimized by gradient descent.

Currently our logistic regression use the following gradient descent:

In order to add regularization we just need to add:

With some manipulations we’ll find:

The term

is a number between below 1 and intuitively we can see that it will reduce the value of by some amount on every iteration.

Please don’t confuse parameter with . I’m sorry but Katex renders the letter a and alpha very similar.

For this post, we’ll rescale our features as well using the following formula (as discussed here):

Splitting the dataset as 10k for training and 32k for testing, I run to tests:

- No regularization: , 50 iterations, learning rate , we get 87.50%.
- Regularization: , 50 iterations, learning rate , we get 87.75%.
- Regularization: , 50 iterations, learning rate , we get 77.53%. (Underfitting)

Submiting to Kaggle our algorithm regularized we finally got 87.642% which is close to our 87.75%.

Of course, this algorithm can be improved and tuned, but you got the idea! You can find the source code here.

Next time we’ll start looking at neural networks, but I have to confess that it will take a little to prepare the post!

Bye and I hope to see you soon!

]]>

Last post we classified MNIST using our classifier. I was wondering how we could evaluate our algorithm with “real” data. What if we could evaluate it against others algorithms from other people?

I guess most of you guys already know Kaggle, right? Kaggle is a website where you will find tons of material and competition (with money prizes sometimes) where you can practice machine learning. It’s a must for those who wants to learn more about it.

Currently there is a competition to classify the MNIST dataset and since we have already done a classifier for that, why not test it and try to improve it?

As mentioned before, Kaggle is a website where you can learn machine learning, share technics, insights, ask questions, etc. it’s really the place to find information.

You must first create an account and then subscribe for the “Digit Recognizer” competition. Today, we have 1871 people participating and the final result will be in 2020. To have a taste of how your algorithm is performing, the site will give right away the results over 25% of the testing data.

Kaggle give us 42k images for testing and we need to classify 28k images. For the testing images, the first column is the label and the others 784 columns are the pixels of the 28×28 image.

For both files, the first line is the header of the columns, so, needed only for our understanding but not for our algorithm. The file given is in .csv format which is easy to import in Octave.

After classify all 28k images, we need to create as well a file to be submitted to evaluation which is described on the competition’s page. The site will give us a partial results over 25% of the testing data. The final result will be in 2020!

Before getting the results, I would like to share the approach I used to train my algorithm. I used part of the training dataset for training and the rest to simulate the testing with the real dataset from Kaggle. This is what I got from Octave:

Hits: 26715, Miss: 5285. Total: 32000

Multi-class Logistic Regression accuracy: 83.48%

The results that I got from my dataset was really close from what Kaggle gave me as you can see below:

- Test using learning rate of 0.6, 200 iteration and used all 42k images for training and got 0.77828.
- Test using first 5k images for training, learning rate: 0.3 and 100 iterations and got 0.70785.
- Last test using 10k images for training, learning rate: 0.6 and 200 iterations and got 0.83485.

So as you can see, you can tune your algorithm using several parameters during the training. This is the trick part. In the next post we can try to improve our algorithm using very known technics to avoid overfitting.

As usual, the code is in my github.

Bye bye

]]>

Until now our algorithm was able to perform binary classification, in other words it could only classify one thing among several other stuffs. I was wondering whether it would be nice to improve our algorithm to be a multi-class classifier and classify images with it.

For this post, I’ll be using the very well known MNISTdatabase for training and classification. They are 10 different classes of handwritten numbers, 0 to 9.

From wikipedia:

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

The MNIST database contains 60k training images and 10k testing images. The images are 28 x 28 pixels and have 8 bits of resolution for grayscale levels. The complete database can be found here.

In order to load the images, we’ll use functions already written for matlab/octave which can be found here.

As we know, our logistic regression algorithm can only tell us if “yes, most probably it’s X” or “no, most probably it’s not X”. So, with this in mind, we could make 10 of these classifiers, one for each number, and be able to classify a number among the other nine.

We can remember from Logistic Regression post that we have the following:

and then

For the Iris dataset, we had four features (sepal length/width and petal length/width) and then was a vector of 1×5. Remember that is always 1. For MNIST database, the images are 28×28 pixels which means that we’ll have 784 features.

As we need 10 classifiers, we’ll have to calculate all parameters for ten equations. In form of vectors we have:

In this case instead of having a vector 1 x M, it will be a matrix (which will be call ) of 10 x M where M is the quantity of features plus one, in our case 785. This is easy to do in Octave using the same code that we used in logistic regression post.

First we’ll need to train for each class. In the code we’ll have a loop to repeat the very same steps for gradient descent. Once all parameters found, we classify. The classification is done using the 10 classifiers and we’ll pick the one which will have higher probability. Below an example for the digit seven and we can see clearly that had 99.98% of chance to be well classified:

First I trained the algorithm with all 60k samples with some different parameters. It took some time to make all those calculations!

We can clearly see that the learning rate and iteration quantity has a big impact on the final accuracy of the classifier. Of course the code is available at github as usual.

Learning rate

start value: 0 for entire matrix

iteration: 200

Hits: 9036, Miss: 964. Total: 10000

Multi-class Logistic Regression accuracy: 90.36%

Learning rate

start value: 0 for entire matrix

iteration: 500

Hits: 8954, Miss: 1046. Total: 10000

Multi-class Logistic Regression accuracy: 89.54%

Learning rate

start value: 0.1 for entire matrix

iteration: 50

Hits: 8807, Miss: 1193. Total: 10000

Multi-class Logistic Regression accuracy: 88.07%

Learning rate

start value: 0 for entire matrix

iteration: 50

Hits: 8607, Miss: 1393. Total: 10000

Multi-class Logistic Regression accuracy: 86.07%

Bye and until next time!

]]>

Today we’ll get hands dirty and test logistic regression algorithm. For this post, we are going to use the very known iris flower data set.This dataset has three classes of flowers which can be classified accordingly to its sepal width/length and petal width/length. From the dataset source “*One class is linearly separable from the other 2 […]*” which makes this dataset handy for our purposes of binary classification.

This dataset has 150 measurements of sepal width/length and petal width/length, 50 for each type of flower. The first thing to do is understand our data by plotting all combinations between sepal width/length and petal width/length as in the image below:

Figure 1 – Iris flower data visualization (Iris Serosa in blue)

As we can see, the Iris Setosa flower can be linearly be separated from the others two classes in all possibilities. In other words, we can plot a straight line which can split the flowers in two different groups and this is what we must do, find the curve equation. Remember, in our example is linear, but it could be anything which could allow us to classify Iris Versicolor among the others two for example.

As we can see, the logistic regression classifier was able to classify Iris Setosa without errors.

Now, let’s take a look about the others two classes:

Figure 2 – Iris Versicolor vs all other Iris flowers

Figure 3 – Iris Virginica vs all other Iris flowers

As we can see, both Versicolor and Virginica have samples which overlaps each other. If we try to train and classify we’ll get errors during classification. For this test I used 117 samples for training and 33 for testing the algorithm.

After training, here are the parameters found:

and 10000 iterations

Iris Setosa:

Iris Versicolor:

Iris Virginica:

We need to remember that:

where:

: 1

: Sepal length

: Sepal width

: Petal length

: Petal width

The table 1 below summarizes the results:

Table 1 – Results over testing samples

True positive / Negative | False positive | False negative | Accuracy | |
---|---|---|---|---|

Iris Setosa | 33 | 0 | 0 | 100% |

Iris Versicolor | 25 | 2 | 6 | 75.76% |

Iris Virginica | 32 | 1 | 0 | 96.97% |

Most probably we would have worst results if I used more data for testing. I think I got lucky with the 11 samples selected that wasn’t close to the boundaries. For curiosity, I tested again over all data (training and test) and got the following (which we all know that we shouldn’t be doing =D):

Table 2 – Results over all samples

True positive / Negative | False positive | False negative | Accuracy | |
---|---|---|---|---|

Iris Setosa | 150 | 0 | 0 | 100% |

Iris Versicolor | 106 | 12 | 32 | 70.67% |

Iris Virginica | 147 | 3 | 0 | 97.33% |

As usual, I did the code on Octave and the source is available on my github page.

Next post we’ll test our classifier with mnist dataset and cover multi-class classification.

See you then!

]]>

Yeah, things are getting more interesting, huh? In the last posts we covered linear regression where we fit a straight line to represent the best way possible a set of points. This is a simple and powerful way to predict values. But sometimes instead of predicting a value, we want to classify them.

Why we would do that? I know that is easy to understand but for those who didn’t catch it, why this is interesting?

We humans classify almost everything in our life without noticing it. We are able to recognize, differentiate and categorize stuffs. We know that a dog is a dog and not a wolf. We can recognize people, cars, buildings, etc. Now, imagine if we could make a program which is able to do the same, it wouldn’t be nice?

The ** classification problem is just like the regression problem**, except that the values we now want to predict take on only a small number of discrete values. We’ll cover first binary classification problem which means that we have only two classes.

For example we could classify a house being expensive or not expensive depending on the number of rooms, area, price, city, etc.

In linear regression we were using equation to fit the best curve given a set of points.

In **logistic regression** we don’t want to fit a curve in a set of points as in linear regression but instead classify data in categories. For that we’ll use the sigmoid function that is expressed as:

Having the following shape

Figure 1 – Sigmoid function

This function is interesting since it maps any value of into a number between 0 and 1. So will actually represents the probability that our output is 0 or 1, where 0 or 1 will be our two classes.

As our solution is discrete, we may “round” the result as follow:

Looking at Figure 1 we can see that at we have y = 0.5 so:

where (again) .

Now, if we pay attention, we have two inequalities. This means that now we have a curve and two regions. One when and another when which are our two possible classes.

It worth notice that don’t need to be linear, it can be any polynomial function like .

As in linear regression, we must minimize the error. In logistic regression the error function is defined as

Where the cost function is

if

Figure 2 – Cost error for y = 1

if

Figure 3 – Cost error for y = 0

This two cost functions above indicates that must be equal to to have zero error. If not, the error will grow exponentially.

We can simplify and merge both equations into one like this

Note that using this equation guarantees that the error will be convex for logistic regression, which means we’ll converge to a global minimum when using gradient descent.

Finally our error function will be:

As in linear regression, we can use gradient descent to minimize the error function (eq 3) and find the constants .

We still need to calculate the derivatives and follow the same algorithm.

In each iteration we must update parameters as we did in linear regression

In this post we covered the theory of logistic regression which is very close to linear regression. Besides the name, it’s a classification algorithm which allow us to classify data in two different classes.

In the next post we’ll get our hands dirty and testing this algorithm to classify some data and check how it will perform.

Seeya!!!

]]>

In this post we’ll show another way different to solve the problem of error minimization in linear regression. Instead of using gradient descent, we could solve this linear system using matrices.

As we know, a system of linear equations can be solved by matrices. In this case, we don’t need to define learning rate , we don’t need to iterate, we don’t need to scale features.

We minimize the square errors by taking the derivatives and setting them to zero and then we solve this linear system with the following matrix equation:

**Remember**: is a vector

How this equation works and why? As seen before, once we define the error function, the whole idea is find a way to minimize it. In our case we are using mean squared error (MSE). We can get the minimum error by taking the derivatives:

and setting them to zero

Working on the equation we have

and finally

Now, we can solve this using matrices:

i)

ii)

iii)

You can find a very good (and very long) explanation on how it works here.

I wrote a script on octave to calculate the parameters using normal equation method with both datasets used for simple and multiple linear regression. Here is the results compared to those using gradient descent:

**Simple linear regression**

Starting gradient descent algorithm with and and 10000 iterations

Normal equation | Gradient descent | |
---|---|---|

1.2921 | 1.4826 | |

9.7777 | 0.065248 | |

Error | 53.776 | 55.529 |

We can see that gradient descent finally converged to a local minimum.

**Multiple linear regression**

Starting the algorithm with , and and 10000 iterations

Normal equation | Gradient descent | |
---|---|---|

77.09660 | 73.632116 | |

0.33745 | 0.652041 | |

-1.01820 | -0.859574 | |

Error | 11.529 | 11.716 |

I did another test with 100,000 iterations and for multiple linear regression it finally converged to the global minimum since I got the very same parameters.

As we can see, normal equation is another way to solve linear regression. It’s more direct, it finds the optimum solution and we don’t need any parameters to be tuned. But why we don’t use it always? Why do we need gradient descent?

Well, as anything is perfect, normal equation need much more processing that gradient descent. For few features (less than 10,000 accordingly to Andrew Ng) it’s acceptable but more than that, we can start thinking about gradient descent. Gradient descent can be thought as an approximation of the ideal solution since it may or may not converge to the global minimum.

See ya!

]]>

The last two posts were about linear regression. I explained a little about the theory and I left an example to test the algorithm which actually works but could be improved. How can we do this?

One of problems with gradient descent is when two features are not in the same scale. For example house price ($200,000 – $1,000,000) vs number of bedrooms (1 – 5). When it happens, we are going to descend quickly on small ranges and very slowly in large ranges, which will lead to oscillation and slow convergence.

By scaling the features, the main idea is put all features at the same (or close) scale to converge faster.

Figure 1 – Feature scaled (left) vs unscaled feature (right)

*Image taken from internet*

So we can imagine that the learning rate could work well for one feature, but could be too high for the other. It’s easy to imagine that if we throw a small marble ball in the image on the right it will oscillate a lot until it stops in the middle but it would stop much quicker on the left image.

The idea of scaling a feature is to have the feature’s values between 0 and 1 and then all feature’s values would be pretty close. But how can we scale a feature?

There are several techniques (honestly I don’t know why one is better than other, if anyone knows, please leave a comment =D):

**i) Rescaling**

In this technique, we subtracts the input by the minimum and divide by the range of values (max – min values).

In this technique, we first subtracts input values by the mean of the input values and divide them by the range of values (max – min values).

**iii) Standardization**

In this technique, we first subtracts input values by the mean of the input values and divide them by the standard deviation.

It worth remember that you have to save the values used in the feature scaling as you’ll need after processing data to convert back to real values.

Another issue with gradient descent is that it can oscillate and never converge or overflow. We use the parameter learning rate to tune the speed of convergence. If too high, gradient descent will oscillate and overflow or never converge to the minimum. If too small, it will take too much time (iterations) to converge.

Personally I start with 0.001, 0.01… to check first what happens and increase/decrease the learning rate value in multiples of 3. I found that multiplying by 10 was very often too much.

As mentioned before, there is a way to check if the gradient descent is converging to a value. We can always track the error vs iteration and see what is going on. The error should decrease over time.

We could see that gradient descent can work well but in some conditions it can be unstable or too slow. Using the techniques discussed in this post can help or solve them allowing a better performance of the algorithm.

See you next time.

]]>

In the last post we talked about simple linear regression, where we calculated the “trend line” of a set of points for a single variable. But what if we have more than a single variable? How can we solve it?

Imagine that we have a database with the grades of students with following data: grade, country, school name, grade level, student sex and student age. With multiple linear regression we could try to predict the grade of a person knowing student’s country, school name, grade level, sex and age. Of course we don’t need to use all these information, actually one of main quality of a data scientist is finding the best data to be used. Most of time we have irrelevant data or even data that must be worked on to be useful. An obvious example is student name and this is not (or shouldn’t be) correlated with the grade.

In the simple linear regression we had . In the last post example y was the car price and x was the year of manufacture.

Now, in our student grade we would have something like

where:

= student grade

= country

= school name

= grade level

= student sex

= student age

Do you remember the error function that we had to minimize?

We still can use it with gradient descent as for simple linear regression. But instead of using

we’ll use

.

Or even better, we’ll use

.

where and . This will simplify later in the partial derivatives and the calculation in Octave.

It’s the same approach with the only difference that we’ll have more partial derivatives as shown below:

In the partial derivative , we use n, which can be 0, 1, 2, 3, 4 or 5, to avoid repetition. Don’t forget that is always 1.

Updating the parameters follows the same rules for simple linear regression.

where is the learning rate and n can be 0, 1, 2, 3, 4 or 5.

It’s hard to visualize what happens since we have more than 3 dimensions, so I made tests with two features (age and sex) and the results are in the images below.

First we can see how grade depends on age and sex.

Figure 1 – grade vs age and sex

Of course in this example we don’t have continuous value for sex but the only two possible values are 0 and 1, but the idea for continuous variable is still the same.

We can see that our algorithm is converging with iteration which is a good sign. By the way, for this example, the learning rate was 0.01 instead of 0.001 of the last post.

As we are doing a linear regression, we expect to have a plane which will pass through the sample points right on the “middle”, in other words, right there where the mean squared error (MSE) was minimized. Never forgot that we could found a local minimum instead of global minimum.

Figure 3 – Linear regression result (age)

Figure 4 – Linear regression result (sex)

In figure 3 we can see where the plane in dividing the age samples and in figure 4, how the plane is dividing the sex samples. So, in multiple linear regression, we try to find a way to find the equation which will minimize the MSE accordingly to all features selected.

I did everything in Octave and if you interested in the source code, check my github page at https://github.com/marcelojo/multi_linear_regression.git.

Bye

]]>

After almost two years of personal hard work, I’m back to share a little about linear regression and gradient descent.

What is linear regression?

From wikipedia

In statistics,

linear regressionis a linear approach for modeling the relationship between a scalar dependent variableyand one or more explanatory variables (or independent variables) denotedX.[…]linear regression can be used to fit a predictive model to an observed data set of

yandXvalues. After developing such a model, if an additional value ofXis then given without its accompanying value ofy, the fitted model can be used to make a prediction of the value ofy.

**Simple linear regression** is when *y* is a function of a single variable *x *and **multiple linear regression** is when *y* is a function of multiples variables, like x1, x2, x3, etc. In this post we’ll cover simple linear regression.

In other words and going direct to the point, linear regression is the linear trend line that we know from Excel as in the image below.

Figure 1 – Linear regression

As we can see, we have a set of points where y changes accordingly to a single variable x. Take as an example, the price of a car versus its year of manufacture.

We can find a line equation which represents all the sampled points with the least possible error. With this equation it will be possible to predict y from new x values. The general equation of a straight line is . **So, our problem is to find the best constants a and **

We know what is linear regression and what we can do with, now how can we find the best fit with the least error?

First, how can we calculate the error? There are several ways to calculate it. You could just subtracts the real y by the predicted y. Or you could take the absolute values after subtraction, or you could first subtract, square it and take the average. Indeed, this is the mean square error which is used normally for data fitting which is our case.

The error equation is

h in our case is the a line equation: . So, basically this formula subtracts y from the predicted value, square it to have only positive numbers, and take the average. The here will be useful later for gradient descent, but is not mandatory.

Now, we must find a way to minimize this error. As we know, we’ll predict y with a line equation . It’s where gradient descent kicks in. From wikipedia:

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function.

From calculus class we know that with the first derivative we can find the local minimum of a curve. If the curve is a convex one, the local minimum is actually the global one. The idea is very simple but still very powerful and it’s used very often in machine learning.

Since , we must find the partial derivative of the error (equation 1) respect to ** a** and

Equations 2 and 3 will give us the direction towards to the minimum. The size of each step is determined by a parameter called the **learning rate**.

So, the new constants **a** and **b** is given by equations 4 and 5

Everytime we calculate this, we make a step forward to the minimum.

- Define function h. In our case as simple linear regression, we used .
- Define error equation. In our case we used mean square error.
- Calculate and
- Define learning rate –
- Calculate new parameters
**a**and**b**. In next iteration use this new parameters and recalculate again.

In the images below, we can see the final result of our algorithm. We can see that we have a line which is passing close to the middle point of all points, hence minimizing the mean square error. The parameters found after 1000 iteration were:

Figure 2 – Linear regression result

A good way to debug the code is checking the error in every iteration. Here is the plot error vs iteration showing the error decreasing which is a good sign. After about 100 iterations the errors almost didn’t decrease.

We can see as well in every step how we improve our final curve. We started with a = b = 0 which is the x – axis. With the update of * a* and

Figure 4 – Linear regression steps

What happens if we set ** a** and

Figure 5 – Not so optimal result

Although the error decreases and the algorithm converges, we can see that we actually found a local minimum since this line seems to split in half the set of points, but not in a optimal manner. We can even try with 10000 iterations and we’ll got the same result. So, start values for ** a** and

Here some tips:

- Try different learning rates – . It makes the algorithm converge faster. If you error increases after some steps, try decreasing .
- Try different start values for
and**a**, since you can find local minimum.**b** - Test this code with different error function, play with different data, start point for
and*a*, test different values of alpha, understand the code and overall, have fun!*b*

As we can see, linear regression is quite simple but powerful enough to solve simple problems. The most important thing is the concept which will be used for others algorithms that we’ll discuss later on the blog.

You can download this algorithm done in Octave from my github repository at https://github.com/marcelojo/simple_linear_regression.git

]]>