Cross Validation

2016-01-30T00:00:00+00:00

Cross Validation

30 Jan 2016

How should you use cross validation to analyze a dataset?

Analysis without Cross Validation

Consider that you are analysis a regression dataset. So you use the Root Mean Square error (RMSE). We split the dataset into a training set (70%) and test set (30%).

An analysis for a Machine Learning algorithm includes these experiments:

Model Complexity Experiments
Learning Curve Experiments

What is the purpose of these?

In the Model Complexity Experiments you change the model by varying the hyperparameters of the model. You record RMSE errors in each case. (The k in kNN is a hyperparameter, the number of hidden nodes/layers, momentum and learning rates in a Neural Network are hyperparameters.)

The hyperparameters that gives you the lowest error on the '''test set''' are your optimal hyperparameters. This gives you the "best" model. By best, we mean it is the model that best generalizes: it is the model with the least bias and least variance.

The Learning Curve Experiments help you to determine how much data you need to learn this "best" model. It gives you a lower bound on the samples needed to learn this model. Isn't this useful? If you are trying to buy a house in Atlanta, you want to know what is the minimum number of houses you want to explore before you understand the housing market here and put your bid.

The Learning Curve experiments also gives you a second check to prove that the model you selected actually does not have high bias or high variance. To understand this, watch this video.

Cross Validation

Ok, so now where does cross validation play into this? Let's watch this video.

Cross validation is simply a new way to find the hyperparameters for the "best" model. But why use cross validation? To remove bias due to your data.

Lets say that you have a dataset of all houses in Buckhead, Marietta, Decatur and Downtown (suburbs) in Atlanta. And you are trying to find a model that can estimate the cost of any house in Atlanta. You don't want to train your data on the houses of Buckhead and Marietta alone. So perhaps you randomly reshuffle your data and partition them into folds. If you have 10 folds, you train on 9 folds and find the error for the 10th fold. If you repeat this procedure for all the 10 folds, you can find an average error. This error is a better estimate of the training error (also called cross validation error now because, duh, you used cross validation). Run the model on your test set to get the test error.

Conclusion

In conclusion, cross validation is just a fancy new way to do training.

How to split the data into 70/30?

If you consider the housing example again, you don't want your test set (30%) to be biased (For example, most of the samples are houses in Downtown). So somehow you need to make sure that the training data and testing data equally represent houses in all the four suburbs. How do you that? Random sampling, or more accurately Stratified Sampling.

You can do some analysis to determine if you did the 70/30 split properly, but thats beyond the course for now. But a simple way to do it is to draw histograms of your training and testing data. The histograms will show how many houses from each suburb are in these data sets. If the histograms look similar, you did the split properly. Comparing histograms is not beyond this course, do that.

Analysis of Supervised Learning Algorithms

2016-01-15T00:00:00+00:00

Analysis of Supervised Learning Algorithms

15 Jan 2016

Introduction

Imagine that you are using Decision Trees to fit some data. The hypothesis set here is the set of all decision trees, that is, when you find your best hypothesis it is going to be some decision tree. Here the decision tree is also called the model or representation. A machine learning algorithm, for example ID3, is an algorithm that is used to learn this model. We will first discuss bias and variance as the properties of the model. We will then understand how we can measure performance of these algorithms and accuracy of the model predicted.

Bias and Variance

If you run the ID3 algorithm several times you will find a new decision tree everytime. This error introduced in the prediction because of this behavior is called variance. If you run ID3 algorithm several times you will find a new decision tree everytime. Let us say you could average out the trees and find a new tree that is representative of these trees. It is still going to be slightly different than the best decision tree. This error is called bias. Further there are two types of biases: preference bias are the set of hypothesis that the algorithm prefers to predict and restriction bias are the set of hypothesis that the algorithm is restricted to predict.

Causes of Bias/Variance

The variance is high when these decision trees are more dissimilar to each other. This is caused when there is too much noise in the data. On the other hand a model has high bias when you might have sampled the data from a particular population. Since our prediction algorithm is finding a hypothesis set which is a decision tree, we are biasing our algorithm to only find decision trees.

A popular technique to reduce error due to variance is Bagging and Resampling. In bagging (Bootstrap Aggregating), we recreate our training set by random selection with replacement. Each training set is then used to predict a model. Then we can average the model or use a validation technique to pick the best model.

Measuring Performance

Like any algorithm the performance of a machine learning algorithm can be measured using asymptotic analysis - time and space complexity. For example, most of the supervised learning algorithms like Decision Trees and Neural Networks learn a model. The space required for these algorithms is equivalent to the space required for these models. However in instance based learning methods like KNN, the space required is equivalent to the space required to store the entire sample space.

Measuring Accuracy

To measure accuracy of a hypothesis we need to compare it to the true hypothesis. However we never know the true hypothesis, so we use other techniques to measure model prediction error. For this article, the model prediction error will be dependent on our data: we are trying to find a hypothesis that best fits the data. For a regression problem we use least square error, for a classification problem we use Type I/II errors.

We can find the best hypothesis using model complexity experiments and learning curves.

Model Complexity

Model Complexity curves have an error metric on its Y axis and the complexity on the X axis. The graph has atleast two curves: one for the training set and one for the testing set. To partially remove variance due to noise, cross validation is used instead of a testing set.

The purpose of this graph is to find the best complexity our model should have. Here, the term "best" is loosely used. In Machine Learning best means it can model future unseen data. If it is not a good forecaster it is not the best.

With this graph you can determine the point at which overfitting is the least. This is the point where the training and validation curves diverge. If there are multiple points, we choose the one where the complexity is the least because of Ockham's Razor.

You can change the model complexity of Supervised Learning models in many ways. The complexity of a model comes from its preference and restriction bias. Every model has a preference bias and when you change complexity of your model, you are basically changing the preference bias. For the best results you should change complexity until you touch the hypothesis space that is the entire restriction bias.

Decision Trees - Changing height of trees, different pruning algorithms
Neural Networks - Changing the number of hidden layers/nodes, changing momentum and optimization algorithm
k-NN - Changing the value of k, choosing different subset of the feature space
SVM - Changing the kernel
Boosting - Changing the number of base learners, changing the base learner

So, that's model complexity.

Learning Curves

Learning curves, again, have an error metric on its Y axis and the size of the dataset on its X axis. Like the Model Complexity curves the graph has atleast two curves: one for the training set and the other for the testing or cross-validation set.

Assume that you run a Model Complexity experiment on a dataset for a given model. We find a point where the training and cross-validation curves diverge. Lets say that the gap between training and cross-validation curve at the point of overfitting is relatively large. It can be due to two reasons: our model is not expressive enough or we don't have enough data. To help matters here we use the learning curve.

In a learning curve experiment, we fix the model complexity and vary the size of the training set used to train the data. If our data does not have noise and the model can perfectly fit the data, the training and cross validation curves will meet when the training size on the X axis is all the data. If they don't meet, than it means we don't have enough data or our model needs to be more complex. We change the complexity of our model and rerun the experiment while noting down the gap between the two curves.

In most cases, if the complexity required for the best fit in a model complexity graph and a learning curve graph should be the same. If they are not the same, we need to gather more data.

The ML Dude

Cross Validation

Cross Validation

Analysis without Cross Validation

Cross Validation

Conclusion

Analysis of Supervised Learning Algorithms

Analysis of Supervised Learning Algorithms

Introduction

Bias and Variance

Causes of Bias/Variance

Measuring Performance

Measuring Accuracy

Model Complexity

Learning Curves

References