Cross Validation

By now, you should be able to name some hyperparameters that you need to tune or tweak to increase the test accuracy or your model. Two of these, the learning rate α (alpha) and the regularization rate λ (lambda) were introduced during the last lesson. With k-NN classifier, we would be tweaking k (number of closest neighbors with a vote). Unless your algorithm is super simple such as the Linear Regression using Normal Equation, you need to fine-tune some set of parameters to maximize the performance of your model. This is true whether you are building a regressor or a classifier.

Terminology: The values we want to tweak, such as learning rate, are called hyperparameters - not parameters. This is to differentiate hyperparameters from the parameters of parametric learning. In the linear models we have been using, the weights (coefficients) are parameters.

We have already been splitting our data into training and testing sets. This testing set, "holdout data", has allowed us to test how the model performs on a data it has never seen. There is still a problem. If we would keep the same test set and start tweaking the parameters (e.g. learning rate) until the accuracy reaches its maximum, we would maximize the training/testing performance on this specific test set. If we were to split the dataset again (e.g. using the train_test_split(X,y)), we would have a new random split. It is highly unlikely that the same parameters would perform as well as before.

Stop reading for a bit. How would you try to fix this? How would you make sure that chosen parameters would perform well?

Validation Split

A potential remedy is to create a second holdout data called validation set. A typical split would be:

60-70 % of the dataset for training
20 % of the dataset for validating the parameters
10-20 % of the dataset for final performance testing

This solution can be great if you have surplus of data and a linear problem that is fairly easy to solve. If not, your model's performance will likely suffer from bias. Important training data might be missing from the training set. You can try your best to split the dataset in a fair way, but with highly dimensional data or with limited samples, the task will be challenging.

The scikit-learn documentations says the same:

"However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets."

Cross-Validation

A second option is to skip the validation split completely. Instead, you will use various sections of your training data for different hyperparameter values. Typical approach is to use k-folds cross-validation:

Split the dataset into k equally sized subsets: folds
- k is often 5 or 10.
Run the training k times.
- During each of these iterations, use the k-th fold for testing.
- …and all other folds for training.

There are various variations of k-folds, such as LOO (Leave-One-Out) or StratifiedKFold, of which many are implemented in scikit learn. Read the documentation for more info on those. There are also some alternative validation methods such as bootstrapping. This lesson will focus on the typical, most common implementation of k-folds CV.

No matter what is the size of your dataset, k-folds cross-validation works. All samples in your dataset take part once in testing and k-1 times in training. This way, you don't need to be as aware of the distribution of various populations inside your data as with the train/test/validation approach.

Below is a visualization of k-folds CV, where k=5. The whole dataset is split into file folds. If you were to concatenate the folds 1…5 at any point, the end result would be the complete X. The chosen model will be trained k times. During each of these 5 training times, we will record the chosen metrics (e.g. MSE.)

| | Fold 1 | Fold 2 | Fold3 | Fold 4 | Fold 5 | | ----------- | -------- | -------- | -------- | -------- | -------- | | Training #1 | TEST | train | train | train | train | | Training #2 | train | TEST | train | train | train | | Training #3 | train | train | TEST | train | train | | Training #4 | train | train | train | TEST | train | | Training #5 | train | train | train | train | TEST |

Tuning Hyperparameters

Notice that all the k training processes will be run using the same hyperparameters. In order to find out how the hyperparameters affect the end result, you need to run the Cross-Validation multiple times. For example, you might want to run the k-folds CV using learning rates alphas = [ 1e-1, 1e-2, 1e-3, 1e-4, 1e-5]. In this case, you would have to run the CV len(alphas) times, so 5 times. Since k=5, the model will be trained len(alphas) * k times. In this case, we will be fitting the data to the model 25 times.

Conclusions

When you are searching for the hyperparameters that will minimize the test error, you need to either use separate validation set or cross-validation. Otherwise, you end up selecting hyperparameters that happen to fit the selected training set well, but if you were to re-split the X, the results would not keep.

Cross-validation is the more commonly used option when the dataset is small-ish. What counts as "small" depends on your dataset: how many features do you have and how difficult it is to split the data into subsets so that all groups or populations are represented in a fair way. With big data and deep learning models, the cost of training k-times isn't worth it.

In the next lesson, we will about hyperparameter optimizers such as GridSearchCV and ParameterGrid.