This lesson will be short and simple, but make sure you understand the concept. In fact, we have already performed training and testing split once. Remember these lines from the Linear Regression notebook in which we trained a k-NN classifier for MNIST digits dataset?:
# Choose [100-1797] for training, [0-99] for testing.
X_train = data[100:]
X_test = data[:100]
# Do the same for the labels.
y_train = digits.target[100:]
y_test = digits.target[:100]
The X_train
will include a subset of data
, which is our feature matrix (by convention, the data would often be called as X
). The subset that X_train
contains are the data elements with indices 100, 101, 102 … 1798, 1797.
The X_test
will include another subset of data
. The indices of elements included in it are: 0, 1, 2 … 98, 99.
The y_train
and y_test
include the targets.
When we train the model, we use the X_train
and y_train
. When we test whether the model can predict correctly or not, we will use X_test
and y_test
.
Often, the split is chosen by a certain percentage of the dataset. Pareto principle states that "for many events, roughly 80% of the effects come from 20% of the causes. (wikipedia)" Thus, we can use 20/80 split as a good starting point. Andrew Ng suggests 30/70 as a good default starting split in his famous Machine Learning course (Coursera).
In our k-NN example, our test set was only 100 samples, which is roughly a 5/95 split. Feel free to visit the classification notebook and test whether changing the ratio has a large effect on the model accuracy or not.
Often, the feature matrix is shuffled before splitting. This is to ensure that the data isn't in some particular order which might affect the accuracy. Imagine if the dataset was ordered by target label? A non-shuffled 30 % split would only include numbers 1, 2 and 3 it its test set!
We could perform this shuffling with random library:
import random
# X (training features) and y (targets)
X = mnist.images.reshape((1797, -1))
y = mnist.target
# Set seed manually so that both X and y will be shuffled in the same way.
random.seed(42)
random.shuffle(X)
random.shuffle(y)
# 539 if len(X) is 1797
split_point = int(len(X) * 0.3)
# 30% of both X and y used for testing
X_test = X[:split_point]
y_test = y[:split_point]
# 70% of both X and y used for training
X_train = X[split_point:]
y_train = y[split_point:]
Luckily, we can utilize sci-kit learns ready-made model selection tools. The code above can be shortened to:
from sklearn.model_selection import train_test_split
# X (training features) and y (targets)
X = mnist.images.reshape((1797, -1))
y = mnist.target
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
If we set a random_state to some integer, the results of shuffling will be the same each time. It is like a weighted dice. Often, you would skip setting the random state manually, leaving it as default (None).
In the MNIST classification notebook, we performed a sanity check to make sure that each target label [0-9] is present in the y_test. The code performing this was:
for digit in set(y_test):
count = list(y_test).count(digit)
print(f"Count of {digit}'s in the validation set: {count}")
In order to train and evaluate the model, it should be obvious that all classes must be present in both training and testing split. If we are shuffling the feature matrix X, we might be unlucky and get zero instances of some class into our testing set.
You can, or course, perform a sanity check to be sure, but there are other ways. We can use stratified sampling, which split the dataset into subpopulations. The benefit is that the split will preserve the relatives ratios in both training and testing groups.
# Here we would stratify based on target (y)
X_train, y_train, X_test, y_test = train_test_split(X,
y,
test_size=0.3,
stratify=y)
Note that we might end up stratifying based on some other column too. Imagine the scenario below:
# Imagine X contains data where one of the features is "sex"
X_train, y_train, X_test, y_test = train_test_split(X,
y,
test_size=0.3,
stratify=X["sex"])
The larger your dataset is, the less likely it is that you will end up having a biased test set. If your dataset has a skewed distribution in classes, stratify will not solve all your problems. You will need some other ways of fighting the class imbalance, such as getting more data, choosing the algorithm with care or resampling. These are a bit more advanced topics, so don't worry if you don't fully understand the problem of class imbalance yet.