Now that we know how to use cross-validation, we are ready to use the GridSearchCV in Scikit Learn.
Remember: Grid search using the cross-validation is a great way of evaluating your model's hyperparameters when you don't have large volumes of data. So far, we haven't seen a training process in this course that would take more than seconds. When working with large quantities of data and deep learning models (such as ImageNet database), training the model from scratch can take weeks on a home computer. In cases like this, you will most likely not want to train the model k times, but instead opt to using train/test/validation split.
Also, note that the datasets we have been using have been fairly simple. For example, our classifier has been doing quite well with a straight, linear boundary. Soon, you will learn how to use curved lines (=polynomial features.)
Using the GridSearchCV is simple. All we need is a model and a dictionary of parameters:
# Choose a classifier/regressor that is fit for your task
model = SGDClassifier()
# Define the grid of parameters you want to compare
param_grid = {
"loss": ("hinge", "log"),
"penalty": ('l1', 'l2'),
"alpha": [0.001, 0.0001, 0.00001]
}
In the example above, we would be comparing two loss functions (hinge and loss) and two regularization functions (L1 and L2). The alpha is the regression rate. In machine learning literature, the alpha is often reserved for learning rate and lambda for regression rate. Scikit learn uses another standard, where eeta (η) is the learning rate and alpha (α) is the regression rate. The parameter learning_rate
is by default set to value ="optimal"
. The value for learning rate is computed using the regularization rate and the number of updates. See the documentation for more information. This regularization rate is the third hyperparameter in our parameter grid.
After this, you can simply instantiate and fit the GridSearchCV:
gridsearch = GridSearchCV(model, param_grid, scoring="f1_weighted")
gridsearch.fit(X, y)
By default, the GridSearchCV uses k=5
for cross-validation, so the X will be split into five folds. In the example above, we have defined a scoring. After calling the fit() function, you will have an access to cv_results_
, best_estimator_
and best_params_
attributes. The cvresults is in dictionary format, which makes it easy to examine the contents using our old-time favorite: Pandas.
# Choose to keep only some of the cv_results_.keys()
cols_to_show = ["param_alpha","param_loss", "param_penalty", "std_test_score", "mean_test_score", "rank_test_score"]
# Display the dictioanry as a DataFrame
pd.DataFrame(gridsearch.cv_results_).sort_values(["rank_test_score"])[cols_to_show]
An example output can be seen below. The column names have been shortened for viewing purposes.
| | alpha | loss | penalty | std | mean | rank | | ---: | -----: | ----: | ------: | -------: | -------: | ---- | | 0 | 0.001 | hinge | l1 | 0.039126 | 0.959743 | 1 | | 6 | 0.0001 | log | l1 | 0.034139 | 0.953115 | 2 | | 3 | 0.001 | log | l2 | 0.034724 | 0.945920 | 3 | | … | … | … | … | … | … | … | | 11 | 1e-05 | log | l2 | 0.068470 | 0.820289 | 12 |