By now, you have learned many data preprocessing and machine learning techniques, but we haven't yet discussed the actual workflow: how to make your code easy to read and maintain? One helpful solution to this is a Scikit Learn modules Pipeline and Compose.
Some of the useful classes is each are:
In future, the contents in the Pipeline module will likely be moved into the Compose module and be refurbished.
Note: FeatureUnion is very similar to the ColumnTransformer, except that it doesn't perform any column selection by default. Check the Scikit Learn documentation for its basic usage and compare it to ColumnTransformer during this lesson.
ColumnTransformer, FeatureUnion and Pipeline all use transformers. Any transformer is required to, as Pipeline documentation states, "…implement fit and transform methods."
We have already used some transformers, such as:
If we need to perform custom data preprocessing, such turning names in Titanic dataset into titles, we can create our own transformers. Below is a basic example:
from sklearn.base import BaseEstimator, TransformerMixin
class YourOwnTitanicParser(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def _split_title(self, cell):
cell = cell.split(",")[1]
cell = cell.split()[0]
return cell
def transform(self, X, y=None):
X["title"] = X["name"].apply(self._split_title)
X = X.drop(columns="name")
return X.values
Normally, we will use this class in a Pipeline instead of manually calling it, but we can test it anyways:
nameparser = YourOwnTitanicParser()
titanic_parsed = nameparser.transform(titanic)
The transform will return NumPy array of the values in the Dataframe. Title has been added, the name has been removed.
Now, a good question is that is this approach necessary. What do we win compared to the old method of calling a function after another? It depends. In some cases, you might want to use an transformer parameter as a hyperparameter when you are performing Grid Search. Code readability might increase by using transformer classes. Also, some of the custom transformers might even be useful in multiple projects, if the data in many projects of a similar formatting (e.g. email addresses, post codes, lat-long coordinates or other that you need to parse).
Often, your dataset will consist of mixed feature and data types, and they require separate operations. For example, you might want to add polynomial degrees to some features, but not to all. Also, you might have a mixture of categorical features (e.g. strings like ["cat", "dog"]) and numerical features (e.g. height or weight).
As illustrated above, the Column Transform will run multiple Pipelines with different transformers. The outputs of each will be concatenated together. Note that the dimensionality of the dataset can (and often will) change. In this example, the OneHotEncoder would add dummy variables, thus increasing the volume of features.
If the transformers you are using have input parameters, you can add those into the GridSearchCV's parameter grid. You will learn more about this in the Notebook and in the Scikit Learn documentation.
Pipelines can either add or remove complexity based on what kinds of preprocessing steps are required.
Advanced use of Pipelines is beyond the scope of this course. Understanding the basics is a necessity. Otherwise you might find it impossible to follow some parts of the Scikit documentation or other online sources.