One-Hot Encoding

Now you should be able to load datasets as Pandas DataFrames, perform exploratory data analysis and split the dataset into training and testing split. You have worked with a dataset consisting of more than 100,000 storage devices. You have also performed some basic feature selection (skipping some of the S.M.A.R.T attributes). Feature engineering isn't unfamiliar either: remember creating the "Load Hours per Power-On" feature using two existing features? You have also trained a model (k-NN) based on some continuous data (pixel values).

In the HDD dataset, the S.M.A.R.T. stats are continuous. For example, the smart_9_raw is a continuous value indicating the power-on hours.

Now all data is continuous, though. In the HDD dataset, the "model" is categorical and consist of string values (e.g. "ST4000DM000"). Another categorical field is the "failure", which by default is an integer containing either 0 or 1. In the MNIST dataset, the target is categorical: 0, 1, 2, … 9.

It is time to learn how to deal with categorical data.

Read with care: Usually, you cannot feed strings ("ST4000DM000") into machine learning algorithms. They need to be encoded into numbers by category (0 = "ST4000DM000", 1 = "Some_Other", …). This process is called label encoding.

Label Encoding in practice

# Simplified case. Here we would have a list of labels
labels = ["Cat", "Dog", "Hamster"]

# The feature would contain in order: a cat, a hamster, a dog, ..., a cat.
encoded_feature = [1, 3, 2, ... 1]

So how would one perform this operation? There are various ways, such as calling apply() on DataFrame and writing a function that will map values using a dictionary. Easier, already-existing approach would be using Sci-Kit learn.

from sklearn.preprocessing import LabelEncoder

# These are our labels that need encoding.
y = ["cat", "hamster", "dog", "dog", "cat", "hamster"]

# Instantiate the object
le = LabelEncoder()

# Fit the 'targets' list to LabelEncoder.
le.fit(y)

# Print the variable containing label names
print(le.classes_)

"""
    OUTPUT would be:
    array(['cat', 'dog', 'hamster'])
"""

The code above fitted list y to the le, which is an instance constructed by LabelEncoder. The le object will store the unique values in classes_ variable. The target values for these labels/classes are integers between 0 and n_classes - 1.

# Calling transform() encodes the labels
y = le.transform(y)
print(y)

"""
    OUTPUT:
    [0 2 1 1 0 2]
"""

The transform method converts the labels to their corresponding labels. To get back to their original values, we need to perform inverse transform:

y = le.inverse_transform(y)
print(y)

"""
    OUTPUT:
    ['cat' 'hamster' 'dog' 'dog' 'cat' 'hamster']
"""

This allows us to convert string into integers, which can them be used for training a machine learning model. Note that the original labels are not required to be strings. They can be a mixture of various data types:

from sklearn.preprocessing import LabelEncoder

y = [10, 20.5, 12040, "string"]

LabelEncoder().fit_transform(y)

"""
    Output in Jupyter Notebook
    array([0, 2, 1, 3], dtype=int64)
"""

One-hot encoding in practice

Unless the categories are one-hot encoded, the machine learning algorithm will assume that the numbers are in order (n-1 < n < n+1). Regression algorithms would try to fit a line using this ordinal variable. This is fine if the categories are meant to be ordinal. "Cat", "Dog" and "Hamster" are definitely not ordinal, whereas military ranks such as "Cadet", "Lieutenant" and "Captain" are. Lieutenant is a higher rank than cadet, but lower than captain. You would have to be careful with the ordering, though, so you don't accidentally perform alphabetic sorting.

But what to do when the feature is not cardinal or ordinal? These non-ordered, non-ranked variables are called nominal numbers. The number simply identifies the feature type or a set membership, but a smaller or larger value doesn't describe/quantify it in any way that would allow you do perform basic arithmetic calculations in a meaningful way. Solution is: to break this feature into multiple binary features. This process is called one-hot encoding.

There are several ways of performing one-hot encoding. We will discuss three here: LabelBinarizer and OneHotEncoder from sci-kit learn and Pandas.get_dummies().

LabelBinarizer

The LabelBinarizer shares has a similar interface as the LabelEncoder we used previously. The code below will look familiar:

from sklearn.preprocessing import LabelBinarizer

y = ["cat", "hamster", "dog", "dog", "cat", "hamster"]

# Construct the object
lb = LabelBinarizer()

# Fit and transform within one line
y_encoded = lb.fit_transform(y)

print(y_encoded)
print(type(y_encoded))

"""
    OUTPUT:
    [[1 0 0]
    [0 0 1]
    [0 1 0]
    [0 1 0]
    [1 0 0]
    [0 0 1]]
    <class 'numpy.ndarray'>
"""

To convert this into readable table, our y had this kind of values before the transfer:

| y | | --------- | | "cat" | | "hamster" | | "dog" | | "dog" | | "cat" | | "hamster" |

After the translate, it is:

| cat | dog | hamster | | ---- | ---- | ------- | | 1 | 0 | 0 | | 0 | 0 | 1 | | 0 | 1 | 0 | | 0 | 1 | 0 | | 1 | 0 | 0 | | 0 | 0 | 1 |

Let's prove this by calling the inverse transform:

for animal_code in y_encoded:
    animal_code = animal_code.reshape(1, -1)
    animal_str = lb.inverse_transform(animal_code)
    print(f"Encoded {animal_code[0]} refers to {animal_str[0]}")

"""
    OUTPUT:
    Encoded [1 0 0] refers to cat
    Encoded [0 0 1] refers to hamster
    Encoded [0 1 0] refers to dog
    Encoded [0 1 0] refers to dog
    Encoded [1 0 0] refers to cat
    Encoded [0 0 1] refers to hamster
"""

Instead of 1 feature (column), we now have three. All three features are in binary format. An animal either is or isn't a cat, dog or a hamster.

OneHotEncoder

OneHotEncoder is a very similar tool compared to LabelBinarizer. By default, it creates a SciPy CSR matrix, which is a format for storing a sparse matrix efficiently (by storing only the non-zero entries.) During this course, we will not need sparse matricies, but feel free to look into the topic. To get a typical dense matrix, we need to set the sparse=False parameter.

The code below creates a y_encoded that is identical to one created above using LabelBinarizer.

from sklearn.preprocessing import OneHotEncoder
import numpy as np

y = ["cat", "hamster", "dog", "dog", "cat", "hamster"]
y = np.array(y).reshape(-1, 1)

ohe = OneHotEncoder(sparse=False)

y_encoded = ohe.fit_transform(y)

So far, we haven't seen any reasons for using OneHotEncoder over LabelBinarizer. Thus, if this is all you need, LabelBinarizer will suit your needs just fine. There is one parameter worth mentioning though, that is not present in the LabelBinarizer. This parameter, drop, allows you to drop the first feature. Another handy option is to drop="if_binary", which will only drop the first feature of the columns where only two unique values are present. This is usually what we want: only hot-encode columns that contain non-binary, nominal values.

# Notice the drop keyword argument
ohe = OneHotEncoder(drop="first", sparse=False)

y_encoded = ohe.fit_transform(y)

Let's investigate how this functions as a table. This the y after the transform, when using drop=first:

| dog | hamster | | ---- | ------- | | 0 | 0 | | 0 | 2 | | 1 | 0 | | 1 | 0 | | 0 | 0 | | 0 | 1 |

Notice the the cat column got dropped, since it was the first column. When performing drop for the first column, cat is the situation where both of these two columns are zero. Dog is the 1-0 and hamster is the 0-1. (The 1-1 doesn't exist, since an animal can't be both a hamster and a dog at the same time.) This will easily turn your make your unintuitive to read, and thus also your coefficient will be slightly confusing to intepret. Why would we do this?

When using normal equation (LinearRegression), you must do this to all features that are being one-hot encoded. If you want to understand why, read "Think twice before dropping that first one-hot encoded column" by Red Huq.
For any model that includes regularization (=most), you can skip this for any feature that has more than two columns after one-hot encoding.

IMPORTANT! Notice that you can skip dropping the first columns only if there are more than two columns. You want to avoid situation where you have a column "CanSwim" and "CannotSwim". Here you would only have two columns. These are mutually inclusive as they have correlation of -1. If one is 1, the other must be 0. This type of multicollinearity needs to be avoided. With three or more, you don't need to drop anything. For example, "LowSwimSkill", "MediumSwimSkill" and "HighSwimSkill" columns would be fine.

Get_Dummies

You will notice that OneHotEncoder and LabelBinarizer can be slightly cumbersome to use with Pandas DataFrames, especially if you have heterogenous data where different columns require different kind of operations. Sci-Kit learn's Dataset transformation tools, such as ColumnTransformer, will help, but using those is beyond the scope of this course.

Luckily, Pandas has its own tool: get_dummies(). It functions very similarly to previous tools, but instead of (numpy) array, it return a DataFrame.

import pandas as pd

y = pd.Series(["cat", "hamster", "dog", "dog", "cat", "hamster"])

y_encoded = pd.get_dummies(y, prefix=prefix)

y_encoded is a DataFrame and the content is:

| RangeIndex | ycat | ydog | y_hamster | | ---------: | ----: | ----: | --------- | | 0 | 1 | 0 | 0 | | 1 | 0 | 0 | 1 | | 2 | 0 | 1 | 0 | | 3 | 0 | 1 | 0 | | 4 | 1 | 0 | 0 | | 5 | 0 | 0 | 1 |

Sadly, there is no option to drop first for only binary columns. The only available parameter for dropping columns is the drop_first=True/False. In the example above, it would drop the y_cat. On the bonus side, it automatically drops columns that contain only 0's and 1's and columns that have float values.

The data type of the feature (column) will affect how get_dummies treats the variables. Only object and category dtypes will be affected. Settings data types correctly might solve problems for you.

Conclusion

For most cases, you can use any of the options above or create your own tools (e.g. by utilizing the apply()in Pandas.) It is important that you know what you want. Below are some guidelines that should be helpful in most situations. Note that the decision must be made per each column (which is a feature that has some variable type).

Is your variable non-relevant, having no causality with the predicted value, or includes only unique values.
Delete the variable or use feature engineering techniques to turn it into useful data.
Is your variable continuous or categorical?
Continuous: Do not one-hot encode.
Categorical: You may want to one-hot encode
Is your variable binary? It has maximum of two unique values?
Binary: One-hot encode AND drop first.
Non-binary: You may want to one-hot encode.
Is your categorical variable nominal or ordinal? Can they by meaningfully ordered as (a > b > c)?
Nominal: One-hot encode but do not drop first.
Ordinal: No need to one-hot encode. Perform label encoding from ["junior", "medior", "senior"] to [1, 2, 3]

Let's imagine you have a dataset containing the following dataset.head(3):

| Alive | Military Rank | Pet | Score | | ----- | -------------- | ------- | ----- | | yes | Private | Dog | 82.4 | | no | Corporal | Cat | 68.7 | | yes | First Sergeant | Hamster | 91.2 |

After all the work, your dataset.head(3) might look like this:

| Alive | Military Rank | PetDot | PetCat | Pet_Hamster | Score | | ----- | ------------- | ------- | ------- | ----------- | ----- | | 1 | 2 | 1 | 0 | 0 | 82.4 | | 0 | 4 | 0 | 1 | 0 | 68.7 | | 1 | 9 | 0 | 0 | 1 | 91.2 |

Note: Military Rank has been preprocessed using LabelEncoder or similar.

Note #2: Index has been hidden from these examples to limit the column values. By default, it would be the typical RangeIndex.