Performance Measures for Classification

This course module focuses in terminology and in various concepts in supervised learning. Let's start by looking at various metrics that measure how successful a trained classification model is. What you already should know is that the full dataset can be split into testing and training sets. The model is trained using the training subset, which consist usually of around 75-80 % of the whole dataset. The testing set includes all the other samples.

Note: you might read about test-train-validation split when going through various documentations and tutorials. For now, we will keep things simple and avoid the "validation" part of the split. It will be useful in fine-tuning hyperparameters, which you will learn later.

After the model has been trained using the training set X_train and the correct labels y_train, we will run predictions on X_test. With Titanic dataset, we are predicting whether a passenger survived the Titanic or not. The predictions are a list that include a label for each vector in Xtest: [0, 1, 0, 0, 0, 1, 1, 1 …]. Let's call the list variable predictions. (In academic context, you will often see the predictions being named as yhat, having a mathematic notation of ŷ. The predicted y is ŷ.)

Having the y_test, which includes the correct labels, and the predicted labels in predictions, we can call the scikit learns classification report:

print(classification_report(y_test, 
                            predictions, 
                            target_names=["Deceased", "Survived"]))

The output in console or notebook is:

              precision    recall  f1-score   support

    Deceased       0.83      0.86      0.84       206
    Survived       0.75      0.70      0.72       122

    accuracy                           0.80       328
   macro avg       0.79      0.78      0.78       328
weighted avg       0.80      0.80      0.80       328

Notice that the numbers may vary. One major force affecting those is the randomness in train_test_split(), which can be locked using a parameter random_state=123. Another source of randomness is the np.random.normal() we are using when imputing the missing age data. This can be locked using np.random.seed(123). (Food for thought: If either of those is None, the source for randomness will be called by OS, using the system time or similar constantly changing value. A computer program cannot perform true randomization.)

We also printed the confusion matrix:

print(confusion_matrix(y_test,predictions))

The output:

[[177  29]
 [ 37  85]]

Now it is time to learn how to interpret those values.

Confusion Matrix

Let's start with the confusion matrix, since those values will be needed in calculating the values in classification report. It is important to remember that the metrics mentioned in this lesson apply to classification models. Regression models' performance is usually measured using distance from correct answer (e.g. mean squared error).

The concept of confusion matrix is simple. It is simply a 2-axis grid. On x-axis, you have the predicted labels. On y-axis, you have the y_test labels. The table below should clarify this:

| | Predicted negative (0) | Predicted positive (1) | | ------------------- | ---------------------- | ---------------------- | | Actual negative (0) | True Negative | False Positive | | Actual positive (1) | False Negative | True Positive |

Note! There can be multiple labels. If you have 5 labels, your confusion matrix will be 5x5 matrix. Columns are "predicted labels" and rows are "actual labels". In that kind of cases, the confusion matrix can be confusing to interpret. When predicting multiple labels, a classification report will be easier on the eyes.

In our case, the values were: 177 true negatives, 29 false positives, 37 false negatives and 85 true positives. These can be explained as:

True negatives (TN). 177 of those who were predicted to not survive, did not survive. These answers are also correct.
True positives (TP). 85 of those who were predicted to survive did actually survive. These answers are correct.
False negatives (FN). 37 were predicted to drown but they made it alive. Again, wrong prediction!
False positives (FP). 29 were predicted to survive but did not. Wrong prediction!

Think a moment. Which one is more severe case, usually? Imagine that an AI system is used to predict whether you have some disease that needs treatment.

If the result is FP (false positive), the system will mark you as a disease-positive. A doctor will see you, and at some point you are given the great news: despite the positive result, you are not sick. It was simply a faulty prediction.
If the result is FN (false negative), the system will mark you as being healthy. Unless you have some severe symptoms, you will continue living as if you were healthy without getting the treatment you need.

Now, let's imagine that the disease is fairly rare. Only 1% of tested individuals are usually carrying the disease. If you would only measure the accuracy, what accuracy would we get if the machine learning model would always predict negative ("healthy")? Below is the confusion matrix for 1000 cases:

| | Predicted "healthy" | Predicted "sick" | | ---------------- | ------------------- | ---------------- | | Actual "healthy" | TN = 990 | FP = 0 | | Actual "sick" | FN = 10 | TP = 0 |

All the 10 individuals carrying the disease were marked as "healthy". No one was sent to get the treatment on time on time, so the consequences will most likely be severe.

What if I told you that the accuracy was still very good? In fact, it would be 99 %. This is why measuring binary classifiers' performance by accuracy may not be a wise decision. Let's check how the accuracy, precision, recall and f1-score are computed.

Accuracy

For the calculations in this document, we will be using X_train and predictions saved from Titanic lesson. The dataset has a shape (328, 2).

ys = pd.read_csv("03-data/titanic_y_yhat.csv")

tp = len(ys[(ys["y"] == 1) & (ys["y_hat"] == 1)])
tn = len(ys[(ys["y"] == 0) & (ys["y_hat"] == 0)])
fp = len(ys[(ys["y"] == 0) & (ys["y_hat"] == 1)])
fn = len(ys[(ys["y"] == 1) & (ys["y_hat"] == 0)])

print(f"TN {tn:>4} | FP {fp:>4}")
print(f"FN {fn:>4} | TP {tp:>4}")

Output:

TN  177 | FP   29
FN   37 | TP   85

Accuracy is a very simple concept. It is the ratio of correct to all. Formula form calculating the accuracy is:

$ accuracy = \frac{tp + tn}{tp + tn + fp + fn} $

For our disease case, the accuracy would be (0 + 990) / (990 + 10 + 0 + 0) = 0.99

Precision

According to scikit-learn documentation: "The precision is intuitively the ability of the classifier not to label as positive a sample that is negative." Note that both tp and fp are in the same column in the confusion matrix. Precision answers the question: "For predicted label 'cat', how many of the predictions are actually cats vs. all animals (incl. cats)?'"

$ precision = \frac{tp}{tp+fp} $

If you forgot what TP and FP indicate, this form might help:

$ precision = \frac{hits}{hits + false_alarms} $

For our disease case, the precision would be 0 / (0 + 0), which will throw a ZeroDivisionError unless we add some epsilon (e.g. 1e-16) to the divisor. We didn't find any positives, because the machine assumed everyone to be healthy. 0 / 0 is NaN, so precision cannot be defined.

Recall

Recall (also known as sensitivity) indicates how many of the actual positives we found. According to scikit-learn documentation: "The recall is intuitively the ability of the classifier to find all the positive samples." Note that both tp and fn are in the same row in the confusion matrix. Recall answers the question: "Of all the 'cats' in our dataset, how many were properly classified as cats vs. all animals?"

$ recall = \frac{tp}{tp+fn} $

If you forgot what FN indicates, this form might help:

$ recall = \frac{hits}{hits + misses} $

For our disease case, the recall would be 0 / 0 + 10 = 0. If the system was working correctly, we would want to have the recall as high as possible (as close to 1.0 as possible), since that would indicate that the system would be catching nearly all positive cases.

F1-score

F1-score is the harmonic mean between precision and recall.

$ f1_score= 2\frac{precisionrecall}{precision+recall} $

Libraries such as scikit learn will calculate a F-beta score if beta parameter is given. F-beta score is otherwise the same as F1-score except that the precision in the divisor will be multiplied with beta^2. This causes the harmonic mean to be weighted. Precision will have beta_times more weight than recall.

$ fb_score = 2 * \frac{precision * recall}{\beta^2 * precision + recall} $

For our disease case, the F1-score is either NaN or 0, depending if you treat its precision as NaN or 0. (Read above to find out why precision would ne NaN.) Python won't give an answer, since it will throw the ZeroDivisionError.

Tip: Options to stop ZeroDivisionError causing problems include adding a tiny epsilon value to divisor or catching the error using try-except block.

Conclusions

All the measures work so that higher is better. All the measures are between 0.0 and 1.0. They punish various things.

Accuracy can be useful when classes are fairly equally presented. Accuracy gets lower and lower as any sample is predicted wrong.
Recall gets lower and lower as your model labels sick person as a healthy.
Precision gets lower and lowers as your model predicts a healthy person as sick.
F1-score is harmonic mean of the latter two. It lies in-between those two.

Notice that a perfect model doesn't exist and there will always be some false positives or false negatives (false alarms or misses). There is a trade-off between these two. How would one dial this trade-off from direction to another? For now, it is enough to know that logistic regression is like linear regression, but the output of a loss function has been mapped between 0 and 1 using a sigmoid function. On the next lesson, you will learn more of this!

In scikit learn, after a model has been trained (using the model.fit(X_train, y_train), you can access the output values from sigmoid function by calling model.predict_proba(X_train). The output can look like something below:

| [:, 0] | [:, 1] | | ---------- | ---------- | | 0.91457964 | 0.08542036 | | 0.81814925 | 0.18185075 | | 0.23080618 | 0.76919382 |

The first column is the probability of the given sample being of class 0 ("deceased"), the right column is the probability of being of class 1 ("survived"). The model.predict(X_train) performs the same operation but applies a class decision rule: f(x) = x > 0.5. So, if probability of surviving Titanic is predicted to be higher than 50 %, it predicts the class as 1 ("survived"). Notice that the samples that lie close to decision boundary (close to the threshold, 0.5) are likely to end up being FN's or FP's. Thus, offsetting this cut-off point will affect the ratio between recall and precision.

Important! Remember that these metrics are used for classification. For regression, the simples error between the prediction and the ground-truth would be the absolute error. Example would be predicting a house price. Prediction is 32 thousand, the ground-truth is 35 thousand. The absolute error is 3 thousand.

Food for thought: You might be wondering that the logistic regressor (as well as many other linear classifiers) seem to be binary classifiers. If this is the case, how would we be able to perform multivariate classification? (Hint: imagine that there are green, red and blue teddy bears on the table. You are green-red color blind: you can only tell if a teddy bear is blue or not. For this problem, you are a binary classifier. An item either is blue or it is one of the other colors. Your friend also has color vision deficiency and cannot differentiate green from blue. Your friend can only tell if a teddy bear is red or not. How would you and your friend be able to classify 21 colored teddy bears into: 7 reds, 7 greens, 7 blues.)