# Confusion Matrix

In statistical classification, we create algorithms or models to predict or classify data into a finite set of classes. Since models are not perfect, some data points will be classified incorrectly. Confusion matrix is basically a tabular summary showing how well the model is performing.

In one dimension, the matrix takes the actual values. The matrix then maps these to the predicted values in the other dimension. In reality, the matrix is like a histogram. The entries in the matrix are counts. For example, it records how many data points were predicted as "true" when they were actually "false".

Confusion matrix is useful in both binary classification as well as multiclass classification problems. There are many performance metrics that can be computed from the matrix. Learning these metrics is handy for a statistician or data scientist.

## Discussion

• What are the elements and terminology used in Confusion Matrix?

Let's also consider a concrete example of a pregnancy test. Based on a urine test, we predict if a person is pregnant or not. We assume that the ground truth (pregnant or not) is available to us. We therefore have four possibilities:

• True Positive (TP): We predict a pregnant person is pregnant. This is a good prediction.
• True Negative (TN): We predict a non-pregnant person is not pregnant. This is a good prediction.
• False Positive (FP): We predict a non-pregnant person is pregnant. This type of error is also called Type I Error.
• False Negative (FN): We predict a pregnant person is not pregnant. This type of error is also called Type II Error.

When these are arranged in matrix form, it will be apparent that correct predictions are represented along the main diagonal. Incorrect predictions are in the non-diagonal cells. This makes it easy to see where predictions have gone wrong. We may also say that the matrix represents the model's inability to classify correctly, and hence the "confusion" in the model.

• What metrics are used for evaluating the performance of a prediction model?

Performance metrics from a confusion matrix are represented in the following equations:

$$Recall\ or\ Sensitivity=TP/(TP+FN)=TP/AllPositives\\Specificity=TN/(TN+FP)=TN/AllNegatives\\Precision=TP/(TP+FP)=TP/PredictedPositives\\Prevalence=TP+FN/Total=AllPositives /Total\\Accuracy=(TP+TN)/Total\\Error\ Rate=(FP+FN)/Total$$

It's important to understand the significance of these metrics. Accuracy is an overall measure of correct prediction, regardless of the class (positive or negative). The complement of accuracy is error rate or misclassification rate.

High recall implies that very few positives are misclassified as negatives. High precision implies very few negatives are misclassified as positives. There's a trade-off here. If model is partial towards positives, we'll end up with high recall but low precision. It model favours negatives, we'll end up with low recall and high precision.

High specificity, like high precision, implies that very few negatives are misclassified as positives. If positive represents some disease, specificity is the model's confidence in clearing a person as disease-free. Selectivity is the model's confidence in diagnosing a person as diseased.

Ideally, recall, specificity, precision and accuracy should all be close to 1. FNR, FPR and error rate should be close to 0.

• Could you give a numerical example showing calculations of performance measures of a prediction model?

This example has 165 samples. We show the following calculations:

• Recall or True Positive Rate (TPR): TP/(TP+FN) = 100/(100+5) = 0.95
• False Negative Rate (FNR): 1 - TPR = 0.05
• Specificity or True Negative Rate (TNR): TN/(TN+FP) = 50/(50+10) = 0.17
• False Positive Rate (FPR): 1 - TNR = 0.83
• Precision: TP/(TP+FP) = 100/(100+10) = 0.91
• Prevalence: (TP+FN)/Total = (100+5)/165 = 0.64
• Accuracy: (TP+TN)/Total = (100+50)/165 = 0.91
• Error Rate: (FP+FN)/Total = (10+5)/165 = 0.09
• Why do we need so many performance measures when accuracy can be sufficient?

If the dataset has 90% positives, then achieving 90% accuracy is easy by predicting only positives. Thus, accuracy is not a sufficient measure when dataset is imbalanced. Accuracy also doesn't differentiate between Type I (False Positive) and Type II (False Negative) errors. This is where the confusion matrix gives us more useful measures with FPR and FNR; or their complementary measures, Recall and Specificity respectively.

Consider the multiclass problem of iris classification that has three classes: setosa, versicolor and virginica. This has an accuracy of 84% (32/38) but it doesn't tell us where the errors are happening. With the confusion matrix, it's easy to see that only versicolor is wrongly classified. The matrix also shows that versicolor is misclassified as virginica and never as setosa. We can also see that Recall is 62% (10/16) for versicolor.

In fact, when classes are not evenly represented in the data, confusion matrix by itself doesn't give an adequate visual representation. For this reason, we use a normalized confusion matrix that takes care of class imbalance.

• What are other performance metrics for a classification/prediction problem?

F-measure takes a harmonic mean of Recall and Precision, (2*Recall*Precision)/(Recall+Precision). It's a value closer to the smaller of the two. Applying this to our earlier example, we get F-measure = (2*0.95*0.91)/(0.95+0.91) = 0.92

A commonly used graphical measure is the ROC Curve. It's generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as we vary the threshold for assigning observations to a given class.

How often will we be wrong if we always predict the majority class? Null Error Rate gives us a measure for this. It's a useful baseline when evaluating a model. In our example, null error rate would be 60/165 = 0.36. If the model always predicted positive, it would be wrong 36% of the time.

Cohen's Kappa can be applied to know how well a classifier is performing as opposed to classifying simply by chance. A high Kappa score implies accuracy differs a lot from null error rate.

• What's the procedure to make or use a Confusion Matrix?

We certainly need both the actual values and the predicted values. We can arrange the actual values by rows and the predicted values by columns, although some may swap the two. It's therefore important read the arrangement of the matrix correctly. For each actual value, count the number of predicted values for each class. Fill these counts into the matrix.

There's no threshold for good accuracy, sensitivity or other measures. They should be interpreted in the context of problem, domain and business.

• Could you mention some tools and techniques in relation to the Confusion Matrix?

In R, package caret: Classification and Regression Training can be used to get confusion matrix with all relevant statistical information. The function is confusionMatrix(data=predicted, reference=expected). This plots actuals (called reference) by columns and predictions by rows.

In Python, package sklearn.metrics has an equivalent function, confusion_matrix(actual, predicted). This plots actuals by rows and predictions by columns. Other related and useful functions are accuracy_score(actual, predicted) and classification_report(actual, predicted).

## Milestones

1904

Mathematician Karl Pearson publishes a paper titled On the theory of contingency and its relation to association and normal correlation. Contingency and correlation between two variables can be seen as the genesis of confusion matrix.

1971

James Townsend publishes a paper titled Theoretical analysis of an alphabetic confusion matrix. Uppercase English alphabets are shown to human participants who try to identify them. Alphabets are presented with or without introduced noise. The resulting confusion matrix is of size 26x26. With noise, Townsend finds that 'W' is misidentified as 'V' 37% of the time; 32% of 'Q' are misidentified as 'O'; 'H' is identified correctly only 19% of the time.

1998

The term Confusion Matrix becomes popular in the ML community when it appears in a glossary featured in Special Issue on Applications of Machine Learning and the Knowledge Discovery Process by Ron Kohavi and Foster Provost.

2011

In a paper titled Comparing Multi-class Classifiers: On the Similarity of Confusion Matrices for Predictive Toxicology Applications, researchers show how to compare predictive models based on their confusion matrices. For lower FNR, they propose regrouping performance measures of multiclass classifiers into a binary classification problem.

## Sample Code

• # Source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
# Accessed 2019-08-19
>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])

>>> y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
>>> y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
>>> confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])

# In the binary case, we can extract true positives, etc as follows:
>>> tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()
>>> (tn, fp, fn, tp)
(0, 2, 1, 1)

# -------------------------------------------------------
# Source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
# Accessed: 2019-08-19
>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 2]
>>> y_pred = [0, 0, 2, 2, 1]
>>> target_names = ['class 0', 'class 1', 'class 2']
>>> print(classification_report(y_true, y_pred, target_names=target_names))
precision    recall  f1-score   support

class 0       0.50      1.00      0.67         1
class 1       0.00      0.00      0.00         1
class 2       1.00      0.67      0.80         3

accuracy                           0.60         5
macro avg       0.50      0.56      0.49         5
weighted avg       0.70      0.60      0.61         5

>>> y_pred = [1, 1, 0]
>>> y_true = [1, 1, 1]
>>> print(classification_report(y_true, y_pred, labels=[1, 2, 3]))
precision    recall  f1-score   support

1       1.00      0.67      0.80         3
2       0.00      0.00      0.00         0
3       0.00      0.00      0.00         0

micro avg       1.00      0.67      0.80         3
macro avg       0.33      0.22      0.27         3
weighted avg       1.00      0.67      0.80         3


Author
No. of Edits
No. of Chats
DevCoins
4
1
1905
2
0
1591
1385
Words
3
Likes
27K
Hits

## Cite As

Devopedia. 2019. "Confusion Matrix." Version 6, August 20. Accessed 2024-06-25. https://devopedia.org/confusion-matrix
Contributed by
2 authors

Last updated on
2019-08-20 06:07:27