Confusion Matrix

Confusion Matrix. Source: Idris 2018.
Confusion Matrix. Source: Idris 2018.

In statistical classification, we create algorithms or models to predict or classify data into a finite set of classes. Since models are not perfect, some data points will be classified incorrectly. Confusion matrix is basically a tabular summary showing how well the model is performing.

In one dimension, the matrix takes the actual values. The matrix then maps these to the predicted values in the other dimension. In reality, the matrix is like a histogram. The entries in the matrix are counts. For example, it records how many data points were predicted as "true" when they were actually "false".

Confusion matrix is useful in both binary classification as well as multiclass classification problems. There are many performance metrics that can be computed from the matrix. Learning these metrics is handy for a statistician or data scientist.

Discussion

  • What are the elements and terminology used in Confusion Matrix?
    Illustrating basic terms of a confusion matrix. Source: Narkhede 2018.
    Illustrating basic terms of a confusion matrix. Source: Narkhede 2018.

    Let's also consider a concrete example of a pregnancy test. Based on a urine test, we predict if a person is pregnant or not. We assume that the ground truth (pregnant or not) is available to us. We therefore have four possibilities:

    • True Positive (TP): We predict a pregnant person is pregnant. This is a good prediction.
    • True Negative (TN): We predict a non-pregnant person is not pregnant. This is a good prediction.
    • False Positive (FP): We predict a non-pregnant person is pregnant. This type of error is also called Type I Error.
    • False Negative (FN): We predict a pregnant person is not pregnant. This type of error is also called Type II Error.

    When these are arranged in matrix form, it will be apparent that correct predictions are represented along the main diagonal. Incorrect predictions are in the non-diagonal cells. This makes it easy to see where predictions have gone wrong. We may also say that the matrix represents the model's inability to classify correctly, and hence the "confusion" in the model.

  • What metrics are used for evaluating the performance of a prediction model?
    Illustrating the many metrics calculated from the confusion matrix. Source: Devopedia 2019.
    Illustrating the many metrics calculated from the confusion matrix. Source: Devopedia 2019.

    Performance metrics from a confusion matrix are represented in the following equations:

    $$Recall\ or\ Sensitivity=TP/(TP+FN)=TP/AllPositives\\Specificity=TN/(TN+FP)=TN/AllNegatives\\Precision=TP/(TP+FP)=TP/PredictedPositives\\Prevalence=TP+FN/Total=AllPositives /Total\\Accuracy=(TP+TN)/Total\\Error\ Rate=(FP+FN)/Total$$

    It's important to understand the significance of these metrics. Accuracy is an overall measure of correct prediction, regardless of the class (positive or negative). The complement of accuracy is error rate or misclassification rate.

    High recall implies that very few positives are misclassified as negatives. High precision implies very few negatives are misclassified as positives. There's a trade-off here. If model is partial towards positives, we'll end up with high recall but low precision. It model favours negatives, we'll end up with low recall and high precision.

    High specificity, like high precision, implies that very few negatives are misclassified as positives. If positive represents some disease, specificity is the model's confidence in clearing a person as disease-free. Selectivity is the model's confidence in diagnosing a person as diseased.

    Ideally, recall, specificity, precision and accuracy should all be close to 1. FNR, FPR and error rate should be close to 0.

  • Could you give a numerical example showing calculations of performance measures of a prediction model?
    Example confusion matrix with sample values. Source: Markham 2014.
    Example confusion matrix with sample values. Source: Markham 2014.

    This example has 165 samples. We show the following calculations:

    • Recall or True Positive Rate (TPR): TP/(TP+FN) = 100/(100+5) = 0.95
    • False Negative Rate (FNR): 1 - TPR = 0.05
    • Specificity or True Negative Rate (TNR): TN/(TN+FP) = 50/(50+10) = 0.17
    • False Positive Rate (FPR): 1 - TNR = 0.83
    • Precision: TP/(TP+FP) = 100/(100+10) = 0.91
    • Prevalence: (TP+FN)/Total = (100+5)/165 = 0.64
    • Accuracy: (TP+TN)/Total = (100+50)/165 = 0.91
    • Error Rate: (FP+FN)/Total = (10+5)/165 = 0.09
  • Why do we need so many performance measures when accuracy can be sufficient?
    Normalized confusion matrix is useful when there's class imbalance. Source: Scikit-learn 2019b.
    Normalized confusion matrix is useful when there's class imbalance. Source: Scikit-learn 2019b.

    If the dataset has 90% positives, then achieving 90% accuracy is easy by predicting only positives. Thus, accuracy is not a sufficient measure when dataset is imbalanced. Accuracy also doesn't differentiate between Type I (False Positive) and Type II (False Negative) errors. This is where the confusion matrix gives us more useful measures with FPR and FNR; or their complementary measures, Recall and Specificity respectively.

    Consider the multiclass problem of iris classification that has three classes: setosa, versicolor and virginica. This has an accuracy of 84% (32/38) but it doesn't tell us where the errors are happening. With the confusion matrix, it's easy to see that only versicolor is wrongly classified. The matrix also shows that versicolor is misclassified as virginica and never as setosa. We can also see that Recall is 62% (10/16) for versicolor.

    In fact, when classes are not evenly represented in the data, confusion matrix by itself doesn't give an adequate visual representation. For this reason, we use a normalized confusion matrix that takes care of class imbalance.

  • What are other performance metrics for a classification/prediction problem?

    F-measure takes a harmonic mean of Recall and Precision, (2*Recall*Precision)/(Recall+Precision). It's a value closer to the smaller of the two. Applying this to our earlier example, we get F-measure = (2*0.95*0.91)/(0.95+0.91) = 0.92

    A commonly used graphical measure is the ROC Curve. It's generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as we vary the threshold for assigning observations to a given class.

    How often will we be wrong if we always predict the majority class? Null Error Rate gives us a measure for this. It's a useful baseline when evaluating a model. In our example, null error rate would be 60/165 = 0.36. If the model always predicted positive, it would be wrong 36% of the time.

    Cohen's Kappa can be applied to know how well a classifier is performing as opposed to classifying simply by chance. A high Kappa score implies accuracy differs a lot from null error rate.

  • What's the procedure to make or use a Confusion Matrix?
    Confusion matrix for multiclass classification. Source: Krüger 2016, table 5.1.
    Confusion matrix for multiclass classification. Source: Krüger 2016, table 5.1.

    We certainly need both the actual values and the predicted values. We can arrange the actual values by rows and the predicted values by columns, although some may swap the two. It's therefore important read the arrangement of the matrix correctly. For each actual value, count the number of predicted values for each class. Fill these counts into the matrix.

    There's no threshold for good accuracy, sensitivity or other measures. They should be interpreted in the context of problem, domain and business.

  • Could you mention some tools and techniques in relation to the Confusion Matrix?

    In R, package caret: Classification and Regression Training can be used to get confusion matrix with all relevant statistical information. The function is confusionMatrix(data=predicted, reference=expected). This plots actuals (called reference) by columns and predictions by rows.

    In Python, package sklearn.metrics has an equivalent function, confusion_matrix(actual, predicted). This plots actuals by rows and predictions by columns. Other related and useful functions are accuracy_score(actual, predicted) and classification_report(actual, predicted).

Milestones

1904
Contingency or correlation between hair colours of siblings. Source: Pearson 1904, table VI.
Contingency or correlation between hair colours of siblings. Source: Pearson 1904, table VI.

Mathematician Karl Pearson publishes a paper titled On the theory of contingency and its relation to association and normal correlation. Contingency and correlation between two variables can be seen as the genesis of confusion matrix.

1971
Confusion matrix is useful in multiclass problems. Source: Townsend 1971, table 2.
Confusion matrix is useful in multiclass problems. Source: Townsend 1971, table 2.

James Townsend publishes a paper titled Theoretical analysis of an alphabetic confusion matrix. Uppercase English alphabets are shown to human participants who try to identify them. Alphabets are presented with or without introduced noise. The resulting confusion matrix is of size 26x26. With noise, Townsend finds that 'W' is misidentified as 'V' 37% of the time; 32% of 'Q' are misidentified as 'O'; 'H' is identified correctly only 19% of the time.

1998

The term Confusion Matrix becomes popular in the ML community when it appears in a glossary featured in Special Issue on Applications of Machine Learning and the Knowledge Discovery Process by Ron Kohavi and Foster Provost.

2011

In a paper titled Comparing Multi-class Classifiers: On the Similarity of Confusion Matrices for Predictive Toxicology Applications, researchers show how to compare predictive models based on their confusion matrices. For lower FNR, they propose regrouping performance measures of multiclass classifiers into a binary classification problem.

Sample Code

  • # Source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
    # Accessed 2019-08-19
    >>> from sklearn.metrics import confusion_matrix
    >>> y_true = [2, 0, 2, 2, 0, 1]
    >>> y_pred = [0, 0, 2, 2, 0, 2]
    >>> confusion_matrix(y_true, y_pred)
    array([[2, 0, 0],
           [0, 0, 1],
           [1, 0, 2]])
     
    >>> y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
    >>> y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
    >>> confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
    array([[2, 0, 0],
           [0, 0, 1],
           [1, 0, 2]])
     
    # In the binary case, we can extract true positives, etc as follows:
    >>> tn, fp, fn, tp = confusion_matrix([0, 1, 0, 1], [1, 1, 1, 0]).ravel()
    >>> (tn, fp, fn, tp)
    (0, 2, 1, 1)
     
     
    # -------------------------------------------------------
    # Source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
    # Accessed: 2019-08-19
    >>> from sklearn.metrics import classification_report
    >>> y_true = [0, 1, 2, 2, 2]
    >>> y_pred = [0, 0, 2, 2, 1]
    >>> target_names = ['class 0', 'class 1', 'class 2']
    >>> print(classification_report(y_true, y_pred, target_names=target_names))
                  precision    recall  f1-score   support
     
         class 0       0.50      1.00      0.67         1
         class 1       0.00      0.00      0.00         1
         class 2       1.00      0.67      0.80         3
     
        accuracy                           0.60         5
       macro avg       0.50      0.56      0.49         5
    weighted avg       0.70      0.60      0.61         5
     
    >>> y_pred = [1, 1, 0]
    >>> y_true = [1, 1, 1]
    >>> print(classification_report(y_true, y_pred, labels=[1, 2, 3]))
                  precision    recall  f1-score   support
     
               1       1.00      0.67      0.80         3
               2       0.00      0.00      0.00         0
               3       0.00      0.00      0.00         0
     
       micro avg       1.00      0.67      0.80         3
       macro avg       0.33      0.22      0.27         3
    weighted avg       1.00      0.67      0.80         3
     

References

  1. Brownlee, Jason. 2016. "What is a Confusion Matrix in Machine Learning." Machine Learning Mastery, November 18. Accessed 2019-08-18.
  2. Caret. 2019. "confusionMatrix: Create a confusion matrix." Caret Docs, via rdrr, May 02. Accessed 2019-08-18.
  3. Idris, Awab. 2018. "Confusion Matrix." Medium, July 11. Accessed 2019-06-27.
  4. Kohavi, Ron and Foster Provost, eds. 1998. "Glossary of Terms." Special Issue on Applications of Machine Learning and the Knowledge Discovery Process, Machine Learning, vol. 30, pp. 271-274, Kluwer Academic Publishers. Accessed 2019-06-28.
  5. Krüger, Frank. 2016. "Activity, Context, and Plan Recognition with Computational Causal Behaviour Models." ResearchGate, December. Accessed 2019-08-20.
  6. Makhtar, Mokhairi and Daniel C. Neagu and Mick J. Ridley. 2011. "Comparing Multi-class Classifiers: On the Similarity of Confusion Matrices for Predictive Toxicology Applications." In: Yin H., Wang W., Rayward-Smith V. (eds), Intelligent Data Engineering and Automated Learning, Lecture Notes in Computer Science, vol. 6936, Springer, Berlin, Heidelberg. Accessed 2019-08-18.
  7. Markham, Kevin. 2014. "Simple guide to confusion matrix terminology." Data School, March 25. Accessed 2019-06-27.
  8. Narkhede, Sarang. 2018. "Understanding Confusion Matrix." Towards Data Science, via Medium, May 09. Accessed 2019-08-18.
  9. Parikh, R., A. Mathai, S. Parikh, G. Chandra Sekhar, and R. Thomas. 2008. "Understanding and using sensitivity, specificity and predictive values." Indian journal of ophthalmology, 56(1), 45–50, Jan-Feb. doi:10.4103/0301-4738.37595. Accessed 2019-08-20.
  10. Pearson, Karl. 1904. "On the theory of contingency and its relation to association and normal correlation." Drapers' Company Research Memoirs, Biometric Series I, Dept. of Applied Mathematics, University of London. Accessed 2019-06-28.
  11. Scikit-learn. 2019a. "sklearn.metrics.confusion_matrix." scikit-learn, v0.21.3, July 30. Accessed 2019-08-18.
  12. Scikit-learn. 2019b. "Confusion matrix." scikit-learn, v0.21.3, July 30. Accessed 2019-08-18.
  13. Scikit-learn. 2019c. "sklearn.metrics.classification_report." scikit-learn, v0.21.3, July 30. Accessed 2019-08-18.
  14. Sharma, Abhishek. 2017. "Confusion Matrix in Machine Learning." GeeksforGeeks, October 15. Updated 2018-02-07. Accessed 2019-08-18.
  15. Townsend, James. 1971. "Theoretical analysis of an alphabet confusion matrix." Attention Perception & Psychophysics 9(1):40-50, via ResearchGate, January. Accessed 2019-06-28.

Further Reading

  1. Caret. 2019. "confusionMatrix: Create a confusion matrix." Caret Docs, via rdrr, May 02. Accessed 2019-08-18.
  2. Vanneti, Marco. 2007. "Confusion matrix online calculator." Accessed 2019-08-18.
  3. Mills, Peter. 2017. "Bayesian Learning for Statistical Classification." Stats and Bots, via Medium, September 26. Accessed 2019-08-18.
  4. Narkhede, Sarang. 2018. "Understanding Confusion Matrix." Towards Data Science, via Medium, May 09. Accessed 2019-08-18.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
4
1
1743
2
0
1453
1385
Words
3
Likes
25K
Hits

Cite As

Devopedia. 2019. "Confusion Matrix." Version 6, August 20. Accessed 2023-11-12. https://devopedia.org/confusion-matrix
Contributed by
2 authors


Last updated on
2019-08-20 06:07:27

Improve this article

Article Warnings

  • In References, replace these sub-standard sources: geeksforgeeks.org
  • Readability score of this article is below 50 (48.4). Use shorter sentences. Use simpler words.