Confusion Matrix
 Summary

Discussion
 What are the elements and terminology used in Confusion Matrix?
 What metrics are used for evaluating the performance of a prediction model?
 Could you give a numerical example showing calculations of performance measures of a prediction model?
 Why do we need so many performance measures when accuracy can be sufficient?
 What are other performance metrics for a classification/prediction problem?
 What's the procedure to make or use a Confusion Matrix?
 Could you mention some tools and techniques in relation to the Confusion Matrix?
 Milestones
 Sample Code
 References
 Further Reading
 Article Stats
 Cite As
In statistical classification, we create algorithms or models to predict or classify data into a finite set of classes. Since models are not perfect, some data points will be classified incorrectly. Confusion matrix is basically a tabular summary showing how well the model is performing.^{}
In one dimension, the matrix takes the actual values. The matrix then maps these to the predicted values in the other dimension. In reality, the matrix is like a histogram. The entries in the matrix are counts. For example, it records how many data points were predicted as "true" when they were actually "false".^{}
Confusion matrix is useful in both binary classification as well as multiclass classification problems. There are many performance metrics that can be computed from the matrix. Learning these metrics is handy for a statistician or data scientist.
Discussion

What are the elements and terminology used in Confusion Matrix? Let's also consider a concrete example of a pregnancy test. Based on a urine test, we predict if a person is pregnant or not. We assume that the ground truth (pregnant or not) is available to us. We therefore have four possibilities:^{}
 True Positive (TP): We predict a pregnant person is pregnant. This is a good prediction.
 True Negative (TN): We predict a nonpregnant person is not pregnant. This is a good prediction.
 False Positive (FP): We predict a nonpregnant person is pregnant. This type of error is also called Type I Error.
 False Negative (FN): We predict a pregnant person is not pregnant. This type of error is also called Type II Error.
When these are arranged in matrix form, it will be apparent that correct predictions are represented along the main diagonal. Incorrect predictions are in the nondiagonal cells. This makes it easy to see where predictions have gone wrong. We may also say that the matrix represents the model's inability to classify correctly, and hence the "confusion" in the model.^{}

What metrics are used for evaluating the performance of a prediction model? Performance metrics from a confusion matrix are represented in the following equations:^{} ^{}
$$Recall\ or\ Sensitivity=TP/(TP+FN)=TP/AllPositives\\Specificity=TN/(TN+FP)=TN/AllNegatives\\Precision=TP/(TP+FP)=TP/PredictedPositives\\Prevalence=TP+FN/Total=AllPositives /Total\\Accuracy=(TP+TN)/Total\\Error\ Rate=(FP+FN)/Total$$
It's important to understand the significance of these metrics. Accuracy is an overall measure of correct prediction, regardless of the class (positive or negative). The complement of accuracy is error rate or misclassification rate.
High recall implies that very few positives are misclassified as negatives. High precision implies very few negatives are misclassified as positives. There's a tradeoff here. If model is partial towards positives, we'll end up with high recall but low precision. It model favours negatives, we'll end up with low recall and high precision.^{}
High specificity, like high precision, implies that very few negatives are misclassified as positives. If positive represents some disease, specificity is the model's confidence in clearing a person as diseasefree. Selectivity is the model's confidence in diagnosing a person as diseased.^{}
Ideally, recall, specificity, precision and accuracy should all be close to 1. FNR, FPR and error rate should be close to 0.

Could you give a numerical example showing calculations of performance measures of a prediction model? This example has 165 samples. We show the following calculations:^{}
 Recall or True Positive Rate (TPR): TP/(TP+FN) = 100/(100+5) = 0.95
 False Negative Rate (FNR): 1  TPR = 0.05
 Specificity or True Negative Rate (TNR): TN/(TN+FP) = 50/(50+10) = 0.17
 False Positive Rate (FPR): 1  TNR = 0.83
 Precision: TP/(TP+FP) = 100/(100+10) = 0.91
 Prevalence: (TP+FN)/Total = (100+5)/165 = 0.64
 Accuracy: (TP+TN)/Total = (100+50)/165 = 0.91
 Error Rate: (FP+FN)/Total = (10+5)/165 = 0.09

Why do we need so many performance measures when accuracy can be sufficient? If the dataset has 90% positives, then achieving 90% accuracy is easy by predicting only positives. Thus, accuracy is not a sufficient measure when dataset is imbalanced. Accuracy also doesn't differentiate between Type I (False Positive) and Type II (False Negative) errors.^{} This is where the confusion matrix gives us more useful measures with FPR and FNR; or their complementary measures, Recall and Specificity respectively.
Consider the multiclass problem of iris classification that has three classes: setosa, versicolor and virginica. This has an accuracy of 84% (32/38) but it doesn't tell us where the errors are happening. With the confusion matrix, it's easy to see that only versicolor is wrongly classified. The matrix also shows that versicolor is misclassified as virginica and never as setosa. We can also see that Recall is 62% (10/16) for versicolor.^{}
In fact, when classes are not evenly represented in the data, confusion matrix by itself doesn't give an adequate visual representation. For this reason, we use a normalized confusion matrix that takes care of class imbalance.^{}

What are other performance metrics for a classification/prediction problem? Fmeasure takes a harmonic mean of Recall and Precision, (2*Recall*Precision)/(Recall+Precision). It's a value closer to the smaller of the two. Applying this to our earlier example, we get Fmeasure = (2*0.95*0.91)/(0.95+0.91) = 0.92^{}
A commonly used graphical measure is the ROC Curve. It's generated by plotting the True Positive Rate (yaxis) against the False Positive Rate (xaxis) as we vary the threshold for assigning observations to a given class.^{}
How often will we be wrong if we always predict the majority class? Null Error Rate gives us a measure for this. It's a useful baseline when evaluating a model. In our example, null error rate would be 60/165 = 0.36. If the model always predicted positive, it would be wrong 36% of the time.^{}
Cohen's Kappa can be applied to know how well a classifier is performing as opposed to classifying simply by chance. A high Kappa score implies accuracy differs a lot from null error rate.^{}

What's the procedure to make or use a Confusion Matrix? We certainly need both the actual values and the predicted values. We can arrange the actual values by rows and the predicted values by columns, although some may swap the two. It's therefore important read the arrangement of the matrix correctly. For each actual value, count the number of predicted values for each class. Fill these counts into the matrix.^{}
There's no threshold for good accuracy, sensitivity or other measures. They should be interpreted in the context of problem, domain and business.

Could you mention some tools and techniques in relation to the Confusion Matrix? In R, package caret: Classification and Regression Training can be used to get confusion matrix with all relevant statistical information. The function is
confusionMatrix(data=predicted, reference=expected)
.^{} This plots actuals (called reference) by columns and predictions by rows.^{}In Python, package sklearn.metrics has an equivalent function,
confusion_matrix(actual, predicted)
.^{} ^{} This plots actuals by rows and predictions by columns.^{} Other related and useful functions areaccuracy_score(actual, predicted
) andclassification_report(actual, predicted)
.^{} ^{}
Milestones
Mathematician Karl Pearson publishes a paper titled On the theory of contingency and its relation to association and normal correlation.^{} Contingency and correlation between two variables can be seen as the genesis of confusion matrix.
James Townsend publishes a paper titled Theoretical analysis of an alphabetic confusion matrix. Uppercase English alphabets are shown to human participants who try to identify them. Alphabets are presented with or without introduced noise. The resulting confusion matrix is of size 26x26. With noise, Townsend finds that 'W' is misidentified as 'V' 37% of the time; 32% of 'Q' are misidentified as 'O'; 'H' is identified correctly only 19% of the time.^{}
The term Confusion Matrix becomes popular in the ML community when it appears in a glossary featured in Special Issue on Applications of Machine Learning and the Knowledge Discovery Process by Ron Kohavi and Foster Provost.^{}
In a paper titled Comparing Multiclass Classifiers: On the Similarity of Confusion Matrices for Predictive Toxicology Applications, researchers show how to compare predictive models based on their confusion matrices. For lower FNR, they propose regrouping performance measures of multiclass classifiers into a binary classification problem.^{}
Sample Code
References
 Brownlee, Jason. 2016. "What is a Confusion Matrix in Machine Learning." Machine Learning Mastery, November 18. Accessed 20190818.
 Caret. 2019. "confusionMatrix: Create a confusion matrix." Caret Docs, via rdrr, May 02. Accessed 20190818.
 Idris, Awab. 2018. "Confusion Matrix." Medium, July 11. Accessed 20190627.
 Kohavi, Ron and Foster Provost, eds. 1998. "Glossary of Terms." Special Issue on Applications of Machine Learning and the Knowledge Discovery Process, Machine Learning, vol. 30, pp. 271274, Kluwer Academic Publishers. Accessed 20190628.
 Krüger, Frank. 2016. "Activity, Context, and Plan Recognition with Computational Causal Behaviour Models." ResearchGate, December. Accessed 20190820.
 Makhtar, Mokhairi and Daniel C. Neagu and Mick J. Ridley. 2011. "Comparing Multiclass Classifiers: On the Similarity of Confusion Matrices for Predictive Toxicology Applications." In: Yin H., Wang W., RaywardSmith V. (eds), Intelligent Data Engineering and Automated Learning, Lecture Notes in Computer Science, vol. 6936, Springer, Berlin, Heidelberg. Accessed 20190818.
 Markham, Kevin. 2014. "Simple guide to confusion matrix terminology." Data School, March 25. Accessed 20190627.
 Narkhede, Sarang. 2018. "Understanding Confusion Matrix." Towards Data Science, via Medium, May 09. Accessed 20190818.
 Parikh, R., A. Mathai, S. Parikh, G. Chandra Sekhar, and R. Thomas. 2008. "Understanding and using sensitivity, specificity and predictive values." Indian journal of ophthalmology, 56(1), 45–50, JanFeb. doi:10.4103/03014738.37595. Accessed 20190820.
 Pearson, Karl. 1904. "On the theory of contingency and its relation to association and normal correlation." Drapers' Company Research Memoirs, Biometric Series I, Dept. of Applied Mathematics, University of London. Accessed 20190628.
 Scikitlearn. 2019a. "sklearn.metrics.confusion_matrix." scikitlearn, v0.21.3, July 30. Accessed 20190818.
 Scikitlearn. 2019b. "Confusion matrix." scikitlearn, v0.21.3, July 30. Accessed 20190818.
 Scikitlearn. 2019c. "sklearn.metrics.classification_report." scikitlearn, v0.21.3, July 30. Accessed 20190818.
 Sharma, Abhishek. 2017. "Confusion Matrix in Machine Learning." GeeksforGeeks, October 15. Updated 20180207. Accessed 20190818.
 Townsend, James. 1971. "Theoretical analysis of an alphabet confusion matrix." Attention Perception & Psychophysics 9(1):4050, via ResearchGate, January. Accessed 20190628.
Further Reading
 Caret. 2019. "confusionMatrix: Create a confusion matrix." Caret Docs, via rdrr, May 02. Accessed 20190818.
 Vanneti, Marco. 2007. "Confusion matrix online calculator." Accessed 20190818.
 Mills, Peter. 2017. "Bayesian Learning for Statistical Classification." Stats and Bots, via Medium, September 26. Accessed 20190818.
 Narkhede, Sarang. 2018. "Understanding Confusion Matrix." Towards Data Science, via Medium, May 09. Accessed 20190818.
Article Stats
Cite As
See Also
 Hypothesis Testing and Types of Errors
 ROC Curve
 Machine Learning
 Contingency Table
 Statistical Classification
 Statistical Inference
Article Warnings
 In References, replace these substandard sources: geeksforgeeks.org
 Readability score of this article is below 60 (48.4). Use shorter sentences. Use simpler words.