In many applications, there's a need to decide between two alternatives. In the military, radar operators look at approaching objects and decide if it's a threat. Doctors look at an image and decide if it's a tumour. For facial recognition, an algorithm has to decide if it's a match. In Machine Learning, we call this binary classification while in radar we call it signal detection.
The decision depends on a threshold. Receiver Operating Characteristic (ROC) Curve is a graphical plot that helps us see the performance of a binary classifier or diagnostic test when the threshold is varied. Using the ROC Curve, we can select a threshold that best suits our application. The idea is to maximize correct classification or detection while minimizing false positives. ROC Curve is also useful when comparing alternative classifiers or diagnostic tests.
How do we define or plot the ROC Curve?
Let's take a binary classification problem that has two distributions: one for positives and one for negatives. To classify subjects into one of these two classes, we select a threshold. Anything above the threshold is classified as positive. The accuracy of the classifier depends directly on the threshold we use. ROC Curve is plotted by varying the thresholds and recording the classifier's performance at each threshold.
ROC curve plots True Positive Rate (TPR) versus False Positive Rate (FPR). TPR is also called recall or sensitivity. TPR is the probability that we detect a signal when it's present. FPR is the complement of specificity: (1-specificity). FPR is the probability that we detect a signal when it's not present. Being based on only recall and specificity, ROC curve is independent of prevalence, that is, how common is the condition in the population.
An ideal classifier will have an ROC curve that rises sharply from origin until FPR rises when TPR is already high. Each point on the ROC curve represents the performance of the classifier at one threshold value.
Which application domains are using ROC Curves?
ROC started in radar applications. It was later applied in many other domains including psychology, medicine, radiology, biometrics, and meteorology. More recently, it's being used in machine learning and data mining.
In medical practice, it's used for assessing diagnostic biomarkers, imaging tests or even risk assessment. It's been used to analyse information processing in the brain during sensory difference testing.
ROC has been used to describe the performance of instruments built to detect explosives. In engineering, it's been used to evaluate the accuracy of pipeline reliability analysis and predict the failure threshold value.
What is AUC and its significance?
After plotting the ROC Curve, the area under it is called Area Under the ROC Curve (AUC), Area Under the Curve (AUC), or AUROC. It's been said that "ROC is a probability curve and AUC represents degree or measure of separability". In other words, AUC is a single metric that can be used to quantify how well two classes are separated by a binary classifier. It's also useful when comparing different classifiers.
AUC has some useful properties. It's scale-invariant. This means it tells how well predictions are ranked rather than their absolute values. AUC is also classification-threshold-invariant. We can objectively compare prediction models irrespective of classification thresholds used. However, these properties are not desirable for some applications.
AUC is also prevalence-invariant. Suppose a health condition is prevalent in only 1% of the population. A simple classifier can achieve 99% accuracy by predicting negative always. AUC however gives a more useful value of 0.5.
How do I interpret an AUC value?
More realistically, AUC has a range [0.5,1] since the ROC curve is expected to be above the diagonal. Value 0.5 implies very poor separation and is represented by the diagonal ROC curve. Value 1 implies perfect separation, where TPR is always 1 at all values of FPR. As a thumb rule, we have an excellent classifier if AUC is >=0.9 and a good classifier when it's >= 0.8.
Why do I need an ROC Curve when TPR and FPR may be adequate?
ROC Curve is a useful tool to compare classification methods and decide which one is better. Suppose a computer algorithm is implemented to diagnose a medical condition. Using ROC curves, we can compare its performance against a doctor's diagnosis, and against doctor's diagnosis when aided with computer-assisted detection (CAD). As shown in figure, a doctor using CAD gives best performance. The other two approaches have the same AUC but the doctor has a higher specificity (lower FPR).
In any binary classification problem, it's not possible to agree on a single threshold and consequently on values of sensitivity and specificity. Take the case of diagnostic testing as an example. Threshold would be adjusted based on the context and available information, such as patient history, presence of symptoms, or even likelihood of getting sued for a missed cancer. If we just plot two points for two classifiers, it's hard to know which one is better. Once we plot entire ROC curves, it's easy to see which one is better.
For a binary classification problem, how to I select the optimum threshold on the ROC Curve?
- Minimum-d: This is the shortest distance of the curve from the top-left corner or (0,1) point.
- Youden index: This is the vertical distance from the curve to the diagonal. To find the optimum point on the curve, we should maximize the Youden index.
ROC Curve and AUC ignore prevalence or misclassification costs. For example, poor sensitivity means missed cancer and delayed treatment whereas poor specificity means unnecessary treatment. Likewise, a false positive on a blood test for HIV simply means a discarded blood sample but a false negative will infect the blood recipient. It's for this reason decision makers should consider financial costs, and combine ROC analysis with utility-based decision theory to find the optimum threshold.
How do I apply ROC Curves to multiclass problems?
Given \(c\) classes, the ROC space has \(c(c-1)\) dimensions. This makes it difficult to apply ROC Curve methodology to multiclass problems. However, some attempt has been made to apply it to 3 classes where AUC concept is extended to Volume Under the ROC Surface (VUS).
One approach is to reframe the problem into \(c\) one-vs-all binary classifiers. However, ROC Curve may not be suitable since FPR will be underestimated due to large number of negative data points. For this reason, Precision vs. Recall curve is more suitable.
What are some pitfalls or drawbacks of using ROC Curve and AUC?
In practice, AUC must be presented with a confidence interval, such as 95% CI, since it's estimated from a population sample. However, one research in clinical chemistry showed that many researchers failed to include CI or constructed them incorrectly.
AUC involves loss of information. Two ROC curves crossing each other can have the same AUC but each will have a range of thresholds at which it's better. Clinicians and patients interpret sensitivity and specificity but don't find AUC useful. They're not interested in performance across all thresholds. In ML, cost curves have been proposed as an alternative. Another alternative is H-measure.
AUC ignores the misclassification costs. A new test may be deemed worthless by using AUC alone. AUC also ignores prevalence but it's known that prevalence affects test results. While sensitivity and specificity are also independent of prevalence, prevalence can be considered during interpretation of the ROC curve.
Jorge M. Lobo et al. give many other reasons why AUC is not a suitable measure.
What software packages are available for ROC analysis?
In R language, we can use the pROC package. Once we obtain the actual and predicted values, we can obtain the AUC along with confidence interval using the function
ci.auc(). On GitHub,
sachsmc/plotROCis an open source package for easily plotting ROC curves. It uses ggplot2, to which it adds handy functions for plotting:
In Python, a webpage on Scikit-learn gives code examples showing how to plot ROC curves and compute AUC for both binary and multiclass problems. It makes use of functions
aucthat are part of sklearn.metrics package.
The idea of ROC starts in the 1940s with the use of radar during World War II. The task is to identify enemy aircraft while avoiding false detection of benign objects. ROC provides a suitable threshold for radar receiver operators. This also explains the origin of the term Receiver Operating Characteristic (ROC).
Peterson and Birdsall explain the ROC Curve in detail in the context of signal detection theory. They plot probability of signal detection versus probability of false alarm. Curve-1 represents optimum operation, curve-3 sets the lower limit, and curve-2 is by guessing. Curve-1 is produced by varying the operating level or threshold β, which is also the slope of the curve at that point. The value on the y-intercept is the one that needs to be maximized.
L.B. Lusted applies ROC methodology to compare different studies of chest film interpretations for detection of pulmonary tuberculosis. This is the first application of ROC to radiology. It subsequently inspires the use of ROC in many diagnostic imaging systems. Lusted himself publishes Decision-making studies in patient management in 1971.
The concept of Free-Response Receiver Operating Characteristic (FROC) Curve is introduced in auditory domain. In free-response analysis, in addition to detection, we also need to point out the location. The term "free-response" was coined in 1961. In 1978, FROC is applied for the first time in imaging. FROC can help where ROC can fail. For example, ROC can show location-level false positive and false negative that could "cancel" each other. This gives an image-level true positive: image shows cancer but wrong location is reported.
Andrew Bradley notes that ROC curve is useful for visualizing a classifier's performance but not suitable for comparing multiple classification methods. A single performance measure is more desirable. He discusses how AUC can be used as a measure for comparing machine learning algorithms. He explains why AUC is a better measure than overall accuracy.
An article titled Better decisions through science appears in Scientific American. It brings ROC Curve to the attention of a wider audience. One example in this article talks about glaucoma diagnosis using eye fluid pressure. It defines the basic terms and shows hypothetical distribution curves, ROC Curves, and AUC. It states that AUC is a reflection of a test's accuracy.
- Berrar, Daniel, and Peter Flach. 2012. "Caveats and pitfalls of ROC analysis in clinical microarray research (and how to avoid them)." Briefings in Bioinformatics, vol. 13, no. 1, pp. 83–97, January. Accessed 2019-07-23.
- Bradley, Andrew P. 1997. "The use of the area under the ROC curve in the evaluation of machine learning algorithms." Pattern Recognition, vol. 30, no. 7, pp. 1145-1159, Elsevier Science Ltd. Accessed 2019-08-20.
- Chakraborty, D. P. 2013. "A brief history of free-response receiver operating characteristic paradigm data analysis." Academic Radiology, 20(7), 915–919, July. doi:10.1016/j.acra.2013.03.001. Accessed 2019-08-20.
- Döring, Matthias. 2018. "Performance Measures for Multi-Class Problems." Data Science Blog, December 04. Accessed 2019-07-23.
- Ekelund, Suzanne. 2011. "ROC curves – what are they and how are they used?" Acute Care Testing, January. Accessed 2019-07-23.
- Fawcett, Tom. 2006. "Introduction to ROC analysis." Pattern Recognition Letters, 27(8):861-874, June. Accessed 2019-08-20.
- Google Developers. 2019. "Classification: ROC Curve and AUC." Crash Course, Machine Learning, Google, March 05. Accessed 2019-07-23.
- Hajian-Tilaki, Karimollah. 2013. "Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation." Caspian Journal of Internal Medicine, vol. 4, no. 2, pp. 627-35. Accessed 2019-07-23.
- Halligan, Steve, Douglas G. Altman, and Susan Mallett. 2015. "Disadvantages of using the area under the receiver operating characteristic curve to assess imaging tests: A discussion and proposal for an alternative approach." European Radiology, vol. 25, no. 4, pp. 932–939. Accessed 2019-07-23.
- Hand, David J. and Robert J. Till. 2001. "A Simple Generalisation of the Area Under the ROCCurve for Multiple Class Classification Problems." Machine Learning, vol. 45, pp. 171–186, Kluwer Academic Publishers. Accessed 2019-08-20.
- Hanley, James A., and Barbara J. McNeil. 1982. "The meaning and use of the area under a receiver operating characteristic (ROC) curve." Radiology, vol. 143, no.1, pp. 29-36, April. Accessed 2019-08-20.
- Joy, Janet E, Edward E Penhoet, and Diana B Petitti, eds. 2005. "ROC Analysis: Key Statistical Tool for Evaluating Detection Technologies." Appendix C in Saving Women's Lives: Strategies for Improving Breast Cancer Detection and Diagnosis, National Academies Press. Accessed 2019-08-20.
- Kumar, Rajeev and Abhaya Indrayan. 2011. "Receiver Operating Characteristic (ROC) Curvefor Medical Researchers." Indian Pediatrics, vol. 48, pp. 277-287, April 17. Accessed 2019-08-20.
- Landgrebe, Thomas C.W. and Robert P.W. Duin. 2007. "Approximating the multiclass ROC by pairwise analysis." Pattern Recognition Letters, vol. 28, pp. 1747–1758, Elsevier. Accessed 2019-08-20.
- Li, Bai. 2018. "Useful properties of ROC curves, AUC scoring, and Gini Coefficients." Lucky's Notes, April 04. Accessed 2019-08-22.
- Narkhede, Sarang. 2018. "Understanding AUC - ROC Curve." Towards Data Science, via Medium, June 26. Accessed 2019-07-23.
- Obuchowski, Nancy A., Michael L. Lieber, and Frank H. Wians. 2004. "ROC Curves in Clinical Chemistry: Uses, Misuses, and Possible Solutions." Clinical Chemistry, vol. 50, no. 7, pp. 1118-1125, June 30. Accessed 2019-08-22.
- Park, Seong Ho, Jin Mo Goo, and Chan-Hee Jo. 2004. "Receiver Operating Characteristic (ROC) Curve: Practical Review for Radiologists." Korean J Radiol., vol. 5, no. 1, pp. 11–18, Jan-Mar. Accessed 2019-08-22.
- Peterson, William Wesley, and Theodore G. Birdsall. 1953. "The theory of signal detectability." TR No. 13, Engineering Research Institute, Univ. of Michigan, June. Accessed 2019-08-20.
- Rickert, Joseph. 2019. "ROC Curves." R Views, RStudio, January 17. Accessed 2019-08-22.
- Sachs, Michael C. 2018. "Generate ROC Curve Charts for Print and Interactive Use." Via GitHub IO, June 01. Accessed 2019-08-22.
- Scikit-learn Docs. 2019. "Receiver Operating Characteristic (ROC)." Scikit-learn v0.21.3, July 30. Accessed 2019-08-22.
- Sonego, Paolo, András Kocsor, and Sándor Pongor. 2008. "ROC analysis: applications to the classification of biological sequences and 3D structures." Briefings in Bioinformatics, vol, 9, no. 3, pp. 198–209, May. Accessed 2019-08-20.
- Streiner, David L., and John Cairney. 2007. "What’s Under the ROC? An Introduction to Receiver Operating Characteristics Curves." The Canadian Journal of Psychiatry, vol. 52, no. 2, pp. 121-128. February. Accessed 2019-08-22.
- Swets, J.A., R.M. Dawes, and J. Monahan. 2000. "Better decisions through science." Scientific American, pp. 82–87, October. Accessed 2019-08-20.
- Tape, Thomas G. 2019a. "The Area Under an ROC Curve." Interpreting Diagnostic Tests, University of Nebraska Medical Center. Accessed 2019-07-23.
- Tape, Thomas G. 2019b. "Plotting and Intrepretating an ROC Curve." Interpreting Diagnostic Tests, University of Nebraska Medical Center. Accessed 2019-07-23.
- Tee, Kong Fah, Lutfor Rahman Khan, and Tahani Coolen-Maturi. 2015. "Application of receiver operating characteristic curve for pipeline reliability analysis." Proc IMechE Part O: Journal of Risk and Reliability, December. Accessed 2019-08-20.
- Treadway, Andrew. 2019. "How to get an AUC confidence interval." Open Source Automation, August 20. Accessed 2019-08-22.
- Wichchukit, Sukanya and Michael O'Mahony. 2010. "A transfer of technology from engineering: use of ROC curves from signal detection theory to investigate information processing in the brain during sensory difference testing." J Food Sci., 75(9):R183-93, Nov-Dec. Accessed 2019-08-20.
- Wikimedia Commons. 2015. "File:ROC curves colors.svg." Wikimedia Commons, December 05. Accessed 2019-08-20.
- Wikipedia. 2019. "Receiver operating characteristic." Wikipedia, August 17. Accessed 2019-08-21.
- Young, M., Wen Fan, Anna Raeva, and Jose Almirall. 2013. "Application of Receiver Operating Characteristic (ROC) Curves for Explosives Detection Using Different Sampling and Detection Techniques." Sensors (Basel, Switzerland), 13(12), 16867–16881. doi:10.3390/s131216867. Accessed 2019-08-20.
- Turner, David A. 1978. "An Intuitive Approach to Receiver Operating Characteristic Curve Analysis." J Nucl Med, vol. 19, no. 2, pp. 213-220. Accessed 2019-08-22.
- Bohne, Julien. 2018. "Beyond the ROC AUC: Toward Defining Better Performance Metrics." BCG Gamma, via Medium, November 01. Accessed 2019-08-22.
- Lobo, Jorge M., Alberto Jiménez-Valverde, and Raimundo Real. 2007. "AUC: a misleading measure of the performance of predictive distribution models." Global Ecology and Biogeography, Blackwell Publishing Ltd. Accessed 2019-07-23.
- Swets, J.A., R.M. Dawes, and J. Monahan. 2000. "Better decisions through science." Scientific American, pp. 82–87, October. Accessed 2019-08-20.