Hypothesis Testing and Types of Errors
Suppose we want to study income of a population. We study a sample from the population and draw conclusions. The sample should represent the population for our study to be a reliable one.
Null hypothesis \((H_0)\) is that sample represents population. Hypothesis testing provides us with framework to conclude if we have sufficient evidence to either accept or reject null hypothesis.
Population characteristics are either assumed or drawn from third-party sources or judgements by subject matter experts. Population data and sample data are characterised by moments of its distribution (mean, variance, skewness and kurtosis). We test null hypothesis for equality of moments where population characteristic is available and conclude if sample represents population.
For example, given only mean income of population, we validate if mean income of sample is close to population mean to conclude if sample represents the population.
What are the math representations of population and sample parameters?
Population mean and population variance are denoted in Greek alphabets \(\mu\) and \(\sigma^2\) respectively, while sample mean and sample variance are denoted in English alphabets \(\bar x\) and \(s^2\) respectively.
What's the relevance of sampling error to hypothesis testing?
Suppose we obtain a sample mean of \(\bar x\) from a population of mean \(\mu\). The two are defined by the relationship |\(\bar x\) - \(\mu\)|>=0:
- If the difference is not significant, we conclude the difference is due to sampling. This is called sampling error and this happens due to chance.
- If the difference is significant, we conclude the sample does not represent the population. The reason has to be more than chance for difference to be explained.
Hypothesis testing helps us to conclude if the difference is due to sampling error or due to reasons beyond sampling error.
What are some assumptions behind hypothesis testing?
A common assumption is that the observations are independent and come from a random sample. The population distribution must be Normal or the sample size is large enough. If the sample size is large enough, we can invoke the Central Limit Theorem (CLT) regardless of the underlying population distribution. Due to CLT, sampling distribution of the sample statistic (such as sample mean) will be approximately a Normal distribution.
A rule of thumb is 30 observations but in some cases even 10 observations may be sufficient to invoke the CLT. Others require at least 50 observations.
What are one-tailed and two-tailed tests?
When acceptance of \(H_0\) involves boundaries on both sides, we invoke the two-tailed test. For example, if we define \(H_0\) as sample drawn from population with age limits in the range of 25 to 35, then testing of \(H_0\) involves limits on both sides.
Suppose we define the population as greater than age 50, we are interested in rejecting a sample if the age is less than or equal to 50; we are not concerned about any upper limit. Here we invoke the one-tailed test. A one-tailed test could be left-tailed or right-tailed.
Consider average gas price in California compared to the national average of $2.62. If we believe that the price is higher in California, we consider right-tailed test. If we believe that California price is different from national average but we don't know if it's higher or lower, we consider two-tailed test. Symbolically, given the alternative or research hypothesis \(H_1\), we state,
- \(H_0\): \(\mu = \$ 2.62\)
- \(H_1\) right-tailed: \(\mu > \$ 2.62\)
- \(H_1\) two-tailed: \(\mu \neq \$ 2.62\)
What are the types of errors in hypothesis testing?
In concluding whether sample represents population, there is scope for committing errors on following counts:
- Not accepting that sample represents population when in reality it does. This is called type-I or \(\alpha\) error.
- Accepting that sample represents population when in reality it does not. This is called type-II or \(\beta\) error.
For instance, granting loan to an applicant with low credit score is \(\alpha\) error. Not granting loan to an applicant with high credit score is (\(\beta\)) error.
The symbols \(\alpha\) and \(\beta\) are used to represent the probability of type-I and type-II errors respectively.
How do we measure type-I or \(\alpha\) error?
The p-value can be interpreted as the probability of getting a result that's same or more extreme when the null hypothesis is true.
The observed sample mean \(\bar x\) is overlaid on population distribution of values with mean \(\mu\) and variance \(\sigma^2\). The proportion of values beyond \(\bar x\) and away from \(\mu\) (either in left tail or in right tail or in both tails) is p-value. If p-value <= \(\alpha\) we reject null hypothesis. The results are said to be statistically significant and not due to chance.
Assuming \(\alpha\)=0.05, p-value > 5%, we conclude the sample is highly likely to be drawn from population with mean \(\mu\) and variance \(\sigma^2\). We accept \((H_0)\). Otherwise, there's insufficient evidence to be part of population and we reject \(H_0\).
We preselect \(\alpha\) based on how much type-I error we're willing to tolerate. \(\alpha\) is called level of significance. The standard for level of significance is 0.05 but in some studies it may be 0.01 or 0.1. In the case of two-tailed tests, it's \(\alpha/2\) on either side.
How do we determine sample size and confidence interval for sample estimate?
Law of Large Numbers suggests larger the sample size, the more accurate the estimate. Accuracy means the variance of estimate will tend towards zero as sample size increases. Sample Size can be determined to suit accepted level of tolerance for deviation.
Confidence interval of sample mean is determined from sample mean offset by variance on either side of the sample mean. If the population variance is known, then we conduct z-test based on Normal distribution. Otherwise, variance has to be estimated and we use t-test based on t-distribution.
The formulae for determining sample size and confidence interval depends on what we to estimate (mean/variance/others), sampling distribution of estimate and standard deviation of estimate's sampling distribution.
How do we measure type-II or \(\beta\) error?
We overlay sample mean's distribution on population distribution, the proportion of overlap of sampling estimate's distribution on population distribution is \(\beta\) error.
Larger the overlap, larger the chance the sample does belong to population with mean \(\mu\) and variance \(\sigma^2\). Incidentally, despite the overlap, p-value may be less than 5%. This happens when sample mean is way off population mean, but the variance of sample mean is such that the overlap is significant.
How do we control \(\alpha\) and \(\beta\) errors?
Errors \(\alpha\) and \(\beta\) are dependent on each other. Increasing one decreases the other. Choosing suitable values for these depends on the cost of making these errors. Perhaps it's worse to convict an innocent person (type-I error) than to acquit a guilty person (type-II error), in which case we choose a lower \(\alpha\). But it's possible to decrease both errors but collecting more data.
Just as p-value manifests \(\alpha\), Power of Test manifests \(\beta\). Power of test is \(1-\beta\). Among the various ways to interpret power are:
- Probability of rejecting the null hypothesis when, in fact, it is false.
- Probability that a test of significance will pick up on an effect that is present.
- Probability of avoiding a Type II error.
Low p-value and high power help us decisively conclude sample doesn't belong to population. When we cannot conclude decisively, it's advisable to go for larger samples and multiple samples.
In fact, power is increased by increasing sample size, effect sizes and significance levels. Variance also affects power.
What are some misconceptions in hypothesis testing?
A common misconception is to consider "p value as the probability that the null hypothesis is true". In fact, p-value is computed under the assumption that the null hypothesis is true. P-value is the probability of observing the values, or more extremes values, if the null hypothesis is true.
Another misconception, sometimes called base rate fallacy, is that under controlled \(\alpha\) and adequate power, statistically significant results correspond to true differences. This is not the case, as shown in the figure. Even with \(\alpha\)=5% and power=80%, 36% of statistically significant p-values will not report the true difference. This is because only 10% of the null hypotheses are false (base rate) and 80% power on these gives only 80 true positives.
P-value doesn't measure the size of the effect, for which confidence interval is a better approach. A drug that gives 25% improvement may not mean much if symptoms are innocuous compared to another drug that gives small improvement from a disease that leads to certain death. Context is therefore important.
The field of statistical testing probably starts with John Arbuthnot who applies it to test sex ratios at birth. Subsequently, others in the 18th and 19th centuries use it in other fields. However, modern terminology (null hypothesis, p-value, type-I or type-II errors) is formed only in the 20th century.
Neyman and Pearson publish On the problem of the most efficient tests of statistical hypotheses. They introduce the notion of alternative hypotheses. They also describe both type-I and type-II errors (although they don't use these terms). They state, "Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong."
Johnson's textbook titled Statistical methods in research is perhaps the first to introduce to students the Neyman-Pearson hypothesis testing at a time when most textbooks follow Fisher's significance testing. Johnson uses the terms "error of the first kind" and "error of the second kind". In time, Fisher's approach is called P-value approach and the Neyman-Pearson approach is called fixed-α approach.
- Biau, David Jean, Brigitte M. Jolles, and Raphaël Porcher. 2010. "P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers." Clin Orthop Relat Res, vol. 468, no. 3, pp. 885–892, March. Accessed 2021-05-28.
- Carver, Ronald P. 1993. "The Case against Statistical Significance Testing, Revisited." The Journal of Experimental Education, vol. 61, no. 4, Statistical Significance Testing in Contemporary Practice, pp. 287-292. Accessed 2021-05-28.
- Frankfort-Nachmias, Chava, Anna Leon-Guerrero, and Georgiann Davis. 2020. "Chapter 8: Testing Hypotheses." In: Social Statistics for a Diverse Society, SAGE Publications. Accessed 2021-05-28.
- Gordon, Max. 2011. "How to best display graphically type II (beta) error, power and sample size?" August 11. Accessed 2018-05-18.
- Heard, Stephen B. 2015. "In defence of the P-value" Types of Errors. February 9. Updated 2015-12-04. Accessed 2018-05-18.
- Huberty, Carl J. 1993. "Historical Origins of Statistical Testing Practices: The Treatment of Fisher versus Neyman-Pearson Views in Textbooks." The Journal of Experimental Education, vol. 61, no. 4, Statistical Significance Testing in Contemporary Practice, pp. 317-333. Accessed 2021-05-28.
- Kensler, Jennifer. 2013. "The Logic of Statistical Hypothesis Testing: Best Practice." Report, STAT T&E Center of Excellence. Accessed 2021-05-28.
- Klappa, Peter. 2014. "Sampling error and hypothesis testing." On YouTube, December 10. Accessed 2021-05-28.
- Lane, David M. 2021. "Section 10.8: Confidence Interval on the Mean." In: Introduction to Statistics, Rice University. Accessed 2021-05-28.
- McNeese, Bill. 2015. "How Many Samples Do I Need?" SPC For Excel, BPI Consulting, June. Accessed 2018-05-18.
- McNeese, Bill. 2017. "Interpretation of Alpha and p-Value." SPC for Excel, BPI Consulting, April 6. Updated 2020-04-25. Accessed 2021-05-28.
- Neyman, J., and E. S. Pearson. 1933. "On the problem of the most efficient tests of statistical hypotheses." Philos Trans R Soc Lond A., vol. 231, issue 694-706, pp. 289–337. doi: 10.1098/rsta.1933.0009. Accessed 2021-05-28.
- Nurse Key. 2017. "Chapter 15: Sampling." February 17. Accessed 2018-05-18.
- Pearson, Karl. 1900. "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling." Philosophical Magazine, Series 5 (1876-1900), pp. 157–175. Accessed 2021-05-28.
- Reinhart, Alex. 2015. "Statistics Done Wrong: The Woefully Complete Guide." No Starch Press.
- Rolke, Wolfgang A. 2018. "Quantitative Variables." Department of Mathematical Sciences, University of Puerto Rico - Mayaguez. Accessed 2018-05-18.
- Six-Sigma-Material.com. 2016. "Population & Samples." Six-Sigma-Material.com. Accessed 2018-05-18.
- Walmsley, Angela and Michael C. Brown. 2017. "What is Power?" Statistics Teacher, American Statistical Association, September 15. Accessed 2021-05-28.
- Wang, Jing. 2014. "Chapter 4.II: Hypothesis Testing." Applied Statistical Methods II, Univ. of Illinois Chicago. Accessed 2021-05-28.
- Weigle, David C. 1994. "Historical Origins of Contemporary Statistical Testing Practices: How in the World Did Significance Testing Assume Its Current Place in Contemporary Analytic Practice?" Paper presented at the Annual Meeting of the Southwest Educational Research Association, SanAntonio, TX, January 27. Accessed 2021-05-28.
- Wikipedia. 2018. "Margin of Error." May 1. Accessed 2018-05-18.
- Wikipedia. 2021. "Law of large numbers." Wikipedia, March 26. Accessed 2021-05-28.
- howMed. 2013. "Significance Testing and p value." August 4. Updated 2013-08-08. Accessed 2018-05-18.
- Foley, Hugh. 2018. "Introduction to Hypothesis Testing." Skidmore College. Accessed 2018-05-18.
- Buskirk, Trent. 2015. "Sampling Error in Surveys." Accessed 2018-05-18.
- Zaiontz, Charles. 2014. "Assumptions for Statistical Tests." Real Statistics Using Excel. Accessed 2018-05-18.
- DeCook, Rhonda. 2018. "Section 9.2: Types of Errors in Hypothesis testing." Stat1010 Notes, Department of Statistics and Actuarial Science, University of Iowa. Accessed 2018-05-18.