Hypothesis Testing and Types of Errors
 Summary

Discussion
 What are the math representations of population and sample parameters?
 What's the relevance of sampling error to hypothesis testing?
 What are some assumptions behind hypothesis testing?
 What are onetailed and twotailed tests?
 What are the types of errors in hypothesis testing?
 How do we measure typeI or \(\alpha\) error?
 How do we determine sample size and confidence interval for sample estimate?
 How do we measure typeII or \(\beta\) error?
 How do we control \(\alpha\) and \(\beta\) errors?
 What are some misconceptions in hypothesis testing?
 Milestones
 References
 Further Reading
 Article Stats
 Cite As
Suppose we want to study income of a population. We study a sample from the population and draw conclusions. The sample should represent the population for our study to be a reliable one.
Null hypothesis \((H_0)\) is that sample represents population. Hypothesis testing provides us with framework to conclude if we have sufficient evidence to either accept or reject null hypothesis.^{}
Population characteristics are either assumed or drawn from thirdparty sources or judgements by subject matter experts. Population data and sample data are characterised by moments of its distribution (mean, variance, skewness and kurtosis). We test null hypothesis for equality of moments where population characteristic is available and conclude if sample represents population.
For example, given only mean income of population, we validate if mean income of sample is close to population mean to conclude if sample represents the population.
Discussion
What are the math representations of population and sample parameters? Population mean and population variance are denoted in Greek alphabets \(\mu\) and \(\sigma^2\) respectively, while sample mean and sample variance are denoted in English alphabets \(\bar x\) and \(s^2\) respectively.^{}
What's the relevance of sampling error to hypothesis testing? Suppose we obtain a sample mean of \(\bar x\) from a population of mean \(\mu\). The two are defined by the relationship \(\bar x\)  \(\mu\)>=0:^{}
 If the difference is not significant, we conclude the difference is due to sampling. This is called sampling error and this happens due to chance.
 If the difference is significant, we conclude the sample does not represent the population. The reason has to be more than chance for difference to be explained.
Hypothesis testing helps us to conclude if the difference is due to sampling error or due to reasons beyond sampling error.
What are some assumptions behind hypothesis testing? A common assumption is that the observations are independent and come from a random sample. The population distribution must be Normal or the sample size is large enough. If the sample size is large enough, we can invoke the Central Limit Theorem (CLT) regardless of the underlying population distribution. Due to CLT, sampling distribution of the sample statistic (such as sample mean) will be approximately a Normal distribution.^{}
A rule of thumb is 30 observations but in some cases even 10 observations may be sufficient to invoke the CLT.^{} Others require at least 50 observations.^{}
What are onetailed and twotailed tests? When acceptance of \(H_0\) involves boundaries on both sides, we invoke the twotailed test. For example, if we define \(H_0\) as sample drawn from population with age limits in the range of 25 to 35, then testing of \(H_0\) involves limits on both sides.
Suppose we define the population as greater than age 50, we are interested in rejecting a sample if the age is less than or equal to 50; we are not concerned about any upper limit. Here we invoke the onetailed test. A onetailed test could be lefttailed or righttailed.
Consider average gas price in California compared to the national average of $2.62. If we believe that the price is higher in California, we consider righttailed test. If we believe that California price is different from national average but we don't know if it's higher or lower, we consider twotailed test. Symbolically, given the alternative or research hypothesis \(H_1\),^{} we state,^{}
 \(H_0\): \(\mu = \$ 2.62\)
 \(H_1\) righttailed: \(\mu > \$ 2.62\)
 \(H_1\) twotailed: \(\mu \neq \$ 2.62\)
What are the types of errors in hypothesis testing? In concluding whether sample represents population, there is scope for committing errors on following counts:^{}
 Not accepting that sample represents population when in reality it does. This is called typeI or \(\alpha\) error.
 Accepting that sample represents population when in reality it does not. This is called typeII or \(\beta\) error.
For instance, granting loan to an applicant with low credit score is \(\alpha\) error. Not granting loan to an applicant with high credit score is (\(\beta\)) error.
The symbols \(\alpha\) and \(\beta\) are used to represent the probability of typeI and typeII errors respectively.^{} ^{}
How do we measure typeI or \(\alpha\) error? The pvalue can be interpreted as the probability of getting a result that's same or more extreme when the null hypothesis is true.^{}
The observed sample mean \(\bar x\) is overlaid on population distribution of values with mean \(\mu\) and variance \(\sigma^2\). The proportion of values beyond \(\bar x\) and away from \(\mu\) (either in left tail or in right tail or in both tails) is pvalue. If pvalue <= \(\alpha\) we reject null hypothesis.^{} The results are said to be statistically significant and not due to chance.^{}
Assuming \(\alpha\)=0.05, pvalue > 5%, we conclude the sample is highly likely to be drawn from population with mean \(\mu\) and variance \(\sigma^2\). We accept \((H_0)\). Otherwise, there's insufficient evidence to be part of population and we reject \(H_0\).^{}
We preselect \(\alpha\) based on how much typeI error we're willing to tolerate. \(\alpha\) is called level of significance. The standard for level of significance is 0.05 but in some studies it may be 0.01 or 0.1.^{} In the case of twotailed tests, it's \(\alpha/2\) on either side.
How do we determine sample size and confidence interval for sample estimate? Law of Large Numbers suggests larger the sample size, the more accurate the estimate. Accuracy means the variance of estimate will tend towards zero as sample size increases. Sample Size can be determined to suit accepted level of tolerance for deviation.^{}
Confidence interval of sample mean is determined from sample mean offset by variance on either side of the sample mean. If the population variance is known, then we conduct ztest based on Normal distribution. Otherwise, variance has to be estimated and we use ttest based on tdistribution.^{} ^{}
The formulae for determining sample size and confidence interval depends on what we to estimate (mean/variance/others), sampling distribution of estimate and standard deviation of estimate's sampling distribution.
How do we measure typeII or \(\beta\) error? We overlay sample mean's distribution on population distribution, the proportion of overlap of sampling estimate's distribution on population distribution is \(\beta\) error.^{}
Larger the overlap, larger the chance the sample does belong to population with mean \(\mu\) and variance \(\sigma^2\). Incidentally, despite the overlap, pvalue may be less than 5%. This happens when sample mean is way off population mean, but the variance of sample mean is such that the overlap is significant.
How do we control \(\alpha\) and \(\beta\) errors? Errors \(\alpha\) and \(\beta\) are dependent on each other. Increasing one decreases the other. Choosing suitable values for these depends on the cost of making these errors. Perhaps it's worse to convict an innocent person (typeI error) than to acquit a guilty person (typeII error), in which case we choose a lower \(\alpha\). But it's possible to decrease both errors but collecting more data.^{}
Just as pvalue manifests \(\alpha\), Power of Test manifests \(\beta\). Power of test is \(1\beta\). Among the various ways to interpret power are:^{}
 Probability of rejecting the null hypothesis when, in fact, it is false.
 Probability that a test of significance will pick up on an effect that is present.
 Probability of avoiding a Type II error.
Low pvalue and high power help us decisively conclude sample doesn't belong to population. When we cannot conclude decisively, it's advisable to go for larger samples and multiple samples.
In fact, power is increased by increasing sample size, effect sizes and significance levels. Variance also affects power.^{}
What are some misconceptions in hypothesis testing? A common misconception is to consider "p value as the probability that the null hypothesis is true". In fact, pvalue is computed under the assumption that the null hypothesis is true. Pvalue is the probability of observing the values, or more extremes values, if the null hypothesis is true.^{}
Another misconception, sometimes called base rate fallacy,^{} is that under controlled \(\alpha\) and adequate power, statistically significant results correspond to true differences. This is not the case, as shown in the figure. Even with \(\alpha\)=5% and power=80%, 36% of statistically significant pvalues will not report the true difference. This is because only 10% of the null hypotheses are false (base rate) and 80% power on these gives only 80 true positives.^{}
Pvalue doesn't measure the size of the effect, for which confidence interval is a better approach. A drug that gives 25% improvement may not mean much if symptoms are innocuous compared to another drug that gives small improvement from a disease that leads to certain death. Context is therefore important.^{}
Milestones
The field of statistical testing probably starts with John Arbuthnot who applies it to test sex ratios at birth. Subsequently, others in the 18th and 19th centuries use it in other fields. However, modern terminology (null hypothesis, pvalue, typeI or typeII errors) is formed only in the 20th century.^{}
Pearson introduces the concept of pvalue with the chisquared test. He gives equations for calculating P and states that it's "the measure of the probability of a complex system of n errors occurring with a frequency as great or greater than that of the observed system."^{}
Ronald A. Fisher develops the concept of pvalue and shows how to calculate it in a wide variety of situations. He also notes that a value of 0.05 may be considered as conventional cutoff.^{}
Neyman and Pearson publish On the problem of the most efficient tests of statistical hypotheses. They introduce the notion of alternative hypotheses. They also describe both typeI and typeII errors (although they don't use these terms). They state, "Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong."^{}
Johnson's textbook titled Statistical methods in research is perhaps the first to introduce to students the NeymanPearson hypothesis testing at a time when most textbooks follow Fisher's significance testing. Johnson uses the terms "error of the first kind" and "error of the second kind".^{} In time, Fisher's approach is called Pvalue approach and the NeymanPearson approach is called fixedα approach.^{}
Carver makes the following suggestions: use of the term "statistically significant"; interpret results with respect to the data first and statistical significance second; and pay attention to the size of the effect.^{} ^{}
References
 Biau, David Jean, Brigitte M. Jolles, and Raphaël Porcher. 2010. "P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers." Clin Orthop Relat Res, vol. 468, no. 3, pp. 885–892, March. Accessed 20210528.
 Carver, Ronald P. 1993. "The Case against Statistical Significance Testing, Revisited." The Journal of Experimental Education, vol. 61, no. 4, Statistical Significance Testing in Contemporary Practice, pp. 287292. Accessed 20210528.
 FrankfortNachmias, Chava, Anna LeonGuerrero, and Georgiann Davis. 2020. "Chapter 8: Testing Hypotheses." In: Social Statistics for a Diverse Society, SAGE Publications. Accessed 20210528.
 Gordon, Max. 2011. "How to best display graphically type II (beta) error, power and sample size?" August 11. Accessed 20180518.
 Heard, Stephen B. 2015. "In defence of the Pvalue" Types of Errors. February 9. Updated 20151204. Accessed 20180518.
 howMed. 2013. "Significance Testing and p value." August 4. Updated 20130808. Accessed 20180518.
 Huberty, Carl J. 1993. "Historical Origins of Statistical Testing Practices: The Treatment of Fisher versus NeymanPearson Views in Textbooks." The Journal of Experimental Education, vol. 61, no. 4, Statistical Significance Testing in Contemporary Practice, pp. 317333. Accessed 20210528.
 Kensler, Jennifer. 2013. "The Logic of Statistical Hypothesis Testing: Best Practice." Report, STAT T&E Center of Excellence. Accessed 20210528.
 Klappa, Peter. 2014. "Sampling error and hypothesis testing." On YouTube, December 10. Accessed 20210528.
 Lane, David M. 2021. "Section 10.8: Confidence Interval on the Mean." In: Introduction to Statistics, Rice University. Accessed 20210528.
 McNeese, Bill. 2015. "How Many Samples Do I Need?" SPC For Excel, BPI Consulting, June. Accessed 20180518.
 McNeese, Bill. 2017. "Interpretation of Alpha and pValue." SPC for Excel, BPI Consulting, April 6. Updated 20200425. Accessed 20210528.
 Neyman, J., and E. S. Pearson. 1933. "On the problem of the most efficient tests of statistical hypotheses." Philos Trans R Soc Lond A., vol. 231, issue 694706, pp. 289–337. doi: 10.1098/rsta.1933.0009. Accessed 20210528.
 Nurse Key. 2017. "Chapter 15: Sampling." February 17. Accessed 20180518.
 Pearson, Karl. 1900. "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling." Philosophical Magazine, Series 5 (18761900), pp. 157–175. Accessed 20210528.
 Reinhart, Alex. 2015. "Statistics Done Wrong: The Woefully Complete Guide." No Starch Press.
 Rolke, Wolfgang A. 2018. "Quantitative Variables." Department of Mathematical Sciences, University of Puerto Rico  Mayaguez. Accessed 20180518.
 SixSigmaMaterial.com. 2016. "Population & Samples." SixSigmaMaterial.com. Accessed 20180518.
 Walmsley, Angela and Michael C. Brown. 2017. "What is Power?" Statistics Teacher, American Statistical Association, September 15. Accessed 20210528.
 Wang, Jing. 2014. "Chapter 4.II: Hypothesis Testing." Applied Statistical Methods II, Univ. of Illinois Chicago. Accessed 20210528.
 Weigle, David C. 1994. "Historical Origins of Contemporary Statistical Testing Practices: How in the World Did Significance Testing Assume Its Current Place in Contemporary Analytic Practice?" Paper presented at the Annual Meeting of the Southwest Educational Research Association, SanAntonio, TX, January 27. Accessed 20210528.
 Wikipedia. 2018. "Margin of Error." May 1. Accessed 20180518.
 Wikipedia. 2021. "Law of large numbers." Wikipedia, March 26. Accessed 20210528.
Further Reading
 Foley, Hugh. 2018. "Introduction to Hypothesis Testing." Skidmore College. Accessed 20180518.
 Buskirk, Trent. 2015. "Sampling Error in Surveys." Accessed 20180518.
 Zaiontz, Charles. 2014. "Assumptions for Statistical Tests." Real Statistics Using Excel. Accessed 20180518.
 DeCook, Rhonda. 2018. "Section 9.2: Types of Errors in Hypothesis testing." Stat1010 Notes, Department of Statistics and Actuarial Science, University of Iowa. Accessed 20180518.