# Hypothesis Testing and Types of Errors

Suppose we want to study income of a population. We study a sample from the population and draw conclusions. The sample should represent the population for our study to be a reliable one.

Null hypothesis $$(H_0)$$ is that sample represents population. Hypothesis testing provides us with framework to conclude if we have sufficient evidence to either accept or reject null hypothesis.

Population characteristics are either assumed or drawn from third-party sources or judgements by subject matter experts. Population data and sample data are characterised by moments of its distribution (mean, variance, skewness and kurtosis). We test null hypothesis for equality of moments where population characteristic is available and conclude if sample represents population.

For example, given only mean income of population, we validate if mean income of sample is close to population mean to conclude if sample represents the population.

## Discussion

• What are the math representations of population and sample parameters?

Population mean and population variance are denoted in Greek alphabets $$\mu$$ and $$\sigma^2$$ respectively, while sample mean and sample variance are denoted in English alphabets $$\bar x$$ and $$s^2$$ respectively.

• What's the relevance of sampling error to hypothesis testing?

Suppose we obtain a sample mean of $$\bar x$$ from a population of mean $$\mu$$. The two are defined by the relationship |$$\bar x$$ - $$\mu$$|>=0:

• If the difference is not significant, we conclude the difference is due to sampling. This is called sampling error and this happens due to chance.
• If the difference is significant, we conclude the sample does not represent the population. The reason has to be more than chance for difference to be explained.

Hypothesis testing helps us to conclude if the difference is due to sampling error or due to reasons beyond sampling error.

• What are some assumptions behind hypothesis testing?

A common assumption is that the observations are independent and come from a random sample. The population distribution must be Normal or the sample size is large enough. If the sample size is large enough, we can invoke the Central Limit Theorem (CLT) regardless of the underlying population distribution. Due to CLT, sampling distribution of the sample statistic (such as sample mean) will be approximately a Normal distribution.

A rule of thumb is 30 observations but in some cases even 10 observations may be sufficient to invoke the CLT. Others require at least 50 observations.

• What are one-tailed and two-tailed tests?

When acceptance of $$H_0$$ involves boundaries on both sides, we invoke the two-tailed test. For example, if we define $$H_0$$ as sample drawn from population with age limits in the range of 25 to 35, then testing of $$H_0$$ involves limits on both sides.

Suppose we define the population as greater than age 50, we are interested in rejecting a sample if the age is less than or equal to 50; we are not concerned about any upper limit. Here we invoke the one-tailed test. A one-tailed test could be left-tailed or right-tailed.

Consider average gas price in California compared to the national average of \$2.62. If we believe that the price is higher in California, we consider right-tailed test. If we believe that California price is different from national average but we don't know if it's higher or lower, we consider two-tailed test. Symbolically, given the alternative or research hypothesis $$H_1$$, we state,

• $$H_0$$: $$\mu = \ 2.62$$
• $$H_1$$ right-tailed: $$\mu > \ 2.62$$
• $$H_1$$ two-tailed: $$\mu \neq \ 2.62$$
• What are the types of errors in hypothesis testing?

In concluding whether sample represents population, there is scope for committing errors on following counts:

• Not accepting that sample represents population when in reality it does. This is called type-I or $$\alpha$$ error.
• Accepting that sample represents population when in reality it does not. This is called type-II or $$\beta$$ error.

For instance, granting loan to an applicant with low credit score is $$\alpha$$ error. Not granting loan to an applicant with high credit score is ($$\beta$$) error.

The symbols $$\alpha$$ and $$\beta$$ are used to represent the probability of type-I and type-II errors respectively.

• How do we measure type-I or $$\alpha$$ error?

The p-value can be interpreted as the probability of getting a result that's same or more extreme when the null hypothesis is true.

The observed sample mean $$\bar x$$ is overlaid on population distribution of values with mean $$\mu$$ and variance $$\sigma^2$$. The proportion of values beyond $$\bar x$$ and away from $$\mu$$ (either in left tail or in right tail or in both tails) is p-value. If p-value <= $$\alpha$$ we reject null hypothesis. The results are said to be statistically significant and not due to chance.

Assuming $$\alpha$$=0.05, p-value > 5%, we conclude the sample is highly likely to be drawn from population with mean $$\mu$$ and variance $$\sigma^2$$. We accept $$(H_0)$$. Otherwise, there's insufficient evidence to be part of population and we reject $$H_0$$.

We preselect $$\alpha$$ based on how much type-I error we're willing to tolerate. $$\alpha$$ is called level of significance. The standard for level of significance is 0.05 but in some studies it may be 0.01 or 0.1. In the case of two-tailed tests, it's $$\alpha/2$$ on either side.

• How do we determine sample size and confidence interval for sample estimate?

Law of Large Numbers suggests larger the sample size, the more accurate the estimate. Accuracy means the variance of estimate will tend towards zero as sample size increases. Sample Size can be determined to suit accepted level of tolerance for deviation.

Confidence interval of sample mean is determined from sample mean offset by variance on either side of the sample mean. If the population variance is known, then we conduct z-test based on Normal distribution. Otherwise, variance has to be estimated and we use t-test based on t-distribution.

The formulae for determining sample size and confidence interval depends on what we to estimate (mean/variance/others), sampling distribution of estimate and standard deviation of estimate's sampling distribution.

• How do we measure type-II or $$\beta$$ error?

We overlay sample mean's distribution on population distribution, the proportion of overlap of sampling estimate's distribution on population distribution is $$\beta$$ error.

Larger the overlap, larger the chance the sample does belong to population with mean $$\mu$$ and variance $$\sigma^2$$. Incidentally, despite the overlap, p-value may be less than 5%. This happens when sample mean is way off population mean, but the variance of sample mean is such that the overlap is significant.

• How do we control $$\alpha$$ and $$\beta$$ errors?

Errors $$\alpha$$ and $$\beta$$ are dependent on each other. Increasing one decreases the other. Choosing suitable values for these depends on the cost of making these errors. Perhaps it's worse to convict an innocent person (type-I error) than to acquit a guilty person (type-II error), in which case we choose a lower $$\alpha$$. But it's possible to decrease both errors but collecting more data.

Just as p-value manifests $$\alpha$$, Power of Test manifests $$\beta$$. Power of test is $$1-\beta$$. Among the various ways to interpret power are:

• Probability of rejecting the null hypothesis when, in fact, it is false.
• Probability that a test of significance will pick up on an effect that is present.
• Probability of avoiding a Type II error.

Low p-value and high power help us decisively conclude sample doesn't belong to population. When we cannot conclude decisively, it's advisable to go for larger samples and multiple samples.

In fact, power is increased by increasing sample size, effect sizes and significance levels. Variance also affects power.

• What are some misconceptions in hypothesis testing?

A common misconception is to consider "p value as the probability that the null hypothesis is true". In fact, p-value is computed under the assumption that the null hypothesis is true. P-value is the probability of observing the values, or more extremes values, if the null hypothesis is true.

Another misconception, sometimes called base rate fallacy, is that under controlled $$\alpha$$ and adequate power, statistically significant results correspond to true differences. This is not the case, as shown in the figure. Even with $$\alpha$$=5% and power=80%, 36% of statistically significant p-values will not report the true difference. This is because only 10% of the null hypotheses are false (base rate) and 80% power on these gives only 80 true positives.

P-value doesn't measure the size of the effect, for which confidence interval is a better approach. A drug that gives 25% improvement may not mean much if symptoms are innocuous compared to another drug that gives small improvement from a disease that leads to certain death. Context is therefore important.

## Milestones

1710

The field of statistical testing probably starts with John Arbuthnot who applies it to test sex ratios at birth. Subsequently, others in the 18th and 19th centuries use it in other fields. However, modern terminology (null hypothesis, p-value, type-I or type-II errors) is formed only in the 20th century.

1900

Pearson introduces the concept of p-value with the chi-squared test. He gives equations for calculating P and states that it's "the measure of the probability of a complex system of n errors occurring with a frequency as great or greater than that of the observed system."

1925

Ronald A. Fisher develops the concept of p-value and shows how to calculate it in a wide variety of situations. He also notes that a value of 0.05 may be considered as conventional cut-off.

1933

Neyman and Pearson publish On the problem of the most efficient tests of statistical hypotheses. They introduce the notion of alternative hypotheses. They also describe both type-I and type-II errors (although they don't use these terms). They state, "Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong."

1949

Johnson's textbook titled Statistical methods in research is perhaps the first to introduce to students the Neyman-Pearson hypothesis testing at a time when most textbooks follow Fisher's significance testing. Johnson uses the terms "error of the first kind" and "error of the second kind". In time, Fisher's approach is called P-value approach and the Neyman-Pearson approach is called fixed-α approach.

1993

Carver makes the following suggestions: use of the term "statistically significant"; interpret results with respect to the data first and statistical significance second; and pay attention to the size of the effect.

## Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins arvindpdmn
10
4
1393 raam.raam
4
3
1367
1855
Words
8
Likes
12K
Hits

## Cite As

Devopedia. 2021. "Hypothesis Testing and Types of Errors." Version 14, May 28. Accessed 2021-09-09. https://devopedia.org/hypothesis-testing-and-types-of-errors
• Site Map