Hypothesis testing allows us to test specific hypothesis about a population parameter with the evidence obtained from a sample. The earliest use of statistical hypothesis testing is generally credited to the question of whether male and female births are equally likely (null hypothesis), which was addressed in the 1700s by John Arbuthnot and later by Pierre-Simon Laplace.
Let \(p\) be the population ratio (defined as the ratio of boys to the total number of babies). We hypotheses that \[H_{0}:p=0.5\] This is called the null hypothesis, which is the hypothesis we want to test. If the null hypothesis is false, we have \[H_{1}:p\neq0.5\] This is called the alternative hypothesis. How am I able to test which hypothesis is true? I can answer this question by collecting a small sample. Suppose I have collected a sample of \(50\) babies computed a sample ratio of \(\hat{p}=0.55\). Does it prove or disprove the hypothesis?
Note that the ratio \(\hat{p}\) can be regarded as a sample mean. Let \(X_{i}\) be a random variable that equals \(1\) if the \(i\)-th baby is a boy and \(0\) otherwise. Then, \(\hat{p}=\frac{1}{n}\sum_{i=1}^{n}X_{i}\). The variance of \(\hat{p}\) is given by \[Var(\hat{p})=\frac{1}{n^{2}}\sum_{i=1}^{n}Var(X_{i})=\frac{p(1-p)}{n}\] since \(X_{i}\) is a Bernoulli random variable. By the Central Limit Theorem, we have \[\frac{\hat{p}-p}{\sqrt{\frac{p(1-p)}{n}}}\to N(0,1)\] Suppose \(H_{0}\) is true, then we know the distribution of \(\hat{p}\). In particular, there is 95% chance that \(\hat{p}\) would be in the interval \[p\pm1.96\sqrt{\frac{p(1-p)}{n}}=0.5\pm0.14\] Our observed sample mean \(\hat{p}=0.55\) is not outrageous. It is well within this interval. That means the evidence is not against the null hypothesis. It does not mean \(H_{0}\) is true. But it is reasonable given we have observed a sample mean \(\hat{p}=0.55\).
Suppose we have observed \(\hat{p}=0.65\). This piece of evidence does not seem to be consistent with the null hypothesis. Because if \(H_{0}\) is true, we only have less than 5% chance of observing this sample mean. It is extremely unlikely. Based on this sample, we are more inclined to reject the \(H_{0}\). Rejecting the null hypothesis does not mean it is false, but it means our evidence does not support this hypothesis.
Definitions
Definition 61.1 (Hypothesis testing) A null hypothesis (\(H_0\)) is a statement that there is no relationship or effect between variables. An alternative hypothesis (\(H_1\)) is the opposite of the null hypothesis. Hypothesis testing seeks to find evidence to reject the null.
Definition 61.2 (Rejection region) The range of extreme values in the sampling distribution where, if the test statistic falls, the null hypothesis (\(H_0\)) is rejected. Rejection region means the \(H_0\) is higly unikely to be true given the evidence. This region is defined by the significance level (\(\alpha\)).
Definition 61.3 (Type I/II error) A Type I error is rejecting the \(H_0\) when it is true. A Type II error is failing to reject the \(H_0\) when it is false.
Decision
Reject \(H_{0}\)
Fail to reject \(H_{0}\)
\(H_{0}\) is true
Type I error
\(\checkmark\)
\(H_{0}\) is false
\(\checkmark\)
Type II error
Definition 61.4 (Significance level) Significance level is the probability of making a Type I error, typically denoted by \(\alpha\): \[ \alpha = P(\textrm{Type I error}|H_{0}\textrm{ is true}).\] To control the Type I error, we typically set \(\alpha\) to a very small value. Typical choices include \(\alpha = 0.1, 0.05, 0.01\).
Definition 61.5 (p-value) The probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data, assuming \(H_0\) is true. A small \(p\)-value provides strong evidence against\(H_0\).
Definition 61.6 (Decision rule) We reject the null hypothesis (\(H_0\)) under a given significance level (\(\alpha\)) if \(p \leq \alpha\). We say a result is statistically significant, by the standards of the study, when \(p \leq \alpha\).
Decision rule of one-sided and two-sided tests
Testing the sample mean
Tom is 185 (cm) tall and he wants to know if he is taller than the average male (\(\mu\)).
We adopt significance level \(\alpha=0.05\). Because \(p<\alpha\), we reject the \(H_0\) and accept \(H_1\). We thus conclude Tom is significantly taller than the average.
Why t-test?
By convention, we would like to use a \(t\)-test. For large samples, \(t\)-statistic converges to standard normal. So it is equivalent to conducting \(z\)-test under CLT. For small samples, \(t\)-test is valid if the sample is drawn from a normal distribution.
Comparing two sample means
Are male students on average do better in exams than female students? (or the other way around)
We adopt significance level \(\alpha=0.05\). Because \(p>\alpha\), we fail to reject the \(H_0\). We thus conclude the average score of male students is not significantly different from females.