60  Confidence intervals

Confidence intervals provide a method of adding more information to an estimator \(\hat{\theta}\) when we wish to estimate an unknown parameter \(\theta\). We can find an interval \((A,B)\) that we think has high probability of containing \(\theta\). The length of such an interval gives us an idea of how closely we can estimate \(\theta\).

Definition 60.1 (Confidence interval) A \(100(1-\alpha)\%\) confidence interval (CI) for \(\theta\) is an interval \([a, b]\) such that the probability that the interval contains the true \(\theta\) is \((1-\alpha)\).

Confidence intervals are reported to indicate the degree of precision of our estimates. The narrower the confidence interval, the more precise the estimate.

We seek an interval which includes the true value with reasonably high probability. Standard choices are \(\alpha=0.05\), corresponding to 95% confidence intervals.

CI of sample mean

Let’s set \(\alpha=5\%\), that is, we are trying to find the CI that contains the true mean 95% of the times. Assume our sample size \(n\) is large enough to invoke the CLT, we thus have

\[\begin{aligned} \frac{\bar{X}_{n}-\mu}{\sigma/\sqrt{n}} & \sim N(0,1)\end{aligned}\] Let’s find the interval \([a,b]\) such that \[P\left(a\leq\frac{\bar{X}_{n}-\mu}{\sigma/\sqrt{n}}\leq b\right)=1-2\Phi(a)=0.95\]

since the normal distribution is symmetric, \(b=-a\). By looking at the CDF of standard normal, we get \(a=-1.96\), \(b=1.96\). Thus,

\[P\left(-1.96\leq\frac{\bar{X}_{n}-\mu}{\sigma/\sqrt{n}}\leq1.96\right)=0.95\]

With a little rearrangement, we have

\[P\left(\bar{X}_{n}-1.96\frac{\sigma}{\sqrt{n}}\leq\mu\leq\bar{X}_{n}+1.96\frac{\sigma}{\sqrt{n}}\right)=0.95\] Therefore, the interval \(\left[\bar{X}_{n}-1.96\frac{\sigma}{\sqrt{n}},\bar{X}_{n}+1.96\frac{\sigma}{\sqrt{n}}\right]\) contains the true mean 95% of the times.

In practice, because we do not know the true \(\sigma^2\), we thus replace it with the sample variance \(s^2\). Theorem 56.2 ensures the result still holds if the sample size is reasonably large.

Theorem 60.1 (Confidence interval of sample mean) The \(100(1-\alpha)\%\) confidence interval for the sample mean \(\bar{X}_{n}\) is \[\left[ \bar{X}_{n}- z_{\alpha/2}\frac{s}{\sqrt{n}}, \bar{X}_{n}+ z_{\alpha/2}\frac{s}{\sqrt{n}} \right],\] where \(s^2\) is the sample variance, \(z_{\alpha/2}\) is the critical value such that \(\Phi(z_{\alpha/2})=\frac{\alpha}{2}\).

Some commonly used confidence levels
  • 90% CI: \(\alpha=0.10\), \(z_{0.050}=1.65\)
  • 95% CI: \(\alpha=0.05\), \(z_{0.025}=1.96\)
  • 99% CI: \(\alpha=0.01\), \(z_{0.005}=2.58\)
# the true mean
mu <- 0.5

# simulating the population
G <- rnorm(1000, mu)

# initialize plot 
plot(NA, 
    xlim = c(-2,2), 
    ylim = c(0,50), 
    xlab="CIs",
    ylab="Index")

# repeat experiment
for (i in 1:50) {
    # sample size
    k <- 10
    # draw random sample
    x <- sample(G,k)
    # sample mean 
    m <- mean(x)
    # sample standard deviation
    s <- sd(x)
    # confidence interval
    ci_1 <- m - 1.96* s/sqrt(k) 
    ci_2 <- m + 1.96* s/sqrt(k)
    # draw the CI
    segments(ci_1, i, ci_2, i)
}

# the true mean
abline(v=mu, col=2)

Interpretation of CIs

A 95% CI means: 95% of the intervals constructed will cover the true population mean \(\mu\). After taking the sample and an interval is constructed, the constructed interval either covers \(\mu\) or it doesn’t. But if we were able to take many such samples and reconstruct the interval many times, 95% of the intervals will contain the true mean.

Common misconceptions

Suppose we have a 95% confidence interval \([2.7,3.7]\). Which of the following statements is true?

1. We are 95% confident that the sample mean is between 2.7 and 3.7.

False. The CI definitely contains the sample mean \(\bar{X}\).

2. 95% of the population observations are in 2.7 to 3.7.

False. The CI is about covering the population mean, not for covering 95% of the entire population.

3. The true mean falls in the interval [2.7, 3.7] with probability 95%.

False. The true mean \(\mu\) is a fixed number, not a random one that happens with a probability.

4. If a new random sample is taken, we are 95% confident that the new sample mean will be between 2.7 and 3.7.

False. The confidence interval is for covering the population mean, not for covering the mean of another sample.

5. This confidence interval is not valid if the population or sample is not normally distributed.

False. The construction of the CI only uses the normality of the sampling distribution of the sample mean (by the CLT). Neither the population nor the sample is required to be normally distributed.

Applications

Do male students do better in exams than female students? (or the other way around?) Comparing the means are not enough, because there are errors in the estimates.

Let \(\bar{X}_1\) and \(\bar{X}_2\) be the average scores of male and female students respectively. By the CLT, \[\bar{X}_1 \to_d N\left( \mu_1, \frac{\sigma_1^2}{n_1} \right)\] \[\bar{X}_2 \to_d N\left( \mu_2, \frac{\sigma_2^2}{n_2} \right)\]

The difference in sample means also follows a normal distribution: \[\bar{X}_1 - \bar{X}_2 \sim N\left( \mu_1-\mu_2, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}\right)\]

We construct a confidence interval over \(\bar{X}_1 - \bar{X}_2\) and check if it contains \(0\).

# load dataset
data <- read.csv("../dataset/survey.csv")

# math scores of introverts and extroverts
x1 <- subset(data, Gender == 'F', 'Score')
x2 <- subset(data, Gender == 'M', 'Score')

# t-test over x1 - x2
t <- t.test(x1, x2, conf.level = .95)

# CI contains 0, meaning there is no significant difference
cat("Means:",  t$estimate, "CI:", t$conf.int)
Means: 82.72727 81.65 CI: -3.439391 5.593937

The fact that the conficence interval contains \(0\) means that there is no statistical difference between the two means. The algebric difference is due to errors in the estimation (maybe because of small samples).