57 Statistical inference

Definition 57.1 (Statistical model) A statistical model is a formalized structure used for quantifying uncertainty, which consists of

a set of random variables of interest: \(X_1,...,X_n\)
a specification of the joint distribution of the random variables \(f(X_1,...,X_n; \theta_1,...,\theta_k)\)
the parameters \(\theta_1,...,\theta_k\) that determine the distribution
(possibly) the distributions of the parameters \(h(\theta_1,...,\theta_k)\).

Definition 57.2 (Statistical inference) A statistical inference is a procedure that produces a probabilistic statement about some or all parts of a statistical model (e.g. a parameter, a conditional distribution, etc).

Definition 57.3 (Parameter space) The characteristics that determine the joint distribution of the random variables of interest is called the parameters of the distribution. The set of all possible values of the parameters is called the parameter space.

Parameters as random variables

There is a debate over whether unknown parameters should be treated as random variables or merely as numbers. Treating parameters as numbers is typically adopted by the Frequentist framework. We use the observables to provide the best estimate of the unknown parameter. For example, \[\bar{X}_n \to_p \mu\] Whereas treating parameters as random variables is typically associated with the Bayesian framework. We update the distribution of the parameters with the information provided by the observables: \[p(\theta | \bar{X}_n) \propto p(\bar{X}_n | \theta) p(\theta).\]

Definition 57.4 (Statistic) Suppose we observe random variables \(X_1,...,X_n\). Let \(g\) be a real-valued function of \(n\) variables. Then the random variable \(T=g(X_1,...,X_n)\) is called a statistic.

Definition 57.5 (Estimator) Let \(\theta\) be a parameter of interest in a statistical model. An estimator of parameter \(\theta\) is a statistic \(\hat\theta = g(X_1,...,X_n)\) constructed to learn about \(\theta\). If \(X_1=x_1, ..., X_n=x_n\) are observed, then \(g(x_1,...,x_n)\) is called the estimate of \(\theta\).

Definition 57.6 (Sampling distribution) Statistic \(T\) is a function of random variables, therefore it is itself a random variable. The distribution of \(T\) is called the sampling distribution of \(T\).

The name comes from the fact that \(T\) depends on the sampling process (a different sample gives to a different value of the statistic).

Definition 57.7 (Standard error) The standard error (SE) of a statistic is the standard deviation of its sampling distribution.

The standard error indicates the “precision” of the estimator, thereby carrying a sense of “error”. The smaller the standard error, the more precise the estimator.

Standard error of \(\bar X\)

By the CLT, the sampling distribution of \(\bar X\) will approach Normal for large sample size. The standard error of \(\bar X\) is \[SE(\bar X) = \sqrt{ Var(\bar X) } = \frac{\sigma}{\sqrt n}\] In practice, the true \(\sigma^2\) is usually unknown, so we replace it with the sample estimate \(s^2\).

Constructing an estimator

How to construct an estimator is an art in itself, which needs to be guided by principles. We introduction the methods to construct estimators in the following sections.

Definition 57.8 (Method of moments) Let \(X_1,...,X_n\) be a random sample from a distribution with at least \(k\) finite moments. Let \(m_j = E(X^j; \theta)\) be the \(j\)-th order moment, \(j=1,...,n\). Suppose the parameter of interest \(\theta\) can be expressed as a function of the moments: \(\theta = M(m_1, ..., m_k)\). The method-of-moments estimator of \(\theta\) is given by \[\hat\theta = M(\hat m_1, ..., \hat m_k)\] where \(\hat m_j\) is the sample moment: \(\hat m_j = \frac{1}{n}\sum_{i=1}^{n} X_i^j\) for \(j=1,...,k\).

The method of moment is guided by the LLN, as the sample moments converges to the true moments for large samples. As an example, the estimator for population mean \(\mu\) is: \[\hat\mu = \frac{1}{n}\sum_{i=1}^{n} X_i.\] The estimator for population variance \(\sigma^2\) is: \[\hat\sigma^2 = \frac{1}{n}\sum_{i=1}^{n} X_i^2.\] The LLN ensures \(\hat\sigma^2\to \sigma^2\) when the sample size is large. But we know \(\hat\sigma^2\) is biased for small samples.