15 Data descriptives*

A random variable is a mathematical abstraction that provides a bridge between theoretical probability and real-world data. Every dataset can be viewed as observations from random variables.

Despite the outcome of any one event being uncertain, we can use patterns from past observations to predict the general behavior of these variables. By collecting data, we can figure out how often certain outcomes occur and connect them to theoretical distributions.

\[\begin{array}{ccccccc} \textrm{Question with} & \rightarrow & \textrm{Data} & \rightarrow & \textrm{Patterns}\\ \textrm{uncertainty} & & \downarrow & & \downarrow\\ & & \textrm{RVs} & \rightarrow & \textrm{Distributions} & \rightarrow & \textrm{Predictions}\\ \end{array}\]

Columns as random variables

In a dataset, we view every column as a random variable.

# Load a dataset from a CSV file
exam <- read.csv("../dataset/exam.csv")

# View the data: each column is a random variable
head(exam)

  id gender     major hw mid final overall
1  1 Female Economics 85  89    74      81
2  2 Female   Finance 90  84    79      83
3  3 Female Economics 90  71    51      65
4  4 Female   Finance 86  84    68      76
5  5   Male   Finance 80  84    67      75
6  6 Female   Finance 96 100    99      99

Summary statistics

We can describe the distribution of a variable with summary statistics: such as quartiles, deciles and percentiles.

summary(exam$overall)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  45.00   72.00   77.50   76.56   84.00   99.00

Histograms

One way to visualize the distribution of a variable is to plot a histogram. A histogram groups data points into intervals, showing how often data values fall within each range. The horizontal axis represents the intervals (or bins), and the vertical axis shows the frequency or count of data points in each bin.

A histogram gives an approximation of the true (unknown) distribution. It is not itself the distribution. A distribution refers to the theoretical assignment of probabilities across all possible outcomes, whereas a histogram represents the empirical frequencies observed in a finite sample.

hist(exam$final, prob = T, ann=F)

Boxplots

A boxplot, also known as a box-and-whisker plot, displays the median, quartiles, and range of the data. The box represents the interquartile range (IQR), which contains the middle 50% of the data, with the lower and upper edges corresponding to the first (Q1) and third quartiles (Q3). Whiskers extend from the box to indicate the range of values within 1.5 times the IQR from Q1 and Q3, while points beyond this range are considered outliers.

boxplot(final ~ major, exam)

Scatter plots

To observe the relationship between variables, it is straightforward to make a scatter plot of \(Y\) against \(X\). An upward-sloping pattern in the scatter plot indicates that the variables tend to move together, whereas a downward-sloping pattern suggests that they move in opposite directions. A flat slope implies the absence of correlation between the two variables.

plot(final ~ mid, exam)