System setup

Click ‘code’ on the left to view the R setup.

# Load packages
library(dplyr)
library(tidyr)
library(ggplot2)
library(gganimate)
library(knitr)

# set seed
set.seed(123)

# knitr chunk options
opts_chunk$set(echo = TRUE,
               eval = TRUE,
               include = TRUE,
               warning = FALSE,
               message = FALSE,
               cache = TRUE,
               cache.extra = rand_seed,
               fig.width = 7,
               fig.height = 7,
               tidy = FALSE)

Background

Click ‘code’ on the left to show the code used to generate the figures in this section.

I have wanted to try David Robinson’s (@drob) gganimate package since I came across it a few months ago. The package extends Hadley Wickham’s (@hadleywickham) ggplot2 package by adding a ‘frame’ aesthetic that provides a time factor to geoms, allowing them to be animated.

Recently, the opportunity for me to put the package through its paces came while I was preparing materials for an introductory biostatistics tutorial for undergrad students. I wanted to demonstrate the central limit theorem and law of large numbers, and thought that animations would help deliver the message.

The central limit theorem provides a shortcut to knowing the sampling distribution, which is the probability distribution of a statistic (e.g., mean, proportion) for all possible samples from a population. The theorem is one of the cornerstones of probability theory because it allows you to make statements about the sampling distribution for a particular statistic without having to sample the entire population. As such, it forms the basis of statistical inference.

The central limit theorem states that the sampling distribution of an independent, random variable is normal or near-normal if a sufficiently large number of (equal-sized) samples are taken. If the sampling distribution for a statistic follows a normal or near-normal distribution we can make probability statements about the range of values in which the statistic lies. For example, there is a 68% probability that the sample statistic is within 1 standard deviation of the population value, and a 95% probability that it lies within about 2 standard deviations (see figure below). In the case of the sampling distribution, the standard deviation of the distribution is also called the standard error.

# Generate formal distribution of mean = 0, sd = 1
norm.df <- rnorm(200000, mean = 0, sd = 1) %>%
    data_frame(values = .)

# Calculate density from norm.df
norm.dens <- density(norm.df$values)
norm.dens <- data_frame(x = norm.dens$x, y = norm.dens$y)

# Calculate quantiles from norm.df
q2.5 <- quantile(norm.df$values, 0.025)
q97.5 <- quantile(norm.df$values, 0.975)
q16 <- quantile(norm.df$values, 0.16)
q84 <- quantile(norm.df$values, 0.84)

# Use quantiles to filter data for geom_ribbon
dens_95 <- norm.dens %>%
    filter(x > q2.5 & x < q97.5) %>%
    mutate(quant = rep('2SD', length(x)))
dens_68 <- norm.dens %>%
    filter(x > q16 & x < q84) %>%
    mutate(quant = rep('1SD', length(x)))
ribbon <- do.call(rbind, list(dens_95, dens_68))

# Plots
norm.plot <- ggplot() +
    geom_density(data = norm.df, aes(x = values),
                 fill = '#FFFFFF') +
    geom_ribbon(data = ribbon[ribbon$quant == '2SD', ],
                              aes(x = x, ymax = y, ymin = 0), 
                fill = '#00517f') +
    geom_ribbon(data = ribbon[ribbon$quant == '1SD', ],
                              aes(x = x, ymax = y, ymin = 0),
                fill = '#0072B2') +
    geom_density(data = norm.df, aes(x = values),
                 size = 1) +
    geom_segment(aes(x = -1, xend = 1,
                     y = 0.22, yend = 0.22), 
               colour = "#FFFFFF", size = 1.5) +
    geom_label(aes(label = '1 SD: 68%', x = 0, y = 0.22), 
              colour = 'gray10', fill = '#FFFFFF') +
    geom_segment(aes(x = -1.96, xend = 1.96,
                     y = 0.05, yend = 0.05), 
               colour = "#FFFFFF", size = 1.5) +
    geom_label(aes(label = '2 SD: 95%', x = 0, y = 0.05), 
              colour = 'gray10', fill = '#FFFFFF') +
    scale_x_continuous(breaks = seq(from = -4, to = 4, by = 1)) +
    labs(x = '\nStandard deviation (SD)', y = 'Density\n') +
    theme(plot.title = element_text(size = 17), 
          axis.text = element_text(size = 17),
          axis.title = element_text(size = 17),
          axis.ticks = element_blank(),
          panel.grid.major = element_line(colour = 'gray80', size = 0.3),
          panel.background = element_blank())
# Save plot
ggsave('normal-distr.png', norm.plot, width = 7, height = 5)