Today we will learn

  1. How to work with RMarkdown.

  2. How to look at Probability Distributions in R.

    1. Discrete Probability Distributions.
    2. Continuous Probability Distributions.
  3. What the Central Limit Theorem is.


1 RMarkdown revisited

You have already started getting yourself acquainted with R Markdown while working on first assignment in R Markdown. Now it’s time to learn a bit more about it.

This is an R Markdown document. It has the file extension .Rmd. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit or Preview button a document will be generated that includes both content, as well as the output of any embedded R code chunks within the document. A R code chunk is anything that goes between ```{r} ... ```.

You can embed a R code chunk by typing it out or with a shortcut in RStudio (Mac: Cmd+Alt+I, Windows/Linux: Ctrl+Alt+I).

Here is a code chunk for you:

# The first line sets an option for the final document that can be produced from
# the .Rmd file. Don't worry about it.
knitr::opts_chunk$set(
  echo = TRUE, # show results
  collapse = TRUE # not interrupt chunks
)

# The next bit (lines 50-69) is quite powerful and useful.
# First you define which packages you need for your analysis and assign it to
# the p_needed object.
p_needed <-
  c("viridis","ggplot2")

# Now you check which packages are already installed on your computer.
# The function installed.packages() returns a vector with all the installed
# packages.
packages <- rownames(installed.packages())
# Then you check which of the packages you need are not installed on your
# computer yet. Essentially you compare the vector p_needed with the vector
# packages. The result of this comparison is assigned to p_to_install.
p_to_install <- p_needed[!(p_needed %in% packages)]
# If at least one element is in p_to_install you then install those missing
# packages.
if (length(p_to_install) > 0) {
  install.packages(p_to_install)
}
# Now that all packages are installed on the computer, you can load them for
# this project. Additionally the expression returns whether the packages were
# successfully loaded.
sapply(p_needed, require, character.only = TRUE)
## Loading required package: viridis
## Loading required package: viridisLite
## Loading required package: ggplot2
## viridis ggplot2 
##    TRUE    TRUE

1.1 Chunk Options

Chunk output in your final document can be customized with knitr options, arguments set in the {} of a chunk header. Here are some useful arguments:

  • include = FALSE prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.
  • echo = FALSE prevents code, but not the results from appearing in the finished file. This is a useful way to embed figures.
  • message = FALSE prevents messages that are generated by code from appearing in the finished file.
  • warning = FALSE prevents warnings that are generated by code from appearing in the finished.
  • fig.cap = "..." adds a caption to graphical results.

See the R Markdown Reference Guide for a complete list of knitr chunk options.

1.2 Text formatting options

Markdown also provides an easy way to format the text in your document. Here are some basics:

For more formatting options, we recommend the R Markdown Cheat Sheet.


2 Probability Distributions

All the probability functions are defined in R already. To call them you have to combine a prefix and the distribution name.

The prefixes you can use are d, p, q and r. When to use which prefix?

  1. Use d when you know x and want to know the density at that point.

    • The result is a point of the probability mass/density function.
    • If you put in a range of x values you will get the PDF/PMF of the distribution.
  2. Use p when you know q and want to know the probability up to that point.

    • The result is a point on the cumulative distribution function or the area under the probability density function.
    • If you put in a range of values you will get the CDF of the distribution.
  3. Use q when you know the area p and want to know the x value

    • This is exactly the opposite to p.
  4. Use r if you want random numbers drawn from a distribution.

    • The result is a sample from a distribution.
  5. The distribution names are:

    • binom for the Binomial distribution
    • norm for the Normal distribution
    • unif for the Uniform distribution

For each of the distributions you need to specify different parameters!

2.1 Overview of R functions

Visually, you can see what each of these functions do on these plots:

Discrete Distribution

Continuous Distribution


2.2 Discrete Probability Distributions

Let’s start with Discrete Probability Distributions. Can you think of examples of Discrete Probability Distributions?

  • Coin flip (multiple times)
  • Number of votes obtained
  • Number of members in a household
  • Gender of a random person

2.2.1 The Bernoulli Distribution

The Bernoulli distribution is the simplest Discrete Probability Distribution. The Bernoulli distribution is a special case of the binomial distribution. It is a Binomial distribution with size 1. Thus, it is ideal for modelling one-time yes/no (success/failure) events.

dbinom(x = 1, size = 1, prob = 0.8)  # What does the result tell us?  
## [1] 0.8

dbinom(x = 0, size = 1, prob = 0.8)  # What does the result tell us?
## [1] 0.2
  • x = number of successes (e.g. pass an exam)
  • size = number of trials (1 coin flip)
  • prop = probability of ‘success’ (e.g. probability of passing an exam)

The best example is one coin flip (or one voter voting yes/no, a person being either a woman/man). Let’s simulate one simple coin toss:

rbinom(n = 1, size = 1, prob = 0.5)
## [1] 0

Ok, we could have done this with a real coin. Let’s toss one coin 1000 times. Any volunteers?

Luckily, R can simulate this quicker.

trial1 <- rbinom(n = 1000, size = 1, prob = 0.5)
  • n = number of observations (i.e. how many coin flips we want to observe)

And now we want to see how successful we were:

table(trial1)
## trial1
##   0   1 
## 474 526

2.2.2 The Binominal Distribution

We get a binomial distribution if we – like in the lecture example – toss a coin twice. What are the possible outcomes, the so-called sample space?

  • Sample space: {HH, HT, TH, TT}

  • What is the probability of observing exactly 0 (1, 2) heads (successes)?

dbinom(x = c(0, 1, 2),
       size = 2,
       prob = 0.5)
## [1] 0.25 0.50 0.25
  • What is the probability of observing 0 or 1 heads (successes)?
pbinom(1, 2, 0.5)
## [1] 0.75
  • What is the probability of more than 1 heads (successes)?
1 - pbinom(1, 2, 0.5)
## [1] 0.25

# or:

pbinom(1, 2, 0.5, lower.tail = F)
## [1] 0.25

2.2.3 Example I: Coin Flips

We talked about probability in a frequentist understanding. Let’s check if repeating this experiment 1,000 times gives us the values we would expect.

Out of 1000 flips, you would expect 250 HH, 500 HT TH, and 250 TT.

ex_trial <- rbinom(n = 1000, size = 2, prob = 0.5)
table(ex_trial)
## ex_trial
##   0   1   2 
## 249 501 250

2.2.4 The PMF and CDF of a Binomial

Let’s plot the binomial probability mass function to get a better understanding.

2.2.4.1 Base R

We start by creating an x-axis and an empty plot.

# First, we need an x-axis
x_values <- 0:100

# Now we produce an empty plot.
plot(
  x = x_values,
  ylim = c(0, 0.1), # With ylim you can adjust the limits of the y axis.
  type = "n",       # type "n" produces an empty plot window.
  main = "PMF for Binomial (n = 100, p = 0.5)",
  xlab = "x",
  ylab = "p(x)",
  las = 1,
  frame = F
  )

# Let's add the points.
points(x_values, 
       dbinom(x_values, size = 100, p = 0.5), 
       cex = 0.3, 
       pch = 19, 
       col = viridis(1)
       )

# And those lines we know from the lecture.
segments(x0 = x_values, 
         y0 = 0, 
         x1 = x_values, 
         y1 = dbinom(x_values, size = 100, p = 0.5), 
         col = viridis(1)
         )  

Question: Why should we not use a line to connect the points?

\(\Rightarrow\) … because the outcome space is discrete, i.e. there is nothing like 2.5 heads out of a given number of coin flips.

What does the cumulative distribution function of the same Binomial look like?

plot(x = x_values, 
     y = pbinom(x_values, size = 100, p = 0.5),
     type = "p",
     main = "CDF for Binomial (n = 100, p = 0.5)",
     xlab = "x",
     ylab = "Probability",
     bty = "n",
     las = 1,
     pch = 19, 
     cex = 0.3, 
     col = viridis(1),
     frame = F
     )

2.2.4.2 ggplot2

ggplot2 is a very widely used package for visualization in R community. We will be providing you with code using this package in addition to the code with bulit-in functions of R.

First, a few words about ggplot2. When working with this package, we think about images as being constructed from layers: the data-level, the mapping aesthetics level, and the geom level. These are the essential, but there are many more:

  • Data: your data frame
  • Aesthetic mapping: how data (variables) become plot attributes (axes, color, sizes)
  • Geoms: geometric representations of data (points, lines, bars)
  • Scales: modifying the mapping from data to plot (customizing axes, colors, sizes)
  • Facets: sub-panels in the plot
  • Coordinates: features of the coordinate system (orientation…)

With ggplot2package, it is most effective to work with datasets. So we’ll start by putting all of our vectors into a data.frame object.

library(ggplot2) 

# create a dataset
binom_data <- data.frame(x = x_values, 
                         density = dbinom(x_values, size = 100, p = 0.5))

# plot 
ggplot( # empty plot for our data
  data = binom_data, # data used for plotting
  mapping = aes(x = x, y = density) # axes
) + 
  geom_point(color = viridis(1)) + # add points
  geom_segment(aes( # add lines 
    x = x, xend = x, # how thick the lines will be
    y = 0, yend = density # how long the lines will be
  ),
  color = viridis(1) # color the lines
  ) +
  theme_minimal() + # change the appearance 
  labs(
    title = "PMF for Binomial (n = 100, p = 0.5)",
    y = "p(x)"
  )  

Question: Why should we not use a line to connect the points?

\(\Rightarrow\) … because the outcome space is discrete, i.e. there is nothing like 2.5 heads out of a given number of coin flips.

What does the cumulative distribution function of the same Binomial look like?


# Add new column to our dataset 

binom_data$CDF <- pbinom(x_values, size = 100, p = 0.5) 

ggplot(
  data = binom_data, # data used for plotting
  mapping = aes(x = x, y = CDF) # axes
) +
  geom_point(color = viridis(1)) + # add points
  theme_minimal() + # change the appearance 
  labs(
    title = "CDF for Binomial (n = 100, p = 0.5)",
    y = "P(x)"
  )  

2.2.5 Example II: Airline Reservations

A flight has 100 available seats, but the airline accepted 120 reservations (overbooking).

The probability of each person showing up is 0.85.

Now the airline wants to know the probability that more than 100 people show up.

  • Remember the binomial process:

    • n independent trials
    • only two possible outcomes (show up/don’t show up)
    • success probability remains constant

We want to know:

\(Pr(X>100)\).

This is \(Pr(X=101)+Pr(X=102) +...Pr(X=120)\) (equal to \(1-Pr(X\leq100)\).

  1. What is the probability that more than 100 people show up?
# Answer to a):
1 - pbinom(q = 100, size = 120, p = 0.85)
## [1] 0.6586167
  1. Plot the cumulative distribution function of the problem.

Base R

# Answer to b):
passengers  <- 0:120 # for x-axis

plot(x = passengers, 
     y = pbinom(q = passengers, size = 120, p = 0.85),
     bty = "n",
     las = 1,
     pch = 19,
     xlab = "Number of People Showing Up",
     ylab = "Probability",
     main = "Flight Overbooking Probability\nCDF of Binomial (n = 120, p = 0.85)",
     cex = 0.5, 
     col = viridis(1),
     frame = F
     )

ggplot2

# Answer to b):

# create a dataframe 
airline_data <- data.frame(x = 0:120,
                           CDF = pbinom(q = 0:120, size = 120, p = 0.85))
# plot 
ggplot(data = airline_data,  # data used for plotting
       mapping = aes(x = x, y = CDF)) +
  geom_point(color = viridis(1)) + # add points
  theme_minimal() + # change the appearance
  labs(x = "Number of People Showing Up",
       y = "Probability",
       title = "Flight Overbooking Probability\nCDF of Binomial (n = 120, p = 0.85)")  

  1. What is the probability of more than 100 people showing up if the airline limits overbooking to 110?
# Answer to c):
1 - pbinom(q = 100, size = 110, p = 0.85)
## [1] 0.02442528

2.3 Continuous Probability Distributions

We will be often looking at normal distribution. The most common normal distribution is the standard normal distribution.

What do you know about the standard normal distribution?

Let’s go through those prefixes for the normal distribution again. If you want to know the probability that \(x < 1.5\), use \(p\):

pnorm(q = 1.5, mean = 0, sd = 1)
## [1] 0.9331928

Conversely, if you want to know the probability that \(x > 1.5\), use:

1 - pnorm(q = 1.5, mean = 0, sd = 1)
## [1] 0.0668072

If you want to know at what value of \(x\) do 30% of the data lie, use \(q\):

qnorm(p = 0.3, mean = 0, sd = 1)
## [1] -0.5244005

If you want to generate some random numbers of the normal distribution use \(r\):

rnorm(n = 20, mean = 0, sd = 1)
##  [1]  1.31448559  1.06028156 -0.08608919  0.05084223  1.47368483  1.79681256
##  [7] -0.36423616 -2.24560283  0.16885770  0.95537072  1.19338135 -0.43046938
## [13]  1.89498528 -0.70042200 -0.49988659  1.29233980 -0.63823273 -0.10754757
## [19] -1.33323577 -1.26583208

It is a good idea to plot those distributions (since R can draw better than me):

Base R

x_values <- seq(from = -5, to = 5, by = 0.1)

plot(x = x_values, 
     y = dnorm(x_values),
     bty = "n", 
     las = 1,
     type = "l",
     col = viridis(1),
     lwd = 2,
     main = "PDF of N(0, 1)",
     ylab = "f(x)",
     xlab = "x"
     )

plot(x = x_values, 
     y = pnorm(x_values),
     bty = "n", 
     las = 1,
     type = "l",
     col = viridis(1),
     lwd = 2,
     main = "CDF of N(0, 1)",
     ylab = "F(x)",
     xlab = "x"
     )

# What if we have a different mean and sd?
plot(x = x_values, 
     y = dnorm(x_values, mean = 3, sd = 2),
     bty = "n", 
     las = 1,
     type = "l",
     col = viridis(1),
     lwd = 2,
     main = "PDF of N(3, 4)",
     ylab = "f(x)",
     xlab = "x"
     )

ggplot2

x_values <- seq(from = -5, to = 5, by = 0.1)

# ggplot can also work without datasets

ggplot() + # this creates an empty plot
  geom_line(aes(x = x_values, y = dnorm(x_values)),
            color = viridis(1)) +
  labs(title = "PDF of N(0, 1)",
       y = "f(x)",
       x = "x") +
  theme_bw()

ggplot() + # this creates an empty plot
  geom_line(aes(x = x_values, y = pnorm(x_values)),
            color = viridis(1)) +
  labs(title = "CDF of N(0, 1)",
       y = "F(x)",
       x = "x") +
  theme_bw()

ggplot() + # this creates an empty plot
  geom_line(aes(x = x_values, y = dnorm(x_values, mean = 3, sd = 2)),
            color = viridis(1)) +
  labs(title = "PDF of N(3, 4)",
       y = "f(x)",
       x = "x") +
  theme_bw()

2.3.1 Example III: Normal Approximation of the Binomial

We saw that the binomial distribution can be approximated by the normal. Let’s calculate the airplane reservation example using a normal distribution.

First, we have to think about the underlying distribution. Using the central limit theorem, we can translate the binomial process into a normal distribution. 0.85 percent of the people are expected to show up, thus the mean can be calculated by \(E(X)=n*p\). The variance can be calculated using \(Var(X)=n*p*(1-p)\) (see slide 15).

Let’s calculate the mean and the variance of the normal distribution:

n <- 120
p <- 0.85
mean <- n * p
sd <- sqrt((n * p * (1 - p)))  # Why do we have to use the square root?

Let’s solve an example with the normal distribution:

  1. What is the probability that more than 100 people will show up, if we accept 120 reservations?
  2. Plot the normal distribution.
  3. Compare your result to the Binomial process from above.

Base R

# part a)
1 - pnorm(q = 100, mean = mean, sd = sd)
## [1] 0.695433

# part b)
passengers <- seq(from = 0, to = 120, by = 0.01)
densities <- dnorm(passengers, mean = mean, sd = sd)

plot(
  x = passengers,
  y = densities,
  bty = "n",
  las = 1,
  col = viridis(1),
  xlab = "Number of People Showing Up",
  ylab = "Probability Density",
  type = "l", 
  lwd = 2, 
  cex = 2, 
  main = "Flight Overbooking Probability \nPDF of N(mean = 102, var = 15.3)"
)


# part c)
1 - pnorm(q = 100, mean = mean, sd = sd)
## [1] 0.695433
1 - pbinom(q = 100, size = n, prob = p)
## [1] 0.6586167

ggplot2

# part a)
1 - pnorm(q = 100, mean = mean, sd = sd)
## [1] 0.695433

# part b)
# modify dataset from before 
airline_data$CDF_normal <- dnorm(airline_data$x, mean = mean, sd = sd)

ggplot(airline_data,
       aes(x = x,
           y = CDF_normal)) +
  geom_point(color = viridis(1)) +
  labs(
    title = "Flight Overbooking Probability \nPDF of N(mean = 102, var = 15.3)",
    x = "Number of People Showing Up",
    y = "Probability Density"
  )  +
  theme_bw()


# part c)
1 - pnorm(q = 100, mean = mean, sd = sd)
## [1] 0.695433
1 - pbinom(q = 100, size = n, prob = p)
## [1] 0.6586167

3 Central Limit Theorem

Finally, we want to better understand the central limit theorem. Let’s generate a random population:

pop <- rnorm(1000, mean = 32, sd = 5) # we know the "true" values
summary(pop)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.56   28.53   32.04   31.86   35.25   46.98
var(pop)
## [1] 24.73022

hist(pop,
     bty = "n",
     las = 1,
     border = "white",
     col = viridis(1)
     )

plot(density(pop),
     bty = "n",
     las = 1,
     col = viridis(1),
     lwd = 2,
     main = "Density of pop"
     )

sample(pop, 200) # take a random sample of size 200 from the population
##   [1] 36.93082 36.54696 23.45970 35.51881 27.37409 32.51348 25.10221 35.18597
##   [9] 37.13408 35.06441 33.52704 34.45158 35.63108 40.29986 29.14245 30.98282
##  [17] 21.66426 22.00992 31.49777 36.54033 33.88720 26.79773 27.60831 33.10906
##  [25] 34.12169 30.22840 24.55076 30.54083 34.36782 32.64015 33.84921 36.36075
##  [33] 38.23185 27.06962 29.22697 32.95937 39.42808 24.55226 26.12642 38.43527
##  [41] 29.25858 27.46874 37.28140 33.99208 34.50292 28.29550 28.97459 33.76632
##  [49] 28.14253 29.27708 38.49585 36.35217 33.85595 32.31176 41.27658 36.32172
##  [57] 36.52738 29.58460 35.78529 39.08515 31.38177 26.39403 27.60635 25.74090
##  [65] 21.89504 31.41977 30.71153 30.53805 29.64588 35.51645 35.27028 22.95281
##  [73] 32.79409 32.35229 38.91122 24.88910 30.95103 29.97282 34.72249 36.20102
##  [81] 31.06040 28.53275 26.48491 37.34261 24.85404 36.04349 31.28513 34.01126
##  [89] 16.73428 24.78637 33.88933 27.04017 34.40393 35.93792 30.44927 32.32813
##  [97] 28.91513 29.03460 40.32361 29.23619 31.97603 37.90542 26.40260 26.50592
## [105] 33.32107 41.50311 18.47899 35.64270 38.09545 34.32512 29.22667 24.01326
## [113] 28.65408 28.99791 26.27415 30.76394 19.70311 31.17917 25.26360 29.54748
## [121] 35.34261 32.00747 32.31820 38.93889 23.64730 38.94296 28.93498 42.16050
## [129] 26.94477 25.42930 29.45360 24.73424 25.86380 31.30271 39.39200 38.73369
## [137] 27.16565 38.11936 30.51352 28.90673 26.19865 40.03153 28.62669 35.10153
## [145] 33.56588 31.70024 27.65010 35.33694 37.23552 33.85825 32.28117 28.09501
## [153] 25.84925 38.88781 35.85157 37.66122 42.67684 25.88897 25.35223 35.59036
## [161] 28.61398 36.09913 34.15776 31.01985 33.25578 34.19536 35.12530 39.79903
## [169] 26.65571 22.80853 34.44628 30.06352 32.63694 32.05541 44.40386 31.94904
## [177] 29.41028 33.82357 29.05341 29.71995 23.00324 30.83428 25.17209 33.78374
## [185] 35.80245 35.53615 31.13256 28.59384 34.15877 40.21165 40.85183 29.43100
## [193] 39.20999 39.28381 38.93926 23.49569 30.24117 28.98075 28.76855 27.55243

3.1 CLT Trial 1: Sample Size = 100

Let’s take 100 samples with 100 persons in each sample. First, we create an empty vector - trial1 - of length 100 to store the results:

trial1 <- rep(NA, 100) 

Now we have to do some programming. We create a for loop counting \(i\) from 1 to 100. For each iteration the loop is saving the mean of our sample in the vector trial1.

for (i in 1:100){
  trial1[i] <- mean(sample(pop, 100))
}

trial1
##   [1] 30.93682 31.46890 31.45704 31.53510 31.95533 32.31625 33.03437 31.80165
##   [9] 32.49047 32.77467 31.67344 31.51761 32.14674 31.81019 32.14353 32.30401
##  [17] 31.78063 32.10755 31.33877 31.20198 31.42197 32.55274 32.09743 31.85166
##  [25] 31.47953 31.83981 32.38201 31.08044 31.67311 30.91909 31.53512 31.04304
##  [33] 32.06269 31.22232 32.50422 31.72397 31.65665 32.18320 31.97848 31.69677
##  [41] 32.11451 31.38452 31.97795 32.05717 32.09142 31.44163 31.78505 32.58827
##  [49] 31.96724 31.46066 31.46748 31.76464 32.30361 31.20908 31.89620 31.59008
##  [57] 31.99709 31.47925 32.11829 31.64346 31.89097 31.74634 31.27185 32.14840
##  [65] 32.14844 31.82621 31.56987 31.67060 32.26069 31.95811 32.34756 32.00648
##  [73] 31.95268 31.64926 31.73842 31.55474 31.81083 31.88900 31.59826 31.72610
##  [81] 31.99750 31.82868 32.45928 31.55107 31.62084 31.62700 31.22489 32.31652
##  [89] 31.89759 31.14663 30.81093 31.97516 31.80280 32.09413 32.83339 31.89023
##  [97] 32.02146 32.07605 30.73424 30.93446

Let’s have a look at our sampling distribution:

hist(trial1,
     bty = "n",
     las = 1,
     border = "white",
     col = viridis(1),
     xlim = c(30, 34),
     main = "Distribution of sample means\n(sample size: n = 100)"
     )


mean(trial1)
## [1] 31.80645
var(trial1)
## [1] 0.1958393

3.2 CLT Trial 2: Sample Size = 400

We do it again. But this time we will take 100 samples with 400 people in each sample. It works just as above.

trial2 <- rep(NA, 100)

for (i in 1:100){
  trial2[i] <- mean(sample(pop, 400))
}

hist(trial2,
     bty = "n",
     las = 1,
     border = "white", 
     col = viridis(1),
     xlim = c(30, 34), 
     main = "Distribution of sample means\n(sample size: n = 400)"
     )


mean(trial2)
## [1] 31.87106
var(trial2)
## [1] 0.03688213

Does the variance of the sampling distribution behave according to \(\frac{sd^2}{n}\)?

The original standard deviation value was 5, so \(\frac{25}{n}\) should be the variance of our sampling distribution!

var(pop)/400
## [1] 0.06182554

In the lecture we learned that the sampling distribution of the mean always approaches a normal distribution, regardless of the shape of the original distribution!

You don’t believe me? See for yourself.

Income, for example, is usually not really normally distributed. First, we generate some hypothetical income data.

income <- rgamma(1000, shape = 1.1, scale = 2000) 
summary(income)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.92   674.90  1450.44  2140.21  2938.65 15636.39
var(income)
## [1] 4589128

# Let's have a look.
par(mfrow = c(1, 2)) # two plots side-by-side (alternative to chunk arguments)
hist(income,
     bty = "n",
     las = 1,
     border = "white",
     col = viridis(2)[1]
     )

plot(density(income),
     bty = "n",
     las = 1,
     col = viridis(2)[2],
     lwd = 2,
     main = "Density of income"
     )

This is clearly a non-normal distribution. Now we want to get our sampling distribution of the mean again.

3.3 CLT Trial 3: A non-normal population distribution

trial3 <- rep(NA, 100)

for (i in 1:100){
  trial3[i] <- mean(sample(income, 100))
}

hist(trial3 ,
     bty = "n",
     las = 1,
     border = "white",
     col = viridis(1),
     xlim = c(1500, 3000),
     main = "Distribution of sample means\n(sample size: n = 100)"
     )

plot(density(trial3),
     bty = "n",
     las = 1,
     lwd = 2,
     col = viridis(1),
     xlim = c(1500, 3000),
     main = "Density of sample means\n(sample size: n = 100)"
     )

mean(trial3)
## [1] 2143.915
var(trial3)
## [1] 40406.81

3.4 CLT Trial 4: A non-normal population distribution II

Base R

trial4 <- rep(NA, 100)

for (i in 1:100){
  trial4[i] <- mean(sample(income, 400)) #this time, we sample 400 people
}

hist(trial4,
     bty = "n",
     las = 1,
     border = "white",
     col = viridis(1),
     xlim = c(1500, 3000),
     main = "Distribution of sample means\n(sample size: n = 400)"
     )

plot(density(trial4),
     bty = "n", 
     las = 1,
     lwd = 2,
     col = viridis(1),
     xlim = c(1500, 3000),
     main = "Density of sample means\n(sample size: n = 400)"
     )

mean(trial4)
## [1] 2137.173
var(trial4)
## [1] 7402.915

ggplot2

trial4 <- rep(NA, 100)

for (i in 1:100){
  trial4[i] <- mean(sample(income, 400)) #this time, we sample 400 people
}

ggplot() +
  geom_histogram(aes(x = trial4),
                 boundary = 2200, 
                 binwidth = 20,
                 color = "white",
                 fill = viridis(1)) +
  theme_bw() +
  labs(title = "Distribution of sample means\n(sample size: n = 400)")

ggplot() +
 geom_density(aes(trial4)) +
  labs(title = "Density of sample means\n(sample size: n = 400)",
       y = "F(x)",
       x = "x") +
  theme_bw()

mean(trial4)
## [1] 2135.09
var(trial4)
## [1] 7263.942


3.5 Set Seeds!

Setting seeds is extremely helpful for the homework and replicability in general. To do so, use the following command: set.seed().

a <- rnorm(5)

b <- rnorm(5)

a == b
## [1] FALSE FALSE FALSE FALSE FALSE


set.seed(1234)
a <- rnorm(5)

set.seed(1234)
b <- rnorm(5)

a == b
## [1] TRUE TRUE TRUE TRUE TRUE

Concluding remarks

Next, you find some exercises to check your knowledge.


Exercises

Exercise A: Airline reservations again

Now the airline also has a bigger plane.

The flight has 250 available seats, but the airline accepted 275 reservations (overbooking).

Since this is a long haul flight, the probability of each person showing up is a bit higher at 0.89.

Now the airline wants to know the probability that more than 250 people show up.

  • Remember the binomial process:

    • n independent trials
    • only two possible outcomes (show up, don’t show up)
    • success probability remains constant

We want to know:

\(Pr(X>250)\).

This is \(Pr(X=251)+Pr(X=252) +...Pr(X=275)\) (equal to \(1-Pr(X<=250)\).

  1. What is the probability that more than 250 people show up?
  2. Plot the cumulative distribution function of the problem.
  3. What is the probability of more than 250 people showing up if the airline limits overbooking to 260?
# Answer to a):
1 - pbinom(q = 250, size = 275, p = 0.89)
## [1] 0.1323218

# Answer to b):
xaxis <-c(0:275) # for x-axis
plot(x = xaxis, 
     y = pbinom(q = xaxis, size = 275, p = 0.89),
     bty = "n",
     las = 1,
     pch = 19,
     xlab = "Number of people showing up",
     ylab = "Cumulative Distribution",
     main = "CDF of overbooking Binomial(n = 275, p = 0.89)",
     cex = 0.5, 
     col = viridis(1),
     frame = F
     )


# Answer to c):
1 - pbinom(q = 250, size = 260, p = 0.85)
## [1] 1.197823e-09

Exercise B: Airline reservations and the Normal Approximation

We saw that the binomial distribution can be approximated by the normal. Let’s calculate the airplane reservation example using a normal distribution.

First, we have to think about the underlying distribution. Using the central limit theorem, we can translate the binomial process into a normal distribution. 0.89 percent of the people are expected to show up, thus the mean can be calculated by \(E(X)=n*p\). The variance can be calculated using \(Var(X)=n*p*(1-p)\) (see slide 15).

Let’s calculate the mean and the standard deviation of the normal distribution:

n <- 275
p <- 0.89

mean <- n * p
sd <- sqrt((n * p * (1 - p)))  # Why do we have to use the square root?

mean
## [1] 244.75
sd
## [1] 5.18869

Let’s solve an example with the normal distribution:

  1. What is the probability that more than 250 people will show up, if we make 275 reservations?
  2. Plot the normal distribution.
  3. Compare your result to the Binomial process from above.
# answer to a)
1 - pnorm(q = 250, mean = mean, sd = sd)
## [1] 0.155813

# answer to b)
xaxis <- seq(0, 275, 0.01)
densities<- dnorm(xaxis, mean = mean, sd = sd)
plot(xaxis, densities,
     bty = "n",
     las = 1,
     col = viridis(1),
     xlab = "% of People showing up", 
     ylab = "Probability Density", 
     type = "l", 
     lwd = 2, 
     cex = 2, 
     main="PDF of overbooking example N(244.75, 5.19)"
     )


# answer to c)
1 - pnorm(q = 250, mean = mean, sd = sd)
## [1] 0.155813
1 - pbinom(q = 250, size = n, prob = p)
## [1] 0.1323218