Chapter 7 Inferential statistics, take 2

This chapter will show you how to calculate confidence intervals and perform hypothesis tests using R’s in-built hypothesis testing functions. We will quickly go over the same examples from the previous sections using R’s functions to recover the values we calculated earlier.

7.1 Analyzing categorical variables

We will learn how to use two workhorse functions in this section: prop.test(...) and chisq.test(...). The former will be used for making inferences for single and two sample proportions, while the latter will be used for chi squared goodness of fit tests and tests of independence.

Remember You can always run ?prop.test or ?chisq.test in the console to view R’s help documentation for these functions.

7.1.1 Single sample proportions

The main function we will use in this section is prop.test(...); it both calculates confidence intervals and performs hypothesis tests all in one fell swoop! The important things for us, then, are understanding what inputs we need and understanding the outputs. For single-sample proportion inferences, the function prop.test requires the following inputs:

prop.test(
  x = # The number of successes
  , n = # the sample size
  , p = # the null proportion of your hypothesis test, the default value is NULL
  , alternative = # the type of hypothesis test to run corresponding to your alternative hypothesis; one of "two.sided", "less", or "greater"
  , conf.level = # your confidence level as a decimal
  , correct = FALSE # This turns OFF the Yates' continuity correction so that calculations are made like we've learned in class.
)

The function will output the results of our desired hypothesis test and a corresponding confidence interval.

Note: The confidence interval intervals that R calculates are ever-so-slightly different from those that we calculate by hand in two important ways:

  • Setting the option correct = FALSE gets us closer to our by-hand techniques, but R still uses a slightly different method under the hood. See this CrossValidated Stack Exchange post for more information.

  • If your alternative hypothesis is one-sided and you select the option alternative = "less" or alternative = "greater", then R will report a one-sided confidence interval that we do not discuss in this class.

To understand the output of prop.test(...) and witness the discrepancies mentioned above, let’s return to the 3 basic examples we started with in the previous chapter. Recall that we have a hypothetical example with

  • \(n = 200\)

  • Our sample has 110 successes, ie \(\hat{p} = .55\)

  • we are comparing our sample data to the null hypothesis \(H_0: p = .5\)

  1. Two-sided hypothesis test: In this case our hypothesis test is

\[\begin{array}{cc} H_0: p = .5 \\ H_a: p \neq .5 \end{array} \]

It is quite easy to perform this hypothesis test with the function prop.test(...), we simply run the following code.

result <- prop.test(
  x = 110 # the NUMBER of successes
  , n = 200 # sample size
  , p = .5 # the null proportion
  , alternative = "two.sided" # since we're doing a two-sided hypothesis test
  , conf.level = .95
  , correct = FALSE
)
result
## 
##  1-sample proportions test without continuity correction
## 
## data:  110 out of 200, null probability 0.5
## X-squared = 2, df = 1, p-value = 0.1573
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.4807562 0.6173593
## sample estimates:
##    p 
## 0.55

Notice that the code stores the output of the hypothesis test as the variable result, then prints a summary of the test. Take a few moments to read through the results and notice that it gives us all of the information we want! In particular,

  • the “sample estimate” p is the sample proportion \(\hat{p}\);

  • the \(p\)-value is about 0.1573;

  • the associated 95% confidence interval is \((0.4807, 0.6173)\);

Let’s compare this with the results we’d obtain making the same calculations “by hand. First we’ll calculate a \(p\)-value and compare it to that calculated by prop.test above.

n <- 200
pHat <- 110/200
null.SE <- sqrt(.5*.5/200)
p.value.hand <- 2*pnorm(pHat, mean = .5, sd = null.SE, lower.tail = FALSE)

c(p.value.hand, result$p.value)
## [1] 0.1572992 0.1572992

They’re the same, just as we would hope. Now let’s calculate a 95% confidence interval “by hand” and compare the results to that calculated by prop.test above.

conf.int.hand <- pHat + c(-1,1)*1.96*null.SE
# our result
conf.int.hand
## [1] 0.4807035 0.6192965
#from prop.test
result$conf.in
## [1] 0.4807562 0.6173593
## attr(,"conf.level")
## [1] 0.95

Notice that the bounds of our “by hand” calculation are ever so slightly different from the bounds calculated by prop.test - this is because R calculates these confidence intervals with slightly different and more sophisticated techniques than we cover in this class. Fret not, however, as the outputs in most cases are similar enough. One interesting thing to observer is that the confidence interval from prop.test is not symmetric about the sample proportion \(\hat{p}\)!

In our remaining examples, we will move a little more quickly.

  1. One-sided lower hypothesis test: In this case our hypothesis test is

\[\begin{array}{cc} H_0: p = .5 \\ H_a: p < .5 \end{array} \]

This hypothesis test is quite easy to evaluate as well will prop.test. See the code below:

result <- prop.test(
  x = 110 # the NUMBER of successes
  , n = 200 # sample size
  , p = .5 # the null proportion
  , alternative = "less" # ONLY CHANGE FROM Above
  , conf.level = .95
  , correct = FALSE
)
result
## 
##  1-sample proportions test without continuity correction
## 
## data:  110 out of 200, null probability 0.5
## X-squared = 2, df = 1, p-value = 0.9214
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
##  0.0000000 0.6068119
## sample estimates:
##    p 
## 0.55

Most of the output is exactly as we’d expect it to be. The \(p\)-value of this test is the complement of half of the \(p\)-value from the two-sided test (think about the corresponding tail areas shaded under a normal distribution) and the point estimate is the same.

Caution: The main change is the output of the confidence interval. Because this is a one-sided tests, R reports a one-sided confidence interval that we do not cover in this class. Suppose \(\alpha\) is the significance level as decimal with corresponding \(z\)-critical value \(z_\alpha\). Then a one-sided lower confidence interval is approximately

\[ \left( 0, \hat{p} + z_\alpha * SE\right) \]

For our example, we can calculate this by hand for a 95% confidence interval, the corresponding significance level is .05, so \(\alpha = .05\).

z.crit <- qnorm(.05, lower.tail = F)
pHat + z.crit*null.SE
## [1] 0.6081544

This is very close to, but not exactly, the upper bound of the confidence interval given by R’s prop.test function. For our purposes they are close enough. That said, if you are looking for a confidence interval as we calculate them in class, set the alternative option to "two.sided".

  1. One-sided upper hypothesis test: In this case our hypothesis test is

\[\begin{array}{cc} H_0: p = .5 \\ H_a: p > .5 \end{array} \]

This hypothesis test is quite easy to evaluate as well will prop.test. See the code below:

result <- prop.test(
  x = 110 # the NUMBER of successes
  , n = 200 # sample size
  , p = .5 # the null proportion
  , alternative = "greater" # ONLY CHANGE FROM Above
  , conf.level = .95
  , correct = FALSE
)
result
## 
##  1-sample proportions test without continuity correction
## 
## data:  110 out of 200, null probability 0.5
## X-squared = 2, df = 1, p-value = 0.07865
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
##  0.4918534 1.0000000
## sample estimates:
##    p 
## 0.55

The result of this test is very similar to the output from the one-sided lower example and all of the same comments apply. Again, the main change is the reported confidence interval.

Caution: As in the case of a one-sided lower test, R reports a one-sided confidence interval, which deviates from what we learn in this class! Suppose \(\alpha\) is the significance level as decimal with corresponding \(z\)-critical value \(z_\alpha\). Then a one-sided lower confidence interval is approximately

\[ \left( \hat{p} - z_\alpha * SE, 1 \right) \]

For our example, we can calculate this by hand for a 95% confidence interval, the corresponding significance level is .05, so \(\alpha = .05\). The lower-bound for the one-sided confidence interval is

z.crit <- qnorm(.05, lower.tail = F)
pHat - z.crit*null.SE
## [1] 0.4918456

This is very close to, but not exactly, the lower bound of the confidence interval given by R’s prop.test function.

As a reminder, if you are looking for a confidence interval as we calculate them in class, set the alternative option to "two.sided".

7.1.1.1 “Real Life” Single sample proportion test.

We close this section with an example of a single sample proportion test where we start with real-world data. As in the previous chapter we’ll use the gss2010 data set from the openintro package. Here are the first six rows of this data set.

head(gss2010)
## # A tibble: 6 × 5
##   hrsrelax mntlhlth  hrs1 degree         grass    
##      <int>    <int> <int> <fct>          <fct>    
## 1        2        3    55 BACHELOR       <NA>     
## 2        4        6    45 BACHELOR       LEGAL    
## 3       NA       NA    NA LT HIGH SCHOOL <NA>     
## 4       NA       NA    NA LT HIGH SCHOOL NOT LEGAL
## 5       NA       NA    NA LT HIGH SCHOOL NOT LEGAL
## 6       NA       NA    NA LT HIGH SCHOOL LEGAL

As of writing, we note that marijuana is legal in many states, so we will assume that a majority of American’s in 2024 believe that the drug should be legalized. We want to use the gss2010 data to try to see if the national opinion has changed over the past 14 years. In other words, did a minority of Americans think marijuana should be legalized in 2010?

If \(p\) denotes the proportion of Americans who think marijuana should be legalized, the hypothesis test we should use to answer this question is

\[\begin{array}{cc} H_0: p = .5 \quad (\text{or} \quad p \geq .5) \\ H_a: p < .5 \end{array} \]

We should look at the grass column of the data set which records respondents’ answers to the question “Do you think the use of marijuana should be made legal, or not?” Some of the entries in this column are NA, so we should drop these values from our data set. Note: before dropping these values, it might be good to check to see if there’s a correlation between the respondents who did not reply and their educational level because degree is the only complete column in this data set and we need to be careful about maintaining independence among observations. One could check this with a chi-squared test of independence, but we will eyeball it for now.

gss2010 %>%
  # add a binary column for replying to the grass question
  mutate(grassReply = if_else(is.na(grass),  'No', 'Yes')) %>%
  # create a two-way table for degree and grassReply
  xtabs(~degree + grassReply, data = .) %>%
  # create a proportion table where we divide by the column totals
  prop.table(., margin = 2)
##                 grassReply
## degree                   No        Yes
##   BACHELOR       0.18343949 0.18347895
##   GRADUATE       0.10445860 0.10802224
##   HIGH SCHOOL    0.49681529 0.48530580
##   JUNIOR COLLEGE 0.07515924 0.06830818
##   LT HIGH SCHOOL 0.14012739 0.15488483

The proportion table above shows that the distribution of educational attainment is just about the same between those that did and did not reply to the question about marijuana legalization. Thus, we can safely drop the NA values from the grass column and proceed with our hypothesis test. To use prop.test we need the sample size and the number of people who responded LEGAL to the grass question.

gss2010 %>% 
  #select all rows where grass is not NA
  filter(!is.na(grass)) %>%
  # get sample size and number of respondents answering legal and not legal
  summarize(
    n = n()
    , notLegal = sum(grass == 'NOT LEGAL')
    , legal = sum(grass == 'LEGAL')
  )
## # A tibble: 1 × 3
##       n notLegal legal
##   <int>    <int> <int>
## 1  1259      656   603

We can finally proceed with our hypothesis test. Let’s use a significance level of \(\alpha = .05\)

testResults <- prop.test(
  x = 603
  , n = 1259
  , p = .5
  , alternative = 'less'
  , conf.level = .95
  , correct = F
)
testResults
## 
##  1-sample proportions test without continuity correction
## 
## data:  603 out of 1259, null probability 0.5
## X-squared = 2.2311, df = 1, p-value = 0.06763
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
##  0.0000000 0.5021298
## sample estimates:
##         p 
## 0.4789515

About 48% of the respondents in our sample believed that marijuana should be legalized. However, our p-value of 0.068 is greater than our significance level, so we fail to reject the null. In other words, we have no evidence that a minority of Americans supported marijuana legalization in 2010.

7.1.2 Two sample proportions

The main function we will use in this section is again prop.test(...). In this section we will do one ‘toy’ example and one example using real-world data. The input and use is almost identical to that of performing single sample proportion inferences. The arguments/inputs are as follows.

prop.test(
  # x1 and x2 are the number of successes in sample 1 and 2
  x = c(x1, x2)  
  # n1 and n2 are the sizes of sample 1 and 2
  , n = c(n1,n2)
  # leave p null when testing for a difference in proportion, our most common use
  , p = NULL
  , alternative = # the type of hypothesis test to run corresponding to your alternative hypothesis; one of "two.sided", "less", or "greater"
  , conf.level = # your confidence level as a decimal
  , correct = FALSE # This turns OFF the Yates' continuity correction so that calculations are made like we've learned in class.
)

The output of prop.test will look almost identical to when you perform a single sample inference and all of our cautions and commentary from the previous section apply in this section as well. First we’ll do a basic, made up example to see what the output looks like, then do an example with real data to practice interpreting the results.

Basic, made up example: Suppose you have two samples, one with size \(n_1 = 344\) and another with size \(n_2 = 432\). The first sample has 177 successes and the second has 124. You want check for a difference between the proportion of successes in populations 1 and 2. In other words, your hypothesis test is

\[\begin{array}{cc} H_0: p_1 - p_2 = 0 \\ H_a: p_1 - p_2 \neq 0 \end{array} \]

The following code uses prop.test to perform this hypothesis test with a significance level of \(\alpha = .05\)

results <- prop.test(
  x = c(177, 124)
  , n = c(344, 432)
  , alternative = 'two.sided'
  , conf.level = .95
  , correct = FALSE
)
results
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  c(177, 124) out of c(344, 432)
## X-squared = 41.744, df = 1, p-value = 1.04e-10
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.1596070 0.2953887
## sample estimates:
##    prop 1    prop 2 
## 0.5145349 0.2870370

The printed results above give us all of the information we need. The \(p\)-value of our test is \(1.0401892\times 10^{-10} < \alpha = .05\), so the data provide compelling evidence of a difference between the population proportions.

How different are the two proportions you ask? Remember, that’s exactly what the confidence interval for a difference in proportions reports. In this case, it means that \(p_1 - p_2\) lies between about 16% and 30% with 95% confidence. In other words, with 95% confidence, the proportion of successes in population 1 is between 16% and 30% higher than the proportion of successes in population 2.

7.1.2.1 “Real world” Example

In this example we turn again to the gss2010 data set from the openintro package and investigate another question about marijuana legalization opinion. After many years around colleges and universities, I hypothesize that, in general, people with college degrees have a different opinion on marijuana legalization than people without college degrees.

Let \(p_C\) denote the proportion of Americans with college degrees who think marijuana should be legalized and let \(p_H\) be the same proportion for Americans without college degrees. Our hypothesis test is then

\[\begin{array}{cc} H_0: p_C - p_H = 0 \\ H_a: p_C - p_H \neq 0 \end{array} \]

We need to classify group the individuals in our sample into college degree recipients and non college degree recipients. The different levels of the degree factor are

sort(unique(gss2010$degree))
## [1] BACHELOR       GRADUATE       HIGH SCHOOL    JUNIOR COLLEGE LT HIGH SCHOOL
## Levels: BACHELOR GRADUATE HIGH SCHOOL JUNIOR COLLEGE LT HIGH SCHOOL

so any individual whose degree level contains the string ‘HIGH SCHOOL’ does not have a college degree. Let’s add a column to our data set, then get the counts we need to perform our test. As in the example in the last section, we will drop the individuals who did not respond to the legalization question. To make the hypothesis test slightly easier to run, we will store the following summary table in memory.

LegalizeIt <- gss2010 %>%
  #drop rows with NA in grass column
  filter(!is.na(grass)) %>%
  # TRUE/FALSE binary for college degree
  mutate(collegeDegree = !str_detect(degree, 'HIGH SCHOOL')) %>%
  group_by(collegeDegree) %>%
  summarize(
    n = n()
    , legal = sum(grass == 'LEGAL')
    , notLegal = sum(grass == 'NOT LEGAL')
  ) %>%
  #arrange the results so that collegeDegree = TRUE comes first. 
  arrange(desc(collegeDegree))
LegalizeIt
## # A tibble: 2 × 4
##   collegeDegree     n legal notLegal
##   <lgl>         <int> <int>    <int>
## 1 TRUE            453   234      219
## 2 FALSE           806   369      437

Now we are ready to perform our hypothesis test with prop.test and a significance level of \(\alpha = .05\).

results <- prop.test(
  x = LegalizeIt$legal
  , n = LegalizeIt$n
  , p = NULL
  , alternative = 'two.sided'
  , conf.level = .95
  , correct = FALSE
)
results 
## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  LegalizeIt$legal out of LegalizeIt$n
## X-squared = 4.0096, df = 1, p-value = 0.04524
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.001287983 0.116191845
## sample estimates:
##    prop 1    prop 2 
## 0.5165563 0.4578164

The p-value of our test is 0.045 which just squeezes in under \(\alpha = .05\), so we have statistically significant evidence that there is a difference in opinion on marijuana legalization between those with and without college degrees.

We can use the confidence interval from our results to be a bit more specific. At the 95% confidence level, the proportion of people with college degrees who think marijuana should be legalized is between 0.1% and 11.6% higher than the proportion of people without college degrees.

Remember, this was all calculated using 2010 data, so you may want to redo this investigation with more up-to-date information.

7.1.3 Chi-squared goodness of fit test

A goodness of fit test checks to see if your sample data deviates from some expected distribution. For example, you can use this type of test to investigate whether political affiliation in a county deviates from the state that that county is in or test to see if your dice are actually fair.

To perform a chi-squared goodness of fit test easily in R we will use the function chisq.test(...). The inputs to this function are as follows:

chisq.test( x = c(...) # a list of your observed counts
            , p = c(...) # a vector of your expected proportions/probabilities
            , correct = F # turn off the continuity correction, like usual
            )

We will look at pet ownership in Seattle as an example of a goodness of fit test. From www.zipatlas.com, the 10 most populous zip codes in the Seattle metro area are those in the table below. The column pop gives the zip codes population in 2024. Note: we change zipcode the a character vector to treat them deservedly as categorical variables.

zipDF <- tibble(
  zip = c(98115,98103,98133, 98105,98118,98125,98122, 98198, 98117, 98155),
  pop = c(54457,51878,50921,50302,48874,43993,41268,38942,36115,35550)
) %>% 
  mutate(zip = as.character(zip)
         , pop = pop
         , popProp = pop/sum(pop)   ) %>%
  arrange(zip)
zipDF
## # A tibble: 10 × 3
##    zip     pop popProp
##    <chr> <dbl>   <dbl>
##  1 98103 51878  0.115 
##  2 98105 50302  0.111 
##  3 98115 54457  0.120 
##  4 98117 36115  0.0798
##  5 98118 48874  0.108 
##  6 98122 41268  0.0912
##  7 98125 43993  0.0973
##  8 98133 50921  0.113 
##  9 98155 35550  0.0786
## 10 98198 38942  0.0861

We want to see if pet ownership is evenly distributed among these zip codes. To do this, we will use the seattlepets data set from the openintro package. As usual, be sure the check the documentation to make sure you understand what the data represents. In this case, it’s a data set detailing pet registrations in Seattle. The first 6 observations in the data set are shown below.

seattlepets %>% 
  head()
## # A tibble: 6 × 7
##   license_issue_date license_number animal_name species primary_breed               secondary_breed zip_code
##   <date>             <chr>          <chr>       <chr>   <chr>                       <chr>           <chr>   
## 1 2018-11-16         8002756        Wall-E      Dog     Mixed Breed, Medium (up to… Mix             98108   
## 2 2018-11-11         S124529        Andre       Dog     Terrier, Jack Russell       Dachshund, Sta… 98117   
## 3 2018-11-21         903793         Mac         Dog     Retriever, Labrador         <NA>            98136   
## 4 2018-11-23         824666         Melb        Cat     Domestic Shorthair          <NA>            98117   
## 5 2018-12-30         S119138        Gingersnap  Cat     Domestic Shorthair          Mix             98144   
## 6 2018-12-16         S138529        Cody        Dog     Retriever, Labrador         <NA>            98103

There are many zip_codes in the data set:

length(unique(seattlepets$zip_code))
## [1] 168

so we need to do some manipulation to get to a place where we can make the comparison in question. First, we reformat any zip code that is of the form #####-##### so that it just contains the first five digits, then we will filter for only the 10 most populous zip codes, then finally count the number of licenses in each zip.

df <- seattlepets %>%
  mutate(zip_code = str_replace(zip_code, '-[0-9]{1,}', '')
         ) %>%
  filter(zip_code %in% zipDF$zip) %>%
  group_by(zip_code) %>%
  summarize(n = n()) %>% 
  arrange(zip_code)

df
## # A tibble: 10 × 2
##    zip_code     n
##    <chr>    <int>
##  1 98103     4506
##  2 98105     2070
##  3 98115     4748
##  4 98117     3879
##  5 98118     2375
##  6 98122     2521
##  7 98125     2882
##  8 98133     1775
##  9 98155       28
## 10 98198        6

Next we make sure that the zip codes are in the same order in both data sets

all.equal(df$zip_code, zipDF$zip)
## [1] TRUE

Now we’re ready to perform our test. If pet-ownership in different zip codes were evenly distributed, we’d expect the proportion of licenses distributed to be roughly the same as the population distribution. Thus, we want to compare the license count, the column n from df above to the population proportion popProp from zipDF. The hypothesis test we’re performing is

\[\begin{array}{cc} H_0:\text{Pet ownership is distributed like population in Seattle} \\ H_a: \text{Pet ownership is not distributed like population in Seattle} \end{array} \]

Let’s perform this test with a significance level of \(\alpha = .05\). Executing the test is now easy with chisq.test!

results <- chisq.test(x = df$n
           , p = zipDF$popProp
           , correct = F)
results 
## 
##  Chi-squared test for given probabilities
## 
## data:  df$n
## X-squared = 8549.1, df = 9, p-value < 2.2e-16

Our p-value is way less than the significance level, so we reject the null and accept the research hypothesis. In other words, we have strong evidence suggesting that pet ownership/license distribution does not match the population distribution among the 10 most populous zip codes in the Seattle metro area.

You might wonder where the distribution deviates the most from the expectation? You can use the residuals element of the results object above to try to answer that question.

If obs and exp are the observed and expected counts respectively, then each residual is:

\[ \frac{\text{obs} - \text{exp}}{\sqrt{\text{exp}}} \]

res <- results$residuals
names(res) <- zipDF$zip
sort(res)
##      98198      98155      98133      98105      98118      98122      98125      98103      98115 
## -46.069289 -43.506938 -19.230237 -13.083768  -5.868322   5.449082   9.587784  31.180236  32.275231 
##      98117 
##  42.696200

The output above shows us that the zip code 98198 sees way fewer licenses than we’d expect if pet ownership matched the population distribution and the zip code 98117 sees way more licensing than we would expect based on population alone.

7.1.4 Chi-squared test of independence

Recall from the previous chapter that a chi-squared test of independence tests for an association between two categorical variables. No surprise, there is a built in function in R to perform these tests and it is again chisq.test(...). The only difference from the last section is how you input data. For a goodness of fit test you

table <- gss2010 %>%
  mutate(grass = if_else(is.na(grass), 'NO REPLY', grass)) %>%
  xtabs(~degree + grass, data = .)

table
##                 grass
## degree           LEGAL NO REPLY NOT LEGAL
##   BACHELOR         119      144       112
##   GRADUATE          73       82        63
##   HIGH SCHOOL      304      390       307
##   JUNIOR COLLEGE    42       59        44
##   LT HIGH SCHOOL    65      110       130
results <- chisq.test(table)
results
## 
##  Pearson's Chi-squared test
## 
## data:  table
## X-squared = 22.338, df = 8, p-value = 0.004326

Since our p-value is less than the significance level of \(\alpha = .05\), we conclude that the data imply an association between highest educational attainment and opinion on marijuana legalization. You might then wonder if you can infer what this association actually looks like to some degree. The key to this is looking at the residuals of this hypothesis test.

Recall that the chi-squared test statistic is

\[ \sum \frac{(\text{obs} - \text{exp})^2}{\text{exp}}. \]

where obs and exp represent the observed and expected counts and the sum ranges over all cells in the contingency table. The residual of a cell in a contingency table is simply

\[\text{residual} = \frac{(\text{obs} - \text{exp})}{\sqrt{\text{exp}}} \]

so a residual simply measures how far above or below the observed count is from the expected count where the difference is normalized by the expected count. Large residuals imply a large deviation from expectation, so you can think of any cell with a large residual as representing a place where the detected association between variables is more evident. We can access the residuals from our test a follows:

results$residuals
##                 grass
## degree                  LEGAL     NO REPLY    NOT LEGAL
##   BACHELOR        0.795903942 -0.001589914 -0.761336064
##   GRADUATE        1.083344800 -0.188315397 -0.832659358
##   HIGH SCHOOL     0.505993107  0.283830520 -0.795608394
##   JUNIOR COLLEGE -0.118711782  0.443908579 -0.371782229
##   LT HIGH SCHOOL -2.633232818 -0.659297213  3.245833866

Most of our residuals are either around 1 or less than 1 in absolute value, except for those corresponding to people with who did not finish high school. Here we see that fewer people than expected believe that marijuana should be legalized and more think it should be illegal if the variables were actually independent.

Does this mean that more education causes one to think marijuana should be legal? Of course not! We’re simply observing an association in an observational data set and it would be a huge mistake to try to infer a causal relationship between the two variables. There are likely many confounding variables at play here.

7.2 Analyzing numerical variables

In this section we will learn how to use R’s in-built functions to perform statistical inference for numerical variables. After the last chapter, it should come as no surprise that the bulk of our inference will be performed by a single function. For categorical variables we frequently used prop.test(...); the analogous function for numerical variables is t.test(...). The following subsections will guide you through the use of this function to perform inferences for single sample means, two sample mean comparisons, and paired data. We will then discuss using the function aov for performing ANOVA tests.

Note: The function t.test(...) requires you have access to actual data, not just summary statistics. In what follows we will also discuss using the tsum.test(...) function from the BSDA package, which executes t-tests from summary statistics as well.

7.2.1 Single sample mean

If you have a data set, making an inference about a numerical variable therein with t.test is quite easy. The basic syntax for a single numerical variable is as follows.

t.test(
  x = # a VECTOR of numerical values, typically a column of a data set.
  , mu = # the null mean of your hypothesis test; defaults to 0.  
  , alternative = # the type of hypothesis test to run corresponding to your alternative hypothesis; one of "two.sided", "less", or "greater"
  , conf.level = # your confidence level as a decimal
)

We will use the fastfood data set from the openintro package to demonstrate the use of this function. This data set gives the nutritional breakdown of many menu items from the following restaurants: .

head(fastfood)
## # A tibble: 6 × 17
##   restaurant item     calories cal_fat total_fat sat_fat trans_fat cholesterol sodium total_carb fiber sugar
##   <chr>      <chr>       <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>  <dbl>      <dbl> <dbl> <dbl>
## 1 Mcdonalds  Artisan…      380      60         7       2       0            95   1110         44     3    11
## 2 Mcdonalds  Single …      840     410        45      17       1.5         130   1580         62     2    18
## 3 Mcdonalds  Double …     1130     600        67      27       3           220   1920         63     3    18
## 4 Mcdonalds  Grilled…      750     280        31      10       0.5         155   1940         62     2    18
## 5 Mcdonalds  Crispy …      920     410        45      12       0.5         120   1980         81     4    18
## 6 Mcdonalds  Big Mac       540     250        28      10       1            80    950         46     3     9
## # ℹ 5 more variables: protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>

Suppose we want to know the average caloric content of a burger from a fast food restaurant. Moreover, we suspect that the average burger contains more than 600 calories. To put this in the context of statistic inference, we want to calculate a confidence interval and perform a hypothesis test. Let \(\mu\) denote the average calorie content of a fast food burger. Our hypothesis test is

\[\begin{array}{cc} H_0: \mu = 600 \\ H_a: \mu > 600 \end{array} \]

We can use the data set above to help make these inferences! We will use t.test(...) after massaging the data a bit. First, we’ll select all rows from fastfood where the item contains the word “burger” or “Burger”

burgers <- fastfood %>%
  filter(str_detect(item, '[bB]urger')) 

dim(burgers)
## [1] 56 17

So our data set has only 56 burgers. Let’s first execute our hypothesis test with a significance level of \(\alpha = .05\) to see if our idea has any legs.

results <- t.test(x = burgers$calories
       , alternative = 'greater'
       , mu = 600
       , conf.level = .95)
results
## 
##  One Sample t-test
## 
## data:  burgers$calories
## t = 0.28482, df = 55, p-value = 0.3884
## alternative hypothesis: true mean is greater than 600
## 95 percent confidence interval:
##  552.1307      Inf
## sample estimates:
## mean of x 
##  609.8214

Our \(p\)-value is about 0.3884, greater than the significance level, so we fail to reject the null. In other words, our sample data do not provide sufficient evidence to conclude that the average burger has more than 600 calories.

Notice that the confidence interval reported above is one-sided to be consistent with the hypothesis test, just like the output of prop.test(...) when performing a one-sided hypothesis test. We can reproduce this confidence interval in a similar way. You can find the lower bound with:

t.crit <- qt(.05, df = 55, lower.tail = F)
SE <- sd(burgers$calories)/sqrt(56)
mean(burgers$calories) - t.crit*SE
## [1] 552.1307

The upper bounds for one-sided confidence intervals associated with one-sided lower hypothesis tests can be calculated analogously.

If we want standard two-sided confidence interval, we simply need to set `alternative = ‘two.sided’.

results <- t.test(x = burgers$calories
       , alternative = 'two.sided'
       , mu = 600
       , conf.level = .95)
results$conf.int
## [1] 540.7165 678.9263
## attr(,"conf.level")
## [1] 0.95

Thus, with 95% confidence, fast food burgers contain between about 541 and 679 calories on average.

Now, suppose you don’t have a data set available and only have summary statistics. This happens frequently on homework, for example. If you want to use a function similar to t.test(...), you have to either write your own function (this is left as an exercise for the reader as it is beyond the scope of these notes) or install a new package called BSDA. This package, whose name represents “Basic Statistics and Data Analysis,” was created by Alan T. Arnholt to accompany a textbook with the same name, written by Larry J. Kitchens. In any case, remember that one must first install the package before it can be loaded into your R session and that it must be loaded into your R session before you can use any functions therein. In other words, you must run

install.packages('BSDA')

once, the first time you want to use the functions, then run

library(BSDA)

any time you start a new R session and want to use this function. For a single sample t-test, the syntax for t.test(...) is as follows.

tsum.test(
    mean.x = #sample average
    , s.x = #sample standard deviation
    , n.x = #sample size
    , alternative = #type of test to perform
    , mu = #null mean
    , conf.level = # confidence level as a decimal
)

We will use tsum.test(...) to generate perform the exact same hypothesis test as above, this time inputting the sample statistics instead of the data we were interested in.

xBar <- mean(burgers$calories)
s <- sd(burgers$calories)
n <- dim(burgers)[1]

tsum.test(
  mean.x = xBar
  , s.x = s
  , n.x = n
  , alternative = 'greater'
  , mu = 600
  , conf.level = .95
)
## Warning in tsum.test(mean.x = xBar, s.x = s, n.x = n, alternative = "greater", : argument 'var.equal'
## ignored for one-sample test.
## 
##  One-sample t-Test
## 
## data:  Summarized x
## t = 0.28482, df = 55, p-value = 0.3884
## alternative hypothesis: true mean is greater than 600
## 95 percent confidence interval:
##  552.1307       NA
## sample estimates:
## mean of x 
##  609.8214

Et Voila - the exact same results as above! You can ignore the warning about var.equal as it is outputting the exact behavior we want.

7.2.2 Two sample mean comparison

The functions t.test(...) and tsum.test(...) also perform two-sample t-tests for us. Recall that, roughly, a two-sample t-test checks for an association between a two-valued categorical variable and a single numerical variable. With these tests you make inferences about the difference between two population averages \(\mu_1 - \mu_2\) from the difference in sample averages \(\overline{x}_1 - \overline{x}_2\), so the associated confidence intervals are estimating the difference in population averages.

If you have a actual data for a two-sample t-test, use t.test(...) with the following syntax.

t.test(
  x = #vector of values for first group
 , y = #vector of values for the second group
 , mu = #null difference in population means, USUALLY = 0! 
 , alternative = # the type of hypothesis test to run
 , var.equal = FALSE # we will assume the pop. variances are unequal in this class 
 , conf.level = # your confidence level as a decimal
    )

If you don’t have data and only have sample statistics, use tsum.test(...) with the following syntax.

tsum.test(
   mean.x = #sample mean from group 1 
  , s.x = #sample std dev from group 1
  , n.x = # sample size from group 1
  , mean.y = #sample mean of group 2
  , s.y = # sample std dev of group 2
  , n.y = # sample size of group 2
  , mu = #null difference in population means, USUALLY = 0! 
  , alternative = # the type of hypothesis test to run
  , var.equal = FALSE # we will assume the pop. variances are unqequal in this class 
  , conf.level = # your confidence level as a decimal
)

Let’s see these functions in action, again using the fastfood data set.

We typically think that salads are a healthier option than burgers when dining out. While “healthiness” is a complicated notion, we can compare the calorie content of these food options as a first pass at healthiness. For instance, a fast food diner may assume they are eating a lower calorie option by ordering a salad over a burger. How true is this on average? Let’s get the data ready for comparison.

burgers <- fastfood %>%
  filter(str_detect(item, '[bB]urger')) 
salads <- fastfood %>%
  filter(str_detect(item, '[sS]alad'))

Let \(\mu_B\) and \(\mu_S\) denote the average calorie content of fast food burgers and salads respectively. We want to test to see if fast food salads are lower calorie than burgers, on average. This notion, as a hypothesis test is

\[\begin{array}{cc} H_0: \mu_S - \mu_B = 0 \\ H_a: \mu_S - \mu_B < 0 \end{array} \]

because if \(\mu_S < \mu_B\) then \(\mu_S - \mu_B < 0\). Executing this test with t.test is easy! We will use a significance level of \(\alpha = .05\).

results <- t.test(
  x = salads$calories
  , y = burgers$calories
  , alternative = 'less'
  , conf.level = .95
  , var.equal = F
)
results
## 
##  Welch Two Sample t-test
## 
## data:  salads$calories and burgers$calories
## t = -5.5512, df = 107.05, p-value = 1.039e-07
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -169.6743
## sample estimates:
## mean of x mean of y 
##  367.8125  609.8214

With a p-value of \(p \approx 1.04 \times 10^{-7} < .05 = \alpha\), we reject the null hypothesis and accept the alternative. In other words, to data provide strong evidence that, on average, fast food salads have fewer calories than burgers. Just how different are these averages? Let’s estimate the difference \(\mu_S - \mu_B\) with a typical two-sided confidence interval.

results <- t.test(
  x = salads$calories
  , y = burgers$calories
  , alternative = 'two.sided'
  , conf.level = .95
  , var.equal = F
)
results$conf.int
## [1] -328.4319 -155.5859
## attr(,"conf.level")
## [1] 0.95

This calculation implies that fast food salads contain between about 155.6 and 328.4 fewer calories than burgers on average at the 95% confidence level.

As a last calculation in this section, we will perform the same hypothesis test using tsum.test(...) for demostration purposes. We need to first calculate all required summary statistics since they haven’t been provided.

# sample size, sample avg, and sample std dev for salads
n.s <- length(salads$calories)
mean.s <- mean(salads$calories)
s.s <- sd(salads$calories)
# sample size, sample avg, and sample std dev for burgers
n.b <- length(burgers$calories)
mean.b <- mean(burgers$calories)
s.b <- sd(burgers$calories)

Now we need to chuck these values into tsum.test(...). Here we’ll perform our initial one-sided hypothesis test.

tsum.test(
  mean.x = mean.s
  , s.x = s.s
  , n.x = n.s
  , mean.y = mean.b
  , s.y = s.b
  , n.y = n.b
  , alternative = 'less'
  , conf.level = .95
  , var.equal = F
)
## 
##  Welch Modified Two-Sample t-Test
## 
## data:  Summarized x and y
## t = -5.5512, df = 107.05, p-value = 1.039e-07
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##         NA -169.6743
## sample estimates:
## mean of x mean of y 
##  367.8125  609.8214

As expected, this test returns the same result. Again, tsum.test(...) is primarily useful when you’re doing homework or checking someone elses work and only have sample statistics available.

7.2.3 Paired data

The functions t.test and tsum.test both handle paired data with ease. Recall that paired data arise whenever when there is a precise dependency between the observations of two specific variables. Paired data often comes up when looking at before/after data or when you measure two similar variables for the same individuals (reading and writing results, for instance).

The syntax for both functions is similar to that for a two-sample t-test. First, t.test runs a paired t-test with the following.

t.test(
  x = #vector of values for first group
 , y = #vector of values for the second group
 , mu = #null difference in population means, USUALLY = 0! 
 , alternative = # the type of hypothesis test to run
 , paired = TRUE # we will assume the pop. variances are unequal in this class 
 , conf.level = # your confidence level as a decimal
    )

To run a paired t-test with tsum.test recall that the statistic you typically want to analyze when dealing with paired data is the average difference \(\mu_{diff}\) and that the the mechanics of this analysis, once you’ve identified that you’re working with paired data, are identical to a single sample t-test. Thus, you can run this with

tsum.test(
  mean.x = # average sample difference
    , s.x = # sample std dev of differences
    , n.x = # number of differences
    , alternative = # the type of hypothesis test to run
    , conf.level = # your confidence level as a decimal
)

As examples, we’ll use the SkiExperts data set that’s included in the PairedData package. Of course, you need to install the package before using this data set.

data("SkiExperts")
head(SkiExperts)
##   Subject Actual Imaginary
## 1      S1  41.12     44.26
## 2      S2  40.21     44.80
## 3      S3  37.67     37.71
## 4      S4  38.50     37.41
## 5      S5  39.04     38.90
## 6      S6  38.91     34.14

This data set, per the R documentation, “gives the actual and motor imaginary performances (time) in ski for 12 experts” where time is measured in seconds. This data set is from the paper Differences in motor imagery time when predicting task duration in alpine skiers and equestrian riders. The researchers essentially asked expert skiers to predict how long it would take them to ski a section of a ski course after inspecting it, then had them ski that section and recorded their times. Thus, we want to compare each skier’s predicted time to their actual time and test for a difference.

Since it is the same skier making the prediction, then skiing the course, these data are paired. We will look at the difference between actual and and predicted times, so \(diff = actual - predicted\) and we simply want to know if there is a difference between a skiers predicted and actual time. Our hypothesis test is thus

\[\begin{array}{cc} H_0: \mu_{diff} = 0 \\ H_a: \mu_{diff} \neq 0 \end{array} \]

Let’s run this test with t.test using a significance level of .05.

t.test(x = SkiExperts$Actual
       , y = SkiExperts$Imaginary
       , mu = 0
       , alternative = 'two.sided'
       , paired = TRUE
       , conf.level = .95
       )
## 
##  Paired t-test
## 
## data:  SkiExperts$Actual and SkiExperts$Imaginary
## t = -0.63931, df = 11, p-value = 0.5357
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -2.384268  1.310934
## sample estimates:
## mean difference 
##      -0.5366667

Let’s interpret the confidence interval first, then the p-value. The confidence interval tells gives us an range of values for \(\mu_{diff}\), in other words, the average difference between actual and predicted times is between -2.4 seconds and 1.3 seconds with 95% confidence. The negative value corresponds to predicted times that were shorter than actual and the positive values correspond to predicted times that are longer than actual. No surprise, with this confidence interval and a corresponding p-value of about 0.5357, we fail to reject the null. We do not have evidence that expert skiers actual ski times deviate much from their estimates on average.

Now let’s get at the same results using tsum.test. First we need the summary statistics.

AvgDiff <- mean(SkiExperts$Actual - SkiExperts$Imaginary)
SdDiff <- sd(SkiExperts$Actual - SkiExperts$Imaginary)
nDiff <- dim(SkiExperts)[1] #number of rows = number of observations

Now we use tsum.test

tsum.test(mean.x = AvgDiff
          , s.x = SdDiff
          , n.x = nDiff
          , alternative = 'two.sided'
          , mu = 0
          , conf.level = .95)
## Warning in tsum.test(mean.x = AvgDiff, s.x = SdDiff, n.x = nDiff, alternative = "two.sided", : argument
## 'var.equal' ignored for one-sample test.
## 
##  One-sample t-Test
## 
## data:  Summarized x
## t = -0.63931, df = 11, p-value = 0.5357
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -2.384268  1.310934
## sample estimates:
##  mean of x 
## -0.5366667

As expected, we get the same exact results.

7.2.4 Analysis of variance (ANOVA)

In the last section of this chapter, we will learn how to run ANOVA tests in R. Recall that ANOVA is an acronym for Analysis of Variance. Now, you might expect the command for an ANOVA test to be anova(...), but you’d be wrong unfortunately. While this function does exist, it performs a different type of ANOVA analysis on different types of R objects (anova(...) compares different models built from the same data set… this is a topic beyond this class.)

Instead, the function we need to use is simply aov(...). The input into this function requires a slightly new format: the first argument is a formula which is simply a way to express a relationship between variables. The basic format it Explanatory variable ~ Response variable(s). If you have more than one response variable, you simply add the variable names together. This might seem odd in the abstract, so it might be best to see a formula in action.

Remember that an ANOVA test checks for an association between a categorical explanatory variable and a numerical response. If there were no association between the two variables, the average values of the numerical variable at each level of the categorical variable would be the same. If there is an association, there would have to be one average that differs from the others. Thus, the hypothesis test for ANOVA looks like

\[\begin{array}{l} H_0: \mu_1 = \mu_2 = \cdots = \mu_k \\ H_a: \text{at least one mean differs from the others} \end{array} \]

where \(\mu_i\) is the average of the numerical variable in the \(i^{th}\) group of the categorical variable.

For an example, let’s use the data set gss_cat from the forcats package (this is part of the Tidyverse, so loaded when you attach tidyverse). This data set gives a collection of categorical variables from the General Social Survey from the years 2000-2014.

head(gss_cat)
## # A tibble: 6 × 9
##    year marital         age race  rincome        partyid            relig              denom         tvhours
##   <int> <fct>         <int> <fct> <fct>          <fct>              <fct>              <fct>           <int>
## 1  2000 Never married    26 White $8000 to 9999  Ind,near rep       Protestant         Southern bap…      12
## 2  2000 Divorced         48 White $8000 to 9999  Not str republican Protestant         Baptist-dk w…      NA
## 3  2000 Widowed          67 White Not applicable Independent        Protestant         No denominat…       2
## 4  2000 Never married    39 White Not applicable Ind,near rep       Orthodox-christian Not applicab…       4
## 5  2000 Divorced         25 White Not applicable Not str democrat   None               Not applicab…       1
## 6  2000 Married          25 White $20000 - 24999 Strong democrat    Protestant         Southern bap…      NA

This data has many categorical variables and only 2 numerical variables: age and tvhours. In this example, let’s test to see if there is any association between a person’s religious affiliation and the amount of time they spend watching TV per day. The variable tvhours is missing for many observations, so we should filter out the missing values first.

df <- gss_cat %>%
  filter(!is.na(tvhours))
dim(df)
## [1] 11337     9

We have 11337 observations with non-null tvhours. Let’s look at the unique religions represented in this survey.

unique(df$relig)
##  [1] Protestant              Orthodox-christian      None                    Christian              
##  [5] Catholic                Other                   Inter-nondenominational Jewish                 
##  [9] Native american         No answer               Buddhism                Moslem/islam           
## [13] Hinduism                Other eastern           Don't know             
## 16 Levels: No answer Don't know Inter-nondenominational Native american Christian ... Not applicable

To simplify matters, we will lump the “Don’t know”, “None” and “No answer” replies into one category called “Atheist/agnostic”.

df <- df %>%
  mutate(relig = case_when(
    relig == "Don't know" ~ "Atheist/agnostic"
    , relig == "None" ~ "Atheist/agnostic"
    , relig == "No answer" ~ "Atheist/agnostic"
    , .default = relig
  ))

Before running an ANOVA test, let’s look at some summary statistics.

df %>%
  group_by(relig) %>%
  summarize(n = n()
            , meanTV = mean(tvhours)
            , sdTV = sd(tvhours)) %>%
  arrange(desc(n))
## # A tibble: 13 × 4
##    relig                       n meanTV  sdTV
##    <chr>                   <int>  <dbl> <dbl>
##  1 Protestant               5650   3.15  2.70
##  2 Catholic                 2695   2.96  2.35
##  3 Atheist/agnostic         1991   2.72  2.63
##  4 Christian                 391   2.79  2.50
##  5 Jewish                    189   2.52  2.07
##  6 Other                     124   2.73  2.26
##  7 Buddhism                   81   2.38  2.11
##  8 Moslem/islam               64   2.44  2.15
##  9 Inter-nondenominational    54   2.87  2.66
## 10 Hinduism                   37   1.89  1.20
## 11 Orthodox-christian         36   2.42  2.13
## 12 Native american            13   3.46  4.07
## 13 Other eastern              12   1.67  1.56

We should assess whether or not we meet the criteria for inference.

  • Are the observations independent within in and between groups? This is a reasonable assumption given that this is a random sample of GSS results over the years.

  • Are the data approximately normally distributed within each group? The variable tvhours is very likely right skewed since many people watch some TV but some people watch a lot. The data likely violates the normality assumption. Since we have fairly large sample sizes in each group we will ignore this though this may not be best practice!

  • Is the standard deviation relatively stable across groups? There is some variability among all standard deviations, but they are not all that different… Again, this may not be best practice all of the time, but we will assume this condition is satisfied for the sake of the exercise.

Now that we’ve checked the conditions and acknowledge that some of our assumptions are a bit shakey, we’re ready to execute the ANOVA test. Since our assumptions are shakey, we should not stake too much on the results of this test. The code chunk below demonstrates how to run the test and get a nice print out of the results. Notice that the first argument of aov is a formula with response ~ explanatory. Let’s use a significance level of 0.05.

test <- aov(tvhours ~ relig, data = df)
summary(test)
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## relig          12    482   40.15   6.031 1.25e-10 ***
## Residuals   11324  75394    6.66                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since our p-value \(p \approx 1.3 \times 10^{-10}\) is less than the significance level, we can reject the null and accept the alternative. The data provide sufficient evidence to detect an association between religion and hours spent watching TV. As usual, since this is observational data, we have no way of deciding if this is a causal relationship; we’ve merely detected a relationship (and possibly a tenuous one at that given the questionable assumptions!).

What does this relationship actually look like? Well, since we’re looking for differences in averages, we’d like some way of comparing them all. R does this automatically; it is stored in the test result above as coefficients

test$coefficients
##                  (Intercept)                religBuddhism                religCatholic 
##                  2.718232044                 -0.335515995                  0.242064802 
##               religChristian                religHinduism religInter-nondenominational 
##                  0.072049286                 -0.826340152                  0.152138326 
##                  religJewish            religMoslem/islam         religNative american 
##                 -0.199713526                 -0.280732044                  0.743306417 
##      religOrthodox-christian                   religOther           religOther eastern 
##                 -0.301565378                  0.007574407                 -1.051565378 
##              religProtestant 
##                  0.427254681

Notice that the religions in the output above are ordered alphabetically and that “Atheist/agnostic” is conspicuously missing. This is the religion corresponding to the intercept value. In particular, (Intercept) reports the average daily TV hours that Atheist/agnostic identifiers watch. The other coefficients report the change from the Atheist/agnostic group to the group in question’s average. For example, the religBuddhism coefficient is -0.335515995, which means that the average daily TV hours Buddhists watch is about \(2.718232044 - 0.335515995 = 2.382716\). Is this a significant difference? That’s a separate hypothesis test. However, from the results of our ANOVA test, we can conclude that some pair of religions watch different amounts of TV on average. One can use techniques that we won’t cover in this class to identify which groups deviate.

One obstacle that you may encounter “out in the wild” (on your homework, for instance) is that data you are provided or collect on your own is not formatted appropriately for using the aov function. Notice that the input requires the categorical explanatory variable to be in one column and the explanatory response to be in another. Sometimes your data is not presented in this way.

For example, suppose 10 students were randomly selected from FVCC, UM, and MSU at the end of STAT216 to take a statistics proficiency exam.

We want to compare the average exam scores for these schools. An ANOVA test is just the ticket to compare these schools. The null and alternative hypothesis are as follows.

  • \(H_0\): there is no association between stats proficiency and school.

  • \(H_0\): there is an association between stats proficiency and school.

Imagine that the exams are scored and you’re given the data below.

msu um fvcc
72 81 83
82 84 83
87 77 88
83 85 78
89 80 85
82 82 89
85 74 81
83 74 90
87 80 82
80 84 85

You want to run an ANOVA test, but recognize that the numerical variable is split between 3 columns and that each column name is actually a value of the school variable! This will not do for using the aov function. The solution is reformatting the data appropriately.

If your data is already loaded into R, you can use the pivot_longer function from the tidyverse package. Here, our data set is called df. This function takes the column names and turns them into column values, just as desired.

df <- df %>% 
  pivot_longer(cols = 1:3 #selects columns 1-3 for the values
               , names_to = 'school' # the name of the new column
               , values_to = 'score')
head(df) # returns first 6 results to inspect formatting. 
## # A tibble: 6 × 2
##   school score
##   <chr>  <dbl>
## 1 msu       72
## 2 um        81
## 3 fvcc      83
## 4 msu       82
## 5 um        84
## 6 fvcc      83

Now, if your data isn’t in R already, you need to get in in here somehow. In an ideal world, you will have the data in a CSV file or some other spreadsheet-type file. The world is not ideal, however, so sometimes you have to enter the data into R. For the example we’re working on, I would recommend entering your data into R in the following way.

# store each column as a list of numbers named evocatively
msu <- c(85, 86, 86, 78, 88, 83, 82, 86, 83, 77)
um <- c(80, 78, 79, 82, 76, 78, 74, 87, 74, 76)
fvcc <- c(83, 86, 83, 88, 85, 86, 90, 86, 83, 83)
# now make one long list of numbers
scores <- c(msu, um, fvcc)
# now make a list of values for your categorical variable
# this should respect the order of the lists you made above
school <- c( rep("msu", 10), rep("um", 10), rep("fvcc", 10))
# now make a dataframe/tibble with these two lists as columns.
df <- tibble( school, scores)
head(df)
## # A tibble: 6 × 2
##   school scores
##   <chr>   <dbl>
## 1 msu        85
## 2 msu        86
## 3 msu        86
## 4 msu        78
## 5 msu        88
## 6 msu        83

The output of these two processes look different, but that’s only because they are sorted differently. Each score value from the MSU column in the original data set is represented in the scores column in a row with school == 'msu'.

Now we can easily run our desired ANOVA test

results <- aov(scores ~ school, data = df) 

results %>%  summary()
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## school       2  254.1  127.03    11.1 0.000303 ***
## Residuals   27  308.9   11.44                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Low and behold, the data provide evidence of an association between school and statistics proficiency. We’ll look at the coefficients to try to get some idea about how the school’s averages differ.

results$coefficients
## (Intercept)   schoolmsu    schoolum 
##        85.3        -1.9        -6.9

This gives a point estimate for each school’s average test score with FVCC having the highest at 85.3. According to this fabricated data, MSU and UM STAT216 students do not score as high, on average. This is, of course, completely made up.

On a closing note for the chapter, recall that one goal of statistical inference is detecting relationships between variables. In this chapter we encountered many different ways to do this.

  • 2-sample proportion tests check for associations between two (binary) categorical variables.

  • chi-squared tests of independence check for associations between two categorical variables

  • 2-sample t-tests check for an association between a (binary) categorical variable and a numerical variable

  • ANOVA tests for an association between a categorical variable and a numerical variable.

You may notice here that we have so far skipped testing for an association between two numerical variables. The main reason for this is that these associations can be somewhat more complicated than those we’ve check already. However, you’ve likely already seen one of the main tools for this task: linear regression. The next chapter discusses linear regression and describes how to use R to perform regression tasks.