Chapter 7 Inferential statistics, take 2
This chapter will show you how to calculate confidence intervals and perform hypothesis tests using R’s in-built hypothesis testing functions. We will quickly go over the same examples from the previous sections using R’s functions to recover the values we calculated earlier.
7.1 Analyzing categorical variables
We will learn how to use two workhorse functions in this section: prop.test(...)
and chisq.test(...)
. The former will be used for making inferences for single and two sample proportions, while the latter will be used for chi squared goodness of fit tests and tests of independence.
Remember You can always run ?prop.test
or ?chisq.test
in the console to view R
’s help documentation for these functions.
7.1.1 Single sample proportions
The main function we will use in this section is prop.test(...)
; it both calculates confidence intervals and performs hypothesis tests all in one fell swoop! The important things for us, then, are understanding what inputs we need and understanding the outputs. For single-sample proportion inferences, the function prop.test
requires the following inputs:
prop.test(
x = # The number of successes
, n = # the sample size
, p = # the null proportion of your hypothesis test, the default value is NULL
, alternative = # the type of hypothesis test to run corresponding to your alternative hypothesis; one of "two.sided", "less", or "greater"
, conf.level = # your confidence level as a decimal
, correct = FALSE # This turns OFF the Yates' continuity correction so that calculations are made like we've learned in class.
)
The function will output the results of our desired hypothesis test and a corresponding confidence interval.
Note: The confidence interval intervals that R
calculates are ever-so-slightly different from those that we calculate by hand in two important ways:
Setting the option
correct = FALSE
gets us closer to our by-hand techniques, butR
still uses a slightly different method under the hood. See this CrossValidated Stack Exchange post for more information.If your alternative hypothesis is one-sided and you select the option
alternative = "less"
oralternative = "greater"
, thenR
will report a one-sided confidence interval that we do not discuss in this class.
To understand the output of prop.test(...)
and witness the discrepancies mentioned above, let’s return to the 3 basic examples we started with in the previous chapter. Recall that we have a hypothetical example with
\(n = 200\)
Our sample has 110 successes, ie \(\hat{p} = .55\)
we are comparing our sample data to the null hypothesis \(H_0: p = .5\)
- Two-sided hypothesis test: In this case our hypothesis test is
\[\begin{array}{cc} H_0: p = .5 \\ H_a: p \neq .5 \end{array} \]
It is quite easy to perform this hypothesis test with the function prop.test(...)
, we simply run the following code.
result <- prop.test(
x = 110 # the NUMBER of successes
, n = 200 # sample size
, p = .5 # the null proportion
, alternative = "two.sided" # since we're doing a two-sided hypothesis test
, conf.level = .95
, correct = FALSE
)
result
##
## 1-sample proportions test without continuity correction
##
## data: 110 out of 200, null probability 0.5
## X-squared = 2, df = 1, p-value = 0.1573
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4807562 0.6173593
## sample estimates:
## p
## 0.55
Notice that the code stores the output of the hypothesis test as the variable result
, then prints a summary of the test. Take a few moments to read through the results and notice that it gives us all of the information we want! In particular,
the “sample estimate”
p
is the sample proportion \(\hat{p}\);the \(p\)-value is about 0.1573;
the associated 95% confidence interval is \((0.4807, 0.6173)\);
Let’s compare this with the results we’d obtain making the same calculations “by hand. First we’ll calculate a \(p\)-value and compare it to that calculated by prop.test
above.
n <- 200
pHat <- 110/200
null.SE <- sqrt(.5*.5/200)
p.value.hand <- 2*pnorm(pHat, mean = .5, sd = null.SE, lower.tail = FALSE)
c(p.value.hand, result$p.value)
## [1] 0.1572992 0.1572992
They’re the same, just as we would hope. Now let’s calculate a 95% confidence interval “by hand” and compare the results to that calculated by prop.test
above.
## [1] 0.4807035 0.6192965
## [1] 0.4807562 0.6173593
## attr(,"conf.level")
## [1] 0.95
Notice that the bounds of our “by hand” calculation are ever so slightly different from the bounds calculated by prop.test
- this is because R calculates these confidence intervals with slightly different and more sophisticated techniques than we cover in this class. Fret not, however, as the outputs in most cases are similar enough. One interesting thing to observer is that the confidence interval from prop.test
is not symmetric about the sample proportion \(\hat{p}\)!
In our remaining examples, we will move a little more quickly.
- One-sided lower hypothesis test: In this case our hypothesis test is
\[\begin{array}{cc} H_0: p = .5 \\ H_a: p < .5 \end{array} \]
This hypothesis test is quite easy to evaluate as well will prop.test
. See the code below:
result <- prop.test(
x = 110 # the NUMBER of successes
, n = 200 # sample size
, p = .5 # the null proportion
, alternative = "less" # ONLY CHANGE FROM Above
, conf.level = .95
, correct = FALSE
)
result
##
## 1-sample proportions test without continuity correction
##
## data: 110 out of 200, null probability 0.5
## X-squared = 2, df = 1, p-value = 0.9214
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
## 0.0000000 0.6068119
## sample estimates:
## p
## 0.55
Most of the output is exactly as we’d expect it to be. The \(p\)-value of this test is the complement of half of the \(p\)-value from the two-sided test (think about the corresponding tail areas shaded under a normal distribution) and the point estimate is the same.
Caution: The main change is the output of the confidence interval. Because this is a one-sided tests, R reports a one-sided confidence interval that we do not cover in this class. Suppose \(\alpha\) is the significance level as decimal with corresponding \(z\)-critical value \(z_\alpha\). Then a one-sided lower confidence interval is approximately
\[ \left( 0, \hat{p} + z_\alpha * SE\right) \]
For our example, we can calculate this by hand for a 95% confidence interval, the corresponding significance level is .05, so \(\alpha = .05\).
## [1] 0.6081544
This is very close to, but not exactly, the upper bound of the confidence interval given by R’s prop.test
function. For our purposes they are close enough. That said, if you are looking for a confidence interval as we calculate them in class, set the alternative
option to "two.sided"
.
- One-sided upper hypothesis test: In this case our hypothesis test is
\[\begin{array}{cc} H_0: p = .5 \\ H_a: p > .5 \end{array} \]
This hypothesis test is quite easy to evaluate as well will prop.test
. See the code below:
result <- prop.test(
x = 110 # the NUMBER of successes
, n = 200 # sample size
, p = .5 # the null proportion
, alternative = "greater" # ONLY CHANGE FROM Above
, conf.level = .95
, correct = FALSE
)
result
##
## 1-sample proportions test without continuity correction
##
## data: 110 out of 200, null probability 0.5
## X-squared = 2, df = 1, p-value = 0.07865
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
## 0.4918534 1.0000000
## sample estimates:
## p
## 0.55
The result of this test is very similar to the output from the one-sided lower example and all of the same comments apply. Again, the main change is the reported confidence interval.
Caution: As in the case of a one-sided lower test, R reports a one-sided confidence interval, which deviates from what we learn in this class! Suppose \(\alpha\) is the significance level as decimal with corresponding \(z\)-critical value \(z_\alpha\). Then a one-sided lower confidence interval is approximately
\[ \left( \hat{p} - z_\alpha * SE, 1 \right) \]
For our example, we can calculate this by hand for a 95% confidence interval, the corresponding significance level is .05, so \(\alpha = .05\). The lower-bound for the one-sided confidence interval is
## [1] 0.4918456
This is very close to, but not exactly, the lower bound of the confidence interval given by R’s prop.test
function.
As a reminder, if you are looking for a confidence interval as we calculate them in class, set the alternative
option to "two.sided"
.
7.1.1.1 “Real Life” Single sample proportion test.
We close this section with an example of a single sample proportion test where we start with real-world data. As in the previous chapter we’ll use the gss2010
data set from the openintro
package. Here are the first six rows of this data set.
## # A tibble: 6 × 5
## hrsrelax mntlhlth hrs1 degree grass
## <int> <int> <int> <fct> <fct>
## 1 2 3 55 BACHELOR <NA>
## 2 4 6 45 BACHELOR LEGAL
## 3 NA NA NA LT HIGH SCHOOL <NA>
## 4 NA NA NA LT HIGH SCHOOL NOT LEGAL
## 5 NA NA NA LT HIGH SCHOOL NOT LEGAL
## 6 NA NA NA LT HIGH SCHOOL LEGAL
As of writing, we note that marijuana is legal in many states, so we will assume that a majority of American’s in 2024 believe that the drug should be legalized. We want to use the gss2010
data to try to see if the national opinion has changed over the past 14 years. In other words, did a minority of Americans think marijuana should be legalized in 2010?
If \(p\) denotes the proportion of Americans who think marijuana should be legalized, the hypothesis test we should use to answer this question is
\[\begin{array}{cc} H_0: p = .5 \quad (\text{or} \quad p \geq .5) \\ H_a: p < .5 \end{array} \]
We should look at the grass
column of the data set which records respondents’ answers to the question “Do you think the use of marijuana should be made legal, or not?” Some of the entries in this column are NA
, so we should drop these values from our data set. Note: before dropping these values, it might be good to check to see if there’s a correlation between the respondents who did not reply and their educational level because degree is the only complete column in this data set and we need to be careful about maintaining independence among observations. One could check this with a chi-squared test of independence, but we will eyeball it for now.
gss2010 %>%
# add a binary column for replying to the grass question
mutate(grassReply = if_else(is.na(grass), 'No', 'Yes')) %>%
# create a two-way table for degree and grassReply
xtabs(~degree + grassReply, data = .) %>%
# create a proportion table where we divide by the column totals
prop.table(., margin = 2)
## grassReply
## degree No Yes
## BACHELOR 0.18343949 0.18347895
## GRADUATE 0.10445860 0.10802224
## HIGH SCHOOL 0.49681529 0.48530580
## JUNIOR COLLEGE 0.07515924 0.06830818
## LT HIGH SCHOOL 0.14012739 0.15488483
The proportion table above shows that the distribution of educational attainment is just about the same between those that did and did not reply to the question about marijuana legalization. Thus, we can safely drop the NA
values from the grass column and proceed with our hypothesis test. To use prop.test
we need the sample size and the number of people who responded LEGAL
to the grass question.
gss2010 %>%
#select all rows where grass is not NA
filter(!is.na(grass)) %>%
# get sample size and number of respondents answering legal and not legal
summarize(
n = n()
, notLegal = sum(grass == 'NOT LEGAL')
, legal = sum(grass == 'LEGAL')
)
## # A tibble: 1 × 3
## n notLegal legal
## <int> <int> <int>
## 1 1259 656 603
We can finally proceed with our hypothesis test. Let’s use a significance level of \(\alpha = .05\)
testResults <- prop.test(
x = 603
, n = 1259
, p = .5
, alternative = 'less'
, conf.level = .95
, correct = F
)
testResults
##
## 1-sample proportions test without continuity correction
##
## data: 603 out of 1259, null probability 0.5
## X-squared = 2.2311, df = 1, p-value = 0.06763
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
## 0.0000000 0.5021298
## sample estimates:
## p
## 0.4789515
About 48% of the respondents in our sample believed that marijuana should be legalized. However, our p-value of 0.068 is greater than our significance level, so we fail to reject the null. In other words, we have no evidence that a minority of Americans supported marijuana legalization in 2010.
7.1.2 Two sample proportions
The main function we will use in this section is again prop.test(...)
. In this section we will do one ‘toy’ example and one example using real-world data. The input and use is almost identical to that of performing single sample proportion inferences. The arguments/inputs are as follows.
prop.test(
# x1 and x2 are the number of successes in sample 1 and 2
x = c(x1, x2)
# n1 and n2 are the sizes of sample 1 and 2
, n = c(n1,n2)
# leave p null when testing for a difference in proportion, our most common use
, p = NULL
, alternative = # the type of hypothesis test to run corresponding to your alternative hypothesis; one of "two.sided", "less", or "greater"
, conf.level = # your confidence level as a decimal
, correct = FALSE # This turns OFF the Yates' continuity correction so that calculations are made like we've learned in class.
)
The output of prop.test
will look almost identical to when you perform a single sample inference and all of our cautions and commentary from the previous section apply in this section as well. First we’ll do a basic, made up example to see what the output looks like, then do an example with real data to practice interpreting the results.
Basic, made up example: Suppose you have two samples, one with size \(n_1 = 344\) and another with size \(n_2 = 432\). The first sample has 177 successes and the second has 124. You want check for a difference between the proportion of successes in populations 1 and 2. In other words, your hypothesis test is
\[\begin{array}{cc} H_0: p_1 - p_2 = 0 \\ H_a: p_1 - p_2 \neq 0 \end{array} \]
The following code uses prop.test
to perform this hypothesis test with a significance level of \(\alpha = .05\)
results <- prop.test(
x = c(177, 124)
, n = c(344, 432)
, alternative = 'two.sided'
, conf.level = .95
, correct = FALSE
)
results
##
## 2-sample test for equality of proportions without continuity correction
##
## data: c(177, 124) out of c(344, 432)
## X-squared = 41.744, df = 1, p-value = 1.04e-10
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.1596070 0.2953887
## sample estimates:
## prop 1 prop 2
## 0.5145349 0.2870370
The printed results
above give us all of the information we need. The \(p\)-value of our test is \(1.0401892\times 10^{-10} < \alpha = .05\), so the data provide compelling evidence of a difference between the population proportions.
How different are the two proportions you ask? Remember, that’s exactly what the confidence interval for a difference in proportions reports. In this case, it means that \(p_1 - p_2\) lies between about 16% and 30% with 95% confidence. In other words, with 95% confidence, the proportion of successes in population 1 is between 16% and 30% higher than the proportion of successes in population 2.
7.1.2.1 “Real world” Example
In this example we turn again to the gss2010
data set from the openintro
package and investigate another question about marijuana legalization opinion. After many years around colleges and universities, I hypothesize that, in general, people with college degrees have a different opinion on marijuana legalization than people without college degrees.
Let \(p_C\) denote the proportion of Americans with college degrees who think marijuana should be legalized and let \(p_H\) be the same proportion for Americans without college degrees. Our hypothesis test is then
\[\begin{array}{cc} H_0: p_C - p_H = 0 \\ H_a: p_C - p_H \neq 0 \end{array} \]
We need to classify group the individuals in our sample into college degree recipients and non college degree recipients. The different levels of the degree
factor are
## [1] BACHELOR GRADUATE HIGH SCHOOL JUNIOR COLLEGE LT HIGH SCHOOL
## Levels: BACHELOR GRADUATE HIGH SCHOOL JUNIOR COLLEGE LT HIGH SCHOOL
so any individual whose degree level contains the string ‘HIGH SCHOOL’ does not have a college degree. Let’s add a column to our data set, then get the counts we need to perform our test. As in the example in the last section, we will drop the individuals who did not respond to the legalization question. To make the hypothesis test slightly easier to run, we will store the following summary table in memory.
LegalizeIt <- gss2010 %>%
#drop rows with NA in grass column
filter(!is.na(grass)) %>%
# TRUE/FALSE binary for college degree
mutate(collegeDegree = !str_detect(degree, 'HIGH SCHOOL')) %>%
group_by(collegeDegree) %>%
summarize(
n = n()
, legal = sum(grass == 'LEGAL')
, notLegal = sum(grass == 'NOT LEGAL')
) %>%
#arrange the results so that collegeDegree = TRUE comes first.
arrange(desc(collegeDegree))
LegalizeIt
## # A tibble: 2 × 4
## collegeDegree n legal notLegal
## <lgl> <int> <int> <int>
## 1 TRUE 453 234 219
## 2 FALSE 806 369 437
Now we are ready to perform our hypothesis test with prop.test
and a significance level of \(\alpha = .05\).
results <- prop.test(
x = LegalizeIt$legal
, n = LegalizeIt$n
, p = NULL
, alternative = 'two.sided'
, conf.level = .95
, correct = FALSE
)
results
##
## 2-sample test for equality of proportions without continuity correction
##
## data: LegalizeIt$legal out of LegalizeIt$n
## X-squared = 4.0096, df = 1, p-value = 0.04524
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.001287983 0.116191845
## sample estimates:
## prop 1 prop 2
## 0.5165563 0.4578164
The p-value of our test is 0.045 which just squeezes in under \(\alpha = .05\), so we have statistically significant evidence that there is a difference in opinion on marijuana legalization between those with and without college degrees.
We can use the confidence interval from our results to be a bit more specific. At the 95% confidence level, the proportion of people with college degrees who think marijuana should be legalized is between 0.1% and 11.6% higher than the proportion of people without college degrees.
Remember, this was all calculated using 2010 data, so you may want to redo this investigation with more up-to-date information.
7.1.3 Chi-squared goodness of fit test
A goodness of fit test checks to see if your sample data deviates from some expected distribution. For example, you can use this type of test to investigate whether political affiliation in a county deviates from the state that that county is in or test to see if your dice are actually fair.
To perform a chi-squared goodness of fit test easily in R
we will use the function chisq.test(...)
. The inputs to this function are as follows:
chisq.test( x = c(...) # a list of your observed counts
, p = c(...) # a vector of your expected proportions/probabilities
, correct = F # turn off the continuity correction, like usual
)
We will look at pet ownership in Seattle as an example of a goodness of fit test. From www.zipatlas.com, the 10 most populous zip codes in the Seattle metro area are those in the table below. The column pop
gives the zip codes population in 2024. Note: we change zipcode the a character vector to treat them deservedly as categorical variables.
zipDF <- tibble(
zip = c(98115,98103,98133, 98105,98118,98125,98122, 98198, 98117, 98155),
pop = c(54457,51878,50921,50302,48874,43993,41268,38942,36115,35550)
) %>%
mutate(zip = as.character(zip)
, pop = pop
, popProp = pop/sum(pop) ) %>%
arrange(zip)
zipDF
## # A tibble: 10 × 3
## zip pop popProp
## <chr> <dbl> <dbl>
## 1 98103 51878 0.115
## 2 98105 50302 0.111
## 3 98115 54457 0.120
## 4 98117 36115 0.0798
## 5 98118 48874 0.108
## 6 98122 41268 0.0912
## 7 98125 43993 0.0973
## 8 98133 50921 0.113
## 9 98155 35550 0.0786
## 10 98198 38942 0.0861
We want to see if pet ownership is evenly distributed among these zip codes. To do this, we will use the seattlepets
data set from the openintro
package. As usual, be sure the check the documentation to make sure you understand what the data represents. In this case, it’s a data set detailing pet registrations in Seattle. The first 6 observations in the data set are shown below.
## # A tibble: 6 × 7
## license_issue_date license_number animal_name species primary_breed secondary_breed zip_code
## <date> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2018-11-16 8002756 Wall-E Dog Mixed Breed, Medium (up to… Mix 98108
## 2 2018-11-11 S124529 Andre Dog Terrier, Jack Russell Dachshund, Sta… 98117
## 3 2018-11-21 903793 Mac Dog Retriever, Labrador <NA> 98136
## 4 2018-11-23 824666 Melb Cat Domestic Shorthair <NA> 98117
## 5 2018-12-30 S119138 Gingersnap Cat Domestic Shorthair Mix 98144
## 6 2018-12-16 S138529 Cody Dog Retriever, Labrador <NA> 98103
There are many zip_codes in the data set:
## [1] 168
so we need to do some manipulation to get to a place where we can make the comparison in question. First, we reformat any zip code that is of the form #####-#####
so that it just contains the first five digits, then we will filter for only the 10 most populous zip codes, then finally count the number of licenses in each zip.
df <- seattlepets %>%
mutate(zip_code = str_replace(zip_code, '-[0-9]{1,}', '')
) %>%
filter(zip_code %in% zipDF$zip) %>%
group_by(zip_code) %>%
summarize(n = n()) %>%
arrange(zip_code)
df
## # A tibble: 10 × 2
## zip_code n
## <chr> <int>
## 1 98103 4506
## 2 98105 2070
## 3 98115 4748
## 4 98117 3879
## 5 98118 2375
## 6 98122 2521
## 7 98125 2882
## 8 98133 1775
## 9 98155 28
## 10 98198 6
Next we make sure that the zip codes are in the same order in both data sets
## [1] TRUE
Now we’re ready to perform our test. If pet-ownership in different zip codes were evenly distributed, we’d expect the proportion of licenses distributed to be roughly the same as the population distribution. Thus, we want to compare the license count, the column n
from df
above to the population proportion popProp
from zipDF
. The hypothesis test we’re performing is
\[\begin{array}{cc} H_0:\text{Pet ownership is distributed like population in Seattle} \\ H_a: \text{Pet ownership is not distributed like population in Seattle} \end{array} \]
Let’s perform this test with a significance level of \(\alpha = .05\). Executing the test is now easy with chisq.test
!
##
## Chi-squared test for given probabilities
##
## data: df$n
## X-squared = 8549.1, df = 9, p-value < 2.2e-16
Our p-value is way less than the significance level, so we reject the null and accept the research hypothesis. In other words, we have strong evidence suggesting that pet ownership/license distribution does not match the population distribution among the 10 most populous zip codes in the Seattle metro area.
You might wonder where the distribution deviates the most from the expectation? You can use the residuals
element of the results
object above to try to answer that question.
If obs and exp are the observed and expected counts respectively, then each residual is:
\[ \frac{\text{obs} - \text{exp}}{\sqrt{\text{exp}}} \]
## 98198 98155 98133 98105 98118 98122 98125 98103 98115
## -46.069289 -43.506938 -19.230237 -13.083768 -5.868322 5.449082 9.587784 31.180236 32.275231
## 98117
## 42.696200
The output above shows us that the zip code 98198 sees way fewer licenses than we’d expect if pet ownership matched the population distribution and the zip code 98117 sees way more licensing than we would expect based on population alone.
7.1.4 Chi-squared test of independence
Recall from the previous chapter that a chi-squared test of independence tests for an association between two categorical variables. No surprise, there is a built in function in R
to perform these tests and it is again chisq.test(...)
. The only difference from the last section is how you input data. For a goodness of fit test you
table <- gss2010 %>%
mutate(grass = if_else(is.na(grass), 'NO REPLY', grass)) %>%
xtabs(~degree + grass, data = .)
table
## grass
## degree LEGAL NO REPLY NOT LEGAL
## BACHELOR 119 144 112
## GRADUATE 73 82 63
## HIGH SCHOOL 304 390 307
## JUNIOR COLLEGE 42 59 44
## LT HIGH SCHOOL 65 110 130
##
## Pearson's Chi-squared test
##
## data: table
## X-squared = 22.338, df = 8, p-value = 0.004326
Since our p-value is less than the significance level of \(\alpha = .05\), we conclude that the data imply an association between highest educational attainment and opinion on marijuana legalization. You might then wonder if you can infer what this association actually looks like to some degree. The key to this is looking at the residuals of this hypothesis test.
Recall that the chi-squared test statistic is
\[ \sum \frac{(\text{obs} - \text{exp})^2}{\text{exp}}. \]
where obs and exp represent the observed and expected counts and the sum ranges over all cells in the contingency table. The residual of a cell in a contingency table is simply
\[\text{residual} = \frac{(\text{obs} - \text{exp})}{\sqrt{\text{exp}}} \]
so a residual simply measures how far above or below the observed count is from the expected count where the difference is normalized by the expected count. Large residuals imply a large deviation from expectation, so you can think of any cell with a large residual as representing a place where the detected association between variables is more evident. We can access the residuals from our test a follows:
## grass
## degree LEGAL NO REPLY NOT LEGAL
## BACHELOR 0.795903942 -0.001589914 -0.761336064
## GRADUATE 1.083344800 -0.188315397 -0.832659358
## HIGH SCHOOL 0.505993107 0.283830520 -0.795608394
## JUNIOR COLLEGE -0.118711782 0.443908579 -0.371782229
## LT HIGH SCHOOL -2.633232818 -0.659297213 3.245833866
Most of our residuals are either around 1 or less than 1 in absolute value, except for those corresponding to people with who did not finish high school. Here we see that fewer people than expected believe that marijuana should be legalized and more think it should be illegal if the variables were actually independent.
Does this mean that more education causes one to think marijuana should be legal? Of course not! We’re simply observing an association in an observational data set and it would be a huge mistake to try to infer a causal relationship between the two variables. There are likely many confounding variables at play here.
7.2 Analyzing numerical variables
In this section we will learn how to use R’s in-built functions to perform statistical inference for numerical variables. After the last chapter, it should come as no surprise that the bulk of our inference will be performed by a single function. For categorical variables we frequently used prop.test(...)
; the analogous function for numerical variables is t.test(...)
. The following subsections will guide you through the use of this function to perform inferences for single sample means, two sample mean comparisons, and paired data. We will then discuss using the function aov
for performing ANOVA tests.
Note: The function t.test(...)
requires you have access to actual data, not just summary statistics. In what follows we will also discuss using the tsum.test(...)
function from the BSDA
package, which executes t-tests from summary statistics as well.
7.2.1 Single sample mean
If you have a data set, making an inference about a numerical variable therein with t.test
is quite easy. The basic syntax for a single numerical variable is as follows.
t.test(
x = # a VECTOR of numerical values, typically a column of a data set.
, mu = # the null mean of your hypothesis test; defaults to 0.
, alternative = # the type of hypothesis test to run corresponding to your alternative hypothesis; one of "two.sided", "less", or "greater"
, conf.level = # your confidence level as a decimal
)
We will use the fastfood
data set from the openintro
package to demonstrate the use of this function. This data set gives the nutritional breakdown of many menu items from the following restaurants: .
## # A tibble: 6 × 17
## restaurant item calories cal_fat total_fat sat_fat trans_fat cholesterol sodium total_carb fiber sugar
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonalds Artisan… 380 60 7 2 0 95 1110 44 3 11
## 2 Mcdonalds Single … 840 410 45 17 1.5 130 1580 62 2 18
## 3 Mcdonalds Double … 1130 600 67 27 3 220 1920 63 3 18
## 4 Mcdonalds Grilled… 750 280 31 10 0.5 155 1940 62 2 18
## 5 Mcdonalds Crispy … 920 410 45 12 0.5 120 1980 81 4 18
## 6 Mcdonalds Big Mac 540 250 28 10 1 80 950 46 3 9
## # ℹ 5 more variables: protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>
Suppose we want to know the average caloric content of a burger from a fast food restaurant. Moreover, we suspect that the average burger contains more than 600 calories. To put this in the context of statistic inference, we want to calculate a confidence interval and perform a hypothesis test. Let \(\mu\) denote the average calorie content of a fast food burger. Our hypothesis test is
\[\begin{array}{cc} H_0: \mu = 600 \\ H_a: \mu > 600 \end{array} \]
We can use the data set above to help make these inferences! We will use t.test(...)
after massaging the data a bit. First, we’ll select all rows from fastfood
where the item contains the word “burger” or “Burger”
## [1] 56 17
So our data set has only 56 burgers. Let’s first execute our hypothesis test with a significance level of \(\alpha = .05\) to see if our idea has any legs.
results <- t.test(x = burgers$calories
, alternative = 'greater'
, mu = 600
, conf.level = .95)
results
##
## One Sample t-test
##
## data: burgers$calories
## t = 0.28482, df = 55, p-value = 0.3884
## alternative hypothesis: true mean is greater than 600
## 95 percent confidence interval:
## 552.1307 Inf
## sample estimates:
## mean of x
## 609.8214
Our \(p\)-value is about 0.3884, greater than the significance level, so we fail to reject the null. In other words, our sample data do not provide sufficient evidence to conclude that the average burger has more than 600 calories.
Notice that the confidence interval reported above is one-sided to be consistent with the hypothesis test, just like the output of prop.test(...)
when performing a one-sided hypothesis test. We can reproduce this confidence interval in a similar way. You can find the lower bound with:
t.crit <- qt(.05, df = 55, lower.tail = F)
SE <- sd(burgers$calories)/sqrt(56)
mean(burgers$calories) - t.crit*SE
## [1] 552.1307
The upper bounds for one-sided confidence intervals associated with one-sided lower hypothesis tests can be calculated analogously.
If we want standard two-sided confidence interval, we simply need to set `alternative = ‘two.sided’.
results <- t.test(x = burgers$calories
, alternative = 'two.sided'
, mu = 600
, conf.level = .95)
results$conf.int
## [1] 540.7165 678.9263
## attr(,"conf.level")
## [1] 0.95
Thus, with 95% confidence, fast food burgers contain between about 541 and 679 calories on average.
Now, suppose you don’t have a data set available and only have summary statistics. This happens frequently on homework, for example. If you want to use a function similar to t.test(...)
, you have to either write your own function (this is left as an exercise for the reader as it is beyond the scope of these notes) or install a new package called BSDA
. This package, whose name represents “Basic Statistics and Data Analysis,” was created by Alan T. Arnholt to accompany a textbook with the same name, written by Larry J. Kitchens. In any case, remember that one must first install the package before it can be loaded into your R
session and that it must be loaded into your R
session before you can use any functions therein. In other words, you must run
once, the first time you want to use the functions, then run
any time you start a new R
session and want to use this function. For a single sample t-test, the syntax for t.test(...)
is as follows.
tsum.test(
mean.x = #sample average
, s.x = #sample standard deviation
, n.x = #sample size
, alternative = #type of test to perform
, mu = #null mean
, conf.level = # confidence level as a decimal
)
We will use tsum.test(...)
to generate perform the exact same hypothesis test as above, this time inputting the sample statistics instead of the data we were interested in.
xBar <- mean(burgers$calories)
s <- sd(burgers$calories)
n <- dim(burgers)[1]
tsum.test(
mean.x = xBar
, s.x = s
, n.x = n
, alternative = 'greater'
, mu = 600
, conf.level = .95
)
## Warning in tsum.test(mean.x = xBar, s.x = s, n.x = n, alternative = "greater", : argument 'var.equal'
## ignored for one-sample test.
##
## One-sample t-Test
##
## data: Summarized x
## t = 0.28482, df = 55, p-value = 0.3884
## alternative hypothesis: true mean is greater than 600
## 95 percent confidence interval:
## 552.1307 NA
## sample estimates:
## mean of x
## 609.8214
Et Voila - the exact same results as above! You can ignore the warning about var.equal
as it is outputting the exact behavior we want.
7.2.2 Two sample mean comparison
The functions t.test(...)
and tsum.test(...)
also perform two-sample t-tests for us. Recall that, roughly, a two-sample t-test checks for an association between a two-valued categorical variable and a single numerical variable. With these tests you make inferences about the difference between two population averages \(\mu_1 - \mu_2\) from the difference in sample averages \(\overline{x}_1 - \overline{x}_2\), so the associated confidence intervals are estimating the difference in population averages.
If you have a actual data for a two-sample t-test, use t.test(...)
with the following syntax.
t.test(
x = #vector of values for first group
, y = #vector of values for the second group
, mu = #null difference in population means, USUALLY = 0!
, alternative = # the type of hypothesis test to run
, var.equal = FALSE # we will assume the pop. variances are unequal in this class
, conf.level = # your confidence level as a decimal
)
If you don’t have data and only have sample statistics, use tsum.test(...)
with the following syntax.
tsum.test(
mean.x = #sample mean from group 1
, s.x = #sample std dev from group 1
, n.x = # sample size from group 1
, mean.y = #sample mean of group 2
, s.y = # sample std dev of group 2
, n.y = # sample size of group 2
, mu = #null difference in population means, USUALLY = 0!
, alternative = # the type of hypothesis test to run
, var.equal = FALSE # we will assume the pop. variances are unqequal in this class
, conf.level = # your confidence level as a decimal
)
Let’s see these functions in action, again using the fastfood
data set.
We typically think that salads are a healthier option than burgers when dining out. While “healthiness” is a complicated notion, we can compare the calorie content of these food options as a first pass at healthiness. For instance, a fast food diner may assume they are eating a lower calorie option by ordering a salad over a burger. How true is this on average? Let’s get the data ready for comparison.
burgers <- fastfood %>%
filter(str_detect(item, '[bB]urger'))
salads <- fastfood %>%
filter(str_detect(item, '[sS]alad'))
Let \(\mu_B\) and \(\mu_S\) denote the average calorie content of fast food burgers and salads respectively. We want to test to see if fast food salads are lower calorie than burgers, on average. This notion, as a hypothesis test is
\[\begin{array}{cc} H_0: \mu_S - \mu_B = 0 \\ H_a: \mu_S - \mu_B < 0 \end{array} \]
because if \(\mu_S < \mu_B\) then \(\mu_S - \mu_B < 0\). Executing this test with t.test
is easy! We will use a significance level of \(\alpha = .05\).
results <- t.test(
x = salads$calories
, y = burgers$calories
, alternative = 'less'
, conf.level = .95
, var.equal = F
)
results
##
## Welch Two Sample t-test
##
## data: salads$calories and burgers$calories
## t = -5.5512, df = 107.05, p-value = 1.039e-07
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -169.6743
## sample estimates:
## mean of x mean of y
## 367.8125 609.8214
With a p-value of \(p \approx 1.04 \times 10^{-7} < .05 = \alpha\), we reject the null hypothesis and accept the alternative. In other words, to data provide strong evidence that, on average, fast food salads have fewer calories than burgers. Just how different are these averages? Let’s estimate the difference \(\mu_S - \mu_B\) with a typical two-sided confidence interval.
results <- t.test(
x = salads$calories
, y = burgers$calories
, alternative = 'two.sided'
, conf.level = .95
, var.equal = F
)
results$conf.int
## [1] -328.4319 -155.5859
## attr(,"conf.level")
## [1] 0.95
This calculation implies that fast food salads contain between about 155.6 and 328.4 fewer calories than burgers on average at the 95% confidence level.
As a last calculation in this section, we will perform the same hypothesis test using tsum.test(...)
for demostration purposes. We need to first calculate all required summary statistics since they haven’t been provided.
# sample size, sample avg, and sample std dev for salads
n.s <- length(salads$calories)
mean.s <- mean(salads$calories)
s.s <- sd(salads$calories)
# sample size, sample avg, and sample std dev for burgers
n.b <- length(burgers$calories)
mean.b <- mean(burgers$calories)
s.b <- sd(burgers$calories)
Now we need to chuck these values into tsum.test(...)
. Here we’ll perform our initial one-sided hypothesis test.
tsum.test(
mean.x = mean.s
, s.x = s.s
, n.x = n.s
, mean.y = mean.b
, s.y = s.b
, n.y = n.b
, alternative = 'less'
, conf.level = .95
, var.equal = F
)
##
## Welch Modified Two-Sample t-Test
##
## data: Summarized x and y
## t = -5.5512, df = 107.05, p-value = 1.039e-07
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## NA -169.6743
## sample estimates:
## mean of x mean of y
## 367.8125 609.8214
As expected, this test returns the same result. Again, tsum.test(...)
is primarily useful when you’re doing homework or checking someone elses work and only have sample statistics available.
7.2.3 Paired data
The functions t.test
and tsum.test
both handle paired data with ease. Recall that paired data arise whenever when there is a precise dependency between the observations of two specific variables. Paired data often comes up when looking at before/after data or when you measure two similar variables for the same individuals (reading and writing results, for instance).
The syntax for both functions is similar to that for a two-sample t-test. First, t.test
runs a paired t-test with the following.
t.test(
x = #vector of values for first group
, y = #vector of values for the second group
, mu = #null difference in population means, USUALLY = 0!
, alternative = # the type of hypothesis test to run
, paired = TRUE # we will assume the pop. variances are unequal in this class
, conf.level = # your confidence level as a decimal
)
To run a paired t-test with tsum.test
recall that the statistic you typically want to analyze when dealing with paired data is the average difference \(\mu_{diff}\) and that the the mechanics of this analysis, once you’ve identified that you’re working with paired data, are identical to a single sample t-test. Thus, you can run this with
tsum.test(
mean.x = # average sample difference
, s.x = # sample std dev of differences
, n.x = # number of differences
, alternative = # the type of hypothesis test to run
, conf.level = # your confidence level as a decimal
)
As examples, we’ll use the SkiExperts
data set that’s included in the PairedData
package. Of course, you need to install the package before using this data set.
## Subject Actual Imaginary
## 1 S1 41.12 44.26
## 2 S2 40.21 44.80
## 3 S3 37.67 37.71
## 4 S4 38.50 37.41
## 5 S5 39.04 38.90
## 6 S6 38.91 34.14
This data set, per the R
documentation, “gives the actual and motor imaginary performances (time) in ski for 12 experts” where time is measured in seconds. This data set is from the paper Differences in motor imagery time when predicting task duration in alpine skiers and equestrian riders. The researchers essentially asked expert skiers to predict how long it would take them to ski a section of a ski course after inspecting it, then had them ski that section and recorded their times. Thus, we want to compare each skier’s predicted time to their actual time and test for a difference.
Since it is the same skier making the prediction, then skiing the course, these data are paired. We will look at the difference between actual and and predicted times, so \(diff = actual - predicted\) and we simply want to know if there is a difference between a skiers predicted and actual time. Our hypothesis test is thus
\[\begin{array}{cc} H_0: \mu_{diff} = 0 \\ H_a: \mu_{diff} \neq 0 \end{array} \]
Let’s run this test with t.test using a significance level of .05.
t.test(x = SkiExperts$Actual
, y = SkiExperts$Imaginary
, mu = 0
, alternative = 'two.sided'
, paired = TRUE
, conf.level = .95
)
##
## Paired t-test
##
## data: SkiExperts$Actual and SkiExperts$Imaginary
## t = -0.63931, df = 11, p-value = 0.5357
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -2.384268 1.310934
## sample estimates:
## mean difference
## -0.5366667
Let’s interpret the confidence interval first, then the p-value. The confidence interval tells gives us an range of values for \(\mu_{diff}\), in other words, the average difference between actual and predicted times is between -2.4 seconds and 1.3 seconds with 95% confidence. The negative value corresponds to predicted times that were shorter than actual and the positive values correspond to predicted times that are longer than actual. No surprise, with this confidence interval and a corresponding p-value of about 0.5357, we fail to reject the null. We do not have evidence that expert skiers actual ski times deviate much from their estimates on average.
Now let’s get at the same results using tsum.test
. First we need the summary statistics.
AvgDiff <- mean(SkiExperts$Actual - SkiExperts$Imaginary)
SdDiff <- sd(SkiExperts$Actual - SkiExperts$Imaginary)
nDiff <- dim(SkiExperts)[1] #number of rows = number of observations
Now we use tsum.test
tsum.test(mean.x = AvgDiff
, s.x = SdDiff
, n.x = nDiff
, alternative = 'two.sided'
, mu = 0
, conf.level = .95)
## Warning in tsum.test(mean.x = AvgDiff, s.x = SdDiff, n.x = nDiff, alternative = "two.sided", : argument
## 'var.equal' ignored for one-sample test.
##
## One-sample t-Test
##
## data: Summarized x
## t = -0.63931, df = 11, p-value = 0.5357
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -2.384268 1.310934
## sample estimates:
## mean of x
## -0.5366667
As expected, we get the same exact results.
7.2.4 Analysis of variance (ANOVA)
In the last section of this chapter, we will learn how to run ANOVA tests in R
. Recall that ANOVA is an acronym for Analysis of Variance. Now, you might expect the command for an ANOVA test to be anova(...)
, but you’d be wrong unfortunately. While this function does exist, it performs a different type of ANOVA analysis on different types of R
objects (anova(...)
compares different models built from the same data set… this is a topic beyond this class.)
Instead, the function we need to use is simply aov(...)
. The input into this function requires a slightly new format: the first argument is a formula which is simply a way to express a relationship between variables. The basic format it Explanatory variable ~ Response variable(s)
. If you have more than one response variable, you simply add the variable names together. This might seem odd in the abstract, so it might be best to see a formula in action.
Remember that an ANOVA test checks for an association between a categorical explanatory variable and a numerical response. If there were no association between the two variables, the average values of the numerical variable at each level of the categorical variable would be the same. If there is an association, there would have to be one average that differs from the others. Thus, the hypothesis test for ANOVA looks like
\[\begin{array}{l} H_0: \mu_1 = \mu_2 = \cdots = \mu_k \\ H_a: \text{at least one mean differs from the others} \end{array} \]
where \(\mu_i\) is the average of the numerical variable in the \(i^{th}\) group of the categorical variable.
For an example, let’s use the data set gss_cat
from the forcats
package (this is part of the Tidyverse, so loaded when you attach tidyverse
). This data set gives a collection of categorical variables from the General Social Survey from the years 2000-2014.
## # A tibble: 6 × 9
## year marital age race rincome partyid relig denom tvhours
## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
## 1 2000 Never married 26 White $8000 to 9999 Ind,near rep Protestant Southern bap… 12
## 2 2000 Divorced 48 White $8000 to 9999 Not str republican Protestant Baptist-dk w… NA
## 3 2000 Widowed 67 White Not applicable Independent Protestant No denominat… 2
## 4 2000 Never married 39 White Not applicable Ind,near rep Orthodox-christian Not applicab… 4
## 5 2000 Divorced 25 White Not applicable Not str democrat None Not applicab… 1
## 6 2000 Married 25 White $20000 - 24999 Strong democrat Protestant Southern bap… NA
This data has many categorical variables and only 2 numerical variables: age
and tvhours
. In this example, let’s test to see if there is any association between a person’s religious affiliation and the amount of time they spend watching TV per day. The variable tvhours
is missing for many observations, so we should filter out the missing values first.
## [1] 11337 9
We have 11337 observations with non-null tvhours
. Let’s look at the unique religions represented in this survey.
## [1] Protestant Orthodox-christian None Christian
## [5] Catholic Other Inter-nondenominational Jewish
## [9] Native american No answer Buddhism Moslem/islam
## [13] Hinduism Other eastern Don't know
## 16 Levels: No answer Don't know Inter-nondenominational Native american Christian ... Not applicable
To simplify matters, we will lump the “Don’t know”, “None” and “No answer” replies into one category called “Atheist/agnostic”.
df <- df %>%
mutate(relig = case_when(
relig == "Don't know" ~ "Atheist/agnostic"
, relig == "None" ~ "Atheist/agnostic"
, relig == "No answer" ~ "Atheist/agnostic"
, .default = relig
))
Before running an ANOVA test, let’s look at some summary statistics.
df %>%
group_by(relig) %>%
summarize(n = n()
, meanTV = mean(tvhours)
, sdTV = sd(tvhours)) %>%
arrange(desc(n))
## # A tibble: 13 × 4
## relig n meanTV sdTV
## <chr> <int> <dbl> <dbl>
## 1 Protestant 5650 3.15 2.70
## 2 Catholic 2695 2.96 2.35
## 3 Atheist/agnostic 1991 2.72 2.63
## 4 Christian 391 2.79 2.50
## 5 Jewish 189 2.52 2.07
## 6 Other 124 2.73 2.26
## 7 Buddhism 81 2.38 2.11
## 8 Moslem/islam 64 2.44 2.15
## 9 Inter-nondenominational 54 2.87 2.66
## 10 Hinduism 37 1.89 1.20
## 11 Orthodox-christian 36 2.42 2.13
## 12 Native american 13 3.46 4.07
## 13 Other eastern 12 1.67 1.56
We should assess whether or not we meet the criteria for inference.
Are the observations independent within in and between groups? This is a reasonable assumption given that this is a random sample of GSS results over the years.
Are the data approximately normally distributed within each group? The variable
tvhours
is very likely right skewed since many people watch some TV but some people watch a lot. The data likely violates the normality assumption. Since we have fairly large sample sizes in each group we will ignore this though this may not be best practice!Is the standard deviation relatively stable across groups? There is some variability among all standard deviations, but they are not all that different… Again, this may not be best practice all of the time, but we will assume this condition is satisfied for the sake of the exercise.
Now that we’ve checked the conditions and acknowledge that some of our assumptions are a bit shakey, we’re ready to execute the ANOVA test. Since our assumptions are shakey, we should not stake too much on the results of this test. The code chunk below demonstrates how to run the test and get a nice print out of the results. Notice that the first argument of aov
is a formula with response ~ explanatory
. Let’s use a significance level of 0.05.
## Df Sum Sq Mean Sq F value Pr(>F)
## relig 12 482 40.15 6.031 1.25e-10 ***
## Residuals 11324 75394 6.66
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since our p-value \(p \approx 1.3 \times 10^{-10}\) is less than the significance level, we can reject the null and accept the alternative. The data provide sufficient evidence to detect an association between religion and hours spent watching TV. As usual, since this is observational data, we have no way of deciding if this is a causal relationship; we’ve merely detected a relationship (and possibly a tenuous one at that given the questionable assumptions!).
What does this relationship actually look like? Well, since we’re looking for differences in averages, we’d like some way of comparing them all. R
does this automatically; it is stored in the test
result above as coefficients
## (Intercept) religBuddhism religCatholic
## 2.718232044 -0.335515995 0.242064802
## religChristian religHinduism religInter-nondenominational
## 0.072049286 -0.826340152 0.152138326
## religJewish religMoslem/islam religNative american
## -0.199713526 -0.280732044 0.743306417
## religOrthodox-christian religOther religOther eastern
## -0.301565378 0.007574407 -1.051565378
## religProtestant
## 0.427254681
Notice that the religions in the output above are ordered alphabetically and that “Atheist/agnostic” is conspicuously missing. This is the religion corresponding to the intercept value. In particular, (Intercept)
reports the average daily TV hours that Atheist/agnostic identifiers watch. The other coefficients report the change from the Atheist/agnostic group to the group in question’s average. For example, the religBuddhism
coefficient is -0.335515995, which means that the average daily TV hours Buddhists watch is about \(2.718232044 - 0.335515995 = 2.382716\). Is this a significant difference? That’s a separate hypothesis test. However, from the results of our ANOVA test, we can conclude that some pair of religions watch different amounts of TV on average. One can use techniques that we won’t cover in this class to identify which groups deviate.
One obstacle that you may encounter “out in the wild” (on your homework, for instance) is that data you are provided or collect on your own is not formatted appropriately for using the aov
function. Notice that the input requires the categorical explanatory variable to be in one column and the explanatory response to be in another. Sometimes your data is not presented in this way.
For example, suppose 10 students were randomly selected from FVCC, UM, and MSU at the end of STAT216 to take a statistics proficiency exam.
We want to compare the average exam scores for these schools. An ANOVA test is just the ticket to compare these schools. The null and alternative hypothesis are as follows.
\(H_0\): there is no association between stats proficiency and school.
\(H_0\): there is an association between stats proficiency and school.
Imagine that the exams are scored and you’re given the data below.
msu | um | fvcc |
---|---|---|
72 | 81 | 83 |
82 | 84 | 83 |
87 | 77 | 88 |
83 | 85 | 78 |
89 | 80 | 85 |
82 | 82 | 89 |
85 | 74 | 81 |
83 | 74 | 90 |
87 | 80 | 82 |
80 | 84 | 85 |
You want to run an ANOVA test, but recognize that the numerical variable is split between 3 columns and that each column name is actually a value of the school
variable! This will not do for using the aov
function. The solution is reformatting the data appropriately.
If your data is already loaded into R, you can use the pivot_longer
function from the tidyverse
package. Here, our data set is called df
. This function takes the column names and turns them into column values, just as desired.
df <- df %>%
pivot_longer(cols = 1:3 #selects columns 1-3 for the values
, names_to = 'school' # the name of the new column
, values_to = 'score')
head(df) # returns first 6 results to inspect formatting.
## # A tibble: 6 × 2
## school score
## <chr> <dbl>
## 1 msu 72
## 2 um 81
## 3 fvcc 83
## 4 msu 82
## 5 um 84
## 6 fvcc 83
Now, if your data isn’t in R already, you need to get in in here somehow. In an ideal world, you will have the data in a CSV file or some other spreadsheet-type file. The world is not ideal, however, so sometimes you have to enter the data into R. For the example we’re working on, I would recommend entering your data into R in the following way.
# store each column as a list of numbers named evocatively
msu <- c(85, 86, 86, 78, 88, 83, 82, 86, 83, 77)
um <- c(80, 78, 79, 82, 76, 78, 74, 87, 74, 76)
fvcc <- c(83, 86, 83, 88, 85, 86, 90, 86, 83, 83)
# now make one long list of numbers
scores <- c(msu, um, fvcc)
# now make a list of values for your categorical variable
# this should respect the order of the lists you made above
school <- c( rep("msu", 10), rep("um", 10), rep("fvcc", 10))
# now make a dataframe/tibble with these two lists as columns.
df <- tibble( school, scores)
head(df)
## # A tibble: 6 × 2
## school scores
## <chr> <dbl>
## 1 msu 85
## 2 msu 86
## 3 msu 86
## 4 msu 78
## 5 msu 88
## 6 msu 83
The output of these two processes look different, but that’s only because they are sorted differently. Each score value from the MSU column in the original data set is represented in the scores
column in a row with school == 'msu'
.
Now we can easily run our desired ANOVA test
## Df Sum Sq Mean Sq F value Pr(>F)
## school 2 254.1 127.03 11.1 0.000303 ***
## Residuals 27 308.9 11.44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Low and behold, the data provide evidence of an association between school and statistics proficiency. We’ll look at the coefficients to try to get some idea about how the school’s averages differ.
## (Intercept) schoolmsu schoolum
## 85.3 -1.9 -6.9
This gives a point estimate for each school’s average test score with FVCC having the highest at 85.3. According to this fabricated data, MSU and UM STAT216 students do not score as high, on average. This is, of course, completely made up.
On a closing note for the chapter, recall that one goal of statistical inference is detecting relationships between variables. In this chapter we encountered many different ways to do this.
2-sample proportion tests check for associations between two (binary) categorical variables.
chi-squared tests of independence check for associations between two categorical variables
2-sample t-tests check for an association between a (binary) categorical variable and a numerical variable
ANOVA tests for an association between a categorical variable and a numerical variable.
You may notice here that we have so far skipped testing for an association between two numerical variables. The main reason for this is that these associations can be somewhat more complicated than those we’ve check already. However, you’ve likely already seen one of the main tools for this task: linear regression. The next chapter discusses linear regression and describes how to use R to perform regression tasks.