In this activity we will analyze Montana Fish, Wildlife, and Park’s 2020 deer hunting season report to address a few questions about deer hunting in the state. The data set breaks down the results from the 2020 season by hunting district, species, sex of deer, and more. We want to address the following questions
Do Montanan’s harvest more whitetail or mule deer annually?
Which district(s) yield the most deer? Which yield the most by species?
Which district(s) yield the most “big bucks”?
Are the big bucks spread out evenly around Northwest Montana?
Check out this map to familiarize yourself with the location of the hunting districts.
Fire up RStudio. We’ll use the tidyverse
and
openintro
packages in this lab, so let’s first load these
packages from the console. Be sure to also include these packages in the
beginning of your lab notebook.
library(knitr)
library(tidyverse)
library(openintro)
We will again use R Markdown to create a lab report. In RStudio, go
to New File -> R Markdown. Then, choose “From Template” and select
Lab Report for OpenIntro Statistics Labs
.
For a bit of additional assistance, check out this video created by OpenIntro.
Note: For each exercise, you should have a subsection that begins
with ### Exercise
followed by the exercise number and a
completely blank like. As an example,
[End of Exercise 1 work.]
### Exercise 2
[Your work here.]
Like a previous activity, we need to download the data set from the
web. Copy and paste the following code into a chunk in your lab report,
then evaluate the chunk by clicking the green arrow in the top left
corner. This chunk will download the csv
file, then read it
into Rstudio
and store the data as deer
<- "https://raw.githubusercontent.com/ckaterba/STAT216_activity_data/main/mtDeerHunt2020.csv"
url if(!file.exists("./deer.csv")){
download.file(url, "deer.csv")
}<- read.csv("deer.csv") deer
In the console, type view(deer)
to explore the data a
bit. You will notice 8 variables:
district
: an integer representing the hunting
district. Notice that this is a categorical variable even though it is a
number.
species
: categorical with two values md
and wt
representing mule deer and whitetail deer
respectively.
harvest
: an integer reporting the count of the
species harvested in the district.
bucks
, does
, fawns
: an
integer reporting the count of the gender/age harvested in the
district.
smallBucks
and bigBucks
: an integer
reporting the count of the number of small and big bucks. Here a small
buck has fewer than 4 points and a big buck has four or more
points.
All of the numbers above are counts of categorical variables, so we will be applying the techniques and tools from Chapter 6 in our textbook to analyze this data.
In this section, we’ll attempt to address our first question: Do Montanan’s harvest more mule deer or whitetail deer annually? The data for 2020 will serve as a point estimate.
How many deer were harvested in 2020, ignoring species? Hint:
add up the harvest
column
What proportion of deer harvested were whitetails? How about mule
deer? Suggestion: Create 2 new data sets, one for each species,
using the subset
command. For example,
mule <- subset(deer, species == 'md')
. This will select
only the rows with mule deer as species. Then sum the
harvest
column in each of your new data sets to get the
total number of each species harvested.
Write the null and research hypothesis to test the claim that there is a difference between the proportions of mule and whitetail deer harvested annually. *Note that there are only two species of deer, mule and whitetail, so if there were no difference in their proportions, there would be a 50/50 split in their harvest totals.
To perform the hypothesis test from Exercise 1, we want to use the
command prop.test
. This function takes a handful of
arguments:
x
is a vector of counts of successes.
n
is a vector of counts of trials.
p
is a vector of probabilities of success. Used to
specify the null proportion.
alternative
specifies the type of hypothesis test.
Options are "two.sided"
, "greater"
, or
"less"
.
conf.level
specifies the confidence level. Notice
that when performing a two-sided hypothesis test, the significance level
is 1 - conf.level
.
correct
is either TRUE
or
FALSE
indicating whether Yates’ continuity correction
should be applied where possible. We will usually set this equal to
FALSE
The code chunk below is a template to help you perform the hypothesis
test. You can copy and paste this chunk into your lab report. Every
entry inside of a list/vector (ie inside of c()
) needs to
be separated by a comma.
prop.test(x = , #number of mule deer
n = , #total number of deer harvested,
p = , #enter the null proportion
alternative = '', #what type of hypothesis test are we performing?
conf.level = .95, #this is the standard confidence level
correct = FALSE
)
Run your code chunk to perform this hypothesis test.
What is the \(p\)-value of your test?
Interpret the confidence interval in the context of the proplem using complete sentences.
Interpret the results of your hpothesis test using complete sentences in the context of the problem.
In this section we want to identify the hunting districts in Montana
with the best deer harvests. We will get practice using
summarise
and group_by
. The code chunk below
will help you construct the summary we will use in the rest of the
activity.
<- deer %>%
byDistrict group_by(district) %>% #all aggregate functions (eg sum()) will group by district
summarize(total = sum(harvest), #this will add the number of wt to the number of md in each district
bucks = #enter code to find the number of bucks in each district
smallBucks = #find the number of small bucks in each district
bigBucks = #find the number of big bucks in each district
propBucks = #find the proportion of bucks harvested in each district
propBigBucks = # find the proportion large bucks out of all bucks in the district
)
As a first step, complete the summary table in the above code
chunk, then evaluate your code chunk. The resulting table should have
one row for each district and seven column. Explore the results with
view(byDistrict)
.
What district saw the largest total harvest of deer in 2020? What district had the smallest? Look at the map of districts. Does geography help explain why these districts’ harvest totals?
The Whitefish Range is mostly in District 110. Based on the data
in our byDistrict
table, does this seem like an efficient
district for harvesting a deer? Explain your answer.
Next we want to see if there is a relationship between the proportion of bucks harvested in a district, and the proportion of those bucks that are considered big.
Start a new code chunk and use it to make a scatter plot with
propBucks
on the horizontal axis and
probBigBucks
on the vertical axis.
Describe any trend you notice in your scatter plot. Be sure to mention the strength of this trend.
If you noticed a trend, try to explain why that trend may be there.
In the final section of this activity, we want to determine if the harvested big bucks are evenly spread around Northwest Montana. More specifically, we will attempt to determine if the harvest big bucks are evenly distributed around the region.
Montana FWP divides the state into regions, then divides each region into districts. There are 7 regions in the state and Northwest Montana is Region 1. The first digit in a hunting districts’ code represents the region.
The district
column of our data set
byDistrict
is an integer, so we can use subset
along with an inequality ( ie <
, >
,
<=
, or >=
) to select only the districts
in Region 1.
byDistrict
so that the result only contains the districts
in Region 1. Name the resulting data set region1
.Now we’re all set to test to check to see if the big bucks are evenly distributed throughout Northwest Montana, by district. To do this, we’ll use a \(\chi^2\) goodness of fit test. The null hypothesis will be the notion that the big bucks are evenly distributed throughout the region. The research hypothesis will be that they are not evenly distributed.
There are 17 hunting districts in Region 1. Let \(p_i\) denote the proportion of all harvested big bucks in the \(i^{th}\) district, listed in ascending order. Then \(H_0\) and \(H_A\) are
\(H_0\): \(p_1 = P_2 = \cdots = p_{17} = 1/17\)
\(H_A\): at least one proportion is different from \(1/17\).
We will perform this test with the function chisq.test
.
When performing a goodness of fit test, we only need 3 inputs
x
: a vector of counts
p
: a list of proportions with the same length as
x
correct
: this will be set equal to
FALSE
to be consistent with what we’ve done in
class.
chisq.test( x = region1$bigBucks,
p = rep( 1/17, 17),
correct = FALSE
)
##
## Chi-squared test for given probabilities
##
## data: region1$bigBucks
## X-squared = 1791.7, df = 16, p-value < 2.2e-16
From this test, we should clearly reject the null and accept the research, since the \(p\)-value is quite small. In other words, the 2020 data provide compelling evidence to believe that the big bucks are not equally distributed among the districts in Region 1.
This result is no surprise. There are a few district with very low
overall harvest numbers. In the console, run view(region1)
.
Notice that districts 141 and 150 have very low total harvest numbers
compared to the other districts.
Similarly, there are districts that have quite large harvest totals. There is very likely an association between total harvest and the number of big bucks harvested. This association is clearly exemplified in the scatter plot below
<- ggplot(region1, aes(x = total, y = bigBucks)) +
harvestPlot geom_point(color = "steelblue") +
labs(x = "Total harvest",
y = "Big buck harvest",
title = "Region 1: harvest vs buck buck totals")
harvestPlot
So, while our first test did tell us something, it didn’t tell us anything very surprising: the more deer that are harvest in a district, the more big bucks are harvested in that district.
Instead, we may be more interested in determining if the categorical variables “hunting district” and “buck size” are dependent. We can test this using a \(\chi^2\) test of independence.
To be safe, we’ll throw out districts 141 and 150. Copy and paste this code into your own code chunk.
<- subset(region1, total>100 ) region1
To perform the test of independence, we only need to input the
correct “matrix” of numbers. Remember that each row in
region 1
represents a district. The columns of interest are
smallBucks
and bigBucks
, which are the 4th and
5th columns of region1
. Thus region1[,4:5]
is
the matrix of numbers we’re looking for!
chisq.test( x = # the data in question goes here ,
correct = FALSE)
Copy this code into your own code chunk, complete it, then run your code chunk.
What can we infer, if anything, from the results of this test?
The results may be statistically significant, but are they practically significant? In other words, do you think you could use these calculations to effectively plan a more fruitful hunt, assuming the goal of the hunt is to harvest a big buck?
This Markdown template was taken from the Openintro Stats Labs.