In this activity, you will analyze a few Montana scratch lottery games using random variables. Imagine that you’re going to spend $60 per week on scratch lottery tickets. Montana scratch tickets cost $1, $2, $3, $5, $10, or $20.
Question: Does it matter which ticket(s) you buy if you’re going to spend $60 per week on lottery tickets? How does the cost of a lottery ticket relate to its expected winning?
In this lab, we hope to:
Increase your familiarity with RStudio. In particular, we will work on loading and manipulating data sets.
Increase your familiarity and comfort with random variables.
Fire up RStudio. We’ll use the tidyverse and openintro packages in this lab, so let’s first load these packages from the console.
library(tidyverse)
library(openintro)We will again use R Markdown to create a lab report. In RStudio, go
to New File -> R Markdown. Then, choose “From Template” and select
Lab Report for OpenIntro Statistics Labs.
For a bit of additional assistance, check out this video created by OpenIntro.
Note: For each exercise, you should have a subsection that begins
with ### Exercise followed by the exercise number and a
completely blank like. As an example,
[End of Exercise 1 work.]
### Exercise 2
[Your work here.]
Any time you want to make calculations using R in your Markdown
document, remember that you can create a code chunk by writing
```{r ChodeChunkName} on one line, followed by your code on
a new line, then close the code chunk with ``` on another
new line.
As a first step, check out the Montana scratch tickets. The data set we’ll use was scrapped from the internet in October 2021. The games you see may be different, from those in the data set, but the ideas will be the same.
Next, we need to load the data into RStudio. Start an R code chunk in your Markdown document, then copy and paste the following into your code chunk:
url <- "https://raw.githubusercontent.com/ckaterba/ScratchLottoScrape/master/outputCSV/allScratchGames.csv"
if(!file.exists("allScratchGames.csv")){
download.file(url, "allScratchGames.csv")
}
df <- as_tibble(read.csv("allScratchGames.csv"))Run the code chunk by clicking the green triangle in the top right corner of the chunk. This code chunk downloads a CSV (comma separated value) file with all of the Montana scratch lotto games, their prices, payouts, and probabilities of all payouts.
This is a three part question:
Look at a one dollar, ten dollar, and a twenty dollar ticket. if you were going to buy $20 worth of tickets, only buying tickets for these games, what would you do? Note: you don’t have to make any calculations yet, but explain your answer.
The game “Baby, It’s Cold Outside!” has two payouts of $10 in the dataset. Why is this? Looking at the odds tables on the Montana Scratch Lotto home page may be helpful.
How many scratch lotto tickets are in the dataset? Hint:
The command length gives you the length of a list. The
command unique returns a list all of the unique values in a
list. So if our list were
x <- c('a', 'a', 'a' 'b', 'b', 'c'),
length(unique(x)) = 3 is the number of unique values in
x.
Recall that we’re investigating the expected winnings of scratch
lotto tickets. In particular, we want to see if cost per ticket matters
if we were to spend $60 a week on tickets. To do so, we’re going to
learn about four features of RStudio that are very convenient for data
analysis. These features are piping, the mutate function,
the group_by function, and the summarize
function.
Piping and pipes are a code
formatting tool that help you clearly express a sequence of multiple
data manipulation operations. The pipe, symbolically, is
%>%.
mutate is a function that adds new variables to a
data set and preserves the existing ones.
group_by is a function that creates a grouped data
set, where you group variables by the common values. We will typically
use this function before using the summarize
function.
summarize creates a new data frame from an existing
one with one row for each combination of grouping variables. The entries
in each row will by some type of summary (specified by you!) of the
variables in each group.
This might sound quite general and/or vague, so it may be easier to get a handle on what these functions do by looking at a small toy example.
#Creating a small example
toy <- tibble( name = rep(letters[1:3], length = 9),
value = 1:9,
prob = c(.1, .2, .3, .3, .2, .1, .6, .6, .6 )) %>%
arrange(name)
toy## # A tibble: 9 × 3
## name value prob
## <chr> <int> <dbl>
## 1 a 1 0.1
## 2 a 4 0.3
## 3 a 7 0.6
## 4 b 2 0.2
## 5 b 5 0.2
## 6 b 8 0.6
## 7 c 3 0.3
## 8 c 6 0.1
## 9 c 9 0.6
Our data set toy has 3 values in the name
variable, a, b, and c, with three
entries for each name.
Suppose we want to square all the value variables. The
mutate function gives us an easy way to do that. Notice in
the example below, we define a new variable name, sq_value,
then set that equal to what we want to calculate.
toy %>% #this is our first example of piping! this means do whatever follows to the data frame toy
mutate(sq_value = value^2)## # A tibble: 9 × 4
## name value prob sq_value
## <chr> <int> <dbl> <dbl>
## 1 a 1 0.1 1
## 2 a 4 0.3 16
## 3 a 7 0.6 49
## 4 b 2 0.2 4
## 5 b 5 0.2 25
## 6 b 8 0.6 64
## 7 c 3 0.3 9
## 8 c 6 0.1 36
## 9 c 9 0.6 81
What if we wanted to add up value for each value of the
name variable? The functions group_by and
summarize are a succinct method for this:
toy %>%
group_by(name) %>%
summarize(valSum = sum(value))## # A tibble: 3 × 2
## name valSum
## <chr> <int>
## 1 a 12
## 2 b 15
## 3 c 18
CAUTION! We have not changed the toy
data set at all, since we didn’t define it to be the mutated or
summarized data frame!
toy## # A tibble: 9 × 3
## name value prob
## <chr> <int> <dbl>
## 1 a 1 0.1
## 2 a 4 0.3
## 3 a 7 0.6
## 4 b 2 0.2
## 5 b 5 0.2
## 6 b 8 0.6
## 7 c 3 0.3
## 8 c 6 0.1
## 9 c 9 0.6
If we wanted to store the summarized data frame, we’d have to give it a name:
toySummary <- toy %>%
group_by(name) %>%
summarize(valSum = sum(value))
toySummary## # A tibble: 3 × 2
## name valSum
## <chr> <int>
## 1 a 12
## 2 b 15
## 3 c 18
We want to calculate the expected winnings and standard
deviation of spending $60 on each type of Montana Scratch Lotto ticket
and will use mutate and summarize to do this
for us. First, we need to calculate the expected value of a single
ticket.
Recall that the expected value of a discrete random variable \(X\) is \[ E(X ) = \sum_{i = 1 }^k x_i \cdot P(X = x_i) \]
Here is an example of how one can calculate the expected value of a
random variable in R using our toy example
from above as a model. Since we need to use the expected value to
calculate the variance, we want to store the expected value of each
name in each row of toy. This can be
accomplished by using group_by and mutate
both.
toy <- toy %>%
group_by(name) %>%
mutate(exp_val = sum(value*prob))
toy## # A tibble: 9 × 4
## # Groups: name [3]
## name value prob exp_val
## <chr> <int> <dbl> <dbl>
## 1 a 1 0.1 5.5
## 2 a 4 0.3 5.5
## 3 a 7 0.6 5.5
## 4 b 2 0.2 6.2
## 5 b 5 0.2 6.2
## 6 b 8 0.6 6.2
## 7 c 3 0.3 6.9
## 8 c 6 0.1 6.9
## 9 c 9 0.6 6.9
Notice that in each row with name = a, the expected
value is exp_val = 5.5 and similarly for
name = b or c.
For the scratch lotto tickets, we want to calculate the expected winnings of a ticket. Recall that winnings are equal to payout minus price.
For Exercise 2, copy and paste code above into a new code chunk in your lab report.
df <- df %>%
group_by(#your work here# ) %>%
mutate( Winnings = # your work here # ) %>%
mutate( exp_val = # your work here # )#your work here# and replacing it with your
work.df using the mutate
function to create a variable called Winnings (remember,
winnings are payout minus price).df using mutate
to create a new variable called exp_val that is the
sum of Winnings multiplied by
Probabilty.subset command will help. We’re subsetting the data set
df by the variable Name. Be sure to use
== inside of the subset function.Now that we have the expected value for a single ticket, we can calculate the variance and standard deviation for a single ticket. Recall that for a discrete random variable \(X\), the variance and standard deviation are:
\[ V(X ) = \sum_{i = 1 }^k (x_i - E(X))^2\cdot P(X = x_i) \quad \text{and} \quad \text{SD}(X) = \sqrt{V(X)}\]
As above, we’ll use the toy example to calculate both of
these values. Note that in the example below toy is already
grouped by name so we don’t have to add that in again.
toy <- toy %>%
mutate(var = sum( prob*(value - exp_val)^2)) %>%
mutate(sd = sqrt(var))
toy## # A tibble: 9 × 6
## # Groups: name [3]
## name value prob exp_val var sd
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 a 1 0.1 5.5 4.05 2.01
## 2 a 4 0.3 5.5 4.05 2.01
## 3 a 7 0.6 5.5 4.05 2.01
## 4 b 2 0.2 6.2 5.76 2.4
## 5 b 5 0.2 6.2 5.76 2.4
## 6 b 8 0.6 6.2 5.76 2.4
## 7 c 3 0.3 6.9 7.29 2.7
## 8 c 6 0.1 6.9 7.29 2.7
## 9 c 9 0.6 6.9 7.29 2.7
df <- df %>%
mutate(var = #your work here# ) %>%
mutate(sd = #your work here# ) df called var that
calculates the variance of the winnings using mutate,
modeling your work on the example above.df called sd that
calculates the standard deviation of the winnings, moeling your work on
the example above.At this point, we’ve calculated the expected value, variance, and standard deviation for every scratch lotto ticket, but the data set we’re using is quite large since we have one row for every possible outcome of every possible ticket. Our goal in this section is to tidy up our dataset. For each scratch lotto game, we want 1 row. In that row we want
NamePriceexp_val, the expected winnings of a single ticketvar, the variance of the winnings for a single
ticketsd, the standard deviation of the winnings for a
ticketWe’ve already constructed these variables, and they are constant for
each scratch lotto game, so we can easily use the summarize
function to compress df into a much tidier dataset.
We’ll use toy as an example of what we’ll do with
df
toySummary <- toy %>%
group_by(name) %>%
summarise(exp_val = mean(exp_val),
var = mean(var),
sd = mean(sd))
toySummary## # A tibble: 3 × 4
## name exp_val var sd
## <chr> <dbl> <dbl> <dbl>
## 1 a 5.5 4.05 2.01
## 2 b 6.2 5.76 2.4
## 3 c 6.9 7.29 2.7
It might seem strange to take the average of exp_val,
var, and sd, but since these values are
constant for a, b, and c, the
average just returns the appropriate value (ie the expected value,
variance, and standard deviation of ‘a’). Note that you can use any
summary stat here (mean, median, mode, max, min, etc.) and it will
return the correct value.
The following code will construct the summary dataset that we hope for, and arranges the scratch tickets first by price, then by expected value. This means all the tickets that cost $1 are grouped together, sorted by increasing expected value.
scratchSummary <- df %>%
group_by(Name) %>%
summarise(Price = mean(Price),
exp_val = mean(exp_val),
var = mean(var),
sd = mean(sd)) %>%
arrange(Price, exp_val)First, create a new code chunk in Exercise 4. Then copy and paste the code above into your new code chunk. Evaluate all code chunks above this one by clicking the grey down pointing triangle with a green line under it. Then run this code chunk by clicking the right pointing green triangle.
In the console type view(scratchSummary).
This will bring up a new window displaying our new summary of all the
scratch tickets.
Which one dollar ticket has the highest (ie least negative) expected value?
Comparing the expected value of tickets that cost different amounts of money may not be the best comparison. If you spend $20 on a single ticket, of course you’re going to lose more money on average than if you only spent $1.
In this section, we’ll analyze the expected value and standard deviation of purchasing $60 worth of each type of scratch lotto ticket. Note that we’re looking at $60 because it is the least common multiple of the prices of all tickets.
As a first step, we need to determine how many of each ticket we
should buy. We will add this column to our scratchSummary
data set. To calculate this number, we should divide 60 by the
price.
scratchSummary <- scratchSummary %>%
mutate(num_tix = # your work here # )Create a new code chunk in your lab report and use the code above as a template to calculate the number of tickets of each type we need to buy.
Now we want to calculate the expected value of $60 worth of each type
of scratch lotto ticket. Observe that purchasing $60 worth of tickets is
playing the same game num_tix number of times. Thus we want
to calculate the expected value of \[ X + X +
\cdots + X\] where \(X\)
represents a single ticket and there are num_tix \(X\)’s in the sum. For instance if we’re
looking at a $20 game, we need to calculate the expected value of \(X + X + X\). Using what we learned in
Chapter 3, this is:
\[ E(X + X + X ) = E(X) + E(X) + E(X) = 3E(X).\]
Thus, we can calculate the expected value of $60 dollars worth of
tickets by multiplying exp_val by num_tix
Similarly, the if \(X\) is the random variable denoting the winnings of a single $20 ticket, we can calculate the variance of \(X + X + X\) as
\[V(X + X + X ) = V(X) + V(X) + V(X) = 3V(X)\]
This can be done by multiplying var by
num_tix.
Finally, the standard deviation, as usual, is just the square root of the variance.
The code below will help you add these columns to
scratchSummary
scratchSummary <- scratchSummary %>%
mutate(exp_val_60 = #your work here# ) %>%
mutate(var_60 = #your work here# ) %>%
mutate(sd_60 = #your work here #) %>%
arrange( #your work here# )arrange function to order the data set by
exp_val_60exp_val_60 vs
Price. Price should be on the horizontal axis. You may want
to review the first activity for assistance in making the plot. Describe
any trend you observe.sd_60 vs
Price. Describe any trend you observe.This Markdown template was taken from the Openintro Stats Labs.