For your homework this week, write a brief report including the results you are asked to obtain and the R code you used to generate it. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that make a substantial contribution to your report.
Go through the STATS 250 materials on probability and random variables at https://open.umich.edu/find/open-educational-resources/statistics/statistics-250-introduction-statistics-data-analysis. If anything is unclear, spend some time working on figuring it out. Report back on anything that remains elusive.
rnorm()
, pnorm()
and qnorm()
which are basic tools for working with the normal distribution. Look at the documentation provided by typing ?pnorm
.pnorm()
to find \({\mathrm{P}}(Z>2.5)\).1 - pnorm(2.5)
## [1] 0.006209665
qnorm()
to find a value \(\alpha\) such that \({\mathrm{P}}(Z>\alpha)=0.2\).qnorm(1-0.2)
## [1] 0.8416212
qnorm()
and the symmetry of the standard normal distribution to find a value \(\beta\) such that \({\mathrm{P}}(|Z|>\beta) = 0.1\).qnorm(1-0.1/2)
## [1] 1.644854
\(\mathbb{E}(M) = (1+2+3+4+5+6)*\frac{1}{6}=3.5\)
\(Var(M) = \mathbb{E}(M^2) - (\mathbb{E}(M))^2 = (1+4+9+16+25+36)*\frac{1}{6} - (3.5)^2 = 2.92\)
sample()
in R to generate a large sample (say, \(n=1000\)) of realizations from the probability model for a single die. Use this sample to check your calculation by finding the sample mean and sample variance.set.seed(2018)
die = sample(1:6, 1000, replace = T)
mean(die)
## [1] 3.454
var(die)
## [1] 2.986871
\(\mathbb{E}(S) = \mathbb{E}(M) + \mathbb{E}(B) = 2\mathbb{E}(M) = 7\)
\(Var(S) = Var(M) + Var(B) = 2Var(M) = 5.84\)
set.seed(2018)
die1 = sample(1:6, 1000, replace = T)
die2 = sample(1:6, 1000, replace = T)
die_sum = die1 + die2
1 - pnorm(q = 10, mean =7, sd = sqrt(5.84))
## [1] 0.1072274
mean(die_sum>=10)
## [1] 0.167
\({\mathrm{P}}(S\ge 10) = \frac{6}{36} = 0.167\)
We consider a simple randomized experiment where an outcome is measured for each subject, half of which are assigned to get treatment 1, with the other half getting treatment 2. We suppose that individuals picked for treatment 1 are assigned randomly. Suppose the experimenters have 24 subjects at their disposal, perhaps they label them \(\{1,2,\dots,24\}\) and they type sample(1:24,size=12)
in R to choose the subjects assigned to treatment 1. To see whether there is statistical evidence for an effect of the treatment on the outcome, we can work out probabilities under the null hypothesis that there is no effect. A natural test statistic is
\[T=\mbox{mean outcome of subjects getting treatment 2} - \mbox{mean outcome of subjects getting treatment 1}.\]
Large values of this statistic suggest that treatment 2 has an effect of increasing the measured outcome compared to treatment 1. But how large does this statistic have to be to conclude that the observed effect is real, not just the result of chance variation when in fact the treatments are equivalent? Under the null hypothesis that the treatment has no effect, we can work out the distribution of \(T\) by reassigning treaments randomly!
Let’s revisit the mice from Homework 1. Your task is to draw 1000 hypothetical realizations for the statistic \(T\) supposing a model where the mice are re-randomized to treatment 1 (regular chow) and treatment 2 (high fat diet) under the null hypothesis that treatment has no effect. Then, report what proportion of these realizations the statistic \(T\) is larger than the value obtained for the actual experimental assignment. Interpret your result as a statistical hypothesis test addressing a scientific question.
The following R hints may be useful:
Call the dataset mice
. Then, mice$Bodyweight
lists all the outcomes, and sample(mice$Bodyweight,12)
draws twelve of these at random. Suppose these are hypothetically assigned to treatment 1 (regular chow).
Using replicate(sample(mice$Bodyweight,12),n=1000)
allows you to generate n=1000
hypothetical realizations of 12 bodyweights for mice assigned to treatment 1. (These are not the weights of mice assigned to treatment 1 in the actual experiment, but we are working under a hypothesis that their weights would have been the same whichever treatment they got.) You should use a much smaller number of replicates, say n=5
while developing and debugging your solution.
The function apply()
gives a convenient way to find the mean bodyweight for each hypothetical treatment 1 sample.
If you follow this approach, you’ll have to figure out how to compute the value of the \(T\) statistic for these hypothetical samples, since \(T\) includes both treatment 1 and treatment 2. For each hypothetical realization, all the mice that are not in the realization of treatment 1 get assigned treatment 2. Thus, the sum of the means for treatment 1 and treatment 2 is the same for every hypothetical realization.
Recall that sum(x>a)/length(x)
gives the fraction of elements of the vector x
that are larger than a
.
set.seed(2018)
mice = read.csv("https://ionides.github.io/401w18/hw/hw01/femaleMiceWeights.csv")
T = mean(mice$Bodyweight[mice$Diet=="hf"]) - mean(mice$Bodyweight[mice$Diet=="chow"])
group1 = replicate(sample(mice$Bodyweight,12),n=1000)
group1_sum = apply(group1, 2, sum)
group2_sum = sum(mice$Bodyweight) - group1_sum
T_permutation = (group2_sum - group1_sum)/12
sum(T_permutation > T)/length(T_permutation)
## [1] 0.03
Since p-value=0.03<0.05, we reject the null hypothesis at 0.05 level. Namely we have enough evidence to say the bodyweight of mice using high fat diet is larger than he bodyweight of mice using regular chow on average.
License: This material is provided under an MIT license
Acknowledgement: The randomized experiment draws on material from from https://genomicsclass.github.io/book