The answer is \(n-m+1\). For the case where \(m = 1\), we can see that the formula correctly gives us \(\sum_{i = 1}^{n} 1 = n-1+1 = n\).
Because \(\bar{x}\) can be treated as a constant we have \(\sum_{i = 1}^{n} (x_i - \bar{x}) = \sum_{i = 1}^{n} x_i - \sum_{i = 1}^{n}\bar{x} = \sum_{i = 1}^{n} x_i - n\bar{x}\).
However, \(n\bar{x} = n\frac{1}{n}\sum_{i = 1}^{n}x_i = \sum_{i = 1}^{n}x_i\), so the above quantity is zero.
\(\textbf{1}^\top\textbf{x} = \begin{bmatrix} 1 & 1 & \dots & 1\end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n\end{bmatrix} = x_1 + x_2 + \dots + x_n = \sum_{i = 1}^{n}x_i\)
First, load the data
mice <- read.csv("https://ionides.github.io/401f18/hw/hw01/femaleMiceWeights.csv")
The sample mean is given by \(\bar{x} = \sum_{i = 1}^{12}x_i\). The sample standard deviation is given by \(s = \sqrt{\frac{1}{11}\sum_{i = 1}^{12}(x_i - \bar{x})^2}\).
To obtain these values in R, we first subset to the high fat diet:
data_hf = mice[mice$Diet == "hf",2]
We then use the mean and sd function to obtain the desired values:
sample_mean = mean(data_hf)
sample_sd = sd(data_hf)
sample_mean
## [1] 26.83417
sample_sd
## [1] 4.097606
hist(data_hf, freq = FALSE)
normal.x = seq(from = min(data_hf), to = max(data_hf),length = 100)
normal.y = dnorm(normal.x, mean = sample_mean, sd = sample_sd)
lines(normal.x,normal.y,col = "blue")
It does not appear that the data fit a normal distribution model well. The data are slightly skewed and have a much thicker tail.
normal_draws = rnorm(12,sample_mean,sample_sd)
hist(normal_draws, freq = FALSE)
normal.x = seq(from = min(normal_draws), to = max(normal_draws),length = 100)
normal.y = dnorm(normal.x, mean = sample_mean, sd = sample_sd)
lines(normal.x,normal.y,col = "blue")
Even though the data in (c) were actually drawn from a normal distribution, it does not appear that the data fit a normal distribution model any better than the mice data. If we were to repeat this procedure, we would notice a similar trend quite often. This tells us that it is not necessarily reasonable to assume that a normal distribution model is not appropriate based off the 12 data points we collected.
The percentage is given by: \(\int_{1}^\infty \frac{1}{\sqrt{2\pi}}\exp(\frac{-x^2}{2}) + \int_{-\infty}^1 \frac{1}{\sqrt{2\pi}}\exp(\frac{-x^2}{2})\)
We can evaluate this in R using:
1-pnorm(1) + pnorm(-1)
## [1] 0.3173105
Alternatively, we can obtain the same result by symmetry:
2*pnorm(-1)
## [1] 0.3173105
above = sum(data_hf > sample_mean + sample_sd)
below = sum(data_hf < sample_mean - sample_sd)
Then we calculate the percentage more than 1 sample standard deviation away from the sample mean:
(above+below)/12
## [1] 0.3333333
This value is fairly close to the value given under a normal approximation.
my_t <- function(n){
rnorm(1) / sqrt(sum(rnorm(n)^2/n))
}
(a)-(c)
t_draws = replicate(10000,my_t(6))
hist(t_draws, breaks = 15, freq = FALSE)
x = seq(from = min(t_draws), to = max(t_draws),length = 100)
t.y = dt(x, df = 6)
normal.y = dnorm(x, 0, 1)
lines(x,t.y, col = "blue")
lines(x,normal.y, col = "blue",lty = "dashed")
Both densities are unimodal and symmetric; however, the t distribution has thicker tails than the normal distribution.
my_F <- function(m,n){
sum(rnorm(m)^2/m) / sum(rnorm(n)^2/n)
}
(a)-(b)
f_draws = replicate(10000,my_F(5,10))
hist(f_draws, breaks = 15, freq = FALSE)
x = seq(from = min(t_draws), to = max(t_draws),length = 100)
f.y = df(x, df1 = 5, df2 = 10)
lines(x,f.y, col = "blue")