Formulas

(1)\(\quad\quad\mathbf{b}=\big(\mathbb{X}^\top \mathbb{X} \big)^{-1}\, \mathbb{X}^\top\mathbf{y}\)

(2)\(\quad\quad{\mathrm{Var}}(\hat{\mathbf{\beta}})= \sigma^2 \big( \mathbb{X}^\top \mathbb{X} \big)^{-1}\)

(3)\(\quad\quad{\mathrm{Var}}(\mathbb{A}\mathbf{Y})=\mathbb{A}{\mathrm{Var}}(\mathbf{Y})\mathbb{A}^\top\)

(4)\(\quad\quad{\mathrm{Var}}(X)={\mathrm{E}}\big[ (X-{\mathrm{E}}[X])^2\big] = {\mathrm{E}}[X^2]-\big({\mathrm{E}}[X]\big)^2\)

(5)\(\quad\quad{\mathrm{Cov}}(X,Y)={\mathrm{E}}\big[ \big(X-{\mathrm{E}}[X]\big)\big(Y-{\mathrm{E}}[Y]\big)\big] = {\mathrm{E}}[XY]-{\mathrm{E}}[X]\,{\mathrm{E}}[Y]\)

(6)\(\quad\quad\) The binomial (\(n,p\)) distribution has mean \(np\) and variance \(np(1-p)\).

From ?pnorm:

pnorm(q, mean = 0, sd = 1)
qnorm(p, mean = 0, sd = 1)
q: vector of quantiles.
p: vector of probabilities.

Q1. Calculating mean and variance, and making a normal approximation

Let \(X_1,X_2,\dots,X_n\) be independent random variables follows Uniform(0,1) distribution. Find the mean and variance of \(X_1\). Use this to find the mean and variance of \(\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i\). Now suppose \(n=50\) and suppose that \(\bar X\) is well approximated by a normal distribution. Find \({\mathrm{P}}(0.45<\bar X<0.55)\). Write your answer as a call to pnorm(). Your call to pnorm may involve specifying any necessary numerical calculations that you can’t work out without access to a computer or calculator.

Q2. Comparing means using linear model

Let’s consider the crabs data set we studied in lab. Recall that species(sp) is a factor with 2 levels Blue(B) and Orange(O). We want to study the difference of frontal lobe size(FL) of two species.

library(MASS)
data(crabs)
head(crabs)
##   sp sex index   FL  RW   CL   CW  BD
## 1  B   M     1  8.1 6.7 16.1 19.0 7.0
## 2  B   M     2  8.8 7.7 18.1 20.8 7.4
## 3  B   M     3  9.2 7.8 19.0 22.4 7.7
## 4  B   M     4  9.6 7.9 20.1 23.1 8.2
## 5  B   M     5  9.8 8.0 20.3 23.0 8.2
## 6  B   M     6 10.8 9.0 23.0 26.5 9.8

Consider the model \(Y_i = \mu_1x_{Bi} + \mu_2x_{Oi} + \epsilon_i\),\(i=1,...,200\). \(Y_i\) is the FL of observation i. \(x_{Bi}\) is 1 if \(sp=B\) for obervation i and 0 otherwise. Similarly, \(x_{Oi}\) is 1 if \(sp=O\) for obervation i and 0 otherwise. \(\epsilon_i\) are i.i.d with mean 0 and variance \(\sigma^2\). This model is fitted in R using lm() function and the summary is provided as below.

lm_crab <- lm(FL~sp-1, data=crabs)
summary(lm_crab)
## 
## Call:
## lm(formula = FL ~ sp - 1, data = crabs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.010 -2.410  0.390  2.169  7.244 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## spB   14.056      0.315   44.62   <2e-16 ***
## spO   17.110      0.315   54.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.15 on 198 degrees of freedom
## Multiple R-squared:  0.9615, Adjusted R-squared:  0.9611 
## F-statistic:  2470 on 2 and 198 DF,  p-value: < 2.2e-16

(a). What do \(\mu_1\) and \(\mu_2\) measure in the above probability model?

(b). Build a 95% confidence interval for \(\mu_1\) using normal approximation

(c). Recall in homework we know that the full estimated covariance matrix of \(\mathbf{\hat{\mu}} = (\hat{\mu}_1,\hat{\mu}_2)\) can be found by

V <- summary(lm_crab)$cov.unscaled * summary(lm_crab)$s^2
V
##            spB        spO
## spB 0.09923719 0.00000000
## spO 0.00000000 0.09923719

Use V and information provided in summary(lm_crab) to build a 95% confidence interval for \(\mu_1-\mu_2\)

Q3. Making and interpreting an F-test

Consider the birth weight data set we have seen in lab For this question, we will look at columns bwt(birth weight), lwt(mother’s weight), age(mother’s age) and race(mother’s race, 1 for white, 2 for black and 3 for other).

library(MASS)
data(birthwt)
head(birthwt)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

We want to study the relationship bwt and race using F test, while lwt and age are provided as confounding variables. Let the null hypothesis, \(H_0\), be the probability model where bwt is modeled to depend linearly on lwt and age. Let \(H_a\) be the probability model where \(H_0\) is extended to include race as a factor, as fitted in R by

lm_bw <- lm(bwt ~ lwt + age +factor(race), data = birthwt)

The results from summary(lm_bw) and anova(lm_bw) are as follows

summary(lm_bw)
## 
## Call:
## lm(formula = bwt ~ lwt + age + factor(race), data = birthwt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2103.50  -429.68    41.74   486.10  1902.20 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2461.147    314.722   7.820 3.97e-13 ***
## lwt              4.620      1.788   2.584  0.01054 *  
## age              1.299     10.108   0.128  0.89789    
## factor(race)2 -447.615    161.369  -2.774  0.00611 ** 
## factor(race)3 -239.357    115.189  -2.078  0.03910 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 704.9 on 184 degrees of freedom
## Multiple R-squared:  0.08536,    Adjusted R-squared:  0.06548 
## F-statistic: 4.293 on 4 and 184 DF,  p-value: 0.00241
anova(lm_bw)
## Analysis of Variance Table
## 
## Response: bwt
##               Df   Sum Sq Mean Sq F value   Pr(>F)   
## lwt            1  3448639 3448639  6.9398 0.009148 **
## age            1   334183  334183  0.6725 0.413247   
## factor(race)   2  4750632 2375316  4.7799 0.009467 **
## Residuals    184 91436202  496936                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(a). Write out the null and alternative hypotheses of the F test by completely specifying the probability models.

(b). Interpret the results in anova(lm_gpa). Specifically, read the sample test statistic from R output, give the distribution of the model-generated test statistic under \(H_0\), and explain how the resulting p-value is calculated and interpreted.

Q4. Prediction using a linear model

We still look at the birthwt data in Q3. Consider the model fitted at \(H_a\). Suppose we are interested in predicting the birthweight of a baby who has a 30-year-old white mother with weight 130.

Specify a row matrix \(\mathbf{x}^*\) so that \(y^*=\mathbf{x}^*\mathbf{b}\) gives the least square predictor. Use summary(lm_bw) from Q3 to find \(y^*\).