Sample Quiz 2

Formulas

You are not allowed to bring any notes into the exam.
The following formulas will be provided. To use these formulas properly, you need to make appropriate definitions of the necessary quantities.

(1)\(\quad\quad\mathbf{b}=\big(\mathbb{X}^\top \mathbb{X} \big)^{-1}\, \mathbb{X}^\top\mathbf{y}\)

(2)\(\quad\quad{\mathrm{Var}}(\hat{\mathbf{\beta}})= \sigma^2 \big( \mathbb{X}^\top \mathbb{X} \big)^{-1}\)

(3)\(\quad\quad{\mathrm{Var}}(\mathbb{A}\mathbf{Y})=\mathbb{A}{\mathrm{Var}}(\mathbf{Y})\mathbb{A}^\top\)

(4)\(\quad\quad{\mathrm{Var}}(X)={\mathrm{E}}\big[ (X-{\mathrm{E}}[X])^2\big] = {\mathrm{E}}[X^2]-\big({\mathrm{E}}[X]\big)^2\)

(5)\(\quad\quad{\mathrm{Cov}}(X,Y)={\mathrm{E}}\big[ \big(X-{\mathrm{E}}[X]\big)\big(Y-{\mathrm{E}}[Y]\big)\big] = {\mathrm{E}}[XY]-{\mathrm{E}}[X]\,{\mathrm{E}}[Y]\)

(6)\(\quad\quad\) The binomial (\(n,p\)) distribution has mean \(np\) and variance \(np(1-p)\).

From ?pnorm:

pnorm(q, mean = 0, sd = 1)
qnorm(p, mean = 0, sd = 1)
q: vector of quantiles.
p: vector of probabilities.

Q1. Calculating mean and variance, and making a normal approximation

Let \(X_1,X_2,\dots,X_n\) be independent random variables follows Uniform(0,1) distribution. Find the mean and variance of \(X_1\). Use this to find the mean and variance of \(\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i\). Now suppose \(n=50\) and suppose that \(\bar X\) is well approximated by a normal distribution. Find \({\mathrm{P}}(0.45<\bar X<0.55)\). Write your answer as a call to pnorm(). Your call to pnorm may involve specifying any necessary numerical calculations that you can’t work out without access to a computer or calculator.

Q2. Comparing means using linear model

Let’s consider the crabs data set we studied in lab. Recall that species(sp) is a factor with 2 levels Blue(B) and Orange(O). We want to study the difference of frontal lobe size(FL) of two species.

library(MASS)
data(crabs)

head(crabs)

##   sp sex index   FL  RW   CL   CW  BD
## 1  B   M     1  8.1 6.7 16.1 19.0 7.0
## 2  B   M     2  8.8 7.7 18.1 20.8 7.4
## 3  B   M     3  9.2 7.8 19.0 22.4 7.7
## 4  B   M     4  9.6 7.9 20.1 23.1 8.2
## 5  B   M     5  9.8 8.0 20.3 23.0 8.2
## 6  B   M     6 10.8 9.0 23.0 26.5 9.8

Consider the model \(Y_i = \mu_1x_{Bi} + \mu_2x_{Oi} + \epsilon_i\),\(i=1,...,200\). \(Y_i\) is the FL of observation i. \(x_{Bi}\) is 1 if \(sp=B\) for obervation i and 0 otherwise. Similarly, \(x_{Oi}\) is 1 if \(sp=O\) for obervation i and 0 otherwise. \(\epsilon_i\) are i.i.d with mean 0 and variance \(\sigma^2\). This model is fitted in R using lm() function and the summary is provided as below.

lm_crab <- lm(FL~sp-1, data=crabs)
summary(lm_crab)

## 
## Call:
## lm(formula = FL ~ sp - 1, data = crabs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.010 -2.410  0.390  2.169  7.244 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## spB   14.056      0.315   44.62   <2e-16 ***
## spO   17.110      0.315   54.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.15 on 198 degrees of freedom
## Multiple R-squared:  0.9615, Adjusted R-squared:  0.9611 
## F-statistic:  2470 on 2 and 198 DF,  p-value: < 2.2e-16

(a). What do \(\mu_1\) and \(\mu_2\) measure in the above probability model?

(b). Build a 95% confidence interval for \(\mu_1\) using normal approximation

(c). Recall in homework we know that the full estimated covariance matrix of \(\mathbf{\hat{\mu}} = (\hat{\mu}_1,\hat{\mu}_2)\) can be found by

V <- summary(lm_crab)$cov.unscaled * summary(lm_crab)$s^2
V

##            spB        spO
## spB 0.09923719 0.00000000
## spO 0.00000000 0.09923719

Use V and information provided in summary(lm_crab) to build a 95% confidence interval for \(\mu_1-\mu_2\)

Q3. Making and interpreting an F-test

Consider the birth weight data set we have seen in lab For this question, we will look at columns bwt(birth weight), lwt(mother’s weight), age(mother’s age) and race(mother’s race, 1 for white, 2 for black and 3 for other).

library(MASS)
data(birthwt)

head(birthwt)

##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

We want to study the relationship bwt and race using F test, while lwt and age are provided as confounding variables. Let the null hypothesis, \(H_0\), be the probability model where bwt is modeled to depend linearly on lwt and age. Let \(H_a\) be the probability model where \(H_0\) is extended to include race as a factor, as fitted in R by

lm_bw <- lm(bwt ~ lwt + age +factor(race), data = birthwt)

The results from summary(lm_bw) and anova(lm_bw) are as follows

summary(lm_bw)

## 
## Call:
## lm(formula = bwt ~ lwt + age + factor(race), data = birthwt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2103.50  -429.68    41.74   486.10  1902.20 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2461.147    314.722   7.820 3.97e-13 ***
## lwt              4.620      1.788   2.584  0.01054 *  
## age              1.299     10.108   0.128  0.89789    
## factor(race)2 -447.615    161.369  -2.774  0.00611 ** 
## factor(race)3 -239.357    115.189  -2.078  0.03910 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 704.9 on 184 degrees of freedom
## Multiple R-squared:  0.08536,    Adjusted R-squared:  0.06548 
## F-statistic: 4.293 on 4 and 184 DF,  p-value: 0.00241

anova(lm_bw)

## Analysis of Variance Table
## 
## Response: bwt
##               Df   Sum Sq Mean Sq F value   Pr(>F)   
## lwt            1  3448639 3448639  6.9398 0.009148 **
## age            1   334183  334183  0.6725 0.413247   
## factor(race)   2  4750632 2375316  4.7799 0.009467 **
## Residuals    184 91436202  496936                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(a). Write out the null and alternative hypotheses of the F test by completely specifying the probability models.

(b). Interpret the results in anova(lm_gpa). Specifically, read the sample test statistic from R output, give the distribution of the model-generated test statistic under \(H_0\), and explain how the resulting p-value is calculated and interpreted.

Q4. Prediction using a linear model

We still look at the birthwt data in Q3. Consider the model fitted at \(H_a\). Suppose we are interested in predicting the birthweight of a baby who has a 30-year-old white mother with weight 130.

Specify a row matrix \(\mathbf{x}^*\) so that \(y^*=\mathbf{x}^*\mathbf{b}\) gives the least square predictor. Use summary(lm_bw) from Q3 to find \(y^*\).