Sample Quiz 2

Formulas

You are not allowed to bring any notes into the exam.
The following formulas will be provided. To use these formulas properly, you need to make appropriate definitions of the necessary quantities.

(1)\(\quad\quad\mathbf{b}=\big(\mathbb{X}^\top \mathbb{X} \big)^{-1}\, \mathbb{X}^\top\mathbf{y}\)

(2)\(\quad\quad{\mathrm{Var}}(\hat{\mathbf{\beta}})= \sigma^2 \big( \mathbb{X}^\top \mathbb{X} \big)^{-1}\)

(3)\(\quad\quad{\mathrm{Var}}(\mathbb{A}\mathbf{Y})=\mathbb{A}{\mathrm{Var}}(\mathbf{Y})\mathbb{A}^\top\)

(4)\(\quad\quad{\mathrm{Var}}(X)={\mathrm{E}}\big[ (X-{\mathrm{E}}[X])^2\big] = {\mathrm{E}}[X^2]-\big({\mathrm{E}}[X]\big)^2\)

(5)\(\quad\quad{\mathrm{Cov}}(X,Y)={\mathrm{E}}\big[ \big(X-{\mathrm{E}}[X]\big)\big(Y-{\mathrm{E}}[Y]\big)\big] = {\mathrm{E}}[XY]-{\mathrm{E}}[X]\,{\mathrm{E}}[Y]\)

(6)\(\quad\quad\) The binomial (\(n,p\)) distribution has mean \(np\) and variance \(np(1-p)\).

From ?pnorm:

pnorm(q, mean = 0, sd = 1)
qnorm(p, mean = 0, sd = 1)
q: vector of quantiles.
p: vector of probabilities.

Q1. Calculating mean and variance, and making a normal approximation

Let \(X_1,X_2,\dots,X_n\) be independent random variables follows Uniform(0,1) distribution. Find the mean and variance of \(X_1\). Use this to find the mean and variance of \(\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i\). Now suppose \(n=30\) and suppose that \(\bar X\) is well approximated by a normal distribution. Find \({\mathrm{P}}(0.45<\bar X<0.55)\). Write your answer as a call to pnorm(). Your call to pnorm may involve specifying any necessary numerical calculations that you can’t work out without access to a computer or calculator.

Solution:

\(\mathbb{E}(X_1)=\int_0^1 x dx = 1/2\)

\(\mathbb{E}(X_1^2)=\int_0^1 x^2 dx = 1/3\)

\(Var(X_1)=\mathbb{E}(X_1^2)-(\mathbb{E}(X_1))^2=1/12\)

Thus, \(\mathbb{E}(\bar{X})=\mathbb{E}(\frac{1}{n}\sum_{i=1}^n X_i)=\frac{1}{n}\sum_{i=1}^n \mathbb{E}X_i=\mathbb{E}X_i=1/2\)

\(Var(\bar{X})=Var(\frac{1}{n}\sum_{i=1}^n X_i)=\frac{1}{n^2}Var(\sum_{i=1}^n X_i)=\frac{1}{n}Var(X_i)=\frac{1}{12n}\)

\({\mathrm{P}}(0.45<\bar X<0.55) = pnorm(0.55,mean=1/2,sd=sqrt(1/(12* 30)))-pnorm(0.45,mean=1/2,sd=sqrt(1/(12*30)))\)

Q2. Comparing means using linear model

Let’s consider the crabs data set we studied in lab. Recall that species(sp) is a factor with 2 levels Blue(B) and Orange(O). We want to study the difference of frontal lobe size(FL) of two species.

library(MASS)
data(crabs)

head(crabs)

##   sp sex index   FL  RW   CL   CW  BD
## 1  B   M     1  8.1 6.7 16.1 19.0 7.0
## 2  B   M     2  8.8 7.7 18.1 20.8 7.4
## 3  B   M     3  9.2 7.8 19.0 22.4 7.7
## 4  B   M     4  9.6 7.9 20.1 23.1 8.2
## 5  B   M     5  9.8 8.0 20.3 23.0 8.2
## 6  B   M     6 10.8 9.0 23.0 26.5 9.8

Consider the model \(Y_i = \mu_1x_{Bi} + \mu_2x_{Oi} + \epsilon_i\),\(i=1,...,200\). \(Y_i\) is the FL of observation i. \(x_{Bi}\) is 1 if \(sp=B\) for obervation i and 0 otherwise. Similarly, \(x_{Oi}\) is 1 if \(sp=O\) for obervation i and 0 otherwise. \(\epsilon_i\) are i.i.d with mean 0 and variance \(\sigma^2\). This model is fitted in R using lm() function and the summary is provided below.

lm_crab <- lm(FL~sp-1, data=crabs)
summary(lm_crab)

## 
## Call:
## lm(formula = FL ~ sp - 1, data = crabs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.010 -2.410  0.390  2.169  7.244 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## spB   14.056      0.315   44.62   <2e-16 ***
## spO   17.110      0.315   54.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.15 on 198 degrees of freedom
## Multiple R-squared:  0.9615, Adjusted R-squared:  0.9611 
## F-statistic:  2470 on 2 and 198 DF,  p-value: < 2.2e-16

(a). What does \(\mu_1\) and \(\mu_2\) measure in the above probability model?

Solution:

\(\mu_1\) is the population mean frontal lobe size for blue crabs. \(\mu_2\) is the population mean frontal lobe size for orange crabs.

(b). Build a 95% confidence interval for \(\mu_1\) using normal approximation

Solution:

\((14.056-1.96*0.315,14.056+1.96*0.315)=(13.44,14.67)\)

(c). Recall in homework we know that the full estimated covariance matrix of \(\mathbf{\hat{\mu}} = (\hat{\mu}_1,\hat{\mu}_2)\) can be found by

V <- summary(lm_crab)$cov.unscaled * summary(lm_crab)$s^2
V

##            spB        spO
## spB 0.09923719 0.00000000
## spO 0.00000000 0.09923719

Use V and information provided in summary(lm_crab) to build a 95% confidence interval for \(\mu_1-\mu_2\)

Solution:

Let \(a = (1,-1)^T\).

\(Var(a^T\mathbf{\hat{\mu}})=a^TVar(\mathbf{\hat{\mu}})a=a^TVa=0.198\)

\(\hat{\mu}_1-\hat{\mu}_2=14.056-17.110=-3.054\)

Thus we have the 95% C.I. \((-3.054-1.96*\sqrt{0.198},-3.054+1.96*\sqrt{0.198})=(-3.926,-2.182)\)

Q3. Making and interpreting an F-test

Consider the birth weight data set we have seen in lab For this question, we will look at columns bwt(birth weight), lwt(mother’s weight), age(mother’s age) and race(mother’s race, 1 for white, 2 for black and 3 for other).

library(MASS)
data(birthwt)

head(birthwt)

##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

We want to study the relationship bwt and race using F test, while lwt and age are provided as confounding variables. Let the null hypothesis, \(H_0\), be the probability model where bwt is modeled to depend linearly on lwt and age. Let \(H_a\) be the probability model where \(H_0\) is extended to include race as a factor, as fitted in R by

lm_bw <- lm(bwt ~ lwt + age +factor(race), data = birthwt)

The results from summary(lm_bw) and anova(lm_bw) are as follows

summary(lm_bw)

## 
## Call:
## lm(formula = bwt ~ lwt + age + factor(race), data = birthwt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2103.50  -429.68    41.74   486.10  1902.20 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2461.147    314.722   7.820 3.97e-13 ***
## lwt              4.620      1.788   2.584  0.01054 *  
## age              1.299     10.108   0.128  0.89789    
## factor(race)2 -447.615    161.369  -2.774  0.00611 ** 
## factor(race)3 -239.357    115.189  -2.078  0.03910 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 704.9 on 184 degrees of freedom
## Multiple R-squared:  0.08536,    Adjusted R-squared:  0.06548 
## F-statistic: 4.293 on 4 and 184 DF,  p-value: 0.00241

anova(lm_bw)

## Analysis of Variance Table
## 
## Response: bwt
##               Df   Sum Sq Mean Sq F value   Pr(>F)   
## lwt            1  3448639 3448639  6.9398 0.009148 **
## age            1   334183  334183  0.6725 0.413247   
## factor(race)   2  4750632 2375316  4.7799 0.009467 **
## Residuals    184 91436202  496936                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(a). Write out the null and alternative hypotheses of the F test by completely specifying the probability models.

Solution:

Under \(H_0\), we have the model \(Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}+ \epsilon_i\),\(i=1,...,189\). Where \(Y_i\) as the birth weight for obervation i, \(x_{i1}\) is the mother’s weight while \(x_{i2}\) is the age for observation i. \(\epsilon_i\) are i.i.d with mean 0 and variance \(\sigma^2\).

Under \(H_a\), we have \(Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}+\beta_3 x_{i3} + \beta_4 x_{i4}+ \epsilon_i\),\(i=1,...,189\). \(x_{i3}\) is 1 if ith mother is black and 0 otherwise. \(x_{i4}\) is 1 if ith mother is neither white nor black and 0 otherwise. Other letters are defined same as the model under \(H_0\).

For the F test, we are testing

\(H_0:\beta_3=\beta_4=0\) against \(H_a:\beta_3\neq 0\) or \(\beta_4\neq 0\)

(b). Interpret the results in anova(lm_gpa). Specifically, read the sample test statistic from R output, give the distribution of the model-generated test statistic under \(H_0\), and explain how the resulting p-value is calculated and interpreted.

Solution:

The sample test statistics \(f_{obs}=4.7799\). The model-generated test statistic F follows an F distribution of 2 numerator degree of freedom and 184 denominator degree of freedom under \(H_0\).

Then p-value is calculated by \(\mathbb{P}(F>f_{obs})=0.009467<0.05\). Thus we will reject \(H_0\) at 0.05 level. Namely, mothers’ race has significant effect on birthweight of babies.

Q4. Prediction using a linear model

We still look at the birthwt data in Q3. Consider the model fitted at \(H_a\). Suppose we are interested in predicting the birthweight of a baby who has a 30-year-old white mother with weight 130.

Specify a row matrix \(\mathbf{x}^*\) so that \(y^*=\mathbf{x}^*\mathbf{b}\) gives the least square predictor. Use summary(lm_bw) from Q3 to find \(y^*\).

Solution:

\(\mathbf{x}^*=(1,130,30,0,0)\).

Thus \(y^*=2461.147+130*4.620+30*1.299=3100.717\)