Formulas

(1)\(\quad\quad\mathbf{b}=\big(\mathbb{X}^\top \mathbb{X} \big)^{-1}\, \mathbb{X}^\top\mathbf{y}\)

(2)\(\quad\quad{\mathrm{Var}}(\hat{\mathbf{\beta}})= \sigma^2 \big( \mathbb{X}^\top \mathbb{X} \big)^{-1}\)

(3)\(\quad\quad{\mathrm{Var}}(\mathbb{A}\mathbf{Y})=\mathbb{A}{\mathrm{Var}}(\mathbf{Y})\mathbb{A}^\top\)

(4)\(\quad\quad{\mathrm{Var}}(X)={\mathrm{E}}\big[ (X-{\mathrm{E}}[X])^2\big] = {\mathrm{E}}[X^2]-\big({\mathrm{E}}[X]\big)^2\)

(5)\(\quad\quad{\mathrm{Cov}}(X,Y)={\mathrm{E}}\big[ \big(X-{\mathrm{E}}[X]\big)\big(Y-{\mathrm{E}}[Y]\big)\big] = {\mathrm{E}}[XY]-{\mathrm{E}}[X]\,{\mathrm{E}}[Y]\)

(6)\(\quad\quad\) The binomial (\(n,p\)) distribution has mean \(np\) and variance \(np(1-p)\).

From ?pnorm:

pnorm(q, mean = 0, sd = 1)
qnorm(p, mean = 0, sd = 1)
q: vector of quantiles.
p: vector of probabilities.

Q1. Calculating mean and variance, and making a normal approximation

Let \(X_1,X_2,\dots,X_n\) be independent random variables follows Uniform(0,1) distribution. Find the mean and variance of \(X_1\). Use this to find the mean and variance of \(\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i\). Now suppose \(n=30\) and suppose that \(\bar X\) is well approximated by a normal distribution. Find \({\mathrm{P}}(0.45<\bar X<0.55)\). Write your answer as a call to pnorm(). Your call to pnorm may involve specifying any necessary numerical calculations that you can’t work out without access to a computer or calculator.

Solution:

\(\mathbb{E}(X_1)=\int_0^1 x dx = 1/2\)

\(\mathbb{E}(X_1^2)=\int_0^1 x^2 dx = 1/3\)

\(Var(X_1)=\mathbb{E}(X_1^2)-(\mathbb{E}(X_1))^2=1/12\)

Thus, \(\mathbb{E}(\bar{X})=\mathbb{E}(\frac{1}{n}\sum_{i=1}^n X_i)=\frac{1}{n}\sum_{i=1}^n \mathbb{E}X_i=\mathbb{E}X_i=1/2\)

\(Var(\bar{X})=Var(\frac{1}{n}\sum_{i=1}^n X_i)=\frac{1}{n^2}Var(\sum_{i=1}^n X_i)=\frac{1}{n}Var(X_i)=\frac{1}{12n}\)

\({\mathrm{P}}(0.45<\bar X<0.55) = pnorm(0.55,mean=1/2,sd=sqrt(1/(12* 30)))-pnorm(0.45,mean=1/2,sd=sqrt(1/(12*30)))\)

Q2. Comparing means using linear model

Let’s consider the crabs data set we studied in lab. Recall that species(sp) is a factor with 2 levels Blue(B) and Orange(O). We want to study the difference of frontal lobe size(FL) of two species.

library(MASS)
data(crabs)
head(crabs)
##   sp sex index   FL  RW   CL   CW  BD
## 1  B   M     1  8.1 6.7 16.1 19.0 7.0
## 2  B   M     2  8.8 7.7 18.1 20.8 7.4
## 3  B   M     3  9.2 7.8 19.0 22.4 7.7
## 4  B   M     4  9.6 7.9 20.1 23.1 8.2
## 5  B   M     5  9.8 8.0 20.3 23.0 8.2
## 6  B   M     6 10.8 9.0 23.0 26.5 9.8

Consider the model \(Y_i = \mu_1x_{Bi} + \mu_2x_{Oi} + \epsilon_i\),\(i=1,...,200\). \(Y_i\) is the FL of observation i. \(x_{Bi}\) is 1 if \(sp=B\) for obervation i and 0 otherwise. Similarly, \(x_{Oi}\) is 1 if \(sp=O\) for obervation i and 0 otherwise. \(\epsilon_i\) are i.i.d with mean 0 and variance \(\sigma^2\). This model is fitted in R using lm() function and the summary is provided below.

lm_crab <- lm(FL~sp-1, data=crabs)
summary(lm_crab)
## 
## Call:
## lm(formula = FL ~ sp - 1, data = crabs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.010 -2.410  0.390  2.169  7.244 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## spB   14.056      0.315   44.62   <2e-16 ***
## spO   17.110      0.315   54.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.15 on 198 degrees of freedom
## Multiple R-squared:  0.9615, Adjusted R-squared:  0.9611 
## F-statistic:  2470 on 2 and 198 DF,  p-value: < 2.2e-16

(a). What does \(\mu_1\) and \(\mu_2\) measure in the above probability model?

Solution:

\(\mu_1\) is the population mean frontal lobe size for blue crabs. \(\mu_2\) is the population mean frontal lobe size for orange crabs.

(b). Build a 95% confidence interval for \(\mu_1\) using normal approximation

Solution:

\((14.056-1.96*0.315,14.056+1.96*0.315)=(13.44,14.67)\)

(c). Recall in homework we know that the full estimated covariance matrix of \(\mathbf{\hat{\mu}} = (\hat{\mu}_1,\hat{\mu}_2)\) can be found by

V <- summary(lm_crab)$cov.unscaled * summary(lm_crab)$s^2
V
##            spB        spO
## spB 0.09923719 0.00000000
## spO 0.00000000 0.09923719

Use V and information provided in summary(lm_crab) to build a 95% confidence interval for \(\mu_1-\mu_2\)

Solution:

Let \(a = (1,-1)^T\).

\(Var(a^T\mathbf{\hat{\mu}})=a^TVar(\mathbf{\hat{\mu}})a=a^TVa=0.198\)

\(\hat{\mu}_1-\hat{\mu}_2=14.056-17.110=-3.054\)

Thus we have the 95% C.I. \((-3.054-1.96*\sqrt{0.198},-3.054+1.96*\sqrt{0.198})=(-3.926,-2.182)\)

Q3. Making and interpreting an F-test

Consider the birth weight data set we have seen in lab For this question, we will look at columns bwt(birth weight), lwt(mother’s weight), age(mother’s age) and race(mother’s race, 1 for white, 2 for black and 3 for other).

library(MASS)
data(birthwt)
head(birthwt)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

We want to study the relationship bwt and race using F test, while lwt and age are provided as confounding variables. Let the null hypothesis, \(H_0\), be the probability model where bwt is modeled to depend linearly on lwt and age. Let \(H_a\) be the probability model where \(H_0\) is extended to include race as a factor, as fitted in R by

lm_bw <- lm(bwt ~ lwt + age +factor(race), data = birthwt)

The results from summary(lm_bw) and anova(lm_bw) are as follows

summary(lm_bw)
## 
## Call:
## lm(formula = bwt ~ lwt + age + factor(race), data = birthwt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2103.50  -429.68    41.74   486.10  1902.20 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2461.147    314.722   7.820 3.97e-13 ***
## lwt              4.620      1.788   2.584  0.01054 *  
## age              1.299     10.108   0.128  0.89789    
## factor(race)2 -447.615    161.369  -2.774  0.00611 ** 
## factor(race)3 -239.357    115.189  -2.078  0.03910 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 704.9 on 184 degrees of freedom
## Multiple R-squared:  0.08536,    Adjusted R-squared:  0.06548 
## F-statistic: 4.293 on 4 and 184 DF,  p-value: 0.00241
anova(lm_bw)
## Analysis of Variance Table
## 
## Response: bwt
##               Df   Sum Sq Mean Sq F value   Pr(>F)   
## lwt            1  3448639 3448639  6.9398 0.009148 **
## age            1   334183  334183  0.6725 0.413247   
## factor(race)   2  4750632 2375316  4.7799 0.009467 **
## Residuals    184 91436202  496936                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(a). Write out the null and alternative hypotheses of the F test by completely specifying the probability models.

Solution:

Under \(H_0\), we have the model \(Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}+ \epsilon_i\),\(i=1,...,189\). Where \(Y_i\) as the birth weight for obervation i, \(x_{i1}\) is the mother’s weight while \(x_{i2}\) is the age for observation i. \(\epsilon_i\) are i.i.d with mean 0 and variance \(\sigma^2\).

Under \(H_a\), we have \(Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}+\beta_3 x_{i3} + \beta_4 x_{i4}+ \epsilon_i\),\(i=1,...,189\). \(x_{i3}\) is 1 if ith mother is black and 0 otherwise. \(x_{i4}\) is 1 if ith mother is neither white nor black and 0 otherwise. Other letters are defined same as the model under \(H_0\).

For the F test, we are testing

\(H_0:\beta_3=\beta_4=0\) against \(H_a:\beta_3\neq 0\) or \(\beta_4\neq 0\)

(b). Interpret the results in anova(lm_gpa). Specifically, read the sample test statistic from R output, give the distribution of the model-generated test statistic under \(H_0\), and explain how the resulting p-value is calculated and interpreted.

Solution:

The sample test statistics \(f_{obs}=4.7799\). The model-generated test statistic F follows an F distribution of 2 numerator degree of freedom and 184 denominator degree of freedom under \(H_0\).

Then p-value is calculated by \(\mathbb{P}(F>f_{obs})=0.009467<0.05\). Thus we will reject \(H_0\) at 0.05 level. Namely, mothers’ race has significant effect on birthweight of babies.

Q4. Prediction using a linear model

We still look at the birthwt data in Q3. Consider the model fitted at \(H_a\). Suppose we are interested in predicting the birthweight of a baby who has a 30-year-old white mother with weight 130.

Specify a row matrix \(\mathbf{x}^*\) so that \(y^*=\mathbf{x}^*\mathbf{b}\) gives the least square predictor. Use summary(lm_bw) from Q3 to find \(y^*\).

Solution:

\(\mathbf{x}^*=(1,130,30,0,0)\).

Thus \(y^*=2461.147+130*4.620+30*1.299=3100.717\)