You are not allowed to bring any notes into the exam.
The following formulas will be provided. To use these formulas properly, you need to make appropriate definitions of the necessary quantities.
(1)\(\quad\quad\mathbf{b}=\big(\mathbb{X}^\top \mathbb{X} \big)^{-1}\, \mathbb{X}^\top\mathbf{y}\)
(2)\(\quad\quad{\mathrm{Var}}(\hat{\mathbf{\beta}})= \sigma^2 \big( \mathbb{X}^\top \mathbb{X} \big)^{-1}\)
(3)\(\quad\quad{\mathrm{Var}}(\mathbb{A}\mathbf{Y})=\mathbb{A}{\mathrm{Var}}(\mathbf{Y})\mathbb{A}^\top\)
(4)\(\quad\quad{\mathrm{Var}}(X)={\mathrm{E}}\big[ (X-{\mathrm{E}}[X])^2\big] = {\mathrm{E}}[X^2]-\big({\mathrm{E}}[X]\big)^2\)
(5)\(\quad\quad{\mathrm{Cov}}(X,Y)={\mathrm{E}}\big[ \big(X-{\mathrm{E}}[X]\big)\big(Y-{\mathrm{E}}[Y]\big)\big] = {\mathrm{E}}[XY]-{\mathrm{E}}[X]\,{\mathrm{E}}[Y]\)
(6)\(\quad\quad\) The binomial (\(n,p\)) distribution has mean \(np\) and variance \(np(1-p)\).
From ?pnorm
:
pnorm(q, mean = 0, sd = 1)
qnorm(p, mean = 0, sd = 1)
q: vector of quantiles.
p: vector of probabilities.
Let \(X_1,X_2,\dots,X_n\) be independent random variables follows Uniform(0,1) distribution. Find the mean and variance of \(X_1\). Use this to find the mean and variance of \(\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i\). Now suppose \(n=30\) and suppose that \(\bar X\) is well approximated by a normal distribution. Find \({\mathrm{P}}(0.45<\bar X<0.55)\). Write your answer as a call to pnorm()
. Your call to pnorm may involve specifying any necessary numerical calculations that you can’t work out without access to a computer or calculator.
Solution:
\(\mathbb{E}(X_1)=\int_0^1 x dx = 1/2\)
\(\mathbb{E}(X_1^2)=\int_0^1 x^2 dx = 1/3\)
\(Var(X_1)=\mathbb{E}(X_1^2)-(\mathbb{E}(X_1))^2=1/12\)
Thus, \(\mathbb{E}(\bar{X})=\mathbb{E}(\frac{1}{n}\sum_{i=1}^n X_i)=\frac{1}{n}\sum_{i=1}^n \mathbb{E}X_i=\mathbb{E}X_i=1/2\)
\(Var(\bar{X})=Var(\frac{1}{n}\sum_{i=1}^n X_i)=\frac{1}{n^2}Var(\sum_{i=1}^n X_i)=\frac{1}{n}Var(X_i)=\frac{1}{12n}\)
\({\mathrm{P}}(0.45<\bar X<0.55) = pnorm(0.55,mean=1/2,sd=sqrt(1/(12* 30)))-pnorm(0.45,mean=1/2,sd=sqrt(1/(12*30)))\)
Let’s consider the crabs data set we studied in lab. Recall that species(sp) is a factor with 2 levels Blue(B) and Orange(O). We want to study the difference of frontal lobe size(FL) of two species.
library(MASS)
data(crabs)
head(crabs)
## sp sex index FL RW CL CW BD
## 1 B M 1 8.1 6.7 16.1 19.0 7.0
## 2 B M 2 8.8 7.7 18.1 20.8 7.4
## 3 B M 3 9.2 7.8 19.0 22.4 7.7
## 4 B M 4 9.6 7.9 20.1 23.1 8.2
## 5 B M 5 9.8 8.0 20.3 23.0 8.2
## 6 B M 6 10.8 9.0 23.0 26.5 9.8
Consider the model \(Y_i = \mu_1x_{Bi} + \mu_2x_{Oi} + \epsilon_i\),\(i=1,...,200\). \(Y_i\) is the FL of observation i. \(x_{Bi}\) is 1 if \(sp=B\) for obervation i and 0 otherwise. Similarly, \(x_{Oi}\) is 1 if \(sp=O\) for obervation i and 0 otherwise. \(\epsilon_i\) are i.i.d with mean 0 and variance \(\sigma^2\). This model is fitted in R using lm() function and the summary is provided below.
lm_crab <- lm(FL~sp-1, data=crabs)
summary(lm_crab)
##
## Call:
## lm(formula = FL ~ sp - 1, data = crabs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.010 -2.410 0.390 2.169 7.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## spB 14.056 0.315 44.62 <2e-16 ***
## spO 17.110 0.315 54.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.15 on 198 degrees of freedom
## Multiple R-squared: 0.9615, Adjusted R-squared: 0.9611
## F-statistic: 2470 on 2 and 198 DF, p-value: < 2.2e-16
(a). What does \(\mu_1\) and \(\mu_2\) measure in the above probability model?
Solution:
\(\mu_1\) is the population mean frontal lobe size for blue crabs. \(\mu_2\) is the population mean frontal lobe size for orange crabs.
(b). Build a 95% confidence interval for \(\mu_1\) using normal approximation
Solution:
\((14.056-1.96*0.315,14.056+1.96*0.315)=(13.44,14.67)\)
(c). Recall in homework we know that the full estimated covariance matrix of \(\mathbf{\hat{\mu}} = (\hat{\mu}_1,\hat{\mu}_2)\) can be found by
V <- summary(lm_crab)$cov.unscaled * summary(lm_crab)$s^2
V
## spB spO
## spB 0.09923719 0.00000000
## spO 0.00000000 0.09923719
Use V and information provided in summary(lm_crab) to build a 95% confidence interval for \(\mu_1-\mu_2\)
Solution:
Let \(a = (1,-1)^T\).
\(Var(a^T\mathbf{\hat{\mu}})=a^TVar(\mathbf{\hat{\mu}})a=a^TVa=0.198\)
\(\hat{\mu}_1-\hat{\mu}_2=14.056-17.110=-3.054\)
Thus we have the 95% C.I. \((-3.054-1.96*\sqrt{0.198},-3.054+1.96*\sqrt{0.198})=(-3.926,-2.182)\)
Consider the birth weight data set we have seen in lab For this question, we will look at columns bwt(birth weight), lwt(mother’s weight), age(mother’s age) and race(mother’s race, 1 for white, 2 for black and 3 for other).
library(MASS)
data(birthwt)
head(birthwt)
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
## 88 0 21 108 1 1 0 0 1 2 2594
## 89 0 18 107 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
We want to study the relationship bwt and race using F test, while lwt and age are provided as confounding variables. Let the null hypothesis, \(H_0\), be the probability model where bwt is modeled to depend linearly on lwt and age. Let \(H_a\) be the probability model where \(H_0\) is extended to include race as a factor, as fitted in R by
lm_bw <- lm(bwt ~ lwt + age +factor(race), data = birthwt)
The results from summary(lm_bw) and anova(lm_bw) are as follows
summary(lm_bw)
##
## Call:
## lm(formula = bwt ~ lwt + age + factor(race), data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2103.50 -429.68 41.74 486.10 1902.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2461.147 314.722 7.820 3.97e-13 ***
## lwt 4.620 1.788 2.584 0.01054 *
## age 1.299 10.108 0.128 0.89789
## factor(race)2 -447.615 161.369 -2.774 0.00611 **
## factor(race)3 -239.357 115.189 -2.078 0.03910 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 704.9 on 184 degrees of freedom
## Multiple R-squared: 0.08536, Adjusted R-squared: 0.06548
## F-statistic: 4.293 on 4 and 184 DF, p-value: 0.00241
anova(lm_bw)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## lwt 1 3448639 3448639 6.9398 0.009148 **
## age 1 334183 334183 0.6725 0.413247
## factor(race) 2 4750632 2375316 4.7799 0.009467 **
## Residuals 184 91436202 496936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(a). Write out the null and alternative hypotheses of the F test by completely specifying the probability models.
Solution:
Under \(H_0\), we have the model \(Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}+ \epsilon_i\),\(i=1,...,189\). Where \(Y_i\) as the birth weight for obervation i, \(x_{i1}\) is the mother’s weight while \(x_{i2}\) is the age for observation i. \(\epsilon_i\) are i.i.d with mean 0 and variance \(\sigma^2\).
Under \(H_a\), we have \(Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}+\beta_3 x_{i3} + \beta_4 x_{i4}+ \epsilon_i\),\(i=1,...,189\). \(x_{i3}\) is 1 if ith mother is black and 0 otherwise. \(x_{i4}\) is 1 if ith mother is neither white nor black and 0 otherwise. Other letters are defined same as the model under \(H_0\).
For the F test, we are testing
\(H_0:\beta_3=\beta_4=0\) against \(H_a:\beta_3\neq 0\) or \(\beta_4\neq 0\)
(b). Interpret the results in anova(lm_gpa)
. Specifically, read the sample test statistic from R output, give the distribution of the model-generated test statistic under \(H_0\), and explain how the resulting p-value is calculated and interpreted.
Solution:
The sample test statistics \(f_{obs}=4.7799\). The model-generated test statistic F follows an F distribution of 2 numerator degree of freedom and 184 denominator degree of freedom under \(H_0\).
Then p-value is calculated by \(\mathbb{P}(F>f_{obs})=0.009467<0.05\). Thus we will reject \(H_0\) at 0.05 level. Namely, mothers’ race has significant effect on birthweight of babies.
We still look at the birthwt data in Q3. Consider the model fitted at \(H_a\). Suppose we are interested in predicting the birthweight of a baby who has a 30-year-old white mother with weight 130.
Specify a row matrix \(\mathbf{x}^*\) so that \(y^*=\mathbf{x}^*\mathbf{b}\) gives the least square predictor. Use summary(lm_bw) from Q3 to find \(y^*\).
Solution:
\(\mathbf{x}^*=(1,130,30,0,0)\).
Thus \(y^*=2461.147+130*4.620+30*1.299=3100.717\)