You are not allowed to bring any notes into the exam.
The following formulas will be provided. To use these formulas properly, you need to make appropriate definitions of the necessary quantities.
(1)\(\quad\quad\mathbf{b}=\big(\mathbb{X}^\top \mathbb{X} \big)^{-1}\, \mathbb{X}^\top\mathbf{y}\)
(2)\(\quad\quad{\mathrm{Var}}(\hat{\mathbf{\beta}})= \sigma^2 \big( \mathbb{X}^\top \mathbb{X} \big)^{-1}\)
(3)\(\quad\quad{\mathrm{Var}}(\mathbb{A}\mathbf{Y})=\mathbb{A}{\mathrm{Var}}(\mathbf{Y})\mathbb{A}^\top\)
(4)\(\quad\quad{\mathrm{Var}}(X)={\mathrm{E}}\big[ (X-{\mathrm{E}}[X])^2\big] = {\mathrm{E}}[X^2]-\big({\mathrm{E}}[X]\big)^2\)
(5)\(\quad\quad{\mathrm{Cov}}(X,Y)={\mathrm{E}}\big[ \big(X-{\mathrm{E}}[X]\big)\big(Y-{\mathrm{E}}[Y]\big)\big] = {\mathrm{E}}[XY]-{\mathrm{E}}[X]\,{\mathrm{E}}[Y]\)
(6)\(\quad\quad\) The binomial (\(n,p\)) distribution has mean \(np\) and variance \(np(1-p)\).
From ?pnorm
:
pnorm(q, mean = 0, sd = 1)
qnorm(p, mean = 0, sd = 1)
q: vector of quantiles.
p: vector of probabilities.
Let \(X_1,X_2,\dots,X_n\) be independent random variables follows Uniform(0,1) distribution. Find the mean and variance of \(X_1\). Use this to find the mean and variance of \(\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i\). Now suppose \(n=50\) and suppose that \(\bar X\) is well approximated by a normal distribution. Find \({\mathrm{P}}(0.45<\bar X<0.55)\). Write your answer as a call to pnorm()
. Your call to pnorm may involve specifying any necessary numerical calculations that you can’t work out without access to a computer or calculator.
Let’s consider the crabs data set we studied in lab. Recall that species(sp) is a factor with 2 levels Blue(B) and Orange(O). We want to study the difference of frontal lobe size(FL) of two species.
library(MASS)
data(crabs)
head(crabs)
## sp sex index FL RW CL CW BD
## 1 B M 1 8.1 6.7 16.1 19.0 7.0
## 2 B M 2 8.8 7.7 18.1 20.8 7.4
## 3 B M 3 9.2 7.8 19.0 22.4 7.7
## 4 B M 4 9.6 7.9 20.1 23.1 8.2
## 5 B M 5 9.8 8.0 20.3 23.0 8.2
## 6 B M 6 10.8 9.0 23.0 26.5 9.8
Consider the model \(Y_i = \mu_1x_{Bi} + \mu_2x_{Oi} + \epsilon_i\),\(i=1,...,200\). \(Y_i\) is the FL of observation i. \(x_{Bi}\) is 1 if \(sp=B\) for obervation i and 0 otherwise. Similarly, \(x_{Oi}\) is 1 if \(sp=O\) for obervation i and 0 otherwise. \(\epsilon_i\) are i.i.d with mean 0 and variance \(\sigma^2\). This model is fitted in R using lm() function and the summary is provided as below.
lm_crab <- lm(FL~sp-1, data=crabs)
summary(lm_crab)
##
## Call:
## lm(formula = FL ~ sp - 1, data = crabs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.010 -2.410 0.390 2.169 7.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## spB 14.056 0.315 44.62 <2e-16 ***
## spO 17.110 0.315 54.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.15 on 198 degrees of freedom
## Multiple R-squared: 0.9615, Adjusted R-squared: 0.9611
## F-statistic: 2470 on 2 and 198 DF, p-value: < 2.2e-16
(a). What do \(\mu_1\) and \(\mu_2\) measure in the above probability model?
(b). Build a 95% confidence interval for \(\mu_1\) using normal approximation
(c). Recall in homework we know that the full estimated covariance matrix of \(\mathbf{\hat{\mu}} = (\hat{\mu}_1,\hat{\mu}_2)\) can be found by
V <- summary(lm_crab)$cov.unscaled * summary(lm_crab)$s^2
V
## spB spO
## spB 0.09923719 0.00000000
## spO 0.00000000 0.09923719
Use V and information provided in summary(lm_crab) to build a 95% confidence interval for \(\mu_1-\mu_2\)
Consider the birth weight data set we have seen in lab For this question, we will look at columns bwt(birth weight), lwt(mother’s weight), age(mother’s age) and race(mother’s race, 1 for white, 2 for black and 3 for other).
library(MASS)
data(birthwt)
head(birthwt)
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
## 88 0 21 108 1 1 0 0 1 2 2594
## 89 0 18 107 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
We want to study the relationship bwt and race using F test, while lwt and age are provided as confounding variables. Let the null hypothesis, \(H_0\), be the probability model where bwt is modeled to depend linearly on lwt and age. Let \(H_a\) be the probability model where \(H_0\) is extended to include race as a factor, as fitted in R by
lm_bw <- lm(bwt ~ lwt + age +factor(race), data = birthwt)
The results from summary(lm_bw) and anova(lm_bw) are as follows
summary(lm_bw)
##
## Call:
## lm(formula = bwt ~ lwt + age + factor(race), data = birthwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2103.50 -429.68 41.74 486.10 1902.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2461.147 314.722 7.820 3.97e-13 ***
## lwt 4.620 1.788 2.584 0.01054 *
## age 1.299 10.108 0.128 0.89789
## factor(race)2 -447.615 161.369 -2.774 0.00611 **
## factor(race)3 -239.357 115.189 -2.078 0.03910 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 704.9 on 184 degrees of freedom
## Multiple R-squared: 0.08536, Adjusted R-squared: 0.06548
## F-statistic: 4.293 on 4 and 184 DF, p-value: 0.00241
anova(lm_bw)
## Analysis of Variance Table
##
## Response: bwt
## Df Sum Sq Mean Sq F value Pr(>F)
## lwt 1 3448639 3448639 6.9398 0.009148 **
## age 1 334183 334183 0.6725 0.413247
## factor(race) 2 4750632 2375316 4.7799 0.009467 **
## Residuals 184 91436202 496936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(a). Write out the null and alternative hypotheses of the F test by completely specifying the probability models.
(b). Interpret the results in anova(lm_gpa)
. Specifically, read the sample test statistic from R output, give the distribution of the model-generated test statistic under \(H_0\), and explain how the resulting p-value is calculated and interpreted.
We still look at the birthwt data in Q3. Consider the model fitted at \(H_a\). Suppose we are interested in predicting the birthweight of a baby who has a 30-year-old white mother with weight 130.
Specify a row matrix \(\mathbf{x}^*\) so that \(y^*=\mathbf{x}^*\mathbf{b}\) gives the least square predictor. Use summary(lm_bw) from Q3 to find \(y^*\).