Homework 7, STATS 401 W18

For your homework this week, write a brief report addressing the questions below, including the results you are asked to obtain and the R code you used to generate it. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that make a substantial contribution to your report.

Using a linear model for investigation of two populations

You don’t necessarily need a linear model to investigate the means in two populations, or to make statistical inferences on the difference between the means of based on a sample from each. However, it may be helpful to see how this relatively simple task fits into the linear model framework.

Using a linear model to estimate means in two populations

Let’s consider once more the mice from homework 1. Let $y_{ij}$ be the weight of the $j$th mouse receiving treatment $i$, where $i=1$ corresponds to regular chow, $i=2$ denotes high fat chow, and $j=1,\dots,n$ with $n=12$. In subscript form, we can write the sample version of the linear model as \[ y_{ij} = m_{i} + e_{ij}, \quad \mbox{for $i=1,2$ and $j=1,\dots,n$}. \] To put this model into the standard representation for a linear model, we can combine all $2n$ observations into a single vector, ${\boldsymbol{\mathrm{y}}}=(y_{11},y_{12},\dots,y_{1n},y_{21},y_{22},\dots,y_{2n})$.

Question 1. Use this standard representation, with ${\boldsymbol{\mathrm{y}}}=(y_{11},y_{12},\dots,y_{1n},y_{21},y_{22},\dots,y_{2n})$, to find the design matrix ${\mathbb{X}}$ giving the sample version of this linear model in the matrix form ${\boldsymbol{\mathrm{y}}}={\mathbb{X}}{\boldsymbol{\mathrm{b}}}+{\boldsymbol{\mathrm{e}}}$ where ${\boldsymbol{\mathrm{b}}}=(m_1,m_2)$, interpreted as a column vector.

Confidence intervals for the population means

We can build a probability model for the mouse experiment in a similar way. Let $Y_{ij}$ be a random variable modeling the weights, with $i=1,2$ and $j=1,\dots,n$. Suppose group $i$ has expected weight $\mu_i$, and the weight of mouse $(i,j)$ is modeled as an independent measurement with $\epsilon_{ij}$ having mean zero and variance $\sigma^2$. A representation of the probability model in subscript form is \[ Y_{ij} = \mu_{i} + \epsilon_{ij}, \quad \mbox{for $i=1,2$ and $j=1,\dots,n$}. \] For the corresponding matrix form ${\boldsymbol{\mathrm{Y}}}={\mathbb{X}}{\boldsymbol{\mathrm{\beta}}}+{\boldsymbol{\mathrm{\epsilon}}}$, we write ${\boldsymbol{\mathrm{\beta}}}=(\mu_1,\mu_2)$ and note that $\mu_i$ is the model mean (also called the population mean) for group $i$.

Question 2. Construct the columns of the design matrix in R and call them mu1 and mu2. We follow the convention of assigning the column name of the design matrix to match the name of the corresponding linear model coefficient. This means that the names mu1 and mu2 will show up properly in the linear model output. Call the dataset mice. Then fit a linear model lm1 <- lm(Bodyweight~mu1+mu2-1,data=mice).

Why do we need to write -1 in this R linear model formula?
Try without the -1 to see what R gives you. Why do you think this happens?
Use the output from summary(lm1) to construct a 95% confidence interval for each of $\mu_1$ and $\mu_2$, using a normal distribution approximation for $\hat\mu_1$ and $\hat\mu_2$.

Comparison of the difference in means

It is convenient to write the same model in a different way in order to compare the means between treatments. We can write \[ {\boldsymbol{\mathrm{Y}}}= \begin{bmatrix} 1 & 0 \\ 1 & 0 \\ \vdots & \vdots \\ 1 & 0 \\ 1 & 1 \\ 1 & 1 \\ \vdots & \vdots \\ 1 & 1 \end{bmatrix} {\boldsymbol{\mathrm{\gamma}}} + {\boldsymbol{\mathrm{\epsilon}}} \] where ${\boldsymbol{\mathrm{\gamma}}}=(\gamma_1,\gamma_2)$ with $\gamma_1=\mu_1$ and $\gamma_2=\mu_2-\mu_1$. Here, we are calling the linear model coefficient vector ${\boldsymbol{\mathrm{\gamma}}}$ instead of ${\boldsymbol{\mathrm{\beta}}}$ since we already used ${\boldsymbol{\mathrm{\beta}}}$ for a different purpose in Questions 1 and 2.

Question 3. Code the columns of this design matrix in R, calling them mu1 and mu_diff. Then fit a linear model lm2 <- lm(weight~mu1+mu_diff-1,data=mice).

Convince yourself, by looking at summary(lm2) that you have, in some sense, fitted the same model in Questions 2 and 3. Explain your reasoning.
Another way to see this is to compare lm1$fitted.values with lm2$fitted.values. What do you find?
Use the output from summary(lm2) to find a 95% confidence interval for $\mu_2-\mu_1$. Also, write down a test of the null hypothesis that $\mu_1=\mu_2$, obtaining a p-value and drawing a conclusion at a suitable significance level.

Using the specification in Question 2 to carry out inference on $\mu_2-\mu_1$.

Having done Question 3, we don’t need to use the model specification from Questions 1 and 2 to carry out inference on $\mu_2-\mu_1$. However, it may be useful to see how we could have done that. Using the notation of Questions 1 and 2, we can write $\mu_2-\mu_1 = \begin{bmatrix}1 & -1\end{bmatrix} {\boldsymbol{\mathrm{\beta}}}$, where ${\boldsymbol{\mathrm{\beta}}}=\begin{bmatrix}\beta_1 \\ \beta_2\end{bmatrix}=\begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}$. In the same way, \[ \hat\mu_2-\hat\mu_1 = \begin{bmatrix}-1 & 1\end{bmatrix} {\boldsymbol{\mathrm{\hat\beta}}} = \begin{bmatrix}-1 & 1\end{bmatrix} \begin{bmatrix}\hat\beta_1 \\ \hat\beta_1\end{bmatrix}. \]

Question 4. First, let’s calculate the estimated covariance matrix of ${\boldsymbol{\mathrm{\hat\beta}}}$. Type names(summary(lm1)) to show all the components in the R linear model summary. The component cov.unscaled is the matrix $\big({\mathbb{X}}^{{\scriptscriptstyle \mathrm{T}}}{\mathbb{X}}\big)^{-1}$ which must then be scaled by $s^2$, where $s$ comes from the sigma component. Thus, the full estimated covariance matrix can be found by something like V <- summary(lm1)$cov.unscaled * summary(lm1)$sigma^2.

Use the matrix formula for the variance of $\begin{bmatrix}1 & -1\end{bmatrix} {\boldsymbol{\mathrm{\hat\beta}}}$ to estimate the standard deviation of $\hat\mu_2-\hat\mu_1$. Does this match your answer to Question 3(c)?
Look at the covariance matrix from summary(lm1)$cov.unscaled. What is the correlation between $\hat\mu_1$ and $\hat\mu_2$? This is a special property of the design matrix for this simple experiment.

License: This material is provided under an MIT license
Acknowledgement: The randomized experiment draws on material from from https://genomicsclass.github.io/book

Homework 7, STATS 401 W18

Due in your lab on 3/15 or 3/16

Using a linear model for investigation of two populations

Using a linear model to estimate means in two populations

Confidence intervals for the population means

Comparison of the difference in means

Using the specification in Question 2 to carry out inference on \(\mu_2-\mu_1\).