x1 = c(rep(1,12),rep(0,12))
x2 = c(rep(0,12),rep(1,12))
X = cbind(x1,x2)
X
## x1 x2
## [1,] 1 0
## [2,] 1 0
## [3,] 1 0
## [4,] 1 0
## [5,] 1 0
## [6,] 1 0
## [7,] 1 0
## [8,] 1 0
## [9,] 1 0
## [10,] 1 0
## [11,] 1 0
## [12,] 1 0
## [13,] 0 1
## [14,] 0 1
## [15,] 0 1
## [16,] 0 1
## [17,] 0 1
## [18,] 0 1
## [19,] 0 1
## [20,] 0 1
## [21,] 0 1
## [22,] 0 1
## [23,] 0 1
## [24,] 0 1
mice = read.csv("https://ionides.github.io/401w18/hw/hw01/femaleMiceWeights.csv")
mice$mu1 = (mice$Diet == "chow")*1
mice$mu2 = (mice$Diet == "hf")*1
lm1 <- lm(Bodyweight~mu1+mu2-1,data=mice);summary(lm1)
##
## Call:
## lm(formula = Bodyweight ~ mu1 + mu2 - 1, data = mice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1042 -2.4358 -0.4138 2.8335 7.1858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## mu1 23.813 1.039 22.91 <2e-16 ***
## mu2 26.834 1.039 25.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.6 on 22 degrees of freedom
## Multiple R-squared: 0.9819, Adjusted R-squared: 0.9802
## F-statistic: 595.8 on 2 and 22 DF, p-value: < 2.2e-16
We do not want to include the intercept in this model. Thus we need -1 in the lm call since otherwise R will automatically include the intercept.
lm <- lm(Bodyweight~mu1+mu2,data=mice);summary(lm)
##
## Call:
## lm(formula = Bodyweight ~ mu1 + mu2, data = mice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1042 -2.4358 -0.4138 2.8335 7.1858
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.834 1.039 25.818 <2e-16 ***
## mu1 -3.021 1.470 -2.055 0.0519 .
## mu2 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.6 on 22 degrees of freedom
## Multiple R-squared: 0.1611, Adjusted R-squared: 0.1229
## F-statistic: 4.224 on 1 and 22 DF, p-value: 0.05192
If we do not write -1, R produce NA for estimates associated with mu2. This is because there is parameter redundancy when trying to estimate an intercept and a mean for both groups.
95% confidence interval for \(\mu_1\)
c(23.813-1.96*1.039,23.813+1.96*1.039)
## [1] 21.77656 25.84944
95% confidence interval for \(\mu_2\)
c(26.834-1.96*1.039,26.834+1.96*1.039)
## [1] 24.79756 28.87044
mice$mu1 = 1
mice$mu_diff = (mice$Diet == "hf")*1
lm2 <- lm(Bodyweight~mu1+mu_diff-1,data=mice);summary(lm2)
##
## Call:
## lm(formula = Bodyweight ~ mu1 + mu_diff - 1, data = mice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1042 -2.4358 -0.4138 2.8335 7.1858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## mu1 23.813 1.039 22.912 <2e-16 ***
## mu_diff 3.021 1.470 2.055 0.0519 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.6 on 22 degrees of freedom
## Multiple R-squared: 0.9819, Adjusted R-squared: 0.9802
## F-statistic: 595.8 on 2 and 22 DF, p-value: < 2.2e-16
Comparing summart(lm1) and summary(lm2), we notice that mu1 in lm2 has same estimate and standard error as mu1 in lm1. mu_diff in lm2 is equal to mu2 - mu1 in lm1. Also notice that lm1 and lm2 have same residual standard error 3.6.
all.equal(lm1$fitted.values, lm2$fitted.values)
## [1] TRUE
Thus fitted value in lm1 is (approximately) equal to fitted value in lm2 (they might be slightly different due to rounding errors).This again indicates that lm1 and lm 2 are essentially the same model.
95% confidence interval for \(\mu_2-\mu_1\)
# use normal approximation
c(3.021-1.96*1.470,3.021+1.96*1.470)
## [1] 0.1398 5.9022
c(3.021-qt(0.975,df=22)*1.470,3.021+qt(0.975,df=22)*1.470)
## [1] -0.02759341 6.06959341
To see this as a hythothesis problem, we want to test \(H_0:\mu_1 = \mu_2\) against \(H_1: \mu_1 \neq \mu_2\). From summary(lm2), the associated test statistics is 2.055 with p-value 0.0519. Thus we fail to reject \(H_0\) at 0.05 level. Namely we do not have enough evidence to conclude the mean of bodyweight under diet chow and hf is different. (If we choose significant level to be 0.1, we can reject \(H_0\))
V <- summary(lm1)$cov.unscaled
a = c(1,-1)
V_diff = (a %*% V %*% a)
sd_diff = sqrt(V_diff)*summary(lm1)$sigma # we need to time sigma here since cov.unscaled does not include sigma
sd_diff
## [,1]
## [1,] 1.469867
This is consistent with the standard error given in summary(lm2).
summary(lm1)$cov.unscaled
## mu1 mu2
## mu1 0.08333333 0.00000000
## mu2 0.00000000 0.08333333
The correlation between \(\hat{\mu}_1\) and \(\hat{\mu}_2\) is 0. This is because \(\mu_1\) is estimated using \(y_{1j}\),\(j=1,...,12\) while \(\mu_2\) is estimated using \(y_{2j}\),\(j=1,...,12\). Thus \(\hat{\mu}_1\) and \(\hat{\mu}_2\) are independent since \(y_{ij}\) are assumed to be independent.
License: This material is provided under an MIT license
Acknowledgement: The randomized experiment draws on material from from https://genomicsclass.github.io/book