Car drivers adjust the seat position for comfort. We are interested in determining the major factors that determine the seat position (measured as hip center in mm) and have collected the following information on 38 drivers: age in years, weight in lbs, and height with shoes, height without shoes, seated height, lower arm length, thigh length, lower leg length all in cm.
library(faraway)
## Warning: package 'faraway' was built under R version 3.4.3
data("seatpos")
head(seatpos)
## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter
## 1 46 180 187.2 184.9 95.2 36.1 45.3 41.3 -206.300
## 2 31 175 167.5 165.5 83.8 32.9 36.5 35.9 -178.210
## 3 23 100 153.6 152.2 82.9 26.0 36.6 31.0 -71.673
## 4 19 185 190.3 187.4 97.3 37.4 44.1 41.0 -257.720
## 5 23 159 178.0 174.1 93.9 29.5 40.1 36.9 -173.230
## 6 47 170 178.7 177.0 92.4 36.0 43.2 37.4 -185.150
To determine these factors, we start by fitting the following linear model:
comfort <- lm(hipcenter ~ ., data = seatpos)
summary(comfort)
##
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.827 -22.833 -3.678 25.017 62.337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.43213 166.57162 2.620 0.0138 *
## Age 0.77572 0.57033 1.360 0.1843
## Weight 0.02631 0.33097 0.080 0.9372
## HtShoes -2.69241 9.75304 -0.276 0.7845
## Ht 0.60134 10.12987 0.059 0.9531
## Seated 0.53375 3.76189 0.142 0.8882
## Arm -1.32807 3.90020 -0.341 0.7359
## Thigh -1.14312 2.66002 -0.430 0.6706
## Leg -6.43905 4.71386 -1.366 0.1824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared: 0.6866, Adjusted R-squared: 0.6001
## F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05
Solutions:
This is because some of the covaraites are almost collinear. For example, height with shoes, height without shoes almost model the same things. Thus the determinant of \(\mathbb{X}^{\top} \mathbb{X}\) is close to zero and the variance of the least squares coefficient vector becomes large.
Solution:
The standard error of the coefficients of height of person will probabily decrease, since we do not need the collinear issue in here.
Suppose we are interested in determining the energy content of municipal waste. This information is useful for the design and operation of municipal waste incinerators. The variables are Energy
- energy content (kcal/kg), Plastic
- % plastic composition by weight, Paper
- % paper composition by weight, Garbage
- % garbage composition by weight, and Water
- % moisture by weight.
A number of samples of municipal waste were obtained, and a regression to predict the Energy
was computed and the following output obtained:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2245.09 177.89 12.62 2.4e-12
Plastics 28.92 2.82 10.24 2.0e-10
Paper 7.64 2.31 3.30 0.0029
Garbage 4.30 1.92 2.24 0.0340
Water -37.36 1.83 -20.37 <2e-16
Residual standard error: 31.5 on 25 degrees of freedom
Multiple R-Squared: 0.964, Adjusted R-squared: 0.958
F-statistic: 168 on 4 and 25 DF, p-value: <2e-16
Solution:
\(\hat{y}_A - \hat{y}_B = \hat{\beta}_{plastic}x_{plasticA}+\epsilon_A-\hat{\beta}_{plastic}x_{plasticA}-\epsilon_B=10\hat{\beta}_{plastic}+\epsilon_A-\epsilon_B\)
\(\hat{\mathbb{E}}(\hat{y}_A - \hat{y}_B) =10\hat{\beta}_{plastic}=289.2\)
\(\hat{Var}(\hat{y}_A - \hat{y}_B)=\hat{Var}(10\hat{\beta}_{plastic})+\hat{Var}(\epsilon_A)+\hat{Var}(\epsilon_B)=100\hat{Var}(\hat{\beta}_{plastic}) +2s^2 = 100*2.82^2+2*31.5^2=2779.74\)
Thus the 95% prediction inverval is \((289.2-1.96*\sqrt{2779.74},289.2+1.96*\sqrt{2779.74}\)
Solution:
\(x^*=(1,0,0,0,100)\). Thus \(y^*=2245.09+100*(-37.36)=-1490.91\)
This prediction is probablily not reliable due to extrapolation. It is unlikely we have a sample that is purely water in our data set.
Solution:
(4.30-1.961.92,4.30+1.961.92)
Solution:
\(s = \sqrt{\frac{\sum_{i=1}^n(y_i-\hat{y_i})^2}{n-p}}\), where \(\hat{y_i}=x_i^{\top}b\) is the fitted value of \(y_i\). \(p\) is the number of parameter we have in this model.
Recall the National Education Longitudinal Study of 1988 which examined schoolchildren’s performance on a math test score in 8th grade. ‘ses’ is the socioeconomic status of parents and ‘paredu’ is the parents highest level of education achieved (less than high school, high school, college, BA, MA, PhD).
library(MASS)
data(nels88)
ses_edu_lm <- lm(math ~ ses + paredu + paredu:ses, data = nels88)
Solution:
Plot the scatter plot of fitted responses vs residuals. No pattern should be observed.
Plot the number qq-plot of the residuals. To check if roughly follows a straight line.
Check if there are obvious outliers.
Solution:
Whether the parents help with their children’s schoolwork at home.
Solution:
Yes, it does. The effect of parents’ education levels on children’s math grade may be influenced by parents’ socioeconomic status. Namely, the increase in children’s math grade if parents’ ses increases by 1 unit may differ by parents’ education levels
anova(ses_edu_lm)
## Analysis of Variance Table
##
## Response: math
## Df Sum Sq Mean Sq F value Pr(>F)
## ses 1 12391.4 12391.4 173.5021 < 2.2e-16 ***
## paredu 5 1642.4 328.5 4.5994 0.0004959 ***
## ses:paredu 5 370.7 74.1 1.0381 0.3957126
## Residuals 248 17712.0 71.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Solution:
Under \(H_0\), we have the model \(Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}+...+\beta_6 x_{i6} + \epsilon_i\),\(i=1,...,n\). Where \(Y_i\) as the birth weight for obervation i, \(x_{i1}\) is the parents’ socioeconomic status while \(x_{ij}\), \(j=2,...6\) is the indicator for parents’ education level . \(\epsilon_i\) are i.i.d with mean 0 and variance \(\sigma^2\).
Under \(H_a\), we have \(Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}+...+\beta_6 x_{i6} + \beta_7 x_{i1}x_{i2} +...+\beta_11x_{i1}x_{i6} + \epsilon_i\),\(i=1,...,n\).
For the F test, we are testing
\(H_0:\beta_7=...+\beta_11=0\) against \(H_a\): one of \(\beta_j\neq 0\), \(j=7,...,11\)
Solution:
\(\mathbf{\alpha}=(0,0,0)\)
Solution:
\(\mathbf{\alpha}=(1,1,-1)\)
Solution:
The row vectors in a) are linear independent while the row vectors in b) are linear dependent.
Solution:
For a observational study, there may exist unobserved confounding variables that are both correlated with the response and the predictor. Thus the correlation we observed may actually due to the confounding varaibles.
For a well-designed experiment, the treatment is usually randomlly assigned, which is uncorrelated to other covaraites and confounding varaibles. Therefore we may be able to infer causation is the experiment is designed well enough.
For additional exercises, see notes, quizzes, and homeworks.
*Source: Modified from previous class with permission.