Data on 12 variables for 113 hospitals from the Study on the Efficacy of Nosocomial Infection Control (SENIC) are provided. The primary purpose of this study is to look for properties of hospitals associated with high (or low) rates of hospital-acquired infections, which have the technical name of nosocomial infections. The rate of nosocomial infections is measured by the variable Infection risk
.
We were asked to explore these data and find statistically supported evidence of associations that might explain nosocomial infection risk.
First, I read in the data and change the Med.school
and Region
to factor variables. Below are the summary statistics of the data.
library(GGally)
library(stats)
library(gridExtra)
library(faraway)
# read in the data and remove observation number labelled 'Hospital'
senic <- read.table("https://ionides.github.io/401w18/hw/hw09/senic.txt", header =T)
senic <- senic[,-1]
senic$Region <- as.factor(senic$Region)
senic$Med.school <- as.factor(senic$Med.school)
summary(senic)
## Length.of.stay Age Infection.risk Culture
## Min. : 6.700 Min. :38.80 Min. :1.300 Min. : 1.60
## 1st Qu.: 8.340 1st Qu.:50.90 1st Qu.:3.700 1st Qu.: 8.40
## Median : 9.420 Median :53.20 Median :4.400 Median :14.10
## Mean : 9.648 Mean :53.23 Mean :4.355 Mean :15.79
## 3rd Qu.:10.470 3rd Qu.:56.20 3rd Qu.:5.200 3rd Qu.:20.30
## Max. :19.560 Max. :65.90 Max. :7.800 Max. :60.50
## X.ray Beds Med.school Region Patients
## Min. : 39.60 Min. : 29.0 1:17 1:28 Min. : 20.0
## 1st Qu.: 69.50 1st Qu.:106.0 2:96 2:32 1st Qu.: 68.0
## Median : 82.30 Median :186.0 3:37 Median :143.0
## Mean : 81.63 Mean :252.2 4:16 Mean :191.4
## 3rd Qu.: 94.10 3rd Qu.:312.0 3rd Qu.:252.0
## Max. :133.50 Max. :835.0 Max. :791.0
## Nurses Facilities
## Min. : 14.0 Min. : 5.70
## 1st Qu.: 66.0 1st Qu.:31.40
## Median :132.0 Median :42.90
## Mean :173.2 Mean :43.16
## 3rd Qu.:218.0 3rd Qu.:54.30
## Max. :656.0 Max. :80.00
The summary statistics above suggest that overall the data have a wide range with a fairly even number of schools in each region. However, it also suggests that there may be outliers in each of the categories. These may be a particularly large or problematic hospital(s). It is interesting to note that many of the hospitals do not have medical schools. Additionally, we graph the each variable against all of the others to get a general understanding of the data. Note: You were not required to know this. A couple of exploratory graphs is sufficient.
# remove categorical variables for the purposes of graphing
senic_num <- senic[,c(-7,-8)]
ggpairs(senic_num)
Based on the graph above, patients, beds, and nurses are highly correlated. To avoid collinearity in the final model, we will only use Nurses
as it has the highest correlation with Infection.Risk
. We also see that Nurses
seems to have a non-linear relationship with infection risk; a transformation of this variable may be needed. All of the other variables we will be using appear to have a linear relationship with Infection.Risk
. Below, we have plotted Nurses
against Infection.Risk
to have a better look.
plot(senic$Nurses, senic$Infection.risk, xlab="Nurses", ylab="infection risk", main="Plot of Infection Risk vs Nurses")
Based on this graph, Nurses
appears to have an inverse-log relationship with Infection.Risk
. To account for this, we take the log of Nurses
and re-plot to check that log_nurses
now as a linear relationship with Infection.Risk
.
senic$log_nurses <- log(senic$Nurses)
plot(senic$log_nurses, senic$Infection.risk)
log_nurses
has a linear relationship with Infection.Risk
. We can now begin fitting initial models.
The first model we fit will take the following form:
\[ Y_{ijk} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \beta_4x_{i4} + \beta_5x_{ij} + \beta_6x_{ik} + \beta_7x_{i5} + \beta_8x_{i6} \]
where
#remove patients, beds, and nurses from data for the purposes of modelling
senic_nurses <- senic[,c(-6,-9, -10)]
lm1 <- lm(Infection.risk ~ ., data = senic_nurses)
summary(lm1)
##
## Call:
## lm(formula = Infection.risk ~ ., data = senic_nurses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64336 -0.52500 -0.05761 0.51460 2.00852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.945333 1.395755 -3.543 0.000598 ***
## Length.of.stay 0.211976 0.058715 3.610 0.000476 ***
## Age 0.026800 0.020668 1.297 0.197664
## Culture 0.050718 0.009618 5.273 7.54e-07 ***
## X.ray 0.010521 0.004905 2.145 0.034316 *
## Med.school2 0.537824 0.277267 1.940 0.055174 .
## Region2 0.309516 0.240765 1.286 0.201512
## Region3 0.199542 0.245858 0.812 0.418901
## Region4 0.938796 0.314329 2.987 0.003532 **
## Facilities -0.008068 0.010725 -0.752 0.453651
## log_nurses 0.780184 0.198547 3.929 0.000155 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8567 on 102 degrees of freedom
## Multiple R-squared: 0.6282, Adjusted R-squared: 0.5918
## F-statistic: 17.24 on 10 and 102 DF, p-value: < 2.2e-16
As we can see from the initial model, some of the variables are not significant or weakly significant.
plot(lm1$fitted.values,lm1$residuals)
Examining the residuals (above) against the fitted values of the model does not immediately reveal anything wrong with the model assumptions. They appear to be randomly scattered about zero and have a constant variance. Since an additional assumption is that the residuals are normally distributed, we examine the QQplot to confirm this.
qqnorm(lm1$residuals)
The QQplot agrees that the residuals are approximately normally distributed, perhaps with some heavy tails.
Since we are interested in obtaining the simplest model for determining the risk of infection, we iteratively remove variables with the highest p-value from the model; this is known as the backwards selection method. Since Region4
is significant, we will hold off on testing whether Region
is significant as a whole until all the other variables are significant.
We start by removing facilities and re-examining the model.
lm2 <- lm(Infection.risk ~ . -Facilities , data = senic_nurses)
summary(lm2)
##
## Call:
## lm(formula = Infection.risk ~ . - Facilities, data = senic_nurses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7093 -0.5501 -0.0667 0.4474 1.9716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.697896 1.353578 -3.471 0.000760 ***
## Length.of.stay 0.215677 0.058385 3.694 0.000355 ***
## Age 0.024207 0.020335 1.190 0.236638
## Culture 0.051175 0.009579 5.343 5.51e-07 ***
## X.ray 0.010716 0.004888 2.192 0.030599 *
## Med.school2 0.586540 0.269029 2.180 0.031519 *
## Region2 0.317667 0.240014 1.324 0.188587
## Region3 0.232042 0.241521 0.961 0.338927
## Region4 0.970907 0.310760 3.124 0.002315 **
## log_nurses 0.661260 0.119854 5.517 2.58e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8549 on 103 degrees of freedom
## Multiple R-squared: 0.6262, Adjusted R-squared: 0.5935
## F-statistic: 19.17 on 9 and 103 DF, p-value: < 2.2e-16
plot(lm2$fitted.values,lm2$residuals)
qqnorm(lm2$residuals)
Notice that upon removing facilities, the residuals did not change much and the QQplot still suggests slightly heavy tails. Since Age
is still not significant, we will remove it from the model.
lm3 <- lm(Infection.risk ~ . -Age -Facilities, data = senic_nurses)
summary(lm3)
##
## Call:
## lm(formula = Infection.risk ~ . - Age - Facilities, data = senic_nurses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64854 -0.56815 -0.06735 0.48486 2.09355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.474152 0.882280 -3.938 0.000149 ***
## Length.of.stay 0.237005 0.055679 4.257 4.56e-05 ***
## Culture 0.048120 0.009247 5.204 9.86e-07 ***
## X.ray 0.010688 0.004897 2.182 0.031330 *
## Med.school2 0.613461 0.268613 2.284 0.024415 *
## Region2 0.271780 0.237372 1.145 0.254857
## Region3 0.222193 0.241863 0.919 0.360391
## Region4 0.977740 0.311329 3.141 0.002196 **
## log_nurses 0.640985 0.118875 5.392 4.38e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8566 on 104 degrees of freedom
## Multiple R-squared: 0.621, Adjusted R-squared: 0.5919
## F-statistic: 21.3 on 8 and 104 DF, p-value: < 2.2e-16
plot(lm3$fitted.values,lm3$residuals)
qqnorm(lm3$residuals)
Residuals appear to be about the same as the previous model, but now the QQplot of the residuals does not suggest heavy tails which suggests that the errors are normally distributed.
Next, we check if we can exclude Med school from our model. In order to do this, we perform an F-test.
lm3 <- lm(Infection.risk ~ . -Age -Facilities, data = senic_nurses)
lm4 <- lm(Infection.risk ~ . -Age -Facilities -Med.school, data = senic_nurses)
summary(lm3)
##
## Call:
## lm(formula = Infection.risk ~ . - Age - Facilities, data = senic_nurses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.64854 -0.56815 -0.06735 0.48486 2.09355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.474152 0.882280 -3.938 0.000149 ***
## Length.of.stay 0.237005 0.055679 4.257 4.56e-05 ***
## Culture 0.048120 0.009247 5.204 9.86e-07 ***
## X.ray 0.010688 0.004897 2.182 0.031330 *
## Med.school2 0.613461 0.268613 2.284 0.024415 *
## Region2 0.271780 0.237372 1.145 0.254857
## Region3 0.222193 0.241863 0.919 0.360391
## Region4 0.977740 0.311329 3.141 0.002196 **
## log_nurses 0.640985 0.118875 5.392 4.38e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8566 on 104 degrees of freedom
## Multiple R-squared: 0.621, Adjusted R-squared: 0.5919
## F-statistic: 21.3 on 8 and 104 DF, p-value: < 2.2e-16
anova(lm4,lm3)
## Analysis of Variance Table
##
## Model 1: Infection.risk ~ (Length.of.stay + Age + Culture + X.ray + Med.school +
## Region + Facilities + log_nurses) - Age - Facilities - Med.school
## Model 2: Infection.risk ~ (Length.of.stay + Age + Culture + X.ray + Med.school +
## Region + Facilities + log_nurses) - Age - Facilities
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 105 80.144
## 2 104 76.317 1 3.8274 5.2158 0.02442 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see from both the summary as well as the anova table, Med school is clearly significant and hence we can not exclude it from our model.
Finally, we still need to test for whether Region
is significant in the model. Since Region
is a factor variable, we need to perform an ANOVA at the 5% level with the following null and alternative hypotheses:
$H_0: Y_{ijk} = _0 + 1x{i1} + 2x{i2} + 3x{i3} + 4x{ij} + 5x{ik} $
\(H_a: Y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \beta_4x_{ij} + \beta_5x_{i4} + \beta_6x_{i4}\)
where
lm4 <- lm(Infection.risk ~ . -Age -Facilities -Region, data = senic_nurses)
anova(lm4, lm3)
## Analysis of Variance Table
##
## Model 1: Infection.risk ~ (Length.of.stay + Age + Culture + X.ray + Med.school +
## Region + Facilities + log_nurses) - Age - Facilities - Region
## Model 2: Infection.risk ~ (Length.of.stay + Age + Culture + X.ray + Med.school +
## Region + Facilities + log_nurses) - Age - Facilities
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 107 84.382
## 2 104 76.317 3 8.0656 3.6638 0.01477 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the ANOVA, wwe can reject the null hypothesis and conclude that Region
is statistically significant for determining the infection rate at the 5% level.
Therefore the final model is \(Y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \beta_4x_{ij} + \beta_5x_{i4}\)
where
Now that we have our final model, we should check for outliers and leverage points. Cook’s distance check’s for both of these things.
cook <- cooks.distance(lm3)
halfnorm(cook, nlab = 3, ylab = "Cook's Distance")
Based on the Cook’s Distance plot we should examine what happens to the fit of our model when we remove observation 8. Below is the fitted model without observation 8.
senic_nurses_8 <- senic_nurses[-8,]
lm4 <- lm(Infection.risk ~ . -Age -Facilities, data = senic_nurses_8)
summary(lm4)
##
## Call:
## lm(formula = Infection.risk ~ . - Age - Facilities, data = senic_nurses_8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.67699 -0.55303 -0.03873 0.51799 2.11012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.424266 0.865314 -3.957 0.00014 ***
## Length.of.stay 0.238428 0.054594 4.367 3.00e-05 ***
## Culture 0.058346 0.010117 5.767 8.51e-08 ***
## X.ray 0.008859 0.004868 1.820 0.07170 .
## Med.school2 0.537932 0.265443 2.027 0.04529 *
## Region2 0.378785 0.237428 1.595 0.11369
## Region3 0.284738 0.238721 1.193 0.23570
## Region4 1.043403 0.306603 3.403 0.00095 ***
## log_nurses 0.630217 0.116648 5.403 4.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8399 on 103 degrees of freedom
## Multiple R-squared: 0.6372, Adjusted R-squared: 0.609
## F-statistic: 22.61 on 8 and 103 DF, p-value: < 2.2e-16
Additionally, we re-examine the residuals.
plot(lm4$fitted.values, lm4$residuals)
qqnorm(lm4$residuals)
Since the coefficients of the model did not change much and only X.ray
changed from being statistically significant at the 5% level to statistically significant at the 10% level, we can keep observation 8 in the analysis but we should examine how it compares with the other observations.
Additionally, it would be interesting to plot Infection.Risk
against the fitted values and against the residuals of our final model above to see if there are infact modelling the infection risk well.
plot(lm3$fitted.values, senic$Infection.risk, main = "Fitted Values against Infection Risk")
lines(quantile(senic$Infection.risk, c(0.01, 0.99)), quantile(senic$Infection.risk, c(0.01, 0.99)), col = "red")
plot(senic$Infection.risk, lm3$residuals)
As we can see from the plots above, hospitals with very low infection rates are not predicted well with the model; the fitted values are consistently larger than the actual value. Similarly, hospitals with very high infection rates have smaller fitted values.
We are interested in determining what hospital properties are associated with high (or low) risks of infection. Our final model suggests that the length of stay, the number of cultures, the number of x-rays, whether or not the hospital is associated with a medical school, the region the hospital is in, and the logarithm of the number of nurses helps determine the infection risk of a hospital. From the values of the coefficients in our final model and assuming that a hospital cannot change its region or whether it is associated with a medical school, hospitals with low numbers of cultures, x-rays, nurses, and a shorter length of stay would see a reduction in the percentage of hospital-acquire infections. Of course, this should be taken with some caution as the number of nurses for instance is driven primarily by the size of the hospital and reducing the number of employees could result in negative consequences.