Sources: HW 11 solutions from Winter ’18
Data on 12 variables for 113 hospitals from the Study on the Efficacy of Nosocomial Infection Control (SENIC) are provided. The primary purpose of this study is to look for properties of hospitals associated with high (or low) rates of hospital-acquired infections, which have the technical name of nosocomial infections. The rate of nosocomial infections is measured by the variable Infection risk
.
We were asked to explore these data and find statistically supported evidence of associations that might explain nosocomial infection risk.
First, I read in the data and change the Med.school
and Region
to factor variables. Below are the summary statistics of the data.
library(GGally)
library(stats)
library(gridExtra)
library(faraway) #for the halfnorm function
# read in the data and remove observation number labelled 'Hospital'
senic <- read.table("https://ionides.github.io/401w18/hw/hw09/senic.txt", header =T)
senic <- senic[,-1]
senic$Region <- as.factor(senic$Region)
senic$Med.school <- as.factor(senic$Med.school)
summary(senic)
## Length.of.stay Age Infection.risk Culture
## Min. : 6.700 Min. :38.80 Min. :1.300 Min. : 1.60
## 1st Qu.: 8.340 1st Qu.:50.90 1st Qu.:3.700 1st Qu.: 8.40
## Median : 9.420 Median :53.20 Median :4.400 Median :14.10
## Mean : 9.648 Mean :53.23 Mean :4.355 Mean :15.79
## 3rd Qu.:10.470 3rd Qu.:56.20 3rd Qu.:5.200 3rd Qu.:20.30
## Max. :19.560 Max. :65.90 Max. :7.800 Max. :60.50
## X.ray Beds Med.school Region Patients
## Min. : 39.60 Min. : 29.0 1:17 1:28 Min. : 20.0
## 1st Qu.: 69.50 1st Qu.:106.0 2:96 2:32 1st Qu.: 68.0
## Median : 82.30 Median :186.0 3:37 Median :143.0
## Mean : 81.63 Mean :252.2 4:16 Mean :191.4
## 3rd Qu.: 94.10 3rd Qu.:312.0 3rd Qu.:252.0
## Max. :133.50 Max. :835.0 Max. :791.0
## Nurses Facilities
## Min. : 14.0 Min. : 5.70
## 1st Qu.: 66.0 1st Qu.:31.40
## Median :132.0 Median :42.90
## Mean :173.2 Mean :43.16
## 3rd Qu.:218.0 3rd Qu.:54.30
## Max. :656.0 Max. :80.00
The summary statistics above suggest that overall the data have a wide range with a fairly even number of hospitals in each region. However, it also suggests that there may be outliers in each of the categories. These may be a particularly large or problematic hospital(s). It is interesting to note that many of the hospitals, exactly 96, do not have medical schools.
To get a better sense of how the variables are related, we graph the each variable against all of the others.
Note: I used ggpairs as a way of combining the pairs plot and the correlation matrix. You are not expected to know this function but it is a good idea to at least use the pairs plot or other initial graphs to explore the data.
# remove categorical variables for the purposes of graphing
senic_num <- senic[,c(-7,-8)]
ggpairs(senic_num)
Based on the graph above, patients, beds, nurses, and facilities are highly correlated. This makes sense as all of these variables are different measures for the size of the hospital.
It also appears that beds, patients, and nurses have log-normal distributions.
Finally, there is additional support that the length of stay and the culture may have a couple of outliers.
Instead of directly examining the size of the hospital, I’m going to examine the level of staffing that a hospital has as a ratio of the number of nurses to the number of patients. One would expect that larger hospitals would have more nurses and more patients, but hospitals that are struggling monetarily would likely have more patients but fewer staff, which could potentially lead to higher rates of infection. Dividing nurses by patients also has the added benefit of putting the amount of care a patient receives in the same units as infection risk, culture, and x-ray.
Additionally, one might expect that the age of the patient would contribute to the infection risk and to the amount of time that the person stays in the hospital. Therefore, in the initial model, I’m also going to include an interaction term between the age and length of stay.
Below I make the ratio and fit an initial model with it, length.of.stay
, Culture
, X.ray
, Age
, an interaction between Age
and length.of.stay
, Region
, and Med.school
.
The inital model takes the form:
\(Y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \beta_4x_{ij} + \beta_5x_{i4} + \beta_6x_{i5} + \beta_7x_{i6}\).
where
senic$nurses_patients <- senic$Nurses/senic$Patients
lm1 <- lm(Infection.risk ~ Length.of.stay + Culture + X.ray + Region + Med.school +
nurses_patients + Age + Age*Length.of.stay, data = senic)
summary(lm1)
##
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + X.ray +
## Region + Med.school + nurses_patients + Age + Age * Length.of.stay,
## data = senic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9922 -0.5983 0.0513 0.5572 2.4798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.561334 5.832364 0.096 0.9235
## Length.of.stay 0.118967 0.596140 0.200 0.8422
## Culture 0.050526 0.011179 4.520 1.67e-05 ***
## X.ray 0.008350 0.005703 1.464 0.1462
## Region2 0.377097 0.272733 1.383 0.1698
## Region3 0.353301 0.275377 1.283 0.2024
## Region4 0.889871 0.359909 2.472 0.0151 *
## Med.school2 -0.034813 0.282836 -0.123 0.9023
## nurses_patients 0.604520 0.335372 1.803 0.0744 .
## Age -0.032681 0.102637 -0.318 0.7508
## Length.of.stay:Age 0.003906 0.010552 0.370 0.7120
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9623 on 102 degrees of freedom
## Multiple R-squared: 0.531, Adjusted R-squared: 0.485
## F-statistic: 11.55 on 10 and 102 DF, p-value: 4.926e-13
The most insignficant of the covariates is the Med.school
based on the p-value in the above summary. I drop this variable and refit the model.
lm2 <- lm(Infection.risk ~ Length.of.stay + Culture + X.ray + Region +
nurses_patients + Age + Age*Length.of.stay, data = senic)
summary(lm2)
##
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + X.ray +
## Region + nurses_patients + Age + Age * Length.of.stay, data = senic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.00068 -0.60128 0.06195 0.56062 2.47729
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.404859 5.664847 0.071 0.9432
## Length.of.stay 0.134698 0.579487 0.232 0.8167
## Culture 0.050670 0.011064 4.580 1.31e-05 ***
## X.ray 0.008310 0.005666 1.466 0.1456
## Region2 0.380339 0.270157 1.408 0.1622
## Region3 0.354331 0.273930 1.294 0.1987
## Region4 0.896108 0.354615 2.527 0.0130 *
## nurses_patients 0.603117 0.333571 1.808 0.0735 .
## Age -0.030833 0.101047 -0.305 0.7609
## Length.of.stay:Age 0.003668 0.010322 0.355 0.7231
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9576 on 103 degrees of freedom
## Multiple R-squared: 0.5309, Adjusted R-squared: 0.49
## F-statistic: 12.95 on 9 and 103 DF, p-value: 1.325e-13
Now, the most insignficant of the covariates is the Length.of.stay
. However, since I have included an interaction term, I must drop the interaction term first and refit the model.
lm3 <- lm(Infection.risk ~ Length.of.stay + Culture + X.ray + Region +
nurses_patients + Age, data = senic)
summary(lm3)
##
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + X.ray +
## Region + nurses_patients + Age, data = senic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02268 -0.60651 0.06226 0.55139 2.50202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.551592 1.325555 -1.171 0.2445
## Length.of.stay 0.339422 0.061615 5.509 2.63e-07 ***
## Culture 0.049797 0.010742 4.636 1.04e-05 ***
## X.ray 0.008690 0.005541 1.568 0.1199
## Region2 0.365065 0.265592 1.375 0.1722
## Region3 0.349252 0.272405 1.282 0.2027
## Region4 0.899558 0.352990 2.548 0.0123 *
## nurses_patients 0.598216 0.331883 1.802 0.0744 .
## Age 0.004178 0.022296 0.187 0.8517
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9536 on 104 degrees of freedom
## Multiple R-squared: 0.5304, Adjusted R-squared: 0.4942
## F-statistic: 14.68 on 8 and 104 DF, p-value: 3.516e-14
Next I drop Age
and refit the model:
lm4 <- lm(Infection.risk ~ Length.of.stay + Culture + X.ray + Region + nurses_patients,
data = senic)
summary(lm4)
##
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + X.ray +
## Region + nurses_patients, data = senic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.03156 -0.59225 0.05949 0.56038 2.52908
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.350249 0.772664 -1.748 0.0835 .
## Length.of.stay 0.342476 0.059148 5.790 7.39e-08 ***
## Culture 0.049158 0.010140 4.848 4.33e-06 ***
## X.ray 0.008679 0.005515 1.574 0.1186
## Region2 0.355828 0.259776 1.370 0.1737
## Region3 0.347736 0.271031 1.283 0.2023
## Region4 0.897483 0.351191 2.556 0.0120 *
## nurses_patients 0.604484 0.328673 1.839 0.0687 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9492 on 105 degrees of freedom
## Multiple R-squared: 0.5302, Adjusted R-squared: 0.4989
## F-statistic: 16.93 on 7 and 105 DF, p-value: 8.29e-15
Finally, I drop X.ray
and refit the model:
lm5 <- lm(Infection.risk ~ Length.of.stay + Culture + Region + nurses_patients, data = senic)
summary(lm5)
##
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + Region +
## nurses_patients, data = senic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.01374 -0.54171 0.08237 0.51229 2.65313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.039244 0.752146 -1.382 0.1700
## Length.of.stay 0.368183 0.057242 6.432 3.71e-09 ***
## Culture 0.053531 0.009819 5.452 3.28e-07 ***
## Region2 0.349103 0.261543 1.335 0.1848
## Region3 0.309069 0.271789 1.137 0.2580
## Region4 0.859171 0.352778 2.435 0.0165 *
## nurses_patients 0.710114 0.323979 2.192 0.0306 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9558 on 106 degrees of freedom
## Multiple R-squared: 0.5191, Adjusted R-squared: 0.4919
## F-statistic: 19.07 on 6 and 106 DF, p-value: 5.817e-15
Finally, I’m going to test whether Region
is actually significant through an ANOVA. Below are my hypotheses:
\[H_0: Y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3}\]
\[H_a: Y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \beta_4x_{ij}\] where
I’ll be testing this hypothesis at the \(\alpha = 0.05\) level.
Below is the F-test for this hypothesis:
lm_null <- lm(Infection.risk ~ Length.of.stay + Culture + nurses_patients, data = senic)
anova(lm_null, lm5)
## Analysis of Variance Table
##
## Model 1: Infection.risk ~ Length.of.stay + Culture + nurses_patients
## Model 2: Infection.risk ~ Length.of.stay + Culture + Region + nurses_patients
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 109 102.312
## 2 106 96.838 3 5.4744 1.9975 0.1188
Because the p-value is bigger than the significance level, I fail to reject the null hypothesis and conclude that I should not keep Region
in the model. Therefore my final model is \(Y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3}\), where
Now that I have a ‘final’ model, I’m going to examine the model to double check that my model assumptions are correct.
plot(senic$Length.of.stay, lm_null$residuals, xlab = "Length of Stay", ylab = "Residuals",
main = "Residuals against Length of Stay")
plot(senic$Culture, lm_null$residuals, xlab = "Culture", ylab = "Residuals",
main = "Residuals against Culture")
plot(senic$nurses_patients, lm_null$residuals, xlab = "Nurses to Patients Ratio", ylab = "Residuals",
main = "Residuals against Care Ratio")
The residual plots above suggest that they have constant variance, randomly scattered about zero, and that there are no transformations that may be needed for any of the variables. They also suggest that there may be a couple of outliers that may be influencing the model. Overall, they do not immediately reveal anything wrong with the model assumptions.
Since an additional assumption is that the residuals are normally distributed, we examine the QQplot to confirm this.
qqnorm(lm_null$residuals)
Based on the QQplot above, the residuals also appear to be normally distributed. So our model assumptions are met.
Finally, we should check for outliers and leverage points. Cook’s distance check’s for both of these things.
cook <- cooks.distance(lm_null)
halfnorm(cook, nlab = 3, ylab = "Cook's Distance")
Based on the Cook’s Distance plot we should examine what happens to the fit of our model when we remove observation 8. Below is the fitted model without observation 8.
senic_8 <- senic[-8,]
lm6 <- lm(Infection.risk ~ Length.of.stay + Culture + nurses_patients, data = senic_8)
summary(lm6)
##
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + nurses_patients,
## data = senic_8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4831 -0.7234 0.1086 0.5046 2.7388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.31210 0.60251 -0.518 0.60552
## Length.of.stay 0.30887 0.05233 5.902 4.19e-08 ***
## Culture 0.05786 0.01066 5.427 3.55e-07 ***
## nurses_patients 0.83227 0.29824 2.791 0.00622 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.958 on 108 degrees of freedom
## Multiple R-squared: 0.5051, Adjusted R-squared: 0.4914
## F-statistic: 36.75 on 3 and 108 DF, p-value: < 2.2e-16
Since the coefficients of the model did not change much, we can keep observation 8 in the analysis but we should examine how it compares with the other observations.
To compare this
Additionally, it would be interesting to plot Infection.Risk
against the fitted values and against the residuals of our final model above to see if there are infact modelling the infection risk well.
plot(lm_null$fitted.values, senic$Infection.risk, xlab = "Fitted Values", ylab = "Infection Risk",
main = "Fitted Values against Infection Risk")
lines(quantile(senic$Infection.risk, c(0.01, 0.99)), quantile(senic$Infection.risk, c(0.01, 0.99)), col = "red")
As we can see from the plot, hospitals with very low infection rates are not predicted well with the model; the fitted values are consistently larger than the actual value. Similarly, hospitals with very high infection rates have consistently smaller fitted values.
We are interested in determining what hospital properties are associated with high (or low) risks of infection. Our final model suggests that the length of stay, the number of cultures, and ratio of the number of nurses to patients determine the infection risk of a hospital. From the values of the coefficients in our final model, hospitals with shorter lengths of stay, low numbers of cultures, and a small ratio of the number of nurses to patients (in other words there is more 1 on 1 attention between nurses and patients) would have a lower percentage of hospital-acquired infections.