Sources: HW 11 solutions from Winter ’18

Exploratory Analysis

Data on 12 variables for 113 hospitals from the Study on the Efficacy of Nosocomial Infection Control (SENIC) are provided. The primary purpose of this study is to look for properties of hospitals associated with high (or low) rates of hospital-acquired infections, which have the technical name of nosocomial infections. The rate of nosocomial infections is measured by the variable Infection risk.

We were asked to explore these data and find statistically supported evidence of associations that might explain nosocomial infection risk.

First, I read in the data and change the Med.school and Region to factor variables. Below are the summary statistics of the data.

library(GGally)
library(stats)
library(gridExtra)
library(faraway) #for the halfnorm function

# read in the data and remove observation number labelled 'Hospital'

senic <- read.table("https://ionides.github.io/401w18/hw/hw09/senic.txt", header =T)
senic <- senic[,-1]
senic$Region <- as.factor(senic$Region)
senic$Med.school <- as.factor(senic$Med.school)
summary(senic)
##  Length.of.stay        Age        Infection.risk     Culture     
##  Min.   : 6.700   Min.   :38.80   Min.   :1.300   Min.   : 1.60  
##  1st Qu.: 8.340   1st Qu.:50.90   1st Qu.:3.700   1st Qu.: 8.40  
##  Median : 9.420   Median :53.20   Median :4.400   Median :14.10  
##  Mean   : 9.648   Mean   :53.23   Mean   :4.355   Mean   :15.79  
##  3rd Qu.:10.470   3rd Qu.:56.20   3rd Qu.:5.200   3rd Qu.:20.30  
##  Max.   :19.560   Max.   :65.90   Max.   :7.800   Max.   :60.50  
##      X.ray             Beds       Med.school Region    Patients    
##  Min.   : 39.60   Min.   : 29.0   1:17       1:28   Min.   : 20.0  
##  1st Qu.: 69.50   1st Qu.:106.0   2:96       2:32   1st Qu.: 68.0  
##  Median : 82.30   Median :186.0              3:37   Median :143.0  
##  Mean   : 81.63   Mean   :252.2              4:16   Mean   :191.4  
##  3rd Qu.: 94.10   3rd Qu.:312.0                     3rd Qu.:252.0  
##  Max.   :133.50   Max.   :835.0                     Max.   :791.0  
##      Nurses        Facilities   
##  Min.   : 14.0   Min.   : 5.70  
##  1st Qu.: 66.0   1st Qu.:31.40  
##  Median :132.0   Median :42.90  
##  Mean   :173.2   Mean   :43.16  
##  3rd Qu.:218.0   3rd Qu.:54.30  
##  Max.   :656.0   Max.   :80.00

The summary statistics above suggest that overall the data have a wide range with a fairly even number of hospitals in each region. However, it also suggests that there may be outliers in each of the categories. These may be a particularly large or problematic hospital(s). It is interesting to note that many of the hospitals, exactly 96, do not have medical schools.

To get a better sense of how the variables are related, we graph the each variable against all of the others.

Note: I used ggpairs as a way of combining the pairs plot and the correlation matrix. You are not expected to know this function but it is a good idea to at least use the pairs plot or other initial graphs to explore the data.

# remove categorical variables for the purposes of graphing
senic_num <- senic[,c(-7,-8)]
ggpairs(senic_num)

Based on the graph above, patients, beds, nurses, and facilities are highly correlated. This makes sense as all of these variables are different measures for the size of the hospital.

It also appears that beds, patients, and nurses have log-normal distributions.

Finally, there is additional support that the length of stay and the culture may have a couple of outliers.

Fitting an Initial Model

Instead of directly examining the size of the hospital, I’m going to examine the level of staffing that a hospital has as a ratio of the number of nurses to the number of patients. One would expect that larger hospitals would have more nurses and more patients, but hospitals that are struggling monetarily would likely have more patients but fewer staff, which could potentially lead to higher rates of infection. Dividing nurses by patients also has the added benefit of putting the amount of care a patient receives in the same units as infection risk, culture, and x-ray.

Additionally, one might expect that the age of the patient would contribute to the infection risk and to the amount of time that the person stays in the hospital. Therefore, in the initial model, I’m also going to include an interaction term between the age and length of stay.

Below I make the ratio and fit an initial model with it, length.of.stay, Culture, X.ray, Age, an interaction between Age and length.of.stay, Region, and Med.school.

The inital model takes the form:

\(Y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \beta_4x_{ij} + \beta_5x_{i4} + \beta_6x_{i5} + \beta_7x_{i6}\).

where

senic$nurses_patients <- senic$Nurses/senic$Patients

lm1 <- lm(Infection.risk ~ Length.of.stay + Culture + X.ray + Region + Med.school + 
            nurses_patients + Age + Age*Length.of.stay, data = senic)
summary(lm1)
## 
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + X.ray + 
##     Region + Med.school + nurses_patients + Age + Age * Length.of.stay, 
##     data = senic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9922 -0.5983  0.0513  0.5572  2.4798 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.561334   5.832364   0.096   0.9235    
## Length.of.stay      0.118967   0.596140   0.200   0.8422    
## Culture             0.050526   0.011179   4.520 1.67e-05 ***
## X.ray               0.008350   0.005703   1.464   0.1462    
## Region2             0.377097   0.272733   1.383   0.1698    
## Region3             0.353301   0.275377   1.283   0.2024    
## Region4             0.889871   0.359909   2.472   0.0151 *  
## Med.school2        -0.034813   0.282836  -0.123   0.9023    
## nurses_patients     0.604520   0.335372   1.803   0.0744 .  
## Age                -0.032681   0.102637  -0.318   0.7508    
## Length.of.stay:Age  0.003906   0.010552   0.370   0.7120    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9623 on 102 degrees of freedom
## Multiple R-squared:  0.531,  Adjusted R-squared:  0.485 
## F-statistic: 11.55 on 10 and 102 DF,  p-value: 4.926e-13

Iteratively finding a final model

The most insignficant of the covariates is the Med.school based on the p-value in the above summary. I drop this variable and refit the model.

lm2 <- lm(Infection.risk ~ Length.of.stay + Culture + X.ray + Region + 
            nurses_patients + Age + Age*Length.of.stay, data = senic)
summary(lm2)
## 
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + X.ray + 
##     Region + nurses_patients + Age + Age * Length.of.stay, data = senic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.00068 -0.60128  0.06195  0.56062  2.47729 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.404859   5.664847   0.071   0.9432    
## Length.of.stay      0.134698   0.579487   0.232   0.8167    
## Culture             0.050670   0.011064   4.580 1.31e-05 ***
## X.ray               0.008310   0.005666   1.466   0.1456    
## Region2             0.380339   0.270157   1.408   0.1622    
## Region3             0.354331   0.273930   1.294   0.1987    
## Region4             0.896108   0.354615   2.527   0.0130 *  
## nurses_patients     0.603117   0.333571   1.808   0.0735 .  
## Age                -0.030833   0.101047  -0.305   0.7609    
## Length.of.stay:Age  0.003668   0.010322   0.355   0.7231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9576 on 103 degrees of freedom
## Multiple R-squared:  0.5309, Adjusted R-squared:   0.49 
## F-statistic: 12.95 on 9 and 103 DF,  p-value: 1.325e-13

Now, the most insignficant of the covariates is the Length.of.stay. However, since I have included an interaction term, I must drop the interaction term first and refit the model.

lm3 <- lm(Infection.risk ~ Length.of.stay + Culture + X.ray + Region + 
            nurses_patients + Age, data = senic)
summary(lm3)
## 
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + X.ray + 
##     Region + nurses_patients + Age, data = senic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02268 -0.60651  0.06226  0.55139  2.50202 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.551592   1.325555  -1.171   0.2445    
## Length.of.stay   0.339422   0.061615   5.509 2.63e-07 ***
## Culture          0.049797   0.010742   4.636 1.04e-05 ***
## X.ray            0.008690   0.005541   1.568   0.1199    
## Region2          0.365065   0.265592   1.375   0.1722    
## Region3          0.349252   0.272405   1.282   0.2027    
## Region4          0.899558   0.352990   2.548   0.0123 *  
## nurses_patients  0.598216   0.331883   1.802   0.0744 .  
## Age              0.004178   0.022296   0.187   0.8517    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9536 on 104 degrees of freedom
## Multiple R-squared:  0.5304, Adjusted R-squared:  0.4942 
## F-statistic: 14.68 on 8 and 104 DF,  p-value: 3.516e-14

Next I drop Age and refit the model:

lm4 <- lm(Infection.risk ~ Length.of.stay + Culture + X.ray + Region + nurses_patients,
          data = senic)
summary(lm4)
## 
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + X.ray + 
##     Region + nurses_patients, data = senic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.03156 -0.59225  0.05949  0.56038  2.52908 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.350249   0.772664  -1.748   0.0835 .  
## Length.of.stay   0.342476   0.059148   5.790 7.39e-08 ***
## Culture          0.049158   0.010140   4.848 4.33e-06 ***
## X.ray            0.008679   0.005515   1.574   0.1186    
## Region2          0.355828   0.259776   1.370   0.1737    
## Region3          0.347736   0.271031   1.283   0.2023    
## Region4          0.897483   0.351191   2.556   0.0120 *  
## nurses_patients  0.604484   0.328673   1.839   0.0687 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9492 on 105 degrees of freedom
## Multiple R-squared:  0.5302, Adjusted R-squared:  0.4989 
## F-statistic: 16.93 on 7 and 105 DF,  p-value: 8.29e-15

Finally, I drop X.ray and refit the model:

lm5 <- lm(Infection.risk ~ Length.of.stay + Culture + Region + nurses_patients, data = senic)
summary(lm5)
## 
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + Region + 
##     nurses_patients, data = senic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01374 -0.54171  0.08237  0.51229  2.65313 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.039244   0.752146  -1.382   0.1700    
## Length.of.stay   0.368183   0.057242   6.432 3.71e-09 ***
## Culture          0.053531   0.009819   5.452 3.28e-07 ***
## Region2          0.349103   0.261543   1.335   0.1848    
## Region3          0.309069   0.271789   1.137   0.2580    
## Region4          0.859171   0.352778   2.435   0.0165 *  
## nurses_patients  0.710114   0.323979   2.192   0.0306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9558 on 106 degrees of freedom
## Multiple R-squared:  0.5191, Adjusted R-squared:  0.4919 
## F-statistic: 19.07 on 6 and 106 DF,  p-value: 5.817e-15

Finally, I’m going to test whether Region is actually significant through an ANOVA. Below are my hypotheses:

\[H_0: Y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3}\]

\[H_a: Y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \beta_4x_{ij}\] where

I’ll be testing this hypothesis at the \(\alpha = 0.05\) level.

Below is the F-test for this hypothesis:

lm_null <- lm(Infection.risk ~ Length.of.stay + Culture + nurses_patients, data = senic)
anova(lm_null, lm5)
## Analysis of Variance Table
## 
## Model 1: Infection.risk ~ Length.of.stay + Culture + nurses_patients
## Model 2: Infection.risk ~ Length.of.stay + Culture + Region + nurses_patients
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1    109 102.312                           
## 2    106  96.838  3    5.4744 1.9975 0.1188

Because the p-value is bigger than the significance level, I fail to reject the null hypothesis and conclude that I should not keep Region in the model. Therefore my final model is \(Y_{ij} = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3}\), where

Model diagnostics

Now that I have a ‘final’ model, I’m going to examine the model to double check that my model assumptions are correct.

Looking at the residuals

plot(senic$Length.of.stay, lm_null$residuals, xlab = "Length of Stay", ylab = "Residuals",
     main = "Residuals against Length of Stay")

plot(senic$Culture, lm_null$residuals, xlab = "Culture", ylab = "Residuals",
     main = "Residuals against Culture")

plot(senic$nurses_patients, lm_null$residuals, xlab = "Nurses to Patients Ratio", ylab = "Residuals",
     main = "Residuals against Care Ratio")

The residual plots above suggest that they have constant variance, randomly scattered about zero, and that there are no transformations that may be needed for any of the variables. They also suggest that there may be a couple of outliers that may be influencing the model. Overall, they do not immediately reveal anything wrong with the model assumptions.

Since an additional assumption is that the residuals are normally distributed, we examine the QQplot to confirm this.

qqnorm(lm_null$residuals)

Based on the QQplot above, the residuals also appear to be normally distributed. So our model assumptions are met.

Finally, we should check for outliers and leverage points. Cook’s distance check’s for both of these things.

cook <- cooks.distance(lm_null)
halfnorm(cook, nlab = 3, ylab = "Cook's Distance")

Based on the Cook’s Distance plot we should examine what happens to the fit of our model when we remove observation 8. Below is the fitted model without observation 8.

senic_8 <- senic[-8,]
lm6 <- lm(Infection.risk ~ Length.of.stay + Culture + nurses_patients, data = senic_8)
summary(lm6)
## 
## Call:
## lm(formula = Infection.risk ~ Length.of.stay + Culture + nurses_patients, 
##     data = senic_8)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4831 -0.7234  0.1086  0.5046  2.7388 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.31210    0.60251  -0.518  0.60552    
## Length.of.stay   0.30887    0.05233   5.902 4.19e-08 ***
## Culture          0.05786    0.01066   5.427 3.55e-07 ***
## nurses_patients  0.83227    0.29824   2.791  0.00622 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.958 on 108 degrees of freedom
## Multiple R-squared:  0.5051, Adjusted R-squared:  0.4914 
## F-statistic: 36.75 on 3 and 108 DF,  p-value: < 2.2e-16

Since the coefficients of the model did not change much, we can keep observation 8 in the analysis but we should examine how it compares with the other observations.

To compare this

Additionally, it would be interesting to plot Infection.Risk against the fitted values and against the residuals of our final model above to see if there are infact modelling the infection risk well.

plot(lm_null$fitted.values, senic$Infection.risk, xlab = "Fitted Values", ylab = "Infection Risk",
     main = "Fitted Values against Infection Risk")
lines(quantile(senic$Infection.risk, c(0.01, 0.99)), quantile(senic$Infection.risk, c(0.01, 0.99)), col = "red")

As we can see from the plot, hospitals with very low infection rates are not predicted well with the model; the fitted values are consistently larger than the actual value. Similarly, hospitals with very high infection rates have consistently smaller fitted values.

Conclusion

We are interested in determining what hospital properties are associated with high (or low) risks of infection. Our final model suggests that the length of stay, the number of cultures, and ratio of the number of nurses to patients determine the infection risk of a hospital. From the values of the coefficients in our final model, hospitals with shorter lengths of stay, low numbers of cultures, and a small ratio of the number of nurses to patients (in other words there is more 1 on 1 attention between nurses and patients) would have a lower percentage of hospital-acquired infections.