(hyperref)

Introduction

In the year 2020, the disease COVID-19 caused a world-wide pandemic. In March of that year stores/resturants/universities and more were all shut down as stay in place orders were issued through out the country and the world. People were instructed to wash hands, wear masks and limit their interactions with others in an effort to “flatten the curve” and limit the ammount of death. The world changed, and it would take longer than most expected to return to any sense of normality. Sadly, on February 22, 2021, almost a year after the tragic virus became headline news in the states, the number of US deaths due to COVID-19 reached half a million. Despite all this, there are still debates about whether or not the nation has over-reacted, if the virus really is as big of a deal as we have made it out to be.

The question of this analysis does not center around whether the curve was flattened, but rather if there was a persitent trend to the number of people who tested positive and the number of deaths. This project focuses on trying to find the lag, in days, between spikes in the number of new cases of COVID-19 (categorized by date of symptom onset or positive test which ever came first) and spikes in the number of COVID-19 related deaths in the state of Utah, and examining the strength of the relationship betwen these two variables. The data analyzed here focuses on cases in the state of Utah from March 6, 2020 to February 23 2021. The state’s coronavirus website is updated daily, however it is most accessable through download, and that is the date of my last download. I am not sure how this data was gathered or how COVID-19 related deaths are classified in the state of Utah, but as the data comes from an official government website, I am trusting that the data is acurate, clean and good.

Exploratory Data Analysis

Below is the raw data for the number of new cases (graphed in black) and deaths (graphed in red) each day. There appears to be alot of noise in the data; some smoothing is necessary. Note that the deaths data looks like it has more variance in the data, however the scale is smaller, so a difference of one or two is a lot more noticable than in the new case data. With the change in scale it is important to note that this graph shows that for about every 200 people who get diognosed with COVID-19 1 will die from it. It appears that the two graphs have similar trends, and that the line representing the number of deaths each day is slightly (possibly two weeks) behind the line representing the number of new cases. Intuitively this trend makes sense as a portion of those new cases each day will statistically result in death, and death from COVID-19 cannot occur unless people actually have the disease.

The data on the website also provided a seven day average, which would account for the “seasonality” caused by weekends. With this Thinking about my data, it might not seem like weekends affect the times people get sick or die, however doctor’s schedules do affect testing, and some of the data is dependant on positive tests. The two graphs below show the raw data (in black) and the 7 day averages (in red) for the number of confirmed cases (on top) and the number of deaths (on bottom).

To better understand the “seasonal” differences, or the impact of different days of the week on the data, I used a spectrum analysis To look at the smoothed periodograms for both new cases and new deaths. The dashed red lines represent the seven-day averaged data. In both graphs the x axis represents the number of cycles per week picked out in the data, and the y axis relates to us the spectrum on a log scale. For the new cases data, the black spikes indicate that there is some periocidy in the data. The red lines, or seasonal data remove that. For the deaths data, there doesn’t appear to be any spikes, meaning there is likely no periocidy. This makes sense as people getting tested depends on doctors and nurses working, while sadly people don’t need a doctor or nurse nearby to die.

To further look at the fit of the seven day average data, I have created frequency response plots, which tells us how much the smoother contracts or inflates the sine and cosine components at each frequency. Both of these ratios are low-pass filters, meaning the smoothers don’t just smooth the high noise, but it also removes some of the low as well. In my data, low and high points both could have equal amount of noise and meaning, especially the farther we get from having 0 new cases or deaths a day, thus I won’t worry about the low-pass filters and moveforward with the seasonal data. Even though the death data didn’t seem seasonal, I want to keep the same data types (raw or averaged) in both of my variables for consitency.

#Determining the lag

I will determine the lag using a sample cross-correlation function (CCF), as shown below. The most extreme points are

From the plot it is difficult to see where the exact lags are, but plotting them will help us see the relationships. The plots below plot the average cases at a certain lag on the x axis, and the average deaths on the y axis. The blue numbers in the upper right hand corners are the cross-correlation values between them. The higest possible cross-correlation value is 1, and the highest I got was 0.94 at lags 7-11. This means that high spikes in new cases will lead to high spikes in COVID-19 related deaths 7-11 days later.

I will thus adjust the average death case data to a 9 (median of 7 and 11) day lead, which will align the spikes in the two datasets and use that to create an ARIMA model. In the graph below, the two seven-day averages are plotted on top of each other, with the lags adjusted by the nine days. The average number of new cases is plotted in black, and the average number of deaths is in red. They seem to line up pretty well, as expected. Again, they are being measured on different scales, so the seemingly more variability in the number of deaths, is due to the fact that a change of one or two has much more impact on the number of deaths than the number of new cases. Again, the ratio between the two scales show that for about every 200 new cases of COVID-19, 9 days later there is 1 COVID-19 related deaths.

Regression Analysis with ARMA

Now that I have decided on the averaged data and adjusted the data for the lag, I want to examine the the dependence between them using a regresion with ARMA errors model. This would give us a model to predict, based on the number of new cases, how many people would die in 9 days due to COVID-19. The table below shows that the best model would be an ARMA(5,1,4) model as it has the lowest AIC score on the table.

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

## [1] 5 1 4

## Warning in log(s2): NaNs produced

## Warning in log(s2): NaNs produced

##          MA0      MA1      MA2      MA3      MA4      MA5
## AR0 485.3740 487.3395 489.3367 487.3139 461.7029 444.3478
## AR1 487.3393 489.3394 489.2403 478.0251 462.5952 445.9475
## AR2 489.3301 489.5952 492.3429 435.9820 427.1492 389.6482
## AR3 485.8656 487.8666 488.9438 447.7467 436.7441 391.2716
## AR4 487.8647 489.8656 490.9437 462.9541 392.3059 372.0643
## AR5 485.0434 485.9320 425.5768 420.0034 369.4044 367.3971

Calling this particular ARIMA functions, we can see the given model below, and use it to figure out the z-statistic of 0.66666667 for the coefficent of the seven-day averaged number of new cases. the p-value from the liklihood ration test is very small, 0.00176, thus there is significant statistical evidence that the negatively lagged seven average death data and the seven day average new case data are associated.

## 
## Call:
## arima(x = lAvDeaths_ts, order = c(5, 1, 4), xreg = AvCases_ts)
## 
## Coefficients:
##          ar1      ar2     ar3      ar4     ar5      ma1     ma2     ma3     ma4
##       0.7211  -1.0707  0.6900  -0.6038  0.3094  -0.7912  1.4114  -0.800  0.9806
## s.e.  0.0532   0.0563  0.0727   0.0565  0.0528   0.0153  0.0237   0.025  0.0289
##       AvCases_ts
##            4e-04
## s.e.       6e-04
## 
## sigma^2 estimated as 0.1547:  log likelihood = -173.7,  aic = 369.4

Z- Statistic for the seven-day averaged number of new cases coefficent.

## [1] 0.6666667

P-value for the model. As it is so low, it indicates that the model does a good job of predicting the number of deaths in nine days given todays number of new cases.

## [1] 0.001764593

Residual diognostics.

The residuals appear to be pretty normal. Plotted against time, the spikes seem to enlarge about 3/4ths of the way through the data. Those spikes correlate with the November-January data, where the counts were larger, and thus the variablity was also larger. On the ACF plot, one out of the 20+ lags is outside the dashed lines showing the pointwise acceptance regions at the 5% level under a null hypothesis of Gaussian white noise. Again this could be due to different variablility at different points, or indicating that for some region of time the model doesn’t quite fit. Howeever as it is only one out of 20+ lags, this is still an adequate model.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,1,4)
## Q* = 22.811, df = 4, p-value = 0.0001381
## 
## Model df: 10.   Total lags used: 14

Conclusions

I have found that the lag between spikes in positive Covid-19 tests, and spikes in Covid-19 deaths for the state of Utah are aproximately 9 days apart. When adjusting the data for this lag, there is a significant correlation between these two values. In this case, it is probably safe to say that a spike in Covid-19 cases will cause a spike in Covid-19 related deaths about 9 days later.

This study could be enlarged or transfered to data for a different state, the whole United States or the world to see if this 9 day lag is consistant for different regions. In addition to the data I have used, the state of Utah’s covid website has data on different demographics, hospitalizations, counties, etc. Another idea for future analysis would be to look at the mid-step of Covid-19 hospitalizations: the lag between confirmed cases and hospitalization spikes and between hospitializations and death spikes, or to look at data broken up by demographics and regions to see if one area or certain demographic has a longer or shorter lag, or a stronger or weaker tie between the number of new cases and the number of COVid-19 related deaths.

Resources

The data sets were downloaded from https://coronavirus-dashboard.utah.gov/. I downloaded them because I could not find the exact data I was looking for online.

I borrowed code from: https://stackoverflow.com/questions/3099219/ggplot-with-2-y-axes-on-each-side-and-different-scales to help me format two axis.

When I was struggling with the ARIMA function (getting weird errors) I went to https://stats.stackexchange.com/questions/53051/simulate-arima-by-hand which not only helped get rid of my errors, but also gave me a good idea for finding the optimum (p,d,q) combination.

I looked at https://online.stat.psu.edu/stat510/lesson/8/8.2 to help me interpret a ccf plot when there is a clear lag.

I went to https://otexts.com/fpp2/arima-r.html for help plotting residuals