Introduction

Covid-19 has completely changed the world since early 2020. The way we do school, interact with each other and almost every other aspect of our lives. It is no question then that many researchers in various fields have set asside their projects and dedicated their time to tracking, curing and modeling the spread of sars-v coronavirus. The goal of all of this was to “flatten the curve” and keep more people from getting sick and then dying from the new pandemic.

My midterm project centered around the lag between spikes and valleys in new cases, and deaths in the state of Utah. This project is a continuation of that one in the sense that the same Utah Covid-19 data is being used. However, this one focuses strictly on modeling the disease. The data has been downloaded from the state’s coronavirus website, which gets updated daily. The last download used on this project was April 19th, 2021. I am not sure how this data was gathered or how new cases, and recoveries are classified in the state of Utah, or how they take into account non reports. Though, as the data comes from an official government website, I am trusting that the data is acurate, clean and good.

Exploratory Data Analysis

The graph below shows the number of new cases reported each day from mid-March 2020 until mid-April 2021, with no missing values. The black line represents the raw number of cases, and it seems to have a cyclic pattern, as less cases would be reported on the weekends. This is a result of the schedules of testing facilites, doctor offices and other places which are in the reporting process. The red line, represents the seven day average, which smooths these weekly trends and is probably a more acurate representation of the true data.

The minimun number of averaged new cases is 0.1, and the maximum number of new cases is 3392,6, with an mean of 958.4.

This graph shows how there is no clear lag pattern, as there is not a better acf than the unadjusted or lagged data. Thus no further data adjustments were done to this data.

ARIMA Model

To begin modeling this data, I started with an Auto Regressive Integrated moving Average (ARIMA) model, which I didn’t do on just the new case data before. This model has the form: \[ Y_{t}=\alpha+\beta_{1} Y_{t-1}+\beta_{2} Y_{t-2}+\ldots+\beta_{p} Y_{t-p} \epsilon_{t}+\phi_{1} \epsilon_{t-1}+\phi_{2} \epsilon_{t-2}+. .+\phi_{q} \epsilon_{t-q} \]

The model with the lowest AIC, or the on with the best fit is the AriMa(5,2,4) model.

##          MA0      MA1      MA2      MA3      MA4      MA5
## AR0 4471.155 4368.471 4366.457 4360.941 4334.782 4332.224
## AR1 4374.048 4366.268 4368.037 4325.152 4324.611 4298.654
## AR2 4364.352 4365.996 4339.291 4308.693 4322.222 4301.859
## AR3 4365.752 4367.742 4332.484 4272.618 4272.760 4270.705
## AR4 4367.604 4325.014 4321.866 4273.004 4270.191 4272.126
## AR5 4339.255 4319.324 4291.398 4278.801 4269.032 4269.975

The summary of this model is below, the coefficents for each piece are below. From the diognostics, it seems like it is an okay fit, but where the December-January spices in data are the fit becomes less predictable. There are also a few more lags outside of the ACF boundaries than we would like to see. All of the roots for this model are outside of the unit circle, so it a stationary model.

## 
## Call:
## arima(x = dat$Seven.day.Average, order = c(5, 2, 4))
## 
## Coefficients:
##           ar1      ar2     ar3     ar4      ar5     ma1      ma2      ma3
##       -0.9899  -0.2125  0.5104  0.4350  -0.1388  0.4399  -0.2564  -0.6896
## s.e.   0.1100   0.1092  0.0821  0.1023   0.0675  0.1047   0.0603   0.0818
##           ma4
##       -0.4729
## s.e.   0.0796
## 
## sigma^2 estimated as 1982:  log likelihood = -2124.52,  aic = 4269.03

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,2,4)
## Q* = 35.721, df = 3, p-value = 8.578e-08
## 
## Model df: 9.   Total lags used: 12

SIR Model

A SIR model uses different parameters to track a dieseas’s path from those that are Succeptable, to those who are Infected, and finaly to those who have Recovered, taking into account the people who die at any step along the way.

There are multiple ways to run an SIR model, but they all take into account different parameters representing the infection rate, the recovery rate and the death rate.

Pomp

The first method I tried uses the pomp package as was discussed in class, STATS 531. However I struggled with it and will explain my struggles.

To begin I wanted to discover the average recovery rate. So I took the data for new cases and newly recovered patients in Utah, and compared them to find which lag had the highest correlation between them. In the plots below it appears that 15 days, has the highest correlation. Thus the mu_IR value, or the rate from infection to recovery is 1/15 days.

Below you can see my intial guessesfor the parameters, the red lines are the predicted models, and blue are the observed covid counts in Utah. The second graph shows the numerics that the pomp model came up with for the parameters. It doesn’t appear to be all that better. Like i said though I had a few issues with pomp. After a certain number of numerics, r struggled caluclating the log-liklihood statistic which tells how well of a fit the model is to the observed data. This is the best model that it calculated. This also made comparing different models with given parameters tricky. I am not saying that pomp would not be able to model this data better, but given my computer errors, this was the best I could do.

Obviously this fit is not great, and I wanted to do better, so I looked into other packages which create SIR models. The one that worked best for me, wasn’t necessarily a package, but a step-by step process which generates curves for each of the S-I-R pieces of the model to show how they interact.

SIR Process

I began by creating a function which calculates each piece of the SIR model with respect to the other, and started with the whole population in the “S” or suceptable state. The Univeristy of Utah stated that a fair 2020 census estimation (as the actual counts are not out yet) is that Utah has a population of roughly 3,250,000 people. I am using that as a constant population count in my model. I then used my data to find the best parameters for beta and gamma which in this case are the rates of people contracting COVID-19 and recovering from it respectively. In this case, the best beta was 0.0088, and the best gamma was 0.0458.

Below, the blue dots (which are so close they form a line) represent the total cumulative number of new covid cases, and the red line is the infected piece of the SIR modelfor the data. The second graph is the same data, just ploted on a log scale. The model still isn’t a perfect fit, but the SIR model seems to show the cycle of the virus better than the pomp model i was able to achieve.

This graph shows the each piece of the SIR data. The black line is those who are suceptable, the green line represents those who have recovered, and the red represents those currently infected. The blue is the actual cumulative cases. According to the model, the virus reached it’s peak in February and we are now on our way towards community recovery in Utah.

Conclusions

Things which are not measured are not adressed. Modeling the coronavirus spread helps us understand where in the span of the virus we are. Even at midterms we were about the peak of the virus in Utah, and now according to all of these models, we can see that the virus is begining to be in decline. Had I used different resouces, like being able to make the greatlakes computer work in my timeframe, i trust that the pomp model would have been more acurate, and better be able to better explain the rates of the data. However there are more than one solutions to most problems, and this SIR model also fits and predicts the data.

Further studies could dive into wheather other parts of the country fit this same trend and if the same parameters could be adapted to model the virus’s spread with different populations. With pomp models using different parameters would help refine the data, and explorations into more acurate ways to find the parameters would also help improve this data set.

Resources

There were many students in 2020 that chose to do their final projects on Covid-19 data as well. However, the spike in my data is December 2020-January 2021, which was after this class was done with their course. Thus though my data covers the same disease, and though I looked through their reports, mine had different challenges, and my data had a different shape, and thus I feel confident in concluding that my work is my own.

Data was taken from the stat of Utah’s Coronavirus website: https://coronavirus.utah.gov/case-counts/

Population estimates for Utah were gathered from the University of Utah’s website: https://gardner.utah.edu/blog-new-2020-census-bureau-estimates/#:~:text=On%20December%2022%2C%20the%20Census,of%20any%20state%20at%2017.6%25.

The SIR process was followed from a tutorial on statsandr.com: https://statsandr.com/blog/covid-19-in-belgium/

The paper and statistics for this project were all written by me.