Coronavirus (Covid-19) has been spread out worldwide from the beginning of 2020. By the end of March, the panic caused by Covid-19 in China has been controlled. To build an model on this infectious disease can help other countries to understand this virus more clearly. We fit the data with the knowledge we learned from this course.
In this project, we want to build mathematic model to investigate the transmission of Covid-19 cases in China.
If we find a pattern or the parameters that simulate an approximating model of the Covid-19 cases dataset, we can be more prepared for the coming outbreaks in other places.
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'ggplot2' was built under R version 3.5.3
The data we are using is 2020 daily Covid-19 dataset from Kaggle, including number of Confirmed, Death and Recovered cases every day across the globe. The raw data includes 24890 observations from 185 different country/region. By the time we downloaded the data, the data include from 2020-01-22 to 2020-04-26. So we first need to select the cases that relates to our topic. In the following analysis, we only focus on the data from China.
After selecting the data we need, there are 2343 cases from various provinces in China, and we aggregate the datapoints by date. Maybe because of miscalculation or other reasons, the number of recovered people from 2020-04-17 is negative. So we decided to use data points only before 2020-04-01. At last, we have 69 observations of confirmed cases and recovered cases of Covid-19 in China.
Let’s have a general look of the data:
In the following analysis, we do not care about changing of daily death rates. The death rate will be counted into removal rate and will be a constant number.
df$date[which.max(df$Confirmed)]
## [1] "2020-02-13"
max(df$Confirmed)
## [1] 13628
df$date[which.max(df$Recovered)]
## [1] "2020-02-22"
max(df$Recovered)
## [1] 3845
On 2020-02-13, there were 13628 new confirmed cases of Covid-19 in China, which is the maximum of the data we are using. Other data points are mostly under 5000. The sudden arise of confirmed cases may be resulted from several reasons. Maybe on 2020-02-13, the hospital got a new patch of reagent test kits and conducted more tests than usual. Maybe there is a public gathering event that day and thus resulting in massive infections. Some sources believes Chinese government change the criterion of it counts to the coronavirus cases.
Then on 2020-02-22, there were 3845 cases of recovered cases, which is the maximum of the data. This 9-day lag may be the cycle of confirmed-recovered. From some news and reports, it is said that a healthy human with strong immune systme can be self-healed from Covid-19 within 1 or 2 weeks. We can roughly say that the natural recover rate of Covid-19 is about
3845/13628
## [1] 0.2821397
We can plot the autocorrelation function (ACF) of the confirmed cases.
As we can see, almost all the autocorrelation of confirmed cases is greater than 0. It shows weakly decreasing trend and no sign of sinusoidal pattern. the ACF of recovered cases shows decreasing trend.
We used the same model in notes 11 and notes 12, the models is:
\[\Delta N_{SI}\sim Binomial(S,1-e^{-\beta I/N \Delta t})\] \[\Delta N_{IR}\sim Binomial(I,e^{-\mu_{IR} \Delta t})\] \[B\sim Binomial(H,\rho)\] H is the number of people who was infected, B is the number of people who was reported to have Covid-19.
In the following plot, B is the actual number of people confirmed to have Covid-19, H is the curve of our simulation.
We do 20 simulations to approximate the curve of confirmed cases with \(\beta=1.2, \mu_{IR}=0.2,\rho=0.8, N=50000\)
Here are some of the loglikelihood of the parameters we guessed.
pf <- pfilter(sir,Np=5000,params=c(Beta=1.2,mu_IR=0.2,rho=0.8,N=50000))
logLik(pf)
## [1] -2541.31
pf <- replicate(10,
pfilter(sir,Np=5000,params=c(Beta=1.2,mu_IR=0.2,rho=0.8,N=50000))
)
print(ll <- sapply(pf,logLik))
## [1] -2625.729 -2541.191 -2540.242 -2605.473 -2540.387 -2541.087 -2543.061
## [8] -2540.798 -2541.046 -2612.147
logmeanexp(ll,se=TRUE)
## se
## -2541.2193972 0.3161533
We can slice in the \(\beta\) and \(\mu_{IR}\) direction to see the change of loglikelihood.
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: rngtools
## [[1]]
## NULL
##
## [[2]]
## NULL
Here, we can get a possible range of the parameters.
We do a SIRRR (3-stage recovery model) model as was in note 12.
First, we test if the particle filter is working.
pf <- pfilter(bsflu2,Np=5000,params=c(Beta=1.2,mu_IR=0.2,rho=0.8,N=50000,mu_R1=1/(sum(bsflu_data$B)/512),
mu_R2=1/(sum(bsflu_data$C)/512)))
logLik(pf)
## [1] -2700.932
pf <- replicate(10,
pfilter(bsflu2,Np=5000,params=c(Beta=1.2,mu_IR=0.2,rho=0.8,N=50000,mu_R1=1/(sum(bsflu_data$B)/512),
mu_R2=1/(sum(bsflu_data$C)/512)))
)
print(ll <- sapply(pf,logLik))
## [1] -2700.932 -2700.932 -2700.932 -2700.932 -2700.932 -2700.932 -2700.932
## [8] -2700.932 -2700.932 -2700.932
logmeanexp(ll,se=TRUE)
## se
## -2700.932 0.000
## se
## -2700.932 0.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2668 -2633 -2633 -2596 -2617 -2438
plot(mifs_local)
coef(mifs_local[[1]])
## Beta mu_IR rho mu_R1 mu_R2
## 0.6325013104 0.0102271440 0.4461454020 0.0005863195 0.0053760618
local_para=coef(mifs_local[[1]])
pf <- pfilter(bsflu2,Np=20000,params=local_para)
logLik(pf)
## [1] -2632.985
The effective sample size is considerable small comparing to the total number of trials.
\(\rho\) and \(\mu_{IR}\) are converged in most of the particles. There are only few particles in which \(\beta\) is not converged.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2701 -2669 -2668 -2672 -2668 -2668
## [1] "mif2List"
## attr(,"package")
## [1] "pomp"
## [1] "mif2d_pomp"
## attr(,"package")
## [1] "pomp"
## Beta mu_IR rho mu_R1 mu_R2
## 0.509514844 0.744081231 0.153533326 0.006264453 0.006924065
## [1] -2667.479
We can see that the result of loglikelihood from global and local search do not differ much. In the global search, only \(\rho\) shows signs of convergence. \(\beta\) and \(\mu_{IR}\) are not convergent in the box.
\(\rho\) is the infectious when people contact with some one who was infected. In our report, \(\rho\) is approxiamately 0.15. In real world report, this rate is believed to be between 4%~6%.
In this project, we proposed mathematical model for Covid-19 infection and recovery in China with pomp process. Model parameters were determined from local and global search. In future, we can consider to change the model structures. We can add high risk group to the model, as they have close contaction with infected patients. In the goverment reports, these high risk group of people will be separated for about 2 weeks and about 20-30% of these were found to be infected. This could induce a SIQR (Q for quarantine) model. Some reports believe the Coronavirus is more infectious than SARS because asymptomatics (people who have been infected but show no symptoms) can be latent for a long time period before they can be confirmed. This can be a part in SEIR model. In the future, if we have time, we can combine these feature together to build a SEIQR model.
The model in this report is built under ideal situation. It may be different from the real-world disease. However, the result displays that the infectious rate of Covid-19 is not negligible and thus people should be aware of protecting themselves.