In the middle of the 2020 winter semester, COVID-19 ravaged the globe, causing great losses and inconvenience to the lives of people all over the world. Since the genetic sequence of this coronavirus is similar to the SARS virus, analyzing the regularities of SARS may provide inspiring insights in helping us fight against COVID-19.
In this project, we study the number of SARS patients over time. SARS first broke out in 2003 in the Guangdong Province in China. To make the prorblem setting clear, we fixed the location of our study to Hong Kong, China, with the time period from March 17, 2003 to July 11, 2003.
Our data source is derived from Kaggle’s original data source “SARS 2003 Outbreak Complete Dataset” [1]. We extracted the statistics associated with Hong Kong into a separate file. First, we take a look at the distribution of our data:
As we can see, the number of patients first increased with the outbreak of the disease. In April, the disease developed somewhat to the worst situation. After it reached some peak value, the number of patients gradually dropped and the epidemic ceased. In the following parts, we are going to use the SIR model to describe this data.
The SIR model consists of three states, and the total population \(N = S + I + R\). In our scenario, we don’t know the exact value of \(N\), so we have to estimate \(N\) in the process. In addition, the relationships between the states are described as \[ \begin{aligned} \Delta N_{SI} &= \mathrm{Binomial}(S, 1 - e^{-\frac{\beta}{N}I\cdot\Delta t}) \\ \Delta N_{IR} &= \mathrm{Binomial}(I, 1-e^{-\mu_{IR}\cdot\Delta t}) \\ S &= S - \Delta N_{SI} \\ I &= I + \Delta N_{SI} - \Delta N_{IR} \\ R &= R + \Delta N_{IR} \end{aligned} \]
In addition, we use \(\rho\) to represent the probability that an infection would results confinement. Since we know that SARS is a severe infectious disease, it is almost surely that an infected person gets quarantined. Therefore, we expect \(\rho\) to have a value very close to 1.
To start with, we randomly try some combination of parameters and test the conformity of their simulation. After many attempts, I discovered that when \(\beta\) is slightly better than \(\mu_{IR}\) (about \(0.05\sim 0.1\)), the simulation results are satisfactory. For example, we plot the simulation for \(\beta=1.15\), \(\mu_{IR} = 1.1\) and \(\rho=0.99\),
sims <- simulate(sir,params=c(Beta=1.15,mu_IR=1.1,rho=0.99,N=100000),
nsim=10,format="data.frame",include=TRUE)
ggplot(sims,mapping=aes(x=day,y=Current,group=.id,color=.id=="data"))+
geom_line()+guides(color=FALSE)
Since we don’t know the exact value of \(N\), we refer to the maximum number of accumulated patients and set \(N=100000\) as an approximation.
Through the process of finding satisfactory simulation, we could discover that the value of \(\beta\) and \(\mu_{IR}\) affects the shape and scale of the simulated curve significantly. Therefore, in order to determine a most appropriate combination of parameters, we perform slicing in the \(\beta\) and \(\mu_{IR}\) dimension.
foreach (v=c("Beta","mu_IR")) %do% {
x <- subset(p,slice==v);
plot(x[[v]],x$loglik,xlab=v,ylab="loglik",ylim=range(x$loglik, na.rm=TRUE));
} -> tmp # take foreach output
As we could see from the plots, the maximum likelihood is achieved when \(\beta \approx 1.20\) and \(\mu_{IR} \approx 1.05\), which is close to the findings of our simulation.
In addition, we can perform a two-dimensional likelihood maximization and plot the cross section. The contour plot reveals interesting findings, as it suggests an almost linear relationship between \(\beta\) and \(\mu_{IR}\). This suggests that the two stages are highly correlated.
pp <- mutate(p,loglik=ifelse(loglik>max(loglik)-500,loglik,NA))
ggplot(data=pp,mapping=aes(x=Beta,y=mu_IR,z=loglik,fill=loglik)) +
geom_tile(color=NA) + scale_fill_gradient() +
geom_contour(color='black',binwidth=0.1) +
labs(x=expression(beta),y=expression(mu[IR]))
We further apply the IF2 algorithm to get insights of the parameters.
sars_rw.sd <- 0.02;
sars_cooling.fraction.50 <- 0.5
stew(file=sprintf("local_search-%d.rda",run_level),{
t_local <- system.time({
mifs_local <- foreach(i=1:sars_Nlocal, .packages='pomp', .combine=c) %dopar% {
mif2(sars2, params=sars_mle, Np=sars_Np,
Nmif=sars_Nmif, cooling.fraction.50=sars_cooling.fraction.50,
rw.sd=rw.sd(Beta=sars_rw.sd, mu_IR=sars_rw.sd, rho=sars_rw.sd))
}
})
},seed=900242057,kind="L'Ecuyer")
stew(file=sprintf("lik_local-%d.rda",run_level),{
t_local_eval <- system.time({
liks_local <- foreach(i=1:sars_Nlocal,.combine=rbind) %dopar% {
evals <- replicate(sars_Neval,
logLik(pfilter(sars2,params=coef(mifs_local[[i]]),Np=sars_Np)))
logmeanexp(evals, se=TRUE)
}
})
},seed=900242057,kind="L'Ecuyer")
results_local <- data.frame(logLik=liks_local[,1], logLik_se=liks_local[,2],t(sapply(mifs_local,coef)))
summary(results_local$logLik, digits=10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3730 -3349 -3288 -3335 -3192 -3082
According to the local search results, we can generate the pairs plot of the likelihood and the parameters. The summary for the local search results suggested that the likelihood search has a wide range. Therefore, we set a threshold 500 lower than the maximum likelihood as condition and filter out the associated parameters.
From the pair plot, we can see conclude that \((0, 1)\) would be a proper range for \(\beta\) and \(\mu_{IR}\). In addition, a proper range for \(\rho\) would be \((0.7, 1)\), which is in consistent with our previous assumption. Based on these results, we can set the range of the search box and perform a global likelihood search with random starting values.
sars_box <- rbind(Beta=c(0,1), mu_IR=c(0,1),rho = c(0.7,1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3730 -3730 -3726 -3604 -3617 -3030
From the summary of global search results, we could see that most of the parameter settings have a likelihood much lower than the maximum value, thus we could set a higher lower bound for the likelihood to filter out the noise simulations.
plot(mifs_global)
From the diagnostics, we could see that in the number of iterations we perform, \(\beta\) and \(\mu_{IR}\) hasn’t show ideal converging behaviors yet. After tried out several different settings, I got similar plots (where \(\beta\) and \(\mu_{IR}\) has no apparrent convergent behavior but \(\rho\) converges to values within \((0.6, 0.8)\). However, this seems to be coincident with previous simulation experiments. From previous simulation tests, we discovered that as long as \(\beta\) is slightly larger than \(\mu_{IR}\), we would obtain a reasonable simulation results. As a consequence, the maximization search result is somewhat consistent with my empirical findings.
We plot the profile likelihood for \(\beta\), and we could see that when \(\beta\approx 0.83\), the likelihood is maximized.
## lower upper
## 0.836842 0.836842
We use the set of parameters inferred from previous sections and performed several simulations. From the simulation plots, the estimation of parameters are reasonable.
However, there are also some insufficiency in this modeling. Some of the stages and parameters have to be estimated in advance, and the rest of the search is based on the estimation. I attempted this project in the same approach as the lecture slides analyzed the boarding school influenza [3]. However, there are many differences between the two cases. First, Hong Kong is a large city, there are far more uncontrollable factors exists than a boarding school case. The population and human interaction has a more complex structure, which the simple SIR model may not capture and reflect. According to the Richards growth curve [4], the accumulated population under a natural environment will first increase and after some point reaches saturation and becomes stable. The current illness approximately reflects the slope of the accumulated infectious cases, but there also exists external factors that may affect the process, such as government policies, public health developments and inventions of new medications. Secondly, timely policies and treatment can be taken for a boarding school influenza. On the other hand, SARS broke out as a brand new disease and efficient treatment needs time to develop. In future, I may try to improve this modeling by taking into considerations of the development of external factors.
srun --pty --account=stats531w20_class --cpus-per-task=16 --mem-per-cpu=8gb --partition=standard --time=1-00:00 /bin/bash