1 Introduction

The Covid-19 virus has been outbreaking all over the world for several months. Since it is a brand new virus, people would like to searching it on Google or some other search engine to know more about it. In this project, I collect data from Google Search Trend\(^{[1]}\). And the data informs search tendency of the keyword “virus” in United States. We may find out some interesting patterns of how people care about the hot topic by applying time series analysis on it.

2 Dataset

First, we can take a brief look to the time series data.

The dataset contains the search interests of past 90 days. There are two peaks of search interest. According to the timeline, we can find out that one may be the time when Covid-19 first appeared and the other may be the time when the virus outbroke in United States. In the following sections, I will use time series analysis to study the inside pattern.

3 POMP Model

First of all, I would like to build a POMP model to fit the data. In particular, I apply SIR model introduced in the course\(^{[2]}\).

3.1 SIR Model

There are three compartments in the SIR model. Susceptible, infected and recovered respectively when studying flu data. And in this project, I can redefine the three compartments as:

Use \(N\) to present the number of individuals. \(N_{SI}(t), N_{IR}(t)\) means the indivisuals have transitioned from S to I and from I to R respectively. Also, to simplify the model, I follow the setting on slides11 that

\[ \mu_{.S} = \mu_{S.} = \mu_{I.} = \mu_{R.} = 0 \]

The ODEs for the counting process can be written as

\[ \frac{dN_{SI}}{dt} = \frac{\beta I}{N}\] \[ \frac{dN_{IR}}{dt} = \gamma \]

To solve the problem, using Euler method combined binomial approximation with exponential transition probabilities:

\[ N_{SI}(t+\delta) = N_{SI}(t) + Binomial(S(t), 1 - \exp (-\frac{\beta I}{N} * \delta))\] \[ N_{IR}(t+\delta) = N_{IR}(t) + Binomial(I(t), 1 - \exp (-\gamma * \delta))\]

Implement the model using R package pomp.

virus_rprocess <- "
  double dN_SI = rbinom(S,1-exp(-Beta*I/(S+I+R)*dt));
  double dN_IR = rbinom(I,1-exp(-dt*mu_IR));
  S -= dN_SI;
  I += dN_SI - dN_IR;
  R += dN_IR;
"

virus_dmeasure <- "
  lik = dpois(V2,rho*R+1e-10,give_log);
"

virus_rmeasure <- "
  V2 = rpois(rho*R+1e-10);
"

virus_rinit <- "
 S=762;
 I=1;
 R=0;
"

virus_statenames <- c("S","I","R")
virus_paramnames <- c("Beta","mu_IR","rho")

virus2 <- pomp(
  data=subset(data,select=c(day,V2)),
  times="day",
  t0=0,
  rprocess=euler(
    step.fun=Csnippet(virus_rprocess),
    delta.t=1/12),
  rmeasure=Csnippet(virus_rmeasure),
  dmeasure=Csnippet(virus_dmeasure),
  partrans=parameter_trans(
    log=c("Beta","mu_IR"),
    logit="rho"),
  statenames=virus_statenames,
  paramnames=virus_paramnames,
  rinit=Csnippet(virus_rinit)
)

3.2 SIR Model Improvement

Though the log-likelihood can be converged when using SIR model, it may not the best result because some parameters cannot converge. In this section, I will like to set more compartments, similar to the model in notes13\(^{[2]}\). The compartments \(R_1, R_2, R_3\) can represent

And the individuals count transforms from \(R_k\) to \(R_{k+1}\) also follows

\[ N_{R_k R_{k+1}}(t+\delta) = N_{R_k R_{k+1}}(t) + Binomial(R_k(t), 1 - \exp (- \mu_{R_k R_{k+1}} * \delta))\]

Similarly, use slice design to find out the propriate initial value of parameters.

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL

When \(\beta\) becomes larger than 0.05, it has small influence to the log-likelihood. And we cannot find specific relation between \(\mu_{R_2 R_3}\) and log-likelihood.

3.2.1 Local Search

I set \(\beta = 5, \mu_{IR} = 0.005, \rho = 0.05\) as the initial value, and fix \(\mu_{R_{1} R_{2}} = 0.01, \mu_{R_{2} R_{3}} = 0.001\). Then I apply if2 algorithm to do a local search of the likelihood surface.

run_level <- 3
switch(run_level, {
  bsflu_Np=100; bsflu_Nmif=10; bsflu_Neval=10;
  bsflu_Nglobal=10; bsflu_Nlocal=10
},{
  bsflu_Np=20000; bsflu_Nmif=100; bsflu_Neval=10;
  bsflu_Nglobal=10; bsflu_Nlocal=10
},{
  bsflu_Np=60000; bsflu_Nmif=300; bsflu_Neval=10;
  bsflu_Nglobal=100; bsflu_Nlocal=20}
)

bsflu_mle <- c(Beta=5,mu_IR=0.05,rho=0.05,mu_R1=0.01,mu_R2=0.001)

bsflu_rw.sd <- 0.02; bsflu_cooling.fraction.50 <- 0.5
stew(file=sprintf("final_local_search-%d.rda",run_level),{
  t_local <- system.time({
    mifs_local <- foreach(i=1:bsflu_Nlocal,
                          .packages='pomp', .combine=c) %dopar%  {
                            mif2(bsflu2,
                                 params=bsflu_mle,
                                 Np=bsflu_Np,
                                 Nmif=bsflu_Nmif,
                                 cooling.fraction.50=bsflu_cooling.fraction.50,
                                 rw.sd=rw.sd(
                                   Beta=bsflu_rw.sd,
                                   mu_IR=bsflu_rw.sd,
                                   rho=bsflu_rw.sd)
                            )
                          }
  })
},seed=900242057,kind="L'Ecuyer")

stew(file=sprintf("final_lik_local-%d.rda",run_level),{
  t_local_eval <- system.time({
    liks_local <- foreach(i=1:bsflu_Nlocal,
                          .combine=rbind,.packages='pomp')%dopar% {
                            evals <- replicate(bsflu_Neval, logLik(
                              pfilter(bsflu2,params=coef(mifs_local[[i]]),Np=bsflu_Np)))
                            logmeanexp(evals, se=TRUE)
                          }
  })
},seed=900242057,kind="L'Ecuyer")

Below is the visualization of the result.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -550.8  -549.7  -549.6  -549.4  -549.1  -547.7

The plot shows that log-likelihood does not converge at the peak value. And the maximum converged value it can reach is -547.7, which is better than the first model.

3.2.2 Global Search

We also try different starting values to find out whether we can find a better set of parameters. Based on the result of local search, I set \(\beta \in [0.001, 20], \mu_{IR} \in [0.045, 0.055], \rho \in [0.01, 0.1]\).

virus_box <- rbind(
  Beta=c(0.001,20),
  mu_IR=c(0.045,0.055),
  rho = c(0.01,0.1)
)

virus_fixed_params <- c(mu_R1=0.01, mu_R2=0.001)

stew(file=sprintf("final_box_eval-%d.rda",run_level),{
  t_global <- system.time({
    mifs_global <- foreach(i=1:virus_Nglobal,
                           .combine=c,.packages='pomp') %dopar% {
                             mif2(
                               mifs_local[[1]],
                               params=c(
                                 apply(virus_box,1,function(x)runif(1,x[1],x[2])),
                                 virus_fixed_params)
                             )}
  })
},seed=1270401374,kind="L'Ecuyer")

stew(file=sprintf("final_lik_global_eval-%d.rda",run_level),{
  t_global_eval <- system.time({
    liks_global <- foreach(i=1:virus_Nglobal,
                           .combine=rbind, .packages='pomp') %dopar% {
                             evals <- replicate(virus_Neval,
                                                logLik(pfilter(virus2,
                                                               params=coef(mifs_global[[i]]),Np=virus_Np)))
                             logmeanexp(evals, se=TRUE)
                           }
  })
},seed=442141592,kind="L'Ecuyer")
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -550.3  -549.7  -549.5  -549.5  -549.3  -547.5

In this case, the log-likelihood converges in a few iterations though the sample size is not stable. And the maximum likelihood can reach to -547.5, which outperforms the first simple SIR Model. And the pair plot does not show some specific pattern between log-likelihood and parameters.

4 GARCH Model

Besides POMP model, I also want to use some other common models in time series analysis to fit the data. I first try GARCH(1,1) model using fGarch package\(^{[3]}\). The result shows below.

require(fGarch)
mod1 <- garchFit(data=c(data$V2),grad = "numerical",trace=FALSE)
summary(mod1)
## 
## Title:
##  GARCH Modelling 
## 
## Call:
##  garchFit(data = c(data$V2), trace = FALSE, grad = "numerical") 
## 
## Mean and Variance Equation:
##  data ~ garch(1, 1)
## <environment: 0x7ff32ef9f138>
##  [data = c(data$V2)]
## 
## Conditional Distribution:
##  norm 
## 
## Coefficient(s):
##         mu       omega      alpha1       beta1  
## 21.6806918   4.5987284   1.0000000   0.0082884  
## 
## Std. Errors:
##  based on Hessian 
## 
## Error Analysis:
##         Estimate  Std. Error  t value Pr(>|t|)    
## mu     21.680692    1.012684   21.409  < 2e-16 ***
## omega   4.598728    3.164931    1.453    0.146    
## alpha1  1.000000    0.183842    5.439 5.34e-08 ***
## beta1   0.008288    0.085644    0.097    0.923    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log Likelihood:
##  -343.224    normalized:  -3.856449 
## 
## Description:
##  Wed Apr 29 11:01:30 2020 by user:  
## 
## 
## Standardised Residuals Tests:
##                                 Statistic p-Value     
##  Jarque-Bera Test   R    Chi^2  10.96227  0.00416459  
##  Shapiro-Wilk Test  R    W      0.8437249 2.921234e-08
##  Ljung-Box Test     R    Q(10)  183.5094  0           
##  Ljung-Box Test     R    Q(15)  204.9081  0           
##  Ljung-Box Test     R    Q(20)  263.6242  0           
##  Ljung-Box Test     R^2  Q(10)  19.71943  0.03202096  
##  Ljung-Box Test     R^2  Q(15)  25.74635  0.04077484  
##  Ljung-Box Test     R^2  Q(20)  30.20732  0.0665632   
##  LM Arch Test       R    TR^2   18.9      0.09097059  
## 
## Information Criterion Statistics:
##      AIC      BIC      SIC     HQIC 
## 7.802786 7.914635 7.798973 7.847869

We can find out that the log likelihood of GARCH(1,1) model is -343.224.

5 ARIMA Model

I also try ARIMA model. To simplify the process, I use auto.arima function in forecast package to find out the best model.

require(forecast)
auto.arima(data$V2)
## Series: data$V2 
## ARIMA(3,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     ar2     ar3     mean
##       0.4263  0.0986  0.3469  19.8064
## s.e.  0.0984  0.1079  0.0986   7.1793
## 
## sigma^2 estimated as 94.1:  log likelihood=-327.11
## AIC=664.23   AICc=664.95   BIC=676.67

It suggests the ARIMA(3,0,0) model, which has log likelihood -327.11.

6 Conclusion

In this project, I try two different multi-compartments POMP models, Garch Model and also ARIMA model to fit the google search trend data of keyword “virus”. And it comes out that the ARIMA(3,0,0) model is the best fit of the data, which has the largest log likelihood -327.11. Since most of the efforts are spend on finding a POMP model, there may have some disadvantages using my SIR model. For further analysis, we can consider different structures of POMP model.

7 Reference

\([1]\) https://trends.google.com/trends/explore?q=virus&geo=US

\([2]\) https://ionides.github.io/531w20/

\([3]\) https://cran.r-project.org/web/packages/fGarch/fGarch.pdf