The Covid-19 virus has been outbreaking all over the world for several months. Since it is a brand new virus, people would like to searching it on Google or some other search engine to know more about it. In this project, I collect data from Google Search Trend\(^{[1]}\). And the data informs search tendency of the keyword “virus” in United States. We may find out some interesting patterns of how people care about the hot topic by applying time series analysis on it.
First, we can take a brief look to the time series data.
The dataset contains the search interests of past 90 days. There are two peaks of search interest. According to the timeline, we can find out that one may be the time when Covid-19 first appeared and the other may be the time when the virus outbroke in United States. In the following sections, I will use time series analysis to study the inside pattern.
First of all, I would like to build a POMP model to fit the data. In particular, I apply SIR model introduced in the course\(^{[2]}\).
There are three compartments in the SIR model. Susceptible, infected and recovered respectively when studying flu data. And in this project, I can redefine the three compartments as:
Use \(N\) to present the number of individuals. \(N_{SI}(t), N_{IR}(t)\) means the indivisuals have transitioned from S to I and from I to R respectively. Also, to simplify the model, I follow the setting on slides11 that
\[ \mu_{.S} = \mu_{S.} = \mu_{I.} = \mu_{R.} = 0 \]
The ODEs for the counting process can be written as
\[ \frac{dN_{SI}}{dt} = \frac{\beta I}{N}\] \[ \frac{dN_{IR}}{dt} = \gamma \]
To solve the problem, using Euler method combined binomial approximation with exponential transition probabilities:
\[ N_{SI}(t+\delta) = N_{SI}(t) + Binomial(S(t), 1 - \exp (-\frac{\beta I}{N} * \delta))\] \[ N_{IR}(t+\delta) = N_{IR}(t) + Binomial(I(t), 1 - \exp (-\gamma * \delta))\]
Implement the model using R package pomp.
virus_rprocess <- "
double dN_SI = rbinom(S,1-exp(-Beta*I/(S+I+R)*dt));
double dN_IR = rbinom(I,1-exp(-dt*mu_IR));
S -= dN_SI;
I += dN_SI - dN_IR;
R += dN_IR;
"
virus_dmeasure <- "
lik = dpois(V2,rho*R+1e-10,give_log);
"
virus_rmeasure <- "
V2 = rpois(rho*R+1e-10);
"
virus_rinit <- "
S=762;
I=1;
R=0;
"
virus_statenames <- c("S","I","R")
virus_paramnames <- c("Beta","mu_IR","rho")
virus2 <- pomp(
data=subset(data,select=c(day,V2)),
times="day",
t0=0,
rprocess=euler(
step.fun=Csnippet(virus_rprocess),
delta.t=1/12),
rmeasure=Csnippet(virus_rmeasure),
dmeasure=Csnippet(virus_dmeasure),
partrans=parameter_trans(
log=c("Beta","mu_IR"),
logit="rho"),
statenames=virus_statenames,
paramnames=virus_paramnames,
rinit=Csnippet(virus_rinit)
)
We can do the slice design to find out some propriate initial parameters for the model.
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
We can find peak for \(\mu_{IR}\) and \(\rho\), but there is no peak value for \(\beta\).
From the slice design above, I set \(\beta = 40, \mu_{IR} = 0.005, \rho = 0.04\) as the initial value. And then I apply if2 algorithm to do a local search of the likelihood surface.
run_level <- 3
switch(run_level, {
virus_Np=100; virus_Nmif=10; virus_Neval=10;
virus_Nglobal=10; virus_Nlocal=10
},{
virus_Np=20000; virus_Nmif=100; virus_Neval=10;
virus_Nglobal=10; virus_Nlocal=10
},{
virus_Np=60000; virus_Nmif=300; virus_Neval=10;
virus_Nglobal=100; virus_Nlocal=20}
)
virus_mle <- c(Beta=40,mu_IR=0.005,rho=0.04)
virus_rw.sd <- 0.02; virus_cooling.fraction.50 <- 0.5
stew(file=sprintf("final_2_local_search-%d.rda",run_level),{
t_local <- system.time({
mifs_local <- foreach(i=1:virus_Nlocal,
.packages='pomp', .combine=c) %dopar% {
mif2(virus2,
params=virus_mle,
Np=virus_Np,
Nmif=virus_Nmif,
cooling.fraction.50=virus_cooling.fraction.50,
rw.sd=rw.sd(
Beta=virus_rw.sd,
mu_IR=virus_rw.sd,
rho=virus_rw.sd)
)
}
})
},seed=900242057,kind="L'Ecuyer")
stew(file=sprintf("final_2_lik_local-%d.rda",run_level),{
t_local_eval <- system.time({
liks_local <- foreach(i=1:virus_Nlocal,
.combine=rbind,.packages='pomp')%dopar% {
evals <- replicate(virus_Neval, logLik(
pfilter(virus2,params=coef(mifs_local[[i]]),Np=virus_Np)))
logmeanexp(evals, se=TRUE)
}
})
},seed=900242057,kind="L'Ecuyer")
We can visualize the iteration result and also the pair plots between log-likelihood and parameters.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -621.4 -621.3 -621.1 -621.2 -621.1 -621.0
The likelihood result converged at the first 50 iterations, which can reach to a maximum value of -621.0. But for parameters, \(\beta\) can not converge. And from the pair plots, when \(\beta\) increases, the log-likelihood also increases. So, I will consider larger starting value of \(\beta\) while doing global search at the next section.
To confirm the result we found above, I try different starting values to do a global likelihood search. Because \(\beta\) cannot converge when I try local search, I set a large range \(\beta \in [20, 50]\). And we also visualize the iteration result and pair plots between log-likelihood and parameters.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -621.5 -621.2 -621.1 -621.2 -621.1 -621.0
The result is similar to the one using local search. The maximum log likelihood can reach to -621.0. But the same problem appears that \(\beta\) cannot converge. And the pair plots show that a larger \(\beta\) can help reach to a larger likelihood.
Though the log-likelihood can be converged when using SIR model, it may not the best result because some parameters cannot converge. In this section, I will like to set more compartments, similar to the model in notes13\(^{[2]}\). The compartments \(R_1, R_2, R_3\) can represent
And the individuals count transforms from \(R_k\) to \(R_{k+1}\) also follows
\[ N_{R_k R_{k+1}}(t+\delta) = N_{R_k R_{k+1}}(t) + Binomial(R_k(t), 1 - \exp (- \mu_{R_k R_{k+1}} * \delta))\]
Similarly, use slice design to find out the propriate initial value of parameters.
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
When \(\beta\) becomes larger than 0.05, it has small influence to the log-likelihood. And we cannot find specific relation between \(\mu_{R_2 R_3}\) and log-likelihood.
I set \(\beta = 5, \mu_{IR} = 0.005, \rho = 0.05\) as the initial value, and fix \(\mu_{R_{1} R_{2}} = 0.01, \mu_{R_{2} R_{3}} = 0.001\). Then I apply if2 algorithm to do a local search of the likelihood surface.
run_level <- 3
switch(run_level, {
bsflu_Np=100; bsflu_Nmif=10; bsflu_Neval=10;
bsflu_Nglobal=10; bsflu_Nlocal=10
},{
bsflu_Np=20000; bsflu_Nmif=100; bsflu_Neval=10;
bsflu_Nglobal=10; bsflu_Nlocal=10
},{
bsflu_Np=60000; bsflu_Nmif=300; bsflu_Neval=10;
bsflu_Nglobal=100; bsflu_Nlocal=20}
)
bsflu_mle <- c(Beta=5,mu_IR=0.05,rho=0.05,mu_R1=0.01,mu_R2=0.001)
bsflu_rw.sd <- 0.02; bsflu_cooling.fraction.50 <- 0.5
stew(file=sprintf("final_local_search-%d.rda",run_level),{
t_local <- system.time({
mifs_local <- foreach(i=1:bsflu_Nlocal,
.packages='pomp', .combine=c) %dopar% {
mif2(bsflu2,
params=bsflu_mle,
Np=bsflu_Np,
Nmif=bsflu_Nmif,
cooling.fraction.50=bsflu_cooling.fraction.50,
rw.sd=rw.sd(
Beta=bsflu_rw.sd,
mu_IR=bsflu_rw.sd,
rho=bsflu_rw.sd)
)
}
})
},seed=900242057,kind="L'Ecuyer")
stew(file=sprintf("final_lik_local-%d.rda",run_level),{
t_local_eval <- system.time({
liks_local <- foreach(i=1:bsflu_Nlocal,
.combine=rbind,.packages='pomp')%dopar% {
evals <- replicate(bsflu_Neval, logLik(
pfilter(bsflu2,params=coef(mifs_local[[i]]),Np=bsflu_Np)))
logmeanexp(evals, se=TRUE)
}
})
},seed=900242057,kind="L'Ecuyer")
Below is the visualization of the result.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -550.8 -549.7 -549.6 -549.4 -549.1 -547.7
The plot shows that log-likelihood does not converge at the peak value. And the maximum converged value it can reach is -547.7, which is better than the first model.
We also try different starting values to find out whether we can find a better set of parameters. Based on the result of local search, I set \(\beta \in [0.001, 20], \mu_{IR} \in [0.045, 0.055], \rho \in [0.01, 0.1]\).
virus_box <- rbind(
Beta=c(0.001,20),
mu_IR=c(0.045,0.055),
rho = c(0.01,0.1)
)
virus_fixed_params <- c(mu_R1=0.01, mu_R2=0.001)
stew(file=sprintf("final_box_eval-%d.rda",run_level),{
t_global <- system.time({
mifs_global <- foreach(i=1:virus_Nglobal,
.combine=c,.packages='pomp') %dopar% {
mif2(
mifs_local[[1]],
params=c(
apply(virus_box,1,function(x)runif(1,x[1],x[2])),
virus_fixed_params)
)}
})
},seed=1270401374,kind="L'Ecuyer")
stew(file=sprintf("final_lik_global_eval-%d.rda",run_level),{
t_global_eval <- system.time({
liks_global <- foreach(i=1:virus_Nglobal,
.combine=rbind, .packages='pomp') %dopar% {
evals <- replicate(virus_Neval,
logLik(pfilter(virus2,
params=coef(mifs_global[[i]]),Np=virus_Np)))
logmeanexp(evals, se=TRUE)
}
})
},seed=442141592,kind="L'Ecuyer")
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -550.3 -549.7 -549.5 -549.5 -549.3 -547.5
In this case, the log-likelihood converges in a few iterations though the sample size is not stable. And the maximum likelihood can reach to -547.5, which outperforms the first simple SIR Model. And the pair plot does not show some specific pattern between log-likelihood and parameters.
Besides POMP model, I also want to use some other common models in time series analysis to fit the data. I first try GARCH(1,1) model using fGarch package\(^{[3]}\). The result shows below.
require(fGarch)
mod1 <- garchFit(data=c(data$V2),grad = "numerical",trace=FALSE)
summary(mod1)
##
## Title:
## GARCH Modelling
##
## Call:
## garchFit(data = c(data$V2), trace = FALSE, grad = "numerical")
##
## Mean and Variance Equation:
## data ~ garch(1, 1)
## <environment: 0x7ff32ef9f138>
## [data = c(data$V2)]
##
## Conditional Distribution:
## norm
##
## Coefficient(s):
## mu omega alpha1 beta1
## 21.6806918 4.5987284 1.0000000 0.0082884
##
## Std. Errors:
## based on Hessian
##
## Error Analysis:
## Estimate Std. Error t value Pr(>|t|)
## mu 21.680692 1.012684 21.409 < 2e-16 ***
## omega 4.598728 3.164931 1.453 0.146
## alpha1 1.000000 0.183842 5.439 5.34e-08 ***
## beta1 0.008288 0.085644 0.097 0.923
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log Likelihood:
## -343.224 normalized: -3.856449
##
## Description:
## Wed Apr 29 11:01:30 2020 by user:
##
##
## Standardised Residuals Tests:
## Statistic p-Value
## Jarque-Bera Test R Chi^2 10.96227 0.00416459
## Shapiro-Wilk Test R W 0.8437249 2.921234e-08
## Ljung-Box Test R Q(10) 183.5094 0
## Ljung-Box Test R Q(15) 204.9081 0
## Ljung-Box Test R Q(20) 263.6242 0
## Ljung-Box Test R^2 Q(10) 19.71943 0.03202096
## Ljung-Box Test R^2 Q(15) 25.74635 0.04077484
## Ljung-Box Test R^2 Q(20) 30.20732 0.0665632
## LM Arch Test R TR^2 18.9 0.09097059
##
## Information Criterion Statistics:
## AIC BIC SIC HQIC
## 7.802786 7.914635 7.798973 7.847869
We can find out that the log likelihood of GARCH(1,1) model is -343.224.
I also try ARIMA model. To simplify the process, I use auto.arima function in forecast package to find out the best model.
require(forecast)
auto.arima(data$V2)
## Series: data$V2
## ARIMA(3,0,0) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 mean
## 0.4263 0.0986 0.3469 19.8064
## s.e. 0.0984 0.1079 0.0986 7.1793
##
## sigma^2 estimated as 94.1: log likelihood=-327.11
## AIC=664.23 AICc=664.95 BIC=676.67
It suggests the ARIMA(3,0,0) model, which has log likelihood -327.11.
In this project, I try two different multi-compartments POMP models, Garch Model and also ARIMA model to fit the google search trend data of keyword “virus”. And it comes out that the ARIMA(3,0,0) model is the best fit of the data, which has the largest log likelihood -327.11. Since most of the efforts are spend on finding a POMP model, there may have some disadvantages using my SIR model. For further analysis, we can consider different structures of POMP model.
\([1]\) https://trends.google.com/trends/explore?q=virus&geo=US
\([2]\) https://ionides.github.io/531w20/
\([3]\) https://cran.r-project.org/web/packages/fGarch/fGarch.pdf