1. Introduction

In the world of finance, people are always concerned about the return of their assets, in particular, the risky assets. There are many forms of risky assets, and in this project, we look into one specific form of the risky assets - the stock. In this project, we will study the daily price of the SP500 for the past twenty years. The data were downloaded from finance.yahoo.com.

SP500 is a stock index that is commonly referred by investors. According to Wikipedia, it is calculated “based on the 500 large companies having common stock listed on the NYSE or NASDAQ”[1]. Investors usually regard SP500 as an indicator of the market behavior. In order to get an idea about the SP500 index, below is the plot of the data.

X = read.table("SP500.csv",sep=",",header=TRUE)
sp500 = rev(X$Adj.Close)
plot(sp500, type='l', ylab="Adjusted Closing Price", main="Time plot of SP500")


Since the returns of the assets are what we really care about, we find out the returns of the SP500 as below. Notice that here we use log-returns as the returns on the stock market, which is quite conventional in terms of financial analysis. This is valid since \(\log(x)\approx x\) for very small \(x\) and returns are usually very close to zero. Below is a plot of the log-returns of SP500 in the past twenty years.

sp500lrt = diff(log(sp500))
plot(sp500lrt, type='l', ylab="log-returns", main="Time plot of log-returns of SP500")


People usually think the returns of the market as a random variable that follows the normal distribution, yet the above plot suggests that this model may not describe the returns in a decent way. Apparently there are periods with larger volatility followed by periods of lower volatility and so on so forth. Higher volatility tends to be followed by higher volatility and lower volatility tends to be followed by lower volatility. In order to better study such kind of behavior, many models are created.

In the following sections of the report, we will study the SV-in-Mean model[2][3], where SV stands for stochastic volatility.



2. SV-in-Mean Model

According to Koopman et al, “the variance in the SV model is modelled as an unobserved component that follows some stochastic process.”[3] And more specifically, for the SV-in-Mean model, it “incorporates the unobserved volatility as an explanatory variable in the mean equation.”[3]

The model is given by the following form:

\[ \begin{align} Y_t& = d\cdot H_t + \exp(H_t/2)u_t,\\ H_t& = \mu + \phi(H_{t-1}-\mu) + \eta_t,\\ u_t& \sim \mathcal{N}(0,1),\\ \eta_t& \sim \mathcal{N}(0,\sigma_{\eta}^2), \end{align} \] where \(d\), \(\mu\), \(\sigma_{\eta}\) are constants[2].

In the above model, \(Y_t\) is the log-returns of the market, and \(H_t\) is the log volatility.

Notice that \(Y_t\) is the observed process since we can calculate the log-returns directly from the data while \(H_t\) is the latent process. Hence the above model is recognized as partially observed Markov process model. In order to fit this model to our data, we utilize pomp package in R.

require(pomp)


3. Building the Pomp Model

In this section, we will construct a pomp object for the above model with the data. A quick simulation will be performed for a sanity check of the model.


sp500_statenames  <-  c("H","Y_state")
sp500_rp_names    <-  c("d","mu","phi","sigma_eta")
sp500_ivp_names   <-  c("H_0")
sp500_paramnames  <-  c(sp500_rp_names,sp500_ivp_names)
sp500_covarnames  <-  "covaryt"

rproc1 <- "
double eta;
eta = rnorm(0, sigma_eta);
H = mu + phi*(H - mu) + eta;
"
rproc2.sim <- "
Y_state = rnorm(d*H, exp(H/2));
"
rproc2.filt <- "
Y_state = covaryt;
"
sp500_rproc.sim   <-  paste(rproc1,rproc2.sim)
sp500_rproc.filt  <-  paste(rproc1,rproc2.filt)

sp500_initializer <- "
H = H_0;
Y_state = rnorm(d*H, exp(H/2));
"

sp500_rmeasure <- "
y = Y_state;
"
sp500_dmeasure <- "
lik = dnorm(y, d*H, exp(H/2), give_log);
"

sp500_toEstimationScale <- "
Td = d;
Tmu = mu;
Tphi = logit(phi);
Tsigma_eta = log(sigma_eta);
"
sp500_fromEstimationScale <- "
Td = d;
Tmu = mu;
Tphi = expit(phi);
Tsigma_eta = exp(sigma_eta);
"

sp500.filt <- pomp(data=data.frame(y=sp500lrt,
                                   time=1:length(sp500lrt)),
                   statenames=sp500_statenames,
                   paramnames=sp500_paramnames,
                   covarnames=sp500_covarnames,
                   times="time",
                   t0=0,
                   covar=data.frame(covaryt=c(0,sp500lrt),
                                    time=0:length(sp500lrt)),
                   tcovar="time",
                   rmeasure=Csnippet(sp500_rmeasure),
                   dmeasure=Csnippet(sp500_dmeasure),
                   rprocess=discrete.time.sim(step.fun=Csnippet(sp500_rproc.filt),delta.t=1),
                   initializer=Csnippet(sp500_initializer),
                   toEstimationScale=Csnippet(sp500_toEstimationScale), 
                   fromEstimationScale=Csnippet(sp500_fromEstimationScale)
)

expit<-function(real){1/(1+exp(-real))}
logit<-function(p.arg){log(p.arg/(1-p.arg))}
params_test <- c(
  d         = 0.0001,  
  mu        = -9,       
  phi       = expit(2),
  sigma_eta = exp(-0.8),
  H_0       = 0
)

sim1.sim <- pomp(sp500.filt, 
                 statenames=sp500_statenames,
                 paramnames=sp500_paramnames,
                 covarnames=sp500_covarnames,
                 rprocess=discrete.time.sim(step.fun=Csnippet(sp500_rproc.sim),delta.t=1)
)

sim1.sim <- simulate(sim1.sim,seed=1,params=params_test)

plot(Y_state~time, data=sim1.sim, type='l', col='red', xlim=c(200,4900), ylim=c(-0.1,0.1), main="Observed Log-returns vs Simulated Results", ylab="Log-returns", xlab="Index")
lines(sp500lrt)
legend(210,0.1, c("Observed Log-returns","Simulated Values"), col=c("black","red"), lty=c(1,1))


plot(Y_state~time, data=sim1.sim, type='l', col='red', xlim=c(200,4900), ylim=c(-0.1,0.1), ylab="", main="Volatility and Log-returns", xlab="Index")
lines(exp(H/2)~time,data=sim1.sim)
legend(210,0.1, c("Volatility","Log-returns"), col=c("black","red"), lty=c(1,1))


Actually, the behavior of the simulation varies much with different choices of value of parameters, and I tried several combinations of parameters to get the above result. Since the simulated results look good for this set of parameters, in latter part of the project, it is reasonable for us to start the local search with this set of parameters. Also, when doing the global search, we should refer to this parameter set when determining the evaluation box.


sim1.filt <- pomp(sim1.sim, 
                  covar=data.frame(
                    covaryt=c(obs(sim1.sim),NA),
                    time=c(timezero(sim1.sim),time(sim1.sim))),
                  tcovar="time",
                  statenames=sp500_statenames,
                  paramnames=sp500_paramnames,
                  covarnames=sp500_covarnames,
                  rprocess=discrete.time.sim(step.fun=Csnippet(sp500_rproc.filt),delta.t=1)
)


4. Filtering on Simulated Data

In previous section we have built a pomp object for the model and the testing results look promising. Therefore we want to proceed with this model. Now we need to check whether we can filter and re-estimate parameters for the simulated data.


run_level <- 2 
sp500_Np          <- c(100,1e3,2e3)
sp500_Nmif        <- c(10, 200,300)
sp500_Nreps_eval  <- c(4,  10,  20)
sp500_Nreps_local <- c(10, 20,  20)
sp500_Nreps_global<- c(10, 20, 100)
require(doParallel)
registerDoParallel()
stew(file=sprintf("xg_pf1_level2.rda",run_level),{
t.pf1 <- system.time(
  pf1 <- foreach(i=1:sp500_Nreps_eval[run_level],.packages='pomp',
                 .options.multicore=list(set.seed=TRUE)) %dopar% try(
                   pfilter(sim1.filt,Np=sp500_Np[run_level])
                 )
)
},seed=493536993,kind="L'Ecuyer")
(L.pf1 <- logmeanexp(sapply(pf1,logLik),se=TRUE))
##                      se 
## -3355.97675    69.43929


5. Fitting the Model to the Data

In order for high accuracy of the results, we should use a high run level.

run_level <- 3

sp500_rw.sd_rp <- 0.02
sp500_rw.sd_ivp <- 0.1
sp500_cooling.fraction.50 <- 0.5

stew("xg_01_mif1.rda",{
t.if1 <- system.time({
  if1 <- foreach(i=1:sp500_Nreps_local[run_level],
                 .packages='pomp', .combine=c,
                 .options.multicore=list(set.seed=TRUE)) %dopar% try(
                   mif2(sp500.filt,
                        start=params_test,
                        Np=sp500_Np[run_level],
                        Nmif=sp500_Nmif[run_level],
                        cooling.type="geometric",
                        cooling.fraction.50=sp500_cooling.fraction.50,
                        transform=TRUE,
                        rw.sd = rw.sd(
                          d         = sp500_rw.sd_rp,
                          mu        = sp500_rw.sd_rp,
                          phi       = sp500_rw.sd_rp,
                          sigma_eta = sp500_rw.sd_rp,
                          H_0       = ivp(sp500_rw.sd_ivp)
                        )
                   )
                 )
  
  L.if1 <- foreach(i=1:sp500_Nreps_local[run_level],.packages='pomp',
                   .combine=rbind,.options.multicore=list(set.seed=TRUE)) %dopar% 
                   {
                     logmeanexp(
                       replicate(sp500_Nreps_eval[run_level],
                                 logLik(pfilter(sp500.filt,params=coef(if1[[i]]),Np=sp500_Np[run_level]))
                       ),
                       se=TRUE)
                   }
})
},seed=318817883,kind="L'Ecuyer")

r.if1 <- data.frame(logLik=L.if1[,1],logLik_se=L.if1[,2],t(sapply(if1,coef)))

summary(r.if1$logLik,digits=5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14870   14891   14917   14936   14923   15485

pairs(~logLik+d+mu+phi+sigma_eta,data=r.if1)

The plot shows what we expected. There is obviously a peak for each of the parameters.


But at this point we have only carried out a local search, and it is possible that we have completely different sets of parameters that can actually give better log-likelihood. In order to determined whether there might be other peaks appearing given different sets of parameters, a global search is what we need to do.

We set up a box sp500_box as a search interval for the parameters where the initial values of the parameters are randomly picked in this interval. And we will see what values they tend to converge to.


run_level <- 2

sp500_box <- rbind(
  d         = c(-1,1),
  mu        = c(-20,0),
  phi       = c(0,0.9999),
  sigma_eta = c(0,0.9999),
  H_0       = c(-0.5,0.5)
)

stew(file="xg_box_eval_level2.rda",{
t.box <- system.time({
  if.box <- foreach(i=1:sp500_Nreps_global[run_level],.packages='pomp',.combine=c,
                    .options.multicore=list(set.seed=TRUE)) %dopar%  
    mif2(
      if1[[1]],
      start=apply(sp500_box,1,function(x)runif(1,x[1],x[2]))
    )
  
  L.box <- foreach(i=1:sp500_Nreps_global[run_level],.packages='pomp',.combine=rbind,
                   .options.multicore=list(set.seed=TRUE)) %dopar% {
                     set.seed(87932+i)
                     logmeanexp(
                       replicate(sp500_Nreps_eval[run_level],
                                 logLik(pfilter(sp500.filt,params=coef(if.box[[i]]),Np=sp500_Np[run_level]))
                       ), 
                       se=TRUE)
                   }
})
},seed=290860873,kind="L'Ecuyer")

r.box <- data.frame(logLik=L.box[,1],logLik_se=L.box[,2],t(sapply(if.box,coef)))

summary(r.box$logLik,digits=5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11197   12882   13538   13568   13869   15487

pairs(~logLik+d+mu+phi+sigma_eta,data=r.box)


run_level <- 3

sp500_box <- rbind(
  d         = c(-0.1,0.1),
  mu        = c(-10,-8.5),
  phi       = c(0.8,0.9999),
  sigma_eta = c(0,0.4),
  H_0       = c(-0.5,0.5)
)

stew(file="xg_01_box_eval.rda",{
t.box <- system.time({
  if.box <- foreach(i=1:sp500_Nreps_global[run_level],.packages='pomp',.combine=c,
                    .options.multicore=list(set.seed=TRUE)) %dopar%  
    mif2(
      if1[[1]],
      start=apply(sp500_box,1,function(x)runif(1,x[1],x[2]))
    )
  
  L.box <- foreach(i=1:sp500_Nreps_global[run_level],.packages='pomp',.combine=rbind,
                   .options.multicore=list(set.seed=TRUE)) %dopar% {
                     set.seed(87932+i)
                     logmeanexp(
                       replicate(sp500_Nreps_eval[run_level],
                                 logLik(pfilter(sp500.filt,params=coef(if.box[[i]]),Np=sp500_Np[run_level]))
                       ), 
                       se=TRUE)
                   }
})
},seed=290860873,kind="L'Ecuyer")

r.box_3_1 <- data.frame(logLik=L.box[,1],logLik_se=L.box[,2],t(sapply(if.box,coef)))

summary(r.box_3_1$logLik,digits=5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14861   14901   14918   14969   14931   15850
pairs(~logLik+d+mu+phi+sigma_eta,data=r.box_3_1)

pairs(~logLik+d+mu+phi+sigma_eta,data=subset(r.box_3_1,logLik>max(logLik)-910))


plot(if.box)


The above indicates that a even higher run level is needed. Therefore I increase the iteration number and run the code again.

run_level <- 3
sp500_Np          <- c(100,1e3,2e3)
sp500_Nmif        <- c(10, 200,400)
sp500_Nreps_eval  <- c(4,  10,  20)
sp500_Nreps_local <- c(10, 20,  20)
sp500_Nreps_global<- c(10, 20, 100)

sp500_box <- rbind(
  d         = c(-0.1,0.1),
  mu        = c(-10,-8.5),
  phi       = c(0.8,0.9999),
  sigma_eta = c(0,0.4),
  H_0       = c(-0.5,0.5)
)

stew(file="xg_03_box_eval.rda",{
t.box <- system.time({
  if.box <- foreach(i=1:sp500_Nreps_global[run_level],.packages='pomp',.combine=c,
                    .options.multicore=list(set.seed=TRUE)) %dopar%  
    mif2(
      if1[[1]],
      start=apply(sp500_box,1,function(x)runif(1,x[1],x[2]))
    )
  
  L.box <- foreach(i=1:sp500_Nreps_global[run_level],.packages='pomp',.combine=rbind,
                   .options.multicore=list(set.seed=TRUE)) %dopar% {
                     set.seed(87932+i)
                     logmeanexp(
                       replicate(sp500_Nreps_eval[run_level],
                                 logLik(pfilter(sp500.filt,params=coef(if.box[[i]]),Np=sp500_Np[run_level]))
                       ), 
                       se=TRUE)
                   }
})
},seed=290860873,kind="L'Ecuyer")

r.box_3_3 <- data.frame(logLik=L.box[,1],logLik_se=L.box[,2],t(sapply(if.box,coef)))

summary(r.box_3_3$logLik,digits=5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14924   14934   14940   15025   14945   15991

plot(if.box)


pairs(~logLik+d+mu+phi+sigma_eta,data=subset(r.box_3_3,logLik>max(logLik)-1000))



6. Discussion on Results

Let us review the model and see how the different set of parameters influence the behavior of the model. The model is as below.

\[ \begin{align} Y_t& = d\cdot H_t + \exp(H_t/2)u_t,\\ H_t& = \mu + \phi(H_{t-1}-\mu) + \eta_t,\\ u_t& \sim \mathcal{N}(0,1),\\ \eta_t& \sim \mathcal{N}(0,\sigma_{\eta}^2); \end{align} \] where \(d\), \(\mu\), \(\sigma_{\eta}\) are constants.


The first case we want to study is the set of parameters that gives the best log-likelihood, where \(\phi\) is around 0.9, \(\sigma_{\eta}\) is around 0.4. Notice that the expression for \(H_t\) can be written as \[ H_t = \mu + \phi(H_{t-1}-\mu) + \eta_t = (1-\phi)\mu + \phi H_{t-1} + \eta_t. \] Therefore we can regard the \(H_t\) as a weighted average on \(\mu\) and \(H_{t-1}\). With \(\phi\) close to one, the effect of variable \(\mu\) is much smaller than \(H_{t-1}\), so the log-volatility behaves like a random walk. The error term \(\eta_t\) has an effect in influencing the log-volatility since \(\sigma_{\eta}\) is obviously greater than 0, but its effect should be even less than a white noise process as its standard variation is less than 1.

The resulting model of this case is reasonable for our data. \(\mu\) can be reagrded as the “floor” of the log-volatility and the term \(\phi H_{t-1}\) is the memory of the previous log-volatility. This corresponds to the pattern that higher volatility tends to be followed by higher volatility and lower volatility tends to be followed by lower volatility, which is discussed in the previous section. The \(\eta_t\) term may explain the sudden changes shown in the volatility.

Based on the above discussion, and considering that this fitted model maximizes the log-likelihood, we would like to say that this model looks like a reasonable model for our data.


The second case comes with \(\sigma_{\eta}\) very close to one and \(\phi\) close to 0. In this case, we have the log-likelihood of second tier. This set of parameters indicates a more significant influence on the log-volatility due to the error term. \(\phi\) being close to 0 suggesting that the log_volatility here has much less memory of the past and it is more likely to be the combination of a constant value of \(\mu\) and the normally distributed error term.

This might also be a reasonable model to explian the data since we have the second best log-likelihood with this fitted model.


Since using pomp is very time-consuming, we hope that it can give us a better result than a traditional fit. We take the GARCH model as the benchmark to assess the overall goodness of the fitting.

require(tseries)
benchmark <- garch(sp500lrt, grad = "numerical", trace = FALSE)
L.benchmark <- logLik(benchmark)
L.benchmark
## 'log Lik.' 15111.16 (df=3)

Therefore we are confident to conclude that the SV-in-Mean pomp model is actually a better model for fitting our data. We should also be interested in the fact that pomp gives us two fitting modes of the model. These two modes corresponds to different explanations to the market volatility. In our case, we have one mode that performs better than the other one; while in other cases, the other mode may describe the market better.



7. Reference

[1] Unknown Author. (n.d.). S&P 500 Index. Retrieved from https://en.wikipedia.org/wiki/S%26P_500_Index

[2] Hautsch, N., Ou, Y. (July 2008). Discrete-Time Stochastic Volatility Models and MCMC-Based Statistical Inference. Retrieved from http://sfb649.wiwi.hu-berlin.de/papers/pdf/SFB649DP2008-063.pdf

[3] Koopman, S.J., Uspensky, E.H. (July 30, 2001). The Stochastic Volatility in Mean model: Empirical evidence from international stock markets. Retrieved from http://personal.vu.nl/s.j.koopman/old/papers/svm300701.pdf