1. Introduction.

1.1 Background.

NASDAQ-100 is a stock market index includes 100 of the largest non-finantial companies of both domestic and international, which are listed on the Nasdaq Stock Market based on market capitalization.The index reflects companies across major industry groups including computer hardware and software, telecommunications, retail or wholesale trade and biotechnology [1]. It doesn’t contain any financial companies as they are in a seperate index. Also, the criteria used in Nasdaq-100 is different from Dow Jones Industrial Average and S&P 500 [2].

The Nasdaq-100 data I use in this project is the daily close price from 2006.04.01 to 2016.04.01, and it’s downloaded from Yahoo finance website. Since Nasdaq-100 index reflects the pattern of Nasdaq Stock Market, it’s meaningful to analyse the features of this index.

1.2 Objectives.

This project aims to analyse the feature of Nasdaq-100 index and fit the ARMA, GARCH and POMP model to it. Then I would interpret the model and compare the results of fitness.

2. Analysis of Data.

First, I read in the data and plot it to observe the features.

data <- read.csv("Nasdaq1.csv",header = TRUE)
head(data)

##         Date    Open    High     Low   Close     Volume Adj.Close
## 1 2016-04-01 4461.77 4534.32 4452.11 4532.08 1812250000   4532.08
## 2 2016-03-31 4491.01 4504.24 4478.13 4483.66 1774270000   4483.66
## 3 2016-03-30 4494.70 4518.18 4481.84 4490.88 1717820000   4490.88
## 4 2016-03-29 4391.19 4471.77 4384.27 4467.72 1806490000   4467.72
## 5 2016-03-28 4415.04 4417.57 4393.08 4398.07 1381000000   4398.07
## 6 2016-03-24 4380.23 4405.99 4374.71 4405.53 1590990000   4405.53

N <- nrow(data)
ns <- data$Close[N:1] 
plot(ns, type = "l")

I notice that the data has obvious up and down trend from 2006 to 2008, and then it shows continuous and stationary increasing with time. The variance seems stationary after 2008, while it varied especially around 2008.

The data could be denoted as \({z^*_n,n=1,\dots,N}\), and the return of Nasdaq-100 index can be expressed as \[ {y_n^*}=\log({z_n^*})-\log({z_{n-1}^*}).\]

ns_df <- diff(log(ns))  # return
plot(ns_df, type = "l", main = "Plot of Return")

From the plot of return, I find that there exists high volatility around the financial crisis in 2008, and some other episodes also shows that, especially around 2009, 2010 and 2016.

acf(ns_df)

The ACF shows the autocorrelation of returns gradually converges to 0, while some non-zero autocorrelation actually exists in lag 1, 2, 15, 16, 18 and 34. Thus I decide to use decomposition to get more stationary data.

ns_ts <- ts(ns_df,frequency = 365,start = 2016-04-03 )
ns_de <- decompose(ns_ts) 
plot(ns_de)

Then we could seperate the random part from other components of the series, that is, we eliminate the noise and detrend.

3. Model Selection.

3.1 ARMA Model.

3.1.1 Construct an ARMA Model.

Since the data seems stationary, I fit the ARMA(p,q) model with parameter vector \(\theta=({\phi}_{1:p},{\psi}_{1:q},\mu,\sigma^2)\) given by \[ {\phi}(B)(Y_n-\mu) = {\psi}(B) \epsilon_n,\] where \[\begin{aligned} \mu &= {\mathbb{E}}[Y_n], \\ar(y)&= 1-{\phi}_1 y-\dots -{\phi}_py^p, \\ma(y)&= 1+{\psi}_1 y+\dots +{\psi}_qy^q, \\ \epsilon_n \sim&\mathrm{ iid }\, N[0,\sigma^2]. \end{aligned}\]

And then select the \(p\) and \(q\) using AIC criteria.

ns_rand <- ns_de$random
aic_table <- function(data,P,Q){
  table <- matrix(NA,(P+1),(Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
       table[p+1,q+1] <- arima(data,order=c(p,0,q))$aic
    }
  }
  dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
  table
}

require(knitr)

## Loading required package: knitr

kable(ns_aic_table,digits=2)

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	-12510.60	-12525.29	-12531.72	-12532.33	-12532.28	-12531.13
AR1	-12523.05	-12528.32	-12532.91	-12531.18	-12530.90	-12529.13
AR2	-12532.80	-12533.42	-12531.47	-12529.00	-12535.98	-12527.13
AR3	-12532.76	-12531.41	-12529.49	-12528.47	-12527.08	-12525.10
AR4	-12532.22	-12530.92	-12529.02	-12526.96	-12524.92	-12523.14

By observing the AIC table, I choose the ARMA(2,1) model to analyse the dataset. Although there exists other models, ARMA(2,4), for example, which has lower AIC value, we should also consider simplicity as an important criterion. Therefore, ARMA(2,1) is more appropriate.

arma21 <- arima(ns_rand, order = c(2,0,1));arma21

## 
## Call:
## arima(x = ns_rand, order = c(2, 0, 1))
## 
## Coefficients:
##           ar1      ar2     ma1  intercept
##       -0.4770  -0.1069  0.3919      0e+00
## s.e.   0.1957   0.0235  0.1963      2e-04
## 
## sigma^2 estimated as 0.0001722:  log likelihood = 6271.71,  aic = -12533.42

I notice that the log likelihood of ARMA(2,1) is 6271.71. Then I use the plot of residuals for diagnostic analysis.

3.1.2 Diagnostic of ARMA Model

ns_rand1 <- ns_rand[183:2334]
r <- resid(arima(ns_rand1, order = c(2,0,1)))
plot(r)

acf(r)

The ACF shows some signs of autocorrelation in lag 12, 16, and 18, while the others seem to be zero. The residual plot still reflects obvious high volatility around 2008 and some other episodes.

qqnorm(ns_rand1)
qqline(ns_rand1)

The QQ plot shows there are heavier tails both in the upper and the lower, which suggests that the data is not Gaussian distributed. Therefore, ARMA(2,1) is not a perfect model for this dataset, and I should try to fit to other models.

3.2 GARCH Model.

Another choise is to fit GARCH(p,q) model, which is widely used for financial time series modeling.

The GARCH(p,q) model can be expressed as \[ Y_n = \epsilon_n \sqrt{V_n},\] where \[ V_n = \alpha_0 + \sum_{j=1}^p \alpha_j Y_{n-j}^2 + \sum_{k=1}^q \beta_k V_{n-k}\] and \(\epsilon_{1:N}\) is white noise.

Therefore, we fit the GARCH(1,1) model for this dataset and calculate the log likelihood.

require(tseries)

## Loading required package: tseries

fit.garch <- garch(ns_rand1,grad = "numerical", trace = FALSE)
L.garch <- logLik(fit.garch);L.garch

## 'log Lik.' 6301.222 (df=3)

summary(fit.garch)

## 
## Call:
## garch(x = ns_rand1, grad = "numerical", trace = FALSE)
## 
## Model:
## GARCH(1,1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -7.77007 -0.49766  0.02667  0.54164  8.13096 
## 
## Coefficient(s):
##     Estimate  Std. Error  t value Pr(>|t|)    
## a0 1.472e-04   1.055e-05   13.944  < 2e-16 ***
## a1 5.000e-02   8.769e-03    5.702 1.19e-08 ***
## b1 5.000e-02   5.759e-02    0.868    0.385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Diagnostic Tests:
##  Jarque Bera Test
## 
## data:  Residuals
## X-squared = 2912.6, df = 2, p-value < 2.2e-16
## 
## 
##  Box-Ljung test
## 
## data:  Squared.Residuals
## X-squared = 8.5785, df = 1, p-value = 0.003402

The maximized log likelihood of GARCH(1,1) model is 6301.222, which is a little bit larger than 6271.71, the log likelihood value of ARMA(2,1). This indicates that GARCH(1,1) fits better than ARMA(2,1).

However, I notice that the coefficient \(b1\) is not significant, and the p-value of Jarque Bera test is pretty small, indicating to reject the null hypothesis of the data being normally distributed. What’s more, the p-value of Box-Ljung test is also smaller than 0.05, and this rejects the null hypothesis that data is uncorrelated. Therefore, GARCH(1,1) is not good enough, too.

3.3 POMP Model.

3.3.1 Financial Leverage.

Before we construct the POMP model, we should first establish empirical observation that negative shocks to a stockmarket index are associated with a subsequent increase in volitility, and this phenomenon is called leverage [3].

I define leverage, \(R_n\) on day \(n\) as the correlation between index return on day \(n-1\) and the increase in the log volatility from day \(n-1\) to day \(n\) [3].

In the pomp implementation of Bretó (2014), I take \(R_n\) as a random walk on a transformed scale, \[R_n= \frac{\exp{2G_n} -1}{\exp{2G_n}+1},\] where \({G_n}\) is the usual, Gaussian random walk[4].

3.3.2 Construct a POMP Moel

Here I fit the time-varying leverage model, and this could demean the returns of Nasdaq-100 index.

ns_rand1_demean <- ns_rand1 - mean(ns_rand1)
plot(ns_rand1_demean, type = "l", main = "Plot of Demeaned Data")

Again, the plot shows high volatility around 2008.

The model I use here is proposed by Bretó (2014)[4], and it can be expressed as: \[ \begin{align} Y_n &= \exp{H_n/2} \epsilon_n, \\ H_n &= \mu_h(1-\phi) + \phi H_{n-1} + \beta_{n-1}R_n\exp{-H_{n-1}/2} + \omega_n,\\ G_n &= G_{n-1}+\nu_n, \\ \end{align} \] where \(\beta_n=Y_n\sigma_\eta\sqrt{1-\phi^2}\), \({\epsilon_n}\) is an iid \(N(0,1)\) sequence, \({\nu_n}\) is an iid \(N(0,\sigma_{\nu}^2)\) sequence, and \({\omega_n}\) is an iid \(N(0,\sigma_\omega^2)\) sequence. \(H_n\) is the log volatility.

Then we apply the model to build POMP. The following R code is adapted based on the Notes 15 of STATS 531 [3].

require(pomp)

## Loading required package: pomp

The basic seequential Monte Carlo algorithm could be calculated by buiding two different pomp objects, one for filtering and another for simulation[3].

ns_statenames <- c("H","G","Y_state")
ns_rp_names <- c("sigma_nu","mu_h","phi","sigma_eta")
ns_ivp_names <- c("G_0","H_0")
ns_paramnames <- c(ns_rp_names,ns_ivp_names)
ns_covarnames <- "covaryt"

rproc1 <- "
  double beta,omega,nu;
omega = rnorm(0,sigma_eta * sqrt( 1- phi*phi ) * sqrt(1-tanh(G)*tanh(G)));
nu = rnorm(0, sigma_nu);
G += nu;
beta = Y_state * sigma_eta * sqrt( 1- phi*phi );
H = mu_h*(1 - phi) + phi*H + beta * tanh( G ) * exp(-H/2) + omega;
"
rproc2.sim <- "
Y_state = rnorm( 0,exp(H/2) );
"

rproc2.filt <- "
Y_state = covaryt;
"
ns_rproc.sim <- paste(rproc1,rproc2.sim)
ns_rproc.filt <- paste(rproc1,rproc2.filt)

ns_initializer <- "
  G = G_0;
H = H_0;
Y_state = rnorm( 0,exp(H/2) );
"

ns_rmeasure <- "
   y=Y_state;
"

ns_dmeasure <- "
lik=dnorm(y,0,exp(H/2),give_log);
"

ns_toEstimationScale <- "
Tsigma_eta = log(sigma_eta);
Tsigma_nu = log(sigma_nu);
Tphi = logit(phi);
"

ns_fromEstimationScale <- "
Tsigma_eta = exp(sigma_eta);
Tsigma_nu = exp(sigma_nu);
Tphi = expit(phi);
"

ns.filt <- pomp(data=data.frame(y=ns_rand1_demean,
                                   time=1:length(ns_rand1_demean)),
                   statenames=ns_statenames,
                   paramnames=ns_paramnames,
                   covarnames=ns_covarnames,
                   times="time",
                   t0=0,
                   covar=data.frame(covaryt=c(0,ns_rand1_demean),
                                    time=0:length(ns_rand1_demean)),
                   tcovar="time",
                   rmeasure=Csnippet(ns_rmeasure),
                   dmeasure=Csnippet(ns_dmeasure),
                   rprocess=discrete.time.sim(step.fun=Csnippet(ns_rproc.filt),delta.t=1),
                   initializer=Csnippet(ns_initializer),
                   toEstimationScale=Csnippet(ns_toEstimationScale), 
                   fromEstimationScale=Csnippet(ns_fromEstimationScale)
)


expit<-function(real){1/(1+exp(-real))}
logit<-function(p.arg){log(p.arg/(1-p.arg))}
params_test <- c(
  sigma_nu = exp(-4.5),  
  mu_h = -0.25,       
  phi = expit(4),     
  sigma_eta = exp(-0.07),
  G_0 = 0,
  H_0=0
)

sim1.sim <- pomp(ns.filt, 
                 statenames=ns_statenames,
                 paramnames=ns_paramnames,
                 covarnames=ns_covarnames,
                 rprocess=discrete.time.sim(step.fun=Csnippet(ns_rproc.sim),delta.t=1)
)

sim1.sim <- simulate(sim1.sim,seed=1,params=params_test)

sim1.filt <- pomp(sim1.sim, 
                  covar=data.frame(
                    covaryt=c(obs(sim1.sim),NA),
                    time=c(timezero(sim1.sim),time(sim1.sim))),
                  tcovar="time",
                  statenames=ns_statenames,
                  paramnames=ns_paramnames,
                  covarnames=ns_covarnames,
                  rprocess=discrete.time.sim(step.fun=Csnippet(ns_rproc.filt),delta.t=1)
)

run_level <- 3 
ns_Np <-          c(100,1e3,2e3)
ns_Nmif <-        c(10, 100,200)
ns_Nreps_eval <-  c(4,  10,  20)
ns_Nreps_local <- c(10, 20, 20)
ns_Nreps_global <-c(10, 20, 100)

require(doParallel)

## Loading required package: doParallel

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

registerDoParallel()

Then I iterate the filtering on Nasdaq-100 data using IF2 algorithm [5].

ns_rw.sd_rp <- 0.02
ns_rw.sd_ivp <- 0.1
ns_cooling.fraction.50 <- 0.5

stew("mif1.rda",{
  t.if1 <- system.time({
    if1 <- foreach(i=1:ns_Nreps_local[run_level],
                   .packages='pomp', .combine=c,
                   .options.multicore=list(set.seed=TRUE)) %dopar% try(
                     mif2(ns.filt,
                          start=params_test,
                          Np=ns_Np[run_level],
                          Nmif=ns_Nmif[run_level],
                          cooling.type="geometric",
                          cooling.fraction.50=ns_cooling.fraction.50,
                          transform=TRUE,
                          rw.sd = rw.sd(
                            sigma_nu  = ns_rw.sd_rp,
                            mu_h      = ns_rw.sd_rp,
                            phi       = ns_rw.sd_rp,
                            sigma_eta = ns_rw.sd_rp,
                            G_0       = ivp(ns_rw.sd_ivp),
                            H_0       = ivp(ns_rw.sd_ivp)
                          )
                     )
                   )
    
    L.if1 <- foreach(i=1:ns_Nreps_local[run_level],.packages='pomp',
                     .combine=rbind,.options.multicore=list(set.seed=TRUE)) %dopar% 
                     {
                       logmeanexp(
                         replicate(ns_Nreps_eval[run_level],
                                   logLik(pfilter(ns.filt,params=coef(if1[[i]]),Np=ns_Np[run_level]))
                         ),
                         se=TRUE)
                     }
  })
},seed=318817883,kind="L'Ecuyer")

r.if1 <- data.frame(logLik=L.if1[,1],logLik_se=L.if1[,2],t(sapply(if1,coef)))
if (run_level>1) 
  write.table(r.if1,file="ns_params.csv",append=TRUE,col.names=FALSE,row.names=FALSE)
summary(r.if1$logLik,digits=5)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  5552.0  6208.3  6263.2  6210.5  6338.4  6443.9

The maximization log likelihood is 6443.9, which is the highest comparing to the results of GARCH(1,1) and ARMA(2,1). This indicates that it’s a better model for this dataset.

Also, I plot the geometry of the likelihood surface in a neighborhood of the point estimate:

pairs(~logLik+sigma_nu+mu_h+phi+sigma_eta,data=subset(r.if1,logLik>max(logLik)-250))

3.3.3 Assess the POMP Moel

To assess the success of model, I construct a parameter box consisting of reasonable values, and use it to generate random initial values.

ns_box <- rbind(
  sigma_nu=c(0.005,0.05),
  mu_h    =c(-1,0),
  phi = c(0.95,0.99),
  sigma_eta = c(0.5,1),
  G_0 = c(-2,2),
  H_0 = c(-1,1)
)

stew(file="box_eval.rda",{
  t.box <- system.time({
    if.box <- foreach(i=1:ns_Nreps_global[run_level],.packages='pomp',.combine=c,
                      .options.multicore=list(set.seed=TRUE)) %dopar%  
      mif2(
        if1[[1]],
        start=apply(ns_box,1,function(x)runif(1,x))
      )
    
    L.box <- foreach(i=1:ns_Nreps_global[run_level],.packages='pomp',.combine=rbind,
                     .options.multicore=list(set.seed=TRUE)) %dopar% {
                       set.seed(87932+i)
                       logmeanexp(
                         replicate(ns_Nreps_eval[run_level],
                                   logLik(pfilter(ns.filt,params=coef(if.box[[i]]),Np=ns_Np[run_level]))
                         ), 
                         se=TRUE)
                     }
  })
},seed=290860873,kind="L'Ecuyer")


r.box <- data.frame(logLik=L.box[,1],logLik_se=L.box[,2],t(sapply(if.box,coef)))
if(run_level>1) write.table(r.box,file="ns_params.csv",append=TRUE,col.names=FALSE,row.names=FALSE)
summary(r.box$logLik,digits=5)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  3357.4  6035.9  6255.4  5899.9  6315.4  6395.6

The best log likelihood value is 6395.6, which is close to 6443.9, but doesn’t exceed it. The usage of randomly selected initial values in the pomp model could approach the MLE we get from the Nasdaq-100 data, but it’s still smaller. And this indicates our MLE is reasonable.

pairs(~logLik+log(sigma_nu)+mu_h+phi+sigma_eta+H_0,data=subset(r.box,logLik>max(logLik)-250))

Considering the overall performance, I use GARCH(1,1) model to fit the demeanded data as the benchmark of analysis.

require(tseries)
fit.garch.benchmark <- garch(ns_rand1_demean,grad = "numerical", trace = FALSE)
L.garch.benchmark <- logLik(fit.garch.benchmark);L.garch.benchmark

## 'log Lik.' 6301.246 (df=3)

The GARCH(1,1) model has a maximized log likelihood of 6301.246 with 3 fitted parameters, which is smaller than 6443.9 (POMP).

4. Conclusion

In the project, I use the daily returns of Nasdaq-100 index for analysis and detrend the returns before I fit the models. By observing the results of fitting, I find that the value of maximum log likelihood could be ordered by \[ARMA(2,1) < GARCH(1,1) < Pomp,\] which indicates that \(GARCH(1,1)\) seems to perform better than \(ARMA(2,1)\), and POMP performs the best in these three models.

As for \(ARMA(2,1)\) model, it could fit the data. However, the diagnostics show that there exists some significant non-zero autocorrelations and the data is not normally distributed. This contradicts to the assumption of \(ARMA\) model. Thus, \(ARMA(2,1)\) is not that reasonable.

For \(GARCH(1,1)\), it fits better than \(ARMA(2,1)\), but still, the diagnostics reflects some problems. The coefficient we estimated are not all that significant, and Jarque Bera test shows the residual is not normal. In addition, the Box-Ljung test suggests that there exists some correlations between squared residuals. This means that \(GARCH(1,1)\) is also not good enough.

Then, I fit the \(POMP\) model and get the result that this stochastic volatility model with time-varying leverage has a maximized log likelihood of 6443.9 with 6 parameters. It performs the best in these three models. To check the success of pomp, I set up a parameter box to generate random starting points for pomp as well as refit \(GARCH(1,1)\) model using demeaned data for a benchmark. Both methods indicate that the MLE of POMP is reasonable. Therefore, \(POMP\) is the most appropriate model for this Nasdaq-100 index.