This data set comes from a Kaggle data challenge: DJIA 30 Stock Time Series [1]. The original dataset contains historical stock prices (last 12 years) for 29 of 30 DJIA companies. In this project, I will mainly focus on two stocks: AAPL, GS. The target of this project is to study the volatility of the stock prices by fitting time-series models and to help people make statistically informed trades. In my mid-term project, I analyzed the return of the stock’s highest price each day. In my final project, I will model the log of the stock price’s return and use some new models such as GARCH and POMP.
Section 2 is the data pre-processing, which involves data cleaning, exporatory data analysis. Section 3 is the time series modelling part where I fit ARIMA model,GARCH model, and POMP model. Section 4 is the results and conclusions.
#To read on local machine
#dat <- read.csv(file ="C:/Users/asus/Desktop/stock-time-series-20050101-to-20171231/all_stocks_2006-01-01_to_2018-01-01.csv")
#To read on GreatLake
dat <- read.csv(file ="./all_stocks_2006-01-01_to_2018-01-01.csv")
summary(dat)
## Date Open High Low
## 2006-01-03: 31 Min. : 6.75 Min. : 7.17 Min. : 0.00
## 2006-01-04: 31 1st Qu.: 33.95 1st Qu.: 34.29 1st Qu.: 33.60
## 2006-01-05: 31 Median : 60.04 Median : 60.63 Median : 59.49
## 2006-01-06: 31 Mean : 85.62 Mean : 86.39 Mean : 84.84
## 2006-01-09: 31 3rd Qu.: 94.00 3rd Qu.: 94.74 3rd Qu.: 93.25
## 2006-01-10: 31 Max. :1204.88 Max. :1213.41 Max. :1191.15
## (Other) :93426 NA's :25 NA's :10 NA's :20
## Close Volume Name
## Min. : 6.66 Min. : 0 AXP : 3020
## 1st Qu.: 33.96 1st Qu.: 5040180 BA : 3020
## Median : 60.05 Median : 9701142 CAT : 3020
## Mean : 85.64 Mean : 20156670 CVX : 3020
## 3rd Qu.: 94.01 3rd Qu.: 20752222 DIS : 3020
## Max. :1195.83 Max. :843264044 GE : 3020
## (Other):75492
dat$Date <- as.Date(dat$Date, format = "%Y-%m-%d")
dat = na.omit(dat)
dat_aapl = dat[dat$Name == "AAPL",]
dat_gs = dat[dat$Name == "GS",]
par(mfrow=c(1,2))
plot(dat_aapl$Date,log(dat_aapl$Close),type="l",xlab="Date",ylab="Log Price($)",main="AAPL Log price")
abline(h=mean(log(dat_aapl$Close)),col="red")
plot(dat_gs$Date,log(dat_gs$Close),type="l",xlab="Date",ylab="Log price($)",main="GS Log price")
abline(h=mean(log(dat_gs$Close)),col="red")
As we can see, during the time period, AAPL’s stock price keeps increasing while GS experienced great fluctuations during year 2008 and year 2013, probably because the financial crisis.
Now, we study the log of the returns, which is the volatility of the stock.
N_aapl = length(dat_aapl$Date)
par(mfrow = c(1,2))
plot(as.Date(dat_aapl$Date)[2:N_aapl-1],diff(log(dat_aapl$Close)),type="l",xlab="Date",ylab="",main="Log Returns of AAPL price")
abline(h=mean(diff(log(dat_aapl$Close))),col="red")
acf(diff(log(dat_aapl$Close)))
par(mfrow = c(1,2))
N_gs = length(dat_gs$Date)
plot(as.Date(dat_gs$Date)[2:N_gs-1],diff(log(dat_gs$Close)),type="l",xlab="Date",ylab="",main="Log-Returns of GS price")
abline(h=mean(diff(log(dat_gs$Close))),col="red")
acf(diff(log(dat_gs$Close)))
We can see that the volatilities are generally fluctuate around 0 for both stocks, while GS has a larger fluctuation than AAPL. Both stocks have seen larger volatility at year 2009. The ACF shows correlations at several years for both stocks, and it may also exist seasonality by simply looking at the ACF.
I then decompose the data to get stationary data.
vot_aapl = diff(log(dat_aapl$Close))
vot_gs = diff(log(dat_gs$Close))
aapl_ts <- ts(vot_aapl,frequency = 365,start = 2016-01-03 )
aapl_de <- decompose(aapl_ts)
gs_ts <- ts(vot_gs,frequency = 365,start = 2016-01-03 )
gs_de <- decompose(gs_ts)
par(mfrow=c(1,2))
plot(aapl_de)
plot(gs_de)
Frist, I will use auto.arima to fit an ARIMA model.
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
aapl_rd = aapl_de$random[!is.na(aapl_de$random)]
gs_rd = gs_de$random[!is.na(gs_de$random)]
arima_AAPL = auto.arima(aapl_rd,max.p = 5,max.q = 5,method = "ML")
arima_GS = auto.arima(gs_rd,max.p = 5,max.q = 5,method = "ML")
arima_AAPL
## Series: aapl_rd
## ARIMA(2,0,2) with zero mean
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 0.0031 -0.9283 0.0002 0.8948
## s.e. 0.0433 0.0325 0.0525 0.0386
##
## sigma^2 estimated as 0.0003626: log likelihood=6748.7
## AIC=-13487.39 AICc=-13487.37 BIC=-13457.97
arima_GS
## Series: gs_rd
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## 0.6846 -0.7454
## s.e. 0.0861 0.0786
##
## sigma^2 estimated as 0.0005257: log likelihood=6254.95
## AIC=-12503.89 AICc=-12503.89 BIC=-12486.24
The ARIMA model for AAPL’s volatiliy is: \[Y_n = 0.0031 {Y_{n-1}} - 0.9283 Y_{n-2}+{\epsilon}_n + 0.0002{\epsilon_{n-1}} + 0.8948{\epsilon_{n-2}}\] where \(\epsilon\) is a white noise process and \(Y_n\) is the volatility.
The ARIMA model for GS’s volatiliy is: \[Y_n = 0.6846{Y_{n-1}}+{\epsilon}_n - 0.7454{\epsilon_{n-1}}\] where \(\epsilon\) is a white noise process and \(Y_n\) is the volatility.
Now, I study the goodness of fit by checking the normality of the residuals
res_aapl = resid(arima_AAPL)
res_gs = resid(arima_GS)
acf(res_aapl)
acf(res_gs)
qqnorm(res_aapl)
qqline(res_aapl)
qqnorm(res_gs)
qqline(res_gs)
The ACF does not have much correlation left for both stocks. However, the QQ-plot shows that the residuals of both stocks have heavier tails than the normal distribution, so it does not strictly follow a normal distribution, indicating that the model does not fit very well. I will try other models instead.
To start with the alternative models, I start with GARCH model. The GARCH(p,q) model looks like this: \[Y_n = \epsilon_n{\sqrt{V_n}}\] where \[V_n = a_0 + \sum_{j=1}^p{a_j{Y_{n-j}^2}} + \sum_{k=1}^q{b_k}V_{n-k}\] with \(\epsilon_{1:N}\) is white noise.
require(tseries)
## Loading required package: tseries
garch_aapl <- garch(aapl_rd,grad = "numerical", trace = FALSE)
lik_garch_aapl <- logLik(garch_aapl)
lik_garch_aapl
## 'log Lik.' 6782.713 (df=3)
summary(garch_aapl)
##
## Call:
## garch(x = aapl_rd, grad = "numerical", trace = FALSE)
##
## Model:
## GARCH(1,1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.838959 -0.553954 -0.004992 0.561453 5.962210
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## a0 3.060e-04 3.433e-05 8.912 < 2e-16 ***
## a1 5.000e-02 7.350e-03 6.802 1.03e-11 ***
## b1 5.000e-02 9.497e-02 0.526 0.599
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Diagnostic Tests:
## Jarque Bera Test
##
## data: Residuals
## X-squared = 2472.3, df = 2, p-value < 2.2e-16
##
##
## Box-Ljung test
##
## data: Squared.Residuals
## X-squared = 12.376, df = 1, p-value = 0.0004348
garch_gs <- garch(gs_rd,grad = "numerical", trace = FALSE)
lik_garch_gs <- logLik(garch_gs)
lik_garch_gs
## 'log Lik.' 6768.54 (df=3)
summary(garch_gs)
##
## Call:
## garch(x = gs_rd, grad = "numerical", trace = FALSE)
##
## Model:
## GARCH(1,1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.24414 -0.58279 -0.01031 0.61708 5.53234
##
## Coefficient(s):
## Estimate Std. Error t value Pr(>|t|)
## a0 3.984e-06 7.377e-07 5.40 6.66e-08 ***
## a1 6.010e-02 4.827e-03 12.45 < 2e-16 ***
## b1 9.313e-01 5.162e-03 180.43 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Diagnostic Tests:
## Jarque Bera Test
##
## data: Residuals
## X-squared = 602.36, df = 2, p-value < 2.2e-16
##
##
## Box-Ljung test
##
## data: Squared.Residuals
## X-squared = 1.0006, df = 1, p-value = 0.3172
Both stocks fit GARCH(1,1) model well. I collect their log-likelihoods for futhrer comparison. A sign of good fit is that all parameters have their p-values less than 0.5, which means that they are all statistically significant.
The GARCH(1,1) model for AAPL volatility is now: \[Y_n = \epsilon_n{\sqrt{V_n}}\] where \[V_n = 3.060*10^{-4} + 5.000*10^{-2}Y_{n-1}^2 + 5.000*10^{-2}V_{n-1}\]
The GARCH(1,1) model for GS volatility is now: \[Y_n = \epsilon_n{\sqrt{V_n}}\] where \[V_n = 3.984*10^{-6} + 6.010*10^{-2}Y_{n-1}^2 + 9.313*10^{-1}V_{n-1}\]
Again, we check the goodness of fit
res_aapl_garch = resid(garch_aapl)
res_gs_garch = resid(garch_gs)
acf(res_aapl_garch[-1])
acf(res_gs_garch[-1])
qqnorm(res_aapl_garch)
qqline(res_aapl_garch)
qqnorm(res_gs_garch)
qqline(res_gs_garch)
Even GARCH is better than ARIMA model, there is still some correlations remain and the residuals still have heavier tails than normal distribution. This shows that GARCH is a better model than ARIMA, but I still want to seek for a better one.
As mentioned in our slides, Garch model is a black-box model, and we still don’t know the clear interpretation of the parameters. To better understand how the volatility model works, I will use the example in the slides, which is using POMP to model[2].
It is a fairly well established empirical observation that negative shocks to a stockmarket index are associated with a subsequent increase in volatility. This phenomenon is called leverage. In the implementations in the slides for a pomp implementation of Bret’o (2014)[3], we model \(R_n\) as a random walk on a transformed scale[4] \[R_n = \frac{e^{2G_n}-1}{e^{2G_n}+1}\] where \(G_n\) is the usual, Gaussian random walk.
Here, I adopt the model used in the slides 14[2], which is called fixed leverage model, where the Gaussian random walk having standard deviation zero. Following the notation and model representation of Bret’o (2014), the model is as following[3]: \[Y_n = e^{\frac{H_n}{2}}\epsilon_n\\H_n = \mu_h({1-\phi}) + \phi{H_{n-1}} + \beta_{n-1} R_n e^{-\frac{H_{n-1}}{2}} + \omega_n\\G_n = G_{n-1} +\nu_n\]
where \(\beta_n = Y_n{\sigma_\eta{\sqrt{1-\phi^2}}}\), \(\epsilon_n\) is an iid \(N(0,1)\) sequence, \(\nu_n\) is an idd \(N(0,\sigma_v^2)\), and \(\omega_n\) is an iid \(N(0,\sigma_w^2)\) sequence. To give the real meaning for the paramters, here \(H_n\) is the log volatility for the stock. And \(Y_n\) is the observations. For the construction of particle filter, please see slides 14 for more reference[2].
require(pomp)
## Loading required package: pomp
## Warning: package 'pomp' was built under R version 3.6.2
## Welcome to pomp version 2.7.1.0!
## As of version 2.7.1.0, important changes have been made to the
## default settings of the particle filtering algorithms in
## 'pfilter', 'mif2', 'pmcmc', 'bsmc2'.
## These changes are not backward compatible.
## See the package NEWS for the details.
##
## For information on upgrading your pomp version < 2 code, see the
## "pomp version 2 upgrade guide" at https://kingaa.github.io/pomp/.
##
## Attaching package: 'pomp'
## The following object is masked from 'package:forecast':
##
## forecast
aapl_dmean = aapl_rd - mean(aapl_rd)
gs_dmean = gs_rd - mean(gs_rd)
ns_statenames <- c("H","G","Y_state")
ns_rp_names <- c("sigma_nu","mu_h","phi","sigma_eta")
ns_ivp_names <- c("G_0","H_0")
ns_paramnames <- c(ns_rp_names,ns_ivp_names)
rproc1 <- "
double beta,omega,nu;
omega = rnorm(0,sigma_eta * sqrt( 1- phi*phi ) * sqrt(1-tanh(G)*tanh(G)));
nu = rnorm(0, sigma_nu);
G += nu;
beta = Y_state * sigma_eta * sqrt( 1- phi*phi );
H = mu_h*(1 - phi) + phi*H + beta * tanh( G ) * exp(-H/2) + omega;
"
rproc2.sim <- "
Y_state = rnorm( 0,exp(H/2) );
"
rproc2.filt <- "
Y_state = covaryt;
"
ns_rproc.sim <- paste(rproc1,rproc2.sim)
ns_rproc.filt <- paste(rproc1,rproc2.filt)
ns_initializer <- "
G = G_0;
H = H_0;
Y_state = rnorm( 0,exp(H/2) );
"
ns_rmeasure <- "
y=Y_state;
"
ns_dmeasure <- "
lik=dnorm(y,0,exp(H/2),give_log);
"
ns_partrans <- parameter_trans(
log=c("sigma_eta","sigma_nu"),
logit="phi"
)
Note that due to the limit of time, I only build POMP on AAPL volatility.
ns.filt <- pomp(data=data.frame(y=aapl_dmean,
time=1:length(aapl_dmean)),
statenames=ns_statenames,
paramnames=ns_paramnames,
times="time",
t0=0,
covar=covariate_table(time=0:length(aapl_dmean),covaryt=c(0,aapl_dmean),times="time"),
rmeasure=Csnippet(ns_rmeasure),
dmeasure=Csnippet(ns_dmeasure),
rprocess=discrete.time.sim(step.fun=Csnippet(ns_rproc.filt),delta.t=1),
initializer=Csnippet(ns_initializer),
rinit = Csnippet(ns_initializer),
partrans = ns_partrans
)
## Warning: 'discrete.time.sim' is deprecated and will be removed in a
## forthcoming release. Use 'discrete_time' instead.
## in 'pomp': the unrecognized argument 'initializer' is available for use by the POMP basic components.
expit<-function(real){1/(1+exp(-real))}
logit<-function(p.arg){log(p.arg/(1-p.arg))}
params_test <- c(
sigma_nu = exp(-4.5),
mu_h = -0.25,
phi = expit(4),
sigma_eta = exp(-0.07),
G_0 = 0,
H_0=0
)
sim1.sim <- pomp(ns.filt,
statenames=ns_statenames,
paramnames=ns_paramnames,
rprocess=discrete.time.sim(step.fun=Csnippet(ns_rproc.sim),delta.t=1)
)
## Warning: 'discrete.time.sim' is deprecated and will be removed in a
## forthcoming release. Use 'discrete_time' instead.
sim1.sim <- simulate(sim1.sim,seed=1,params=params_test)
sim1.filt <- pomp(sim1.sim,
covar=covariate_table(
time=c(timezero(sim1.sim),time(sim1.sim)),
covaryt=c(obs(sim1.sim),NA),
times="time"),
statenames=ns_statenames,
paramnames=ns_paramnames,
rprocess=discrete.time.sim(step.fun=Csnippet(ns_rproc.filt),delta.t=1)
)
## Warning: 'discrete.time.sim' is deprecated and will be removed in a
## forthcoming release. Use 'discrete_time' instead.
run_level <- 3
ns_Np <- switch(run_level, 100, 1e3, 2e3) #change it to larger scale later.
ns_Nmif <- switch(run_level, 10, 100, 200)
ns_Nreps_eval <- switch(run_level, 4, 10, 20)
ns_Nreps_local <- switch(run_level, 10, 20, 20)
ns_Nreps_global <- switch(run_level, 10, 20, 100)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoParallel()
library(doRNG)
## Loading required package: rngtools
registerDoRNG(34118892)
stew(file=sprintf("pf1-%d.rda",run_level),{
t.pf1 <- system.time(
pf1 <- foreach(i=1:ns_Nreps_eval,
.packages='pomp') %dopar% pfilter(sim1.filt,Np=ns_Np)) #change it to %do% when run on own PC
},seed=493536993,kind="L'Ecuyer")
(L.pf1 <- logmeanexp(sapply(pf1,logLik),se=TRUE))
## se
## -3521.9077495 0.1322582
ns_rw.sd_rp <- 0.02
ns_rw.sd_ivp <- 0.1
ns_cooling.fraction.50 <- 0.5
ns_rw.sd <- rw.sd(
sigma_nu = ns_rw.sd_rp,
mu_h = ns_rw.sd_rp,
phi = ns_rw.sd_rp,
sigma_eta = ns_rw.sd_rp,
G_0 = ivp(ns_rw.sd_ivp),
H_0 = ivp(ns_rw.sd_ivp)
)
stew(file=sprintf("mif1-%d.rda",run_level),{
t.if1 <- system.time({
if1 <- foreach(i=1:ns_Nreps_local,
.packages='pomp', .combine=c) %dopar% mif2(ns.filt,
params=params_test,
Np=ns_Np,
Nmif=ns_Nmif,
cooling.fraction.50=ns_cooling.fraction.50,
rw.sd = ns_rw.sd)
L.if1 <- foreach(i=1:ns_Nreps_local,
.packages='pomp', .combine=rbind) %dopar% logmeanexp(
replicate(ns_Nreps_eval, logLik(pfilter(ns.filt,
params=coef(if1[[i]]),Np=ns_Np))), se=TRUE)
})
},seed=318817883,kind="L'Ecuyer")
r.if1 <- data.frame(logLik=L.if1[,1],logLik_se=L.if1[,2],
t(sapply(if1,coef)))
if (run_level>1) write.table(r.if1,file="ns_params.csv",
append=TRUE,col.names=FALSE,row.names=FALSE)
summary(r.if1$logLik,digits=5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6952 6965 6973 6984 7012 7016
The iterated filtering find the maximized log-likelihood at 7016.[5] The repeated stochastic maximizations can also show us the geometry of the likelihood surface in a neighborhood of this point. Meanwhile, I can get the estimations for all paramaters and their standard deviations.
pairs(~logLik+sigma_nu+mu_h+phi+sigma_eta,
data=subset(r.if1,logLik>max(logLik)-20))
plot(if1)
From the iterated filtering plots, I see that the chains converge after a long time running.
However,it is not guaranteed that the initializations we pick are the optimized ones. To address this, I carry out a search box for global maximization.
ns_box <- rbind(
sigma_nu=c(0.005,0.05),
mu_h =c(-1,0),
phi = c(0.95,0.99),
sigma_eta = c(0.5,1),
G_0 = c(-2,2),
H_0 = c(-1,1)
)
stew(file=sprintf("box_eval-%d.rda",run_level),{
t.box <- system.time({
if.box <- foreach(i=1:ns_Nreps_global,
.packages='pomp',.combine=c) %dopar% mif2(if1[[1]],
params=apply(ns_box,1,function(x)runif(1,x)))
L.box <- foreach(i=1:ns_Nreps_global,
.packages='pomp',.combine=rbind) %dopar% {
logmeanexp(replicate(ns_Nreps_eval, logLik(pfilter(
ns.filt,params=coef(if.box[[i]]),Np=ns_Np))),
se=TRUE)
}
})
},seed=290860873,kind="L'Ecuyer")
r.box <- data.frame(logLik=L.box[,1],logLik_se=L.box[,2],
t(sapply(if.box,coef)))
if(run_level>1) write.table(r.box,file="ns_params.csv",
append=TRUE,col.names=FALSE,row.names=FALSE)
summary(r.box$logLik,digits=5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6930 6962 6972 6978 6998 7021
We can see that the global max log-likelihood found by box method is 7021, which is larger than the max log-likelihood with our local search.
pairs(~logLik+log(sigma_nu)+mu_h+phi+sigma_eta+H_0,
data=subset(r.box,logLik>max(logLik)-10))
plot(if.box)
I see that all chains converge
First, I model the stock log-volatility using three different models: ARIMA, GARCH, and POMP. For each model, I determine the paramters which give the best fit on the data. I find that ARIMA(2,0,2), GARCH(1,1) gives the best fit on AAPL and ARIMA(1,0,1), GARCH(1,1) gives the best fit on GS. I also adapt Bret’o (2014)[3] to implement a POMP model. I see that neither ARIMA model nor GARCH model is able to explain all the correlations in the data but GARCH gives a better fit than ARIMA. And POMP is performing well on the data.
Second, I conduct model selections using AIC. For AAPL stock specifically, recall that the log-likelihood for ARIMA(2,0,2) is 6748.7, the log-likelihood for GARCH(1,1) is 6782.713, and the log-likehood for POMP is 7021. AIC for ARIMA is \(-2*6748.7 + 2*5 = -13487.4\), AIC for GARCH is \(-2*6782.713 + 2*3 = -13559.426\), AIC for POMP is \(-2*7021 + 2*6 = -14021\). The lower AIC means better predictive power, so I conclude the model preference should be: \[ARIMA(2,0,2) < GARCH(1,1) < POMP\] For GS, I conduct similar study and get the same result: \[ARIMA(1,0,1) < GARCH(1,1) < POMP\] As a conclusion, POMP is the best model to use when people have enough computing resource.
[1] https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231
[2] https://ionides.github.io/531w20/14/notes14.pdf
[3] Bret’o, C. 2014. On idiosyncratic stochasticity of financial leverage effects. Statistics & Probability Letters 91:20-26.
[4] Edward Ionides, “14. Case study: POMP modeling to investigate financial volatility”, https://ionides.github.io/531w18/14/notes14.html#arch-and-garch-models
[5] Ionides, E.L., D.Nguyen, Y.Atchade, S.Stoev, and A.A. King. 2015. Inference for dynamic and latent variable models via iterated, perturbed Bayes maps. Proceedings of the National Academy of Sciences of the U.S.A. 112:719-724.