1 Introduction

This data set comes from a Kaggle data challenge: DJIA 30 Stock Time Series (https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231). The original dataset contains historical stock prices (last 12 years) for 29 of 30 DJIA companies. In this project, I will mainly focus on four stocks: AAPL, GS, JPM, GOOGL. The target of this project is to study the volatility of the stock prices by fitting time-series models and to help people make statistically informed trades.

Section 2 is the data pre-processing, which involves data cleaning, exporatory data analysis. Section 3 is the time series modelling part where I fit ARIMA model to the volatility of four stocks, make hypothesis testing, and conduct frequency domain analysis.

2 Data Pre-processing

dat <- read.csv(file ="all_stocks_2006-01-01_to_2018-01-01.csv")
summary(dat)
##          Date            Open              High              Low         
##  2006-01-03:   31   Min.   :   6.75   Min.   :   7.17   Min.   :   0.00  
##  2006-01-04:   31   1st Qu.:  33.95   1st Qu.:  34.29   1st Qu.:  33.60  
##  2006-01-05:   31   Median :  60.04   Median :  60.63   Median :  59.49  
##  2006-01-06:   31   Mean   :  85.62   Mean   :  86.39   Mean   :  84.84  
##  2006-01-09:   31   3rd Qu.:  94.00   3rd Qu.:  94.74   3rd Qu.:  93.25  
##  2006-01-10:   31   Max.   :1204.88   Max.   :1213.41   Max.   :1191.15  
##  (Other)   :93426   NA's   :25        NA's   :10        NA's   :20       
##      Close             Volume               Name      
##  Min.   :   6.66   Min.   :        0   AXP    : 3020  
##  1st Qu.:  33.96   1st Qu.:  5040180   BA     : 3020  
##  Median :  60.05   Median :  9701142   CAT    : 3020  
##  Mean   :  85.64   Mean   : 20156670   CVX    : 3020  
##  3rd Qu.:  94.01   3rd Qu.: 20752222   DIS    : 3020  
##  Max.   :1195.83   Max.   :843264044   GE     : 3020  
##                                        (Other):75492
str(dat)
## 'data.frame':    93612 obs. of  7 variables:
##  $ Date  : Factor w/ 3020 levels "2006-01-03","2006-01-04",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Open  : num  77.8 79.5 78.4 78.6 78.5 ...
##  $ High  : num  79.3 79.5 78.7 78.9 79.8 ...
##  $ Low   : num  77.2 78.2 77.6 77.6 78.5 ...
##  $ Close : num  79.1 78.7 78 78.6 79 ...
##  $ Volume: int  3117200 2558000 2529500 2479500 1845600 1919900 1911900 2121100 1925300 2073400 ...
##  $ Name  : Factor w/ 31 levels "AABA","AAPL",..: 20 20 20 20 20 20 20 20 20 20 ...

We can see that the data contains missing values and the data type of variable “Date” is factor, so I will conver it into date.

dat$Date <- as.Date(dat$Date, format = "%Y-%m-%d")
dat = na.omit(dat)
par(mfrow = c(2,2))
hist(dat$Open,main = "Open all stocks")
hist(dat$High,main = "High all stocks")
hist(dat$Low, main = "Low all stocks")
hist(dat$Close,main = "Close all stocks")

The distribution of all stocks’ open, high, low and close prices are very similar, mostly concentrated on the range of $0 to $200.

dat_aapl = dat[dat$Name == "AAPL",]
dat_gs = dat[dat$Name == "GS",]
dat_jpm = dat[dat$Name == "JPM",]
dat_googl = dat[dat$Name == "GOOGL",]

Now take a closer look at four companies’ stocks

name = c("AAPL","GS","JPM","GOOGL")
#par(mfrow = c(2,2))
#hist(dat_aapl$Open,main = "Open AAPL")
#hist(dat_gs$Open,main = "Open GS")
#hist(dat_jpm$Open,main = "Open JPM")
#hist(dat_googl$Open,main = "Open GOOGL")


par(mfrow = c(2,2))
hist(dat_aapl$High,main = "High AAPL")
hist(dat_gs$High,main = "High GS")
hist(dat_jpm$High,main = "High JPM")
hist(dat_googl$High,main = "High GOOGL")

#par(mfrow = c(2,2))
#hist(dat_aapl$Low,main = "Low AAPL")
#hist(dat_gs$Low,main = "Low GS")
#hist(dat_jpm$Low,main = "Low JPM")
#hist(dat_googl$Low,main = "Low GOOGL")

#par(mfrow = c(2,2))
#hist(dat_aapl$Close,main = "Close AAPL")
#hist(dat_gs$Close,main = "Close GS")
#hist(dat_jpm$Close, main = "Close JPM")
#hist(dat_googl$Close,main = "Close GOOGL")

The distribution of GS and JPM looks like a normal distribution, while AAPL and GOOGL are very skewed to the left. Generally speaking, the price of GOOGL is the most expensive and the price of JPM is the least expensive. The price of GOOGL is also mose volitile.

3 Time Series Analysis

First, plot the High price of four stocks over the time

par(mfrow = c(2,2))
plot(dat_aapl$High ~dat_aapl$Date,type = "l",main = "AAPL")
plot(dat_gs$High ~dat_gs$Date,type = "l",main = "GS")
plot(dat_jpm$High ~dat_jpm$Date,type = "l",main = "JPM")
plot(dat_googl$High ~dat_googl$Date,type = "l",main = "GOOGL")

It seems like the price of all four stocks are gradually increasing. However, it seems that GS has some fluctuation during year 2016 to 2010

Study the ACF of the prices

par(mfrow = c(2,2))
acf(dat_aapl$High,main = "AAPL ACF",xlab = "AAPL")
acf(dat_gs$High,main = "GS ACF",xlab = "GS")
acf(dat_jpm$High,main = "JPM ACF",xlab = "JPM")
acf(dat_googl$High,main = "GOOGL ACF",xlab = "GOOGL")

From ACF plots, we see that the acf for all four stocks prices are highly correlated.

par(mfrow = c(2,2))
acf(diff(dat_aapl$High),main = "diff AAPL ACF",xlab = "AAPL")
acf(diff(dat_gs$High),main = "diff GS ACF",xlab = "GS")
acf(diff(dat_jpm$High),main = "diff JPM ACF",xlab = "JPM")
acf(diff(dat_googl$High),main = "diff GOOGL ACF",xlab = "GOOGL")

However, the difference of the prices are not highly correlated, the only correlated lag is probably around 1 or 2.

par(mfrow = c(2,2))
acf(abs(diff(dat_aapl$High) - mean(diff(dat_aapl$High))),lag.max = 200,main = "mag AAPL ACF",xlab = "AAPL")
acf(abs(diff(dat_gs$High) - mean(diff(dat_gs$High))),lag.max = 200,main = "mag GS ACF",xlab = "GS")
acf(abs(diff(dat_jpm$High) - mean(diff(dat_jpm$High))),lag.max = 200,main = "mag JPM ACF",xlab = "JPM")
acf(abs(diff(dat_googl$High) - mean(diff(dat_googl$High))),lag.max = 200,main = "mag GOOGL ACF",xlab = "GOOGL")

The magnitude are highly correlated, however. Even though the difference of the price is uncorrelated, the volitility is correlated.

Fit ARIMA model with AIC table

Try to pick the best ARMA model using the AIC table for stock price of GOOGL.

aic_table <- function(data,P,Q,I){
  table <- matrix(NA,(P+1),(Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
      table[p+1,q+1] <- arima(data,order=c(p,0,q),method="ML")$aic
    }
  }
  dimnames(table) <- list(paste("AR",0:P, sep=""),paste("MA",0:Q,sep=""))
  return(table)
}
diff_GOOGL = diff(dat_googl$High)
diff_AAPL = diff(dat_aapl$High)
diff_GS = diff(dat_gs$High)
diff_JPM = diff(dat_jpm$High)
library(kableExtra)
options(warn=-1)
googl_aic_table <- aic_table(abs(diff_GOOGL-mean(diff_GOOGL)),4,4)
require(knitr)
## Loading required package: knitr
table = kable(googl_aic_table,digits=2,caption = "Low AIC Table") %>% kable_styling()
table
Low AIC Table
MA0 MA1 MA2 MA3 MA4
AR0 18253.37 18189.70 18166.53 18136.88 18108.55
AR1 18176.11 17979.42 17974.68 17975.96 17972.45
AR2 18142.82 17974.43 17979.28 17978.38 17978.68
AR3 18101.25 17975.42 17978.15 17980.68 17982.17
AR4 18072.78 17970.87 17977.71 17981.82 17974.48

From the AIC table, the lowest AIC value occurs at ARMA(1,5), which is 17968.60 The model is: \[Y_n = \mu + \theta{Y_{n-1}}+{\epsilon}_n + \psi_1{\epsilon_{n-1}} + \psi_2{\epsilon_{n-2}} + \psi_3{\epsilon_{n-3}} + \psi_4{\epsilon_{n-4}} + \psi_5{\epsilon_{n-5}}\] where \(\epsilon\) is a white noise process. Note that \(Y_n\) is the difference between the price, so essentially it is an ARIMA(1,1,5) with respect to the price of the stock. In the report, I will still use \(Y_n\) for consistency.

arima_GOOGL = arima(abs(diff_GOOGL-mean(diff_GOOGL)),order = c(1,0,5))
arima_GOOGL
## 
## Call:
## arima(x = abs(diff_GOOGL - mean(diff_GOOGL)), order = c(1, 0, 5))
## 
## Coefficients:
##          ar1      ma1      ma2     ma3      ma4      ma5  intercept
##       0.9972  -0.9213  -0.0315  0.0288  -0.0053  -0.0444     4.0298
## s.e.  0.0020   0.0183   0.0248  0.0257   0.0243   0.0182     0.7297
## 
## sigma^2 estimated as 22.43:  log likelihood = -8976.3,  aic = 17968.6

The arma model for GOOGL stock price is: \[Y_n = 4.0298 + 0.9972 {Y_{n-1}}+{\epsilon}_n - 0.9213{\epsilon_{n-1}} - 0.0315{\epsilon_{n-2}} + 0.0288{\epsilon_{n-3}} - 0.0053{\epsilon_{n-4}} - 0.0444{\epsilon_{n-5}}\]

To study whether a ARIMA model is more suitable, I make hypothesis and do statistical testing. We make two nested hypothese: \[H^{<0>} : \theta \in ARIMA(1,0,5)\] \[H^{<1>} : \theta \in ARIMA(1,1,2)\]

Try some other nested model and do statistical testing

googl_arma1 = arima(abs(diff_GOOGL-mean(diff_GOOGL)),order = c(1,0,4))
googl_arma1
## 
## Call:
## arima(x = abs(diff_GOOGL - mean(diff_GOOGL)), order = c(1, 0, 4))
## 
## Coefficients:
##          ar1      ma1      ma2     ma3      ma4  intercept
##       0.9965  -0.9192  -0.0333  0.0259  -0.0436     4.0281
## s.e.  0.0024   0.0185   0.0242  0.0252   0.0186     0.6757
## 
## sigma^2 estimated as 22.47:  log likelihood = -8979.25,  aic = 17972.5

We know that: \[l^{<1>} - l^{<0>} \approx (\frac{1}{2})\chi^2_{D^{<1>} - D^{<0>}}\]

which is approximate equal to 1.92. Since the difference \(l^{<1>} - l^{<0>} =-8976.3 -(-8979.25) = 2.95 > 1.92\), we cannot reject the null hypothesis, so we will choose ARIMA(1,0,4) over ARIMA(1,0,5), so the model is now: \[Y_n = 4.0281 + 0.9965{Y_{n-1}}+{\epsilon}_n - 0.9192{\epsilon_{n-1}} - 0.0333{\epsilon_{n-2}} + 0.0259{\epsilon_{n-3}} - 0.0436{\epsilon_{n-4}}\]

For the rest three stocks, will use Auto.arima to fit the model

library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
arima_AAPL = auto.arima(abs(diff_AAPL - mean(diff_AAPL)),max.p = 5,max.q = 5,method = "ML")
arima_AAPL
## Series: abs(diff_AAPL - mean(diff_AAPL)) 
## ARIMA(3,1,1) 
## 
## Coefficients:
##          ar1      ar2     ar3      ma1
##       0.0064  -0.0398  0.0395  -0.9758
## s.e.  0.0188   0.0187  0.0187   0.0045
## 
## sigma^2 estimated as 0.5355:  log likelihood=-3338.29
## AIC=6686.58   AICc=6686.6   BIC=6716.64
arima_GS = auto.arima(abs(diff_GS - mean(diff_GS)),max.p = 5,max.q = 5,method = "ML")
arima_GS
## Series: abs(diff_GS - mean(diff_GS)) 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1      ma1
##       0.0636  -0.9546
## s.e.  0.0198   0.0075
## 
## sigma^2 estimated as 3.668:  log likelihood=-6241.44
## AIC=12488.89   AICc=12488.9   BIC=12506.92
arima_JPM = auto.arima(abs(diff_JPM - mean(diff_JPM)),max.p = 5,max.q = 5,method = "ML")
arima_JPM
## Series: abs(diff_JPM - mean(diff_JPM)) 
## ARIMA(1,1,2) with drift 
## 
## Coefficients:
##          ar1      ma1     ma2  drift
##       0.8619  -1.7802  0.7847  1e-04
## s.e.  0.0371   0.0413  0.0402  3e-04
## 
## sigma^2 estimated as 0.3121:  log likelihood=-2523.61
## AIC=5057.22   AICc=5057.24   BIC=5087.28

The model for AAPL is ARIMA(3,1,1), for GS is ARIMA(1,1,1), for JPM is ARIMA(1,1,2) Write in mathematical formula: AAPL: \[Z_n = 0.0064Z_{n-1} - 0.0398Z_{n-2} + 0.0395Z_{n-3} + {\epsilon}_n - 0.9758{\epsilon_{n-1}}\] where \(Z_n = Y_n - Y_{n-1}\) GS: \[Z_n = 0.0636Z_{n-1} + {\epsilon}_n - 0.9546{\epsilon_{n-1}}\]

JPM: \[Z_n = 0.8620Z_{n-1} + {\epsilon}_n - 1.7803{\epsilon_{n-1}} + 0.7848\epsilon_{n-2}\]

Check the causality and invertibility based on the roots.

ar_root_AAPL = polyroot(c(1,-coef(arima_AAPL)[c("ar1","ar2","ar2")]))
ma_root_AAPL = polyroot(c(1,coef(arima_AAPL)[c("ma1")]))
print(abs(ar_root_AAPL))
## [1] 2.750380 3.324164 2.750380
print(abs(ma_root_AAPL))
## [1] 1.024849
ar_root_JPM = polyroot(c(1,-coef(arima_JPM)[c("ar1")]))
ma_root_JPM = polyroot(c(1,coef(arima_JPM)[c("ma1","ma2")]))
print(abs(ar_root_JPM))
## [1] 1.160208
print(abs(ma_root_JPM))
## [1] 1.023414 1.245159
ar_root_GOOGL = polyroot(c(1,-coef(googl_arma1)[c("ar1")]))
ma_root_GOOGL = polyroot(c(1,coef(googl_arma1)[c("ma1","ma2","ma3","ma4")]))
print(abs(ar_root_GOOGL))
## [1] 1.00355
print(abs(ma_root_GOOGL))
## [1] 1.027382 2.817468 2.816138 2.816138
ar_root_GS = polyroot(c(1,-coef(arima_GS)[c("ar1")]))
ma_root_GS = polyroot(c(1,coef(arima_GS)[c("ma1")]))
print(abs(ar_root_GS))
## [1] 15.71969
print(abs(ma_root_GS))
## [1] 1.047609

AAPL have all AR roots outside unit circle, so it is invertible. The MA roots of AAPL almost fall on the unit circle so it may not be causal, neither. JPM has AR roots outside the unit circle, so it is causal, but one of its MA roots is on the unit circle so it may not be invertible. GOOGL has both MA root and AR root on unit circle so it is neither invertible nor causal. GS has both AR root and MA root outside unit circle so it is invertible and causal.

Now, study the seasonality

We study the seasonality by the frequency domain analysis

par(mfrow = c(2,2))
spectrum(ts(dat_aapl$High,frequency = 250),span = c(3,5,3))
spectrum(ts(dat_googl$High,frequency = 250),span = c(3,5,3))
spectrum(ts(dat_gs$High,frequency = 250),span = c(3,5,3))
spectrum(ts(dat_jpm$High,frequency = 250),span = c(3,5,3))

None of the stock shows seasonality since there is no clear peaks on the plot.

Residual Analysis

AAPL_resid = forecast(arima_AAPL, h = 30)
checkresiduals(AAPL_resid)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,1)
## Q* = 4.0477, df = 6, p-value = 0.6702
## 
## Model df: 4.   Total lags used: 10
GOOGL_resid = forecast(arima_GOOGL, h = 30)
checkresiduals(GOOGL_resid)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,5) with non-zero mean
## Q* = 8.4968, df = 3, p-value = 0.03679
## 
## Model df: 7.   Total lags used: 10
GS_resid = forecast(arima_GS, h = 30)
checkresiduals(GS_resid)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 16.131, df = 8, p-value = 0.04054
## 
## Model df: 2.   Total lags used: 10
JPM_resid = forecast(arima_JPM, h = 30)
checkresiduals(JPM_resid)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2) with drift
## Q* = 14.227, df = 6, p-value = 0.0272
## 
## Model df: 4.   Total lags used: 10

The residuals look normal or like white noise, so the model is reasonal and is able to explain the correlation part in the data.

Conclusion

  1. The difference between the stock price is uncorrelate. However, their absolute deviance are correlated. In other words, the volatility is correlated. 2.For four stocks, fit different ARIMA models. The model for GOOGL is ARIMA(1,0,4), AAPL is ARIMA(3,1,1), for GS is ARIMA(1,1,1), for JPM is ARIMA(1,1,2). From the residual analysis, all the model work pretty well.
  2. There is no clear seasonality.
  3. It is possible that the price is related to other factors, such as the volumn. More complex model is needed in order to gain a more accurate model.

Reference

  1. DJIA 30 Stock Time Series https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231
  2. Time Series Analysis ARIMA basic model https://www.kaggle.com/szrlee/time-series-analysis-arima-basic-model/report
  3. Using R for Time Series Analysis https://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html
  4. Time Series ARIMA Models https://sites.google.com/site/econometricsacademy/econometrics-models/time-series-arima-models
  5. Lecture notes 3-8 https://ionides.github.io/531w20/