This data set comes from a Kaggle data challenge: DJIA 30 Stock Time Series (https://www.kaggle.com/szrlee/stock-time-series-20050101-to-20171231). The original dataset contains historical stock prices (last 12 years) for 29 of 30 DJIA companies. In this project, I will mainly focus on four stocks: AAPL, GS, JPM, GOOGL. The target of this project is to study the volatility of the stock prices by fitting time-series models and to help people make statistically informed trades.
Section 2 is the data pre-processing, which involves data cleaning, exporatory data analysis. Section 3 is the time series modelling part where I fit ARIMA model to the volatility of four stocks, make hypothesis testing, and conduct frequency domain analysis.
dat <- read.csv(file ="all_stocks_2006-01-01_to_2018-01-01.csv")
summary(dat)
## Date Open High Low
## 2006-01-03: 31 Min. : 6.75 Min. : 7.17 Min. : 0.00
## 2006-01-04: 31 1st Qu.: 33.95 1st Qu.: 34.29 1st Qu.: 33.60
## 2006-01-05: 31 Median : 60.04 Median : 60.63 Median : 59.49
## 2006-01-06: 31 Mean : 85.62 Mean : 86.39 Mean : 84.84
## 2006-01-09: 31 3rd Qu.: 94.00 3rd Qu.: 94.74 3rd Qu.: 93.25
## 2006-01-10: 31 Max. :1204.88 Max. :1213.41 Max. :1191.15
## (Other) :93426 NA's :25 NA's :10 NA's :20
## Close Volume Name
## Min. : 6.66 Min. : 0 AXP : 3020
## 1st Qu.: 33.96 1st Qu.: 5040180 BA : 3020
## Median : 60.05 Median : 9701142 CAT : 3020
## Mean : 85.64 Mean : 20156670 CVX : 3020
## 3rd Qu.: 94.01 3rd Qu.: 20752222 DIS : 3020
## Max. :1195.83 Max. :843264044 GE : 3020
## (Other):75492
str(dat)
## 'data.frame': 93612 obs. of 7 variables:
## $ Date : Factor w/ 3020 levels "2006-01-03","2006-01-04",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Open : num 77.8 79.5 78.4 78.6 78.5 ...
## $ High : num 79.3 79.5 78.7 78.9 79.8 ...
## $ Low : num 77.2 78.2 77.6 77.6 78.5 ...
## $ Close : num 79.1 78.7 78 78.6 79 ...
## $ Volume: int 3117200 2558000 2529500 2479500 1845600 1919900 1911900 2121100 1925300 2073400 ...
## $ Name : Factor w/ 31 levels "AABA","AAPL",..: 20 20 20 20 20 20 20 20 20 20 ...
We can see that the data contains missing values and the data type of variable “Date” is factor, so I will conver it into date.
dat$Date <- as.Date(dat$Date, format = "%Y-%m-%d")
dat = na.omit(dat)
par(mfrow = c(2,2))
hist(dat$Open,main = "Open all stocks")
hist(dat$High,main = "High all stocks")
hist(dat$Low, main = "Low all stocks")
hist(dat$Close,main = "Close all stocks")
The distribution of all stocks’ open, high, low and close prices are very similar, mostly concentrated on the range of $0 to $200.
dat_aapl = dat[dat$Name == "AAPL",]
dat_gs = dat[dat$Name == "GS",]
dat_jpm = dat[dat$Name == "JPM",]
dat_googl = dat[dat$Name == "GOOGL",]
Now take a closer look at four companies’ stocks
name = c("AAPL","GS","JPM","GOOGL")
#par(mfrow = c(2,2))
#hist(dat_aapl$Open,main = "Open AAPL")
#hist(dat_gs$Open,main = "Open GS")
#hist(dat_jpm$Open,main = "Open JPM")
#hist(dat_googl$Open,main = "Open GOOGL")
par(mfrow = c(2,2))
hist(dat_aapl$High,main = "High AAPL")
hist(dat_gs$High,main = "High GS")
hist(dat_jpm$High,main = "High JPM")
hist(dat_googl$High,main = "High GOOGL")
#par(mfrow = c(2,2))
#hist(dat_aapl$Low,main = "Low AAPL")
#hist(dat_gs$Low,main = "Low GS")
#hist(dat_jpm$Low,main = "Low JPM")
#hist(dat_googl$Low,main = "Low GOOGL")
#par(mfrow = c(2,2))
#hist(dat_aapl$Close,main = "Close AAPL")
#hist(dat_gs$Close,main = "Close GS")
#hist(dat_jpm$Close, main = "Close JPM")
#hist(dat_googl$Close,main = "Close GOOGL")
The distribution of GS and JPM looks like a normal distribution, while AAPL and GOOGL are very skewed to the left. Generally speaking, the price of GOOGL is the most expensive and the price of JPM is the least expensive. The price of GOOGL is also mose volitile.
par(mfrow = c(2,2))
plot(dat_aapl$High ~dat_aapl$Date,type = "l",main = "AAPL")
plot(dat_gs$High ~dat_gs$Date,type = "l",main = "GS")
plot(dat_jpm$High ~dat_jpm$Date,type = "l",main = "JPM")
plot(dat_googl$High ~dat_googl$Date,type = "l",main = "GOOGL")
It seems like the price of all four stocks are gradually increasing. However, it seems that GS has some fluctuation during year 2016 to 2010
par(mfrow = c(2,2))
acf(dat_aapl$High,main = "AAPL ACF",xlab = "AAPL")
acf(dat_gs$High,main = "GS ACF",xlab = "GS")
acf(dat_jpm$High,main = "JPM ACF",xlab = "JPM")
acf(dat_googl$High,main = "GOOGL ACF",xlab = "GOOGL")
From ACF plots, we see that the acf for all four stocks prices are highly correlated.
par(mfrow = c(2,2))
acf(diff(dat_aapl$High),main = "diff AAPL ACF",xlab = "AAPL")
acf(diff(dat_gs$High),main = "diff GS ACF",xlab = "GS")
acf(diff(dat_jpm$High),main = "diff JPM ACF",xlab = "JPM")
acf(diff(dat_googl$High),main = "diff GOOGL ACF",xlab = "GOOGL")
However, the difference of the prices are not highly correlated, the only correlated lag is probably around 1 or 2.
par(mfrow = c(2,2))
acf(abs(diff(dat_aapl$High) - mean(diff(dat_aapl$High))),lag.max = 200,main = "mag AAPL ACF",xlab = "AAPL")
acf(abs(diff(dat_gs$High) - mean(diff(dat_gs$High))),lag.max = 200,main = "mag GS ACF",xlab = "GS")
acf(abs(diff(dat_jpm$High) - mean(diff(dat_jpm$High))),lag.max = 200,main = "mag JPM ACF",xlab = "JPM")
acf(abs(diff(dat_googl$High) - mean(diff(dat_googl$High))),lag.max = 200,main = "mag GOOGL ACF",xlab = "GOOGL")
The magnitude are highly correlated, however. Even though the difference of the price is uncorrelated, the volitility is correlated.
Try to pick the best ARMA model using the AIC table for stock price of GOOGL.
aic_table <- function(data,P,Q,I){
table <- matrix(NA,(P+1),(Q+1))
for(p in 0:P) {
for(q in 0:Q) {
table[p+1,q+1] <- arima(data,order=c(p,0,q),method="ML")$aic
}
}
dimnames(table) <- list(paste("AR",0:P, sep=""),paste("MA",0:Q,sep=""))
return(table)
}
diff_GOOGL = diff(dat_googl$High)
diff_AAPL = diff(dat_aapl$High)
diff_GS = diff(dat_gs$High)
diff_JPM = diff(dat_jpm$High)
library(kableExtra)
options(warn=-1)
googl_aic_table <- aic_table(abs(diff_GOOGL-mean(diff_GOOGL)),4,4)
require(knitr)
## Loading required package: knitr
table = kable(googl_aic_table,digits=2,caption = "Low AIC Table") %>% kable_styling()
table
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 18253.37 | 18189.70 | 18166.53 | 18136.88 | 18108.55 |
AR1 | 18176.11 | 17979.42 | 17974.68 | 17975.96 | 17972.45 |
AR2 | 18142.82 | 17974.43 | 17979.28 | 17978.38 | 17978.68 |
AR3 | 18101.25 | 17975.42 | 17978.15 | 17980.68 | 17982.17 |
AR4 | 18072.78 | 17970.87 | 17977.71 | 17981.82 | 17974.48 |
From the AIC table, the lowest AIC value occurs at ARMA(1,5), which is 17968.60 The model is: \[Y_n = \mu + \theta{Y_{n-1}}+{\epsilon}_n + \psi_1{\epsilon_{n-1}} + \psi_2{\epsilon_{n-2}} + \psi_3{\epsilon_{n-3}} + \psi_4{\epsilon_{n-4}} + \psi_5{\epsilon_{n-5}}\] where \(\epsilon\) is a white noise process. Note that \(Y_n\) is the difference between the price, so essentially it is an ARIMA(1,1,5) with respect to the price of the stock. In the report, I will still use \(Y_n\) for consistency.
arima_GOOGL = arima(abs(diff_GOOGL-mean(diff_GOOGL)),order = c(1,0,5))
arima_GOOGL
##
## Call:
## arima(x = abs(diff_GOOGL - mean(diff_GOOGL)), order = c(1, 0, 5))
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 ma5 intercept
## 0.9972 -0.9213 -0.0315 0.0288 -0.0053 -0.0444 4.0298
## s.e. 0.0020 0.0183 0.0248 0.0257 0.0243 0.0182 0.7297
##
## sigma^2 estimated as 22.43: log likelihood = -8976.3, aic = 17968.6
The arma model for GOOGL stock price is: \[Y_n = 4.0298 + 0.9972 {Y_{n-1}}+{\epsilon}_n - 0.9213{\epsilon_{n-1}} - 0.0315{\epsilon_{n-2}} + 0.0288{\epsilon_{n-3}} - 0.0053{\epsilon_{n-4}} - 0.0444{\epsilon_{n-5}}\]
To study whether a ARIMA model is more suitable, I make hypothesis and do statistical testing. We make two nested hypothese: \[H^{<0>} : \theta \in ARIMA(1,0,5)\] \[H^{<1>} : \theta \in ARIMA(1,1,2)\]
Try some other nested model and do statistical testing
googl_arma1 = arima(abs(diff_GOOGL-mean(diff_GOOGL)),order = c(1,0,4))
googl_arma1
##
## Call:
## arima(x = abs(diff_GOOGL - mean(diff_GOOGL)), order = c(1, 0, 4))
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 intercept
## 0.9965 -0.9192 -0.0333 0.0259 -0.0436 4.0281
## s.e. 0.0024 0.0185 0.0242 0.0252 0.0186 0.6757
##
## sigma^2 estimated as 22.47: log likelihood = -8979.25, aic = 17972.5
We know that: \[l^{<1>} - l^{<0>} \approx (\frac{1}{2})\chi^2_{D^{<1>} - D^{<0>}}\]
which is approximate equal to 1.92. Since the difference \(l^{<1>} - l^{<0>} =-8976.3 -(-8979.25) = 2.95 > 1.92\), we cannot reject the null hypothesis, so we will choose ARIMA(1,0,4) over ARIMA(1,0,5), so the model is now: \[Y_n = 4.0281 + 0.9965{Y_{n-1}}+{\epsilon}_n - 0.9192{\epsilon_{n-1}} - 0.0333{\epsilon_{n-2}} + 0.0259{\epsilon_{n-3}} - 0.0436{\epsilon_{n-4}}\]
For the rest three stocks, will use Auto.arima to fit the model
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
arima_AAPL = auto.arima(abs(diff_AAPL - mean(diff_AAPL)),max.p = 5,max.q = 5,method = "ML")
arima_AAPL
## Series: abs(diff_AAPL - mean(diff_AAPL))
## ARIMA(3,1,1)
##
## Coefficients:
## ar1 ar2 ar3 ma1
## 0.0064 -0.0398 0.0395 -0.9758
## s.e. 0.0188 0.0187 0.0187 0.0045
##
## sigma^2 estimated as 0.5355: log likelihood=-3338.29
## AIC=6686.58 AICc=6686.6 BIC=6716.64
arima_GS = auto.arima(abs(diff_GS - mean(diff_GS)),max.p = 5,max.q = 5,method = "ML")
arima_GS
## Series: abs(diff_GS - mean(diff_GS))
## ARIMA(1,1,1)
##
## Coefficients:
## ar1 ma1
## 0.0636 -0.9546
## s.e. 0.0198 0.0075
##
## sigma^2 estimated as 3.668: log likelihood=-6241.44
## AIC=12488.89 AICc=12488.9 BIC=12506.92
arima_JPM = auto.arima(abs(diff_JPM - mean(diff_JPM)),max.p = 5,max.q = 5,method = "ML")
arima_JPM
## Series: abs(diff_JPM - mean(diff_JPM))
## ARIMA(1,1,2) with drift
##
## Coefficients:
## ar1 ma1 ma2 drift
## 0.8619 -1.7802 0.7847 1e-04
## s.e. 0.0371 0.0413 0.0402 3e-04
##
## sigma^2 estimated as 0.3121: log likelihood=-2523.61
## AIC=5057.22 AICc=5057.24 BIC=5087.28
The model for AAPL is ARIMA(3,1,1), for GS is ARIMA(1,1,1), for JPM is ARIMA(1,1,2) Write in mathematical formula: AAPL: \[Z_n = 0.0064Z_{n-1} - 0.0398Z_{n-2} + 0.0395Z_{n-3} + {\epsilon}_n - 0.9758{\epsilon_{n-1}}\] where \(Z_n = Y_n - Y_{n-1}\) GS: \[Z_n = 0.0636Z_{n-1} + {\epsilon}_n - 0.9546{\epsilon_{n-1}}\]
JPM: \[Z_n = 0.8620Z_{n-1} + {\epsilon}_n - 1.7803{\epsilon_{n-1}} + 0.7848\epsilon_{n-2}\]
ar_root_AAPL = polyroot(c(1,-coef(arima_AAPL)[c("ar1","ar2","ar2")]))
ma_root_AAPL = polyroot(c(1,coef(arima_AAPL)[c("ma1")]))
print(abs(ar_root_AAPL))
## [1] 2.750380 3.324164 2.750380
print(abs(ma_root_AAPL))
## [1] 1.024849
ar_root_JPM = polyroot(c(1,-coef(arima_JPM)[c("ar1")]))
ma_root_JPM = polyroot(c(1,coef(arima_JPM)[c("ma1","ma2")]))
print(abs(ar_root_JPM))
## [1] 1.160208
print(abs(ma_root_JPM))
## [1] 1.023414 1.245159
ar_root_GOOGL = polyroot(c(1,-coef(googl_arma1)[c("ar1")]))
ma_root_GOOGL = polyroot(c(1,coef(googl_arma1)[c("ma1","ma2","ma3","ma4")]))
print(abs(ar_root_GOOGL))
## [1] 1.00355
print(abs(ma_root_GOOGL))
## [1] 1.027382 2.817468 2.816138 2.816138
ar_root_GS = polyroot(c(1,-coef(arima_GS)[c("ar1")]))
ma_root_GS = polyroot(c(1,coef(arima_GS)[c("ma1")]))
print(abs(ar_root_GS))
## [1] 15.71969
print(abs(ma_root_GS))
## [1] 1.047609
AAPL have all AR roots outside unit circle, so it is invertible. The MA roots of AAPL almost fall on the unit circle so it may not be causal, neither. JPM has AR roots outside the unit circle, so it is causal, but one of its MA roots is on the unit circle so it may not be invertible. GOOGL has both MA root and AR root on unit circle so it is neither invertible nor causal. GS has both AR root and MA root outside unit circle so it is invertible and causal.
We study the seasonality by the frequency domain analysis
par(mfrow = c(2,2))
spectrum(ts(dat_aapl$High,frequency = 250),span = c(3,5,3))
spectrum(ts(dat_googl$High,frequency = 250),span = c(3,5,3))
spectrum(ts(dat_gs$High,frequency = 250),span = c(3,5,3))
spectrum(ts(dat_jpm$High,frequency = 250),span = c(3,5,3))
None of the stock shows seasonality since there is no clear peaks on the plot.
AAPL_resid = forecast(arima_AAPL, h = 30)
checkresiduals(AAPL_resid)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,1)
## Q* = 4.0477, df = 6, p-value = 0.6702
##
## Model df: 4. Total lags used: 10
GOOGL_resid = forecast(arima_GOOGL, h = 30)
checkresiduals(GOOGL_resid)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,5) with non-zero mean
## Q* = 8.4968, df = 3, p-value = 0.03679
##
## Model df: 7. Total lags used: 10
GS_resid = forecast(arima_GS, h = 30)
checkresiduals(GS_resid)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)
## Q* = 16.131, df = 8, p-value = 0.04054
##
## Model df: 2. Total lags used: 10
JPM_resid = forecast(arima_JPM, h = 30)
checkresiduals(JPM_resid)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2) with drift
## Q* = 14.227, df = 6, p-value = 0.0272
##
## Model df: 4. Total lags used: 10
The residuals look normal or like white noise, so the model is reasonal and is able to explain the correlation part in the data.