2.1 Explore the data
dat <- read.csv(file="NDAQ.csv",header=TRUE)
head(dat)
##         Date  Open  High   Low Close Adj.Close Volume
## 1 2015-03-02 50.37 50.93 50.29 50.92  48.18684 761800
## 2 2015-03-03 50.90 51.13 50.21 50.60  47.88402 646700
## 3 2015-03-04 50.51 51.10 50.24 51.02  48.28148 987500
## 4 2015-03-05 51.04 51.04 50.40 50.78  48.05436 635800
## 5 2015-03-06 50.66 51.10 50.12 50.19  47.49603 742000
## 6 2015-03-09 50.26 50.85 50.01 50.70  47.97866 619100
date <- as.Date(dat$Date)
price <- as.numeric(dat$Adj.Close)
- Summary our data with six stastics index:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   45.66   55.31   64.82   63.25   69.72   81.85
- Seeing that the scale of price is quite large. And the median price and mean price are very close to each other. Now we plot the change of log-adjusted closing price with time, here we use log-price is because the difference of log-price is the log-return of Nasdaq, it has economical meaning. 

- The blue line is the total mean of our data, and seeing that there is a very obvious trend that the adjusted closing price is increasing with time goes by. Maybe we can start by a random walk with drift and to see if the model fits well.
 
2.2 Fitting with random walk model
- The random walk with drift model is given by the difference equation: 
 
 \(Y_n = Y_{n-1} + \mu + \epsilon_n\)
 
 driven by a white noise process \({\epsilon_n}\). This has solution: \[Y_n = Y_0 + n\mu + \sum_{k=1}^n\epsilon_k\]
- To train our intuition, we can compare the data with simulations from a fitted model. A simple starting point is a Gaussian random walk with drift, having parameters estimated by:
N <- nrow(dat)
mu = mean(diff(log(price)))
sigma = sd(diff(log(price)))
set.seed(20180301)
X1 <- log(price[1])+cumsum(c(0,rnorm(N-1,mean=mu,sd=sigma)))
set.seed(03012018)
X2 <- log(price[1])+cumsum(c(0,rnorm(N-1,mean=mu,sd=sigma)))
par(mfrow=c(1,2))
plot(X1,type="l")
plot(X2,type="l")

- It seems like that the left one is very similar to our data plot but the right one is far more different.
- Let \(y^*_{1:N}\) be the time series of Nasdaq daily closing values downloaded from yahoo.com. Let \(z^*_n= \Delta \log y^*_n = \log y^*_{n}-\log y^*_{n-1}\).
- Let’s plot the sample autocorrelation function of the time series of Nasdaq log-returns, \(z^*_{2:N}\).
z = diff(log(price))
acf(z)

acf(abs(z-mean(z)),lag.max=200)

- For ACF of log-return it self, it looks pretty close to the ACF of white noise. There is some evidence for a small nonzero autocorrelation. However, as our data length is 756, the number of lags out of the dashed line is far more less than 5% of the whole data, which makes these nozero lags insignificant.
- But seeing the ACF plot for absolute value, it’s obvious that \(z^*_{2:N}\) can’t be a white noise process as there are many lags out of the dashed line and it must be larger than 5% of N, which means it’s significant.
- In conclusion, random walk with drift model fits not very well to our data.
 
2.3 Fitting with ARIMA model
- We seperate data into training data and test data, choose last 10% as test data.
dat_train = log(price[1:floor(0.9*length(price))])
dat_test = log(price[floor(0.9*length(price)):length(price)])
2.3.1 Choosing p, q and d
- First we should decide the order of difference, i.e. value of d in ARIMA(p,d,q).
- Usually people will pick d no more than 1, so let’s start with d = 1.
t = seq(from=1,length=floor(0.9*N)-1,by=1)
detrend_dat = diff(dat_train)
plot(t,detrend_dat,type="l",xlab="t",ylab="difference of log price",main = "Detrending Nasdaq index with d = 1")
abline(h=mean(detrend_dat),col = "blue")

- It seems like there is no trend for one order of difference (mean stationary). We can fit our model under the null hyphosis with d = 1 firstly and then we can make some test to check this.
- We seek to fit a stationary Gaussian ARIMA(p,1,q) model with parameter vector \(\theta = (\phi_{1:p},\psi_{1:q},\mu,\sigma^2)\) given by: \[\phi(B)((1-B)Y_n-\mu) = \psi(B)\varepsilon_n\] where \({\epsilon_n}\) is a white noise process (i.e. \(\sim iid N(0,\sigma^2)\));\(\mu = E(X_n)\) is the trend; \(\phi(x) = 1-\sum_{i=1}^p \phi_i x^i\) and \(\psi(x) = 1+\sum_{i=1}^q \psi_i x^i\) are the ARMA polynomials; B is lag operator; \(Y_n\) is the estimator of log(Nasdaq adjusted closing price). NA \[AIC = -2 \times l(\theta^*) + 2D\] where \(l(\theta^*)\) is the log-likelihood function deciding by parameter vector \(\theta^*\) and \(D\) is the number of parameters. The model with the lowest AIC score implies the least prediction error.
- Let’s make a table for choosing different p and q and there AIC score: - 
- 
| AR0 | -4124.41 | -4126.12 | -4126.98 | -4125.20 | -4125.42 |  - 
| AR1 | -4125.64 | -4126.59 | -4131.85 | -4130.44 | -4128.75 |  - 
| AR2 | -4126.77 | -4131.36 | -4130.68 | -4129.32 | -4128.18 |  - 
| AR3 | -4124.79 | -4123.45 | -4129.31 | -4125.48 | -4129.80 |  - 
| AR4 | -4125.24 | -4127.10 | -4127.32 | -4125.33 | -4127.27 |  
 
- From the table above, it’s easily to find that the two lowest AIC score appears at ARIMA(1,1,2) and ARIMA(2,1,1).
- The models on their right and below them have both larger AIC scores and more complex models (i.e. larger p and q, which can causing invertibility and stability problems). That’s not what we want, so we can exclude them.
- The models on their left and upper have larger AIC scores with apparently differences. Thus, we can also exclude thinking about these models. 
 
2.3.2 Regression on ARIMA models
- We fit and diagnose two models simultaneously.
price_arma12 = arima(dat_train,order=c(1,1,2)); price_arma12
## 
## Call:
## arima(x = dat_train, order = c(1, 1, 2))
## 
## Coefficients:
##           ar1     ma1      ma2
##       -0.9068  0.8424  -0.1131
## s.e.   0.0533  0.0655   0.0420
## 
## sigma^2 estimated as 0.0001317:  log likelihood = 2069.93,  aic = -4131.85
price_arma21 = arima(dat_train,order=c(2,1,1)); price_arma21
## 
## Call:
## arima(x = dat_train, order = c(2, 1, 1))
## 
## Coefficients:
##           ar1      ar2     ma1
##       -1.0163  -0.1019  0.9587
## s.e.   0.0520   0.0397  0.0369
## 
## sigma^2 estimated as 0.0001318:  log likelihood = 2069.68,  aic = -4131.36
- We can get our two models from above: 
 
 For ARIMA(1,1,2): \[(1+0.9068B)(1-B)Y_n = (1+0.8424B-0.1131B^2)\varepsilon_n\] For ARIMA(2,1,1): \[(1+1.0163B+0.1019B^2)(1-B)Y_n = (1+0.9587B)\varepsilon_n\]
- From the results above, we can see that from Fisher information, the confidence interval (i.e. standard error) for all the coefficients are quite small so that zero is not covered in it. In a word, they are all significant under Fisher information. Also we can see that both of them don’t have intercept terms, this indicates that the differences between adjacent log prices have a mean close to zero. \[E(Y_n-Y_{n-1}) \approx 0\] where \(Y_n = log(price\ at\ time\ n)\)
 
2.3.3 Test our models
- Let’s look at the roots for our models to check the casuality and invertibility
# AR root for ARIMA(1,1,2) is outside of unit circle
MA_roots <- polyroot(c(1, coef(price_arma12)[c("ma1", "ma2")]))
MA_roots # MA root for ARIMA(1,1,2)
## [1] -1.041512+0i  8.492627-0i
# MA root for ARIMA(2,1,1) is outside of unit circle
AR_roots <- polyroot(c(1, -coef(price_arma21)[c("ar1", "ar2")]))
AR_roots # AR root for ARIMA(2,1,1)
## [1] -1.106809-0i -8.864667+0i
- Seeing that all of the roots are just outside the unit circle, we can conclude that both two models can fit our data very well without stability problems.
 
 
2.4 Diagnostics
2.4.1 Seasonality
- Although we didn’t see any evidence of seasonality, people may suspect that the change of Nasdaq index may be affected by some seasonal influence and thus have seasonality effect. We check this under frequency domain.
spectrum_price <-ts(dat_train,frequency=250)
spectrum(spectrum_price,spans=c(3,5,3))

- Seeing that there aren’t any significant peaks in periodogram above, this implies that there is no seasonality effect for fitting our data.
 
2.4.2 Normality of residuals
- First let’s plot the residuals and ACF plots of two models.
par(mfrow = c(1,2))
plot(price_arma12$resid,ylab = "residuals",main = "Residuals for ARIMA(1,1,2) Model")
acf(price_arma12$resid,main = "ACF of residuals for ARIMA(1,1,2)")

par(mfrow = c(1,2))
plot(price_arma21$resid,ylab = "residuals",main = "Residuals for ARIMA(2,1,1) Model")
acf(price_arma21$resid,main = "ACF of residuals for ARIMA(2,1,1)")

- For the first sight, there seems no big problem for residuals, the mean is stationary and around zero, ACF plots show that more than 95% lags are in the dashed line.
- Using QQ-plot to check normality of residuals: 
par(mfrow = c(1,2))
qqnorm(price_arma12$resid)
qqline(price_arma12$resid,probs = c(0.25,0.75))
qqnorm(price_arma21$resid)
qqline(price_arma21$resid,probs = c(0.25,0.75))

- Both of the residuals have heavier tails than normal distribution, refusing normality assumption.
library(forecast)
pred = predict(price_arma12, n.ahead = (N-floor(0.9*N)+1))$pred
forecast1 = forecast(price_arma12, h = 25)
plot(forecast1)
x = c((N-floor(0.1*N)-1):N)
lines(x, dat_test, type="l", col="red")

- Using ARIMA(1,1,2) for example, the prediction of our test data is as above. We can see that the true data can only be covered by the border of 95% confidence interval. We need do some other method to improve our model.[learn the code from http://blog.fens.me/r-ar/]
 
2.4.3 Modify our model
- It seems like that Nasdaq index is also affected by some other factors.
- Let’s take volume as one of the considering factors in our model. We know that volume can affect the price apparently.
volume = log(as.numeric(dat$Volume))
vol_train = log(volume[1:floor(0.9*length(price))])
vol_test = log(volume[floor(0.9*length(price)):length(price)])
- repeat the steps in 2.3 and see how result change: - 
- 
| AR0 | -4133.60 | -4135.90 | -4135.95 | -4134.01 | -4133.64 |  - 
| AR1 | -4135.42 | -4135.86 | -4140.32 | -4139.21 | -4137.30 |  - 
| AR2 | -4135.77 | -4133.78 | -4139.43 | -4137.28 | -4136.03 |  - 
| AR3 | -4133.77 | -4132.48 | -4137.82 | -4133.45 | -4140.05 |  - 
| AR4 | -4133.67 | -4134.97 | -4135.93 | -4133.94 | -4137.12 |  
 
- Different from the result before, ARIMA(1,1,2) has the significant lowest score than other models. Thus, we choose ARIMA(1,1,2) as our model. 
arima112 <- arima(dat_train,order = c(1,1,2),xreg = vol_train)
arima112
## 
## Call:
## arima(x = dat_train, order = c(1, 1, 2), xreg = vol_train)
## 
## Coefficients:
##           ar1     ma1      ma2  vol_train
##       -0.9620  0.8911  -0.1029    -0.0535
## s.e.   0.0153  0.0423   0.0409     0.0160
## 
## sigma^2 estimated as 0.0001295:  log likelihood = 2075.16,  aic = -4140.32
- Representing our model in math way: \[\phi(B)((1-B)Y_n-\mu-\beta v_n) = \psi(B)\varepsilon_n\] where the other notation is the same as before, and \(v_n\) represents the value of volume at time n.
- From the result above, our new model is: \[(1+0.9620B)(1-B)(Y_n+0.0535) = (1+0.8911B-0.1028B^2)\varepsilon_n\]
- Check the root:
# AR root is outside the unit circle
MA_roots <- polyroot(c(1,coef(arima112)[c("ma1","ma2")]))
MA_roots # MA roots are outside the unit circle
## [1] -1.005512+0i  9.669575-0i
- Again, all the roots are just outside the unit circle.
- Look at residuals:
plot(arima112$resid,ylab = "residuals",main = "Residuals for ARIMA(1,1,2) Model with volume parameter")

acf(arima112$resid,main = "ACF of residuals with volume parameter")

- Similar to previous result, the residuals seem to be a Gaussian process.
- Check normality:
qqnorm(arima112$resid)
qqline(arima112$resid,probs = c(0.25,0.75))

- Although there is still havier tails than normal distribution, we can see that the tails are lighter than our original models in 2.3.
- It implies that we can explore more other factors affecting Nasdaq index.
- What’s more, if we want to make a prediction with this model, we need make more efforts on predicting the change of volume also.