1. Raising a question

Nasdaq composite is a stock market index of the common stocks and similar securities listed on the NASDAQ stock market. It is one of the three most-followed indices in US stock markets. The composition of the NASDAQ Composite is heavily weighted towards information technology companies.[from wikipedia https://en.wikipedia.org/wiki/NASDAQ]
People who engaged in IT and related fields can get some important information from the change of Nasdaq index. Like an IT company can assess their value and position in the whole field according to the price change of Nasdaq index.
It reflects the trend of US economy in some degree.
According to the reasons above, I get interested in exploring the price changing of Nasdaq index, and we using time series model to predict this change of a short future.

2. Data analysis

2.1 Explore the data

We first read in the data downloaded from Yahoo Finance.

dat <- read.csv(file="NDAQ.csv",header=TRUE)
head(dat)

##         Date  Open  High   Low Close Adj.Close Volume
## 1 2015-03-02 50.37 50.93 50.29 50.92  48.18684 761800
## 2 2015-03-03 50.90 51.13 50.21 50.60  47.88402 646700
## 3 2015-03-04 50.51 51.10 50.24 51.02  48.28148 987500
## 4 2015-03-05 51.04 51.04 50.40 50.78  48.05436 635800
## 5 2015-03-06 50.66 51.10 50.12 50.19  47.49603 742000
## 6 2015-03-09 50.26 50.85 50.01 50.70  47.97866 619100

date <- as.Date(dat$Date)
price <- as.numeric(dat$Adj.Close)

Summary our data with six stastics index:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   45.66   55.31   64.82   63.25   69.72   81.85

Seeing that the scale of price is quite large. And the median price and mean price are very close to each other. Now we plot the change of log-adjusted closing price with time, here we use log-price is because the difference of log-price is the log-return of Nasdaq, it has economical meaning.

The blue line is the total mean of our data, and seeing that there is a very obvious trend that the adjusted closing price is increasing with time goes by. Maybe we can start by a random walk with drift and to see if the model fits well.

2.2 Fitting with random walk model

The random walk with drift model is given by the difference equation:

\(Y_n = Y_{n-1} + \mu + \epsilon_n\)

driven by a white noise process \({\epsilon_n}\). This has solution: \[Y_n = Y_0 + n\mu + \sum_{k=1}^n\epsilon_k\]
To train our intuition, we can compare the data with simulations from a fitted model. A simple starting point is a Gaussian random walk with drift, having parameters estimated by:

N <- nrow(dat)
mu = mean(diff(log(price)))
sigma = sd(diff(log(price)))
set.seed(20180301)
X1 <- log(price[1])+cumsum(c(0,rnorm(N-1,mean=mu,sd=sigma)))
set.seed(03012018)
X2 <- log(price[1])+cumsum(c(0,rnorm(N-1,mean=mu,sd=sigma)))
par(mfrow=c(1,2))
plot(X1,type="l")
plot(X2,type="l")

It seems like that the left one is very similar to our data plot but the right one is far more different.
Let \(y^*_{1:N}\) be the time series of Nasdaq daily closing values downloaded from yahoo.com. Let \(z^*_n= \Delta \log y^*_n = \log y^*_{n}-\log y^*_{n-1}\).
Let’s plot the sample autocorrelation function of the time series of Nasdaq log-returns, \(z^*_{2:N}\).

z = diff(log(price))
acf(z)

acf(abs(z-mean(z)),lag.max=200)

For ACF of log-return it self, it looks pretty close to the ACF of white noise. There is some evidence for a small nonzero autocorrelation. However, as our data length is 756, the number of lags out of the dashed line is far more less than 5% of the whole data, which makes these nozero lags insignificant.
But seeing the ACF plot for absolute value, it’s obvious that \(z^*_{2:N}\) can’t be a white noise process as there are many lags out of the dashed line and it must be larger than 5% of N, which means it’s significant.
In conclusion, random walk with drift model fits not very well to our data.

2.3 Fitting with ARIMA model

We seperate data into training data and test data, choose last 10% as test data.

dat_train = log(price[1:floor(0.9*length(price))])
dat_test = log(price[floor(0.9*length(price)):length(price)])

2.3.1 Choosing p, q and d

First we should decide the order of difference, i.e. value of d in ARIMA(p,d,q).
Usually people will pick d no more than 1, so let’s start with d = 1.

t = seq(from=1,length=floor(0.9*N)-1,by=1)
detrend_dat = diff(dat_train)
plot(t,detrend_dat,type="l",xlab="t",ylab="difference of log price",main = "Detrending Nasdaq index with d = 1")
abline(h=mean(detrend_dat),col = "blue")

It seems like there is no trend for one order of difference (mean stationary). We can fit our model under the null hyphosis with d = 1 firstly and then we can make some test to check this.
We seek to fit a stationary Gaussian ARIMA(p,1,q) model with parameter vector \(\theta = (\phi_{1:p},\psi_{1:q},\mu,\sigma^2)\) given by: \[\phi(B)((1-B)Y_n-\mu) = \psi(B)\varepsilon_n\] where \({\epsilon_n}\) is a white noise process (i.e. \(\sim iid N(0,\sigma^2)\));\(\mu = E(X_n)\) is the trend; \(\phi(x) = 1-\sum_{i=1}^p \phi_i x^i\) and \(\psi(x) = 1+\sum_{i=1}^q \psi_i x^i\) are the ARMA polynomials; B is lag operator; \(Y_n\) is the estimator of log(Nasdaq adjusted closing price). NA \[AIC = -2 \times l(\theta^*) + 2D\] where \(l(\theta^*)\) is the log-likelihood function deciding by parameter vector \(\theta^*\) and \(D\) is the number of parameters. The model with the lowest AIC score implies the least prediction error.

Let’s make a table for choosing different p and q and there AIC score:

	MA0	MA1	MA2	MA3	MA4
AR0	-4124.41	-4126.12	-4126.98	-4125.20	-4125.42
AR1	-4125.64	-4126.59	-4131.85	-4130.44	-4128.75
AR2	-4126.77	-4131.36	-4130.68	-4129.32	-4128.18
AR3	-4124.79	-4123.45	-4129.31	-4125.48	-4129.80
AR4	-4125.24	-4127.10	-4127.32	-4125.33	-4127.27

From the table above, it’s easily to find that the two lowest AIC score appears at ARIMA(1,1,2) and ARIMA(2,1,1).
The models on their right and below them have both larger AIC scores and more complex models (i.e. larger p and q, which can causing invertibility and stability problems). That’s not what we want, so we can exclude them.
The models on their left and upper have larger AIC scores with apparently differences. Thus, we can also exclude thinking about these models.

2.3.2 Regression on ARIMA models

We fit and diagnose two models simultaneously.

price_arma12 = arima(dat_train,order=c(1,1,2)); price_arma12

## 
## Call:
## arima(x = dat_train, order = c(1, 1, 2))
## 
## Coefficients:
##           ar1     ma1      ma2
##       -0.9068  0.8424  -0.1131
## s.e.   0.0533  0.0655   0.0420
## 
## sigma^2 estimated as 0.0001317:  log likelihood = 2069.93,  aic = -4131.85

price_arma21 = arima(dat_train,order=c(2,1,1)); price_arma21

## 
## Call:
## arima(x = dat_train, order = c(2, 1, 1))
## 
## Coefficients:
##           ar1      ar2     ma1
##       -1.0163  -0.1019  0.9587
## s.e.   0.0520   0.0397  0.0369
## 
## sigma^2 estimated as 0.0001318:  log likelihood = 2069.68,  aic = -4131.36

We can get our two models from above:

For ARIMA(1,1,2): \[(1+0.9068B)(1-B)Y_n = (1+0.8424B-0.1131B^2)\varepsilon_n\] For ARIMA(2,1,1): \[(1+1.0163B+0.1019B^2)(1-B)Y_n = (1+0.9587B)\varepsilon_n\]
From the results above, we can see that from Fisher information, the confidence interval (i.e. standard error) for all the coefficients are quite small so that zero is not covered in it. In a word, they are all significant under Fisher information. Also we can see that both of them don’t have intercept terms, this indicates that the differences between adjacent log prices have a mean close to zero. \[E(Y_n-Y_{n-1}) \approx 0\] where \(Y_n = log(price\ at\ time\ n)\)

2.3.3 Test our models

Let’s look at the roots for our models to check the casuality and invertibility

# AR root for ARIMA(1,1,2) is outside of unit circle
MA_roots <- polyroot(c(1, coef(price_arma12)[c("ma1", "ma2")]))
MA_roots # MA root for ARIMA(1,1,2)

## [1] -1.041512+0i  8.492627-0i

# MA root for ARIMA(2,1,1) is outside of unit circle
AR_roots <- polyroot(c(1, -coef(price_arma21)[c("ar1", "ar2")]))
AR_roots # AR root for ARIMA(2,1,1)

## [1] -1.106809-0i -8.864667+0i

Seeing that all of the roots are just outside the unit circle, we can conclude that both two models can fit our data very well without stability problems.

2.4 Diagnostics

2.4.1 Seasonality

Although we didn’t see any evidence of seasonality, people may suspect that the change of Nasdaq index may be affected by some seasonal influence and thus have seasonality effect. We check this under frequency domain.

spectrum_price <-ts(dat_train,frequency=250)
spectrum(spectrum_price,spans=c(3,5,3))

Seeing that there aren’t any significant peaks in periodogram above, this implies that there is no seasonality effect for fitting our data.

2.4.2 Normality of residuals

First let’s plot the residuals and ACF plots of two models.

par(mfrow = c(1,2))
plot(price_arma12$resid,ylab = "residuals",main = "Residuals for ARIMA(1,1,2) Model")
acf(price_arma12$resid,main = "ACF of residuals for ARIMA(1,1,2)")

par(mfrow = c(1,2))
plot(price_arma21$resid,ylab = "residuals",main = "Residuals for ARIMA(2,1,1) Model")
acf(price_arma21$resid,main = "ACF of residuals for ARIMA(2,1,1)")

For the first sight, there seems no big problem for residuals, the mean is stationary and around zero, ACF plots show that more than 95% lags are in the dashed line.
Using QQ-plot to check normality of residuals:

par(mfrow = c(1,2))
qqnorm(price_arma12$resid)
qqline(price_arma12$resid,probs = c(0.25,0.75))
qqnorm(price_arma21$resid)
qqline(price_arma21$resid,probs = c(0.25,0.75))

Both of the residuals have heavier tails than normal distribution, refusing normality assumption.

library(forecast)
pred = predict(price_arma12, n.ahead = (N-floor(0.9*N)+1))$pred
forecast1 = forecast(price_arma12, h = 25)
plot(forecast1)
x = c((N-floor(0.1*N)-1):N)
lines(x, dat_test, type="l", col="red")

Using ARIMA(1,1,2) for example, the prediction of our test data is as above. We can see that the true data can only be covered by the border of 95% confidence interval. We need do some other method to improve our model.[learn the code from http://blog.fens.me/r-ar/]

2.4.3 Modify our model

It seems like that Nasdaq index is also affected by some other factors.
Let’s take volume as one of the considering factors in our model. We know that volume can affect the price apparently.

volume = log(as.numeric(dat$Volume))
vol_train = log(volume[1:floor(0.9*length(price))])
vol_test = log(volume[floor(0.9*length(price)):length(price)])

repeat the steps in 2.3 and see how result change:

	MA0	MA1	MA2	MA3	MA4
AR0	-4133.60	-4135.90	-4135.95	-4134.01	-4133.64
AR1	-4135.42	-4135.86	-4140.32	-4139.21	-4137.30
AR2	-4135.77	-4133.78	-4139.43	-4137.28	-4136.03
AR3	-4133.77	-4132.48	-4137.82	-4133.45	-4140.05
AR4	-4133.67	-4134.97	-4135.93	-4133.94	-4137.12

Different from the result before, ARIMA(1,1,2) has the significant lowest score than other models. Thus, we choose ARIMA(1,1,2) as our model.

arima112 <- arima(dat_train,order = c(1,1,2),xreg = vol_train)
arima112

## 
## Call:
## arima(x = dat_train, order = c(1, 1, 2), xreg = vol_train)
## 
## Coefficients:
##           ar1     ma1      ma2  vol_train
##       -0.9620  0.8911  -0.1029    -0.0535
## s.e.   0.0153  0.0423   0.0409     0.0160
## 
## sigma^2 estimated as 0.0001295:  log likelihood = 2075.16,  aic = -4140.32

Representing our model in math way: \[\phi(B)((1-B)Y_n-\mu-\beta v_n) = \psi(B)\varepsilon_n\] where the other notation is the same as before, and \(v_n\) represents the value of volume at time n.
From the result above, our new model is: \[(1+0.9620B)(1-B)(Y_n+0.0535) = (1+0.8911B-0.1028B^2)\varepsilon_n\]
Check the root:

# AR root is outside the unit circle
MA_roots <- polyroot(c(1,coef(arima112)[c("ma1","ma2")]))
MA_roots # MA roots are outside the unit circle

## [1] -1.005512+0i  9.669575-0i

Again, all the roots are just outside the unit circle.
Look at residuals:

plot(arima112$resid,ylab = "residuals",main = "Residuals for ARIMA(1,1,2) Model with volume parameter")

acf(arima112$resid,main = "ACF of residuals with volume parameter")

Similar to previous result, the residuals seem to be a Gaussian process.
Check normality:

qqnorm(arima112$resid)
qqline(arima112$resid,probs = c(0.25,0.75))

Although there is still havier tails than normal distribution, we can see that the tails are lighter than our original models in 2.3.
It implies that we can explore more other factors affecting Nasdaq index.
What’s more, if we want to make a prediction with this model, we need make more efforts on predicting the change of volume also.

3. Conclusion

First we fit with a random walk model and we found significant evidence that the residuals can’t be normal.
Then we fit our model with ARIMA(1,1,2) and ARIMA(2,1,1) models, finding that these two models have the same abilities on predicting the price. Both of them are very simple models. NA
The guess for fail on prediction is that in ARIMA model, we assume the trend is a constant, but the truth is not. We can see evidence from the orignal data plot. It shows that at the second half year of 2015, there is a very fast increase than other time.
We try to add other factors into our model. We assume that the trend has some relation to Volume, and it concludes that the model gets better than before. Also we drop one of the previous models. Only ARIMA(1,1,2) left.
The modification causes another problem, which is to predict the change of volume.
It’s reasonal to guess that there are still some other factors influcing the change of Nasdaq prices. More complex model is needed. ARIMA is only a simple model to catch basic characteristics of the change of price.

3.1 Advantages

Make reasonal guesses on the model and check every assumption comprehensivly.
Find the potential problems and make modification.

3.2 Disadvantage

It will be more careful if we check taking one order difference (i.e. d=1) is enough.
Although the result shows that the roots for our model are outside the unit circle, some of them are very close to the boundary. We can go further on this problem.

4 Reference

Thanks for professor’s lecture notes and previous midterm projects.[https://ionides.github.io/531w18/#midterm-project]
Other references has been noted above the report.

Midterm Project

2/27/2018

1. Raising a question

2. Data analysis

2.1 Explore the data

2.2 Fitting with random walk model

2.3 Fitting with ARIMA model

2.3.1 Choosing p, q and d

2.3.2 Regression on ARIMA models

2.3.3 Test our models

2.4 Diagnostics

2.4.1 Seasonality

2.4.2 Normality of residuals

2.4.3 Modify our model

3. Conclusion

3.1 Advantages

3.2 Disadvantage

4 Reference