The purpose of time series modeling for return is to explore the inherent autocorrelation of the data and make a judgment on the future return under the assumption that the characteristic can be repeated in the future. If we want to construct a trading signal based on this return prediction.For example, if the predicted return is positive then we buy; If the predicted return is negative then we sell.
The nasdaq index is a barometer of changes in market value across industrial sectors. The nasdaq index is more comprehensive than the standard & poor’s 500-stock index or the dow Jones industrial index (which includes just 30 prominent industrial and commercial companies, 20 transportation companies and 15 utilities). The nasdaq index includes more than 5,000 companies, more than any other single stock market.Therefore, I choose to analyze log returns of Nasdaq.
I use the daily Nasdaq data from 2015-03-04 to 2020-03-04.which are captured from Yahoo Finance. The dataset consists of 1259 observations and 7 variables which are Date, Open prices, High prices, Low prices, Close prices, Adjusted Close prices and Volume of Nasdaq.
First, we summary six important statistics index of the closed price as follows.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4267    5099    6384    6456    7638    9817To get more information about the trend of adjusted Close prices, we plot the adjusted Close prices against time. The mean for the adjusted Close prices is shown by the red line.
From the plot, we can find that stock prices have a general increasing trend over time, and there is a significant drop at the beginning of 2019 and the beginning of 2020.
To detrend the data, I decide to analyse returns so that I can get a more stationary time series data. Besides, analysing returns have practical meaning since investors are always care more about returns.
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0567238 -0.0060692 -0.0008543 -0.0004441  0.0037424  0.0472298From the output, it is obvious that the mean of log returns are almost 0 and the min and max are also close to zero.
we can see that the mean of log return is almost zero, but the volatilities are large, which shows significantly fluctuations.
In order to check whether there is seasonal effect, I plot the frequency domain of original data and smoothed data. we could see that there’s no significant cycle in a year, which means there’s no seasonal effect in this problem. Therefore, instead of using ARIMA model, we choose to use ARMA model.
From the plot of log return against time, we can see wide variation of the log return. The data fluctuate around mean which is close to zero. Therefore, it seems no trend for these data. So null hypothesis is that there is no trend for this model.Then I choose a stationary autoregressive-moving average (ARMA) model.
I will start by fitting an ARMA(p,q) model of the form: \(Y_n = \mu + \phi_1(Y_{n-1} - \mu) + \dots + \phi_p(Y_{n-p} - \mu) + \varepsilon_n + \psi_1\varepsilon_{n-1} + \dots + \psi_q\varepsilon_{n-q}\) \({\{\varepsilon_n}\}\) is a white noise process with distribution\(\mathcal{N}(0,\sigma^2)\).\(\phi_1, \dots, \phi_p\)are the coefficients for the autoregressive part of the model, \(\psi_1, \dots, \psi_q\)are the coefficients for moving average part of the model, \(\mu\) is the population mean, \(\sigma^2\) is error variance.
## Warning in arima(data, order = c(p, 0, q)): possible convergence problem:
## optim gave code = 1
## Warning in arima(data, order = c(p, 0, q)): possible convergence problem:
## optim gave code = 1## Loading required package: knitr| MA0 | MA1 | MA2 | MA3 | MA4 | |
|---|---|---|---|---|---|
| AR0 | -7894.94 | -7893.33 | -7893.23 | -7892.64 | -7891.49 | 
| AR1 | -7893.30 | -7894.58 | -7894.96 | -7891.02 | -7889.02 | 
| AR2 | -7893.26 | -7893.35 | -7896.34 | -7894.65 | -7892.81 | 
| AR3 | -7892.82 | -7890.99 | -7894.65 | -7892.54 | -7894.59 | 
| AR4 | -7891.32 | -7893.47 | -7893.93 | -7890.91 | -7896.98 | 
From the table, we can see that ARMA(4,4) ,ARMA(2,2) and ARMA(1,2) have the lowest, the second and third lowest AIC value. Firstly, ARIMA(4,4) is a large model, which is hard to explain and is more likely to have problems with parameter identifiability,invertibility, and numerical stability. Therefore, I will fistly consider ARMA(2,2) and ARMA(1,2) to analyze this dataset.
## 
## Call:
## arima(x = lreturn, order = c(2, 0, 2))
## 
## Coefficients:
##           ar1      ar2     ma1     ma2  intercept
##       -0.5813  -0.8413  0.5563  0.7944     -4e-04
## s.e.   0.1221   0.1041  0.1356  0.1186      3e-04
## 
## sigma^2 estimated as 0.000109:  log likelihood = 3954.17,  aic = -7896.34## 
## Call:
## arima(x = lreturn, order = c(1, 0, 2))
## 
## Coefficients:
##           ar1     ma1      ma2  intercept
##       -0.8461  0.8318  -0.0485     -4e-04
## s.e.   0.0969  0.1004   0.0304      3e-04
## 
## sigma^2 estimated as 0.0001093:  log likelihood = 3952.48,  aic = -7894.96\(X_{1:N}\) is defined by \((1-\phi_1 B-\phi_2 B^2)(X_n-\mu)=(1+\psi_1B+\psi_2B^2)\epsilon_n\)
Use the data we’ve got, I have the ARMA(2,2) model:
\((1+0.5813B+0.8413B^2)(X_n+0.0004)=(1+0.5563B+0.7944B^2)\epsilon_n\)
where \(\epsilon_n\sim\mathrm{ iid }\, N[0,1.09*10^{-4}]\).
\(X_{1:N}\) is defined by \((1-\phi_1 B)(X_n-\mu)=(1+\psi_1B+\psi_2B^2)\epsilon_n\)
Use the data we’ve got, I have the ARMA(1,2) model:
\((1+0.8461B)(X_n+0.0004)=(1+0.8318B-0.0485B^2)\epsilon_n\)
where \(\epsilon_n\sim\mathrm{ iid }\, N[0,1.09*10^{-4}]\)
ARMA(2,2)
## [1] -0.345464+1.034085i -0.345464-1.034085i## [1]  0.8252029-0i -1.5255512+0iFrom the result, we can see that ARMA(2,2) model is not invertible because MA polynomial has a root which inside the unit circle. Therefore, we cannot use ARMA(2,2) as final model.
ARMA(1,2)
## [1] -1.181953+0i## [1]  1.30098+0i 15.83513+0iARMA(1,2) model is causal and invertible because the AR and MA polynomial has all its roots outside the unit circle in the complex plane.So my final model is ARMA(1,2):
\((1+0.8461B)(X_n+0.0004)=(1+0.8318B-0.0485B^2)\epsilon_n\)
where \(\epsilon_n\sim\mathrm{ iid }\, N[0,1.09*10^{-4}]\)
From the residual plot we can see the mean of residuals is almost zero. However, we can see that the variance seems like not a constant. So it’s mean stationary but not variance stationary.
From the acf plot, we can see it’s almost the Gaussian White Noise Process. The values of ACF almost fall inside the dashed lines except lag 8 and lag 29. So we conclude that the autocorelation of residuals are zero.
From QQ-Plot, we can see that the distribution of residuals are not normally distributed and have heavy tail. And we can also see asymmetricy from the plot, the right tail is heavier.
The heavy-tailed residual indicates that current log return of Nasdaq index could not only be explained by its historical data. To get a better model to estimate log return of Nasdaq index, we need add some important factors during our analysis.Since volume is an important factor in stock market, and it may affect the close price. Thus, I try to add this factor into our model.
## 
## Call:
## arima(x = lreturn, order = c(1, 0, 2), xreg = volume)
## 
## Coefficients:
##           ar1     ma1      ma2  intercept  volume
##       -0.8307  0.8037  -0.0625    -0.1046  0.0049
## s.e.   0.1028  0.1061   0.0304     0.0283  0.0013
## 
## sigma^2 estimated as 0.0001081:  log likelihood = 3959.13,  aic = -7906.26The result shows that the standard error of volume estimate is very small which means there’s a significant effect on log(volume) as expected.
Then we should also check roots of this model
## [1] -1.203777+0i## [1] -1.142656+0i 13.996604-0iFortunately, all the roots are outside the unit circle as we expected.
Again, we do diagnostic test for residual.
It seems that the heavy tail problem of residual is not relieved, but adding log(volume) does improve the model in many ways.
I find ARMA(1,2) model fits log return of Nasdaq index time series. By adding a new parameter, log(volume), the model is improved.
From diagnostic analysis of residual, we find residuals are almost the Gaussian White Noise Process,but the Q-Q plot shows that the model deviates from the independent normal distribution residual assumption.
8.Reference
[1]previous midterm project “Time Series Analysis for Log Returns of S&P500”
Here are some improvement of my work compared with previous project. First, I choose different stock index–Nasdaq index, which is more comprehensive than the standard & poor’s 500-stock index. Second, I improved model by adding log(volume).
[2]lecture notes