STATS 531 Mid-term Project

Introduction.

In this report, we want to apply time series technique to analyzing the Russell 2000 Index data.The Russell 2000 Index is a small-cap stock market index of the smallest 2,000 stocks in the Russell 3000 Index. It was started by the Frank Russell Company in 1984. The index is maintained by FTSE Russell, a subsidiary of the London Stock Exchange Group. The Russell 2000 is by far the most common benchmark for mutual funds that identify themselves as “small-cap”, while the S&P 500 index is used primarily for large capitalization stocks. It is the most widely quoted measure of the overall performance of the small-cap to mid-cap company shares. [\(from\ wikipedia\) https://en.wikipedia.org/wiki/Russell_2000_Index]

Data Exploration.

Firstly, we import the data of Russell 2000 index between 01/01/2018 - 12/31-2019, which gives us 502 observations with 7 variables. We only focus the Adj.Close variable in the following.

From the plot above, we can see that the history Adj.close price of Russell 2000 experienced a sharp decline at the end of 2018, and then started to increase. It seems that it fluctuated around 1550 for most of 2019. It is a common practice to model the log return of index/stock price in financial territory. Hence, we can take a look at the daily log return of Russell 2000 index.

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0450350 -0.0049750  0.0009425  0.0001418  0.0066236  0.0484473

As for the log_returns plot, we can see that Russell 2000 recorded the highest return around the very beginning of 2019. In addition, the log_returns seems that it fluctuated around 0 for the observed period without obvious trend feature.

Then we check the the sample acf of log_returns.

From the result above, we can see that almost all acf values are within the dashed line area,however, the acf of lag 14 and 16 are a littl bit large. Thus,We may suspect that the log_returns series present some white noise series feature but also consider the possibility of high order MA model(16 in this case).

Model Selection .

First we can try model log_returns series by ARMA(P,Q) model and present the AIC table as follows.

	MA0	MA1	MA2	MA3	MA4
AR0	-3114.62	-3112.71	-3110.83	-3110.43	-3108.48
AR1	-3112.70	-3110.70	-3110.17	-3108.48	-3106.48
AR2	-3110.85	-3110.18	-3114.86	-3106.47	-3104.48
AR3	-3110.48	-3108.58	-3113.44	-3111.49	-3109.65
AR4	-3108.61	-3106.61	-3111.55	-3109.84	-3107.79

We can see that ARMA(2,2) records the lowest AIC. Note that ARMA(0,0) also could be great candidate in terms of low AIC and simplicity, so we may want to take it into account as well. In addition, as we discussed above, we also want to consider the possibility of high order MA model(16 in this case). To sum up, we can try 3 models as follows and select the best one. Denote log_returns by \(\{X_n\}\), we have:

Model (1) ARMA(0,0) \[X_n = \mu + \epsilon_n\]
Model (2) ARMA(2,2) \[X_n = \phi_1X_{n-1} + \phi_2X_{n-2} + \epsilon_n + \theta_1\epsilon_{n-1}+\theta_2\epsilon_{n-2}\]
Model (3) MA(16) \[X_n = \epsilon_n + \sum_{q=1}^{16}\theta_q\epsilon_{n-q}\]

where we assume \(\{\epsilon_n\}\) is assumed as Gaussian white noise in all 3 models. Firstly, we fit the ARMA(0,0) model and check the fit result.

arma00<-arima(x = log_returns, order = c(0, 0, 0))
arma00

## 
## Call:
## arima(x = log_returns, order = c(0, 0, 0))
## 
## Coefficients:
##       intercept
##           1e-04
## s.e.      5e-04
## 
## sigma^2 estimated as 0.0001159:  log likelihood = 1559.31,  aic = -3114.62

It seems that both the estimate and its standard error of intercept are very closed to 0.Thus, we may treat log_returns as just a white noise series if we select this model. Then we try ARMA(2,2) model.

arma22<-arima(x = log_returns, order = c(2, 0, 2))
arma22

## 
## Call:
## arima(x = log_returns, order = c(2, 0, 2))
## 
## Coefficients:
##           ar1      ar2     ma1     ma2  intercept
##       -0.7091  -0.9953  0.6942  0.9992      1e-04
## s.e.   0.0068   0.0045  0.0112  0.0289      5e-04
## 
## sigma^2 estimated as 0.0001131:  log likelihood = 1563.43,  aic = -3114.86

abs(polyroot(c(1,-arma22$coef[1:2])))

## [1] 1.002356 1.002356

abs(polyroot(c(1,arma22$coef[3:4])))

## [1] 1.000382 1.000382

According to the results above, we could find that all the coefficients of ARMA model are significant with almost zero intercept term. However, the root of AR and MA polynomial are very closed to 1. Hence we may suspect the causality and invertibility of this model. Finally, we check the model of MA(16).

ma16<- arima(x = log_returns, order = c(0, 0, 16))
ma16

## 
## Call:
## arima(x = log_returns, order = c(0, 0, 16))
## 
## Coefficients:
##           ma1     ma2     ma3     ma4      ma5     ma6     ma7      ma8
##       -0.0230  0.0110  0.0758  0.0070  -0.0318  0.0158  0.0698  -0.0476
## s.e.   0.0446  0.0444  0.0444  0.0444   0.0448  0.0447  0.0444   0.0449
##           ma9     ma10     ma11    ma12    ma13     ma14     ma15    ma16
##       -0.0002  -0.0635  -0.0460  0.0230  0.0081  -0.1625  -0.0352  0.1141
## s.e.   0.0450   0.0466   0.0468  0.0484  0.0451   0.0429   0.0444  0.0458
##       intercept
##           1e-04
## s.e.      4e-04
## 
## sigma^2 estimated as 0.0001101:  log likelihood = 1571.76,  aic = -3107.53

abs(polyroot(c(1,ma16$coef[1:16])))

##  [1] 1.130070 1.110368 1.110368 1.125138 1.069243 1.155458 1.091528
##  [8] 1.229697 1.091528 1.198579 1.069243 1.125138 1.198579 1.155458
## [15] 1.130070 1.368787

Accroding to the result, we can see that the MA polynomial has all its roots outside the unit circle in the complext plane, which is a desirable result. From the discussion above, we may want to proceed with ARMA(0,0) and MA(16). Let’s take a look at the acf of their residuals.

We can see that for ARMA(0,0) model, there is still relatively large acf on lag of 14 and 16. By contrast, MA(16) support the uncorrelation assumption very clearly. Hence, we want to select MA(16) as our final model and move to the diagnosis part.

Diagnosis.

In previous part, we already check the uncorrelation assumption of the drive noise of MA(16). In the following, we continue to perform other diagnosis of this model.

Normality

First we check the normality of the residuals.

It seems that residuals have heavy tail feature compared to normal distribution. Thus, we may want to consider other distribution, such as t - distribution. We can choose degree of freedom of 5 and plot the QQ-plot for t-distribution of residuals as follows.

Here we can see, t-distribution fits the residuals much better than normal-distribution.

Seasonality

Now we check the Seasonality and plot the periodogram as follows.

Setting 250 trade days per year, we have the plot with unit of cycle per year as above. It seems that no frequency has very significant power than others, so we can take accpet the assumption that there is no Seasonality for this dataset.

Conclusion and Discussion.

In this report, we apply some time series techniques to a financial dataset - Russell 2000 Index data. We firstly explore the sample data and get some intuition of its feature. Then, we try 3 time series model, ARMA(0,0), ARMA(2,2) and MA(16) and select the best one to perform the diagnosis. The result shows that the MA(16) model with t-distribution assumption of drive noise can fit the data reasonably well.

The advantage of this report is that it almost covers every classical step in analyzing a financial time series dataset and give a good example of how to modify the assumtion in order to get a good fit. However, all models we choose and compare are within the ARMA framework. Yet, there are still plenty of other models, such as ARCH, GARCH,etc, that may have a better fit to this dataset.

Reference.

Dataset is used as teaching material and provided by Mr.Brian Thelen, University of Michigan.
Reference books include:
Ruppert, David. 2011. Statistics and data analysis for financial engineering. New York: Springer.
Shumway, Robert H., and David S. Stoffer. 2017. Time series analysis and its applications: with R examples.