1.Summary

This project is about the time series analysis about the log return of the S&P500.

First, I take differences for log transformation of the Adjusted Close prices for S&P500. Then, the smoothed spectrum shows that there is no evidence for significant cycles.

After that, by AIC criteria, I choose to fit ARMA(1,1) model for log returns. By Fisher Information and simulation method, I confirm that the parameters I chose is reasonable.

The ACF figure shows no evidence that the residuals are correlated, but the absolute value of residuals are highly correlated.

The qqplot also shows that the distribution of residuals has a heavier tail than normal distribution, which deviates from the standard assumption that the residuals are Gaussian White Noise Processes.

2.Background

The Standard & Poor’s 500, often abbreviated as the S&P500, is an American stock market index based on the market capitalizations of 500 large companies having common stock listed on the NYSE or NASDAQ. The S&P 500 index components and their weightings are determined by S&P Dow Jones Indices. It is one of the most commonly followed equity indices, and many consider it one of the best representations of the U.S. stock market, and a bellwether for the U.S. economy.[1]

Compared with the Dow Jones industrial average stock index, standard & Poor’s 500 index has the characteristics of wide range and sampling representativeness, high accuracy, good continuity, and is generally considered to be an ideal underlying asset of stock-index futures contracts.

So it is useful to analyse S&P500 because they are representative of the industries in the United States economy. People always want to know how the market performance is and how to predict its return. Since there are not perfect method to model the distribution of the stock prices and returns, we try to use ARMA model to see where the disadvantages are, and try to give a general conclution of the returns of S&P500. This has practical value for quantitative finance.

3.Data Analysis

3.1 Explore the data

The data I use in this report is the daily data of S&P500 from 2010-03-04 to 2017-03-03, which are captured from Yahoo Finance. The dataset contains Open prices, High prices, Low prices, Close prices, Adjusted Close prices and Volume of S&P500. I’ve checked that there is no NA or omitted data, so we don’t need to work on the missing values.

We can see some of the data from below.

##         Date    Open    High     Low   Close Adj.Close     Volume
## 1 2013-03-05 1525.20 1543.47 1525.20 1539.79   1539.79 3610690000
## 2 2013-03-06 1539.79 1545.25 1538.11 1541.46   1541.46 3676890000
## 3 2013-03-07 1541.46 1545.78 1541.46 1544.26   1544.26 3634710000
## 4 2013-03-08 1544.26 1552.48 1542.94 1551.18   1551.18 3652260000
## 5 2013-03-11 1551.15 1556.27 1547.36 1556.22   1556.22 3091080000
## 6 2013-03-12 1556.22 1556.77 1548.24 1552.48   1552.48 3274910000

Here, we focus mostly on Adjusted Close prices. We plot the Adj. Close v.s. time and here we get 1259 observations.

From the plot of Adj.Close against time, we can find that stock prices have a general increasing trend over time, and there is a significant downward trend in Sep. 2015, Jan. 2016 and Feb. 2018.

3.2 Return and Log Return

For the purpose of detrend, I try to analyse returns instead of prices. Then we can get a more stationary time series data.

As for practical meaning, people always concentrate on log returns because they are simply eliminate the non-stationary properties of the data set, making the financial data more stable. Here we can plot returns and log returns to see, they are very close to each other at each time point.

In the report, I use log return as target of the research.

n=length(data$Adj.Close)
Ret = data$Adj.Close[-1]/data$Adj.Close[-n]-1
ret = diff(log(data$Adj.Close))
plot(t[-1],100*Ret,type="l",xlab="Time",ylab="%",col=1)
points(t[-1],100*ret,col=2,cex=0.1)
legend(16600,-4.5,legend=c("Return","Log-Return"),lty=1,col=c(1:2))

Here, I plot log return v.s. time, we can see that the mean of log return is almost zero, but the volatility becomes larger at the beginning of 2016 and beginning of 2018. This phenomenon is consistent with the Adj. Close price figure, that at those time, there exist significantly fluctuations.

3.3 Description of Log Returns

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.0418425 -0.0029139  0.0005307  0.0004438  0.0046022  0.0382913

As we can see, the mean of log returns are almost 0 and the min and max are approximately around zero.

##  The kurtosis of log return is 3.439882 
##  The skewness of log return is -0.5959859

From the description above, we can see that the kurtosis is 3.44 which is larger than normal distribution, which kurtosis = 3. So the S&P500 return has a heavier tail than normal. The skewness is -0.6, which means the distribution of return is asymmetric and the negative value implies that the distribution has a long left tail.

3.4 Spectrum

raw = spectrum(ret)

smooth = spectrum(ret,spans=c(25,5,25),main="Smoothed periodogram",ylim=c(1e-5,2e-4))

From the smoothed periodogram, we can see that there is no significant dominant frequency, which means there is no significant cycles. Although there are some small peaks in the spectrum, but when we move the crossbar to each peak along the estimated spectrum, it gives pointwise 95% confidence intervals, and we can see that all the peaks are insignificant.

This is not contrary to our common sense that the stock price and return is a kind like random walk, one can hardly find cycles in such a few years.

4.Fit ARMA Model

4.1 AIC table

First, we use AIC criteria to choose the model.

  • Let’s start by fitting a stationary ARMA\((p,q)\) model under the null hypothesis that there is no trend. We seek to fit a stationary Gaussian ARMA(p,q) model with parameter vector \(\theta=(\phi_{1:p},\psi_{1:q},\mu,\sigma^2)\) given by \[ \phi(B)(X_n-\mu) = \psi(B) \epsilon_n,\] where \[\begin{eqnarray} \mu &=& E[X_n] \\ \phi(x)&=&1-\phi_1 x-\dots -\phi_px^p, \\ \psi(x)&=&1+\psi_1 x+\dots +\psi_qx^q, \\ \epsilon_n&\sim&\mathrm{ iid }\, N[0,\sigma^2]. \end{eqnarray}\]

  • We need to decide where to start in terms of values of \(p\) and \(q\). Here, I use AIC criteria.

  • Akaike’s information criterion AIC is given by \[ AIC = -2 loglik(\theta) + 2D\] where D is the number of parameters in the model.

Let’s tabulate some AIC values for a range of different choices of \(p\) and \(q\).

aic_table = function(dataset,P,Q){
  table = matrix(NA,(P+1),(Q+1))
  for (p in 0:P){
    for (q in 0:Q){
      table[p+1,q+1] = arima(dataset,order=c(p,0,q))$aic
    }
  }
  dimnames(table) = list(paste("<b> AR",0:P,"</b>",sep=""),paste("MA",0:Q,sep=""))
  table
}
ret_aic_table = aic_table(ret,4,4)
require(knitr)
kable(ret_aic_table,digits=2)
MA0 MA1 MA2 MA3 MA4
AR0 -8662.23 -8660.35 -8660.56 -8658.56 -8658.90
AR1 -8660.34 -8668.46 -8658.56 -8656.56 -8663.78
AR2 -8660.37 -8658.37 -8665.43 -8663.68 -8661.95
AR3 -8658.37 -8656.37 -8663.68 -8662.04 -8667.49
AR4 -8658.75 -8663.77 -8662.00 -8660.92 -8666.47
  • From the table we can see that ARMA(1,1) has the lowest AIC. So we will choose ARMA(1,1).

  • Although we’ve learned that AIC criteria is not the only rule to help us choose the model, but we can see that besides ARMA(1,1), the lowest AIC occurs at ARMA(3,4) and ARMA(4,4). However, these models are more complex, so it may lead to problems like overfitting, numerical stability and etc. We usually prefer a simply model, which also better for interpretation.

4.2 Model Fitting

4.2.1 Fit ARMA(1,1) Model

arma11 = arima(ret,order=c(1,0,1))
arma11
## 
## Call:
## arima(x = ret, order = c(1, 0, 1))
## 
## Coefficients:
##          ar1      ma1  intercept
##       0.9409  -0.9716      4e-04
## s.e.  0.0206   0.0137      1e-04
## 
## sigma^2 estimated as 5.917e-05:  log likelihood = 4338.23,  aic = -8668.46

We have:

Write \(x_{1:N}^{*}\) for the N values of S&P500 log returns, at times \(t_{1:N}\). We model \(x_{1:N}^{*}\) as a realization of the time series model \(X_{1:N}\) defined by \((1-\phi_1 B)(X_n-\mu)=(1+\psi_1B)\epsilon_n\), where \(\epsilon_n\) is Gaussian white noise, i.e. \(\epsilon_n\sim\mathrm{ iid }\, N[0,\sigma^2]\).

Use the data we’ve got, I have the model: \[(1-0.9409 B)(X_n-0.0004)=(1-0.9716B)\epsilon_n\] where \(\epsilon_n\sim\mathrm{ iid }\, N[0,9.286*10^{-5}]\)

4.2.2 Confidence Interval for Parameters

From the summary above, we can see that for the parameter \(\phi_1\), the approximate 95% confidence interval derived from Fisher Information is: \[[0.9409-1.96*0.0206,0.9409+1.96*0.0206]=[0.900524,0.981276]\] This Confidence Interval doesn’t include 0, so we can say that \(\phi_1\) is significant at \(\alpha = 0.05\).

Similarly, for the parameter \(\psi_1\), the approximate 95% confidence interval derived from Fisher Information is: \[[-0.9716-1.96*0.0137,-0.9716+1.96*0.0137]=[-0.998452,-0.944748]\] This Confidence Interval also doesn’t include 0, so we can say that \(\psi_1\) is significant at \(\alpha = 0.05\).

To get further confirmation, I do a simulation study to see whether the parameter has been given a reliable value. We can plot the histogram and density plot for the ar1 and ma1 parameters.

These plots are consistent with the confidence interval we’ve concluded before. That the peaks occur at ma1 parameter approximately -1 and ar1 parameter approximately 0.95.

From both two methods, we can say that the ar1 and ma1 parameters are significant, and the value we chose are reasonable.

4.3 Diagnostics

4.3.1 Test for Roots

AR_roots = polyroot(c(1,-coef(arma11)[1]))
AR_roots
## [1] 1.062862+0i

We can see that the ARMA(1,1) model is causal because the AR polynomial has all its roots outside the unit circle in the complex plane.

MA_roots = polyroot(c(1,coef(arma11)[2]))
MA_roots
## [1] 1.02925+0i

We can see that the ARMA(1,1) model is invertible because the MA polynomial has all its roots outside the unit circle, too.

4.3.2 Test for Residuals

  • Test for independence

We plot residuals and ACF figures for diagnostic analysis.

From the residual plot we can see the mean of residuals is almost zero. However, we can see that the variance seems like not a constant, there are larger variance at beginning of 2016 and beginning of 2018.

From the acf plot we can see that there is a slight deviation at lag=15 and lag=24 when comparing with the Gaussian White Noise Process. The values of ACF almost fall inside the dashed lines, and we expect a fraction of 5% of the lags of the ACF to fall outside the two dashed lines under the null hypothesis. The figure shows no significant evidence that the residuals are autocorrelated.

To further study, we plot the ACF of absolute value of residuals, however, it obviously shows an autocorrelated relationship, so we will reject the standard assumption that the residuals are independent. This phenomenon shows that high volatility today implies high volatility in the future, but the price goes up or down are unknown.

This is a quite common phenomenon when trying to fit ARMA model to time series data, that the residuals have some inner correlation and some other properties. That’s why GARCH model was introduced.

  • Test for normality

From QQ-Plot we can see that the distribution of residuals have a much heavier tail than normal distribution.

4.4 Forecasting

We use tha ARMA(1,1) model to forecast the log return of S&P500. We take 90% of the data as train data and 10% of the data as test data.[2]

The dark grey bar shows 99% CI for the forecast while the light grey bar shows 95% CI. We can see that there will not be a large volatility by forecasting.

4.5 Modification

Since the ACF plot and QQ plot shows that the data fitted with ARMA model has residuals which are not normal distributed, we try to solve this problem.

I choose to use weekly data instead of daily data of S&P500, try to lose some accuracy, but get a more normalized data via this method. Also, by weekly data, we can actually get an equally spaced time series data.

I read the weekly data of S&P500 during the same period as I analysed before.

##         Date    Open    High     Low   Close Adj.Close      Volume
## 1 2013-03-04 1525.20 1552.48 1525.20 1551.18   1551.18 14574550000
## 2 2013-03-11 1551.15 1563.62 1547.36 1560.70   1560.70 18074930000
## 3 2013-03-18 1560.70 1561.56 1538.57 1556.89   1556.89 16501510000
## 4 2013-03-25 1556.89 1570.28 1546.22 1569.19   1569.19 12266080000
## 5 2013-04-01 1569.18 1573.66 1539.50 1553.28   1553.28 16991960000
## 6 2013-04-08 1553.26 1597.35 1548.63 1588.85   1588.85 16193490000

Similarly, we can plot the weekly Adjusted Close prices v.s. time.

We can see that the price plot is smoother than daily data. Still, it has an increasing trend and a sharp fall at around 2016 and beginning of 2018. That is, the weekly plot captured the main character of the data.

Then we plot the log returns of weekly data.

Also, we can see the log return plot is smoother than daily. And its volatility seems more convergence than daily.

Then we can plot the periodogram for the weekly log returns.

raw = spectrum(ret)

smooth = spectrum(ret,spans=c(25,5,25),main="Smoothed periodogram",ylim=c(5e-5,5e-4))

Similarly to the explaination before, moving the crossbar and we’ll find that the peaks are insignificant. So from the smoothed periodogram, we can see that there is no significant dominant frequency, which means there is no significant cycles.

4.5.1 AIC table_Weekly

aic_table = function(dataset,P,Q){
  table = matrix(NA,(P+1),(Q+1))
  for (p in 0:P){
    for (q in 0:Q){
      table[p+1,q+1] = arima(dataset,order=c(p,0,q))$aic
    }
  }
  dimnames(table) = list(paste("<b> AR",0:P,"</b>",sep=""),paste("MA",0:Q,sep=""))
  table
}
ret_aic_table = aic_table(ret,4,4)
require(knitr)
kable(ret_aic_table,digits=2)
MA0 MA1 MA2 MA3 MA4
AR0 -1411.60 -1411.55 -1411.62 -1413.12 -1411.64
AR1 -1411.20 -1415.84 -1413.90 -1412.47 -1410.67
AR2 -1411.02 -1413.90 -1413.24 -1412.38 -1411.23
AR3 -1412.58 -1412.61 -1410.63 -1410.94 -1409.32
AR4 -1412.46 -1410.70 -1408.70 -1409.64 -1413.39

From the AIC table, we still choose ARMA(1,1) model to fit the log return of weekly data. Because the AIC of ARMA(1,1) has the smallest value.

4.5.2 Fit ARMA(1,1) model_Weekly

arma11 = arima(ret,order=c(1,0,1))
arma11
## 
## Call:
## arima(x = ret, order = c(1, 0, 1))
## 
## Coefficients:
##          ar1      ma1  intercept
##       0.7506  -0.8696     0.0021
## s.e.  0.0898   0.0635     0.0005
## 
## sigma^2 estimated as 0.0002449:  log likelihood = 711.92,  aic = -1415.84

We have:

Write \(y_{1:N}^{*}\) for the N values of S&P500 log returns, at times \(t_{1:N}\). We model \(y_{1:N}^{*}\) as a realization of the time series model \(Y_{1:N}\) defined by \((1-\phi_1 B)(Y_n-\mu)=(1+\psi_1B)\epsilon_n\), where \(\epsilon_n\) is Gaussian white noise, i.e. \(\epsilon_n\sim\mathrm{ iid }\, N[0,\sigma^2]\).

Use the data we’ve got, I have the model: \[(1-0.7506 B)(X_n-0.0021)=(1-0.8696B)\epsilon_n\] where \(\epsilon_n\sim\mathrm{ iid }\, N[0,0.0002449]\)

4.5.3 Check Confidence Interval for Parameters

From the summary above, we can see that for the parameter \(\phi_1\), the approximate 95% confidence interval derived from Fisher Information is: \[[0.7506-1.96*0.0898,0.7506+1.96*0.0898]=[0.574592,0.926608]\] This Confidence Interval doesn’t include 0, so we can say that \(\phi_1\) is significant at \(\alpha = 0.05\).

Similarly, for the parameter \(\psi_1\), the approximate 95% confidence interval derived from Fisher Information is: \[[-0.8696 -1.96*0.0635,-0.8696 +1.96*0.0635]=[-0.99406,-0.74514]\] This Confidence Interval also doesn’t include 0, so we can say that \(\psi_1\) is significant at \(\alpha = 0.05\).

4.5.4 Test for Residuals

  • Test for normality

From the qq plot we can see that residuals for weekly data is quite close to normal distributed values, although the left tail is still heavier than normal.

The normality test shows that the we’ve solved the residuals’ normality problem to some extent.

  • Test for independence

From the residual plot we can see that the variance seems like not a constant, there are larger variance at beginning of 2016 and beginning of 2018. This also shows the main character of the model fitted by daily data.

From the ACF plot we can see that there is a slight deviation at lag=4 when comparing with the Gaussian White Noise Process. All the other values of ACF fall inside the dashed lines, so we can say that there is no evidence to reject the null hypothesis that the residuals are uncorrelated.

Then focus on the ACF plot of the absolute values of residuals, we can see deviations at lag=4 and lag=24. This shows that there are not much correlation between the absolute values of residuals. It is reasonable to consider the residuals as Gaussian White Noise Process. This shows that the ARMA model for weekly data is more fittable than for daily data.

5.Conclusion

6.References

[1]. Wikipedia S&P 500 Index

[2]. Time Series Analysis for Stock Data

[3]. Lecture Notes and Previous Homework solutions and Practice Midterm Exam solutions