Midterm Project

2 Data Analysis

2.1 Explore the data

First we read in the data, which can be downloaded from yahoo finance. It is SPX 500 historical time series data in recent 5 years. The data set consists of 1553 observations and 7 variables. It records SPX 500 index changes from 2011 to 2016. I am interested in SPX 500 index’s closed price every day, so I mainly focus on closed price.

dat <- read.csv(file="SPX500.csv",header=TRUE)

And we could summary six important statistics index of the closed price as follows.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1023    1284    1502    1580    1932    2131

Then we can plot the closed price over time to study its pattern. The blue line represents the mean for this time series.
There seems a significant increasing trend in SPX 500 index, so we need to differentiate the original to determine d in ARIMA model.

t <- seq(from=1,length=length(date),by=1)
N <- length(CLOSED)
detrend_closed <- CLOSED[2:N] - CLOSED[1:N-1]
plot(t[1:N-1],detrend_closed,type="l",xlab="t",ylab="remove tendency",main = "Detrending SPX 500")
abline(h=mean(detrend_closed),col = "blue")

The variation of the data seems unchanged during the 5 years and there is no visible trend. Thus, we can conclude that \(d=1\). Now we can analyze this time series.

2.2 Fitting an ARIMA(p,1,q) Model

First, consider if there exists a seasonal effect in this question. Assume that there’s a seasonal effect, then we could buy in the index at the trough and sell it at the crest. If so, everyone could arbitrage from the market which deviate from the principle of finance. Thus, it could not have a seasonal effect in this problem theoratically.
Meanwhile, we could plot the frequency domain of original data as follow
In the above figure, we could see that there’s no significant cycle in a year, which means there’s no seasonal effect in this problem.
After excluding the seasonal effect, let’s start by fitting a stationary ARIMA(p,1,q) model under the null hypothesis that there is no trend after differentiating once.
We seek to fit a stationary Gaussian ARIMA(p,1,q) model with parameter vector \(\theta = (\phi_{1:p},\psi_{1:q},\mu,\sigma^2)\) given by: \[\phi(B)(1-B)(X_n-\mu) = \psi(B)\varepsilon_n\] where \[\mu = E[X_n]\] \[\phi(x) = 1-\sum_{i=1}^p \phi_i x^i\] \[\psi(x) = 1+\sum_{i=1}^q \psi_i x^i\] \[\varepsilon \sim iid N[0,\sigma^2]\]
We need to decide where to start in terms of values of p and q.
In this question, I choose Akaike’s information criterion, AIC is given by \[AIC = -2 \times l(\theta^*) + 2D\]
The following table is an AIC table for a range of different choices of p and q.

## Warning in arima(data, order = c(p, 1, q)): possible convergence problem:
## optim gave code = 1

## Loading required package: knitr

	MA0	MA1	MA2	MA3	MA4
AR0	12840.57	12841.70	12843.66	12843.49	12844.64
AR1	12841.69	12843.70	12845.66	12840.74	12842.55
AR2	12843.64	12845.64	12840.97	12842.30	12844.02
AR3	12843.63	12840.56	12842.29	12843.61	12845.21
AR4	12844.85	12842.42	12843.89	12845.26	12834.72

In the AIC table, we can find the 3 smallest (p,q), namely (4,4),(3,1), and (0,0).
Firstly, ARIMA(4,1,4) is a very big model, which is hard to explain. Moreover, it is at the corner of the table, which is an unproper choice.
Secondly, consider ARIMA(0,1,0) model. If it is a true model, then \(X_n\) should follow: \[1*(1-B)X_n=1*\varepsilon_n\]
Take expectation of each side of the equation, we have: \[E[X_n - X_{n-1}]=E[\varepsilon_n]=0\] \[E[X_n] = E[X_{n-1}]\]
Thus, if the data follows ARIMA(0,1,0) model, it would not have a trend theoratically, which deviates from our original data.
To sum up, ARIMA(3,1,1) may be the only choice. Thus I choose ARIMA(3,1,1) model to analyze this dataset.

2.3 Analysing ARIMA(3,1,1) Model

2.3.1 Regression on ARIMA(3,1,1) Model

Now we fit ARIMA(3,1,1) model to the dataset.

closed_arma31 <- arima(CLOSED,order=c(3,1,1))
closed_arma31

## 
## Call:
## arima(x = CLOSED, order = c(3, 1, 1))
## 
## Coefficients:
##          ar1     ar2      ar3      ma1
##       0.7659  0.0237  -0.0462  -0.7930
## s.e.  0.1557  0.0322   0.0282   0.1545
## 
## sigma^2 estimated as 228:  log likelihood = -6415.28,  aic = 12840.56

The result shows that \(\phi_1 = 0.7659\),\(\phi_2 = 0.0237\),\(\phi_3 = -0.0462\),\(\psi_1 = -0.7930\), that is: \[(1-0.7659B-0.0237B^2+0.0462B^3)(1-B)X_n = (1-0.7930B)\varepsilon_n\]
However, noticing that standard error of ar3 coefficient is quite large, and this may leads to an unsignificant result.

2.3.2 Testing ARIMA(3,1,1) Model’s Root

To check if all parameters have a small fitted value, we can see whether its roots are in a unit circle.

AR_roots <- polyroot(c(1,-coef(closed_arma31)[c("ar1","ar2","ar3")]))
AR_roots

## [1]  1.414352-0i -4.387351-0i  3.486625+0i

MA_roots <- polyroot(c(1,coef(closed_arma31)[c("ma1")]))
MA_roots

## [1] 1.261061+0i

From the analysis given above, we can find that all the roots are outside the unit circle. This suggests that ARIMA(3,1,1) may be a good fit for the data.

2.3.3 Predictions

Based on the analysis given above, we can predict the future’s SPX 500 index based on historical data with following equation. \[(1-0.7659B-0.0237B^2+0.0462B^3)(1-B)X_n = (1-0.7930B)\varepsilon_n\]
Given \(X_{n-1}=1999.99, X_{n-2} = 1993.40, X_{n-3} = 1986.45, X_{n-4} = 1978.35\), we could say \[X_n = (1.7659B-0.7422B^2-0.0699B^3+0.0462B^4)X_n + (1-0.7930B)\varepsilon_n\] \[X_n = 2004.83 + (1-0.7930B)\varepsilon_n\]
Thus, 95% confidence interval for \(X_n\) should be \((1966.29,2043.37)\), since \(var(\epsilon_i) = sigma^2 = 228\)

2.4 Diagnostics

2.4.1 Residuals Brief View

For the diagnostic analysis, I look at the residuals first.

plot(closed_arma31$resid,ylab = "residuals",main = "Residuals for ARIMA(3,1,1) Model")

acf(closed_arma31$resid,main = "ACF of residuals")

From the acf plot, we can find a slight deviation from gaussian white noise at \(lag=22\), \(lag=24\) and \(lag=25\).

2.4.2 Testing for Normality

In our assumption, we assumes that residuals at each period are independent, and follow same normal distribution. Thus, the model’s residual is \(resid = (1-0.7930B)\varepsilon_n\) should follow normal distribution either. Then we could draw a Q-Q plot for our model.

qqnorm(closed_arma31$resid)
qqline(closed_arma31$resid,probs = c(0.25,0.75))

From the Q-Q plot, we can see that residuals have long tail on both sides, indicating that it is not a normal distribution. This would be a big problem.

2.4.3 Modify the Model

The long-tail residual indicates that current SPX 500 index could not be explained by its historical data only. We may ignore some important factors during our analysis. Thus, I am going to add a factor to my model.
As we all know, volume is an important factor in stock market, and it may affect the close price. Thus, I try to add this factor into our model and do all what I did in section 2.2 and section 2.3.
Since volume is a large number, I take a log form.

volume <- log(rev(as.numeric(dat$Volume)))

## Warning in arima(data, order = c(p, 1, q), xreg = data2): possible
## convergence problem: optim gave code = 1

## Warning in arima(data, order = c(p, 1, q), xreg = data2): possible
## convergence problem: optim gave code = 1

	MA0	MA1	MA2	MA3	MA4
AR0	12829.18	12829.68	12831.51	12831.03	12831.78
AR1	12829.65	12829.90	12832.86	12828.37	12830.12
AR2	12831.44	12832.32	12828.85	12830.43	12831.77
AR3	12831.18	12828.16	12830.58	12831.46	12832.71
AR4	12832.00	12829.98	12832.05	12832.80	12829.94

It is good to see that adding an additonal parameter does not change our ARIMA model’s parameter. And this time, we would run ARIMA(3,1,1) either, but adding one more parameter log(volume).

arma31_v <- arima(CLOSED,order = c(3,1,1),xreg = volume)
arma31_v

## 
## Call:
## arima(x = CLOSED, order = c(3, 1, 1), xreg = volume)
## 
## Coefficients:
##          ar1     ar2      ar3      ma1   volume
##       0.7270  0.0341  -0.0535  -0.7615  -7.2501
## s.e.  0.1612  0.0319   0.0276   0.1601   1.9140
## 
## sigma^2 estimated as 225.9:  log likelihood = -6408.08,  aic = 12828.16

The result shows that coefficient of ar3 becomes more significant, and there’s a significant effect on log(volume) as we expected. Then we need to examine if all the roots are outside the unit circle

AR_roots <- polyroot(c(1,-coef(arma31_v)[c("ar1","ar2","ar3")]))
AR_roots

## [1]  1.528866-0i -3.972214+0i  3.080530+0i

MA_roots <- polyroot(c(1,coef(arma31_v)[c("ma1")]))
MA_roots

## [1] 1.313209+0i

The roots are outside the unit circle as we expected.
Then we need to examine if the residual follows stationary condition and normal distribution this time.

plot(arma31_v$resid,ylab = "residuals",main = "Residuals for ARIMA(3,1,1) Model with volume parameter")

acf(arma31_v$resid,main = "ACF of residuals with volume parameter")

From the acf plot, we can find that the problem is not solved, there’s still a slight deviation from gaussian white noise at \(lag=22\), \(lag=24\) and \(lag=25\).
Looking at the Q-Q plot, we can find the same problem as before.

qqnorm(arma31_v$resid)
qqline(arma31_v$resid,probs = c(0.25,0.75))

Though adding log(volume) improves the model in many ways, it is still hard to explain why the residual of the model is long-tail, suggesting further analysis with more complex model than ARIMA.

Midterm Project

3/5/2016

1 Question Description

2 Data Analysis

2.1 Explore the data

2.2 Fitting an ARIMA(p,1,q) Model

2.3 Analysing ARIMA(3,1,1) Model

2.3.1 Regression on ARIMA(3,1,1) Model

2.3.2 Testing ARIMA(3,1,1) Model’s Root

2.3.3 Predictions

2.4 Diagnostics

2.4.1 Residuals Brief View

2.4.2 Testing for Normality

2.4.3 Modify the Model

3 Conclusion