The SSE Composite (also known as Shanghai Stock Exchange Composite) Index is the most commonly used indicator to reflect SSE’s market performance. Constituents for the SSE Composite Index are all listed stocks (A shares and B shares) at the Shanghai Stock Exchange. The index was launched on July 15, 1991 [1]. Investors spend a great amount of time and energy in studying the stock market, hoping to find certain laws within the trend and profit from it. However, many have failed. A natural question has raised from it: as the stock market are recorded by certain lag of time, will a time series model appropriate for modeling? So in this report, we are interested in finding any patterns within the stock market data, which should be highly related to the Composite Index. And we also try to see whether it is appropriate to use basic time series approaches in modeling stock data.
We manage to identify potential patterns with the SSE Composite Index. SSE index is collected from Yahoo from 1998-01-01 to 2017-12-31 [2]. We will be focusing on the following aspects. Particularly, we are interested in the increase rate of the SSE Composite Index, and we will calculate it from the existing data.
These aspects will be primarily enough to identify patterns within the stock markets.
The data are collect from Yahoo Finance. The Time period will be from 1998-01-01 to 2017-12-31.
SSE <- getSymbols("000001.SS",auto.assign = FALSE, from = "1998-01-01", to = "2017-12-31")
The original data contains four variables, i.e. Open Price, High Price, Low Price and Close Price. We are interested in the increase rate (%) of the index, which can be computed with the following formula
\[ \operatorname{Increase Rate} = \frac{\operatorname{Close Price} - \operatorname{Open Price}}{\operatorname{Open Price}} \times 100\% \]
Then, the data is as follow.
head(SSE)
Open High Low Close Rates Date
1998-01-05 1200.948 1220.495 1200.222 1220.473 1.6258010 1998-01-05
1998-01-06 1223.730 1233.635 1215.409 1233.620 0.8081861 1998-01-06
1998-01-07 1233.956 1244.098 1231.146 1244.070 0.8196314 1998-01-07
1998-01-08 1243.121 1250.005 1235.885 1237.163 -0.4792782 1998-01-08
1998-01-09 1233.701 1244.999 1221.297 1239.901 0.5025489 1998-01-09
1998-01-12 1242.210 1246.651 1226.397 1226.966 -1.2271693 1998-01-12
As the stock market is close during weekends and holidays, several rows contain missing data, and they are removed. This gives us a total of 4840 observations.
The following plot shows the increase rate across time. The variance seems to be approximately stable. However, there is potential period within the data. The variance increase after a particular time. We will be looking into that in the coming sections.
ggplot(data = SSE, aes(x = c(1:nrow(SSE)), y = Rates)) + geom_line() + xlab("Time")
First we take a look at the data in frequency domain, i.e. the spectrum analysis. The smoothed periodogram actually suggests that there is period within the increase rate. Approximately a 0.25 cycles per day can be read from the plot, which is about 1 cycle per 4 days. As we mentioned before, the stock market is closed during weekends. Therefore, a weekly cycles may be reasonable.
spectrum(SSE$Rates, spans = c(100,3,100))
Then we can also take a look at the auto correlation function. It is clear that there are obvious correlation when lag = 1, 20, 35. Such patterns should be considered.
acf(SSE$Rates)
In order to find a way to match the patterns we discovered, fitting a ARMA model for this time series data is appropriate. Under the null hypothesis, we assume that the process is stationary, and the noise is i.i.d. Gaussian.
A ARMA(p,q) model is as follow [3], \[ \phi(B)(X_{n} - \mu) = \psi(B)\epsilon_{n} \] Where \[ BX_{n} = X_{n-1} \quad \operatorname{and} \quad B\epsilon_{n} = \epsilon_{n-1} \] \[ \mu = {\mathbb{E}}(X_{n}) \] \[ \phi(x) = 1-\sum_{i=1}^{p}\phi_{i}x^{i} \] \[ \psi(x) = 1+\sum_{i=1}^{q}\psi_{i}x^{i} \] As we mentioned, under the null hypothesis, the noise are independently identically distributed as follow, \[ \epsilon_{n} \sim N(0, \sigma^{2}). \]
The \(p\) and \(q\) are tuning parameters. In order to choose the most suitable ones, we will be using AIC criterion, which is defined as follow, \[ \operatorname{AIC} = -2 \times \operatorname{likelihood}_{\theta} + 2\times D \] where \(D = p+q+2\), i.e. the #parameter + 1. The smaller the AIC, the better the model is. The following is a table of different choice of \(p\) and \(q\), and the corresponding AIC value.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 17500.91 | 17484.26 | 17485.81 | 17479.02 | 17477.89 |
AR1 | 17484.80 | 17486.07 | 17484.82 | 17479.06 | 17479.88 |
AR2 | 17485.10 | 17484.96 | 17469.99 | 17470.35 | 17471.62 |
AR3 | 17480.87 | 17479.54 | 17470.30 | 17472.57 | 17474.25 |
AR4 | 17477.59 | 17479.56 | 17471.73 | 17474.12 | 17468.38 |
AR5 | 17479.47 | 17478.32 | 17472.23 | 17473.44 | 17464.09 |
From the table we can see that ARMA(2,2) and ARMA(4,4) are among the lowest AICs. However, the AIC is not the only criterion of the model selection, we should also consider the complexity of the model. As we have more then enough amount of data, which is 4840 observations, the model is hardly overfitted with only a few parameters. But we also notice that adding two parameters only leads to a slight decrease in AIC. Therefore, the simpler model, ARMA(2,2), is preferred.
Then the next step is to take a closer look at the ARMA(2,2) model.
model <- arima(SSE$Rates, order = c(2,0,2))
model
Call:
arima(x = SSE$Rates, order = c(2, 0, 2))
Coefficients:
ar1 ar2 ma1 ma2 intercept
0.1364 -0.7846 -0.1894 0.7722 0.0763
s.e. 0.0676 0.0895 0.0728 0.0880 0.0203
sigma^2 estimated as 2.158: log likelihood = -8728.99, aic = 17469.99
By writting it as the previous form, we have \[ (1-0.1364B+0.7846B^2)(X_n - 0.0763) = (1+0.7930B + 0.7722B)\epsilon_n \] where \(\epsilon \sim N(0, 2.158)\). Also, we can tell from the summary that the standard error of each parameter are very small, which means that the estimation itself is reliable.
A examination on the root of \(\phi(x)\) and \(\psi(x)\) is also necessary.
rootAR <- polyroot(c(1,-coef(model)[c("ar1","ar2")]))
rootAR
[1] 0.086891+1.125578i 0.086891-1.125578i
rootAR <- polyroot(c(1,coef(model)[c("ma1","ma2")]))
rootAR
[1] 0.122657+1.131363i 0.122657-1.131363i
For both of \(\phi(x)\) and \(\psi(x)\), their roots are all outside the unit circle. It suggests that the model is both invertible and causal. With these properties, it also concludes that the model is reliable in predicting the future values.
When we have chosen a model, a detailed diagnostics should be carried out. Usually, diagnostics for residual is very important. We start with the acf plot of the residual. From the plot, we can see that the correlation for lag < 5 no longer exists. However, the correlation for lag = 35 is still significant. We will try to modify the model by adding seasonality effects into the model.
acf(model$residuals)
We should also take a look at the distribution of the residual with a qqplot. The result is surprising that while most of the residuals is very close to Gaussian, part of the residuals are from heavy tailed distribution. It suggest that low probability cases always exist in the stock market. And they are very hard to predict.
qqnorm(model$residuals)
qqline(model$residuals,probs = c(0.1,0.9))
From the previous diagnostics, we learn that a seasonality effect should be included in the model. A SARMA\((p.q)\) \(\times\) \((P,Q)_{k}\) model is as follow, \[ \phi(B)\Phi(B^k)(X_{n}- \mu) = \psi(B)\Psi(B^{k})\epsilon_{n} \] Where \[ \Phi(x) = 1-\Phi_1 x-\dots -\Phi_Px^P \] \[ \Psi(x) = 1+\Psi_1 x-\dots -\Psi_Qx^Q \] And the rest of the symbol has the same definition as ARMA(p,q) model. In our case, we set \(P=1\), \(Q=0\) and \(k=35\). The estimated coefficient are as follow.
model_modified <- arima(SSE$Rates, order = c(2,0,2), seasonal = list(order = c(1,0,0), period = 35))
model_modified
Call:
arima(x = SSE$Rates, order = c(2, 0, 2), seasonal = list(order = c(1, 0, 0),
period = 35))
Coefficients:
ar1 ar2 ma1 ma2 sar1 intercept
0.1307 -0.8070 -0.1813 0.7931 0.0368 0.0770
s.e. 0.0546 0.1011 0.0603 0.1010 0.0144 0.0211
sigma^2 estimated as 2.155: log likelihood = -8725.75, aic = 17465.49
Basically, the original ARMA coefficient does not change a lot. And the new model has very small standard error for each coefficient. The acf for the residual shows that there is no longer strong correlation for lag = 35. The residual can be considered as Gaussian white noise.
acf(model_modified$residuals)
However, the qqplot is still the same as before, a heavy tailed distribution. We will try to fix this problem with potential transformation in the next section.
qqnorm(model_modified$residuals)
qqline(model_modified$residuals,probs = c(0.1,0.9))
The data does not have a very clear exponential trend. A log-transformation, however, worth a try due to its nice properties [4]. We will be using the log-Open and log-Close data to calculate the increase rate, and then follow the analysis as before.
The ARMA model with transformed data seems to be very similar to the original model. There is only slight changes in the parameters, and the standard error is also very small. In this case, however, we cannot compare the likelihood and AIC with the previous models. It is not reasonable to compare these criterion after the transformation.
model_transformed <- arima(SSE$modifiedRate, order = c(2,0,2))
model_transformed
Call:
arima(x = SSE$modifiedRate, order = c(2, 0, 2))
Coefficients:
ar1 ar2 ma1 ma2 intercept
0.1294 -0.7786 -0.1832 0.7667 0.0084
s.e. 0.0681 0.0921 0.0728 0.0908 0.0026
sigma^2 estimated as 0.0363: log likelihood = 1156.97, aic = -2301.94
As we look at the qqplot, however, it is still heavy tailed distributed. Therefore, the log transformation is ineffective when handling this problem. On the other hand, we can also conclude that the increase rate of the stock market is actually heady tailed distributed. The low probability cases happens from time to time, making the prediction very hard.
qqnorm(model_transformed$residuals)
qqline(model_transformed$residuals,probs = c(0.1,0.9))
To sum up, our report manages to identify potential patterns within the increase rate of the Shanghai Stock Exchange Composite Index. Three approaches are attempted, i.e. the spectrum analysis, fitting the ARMA model, fitting the modified ARMA model with seasonality and log transformation. Several results can be draw from the analysis.
From the spectrum analysis, we can tell a 0.25 cycles per day period, which is the same as 1 cycle per four days. As the stock market has a weekly (5 days) circulation, this is reasonable.
A ARMA(2,2) model is most suitable for our data. The standard error for estimated coefficient is small, making the model reliable. However, the residual is heavy tailed distributed, and the acf shows sign of seasonality. Then, a modified model, SARMA\((2,2)\) \(\times\) \((1,0)_{35}\), is used to deal with it.
A transformation is implemented to handle the heavy-tailed distributed residual. However, it is ineffective.
From these analysis, we can reach our conclusions as follow. We are able to discover cycles within the operations of the stock market, and a slightly seasonality of 35 days as well. This is the patterns we are looking for under the basic time series analysis. Also, the heavy tailed distributed residual also shows us that the low probability events happens all the time in the stock market, such as a lot of unpredictable sudden changes. That is to say, our assumption on the stationarity might not be reasonable. And the model under such hypothesis is not very meaningful. Therefore, more advanced techniques should be used. A basic time series approach is not effective enough.