The time series dataset we analyze here is the monthly information of the Brazilian GDP (Gross Domestic Product) series which starts from January of 1980 until December of 1997 containing 216 months. We will get a basic idea of how this series data extend along the time horizon by checking the trend, cycles (if any) using tools like specturm analysis, local linear regression firstly. In a futher step, we try to fit a model for this data and get some insight of how that model means in reality by looking at the model we build. Predition will always be the thing we are interested in based on the model and the dataset we get which is also a work we will touch on at the last of my project.
Firstly, we try to generate a basic picture of our data here. This plot below shows the time series of 216 month’s Brazilian GDP index.
data=read.csv(file="http://www2.stat.duke.edu/~mw/data-sets/ts_data/brazil_econ", skip=14, sep="",header=FALSE)
Intuitively by checking the plot, we can detect the seasonality property and probability linear trend (which need further examination) of the raw data. In order to get an idea of the period of the seasonal effect we do periodogram checking. The smoothed periodogram indicates that there is a dominated cycle with frequency about 0.8. That mean the Brazilian GDP data has a cycle of thriving and recession every 12 months. This the slowest cycle we can tell from analyzing periodogram. We can see that there are also some bumps at frequency about 0,17, 0.33 corresponding to cycles with period 6 months and 3 months. That means Brazilian GDP along these twenty years has strong annual cycle, with also quartly cycle and half-year cycle within each year although the relative power is weaker compared with the annual cycle.
Decompose the original data in a further step, we seperate the chaotic line into long term trend plus cycle plus colored noise here.
low_f= ts(loess(training~seq(1:190),span=0.5)$fitted)
high_f=ts(training-loess(training~seq(1:190),span=0.1)$fitted)
cycle=training-low_f-high_f
plot(ts.union(training,low_f,high_f,cycle),main="Decomposed GDP as Trend + Cycle + Noise")
We will conclude that the Brazilian economy kept blooming in the years 1980-1997 in general. While the econony has its own pattern in inspite of the upgoing trend considering GDP as an indicator of the well-being of overall economy. Within each year, it has its prosperity basically at May, June and July and stagnancy at the beginning and the end of a year. On the basis of that, there is some highy frequency noise of the behaviour of the econmy which we can develop ARMA model for that part if we want to know the detail. (As for the detailed model to clarify clearly how the parttern is, we will discuss next).
Now we are thinking of developing a model to figure out in detail how the underlying system goin on beneath the obeservations. The first model comes to mind when looking at the seasonal pattern is the SARIMA\((p,d,q){\times}(P,D,Q)_{12}\) (Seasonal AutoRegressive Moving Average Model). The ACF plot below intensify our analysis that period effect should 12 months.
We first difference the data with lag of twelve and get the new series of \(Y_{n}=(1-B^{12})X_{n}\)
From the plot we believe that this new seires \(Y_{n}\) is generally a weak stationary time series with equal variance while the ACF of this \({Y_n}\) show the correlaiton along time axis. All these properties motivates us to build a ARMA model for the \({Y_n}\) data. So we use the AIC model selection criteria to choose the best model for \({Y_n}\).
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 1146.59 | 1044.58 | 1022.24 | 1000.05 | 986.77 | 978.52 |
AR1 | 977.36 | 976.68 | 978.01 | 976.23 | 977.29 | 978.72 |
AR2 | 977.30 | 978.47 | 972.00 | 973.96 | 977.80 | 974.16 |
AR3 | 976.64 | 977.41 | 969.59 | 966.58 | 967.19 | 974.70 |
AR4 | 976.48 | 978.35 | 971.52 | 961.14 | 962.98 | 964.81 |
The priciple for the AIC selection is to choose the model with the least AIC while keeping the model as simple and compact as possible without losing the effeciveness. As for this specific case, AR(1) and ARMA(1,1) are all good choice. We need to actually build the model and check the fitness of the model to select a better (in the sense of interpretability) model. We start from the less complicated one AR(1) and if it fits well there will be no necessarity to try ARMA(1,1). So we model we first try is \(Y_n = {\phi}Y_{n-1}+\epsilon_n\). We plug in \(X_n\) in to \(Y_n\) and derive the model of \(X_n\) to be a SARMA \((1,0,0)\times(0,1,0)_12\) \[(1-{\phi}X_{n})((1-B^{12})X_n-\mu)=\epsilon_n\]
We check the goodness of model by examing the dependency and normality of the residuals after regression. The plot of residuals shows the property of white noise with probably two exception around time points 120. Those two might be outliers. In addition, we take a peek of the ACF and QQ plot of the residuals and it seems the residuals well-fit the assumption. Also, the Ljung-Box reinforces our confidence that these are qualified independent residuals
par(mfrow=c(2,2))
plot(resid(sarma),ylab=" ",main="Residuals of SARIMA model")
plot(abs(sarma$residuals),ylab=" ",main="Absolute value of Residuals")
acf(resid(sarma),main="ACF of errors")
qqnorm(sarma$residuals)
qqline(sarma$residuals)
Box.test(sarma$residuals)
##
## Box-Pierce test
##
## data: sarma$residuals
## X-squared = 2.2897, df = 1, p-value = 0.1302
Since this model is quite satisfactorying, we now show the coefficients of the model and the plot of how it fitted inside the training set.
##
## Call:
## arima(x = training, order = c(1, 0, 0), seasonal = list(order = c(0, 1, 0),
## period = 12))
##
## Coefficients:
## ar1
## 0.8108
## s.e. 0.0428
##
## sigma^2 estimated as 13.86: log likelihood = -487.1, aic = 978.2
In addition to the SARMA model, there is another option that we can fit a model of Linear trend + sinusoidal function +ARMA Error. \[X_n= {\beta} n + A sin({\pi}X_n/6 ) + B cos({\pi}X_n/6 ) + ARMA\ Error\] Note that \(X_n\) denote the obervation at time point n Here shows the deterministic trend estimated by OLS Method to get the parameter as well as the residuals for further ARMA model.
m2
##
## Call:
## lm(formula = training ~ sin(2 * pi * x/12) + cos(2 * pi * x/(12)) +
## x)
##
## Coefficients:
## (Intercept) sin(2 * pi * x/12) cos(2 * pi * x/(12))
## 93.6758 0.6438 -6.7460
## x
## 0.2038
We check the AIC table for the redisuals ARMA(p,q) to select the p and q and it shows that p=2 and q=2 gives the least AIC without getting the model too large. As the reasoning we give before,we will try this model first and accept that if it behaves well. Else, we can enlarge the model to include more parameters if the first picked one fails. Anyway we try the ARMA(2,2) at the first hand.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 1259.67 | 1179.11 | 1170.92 | 1141.18 | 1142.61 | 1140.30 |
AR1 | 1141.16 | 1132.53 | 1128.47 | 1142.03 | 1126.12 | 1120.39 |
AR2 | 1137.42 | 1128.97 | 1123.08 | 1124.85 | 1123.24 | 1122.27 |
AR3 | 1133.92 | 1126.29 | 1124.66 | 1111.72 | 1118.27 | 1117.42 |
AR4 | 1135.92 | 1128.27 | 1124.74 | 1107.06 | 1115.76 | 1117.31 |
The ACF plot of the residuals after fitting ARMA(2,2) does not seem so good, the outstanding value at lag 12 indicates that we should extend the model to ARMA(12,2) to incorporate the autocorrelation after 12 months.
Here show the detailed examination of the residuals after fitting ARMA(12,2) model. We have not seen obvious evidence to object the white noise assumption by those four results. So that should be a pretty good model
##
## Call:
## arima(x = m2$residuals, order = c(12, 0, 2))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8
## -0.0870 0.1387 0.2552 0.0011 0.1669 0.0526 0.0763 -0.0740
## s.e. 0.1035 0.0764 0.0698 0.0668 0.0663 0.0694 0.0683 0.0661
## ar9 ar10 ar11 ar12 ma1 ma2 intercept
## -0.0125 -0.1442 -0.0797 0.5436 0.8276 0.3262 1.3447
## s.e. 0.0685 0.0675 0.0660 0.0661 0.1239 0.0921 2.8838
##
## sigma^2 estimated as 11.93: log likelihood = -508.82, aic = 1049.63
## Warning in arima(m2$residuals, order = c(12, 0, 2), fixed = c(0, 0, NA, :
## some AR parameters were fixed: setting transform.pars = FALSE
The second model we conclude here is \[X_n=93.67 + 0.64sin({\pi}X_n /6)-6.75cos({\pi}X_n /6) + 0.204X_n + E_n\] where \(E_n\) is a ARMA(12,2) proces with coefficients shown below
arma1
##
## Call:
## arima(x = m2$residuals, order = c(12, 0, 2), fixed = c(0, 0, NA, 0, NA, 0, 0,
## 0, 0, NA, 0, NA, NA, NA, NA))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ar8 ar9 ar10 ar11
## 0 0 0.2270 0 0.1773 0 0 0 0 -0.1452 0
## s.e. 0 0 0.0585 0 0.0557 0 0 0 0 0.0532 0
## ar12 ma1 ma2 intercept
## 0.5931 0.6728 0.3372 1.5822
## s.e. 0.0553 0.0722 0.0659 2.9183
##
## sigma^2 estimated as 12.76: log likelihood = -515.19, aic = 1046.37
For each of the prediction graphs, the Blak line is the real GDP data of the Brazil Country for the following 25 months and the red line is forecasting we made from the corresponding model with the confidence interval marked by the dashed lines.
Both of the models fit the data well and act pretty beautifully regarding forecasting within a short time period ahead. However, interpret the model to get some real-life insight is not that obvious which also is one of the shortcomings of the general ARMA, SARMA model, espcially models with a lot parameters.We obtain the numeric quantity of how previous GDP effect the furture GDP beviors while not knowing why the parameters are what they are without further developing. Just stand at the point as a statistical analysing,
[1]“http://www2.stat.duke.edu/~mw/data-sets/ts_data/brazil_econ” for dataset
[2]Robert H.Shumway, David S.Stoffer “Time Series Analysis and Its Application with R Example”(Third Edition)