Introduction

In this project, we analyze a monthly economic time series from the Brazilian economy. The dataset comes from the website [1], which in turn comes from the article [2] by Huerta and Lopes, 1999. The dataset contains monthly IPI (Industrial Production Index) in Brazil from January 1980 to December 1997, 216 months in total. IPI is an economic indicator that measures the real production output of manufacturing, mining, and utilities. The monthly IPI is usually used to reflect short-term changes in industrial production. The growth in IPI from month to month indicates the growth in the industry. [3]

We want to get a basic idea of the changes in Brazilian industry from 1980 to 1997 by studying IPI time series data. Specifically, we will check whether the data has any trends, cycles by making use of specturm analysis and local linear regression. Then we want to fit an appropriate model to the IPI data, which would enable us to forecast future IPI.

Data Exploration

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   72.21   94.86  103.67  104.26  114.20  134.75

The above is a brief summary of IPI over 216 months. We do not see any missing values, so we can proceed to make a time series plot.

From the time series plot above, we observe the seasonality property of IPI data. Visually, there seems to be some evidences for an increasing trend, especially in the later months. However, the evidences are not strong. There are also considerable fluctuations.

The smoothed periodogram above suggests a dominant frequency of 0.08, which corresponds to a cycle of 12 months. This means IPI in Brazil has a cycle of boom and recession every 12 months, or 1 year. We can see from the smoothed periodogram other peeks of frequencies about 0.17, 0.25, 0.35 and 0.42, corresponding to cycles about 6 months, 4 months, 3 months and 2.5 months. However, these peeks are much weaker than the dominant one.

Following notes08 in [4], we try to extract business cycles of IPI data by a band pass filter. For the IPI data, high frequency variation might be treated as “noise” and low frequency variation might be treated as “trend”. A band of mid-range frequencies might be considered to correspond to the business cycle.

From the I_low plot above, we again see some evidences for an increasing trend, especially in the later months, but the trend is not strong. From the I_high plot, we observe some noise with high frequency. From the I_cycles plot, IPI achieves peaks around the middle of each year, specifically, May, June and July; IPI reaches trough around the end of each year, specifically, December, January and February.

Model Selection

ARMA Model

Following notes05 in [4], let’s start by fitting a stationary ARMA\((p,q)\) model under the null hypothesis that there is no trend. This hypothesis, which asserts that nothing has substantially changed in IPI over the 216 months, is not entirely unreasonable from looking at the data.

We seek to fit a stationary Gaussian ARMA\((p,q)\) model with parameter vector \(\theta=({\phi}_{1:p},{\psi}_{1:q},\mu,\sigma^2)\) given by \[ {\phi}(B)(Y_n-\mu) = {\psi}(B) \epsilon_n,\] where \[\begin{eqnarray} \mu &=& {\mathbb{E}}[Y_n] \\ {\phi}(x)&=&1-{\phi}_1 x-\dots -{\phi}_px^p, \\ {\psi}(x)&=&1+{\psi}_1 x+\dots +{\psi}_qx^q, \\ \epsilon_n&\sim&\mathrm{ iid }\, N[0,\sigma^2]. \end{eqnarray}\]

We need to decide where to start in terms of values of \(p\) and \(q\). Akaike’s information criterion (AIC) is often useful when we want to select a model with reasonable predictive skill from a range of possibilities, where AIC is given by \[ AIC = -2 \times {\ell}({\theta^*}) + 2D\] and \(D\) is the number of parameters in the model.

Let’s tabulate some AIC values for a range of different choices of \(p\) and \(q\).

MA0 MA1 MA2 MA3 MA4 MA5
AR0 1730.93 1597.13 1527.51 1496.38 1481.72 1462.47
AR1 1479.45 1481.43 1470.52 1468.98 1467.40 1462.69
AR2 1481.43 1482.16 1463.12 1462.01 1451.91 1450.79
AR3 1469.46 1469.61 1476.92 1461.58 1404.26 1443.14
AR4 1467.74 1467.42 1469.36 1441.60 1463.20 1444.30

ARMA(3,4) model is recommended by the consideration of AIC. Let’s fit it and diagnose whether the model fits reasonably to the data.

We see from the plot above that this model captures major fluctuations of IPI data, but it does not estimate the peaks well. The model tends to overestimate the peaks. The following are some diagnostics.

From the diagnostics plots, residuals seem to have mean zero and constant variance despite the existance of one extreme outlier. The normality approximately holds. However, the ACF plot shows a nonzero correlation that happens at lag 12. The seasonality property of IPI data makes usual ARMA model unsuitable. The lag 12 and seasonality property of IPI data motivates us to consider SARIMA model.

SARIMA Model

Following notes06 in [4], a general SARIMA\((p,d,q)\times(P,D,Q)_{12}\) model for monthly data is \[\begin{eqnarray} {\phi}(B){\Phi}(B^{12}) ((1-B)^{d}(1-B^{12})^{D}Y_n-\mu) = {\psi}(B){\Psi}(B^{12}) \epsilon_n \end{eqnarray}\] where \(\{\epsilon_n\}\) is a white noise process, \(\mu\) is the mean of the differenced data \((1-B)^{d}(1-B^{12})^{D}Y_n\) and \[\begin{eqnarray} {\phi}(x)&=&1-{\phi}_1 x-\dots -{\phi}_px^p, \\ {\psi}(x)&=&1+{\psi}_1 x+\dots +{\psi}_qx^q, \\ {\Phi}(x)&=&1-{\Phi}_1 x-\dots -{\Phi}_Px^P, \\ {\Psi}(x)&=&1+{\Psi}_1 x+\dots +{\Psi}_Qx^Q. \end{eqnarray}\]

Recall that in data exploration phase, we find that the smoothed periodogram suggests a dominant frequency of 0.08, which corresponds to a cycle of 12 months. Also, when we try to fit an ARMA(3,4) model, we discover strong correlation of residuals at lag 12. These two observations together suggest we should difference the data with lag 12. The following is ACF plot of IPI data. The peak at lag 12 confirms that the idea of differencing the data with lag 12 is reasonable.

Another benefit of applying a difference operation to the data is that the differenced data looks more stationary and thus more suitable for SARMA modeling. Therefore we apply \((1-B^{12})\) operator to the IPI data and then fit an ARMA model for the differenced data. This is equivalent to SARIMA\((p,0,q)\times(0,1,0)_{12}\) model. It remains to find the appropriate \(p\) and \(q\). Let’s tabulate some AIC values for a range of different choices of \(p\) and \(q\).

MA0 MA1 MA2 MA3 MA4 MA5
AR0 1484.25 1387.68 1362.10 1333.23 1324.32 1315.99
AR1 1316.89 1312.20 1313.86 1311.88 1313.83 1315.78
AR2 1312.45 1314.07 1306.99 1313.84 1307.45 1317.83
AR3 1313.49 1314.34 1304.00 1303.94 1301.49 1303.16
AR4 1312.18 1313.74 1305.15 1308.00 1297.28 1299.12

SARIMA\((4,0,4)\times(0,1,0)_{12}\) is recommended with AIC is about 1297. However, we observe from the table above that SARIMA\((1,0,0)\times(0,1,0)_{12}\) has AIC value about 1316, which is very close to the AIC value for SARIMA\((4,0,4)\times(0,1,0)_{12}\), but SARIMA\((1,0,0)\times(0,1,0)_{12}\) is a much simplier model. The benefit of a simplier model is that it is more robust to overfitting issue. By considering both AIC and model complexity, we choose to fit a SARIMA\((1,0,0)\times(0,1,0)_{12}\) model. The mathematical form of this model is \[\begin{eqnarray} (1 - \phi_1 Y_n)((1-B^{12})Y_n - \mu) = \epsilon_n \end{eqnarray}\] where \(\{\epsilon_n\}\) is a white noise process and \(\mu\) is the mean of the differenced data \((1-B^{12})Y_n\).

We now use R to compute the parameters.

## 
## Call:
## arima(x = IPI, order = c(1, 0, 0), seasonal = list(order = c(0, 1, 0), period = 12))
## 
## Coefficients:
##          ar1
##       0.7488
## s.e.  0.0457
## 
## sigma^2 estimated as 36.37:  log likelihood = -656.45,  aic = 1316.89
mean(diff(IPI, lag=12))
## [1] 1.029755

The fitted model is \[\begin{eqnarray} (1 - 0.7488 Y_n)((1-B^{12})Y_n - 1.029755) = \epsilon_n \end{eqnarray}\]

We see from the plot above that this model fits the data reasonably well. The model captures most of the fluctuations and does a better job at estimating the peaks than ARMA(3,4) model. However, it still does not fit the peaks quite well. The following are some diagnostics.

From the diagnostics plots, residuals seem to have mean zero and constant variance despite the existance of one or two extreme outliers. The normality approximately holds. However, the ACF plot shows a nonzero correlation that happens at lag 12. This suggests that we still need to tune SARIMA model by introducing seasonal polynomial.

According to notes06 in [4], SARIMA\((0,1,1)\times(0,1,1)_{12}\) model has often been used for forecasting monthly time series in economics and business. Let’s try this model.

## 
## Call:
## arima(x = IPI, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12))
## 
## Coefficients:
##           ma1     sma1
##       -0.2907  -1.0000
## s.e.   0.0690   0.1252
## 
## sigma^2 estimated as 21.39:  log likelihood = -616.27,  aic = 1238.53

We see from the plot above that this model fits the data very well. The model not only captures most of the fluctuations, but also fits the peaks closely, which is an improvement compared with SARIMA\((1,0,0)\times(0,1,0)_{12}\) model. The following are some diagnostics.

Residuals seem to have mean zero and constant variance despite the existance of one or two extreme outliers. The normality approximately holds. This time the ACF plot does not show any correlation significantly deviating from zero.

Conclusion

In this project, we analyze monthly IPI data in Brazil from 1980 to 1997. We observe seasonality property of the data with dominant period 12 months. This indicates the boom and recession of IPI every 12 months. In extracting business cycles, we see some evidences for increasing trend of IPI, but the evidences are not strong. We find SARIMA\((0,1,1)\times(0,1,1)_{12}\) model fits the data pretty closely, which outperforms ARMA(4,3) model and SARIMA\((1,0,0)\times(0,1,0)_{12}\) model. SARIMA\((1,0,0)\times(0,1,0)_{12}\) model is better than ARMA(4,3) model because it takes seasonality property into account. SARIMA\((0,1,1)\times(0,1,1)_{12}\) model is better than SARIMA\((1,0,0)\times(0,1,0)_{12}\) model because it in addition takes into consideration that the dominant period is 12 months. We now have a better sense of how IPI in Brazil changes. SARIMA\((0,1,1)\times(0,1,1)_{12}\) model can be used to forecast future IPI in Brazil.

However, there are still some remaining problems. First, although SARIMA\((0,1,1)\times(0,1,1)_{12}\) model fits the IPI data pretty well, we still need to keep in mind the potential overfitting issue. Second, it is not obvious how to interpret the model to get some insights in reality.

Reference

[1] http://www2.stat.duke.edu/~mw/ts_data_sets.html.

[2] Huerta, G., Lopes, H. F. 1999. Bayesian forecasting and inference in latent structure for the Brazilian GDP and Industrial Production Index. ftp://ftp.stat.duke.edu/pub/WorkingPapers/99-08.html.

[3] https://fred.stlouisfed.org/series/INDPRO

[4] https://ionides.github.io/531w18/