Background

The data consists of monthly shipments of a company that manufactures pollution equipment (in thousands of French francs) from Jan 1986 to Oct 1996. The data comes from Makridakis, Wheelwright and Hyndman (1998) Forecasting: methods and applications, John Wiley & Sons: New York. Chapter 7. We can see that until December 1989, the value of shipments as well as the variablity were low. From then on, shipments increased and so did their variations for each subsequent month.

The goal is to explore the data and fit a decent model to it. This goal is motivated by the increasing fluctuations in the data over time, helping one learn how to work with data with increasing variability and whether certain transformations help in the modeling process.

##   month   polln
## 1     1 122.640
## 2     2 120.888
## 3     3 164.688
## 4     4 147.168
## 5     5 171.696
## 6     6 228.636

Here we see a glimpse of the data. There are 130 observations. The month variable represents the number of months having passes since January 1986, and the polln variable represents that month’s shipment in thousands of French francs.

Here we see that there is an increasing trend over time as well as an increase in variability. The shipments increase exponentially for a few months then drop down considerably before increasing to potentially an even higher peak. Around the year 1994, there was a huge decrease that has yet to recover by the time the data ends. It is unwise to think the same pattern would continue and that there would be another ‘highest’ peak in the subsequent years.

The smoothed periodogram is in cycles/year and we see the spectrum values are largest around frequency values of 0 and 4 (period of 1/4 years or every three months). Since the polln values increases exponentially and has an increase in variablity, I will try to log the data and replot.

Here, we see a linear model looks more appropriate for the data but there is still a slight downward curve towards the end. This could be for a number of reasons including consumers improving their own pollution technology, issues with the economy, etc. The smoothed periodogram yields similar frequencies that correspond to the highest spectrum values.

One might difference the data and then try to model it as an \(SARMA(p,q)x(P,Q)_{season}\), but I want to detrend the data using a function of time to further see how the shippings varied throughout the 10 years the data was collected. This can be accomplished thorugh modeling trend with ARMA errors

Linear Regression with ARMA(p,q) noise

. One strategy to tackle trend is linear regression and then modeling the residuals as ARMA errors. This means I will model the monthly shipping, denoted \(Y_{n}\), and logged monthly shipping, denoted \(Z_{n}\), as

\[ Y_{n}(Z_{n}) = \mu_{n} + \eta_{n}\] where \(\eta_{n}\) is a stationary, causal, invertible ARMA(p,q) process with mean zero and \(\mu_{n}\) has the (initial) linear specification

\[\mu_{n}=Intercept+\beta_{1}Month+\beta_{2}Month^{2}\]

Let’s see what the fit looks like using lm.

The un-logged data is a bit too variable towards the end. It seem much easier to model the logged data. Thus, from now on, I will only look at the logged data.

We can also look at the results of fitting the model

## 
## Call:
## lm(formula = polln_log ~ month + I(month^2), data = polln)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8485 -0.1703 -0.0072  0.1923  0.6818 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.803e+00  8.014e-02  59.927  < 2e-16 ***
## month        3.990e-02  2.824e-03  14.128  < 2e-16 ***
## I(month^2)  -1.052e-04  2.089e-05  -5.039 1.57e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2999 on 127 degrees of freedom
## Multiple R-squared:  0.9176, Adjusted R-squared:  0.9163 
## F-statistic: 706.7 on 2 and 127 DF,  p-value: < 2.2e-16

We see the polynomial coefficient being evaluated as significant but is also very small. However, it adds a nice curve to the fitted plots which shows that it fits the data well. Is it worth keeping?

To check this, I will first try to model the errors as ARMA noise. This will help me find one signal plus ARMA noise model that includes Month^2 and one that does not include it. Then I can do hypothesis testing to see if it is worth keeping.

Now, to look at the residuals of of the model before trying to model the errors as ARMA noise

From looking at the ACF plot, we see that the largest value is at lag 3.

Frequency values with corresponding high spectrum are approx 0.015 and .33 which in terms of cycles per year would be .2 and 4 which correspond to periods of 5 years and .25 years (3 months) respectively. However, our data only covers 10 years, so I don’t feel confident trying to look at anything with a period of 5 years with barely enough data to cover 2 period. Thus, I will look at the AIC table with no seasonal component and with a 3 month component. I decided to include seasonality of 3 months instead of 12 months as most \(ARMA(p,q)*(P, Q)_{12}\) with nice AIC values had polynomial roots that were not outside the unit circle.

AIC (ARMA(3,0), no Month^{2} variable)
	MA0	MA1	MA2	MA3	MA4
AR0	56.78	49.09	48.84	25.92	27.91
AR1	49.24	44.06	45.90	27.91	29.89
AR2	49.80	39.50	34.24	27.49	28.44
AR3	14.33	14.21	NA	-18.03	NA
AR4	14.31	16.12	NA	NA	NA

AIC (ARMA(3,0), Month^{2} variable)
	MA0	MA1	MA2	MA3	MA4
AR0	58.78	51.08	50.83	27.88	29.87
AR1	51.23	46.05	47.89	29.87	31.86
AR2	51.79	41.21	35.97	29.45	30.40
AR3	15.96	15.95	-11.14	-16.20	-15.12
AR4	16.08	17.88	NA	NA	NA

AIC (SARMA(3,0)x(1,0)_{3}, no Month^{2} variable)
	MA0	MA1	MA2	MA3	MA4
AR0	17.40	10.38	11.83	NA	NA
AR1	10.46	12.13	5.82	NA	-9.15
AR2	11.80	13.07	2.96	NA	NA
AR3	7.65	8.97	-6.86	NA	NA
AR4	8.48	9.31	-9.37	-15.7	-5.53

AIC (SARMA(3,0)x(1,0)_{3}, Month^{2} variable)
	MA0	MA1	MA2	MA3	MA4
AR0	19.08	12.14	13.62	NA	NA
AR1	12.24	13.90	7.76	NA	NA
AR2	13.56	14.81	4.65	NA	NA
AR3	9.31	10.62	-5.18	NA	NA
AR4	10.12	10.88	-7.71	-13.89	-3.92

Generally, we look for pick the model that comes with the lowest AIC value. Looking at the table, the best models would be \(ARMA(3,3)\) with no seasonal component and \(SARMA(4,3)x(1,0)_{3}\) with a seasonal component of 3 months. However, there are two things to note:

There are obvious issues with the maximization process. By adding one additional parameter to the model, the AIC value should increase at most by 2. However, the AIC values jump by extremely large values when one parameter is added to the model e.g when comparing ARMA(2,0) to ARMA(3,0) in both tables.
There are reasons to not choose overly complex models, as they may suffer from parameter identification, invertiblility and causality issues, and general numerical instability

Thus, I will look for a simple model that has a reasonable AIC value. The choice for the both the seasonal and nonseasonal component is ARMA(3, 0). Because the AIC values are lower, I will also move on with the seasonal model. Thus, now the question is whether to include \(Month^2\) or not when modeling the errors as \(SARMA(3,0)x(1,0)_{12}\)

Hypothesis testing

Let’s first re-visit each model

## 
## Call:
## arima(x = polln$polln_log, order = c(3, 0, 0), seasonal = list(order = c(1, 
##     0, 0), period = 3), xreg = Z[, 2:3])
## 
## Coefficients:
##          ar1      ar2      ar3    sar1  intercept   xreg1   xreg2
##       0.2402  -0.0415  -0.2991  0.7315     4.7854  0.0422  -1e-04
## s.e.  0.0840   0.0861   0.0973  0.0386     0.1551  0.0030     NaN
## 
## sigma^2 estimated as 0.05584:  log likelihood = 2.34,  aic = 9.31

## 
## Call:
## arima(x = polln$polln_log, order = c(3, 0, 0), seasonal = list(order = c(1, 
##     0, 0), period = 3), xreg = Z[, 2])
## 
## Coefficients:
##          ar1      ar2      ar3    sar1  intercept   xreg
##       0.2779  -0.0169  -0.2975  0.7697     5.1275  0.025
## s.e.  0.0860   0.0881   0.1015  0.0723     0.1527  0.002
## 
## sigma^2 estimated as 0.0592:  log likelihood = -1.64,  aic = 15.27

We see that there are some issues calculating the SE for \(Month^{2}\) and the result is \(NaN\). This means that a more simple model might be preferred. Also, based on the coefficient, this variable isn’t doing much to help model the data. Just to be thorough, we will conduct a likelihood ratio test. This is a way to try and verify the significance of the association between these two datasets. We have two hypotheses:

\[ \begin{eqnarray} H^{\langle 0\rangle} &:& \theta\in \Theta^{\langle 0\rangle} \\ H^{\langle 1\rangle} &:& \theta\in \Theta^{\langle 1\rangle} \end{eqnarray} \] defined via two nested parameter subspaces, \(\Theta^{\langle 0\rangle}\subset \Theta^{\langle 1\rangle}\), with respective dimensions \(D^{\langle 0\rangle}< D^{\langle 1\rangle}\le D\).

We consider the log likelihood maximized over each of the hypotheses, \[ \begin{eqnarray} \ell^{\langle 0\rangle} &=& \sup_{\theta\in \Theta^{\langle 0\rangle}} \ell(\theta), \\ \ell^{\langle 1\rangle} &=& \sup_{\theta\in \Theta^{\langle 1\rangle}} \ell(\theta). \end{eqnarray} \] By Wilks approximation, we have \(\ell^{\langle 1\rangle} - \ell^{\langle 0\rangle} \approx (1/2) \chi^2_{D^{\langle 1\rangle}- D^{\langle 0\rangle}}\) under the null hypothesis. The difference in the dimension is 1 as the full model contains \(Month^{2}\) and the null model does not.

log_lik_ratio = as.numeric(logLik(h1) - logLik(h0))
pval = 1 - pchisq(2*log_lik_ratio, df = 1)
pval

## [1] 0.004772354

We see that the p-value is 0.00477. This is well below the threshold of 0.05 indicating that the null hypothesis \(H^{\langle 0\rangle}\) should be rejected and hence the association is indeed significant. However, there is still the identifiability issue. So this is a case where something has a p-value that is below the threshold but one might still follow the model under the null hypothesis for other reasons that the p-value doesn’t cover. Thus, due to the lack of a standard error, the extremely small coefficient, and the fact that \(Month^{2}\) is useful mainly in the last 10 observations of the data, the final model will be

\[ Z_{n} = Intercept+\beta_{1}Month+ \eta_{n} \] where \(\eta_{n}\) are modeled as \(SARMA(3,0)x(1,0)_{3}\) ### Diagnostics

Now we’ll look at diagnostics for the ARMA model fitted to our residuals:

With the residuals plotted, our first impression of its behavior is that there seems to exist slight heteroskedasticity.

Now we check the ACF plots

There are 4/50 lags only one that are outside the dashed line in the non seasonal and seasonal model. Also, there is still a slight pattern in the ACF plot. Thus, there are a couple of things that prevent one from exclaiming “the residuals follow the null hypothesis of Gaussian white noise!”. I didn’t take into account the possibility of a better, slightly more complex model, which is why there might be violations at certain lags and/or patterns in the plot. However, this suffices for me (and I hope for the reader), and makes me more confident of the SARMA than the ARMA(3,0) model.

A good thing is that there are no dominant cycles in the periodogram in the residuals of the seasonal model. This gives evidence of IID errors for both models. All things considered, the \(SARMA(3,0)x(1,0)_{3}\) model seems to be nicer, meaning its residuals moreso follow what we’d see if we had Gaussian white noise. I will stick with this model!

We should also check the roots of the coefficients.

## [1] 1.059415 1.057127 1.059415 1.095020

Our roots all lie outside the unit circle, suggesting we have a stationary causal fitted ARMA.

The ar2 coefficient, however, leaves much to be desired. Its standard error has a higher in magnitude than the coefficient. I will see if the approximate confidence interval constructed using profile likelihood is in agreement with the approximate confidence interval constructed using the observed Fisher information.

## [1] "the profile CI is the interval ( -0.189 , 0.05 )"

## [1] "the Fisher CI is the interval ( -0.19 , 0.156 )"

There are several errors that occur from the optim function, showing that there might be some numerical issues. However, the CI’s both contain 0 and have a similar left endpoint, with the right endpoint smaller for the profile CI. To double check using another’s software, you can also use the lmtest::coefci, which is a generic function that computes Wald confidence intervals.

##                 2.5 %      97.5 %
## ar1        0.10934133  0.44649036
## ar2       -0.18968794  0.15581497
## ar3       -0.49650558 -0.09855425
## sar1       0.62803872  0.91135112
## intercept  4.82827043  5.42674631
## xreg       0.02109337  0.02887182

Discussion and Conclusion

When we saw the raw data, the trend and fluctuations in the monthly shipments of pollution equipment increased over time. There seemed to be a slight decrease in the last several months. There isn’t any historical context I could find that might help explain the pattern in the data.

Taking the log of the data helped deal with the fluctations. To deal with the trend, multiple trend functions were specified and the errors were treated as ARMA noise. After finding how to model the ARMA noise for different mean functions, the mean functions were compared and a simple model was selected.

\[(1 - .2779B - 0B^{2} + .2975B^{3})(1-.7697B^{3})(log(Y_{n}) + 5.1275 + .025Month) = \epsilon_{n}\]

Thus, we’ve found that a reasonable model for monthly shipments of pollution equipment is a linear model with ARMA errors after performing a log transform on the monthly shipments.

References

Ionides, E. (n.d.). Stats 531 (Winter 2018) ‘Analysis of Time Series’ Retrieved March 05, 2018, from http://ionides.github.io/531w18/
Description of data (sourced from Makridakis, Wheelwright and Hyndman (1998) Forecasting: methods and applications, John Wiley & Sons: New York. Chapter 7). Retrieved March 05, 2018 from http://pkg.robjhyndman.com/fma/reference/pollution.html

Pollution Equipment Time Series

3/4/2018

Background

Linear Regression with ARMA(p,q) noise

Hypothesis testing

Discussion and Conclusion

References