In this project, I want to check if the air passenger data is seasonal, which can help with developing future bussiness strategies for companies serving the tourism industry. Next, with further data analysis I tested several ARMA, ARIMA and SARIMA models and determined the best fit. Last, I predicted a full cycle of data with the final SARIMA model.
The dataset is obtained on Kaggle, containing monthly records of number of air passengers from the year 1949 to 1960.
## 'data.frame': 144 obs. of 2 variables:
## $ Month : Factor w/ 144 levels "1949-01","1949-02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ passengers: int 112 118 132 129 121 135 148 148 136 119 ...
First, we ploted the time series of the passengers of 144 successive months from 1949 to 1960. There seems to have a seasonal variation every year. Specifically, a clear peak appears around the middle of every year.
Usually, time series can be decomposed into three components: trend components and random components, and seasonal components.
A simple, additive decomposition model for a time series can be written as: \[ Y_t = m_t+s_t+e_t \] where \(m_t\) is the trend, \(s_t\)is the seasonality, and \(e_t\) is the random component.
From the decomposition plot, we can observe a clear increasing trend, which verifies the pattern detected in the previous exploration process. Next, there exist obvious seasonality. Last but not least, variation in the early and late years is high, indicating non-stationary variance.
Next, we tried log transformation and difference on the original data to see if the non-stationary patterns can be effectively adjusted.
From the above plot, we can see that the variation is getting closer to the state of stationary after the log-transformation and first difference.
It is often to see periodic behavior in time series. However, sometimes the periodic behavior can be more complex than what we assume it to be. Spectral analylsis is a technique commonly used to detect underlying periodicities.
As an inconsistent estimator of the spectral density, which means the error term does not vanish as N gets large, let us first try the unsmoothed periodogram,
Usually, smoothed spectrum will be far more interpretable. Here, we first try the default method.
We can observe that the most powerful peak appears before 0.1 frequency.
## frequency period
## peak 0.08333333 12.00000000
Since frequency is measured in cycles per time, and by definition, the period of an oscillation is the time for one cycle, we have 12-months period as result.
By fitting the AR(1) models we can double check whether there is a trend.
\[ (1-\phi_1B)(Y_n-\mu-\beta t_n = \epsilon_n) \] where {\(\epsilon_n\)} is Gaussian white noise with variance \(\sigma^2\). Therefore, the null hypothesis is \(H_{0}:\beta = 0\) and the alternative hypothesis is \(H_1:\beta \neq 0\)
fit0 <- arima(log_passengers,order=c(1,0,0))
fit0
##
## Call:
## arima(x = log_passengers, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## 0.9781 5.4854
## s.e. 0.0171 0.3181
##
## sigma^2 estimated as 0.01127: log likelihood = 117.07, aic = -228.13
fit1 <- arima(log_passengers,order=c(1,0,0),xreg=seq(1,144))
fit1
##
## Call:
## arima(x = log_passengers, order = c(1, 0, 0), xreg = seq(1, 144))
##
## Coefficients:
## ar1 intercept seq(1, 144)
## 0.7052 4.8130 1e-02
## s.e. 0.0591 0.0535 6e-04
##
## sigma^2 estimated as 0.009619: log likelihood = 129.69, aic = -251.39
Based on the likelihood ratio test, the difference in log likelihood = 129.69 - 117.07 > 1.92, the cutoff for a test at 5% size. Therefore, we reject the null hypothesis, which is to say we accept \(\beta \neq 0\).
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | -231.56 | -237.51 | -240.38 | -257.89 | -277.39 | -279.60 |
AR1 | -235.39 | -241.61 | -265.19 | -263.23 | -275.47 | -287.27 |
AR2 | -237.60 | -270.15 | -263.26 | -264.04 | -286.80 | -287.40 |
AR3 | -236.95 | -270.09 | -270.16 | -289.48 | -287.66 | -290.38 |
AR4 | -250.73 | -248.75 | -276.93 | -304.15 | -295.47 | -307.13 |
AR5 | -248.75 | -272.79 | -275.23 | -302.50 | -296.23 | -310.72 |
Here, we compared the AIC values with various choices of AR and MA on the differenced time series, ARMA(0,1,0) has the lowest AIC values. And ARMA(1,1,0) and ARMA(0,1,1) has the second and third lowest AIC respectively. Therefore, we have 3 candidates.
fit3 <- arima(log_passengers,order=c(0,1,0))
fit3
##
## Call:
## arima(x = log_passengers, order = c(0, 1, 0))
##
##
## sigma^2 estimated as 0.01136: log likelihood = 117.22, aic = -232.44
fit4 <- arima(log_passengers,order=c(1,1,0))
fit4
##
## Call:
## arima(x = log_passengers, order = c(1, 1, 0))
##
## Coefficients:
## ar1
## 0.2054
## s.e. 0.0818
##
## sigma^2 estimated as 0.01088: log likelihood = 120.3, aic = -236.6
fit5 <- arima(log_passengers,order=c(0,1,1))
fit5
##
## Call:
## arima(x = log_passengers, order = c(0, 1, 1))
##
## Coefficients:
## ma1
## 0.2768
## s.e. 0.0944
##
## sigma^2 estimated as 0.01072: log likelihood = 121.36, aic = -238.73
The residual plots of all three ARIMA models share similar patterns. However, from the ACF plots, the significant spike at lag 1 suggests a non-seasonal MA(1) component, and the significant spike at lag 12 suggests a seasonal MA(1) component. Moreover, the ACF values of ARIMA(0,0,1) are closer to 0, verifying the MA(1) component is a good choice.
A seasonal ARIMA model is formed by including additional seasonal terms (P,D,Q)\(_m\) where m = number of observations per year.
Considering the seasonality, we tried to fit SARIMA(0,0,1)(0,1,1)\(_{12}\) model. Since both trend and seasonality are present, we implemented both a non-seasonal first difference and a seasonal difference. That is to let d=D=1. Additionally, since our span of the periodic seasonal behavior is one year, we specify period as 12 (months per year).
##
## Call:
## arima(x = log_passengers, order = c(0, 1, 1), seasonal = list(order = c(0, 1,
## 1), period = 12))
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 estimated as 0.001348: log likelihood = 244.7, aic = -483.4
Generally speaking, there is no obvious trends in the residual plot. But the scale of residuals varys over time. Also, the ACF plot of SARIMA(0,1,1)(0,1,1)\(_{12}\) residuals shows no autocorrelation. All spikes are within the 95% confidence boundaries. In conclusion, SARIMA(0,1,1)(0,1,1)\(_{12}\) should be a good fit.
To check if there is any invalid model assumptions, we can take a look at the normality of the time series by looking at the QQ-plot.
Looking at the QQ-plot, we can see that the majority of points are close to the line. But it is note-worthy that points to the left form a long tail. Therefore, we need to double check by Shapiro-Wilk test, which states \(H_0: Y_{1:N}\sim N(\mu,\sigma^2)\) and \(H_1: Y_{1:N}\) does not follow \(N(\mu,\sigma^2)\).
##
## Shapiro-Wilk normality test
##
## data: data$passengers
## W = 0.95196, p-value = 6.832e-05
From the results above, we cannot reject H_0 because the p-value is small. In the other words, we can conclude that the normality assumption of data is valid.
Here, let us use the SARIMA model for a predictive goal for a full-cycle of future air passengers. We can use forecast package in R to achieve that. Specifying the number of periods for forecasting as h = 12 with confidence level 99.5%.
The blue line attached at the end represents the forecasts, with the 95% prediction intervals as grey shaded area. It is clear that the predictive cycle follows the same trend as those observed records.
For the non-stationary monthly airpassenger data, we fit a SARIMA(0,1,1)(0,1,1)\(_{12}\) model, which can be written as: \[ \phi(B)\Phi(B^{12})((1-B)^d(1-B^{12})^DY_n-\mu) = \psi(B)\Psi(B^{12})\epsilon_n \\ (1-B)(1-B^{12})Y_n-\mu = (1-0.40B)(1-0.56B)\epsilon_n \] where \(\epsilon_n\) is a white noise process.