Population is always an attractive topic in time series study. The exact number of population can help the government make better decisions and the trend of population can somehow reflects the development of the whole country. In particular, people care about sustainable development and the fact that many countries suffer from ageing population. When population grows rapidly, it may cause the problem of insufficient resources in the future. On the contrary, slow increasing rate or even negative increasing rate may lead to ageing population. United States is the first modern country that take census. And in this project, I am going to analyze the trend of population among the nearest 70 years in United States. The result may do help to some further analysis like fertility rate and age structure of the society.
The data is collected by U.S. Census Bureau \(^{[1]}\) from 1952.1 to 2019.12, recorded once a month. And the whole dataset can be downloaded from Kaggle \(^{[2]}\). The overview of the data is showing below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 156309 201725 239558 243848 289364 330310
plot.ts(value, xlab = "year", ylab = "population", main = "Population of United States")
Obviously, the population is increasing all the time. In this case, we are more interested in the newly-born population. Therefore, doing first order difference is meaningful. Below is the plot of the data after transformation.
plot.ts(inc, xlab = "year", ylab = "newly-born population", main = "Newly-born Population of United States")
From the plot above, we can easily find out that there is an outlier at April, 2010. It is the only month that the population decrease. There are several methods to deal with the outlier. And I take the mean value of its last and next month as the new value.
One month is too short to observe the exact trend of population. Thus, in the following analysis, I set frequency of the time series data as 12, which means the unit of time is a year. We first take a look at the expected spectrum of the original data with and without smoothers.
spectrum(value)
spectrum(value, spans = c(3,5,3))
The spectrum density plot shows that there may be a seasonality every year. But the local peak is not large enough to confirm the seasonality. Therefore, we can look at the spectrum density plot of the data after first order difference transformation.
spectrum(inc)
spectrum(inc, spans = c(3,5,3))
In this plot, the seasonality can be better observed. When we try to find a model to fit this time series data, I will prefer a SARIMA model with period equals to 12.
unseason = decompose(inc)
xadjust = inc - unseason$seasonal
spectrum(ts.union(inc, xadjust), spans = c(3,5,3), main = "Original Data and Unseasonal Data")
To confirm what we observe above, we can apply smoothers to remove the seasonality of the original data. We can find out that the seasonal adjustment removes most of the signal at seasonal frequencies and little elsewhere.
And then we can take a look of ACF (Auto Correlation Function). The first plot is for the original data (total population). The correlations are very high and decrease very slowly, which shows that the first order difference is neccessary.
acf(value)
The second plot is for the data after first order difference transformation. The correlations are still above the 95% CI and shows a seasonality of lag difference = 1 (12 months).
acf(inc)
The last plot is for the data without seasonality. The correlation is still too high and decrease slowly. While selecting the proper model, we should also consider do the first order difference transformation on seasonal part.
acf(xadjust)
In this part, I try to find out the best model. From the analysis above, the timeseries data is nonstationary monthly data. Therefore, I select SARIMA\((p,d,q) \times (P,D,Q)_{12}\) model. The general form can be written as
\[\phi(B) \Phi(B^{12}) ((1-B)^d (1 - B^{12})^D Y_n - \mu) = \psi(B) \Psi(B^{12}) \epsilon_n\]
where \({\epsilon_n}\) is a white noise process, the intercept \(\mu\) is the mean of the differenced process \({\{(1-B)^d(1-B^{12})^D Y_n\}}\).
\[\phi(x) = 1 - \phi_1x - \cdots - \phi_px^p \] \[\psi(x) = 1 + \psi_1x + \cdots + \psi_qx^q \] \[\Phi(x) = 1 - \Phi_1x - \cdots - \Phi_Px^P \]
\[\Psi(x) = 1 + \Psi_1x + \cdots + \Psi_Qx^Q \]
First of all, I try SARIMA\((p,1,q) \times (0,1,1)_{12}\) model, which is frequently used for forecasting monthly time series in economics and business \(^{[4]}\), and use AIC as evaluation.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 7426.41 | 7321.48 | 7222.77 | 7161.16 | 7119.54 | 7083.54 |
AR1 | 7176.53 | 6894.31 | 6890.19 | 6891.84 | 6893.78 | 6895.39 |
AR2 | 6985.95 | 6889.94 | 6891.92 | 6894.17 | 6895.64 | 6897.61 |
AR3 | 6926.20 | 6891.91 | 6893.93 | 6895.92 | 6896.24 | 6897.17 |
AR4 | 6910.11 | 6893.74 | 6895.67 | 6896.96 | 6899.12 | 6899.16 |
When p=2 and q=1, we have the lowest AIC value. So, we can fit the SARIMA\((2,1,1) \times (0,1,1)_{12}\) model and look at the acf of its residual.
model = Arima(value, order=c(2,1,1), seasonal=list(order=c(0,1,1),period=12))
summary(model)
## Series: value
## ARIMA(2,1,1)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 sma1
## 0.8574 0.1239 -0.6508 -0.8501
## s.e. 0.0516 0.0487 0.0402 0.0247
##
## sigma^2 estimated as 693.1: log likelihood=-3439.97
## AIC=6889.94 AICc=6890.02 BIC=6913.38
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.7260941 26.05127 10.43987 -0.0003654664 0.004704941
## MASE ACF1
## Training set 0.004070084 0.09709279
acf(model$resid)
The first correlation is still a little larger than the value of dotted line. It may suggests us to add an AR part on the seasonal part.
Figuring out whether the model is invertible or casual by computing the root of \(\phi(x), \psi(x)\)
rootAR <- polyroot(c(1,-coef(model)[c("ar1","ar2")]))
rootAR
## [1] 1.016837+0i -7.936378-0i
rootMA <- polyroot(c(1,-coef(model)[c("ma1")]))
rootMA
## [1] -1.536541+0i
Though all the roots are out of the unit cycle, the root of ar1 is very close to 1, which may cause some incasual problems.
And then, I also try SARIMA\((p,1,q) \times (1,1,1)_{12}\) model and use AIC as evaluation.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 7423.04 | 7317.83 | 7220.11 | 7158.58 | 7117.69 | 7081.59 |
AR1 | 7175.89 | 6895.24 | 6891.44 | 6892.99 | 6894.90 | 6896.56 |
AR2 | 6986.30 | 6891.16 | 6893.12 | 6895.44 | 6896.20 | 6898.19 |
AR3 | 6927.39 | 6893.09 | 6895.14 | 6896.69 | 6898.66 | 6899.06 |
AR4 | 6911.56 | 6894.87 | 6896.80 | 6898.70 | 6900.62 | 6902.10 |
According to the AIC table, I pick SARIMA\((2,1,1) \times (1,1,1)_{12}\) and SARIMA\((1,1,2) \times (1,1,1)_{12}\) because they have relatively close AIC. I fit this two model and take a look at the ACF of their residuals.
model1 = Arima(value, order=c(2,1,1), seasonal=list(order=c(1,1,1),period=12))
summary(model1)
## Series: value
## ARIMA(2,1,1)(1,1,1)[12]
##
## Coefficients:
## ar1 ar2 ma1 sar1 sma1
## 0.8598 0.1217 -0.6527 0.0398 -0.8642
## s.e. 0.0519 0.0490 0.0405 0.0449 0.0279
##
## sigma^2 estimated as 693.6: log likelihood=-3439.58
## AIC=6891.16 AICc=6891.26 BIC=6919.29
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.7330862 26.04387 10.40612 -0.0003691297 0.004693772
## MASE ACF1
## Training set 0.004056925 0.09727587
acf(model1$resid)
The root examination of \(\phi(x), \psi(x)\).
root1AR <- polyroot(c(1,-coef(model1)[c("ar1","ar2")]))
root1AR
## [1] 1.016765+0i -8.081863-0i
root1MA <- polyroot(c(1,-coef(model1)[c("ma1")]))
root1MA
## [1] -1.532091+0i
Fit the SARIMA\((1,1,2) \times (1,1,1)_{12}\) model.
model2 = Arima(value, order=c(1,1,2), seasonal=list(order=c(1,1,1),period=12))
summary(model2)
## Series: value
## ARIMA(1,1,2)(1,1,1)[12]
##
## Coefficients:
## ar1 ma1 ma2 sar1 sma1
## 0.9832 -0.7736 0.0850 0.0389 -0.8633
## s.e. 0.0074 0.0356 0.0351 0.0450 0.0282
##
## sigma^2 estimated as 693.7: log likelihood=-3439.72
## AIC=6891.44 AICc=6891.54 BIC=6919.57
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.7336713 26.04616 10.4108 -0.00036944 0.004695427
## MASE ACF1
## Training set 0.00405875 0.09604747
acf(model2$resid)
The root examination of \(\phi(x), \psi(x)\).
root2AR <- polyroot(c(1,-coef(model2)[c("ar1")]))
root2AR
## [1] 1.017053+0i
root2MA <- polyroot(c(1,-coef(model2)[c("ma1", "ma2")]))
root2MA
## [1] -1.14781+0i 10.24618-0i
We can find out that the results of this two model are very similar. The correlation at the first lag is still a little bit over the dotted line. Like the SARIMA\((2,1,1) \times (0,1,1)\) model I fit before, though all the roots are out of the unit cycle, the root of ar1 is very close to 1, which may cause some incasual problems. In general, both the two models fit the data well.
Moreover, the auto.arima function suggests that SARIMA\((1,1,2) \times (1,1,1)_{12}\) is the best model. This function use Hyndman-Khandakar algorithm to minimize AICc and maximize log likelihood \(^{[5]}\).
auto.arima(value)
## Series: value
## ARIMA(1,1,2)(1,1,1)[12]
##
## Coefficients:
## ar1 ma1 ma2 sar1 sma1
## 0.9832 -0.7736 0.0850 0.0389 -0.8633
## s.e. 0.0074 0.0356 0.0351 0.0450 0.0282
##
## sigma^2 estimated as 693.7: log likelihood=-3439.72
## AIC=6891.44 AICc=6891.54 BIC=6919.57
In order to find out whether the sar1 part is neccessary or not, I apply SARIMA\((2,1,1) \times (1,1,1)\) model to do simulation for 1000 times. Below is the histogram of the coefficient of sar1.
We can find out that in the most circumstances, the coefficient is non-zero. So I prefer SARIMA model with P=1. And both SARIMA\((1,1,2) \times (1,1,1)\) and SARIMA\((2,1,1) \times (1,1,1)\) can fit the data well.
Use SARIMA\((1,1,2) \times (1,1,1)\) to predict the population of United States and also apply SARIMA\((1,0,2) \times (1,1,1)\) to predict the increasement of population for the upcoming 5 years. The result shows that the population of United States will increase stably.
Also use SARIMA\((2,1,1) \times (1,1,1)\) to predict the population of United States and also apply SARIMA\((2,0,1) \times (1,1,1)\) to predict the increasement of population for the upcoming 5 years.
This project is aimed at exploring the trend of population of the United States. Based on the result of spectrum analysis and acf analysis, I notice that SARIMA\((p,1,q) \times (P,1,Q)_{12}\) model can be a good choice. Furthermore, I apply AIC as evaluation to figure out the best parameters p,q,P and Q of the model. A simulation study is also used to confirm my choice that SARIMA\((2,1,1) \times (1,1,1)\) and SARIMA\((1,1,2) \times (1,1,1)\) can both fit the data well.
Based on the model I select, the population of United States will increase stably in the upcoming 5 years.
There are also some disadvantages using SARIMA model. For example, the correlations of residuals are not all under the dotted line. And the root of \(\phi(x)\) is close to unit circle, which may cause incasual problems. More advanced techniques can be applied to better fit with the data.