Live birth is the birth of a living child and has affected lots of fields such as economy and politics. In South Korea as well as most developed and developing countries, the number of live births has decreased and then it has threatened their labor population. Also, in South Korea it is related to the National Pension and other insurance and bank business systems. From the government and National Assembly’s position, they have to make policies to prevent declining the tendency.
The goal of this project is to analyze the number of live births in South Korea in terms of time series analysis. Specifically, we will try to find any patterns from the data and fit a time series model to capture those patterns. In order to do that, we will conduct the SARIMA modeling based on estimated spectrum density and acf plots.
We downloaded data from UNdata and Korean Statistical Information Service. Birth dataset contains the number of live births in South Korea from January 1983 to December 2017. We excluded alien armed forces, civilian aliens employed by armed forces and foreign diplomatic personnel and their dependants.
## Year Month Value Time
## 1: 1983 1 84129 Jan 1983
## 2: 1983 2 86241 Feb 1983
## 3: 1983 3 59582 Mar 1983
## 4: 1983 4 55109 Apr 1983
## 5: 1983 5 55965 May 1983
## 6: 1983 6 53323 Jun 1983
## Year Month Value Time
## 1983 : 12 1 : 35 Min. :25000 Min. :1983
## 1984 : 12 2 : 35 1st Qu.:38007 1st Qu.:1992
## 1985 : 12 3 : 35 Median :46452 Median :2000
## 1986 : 12 4 : 35 Mean :47302 Mean :2000
## 1987 : 12 5 : 35 3rd Qu.:54705 3rd Qu.:2009
## 1988 : 12 6 : 35 Max. :86241 Max. :2018
## (Other):348 (Other):210
From the time series plot, we see there are two peaks and two trough every year. From 1983 to early 2000, two peaks occurred every February and October while troughs were at summer season and December. After then, the number of live births peaked every January and at spring season whereas it dropped at summer and December.
spectrum(birthts, spans = c(5,5))
We plot the periodogram to check the seasonality in the data. There are six dominant peaks. These frequencies correspond to 12,6,4,3,2.4,2 month period, respectively.
In the data, we see there is a decreasing trend in live birth over time. Also, variance changes over time. It indicates that to fit a time series model we need to make the data stationary. Thus we will transform the data later. In terms of seasonality of the data, we confirm that there is a dominant annual cycle since ten peaks occur between every ten years in time series plot and the largest peak in the periodogram plot is at frequency 1, which corresponds to one-year period.
acf(birthts,60, main = "ACF of Live Births")
The acf plot also suggests there are strong autocorrelation between observations and seasonality among them.
For the assumption of stationarity for time series modeling, we need to have a stationary dataset. However, our data don’t have a stationary behavior. To remedy it, we take logarithm and difference with several lags. Taking a log scale makes the data less fluctuate, which suggests we use the log scale data,\(Y_n = \log(X_n)\). Also, taking difference with lag of 12 makes the data stationary and does not have a sign of seasonality.
y = birthts
ly = log(y)
dly = diff(ly)
dly12 = diff(ly, 12)
plot.ts(cbind(y,ly,dly,dly12), main = "")
We seek to fit a Seasonal ARIMA\((p,0,q)\times(P,1,Q)_{12}\) with parameter vector \(\theta = (\phi_{1:p},\psi_{1:q},\Phi_{1:P},\Psi_{1:Q},\mu,\sigma^2)\) given by \[\phi(B)\Phi(B^{12})((1-B^{12})Y_n - \mu) = \psi(B)\Psi(B^{12})\epsilon_n,\] where \[ \begin{eqnarray} \mu &=&E[Y_n]\\ \phi(x) &=& 1 -\phi_1x - \cdots - \phi_px^p\\ \Phi(x) &=& 1 -\Phi_1x - \cdots - \Phi_Px^P\\ \psi(x) &=& 1 +\psi_1x + \cdots + \psi_qx^q\\ \Psi(x) &=& 1 +\Psi_1x + \cdots + \Psi_Qx^Q\\ \epsilon_n &\sim& \text{iid}N[0,\sigma^2] \end{eqnarray} \]
In order to fit a \(SARIMA(p,0,q)\times(P,1,Q)_{12}\) model, we have to decide orders for AR and MA. Therefore, we tabulate AIC values for a range of different choices of \(p\) and \(q\). Akaike’s information criterion AIC is given by \[AIC = -2\times l(\theta^*)+2D\] where D is the number of parameters to be estimated. We select the model with the lowest AIC score.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | -977.18 | -1196.07 | -1297.81 | -1364.38 | -1394.32 | -1418.01 |
AR1 | -1426.38 | -1454.16 | -1455.23 | -1453.36 | -1451.60 | -1449.80 |
AR2 | -1456.42 | -1454.62 | -1453.31 | -1458.30 | -1456.54 | -1454.57 |
AR3 | -1454.68 | -1452.85 | -1451.66 | -1454.12 | -1454.52 | -1461.01 |
AR4 | -1453.29 | -1451.95 | -1456.96 | -1463.29 | -1462.61 | -1460.64 |
AR5 | -1451.51 | -1449.68 | -1454.40 | -1458.34 | -1450.50 | -1451.53 |
From the AIC table, although \(SARIMA(5,0,5)\times(0,1,0)_{12}\)and \(SARIMA(4,0,4)\times(0,1,0)_{12}\) have smallest values, they are so complex. Therefore, we try to calculate BIC values and see if these candidates will change. Bayesian information criterion BIC is given by \[BIC = -2\times l(\theta^*) + D\log(N)\] where D is the number of parameters and N is the sample size.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | -973.17 | -1188.05 | -1285.78 | -1348.34 | -1374.26 | -1393.95 |
AR1 | -1418.36 | -1442.13 | -1439.18 | -1433.30 | -1427.53 | -1421.72 |
AR2 | -1444.39 | -1438.58 | -1433.25 | -1434.23 | -1428.46 | -1422.48 |
AR3 | -1438.63 | -1432.79 | -1427.59 | -1426.05 | -1422.43 | -1424.91 |
AR4 | -1433.24 | -1427.88 | -1428.88 | -1431.20 | -1426.50 | -1420.53 |
AR5 | -1427.44 | -1421.60 | -1422.31 | -1422.24 | -1410.39 | -1407.41 |
From the BIC table, we select \(SARIMA(2,0,0)\times(0,1,0)_{12}\) since is has a smallest BIC value. This combination also has a local minimum in the AIC table. Thus, we decide to use the \(SARIMA(2,0,0)\times(0,1,0)_{12}\) model. To consider orders for SAR and SMA, we will plot the acf of residuals of the model. From the ACF plot, we see the autocorrelation cuts off after lag 3, which indicates the residuals might bahave like SMA(3). Similarly, from the PACF plot, the pacf cuts off after lag 2, which corresponds to SAR(2). Finally, we will fit a \(SARIMA(2,0,0)\times(2,1,3)_{12}\).
new2023 = Arima(ly, order = c(2,0,0), seasonal = list(order = c(2,1,3), period = 12))
new2023
## Series: ly
## ARIMA(2,0,0)(2,1,3)[12]
##
## Coefficients:
## ar1 ar2 sar1 sar2 sma1 sma2 sma3
## 0.6087 0.2964 -0.9979 -0.8871 0.6914 0.5229 -0.1965
## s.e. 0.0478 0.0481 0.0804 0.0701 0.0995 0.1007 0.0573
##
## sigma^2 estimated as 0.001407: log likelihood=761.51
## AIC=-1507.03 AICc=-1506.67 BIC=-1474.94
The result of fitting the SARIMA model is given by \[(1 - 0.6087B - 0.2964B^2)(1 + 0.9979B^{12} + 0.8871B^{24})(1 - B^{12})Y_n = (1 +0.6914B^{12} + 0.5229B^{24} - 0.1965B^{36})\epsilon_n\] where \(\epsilon_n \sim \text{iid} N[0,0.001421]\). From the result, we check that the AIC and BIC values are improved.
list(AR_roots = abs(polyroot(c(1,-coef(new2023)[1:2]))),
SAR_roots = abs(polyroot(c(1, -coef(new2023)[3:4]))),
SMA_roots=abs(polyroot(c(1,coef(new2023)[5:7]))))
## $AR_roots
## [1] 1.077573 3.131367
##
## $SAR_roots
## [1] 1.06175 1.06175
##
## $SMA_roots
## [1] 1.142537 1.142537 3.898046
round((1-pnorm(abs(new2023$coef)/sqrt(diag(new2023$var.coef))))*2,3)
## ar1 ar2 sar1 sar2 sma1 sma2 sma3
## 0.000 0.000 0.000 0.000 0.000 0.000 0.001
All the roots are outside the unit circle, suggesting we have a stationary causal fitted SARIMA. Therefore, the assumptions of causality and invertibility still hold. Also, all the coefficients are significant.
The time plot above shows the original live births on log scale and the fitted values from the model we considered above. We see that our model can capture the patterns from the live births but some peaks and troughs are over- or underestimated.
For the diagnostics, we first plot the residuals of the model above.
plot(resid(new2023), ylab = "Residual", type = "l", main = "Residual Plot")
We see that the residuals behave like a white noise except some points, which might be outliers. Nextly, we look the acf plot of the residuals to check whether autocorrelation exists among the residuals.
par(mfrow = c(2,1))
acf(resid(new2023),60, main = "ACF and PACF of SARIMA(2,0,0)(2,1,3)")
pacf(resid(new2023),60, main = "")
par(mfrow = c(1,1))
Based on the acf plot, the residuals look like a white noise. Lastly, to check the assumption of normality, we plot a QQ-plot of the residuals.
qqnorm(resid(new2023))
qqline(resid(new2023))
From the plot above, although the residuals have a long tail on the right hand, they can be considered as a normal distribution.
The live births in South Korea time series has a decreasing trend as well as some seasonalities, especially in a cycle per year. Usually, peaks have occurred between January and March or October while there have been troughs in summer or December. Accoring to these patterns and duration of pregnancy, we can speculate many women in South Korea have gotten pregnant in spring and early in winter rather than in autumn and late in winter.
We tried to fit the \(SARIMA(2,0,0)\times(2,1,3)_{12}\) based on the AIC and BIC tables as well as the acf plots. Although the fit of the model is good, there are some misetimated points to be improved. Thus, further analysis should try to close these gaps.
http://data.un.org/Data.aspx?d=POP&f=tableCode%3A55
https://ionides.github.io/531w18/
R. Shumway and D. Stoffer “Time Series Analysis and its Applications” 4th edition.