1.Introduction

Growth of population has always been a concern for governments of many countries. The birth rate is one of the most significant indicators that influence the population and structure of the population. It determines the speed of growth of population and the percentage of people of different ages in a country. It is well known that the scale and structure of population have important effects on the development of economics, education and social structure. Also, the birth rate depends on various factors, such as economics, and agricultural production, science and technology, especially medical level, and culture. Exploring the features of birth through history can help us view the development and constantly changing culture of a country. Moreover, knowing more about the birth cycle can help women wanting to get pregnant.

The United States, the fastest developed country in the 20th century, its birth may let us view a tip of the iceberg that why the USA can develop so fast in the last century that it must have certain relation with economics, science and culture. In this report, we are interested in mining the relationship between the birth rate and month, to improve the accuracy of birth rate prediction. We mainly solve the question about the optimal model for this series. To sum up, we would like to select birth data of the late-20-century of the United States as a research object and make a time series analysis to find out the regulation and trend of the birth.

The data we selected is CDC Births Data 1969-2008 [1]. This data is drawn from the USA Centers for Disease Control and Prevention. There are 5 columns which include 4 numerical variables and 1 categorical variable. The numerical variables are year, month, day and birth. The categorical variable is gender(Male and Female). We selected the data from 1969 to 1988 as we would like to explore the birth before the important timeline–the collapse of the Soviet Union as it is a turning point which has enormous influence on every part of the US. We will use the data to make a time series analysis of the monthly birth of the US.

In this report, we will use \(\text{SARIMA}\) method to explore the birth through time series. The report will be divided into parts. The first one is data exploration which includes the overall view of data, heteroscedasticity and decomposition of data. We also determine stationary in this part. The second section is model selection. We use Grid Search method to obtain the best fitted model. The third section is to test the model. We will carry out diagnostics in this part. The last section is the conclusion and possible improvement.

2.Data Exploration

The information we used includes dates and birth rate. Firstly, we make a plot of birth for every month with the original data. We intend to find out the relation of birth rate and the changes of birth between different months. Thus, we make the month variable from 1969 to 1988 to be a consecutive variable. As it can be seen from the time series plot with birth as y-label and month as x-label, the trend of the data can be approximately splitted into 3 parts. The first is approximately from 0 to 25, which has a significant increasing trend. The second part is from 25 to 50, and the values are descending. The last part, generally, shows an evident increasing trend. Besides the trend of the time series data, we also notice that the data has very regular fluctuations which seem to have periodicity. We, therefore, suppose that this data may fit with a \(\text{SARIMA}\) model with overall trend and periodic seasoning effect[2][3]. Before the subsequent process of analysis and model selection, the original data is found to be heteroscedastic and non-stationary. Hence, we transform the data through logarithm and then difference it. Thereafter, our data successfully meets the model assumption and is stationary. We further use grid search method to determine other parameters as long as we figure out the periodicity by using the logged data. More details have been presented in the following.

2.1 Heteroscedasticity Removing

The series shows compound growth and has evident patterns, which is a sign of heteroscedasticity, and the logarithm may be helpful to remove the increase in variance. Taking the logarithm of the data will not fatten an inflationary growth pattern, but it will straighten it out so that it can be fitted by a linear model(e.g., a random walk or ARIMA model with constant growth, or a linear exponential smoothing model) [4]. Also, taking the logarithm of the data will convert multiplicative seasonal patterns to additive patterns, so that if we perform seasonal adjustments after taking the logarithm, we should use the additive type.

Log serves our purpose since it did remove the increase in variance.

2.2 Data Decomposition

Based on the decomposition of the time series[5], we could figure out that this birth series is generally a combination of level, trend, seasonality, and noise components.

2.3 Stationarity Check

##      [,1] [,2]  [,3]  [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## ACF  0.81 0.76  0.65  0.54 0.49 0.41 0.46 0.49 0.57  0.65  0.68  0.83  0.65
## PACF 0.81 0.31 -0.10 -0.13 0.10 0.01 0.31 0.20 0.21  0.20  0.05  0.54 -0.72
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
## ACF   0.61  0.49  0.39  0.34  0.27  0.31  0.34  0.42   0.5  0.53  0.66  0.49
## PACF -0.18 -0.03  0.10  0.07 -0.05 -0.01  0.10 -0.02   0.1 -0.01  0.06 -0.19
##      [,26]
## ACF   0.45
## PACF -0.17

We have seen by the ACF and PACF that this is not white noise, but a series of correlated values. Nevertheless, we can do the Ljung-Box test on the time series to formally validate this claim.

## 
##  Box-Ljung test
## 
## data:  log(birth_per_month)
## X-squared = 1625.7, df = 20, p-value < 2.2e-16

By using Ljung-Box test, it yields a small p-value which rejects the null hypothesis, showing the data is not stationary. So this supports our ACF plot consideration above where we stated: it is likely that this is not purely white noise and some time series information exists in this data. Actually, there is an obvious trend in the time series, and thus we difference the data to detrend it. The corresponding figure is shown below and proves that the series now does not have apparent trend now:

2.4 Period Detection

According to the plots above, we could figure out there seems to have cycles in the rear of the data. We use frequency domain to analyze the spectral density of the series and here is the smoothed periodogram with an obvious peak. We found that the highest frequency is 0.0833, which corresponds to 12 months(a year). This phenomenon agrees with our hypothesis that the period is 12 months.

3. Model Selection

3.1 SARIMA Model

Given that the period is equal to 12 (one year), it is supposed a general \(\text{SARIMA}(p,1,q)\times(P,1,Q)_{12}\) model [6] should be conducted for monthly data:

\[\phi(B)\Phi(B^{12})[(1-B)^d(1-B^{12})^D(Y_n - \mu)] = \psi(B)\Psi(B^{12})\epsilon_n\] where \(\left\{\epsilon_n\right\}\) is a white noise process and: \[ \begin{aligned} \mu &= \mathbf{E}\left\{(1-B)^d(1-B^{12})^DY_n\right\}\\ \phi(x) &= 1-\phi_1x-\cdots-\phi_px^p\\ \psi(x) &=1+\psi_1x+\cdots+\psi_qx^q\\ \Phi(x) &= 1-\Phi_1x-\cdots-\Phi_Px^P\\ \Psi(x) &=1+\Psi_1x+\cdots+\Psi_Qx^Q\\ \end{aligned} \]

3.2 Hyperparameters Configuration by Grid Search

Plot the ACF and PACF of monthly births number with first diff 1 and then diff with a period:

According to ACF and PACF, \(P,Q,p,q\) can be roughly estimated as \(p = 1,2\), \(q = 0, 1, 2, 3\), \(P = 0,1\), and \(Q = 0,1\). Some possible models will be conducted and compared with AIC. We try to find the optimal model by grid search method[7], and the results are listed here:

For \(P = Q = 0\):

	MA0	MA1	MA2	MA3
AR0	-1111.794	-1132.221	-1131.090	-1132.855
AR1	-1126.100	-1130.637	-1139.385	-1132.554
AR2	-1135.789	-1136.430	-1140.423	-1144.844

For \(P = 1, Q = 0\):

	MA0	MA1	MA2	MA3
AR0	-1155.080	-1183.848	-1182.144	-1182.904
AR1	-1178.725	-1181.998	-1181.566	-1183.456
AR2	-1186.145	-1187.261	-1186.935	-1184.682

For \(P = Q = 1\):

	MA0	MA1	MA2	MA3
AR0	-1206.814	-1225.003	-1223.068	-1224.053
AR1	-1222.010	-1223.032	-1221.616	-1224.838
AR2	-1226.068	-1227.809	-1228.345	-1249.984

For all combinations of possible parameters, the model \(\text{SARIMA}(2,1,3)\times(1,1,1)_{12}\) has the smallest AIC value. We select two models \(\text{SARIMA}(2,1,2)\times(1,1,1)_{12}\) and \(\text{SARIMA}(2,1,3)\times(1,1,1)_{12}\) as our candidate models.

The coefficients of the fitted model with AR=2 and MA=3 are:

	Dependent variable
Predictors	Estimates	CI	p
ar1	-1.16	-1.17 – -1.15	<0.001
ar2	-1.00	-1.00 – -1.00	<0.001
ma1	0.93	0.79 – 1.06	<0.001
ma2	0.75	0.58 – 0.91	<0.001
ma3	-0.22	-0.35 – -0.08	0.001
sar1	-0.13	-0.30 – 0.03	0.118
sma1	-0.72	-0.84 – -0.59	<0.001
Observations	227
R²	0.973

Here are the corresponding roots for AR and MA respectively:

AR
x
-0.5800218+0.8149908i
-0.5800218-0.8149908i

MA
x
-0.5730628+0.8196098i
-0.5730628-0.8196098i
4.5659101+0.0000000i

Actually, almost all roots are on the unit circle, showing that this model is at the threshold of non-invertibility.

The coefficients and roots for another model are:

	Dependent variable
Predictors	Estimates	CI	p
ar1	-0.82	-1.19 – -0.45	<0.001
ar2	-0.58	-0.87 – -0.29	<0.001
ma1	0.53	0.10 – 0.95	0.016
ma2	0.32	0.00 – 0.64	0.048
sar1	-0.02	-0.20 – 0.15	0.797
sma1	-0.76	-0.88 – -0.64	<0.001
Observations	227
R²	0.970

AR
x
-0.709599+1.109074i
-0.709599-1.109074i

MA
x
-0.81907+1.563851i
-0.81907-1.563851i

Comparatively, roots are just outside the unit circle, suggesting we have a stationary causal fitted SARIMA.

Next, we will do a formal hypothesis test using Wilk’s theorem. Suppose we have two nested hypothesis \(H_0: \text{SARIMA}(0,1,0)\times(1,1,1)_{12}\) and \(H_1: \text{SARIMA}(2,1,2)\times(1,1,1)_{12}\). Under the null hypothesis, we have

\[ 1/2\Lambda = l^{(1)}-l^{(0)}\approx1/2\chi^2_{D^{(1)}-D^{(0)}} \]

where \(\chi^2_{d}\) is a chi-squared random variable on d degree of freedom, and \(l^{(i)}\) is the maximum likelihood under the hypothesis \(H_i\).

The cut-off value for \(\chi^2_4\) distribution with \(95\%\) significance confidence is 9.49, which is smaller than \(\Lambda=29.53\). So we could reject the null hypothesis and thus \(\text{SARIMA}(2,1,2)\times(1,1,1)_{12}\) is appropriate for the data. To sum up, based on AIC table, roots, and hypothesis test result, we select this model as the best one. The fitted model is:

\[(1+0.82B+0.58B^2)(1+0.02B^{12})[(1 - B)(1 - B^{12})Y_n] = (1+0.53B+0.32B^2)(1-0.76B^{12})\epsilon_n\]

4.Diagnostics

In this plot, the blue line stands for the true log-transformed data while the red line represents the predicted data. We can see that the fitted values are quite close to the original ones. The trend and the seasonality have been captured well by our model. However, the ranges of our predicted value, namely the difference between maximum and minimum in one cycle, are generally larger than the original points for data after 200, which indicates we should adjust our model to better fit the data.

## 
##  Box-Ljung test
## 
## data:  birth_arima$residuals
## X-squared = 18.297, df = 20, p-value = 0.5678

The residuals oscillated around zero symmetrically over time. Besides, the QQ-plot proves that the residuals are normally distributed, which are generally consistent with the assumption that the residuals are white noise. The ACF plot of the residual does not show quite significant autocorrelations, even though there are few values of ACF fall outside the dashed lines. The result in Box-Ljung test does not the reject the null hypothesis, demonstrating the residuals are independently distributed as well.

5. Conclusion and Further Work

In this report, we are mainly interested in exploring the changes of birth rate over a few years, which is one of the most basic and important measures in demography. In fact, birth rates affect public policy and budgeting for education and health systems, and could have major impacts on the well-being of a country’s population. The plot of the data tells us the birth rate is increasing gradually, and it also has some seasonal features. After removing the trend of the series, we use grid search method to find the optimal hyperparameters p, q, P, and Q based on the AIC criterion. Finally, we obtain the model \(\text{SARIMA}(2,1,2)\times(1,1,1)_{12}\). The diagnostics validate our methods and show availability of the model. It tells us the birth rate has close relationship with the months and it is also periodic with the cycle 12 months. However, it also indicates that predicting the birth rate merely depending on the date is not quite comprehensive and convincing. Possibly, we could add other information like sex to improve our model [8].

Reference

[1]

A gentle introduction to wavelet for data analysis. Available: https://www.kaggle.com/asauve/a-gentle-introduction-to-wavelet-for-data-analysis/data

[2]

Time series analysis of NO2 concentration in dongcheng district, beijing. Available: https://ionides.github.io/531w20/midterm_project/project27/midterm_proj.html

[3]

Pharma sales. Available: https://ionides.github.io/531w21/midterm_project/project13/project.html

[4]

Steps in choosing a forecasting model. Available: https://people.duke.edu/~rnau/411fcst.htm

[5]

Chapter8, Notes for STATS 531, modeling and analysis of time series data. Available: https://ionides.github.io/531w22/08/notes.pdf

[6]

R. H. Shumway, D. S. Stoffer, and D. S. Stoffer, Time series analysis and its applications, vol. 3. Springer, 2000.

[7]

S. Abolmaali and S. Shirzaei, “Forecasting COVID-19 number of cases by implementing ARIMA and SARIMA with grid search in united states,” medRxiv, 2021, doi: 10.1101/2021.05.29.21258041.

[8]

H. Ritchie and M. Roser, “Gender ratio,” Our World in Data, 2019.

Study on the Birth Rate in the U.S.A.

SATST531 Midterm Project