1 Introduction of Candy Production Dataset

Candy production dateset is from \(FRED^{[1]}\) economic research, which tracks the monthly production of outputs in United States from Jan.1972 to Jan.2020. The original dataset has two columns. One is date and the other is IP, known as Industrial Production showing as a percentage of 2012 production. As in winter hoildays, like Halloween and Christmas, the candy requirment increased a lot which leads to the sudden increase of sugar production. Difference of sugar requirment may lead to the seasonility of sugar production and this is the main concern of this anlysis.

2 Exploratory Data Analysis (EDA)

1 Basic Summary

Firstly, we have a look of the basic information of the candy production data. When we look at the head five lines of the data, we can clearly see the date column and the IP column in double format.From the summary of IP, the percentage values range from 50 to 140.

##         DATE IPG3113N
## 1 1972-01-01  85.6973
## 2 1972-02-01  71.8229
## 3 1972-03-01  66.0235
## 4 1972-04-01  64.5634
## 5 1972-05-01  65.0068
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   50.72   88.68  103.42  101.75  115.49  139.80

After changing the format of date and decomposing time into year,month and time of year and month, the new dataset has 5 columns which the analysis may focus on columns of IP and time.

##         DATE IPG3113N year month     time
## 1 1972-01-01  85.6973 1972     1 1972.083
## 2 1972-02-01  71.8229 1972     2 1972.167
## 3 1972-03-01  66.0235 1972     3 1972.250
## 4 1972-04-01  64.5634 1972     4 1972.333
## 5 1972-05-01  65.0068 1972     5 1972.417

From the plot of candy IP against time, there is a clear trend and seasoning in the plot, while the variance may not stationary along the time.

2 Stationary and Seasonality

To have a close analysis of the data stationality and seasonality, ACF plot and spectrum plot are shown below. From ACF plot, almost all the acf value are above the dashed line indicating highly autocorrelation between lags showing unstationary and the it reaches the peak at lag 12 showing the dominant period of seasoning is 12 months. What’r more, the periodic wave in ACF plot indicates seasonality and AR model may be a good choice to fit.

In spectrum plot, the bandwidth value is small, so the plot is less smoother. As the red dashed line shows, it reached its peak at the frequency of about 0.083 and it is much higher than the local peaks at frequecies 0.16, 0.25 and so on. This proves the seasonality we found in ACF plot as a period of 12 monthes rather than 6 monthes or 4 monthes.

3 Trend, Noise and Cycles

In last part of EDA, we decompose the original sugar industrial production into trend, noise and cycles parts. In the decomposition plot, low frequency refers to the trend is clear in the original candy production and high frequency refers to the noise is almost weak stationary with constant mean and variance and the middle frequency refers to the cycles.

3 Fitting an ARMA model for annual analysis

After anaylse the basis information of candy production series, we would fit an ARMA model not considering seasoning and use the subset of production in August as they represents the middle level in each year.

3.1 Choosing p,d,q for ARIMA model

Accoding to the aic table below, AR(2) has the lowest aic value as 324.2974, but AR(1) and ARMA(1,1) have similar aic value. As AR(2) model is quite simple withour MA parameters, it may not fit the long-term time series data like candy production. In that case, we would choose ARMA(1,1). Also when adding a parameter, some aic values increase more than 2, indicating the data is mathematically inconsistent.

When we fit ARMA(1,1) model to the August production data, the ar1 root is quite close to 1 which indicates ARIMA(0,1,1) model is a better choice and has lower aic value than ARMA(1,1).

##          MA0      MA1      MA2      MA3      MA4      MA5
## AR0 407.5939 370.5949 359.3304 344.5750 344.3509 338.5814
## AR1 324.5205 324.5304 325.9249 326.9689 328.9609 328.7471
## AR2 324.2974 326.2968 326.7673 327.3553 329.2326 330.7218
## AR3 326.2958 326.0009 328.0060 328.7492 330.8970 332.5264
## AR4 326.7520 327.8615 328.9928 330.6098 333.4941 332.5077
## 
## Call:
## arima(x = production, order = c(1, 0, 1))
## 
## Coefficients:
##          ar1      ma1  intercept
##       0.9624  -0.2018    95.5526
## s.e.  0.0390   0.1345    13.6847
## 
## sigma^2 estimated as 40.87:  log likelihood = -158.27,  aic = 324.53
## 
## Call:
## arima(x = production, order = c(0, 1, 1))
## 
## Coefficients:
##           ma1
##       -0.2226
## s.e.   0.1281
## 
## sigma^2 estimated as 41.38:  log likelihood = -154.2,  aic = 312.41

3.2 ARIMA Model Selected

Therefore the ARIMA model selected is ARIMA(0,1,1):
\[ X_n-X_{n-1}=\epsilon_n-0.2226\epsilon_{n-1} \]

3.3 Diagonose Analysis

In this part, we would analyse the goodness of fit of the ARIMA(0,1,1) model we choose.

3.3.1 Residual Analysis

From the acf plot of residuals, we can see all the acf values are between the dashed lines, indicating there is not autocorrelation between lags. And from QQplot, we can see that the residuals fit normal distribution well with some heavy tails in both side.

3.3.2 Plots of orginal data v.s. fitted value

From the \(ggplots^{[2]}\) below, we can see that ARIMA(0,1,1) model fits both the August subset data and the full original data well, but it still miss some peaks.

4 Fitting SARIMA model

As what shown in EDA part, the time series data has seasonality of a period of 12 monthes. This means a Seasonal-ARIMA model may fit better than a normal ARIMA model.

4.1 Choosing p,d,q for SARIMA model

Based on the annual anaylsis in section 3, we would fit a model of \(SARIMA(p,d,q)\times(0,1,1)_{12}\) with annual polynomial of ARIMA(0,1,1). According to the AIC output, \(SARIMA(4,0,3)\times(0,1,1)_{12}\) has the lowest AIC value.

##          MA0      MA1      MA2      MA3      MA4      MA5
## AR0 3732.009 3420.772 3334.876 3244.124 3208.156 3187.198
## AR1 3139.166 3114.097 3110.637 3111.530 3106.361 3107.631
## AR2 3121.498 3111.363 3109.475 3111.434 3111.428 3108.144
## AR3 3109.268 3108.888 3110.859 3071.547 3069.524 3070.747
## AR4 3109.953 3110.885 3075.576 3068.566 3070.795 3071.252

When we see the summary of the selected SARIMA model, all the AR and MA roots are inside the unit circle which shows good selection.

## 
## Call:
## arima(x = candy, order = c(4, 0, 3), seasonal = list(order = c(0, 1, 1), period = 12))
## 
## Coefficients:
##          ar1     ar2     ar3      ar4     ma1     ma2      ma3     sma1
##       0.3591  0.2208  0.9040  -0.5256  0.3830  0.0981  -0.7627  -0.6951
## s.e.  0.1318  0.0293  0.0116   0.1240  0.1007  0.1188   0.1155   0.0369
## 
## sigma^2 estimated as 12.7:  log likelihood = -1525.28,  aic = 3068.57

4.2 SARIMA Model Selected

Therefore the SARIMA model selected is \(SARIMA(4,0,3)\times(0,1,1)_{12}\):

\[ (1-B^{12})(1-0.3528B-0.2203B^2-0.9054B^3+0.5191B^4)Y_n=(1-0.7B^{12})(1+0.3873B+0.1024B^2-0.7603B^3)\epsilon_n \]

4.3 Diagonose Analysis

In this part, we would analyse the goodness of fit of the \(SARIMA(4,0,3)\times(0,1,1)_{12}\) model we choose.

4.3.1 Residual Analysis

From the acf plot of residuals, we can see all the acf values are between the dashed lines, indicating there is not autocorrelation between lags. And from QQplot, we can see that the residuals fit normal distribution well with some ligher tails than the QQplot of ARIMA(0,1,1) model in both side.

4.3.2 Plots of orginal data v.s. fitted value

From the SARIMA fitted plot below, we can see a much lighter color of the plot which indicates that there are more orange lines of doublication with original data than the ARMA model and it better catches the peaks than before.

5 Conclusion and References

5.1 Conclusion

We made the time series anaysis of United States monthly sugar production on percentage of 2012 production from 1972 to 2020. The analysis conducted mainly in three parts. Firstly, we made EDA of the data and found that sugar production is nonstationary with a increasing trend and has a seasonality of dominant period of 12 months. Decomposition of time series shows clear trend, noise and cycles.
Then we fit an ARMA model to catch the annual polynomial and SARIMA model including the obvious seasonal effects. The final SARIMA model chosen is \(SARIMA(4,0,3)\times(0,1,1)_{12}\) model:

\[ (1-B^{12})(1-0.3528B-0.2203B^2-0.9054B^3+0.5191B^4)Y_n=(1-0.7B^{12})(1+0.3873B+0.1024B^2-0.7603B^3)\epsilon_n \] The analysis of residuals and plot of fitted value shows that SARIMA model fits well on the candy productiom series. In future analysis, we could apply the model we selected to forecast further sugar production which might have some benefit in sugar business.