Research Description

Everyone loves dessert! No matter what age, no one can resist the temptation of frozen desserts, especially ice cream, during the hot summer months. So how does the production of frozen desserts like ice cream change from month to month? In this analysis, we will answer this question by modeling the time-dependent patterns of ice cream and frozen dessert production.

In this project, we are going to study the time series from Non-Durable Goods: Ice Cream and Frozen Dessert [1].

This dataset tracks the industrial production (IP) index monthly from 1972-01 to 2021-11. The data frame contains two variables:

Date: Monthly date time
Prod: The IP index measures the real output of all relevant establishments located in the United States, regardless of their ownership, but not those located in U.S. territories. This is a useful indicator for economic output and strength of demand.

Our objective is to utilize methods from STATS 531 to identify the best fitting time-series model for the ice cream production series.

Analysis

Exploration Data Analysis

To obtain an overview of the ice cream data, we first do the time series plot, and use a Local linear regression approach to show the general trend of the data. From the plot, we observe an obvious seasonal pattern and a non-stationary trend in the data. Since ARIMA model requires stationarity, we need to first deal with the trend.

ice <- read.csv("ice_cream.csv")
names(ice) = c("Date", "Prod")

cap_fig1 = paste(
  "**Figure 1.** *US Ice Cream production data by month.*",
   "Grey region indicates 95% confidence intervel for locally estimated scatterplot smoothing regression line."
)
## plot time series
ice$Date <- as.Date(ice$Date)
ice %>%
  ggplot(aes(x = Date, y = Prod)) +
  geom_line()+
  geom_smooth(method = 'loess', formula = y ~ x) +
  xlab("Month") +
  ylab("Ice cream Production") +
  theme_bw()

Figure 1. US Ice Cream production data by month. Grey region indicates 95% confidence intervel for locally estimated scatterplot smoothing regression line.

The blue fitting curve shows that data is not stationary originally. We might need to differentiate it to eliminate trend and get stationary series.

We can first try to difference the data with lag=1. \[y_n = \Delta x_n = x_n - x_{n-1}\] The series now appears stationary, supporting the selection of \(d=1\) in the ARIMA model.

cap_fig2 = paste(
  "**Figure 2.** *Difference of US ice cream production data with lag = 1.*"
)
ice_diff = data.frame(ice$Date[-1],diff(ice$Prod,lag=1))
names(ice_diff)=c('date_diff', 'prod_diff')
ice_diff %>%
  ggplot(aes(x=date_diff, y=prod_diff))+
  geom_line()+
  geom_smooth(method = 'loess', formula = y ~ x) +
  xlab("Month") +
  ylab("Ice cream Production Diff with lag=1") +
  theme_bw()

Figure 2. Difference of US ice cream production data with lag = 1.

The curve line becomes more linear and horizontal than before, which means data is more stationary after differencing. However, the seasonal pattern also seems to have disappeared. Thus, we want to try differencing with seasonal lag.

By plotting the auto-correlation function, we can learn a suitable value for the seasonal lag.

cap_fig3 = paste(
  "**Figure 3.** *ACF plot for ice cream production data.*"
)
acf(ice$Prod, main = "Ice-cream Production ACF plot")

Figure 3. ACF plot for ice cream production data.

From the apparent cyclical pattern in the ACF plot, a period of 12 months is apparent, which means we can use a seasonal difference lag of 12. Then we use this lag to plot the difference of data.

cap_fig4 = paste(
  "**Figure 4.** *Difference of US ice cream production data with lag = 12.*"
)
ice_diff = data.frame(ice$Date[-c(1:12)],diff(ice$Prod,lag=12))
names(ice_diff)=c('date_diff', 'prod_diff')
ice_diff %>%
  ggplot(aes(x=date_diff, y=prod_diff))+
  geom_line()+
  geom_smooth(method = 'loess', formula = y ~ x) +
  xlab("Month") +
  ylab("Ice cream Production Diff with lag=12") +
  theme_bw()

Figure 4. Difference of US ice cream production data with lag = 12.

The data shows roughly trend-stationarity with seasonal lag=12, which implies that we can set \(D=1\) and use \(SARIMA(p,d,q)\times (P,D,Q)_{12}\) model to fit the data.

Spectral Analysis

To understand the deeper structures of the series, we consider the spectral analysis[3] of the data. This allows us to study the series in the frequency domain.

Unsmoothed Spectrum

cap_fig5 = paste(
  "**Figure 5.** *Unsmoothed periodogram of US Ice Cream production monthly data.*"
)
spec1=spectrum(ice$Prod, main='Unsmoothed periodogram')
f1 = spec1$freq[which.max(spec1$spec)]
abline(v=f1, col='red', lwd=2, lty=2)
text(0.15,0.1, paste('freq =',round(f1,4)), col='red')

Figure 5. Unsmoothed periodogram of US Ice Cream production monthly data.

Smoothed Spectrum

cap_fig6 = paste(
  "**Figure 6.** *Smoothed periodogram of US Ice Cream production monthly data.*"
)
spec2=spectrum(ice$Prod, spans=c(2), main='Smoothed periodogram')
f2 = spec2$freq[which.max(spec2$spec)]
abline(v=f2, col='red', lwd=2,lty=2)
text(0.15, 2, paste('freq =',round(f2,4)), col='red')

Figure 6. Smoothed periodogram of US Ice Cream production monthly data.

Smooth via AR model

cap_fig7 = paste(
  "**Figure 7.** *Smoothed periodogram of US Ice Cream production monthly data.*",
  "Smoothed via AR model."
)
## Using the interval between peaks here
find_peaks <- function (x, m = 3){
    shape <- diff(sign(diff(x, na.pad = FALSE)))
    pks <- sapply(which(shape < 0), FUN = function(i){
       z <- i - m + 1
       z <- ifelse(z > 0, z, 1)
       w <- i + m + 1
       w <- ifelse(w < length(x), w, length(x))
       if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])) return(i + 1) else return(numeric(0))
    })
     pks <- unlist(pks)
     pks
}

spec3=spectrum(ice$Prod, method='ar', main='Spectrum estimated via AR model picked by AIC')
f3 = spec3$freq[find_peaks(spec3$spec)[1]]
abline(v=f3, col='red', lwd=2, lty=2)
text(0.15, 9, paste('freq =',round(f3,4)), col='red')

Figure 7. Smoothed periodogram of US Ice Cream production monthly data. Smoothed via AR model.

From plots we can find that for unsmoothed and smoothed with spans=2 data, they both have peak near frequency \(\omega=0.083\), which corresponds to the period \(T=\frac{1}{\omega}=12.05\). That is to say, the domain period is about 1 cycle per year. Note that in figure 7, we use the second highest peak to get the freq. The method to find peaks from data was found in an online question[2].

Decomposition

For the ice cream series, high frequency variation can be considered “noise”, while low frequency variation might be considered trend. But here we try to extract the business cycles, which requires a band of mid-range frequencies. We build a smoothing operation in the time domain to extract its business cycles and then study it in the frequency domain.

cap_fig8 = paste(
  "**Figure 8.** *Decomposition of ice cream production.*"
)
prod = ice$Prod
date = seq(from = 1972,length = length(prod) , by = 1 / 12)

ice_low = ts(loess(prod ~ date, span = 0.5)$fitted,
            start = 1972, 
            frequency = 12)
ice_high = ts(prod - loess(prod ~ date, span = 0.1)$fitted,
           start = 1972,
           frequency = 12)
ice_cycles = prod - ice_high - ice_low
plot(ts.union(prod, ice_low, ice_high, ice_cycles),
     main = "Decomposition of ice production as trend + noise + cycles")

Figure 8. Decomposition of ice cream production.

cap_fig9 = paste(
  "**Figure 9.** *Spectrum for Srigin data and Cycle data.*",
  "Business cycle data for ice cream production(Red)."
)
spec_union = spectrum(ts.union(prod, ice_cycles))

Figure 9. Spectrum for Srigin data and Cycle data. Business cycle data for ice cream production(Red).

spec_rps = data.frame(freq = spec_union$freq,
       ratio = spec_union$spec[,2]/spec_union$spec[,1])
xlim = spec_rps %>%
  filter(ratio > 0.5) %>%
  summarize(mini = min(freq), maxi = max(freq)) %>%
  unlist()

cap_fig10 = paste(
  "**Figure 10.** *Spectrum response ratio.*"
)
spec_rps %>%
  ggplot(aes(x = freq, y = ratio)) +
  geom_line()+
  scale_x_continuous(name = "Frequency") + 
  scale_y_continuous(name = "Spectrum Ratio(scaled by log10)",
                     trans = "log10") +
  geom_hline(yintercept = 0.5,
             col = "tomato3",
             lty = "dashed") +
  geom_hline(yintercept = max(spec_rps$ratio),
             col = "tomato3",
             lty = "dashed") +
  geom_vline(xintercept = xlim,
             col = "tomato3",
             lty = "dashed") + 
  geom_text(aes(x = xlim[1],
                label = sprintf("%.3f", xlim[1]),
                y = 1e-15),
            colour = "darkred") +
  geom_text(aes(x = xlim[2],
                label = sprintf("%.3f", xlim[2]),
                y = 1e-15),
            colour = "darkred")

Figure 10. Spectrum response ratio.

Here we set the cutoff value to be 0.5, and the spectrum response ratio plot indicates that nearly half of the power for frequencies with cycle length between \(\frac{1}{0.68}=1.47\) and \(\frac{1}{0.08}=12.5\) months are kept. Then frequencies within this interval could be interpreted as frequencies that related to the ice cream production cycle.

Model section

auto selection

R has a built-in ARIMA model selection function, auto.ARIMA, which takes into account the AIC and BIC values generated to determine the best combination of parameters. This function may not give us the best fitted model, but it can provide a reference for our model selection process.

autoARIMA <- auto.arima(ice$Prod)
autoARIMA

## Series: ice$Prod 
## ARIMA(1,1,0) 
## 
## Coefficients:
##          ar1
##       0.4663
## s.e.  0.0361
## 
## sigma^2 = 109.4:  log likelihood = -2252.05
## AIC=4508.1   AICc=4508.12   BIC=4516.89

The auto-selection process found \(ARIMA(1, 1, 0)\) as the best fitting model for the production data. We will have to analyze the performance of this model further, but we can see that the auto-selection process aggrees on differentiating the data with lag = 1. This shows that our previous analysis is reasonable.

ARIMA model selection

Since the raw data is not stationary, we set \(d = 1\). Now, let’s find parameters p, and q for \(ARIMA(p, 1, q)\) model by a more reliable method, the Akaike Information Criteria(AIC). It penalizes models that predict inaccurately and use more independent variables (parameters). A model with low AIC values is the model we are looking for. First, let’s plot the AIC table for small p and q values to select from.

aic_table_arima = function(data, P, Q){
  table = matrix(NA, (P + 1), (Q + 1) )
  for(p in 0:P){
    for(q in 0:Q){
        try({
            table[p+1, q+1] = arima(data,
                              order = c(p, 1, q)
      )$aic
            })
    }
  }
  dimnames(table) = list(paste("AR", 0:P),
                         paste("MA", 0:Q) )
  table
}
aic_table_arima(ice$Prod, 5, 5)

	MA 0	MA 1	MA 2	MA 3	MA 4	MA 5
AR 0	4652.762	4543.907	4502.755	4499.941	4439.823	4391.367
AR 1	4508.102	4510.003	4500.125	4500.605	4414.162	4356.125
AR 2	4509.91	4472.067	4058.179	4440.714	3989.001	4326.366
AR 3	4484.355	4281.054	4041.326	NA	NA	4325.394
AR 4	4442.819	4233.119	4015.794	4018.74	3983.809	3982.459
AR 5	4372.93	4179.788	3982.798	3972.499	4021.77	4013.822

According to the AIC table, we can find that the \(ARIMA(5,3)\) model has the min value of AIC, yet following this criterion tends to result in the complexity of model. To see more carefully, we can find that there is a sharp decrease of AIC value when model comes to \(ARIMA(2,2)\), which might be more approprate for us to fit. We will compare different models in Diagnosis part.

SARIMA model selection

After selecting ARIMA model, we now consider on the seasonal part. From study before, we set \(D=1\) with period=12 to calcuate the AIC table.

aic_table_sarima = function(data, P, Q){ 
    table = matrix(NA, (P+1), (Q+1))
    for (p in 0:P){
        for (q in 0:Q){
            try({
            table[p+1, q+1] = arima(data, 
                                    order=c(2, 1, 2),
                                    seasonal = list(order = c(p, 1, q), period = 12))$aic
                })
        }
    }
    dimnames(table) = list(paste('SAR', 0:P, sep=' '), paste('SMA', 0:Q, sep=' '))
    table
}
aic_table_sarima(ice$Prod, 3, 3)

	SMA 0	SMA 1	SMA 2	SMA 3
SAR 0	3538.415	3321.604	3323.443	3325.039
SAR 1	3393.053	3323.431	3325.316	3326.864
SAR 2	3359.898	3325.065	3326.942	3323.571
SAR 3	3344.342	3326.466	3349.119	3324.304

For \(ARIMA(2,1,2)\) model, its seasonal part \((P,Q)_{12}=(0,1)_{12}\) gives the smallest AIC value, while \((P,Q)_{12}=(1,1)_{12}\) also provides close value. We then need to find which model is more stable and has better properties in the Diagnosis section.

Diagnosis

In this section we will analyze the fit of each of the final 3 models through visualizations, testing the residual assumptions, checking for causality & invertibility, and performing Wilk’s approximation.

Fitted Values

Below are plots comparing the true and fitted values of the data. All of the models fit the data reasonably well from a visual evaluation, so we will have to take a closer look at the residuals.

ARIMA(1, 1, 0)

cap_fig11 = paste(
  "**Figure 11.** *ARIMA(1,1,0) model Fitted value(Red) and Original time series(Black).*"
)
arima110 <- Arima(ice$Prod, order=c(1,1,0))
ice %>%
  ggplot(aes(x = Date, y = Prod)) +
  geom_line() +
  geom_line(aes(y = fitted(arima110)),
            col = "tomato3") +
  xlab("Month") +
  ylab("Ice-cream Production") +
  theme_bw()

Figure 11. ARIMA(1,1,0) model Fitted value(Red) and Original time series(Black).

SARIMA(2, 1, 2)x(0, 1, 1)

cap_fig12 = paste(
  "**Figure 12.** *SARIMA(2,1,2)x(0,1,1) model Fitted value(Red) and Original time series(Black).*"
)
sarma212011 <- Arima(ice$Prod, order=c(2, 1, 2), seasonal = list(order = c(0, 1, 1), period = 12))
ice %>%
  ggplot(aes(x = Date, y = Prod)) +
  geom_line() +
  geom_line(aes(y = fitted(sarma212011)),
            col = "tomato3") +
  xlab("Month") +
  ylab("Ice-cream Production") +
  theme_bw()

Figure 12. SARIMA(2,1,2)x(0,1,1) model Fitted value(Red) and Original time series(Black).

SARIMA(2, 1, 2)x(1, 1, 1)

cap_fig13 = paste(
  "**Figure 13.** *SARIMA(2,1,2)x(1,1,1) model Fitted value(Red) and Original time series(Black).*"
)
sarma212111 <- Arima(ice$Prod, order=c(2, 1, 2), seasonal = list(order = c(1, 1, 1), period = 12))
ice %>%
  ggplot(aes(x = Date, y = Prod)) +
  geom_line() +
  geom_line(aes(y = fitted(sarma212111)),
            col = "tomato3") +
  xlab("Month") +
  ylab("Ice-cream Production") +
  theme_bw()

Figure 13. SARIMA(2,1,2)x(1,1,1) model Fitted value(Red) and Original time series(Black).

# Let me know what you were looking for here.
res <- fitted(sarma212011) - fitted(sarma212111)
plot(res)

Residual Assumptions

We assume residuals are a normally distributed white-noise series. We can break this down into 3 core assumptions:

The residuals have mean zero
The residuals are uncorrelated
The residuals are distributed approximately normal

We will see below that each model struggles with at least one assumption.

Mean Zero

cap_fig14 = paste(
  "**Figure 14.** *Models' residuals.*",
  "Green modle is covered by Blue model."
)
tibble(Date = ice$Date,
       r1 = arima110$resid, 
       r2 = sarma212011$resid, 
       r3 = sarma212111$resid) %>%  
ggplot(aes(x = Date)) +
  geom_line(aes(y = r1, color = "ARIMA (1,1,0)"), alpha = 0.5) +
  geom_smooth(aes(y = r1, color = "ARIMA (1,1,0)"), alpha = 0.5,
              method = 'loess', formula = y ~ x) +
  geom_line(aes(y = r2, color = "SARIMA (2,1,2)x(0,1,1)"), alpha = 0.5) +
  geom_smooth(aes(y = r3, color = "SARIMA (2,1,2)x(1,1,1)"), 
              alpha = 0.5, method = 'loess', formula = y ~ x) +
  geom_line(aes(y = r3, color = "SARIMA (2,1,2)x(1,1,1)"), alpha = 0.5) +
  scale_colour_manual("", breaks = c("ARIMA (1,1,0)", "SARIMA (2,1,2)x(0,1,1)",
                      "SARIMA (2,1,2)x(1,1,1)"), values = c("red", "green", "blue")) + 
  theme_bw() +
  ylab("Residuals")

Figure 14. Models’ residuals. Green modle is covered by Blue model.

The residuals from each of the models seem to be centered around 0.

By plotting the residuals of all the models together, we can see that the SARIMA model residuals are much smaller, indicating better fit. The two SARIMA models provide exactly the same residuals, meaning they have the same fitted values.

Uncorrelation

We assume residuals are uncorrelated in all of our models. To check whether this assumption holds, we can examine the autocorrelation plots of each model’s residuals[5]. Furthermore, we can numerically test if the correlation is different form 0 by using the Ljung-Box-Pierce Chi Square test[6].

\[\hat{Q} = N(N+2)\sum_{k=1}^{L(N)}\frac{1}{N-k}|\hat{\rho_k}|^2\]

Where

\[\hat{Q} \sim \chi^2\left(L(N)-p-q\right)\]

The null hypothesis is that the series is independent.

Below, we see the ARIMA model demonstrates significant correlation within the residuals at numerous lags as well as oscillating behavior. The SARIMA models fair better, but still exhibit signs of correlated residuals. All models have at least one lag that exceeds the 5% threshold (blue line) and test statistics that reject the null hypothesis.

ARIMA (1,1,0)

cap_fig15 = paste(
  "**Figure 15.** *ACF plot for ARIMA(1,1,0) residuals*"
)
acf(arima110$residuals, main = "Residuals Autocorrelation")

Figure 15. ACF plot for ARIMA(1,1,0) residuals

for(l in c(12, 24, 36, 48, 60)){
  print(Box.test(arima110$residuals, lag = l, type = "Ljung-Box", fitdf = 1))
}

ARIMA (1,1,0)

	Lag 12	Lag 24	Lag 36	Lag 48	Lag 60
\(\hat{Q}\)	615.13	1214	1793.6	2387.2	2988.6
p-value	0	0	0	0	0

SARIMA (2,1,2)x(0,1,1)

cap_fig16 = paste(
  "**Figure 16.** *ACF plot for SARIMA(2,1,2)x(0,1,1) residuals*"
)
acf(sarma212011$residuals, main = "Residuals Autocorrelation")

Figure 16. ACF plot for SARIMA(2,1,2)x(0,1,1) residuals

for(l in c(12, 24, 36, 48, 60)){
  print(Box.test(sarma212011$residuals, lag = l, type = "Ljung-Box", fitdf = 5))
}

SARIMA (2,1,2)x(0,1,1)

	Lag 12	Lag 24	Lag 36	Lag 48	Lag 60
\(\hat{Q}\)	25.957	47.688	61.793	79.181	104.67
p-value	0	0	0	0	0

SARIMA (2,1,2)x(1,1,1)

cap_fig17 = paste(
  "**Figure 17.** *ACF plot for SARIMA(2,1,2)x(1,1,1) residuals*"
)
acf(sarma212111$residuals, main = "Residuals Autocorrelation")

Figure 17. ACF plot for SARIMA(2,1,2)x(1,1,1) residuals

for(l in c(12, 24, 36, 48, 60)){
  print(Box.test(sarma212111$residuals, lag = l, type = "Ljung-Box", fitdf = 6))
}

SARIMA (2,1,2)x(1,1,1)

	Lag 12	Lag 24	Lag 36	Lag 48	Lag 60
\(\hat{Q}\)	25.696	47.577	61.772	79.027	104.2
p-value	0	0	0	0	0

Normality

We can check for normality by comparing the sample quantiles to the theoretical quantiles of a variable distributed normally. If the points fall closely to the line in this Quantile-Quantile plot, it supports the assumption of normality in the residuals.

ARIMA (1,1,0)

cap_fig18 = paste(
  "**Figure 18.** *QQ-plot for ARIMA(1,1,0) residuals*"
)
qqnorm(arima110$residuals, main = "QQ-Plot: Residuals")
qqline(arima110$residuals)

Figure 18. QQ-plot for ARIMA(1,1,0) residuals

SARIMA (2,1,2)x(0,1,1)

cap_fig19 = paste(
  "**Figure 19.** *QQ-plot for SARIMA(2,1,2)x(0,1,1) residuals*"
)
qqnorm(sarma212011$residuals, main = "QQ-Plot: Residuals")
qqline(sarma212011$residuals)

Figure 19. QQ-plot for SARIMA(2,1,2)x(0,1,1) residuals

SARIMA (2,1,2)x(1,1,1)

cap_fig20 = paste(
  "**Figure 20.** *QQ-plot for SARIMA(2,1,2)x(1,1,1) residuals*"
)
qqnorm(sarma212111$residuals, main = "QQ-Plot: Residuals")
qqline(sarma212111$residuals)

Figure 20. QQ-plot for SARIMA(2,1,2)x(1,1,1) residuals

The ARIMA model residuals fit well to the normal distribution. The SARIMA models residuals have heavier left tails than expected for a normally distributed series, but still fit reasonably well.

Causality and Invertibility

For our model to be stationary, it must uphold both causality and invertibility. Causality means that the series can be represented as a linear process (or a linear combination of white noise). Invertibility means the noise sequence can be represented as a linear process.

To test for these properties, we can plot the inverse characteristic roots of the AR and MA polynomials[7]. We expect to see the roots to be within the unit circle.

ARIMA (1,1,0)

cap_fig21 = paste(
  "**Figure 21.** *Inverse AR roots and inverse MA roots displayed in a complex plane.*"
)
plot(arima110, type = "both")

Figure 21. Inverse AR roots and inverse MA roots displayed in a complex plane.

SARIMA (2,1,2)x(0,1,1)

cap_fig22 = paste(
  "**Figure 22.** *Inverse AR roots and inverse MA roots displayed in a complex plane.*"
)
plot(sarma212011, type = "both")

Figure 22. Inverse AR roots and inverse MA roots displayed in a complex plane.

SARIMA (2,1,2)x(1,1,1)

cap_fig23 = paste(
  "**Figure 23.** *Inverse AR roots and inverse MA roots displayed in a complex plane.*"
)
plot(sarma212111, type = "both")

Figure 23. Inverse AR roots and inverse MA roots displayed in a complex plane.

All models have inverse roots within the unit circle, so our assumption of stationarity holds.

Wilk’s approximation

Here we will compare each of the models using Wilk’s approximation[4]. We want to check if the models are significantly different from one another to warrant the added complexity.

ARIMA(1,1,0) vs SARIMA (2,1,2)x(0,1,1)

\[\begin{equation} \begin{split} H^{<0>} &: \theta \in \Theta^{<0>} = (\mu, \sigma^2, \phi_{1})\\ H^{<1>} &: \theta \in \Theta^{<1>} = (\mu, \sigma^2, \phi_{1:2}, \psi_{1:2},\Psi_{1})\\ \ell^{<1>} - \ell^{<0>} &\approx \frac{1}{2}\chi^{2}_{4} \end{split} \end{equation}\]

log_diff = 2 * (sarma212011$loglik - arima110$loglik)
chisq_cutoff = qchisq(0.95, df = 4)
diagnose = ifelse(log_diff > chisq_cutoff,
                  "The two models are significantly different",
                  "The two models are NOT significantly different")
print(diagnose)

## [1] "The two models are significantly different"

ARIMA(1,1,0) vs SARIMA (2,1,2)x(1,1,1)

\[\begin{equation} \begin{split} H^{<0>} &: \theta \in \Theta^{<0>} = (\mu, \sigma^2, \phi_{1})\\ H^{<1>} &: \theta \in \Theta^{<1>} = (\mu, \sigma^2, \phi_{1:2}, \psi_{1:2}, \Phi_{1}, \Psi_{1})\\ \ell^{<1>} - \ell^{<0>} &\approx \frac{1}{2}\chi^{2}_{5} \end{split} \end{equation}\]

log_diff = 2 * (sarma212111$loglik - arima110$loglik)
chisq_cutoff = qchisq(0.95, df = 5)
diagnose = ifelse(log_diff > chisq_cutoff,
                  "The two models are significantly different",
                  "The two models are NOT significantly different")
print(diagnose)

## [1] "The two models are significantly different"

SARIMA (2,1,2)x(1,1,1) vs SARIMA (2,1,2)x(0,1,1)

\[\begin{equation} \begin{split} H^{<0>} &: \theta \in \Theta^{<1>} = (\mu, \sigma^2, \phi_{1:2}, \psi_{1:2},\Psi_{1})\\ H^{<1>} &: \theta \in \Theta^{<1>} = (\mu, \sigma^2, \phi_{1:2}, \psi_{1:2}, \Phi_{1}, \Psi_{1})\\ \ell^{<1>} - \ell^{<0>} &\approx \frac{1}{2}\chi^{2}_{1} \end{split} \end{equation}\]

log_diff = 2 * (sarma212111$loglik - sarma212011$loglik)
chisq_cutoff = qchisq(0.95, df = 1)
diagnose = ifelse(log_diff > chisq_cutoff,
                  "The two models are significantly different",
                  "The two models are NOT significantly different")
print(diagnose)

## [1] "The two models are NOT significantly different"

Conclusion

In this analysis, we examined US Ice Cream production data by month. We observed both a seasonal and non-stationary pattern in the data that was dealt with by using differencing, spectral analysis, and incorporating seasonal parameters. We identified three promising models and performed extensive diagnostics on each.

We decided the SARIMA (2,1,2)x(0,1,1) model was the simplest model that achieved reasonable performance. There is still some correlation and tailedness in the residuals which could impact forecasts and prediction intervals produced by this model. Higher order pairings of \(p\) and \(q\) would not resolve the correlations for either ARIMA nor SARIMA models.

The components of this model fits our intuition about patterns in ice cream production. Ice cream is produced and consumed more heavily in the summer months, it has faced a declining demand recently as healthier alternative products have come to market, and production levels are dependent on previous years.

References

[1] Monthly Ice Cream and Frozen Dessert Production Data

[2] How to find local peaks in a series of data

[3] Slides: Time series analysis in frequence domain

[4] Slides: Wilk’s approximation

[5] Examining the ACF plot of the residuals

[6] Information about the Box-Ljung Test

[7] Cursory Analysis of Inverse characteristic root plot

Midterm Project: US Ice Cream Production Time Series Analysis

Feb 19, 2022

Research Description

Analysis

Exploration Data Analysis

Spectral Analysis

Unsmoothed Spectrum

Smoothed Spectrum

Smooth via AR model

Decomposition

Model section

auto selection

ARIMA model selection

SARIMA model selection

Diagnosis

Fitted Values

ARIMA(1, 1, 0)

SARIMA(2, 1, 2)x(0, 1, 1)

SARIMA(2, 1, 2)x(1, 1, 1)

Residual Assumptions

Mean Zero

Uncorrelation

ARIMA (1,1,0)

ARIMA (1,1,0)

SARIMA (2,1,2)x(0,1,1)

SARIMA (2,1,2)x(0,1,1)

SARIMA (2,1,2)x(1,1,1)

SARIMA (2,1,2)x(1,1,1)

Normality

ARIMA (1,1,0)

SARIMA (2,1,2)x(0,1,1)

SARIMA (2,1,2)x(1,1,1)

Causality and Invertibility

ARIMA (1,1,0)

SARIMA (2,1,2)x(0,1,1)

SARIMA (2,1,2)x(1,1,1)

Wilk’s approximation

ARIMA(1,1,0) vs SARIMA (2,1,2)x(0,1,1)

ARIMA(1,1,0) vs SARIMA (2,1,2)x(1,1,1)

SARIMA (2,1,2)x(1,1,1) vs SARIMA (2,1,2)x(0,1,1)

Conclusion

References