1 Introduction

Retail sales tell how much demand exists for consumer goods. The retail sales can reflect the economic environment of a country in a certain period which are a critical indicator of a country’s economy. For example, consumer spending makes up almost 70% of total U.S. economic output.[1] Retail sales can play a predictive role in a country’s economy. The Census Bureau divides retail sales into 13 categories. The retail sales datasets investigated in this report exclude the food services. Two dataset were used in this report. One is the retail sales(excluding food services) raw data, the other is retail sales(excluding food services) seasonally adjusted data. Both of these two dataset contain 337 retail sales data from 1992-01-01 to 2020-01-01 which were released by Advance Monthly Sales for Retail and Food Services from U.S. Census Bureau. In this report, I will compare the difference between these two datasets and analyze some basic time series features on the seasonally adjusted dataset, including exploratory data analysis and SARMA modeling to see how the retail sales changes with time.

Units: Billions of Dollars.

2 Exploratory Data Analysis

These two charts show retail sales without seasonal adjustments and seasonally adjusted, respectively. The blue curves are smoothed from the loess method. As we can see, there is an increasing trend with time in each plot and a seasonal variation in the left one.

To better compare the difference between the raw retail sales and seasonally adjusted retail sales, I performed the spectrum analysis.

The x-axis units are cycles per year. As shown by the spectrum frequncy plot, the seasonally adjusted retail sales removes most of the signal at seasonal frequencies.

The ratio of the periodograms of the smoothed and unsmoothed time series is called the transfer function or frequency response function of the smoother.[2] The frequency response plot shows that at least two order of magnitude is removed from the raw data at the seasonal frequencies.

As we can see from the spectrum plots, the raw retail sales data contains information of various frequencies, which makes it hard to examine the particular aspects of interest. I decompose the retail sales into three parts, low frequency trend, high frequency noise and middle frequency long-term cycles.

From the frequency decomposition plot, the low frequency plot provides a non-paramatric estimate of the trend from 1992 to 2020. The average increasing rate of retail sales is 11.4586303 billion dollars per year. The high frequency plot shows the most noisy part in the retail sales, part of which is driven by the seasonal changes. As I zoom into this high frequency plot to a time span of two years, it shows more clearly that there is a high retail sales peak at December. For the first half of the year, there is another spike around May or June. This similar pattern repeatedly occurs over years. The seasonal change is shown to have a period of one year. The middle frequency plot in the bottom tells us about possible long-term cycles in the retail sales if any. From 2004 to 2008, the retail sales maintain a good growth rate. But from the second quarter of 2008, retail sales begin to decline, and it is not until the third quarter of 2009 that retail sales begin to recover. We know that the stock market crash of 2008 happened on Sep 29, 2008. The decline in retail sales partly reflects the economic downturn.

3 SARMA modeling of the retail sales changes.

High frequency components in the data don’t interest me so much. There are always low and high season sales throughout the year, which don’t change much over different year. As we can see from this first part of the report, the seansonly adjusted retail sales removed these conponents. So I would be more interested to fit model on the seasonaly adjusted retail sales data.

The exploratory data analysis above also indicates a non-constant mean function and a seasonal period of 12 on the retail sales data. A SARMA model with period equal to 12 might be a proper candidate here.

Before fitting the SARMA model, I detrend the data first.

3.1 Detrending data

I fit a linear model to the trend \(X_n = \beta_1 t_n + \beta_0\)

## 
## Call:
## lm(formula = `Retail sales` ~ DATE, data = data_ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.044  -4.724  -0.634   7.703  24.388 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.209e+01  3.081e+00  -26.64   <2e-16 ***
## DATE         2.890e-02  2.286e-04  126.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.43 on 335 degrees of freedom
## Multiple R-squared:  0.9795, Adjusted R-squared:  0.9794 
## F-statistic: 1.598e+04 on 1 and 335 DF,  p-value: < 2.2e-16

The linear fit looks pretty good except for the severe economic crisis around 2008.

3.2 SARMA model

The general form of \(SARMA(p,q) \times (P,Q)_{12}\) model[3] is \[\phi(B)\Phi(B^{12}) (Y_n-\mu) = \psi(B)\Psi(B^{12}) \epsilon_n\] where \(\{\epsilon_n\}\) is a white noise process and

\[\mu = E[Y_n]\]

\[\phi(x)=1-\phi_1 x-\dots -\phi_px^p,\] \[ \psi(x)=1+\psi_1 x+\dots +\psi_qx^q,\] \[\Phi(x)=1-\Phi_1 x-\dots -\Phi_Px^P,\] \[\Psi(x)=1+\Psi_1 x+\dots +\Psi_Qx^Q.\] In order to pick up the orders p and q, I calculate the AIC’s for SARMA(p,q)\(\times(1,0)_{12}\).

3.2.1 SARMA(p,q)\(\times(1,0)_{12}\) model

From this AIC table, I pick up p = 3, q = 2 and fit a SARMA(3,2)\(\times(1,0)_{12}\) model.

MA0 MA1 MA2 MA3 MA4 MA5
AR0 2511.38 2212.84 2041.18 1953.51 1876.02 1856.95
AR1 1691.98 1693.14 1693.03 1694.99 1695.02 1697.02
AR2 1693.00 1692.91 1688.50 1688.28 1691.80 1691.44
AR3 1692.79 1694.66 1687.40 1698.84 1695.58 1698.39
AR4 1694.71 1688.12 1695.10 1697.21 1694.01 1698.59
AR5 1694.61 1696.68 1698.03 1697.51 1698.68 1699.71

Here are the ACF of the residuals and the output. From the AIC plot, we can see that there is significant autocorrelation at lag 24, which should be assigned to the seasonality part.

## 
## Call:
## arima(x = new$detrend, order = c(3, 0, 2), seasonal = list(order = c(1, 0, 0), 
##     period = 12))
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2     sar1  intercept
##       1.2252  0.5027  -0.7306  -0.3245  -0.6754  -0.0816     0.2565
## s.e.  0.2091  0.4111   0.2028   0.2326   0.2325   0.0552     0.8148
## 
## sigma^2 estimated as 8.213:  log likelihood = -835.7,  aic = 1687.4

3.2.2 SARMA(p,q)\(\times (1,1)_{12}\) model

In order to deal with the residual autocorrelation, I decide to try the SARMA(p,q)\(\times (1,1)_{12}\) model. Again, I used the AIC values to pick up the model. The AIC table favors p = 3 and q = 2.

MA0 MA1 MA2 MA3
AR0 2483.11 2199.42 2037.66 1951.35
AR1 1681.70 1682.68 1682.40 1684.39
AR2 1682.50 1682.48 1684.26 1685.95
AR3 1682.16 1684.21 1678.18 1686.76

I plot the ACF. There is a significant reduction in the residual autocorrelation at lag = 24.

## 
## Call:
## arima(x = new$detrend, order = c(3, 0, 2), seasonal = list(order = c(1, 0, 1), 
##     period = 12))
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2    sar1     sma1
##       1.2731  0.4140  -0.6892  -0.3762  -0.6238  0.5815  -0.7547
## s.e.  0.2750  0.5422   0.2677   0.3021   0.3020  0.1289   0.1034
##       intercept
##          0.2787
## s.e.     0.7307
## 
## sigma^2 estimated as 7.91:  log likelihood = -830.09,  aic = 1678.18

From the model output, I notice that the standard errors for these parameters are all very big, especially for the intercept. Let’s take a look at the qqplot to see whether we can see any violation of the model assumption.

The qqplot shows that about 5% of the points deviate from the line at two ends. But the main body follow a good normal distribution. From the residual plot, we can see there are a few influential points at position 200 and a very high outlier at position 120. The residual plot also shows evidence of heteroscedasticity. Based on the residual plot, I decide to make a log transformation and then fit the SARMA model again.

3.3 SARMA(p,q)\(\times (1,1)_{12}\) model on the log transformed data

As the growth rate of retail sales is always expected to be stable, I assume the trend doesn’t change through the log transformation. So I use the linear model fitted above and transform it to logarithmic coordinate. The retail sales and trend in the log scale are present in the plot. Then I fit a SARMA model to the different between these two curves.

3.3.1 SARMA(p,q)\(\times(1,1)_{12}\) model on the log transformed detrend retail sales

From the AIC table, I pick up the \(SARMA(2,2) \times(1,1)_{12}\).

MA0 MA1 MA2 MA3
AR0 -1391.02 -1669.60 -1821.19 -1897.36
AR1 -2156.12 -2156.84 -2155.28 -2153.54
AR2 -2157.01 -2155.16 -2159.11 -2157.14
AR3 -2155.22 -2153.63 -2151.94 -2152.98
## 
## Call:
## arima(x = new_log$detrend, order = c(2, 0, 2), seasonal = list(order = c(1, 
##     0, 1), period = 12))
## 
## Coefficients:
##          ar1      ar2      ma1     ma2    sar1     sma1  intercept
##       1.9533  -0.9559  -1.0874  0.1374  0.5747  -0.7450    -0.0002
## s.e.  0.0482   0.0462   0.0710  0.0604  0.1661   0.1361     0.0064
## 
## sigma^2 estimated as 9.104e-05:  log likelihood = 1087.55,  aic = -2159.11

Comparing the results of this \(SARMA(2,2) \times(1,1)_{12}\) fitted on the transformed data with the previous \(SARMA(3,2) \times(1,1)_{12}\) fitted on the original, we can see that the standard errors of these parameters get smaller, which is very promising.

The ACF doesn’t show significant autocorrelation among residuals.

The red curve is from our model \(SARMA(2,2) \times(1,1)_{12}\), which is very comparable to the original data.

These two diagnostic plots don’t show strong evidence against the normal iid assumption on \(\epsilon_n\). It seems that the transformation does deal with the problem of the heteroscedasticity.

In order to check the nurmerical stability of the model, I calculated the roots of the AR, MA, SAR and SMA polynomials respectively.

## [1] 1.021714+0.047374i 1.021714-0.047374i
## [1] 1.062172+0i 6.851046-0i
## [1] 1.740032+0i
## [1] 1.342286+0i

It seems that all the roots of the AR(SAR) polynomial and the MA(SMA) polynomials are outside the unit circle, this \(SARMA(2,2) \times (1,1)\) is an invertible and causal process. The two AR polynomial roots are very close to the unit circle. Let’s do a simulation study.

3.3 Simulation

As we can see from the SARMA model output, the two AR polynomial roots are very close to the unit circle. I need to consider the stability of numerical solutions. So I simulated 1000 time series, using the parameters from the the output \(SARMA(2,2) \times(1,1)_{12}\) and plot the histgram of each parameter.

As we can see from these histgrams, the results are very comparable to the parameters from the arima R function.

4 Conclusion

In this study, I compared the retail sales datasets from the U.S. Census Bureau. I found that the seasonally adjusted retail sales data were simply raw retail sales data with the components at the seasonal frequencies removed by around two order of magnitude. The \(SARMA(2,2) \times(1,1)_{12}\) model fits pretty well on the log transformed data. The simulation study doesn’t reveal significant numerical problem on the R output. Overall, retail sales growth over the past 15 years is steady. The average increasing rate of retail sales is 11.45 billion dollars per year. Retail sales also fell as a result of the 2008 financial crisis. But in the next two years the retail sales got back to its original level of growth.

5 Reference:

[1] https://www.thebalance.com/what-is-retail-sales-3305722

[2] Lecture 8 The transfer (or frequency response) function

[3] Chapter 6. Extending the ARMA model: Seasonality and trend

U.S. Census Bureau, Advance Retail Sales: Retail (Excluding Food Services) [RSXFSN], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/RSXFSN, February 29, 2020.