Introduction

Keeping track of the food price on industrial scale is an important measure to regulate the food market. The Mandatory Price Reporting Act of 2010 requires the USDA to release dairy product sales information on or before Wednesday at 3:00 pm EST (unless affected by a Federal Holiday). This weekly reported price information offers insight into time trend of several dairy products in the past few years.

The data is collected from Kaggle and belongs to USDA. The data contain prices (dollars per pound) of five different kinds of dairy products from March 2012 to December 2017. This project zooms in on the price of butter alone.

Data Cleaning and Exploration

The data cleaning addresses three issues with the raw data. The first issue is duplicate entries in the dataset that need to be removed. The second issue is multiple entries of prices taken on the same day. A plausible solution is to aggregate these prices and take the average. The third issue is that some adjacent time points don’t differ by 7 days. A simple approach is to only retain those time points whose date differs from the very first day in the dataset (2012/03/03) by multiples of 7.

Data processing retains 286 weekly data points from 2012/03/03 to 2017/08/19. The points are plotted below

The line plot shows strong trend in the data (not simply white noise). There are large short term fluctuations together with a seemingly increasing trend on a global scale.

Methodology

Based on the plot of the raw prices above, I decide to first decompose the price trend into a long term trend, a short term trend and noise using loess smoothing with two different spans. For the short term trend, I plan to carry out spectral analysis and identify potential significant frequencies. For the long term trend, I plan to use ARMA models and their derivatives(SARMA, regression with ARMA errors) to explain the time series trend parametrically.

Results

Spectral Analysis for Short Term Trend

First the signal is separated into long term trend, short term trend and noise. The long term trend is acquired by loess smoothing at a span of 0.5. The noise is acquired by subtracting the original signal by the signal after loess smoothing at a span of 0.1. The choices of the spans follow the practice in class. The short term signal is calculated by subtracting the original signal by both the long term trend and the noise.

The spectrum of the short term signal is plotted below with the period set to be a year (52 weeks). A red threshold line at \(5\times 10^{-3}\) is added for reference given that the smallest unit of price is one cent.

On this spectrum, the peak of the spectrum has a density of 0.018 at 0.906 cycles per year. It indicates that there is a notable periodicity of one cycle per year.

ARMA Models for Long Term Trend

The long term trend is already plotted in the previous section. Based on the shape of the line, a linear model is worth trying. In the plot below, the X axis is the week number with the first observation in the data on 2012/03/03 set as one.

The linear model generates a visually good fit. The summary statistic of the model confirms our guess with high R-square value.

## 
## Call:
## lm(formula = long_term ~ date)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.191091 -0.089437 -0.005452  0.064953  0.270405 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.474602   0.013411  109.96   <2e-16 ***
## date        0.003295   0.000081   40.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1131 on 284 degrees of freedom
## Multiple R-squared:  0.8535, Adjusted R-squared:  0.853 
## F-statistic:  1654 on 1 and 284 DF,  p-value: < 2.2e-16

Although the R square is very high, the scatterplot and the autocorrelation plot of the residuals tell us that the residuals violate the constant variance and independence assumption.

Given that the autocorrelation at small lags are very large, ARMA models on differences of adjacent residuals could be helpful for stablizing the autocorrelation. I carry out a grid search for ideal AR components and MA components of a candidate ARIMA model, as taught in class.

MA0 MA1 MA2 MA3 MA4 MA5
AR0 -2024.192 -2409.407 -2783.383 -3095.843 -3328.633 -3487.999
AR1 -3230.896 -3590.915 -3779.325 -3852.451 -3886.078 -3900.954
AR2 -3756.354 -3930.074 -3928.415 -3926.579 -3924.950 -3923.114
AR3 -3839.539 -3928.438 -3927.592 -3924.467 -3922.782 -3920.900
AR4 -3862.532 -3926.645 -3924.497 -3925.869 -3921.937 -3921.997

The table tells us that ARIMA(2,1,1) model is the strongest candidate. The model formula is laid out below (written in lag operators). \(\{Y_{n}\}\) are the residuals from the linear model and \(\{\epsilon_{n}\}\) are assumed to be white noise.

\[(1-\phi_{1}B-\phi_{2}B^{2})[(1-B)Y_{n}]=(1+\psi_{1}B)\epsilon_{n}\]

## 
## Call:
## arima(x = slrresiduals, order = c(2, 1, 1))
## 
## Coefficients:
##          ar1      ar2     ma1
##       1.8276  -0.8416  0.9408
## s.e.  0.0314   0.0315  0.0241
## 
## sigma^2 estimated as 5.615e-08:  log likelihood = 1969.04,  aic = -3930.07

Fill in the paramters for the model

\[(1-1.828B+0.842B^{2})[(1-B)Y_{n}]=(1+0.941B)\epsilon_{n}\] in which \(\{\epsilon_{n}\}\) are a series of normal white noise with a variance of \(5.62\times 10^{-8}\).

The modulus of the two AR component roots are 1.09 and the modulus of the MA component root is 1.063. Since all the roots fall outside the unit circle, the fitted model is both causal and invertible.

The original time series(black) and the fitted data(red dashed line) almost overlap, which indicates that it is a good fit.

The autocorrelation of the residuals from the model is plotted below.

There are significant correlations at lag 18, 36, 54 and 72. This phenomenon indicates that adding a SARMA term of lag 18 could possibly help. Given that the sign of correlation is different for lag 18 and lag 36, a seasonal AR(2) term is added \(SARIMA(2,1,1)\times (2,0,0)\). The model specification is

\[(1-\phi_{1}B-\phi_{2}B^{2})(1-\phi_{3}B^{18}-\phi_{4}B^{36})[(1-B)Y_{n}]=(1+\psi_{1}B)\epsilon_{n}\]

## 
## Call:
## arima(x = slrresiduals, order = c(2, 1, 1), seasonal = list(order = c(2, 0, 
##     0), period = 18))
## 
## Coefficients:
##          ar1      ar2     ma1    sar1     sar2
##       1.7731  -0.7864  0.9243  0.3332  -0.5243
## s.e.  0.0376   0.0380  0.0334  0.0482   0.0456
## 
## sigma^2 estimated as 3.609e-08:  log likelihood = 2026.19,  aic = -4040.38

According to the output, the model is

\[(1-1.7731B+0.7864B^{2})(1-0.3332B^{18}+0.5243B^{36})[(1-B)Y_{n}]=(1+0.9243B)\epsilon_{n}\]

The AR component has two roots with modulus 1.128. The MA component has two roots with modulus 1.082. The SAR component has two roots with modulus 1.381. All the AR, SAR and MA roots are outside the unit circle, confirming that the model is causal and invertible.

In the autocorrelation plot, we can see that the autocorrelation at lag 18, 36, 54 and 72 are reduced by some amount, but lag 36 and 54 are still significant. Overall this is not a big improvement. There has to be more intricate measures to handle this seasonality, but they are beyond my planned time devoted to this project.

Conclusion

This project focuses on the weekly butter price collected on a time span of over five years. There is significant time series trend in the data. The data is decomposed into short term and long term trend. The short term trend presents a periodicity of one cycle per year with an amplitude of about 1.3 cents. The long term trend can be mostly summarized to be a linear trend. The linear trend indicates that the butter price has been increasing at 0.17 dollars per year in the past 5 years. Even though the linear model captures most of the trend in the data, the residuals after fitting the linear model show an oscillating trend. This perturbation can be explained well by an ARIMA(2,1,1) model. Even though ARIMA(2,1,1) model offers a good fit to the data, the autocorrelation plot raises a seasonal trend at a lag of 18 weeks. The seasonality trend is worth looking into, but needs extra efforts beyond the SARIMA(2,1,1)*(2,0,0) model tried in this project.

References

[1] Data source from Kaggle, uploaded by Sohier Dane [https://www.kaggle.com/sohier/weekly-dairy-product-prices]

[2] Data source from USDA [https://mpr.datamart.ams.usda.gov/menu.do?path=%5CProducts%5CDairy]

[3] The Mandatory Price Reporting Act of 2010 [https://www.congress.gov/111/plaws/publ239/PLAW-111publ239.pdf]

[4] STATS 531 Lecture Note 5, compiled by Professor Edward Ionides[https://ionides.github.io/531w20/05/notes05.pdf]

[5] STATS 531 Lecture Note 8, compiled by Professor Edward Ionides [https://ionides.github.io/531w20/08/notes08.pdf]