First, I am interested if there is any seasonal variation. In general the crop is planted in the spring and grows during the summer and it would be harvested in the fall. “Wheat markets have a tendency to decline between spring and the July harvest, then begin to rise from these harvest lows into fall and winter.” So in my guessing, there might be 1-2 cycles per year.
Second, I would like to know if there is a trend in wheat price. Is it increasing or decreasing over times. I think economic growth or human dietary habit will have an impact on the price of grains.
Then I would like to fit a suitable model for the wheat price data.
I choose the daily wheat price dataset dated from 2009-10-14 to 2018-03-12. There is more than one price record for each month; hence, I average the wheat price within the same months to get one data point each month from 2009-10 to 2018-03. (Total data point will be 102)
summary(data_wheat_filtered$AvgPrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 402.0 479.4 548.1 585.8 689.7 890.4
data_wheat_filtered$Month[which.max(data_wheat_filtered$AvgPrice)]
## [1] "201209"
We can see that the wheat price range from 402 to 890.4, and the mean price is 585.8. The price reach the peak in 2012/09. Moreover, there is a report showing that there was a rise in the price of grains in 2012 due to the severe drought. Although Wheat price was not as much in the US, export wheat price advanced because of a reduction in world wheat production.
The blue line in Figure 1 is the mean of the overall data.
By looking at the data, it seems like stationary model is not appropriate. Which means non-constant mean and variance over time. The variance seems a bit larger in the first half time period, and there is some fluctuations in the data.
It seems like the simplest model that make sense is a cubic trend. We fit a cubic trend to the data (show as red line in Figure 2)
##
## Call:
## lm(formula = AvgPrice ~ indicator + I(indicator^2) + I(indicator^3),
## data = data_wheat_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -149.560 -28.984 -2.361 30.716 156.195
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 433.190334 26.788558 16.171 < 2e-16 ***
## indicator 23.596669 2.241420 10.528 < 2e-16 ***
## I(indicator^2) -0.530041 0.050434 -10.510 < 2e-16 ***
## I(indicator^3) 0.002967 0.000322 9.216 6.09e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.17 on 98 degrees of freedom
## Multiple R-squared: 0.7641, Adjusted R-squared: 0.7569
## F-statistic: 105.8 on 3 and 98 DF, p-value: < 2.2e-16
We can use the Augmented Dickey-Fuller (ADF) test to identify stationary series. Looks like it indicates the model is not stationary, since p-value = 0.088 > 0.05.
##
## Augmented Dickey-Fuller Test
##
## data: ts_dat
## Dickey-Fuller = -3.2245, Lag order = 4, p-value = 0.08756
## alternative hypothesis: stationary
From ACF plot below, autocorrelation exceeds the dashed lines at many lags. So cubic model may not be a good fit for our data.
We would like to detrend our data.
We can see there is still an obvious trend after log transformation. Cubic trend is also fit for our data shown in Figure 4. Log transformation did not change much for the distribution of our data. Only the range of our data decreases.
From ACF plot (Figure 5), autocorrelation exceeds the dashed lines at many lags. So cubic model also doesn’t look like a good fit for our data after log transformation.
The slow decay in residual correlation shown in the ACF plot below is a sign that differencing may be required.
The spectrum plot (Figure 6) has no main peak to show seasonality evidence.
\(y_n = \Delta \log x_n = \log x_n - \log x_{n-1}\)
After transformation, it looks more stationary with a more constant mean. Although around 35th data point the variance looks larger, we’ll do further analyze to see if that point has a huge influence on our decision. ACF plot (Figure 8) have ACF’s in the acceptance region, which shows a good sign to do further analyzing.
Next, we’ll start our model selection to find a suitable model for our data after transformation.
We’ll look at the frequency domain to see if there is any seasonality happens in our data. From the spectrum plot (Figure 9 and Figure 10), there is no obvious peak since the y axis has small scale.
We’ll see our transformed data as a stationary data under our null hypothesis.
Then I choose to fit \(ARIMA(p,1,q)\) model for nonstationary monthly data, given by \[\phi (B) ((1-B) Y_n - \mu) = \psi(B) \epsilon_n\] where \({\epsilon_n}\) is a white noise process, \(\mu = -0.0003993\)
\[ \phi (B) = 1 - \phi_1 B - \phi_2 B^2 - \cdots - \phi_pB^p\] \[ \psi(B) = 1 + \psi_1 B + \psi_2 B^2 + \cdots + \psi_q B^q\]
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | -253.76 | -254.88 | -253.37 | -254.26 | -252.38 |
AR1 | -254.19 | -253.04 | -252.89 | -252.35 | -250.39 |
AR2 | -254.48 | -252.94 | -252.71 | -250.81 | -248.87 |
AR3 | -253.39 | -251.98 | -250.81 | -249.62 | -247.66 |
AR4 | -252.17 | -250.27 | -250.35 | -249.15 | -245.63 |
From the AIC table we can see that ARIMA(0, 1, 1) has the smallest AIC. We’ll first consider ARIMA(0, 1, 1) model. However, ARIMA(1, 1, 0), ARIMA(0, 1, 3) and ARIMA(2, 1, 0) have small AIC close to the lowest AIC, ARIMA(0, 1, 1) is a simple model with the lowest AIC. So we’ll consider ARIMA(0, 1, 1) model at this time.
arima011 = arima(log_price_ts, order = c(0,1,1))
arima011$coef
## ma1
## 0.1908186
ARIMA(0, 1, 1) has a coefficient = 0.191. We can get model for ARIMA(0, 1, 1): \(((1-B)Y_n -\mu) = (1+ 0.191B)\epsilon_n\), where \(\mu = -0.0003993\).
Now we can check if our model ARIMA(0, 1, 1) meet the model assumption.
First, we check the residuals. The residuals seem to be equally distributed around zero, and there is no obvious pattern. ACF plot also shows that \(\epsilon_n\) follows normal distribution (white noise), since they did not exceed the dashed lines. We have no evidence to reject the null hypothesis that \(\epsilon_n\) follows normal distribution. So the model seems like a pretty good fit.
forecast::checkresiduals(arima011)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 7.0613, df = 9, p-value = 0.6307
##
## Model df: 1. Total lags used: 10
Simulate ARIMA(0, 1, 1) model with \(\mu = -0.0003993\)
The original data has a large range in the price, and there is a cubic trend by looking at the data. Although the trend seems to be decreasing, there is a little increase at the end of the data. I won’t say the price of wheat will continue to decrease in the future. Moreover, we can see from the periodogram, there is no obvious peak in the plot. Therefore, we’ll assume that there is no seasonality, although in my guess, there will be a seasonality in wheat price.
After data transforming and model fitting, I choose to fit a ARIMA(0, 1, 1) model by AIC to fit our data. From the diagnosis, we can say that the model selected is a pretty good fit for our data.
Data source: https://www.kaggle.com/nickwong64/daily-wheat-price
Drought Report: https://www.bls.gov/opub/btn/volume-1/impact-of-the-drought-on-corn-exports-paying-the-price.htm
Seasonal decaying acf: https://medium.com/@kfoofw/seasonal-lags-sarima-model-fa671a858729
SARIMA Model Parameters: https://towardsdatascience.com/time-series-forecasting-with-a-sarima-model-db051b7ae459
Past project: https://ionides.github.io/531w16/midterm_project/project5/mid_term.html https://ionides.github.io/531w16/midterm_project/project19/crude_oil_price.html
Lecture Notes and Script for main analysis.