STATS 531 Midterm Project

Introduction
Data
Data Exploration
Stationary
Data Transformation
- Log Transformation
- Differnced Log Transformation
Model Selection
- Frequency Domain
- ARMA Model
Check model assumptions
Simulating
Conclusion
References

Introduction

First, I am interested if there is any seasonal variation. In general the crop is planted in the spring and grows during the summer and it would be harvested in the fall. “Wheat markets have a tendency to decline between spring and the July harvest, then begin to rise from these harvest lows into fall and winter.” So in my guessing, there might be 1-2 cycles per year.

Second, I would like to know if there is a trend in wheat price. Is it increasing or decreasing over times. I think economic growth or human dietary habit will have an impact on the price of grains.

Then I would like to fit a suitable model for the wheat price data.

Data

I choose the daily wheat price dataset dated from 2009-10-14 to 2018-03-12. There is more than one price record for each month; hence, I average the wheat price within the same months to get one data point each month from 2009-10 to 2018-03. (Total data point will be 102)

Data Exploration

summary(data_wheat_filtered$AvgPrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   402.0   479.4   548.1   585.8   689.7   890.4

data_wheat_filtered$Month[which.max(data_wheat_filtered$AvgPrice)]

## [1] "201209"

We can see that the wheat price range from 402 to 890.4, and the mean price is 585.8. The price reach the peak in 2012/09. Moreover, there is a report showing that there was a rise in the price of grains in 2012 due to the severe drought. Although Wheat price was not as much in the US, export wheat price advanced because of a reduction in world wheat production.

The blue line in Figure 1 is the mean of the overall data.

Stationary

By looking at the data, it seems like stationary model is not appropriate. Which means non-constant mean and variance over time. The variance seems a bit larger in the first half time period, and there is some fluctuations in the data.

It seems like the simplest model that make sense is a cubic trend. We fit a cubic trend to the data (show as red line in Figure 2)

## 
## Call:
## lm(formula = AvgPrice ~ indicator + I(indicator^2) + I(indicator^3), 
##     data = data_wheat_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -149.560  -28.984   -2.361   30.716  156.195 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    433.190334  26.788558  16.171  < 2e-16 ***
## indicator       23.596669   2.241420  10.528  < 2e-16 ***
## I(indicator^2)  -0.530041   0.050434 -10.510  < 2e-16 ***
## I(indicator^3)   0.002967   0.000322   9.216 6.09e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.17 on 98 degrees of freedom
## Multiple R-squared:  0.7641, Adjusted R-squared:  0.7569 
## F-statistic: 105.8 on 3 and 98 DF,  p-value: < 2.2e-16

We can use the Augmented Dickey-Fuller (ADF) test to identify stationary series. Looks like it indicates the model is not stationary, since p-value = 0.088 > 0.05.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_dat
## Dickey-Fuller = -3.2245, Lag order = 4, p-value = 0.08756
## alternative hypothesis: stationary

From ACF plot below, autocorrelation exceeds the dashed lines at many lags. So cubic model may not be a good fit for our data.

Data Transformation

We would like to detrend our data.

Log Transformation

We can see there is still an obvious trend after log transformation. Cubic trend is also fit for our data shown in Figure 4. Log transformation did not change much for the distribution of our data. Only the range of our data decreases.

From ACF plot (Figure 5), autocorrelation exceeds the dashed lines at many lags. So cubic model also doesn’t look like a good fit for our data after log transformation.

The slow decay in residual correlation shown in the ACF plot below is a sign that differencing may be required.

The spectrum plot (Figure 6) has no main peak to show seasonality evidence.

Differnced Log Transformation

\(y_n = \Delta \log x_n = \log x_n - \log x_{n-1}\)

After transformation, it looks more stationary with a more constant mean. Although around 35th data point the variance looks larger, we’ll do further analyze to see if that point has a huge influence on our decision. ACF plot (Figure 8) have ACF’s in the acceptance region, which shows a good sign to do further analyzing.

Next, we’ll start our model selection to find a suitable model for our data after transformation.

Model Selection

Frequency Domain

We’ll look at the frequency domain to see if there is any seasonality happens in our data. From the spectrum plot (Figure 9 and Figure 10), there is no obvious peak since the y axis has small scale.

ARMA Model

We’ll see our transformed data as a stationary data under our null hypothesis.

Then I choose to fit \(ARIMA(p,1,q)\) model for nonstationary monthly data, given by \[\phi (B) ((1-B) Y_n - \mu) = \psi(B) \epsilon_n\] where \({\epsilon_n}\) is a white noise process, \(\mu = -0.0003993\)

\[ \phi (B) = 1 - \phi_1 B - \phi_2 B^2 - \cdots - \phi_pB^p\] \[ \psi(B) = 1 + \psi_1 B + \psi_2 B^2 + \cdots + \psi_q B^q\]

Choosing parameters p and q by model AIC.

Model AIC Table
	MA0	MA1	MA2	MA3	MA4
AR0	-253.76	-254.88	-253.37	-254.26	-252.38
AR1	-254.19	-253.04	-252.89	-252.35	-250.39
AR2	-254.48	-252.94	-252.71	-250.81	-248.87
AR3	-253.39	-251.98	-250.81	-249.62	-247.66
AR4	-252.17	-250.27	-250.35	-249.15	-245.63

From the AIC table we can see that ARIMA(0, 1, 1) has the smallest AIC. We’ll first consider ARIMA(0, 1, 1) model. However, ARIMA(1, 1, 0), ARIMA(0, 1, 3) and ARIMA(2, 1, 0) have small AIC close to the lowest AIC, ARIMA(0, 1, 1) is a simple model with the lowest AIC. So we’ll consider ARIMA(0, 1, 1) model at this time.

arima011 = arima(log_price_ts, order = c(0,1,1))

arima011$coef

##       ma1 
## 0.1908186

ARIMA(0, 1, 1) has a coefficient = 0.191. We can get model for ARIMA(0, 1, 1): \(((1-B)Y_n -\mu) = (1+ 0.191B)\epsilon_n\), where \(\mu = -0.0003993\).

Check model assumptions

Now we can check if our model ARIMA(0, 1, 1) meet the model assumption.

First, we check the residuals. The residuals seem to be equally distributed around zero, and there is no obvious pattern. ACF plot also shows that \(\epsilon_n\) follows normal distribution (white noise), since they did not exceed the dashed lines. We have no evidence to reject the null hypothesis that \(\epsilon_n\) follows normal distribution. So the model seems like a pretty good fit.

forecast::checkresiduals(arima011)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 7.0613, df = 9, p-value = 0.6307
## 
## Model df: 1.   Total lags used: 10

Simulating

Simulate ARIMA(0, 1, 1) model with \(\mu = -0.0003993\)

Conclusion

The original data has a large range in the price, and there is a cubic trend by looking at the data. Although the trend seems to be decreasing, there is a little increase at the end of the data. I won’t say the price of wheat will continue to decrease in the future. Moreover, we can see from the periodogram, there is no obvious peak in the plot. Therefore, we’ll assume that there is no seasonality, although in my guess, there will be a seasonality in wheat price.

After data transforming and model fitting, I choose to fit a ARIMA(0, 1, 1) model by AIC to fit our data. From the diagnosis, we can say that the model selected is a pretty good fit for our data.