Introduction

Nowadays, the price of gasoline is having a great impact on the everyday life. A small fluctuation on the gasoline price would influence the whole industrial chain, including the plastic production, public transportation and the most importantly, the development of fertilizer and other chemical solutions.

In this project, we would love to study the US CPI Average Price Gasoline Data[1]. We will use methods and models that we learned in class to figure out the following questions.

Can we fit a model for gasoline prices and possibly predict future prices?

Is the monthly change of the gasoline prices stable?

Data Description

the US CPI Average Price Gasoline Data tracks the price of gasoline every month from January 1978 to January 2022. We would use the data from January 1981 to January 2022. We manipulated the rows and columns of the data in order to for better model fitting and visualization. After this transformation, there are five variables in this data:

Series ID: The series ID of the dataset, which will not used in the dataset
Year: The year for each track of price
Period: The period of each price, in the format of M(month)-month ID(01..)
Label: Year-Month for each track of price
Value: The value of each track of price in \((dollar/gallon)\)

Analysis

Exploratory Data Analysis

Normal Data

We first plot the time series first to have an overview of the data.

price <- data$Value
year <- seq(from = 1981, length = length(price), by = 1/12)
plot(year, price, type = 'l', ylab = 'Average Price of gasoline')

Fig 1: the US gasoline price by month

The plot of price shows that there exist some trend in the price of gasoline per gallon. It keeps increasing since the beginning of the new centry. But we cannot observe any monthly fluctuation on the price.

Then we include the Acf plot:

acf(price)

Fig 2: the acf plot of US gasoline price by month

All values of the acf plot is greater than 0.1, which means that the data is highly autocorrelated. Hence, it indicates that the price of gasoline is highly influenced by the previous prices. (\(i.e.\) The price will tend to keep increasing when the price of last month increased.)

Differenced Data

Since the acf plot shows autocorrelation, we would like to further look into the difference data, which is the price of the next month minus the month of the current month. The value is taken into logarithm for better visualization.

plot(year, c(0,diff(log(price))), type = "l", ylab = "Average price of gasoline")

Fig 3: the differenced plot of US gasoline price by month

We would also include the acf plot.

acf(diff(price))

Fig 4: the differenced acf plot of US gasoline price by month

The autocorrelation function for the differenced data seems more stationary.

Hence, in order to answering the two proposed questions, we are going to divide our analysis into two parts.

In the first part, we will study the normal data to discuss about the trend of the gasoline prices.
In the second part, we will study the differenced data to evaluate the changing amount of the prices.

The Trend Analysis

Linear Regression

First thing we want to do is build a linear regression model to check whether there is an obvious increasing trend.

new_df <- data.frame(cbind(year, price))
lmod_1 <- lm(price~year, data = new_df)
lm(price~year, data = new_df) %>%
  tidy() %>%
  kable(align="c",  digits = c(0, 2, 3, 2, 3))

term	estimate	std.error	statistic	p.value
(Intercept)	-118.62	4.031	-29.42	0
year	0.06	0.002	29.90	0

plot(year, price, type = 'l', ylab = 'Average Price')
lines(x = year, y = lmod_1$fitted.values, type = 'l', col = 'red')

Fig 5: the linear regression plot of US gasoline price by month

The P-value is zero indicates that both the intercept and year are statistically significant.

Residual Analysis

Then we plot the residual of the regression.

Noticed that the residual is not random, it appears to be the combinaation of some U shapes. And the Acf plot shows that the residuals are highly correlated. Though the estimated of trend vs year is significant and maybe useful, but this linear model is not good for the error term.

Acf Residual Plot

acf(resid(lmod_1))

Fig 6: the acf of the residual

Regression Residual Plot

plot(lmod_1$residuals, main = 'residual plot for regression model')
abline(h = 0, col = 'red')

Fig 7: the residual plot of the linear regression

Linear Regression with ARMA Noise

Next, we want to discuss about the linear model with ARMA noises.

aic_table <- function(data,P,Q, xreg = NULL){
  table <- matrix(NA,(P+1),(Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
      table[p+1,q+1] <- arima(data,order=c(p,0,q), xreg = xreg)$aic
    }
  }
  dimnames(table) <- list(paste("AR",0:P, sep=""),
                          paste("MA",0:Q,sep=""))
  table
}
huron_aic_table <- aic_table(price,3,3, year)
huron_aic_table

##           MA0       MA1       MA2       MA3
## AR0  773.2293  180.3854 -233.3885 -473.7231
## AR1 -667.7206 -808.2358 -819.4180 -817.5429
## AR2 -800.6259 -817.2101 -817.5066 -820.3243
## AR3 -819.9522 -817.9555 -815.8403 -819.6854

arma11_reg <- arima(price, order = c(1, 0, 1), xreg = year)
arma11_reg

## 
## Call:
## arima(x = price, order = c(1, 0, 1), xreg = year)
## 
## Coefficients:
##          ar1     ma1  intercept    year
##       0.9538  0.5158  -115.0497  0.0585
## s.e.  0.0133  0.0338    23.2750  0.0116
## 
## sigma^2 estimated as 0.01102:  log likelihood = 409.12,  aic = -808.24

According to the AIC criteria. ARMA(1,1) has the lowest AIC value in small models, so we choose regression with ARMA(1,1) noise.

The model with ARMA errors can be written as, where Mt is months and ηt is an ARMA(1,1) error.: \[ Y_t=\beta_0+\beta_1M_t+\eta_t \]

Residual Analysis of the regression with ARMA noise5

Then we plot the residual of the regression.

According to the residual plot, there is a evidence of increasing in the amplitude or the heteroskedasticity, which means our error term’s variance is not a constant, the further research can study on this thing. The acf plot shows there are several lags’ autocorrelation out of 95% confidenence interval under null hypothesis of Gussian white noise. On the other hand, the AIC table shows some larger regression model with ARMA noise has lower AIC, so maybe fit a larger ARMA noise is better.

Acf Residual Plot

plot(acf(arma11_reg$residuals))

Fig 8: the acf of the residual

Residual plot of the regression with large arma noise

plot(arma11_reg$residuals, ylab = 'Residuals')

Fig 9: the residual plot of the regression with large arma noise

Normal ARMA Model

The other model we want to try is the ARMA model.

aic_table <- function(data,P,Q){
  table <- matrix(NA,(P+1),(Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
      table[p+1,q+1] <- arima(data,order=c(p,0,q))$aic
    }
  }
  dimnames(table) <- list(paste("AR",0:P, sep=""),
                          paste("MA",0:Q,sep=""))
  table
}
huron_aic_table <- aic_table(price,3,3)
huron_aic_table

##           MA0       MA1       MA2       MA3
## AR0 1282.1369  657.5195  162.4745 -155.7455
## AR1 -662.6866 -800.4559 -809.3660 -808.0328
## AR2 -787.8142 -806.9241 -807.7802 -816.1554
## AR3 -811.5188 -813.7894 -817.4926 -811.0718

arma11 <- arima(price, order = c(1, 0, 1))
arma11

## 
## Call:
## arima(x = price, order = c(1, 0, 1))
## 
## Coefficients:
##          ar1     ma1  intercept
##       0.9852  0.5108     2.0484
## s.e.  0.0076  0.0339     0.4358
## 
## sigma^2 estimated as 0.01121:  log likelihood = 404.23,  aic = -800.46

95% confidence intervals of coefficients are shown below.

t(confint(arma11))

##              ar1       ma1 intercept
## 2.5 %  0.9702037 0.4443697  1.194255
## 97.5 % 1.0001020 0.5771790  2.902598

None of the confidence intervals of the coefficients include 0. Next, check if the ARIMA(3,1,2) model is both causal and invertible.

autoplot(arma11, main = "Plotting the ARMA(1,1) characteristic roots")

Fig 10: The plot of ARMA(1,1) characteristic roots

According to the AIC table, the ARMA(1, 1) has relatively lower AIC value among small model. Both roots of AR and MA is out of unit circle and have very low standard error.

Freduency Domain

Finally we want to check whether there is any cyclical fluctuation of the gasoline prices. As we do in class, we plot the spectra using unparametric and parametric methods separately in the following chunks.

Unsmoothed periodogram

spectrum(price, main = 'Unsmoothed periodogram')

Fig 11: the unsmoothed periodogram

Smoothed periodogram

smoothed_freq <- spectrum(price, spans = c(3, 5, 3), main = 'Smoothed periodogram')
abline(v = smoothed_freq$freq[which.max(smoothed_freq$spec)], col = 'red', lty = 'dotted', lwd = 2)

Fig 12: the smoothed periodogram

Explannation of the specturm

domin_frequency <- smoothed_freq$freq[which.max(smoothed_freq$spec)]
period <- 1/(smoothed_freq$freq[which.max(smoothed_freq$spec)]*12)
period

## [1] 41.66667

The dominant frequcy is 0.002 and the period is 41.6666667 which is not make sense in our data because the time range of our time series data is just 40 years, this result means there is no obviously period behavior in the average gasoline price per gallon in our data.

Local linear re gression for the specturm

Then we check another smoothing method with local linear regression.

loess_smooth_model <- loess(price~year, span = 0.1)
plot(year, price, type = 'l')
lines(loess_smooth_model$x, loess_smooth_model$fitted, col = 'red')

Fig 13: Local Linear Regression

Then we plot the smoothed periodogram of the local linear regression.

spectrum(loess_smooth_model$fitted, main = 'Smoothed periodogram')
loess_spectrum <- spectrum(loess_smooth_model$fitted, main = 'Smoothed periodogram')
abline(v = loess_spectrum$freq[which.max(loess_spectrum$spec)], col = 'red', lty = 'dotted', lwd = 2)

Fig 14: the smoothed periodogram of local linear regression

loess_domin_frequency <- loess_spectrum$freq[which.max(loess_spectrum$spec)]
loess_period <- 1/(loess_spectrum$freq[which.max(loess_spectrum$spec)]*12)
loess_period

## [1] 20.83333

When we use local linear regression approach to smooth our time series data, the dominant period frequency we got is 0.004 which means the period is 20.8333333. This is a large number as well compared to our data’s time range. Based on the two smoothing method we can conclude that there is not obvious short-run cyclial behavior in the average price for gasoline per gallon.

Monthly Change stability Analysis

ARIMA model of the monthly difference

Since the differenced data is more stationary, let’s take the ARMA model for the differenced data. This differenced data has the equation that transforms the data \(y^*_{1:N}\) to \(z_{2:N}\) such that \(z_n = \Delta y^*_n = y^*_n - y^*_{n-1}\) (chp 6 pg.4).

As mentioned before, we will fit an ARMA(p,q) model for the differenced data, this is also called integrated autoregressive moving average model for \(y^*_{1:N}\), also written as ARIMA(p,1,q).

The equation for the ARIMA(p,1,q) model with intercept \(\mu\) for \(Y_{1:N}\) is as follows: \[\phi(B)[(1-B)Y_n - \mu] = \psi(B)\epsilon_n\]

where \(\epsilon_n\) is a white noise process, \(\phi(x)\) and \(\psi(x)\) are ARMA polynomials. (Lecture Notes Chapter 6, pg. 4)

aic_table <- function(data, P, Q){
  table <- matrix(NA, (P+1), (Q+1))
  for (p in 0:P){
    for(q in 0:Q){
      table[p+1,q+1] <- arima(data, order=c(p,1,q))$aic
    }
  }
  dimnames(table) <- list(paste("AR", 0:P, sep=""),
                          paste("MA", 0:Q, sep=""))
  table
}

price_aic_table <- aic_table(price,4,6)
price_aic_table

##           MA0       MA1       MA2       MA3       MA4       MA5       MA6
## AR0 -667.6667 -803.7626 -811.4060 -810.5775 -811.4778 -812.0838 -814.3443
## AR1 -788.5684 -808.9813 -810.0717 -821.7192 -820.2843 -818.6572 -818.5502
## AR2 -814.6769 -819.1057 -822.9421 -832.4099 -832.5461 -835.4129 -836.0401
## AR3 -812.9944 -810.8348 -820.9571 -819.3574 -832.4322 -833.0187 -827.1177
## AR4 -812.2095 -822.6035 -838.7941 -828.8813 -826.4605 -837.9411 -836.5241

From AIC table above, it can be seen that ARIMA(4,1,2) model has the lowest AIC.

price_arima <- arima(price, order = c(4,1,2))
price_arima

## 
## Call:
## arima(x = price, order = c(4, 1, 2))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     ma2
##       2.2056  -2.0748  0.9136  -0.2499  -1.6848  0.9513
## s.e.  0.0492   0.1051  0.1028   0.0458   0.0251  0.0227
## 
## sigma^2 estimated as 0.01028:  log likelihood = 426.4,  aic = -838.79

Model Diagnostics

As a diagnostic test, let’s plot the autocorrelation function for the residuals of this ARIMA model.

acf(residuals(price_arima))

Fig 15: the AIC plot of the ARIMA model

It can be seen that autocorrelation mimics that of a white noise, even though there are some relatively high autocorrelation in some lags.

Trend of differenced data

Next, let’s try to see whether there is a trend in this differenced time series data of the average price of gasoline over the years. It can be seen from the plot before, that there seems to be very little or no trend. We will see if this observation is supported statistically.

Let’s fit the ARIMA(4,1,2) model that was shown to be performing relatively well with a linear trend on the date.

linear_model <- arima(price, order = c(4,1,2), xreg = data$Year)
linear_model

## 
## Call:
## arima(x = price, order = c(4, 1, 2), xreg = data$Year)
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     ma2  data$Year
##       2.2110  -2.0899  0.9287  -0.2555  -1.6847  0.9516     0.0225
## s.e.  0.0489   0.1051  0.1029   0.0457   0.0246  0.0226     0.0142
## 
## sigma^2 estimated as 0.01023:  log likelihood = 427.64,  aic = -839.29

We have fitted the differenced data \(z_2:N\):
\[(1-\phi(B))(Y_n - \mu - \beta t_n)= \psi(B)\epsilon_n\]

where \(\epsilon_n\) is Gaussian white noise with variance \(sigma^2\).

To test whether there exista a linear trend, we will perform hypothesis test with null model: \[H_0: \beta = 0\] against the alternative hypotheses \[H_A = \beta \neq 0\].

Then, perform a z-test on the slope parameter on the coefficient of the date variable by calculating the test statistic: \(\frac{\beta}{s.e(\hat{\beta})}\).

0.0547/0.0710

## [1] 0.7704225

0.7704225 < 1.96. Therefore, we fail to reject our null hypothesis \(H_0\) at 5% level.

We can further confirm this by performing a likelihood ratio test, by computing the difference in log likelihood between the model with a linear trend and one without.

426.69 - 426.4

## [1] 0.29

0.29 < 1.92. Therefore, we again fail to reject our null hypothesis \(H_0\) at 5% level.

(Lecture Notes Chapter 6, page 7-8)

Therefore, both z-test and likelihood ratio test agree that there exist no linear trend in the differenced data.

Conclusion

In this project, we want to verify that if we could fit a model for the gasoline price of the United states for future prediction. We would also love to discuss the change of the monthly prices.

In our analysis, we have found that there is no apparent model for us to fit the overall trend. As the price of the gasoline is highly correlated with the previous data, we could not figure out a simple model for further prediction without any other data sources. Also, we have found no periodical pattern in the fluctuation of the gasoline prices, which means that the change of gasoline prices is somehow irreversible. However, the change of the price itself is pretty steady. It indicates that the price of gasoline would not change dramatically between months to months, but there is also no trend in the amount of change.

In the future, we want to combine this data with some other data sources, such as the CPI (Consumer price index) data to check whether there is any causal impact between the gasoline price and CPI. We would also love to refer to some important historical events, such as the oil crisis and the pandemic, and check whether these events would pose a great influence.

References

[1] U.S. Bureau of Labor Statistics Average Gasoline Price Per Gallon.

[2] Ionides, Edward. Analysis of Time Series lecture notes chapter 9.

[3] Ionides, Edward. Analysis of Time Series lecture notes chapter 5.

[4] Ionides, Edward. Analysis of Time Series lecture notes chapter 6.

[5] Otexts.comIonides, Edward. Regression with ARMA errors.

STATS 531 Midterm Project

2022/2/16

Introduction

Data Description

Analysis

Exploratory Data Analysis

Normal Data

Differenced Data

The Trend Analysis

Linear Regression

Residual Analysis

Acf Residual Plot

Regression Residual Plot

Linear Regression with ARMA Noise

Residual Analysis of the regression with ARMA noise5

Acf Residual Plot

Residual plot of the regression with large arma noise

Normal ARMA Model

Freduency Domain

Unsmoothed periodogram

Smoothed periodogram

Explannation of the specturm

Local linear re gression for the specturm

Monthly Change stability Analysis

ARIMA model of the monthly difference

Model Diagnostics

Trend of differenced data

Conclusion

References