1. Introduction, Motivation, and Project Goal

Here is the glance of the dataset to get a better sense of the data.

data = read_csv("data.csv")
data$OprDate = as.Date(data$OprDate, "%m/%d/%y")
head(data)
## # A tibble: 6 x 5
##   OprDate    Storage UShdd UScdd USchdd
##   <date>       <dbl> <dbl> <dbl>  <dbl>
## 1 2005-01-02  - 7.21  21.4 0.400   21.4
## 2 2005-01-03  - 7.59  21.6 0.600   21.6
## 3 2005-01-04  -12.4   23.7 0.600   23.7
## 4 2005-01-05  -22.5   28.8 0.300   28.8
## 5 2005-01-06  -28.9   32.0 0.400   32.0
## 6 2005-01-07  -23.4   30.6 0.300   30.6
summary(data)
##     OprDate              Storage             UShdd           UScdd       
##  Min.   :2005-01-02   Min.   :-63.9340   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:2008-01-02   1st Qu.: -8.3993   1st Qu.: 0.90   1st Qu.: 0.300  
##  Median :2011-01-01   Median :  5.6080   Median : 9.40   Median : 1.600  
##  Mean   :2011-01-01   Mean   :  0.1346   Mean   :12.64   Mean   : 3.861  
##  3rd Qu.:2013-12-31   3rd Qu.: 10.8187   3rd Qu.:22.77   3rd Qu.: 7.300  
##  Max.   :2016-12-31   Max.   : 20.5650   Max.   :47.40   Max.   :16.700  
##      USchdd     
##  Min.   : 2.30  
##  1st Qu.: 8.20  
##  Median :12.30  
##  Mean   :15.63  
##  3rd Qu.:22.77  
##  Max.   :47.40


2. Data Analysis

2.1 Exploratory Analysis

ggplot(data_plot, aes(x=OprDate)) + geom_line(aes(y=value, col=variable)) + 
                                    scale_colour_manual(values=c("#1ABC9C", "#F39C12")) + 
                                    labs(title="Fig-1: Natural Gas Storage Change vs Degree Day") + 
                                    labs(x ="Year", y = "Value", colour='') + 
                                    theme(legend.position="top",
                                          plot.title = element_text(hjust = 0.5, face="bold"),
                                          panel.background = element_blank())

In Figure1, the green line represents the daily natural gas storage change, and the orange line represents the daily heating degree day or cooling degree day (whichever is larger on that day) from 2005 to 2016.

  • We can see that the storage change and weather data have strong seasonality. Although the pattern is relatively stable from year to year, some extreme spikes do appear such as in 2013-14 winter due to unusual weather.

  • What’s more, it is intuitive that the storage change is negatively related to the degree day. The larger temperature deviation from 65°F, the more natural gas storage decreases due to higher heating or cooling demand, especially during the winter.

Thus, we can conclude that it could be plausible to study the association between the storage change and degree day. Also, it is worth taking detailed analysis on the seasonality (see section 2.2) and consider using SARIMA model to fit the data (see section 3.2). Furthermore, although it is not obvious that there is any trend in the data from the plot directly, it would be helpful to use smoothing method to do time series decomposition to reach a more solid conclusion (see section 2.3).


2.2 Seasonality Analysis

In this section, spectrum and ACF methods are used to study the seasonality in the data.

spectrum(bind_weekly, spans=c(5, 3), col=c("#1ABC9C","#F39C12"), 
         main="", xlab="Frequency (cycle/year)")
title(main="Fig-2: Smoothed Periodogram for\nWeekly Storage Change (Green) and Degree Day (Orange)", 
      line = -2, outer=TRUE)

To better interpret the frequency, the weekly data (the median data point of each week) are used to calculate the spectrum. In Figure2, the green solid line represents the spectrum of weekly natural gas storage change, and the orange dotted line represents the spectrum of weekly degree day. We set the time series frequency as 52, so the x-axis represents cycle/year.

  • Firstly, we can see that the most powerful frequencies are at 1 and 2, which indicates that there is a strong annual cycle, and a less strong cycle around 6 months. The blue cross on the top right corner shows the 0.95 confidence interval. These two peaks both fall into the CI, so that they are statistically significant.

  • This is consistent with the demand cycle of natural gas as introduced in section 1. The major cycle is around 1 year, which is due to the strong heating demand during winter, and natural gas is one of the major resources for heating. The half-year cycle is due to using natural gas to generate electricity to meet the cooling demand during summer.

  • Secondly, the cycle patterns of spectrum for the storage change and degree day are highly consistent with each other. This also illustrates that these two data are associated with each other.

par(mfrow=c(1,2))
acf(monthly_Storage$Storage, main="Fig-3.1 Monthly Storage Change", lag=36)
acf(monthly_USchdd$USchdd, main="Fig-3.2 Monthly USchdd", lag=36)

We then zoom-out the time interval to get monthly data (taking the median data point for each month) and use ACF to double justify our findings.

  • From Figure3, we can see a strong lag around every 12 months, and a less strong lag round every 6 months. This supports our findings discussed above.

2.3 Decomposition Analysis

Although there is no obvious sign of existing any trend in the data from the exploratory analysis, it would be helpful to use smoothing method to decomposite the data and extract the low-frequency part. In this section, Loess smoothing (a local regression smoother) is used to decomposite the time series.

freq = 12
decomp_data = monthly_Storage

s_data = ts(decomp_data$Storage, start=2005, frequency=freq)
s_low = ts(loess(Storage ~ seq(1:length(Storage)), decomp_data, span=0.5)$fitted, 
           start=2005, frequency=freq)
s_hi1 = ts(s_data - loess(Storage ~ seq(1:length(Storage)), decomp_data, span=0.1)$fitted, 
          start=2005, frequency=freq)
s_hi2 = ts(s_data - loess(Storage ~ seq(1:length(Storage)), decomp_data, span=0.05)$fitted, 
          start=2005, frequency=freq)
s_cycles1 = s_data - s_hi1 - s_low
s_cycles2 = s_data - s_hi2 - s_low
s_cycles = s_cycles2

plot(ts.union(s_data, s_hi1, s_hi2, s_low, s_cycles1, s_cycles2), col="#1ABC9C", main="")
title(main="Fig-4.1 Decomposition of Monthly Storage Change\nas Trend + Noise + Cycles")

freq = 12
decomp_data = monthly_USchdd

dd_data = ts(decomp_data$USchdd, start=2005, frequency=freq)
dd_low = ts(loess(USchdd ~ seq(1:length(USchdd)), decomp_data, span=0.5)$fitted, 
            start=2005, frequency=freq)
dd_hi1 = ts(dd_data - loess(USchdd ~ seq(1:length(USchdd)), decomp_data, span=0.1)$fitted, 
           start=2005, frequency=freq)
dd_hi2 = ts(dd_data - loess(USchdd ~ seq(1:length(USchdd)), decomp_data, span=0.05)$fitted, 
           start=2005, frequency=freq)
dd_cycles1 = dd_data - dd_hi1 - dd_low
dd_cycles2 = dd_data - dd_hi2 - dd_low
dd_cycles = dd_cycles2

plot(ts.union(dd_data, dd_hi1, dd_hi2, dd_low, dd_cycles1, dd_cycles2), col="#F39C12", main="")
title(main="Fig-4.2 Decomposition of Monthly Degree Day\nas Trend + Noise + Cycles")

Figure4-1 and Figure4-2 show the decomposition of monthly storage change and degree day in terms of trend, noise, and cycles. After trying different span values, we used a span of 0.5 to extract low-frequency trend, and a span of 0.1 and 0.05 to extract two levels of high frequency.

  • Trend: After smoothing, it seems that the trends are not stable for storage change and degree day, and they go for opposite directions. The degree day seems to have a downtrend from 2014 to 2016, which might be due to two folds. On one hand, 2013-14 winter experienced some extremely cold days, causing a bump in the trend. On the other hand, we only observed 11 years data in this project, however, there might exist other long-term cycles that can not be captured from our limited data points. The trend of the storage change is consistent with the temperature movement, with a higher demand during 2013-14 winter.

  • Cycles: When span = 0.1, the major annual cycle is nicely extracted. However, from section 2.2, we learned that there exists two major frequencies in the data. Thus, we further tried to set span = 0.05 so that the significant half-year cycle can also be extracted.

  • Noise: When comparing the low frequencies for both data under span = 0.1 and span = 0.05, we can see that the residuals under span = 0.1 still shows cycles, while the residuals are much more uncorrelated when span = 0.05. The ACF plots on the residuals (Figure5) also confirmed that the former residuals contain a lag of 3 months showing seasonality, while the latter residuals are less correlated.



3. Model Fitting

In this section, we are going to fit two models on the weekly cycles of natural gas storage change and compare the performance.

We are going to use AIC to pick parameters, diagnose model residuals, and compare the performance between the two models.


3.1 Fit a SARIMA Model

sarima_aic_table = function(data, P, Q, xreg = NULL){
  table = matrix(NA, (P+1), (Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
       table[p+1, q+1] = arima(data, order = c(p,0,q), seasonal=list(order=c(0,0,0), period=12))$aic
    }
  }
  dimnames(table) = list(paste("<b> AR", 0:P, "</b>", sep = ""), paste("MA", 0:Q, sep = ""))
  table
}
res_sarima_aic_table = sarima_aic_table(s_cycles, 5, 5)

kable(res_sarima_aic_table, digits = 2)
MA0 MA1 MA2 MA3 MA4 MA5
AR0 1128.59 969.69 851.63 805.40 787.48 785.73
AR1 989.84 864.21 809.16 793.57 786.84 765.85
AR2 816.76 775.41 776.90 739.19 741.16 736.36
AR3 797.06 777.18 777.01 742.52 742.86 743.26
AR4 759.82 715.39 778.16 743.27 744.49 730.03
AR5 752.03 714.18 718.68 723.87 739.49 732.28

We select ARMA\((3,3)\) as the candidate. ARMA\((2,3)\) and ARMA\((4,1)\) are not chosen because by adding one more parameter to the model, the AIC value should increase at most by 2, because \[AIC = -2 \times {\ell}({\theta^*}) + 2D\] where \({\ell}({\theta^*})\) is the maximized log likelihood and D is the number of parameters. However, ARMA\((3,3)\) - ARMA\((2,3)\) and ARMA\((4,2)\) - ARMA\((4,1)\) are above 2, so these values are not proper here.

par(mfrow=c(1,2))

sarima_0 = Arima(s_cycles,order=c(3,0,3), seasonal=list(order=c(0,1,0),period=12))
acf(sarima_0$residuals, main="ARIMA(3,0,3)", 72)
sarima = Arima(s_cycles, order=c(3,0,3), seasonal=list(order=c(1,1,2), period=12))
acf(sarima$residuals, main="SARIMA(3,0,3)x(1,1,2)_12", 72)

title(main="Fig-6 Residuals of SARIMA Model", outer=TRUE, line=-1)

By checking residual of the ARIMA\((3,0,3)\) model, we can see significant lags every 3 months, which is obviously due to seasonality. After adding the seasonal part, the residual of the SARIMA\((3,0,3)\times(1,1,2)_{12}\) Model becomes much more uncorrelated.


3.2 Fit a SARIMA Errors Model

plot(-s_cycles, col="#1ABC9C", type="l", xlab="Year", ylab="", ylim=c(-20,30))
par(new=TRUE)
plot(dd_cycles, col="#F39C12", type="l", axes=FALSE, xlab="", ylab="", ylim=c(-15,20))
axis(side=4, col="black")
legend("topleft",legend=c("ΔStorage (Billions Cubic Feet)", "USchdd °F"), col=c("#1ABC9C", "#F39C12"),
       cex=0.8,lty=1,bty="n")

title(main="Fig-7 Monthly Storage Negative Change and Degree Day Cycle",
      outer=TRUE, line = -1)

From Figure7, we can clearly see that the cycle of storage change and the cycle of degree day have a strong association, thus we consider fitting an Errors Model.

sarima_error_aic_table = function(data, P, Q, xreg = NULL){
  table = matrix(NA, (P+1), (Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
       table[p+1, q+1] = arima(data, order = c(p,0,q), xreg = xreg,
                               seasonal=list(order=c(0,0,0),period=12))$aic
    }
  }
  dimnames(table) = list(paste("<b> AR", 0:P, "</b>", sep = ""), paste("MA", 0:Q, sep = ""))
  table
}
res_sarima_error_aic_table = sarima_error_aic_table(s_cycles, 5, 5, xreg = dd_cycles)

kable(res_sarima_error_aic_table, digits = 2)
MA0 MA1 MA2 MA3 MA4 MA5
AR0 562.64 454.31 408.97 410.87 406.38 407.83
AR1 499.76 439.24 410.92 403.89 408.17 409.70
AR2 461.83 427.40 405.17 405.10 379.68 378.17
AR3 423.75 413.61 399.22 388.64 378.04 378.11
AR4 399.54 399.09 382.68 380.28 379.76 372.42
AR5 397.19 394.71 379.41 381.24 393.84 374.32

We select ARMA\((3,4)\) as the candidate. ARMA\((1,3)\) is not chosen because ARMA\((1,4)\) - ARMA\((1,3)\) is above 2, so the value is not proper here.

par(mfrow=c(1,2))

sarima_error_0 = Arima(s_cycles,order=c(3,0,4), seasonal=list(order=c(0,1,0),period=12))
acf(sarima_error_0$residuals, main="ARIMA(3,0,4)", 72)
sarima_error = Arima(s_cycles, order=c(3,0,4), seasonal=list(order=c(1,1,1), period=12))
acf(sarima_error$residuals, main="SARIMA(3,0,4)x(1,1,1)_12", 72)

title(main="Fig-8 Residuals of SARIMA Errors Model", outer=TRUE, line=-1)

By checking residual of the ARIMA\((3,0,4)\) Errors model, we can see significant lags every 3 months, which is obviously due to seasonality. After adding the seasonal part, the residual of the SARIMA\((3,0,4)\times(1,1,1)_{12}\) Errors Model becomes much more uncorrelated. Note that at each lag h, the chance that the estimated ACF falls within this dotted line is approximately 95% under the null hypothesis. Thus, under the null hypothesis, one expects a fraction of 1/20 of the lags of the sample ACF to fall outside this band. (Ionides, Note2)


3.3 Diagnostic Analysis

## Series: s_cycles 
## ARIMA(3,0,3)(1,1,2)[12] 
## 
## Coefficients:
##           ar1      ar2     ar3     ma1     ma2     ma3     sar1     sma1
##       -1.5743  -0.4712  0.1902  2.7905  2.7112  0.9101  -0.6776  -0.1303
## s.e.   0.1039   0.1745  0.0976  0.0691  0.1299  0.0678   0.1956   0.1964
##          sma2
##       -0.8669
## s.e.   0.1903
## 
## sigma^2 estimated as 3.301:  log likelihood=-279.23
## AIC=578.47   AICc=580.29   BIC=607.3
## Series: s_cycles 
## ARIMA(3,0,4)(1,1,1)[12] 
## 
## Coefficients:
##           ar1      ar2     ar3     ma1     ma2     ma3      ma4     sar1
##       -0.3697  -0.5119  0.2362  1.5557  1.3548  0.4508  -0.1179  -0.0235
## s.e.   0.8687   0.3798  0.4319  0.8791  1.3784  1.1328   0.4695   0.1038
##          sma1
##       -0.9999
## s.e.   0.1162
## 
## sigma^2 estimated as 3.5:  log likelihood=-282.97
## AIC=585.94   AICc=587.76   BIC=614.77
par(mfrow=c(1,2))
plot(sarima$residuals)
plot(sarima_error$residuals)
title(main="Fig-9 Diagnostic Analysis: SARIMA(left), SARIMA_Errors(right)", outer=TRUE, line=-1)

par(mfrow=c(1,2))
acf(sarima$residuals, lag=72)
acf(sarima_error$residuals, lag=72)

par(mfrow=c(1,2))
qqnorm(sarima$residuals)
qqline(sarima$residuals)
qqnorm(sarima_error$residuals)
qqline(sarima_error$residuals)

par(mfrow=c(1,2))
plot(data$Storage, col="#FFE221", type="l", axes=FALSE, xlab="", ylab="", ylim=c(-60,40))
par(new=TRUE)
plot(sarima$x + s_low, col="#1ABC9C", type="l", xlab="Year", ylab="", ylim=c(-60,40), 
     main="Original Data and Fitted Result")
par(new=TRUE)
plot(fitted(sarima) + s_low, col="#C06C84", type="l", axes=FALSE, xlab="", ylab="", ylim=c(-60,40))
legend("topleft", col=c("#FFE221", "#1ABC9C", "#C06C84"), cex=0.8, lty=1, bty="n",
       legend=c("ΔStorage - Original", "ΔStorage - Cycles + Trend", "ΔStorage - Fitted + Trend"))

plot(data$Storage, col="#FFE221", type="l", axes=FALSE, xlab="", ylab="", ylim=c(-60,40))
par(new=TRUE)
plot(sarima_error$x + s_low, col="#1ABC9C", type="l", xlab="Year", ylab="", ylim=c(-60,40), 
     main="Original Data and Fitted Result")
par(new=TRUE)
plot(fitted(sarima_error) + s_low, col="#C06C84", type="l", axes=FALSE, xlab="", ylab="", ylim=c(-60,40))
legend("topleft", col=c("#FFE221", "#1ABC9C", "#C06C84"), cex=0.8, lty=1, bty="n",
       legend=c("ΔStorage - Original", "ΔStorage - Cycles + Trend", "ΔStorage - Fitted + Trend"))

From Figure 9, we can see that:

  • The residuals are uncorrelated for both models.

  • The Q-Q plots are majorly along the straight line. I also tried to set span = 0.04 to extract the cycle, and the Q-Q plots are more normal but heavy-tailed in that case.

  • Both models fit the data well, while SARIMA\((3,0,3)\times(1,1,2)_{12}\) has lower AIC and BIC.

  • The box test also shows that the residuals for both models are uncorrelated.

## 
##  Box-Pierce test
## 
## data:  sarima$residuals
## X-squared = 0.031337, df = 1, p-value = 0.8595
## 
##  Box-Pierce test
## 
## data:  sarima_error$residuals
## X-squared = 0.12524, df = 1, p-value = 0.7234


4. Conclusion

In this project, we carefully analyzed the pattern and association of the US natural gas storage change and cooling and heating degree day. We found that:



5. Reference

  1. Ionides, E. (n.d.). Stats 531 (Winter 2018) ‘Analysis of Time Series’ Retrieved March 07, 2018, from http://ionides.github.io/531w18/

  2. Heating Degree Day - HDD. Investopedia. Retrieved March 07, 2018, from https://www.investopedia.com/terms/h/heatingdegreeday.asp

  3. Natural Gas Storage. Retrieved March 07, 2018, from https://www.eia.gov/naturalgas/data.php#storage

  4. U.S. Energy Information Administration. 2015. Natural Gas Consumption and Prices Short-Term Energy Outlook. U.S. Department of Energy, Washington, DC.

  5. Sebastian Nick, Stefan Thoenes. 2013. What Drives Natural Gas Prices? – A Structural VAR Approach l. EWI Working Paper, No 13/02. Institute of Energy Economics at the University of Cologne