## Warning: 程辑包'dplyr'是用R版本4.3.2 来建造的
## 
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: 程辑包'zoo'是用R版本4.3.2 来建造的
## 
## 载入程辑包:'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Warning: 程辑包'tseries'是用R版本4.3.2 来建造的
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Warning: 程辑包'ggplot2'是用R版本4.3.2 来建造的
## Warning: 程辑包'forecast'是用R版本4.3.2 来建造的
## Warning: 程辑包'astsa'是用R版本4.3.2 来建造的
## 
## 载入程辑包:'astsa'
## The following object is masked from 'package:forecast':
## 
##     gas

Exploratory Data Analysis

This dataset looks at Carbon Dioxide (CO2) emissions in the United States from 1973 to 2016. A large producer of CO2 is burning coal to generate electricity for the country. We would like to explore how the CO2 levels have changed over time, given its impact on the environment. Namely, CO2 absorbs heat from our planet and traps it within our atmosphere, effectively raising the global temperature. As the US and other countries have emmitted coal in increasing quantities over the years, it has had negative affects on the environment. Only more recently have organizations pushed to decrease CO2 emissions.

There are no missing values in our data.

It appears as if there are two major trends over time: an increase in CO2 emissions from 1973 until approximately 2006-2008, and then a decrease in emissions after this point until 2016. It is also possible that there is annual seasonality in the CO2 emissions, which is something that we will explore.

Firstly, we need to run an ADF test to check whether the data is stationary. From the result we can see, p-value of the ADF test is 0.9627 which is much bigger than 0.05. It means that we can not reject the null hypothesis: the series is not stationary. So next we need to differentiate the data.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  co2[, 2]
## Dickey-Fuller = -0.78755, Lag order = 8, p-value = 0.9627
## alternative hypothesis: stationary

Exploring \(CO^2\) emissions summary:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   60.54  102.02  129.05  125.78  151.50  188.41

The differenced data is now stationary:

## Warning in adf.test(y2, alternative = "stationary"): p-value smaller than
## printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  y2
## Dickey-Fuller = -16.412, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

It appears that there are high levels of autocorrelation with lags going as far back as 35. It also appears that the autocorrelation fluctuates approximately every 6 lags, indicating the possibility of seasonality.

In the PACF, it appears that only the first 13 lags are significant, indicating that an annual seasonal model might be useful.

We can take a closer look at the decomposition of carbon dioxide emissions. The increasing then decreasing trend is very clear, and there is a very strong seasonal component.

Frequency Analysis

Unsmoothed periodogram

Smoothed Periodogram

Both unsmoothed and smoothed periodograms result in dominant frequencies(\(\omega\)) of 0.17 and 0.09

This dominant frequencies corresponds to a time period T where,

\[ T= \frac{1}{\omega} = \frac{1}{0.17} = 5.88\]

\[ T= \frac{1}{\omega} = \frac{1}{0.09} = 1.11\]

The primary frequencies are associated with periods of approximately 6 months, and 1 month. This means there is a significant seasonal pattern in the CO2 emissions data that repeats approximately every 6 months, and 1 month.

Smoothed Periodogram via AR model:

The smoothed periodogram via AR model reveals similar results to the previously smoothed periodogram, reinforcing our belief that there are cycles of periods of 6 months and 1 month in the \(CO^2\) data.

Now, the chosen ARIMA modeling approach, which incorporates these identified seasonal components, will likely yield accurate forecasts for CO2 emissions. The analysis suggests that our model will be able to project future emissions levels while considering the cyclical nature of CO2 output over time.

Modeling

Due to the seasonal nature of our data, we have elected to use a SARIMA model, which is of the form:

\(\phi(B)\Phi(B_{12})(Y_n - \mu) = \psi(B)\Psi(B_{12})\epsilon_n,\)

\(\text{where the error terms are a white noise process and}\)

\(\mu = E(Y_n)\)

\(\phi(x)= 1 - \phi_1x - \ldots - \phi_px^p,\)

\(\psi(x)= 1 + \psi_1x - \ldots + \psi_qx^q,\)

\(\Phi(x)= 1 - \Phi_1x - \ldots - \Phi_px^p,\)

\(\Psi(x)= 1 + \Psi_1x - \ldots + \Psi_qx^q.\)

Then we need to decide where to start in terms of values of p and q. To choose the appropriate p, d and q, we use auto. arima function, which uses AIC as the criterion to select the most appropriate model. This analysis is conducted below:

Model 1

## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,1,2)(1,1,1)[12]                    : 2989.733
##  ARIMA(0,1,0)(0,1,0)[12]                    : 3224.131
##  ARIMA(1,1,0)(1,1,0)[12]                    : 3122.475
##  ARIMA(0,1,1)(0,1,1)[12]                    : 2997.203
##  ARIMA(2,1,2)(0,1,1)[12]                    : 2977.264
##  ARIMA(2,1,2)(0,1,0)[12]                    : Inf
##  ARIMA(2,1,2)(0,1,2)[12]                    : 2978.851
##  ARIMA(2,1,2)(1,1,0)[12]                    : 3084.412
##  ARIMA(2,1,2)(1,1,2)[12]                    : 2988.537
##  ARIMA(1,1,2)(0,1,1)[12]                    : 2976.604
##  ARIMA(1,1,2)(0,1,0)[12]                    : 3174.039
##  ARIMA(1,1,2)(1,1,1)[12]                    : 2986.237
##  ARIMA(1,1,2)(0,1,2)[12]                    : 2978.326
##  ARIMA(1,1,2)(1,1,0)[12]                    : 3083.008
##  ARIMA(1,1,2)(1,1,2)[12]                    : 2985.612
##  ARIMA(0,1,2)(0,1,1)[12]                    : 2978.552
##  ARIMA(1,1,1)(0,1,1)[12]                    : 2974.655
##  ARIMA(1,1,1)(0,1,0)[12]                    : 3159.527
##  ARIMA(1,1,1)(1,1,1)[12]                    : 2984.802
##  ARIMA(1,1,1)(0,1,2)[12]                    : 2976.428
##  ARIMA(1,1,1)(1,1,0)[12]                    : 3084.685
##  ARIMA(1,1,1)(1,1,2)[12]                    : 2983.928
##  ARIMA(1,1,0)(0,1,1)[12]                    : 3010.334
##  ARIMA(2,1,1)(0,1,1)[12]                    : 2975.512
##  ARIMA(0,1,0)(0,1,1)[12]                    : 3026.306
##  ARIMA(2,1,0)(0,1,1)[12]                    : 2993.891
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(1,1,1)(0,1,1)[12]                    : 3025.322
## 
##  Best model: ARIMA(1,1,1)(0,1,1)[12]
## Series: co2ts 
## ARIMA(1,1,1)(0,1,1)[12] 
## 
## Coefficients:
##          ar1      ma1     sma1
##       0.6117  -0.8878  -0.7542
## s.e.  0.0845   0.0562   0.0312
## 
## sigma^2 = 21.39:  log likelihood = -1508.62
## AIC=3025.24   AICc=3025.32   BIC=3042.18
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE     MASE
## Training set -0.2328457 4.553198 3.396243 -0.2713068 2.765444 0.537675
##                     ACF1
## Training set -0.01338409

We get the best model to be a Seasonal ARIMA(1,1,1)(0,1,1)[12] model, where the ARIMA part of the model has an AR(1) and MA(1) component and differenced once, and the seasonal component has an MA(1) component and differenced once. The seasonal lag is 12 months.

Below, we can forecast with this model on the final year of our data:

It appears that the model forecasts the CO2 emissions fairly well. The 95% confidence interval around the forecast is relatively small and it appears that the original data and fited values are similar.

Model 2

Given that there appear to be two trend components to our model, before and after ~2003, we can attempt to create a model just for before and after this date separately:

In the above ACF and PACF plots for data after 2003, when CO2 emissions started trending down, we can still see significant lags and significant seasonality.

## Warning in adf.test(co2[361:523, 2]): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  co2[361:523, 2]
## Dickey-Fuller = -4.1971, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

We ran an ADF test to test for stationary, and found that the p-value was 0.01 < 0.05, meaning that we can reject the null hypothesis and the data is in fact stationary.

## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,1,2)(1,1,1)[12]                    : 945.2503
##  ARIMA(0,1,0)(0,1,0)[12]                    : 972.1671
##  ARIMA(1,1,0)(1,1,0)[12]                    : 955.5193
##  ARIMA(0,1,1)(0,1,1)[12]                    : 923.154
##  ARIMA(0,1,1)(0,1,0)[12]                    : 965.7792
##  ARIMA(0,1,1)(1,1,1)[12]                    : 940.0908
##  ARIMA(0,1,1)(0,1,2)[12]                    : 923.6302
##  ARIMA(0,1,1)(1,1,0)[12]                    : 953.9603
##  ARIMA(0,1,1)(1,1,2)[12]                    : 935.9723
##  ARIMA(0,1,0)(0,1,1)[12]                    : 926.9546
##  ARIMA(1,1,1)(0,1,1)[12]                    : 923.8431
##  ARIMA(0,1,2)(0,1,1)[12]                    : 922.3506
##  ARIMA(0,1,2)(0,1,0)[12]                    : 966.841
##  ARIMA(0,1,2)(1,1,1)[12]                    : 939.8656
##  ARIMA(0,1,2)(0,1,2)[12]                    : 923.1046
##  ARIMA(0,1,2)(1,1,0)[12]                    : 955.3841
##  ARIMA(0,1,2)(1,1,2)[12]                    : 934.6264
##  ARIMA(1,1,2)(0,1,1)[12]                    : 924.038
##  ARIMA(0,1,3)(0,1,1)[12]                    : 924.4096
##  ARIMA(1,1,3)(0,1,1)[12]                    : 926.2082
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(0,1,2)(0,1,1)[12]                    : 981.3449
## 
##  Best model: ARIMA(0,1,2)(0,1,1)[12]
## Series: co2ts3 
## ARIMA(0,1,2)(0,1,1)[12] 
## 
## Coefficients:
##           ma1      ma2     sma1
##       -0.1929  -0.1447  -0.7092
## s.e.   0.0828   0.0922   0.0827
## 
## sigma^2 = 37.08:  log likelihood = -486.53
## AIC=981.07   AICc=981.34   BIC=993.11
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -0.2169387 5.782415 4.297936 -0.2912733 3.275145 0.4724278
##                      ACF1
## Training set -0.008265901

We select an ARIMA(0,1,2)(0,1,1)[12] to model data after 2003, based on AIC

The 95% confidence interval on these predictions are much larger than with our previous model. This could be due to the fact that the model was trained using much less data.

The fitted values and original data appear very similar in general.

In the model with the full dataset used, it appears that there may be some hints of heteroskedasticity, with greater variance as time goes on, as well as a light tailed normal distribution of residuals. Given that larger lags in the ACF and PACF plots are significant, it implies that the model has not fully captured the underlying structure of the data.

In the model with the dataset of only recent years, it appears that there may also be some hints of heteroskedasticity as time goes on and a light tailed distribution of residuals, similar to the first model. There are fewer significant lags in the ACF and PACF plots, but still some underlying structure that our model did not capture.

Conclusion

We built two models for our data: one using the full dataset, and another using data after 2003 based on the drastic change in trend after this point. This change in trend may be due to people realizing the importance of protecting the environment, coupled with the support of environmental organizations.

From the result of auto.arima and model diagnostics, we choose ARIMA(1,1,1)(0,1,1)[12] using the full dataset as our final model. Although the diagnostics of this model and the model using only recent data were similar, the confidence intervals on the forecast of the first model were much smaller. We used this model to predict the co2 emission for the next 5 years. We can clearly see that carbon dioxide emissions will show a clear downward trend in the next five years, and emissions in summer and winter will still be higher than in spring and autumn, per the seasonal trend.

Our exploration of the data shows that U.S. carbon dioxide emissions have been trending downward seasonally since 2003. And according to our forecast results, starting from 2020, the carbon dioxide emissions in the United States can be basically controlled to less than 100, and the emissions in the spring and autumn seasons can be less than 90. This shows that we attach great importance to environmental protection and Effective control of carbon dioxide emissions.

References

  1. Hyperparameters Configuring: R. Shumway and D. Stoffer Time Series Analysis and its Applications, 4th edition, 2017

  2. Slides Chapter 6: Extending the ARMA model: Seasonality, integration and trend.

  3. Slides Chapter 7: Introduction to time series analysis in the frequency domain.

  4. ARIMA modelling in R https://otexts.com/fpp2/arima-r.html.

  5. A Guide to Time Series Forecasting in R You Should Know https://www.simplilearn.com/tutorials/data-science-tutorial/time-series-forecasting-in-r

  6. Previous project: Analysis on CO2 Emission from Natural Gas Electric Power Generation in United States