1. Introduction
Crude oil, is a naturally occurring, yellowish-black liquid found in geological formations, and is commonly refined into various fuels and chemicals that benefit people’s life\(^{1}\). The price of oil influences the costs of other production and manufacturing around the world. For example, there is a direct correlation between the cost of gasoline or airplane fuel to the price of transporting goods and people\(^{2}\).

Crude oil has always been one of the most widely used fuel sources around the world that is crucial to many industries. Crude oil is appreciated due to its scarcity and its global trade in markets. Since it is so widely welcomed and paid attention to, this project hopes to analyze the crude oil price with the following research question: Can we use time series analysis to analyze crude oil prices?

This project downloads data from ourworldindata, which offers us the crude oil price over a range of timeframes from 1861 annually\(^{3}\).

2. Exploratory Data Analysis
2.1 Basic Analysis and Data Split
As we can see from the following data structure, we find that there is a large value range for crude oil prices from 1861 to 2020. Among them, the maximum crude oil price is nearly two hundred times the minimum crude oil price. Since there is a large difference between crude oil prices in latest years and crude oil prices in years before 1980 and limited computation resources. We decide to split the data and focus on analyzing the crude oil price from 1980, which will be more useful for people to refer to for investment or industry development. Crude oil prices are influenced by many factors, e.g. quantity, speculation, temporary price fluctuations, invesment\(^{4}\). In brief, the economy status may influence the crude oil price. As a result, our analysis will focus on years prior to 2020 since the economy is strongly impacted by COVID-19.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.49    1.02    1.80   12.81   14.55  111.67 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12.72   19.32   29.55   43.28   61.67  111.67 

2.2 Autocorrelation and Data Preprocessing
With the ACF figure below, we can see that there is substantial autocorrelation between crude oil price and its previous data, which suggests that it is not a stationary model, therefore driving us to take the log difference of crude oil price for further analysis.

Since it is common to analyze the log returns of financial products rather than the raw difference in prices, it remains unsure whether log difference analysis could be used for crude oil prices. With plots below, we see that the annual crude oil price log difference appears stationary with no trend, with a mean around 0 and an unclear variance trend. Therefore, taking the log difference of crude oil prices may be an appropriate approach for further analysis. The acf plot for the log difference of crude oil price also suggests that there is no significant autocorrelation when lag is greater than 0, which can lead to an assumption that the data are independent.

3. GARCH Model Analysis

The ARCH(p) model is the model of an p autogressive conditional heteroskedasticity model, which has the form \(Y_n = \epsilon_n \sqrt V_n\) where \(\epsilon_{1:N}\) is white noise and \(V_n = \alpha_0 + \sum_{j=1}^{p} \alpha_j Y^2_{n-j}\)\(^{5}\).
The generalized ARCH model, also known as GARCH(p, q), has the form: \(Y_n=\epsilon_n \sqrt{V_n}\), where \(V_n=\alpha_0 + \sum_{j=1}^{p} \alpha_j Y^2_{n-j} + \sum_{k=1}^{q} \beta_k V_{n-k}\) and \(\epsilon_{1:N}\) is white noise\(^{5}\).

3.1 GARCH Model Fitting

Loading required package: knitr

From the table above, we can see that GARCH(1,1) model is most ideal due to its lowest AIC value, which is also a popular choice (Cowpertwait and Metcalfe, 2009)\(^{5,6}\).

3.2 GARCH Model Diagnostics
Our GARCH(1,1) model has a log-likelihood of -3.331448, which will be used as a baseline for further comparison with our POMP model\(^{6}\).

As shown in the figures above, the residuals of GARCH(1, 1) model are heavy-tailed and there is no significant autocorrelation when lag is greater than 0.

4. ARMA Model Analysis

Since we don’t observe any obvious periodical pattern in our data, so we will just consider ARMA(\(p\), \(q\)) here instead of the one with period and seasonality involved.

ARMA(\(p\), \(q\)) model is given by \(Y_n\) = \(\phi_1Y_{n-1} + \phi_2Y_{n-2} + ... + \phi_pY_{n-p} + \epsilon_n + \psi_1\epsilon_{n-1}+...+\psi_q\epsilon_{n-q}\), where {\(\epsilon_n\)} is a white noise process with mean 0 and variance \(\sigma^2\); \(\phi_1 ...\phi_p\) is the coefficients of AR components, and \(\psi_1 ...\psi_q\) is the coefficients of MA component. Furthermore, based on what we discussed in the class, using the backshift operator \(B\), we can rewrite it more succinctly as : \(^{5}\)

\(\phi(B)(Y_n - \mu) = \psi(B)\epsilon_n\).

We use the Akaike information criterion (AIC) as the criterion for model selection among different choices of \(p\) and \(q\) for ARMA(\(p\), \(q\)) models. AIC is defined by \(-2 \times \ell(\theta^*) + 2D\), where \(\theta^*\) is the maximum value of the likelihood function for the model, and \(D\) represents the number of estimated parameters in the model. Models with lower AIC values are usually preferred. Although based on what have discussed in class, AIC may have weak statistical properties when being viewed as a hypothesis test, it is still useful for narrowing down models choices with reasonable predictive performances.

MA 0 MA 1 MA 2 MA 3 MA 4 MA 5
AR 0 10.86999 12.79574 13.43893 15.35390 17.18950 16.78257
AR 1 12.82316 14.03900 15.36611 16.06957 16.62043 17.01141
AR 2 13.50763 15.47364 17.36586 14.92712 16.92538 18.11608
AR 3 15.50465 17.36958 19.32842 17.21675 17.84248 16.12324
AR 4 17.41548 19.18174 18.97390 18.77001 18.94766 20.35303

From the table above, the recommended model is ARMA(0,0), which has the lowest AIC value. It’s worth noticing that ARMA(0,0) is essentially a white noise model which is represented by \(Y_n = \mu+\epsilon_n\), where {\(\epsilon_n\)} ~i.i.d. \(N(0,\sigma^2)\). This means the model contains neither autoregressive nor moving average terms, which implies that the errors are uncorrelated across time. Thus, this white noise model won’t give us too much information about the oil price and the model fitting goodness, and the lowest AIC of it may be due to the limited size of our data. Based on this, for the analysis purpose, we would like to consider the model with second lowest AIC value, which is ARMA(0,1). On the other hand, since higher AIC value might imply that there are some dependence relationships involved for the oil price between two successive years in this context, thus, we also pick ARMA(4,5) for the further comparison.

To choose the model between ARMA(0,1) and ARMA(4,5), we use Wilk’s approximation as the likelihood ratio test for hypothesis test, with the null hypothesis of choosing model ARMA(0,1) and alternative hypothesis of choosing model ARMA(4,5).

According to the definition, the approximation under the null hypothesis is:

\(l_1 - l_0 \approx (1/2)\chi^2_{D_1-D_0}\),

where \(l_1\) and \(l_0\) are the log likelihood maximization over \(H_1\) and \(H_0\), respectively; \(D_1\) and \(D_0\) are the number of parameters (dimension) under each hypothesis; \(\chi^2_d\) is a chi-squared random variable on \(d\) degrees of freedom; and \(\approx\) means “is approximately distributed as”.

With this approximation, we can calculate the test statistics and when \(l_1 - l_0\) is greater than the \((1/2)\chi^2_8\), we will reject \(H_0\) (Since the reject region should be (\((1/2)\chi^2_8\), \(\infty\)) ). Computing by R, the value of \(l_1 - l_0\) is 4.221351, which is less than \((1/2)\chi^2_8\) = 7.753657 at alpha = 0.05. Thus, we don’t reject \(H_0\) at \(\alpha = 0.05\), and conclude that we should select ARMA(0,1) based on the Wilk’s approximation.

Diagnostic plots for the ARMA model

Now, we are going to check the basic assumptions for the ARMA model.

Causality and Invertibility

By looking at the plot for roots of the AR and MA polynomials, we can see since the \(p\) component for AR is 0 in our case, and there is only one root for the MA component, where the inverse of it is within the unit circle. That is, the absolute values of MA roots are greater than 1, which means the fitted model is invertible.

Normality

From the QQ plot, we can see that although the majority of points are falling on the line, there is still a little bit right-skewed regarding the distribution of the residuals.

5. POMP Model Analysis

5.1 Build the POMP model

We utilized the POMP model proposed in the lecture to analyze the volatility of SSE Composite Index. The equation and notations that we build for this POMP model are adopted from Breto (2014). Denote \(H_n=log(\sigma^2_n)=2log(\sigma_n)\) and the model as follows:

\[\begin{align} Y_n &= exp(H_n/2) \epsilon_n \\ H_n &= \mu_h (1-\phi) + \phi H_{n-1} + \beta_{n-1} R_n exp(-H_{n-1}/2) + \omega_n \\ G_n &= G_{n-1}+\nu_n \\ \end{align}\]

where, \[\begin{align} \beta_n &= Y_n \sigma_{\eta} \sqrt{1-\phi^2} \\ \sigma_{\omega} &= \sigma_{\eta} \sqrt{1-\phi^2} \sqrt{1-R_n^2} \\ \epsilon_n &\overset{i.i.d}{\sim} N(0, 1) \\ \nu_n &\overset{i.i.d}{\sim} N(0, \sigma_{\nu}^2) \\ \omega_n &\overset{i.i.d}{\sim} N(0, \sigma_{\omega}^2) \\ Rn &= \frac{exp(2G_n)-1}{exp(2G_n)+1} \end{align}\]

Here, we perform a series of steps to build POMP Model. First, two different POMP objects are built: one is for filtering and the other one is for simulation. After that, parameter transformations are conducted to do optimization procedures such as iterated filtering.\(^{7}\) The pomp object, which is suitable for filtering is also built and then the model is simulated, and the parameters are initialized randomly as well.\(^{7}\)

5.2 Filtering on simulated data

                       se 
-65.07335642   0.01206845 

We obtain a log likelihood estimate of -65.07335642 with a Monte Carlo standard error of 0.01206845.

5.3 MLE from local search

The iterated filtering algorithm, also known as IF2, of lonides was used here. This procedure theoretically converges toward the region of parameter space where maximizing the maximum likelihood.\(^{8}\)

   user  system elapsed 
499.574   7.918 134.869 
Resulting log likelihood values:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -2.931  -2.925  -2.922  -2.922  -2.919  -2.909 
Point estimates of log likelihood and parameters:
                       
logLik    -2.9089474934
logLik_se  0.0006050061
sigma_nu   0.0110668412
mu_h      -2.3075042546
phi        0.9299229690
sigma_eta  0.0384124579
G_0        0.5255061678
H_0       -3.3873146780
Best AIC:  17.81789

The best AIC is around 17.81789. The maximum likelihood is -2.909. We also check the diagnostic plots to see whether we could improve the model further.

Diagnostic plots for the maximization procedure

The convergence plots show that some parameters could not converge very well.

The plausible range for each parameter can be seen by plotting the pairs plot for log-likelihood and parameters:

We find that \(\sigma_{\nu}\) has the range in 0.005 to 0.020; \(\mu_{h}\) has the range in -1.5 to 0; \(\phi\) is focus on 0.99; and \(\sigma_{\eta}\) has the range from 0.08 to 0.12.

5.4 MLE from global search

From the graph above, a large box of parameters can be tried to reach global maximization:

\[\sigma_\nu \in (0.005,0.020) \\ \mu_h \in (-1.5,0)\\ \phi \in (0.98,0.99)\\ \sigma_{\eta} \in (0.08,0.12)\\ G_0 \in (-2,2)\\ H_0 \in (-1,1)\]

Resulting log likelihood values:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -3.242  -2.925  -2.917  -2.905  -2.913  -2.225 
Point estimates of log likelihood and parameters:
                       
logLik    -2.2247882431
logLik_se  0.0006180195
sigma_nu   3.5940750892
mu_h      -2.5611369643
phi        0.7183664974
sigma_eta  0.0426707700
G_0       -0.9851877186
H_0       -4.6612062595
Best AIC:  16.44958

The new maximum log-likelihood using global search is -2.225 , which is better than the result using local parameter search. The best AIC is around 16.44958, which is the lowest among all models considered.

Diagnostic plots for the maximization procedure

From diagnostics plots above, we see that: \(\sigma_nu\), \(\mu_h\), \(G_0\) and \(H_0\) are convergent within around 50-100 iterations. log-likelihood converges to a larger value. \(\phi\) and \(\sigma_\eta\) seems to converge to a certain range. Their converging rate seems to be slower than the other parameters.

5.5 profile likelihood over \(\phi\)

The profile likelihood of \(\phi\) was investigated to see whether \(\phi\) lies in the range of (0.95, 0.99) as suggested in local search. Another reason for checking the profile likelihood of \(\phi\) is that it seems a strong positive relationship between \(\phi\) and log-likelihood. That is, as \(\phi\) increasing and getting close to 1, the log-likelihood also increases.

When \(\phi\) is smaller than 0, stack of points lay above the threshold of the 95% confidence interval. However we need to be cautious of the points lay under the threshold line when phi is between 0.5 and 1.

6. Conclusion

As a commonly used fuel source, Crude oil plays an important role all over the world and in the industrial fields. Our project is analyzing the annual Crude oil price, with a specific focus on the period of 1980 to 2020. We mainly focus on the three models for analysis, Garch, ARMA, as well as POMP. Starting with the Exploratory Data Analysis, there is no obvious trend or period observed in our data. With the fitting of GARCH models, we finally decide that GARCH(1,1) is the most ideal one among the choices based on the AIC criterion. After using Wilk’s approximation as the likelihood ratio test, as well as the AIC values as a reference, we decide ARMA(0,1) is the most recommended one in the ARMA model part. In the POMP model part, we find that our parameters converge to the maximization of the maximum likelihood. Additionally, the maximum log-likelihood using global search is greater than using the local parameter search in our case with the lowest AIC value of 16.4.

After these analyses, we find that these models are appropriate for the time series analysis for the Crude oil price data set. Although due to time and computing resource limitations, shrinking data samples by selecting parts of the original data set as our analysis target may affect on the performance of the models (For instance, in the ARMA model analysis, it may result in the ARMA(0,0) corresponding to the lowest AIC value), we still obtain the reasonable analysis results from the data. In the future work, we may try to increase the size of the data set to see if there is any further interesting finding.

7. References
[1] Wikipedia: Crude oil. URL:https://en.wikipedia.org/wiki/Petroleum#Uses. access at 04/09/2022.
[2] Investopedia: How Oil Prices Impact the U.S. Economy. URL:https://www.investopedia.com/articles/investing/032515/how-oil-prices-impact-us-economy.asp#:~:text=The%20price%20of%20oil%20influences,costs%20and%20cheaper%20airline%20tickets. access at 04/09/2022.

[3] “Crude Oil Prices.” Our World in Data, https://ourworldindata.org/grapher/crude-oil-prices?time=earliest..latest. access at 04/09/2022.

[4] Forbes. URL: https://www.forbes.com/sites/forbesbooksauthors/2021/01/25/factors-that-influence-pricing-of-oil-and-gas/?sh=525908b9338d. access at 04/09/2022.

[5] Lecture Notes, Chapter 4: “Linear time series models and the algebra of ARMA models”. URL: https://ionides.github.io/531w22/04/notes.pdf. access at 04/10/2022.

[6] Lecture Notes, Chapter 16: A case study of financial volatility and a POMP model with observations driving latent dynamics”. URL: https://ionides.github.io/531w22/16/notes.pdf. access at 04/11/2022.

[7] Final Project 2021: Volatility analysis on the Shanghai Composite Index. URL: https://github.com/ionides/531w21/blob/main/final_project/project16/blinded.Rmd. access at 04/12/2022.

[8] Lecture Notes, Chapter 12: “Simulation of stochastic dynamic models”. URL: https://ionides.github.io/531w20/12/notes12.pdf. access at 04/14/2022.

[9] Lecture Notes, Chapter 14: “Likelihood maximization for POMP models”. URL: https://ionides.github.io/531w20/14/notes14.pdf. access at 04/14/2022.