1 Background & Dataset

Ozone concentration is an important factor when considering the air quality. Ozone generally does no harm to human body, but can effectively kill microorganisms such as bacteria. However, ozone can be dangerous without proper handling. According to OSHA (Occupational Safety and Health Administration), ozone levels should never exceed the following average: 0.10 ppm (parts per million) for 8 hours per day exposure.

Also, the guidelines for ozone in the workplace are

In this project, I’d like to invesigate how the ozone concentration varied in Ann Arbor during the last year. The dataset is from EPA - United States Environmental Protection Agency. It contains the daily maximum 8 hours ozone concentration from 2019-1-1 to 2019-12-31. As you can see from the following table, the maximum value for the ozone concentration is 0.066 ppm, which is greater than the value in the guidelines for 8 hours per day exposure doing heavy work. It shows that it’s necessary to conduct a time series analysis to fit the ozone concentration data in Ann Arbor.

Summary of the ozone concentration dataset
Date Daily.Max.8.hour.Ozone.Concentration
Min. :2019-01-01 Min. :0.00200
1st Qu.:2019-04-02 1st Qu.:0.03000
Median :2019-07-02 Median :0.03700
Mean :2019-07-02 Mean :0.03672
3rd Qu.:2019-10-01 3rd Qu.:0.04300
Max. :2019-12-31 Max. :0.06600

2 Objectives

This project aims to answer or bring insights to the following questions:

3 Data Exploration

First, let’s plot the time series. From the plot, the mean of the data does not look stationary and the non-constant variance implies that the covariance is also not stationary. From the Feburary to June, the variance keeps increasing and then decreases a little bit from June to Sepetember. Therefore, a weekly stationary model is not appropriate for the this time series dataset.

3.1 Seasonality Analysis

In this section, the seasonality analysis will be conducted to see if the time series data has any seasonal variations.

The smoothed periodogram shows that there is no obvious cycle in the time seires. Some local peaks occur at frequency 0.29 and 0.34. However, the distance from the peak to its base is very close to the 95% confidence interval shown on the top right corner, thus it’s not statistically signifanct that the cycles corresponding to the frequencies 0.34 and 0.29 can describe the data.

The spectrum estimated by fitting an AR(p) model with p selected by AIC also shows that no frequency stands out significantly, from which we can also conclude that there is no cycle in the Ann Arbor ozone concentration in 2019 time series.

3.2 Fitting an ARMA Model without trend

A log transform is inappropriate to the data because the variance of the time series data seems to be repeating increasing and then decreasing. Also, a trend is still observable after the log transformation.

For now, let’s assume that there is no trend and start by fitting a stationary ARMA(p, q) model. This hypothesis is not entirelty reasonable because it asserts that the ozone concentration does not significantly change in the last year.

We seek to fit a stationary Gaussian ARMA(p,q) model with parameter vector \(\theta=(\phi_{1:p},\psi_{1:q},\mu,\sigma^2)\) given by \[ \phi(B)(X_n-\mu) = \psi(B) \epsilon_n,\] where

\[ \begin{eqnarray} \mu &=& \mathbb{E}[X_n] \\ \phi(x)&=&1-\phi_1 x-\dots -\phi_px^p, \\ \psi(x)&=&1+\psi_1 x+\dots +\psi_qx^q, \\ \epsilon_n&\sim&\mathrm{ iid }\, N[0,\sigma^2]. \end{eqnarray} \]

3.2.1 AIC Table

Among all the values for p and q, we need to pick one for our ARMA model. We plan to select the model with the lowest Akaike’s Information criterion score. AIC was derived as an approach to minimizing prediction error. It is useful when viewed as a way to select a model with reasonable predictive skill from a range of possibilities.

Akaike’s information criterion AIC is given by \[ AIC = -2 \times \ell(\theta^*) + 2D\] It penalizes the likelihood of each model by a measure of its complexity. Increasing the number of parameters leads to additional overfitting which can decrease predictive skill of the fitted model.

MA0 MA1 MA2 MA3 MA4 MA5
AR0 -2288.15 -2420.72 -2459.74 -2468.05 -2486.04 -2488.17
AR1 -2498.57 -2516.60 -2540.10 -2540.45 -2538.65 -2538.50
AR2 -2504.25 -2541.41 -2540.21 -2541.84 -2539.95 -2538.33
AR3 -2513.58 -2540.22 -2538.23 -2539.97 -2538.23 -2537.11
AR4 -2523.17 -2538.28 -2538.83 -2538.31 -2534.43 -2538.16

The AIC table shows that ARMA(2,3) has the lowest AIC value, which is -2541.84. The ARMA(2,1) model has the second lowest AIC value, which is -2541.41. However, the ARMA(2,1) model is simpler and we prefer to select model with less parameters. Therefore, I’m going to to use ARMA(2,1) to fit the data.

The coefficients and log likelihood of the model are shown below.

## 
## Call:
## arima(x = dat[, Daily.Max.8.hour.Ozone.Concentration], order = c(2, 0, 1))
## 
## Coefficients:
##          ar1      ar2      ma1  intercept
##       1.3501  -0.3548  -0.9138     0.0325
## s.e.  0.0576   0.0567   0.0249     0.0057
## 
## sigma^2 estimated as 5.37e-05:  log likelihood = 1275.7,  aic = -2541.41

From the plot above, we can swee that the model can capture the main features of the ozone concentration, but has the problem of under-estimating peaks.

3.2.2 Check AR and MA roots

Then, we need to check the AR and MA roots and see if they can be cancelled and check invertibility

AR_roots <- polyroot(c(1,-coef(arma21)[c("ar1","ar2")])) 
AR_roots
## [1] 1.007345+0i 2.798274+0i

Both the AR and MA roots are outside unit circle, suggesting we have a stationary causal fitted model. The MA root and AR roots are not close to each other, which implies there is no parameter redundancy problem. One of the AR roots (1.007) is close to the unit circle, so the fitted model is close to non-invertibility. This could potentially cause numerical instability.

3.2.3 Diagnostics Analysis

The residual plot shows no abnormal pattern; the mean fluctuates around 0 and the variance does not change very much. The ACF plot shows that there is no significant autocorrelation within 25 lags despite the fact that the ACF value exceeds the blue dashed line at lag 20. However, only 1 out of 25 lags shows some abnormal pattern which agrees with the definition of 95% confidence interval. The QQ-plot shows that the residual partially satisfies the normality assumption made above \(\epsilon_n \sim N[0,\sigma^2]\) though the left tail deviates from the normal distribution somehow.

3.3 Fitting an ARMA Model with trend

Applying a difference operation to the data can make it look more stationary and therefore more appropriate for ARMA modeling. This transformation makes the model become ARIMA, which is integrated autoregressive moving average model. After this transformation, the plot below shows that a mean stationary model would be appropriate to the time series data.

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -2.900e-02 -4.000e-03  1.000e-03  2.198e-05  5.000e-03  2.400e-02

3.3.1 AIC Table

We can then look at the AIC table to find appropriate values for p and q. The AIC table shows that ARIMA(1,1,1), ARIMA(2,1,1), ARIMA(1,1,2) and ARIMA(1,1,3) are good candidates.

MA0 MA1 MA2 MA3 MA4 MA5
AR0 -2428.38 -2504.98 -2534.11 -2534.48 -2532.69 -2532.51
AR1 -2457.70 -2535.45 -2534.25 -2535.88 -2533.98 -2532.36
AR2 -2482.77 -2534.25 -2532.26 -2534.01 -2532.35 -2530.59
AR3 -2502.18 -2532.30 -2532.86 -2530.43 -2528.46 -2530.29
AR4 -2504.17 -2531.39 -2532.52 -2530.52 -2528.52 -2530.20

3.3.2 Hypothesis Test

To statistically judge whether a trend is appropriate to the model, we can use the null hypothesis that the model has no trend, which is ARMA(2,1) and compares it to the alternative hypothesis/model ARIMA(2,1,1).

Let’s fit the alternative model ARIMA(2,1,1).

## 
## Call:
## arima(x = dat[, Daily.Max.8.hour.Ozone.Concentration], order = c(2, 1, 1))
## 
## Coefficients:
##          ar1      ar2      ma1
##       0.3650  -0.0502  -0.9094
## s.e.  0.0578   0.0561   0.0264
## 
## sigma^2 estimated as 5.377e-05:  log likelihood = 1272.12,  aic = -2536.24

The difference of the log likelihood is 1275.70-1272.12=2.58 which is larger than the 1.92 cutoff for a test at 5% size. Therefore, we can reject the null hypothesis, which means the ARIMA(2,1,1) model better fits the data.

3.3.3 AR and MA roots

Then, we need to check the AR and MA roots and see if they can be cancelled and check invertibility.

AR_roots2 <- polyroot(c(1,-coef(arima211)[c("ar1","ar2")])) 
AR_roots2
## [1] 3.637206+2.588377i 3.637206-2.588377i

Both the AR roots and MA root are outside the unit circle and are not close to each other, suggesting the non-invertibility and causality of the model.

3.3.4 Diagnostics Analysis

The diagnostics plots for the ARIMA(2,1,1) model looks close to the ARMA(2,1), both showing no abnormal pattern.

4 Conclusion

In this report, we sought to find out what time series models will best fit the 2019 ozone concentration time series. To summarize, the model ARIMA(2,1,1) is the most appropriate one. The spectrum plot shows that there is no seasonal variation in the time series data within the year 2019. By conducting hypothesis test, a model with trend proves to be statistically significant.

5 Reference

  1. Ionides, E. (n.d.). Stats 531 (Winter 2020) ‘Analysis of Time Series’ Retrieved from http://ionides.github.io/531w20/

  2. OHSA and Ozone. Retrieved from https://www.ozonesolutions.com/knowledge-center/osha-and-ozone.html/

  3. Outdoor Air Quality Data. Retrieved from https://www.epa.gov/outdoor-air-quality-data/

  4. Previous year midterm project 531w18 - Project 24. Retrieved from https://ionides.github.io/531w18/midterm_project/project24/531_midterm_project.html