1. Introduction

High levels of fine particulate matter, especially PM2.5 (i.e., atmospheric particulate matter with a diameter smaller than 2.5 um), is one of the most injurious pollutants with impact on economic loss and human health including cardiovascular disease and respiratory disease.The World Health Organization estimated that the 3 million population die every year due to ambient outdoor pollution. In addition, the PM2.5 guideline values from World Health Organization (WHO) is 10 μg/m3 [1]. However, the median value in Delhi is 124 (shown in summary table below) which is 10 times higher than the WHO guideline value. Therefore, our project aims to fit a time series model to find the pattern of PM 2.5 concentration, which maybe helpful for the India Government to formulate the relative policy to mitigate PM 2.5 pollution.

2. Exploratory data analysis

We collected the data of daily PM2.5 concentration in Delhi from January 2015 to December 2018 [2]. There are 1461 entries with 2 missing points. We impute the missing values by its previous value.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.88   62.41  103.93  124.00  164.29  685.36       2

After imputing the missing data, we plotted the daily PM2.5 and the its histogram. The results showed that daily PM 2.5 concentration is significantly skewed. It showed that there is no mean stationary and covariance stationary for the data. In addition, the size of the dataset is too large, which might influence our the seasonality analysis[3]. Therefore, we decide to continue decreasing the size of dataset by integrating the mean for each week, and take log transformation.

After the log transformation, we also plotted the weekly PM2.5 concentration and its histogram. We saw that the variance of the data is relatively stable, and the mean is nearly constant. In addition, the histogram seems to follow the normal distribution. We used the Shapiro Wilk test (p value =0.053>0.05) to confirm the normal distribution of the dataset.

## 
##  Shapiro-Wilk normality test
## 
## data:  Z$PM2.5
## W = 0.98706, p-value = 0.05304

Before establishing the ARMA model, we also conducted the spectrum analysis to test the frequency of the log weekly data. First, we plotted the spectrum density, which difficult to interpret. Therefore, we use the periodogram smoother to make the plot more smoother. The peak is around 0.019 to 0.025, which means there might have a cycle around a 0.77 (i.e., 1/0.025/52) to 1 (i.e., 1/0.019/52) year. This validates the visual observation of seasonality, as well as the logical one that PM2.5 would follow a yearly cycle.

3. Model selection and analysis

3.1 ARMA model analysis

As we have already confirmed the weak stationary of the weekly data, we established the general \(ARMA(p,q)\) model to fit the data to see the effect. The formula of the general stationary \(ARMA(p,q)\) model [3] is

\[\begin{align*} \phi(B)(Y_n -\mu) = \psi(B)\varepsilon_n \end{align*}\]

where

\[\begin{align*} &\mu = \mathbb{E}[Y]\\ &\phi(x) =1 - \phi_1 x - \phi_2 x^2 - \dots - \phi_p x^p\\ &\psi(x) = 1 + \psi_1 x + \psi_2 x^2 + \dots + \psi_p x^p\\ &\varepsilon_{n} \in iid N(0, \sigma^2) \end{align*}\]

Next, we choose value p and q for \(ARMA(p,q)\). Akaike’s information criterion (AIC) is a general approach to select p and q. It is a method to compare likelihoods of different models by penalizing the likelihood of each model by a measure of its complexity. From the AIC table below, when (p,q) = (4,2), (4,1) and (5,2) the model have relatively low AIC values. We consider these three models as our candidates.

MA0 MA1 MA2 MA3 MA4 MA5
AR0 397.75 246.76 196.19 155.59 122.44 95.85
AR1 88.54 79.20 78.60 66.47 68.46 70.25
AR2 78.80 80.40 79.80 68.45 68.23 69.34
AR3 79.69 75.93 77.83 67.87 80.46 67.18
AR4 74.34 55.21 55.89 69.70 57.07 65.21
AR5 70.47 56.60 55.73 57.73 67.07 65.75

We first establish the ARMA(4,2), ARMA(5,2) and ARMA(4,1) models. Result shows all the models are causal and invertible. The roots of ARMA(4,1) are close the the unit root, so we are therefore disregarding.

We plot the ACF of the residuals of ARMA(5,2) and ARMA(4,1) models. Result show there is no substantial autocorrelation for the residuals of the two models. However, the corresponding ACF (lag = 52) is out of the confidence interval. Also, when checking the Q-Q plot, we find that there is a little bit heavier tail, which means the model is generally acceptable but not perfectly fit the data.

While the ARMA(4,2) and ARMA(5,2) are very similar, we have decided to go with the simpler model. Therefore, we selected ARMA(4,2) as our candidate.

## 
## Call:
## arima(x = Z$PM2.5, order = c(4, 0, 2))
## 
## Coefficients:
##          ar1     ar2      ar3      ar4      ma1      ma2  intercept
##       0.7746  0.9072  -0.3409  -0.3773  -0.1812  -0.8188     4.6582
## s.e.  0.1198  0.2042   0.1160   0.0643   0.1197   0.1191     0.0146
## 
## sigma^2 estimated as 0.06896:  log likelihood = -19.95,  aic = 55.89
## [1] 1.024588 1.588981 1.588981 1.024588
## [1] 1.105116 1.105116
## 
## Call:
## arima(x = Z$PM2.5, order = c(5, 0, 2))
## 
## Coefficients:
##          ar1     ar2      ar3      ar4      ar5      ma1      ma2  intercept
##       0.6632  0.9909  -0.2879  -0.2915  -0.1165  -0.1043  -0.8957     4.6584
## s.e.  0.1265  0.1752   0.1125   0.0881   0.0819   0.1079   0.1071     0.0134
## 
## sigma^2 estimated as 0.06818:  log likelihood = -18.87,  aic = 55.73
## [1] 1.019756 1.281529 1.019756 2.537503 2.537503
## [1] 1.056629 1.056629
## 
## Call:
## arima(x = Z$PM2.5, order = c(4, 0, 1))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1  intercept
##       1.5184  -0.3456  0.0137  -0.2117  -0.9754     4.6558
## s.e.  0.0716   0.1263  0.1256   0.0701   0.0347     0.0232
## 
## sigma^2 estimated as 0.07006:  log likelihood = -20.6,  aic = 55.21
## [1] 1.018689 1.018689 2.133523 2.133523
## [1] 1.025205

Because there appeared to still be some seasonality in the ACF plot, we are tried to see if adding a difference term improves performance. We see that visually the graphs are comparable, though the log likelihood is lower and the AIC is higher. Also, the MA root is inside the unit circle and is therefore not invertible. The model ARMA(4,2) model looks to be preferred to the ARMA(4,1,2).

## 
## Call:
## arima(x = Z$PM2.5, order = c(4, 1, 2))
## 
## Coefficients:
##           ar1      ar2      ar3     ar4      ma1     ma2
##       -0.0206  -0.8043  -0.1321  0.0406  -0.2989  0.8866
## s.e.   0.0938   0.0887   0.0786  0.0759   0.0647  0.0563
## 
## sigma^2 estimated as 0.07652:  log likelihood = -28.42,  aic = 70.84
## [1] 1.072688 1.072688 3.326770 6.440826
## [1] 1.2438725 0.9067722

3.2 SARMA model analysis

Due to the imperfection of the ARMA model and frequency plot in 2.2, we also tried the SARMA models. We chose to include a seasonal AR term rather than MA term because we observed a potential seasonal trend both in the spectrum analysis and the ACF analysis of residuals of the ARMA model. We therefore used SARMA(P, Q)(1,0,0) rather than SARMA(P,Q)(1,0,1) model.

From the above specturm density and the acf of the residuals. We select the period of 52. The formula of \(SARMA(p,q)\times(1,0,0)_{52}\) model [3] is

\[\begin{align*} \phi(B)\Phi(B^{52})(Y_n -\mu) = \psi(B)\Psi(B^{52})\varepsilon_n \end{align*}\]

where

\[\begin{align*} \mu &= \mathbb{E}[Y]\\ \phi(x) &=1 - \phi_1 x - \phi_2 x^2 - \dots - \phi_p x^p\\ \psi(x) &= 1 + \psi_1 x + \psi_2 x^2 + \dots + \psi_p x^p\\ \Phi(x) &=1 - \Phi_1 x - \Phi_2 x^2 - \dots - \Phi_p x^p\\ \Psi(x) &= 1 + \Psi_1 x + \Psi_2 x^2 + \dots + \Psi_p x^p\\ \varepsilon_{n} &\in iid N(0, \sigma^2) \end{align*}\]

## 
## Call:
## arima(x = Z$PM2.5, order = c(4, 0, 2), seasonal = list(order = c(1, 0, 0), period = 52))
## 
## Coefficients:
##          ar1      ar2     ar3     ar4      ma1     ma2    sar1  intercept
##       0.7483  -0.7834  0.5403  0.2831  -0.2157  1.0000  0.3413     4.6994
## s.e.  0.0725   0.0770  0.0812  0.0688   0.0132  0.0238  0.0790     0.1978
## 
## sigma^2 estimated as 0.06671:  log likelihood = -19.43,  aic = 56.86

3.3 Likelihood ratio test

To further confirm the necessity to keep the seasonal term in the model, we use the likelihood ratio test. The parameter space of the model is

\[\begin{align*} H^{<0>}:\theta \in \Theta{(0)}\\ H^{<1>}:\theta \in \Theta{(1)} \end{align*}\]

where \(H^{<0>}\) is the \(ARMA(4,2)\) model and \(H^{<1>}\) is the \(SARMA(4,2)\times(1,0,0)_{52}\) model.

The log likelihood of the \(ARMA(4,2)\) model is -19.95, and the log likelihood of the \(SARMA(4,2)\times(1,0,0)_{52}\) model is -19.43. The difference of log likelihood is 0.52, which is much smaller than the 1.92 cutoff for a test at 5% size. This means we should not keep the seasonal term. As the lecture 6 mentioned, SARMA model may not work well for the higher frequency data. Thus, we could conclude that \(ARMA(4,2)\) fits the data better.

4. Conclusion

Our project aims to fit a time series model to find the pattern of PM 2.5 concentration in Delhi. We collected the daily data and found it was not stationary. Then we integrated it into weekly data and took the log to make it stationary. We then established the ARMA and SARMAR model to fit the data. Our result shows compared with the \(ARMA(4,2)\) model, the \(SARMA(4,2)\times(1,0,0)_{52}\) model fits the data worse. This means there is no seasonality (annual cycle) among the weekly data. In the future work, we could make some predictions using our established \(SARMA(4,2)\) and find the impact factors of the PM2.5 pollutants.

5. Reference

1.https://www.who.int/phe/health_topics/outdoorair/outdoorair_aqg/en
2.https://www.kaggle.com/rohanrao/air-quality-data-in-india
3.slides on https://ionides.github.io/531w20/