Summary

This project aims at time series analysis of PM2.5 concentration in Shanghai, China.


1. Introduction

1.1 Background

Air pollution has long been a crucial environmental concern in many countries since the rise of industrialization. This issue is particularly pressing in China. As a developing country with fast expanding manufacture industry and ongoing urbanization, China is suffering from poor air quality in many of its major cities.

Among all air pollution indicators, PM2.5 is a representitive particle to measure the air quality and is most discussed by general public. Hourly PM2.5 concentration data in 5 Chinese cities has been recorded for years and the data is currently available online. In this project, I am going to focus on the data in Shanghai, the city that I spent 9 years living in.

1.2 Objective

This project aims to answer or bring insights to following questions:

  • What time series model fits PM2.5 hourly data well (in a monthly time span)? Is there any significant seasonality?

  • What factors can help explain the pattern of PM2.5 concentration? Does PM2.5 level associate with temperature, humidity, or other variables?

The exploration of above aspects would help us better understand the dynamic of PM2.5 pollution and encourage more effective solution to this environmental problem.

1.3 Scope of Data

To narrow down the scope of this project, hourly PM2.5 data from Jan 01 2015 to Jan 31 2015 is selected to form a time series \(Y_n\) with \(N=744\).

Some key environmental variables of the same length are also kept for later analysis.

## 'data.frame':    744 obs. of  5 variables:
##  $ HUMI         : num  34.6 40.2 40.2 40.2 43.3 ...
##  $ Iws          : num  11 15 20 24 29 33 38 42 47 54 ...
##  $ precipitation: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TEMP         : num  2 1 1 1 0 0 0 -1 1 2 ...
##  $ pm           : num  32 40 37 44 38 38 32 41 51 50 ...
##       HUMI             Iws         precipitation          TEMP       
##  Min.   : 18.07   Min.   :  0.00   Min.   :0.00000   Min.   :-3.000  
##  1st Qu.: 55.63   1st Qu.:  6.00   1st Qu.:0.00000   1st Qu.: 4.000  
##  Median : 70.06   Median : 22.00   Median :0.00000   Median : 7.000  
##  Mean   : 66.76   Mean   : 58.33   Mean   :0.06969   Mean   : 6.737  
##  3rd Qu.: 80.67   3rd Qu.: 69.00   3rd Qu.:0.00000   3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :403.00   Max.   :5.10000   Max.   :20.000  
##        pm        
##  Min.   : 11.00  
##  1st Qu.: 33.75  
##  Median : 77.00  
##  Mean   : 84.59  
##  3rd Qu.:119.00  
##  Max.   :364.00

2. Data Exploration

Judging by the plot, this series is clearly not mean stationary, and the variance is not stable. We can take the log to stablized variance and take temporal difference to de-trend the data.

Zoomed the plot at two particular days, we can also observe a pattern within each day: PM2.5 level has the first peak around 8-11 (possibly associated with morning commute), drops during noon and early afternoon, and peaks again across the evening (possibly associated with evening commute and the lack of sunlight).

3. De-trend Data

Based on previous observation, we first take the log of data to get a new series, and then take the first difference in model fitting: \[Z_n = log(Y_n) \] \[\Delta Z_n=\Delta log(Y_n)= log(Y_n)-log(Y_{n-1})\]

Now the variance is more stable and a mean stationary model is applicable.

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.6423000 -0.0920900  0.0000000 -0.0007016  0.0982100  0.8622000

The ACF of the first difference of this time series has a vague sinusoidal pattern. An AR model might be applicable. The spectrum estimation with AR method suggests an AR(2) model.


4. Model Fitting:

4.1 ARMA Model Selection

To select a suitable model, an AIC table is constructed (see supplementary material). Taking both AIC value and the simplicity of model into consideration, ARIMA(2,1,0) model is the best choice. Besides, the roots of AR polynomial are both outside unit circle. The fitted ARMA is causal, which is desired.

## 
## Call:
## arima(x = sh.pm, order = c(2, 1, 0))
## 
## Coefficients:
##          ar1     ar2
##       0.1242  0.0937
## s.e.  0.0366  0.0366
## 
## sigma^2 estimated as 0.02995:  log likelihood = 249.05,  aic = -492.11

Roots of polynomial of AR:

## [1]  2.670828-0i -3.997241+0i

4.2 Hypothesis Testing

Judging by the estimated values and standard errors of two AR parameters, they are both significantly different from zero. A formal Likelihood Ratio Test can be conducted:

\(H_0:\phi_1=\phi_2=0\)

\(H_a:\phi_1\ or\ \phi_2\neq0\)

We run the null model to get the log likelihood of it and then calculate the test statistics. Under null hypothesis, the test statistics follow a \(\chi_2^2\) distribution

## 
## Call:
## arima(x = sh.pm, order = c(0, 1, 0))
## 
## 
## sigma^2 estimated as 0.03079:  log likelihood = 238.76,  aic = -475.51

\(LRT=2[log L(\theta_a)-log L(\theta_0)]=2*(249.05-238.76)=20.58>\chi_{2,0.95}^2\)

The AR parameters are significant.

4.3 Residual Diagnostics

Now we examine the residuals from this ARIMA(2,1,0) model. The residuals of is symmetric with bell shape, but its tail is heavier than normal distribution. Judging by the plot from section 2, the variance is not constant across the series. A few time points have larger variance. The unconstant variance might be the reason behind the non-normality of residuals. The ACF plot suggests that residuals may be mildly correlated. Perhaps there are other unknown covariates driving the trend behind this time series.

4.4 Seasonality

To further improve the model, we can try to add seasonality factors. The natural time cycle is 24 hour/day. With this factor added, the log likelihood sightly improved. We can also use Likelihood Ratio Test to test the significance of this seasonal parameter.

## 
## Call:
## arima(x = sh.pm, order = c(2, 1, 0), seasonal = list(order = c(1, 0, 0), period = 24))
## 
## Coefficients:
##          ar1     ar2     sar1
##       0.1229  0.0920  -0.0245
## s.e.  0.0367  0.0367   0.0377
## 
## sigma^2 estimated as 0.02993:  log likelihood = 249.26,  aic = -490.53

\(H_0:\) Seasonal AR parameter \(\Phi_1=0\)

\(H_a:\) Seasonal AR parameter \(\Phi_1\neq0\)

\(LRT=2[log L(\theta_a)-log L(\theta_0)]=2*(249.26-249.05)<<\chi_{1,0.95}^2\)

Therefore the 24 hour seasonal parameter is not significant. To keep the simplicity of the model, we can exclude this parameter.


5. Further Exploration on Trend

To better explain the pattern of PM2.5 Concentration, we can leverage other environmental variables available in this dataset.

Four environmental variables: humidity, cumulated wind speed, precipitation, and temperature, are added to the ARIMA(2,1,0) respectively. Likelihood Ratio Test are also conducted.

According to the output, the only significant covariate is cumulated wind speed.

##                         logL   LRT p_val
## Humidity             249.124 0.141 0.707
## Cumulated Wind Speed 253.395 8.683 0.003
## Precipitation        249.722 1.337 0.248
## Temperature          249.146 0.185 0.667
## 
## Call:
## arima(x = sh.pm, order = c(2, 1, 0), xreg = sh15jan$Iws)
## 
## Coefficients:
##          ar1     ar2  sh15jan$Iws
##       0.1324  0.1095        8e-04
## s.e.  0.0366  0.0369        3e-04
## 
## sigma^2 estimated as 0.0296:  log likelihood = 253.4,  aic = -498.79

6. Conclusion

\[(1-0.1324B-0.1095B^2)[(1-B)log(Y_n)-0.0008W_n]=\epsilon_n\]

\(W_n\) is the cumulated wind speed at \(t_n\) (m/s). \(Y_n\) is the PM2.5 concentration at \(t_n\) (ug/m^3). \(\epsilon_n\) is gaussian white noise \(\epsilon_n \overset{iid} \sim N[0,\sigma^2]\), where \(\sigma^2=0.0296\)


7. Future Analysis


8. Supplementary Material

1. Model selection by AIC

## Loading required package: knitr
MA0 MA1 MA2 MA3 MA4 MA5
AR0 -475.51 -485.29 -490.91 -490.75 -488.76 -487.15
AR1 -487.60 -491.07 -490.57 -488.71 -486.94 -485.58
AR2 -492.11 -490.40 -489.05 -487.12 -485.12 -496.86
AR3 -490.55 -489.19 -487.25 -491.43 -495.71 -491.35
AR4 -488.91 -487.24 -497.86 -494.68 -503.06 -486.25

2. Data Source

3. Missing data


9. Reference

[1] https://archive.ics.uci.edu/ml/datasets/PM2.5+Data+of+Five+Chinese+Cities
[2] Liang, X., S. Li, S. Zhang, H. Huang, and S. X. Chen (2016), PM2.5 data reliability, consistency, and air quality assessment in five Chinese cities, J. Geophys. Res. Atmos., 121, 10220-10236
[3] Lecture notes and source code of Stats 531: https://ionides.github.io/531w18/
[4] Guerrero, Victor M. and Perera, Rafael (2004) “Variance Stabilizing Power Transformation for Time Series,” Journal of Modern Applied Statistical Methods: Vol.3: Iss. 2, Article 9