Stats 531 Midterm Project

Summary

This project aims at time series analysis of PM2.5 concentration in Shanghai, China.

Hourly data from Jan 01 2015 to Jan 31 2015 is analysed. Log transformation is applied to stablize variance. Since the data shows certain trend, first difference is taken to achieve mean stationary.
An ARIMA(2,1,0) model is selected by AIC, but the residuals are not normally distributed. There is no significant seasonality.
The association between PM2.5 level and multiple environmental covariates (humidity, temperatur and etc.) are examined. The statistical test shows that the only significant covariate is cumulated wind speed.

1. Introduction

1.1 Background

Air pollution has long been a crucial environmental concern in many countries since the rise of industrialization. This issue is particularly pressing in China. As a developing country with fast expanding manufacture industry and ongoing urbanization, China is suffering from poor air quality in many of its major cities.

Among all air pollution indicators, PM2.5 is a representitive particle to measure the air quality and is most discussed by general public. Hourly PM2.5 concentration data in 5 Chinese cities has been recorded for years and the data is currently available online. In this project, I am going to focus on the data in Shanghai, the city that I spent 9 years living in.

1.2 Objective

This project aims to answer or bring insights to following questions:

What time series model fits PM2.5 hourly data well (in a monthly time span)? Is there any significant seasonality?
What factors can help explain the pattern of PM2.5 concentration? Does PM2.5 level associate with temperature, humidity, or other variables?

The exploration of above aspects would help us better understand the dynamic of PM2.5 pollution and encourage more effective solution to this environmental problem.

1.3 Scope of Data

To narrow down the scope of this project, hourly PM2.5 data from Jan 01 2015 to Jan 31 2015 is selected to form a time series \(Y_n\) with \(N=744\).

Some key environmental variables of the same length are also kept for later analysis.

## 'data.frame':    744 obs. of  5 variables:
##  $ HUMI         : num  34.6 40.2 40.2 40.2 43.3 ...
##  $ Iws          : num  11 15 20 24 29 33 38 42 47 54 ...
##  $ precipitation: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TEMP         : num  2 1 1 1 0 0 0 -1 1 2 ...
##  $ pm           : num  32 40 37 44 38 38 32 41 51 50 ...

##       HUMI             Iws         precipitation          TEMP       
##  Min.   : 18.07   Min.   :  0.00   Min.   :0.00000   Min.   :-3.000  
##  1st Qu.: 55.63   1st Qu.:  6.00   1st Qu.:0.00000   1st Qu.: 4.000  
##  Median : 70.06   Median : 22.00   Median :0.00000   Median : 7.000  
##  Mean   : 66.76   Mean   : 58.33   Mean   :0.06969   Mean   : 6.737  
##  3rd Qu.: 80.67   3rd Qu.: 69.00   3rd Qu.:0.00000   3rd Qu.: 9.000  
##  Max.   :100.00   Max.   :403.00   Max.   :5.10000   Max.   :20.000  
##        pm        
##  Min.   : 11.00  
##  1st Qu.: 33.75  
##  Median : 77.00  
##  Mean   : 84.59  
##  3rd Qu.:119.00  
##  Max.   :364.00

2. Data Exploration

Judging by the plot, this series is clearly not mean stationary, and the variance is not stable. We can take the log to stablized variance and take temporal difference to de-trend the data.

Zoomed the plot at two particular days, we can also observe a pattern within each day: PM2.5 level has the first peak around 8-11 (possibly associated with morning commute), drops during noon and early afternoon, and peaks again across the evening (possibly associated with evening commute and the lack of sunlight).

3. De-trend Data

Based on previous observation, we first take the log of data to get a new series, and then take the first difference in model fitting: \[Z_n = log(Y_n) \] \[\Delta Z_n=\Delta log(Y_n)= log(Y_n)-log(Y_{n-1})\]

Now the variance is more stable and a mean stationary model is applicable.

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.6423000 -0.0920900  0.0000000 -0.0007016  0.0982100  0.8622000

The ACF of the first difference of this time series has a vague sinusoidal pattern. An AR model might be applicable. The spectrum estimation with AR method suggests an AR(2) model.

4. Model Fitting:

4.1 ARMA Model Selection

To select a suitable model, an AIC table is constructed (see supplementary material). Taking both AIC value and the simplicity of model into consideration, ARIMA(2,1,0) model is the best choice. Besides, the roots of AR polynomial are both outside unit circle. The fitted ARMA is causal, which is desired.

## 
## Call:
## arima(x = sh.pm, order = c(2, 1, 0))
## 
## Coefficients:
##          ar1     ar2
##       0.1242  0.0937
## s.e.  0.0366  0.0366
## 
## sigma^2 estimated as 0.02995:  log likelihood = 249.05,  aic = -492.11

Roots of polynomial of AR:

## [1]  2.670828-0i -3.997241+0i

4.2 Hypothesis Testing

Judging by the estimated values and standard errors of two AR parameters, they are both significantly different from zero. A formal Likelihood Ratio Test can be conducted:

\(H_0:\phi_1=\phi_2=0\)

\(H_a:\phi_1\ or\ \phi_2\neq0\)

We run the null model to get the log likelihood of it and then calculate the test statistics. Under null hypothesis, the test statistics follow a \(\chi_2^2\) distribution

## 
## Call:
## arima(x = sh.pm, order = c(0, 1, 0))
## 
## 
## sigma^2 estimated as 0.03079:  log likelihood = 238.76,  aic = -475.51

\(LRT=2[log L(\theta_a)-log L(\theta_0)]=2*(249.05-238.76)=20.58>\chi_{2,0.95}^2\)

The AR parameters are significant.

4.3 Residual Diagnostics

Now we examine the residuals from this ARIMA(2,1,0) model. The residuals of is symmetric with bell shape, but its tail is heavier than normal distribution. Judging by the plot from section 2, the variance is not constant across the series. A few time points have larger variance. The unconstant variance might be the reason behind the non-normality of residuals. The ACF plot suggests that residuals may be mildly correlated. Perhaps there are other unknown covariates driving the trend behind this time series.

4.4 Seasonality

To further improve the model, we can try to add seasonality factors. The natural time cycle is 24 hour/day. With this factor added, the log likelihood sightly improved. We can also use Likelihood Ratio Test to test the significance of this seasonal parameter.

## 
## Call:
## arima(x = sh.pm, order = c(2, 1, 0), seasonal = list(order = c(1, 0, 0), period = 24))
## 
## Coefficients:
##          ar1     ar2     sar1
##       0.1229  0.0920  -0.0245
## s.e.  0.0367  0.0367   0.0377
## 
## sigma^2 estimated as 0.02993:  log likelihood = 249.26,  aic = -490.53

\(H_0:\) Seasonal AR parameter \(\Phi_1=0\)

\(H_a:\) Seasonal AR parameter \(\Phi_1\neq0\)

\(LRT=2[log L(\theta_a)-log L(\theta_0)]=2*(249.26-249.05)<<\chi_{1,0.95}^2\)

Therefore the 24 hour seasonal parameter is not significant. To keep the simplicity of the model, we can exclude this parameter.

5. Further Exploration on Trend

To better explain the pattern of PM2.5 Concentration, we can leverage other environmental variables available in this dataset.

Four environmental variables: humidity, cumulated wind speed, precipitation, and temperature, are added to the ARIMA(2,1,0) respectively. Likelihood Ratio Test are also conducted.

According to the output, the only significant covariate is cumulated wind speed.

##                         logL   LRT p_val
## Humidity             249.124 0.141 0.707
## Cumulated Wind Speed 253.395 8.683 0.003
## Precipitation        249.722 1.337 0.248
## Temperature          249.146 0.185 0.667

## 
## Call:
## arima(x = sh.pm, order = c(2, 1, 0), xreg = sh15jan$Iws)
## 
## Coefficients:
##          ar1     ar2  sh15jan$Iws
##       0.1324  0.1095        8e-04
## s.e.  0.0366  0.0369        3e-04
## 
## sigma^2 estimated as 0.0296:  log likelihood = 253.4,  aic = -498.79

6. Conclusion

In conclusion, we decide to use an ARIMA(2,1,0) model to describe this time series. The fitted model is:

\[(1-0.1324B-0.1095B^2)[(1-B)log(Y_n)-0.0008W_n]=\epsilon_n\]

\(W_n\) is the cumulated wind speed at \(t_n\) (m/s). \(Y_n\) is the PM2.5 concentration at \(t_n\) (ug/m^3). \(\epsilon_n\) is gaussian white noise \(\epsilon_n \overset{iid} \sim N[0,\sigma^2]\), where \(\sigma^2=0.0296\)

Daily seasonality is not significant in this model, observed daily cycle is de-trended through first difference.
Among all available environmental variables, cumulated wind speed is the only significant covariate to predict the trend of PM2.5 concentration.

7. Future Analysis

Variance stablization: As mentioned in section 3.3, the residuals are not fully aligned with normal assumption. Advanced variance stablization methods could be applied on time series before model fitting.
Model generalization: The model is constructed with data of only one month. We can test its fitness on data from other months, or calibrate the parameter with longer series. Some long-term seasonality can be discovered if the analysis includes data across years.
Comparison among cities: We can apply similar analysis procedure on PM2.5 concentration data from other cities, and compare the pattern among different cities.

8. Supplementary Material

1. Model selection by AIC

## Loading required package: knitr

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	-475.51	-485.29	-490.91	-490.75	-488.76	-487.15
AR1	-487.60	-491.07	-490.57	-488.71	-486.94	-485.58
AR2	-492.11	-490.40	-489.05	-487.12	-485.12	-496.86
AR3	-490.55	-489.19	-487.25	-491.43	-495.71	-491.35
AR4	-488.91	-487.24	-497.86	-494.68	-503.06	-486.25

2. Data Source

The data is from UCI Machine Learning Repository. Detail variable description can be found in the repository.
The original dataset contains hourly PM2.5 concentration data from 5 cities in China from year 2010 to 2015. Temperature, humidity and other environmental variables are also available in this dataset.

3. Missing data

PM2.5 concentration is measured by different detectors in the city. The data from those detectors are only slightly different at each time point. I primarily use the data from “US.Post” detector. If it is missing, I use the data from “Xuhui” detector to replace the NAs. After this process there are only 3 missing datas, I use the average of nearby time points to replace them.
For other variables (wind speed, precipitation and etc.), missing data is rare. I also use the average of nearby time points to replace NAs.

9. Reference

[1] https://archive.ics.uci.edu/ml/datasets/PM2.5+Data+of+Five+Chinese+Cities
[2] Liang, X., S. Li, S. Zhang, H. Huang, and S. X. Chen (2016), PM2.5 data reliability, consistency, and air quality assessment in five Chinese cities, J. Geophys. Res. Atmos., 121, 10220-10236
[3] Lecture notes and source code of Stats 531: https://ionides.github.io/531w18/
[4] Guerrero, Victor M. and Perera, Rafael (2004) “Variance Stabilizing Power Transformation for Time Series,” Journal of Modern Applied Statistical Methods: Vol.3: Iss. 2, Article 9