Time Series Analysis of NO2 Concentration in Dongcheng District, Beijing

1 Introduction
2 Exploratory Data Analysis
3 Time Series Analysis
4 Conclusion
5 Reference

1 Introduction

1.1 Background

Nitrogen dioxide (\(NO_2\)) is a poisonous gas, contributing to the chemical reactions that make ozone. Increasing atmospheric \(NO_2\) concentrations can threat the ecological environment and adversely impact human. The \(NO_2\) concentrations is one of the most important indicators of air pollution. In this project, we are going to do a time series analysis of \(NO_2\) concentration in Dongcheng District, Beijing. All data reflecting the air quality in Dongcheng District are retrieved from Tiantan air-quality monitoring site.

1.2 Goals

We have three main goals in this project. The first one is to find a relatively better ARIMA model fitting the hourly \(NO_2 concentration\) data. The second one is to figure out whether some environmental factors can affect \(NO_2\) concentration. The last one is to check whether there are any seasonalities.

1.3 Data Description

The raw data set comes from UCI Machine Learning Repository. It actually includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The time ranges from March 1st, 2013 to February 28th, 2017. This project aims at focusing on the data from Feburary 1st, 2015 to Feburary 28th, 2015, which has 672 observations (24 hours \(\times\) 28 days). The variables we are concerned with are hourly \(NO_2\) concentration \((ug/m^3)\), temperature (degree Celsius), precipitation \((mm)\) and wind speed \((m/s)\). Below shows the summary of the data briefly.

##       NO2          Temperature     Precipitation        Wind Speed   
##  Min.   :  2.00   Min.   :-7.400   Min.   :0.000000   Min.   :0.000  
##  1st Qu.: 25.00   1st Qu.:-1.100   1st Qu.:0.000000   1st Qu.:1.100  
##  Median : 49.00   Median : 2.000   Median :0.000000   Median :1.700  
##  Mean   : 55.32   Mean   : 2.662   Mean   :0.006101   Mean   :1.954  
##  3rd Qu.: 80.00   3rd Qu.: 6.000   3rd Qu.:0.000000   3rd Qu.:2.700  
##  Max.   :148.00   Max.   :15.900   Max.   :0.800000   Max.   :6.800

2 Exploratory Data Analysis

In order to make the data less skewed, we take log transformation first. Then, we draw the spectrum plot and ACF of plot for the transfromed data.

From the spectrum plot, we find this is no obvious dominant frequencies but some small peaks marked by the dotted red line. They correspond to frequencies of 0.081, 0.17 and 0.28 cycles per hour, or about 12.27, 5.92 and 3.61 hours per cycle. There are also roughly two bumps at lags of 12 and 24 from the ACF plot. We’ll check its seasonality later by fitting data in a seasonal model.

3 Time Series Analysis

3.1 ARMA Model

Let’s get started with ARMA model. What we do first is to decide where to start in terms of values of p and q. Thus, we tabulate AIC tables below.

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	1427.6984	986.9075	795.7126	703.6688	609.9149	585.3122
AR1	524.6264	503.1098	505.0136	504.8699	506.8264	503.3316
AR2	504.1689	505.0486	506.6640	506.8610	507.2728	504.9745
AR3	504.4966	506.2709	503.9863	505.1825	508.1939	506.1603
AR4	505.6968	507.3322	502.7489	506.3242	508.3087	509.8437
AR5	505.7764	506.5776	502.7820	508.3097	508.4650	509.9879

From the AIC table, we prefer to choose ARMA(1,1) with the lowest AIC value 503.1098. Besides, ARMA(1,1) is a relative simple model among these optional models. We refit to the data in ARMA(1,1) and obtain the value of each parameter.

##       ar1       ma1 intercept 
##  0.914748 -0.215361  3.787732

To evalute the goodness of fit of this model, I can consider the sample ACF plot and QQ plot of residuals. When looking at the sample ACF plot below, there are slight oscillations at specific lags, like multiples of 3, that indicate the dependence in errors is not being modeled sufficiently well. From the QQ-plot, heavier tails of residuals than normal distribution occur on both sides.

3.2 Linear Regression with ARMA Errors

In order to explain the pattern of \(NO_2\) concentration more precisely, we hope to include some covariates in our model. Currently, we have three different environmental covariates, temperature(TEMP), precipitation(RAIN) and wind speed (WSPM). The model and AIC table are shown below. \[\phi(B)(Y_n - \mu - \beta_1 \times TEMP - \beta_2 \times RAIN - \beta_3 \times WSPM) = \psi(B)\epsilon_n\] Where \[\phi(x) = 1-\phi_1 x - \cdots -\phi_p x^p\] \[\psi(x) = 1+\psi_1 x + \cdots +\psi_q x^q\] \(B\) is the backshift operator, \(\{\epsilon_n\}\) is an independent Gaussian white noise process.

	MA0	MA1	MA2	MA3	MA4
AR0	1056.4571	790.1174	696.5757	641.2574	571.4230
AR1	508.2045	481.5911	482.5438	484.5004	485.2630
AR2	485.9811	482.4566	481.1257	483.1229	487.9385
AR3	482.2958	484.2805	485.4838	481.2219	483.2649
AR4	484.2727	485.3666	482.3529	480.6695	480.2665
AR5	485.7113	486.4210	483.4431	480.1730	482.0892

ARMA(5,4) has the smallest AIC but it is way more complex. We know that a model with slightly larger AIC is acceptable if it is simpler. ARMA(1,1) is the one that satisfies the conditions. Refit to data into linear regression with error model and get values of these parameters. Looking at the values, we find that two ARMA parameters are somewhat equivalent to those of previous model indicating that the enviromental factors can only affect the global trend for \(NO_2\) concentration. In addition, the parameters of three factors are all negative, which means the smaller precipitation and wind speed and the lower temperature lead to higher \(NO_2\) concentration.

##         ar1         ma1   intercept        TEMP        RAIN        WSPM 
##  0.91707881 -0.25906024  3.98114129 -0.03090788 -0.60875394 -0.05300977

From the ACF plot of residuals, the oscillatory component for 3-lagged variation might suggest there should be a seasonal trend. Likelihood Ratio Test(LRT) is also carried out as follows: \[H_0: \beta_1 = \beta_2 = \beta_3 = 0\] \[H_a:\mbox{not all } \beta 's \mbox{ equal zero}\] .

The LRT statistic is calculated as \[2[logL(\theta_a)−logL(\theta_0)]=2(-232.59−(-247.55))=29.93 > \chi^2_{3, 0.95}\] We will reject the null hypothesis because LRT Statistic is larger than the \(\chi^2_{3, 0.95}\) cutoff, which means the enviromental factors are somewhat inevitable in the model.

3.3 Linear Regression with SARIMA Errors

Based on the spectrum plot before and the acf plot of the residuals above, we find the period of 3 hours is reasonble. Other period candidates are roughly the harmonic of period of 3 hours, which means it is the base of periods of 6, 12, and 24 hours. For simplicity, we fix ARIMA part as ARIMA(1,0,1) first and then study the seasonality part.

\[\Phi(B^3)\phi(B)(Y_n - \mu - \beta_1 \times TEMP - \beta_2 \times RAIN - \beta_3 \times WSPM) = \Psi(B^3)\psi(B)\epsilon_n\] Where \[\phi(x) = 1-\phi_1 x\] \[\psi(x) = 1+\psi_1 x\] \[\Phi(x) = 1-\Phi_1 x - \cdots -\Phi_p x^p\] \[\Psi(x) = 1+\Psi_1 x + \cdots +\Psi_q x^q\] \(B\) is the backshift operator, \(\{\epsilon_n\}\) is an independent Gaussian white noise process.

After then, we create AIC table with respect to seasonality part. we find model SARIMA\((1,0,1)\times(2,0,2)_3\) with the lowest AIC.

	SMA0	SMA1	SMA2	SMA3	SMA4
SAR0	481.5911	483.1921	484.5738	485.1345	484.6664
SAR1	483.2210	485.0843	486.2798	485.4085	483.9058
SAR2	484.3961	486.0549	479.4858	487.3742	485.9078
SAR3	484.9277	480.9422	482.3747	483.7473	481.7220
SAR4	484.7496	484.7003	486.5044	481.5281	483.4694

From the sample ACF plot below, we notice that autocorrelations are much smaller than those in above two ACF plots but there is still a non-negligible autocorrelation at a lag of about 11 and heavy tailes still exist.

The hypothesis is shown as follows: \[H_0: \Phi_1 = \Phi_2 = \Psi_1 = \Psi_2 = 0\] \[H_a: \mbox{They are not all zero}.\]

The LRT statistic is calculated as \[2[logL(\theta_a)−logL(\theta_0)]=2(-228.17−(-233.80))=11.23 > \chi^2_{4, 0.95}\] Since the statistic is greater than \(\chi^2_{4, 0.95}\), we conclude the null hypothese should be rejected at 0.05 level.

4 Conclusion

In this project, we take a look at the hourly \(NO_2\) concentration in Feburary monitored in Tiantan air-quality monitoring site. After a sequence of model selections and evaluations, we finally think the linear regresson with SARIMA\((1,0,1)\times(2,0,2)_3\) error model for this dataset is more reasonable to use due to having all the variables being significant. The error part has seasonal tread and the global trend can be explained by three environmental factors, which are all negative-correlated with \(NO_2\) concentrartion. The details of the final model are shown as follows

\[\Phi(B^3)\phi(B)(Y_n - 3.98 + 0.03 \times TEMP + 0.58 \times RAIN + 0.06 \times WSPM) = \Psi(B^3)\psi(B)\epsilon_n\] Where \[\phi(x) = 1-0.93 x\] \[\psi(x) = 1-0.29 x\] \[\Phi(x) = 1+1.21 x + 0.56 x^2\] \[\Psi(x) = 1+1.11 x + 0.46 x^2\] \({\epsilon_n} \sim N(0, 0.1152)\).

However, residuals are more heavy-tailed than normal distribution on both sides. In addition, the period of 3 hours is harder to interpret within one day. If we enlarge the dataset and extend the time range, the period of 12 hours (half of a day) or 24 hours (one day) might be the more reasonable one, maybe the non-negligible autocorrelation at a lag of about 11 is also can be removed.

5 Reference

[1] Data resource from https://archive.ics.uci.edu/ml/machine-learning-databases/00501/

[2] Previous projects from https://ionides.github.io/531w18/midterm_project/

[3] Course notes https://ionides.github.io/531w20/

[4] Zhang, S., Guo, B., Dong, A., He, J., Xu, Z. and Chen, S.X. (2017) Cautionary Tales on Air-Quality Improvement in Beijing. Proceedings of the Royal Society A, Volume 473, No. 2205, Pages 20170457.

[5] Nianliang Cheng, Yunting Li, el, Ground-Level NO2 in Urban Beijing: Trends, Distribution, and Effects of Emission Reduction Measures