1 Introduction

1.1 Background

Nitrogen dioxide (\(NO_2\)) is a poisonous gas, contributing to the chemical reactions that make ozone. Increasing atmospheric \(NO_2\) concentrations can threat the ecological environment and adversely impact human. The \(NO_2\) concentrations is one of the most important indicators of air pollution. In this project, we are going to do a time series analysis of \(NO_2\) concentration in Dongcheng District, Beijing. All data reflecting the air quality in Dongcheng District are retrieved from Tiantan air-quality monitoring site.

1.2 Goals

We have three main goals in this project. The first one is to find a relatively better ARIMA model fitting the hourly \(NO_2 concentration\) data. The second one is to figure out whether some environmental factors can affect \(NO_2\) concentration. The last one is to check whether there are any seasonalities.

1.3 Data Description

The raw data set comes from UCI Machine Learning Repository. It actually includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The time ranges from March 1st, 2013 to February 28th, 2017. This project aims at focusing on the data from Feburary 1st, 2015 to Feburary 28th, 2015, which has 672 observations (24 hours \(\times\) 28 days). The variables we are concerned with are hourly \(NO_2\) concentration \((ug/m^3)\), temperature (degree Celsius), precipitation \((mm)\) and wind speed \((m/s)\). Below shows the summary of the data briefly.

##       NO2          Temperature     Precipitation        Wind Speed   
##  Min.   :  2.00   Min.   :-7.400   Min.   :0.000000   Min.   :0.000  
##  1st Qu.: 25.00   1st Qu.:-1.100   1st Qu.:0.000000   1st Qu.:1.100  
##  Median : 49.00   Median : 2.000   Median :0.000000   Median :1.700  
##  Mean   : 55.32   Mean   : 2.662   Mean   :0.006101   Mean   :1.954  
##  3rd Qu.: 80.00   3rd Qu.: 6.000   3rd Qu.:0.000000   3rd Qu.:2.700  
##  Max.   :148.00   Max.   :15.900   Max.   :0.800000   Max.   :6.800

2 Exploratory Data Analysis

In order to make the data less skewed, we take log transformation first. Then, we draw the spectrum plot and ACF of plot for the transfromed data.

From the spectrum plot, we find this is no obvious dominant frequencies but some small peaks marked by the dotted red line. They correspond to frequencies of 0.081, 0.17 and 0.28 cycles per hour, or about 12.27, 5.92 and 3.61 hours per cycle. There are also roughly two bumps at lags of 12 and 24 from the ACF plot. We’ll check its seasonality later by fitting data in a seasonal model.

3 Time Series Analysis

3.1 ARMA Model

Let’s get started with ARMA model. What we do first is to decide where to start in terms of values of p and q. Thus, we tabulate AIC tables below.

MA0 MA1 MA2 MA3 MA4 MA5
AR0 1427.6984 986.9075 795.7126 703.6688 609.9149 585.3122
AR1 524.6264 503.1098 505.0136 504.8699 506.8264 503.3316
AR2 504.1689 505.0486 506.6640 506.8610 507.2728 504.9745
AR3 504.4966 506.2709 503.9863 505.1825 508.1939 506.1603
AR4 505.6968 507.3322 502.7489 506.3242 508.3087 509.8437
AR5 505.7764 506.5776 502.7820 508.3097 508.4650 509.9879

From the AIC table, we prefer to choose ARMA(1,1) with the lowest AIC value 503.1098. Besides, ARMA(1,1) is a relative simple model among these optional models. We refit to the data in ARMA(1,1) and obtain the value of each parameter.

##       ar1       ma1 intercept 
##  0.914748 -0.215361  3.787732

To evalute the goodness of fit of this model, I can consider the sample ACF plot and QQ plot of residuals. When looking at the sample ACF plot below, there are slight oscillations at specific lags, like multiples of 3, that indicate the dependence in errors is not being modeled sufficiently well. From the QQ-plot, heavier tails of residuals than normal distribution occur on both sides.

3.2 Linear Regression with ARMA Errors

In order to explain the pattern of \(NO_2\) concentration more precisely, we hope to include some covariates in our model. Currently, we have three different environmental covariates, temperature(TEMP), precipitation(RAIN) and wind speed (WSPM). The model and AIC table are shown below. \[\phi(B)(Y_n - \mu - \beta_1 \times TEMP - \beta_2 \times RAIN - \beta_3 \times WSPM) = \psi(B)\epsilon_n\] Where \[\phi(x) = 1-\phi_1 x - \cdots -\phi_p x^p\] \[\psi(x) = 1+\psi_1 x + \cdots +\psi_q x^q\] \(B\) is the backshift operator, \(\{\epsilon_n\}\) is an independent Gaussian white noise process.

MA0 MA1 MA2 MA3 MA4
AR0 1056.4571 790.1174 696.5757 641.2574 571.4230
AR1 508.2045 481.5911 482.5438 484.5004 485.2630
AR2 485.9811 482.4566 481.1257 483.1229 487.9385
AR3 482.2958 484.2805 485.4838 481.2219 483.2649
AR4 484.2727 485.3666 482.3529 480.6695 480.2665
AR5 485.7113 486.4210 483.4431 480.1730 482.0892

ARMA(5,4) has the smallest AIC but it is way more complex. We know that a model with slightly larger AIC is acceptable if it is simpler. ARMA(1,1) is the one that satisfies the conditions. Refit to data into linear regression with error model and get values of these parameters. Looking at the values, we find that two ARMA parameters are somewhat equivalent to those of previous model indicating that the enviromental factors can only affect the global trend for \(NO_2\) concentration. In addition, the parameters of three factors are all negative, which means the smaller precipitation and wind speed and the lower temperature lead to higher \(NO_2\) concentration.

##         ar1         ma1   intercept        TEMP        RAIN        WSPM 
##  0.91707881 -0.25906024  3.98114129 -0.03090788 -0.60875394 -0.05300977

From the ACF plot of residuals, the oscillatory component for 3-lagged variation might suggest there should be a seasonal trend. Likelihood Ratio Test(LRT) is also carried out as follows: \[H_0: \beta_1 = \beta_2 = \beta_3 = 0\] \[H_a:\mbox{not all } \beta 's \mbox{ equal zero}\] .

The LRT statistic is calculated as \[2[logL(\theta_a)−logL(\theta_0)]=2(-232.59−(-247.55))=29.93 > \chi^2_{3, 0.95}\] We will reject the null hypothesis because LRT Statistic is larger than the \(\chi^2_{3, 0.95}\) cutoff, which means the enviromental factors are somewhat inevitable in the model.

3.3 Linear Regression with SARIMA Errors

Based on the spectrum plot before and the acf plot of the residuals above, we find the period of 3 hours is reasonble. Other period candidates are roughly the harmonic of period of 3 hours, which means it is the base of periods of 6, 12, and 24 hours. For simplicity, we fix ARIMA part as ARIMA(1,0,1) first and then study the seasonality part.

\[\Phi(B^3)\phi(B)(Y_n - \mu - \beta_1 \times TEMP - \beta_2 \times RAIN - \beta_3 \times WSPM) = \Psi(B^3)\psi(B)\epsilon_n\] Where \[\phi(x) = 1-\phi_1 x\] \[\psi(x) = 1+\psi_1 x\] \[\Phi(x) = 1-\Phi_1 x - \cdots -\Phi_p x^p\] \[\Psi(x) = 1+\Psi_1 x + \cdots +\Psi_q x^q\] \(B\) is the backshift operator, \(\{\epsilon_n\}\) is an independent Gaussian white noise process.

After then, we create AIC table with respect to seasonality part. we find model SARIMA\((1,0,1)\times(2,0,2)_3\) with the lowest AIC.

SMA0 SMA1 SMA2 SMA3 SMA4
SAR0 481.5911 483.1921 484.5738 485.1345 484.6664
SAR1 483.2210 485.0843 486.2798 485.4085 483.9058
SAR2 484.3961 486.0549 479.4858 487.3742 485.9078
SAR3 484.9277 480.9422 482.3747 483.7473 481.7220
SAR4 484.7496 484.7003 486.5044 481.5281 483.4694

From the sample ACF plot below, we notice that autocorrelations are much smaller than those in above two ACF plots but there is still a non-negligible autocorrelation at a lag of about 11 and heavy tailes still exist.

The hypothesis is shown as follows: \[H_0: \Phi_1 = \Phi_2 = \Psi_1 = \Psi_2 = 0\] \[H_a: \mbox{They are not all zero}.\]

The LRT statistic is calculated as \[2[logL(\theta_a)−logL(\theta_0)]=2(-228.17−(-233.80))=11.23 > \chi^2_{4, 0.95}\] Since the statistic is greater than \(\chi^2_{4, 0.95}\), we conclude the null hypothese should be rejected at 0.05 level.

4 Conclusion

In this project, we take a look at the hourly \(NO_2\) concentration in Feburary monitored in Tiantan air-quality monitoring site. After a sequence of model selections and evaluations, we finally think the linear regresson with SARIMA\((1,0,1)\times(2,0,2)_3\) error model for this dataset is more reasonable to use due to having all the variables being significant. The error part has seasonal tread and the global trend can be explained by three environmental factors, which are all negative-correlated with \(NO_2\) concentrartion. The details of the final model are shown as follows

\[\Phi(B^3)\phi(B)(Y_n - 3.98 + 0.03 \times TEMP + 0.58 \times RAIN + 0.06 \times WSPM) = \Psi(B^3)\psi(B)\epsilon_n\] Where \[\phi(x) = 1-0.93 x\] \[\psi(x) = 1-0.29 x\] \[\Phi(x) = 1+1.21 x + 0.56 x^2\] \[\Psi(x) = 1+1.11 x + 0.46 x^2\] \({\epsilon_n} \sim N(0, 0.1152)\).

However, residuals are more heavy-tailed than normal distribution on both sides. In addition, the period of 3 hours is harder to interpret within one day. If we enlarge the dataset and extend the time range, the period of 12 hours (half of a day) or 24 hours (one day) might be the more reasonable one, maybe the non-negligible autocorrelation at a lag of about 11 is also can be removed.

5 Reference

[1] Data resource from https://archive.ics.uci.edu/ml/machine-learning-databases/00501/

[2] Previous projects from https://ionides.github.io/531w18/midterm_project/

[3] Course notes https://ionides.github.io/531w20/

[4] Zhang, S., Guo, B., Dong, A., He, J., Xu, Z. and Chen, S.X. (2017) Cautionary Tales on Air-Quality Improvement in Beijing. Proceedings of the Royal Society A, Volume 473, No. 2205, Pages 20170457.

[5] Nianliang Cheng, Yunting Li, el, Ground-Level NO2 in Urban Beijing: Trends, Distribution, and Effects of Emission Reduction Measures