Nitrogen dioxide (\(NO_2\)) is a poisonous gas, contributing to the chemical reactions that make ozone. Increasing atmospheric \(NO_2\) concentrations can threat the ecological environment and adversely impact human. The \(NO_2\) concentrations is one of the most important indicators of air pollution. In this project, we are going to do a time series analysis of \(NO_2\) concentration in Dongcheng District, Beijing. All data reflecting the air quality in Dongcheng District are retrieved from Tiantan air-quality monitoring site.
We have three main goals in this project. The first one is to find a relatively better ARIMA model fitting the hourly \(NO_2 concentration\) data. The second one is to figure out whether some environmental factors can affect \(NO_2\) concentration. The last one is to check whether there are any seasonalities.
The raw data set comes from UCI Machine Learning Repository. It actually includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The time ranges from March 1st, 2013 to February 28th, 2017. This project aims at focusing on the data from Feburary 1st, 2015 to Feburary 28th, 2015, which has 672 observations (24 hours \(\times\) 28 days). The variables we are concerned with are hourly \(NO_2\) concentration \((ug/m^3)\), temperature (degree Celsius), precipitation \((mm)\) and wind speed \((m/s)\). Below shows the summary of the data briefly.
## NO2 Temperature Precipitation Wind Speed
## Min. : 2.00 Min. :-7.400 Min. :0.000000 Min. :0.000
## 1st Qu.: 25.00 1st Qu.:-1.100 1st Qu.:0.000000 1st Qu.:1.100
## Median : 49.00 Median : 2.000 Median :0.000000 Median :1.700
## Mean : 55.32 Mean : 2.662 Mean :0.006101 Mean :1.954
## 3rd Qu.: 80.00 3rd Qu.: 6.000 3rd Qu.:0.000000 3rd Qu.:2.700
## Max. :148.00 Max. :15.900 Max. :0.800000 Max. :6.800
In order to make the data less skewed, we take log transformation first. Then, we draw the spectrum plot and ACF of plot for the transfromed data.
From the spectrum plot, we find this is no obvious dominant frequencies but some small peaks marked by the dotted red line. They correspond to frequencies of 0.081, 0.17 and 0.28 cycles per hour, or about 12.27, 5.92 and 3.61 hours per cycle. There are also roughly two bumps at lags of 12 and 24 from the ACF plot. We’ll check its seasonality later by fitting data in a seasonal model.
Let’s get started with ARMA model. What we do first is to decide where to start in terms of values of p and q. Thus, we tabulate AIC tables below.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 1427.6984 | 986.9075 | 795.7126 | 703.6688 | 609.9149 | 585.3122 |
AR1 | 524.6264 | 503.1098 | 505.0136 | 504.8699 | 506.8264 | 503.3316 |
AR2 | 504.1689 | 505.0486 | 506.6640 | 506.8610 | 507.2728 | 504.9745 |
AR3 | 504.4966 | 506.2709 | 503.9863 | 505.1825 | 508.1939 | 506.1603 |
AR4 | 505.6968 | 507.3322 | 502.7489 | 506.3242 | 508.3087 | 509.8437 |
AR5 | 505.7764 | 506.5776 | 502.7820 | 508.3097 | 508.4650 | 509.9879 |
From the AIC table, we prefer to choose ARMA(1,1) with the lowest AIC value 503.1098. Besides, ARMA(1,1) is a relative simple model among these optional models. We refit to the data in ARMA(1,1) and obtain the value of each parameter.
## ar1 ma1 intercept
## 0.914748 -0.215361 3.787732
To evalute the goodness of fit of this model, I can consider the sample ACF plot and QQ plot of residuals. When looking at the sample ACF plot below, there are slight oscillations at specific lags, like multiples of 3, that indicate the dependence in errors is not being modeled sufficiently well. From the QQ-plot, heavier tails of residuals than normal distribution occur on both sides.
In order to explain the pattern of \(NO_2\) concentration more precisely, we hope to include some covariates in our model. Currently, we have three different environmental covariates, temperature(TEMP), precipitation(RAIN) and wind speed (WSPM). The model and AIC table are shown below. \[\phi(B)(Y_n - \mu - \beta_1 \times TEMP - \beta_2 \times RAIN - \beta_3 \times WSPM) = \psi(B)\epsilon_n\] Where \[\phi(x) = 1-\phi_1 x - \cdots -\phi_p x^p\] \[\psi(x) = 1+\psi_1 x + \cdots +\psi_q x^q\] \(B\) is the backshift operator, \(\{\epsilon_n\}\) is an independent Gaussian white noise process.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 1056.4571 | 790.1174 | 696.5757 | 641.2574 | 571.4230 |
AR1 | 508.2045 | 481.5911 | 482.5438 | 484.5004 | 485.2630 |
AR2 | 485.9811 | 482.4566 | 481.1257 | 483.1229 | 487.9385 |
AR3 | 482.2958 | 484.2805 | 485.4838 | 481.2219 | 483.2649 |
AR4 | 484.2727 | 485.3666 | 482.3529 | 480.6695 | 480.2665 |
AR5 | 485.7113 | 486.4210 | 483.4431 | 480.1730 | 482.0892 |
ARMA(5,4) has the smallest AIC but it is way more complex. We know that a model with slightly larger AIC is acceptable if it is simpler. ARMA(1,1) is the one that satisfies the conditions. Refit to data into linear regression with error model and get values of these parameters. Looking at the values, we find that two ARMA parameters are somewhat equivalent to those of previous model indicating that the enviromental factors can only affect the global trend for \(NO_2\) concentration. In addition, the parameters of three factors are all negative, which means the smaller precipitation and wind speed and the lower temperature lead to higher \(NO_2\) concentration.
## ar1 ma1 intercept TEMP RAIN WSPM
## 0.91707881 -0.25906024 3.98114129 -0.03090788 -0.60875394 -0.05300977
From the ACF plot of residuals, the oscillatory component for 3-lagged variation might suggest there should be a seasonal trend. Likelihood Ratio Test(LRT) is also carried out as follows: \[H_0: \beta_1 = \beta_2 = \beta_3 = 0\] \[H_a:\mbox{not all } \beta 's \mbox{ equal zero}\] .
The LRT statistic is calculated as \[2[logL(\theta_a)−logL(\theta_0)]=2(-232.59−(-247.55))=29.93 > \chi^2_{3, 0.95}\] We will reject the null hypothesis because LRT Statistic is larger than the \(\chi^2_{3, 0.95}\) cutoff, which means the enviromental factors are somewhat inevitable in the model.
Based on the spectrum plot before and the acf plot of the residuals above, we find the period of 3 hours is reasonble. Other period candidates are roughly the harmonic of period of 3 hours, which means it is the base of periods of 6, 12, and 24 hours. For simplicity, we fix ARIMA part as ARIMA(1,0,1) first and then study the seasonality part.
\[\Phi(B^3)\phi(B)(Y_n - \mu - \beta_1 \times TEMP - \beta_2 \times RAIN - \beta_3 \times WSPM) = \Psi(B^3)\psi(B)\epsilon_n\] Where \[\phi(x) = 1-\phi_1 x\] \[\psi(x) = 1+\psi_1 x\] \[\Phi(x) = 1-\Phi_1 x - \cdots -\Phi_p x^p\] \[\Psi(x) = 1+\Psi_1 x + \cdots +\Psi_q x^q\] \(B\) is the backshift operator, \(\{\epsilon_n\}\) is an independent Gaussian white noise process.
After then, we create AIC table with respect to seasonality part. we find model SARIMA\((1,0,1)\times(2,0,2)_3\) with the lowest AIC.
SMA0 | SMA1 | SMA2 | SMA3 | SMA4 | |
---|---|---|---|---|---|
SAR0 | 481.5911 | 483.1921 | 484.5738 | 485.1345 | 484.6664 |
SAR1 | 483.2210 | 485.0843 | 486.2798 | 485.4085 | 483.9058 |
SAR2 | 484.3961 | 486.0549 | 479.4858 | 487.3742 | 485.9078 |
SAR3 | 484.9277 | 480.9422 | 482.3747 | 483.7473 | 481.7220 |
SAR4 | 484.7496 | 484.7003 | 486.5044 | 481.5281 | 483.4694 |
From the sample ACF plot below, we notice that autocorrelations are much smaller than those in above two ACF plots but there is still a non-negligible autocorrelation at a lag of about 11 and heavy tailes still exist.
The hypothesis is shown as follows: \[H_0: \Phi_1 = \Phi_2 = \Psi_1 = \Psi_2 = 0\] \[H_a: \mbox{They are not all zero}.\]
The LRT statistic is calculated as \[2[logL(\theta_a)−logL(\theta_0)]=2(-228.17−(-233.80))=11.23 > \chi^2_{4, 0.95}\] Since the statistic is greater than \(\chi^2_{4, 0.95}\), we conclude the null hypothese should be rejected at 0.05 level.
In this project, we take a look at the hourly \(NO_2\) concentration in Feburary monitored in Tiantan air-quality monitoring site. After a sequence of model selections and evaluations, we finally think the linear regresson with SARIMA\((1,0,1)\times(2,0,2)_3\) error model for this dataset is more reasonable to use due to having all the variables being significant. The error part has seasonal tread and the global trend can be explained by three environmental factors, which are all negative-correlated with \(NO_2\) concentrartion. The details of the final model are shown as follows
\[\Phi(B^3)\phi(B)(Y_n - 3.98 + 0.03 \times TEMP + 0.58 \times RAIN + 0.06 \times WSPM) = \Psi(B^3)\psi(B)\epsilon_n\] Where \[\phi(x) = 1-0.93 x\] \[\psi(x) = 1-0.29 x\] \[\Phi(x) = 1+1.21 x + 0.56 x^2\] \[\Psi(x) = 1+1.11 x + 0.46 x^2\] \({\epsilon_n} \sim N(0, 0.1152)\).
However, residuals are more heavy-tailed than normal distribution on both sides. In addition, the period of 3 hours is harder to interpret within one day. If we enlarge the dataset and extend the time range, the period of 12 hours (half of a day) or 24 hours (one day) might be the more reasonable one, maybe the non-negligible autocorrelation at a lag of about 11 is also can be removed.
[1] Data resource from https://archive.ics.uci.edu/ml/machine-learning-databases/00501/
[2] Previous projects from https://ionides.github.io/531w18/midterm_project/
[3] Course notes https://ionides.github.io/531w20/
[4] Zhang, S., Guo, B., Dong, A., He, J., Xu, Z. and Chen, S.X. (2017) Cautionary Tales on Air-Quality Improvement in Beijing. Proceedings of the Royal Society A, Volume 473, No. 2205, Pages 20170457.
[5] Nianliang Cheng, Yunting Li, el, Ground-Level NO2 in Urban Beijing: Trends, Distribution, and Effects of Emission Reduction Measures