1 Introduction

PM10 is PM with particle size less than 10 microns. PM10 can get deep into the lungs and cause a broad range of health effects, in particular, respiratory and cardiovascular illnesses.[1] In this analysis, we are aimed to identify any seasonality or trend in the PM10 Concentration (ug/m3) distribution using the Beijing Multi-Site Air-Quality Dataset[2]. Finding potential patterns of PM10 concentration would enable air pollution forecast so that people could take relative protective measures to prevent harmful impacts.

The entire dataset that we work on includes 12 datasets of hourly air pollution data records from 12 nationally-controlled air-quality monitoring stations. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality station are matched with the nearest weather station from the China Meteorological Administration. In each dataset, there are 18 columns, spanning from March 1st, 2013 to February 28th, 2017. The variables include year, month, day, hour, PM2.5, PM10, SO2, NO2, CO, O3, Temperature, Pressure, Dew Point Temperature, Rain, Wind, Wind Speed, and Station.

In this analysis, we are going to focus on one of twelve datasets which including air pollution data of Wanliu station. There are 35,064 records in total. However, we only focus on analyzing the data from October 6th to October 26th, 2014, which corresponds to 3-week time period. During the this time period, there are \(24\ hours\ \times 7\ days\ \times 3\ weeks\ = 504\) hourly observations/records in total.

2 Data Preprocessing: Dealing with Missing Values

Before diving into time series analysis, we conduct data preprocessings in order to make sure the data is cleaned. We find that there are 69 missing values in “PM10” column in 2014 data. We decide to impute the missing values with the closest non-NA value in the same column. For example, if the PM10 value at 9am on October 2nd is missing, we will impute the value with the PM10 value at 8am on October 2nd if available.

3 Exploratory Data Analysis

In order to investigate the time series data, we will start with the basic time plot.

3.1 Logarithmic Transformation

The variation of the data shown in the above plot increases and decreases with the level of the series, therefore we think that a logarithmic transformation would be useful [9]. As we can see from below plot, there seems to be a weekly seasonality (S = \(24\ hours\ \times 7\ days\ = 168\ lags\)) and a trend within one week for this 3-week hourly PM10 concentration data. The PM10 tend to start low at the beginning of a week, and then slowly increases and fluctuates around the peak during the week, finally drops sharply at the end of a week. The data seems to be a non-stationary data.

As we can see from the difference of logarithmic transformed PM10 data, we noticed that the series has a invariant zero mean and a changing variation. This series also seems to be non-stationary.

3.2 Trend and Seasonality

3.2.1 ACF Plot

The following plot is the ACF plot of the logarithmic transformed PM10 concentration data from Wanliu station. As we can see from the plot, the autocorrelations at small lags are large and positive and then the correlations start decreasing as lags increase, indicating a potential trend in the data. At the large lags, we can also see some “scalloped” shape which are potentially due to the seasonality.

3.2.2 Spectrum Density Plots

## [1] 10.66667

In spectrum density plot, we can see the dominant frequency is around 0.004. This indicates that the dominant period is around 11 days. Although there are a lot of other local peaks, the one around 0.004 is much higher. This result agrees what the ACF plots indicates.

3.2.3 Decomposition Plot

Then we decomposed the log(PM10) data into trend, noise and cycles parts. In the decomposition plot, low frequency refers to the trend, middle frequency refers to the cycles and high frequency refers to the noise.

The trend shows that the PM10 concentration goes up and down twice in October 2014. This might indicate that we might have a waving trend.

The middle frequency shows that we meight have a circle of around a week. The high frequency plot indicates that we may not have stationary noises.

Since all the three plots seem to agree with each other, we now assume that there is a seasonality in our data, i.e. our null hypothesis is the data has seasonality.

4 Time Series Analysis

4.1 ARMA Model

4.1.1 Logarithmic Transformation

Although we observe some potential trend and seasonality in the data, we will still start with fitting a stationary Gaussian ARMA(p,q) model with parameter vector \(\theta=(\phi_{1:p},\psi_{1:q},\mu,\sigma^2)\) given by \[\phi (B)(log(Y_n)-\mu)=\psi (B)\epsilon_n,\] where \[\mu=\mathbb{E} [Y_n],\]

\[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\] \[\epsilon_n \sim iid N[0,\sigma^2] .\]

4.1.1.1 Choosing p and q for ARMA(p,q) Model

Next, we seek to decide where to start in terms of values of p and q by tabulating AIC values for a range of different choices p and q.

## [1] "The minimum AIC is: 266.232923466563"
MA0 MA1 MA2 MA3 MA4 MA5
AR0 1705.15 1205.62 899.87 721.84 635.27 520.22
AR1 312.17 318.53 307.34 302.88 304.76 317.09
AR2 299.51 302.78 303.53 314.16 297.74 293.86
AR3 295.21 299.97 269.63 304.74 301.38 272.44
AR4 293.31 294.28 271.58 266.60 270.26 272.14
AR5 294.31 296.12 272.07 266.23 270.41 272.61

Based on Akaike’s information criterion, we want to select the model with lowest AIC scores. In our case, ARMA(5,3) has the lowest AIC score of 266.23.

4.1.1.3 Diagnostic Analysis

The standard errors of estimated coefficients of AR and MA are close to 0, which are fairly small. While the standard error of intercept is around 0.5, which is fairly large, indicating the model does not provide a good estimate for the intercept. The variance of the model is around 0.09 while the log likelihood is around -123.12. In order to further investigate that if the model has captured the information in the data, we decide to conduct residual diagnostics.

According to the histogram of the residuals, the mean of the residuals is close to zero. The left tail seems to be longer than the right tail.

By looking at autocorrelation function (ACF) plot of the residuals, we can check if the residuals are uncorrelated or not. In our case, the ACF are all close to zero for small lags. However, we can see there are some autocorrelations at lag around 40 to 50 are significantly lower than zero.

In order to check the normality assumption of the residuals, we look at the QQ-plot and find that the points curve far away from the line at each end in opposite direction. The residuals clearly does not follow normal distribution.

Consequently, the Gaussian white noise assumption might not be valid. We might need more advanced model to fit the data.

4.1.2 Suqare Root Transformation

The QQ-plot above indicates that we do not meet the Gaussian assumption. The reason might be that we have a lot of small values near 0. So in this case, a square root transformation of the data might be more appropriate. Therefore, our model changes as following:

\[\phi (B)(sqrt(Y_n)-\mu)=\psi (B)\epsilon_n,\] where \[\mu=\mathbb{E} [Y_n],\]

\[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\] \[\epsilon_n \sim iid N[0,\sigma^2] .\]

4.1.2.1 Choosing p and q for ARMA(p,q) Model

MA0 MA1 MA2 MA3 MA4 MA5 MA6
AR0 3197.327 2616.944 2271.729 2017.404 1889.463 1801.734 1726.610
AR1 1560.806 1547.003 1543.844 1544.742 1545.628 1547.174 1549.136
AR2 1544.233 1544.004 1545.279 1545.602 1547.286 1542.081 1543.986
AR3 1543.552 1545.552 1545.864 1546.483 1540.772 1543.928 1544.572
AR4 1545.552 1546.774 1539.515 1546.852 1542.596 1537.634 1550.184
AR5 1546.072 1547.184 1540.733 1542.625 1548.848 1543.217 1535.361
AR6 1546.506 1542.091 1542.606 1544.602 1533.731 1553.447 1537.296
## [1] "The minimum AIC is: 1533.73147949887"

As the table above shows, ARMA(6,4) has the lowest AIC of 1533.73. This value is much larger than what we get from the model derived by log transformed data. The reason behind it is that when the the value of PM10 concentration is pretty large, the log transformed data will be much smaller than the square root transformed data and that will lead to a smaller AIC value. So, these two AICs are not comparable.

4.1.2.3 Diagnostic Analysis

As the histogram of the residuals shows, the mean of the residuals is close to zero. Except some outliers, the two tails become shorter and the histogram looks more symmetric.

The autocorrelation function (ACF) plot also has some improvements. There are less number of lags with autocorrelations that are significantly not equal to zero.

Most importantly, QQ-plot shows that except for some outliers, we are almost fine with the normality assumption, although we still have a distribution with tails a bit longer than Gaussion.

As it shows in the plot above, ARMA(6,4) can catch almost every peak and pattern of the data. Since the model derived by the square root transformed can fit the data pretty well, and it seems to meet all the assumptions, we might go with the square root transformation and pick ARMA(6,4) as our best model. This also indicates that there might not be any seasonality in this data.

4.2 SARMA Model

Since we observe some weekly seasonality in the data as discussed at the beginning, we will then try to fit a seasonal autoregressive moving average (SARMA) model.

The model we want to fit is \(SARMA(p,q) \times (P,Q)_{24\ \times 7} = SARMA(p,q) \times (P,Q)_{168}\) for hourly data is \[\phi (B) \Phi(B^{168})(log(Y_n)-\mu)=\psi (B) \Psi(B^{168})\epsilon_n,\] where \[\mu=\mathbb{E} [Y_n],\]

\[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\] \[\Phi (x) = 1 - \Phi_1 x - ... - \Phi_P x^P ,\] \[\Psi (x) = 1 + \Psi_1 x + ... + \Psi_Q x^Q ,\] \[\epsilon_n \sim iid N[0,\sigma^2] .\]

4.2.1 Fit SARMA(1,0) Model

Due to the computation complexity, we will not choose the p and q values based on AIC. We will fit \(SARMA(1,0) \times (P,Q)_{168}\) for simplicity.

## 
## Call:
## arima(x = log(WanliuPM10_2014_Oct$PM10), order = c(1, 0, 0), seasonal = list(order = c(1, 
##     0, 0), period = 168), method = "ML", optim.control = list(maxit = 1000))
## 
## Coefficients:
##       ar1   sar1  intercept
##         1  0.022     4.7238
## s.e.    0  0.055  4586.9469
## 
## sigma^2 estimated as 0.1072:  log likelihood = -153,  aic = 314.01

As we can see from the results, the fitted model is \[ (1-B)(1-0.022B^{168})(log(Y_n)-4.7238)=\epsilon_n,\]

However, we can notice that the standard error of AR(1) coefficient is exactly zero while the standard error of the intercept is significantly high, indicating the fitted model is inappropriate for the data. The results is consistent with the following residual diagnostic analysis. The histogram and the QQ-plot indicates the residuals are heavy-tail distributed. While the ACF plot seems to have oscillatory components, suggesting AR(2) for the residuals. Thus \(SARMA(3,0) \times (P,Q)_{168}\) might be a good fit for the data or potentially the existence of seasonality is suspicious.

4.2.2 Diagnostic Analysis

4.2.3 STL(Seasonal and Trend decomposition using Loess) decomposition

In order to further investigate the seasonality existence in the data, we use \(mstl()\)function in \(forecast\) library to assist us to deal with seasonality[8]. This function provides a convenient automated STL (Seasonal and Trend decomposition using Loess) decomposition, i.e. decompose the time series data into components such as trend, seasonality, cycles and remainder. As we can see from below plot that PM10 series data can be only decomposed to trend and remainder components. The results suggests that there are no seasonal patterns in the data, which is consistent with our previous analysis.

4.3 Modeling Trend using Linear Regression with ARMA Errors

Next, we want to model the trend in the PM10 data using liner regression with ARMA Errors.

In order to fully utilize the dataset to model the trend of PM10 concentration, we want to incorporate extra information such as temperature (degree Celsius), precipitation (mm) and wind speed (m/s), and allow autocorrelation in the regression error term instead of using white noise.

We want to consider a regression model of the form

\[sqrt(Y_n) = \mu + \beta_1\ TEMP + \beta_2\ RAIN + \beta_3\ WSPM + \eta_n,\] \[\phi (B)\ \eta_n=\psi (B)\epsilon_n,\] which is equivalent to \[\phi (B)\ (sqrt(Y_n) - \mu - \beta_1\ TEMP - \beta_2\ RAIN - \beta_3\ WSPM) = \psi (B)\epsilon_n,\] where \[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\]

\[\epsilon_n \sim iid N[0,\sigma^2] .\] Our null model is \[H_0: \beta_1=\beta_2=\beta_3=0\],

and the alternative hypothesis is \[H_1: \exists \beta_i \ne 0.\]

Based on the AIC table, ARMA(4,2) seems to be appropriate. We conducted Z-test on the slope \(\beta_1,\ \beta_2, and\ \beta_3\).

\[Z_{\beta_1} = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})} = \frac{0.0061}{0.0435} = 0.1402\ \rightarrow |Z_{\beta_1}| < 1.96 \rightarrow fail\ to\ reject\ H_0\]

\[Z_{\beta_2} = \frac{\hat{\beta_2}}{SE(\hat{\beta_2})} = \frac{-0.2654}{1.1770} = -0.2255\ \rightarrow |Z_{\beta_1}| < 1.96 \rightarrow fail\ to\ reject\ H_0\]

\[Z_{\beta_3} = \frac{\hat{\beta_3}}{SE(\hat{\beta_3})} = \frac{-0.0336}{0.0701} = -0.4793\ \rightarrow |Z_{\beta_1}| < 1.96 \rightarrow fail\ to\ reject\ H_0\]

All the coefficients are statistically insignificant. We do not have enough evidence to reject the null hypothesis, indicating the fitted model might not be appropriate. Temperature, rain and wind speed may not be helpful in explaine the trend of the data.

MA0 MA1 MA2 MA3 MA4
AR0 2958.796 2545.255 2246.807 2005.584 1888.333
AR1 1564.706 1551.625 1549.124 1550.077 1551.057
AR2 1549.156 1549.149 1550.587 1550.989 1552.696
AR3 1548.798 1550.796 1552.169 1551.869 1546.529
AR4 1550.793 1552.005 1545.222 1546.578 1548.359
AR5 1551.480 1552.617 1546.478 1548.375 1550.367
## [1] "The minimum AIC is: 1545.22203636253"
## 
## Call:
## arima(x = sqrt(WanliuPM10_2014_Oct$PM10), order = c(4, 0, 2), xreg = WanliuPM10_2014_Oct[, 
##     c("TEMP", "RAIN", "WSPM")], optim.control = list(maxit = 10000))
## 
## Coefficients:
##           ar1     ar2     ar3      ar4     ma1     ma2  intercept    TEMP
##       -0.2819  0.6189  0.8195  -0.2334  1.4591  0.9301    11.7510  0.0061
## s.e.   0.0557  0.0358  0.0438   0.0447  0.0383  0.0438     2.1065  0.0435
##          RAIN     WSPM
##       -0.2654  -0.0336
## s.e.   1.1770   0.0701
## 
## sigma^2 estimated as 1.193:  log likelihood = -761.61,  aic = 1545.22

5 Conclusion

Inspired by our observations from previous exploratory data analysis, we assume that there is a seasonality in PM10 concentration data from Wanliu station during 3-week period of October, 2014. We conducted time series analysis by fitting ARMA(p,q) model with logarithmic and square-root transformed data and fitting SARMA model with logarithmic transformed data. We conducted residuals diagnostic analysis for each fitted model as discussed above. By carefully investigating our residual plots and the STL(Seasonal and Trend decomposition using Loess) decomposition plot, we safely concluded that we need to reject our null hypothesis. The “seasonality” we observed previously might potentially come from the waving trend in the data.

Therefore, we fitted a linear regression model with ARMR errors in order to model the trend in the data. We incorporated extra information from the dataset such as temperature (degree Celsius), precipitation (mm) and wind speed (m/s) to help model the trend. However, the result of the fitted model showed that none of the coefficients of the variables are statistically significant, indicating that the trend may not be explianed by them. What’s more, by carefully looking at the vertical scale of STL decomposition plot [11], we can see that the trend has relatively narrow range of value. It will stay in a reletively high level for a long time and just drops dramatically. Since this special property, a polinomial trend may also not be able to fit the data. Both results implied there is little trend in data.

In conclusion, there is no seasonality and trend in PM10 concentration data from Wanliu station during 3-week period of October, 2014. Therefore, we could safely conclude that the square-root transformed data can be fitted by a stationary Gaussian ARMA(6,4) model as following: \[(1-0.05B-1.39B^2-0.79B^3+0.93B^4+0.46B^5-0.16B^4)(sqrt(Y_n)-12.3) = (1+1.11B-0.25B^2-1.27B^3-0.58B^4)\epsilon_n\] where \[{\epsilon_n} \sim N(0, 1.171)\].

6 Reference

[1] Health effects of air pollutants https://www.aqhi.gov.hk/en/health-advice/health-effects-of-air-pollutants9b5f.html?start=5

[2] Data resource from: https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data#

[3] 2020 midterm projects 1 https://ionides.github.io/531w20/midterm_project/project1/Midterm-Project.html#trend-noise-and-cycles

[4] 2020 midterm projects 27 https://ionides.github.io/531w20/midterm_project/project27/midterm_proj.html#linear-regression-with-sarima-errors

[5] ARMA model https://ionides.github.io/531w21/05/index.html

[6] SARMA model https://ionides.github.io/531w21/06/index.html

[7] Regression with ARIMA errors in R https://otexts.com/fpp2/regarima.html

[8] STL with multiple seasonal periods https://otexts.com/fpp2/complexseasonality.html

[9] Transformations and adjustments https://otexts.com/fpp2/transformations.html

[10] Interpretating STL decomposition plot https://otexts.com/fpp2/complexseasonality.html

[11] Regression with ARIMA errors in R https://otexts.com/fpp2/regarima.html