1 Introduction

PM10 is PM with particle size less than 10 microns. PM10 can get deep into the lungs and cause a broad range of health effects, in particular, respiratory and cardiovascular illnesses.[1] In this analysis, we are aimed to identify any seasonality or trend in the PM10 Concentration (ug/m3) distribution using the Beijing Multi-Site Air-Quality Dataset[2]. Finding potential patterns of PM10 concentration would enable air pollution forecast so that people could take relative protective measures to prevent harmful impacts.

The entire dataset that we work on includes 12 datasets of hourly air pollution data records from 12 nationally-controlled air-quality monitoring stations. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality station are matched with the nearest weather station from the China Meteorological Administration. In each dataset, there are 18 columns, spanning from March 1st, 2013 to February 28th, 2017. The variables include year, month, day, hour, PM2.5, PM10, SO2, NO2, CO, O3, Temperature, Pressure, Dew Point Temperature, Rain, Wind, Wind Speed, and Station.

In this analysis, we are going to focus on one of twelve datasets which including air pollution data of Wanliu station. There are 35,064 records in total. However, we only focus on analyzing the data from October 6th to October 26th, 2014, which corresponds to 3-week time period. During the this time period, there are \(24\ hours\ \times 7\ days\ \times 3\ weeks\ = 504\) hourly observations/records in total.

2 Data Preprocessing: Dealing with Missing Values

Before diving into time series analysis, we conduct data preprocessings in order to make sure the data is cleaned. We find that there are 69 missing values in “PM10” column in 2014 data. We decide to impute the missing values with the closest non-NA value in the same column. For example, if the PM10 value at 9am on October 2nd is missing, we will impute the value with the PM10 value at 8am on October 2nd if available.

3 Exploratory Data Analysis

In order to investigate the time series data, we will start with the basic time plot.

3.1 Logarithmic Transformation

The variation of the data shown in the above plot increases and decreases with the level of the series, therefore we think that a logarithmic transformation would be useful [9]. As we can see from below plot, there seems to be a weekly seasonality (S = \(24\ hours\ \times 7\ days\ = 168\ lags\)) and a trend within one week for this 3-week hourly PM10 concentration data. The PM10 tend to start low at the beginning of a week, and then slowly increases and fluctuates around the peak during the week, finally drops sharply at the end of a week. The data seems to be a non-stationary data.

As we can see from the difference of logarithmic transformed PM10 data, we noticed that the series has a invariant zero mean and a changing variation. This series also seems to be non-stationary.

3.2 Trend and Seasonality

3.2.1 ACF Plot

The following plot is the ACF plot of the logarithmic transformed PM10 concentration data from Wanliu station. As we can see from the plot, the autocorrelations at small lags are large and positive and then the correlations start decreasing as lags increase, indicating a potential trend in the data. At the large lags, we can also see some “scalloped” shape which are potentially due to the seasonality.

3.2.2 Spectrum Density Plots

## [1] 10.66667

In spectrum density plot, we can see the dominant frequency is around 0.004. This indicates that the dominant period is around 11 days. Although there are a lot of other local peaks, the one around 0.004 is much higher. This result agrees what the ACF plots indicates.

3.2.3 Decomposition Plot

Then we decomposed the log(PM10) data into trend, noise and cycles parts. In the decomposition plot, low frequency refers to the trend, middle frequency refers to the cycles and high frequency refers to the noise.

The trend shows that the PM10 concentration goes up and down twice in October 2014. This might indicate that we might have a waving trend.

The middle frequency shows that we meight have a circle of around a week. The high frequency plot indicates that we may not have stationary noises.

Since all the three plots seem to agree with each other, we now assume that there is a seasonality in our data, i.e. our null hypothesis is the data has seasonality.

4 Time Series Analysis

4.1 ARMA Model

4.1.1 Logarithmic Transformation

Although we observe some potential trend and seasonality in the data, we will still start with fitting a stationary Gaussian ARMA(p,q) model with parameter vector \(\theta=(\phi_{1:p},\psi_{1:q},\mu,\sigma^2)\) given by \[\phi (B)(log(Y_n)-\mu)=\psi (B)\epsilon_n,\] where \[\mu=\mathbb{E} [Y_n],\]

\[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\] \[\epsilon_n \sim iid N[0,\sigma^2] .\]

4.1.1.1 Choosing p and q for ARMA(p,q) Model

Next, we seek to decide where to start in terms of values of p and q by tabulating AIC values for a range of different choices p and q.

## [1] "The minimum AIC is: 266.232923466563"

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	1705.15	1205.62	899.87	721.84	635.27	520.22
AR1	312.17	318.53	307.34	302.88	304.76	317.09
AR2	299.51	302.78	303.53	314.16	297.74	293.86
AR3	295.21	299.97	269.63	304.74	301.38	272.44
AR4	293.31	294.28	271.58	266.60	270.26	272.14
AR5	294.31	296.12	272.07	266.23	270.41	272.61

Based on Akaike’s information criterion, we want to select the model with lowest AIC scores. In our case, ARMA(5,3) has the lowest AIC score of 266.23.

4.1.1.2 Fit ARMA(5,3) Model Recommended by AIC

## 
## Call:
## arima(x = log(WanliuPM10_2014_Oct$PM10), order = c(5, 0, 3), method = "ML", 
##     optim.control = list(maxit = 1000))
## 
## Coefficients:
##           ar1     ar2     ar3     ar4      ar5     ma1     ma2     ma3
##       -1.4114  0.1045  1.4178  0.7908  -0.0769  2.3216  2.1735  0.8107
## s.e.   0.0719  0.0896  0.0359  0.0969   0.0511  0.0585  0.0942  0.0628
##       intercept
##          4.5621
## s.e.     0.4671
## 
## sigma^2 estimated as 0.09465:  log likelihood = -123.12,  aic = 266.23

According to the results, the fitted model is \[ (1 +1.4114B-0.1045 B^2 -1.4178B^3- 0.7908B^4+0.0769B^5)(log(Y_n)-4.5621) = (1+2.3216B+2.1735B^2+0.8107B^3)\epsilon_n\]

4.1.1.3 Diagnostic Analysis

The standard errors of estimated coefficients of AR and MA are close to 0, which are fairly small. While the standard error of intercept is around 0.5, which is fairly large, indicating the model does not provide a good estimate for the intercept. The variance of the model is around 0.09 while the log likelihood is around -123.12. In order to further investigate that if the model has captured the information in the data, we decide to conduct residual diagnostics.

According to the histogram of the residuals, the mean of the residuals is close to zero. The left tail seems to be longer than the right tail.

By looking at autocorrelation function (ACF) plot of the residuals, we can check if the residuals are uncorrelated or not. In our case, the ACF are all close to zero for small lags. However, we can see there are some autocorrelations at lag around 40 to 50 are significantly lower than zero.

In order to check the normality assumption of the residuals, we look at the QQ-plot and find that the points curve far away from the line at each end in opposite direction. The residuals clearly does not follow normal distribution.

Consequently, the Gaussian white noise assumption might not be valid. We might need more advanced model to fit the data.

4.1.2 Suqare Root Transformation

The QQ-plot above indicates that we do not meet the Gaussian assumption. The reason might be that we have a lot of small values near 0. So in this case, a square root transformation of the data might be more appropriate. Therefore, our model changes as following:

\[\phi (B)(sqrt(Y_n)-\mu)=\psi (B)\epsilon_n,\] where \[\mu=\mathbb{E} [Y_n],\]

\[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\] \[\epsilon_n \sim iid N[0,\sigma^2] .\]

4.1.2.1 Choosing p and q for ARMA(p,q) Model

	MA0	MA1	MA2	MA3	MA4	MA5	MA6
AR0	3197.327	2616.944	2271.729	2017.404	1889.463	1801.734	1726.610
AR1	1560.806	1547.003	1543.844	1544.742	1545.628	1547.174	1549.136
AR2	1544.233	1544.004	1545.279	1545.602	1547.286	1542.081	1543.986
AR3	1543.552	1545.552	1545.864	1546.483	1540.772	1543.928	1544.572
AR4	1545.552	1546.774	1539.515	1546.852	1542.596	1537.634	1550.184
AR5	1546.072	1547.184	1540.733	1542.625	1548.848	1543.217	1535.361
AR6	1546.506	1542.091	1542.606	1544.602	1533.731	1553.447	1537.296

## [1] "The minimum AIC is: 1533.73147949887"

As the table above shows, ARMA(6,4) has the lowest AIC of 1533.73. This value is much larger than what we get from the model derived by log transformed data. The reason behind it is that when the the value of PM10 concentration is pretty large, the log transformed data will be much smaller than the square root transformed data and that will lead to a smaller AIC value. So, these two AICs are not comparable.

4.1.2.2 Fit ARMA(6,4) Model Recommended by AIC

## 
## Call:
## arima(x = sqrt(WanliuPM10_2014_Oct$PM10), order = c(6, 0, 4), method = "ML", 
##     optim.control = list(maxit = 1000))
## 
## Coefficients:
##          ar1     ar2     ar3      ar4      ar5     ar6     ma1      ma2
##       0.0531  1.3906  0.7878  -0.9348  -0.4642  0.1631  1.0989  -0.2466
## s.e.  0.6191  0.4183  0.5786   0.1642   0.6393  0.1265  0.6215   0.2923
##           ma3      ma4  intercept
##       -1.2678  -0.5845    12.3045
## s.e.   0.3332   0.5790     0.3654
## 
## sigma^2 estimated as 1.171:  log likelihood = -757.86,  aic = 1539.72

According to the results, the fitted model is \[(1 -0.05B-1.39B^2-0.79B^3+0.93B^4+0.46B^5-0.16B^4)(sqrt(Y_n)-12.3) = (1+1.11B-0.25B^2-1.27B^3-0.58B^4)\epsilon_n\]

4.1.2.3 Diagnostic Analysis

As the histogram of the residuals shows, the mean of the residuals is close to zero. Except some outliers, the two tails become shorter and the histogram looks more symmetric.

The autocorrelation function (ACF) plot also has some improvements. There are less number of lags with autocorrelations that are significantly not equal to zero.

Most importantly, QQ-plot shows that except for some outliers, we are almost fine with the normality assumption, although we still have a distribution with tails a bit longer than Gaussion.

As it shows in the plot above, ARMA(6,4) can catch almost every peak and pattern of the data. Since the model derived by the square root transformed can fit the data pretty well, and it seems to meet all the assumptions, we might go with the square root transformation and pick ARMA(6,4) as our best model. This also indicates that there might not be any seasonality in this data.

4.2 SARMA Model

Since we observe some weekly seasonality in the data as discussed at the beginning, we will then try to fit a seasonal autoregressive moving average (SARMA) model.

The model we want to fit is \(SARMA(p,q) \times (P,Q)_{24\ \times 7} = SARMA(p,q) \times (P,Q)_{168}\) for hourly data is \[\phi (B) \Phi(B^{168})(log(Y_n)-\mu)=\psi (B) \Psi(B^{168})\epsilon_n,\] where \[\mu=\mathbb{E} [Y_n],\]

\[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\] \[\Phi (x) = 1 - \Phi_1 x - ... - \Phi_P x^P ,\] \[\Psi (x) = 1 + \Psi_1 x + ... + \Psi_Q x^Q ,\] \[\epsilon_n \sim iid N[0,\sigma^2] .\]

4.2.1 Fit SARMA(1,0) Model

Due to the computation complexity, we will not choose the p and q values based on AIC. We will fit \(SARMA(1,0) \times (P,Q)_{168}\) for simplicity.

## 
## Call:
## arima(x = log(WanliuPM10_2014_Oct$PM10), order = c(1, 0, 0), seasonal = list(order = c(1, 
##     0, 0), period = 168), method = "ML", optim.control = list(maxit = 1000))
## 
## Coefficients:
##       ar1   sar1  intercept
##         1  0.022     4.7238
## s.e.    0  0.055  4586.9469
## 
## sigma^2 estimated as 0.1072:  log likelihood = -153,  aic = 314.01

As we can see from the results, the fitted model is \[ (1-B)(1-0.022B^{168})(log(Y_n)-4.7238)=\epsilon_n,\]

However, we can notice that the standard error of AR(1) coefficient is exactly zero while the standard error of the intercept is significantly high, indicating the fitted model is inappropriate for the data. The results is consistent with the following residual diagnostic analysis. The histogram and the QQ-plot indicates the residuals are heavy-tail distributed. While the ACF plot seems to have oscillatory components, suggesting AR(2) for the residuals. Thus \(SARMA(3,0) \times (P,Q)_{168}\) might be a good fit for the data or potentially the existence of seasonality is suspicious.

4.2.2 Diagnostic Analysis

4.2.3 STL(Seasonal and Trend decomposition using Loess) decomposition

In order to further investigate the seasonality existence in the data, we use \(mstl()\)function in \(forecast\) library to assist us to deal with seasonality[8]. This function provides a convenient automated STL (Seasonal and Trend decomposition using Loess) decomposition, i.e. decompose the time series data into components such as trend, seasonality, cycles and remainder. As we can see from below plot that PM10 series data can be only decomposed to trend and remainder components. The results suggests that there are no seasonal patterns in the data, which is consistent with our previous analysis.

4.3 Modeling Trend using Linear Regression with ARMA Errors

Next, we want to model the trend in the PM10 data using liner regression with ARMA Errors.

In order to fully utilize the dataset to model the trend of PM10 concentration, we want to incorporate extra information such as temperature (degree Celsius), precipitation (mm) and wind speed (m/s), and allow autocorrelation in the regression error term instead of using white noise.

We want to consider a regression model of the form

\[sqrt(Y_n) = \mu + \beta_1\ TEMP + \beta_2\ RAIN + \beta_3\ WSPM + \eta_n,\] \[\phi (B)\ \eta_n=\psi (B)\epsilon_n,\] which is equivalent to \[\phi (B)\ (sqrt(Y_n) - \mu - \beta_1\ TEMP - \beta_2\ RAIN - \beta_3\ WSPM) = \psi (B)\epsilon_n,\] where \[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\]

\[\epsilon_n \sim iid N[0,\sigma^2] .\] Our null model is \[H_0: \beta_1=\beta_2=\beta_3=0\],

and the alternative hypothesis is \[H_1: \exists \beta_i \ne 0.\]

Based on the AIC table, ARMA(4,2) seems to be appropriate. We conducted Z-test on the slope \(\beta_1,\ \beta_2, and\ \beta_3\).

\[Z_{\beta_1} = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})} = \frac{0.0061}{0.0435} = 0.1402\ \rightarrow |Z_{\beta_1}| < 1.96 \rightarrow fail\ to\ reject\ H_0\]

\[Z_{\beta_2} = \frac{\hat{\beta_2}}{SE(\hat{\beta_2})} = \frac{-0.2654}{1.1770} = -0.2255\ \rightarrow |Z_{\beta_1}| < 1.96 \rightarrow fail\ to\ reject\ H_0\]

\[Z_{\beta_3} = \frac{\hat{\beta_3}}{SE(\hat{\beta_3})} = \frac{-0.0336}{0.0701} = -0.4793\ \rightarrow |Z_{\beta_1}| < 1.96 \rightarrow fail\ to\ reject\ H_0\]

All the coefficients are statistically insignificant. We do not have enough evidence to reject the null hypothesis, indicating the fitted model might not be appropriate. Temperature, rain and wind speed may not be helpful in explaine the trend of the data.

	MA0	MA1	MA2	MA3	MA4
AR0	2958.796	2545.255	2246.807	2005.584	1888.333
AR1	1564.706	1551.625	1549.124	1550.077	1551.057
AR2	1549.156	1549.149	1550.587	1550.989	1552.696
AR3	1548.798	1550.796	1552.169	1551.869	1546.529
AR4	1550.793	1552.005	1545.222	1546.578	1548.359
AR5	1551.480	1552.617	1546.478	1548.375	1550.367

## [1] "The minimum AIC is: 1545.22203636253"

## 
## Call:
## arima(x = sqrt(WanliuPM10_2014_Oct$PM10), order = c(4, 0, 2), xreg = WanliuPM10_2014_Oct[, 
##     c("TEMP", "RAIN", "WSPM")], optim.control = list(maxit = 10000))
## 
## Coefficients:
##           ar1     ar2     ar3      ar4     ma1     ma2  intercept    TEMP
##       -0.2819  0.6189  0.8195  -0.2334  1.4591  0.9301    11.7510  0.0061
## s.e.   0.0557  0.0358  0.0438   0.0447  0.0383  0.0438     2.1065  0.0435
##          RAIN     WSPM
##       -0.2654  -0.0336
## s.e.   1.1770   0.0701
## 
## sigma^2 estimated as 1.193:  log likelihood = -761.61,  aic = 1545.22

5 Conclusion

Inspired by our observations from previous exploratory data analysis, we assume that there is a seasonality in PM10 concentration data from Wanliu station during 3-week period of October, 2014. We conducted time series analysis by fitting ARMA(p,q) model with logarithmic and square-root transformed data and fitting SARMA model with logarithmic transformed data. We conducted residuals diagnostic analysis for each fitted model as discussed above. By carefully investigating our residual plots and the STL(Seasonal and Trend decomposition using Loess) decomposition plot, we safely concluded that we need to reject our null hypothesis. The “seasonality” we observed previously might potentially come from the waving trend in the data.

Therefore, we fitted a linear regression model with ARMR errors in order to model the trend in the data. We incorporated extra information from the dataset such as temperature (degree Celsius), precipitation (mm) and wind speed (m/s) to help model the trend. However, the result of the fitted model showed that none of the coefficients of the variables are statistically significant, indicating that the trend may not be explianed by them. What’s more, by carefully looking at the vertical scale of STL decomposition plot [11], we can see that the trend has relatively narrow range of value. It will stay in a reletively high level for a long time and just drops dramatically. Since this special property, a polinomial trend may also not be able to fit the data. Both results implied there is little trend in data.

In conclusion, there is no seasonality and trend in PM10 concentration data from Wanliu station during 3-week period of October, 2014. Therefore, we could safely conclude that the square-root transformed data can be fitted by a stationary Gaussian ARMA(6,4) model as following: \[(1-0.05B-1.39B^2-0.79B^3+0.93B^4+0.46B^5-0.16B^4)(sqrt(Y_n)-12.3) = (1+1.11B-0.25B^2-1.27B^3-0.58B^4)\epsilon_n\] where \[{\epsilon_n} \sim N(0, 1.171)\].