PM10 is PM with particle size less than 10 microns. PM10 can get deep into the lungs and cause a broad range of health effects, in particular, respiratory and cardiovascular illnesses.[1] In this analysis, we are aimed to identify any seasonality or trend in the PM10 Concentration (ug/m3) distribution using the Beijing Multi-Site Air-Quality Dataset[2]. Finding potential patterns of PM10 concentration would enable air pollution forecast so that people could take relative protective measures to prevent harmful impacts.
The entire dataset that we work on includes 12 datasets of hourly air pollution data records from 12 nationally-controlled air-quality monitoring stations. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality station are matched with the nearest weather station from the China Meteorological Administration. In each dataset, there are 18 columns, spanning from March 1st, 2013 to February 28th, 2017. The variables include year, month, day, hour, PM2.5, PM10, SO2, NO2, CO, O3, Temperature, Pressure, Dew Point Temperature, Rain, Wind, Wind Speed, and Station.
In this analysis, we are going to focus on one of twelve datasets which including air pollution data of Wanliu station. There are 35,064 records in total. However, we only focus on analyzing the data from October 6th to October 26th, 2014, which corresponds to 3-week time period. During the this time period, there are \(24\ hours\ \times 7\ days\ \times 3\ weeks\ = 504\) hourly observations/records in total.
Before diving into time series analysis, we conduct data preprocessings in order to make sure the data is cleaned. We find that there are 69 missing values in “PM10” column in 2014 data. We decide to impute the missing values with the closest non-NA value in the same column. For example, if the PM10 value at 9am on October 2nd is missing, we will impute the value with the PM10 value at 8am on October 2nd if available.
In order to investigate the time series data, we will start with the basic time plot.
The variation of the data shown in the above plot increases and decreases with the level of the series, therefore we think that a logarithmic transformation would be useful [9]. As we can see from below plot, there seems to be a weekly seasonality (S = \(24\ hours\ \times 7\ days\ = 168\ lags\)) and a trend within one week for this 3-week hourly PM10 concentration data. The PM10 tend to start low at the beginning of a week, and then slowly increases and fluctuates around the peak during the week, finally drops sharply at the end of a week. The data seems to be a non-stationary data.
As we can see from the difference of logarithmic transformed PM10 data, we noticed that the series has a invariant zero mean and a changing variation. This series also seems to be non-stationary.
The following plot is the ACF plot of the logarithmic transformed PM10 concentration data from Wanliu station. As we can see from the plot, the autocorrelations at small lags are large and positive and then the correlations start decreasing as lags increase, indicating a potential trend in the data. At the large lags, we can also see some “scalloped” shape which are potentially due to the seasonality.
## [1] 10.66667
In spectrum density plot, we can see the dominant frequency is around 0.004. This indicates that the dominant period is around 11 days. Although there are a lot of other local peaks, the one around 0.004 is much higher. This result agrees what the ACF plots indicates.
Then we decomposed the log(PM10) data into trend, noise and cycles parts. In the decomposition plot, low frequency refers to the trend, middle frequency refers to the cycles and high frequency refers to the noise.
The trend shows that the PM10 concentration goes up and down twice in October 2014. This might indicate that we might have a waving trend.
The middle frequency shows that we meight have a circle of around a week. The high frequency plot indicates that we may not have stationary noises.
Since all the three plots seem to agree with each other, we now assume that there is a seasonality in our data, i.e. our null hypothesis is the data has seasonality.
Although we observe some potential trend and seasonality in the data, we will still start with fitting a stationary Gaussian ARMA(p,q) model with parameter vector \(\theta=(\phi_{1:p},\psi_{1:q},\mu,\sigma^2)\) given by \[\phi (B)(log(Y_n)-\mu)=\psi (B)\epsilon_n,\] where \[\mu=\mathbb{E} [Y_n],\]
\[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\] \[\epsilon_n \sim iid N[0,\sigma^2] .\]
Next, we seek to decide where to start in terms of values of p and q by tabulating AIC values for a range of different choices p and q.
## [1] "The minimum AIC is: 266.232923466563"
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 1705.15 | 1205.62 | 899.87 | 721.84 | 635.27 | 520.22 |
AR1 | 312.17 | 318.53 | 307.34 | 302.88 | 304.76 | 317.09 |
AR2 | 299.51 | 302.78 | 303.53 | 314.16 | 297.74 | 293.86 |
AR3 | 295.21 | 299.97 | 269.63 | 304.74 | 301.38 | 272.44 |
AR4 | 293.31 | 294.28 | 271.58 | 266.60 | 270.26 | 272.14 |
AR5 | 294.31 | 296.12 | 272.07 | 266.23 | 270.41 | 272.61 |
Based on Akaike’s information criterion, we want to select the model with lowest AIC scores. In our case, ARMA(5,3) has the lowest AIC score of 266.23.
##
## Call:
## arima(x = log(WanliuPM10_2014_Oct$PM10), order = c(5, 0, 3), method = "ML",
## optim.control = list(maxit = 1000))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## -1.4114 0.1045 1.4178 0.7908 -0.0769 2.3216 2.1735 0.8107
## s.e. 0.0719 0.0896 0.0359 0.0969 0.0511 0.0585 0.0942 0.0628
## intercept
## 4.5621
## s.e. 0.4671
##
## sigma^2 estimated as 0.09465: log likelihood = -123.12, aic = 266.23
According to the results, the fitted model is \[ (1 +1.4114B-0.1045 B^2 -1.4178B^3- 0.7908B^4+0.0769B^5)(log(Y_n)-4.5621) = (1+2.3216B+2.1735B^2+0.8107B^3)\epsilon_n\]
The standard errors of estimated coefficients of AR and MA are close to 0, which are fairly small. While the standard error of intercept is around 0.5, which is fairly large, indicating the model does not provide a good estimate for the intercept. The variance of the model is around 0.09 while the log likelihood is around -123.12. In order to further investigate that if the model has captured the information in the data, we decide to conduct residual diagnostics.
According to the histogram of the residuals, the mean of the residuals is close to zero. The left tail seems to be longer than the right tail.
By looking at autocorrelation function (ACF) plot of the residuals, we can check if the residuals are uncorrelated or not. In our case, the ACF are all close to zero for small lags. However, we can see there are some autocorrelations at lag around 40 to 50 are significantly lower than zero.
In order to check the normality assumption of the residuals, we look at the QQ-plot and find that the points curve far away from the line at each end in opposite direction. The residuals clearly does not follow normal distribution.
Consequently, the Gaussian white noise assumption might not be valid. We might need more advanced model to fit the data.
The QQ-plot above indicates that we do not meet the Gaussian assumption. The reason might be that we have a lot of small values near 0. So in this case, a square root transformation of the data might be more appropriate. Therefore, our model changes as following:
\[\phi (B)(sqrt(Y_n)-\mu)=\psi (B)\epsilon_n,\] where \[\mu=\mathbb{E} [Y_n],\]
\[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\] \[\epsilon_n \sim iid N[0,\sigma^2] .\]
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | MA6 | |
---|---|---|---|---|---|---|---|
AR0 | 3197.327 | 2616.944 | 2271.729 | 2017.404 | 1889.463 | 1801.734 | 1726.610 |
AR1 | 1560.806 | 1547.003 | 1543.844 | 1544.742 | 1545.628 | 1547.174 | 1549.136 |
AR2 | 1544.233 | 1544.004 | 1545.279 | 1545.602 | 1547.286 | 1542.081 | 1543.986 |
AR3 | 1543.552 | 1545.552 | 1545.864 | 1546.483 | 1540.772 | 1543.928 | 1544.572 |
AR4 | 1545.552 | 1546.774 | 1539.515 | 1546.852 | 1542.596 | 1537.634 | 1550.184 |
AR5 | 1546.072 | 1547.184 | 1540.733 | 1542.625 | 1548.848 | 1543.217 | 1535.361 |
AR6 | 1546.506 | 1542.091 | 1542.606 | 1544.602 | 1533.731 | 1553.447 | 1537.296 |
## [1] "The minimum AIC is: 1533.73147949887"
As the table above shows, ARMA(6,4) has the lowest AIC of 1533.73. This value is much larger than what we get from the model derived by log transformed data. The reason behind it is that when the the value of PM10 concentration is pretty large, the log transformed data will be much smaller than the square root transformed data and that will lead to a smaller AIC value. So, these two AICs are not comparable.
##
## Call:
## arima(x = sqrt(WanliuPM10_2014_Oct$PM10), order = c(6, 0, 4), method = "ML",
## optim.control = list(maxit = 1000))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ma1 ma2
## 0.0531 1.3906 0.7878 -0.9348 -0.4642 0.1631 1.0989 -0.2466
## s.e. 0.6191 0.4183 0.5786 0.1642 0.6393 0.1265 0.6215 0.2923
## ma3 ma4 intercept
## -1.2678 -0.5845 12.3045
## s.e. 0.3332 0.5790 0.3654
##
## sigma^2 estimated as 1.171: log likelihood = -757.86, aic = 1539.72
According to the results, the fitted model is \[(1 -0.05B-1.39B^2-0.79B^3+0.93B^4+0.46B^5-0.16B^4)(sqrt(Y_n)-12.3) = (1+1.11B-0.25B^2-1.27B^3-0.58B^4)\epsilon_n\]
As the histogram of the residuals shows, the mean of the residuals is close to zero. Except some outliers, the two tails become shorter and the histogram looks more symmetric.
The autocorrelation function (ACF) plot also has some improvements. There are less number of lags with autocorrelations that are significantly not equal to zero.
Most importantly, QQ-plot shows that except for some outliers, we are almost fine with the normality assumption, although we still have a distribution with tails a bit longer than Gaussion.
As it shows in the plot above, ARMA(6,4) can catch almost every peak and pattern of the data. Since the model derived by the square root transformed can fit the data pretty well, and it seems to meet all the assumptions, we might go with the square root transformation and pick ARMA(6,4) as our best model. This also indicates that there might not be any seasonality in this data.
Since we observe some weekly seasonality in the data as discussed at the beginning, we will then try to fit a seasonal autoregressive moving average (SARMA) model.
The model we want to fit is \(SARMA(p,q) \times (P,Q)_{24\ \times 7} = SARMA(p,q) \times (P,Q)_{168}\) for hourly data is \[\phi (B) \Phi(B^{168})(log(Y_n)-\mu)=\psi (B) \Psi(B^{168})\epsilon_n,\] where \[\mu=\mathbb{E} [Y_n],\]
\[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\] \[\Phi (x) = 1 - \Phi_1 x - ... - \Phi_P x^P ,\] \[\Psi (x) = 1 + \Psi_1 x + ... + \Psi_Q x^Q ,\] \[\epsilon_n \sim iid N[0,\sigma^2] .\]
Due to the computation complexity, we will not choose the p and q values based on AIC. We will fit \(SARMA(1,0) \times (P,Q)_{168}\) for simplicity.
##
## Call:
## arima(x = log(WanliuPM10_2014_Oct$PM10), order = c(1, 0, 0), seasonal = list(order = c(1,
## 0, 0), period = 168), method = "ML", optim.control = list(maxit = 1000))
##
## Coefficients:
## ar1 sar1 intercept
## 1 0.022 4.7238
## s.e. 0 0.055 4586.9469
##
## sigma^2 estimated as 0.1072: log likelihood = -153, aic = 314.01
As we can see from the results, the fitted model is \[ (1-B)(1-0.022B^{168})(log(Y_n)-4.7238)=\epsilon_n,\]
However, we can notice that the standard error of AR(1) coefficient is exactly zero while the standard error of the intercept is significantly high, indicating the fitted model is inappropriate for the data. The results is consistent with the following residual diagnostic analysis. The histogram and the QQ-plot indicates the residuals are heavy-tail distributed. While the ACF plot seems to have oscillatory components, suggesting AR(2) for the residuals. Thus \(SARMA(3,0) \times (P,Q)_{168}\) might be a good fit for the data or potentially the existence of seasonality is suspicious.
In order to further investigate the seasonality existence in the data, we use \(mstl()\)function in \(forecast\) library to assist us to deal with seasonality[8]. This function provides a convenient automated STL (Seasonal and Trend decomposition using Loess) decomposition, i.e. decompose the time series data into components such as trend, seasonality, cycles and remainder. As we can see from below plot that PM10 series data can be only decomposed to trend and remainder components. The results suggests that there are no seasonal patterns in the data, which is consistent with our previous analysis.
Next, we want to model the trend in the PM10 data using liner regression with ARMA Errors.
In order to fully utilize the dataset to model the trend of PM10 concentration, we want to incorporate extra information such as temperature (degree Celsius), precipitation (mm) and wind speed (m/s), and allow autocorrelation in the regression error term instead of using white noise.
We want to consider a regression model of the form
\[sqrt(Y_n) = \mu + \beta_1\ TEMP + \beta_2\ RAIN + \beta_3\ WSPM + \eta_n,\] \[\phi (B)\ \eta_n=\psi (B)\epsilon_n,\] which is equivalent to \[\phi (B)\ (sqrt(Y_n) - \mu - \beta_1\ TEMP - \beta_2\ RAIN - \beta_3\ WSPM) = \psi (B)\epsilon_n,\] where \[\phi (x) = 1 - \phi_1 x - ... - \phi_px^p ,\] \[\psi (x) = 1 + \psi_1 x + ... + \psi_q x^q ,\]
\[\epsilon_n \sim iid N[0,\sigma^2] .\] Our null model is \[H_0: \beta_1=\beta_2=\beta_3=0\],
and the alternative hypothesis is \[H_1: \exists \beta_i \ne 0.\]
Based on the AIC table, ARMA(4,2) seems to be appropriate. We conducted Z-test on the slope \(\beta_1,\ \beta_2, and\ \beta_3\).
\[Z_{\beta_1} = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})} = \frac{0.0061}{0.0435} = 0.1402\ \rightarrow |Z_{\beta_1}| < 1.96 \rightarrow fail\ to\ reject\ H_0\]
\[Z_{\beta_2} = \frac{\hat{\beta_2}}{SE(\hat{\beta_2})} = \frac{-0.2654}{1.1770} = -0.2255\ \rightarrow |Z_{\beta_1}| < 1.96 \rightarrow fail\ to\ reject\ H_0\]
\[Z_{\beta_3} = \frac{\hat{\beta_3}}{SE(\hat{\beta_3})} = \frac{-0.0336}{0.0701} = -0.4793\ \rightarrow |Z_{\beta_1}| < 1.96 \rightarrow fail\ to\ reject\ H_0\]
All the coefficients are statistically insignificant. We do not have enough evidence to reject the null hypothesis, indicating the fitted model might not be appropriate. Temperature, rain and wind speed may not be helpful in explaine the trend of the data.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 2958.796 | 2545.255 | 2246.807 | 2005.584 | 1888.333 |
AR1 | 1564.706 | 1551.625 | 1549.124 | 1550.077 | 1551.057 |
AR2 | 1549.156 | 1549.149 | 1550.587 | 1550.989 | 1552.696 |
AR3 | 1548.798 | 1550.796 | 1552.169 | 1551.869 | 1546.529 |
AR4 | 1550.793 | 1552.005 | 1545.222 | 1546.578 | 1548.359 |
AR5 | 1551.480 | 1552.617 | 1546.478 | 1548.375 | 1550.367 |
## [1] "The minimum AIC is: 1545.22203636253"
##
## Call:
## arima(x = sqrt(WanliuPM10_2014_Oct$PM10), order = c(4, 0, 2), xreg = WanliuPM10_2014_Oct[,
## c("TEMP", "RAIN", "WSPM")], optim.control = list(maxit = 10000))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 ma2 intercept TEMP
## -0.2819 0.6189 0.8195 -0.2334 1.4591 0.9301 11.7510 0.0061
## s.e. 0.0557 0.0358 0.0438 0.0447 0.0383 0.0438 2.1065 0.0435
## RAIN WSPM
## -0.2654 -0.0336
## s.e. 1.1770 0.0701
##
## sigma^2 estimated as 1.193: log likelihood = -761.61, aic = 1545.22
Inspired by our observations from previous exploratory data analysis, we assume that there is a seasonality in PM10 concentration data from Wanliu station during 3-week period of October, 2014. We conducted time series analysis by fitting ARMA(p,q) model with logarithmic and square-root transformed data and fitting SARMA model with logarithmic transformed data. We conducted residuals diagnostic analysis for each fitted model as discussed above. By carefully investigating our residual plots and the STL(Seasonal and Trend decomposition using Loess) decomposition plot, we safely concluded that we need to reject our null hypothesis. The “seasonality” we observed previously might potentially come from the waving trend in the data.
Therefore, we fitted a linear regression model with ARMR errors in order to model the trend in the data. We incorporated extra information from the dataset such as temperature (degree Celsius), precipitation (mm) and wind speed (m/s) to help model the trend. However, the result of the fitted model showed that none of the coefficients of the variables are statistically significant, indicating that the trend may not be explianed by them. What’s more, by carefully looking at the vertical scale of STL decomposition plot [11], we can see that the trend has relatively narrow range of value. It will stay in a reletively high level for a long time and just drops dramatically. Since this special property, a polinomial trend may also not be able to fit the data. Both results implied there is little trend in data.
In conclusion, there is no seasonality and trend in PM10 concentration data from Wanliu station during 3-week period of October, 2014. Therefore, we could safely conclude that the square-root transformed data can be fitted by a stationary Gaussian ARMA(6,4) model as following: \[(1-0.05B-1.39B^2-0.79B^3+0.93B^4+0.46B^5-0.16B^4)(sqrt(Y_n)-12.3) = (1+1.11B-0.25B^2-1.27B^3-0.58B^4)\epsilon_n\] where \[{\epsilon_n} \sim N(0, 1.171)\].
[1] Health effects of air pollutants https://www.aqhi.gov.hk/en/health-advice/health-effects-of-air-pollutants9b5f.html?start=5
[2] Data resource from: https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data#
[3] 2020 midterm projects 1 https://ionides.github.io/531w20/midterm_project/project1/Midterm-Project.html#trend-noise-and-cycles
[4] 2020 midterm projects 27 https://ionides.github.io/531w20/midterm_project/project27/midterm_proj.html#linear-regression-with-sarima-errors
[5] ARMA model https://ionides.github.io/531w21/05/index.html
[6] SARMA model https://ionides.github.io/531w21/06/index.html
[7] Regression with ARIMA errors in R https://otexts.com/fpp2/regarima.html
[8] STL with multiple seasonal periods https://otexts.com/fpp2/complexseasonality.html
[9] Transformations and adjustments https://otexts.com/fpp2/transformations.html
[10] Interpretating STL decomposition plot https://otexts.com/fpp2/complexseasonality.html
[11] Regression with ARIMA errors in R https://otexts.com/fpp2/regarima.html