Introduction

What is Particulate Matter? What is PM2.5?

“Particulate matter (PM) is a term used to describe the mixture of solid particles and liquid droplets in the air. It can be either human-made or naturally occurring. Some examples include dust, ash and sea-spray. Particulate matter (including soot) is emitted during the combustion of solid and liquid fuels, such as for power generation, domestic heating and in vehicle engines. Particulate matter varies in size (i.e. the diameter or width of the particle). PM2.5 means the mass per cubic metre of air of particles with a size (diameter) generally less than 2.5 micrometres (µm). PM2.5 is also known as fine particulate matter.” (UK Department for Food and Rural Affairs)

Health Effects of PM2.5

“The biggest impact of particulate air pollution on public health is understood to be from long-term exposure to PM2.5, which increases the age-specific mortality risk, particularly from cardiovascular causes. Several plausible mechanisms for this effect on mortality have been proposed, although it is not yet clear which is the most important. Exposure to high concentrations of PM (e.g. during short-term pollution episodes) can also exacerbate lung and heart conditions, significantly affecting quality of life, and increase deaths and hospital admissions. Children, the elderly and those with predisposed respiratory and cardiovascular disease, are known to be more susceptible to the health impacts from air pollution” (WHO)

Sources of PM2.5

Human acticity caused PM2.5 pollution are known to be more important than natual sources, which only contribute little part of the total concentration. In some rural area, industrial emiission can also be the key pollution source of PM2.5, as the usage of none smoke-free fuels and other domestic sources of smoke such as bonfires. In addition to the direct emissions of PM2.5 particles, it is also possible to form from the chemical reactions of sulphur dioxide and nitrogen oxides (toxic gases).

Dataset Descripction

In this project, I am going to use the time series of the PM2.5 data of US Embassy in Beijing as the analysis object. This dataset is collected by Song Xi Chen from Guanghua School of Mannagement, Peking University. This hourly data set contains the PM2.5 data of US Embassy in Beijing. Meanwhile, meteorological data from Beijing Capital International Airport are also included. The dataset time period is between Jan 1st, 2010 to Dec 31st, 2014. Missing data are denoted as NA. 13 different air condition related attributes are included in this dataset, while in this project I am only going to focus on the PM2.5 value. There are 43824 instances in this dataset, with all of the PM2.5 value in integar.

Project Goal

The goal of this project is to properly analysis the property of this PM2.5 time series and find a reasonable time series model to fit the data. This could help people to have a better understanding on the trend of PM2.5. In this project report, I will first explore the data, then use two different methods to fit the data and perform multiple residual analysis on the results. In the conclusion, all findings regarding the data will be included.

Analysis of Data

Data Inspection

In this section, I will provide us a brief understanding of how is the data looks like. Also data preprocessing is done in this section. Load the data and plot the time domain series,

tsdata <- read.csv("beijing_pm25.csv", header = TRUE)
ts_pm25 <- tsdata$pm25[1:8760]
No = tsdata$No[1:8760]
ts_pm25 <- na_ma(ts_pm25[1:8760], k = 4, weighting = "exponential")
plot(ts_pm25, type= "l", xlab = "Hours", ylab ="PM2.5",  main = "Hourly PM2.5 data in Beijing from Jan 1st, 2010 to Dec 31st, 2010.",)

To get a better understanding of the data, I decomposite the time series regarding different frequency. The first part is the original time series. The low frequency (ts_low) is the trend. The high frequency part (ts_hi) is the noise. The last part (ts_cycles) with the middle frequency is the main cycle. From the decomposition plot, it is hard to identify any obvious cycles. It only give us a nonparametric trend from a time series. Further analysis need to be performed.

Stationary Test

Second, I test the stationary property of this time series, using the Augmented Dickey-Fuller test. An augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample. The augmented Dickey–Fuller (ADF) statistic, used in the test, is a negative number. The more negative it is, the stronger the rejection of the hypothesis that there is a unit root at some level of confidence. (https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test)

adf.test(ts_pm25)
## Warning in adf.test(ts_pm25): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_pm25
## Dickey-Fuller = -10.365, Lag order = 20, p-value = 0.01
## alternative hypothesis: stationary

The Augmented Dickey-Fuller Test shows that the null hypothesis is rejected and the time series is stationary. However, besides the evidence from ADF test, the time series also exposes itself that all value are above 0, which is abnormal for a stationary time series. Following the first idea come up to my mind, I am going to analysis this time series on the log domain. To prevent infinite value in the log domain, 1 is added to the original series. The following is the value this time series on log domain.

ts_log <- log(ts_pm25 + 1, base = exp(1))
plot(ts_log, type= "l", xlab = "Hours", ylab ="PM2.5",  main = "Hourly PM2.5 data in Beijing from Jan 1st, 2010 to Dec 31st, 2014.")

Even though I transform the data into log domain, the ADF test still reject the null hypothesis. The time series is stationary.

adf.test(ts_log)
## Warning in adf.test(ts_log): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_log
## Dickey-Fuller = -10.89, Lag order = 20, p-value = 0.01
## alternative hypothesis: stationary

SARMA error model

Detrend

To fit an SARMA model, the first thing need to do is to extract the trend and decide the seasonality before fitting the ARMA model.

In order to decide the trend of the model, I tried different order of polynomial to test for fitting the time series.

fitted = lm(ts_log~No)
summary(fitted)
## 
## Call:
## lm(formula = ts_log ~ No)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5567 -0.6355  0.1280  0.7333  2.6566 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.228e+00  2.067e-02 204.525   <2e-16 ***
## No          3.399e-06  4.087e-06   0.831    0.406    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9674 on 8758 degrees of freedom
## Multiple R-squared:  7.894e-05,  Adjusted R-squared:  -3.524e-05 
## F-statistic: 0.6914 on 1 and 8758 DF,  p-value: 0.4057
fitted = lm(ts_log~poly(No,2,raw=TRUE))
summary(fitted)
## 
## Call:
## lm(formula = ts_log ~ poly(No, 2, raw = TRUE))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6132 -0.6323  0.0977  0.7009  2.7558 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.955e+00  3.077e-02  128.54   <2e-16 ***
## poly(No, 2, raw = TRUE)1  1.905e-04  1.622e-05   11.74   <2e-16 ***
## poly(No, 2, raw = TRUE)2 -2.136e-08  1.793e-09  -11.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9597 on 8757 degrees of freedom
## Multiple R-squared:  0.01602,    Adjusted R-squared:  0.0158 
## F-statistic:  71.3 on 2 and 8757 DF,  p-value: < 2.2e-16
fitted = lm(ts_log~poly(No,3,raw=TRUE))
summary(fitted)
## 
## Call:
## lm(formula = ts_log ~ poly(No, 3, raw = TRUE))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6637 -0.6305  0.1046  0.7041  2.7611 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               4.068e+00  4.099e-02  99.234  < 2e-16 ***
## poly(No, 3, raw = TRUE)1  3.576e-05  4.052e-05   0.883    0.377    
## poly(No, 3, raw = TRUE)2  2.279e-08  1.075e-08   2.121    0.034 *  
## poly(No, 3, raw = TRUE)3 -3.359e-12  8.064e-13  -4.166 3.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9588 on 8756 degrees of freedom
## Multiple R-squared:  0.01797,    Adjusted R-squared:  0.01763 
## F-statistic: 53.41 on 3 and 8756 DF,  p-value: < 2.2e-16
#plot(diff(ts_log, differences = 10), type= "l", xlab = "Hours", ylab ="PM2.5",  main = "Hourly PM2.5 data in Beijing from Jan 1st, 2010 to Dec 31st, 2014.")
#adf.test(diff(ts_log, differences = 10))

We can see the estimate parameter of the of degree 2 and degree 3 is significantly smaller than that of degree 1 and there is no strong evidence that the second/third order polynomials has significantly better performance. therefore degree 1 linear regression is enough for the trend. The following is the result of dtrend process. This decision is also valid because this time series is stationary.

fitted = lm(ts_log~No)
summary(fitted)
## 
## Call:
## lm(formula = ts_log ~ No)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5567 -0.6355  0.1280  0.7333  2.6566 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.228e+00  2.067e-02 204.525   <2e-16 ***
## No          3.399e-06  4.087e-06   0.831    0.406    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9674 on 8758 degrees of freedom
## Multiple R-squared:  7.894e-05,  Adjusted R-squared:  -3.524e-05 
## F-statistic: 0.6914 on 1 and 8758 DF,  p-value: 0.4057
ts_log_without_trend <-  fitted$resid
plot(ts_log, type= "l", xlab = "Hours", ylab ="PM2.5",  main = "Hourly PM2.5 data in Beijing from Jan 1st, 2010 to Dec 31st, 2014.")
abline(fitted, col = 'red')

plot(ts_log_without_trend,type= "l", xlab = "Hours", ylab ="PM2.5 resid",  main = "Hourly PM2.5 data without trend")

adf.test(ts_log_without_trend)
## Warning in adf.test(ts_log_without_trend): p-value smaller than printed p-
## value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_log_without_trend
## Dickey-Fuller = -10.89, Lag order = 20, p-value = 0.01
## alternative hypothesis: stationary

After taking out the general trend of the Beijing PM2.5 data, we can see the time series is still stationary.

Seasonality

To find out the seasonal trend in this data, we need to exam the time series on the frequency domain. The following is the unsmoothed periodogram for the PM2.5 series with x axis unit of cycle per hour.

spec_unsmooth = spectrum(ts_log_without_trend, main = "Unsmoothed periodogram", xlab = 'frequency (cycle per hour)')

spec_unsmooth$freq[which.max(spec_unsmooth$spec)]
## [1] 0.001666667

The following is the smoothed periodogram for the PM2.5 series with x axis unit of cycle per hour.

spec = spectrum(ts_log_without_trend, spans=c(3,5,3), main = "Smoothed periodogram", xlab = 'frequency (cycle per hour)')
#spec$freq[which.max(spec$spec)]
abline(v=c(0.0416, 0.0832),lty="dotted",col="red")

From the frequency domain, we can see there are two dominate (local peaks) frequencies in this time series. One is at 0.0832 and the other is at 0.0416. All of their are multiples also are multiples of 0.0416. Here 0.0832 frequency represents a seasonality of 2 days (48 hours) and 0.0416 represents a daily seasonality (24 hours). Therefore, in the following ARMA model fitting, the primary daily seasonality will be included.

Desiding Seasonal Model

Besides frequency domain analysis, PACF also help me decide the seasonality parameter (period). In time series analysis, the partial autocorrelation function (PACF) gives the partial correlation of a stationary time series with its own lagged values, regressed the values of the time series at all shorter lags. It contrasts with the autocorrelation function, which does not control for other lags. (wiki) From the pacf figure of the time series, we can see there is autocorelation around lag (24 48) , which is outside the blue dashed line. Therefore I choose to use SARIMA(p,0,q)\(\times\)(1,0,0).

Fitting ARMA model

After extract the trend and the daily seasonality, we need to fit this time series with proper SARMA model. The following part use the AIC table to find the most proper.

## Loading required package: knitr
MA0 MA1 MA2 MA3 MA4 MA5
AR0 23009.24 14093.04 9197.00 6481.48 4831.98 3790.94
AR1 1047.61 948.18 947.00 941.36 940.97 942.91
AR2 944.70 943.52 952.03 949.38 944.51 944.90
AR3 945.08 948.68 943.26 940.12 945.12 946.94
AR4 939.98 941.64 942.26 938.41 944.51 949.30
AR5 941.43 943.46 945.40 947.45 956.08 936.70

From the AIC table we can see, the ARMA(4,0),ARMA(4,1) and ARMA(5,0) have the lowest value of AIC . Then we are going to test these three parameter settings.

## Series: ts_log_without_trend 
## ARIMA(4,0,0)(1,0,0)[24] with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4    sar1    mean
##       1.0666  -0.0974  0.0168  -0.0285  0.0320  0.0000
## s.e.  0.0107   0.0156  0.0156   0.0107  0.0108  0.0661
## 
## sigma^2 estimated as 0.0651:  log likelihood=-462.99
## AIC=939.98   AICc=940   BIC=989.53
## Series: ts_log_without_trend 
## ARIMA(4,0,1)(1,0,0)[24] with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1    sar1     mean
##       1.1994  -0.2391  0.0298  -0.0271  -0.1329  0.0322  -0.0009
## s.e.  0.2745   0.2930  0.0304   0.0115   0.2745  0.0108   0.0658
## 
## sigma^2 estimated as 0.06511:  log likelihood=-462.82
## AIC=941.64   AICc=941.65   BIC=998.26
## Series: ts_log_without_trend 
## ARIMA(5,0,0)(1,0,0)[24] with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ar5    sar1     mean
##       1.0663  -0.0972  0.0160  -0.0200  -0.0079  0.0323  -0.0001
## s.e.  0.0107   0.0156  0.0157   0.0156   0.0107  0.0108   0.0656
## 
## sigma^2 estimated as 0.06511:  log likelihood=-462.72
## AIC=941.43   AICc=941.45   BIC=998.06

Check the causality of the model

AR Root for ARMA(4,0):

## [1]  1.053366+0.000000i -1.577214+3.147864i -1.577214-3.147864i
## [4]  2.690498+0.000000i

AR Root for ARMA(4,1):

## [1]  1.054030-0.000000i -1.184492+3.618964i -1.184492-3.618964i
## [4]  2.415409+0.000000i

AR Root for ARMA(5,0):

## [1]  1.054460+0.000000i -0.534127+3.193158i -0.534127-3.193158i
## [4]  2.348532+0.000000i -4.856943-0.000000i

All AR root is outside the unit circle, which is causal. However, they have one root close to the unit circle, which implies the causality is not strong. For the seasonal MA root of both model,

Seasonal AR Root for ARMA(4,0):

## [1] 31.2254+0i

Seasonal AR Root for ARMA(4,1):

## [1] 31.09241+0i

Seasonal AR Root for ARMA(5,0):

## [1] 30.99272+0i

Both MA root is outside the unit circle, which means both models are invertible.

Residual Analysis

The following is the Q-Q plot for the residuals. The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution. I use the normal Q-Q plot check if the residual follow the normal distribution. It is obvious that the distribution of residual follow the line on Q-Q plot, which means the residual follow the normal distribution.

Residual and Q-Q plot for SARMA(4,0)\(\times\)(1,0) model:

Residual and Q-Q plot for SARMA(4,1)\(\times\)(1,0) model:

Residual and Q-Q plot for SARMA(5,0)\(\times\)(1,0) model:

Another evaluation of the residual is the acf function, which is in complete called autocorrelation function which gives us values of autocorrelation of any series with its lagged values ACF plot for SARMA(4,0)\(\times\)(1,0) model:

ACF plot for SARMA(4,1)\(\times\)(1,0) model:

ACF plot for SARMA(5,0)\(\times\)(1,0) model:

From the fitting results, we can see the ar5 parameter for ARMA(5,0) and ma1 parameter for ARMA(4,1) have zero in their confidence interval (1.96 standard error), while ARMA(4,0) model has no parameters within the 95% estimated confidence intervals. This property increases our confidence to choose SARMA(4,0)\(\times\)(1,0). Besides that, from the Q-Q plot, we can see, the residual of all three models approximatly follows the Gaussian distribution and the acf plots are all truancated at 1. Therefore, I decide to use the SARMA(4,0)\(\times\)(1,0) model as the SARMA error.

Simulation

With this model, the time series can be simulated. In the following simulation plot, the red line iindicates the time series generating by trend + SARMA(4,0)\(\times\)(1,0) error and the blue line indicates the original PM2.5 time.

SARIMA model

Different from SARMA model with trend, another way to fit this time series is SARIMA model, combining the integrated ARMA models with seasonality.

Same as above, use the AIC table to find the optimal p,q value.

MA0 MA1 MA2 MA3 MA4 MA5
AR0 4465.16 4415.81 4417.10 4419.10 4419.41 4419.11
AR1 4417.01 4417.07 4419.11 4421.10 4198.50 4198.58
AR2 4417.17 4419.15 4421.08 4091.97 4199.72 4202.39
AR3 4419.16 4421.15 4088.44 4081.98 4081.89 4084.00
AR4 4419.01 4197.14 4084.79 4207.53 4201.63 4206.16
AR5 4417.98 4198.12 4201.14 4009.06 4006.85 3887.91

Fitting model:

## 
## Call:
## arima(x = ts_log, order = c(4, 1, 4), seasonal = list(order = c(1, 1, 0), period = 24))
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1     ar2     ar3      ar4      ma1      ma2      ma3     ma4
##       0.2340  0.3289  0.7288  -0.3870  -0.1856  -0.3689  -0.7710  0.3255
## s.e.  0.0899     NaN  0.0291   0.0844   0.0931      NaN   0.0401  0.0911
##          sar1
##       -0.4983
## s.e.   0.0093
## 
## sigma^2 estimated as 0.09435:  log likelihood = -2090.44,  aic = 4200.87

Residual Analysis

Similar to the above residual analysis section, the following is the the residual learning for SARIMA(4,1,4)\(\times\)(1,1,0).

Despite the strange tail, we still can see the residual approximately follow the Gaussian distrubution.

Simulation

With this model, the time series can be simulated. In the following simulation plot, the red line iindicates the time series generating by trend + SARIMA(4,1,0)\(\times\)(1,1,0) error and the blue line indicates the original PM2.5 time.

Conclusion

In the report, I analyze the Hourly Beijing PM2.5 time series in 2010. With a bunch of exploratory frequency analysis, models testing and residual analysis, five main conclusions can be stated:

  1. This times series can be properly fitted on the log domain. It have approximately linear trend respect to time. Simple linear regression with time and the first order difference (ARIMA) is sufficiently enough to detrend this time series.

  2. Both trend + SARMA error model and SARIMA model shows high precision in fitting the data on log scale. The residual is reasonably small for both SARIMA(4,1,4)\(\times\)(1,1,0) model and SARMA(4,0)\(\times\)(1,0) error with trend. However, all residuals only partially follow the Gaussian distribution with strange tail on Q-Q plots. It may implies that this time series require a more complex model to fit. For example, a multiple seasonal ARIMA model.

  3. The dominant period of this times series 24 hours, which implies PM2.5 in Beijing has a strong daily pattern

  4. Both SARMA errors model ( SARMA(4,0)\(\times\)(1,0) ) and SARIMA model ( SARIMA(4,1,4)\(\times\)(1,1,0) ) fit the time series very well, which capture the trend. However, from the simulation plot and the log likelyhood of the parameters. SARMA errors model is more accurate in fitting this time series data. SARIMA(4,1,4)\(\times\)(1,1,0) model seems to exaggerate local peaks of the time series, where SARMA errors model performs better

  5. Because the limitation of the computation resources, monthly seasonality and yearly seasonality is not considered properly. This could be the main source of the simulation error.

Source

[1] United Kindom Department for Food and Rural Affairs, https://laqm.defra.gov.uk/public-health/

[2] World Health Organisation (WHO), Air Quality and Health Question and Answer, (https://www.who.int/phe/air_quality_q&a.pdf)

[3] Song Xi Chen, , Guanghua School of Management, Center for Statistical Science, Peking University, https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data#.

[4] Ionides, E. (n.d.). Stats 531 (Winter 2020) ‘Analysis of Time Series’ http://ionides.github.io/531w20/

[5] Wiki, Partial autocorrelation function, https://en.wikipedia.org/wiki/Partial_autocorrelation_function