1.Introduction

In the area of aerodynamics, Particulate Matter 2.5 (PM 2.5) means the particulate matter whose equivalent diameter is equal to or less than 2.5 micrometer. Many times, it is used to measure the air quality. PM 2.5 can suspend in the air for a long time, and the higher content concentration of PM 2.5, the even worse of air pollution. Meanwhile, PM 2.5 would significantly influence visibility and carry hazardous substance. With the development of industry, high concentration of PM 2.5 become a serious problem in some countries and affect people’s health, like China. Beijing, the capital of China, is suffered by high PM 2.5 concentration for a long time. Also, the PM 2.5 concentration does not have a obvious periodical change when we merely see the data superficially, which lets the government hard to formulate effective policies. In this project, I would find a time series model to fit the dataset of PM 2.5 concentration collected near Olympic Sports Center in Beijing. If the model fitted the data well, it could find the pattern of the trend of PM 2.5 concentration and be used to predict the future PM 2.5 concentration, which could help the Chinese Government to formulate the relative policy to control the PM 2.5 concentration.

2.Exploratory Data Analysis

The dataset I use in this project is downloaded from the UCI machine learning repository. The original dataset has 35065 instances which are collected from 2013 to 2017 near Olympic Sports Center in Beijing. Each instance has several features, like the time to record the information, PM 2.5 concentration, PM 10 concentration, etc. Since I am only interested in the relationship between time and PM 2.5 concentration, I extract features year, month, day and PM 2.5 for each instance. Also, to decrease the size of dataset, I use the data collected at 8 pm from 2014 to 2016. The reason I select 8 pm is after observing the dataset, I find the PM 2.5 concnetration is relatively high at night. Right now, there are 1096 instances and I fill all the missing value by its previous value.

## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
##   pm25.year pm25.month pm25.day pm25.PM2.5
## 1      2014          1        1         61
## 2      2014          1        2        160
## 3      2014          1        3         36
## 4      2014          1        4        179
## 5      2014          1        5         47
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   25.75   57.00   84.39  111.00  598.00

The first table shows the first 5 instances in the dataset. The summary shows the general information about the instances. From the summary and histogram, we could easily find that the data of PM 2.5 concentration is significantly skewed. Also, based on the “year~PM 2.5 concentration” graph above, it shows there is no mean stationary and convariance stationary for the data. Furthermore, the dataset size is still too large, which might influence my future work. So, I decide to continue decreasing the size of dataset by picking one value for each week, and do log transformation and difference operation on each instance. For the difference operation, it simply means to get the difference between two adjacent data. Thus, we can transform the data \(y_{1:N}\) to \(Z_{2:N}\), where

\(z_{n}=\Delta y_{n}=y_{n}-y_{n-1}\)

Here, \(y_{n}\) is the log transformed PM 2.5 concentration.

## 
##  Shapiro-Wilk normality test
## 
## data:  pm25_week
## W = 0.99509, p-value = 0.8838

After the log transformation and difference operation, the plot shows the variance of the data is relatively stable and the mean is nearly constant. Thus, we can say the data is almost weak stationary. Furthermore, I use the histogram and Shapiro Wilk test to confirm the normal distribution of the dataset. By now, we can use ARMA model to fit the data.

## [1] "The dominant frequency corresponding to the highest density in the above periodgrams are 0.435871743486974 and 0.41875"

Before applying the ARMA model, I want to do some data analysis using the frequenyc domain. In the graph above, we see that when frequency = 0.41875 or 0.43587, it has the highest density. But we cannot say the data has a periodical change in around 2.5 weeks since there are several peaks and different frequency might confuse with each other.

3.Model Selection

3.1 General ARMA Model

First, since I have already confirmed the weak stationary of the dataset above, I will apply the general ARMA model to fit the data to see the effect. The formula of the general stationary ARMA model is

\(\phi(B)\left(Y_{n}-\mu\right)=\psi(B) \epsilon_{n}\)

where \(\epsilon_{n}\) is a white noise process. It has mean 0 and covariance \(\sigma^{2}\). \(Y_{n}\) is the random variable. In this problem, it is the variable of PM 2.5 concentration on 12 pm. \(B\) is the backshift operator, which is given by \(BY_{n} = Y_{n-1}\). Furthermore, \(\mu\) is the constant mean. Last,

\(\phi(x)=1-\phi_{1} x-\cdots-\phi_{p} x^{p}\), which represents the autoregressive model.

\(\psi(x)=1+\psi_{1} x+\cdots+\psi_{q} x^{q}\), which represents the moving average model.

Then I should choose value p and q for ARMA(p,q). Akaike’s information criterion (AIC) is a general approach to pick p and q. It is a method to compare likelihoods of different models by penalizing the likelihood of each model by a measure of its complexit. And the formula is \(A I C=-2 \times \ell(\theta)+2 D\), which means “Minus twice the maximized log likelihood plus twice the number of parameters”. And I will select the model based on the low AIC score and simple model (comparatively small p and q).

MA0 MA1 MA2 MA3 MA4
AR0 744.63 588.26 492.85 494.75 495.41
AR1 664.57 552.91 494.77 496.53 498.52
AR2 632.44 538.41 495.53 496.67 496.06
AR3 599.05 521.32 496.02 497.93 495.00
AR4 578.31 517.42 497.82 500.01 496.31
AR5 574.26 516.59 499.59 501.75 496.72

From the AIC table above, when (p,q) = (0,2), the model has the lowest AIC value. Also, MA(2) is a very simple model. So, due to the model simplicity and low AIC value, we apply the MA(2) model here to fit the data.

## 
## Call:
## arima(x = pm25_week, order = c(0, 0, 2))
## 
## Coefficients:
##           ma1     ma2  intercept
##       -1.0160  0.0160    -0.0011
## s.e.   0.0917  0.0874     0.0019
## 
## sigma^2 estimated as 1.206:  log likelihood = -238.53,  aic = 485.05
## [1]  1.0 62.5

After generating the model under the circumstance when (p,q) = (0,2), we could check the invertibility by calculating the MA polynomial. The results is that all the roots are on or outside the unit circle in the complex plane, which might let us believe the model is invertible

3.1.1 Diagnostic Analysis

Based on the left plot above, it shows there is no substantial autocorrelation for the residuals. But when lag=32, the corresponding ACF is a little bit out of the confidence interval. Also, when checking the Q-Q plot, we find that there is a little bit heavier tail, which means the model is generally acceptable but not perfectly fit the data.

3.2 Seasonal ARMA model

Due to the imperfect of the general ARMA model, I want to try the seasonal ARMA model since the data I use is collected in three years for each week. So, I conjecture seasonality might be a factor to influence the data, and the ACF plot shows above support my conjecture. However, since the periodical change of the PM 2.5 is not highly related to month, week or year, I set the period be 32 based on the above ACF. The \(\operatorname{SARMA}(p, q) \times(P, Q)_{32}\) model for the data is

\(\phi(B) \Phi\left(B^{32}\right)\left(Y_{n}-\mu\right)=\psi(B) \Psi\left(B^{32}\right) \epsilon_{n}\) where

\(\Phi(x)=1-\Phi_{1} x-\cdots-\Phi_{P} x^{P}\) represents the seasonal AR model

\(\Psi(x)=1+\Psi_{1} x+\cdots+\Psi_{Q} x^{Q}\) represents the seasonal MA model

## 
## Call:
## arima(x = pm25_week, order = c(0, 0, 2), seasonal = list(order = c(1, 0, 0), 
##     period = 32))
## 
## Coefficients:
##           ma1     ma2     sar1  intercept
##       -1.0235  0.0235  -0.1969    -0.0008
## s.e.   0.0914  0.0862   0.0845     0.0017
## 
## sigma^2 estimated as 1.155:  log likelihood = -235.91,  aic = 481.82

3.2.1 Diagnostic Analysis

## 
##  Shapiro-Wilk normality test
## 
## data:  pm25_sarma$residuals
## W = 0.988, p-value = 0.2018

Based on the plot above, it shows there is no substantial autocorrelation for the residuals. Also, after using the Shapiro-wilk test and drawing the Q-Q plot, I find that residuals are almost normally distributed.

3.2.2 Likelihood Ratio Test

To further confirm the necessity to keep the seasonal term in the model, I use the likelihood ratio test. The parameter space of the model is \(\Theta \subset \mathbb{R}^{D}\). And we have two nested hypotheses here,

\(H^{\langle 0\rangle}: \theta \in \Theta^{\langle 0\rangle}\)

\(H^{\langle 1\rangle}: \theta \in \Theta^{\langle 1\rangle}\)

Here \(H^{\langle 0\rangle}\) corresponds to the ARMA(0,2) model and \(H^{\langle 1\rangle}\) corresponds to the SARMA\((0, 2) \times(1, 0)_{32}\) model. And two nested parameter subspances do have the relationship \(\Theta^{\langle 0\rangle} \subset \Theta^{\langle 1\rangle}\). The null hypothesis of the

## [1] 2.614412

The log likelihood of the ARMA(0,2) model is -238.53, and the log likelihood of the SARMA\((0, 2) \times(1, 0)_{32}\) model is -235.91. The difference of log likelihhod is 2.614, which is larger than the 1.92 cutoff for a test at 5% size. So, we should keep the seasonal term. Thus, we could safely to say SARMA\((0, 2) \times(1, 0)_{32}\) fits the data well.

4.Forecast

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Finally, I use the seasonal ARMA model to do the prediction of the following 7 time series data. But when comparing to the actual data, I find the data generated from the trivial prediction is not quite close to the ground truth, even though the 95% confidence interval could include these actual data. I think there are several reasons to cause such an inaccuracy: first, this time series data is unpredicted and does not have a regular pattern since PM 2.5 concentration is highly related to the human activity; second, the ARMA model will only do the prediction by looking backward, which will let the line of prediction become a straight line finally.

5.Conclusion

All in all, after doing the log transformation and difference operation, I get the dataset that can be used to apply ARMA model. After applying the MA(2) model, we find that the residuals generated from the model have some autocorrelation and not quitely noraml distributed. Then, we add the seasonal term to the model based on the ACF plot, and find that SARMA\((0, 2) \times(1, 0)_{32}\) could fit the data well. 32 is an unusual choice, which means the dataset of PM 2.5 concentration does not have a regular pattern, like monthly, quarterly. In my mind, PM 2.5 concentration is highly related with human activity and human activity is always unpredicted. Also, through the forecast, we realize that even though the seasonal ARMA model fits the data well, it could not perform well in doing prediction when merely using the “forecast” package. So, maybe we need use more advanced techniques and add the factors of human activity to do the prediction in the future. By now, we have captured the pattern of the PM 2.5 concentration dataset in the past, and I believe it could be suggestive to the Chinese Government to make some environmential policies.

Reference

download the dataset from https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data

refer the slides on https://ionides.github.io/531w20/

refer the webiste to know the basic knowledge of PM 2.5 https://baike.baidu.com/item/细颗粒物/804913?fromtitle=PM%202.5&fromid=847195&fr=aladdin

refer the documentation on the website to learn how to use forecast function https://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html

refer the documentation to realize the disadvantage of using forecast in arima model https://libres.uncg.edu/ir/uncw/f/zhai2005-2.pdf