1 Introduction

Nowadays, Boston is a popular choice when people are considering moving to a new plcae to live, so understanding the wind speed pattern of Boston is one of the important things for those people. In this project, we are going to analyze the average wind speed data for Boston within the recent 20 years. The main technique that we are going to use is time series analysis (ARMA model, spectral density, etc.). We hope to find if there is a trend, seasonality, or some other interesting facts related with the average wind speed in Boston. The analysis aims to give people reference when they choose to move to Boston.

2 Data Summary

The data is retrieved from NOAA (National Centers For Environmental Information)[1]. The dataset contains the daily average wind speed data for the past 20 years (2000-2019) in Boston, MA. We pick the average wind speed for the first day and 15th day of each month to create a new dataset. The length of the new dataset is 480.

\(~\)

The table below is the summary statistics of the average wind speed, we can see that the mean is about 11 mph. The mean is very close to the medium.

##          DATE  AWND
## 5  2000-01-05 16.11
## 25 2000-01-25 19.01
## 36 2000-02-05 12.30
## 56 2000-02-25 14.99
## 65 2000-03-05 12.30
## 85 2000-03-25  8.95
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.58    8.28   10.51   10.90   12.80   30.20

\(~\)

The plot below is the line plot for the average wind speed. It provides a clear view about how the data looks like. The wind speed changes bewteen about 5 mph to 30 mph.

We model the trend by using formula \(y = \beta_{0} + \beta_{1}x + \beta_{2}x^2\). The red line on the plot shows the trend that we modeled. We can see that this red line is somewhat flat. It indicates that there is not a clear trend for the Boston wind speed. The data looks mean stationary. If we pick a relatively small interval, it does not look variance stationary well.

\(~\)

We perform log transformation and first difference techniques to make data to be more stationary. The modified data is shown below, we can see that it looks more stationary.

\(~\)

The plot below is the ACF for the data, we do this in order to find the autocorrelation between observations. By the plot below, we can see that the first lag is large and have negative relation with the lag 0. This suggests a AR2 model may be considerable choice.

3 Spectral Density

We can use the Spectral density to find out the seasonality pattern of the dataset. We can use the formula \(period = \frac{1}{frequency}\) to calulate period if we find significant peak.

The plot below is the unsmoothed Spectral Density for it, we can see that the plot is noisy, thus it suggests that a smooth technique may be needed to perform here.

\(~\)

We smooth the density three times and the result is shown below. If we move the blue crossbar along the density curve, we can see that none of the point seems to be significant. This suggests that this time series data does not contain clear seasonality patterns.

4 ARMA Model

Since we have find out that there is no clear seasonality patterns, we choose to model the data using ARMA (Autoregressive–moving-average) model.

For ARMA(p,q) model (with parameter \(\theta=\left(\phi_{1: p}, \psi_{1: q}, \mu, \sigma^{2}\right)\)), we have:

\[\phi(B) \left(Y_{n}-\mu\right)=\psi(B) \epsilon_{n}\]

where,

\[\mu=\mathbb{E}\left[Y_{n}\right]\]

\[\psi(x)=1+\psi_{1} x+\psi_{2} x^{2}+\cdots+\psi_{q} x^{q}\]

\[\phi(x)=1-\phi_{1} x-\phi_{2} x^{2}-\cdots-\phi_{p} x^{p}\]

\[\epsilon_{n} \sim \operatorname{iid} N\left[0, \sigma^{2}\right]\] \[B\text{ is a backward shift operator}\]

4.1 Model Fitting and Selection

We choose the ARMA model using AIC. We try different sets of p and q values and we want to pick the model which provides the lowest AIC. The formula for AIC is: \(A I C=-2 \times \ell(\theta)+2 D\). \(D\), which represents the number of parameters, is used to penalize complex models. We do not want our model to be to complex. \(\ell(\theta)\) here represents the likelihood. AIC help us measuring the models by balancing the likelihood and the size of the models.

\(~\)

The following table is the table for AIC with different p and q. We can see that ARMA(2,1) model gives the lowest AIC value. Thus, we choose ARMA(2,1) model.

MA0 MA1 MA2 MA3 MA4
AR0 647.63 353.61 354.70 352.37 353.90
AR1 496.20 354.53 356.75 355.47 355.94
AR2 455.88 352.11 353.94 355.91 357.57
AR3 435.23 353.91 355.62 356.58 358.87
AR4 427.58 355.88 357.93 358.15 359.37

\(~\)

The table below is the summary table of the ARMA(2,1) model that we choose

## 
## Call:
## arima(x = p, order = c(2, 0, 1))
## 
## Coefficients:
##          ar1     ar2      ma1  intercept
##       0.0490  0.0990  -0.9917     -2e-04
## s.e.  0.0476  0.0475   0.0154      2e-04
## 
## sigma^2 estimated as 0.1186:  log likelihood = -171.05,  aic = 352.11

By the table above, we have that the model is

\[(1-0.0490B-0.0990B^2)\left(Y_{n}+0.0002\right)=(1-0.9917B)\epsilon_{n}\] the estimated \(\sigma^2\) is 0.1186

\(~\)

Then, we check the roots:

  • AR roots:
## [1] 2.940365 3.434850
  • MA roots:
## [1] 1.008399

All roots are just outside the unit circle. It indicates that we have a stationary causal fitted ARMA. Some roots are close to 1 which indicates some potential risk for invertibility. MA root is not similar with the AR roots, so we know that there should not be big problem with parameters redundancy.

4.2 Model Diagnostics

It is necessary for us to check whether the model fits well or not. We do it by checking the iid normal assumption for the residuals.

By the qqplot below, we can see that most the points follow the straight line considerablely well. Thus, we know that the distribution is somewhat close to normal. There is no problem related with long tails.

\(~\)

We can also use shapiro test to check the normallity. The null hypothesis of this test assume that the distribution is normal. By the results below, we can see that the p-value is not smaller that 0.05, so we cannot reject the normality assumption.

## 
##  Shapiro-Wilk normality test
## 
## data:  m$residuals
## W = 0.9957, p-value = 0.2145

\(~\)

By the ACF plot, we can see that the observations seem to be uncorrelated. The small lag values are in between the blue dash lines.

Thus, by the checking above, we see that the model seems to model the data well.

5 Conclusion

6 Reference

[1] The data is retrieved from NOAA (National Centers For Environmental Information): https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00014739/detail

[2] The course website of stats 531 winter 2020 (instructor: Edward L. Ionides): https://ionides.github.io/531w20/

[3] R. Shumway and D. Stoffer “Time Series Analysis and its Applications” 4th edition. link: https://www.stat.pitt.edu/stoffer/tsa4/tsa4.pdf

[4] The previous homeworks which I submitted on canvas. https://umich.instructure.com/courses/200