## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Introduction

Besides coffee and technology companies, one of the things that Seattle is most famous for is how frequent it rains. Daily preciptation may fluctuate a lot, and average precipitation of a month may be different based on different seasons. In this project, I will analyze the monthly average precipitation from 1997 to 2017.

Data Summary and Exploratory Data Analysis

In the original dataset, it contains daily precipitation data of seattle from 1948 to 2017 with other features, such as maximum and minimum temperature of the day. Since I’m only interested in the more general precipitation trend in recent years, I decided to use only the data from 1997 to 2017. Plus, I took the average of the daily precipitation of a month, and used the transformed data in the following analysis.

As can be seen from the summary below, the minimum average monthly precipitation is 0, and the maximum is precipitation.

Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0.0423848 0.0910357 0.1095686 0.1545968 0.521

Based on the plot of average monthly preciptation against time, there is no apparent trend or seasoning. Both the mean and variance of average monthly preciptation seems stationary from 1997 to 2017.

The decomposition of the data in trends, season, noise and cycles. As can be seen from the plots below, the trend doesn’t fluctuate, and the noise seems mean and variance stationary despite a peak near 2007.

Fitting ARMA(p,q) model

Now we can fit a stationary Gaussian ARMA(p,q) model \(\phi(B)\left(Y_{n}-\mu\right)=\psi(B) \epsilon_{n}\) with parameters \(\theta=\left(\phi_{1: p}, \psi_{1: q}, \mu, \sigma^{2}\right)\)

\[ \begin{aligned} \mu &=\mathbb{E}\left[Y_{n}\right] \\ \phi(x) &=1-\phi_{1} x-\cdots-\phi_{p} x^{p} \\ \psi(x) &=1+\psi_{1} x+\cdots+\psi_{q} x^{q} \\ \epsilon_{n} & \sim N\left[0, \sigma^{2}\right] \end{aligned} \]

Specifically, \(\phi(x)=1-\phi_{1} x-\cdots-\phi_{p} x^{p}\) represents autoregressive model, and \(\psi(x)=1+\psi_{1} x+\cdots+\psi_{q} x^{q}\) represents moving average model.

ARMA(p,q) model

MA0 MA1 MA2 MA3 MA4 MA5
AR0 -504.22 -548.75 -556.23 -555.68 -553.76 -564.13
AR1 -556.82 -555.14 -554.97 -553.70 -569.22 -570.89
AR2 -555.35 -578.75 -630.78 -633.36 -632.16 -631.73
AR3 -559.64 -587.23 -632.69 -627.86 -591.49 -630.70
AR4 -567.97 -591.21 -632.93 -630.45 -624.15 -628.20

Based on the AIC (Akaike’s information criterion), an approach to compare different models based on the likelihoods with penalty based on complexity of models, we have three best models with the lowest AIC, ARMA(3,2), ARMA(2,2),ARMA(2,3). Since ARMA(2,3) is rather simpler and lower in terms of AIC, I chose it to fit the precipitation data.

Diagnostics

After fitting the data using ARMA(2,3) model, we need to check the causality and the Invertibility of the models by checking the roots of the AR polynomial and MA polynomial. The first root of the AR polynomial is inside the unit circle, so it’s not causal. For the invertibility, since all the roots are outside the unit circle, the model is therefore invertible. So it’s reliable to use the ARMA(2,3) model to fit the data.

# check causality
abs(polyroot(c(1,arma23$coef[1],arma23$coef[2])))
## [1] 0.456927 2.189294
#check invertibility
abs(polyroot(c(1,arma23$coef[3], arma23$coef[4],arma23$coef[5])))
## [1] 1.000007 1.000007 6.668350

Based on the plot, the mean of residuals is around 0, and the variance seems fairly constant. Thus, the model meets the requirement of white noise.

Based on the plot, there is no autocorrelation between different residuals outside the 95% confidence interval, and, according to the normal QQ plot, the data doesn’t seem normal. In fact, the distribution of residuals has heavy tail respective to the normal distribution.

Spectrum Analysis

Even though we assumed there is no seasonality in the data, we can still use the spectrum analysis to find the periods of the data. Using both the unparametric and parametric spectrum method, the dominant frequency seems to be around 0.0825, and the period is around 12 months.

Smoothed Unsmoothed
Frequency 0.08203125 0.08316633
period 12.1904761905 12.0240967709

Fitting SARMA(p,q)x(P,Q) model

In order to check whether there’s an actual seasonality in the data, I decided to use the seasonal ARMA model (SARMA). I used the period as 9 based on the ACF plot of ARMA(2,3) model.

The AIC values of SARMA(2,3)×(1,0) and SARMA(2,3)×(0,1) are -635.72 and -635.27. The SARMA(2,3)×(1,0) has lower AIC, so it’s a better model. In fact, based on the AIC, SARMA(2,3)×(1,0) improves by having lower AIC compared to ARMA(2,3) model.

SARMA(2,3)×(1,0) SARMA(2,3)×(0,1)
AIC -635.72 -635.27

Diagnostics

Based on the plot, there is no autocorrelation between different residuals outside the 95% confidence interval, and, according to the normal QQ plot, the data still doesn’t seem normal. In fact, the distribution of residuals still has heavy tail respective to the normal distribution.

As can be seen from the plot below, the SARMA(2,3)×(1,0) model fits well to the data, although it doesn’t necessarily capture the peaks.

Conclusion

In this report, I seek to use the time series models to fit the monthly average precipitation in Seattle from 1997 to 2017. To summarize, the model ARMA(2,3) is the best model if seasonality is not considered. Although there is only weak evidence of seasonality, when fitting the model SARMA(2,3)×(1,0), the fit does improve in terms of AIC, and based on the ACF plot, all the residuals are within 95% confidence interval.

Reference

  1. Code and Invertibility concept from Ionides, Edward ‘Analysis of Time Series’ from https://ionides.github.io/531w21/

  2. Data from: https://www.kaggle.com/rtatman/did-it-rain-in-seattle-19482017