STATS531_Midterm Project

1. Introduction

In this project, we study the daily minimum temperatures time series in Melbourne, Australia. The data set consists of 3650 observations. It records the weather change from 1981 to 1990. We will use two models to fit the data and test their performance.

2. Explore the Data

First we read in the data.

data=read.csv("daily-minimum-temperatures-in-me.csv",header=F)
colnames(data)=c("date","temp")
data$temp=as.numeric(as.character(data$temp))
data=na.omit(data)

The six number summary of the lowest temperature in Fahrenheit for each year are as follows.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.30   11.00   11.19   14.00   26.30

In order to test how good our model fits the data, we seperate the data set into training set and test set. The training set includes the first 3300 observations and the test set includes the remaining 350 observations. We will produce the model using training set and then test the performance of the model using test set.

training=data[1:3300,2]
test=data[3301:3650,2]

Then we plot the training data over time.

plot(training,type="l")

From the plot, we see that there is a strong periodic behavior, with regularly spaced peaks. The time interval between these peaks are about 360 days. This is as expected from common knowledge.

Now we take a look at spectrum density.

spectrum(training)

We first use R to see an unparametric method result of the data.

spec=spectrum(training,spans=c(3,5,3), main="Smoothed periodogram")

We now determine the dominant frequency.

spec$freq[which.max(spec$spec)]

## [1] 0.002666667

We see that the dominant frequency is 0.002666667, which corresponds to a period of 365 days.

Then we look at the ACF of the training data.

acf(training,lag=1000)

The ACF also shows that the training set has a period of about 365 years.

All these evidence lead us to fit a seasonal model.

3.1 Fitting an Model

According to above statement, we ought to fit a SARMA \((p,d,q)\times(1,0,0)_{12}\) with period of 365. However, we have to face the problem that R cannot afford a lag of more than 350. Fortunately, after looking at the plot, we can work under a null hypothesis that there is no trend. Therefore, we can subtract a sinusoidal function from the data and then analyze the residual.

x=seq(1,3300)
l=lm(training~sin(2*pi/365*x)+cos(2*pi/365*x))
summary(l)

## 
## Call:
## lm(formula = training ~ sin(2 * pi/365 * x) + cos(2 * pi/365 * 
##     x))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4127 -1.9590 -0.0864  1.8092 11.4731 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         11.13852    0.04858  229.29   <2e-16 ***
## sin(2 * pi/365 * x)  1.68459    0.06885   24.47   <2e-16 ***
## cos(2 * pi/365 * x)  3.86297    0.06855   56.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.791 on 3297 degrees of freedom
## Multiple R-squared:  0.534,  Adjusted R-squared:  0.5337 
## F-statistic:  1889 on 2 and 3297 DF,  p-value: < 2.2e-16

We fit a model that \[{X_n}=11.12704+1.83179 \times sin(2{\pi}n/365)+3.81381\times cos(2{\pi}n/365)+{\eta_n}\] where \({\eta_n}\) are mean zero residuals.

We plot the data and fitted values.

plot(training,type="l",col="red")
lines(l$fitted.values,col="blue")

Then we focus on the residuals. First we plot the ACF of residuals

r=l$residuals
acf(r)

ACF plot shows much sign of autocorrelation. That means we can fit an ARMA model for residuals.

Akaike’s information criterion AIC is given by \[AIC = -2 \times {\ell}({\theta^*}) + 2D\]

Let’s tabulate some AIC values for a range of different choices of p and q.

aic_table = function(data,P,Q){
  table = matrix(NA,(P+1),(Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
      table[p+1,q+1] = arima(data,order=c(p,0,q))$aic
    }
  }
  dimnames(table)=list(paste("AR",0:P,sep=""),paste("MA",0:Q,sep=""))
  table
}
d_aic_table = aic_table(r,4,5)
require(knitr)
kable(d_aic_table,digits=2)

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	16139.30	15224.34	15125.31	15110.94	15108.07	15104.88
AR1	15121.82	15101.35	15095.72	15075.76	15075.47	15077.46
AR2	15105.09	15101.31	15073.61	15075.45	15077.85	15079.35
AR3	15096.92	15074.50	15075.46	15077.59	15079.28	15081.22
AR4	15094.97	15075.39	15077.59	15079.32	15081.28	15073.32

From the AIC table, we select ARMA(2,0), ARMA(3,0), and ARMA(3,1) as candidates. We choose ARMA(2,0) model to analyze this dataset first.

arma=arima(r,order = c(2,0,0))
arma

## 
## Call:
## arima(x = r, order = c(2, 0, 0))
## 
## Coefficients:
##          ar1      ar2  intercept
##       0.5546  -0.0753    -0.0002
## s.e.  0.0174   0.0174     0.0797
## 
## sigma^2 estimated as 5.68:  log likelihood = -7548.55,  aic = 15105.09

Then we do some basic analysis on residuals of \({\eta_n}\)

acf(arma$residuals,lag.max = 500)

The ACF plot for residuals of \({\eta_n}\) shows that these residuals seem like white noises.

Box.test(arma$residuals)

## 
##  Box-Pierce test
## 
## data:  arma$residuals
## X-squared = 0.044561, df = 1, p-value = 0.8328

Box test shows us residuals of \({\eta_n}\) can be regarded as independent.

qqnorm(arma$residuals)
qqline(arma$residuals)

From the qqplot above, we can see that the quantiles lie roughly on a straight line.

Thus it is reasonable to assume residuals of \({\eta_n}\) are independent and identically normally distributed. Hence \({\eta_n}\) can be written as \[{\eta_n}=-0.5574 \times{\eta_{n-1}}+0.0741 \times{\eta_{n-2}}+{\epsilon_n}\] where \[{\epsilon_n} {\sim} {iidN(0,5.68)}\].

Therefore the whole model can be written as:\[{X_n}=11.12704+1.83179 \times sin(2{\pi}n/365)+3.81381\times cos(2{\pi}n/365)+{\eta_n}\], where \[{\eta_n}=-0.5574 \times{\eta_{n-1}}+0.0741 \times{\eta_{n-2}}+{\epsilon_n}\] \[{\epsilon_n} {\sim} {iidN(0,5.691374)}\].

The result is so good that it is unnecessary to try ARMA(3,0) or ARMA(3,1) model.

3.2 Test the Performance of the Model

In the end of this part, we use the test set to test the performance of our model.

library(forecast)
x=seq(3301,3650)
predict=predict(arma,n.ahead=350)
plot(seq(3301,3650),test,type='l',ylim=c(0,25))
lines(predict$pred+11.12704+1.83179*sin(2*pi/365*x)+3.81381*cos(2*pi/365*x),type="l",col='red')

In the plot, the black line plot the test data as well as the red line plot our predicted data. Unfortunately, after looking at this plot, we can hardly say that the model predicts well. This is because the standard error of predicted data is too small so that almost every test lies out of confidence interval of our predicted data.

To avoid the restriction of R, we could summary the data and fit another model.

3.3 Fitting a SARMA Model

So as to make the data easy to analyze, we take monthly temperature mean. Also, we divide the data into two parts. The first 100 observations are training data and the last 20 observations are test data

data1=data
data1$date=as.Date(data1$date)
data1$year=as.numeric(format(data1$date,format="%Y"))
data1$month=as.numeric(format(data1$date,format="%m"))
data1=aggregate(data1[,2],list(data1[,3],data1[,4]),mean)
colnames(data1)=c("year","month","temp")
data1=data1[order(data1$year,data1$month),]
data1$t=data1$year*12+data1$month-23772
training1=data1[1:100,3]
test1=data1[101:120,3]

Then we take a look at the monthly data.

plot(training1,type="l")

Obviously, there is a periodic behavior, with period of 12 months.

Now we take a look at spectrum density.

spectrum(training1)

We use R to see an unparametric method result of the data.

spec1=spectrum(training1,spans=c(3,5,3), main="Smoothed periodogram")

We now determine the dominant frequency.

spec1$freq[which.max(spec1$spec)]

## [1] 0.08

We see that the dominant frequency is 0.08, which corresponds to a period of 12 months.

Then we look at the ACF of the training data.

acf(training1)

The ACF also shows that the training set has a period of about 12 months.

We first difference the data with lag of twelve and get the new series of \(Y_{n}=(1-B^{12})X_{n}\) and then plot the new data and its ACF.

training2=diff(training1,lag=12)
plot(training2,type = "l")

acf(training2)

From the plot we believe that this new seires \(Y_{n}\) is generally a weak stationary time series with equal variance, which motivates us to build a ARMA model for \({Y_n}\). So we use the AIC model selection criteria to choose the best model for \({Y_n}\).

aic_table <- function(data,P,Q){
  table <- matrix(NA,(P+1),(Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
      table[p+1,q+1] <- arima(data,order=c(p,0,q))$aic
    }
  }
  dimnames(table) <- list(paste("AR",0:P,sep=""),paste("MA",0:Q,sep=""))
  table
}
d_aic_table <- aic_table(training2,4,5)

## Warning in arima(data, order = c(p, 0, q)): possible convergence problem:
## optim gave code = 1

## Warning in arima(data, order = c(p, 0, q)): possible convergence problem:
## optim gave code = 1

## Warning in arima(data, order = c(p, 0, q)): possible convergence problem:
## optim gave code = 1

require(knitr)
kable(d_aic_table,digits=2)

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	301.07	302.64	299.72	301.10	301.97	303.97
AR1	302.43	300.46	300.06	301.79	303.96	305.55
AR2	298.42	299.69	301.61	300.12	301.58	286.62
AR3	299.96	301.63	303.61	297.61	291.04	294.14
AR4	301.66	303.31	305.57	302.59	292.83	287.26

From the AIC table, we select ARMA(2,0) for \({Y_n}\) since ARMA(2,0) model has least AIC and number of parameters. Thus we choose SARMA \((2,0,0)\times(1,0,0)_{12}\) model to analyze this dataset.

sarma=arima(training1,
            xreg=seq(1:100), order=c(2,0,0),
            seasonal=list(order=c(1,0,0),period=12)
)
sarma

## 
## Call:
## arima(x = training1, order = c(2, 0, 0), seasonal = list(order = c(1, 0, 0), 
##     period = 12), xreg = seq(1:100))
## 
## Coefficients:
##          ar1    ar2    sar1  intercept  seq(1:100)
##       0.1961  0.266  0.8731    11.5082      0.0041
## s.e.  0.1089  0.098  0.0448     1.3965      0.0169
## 
## sigma^2 estimated as 1.515:  log likelihood = -171.39,  aic = 354.77

Also, we make some basic diagnostics.

acf(sarma$residuals)

Box.test(sarma$residuals)

## 
##  Box-Pierce test
## 
## data:  sarma$residuals
## X-squared = 0.0059406, df = 1, p-value = 0.9386

qqnorm(sarma$residuals)
qqline(sarma$residuals)

The ACF plot shows these residuals are uncorrelated. Box test shows us residuals are significantly independent. QQ-plot indicates that the quantiles lie on a straight line.

Thus it is reasonable to assume residuals are independent and identically normally distributed.

Hence the whole SARMA model can be written as: \[(1-0.8716\times {B^{12}})(1-0.1932\times {B}-0.2760\times B^2)({X_n}-11.5115)={\epsilon_n}\]

where \({\epsilon_n}\) are independent Gaussian random variables with mean 0 and variance 1.515.

3.4 Test the Performance of the SARMA Model

Similarly, we test the performance of the SARMA model.

predict=predict(sarma,n.ahead=20,newxreg = seq(101,120))
plot(seq(101,120),test1,type='l',ylim=c(0,20))
lines(seq(101,120),predict$pred,type="l",col='red')
lines(seq(101,120),predict$pred-1.98*predict$se,col='red',pch=22, lty=2)
lines(seq(101,120),predict$pred+1.98*predict$se,col='red',pch=22, lty=2)

As the graph above, the black line is the origin test data. The red solid line is the curve we predict while the two red dashed lines are 95% confidence interval of our predicted data.

We can find that all real test data lies in the 95% confidence interval, which means our prediction works.

4. Conclusion

Both of the models fit the training data well since all residuals subject to normal distribution. However, the first model do not perform well in test data. This probably because we only use AR model to fit residuals while AR model act weakly in forecasting. The second perform good in both training data set and test data set. We can predict future lowest temperature with 95% level of confidence.

5. Reference

[1] “http://datamarket.com/data/list/?q=provider:tsdl” for dataset

[2]Robert H.Shumway, David S.Stoffer “Time Series Analysis and Its Application with R Example”(Third Edition)

[3]http://ionides.github.io/531w16