Chapter 1 discussion questions

Fitting to different time intervals. In chapter 1, we saw results for ARMA models fitted to two different time intervals. Here are the full R fitted model summaries for data up to 2018 and 2020:

y <- read.table(file="ann_arbor_weather.csv",header=1)
arma2018 <- arima(y$Low[y$Year<=2018], order=c(1,0,1))
arma2018

## 
## Call:
## arima(x = y$Low[y$Year <= 2018], order = c(1, 0, 1))
## 
## Coefficients:
##          ar1      ma1  intercept
##       0.8257  -0.7881    -2.8193
## s.e.  0.2650   0.2866     0.8080
## 
## sigma^2 estimated as 53.06:  log likelihood = -401.76,  aic = 811.52

arma2020 <- arima(y$Low[y$Year<=2020], order=c(1,0,1))
arma2020

## 
## Call:
## arima(x = y$Low[y$Year <= 2020], order = c(1, 0, 1))
## 
## Coefficients:
##           ar1      ma1  intercept
##       -0.0005  -0.0005    -2.8086
## s.e.  13.1214  13.2431     0.6842
## 
## sigma^2 estimated as 56.16:  log likelihood = -411.97,  aic = 831.93

What do you conclude by comparing these fitted models? Do you notice anything surprising?

Diagnostics. We plotted the residuals and we checked their autocorrelation, but we did not do a normality test on them. We could have done a Shapiro-Wilk test,

y <- read.table(file="ann_arbor_weather.csv",header=1)
r <-  arima(y$Low, order=c(1,0,1))$resid
shapiro.test(r)

## 
##  Shapiro-Wilk normality test
## 
## data:  r
## W = 0.99418, p-value = 0.8922

We could confirm that the residuals have an approximatedly normal marginal distribution, for example

hist(r)

The importance of normality may depend on what we want to do with the resulting fitted model. For many purposes, a central limit result may make results insensitive to normality. Severe outliers may nevertheless be important, and plotting the residuals can help to identify these.

What purposes for a fitted model might be more—or less—sensitive to normality?

If you’re curious to read more about this, Section 3.10 of Box (1976) gives an example where addressing dependence is much more important than non-normality.

Changing variation. Perhaps climate change in Michigan January low temperature is primarily in the variability rather than the expected value? Let’s look for a trend in absolute deviation.

y <- read.table(file="ann_arbor_weather.csv",header=1)$Low
year <- seq_along(y)-1 # year since 1900
ad <- abs(y-median(y,na.rm=TRUE))
lm.ad <- lm(ad~year)
summary(lm.ad)

## 
## Call:
## lm(formula = ad ~ year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9934 -3.3128 -0.6988  2.8033 15.5023 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.71299    0.79788   5.907 3.23e-08 ***
## year         0.01983    0.01110   1.786   0.0766 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.479 on 122 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.02547,    Adjusted R-squared:  0.01749 
## F-statistic: 3.189 on 1 and 122 DF,  p-value: 0.07662

This is borderline significant, so maybe we are onto something. Perhaps the change in variability will be clearer if we separate it from the possible change in expected value:

y <- read.table(file="ann_arbor_weather.csv",header=1)$Low
year <- seq_along(y)-1 # year since 1900
y.na <- is.na(y)
y <- y[!y.na]
year <- year[!y.na]
y2 <- resid(lm(y~year+I(year^2)+I(year^3)))
ad2 <- abs(y2-median(y2))
lm.ad2 <- lm(ad2~year)
summary(lm.ad2)

## 
## Call:
## lm(formula = ad2 ~ year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8807 -3.2359 -0.8642  2.3473 15.6314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.46909    0.79646   5.611 1.28e-07 ***
## year         0.02306    0.01108   2.081   0.0395 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.471 on 122 degrees of freedom
## Multiple R-squared:  0.03427,    Adjusted R-squared:  0.02636 
## F-statistic:  4.33 on 1 and 122 DF,  p-value: 0.03955

We have statistical significance at the 5% level, but have we p-hacked? How would you assess this?
Do you think it is scientifically plausible that climate change is affecting the variation but not (yet) the mean? What is your reasoning? As statisticians, we do not necessarily have extensive scientific knowledge about the data we are analyzing, but we can still use whatever we know to help interpret the data analysis.

Chapter 1 discussion questions

DATASCI/STATS 531