y <- read.table(file="ann_arbor_weather.csv",header=1)
arma2018 <- arima(y$Low[y$Year<=2018], order=c(1,0,1))
arma2018
##
## Call:
## arima(x = y$Low[y$Year <= 2018], order = c(1, 0, 1))
##
## Coefficients:
## ar1 ma1 intercept
## 0.8257 -0.7881 -2.8193
## s.e. 0.2650 0.2866 0.8080
##
## sigma^2 estimated as 53.06: log likelihood = -401.76, aic = 811.52
arma2020 <- arima(y$Low[y$Year<=2020], order=c(1,0,1))
arma2020
##
## Call:
## arima(x = y$Low[y$Year <= 2020], order = c(1, 0, 1))
##
## Coefficients:
## ar1 ma1 intercept
## -0.0005 -0.0005 -2.8086
## s.e. 13.1214 13.2431 0.6842
##
## sigma^2 estimated as 56.16: log likelihood = -411.97, aic = 831.93
What do you conclude by comparing these fitted models? Do you notice anything surprising?
y <- read.table(file="ann_arbor_weather.csv",header=1)
r <- arima(y$Low, order=c(1,0,1))$resid
shapiro.test(r)
##
## Shapiro-Wilk normality test
##
## data: r
## W = 0.99418, p-value = 0.8922
We could confirm that the residuals have an approximatedly normal marginal distribution, for example
hist(r)
The importance of normality may depend on what we want to do with the resulting fitted model. For many purposes, a central limit result may make results insensitive to normality. Severe outliers may nevertheless be important, and plotting the residuals can help to identify these.
What purposes for a fitted model might be more—or less—sensitive to normality?
If you’re curious to read more about this, Section 3.10 of Box (1976) gives an example where addressing dependence is much more important than non-normality.
y <- read.table(file="ann_arbor_weather.csv",header=1)$Low
year <- seq_along(y)-1 # year since 1900
ad <- abs(y-median(y,na.rm=TRUE))
lm.ad <- lm(ad~year)
summary(lm.ad)
##
## Call:
## lm(formula = ad ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9934 -3.3128 -0.6988 2.8033 15.5023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.71299 0.79788 5.907 3.23e-08 ***
## year 0.01983 0.01110 1.786 0.0766 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.479 on 122 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.02547, Adjusted R-squared: 0.01749
## F-statistic: 3.189 on 1 and 122 DF, p-value: 0.07662
This is borderline significant, so maybe we are onto something. Perhaps the change in variability will be clearer if we separate it from the possible change in expected value:
y <- read.table(file="ann_arbor_weather.csv",header=1)$Low
year <- seq_along(y)-1 # year since 1900
y.na <- is.na(y)
y <- y[!y.na]
year <- year[!y.na]
y2 <- resid(lm(y~year+I(year^2)+I(year^3)))
ad2 <- abs(y2-median(y2))
lm.ad2 <- lm(ad2~year)
summary(lm.ad2)
##
## Call:
## lm(formula = ad2 ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8807 -3.2359 -0.8642 2.3473 15.6314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.46909 0.79646 5.611 1.28e-07 ***
## year 0.02306 0.01108 2.081 0.0395 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.471 on 122 degrees of freedom
## Multiple R-squared: 0.03427, Adjusted R-squared: 0.02636
## F-statistic: 4.33 on 1 and 122 DF, p-value: 0.03955
We have statistical significance at the 5% level, but have we p-hacked? How would you assess this?
Do you think it is scientifically plausible that climate change is affecting the variation but not (yet) the mean? What is your reasoning? As statisticians, we do not necessarily have extensive scientific knowledge about the data we are analyzing, but we can still use whatever we know to help interpret the data analysis.