Global warming is a trending topic nowadays. It has observable impacts on the environment, animals, and human lives. Glacier shrinks and sea level increases; ice of lakes begins to melt earlier; heat waves worsen and become deadly. According to NASA, the “Intergovernmental Panel on Climate Change (IPCC), which includes more than 1,300 scientists from the United States and other countries, forecasts a temperature rise of 2.5 to 10 degrees Fahrenheit over the next century.” \([1]\). To throw further light on the topic, we decided to conduct an analysis of monthly and annual temperatures for two of the biggest cities in South America, Mexico city and Los Angeles between the mid-\(19^{th}\) century and \(2013\). The main objectives are to question the existence of any upper trend in the temperature levels for these 2 cities, finding any temperature cycles occurring and modeling the data using ARMA
model.
The data set is acquired from data.world.com \([2]\). It consists of most average temperatures of months from 1849/1/1 to 2013/9/1 for major cities.
First, we read the Los Angeles data and then plot it.
## mean temperature in Los Angeles for the last 30 years 16.51158
## lowest temperature in Los Angeles for the last 30 years 7.213
## highest temperature in Los Angeles for the last 30 years 27.336
In order to demonstrate the overview of the data better, we chose the data from most recent 30 years. It is clear that the average monthly temperature varies from 5 degrees to 30 degrees Celsius. From the above result, it is clear that the average monthly temperature is fluctuated among the mean value of the average temperature, which is 16.2392 degrees Celsius. The maximum value of the average temperature in past 30 years is 27.336, and the minimum average temperature is 6.636. We could also get these basic information of data from the plot. However, we cannot find the obvious trend in the plot, which persuade us to assume this is a mean stationary data.
From the above figure, we can find that the average temperature in Los Angeles in past 30 years has the characteristics of periodic behavior with the peaks for every year \([3]\). The time period for each peak is about 12 months, because the average temperature will cycled between every 12 months. So we decided to exploring the data by frequency domain analysis. Firstly, we need to check the periods of the data. In order to better verify the result, we use the both “Smooth Periodogram” method and “Spectrum estimated via AR model picked by AIC” to estimate the periods. Below are the result of the plot, frequency and period for our analysis \([4]\).
From the above plots, we can see the dominant frequency for both methods just below 0.1. Now, we find the exact dominant frequencies for both methods and calculate the respective periods using the formula of \(T = \dfrac{1}{f}\) \([4]\).
## dominant frequency for smoothing method is 0.0835
## dominant frequency for AR method is 0.0832
## dominant period for smoothing method is 11.976
## dominant period for AR method is 12.024
Smoothing | AR-fitted | |
---|---|---|
Frequency | 0.0835 | 0.0832 |
period | 11.976 | 12.024 |
From the spectrum density plots, we could find both methods have similar frequency when the spectrum get the first peak value, which is also demonstrated by calculating the exact frequency for smoothed (0.0835) and AR-fitted (0.0832) methods. Because the frequency is similar for both these two methods, we are confident in this spectrum data analysis. Besides, we also calculated the periods for them, which is 11.976 and 12.024 respectively. These are also consistent with the background information (the temperature will cycle in 12 months, which we mentioned above).
We do the similar spectral density analysis for Mexico city.
We spot almost same dominant frequency for Mexico city data. Let’s see what exactly it is.
## dominant frequency for smoothing method is 0.0833
## dominant frequency for AR method is 0.0832
## dominant period for smoothing method is 12
## dominant period for AR method is 12.024
Smoothing | AR-fitted | |
---|---|---|
Frequency | 0.0833 | 0.0832 |
period | 12.000 | 12.024 |
We get almost same results which was expected. In the next section, we want to find if there is any linear trend. After finding that, we will process our data (depending on existence of the trend) and fit ARMA
model to our dataset. We will do these steps for both cities and then compare the results.
For trend analysis, we will extract the annual data (which is the averages of temperatures for each year) and use it.
Let’s extract the annual data for Los Angeles and plot it.
Now, we want to answer the question of is the data trended?
To find it out, we will apply the linear regression model to the data and then make a statistical significance test on the coefficient of the index
. Here index
is the artificially created time component. Here is the summary of the linear fit \([5]\):
##
## Call:
## lm(formula = temperature ~ index, data = losAngelos_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27382 -0.29311 -0.02612 0.32705 1.14248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.548e+01 7.786e-02 198.816 < 2e-16 ***
## index 4.711e-03 8.185e-04 5.755 4.23e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4962 on 162 degrees of freedom
## Multiple R-squared: 0.1697, Adjusted R-squared: 0.1646
## F-statistic: 33.12 on 1 and 162 DF, p-value: 4.226e-08
Here is the \(99\%\) confidence interval:
## For 99% Confidence interval of the coefficient, lower bound is 0.0026 and upper bound is 0.0068
Since \(0\) is out of \(99\%\) confidence interval, we confirm that there is a linear trend. Before further analysis, we first detrend the data \([6]\). This step is crucial for ARMA
model fitting since one of our assumptions for the ARMA
model is the mean stationary data. However, for the dataset with a positive trend, mean stationary criteria is not satisfied.
detrended_annual_los = annual_los - predict(LR_model, losAngelos_data) # (annual data) - (linear fit of annual data)
We are going to use smoothing method for spectral density estimation \([4]\).
As we can see from the graph, the first major frequency is around 0.1
(first peak) and second major frequency is around 2.7
(second peak). We will spot those peaks and find their respective periods.
## first major period is 10.588
## first major period is 3.75
So, first major frequency is 10.588
and second major frequency is 3.75
. The reason could be a cyclic behavior in our solar system.
Now we fit the ARMA
model to our detrended dataset.
For ARMA fitting, we create an AIC table and then choose the model with the least AIC value. Here is the mathematical formulation of ARMA model and AIC table.
\(\phi(B)\left(Y_{n}-\mu\right)=\psi(B) \epsilon_{n}\) with parameters \(\theta=\left(\phi_{1: p}, \psi_{1: q}, \mu, \sigma^{2}\right)\)
\[ \begin{aligned} \mu &=\mathbb{E}\left[Y_{n}\right] \\ \phi(x) &=1-\phi_{1} x-\cdots-\phi_{p} x^{p} \\ \psi(x) &=1+\psi_{1} x+\cdots+\psi_{q} x^{q} \\ \epsilon_{n} & \sim N\left[0, \sigma^{2}\right] \end{aligned} \] Finding the most accurate ARMA model is really important for this data analysis, we can use the Akaike information criteria (AIC) values, where AIC was derived as an approach to minimizing prediction error \([4]\). Models with lower AIC generally have lower prediction error, which means the fitness of the model we selected is better. Solely using AIC can lead to a mistake on choosing models, but it is often useful. Therefore, we need to calculate the AIC value for each model with different p and q values. Below is the result of the AIC estimation for the p range from 0 to 4, and q range from 0 to 5.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 237.58 | 230.87 | 228.94 | 228.18 | 218.70 | 220.51 |
AR1 | 227.63 | 216.64 | 218.59 | 220.42 | 220.50 | 222.70 |
AR2 | 224.10 | 218.60 | 220.37 | 222.34 | 222.47 | 222.30 |
AR3 | 222.31 | 220.36 | 222.53 | 220.14 | 221.75 | 222.49 |
AR4 | 217.31 | 219.14 | 219.57 | 221.31 | 222.70 | 224.24 |
From the above table, we could find the most models have the AIC around 220, but the ARMA(1,1) has the lowest AIC value, which is 216.64. Therefore, we decided to use the ARMA(1,1) to fit our yearly temperature data.
## ar1 ma1 intercept
## 0.9091 -0.7410 0.0179
After fitting the data with the ARMA(1,1), the model could be written as below: \[ \begin{aligned} \phi(B)\left(Y_{n}-0.0179\right)=\psi(B) \epsilon_{n}\\ \end{aligned} \]
\[ \begin{aligned} \phi(x) &=1 - 0.9091x \\ \psi(x) &=1 - 0.7410x \\ \end{aligned} \]
Besides, we also need to calculate the roots of the AR and MA polynomials, which could help us check the causality and the invertibility of the models.
Let’s start by checking for causality and invertibility.
## Absolute value of AR polynomial roots are: 1.099956
##
## Absolute value of MA polynomial roots are: 1.349567
The absolute value of AR root for the model is 1.099 and the MA root for the model is 1.35. Because they are both larger than 1 (outside the unit circle), the model meets the requirements of causality and intertibility. Therefore, the model ARMA(1, 1) is reliable to us.
Now, we check our assumptions for residuals of our model. those assumptions are:
They should be distributed normally
They should be a white noise (\(\gamma(h) = 0\) for \(h \neq 0\)). No autocorrelation.
For normality check we plot qq plots.
From the plot above, we could find that the mean of the residual is around 0, the line is fluctuated between mean value. Besides, we could also find the variance of the residuals do not change as the times going up. These means our residual has the mean value of 0 and the equal variance as time increasing. Then we also need to check for the normality of the residual.
From the QQ plot for the residual, although there exist some points stand out the line at the end of both data, we could find almost all of the other data is standing in the solid black line. Therefore, we can conclude that the error also meets the assumption of normality.
For checking the white noise assumption, we plot the acf
plot of residuals.
From the plot, it is reasonable to say that residuals are white noise.
Let’s extract the annual data for Mexico city and plot it.
Now, we follow the similar steps (as in 3.1) for finding the trend (if there is any). Here is the \(99\%\) confidence interval:
## For 99% Confidence interval of the coefficient, lower bound: 0.0059 and upper bound: 0.0083
As we have seen for Los Angeles, there is also and upper trend for annual temperatures in Mexico city (since \(0\) is out of \(99\%\) confidence interval of index
). Consequently, we are going to use the detrended data for the rest of our analysis.
detrended_annual_mex = annual_mex - predict(LR_model_mex, mexico_data)
We use the same method (smoothing method) for spectral density estimation.
As we can see, there are again 2 similar dominant frequencies as we spotted in Los Angeles data. Let’s find what exactly they are.
## first major period is 10.588
## first major period is 3.913
So, first major frequency is 10.588
and second major frequency is 3.913
. First major periods are same for both cities and second major frequencies are close to each other. We will further talk about this in comparisons section (section 5).
We are going to use ARIMA(3,0,2)
which has a reasonably small AIC value.
## ar1 ar2 ar3 ma1 ma2 intercept
## 1.0085 -0.7453 0.3089 -0.7521 0.6749 -0.0055
Again we start by checking for causality and invertibility of the model.
## Absolute value of AR polynomial roots are: 1.40995 1.40995 1.628712
##
## Absolute value of MA polynomial roots are: 1.217227 1.217227
Since all roots are out of unit circle, the model is invertible and causal.
Let’s continue with the normality check of residuals using qq-plots
Also qq-plot confirms the normality of residuals.
Lastly, here is the acf
plot for white noise assumption check.
In conclusion, we found that the results of analysis for average temperature is same for Los Angeles and Mexico. For both of these cities, we find there exists the global warming trend as time goes on. We also found that annual temperatures for both cities have the dominant period of 10.58 years. The reason could be they are geographically close to each other. Finally, we discovered the ARMA(1, 1) to be suitable to fit the detrended data for Los Angeles and the ARMA(3, 2) to be suitable to fit the detrended data for Mexico city.
\([1]\) - Online source: https://climate.nasa.gov/resources/global-warming-vs-climate-change/
\([2]\) - Online source: https://data.world/data-society/global-climate-change-data/workspace/file?filename=GlobalLandTemperatures%2FGlobalLandTemperaturesByMajorCity.csv
\([3]\) - Online source (Solution for previous homework): https://ionides.github.io/531w21/hw03/sol03.html
\([4]\) - Lecture notes for Stats 531: https://ionides.github.io/531w22/
\([5]\) - Online source: https://projects.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%205%20Trend%20Tests.htm
\([6]\) - Online source: https://www.statology.org/detrend-data/