Feburary 21, 2022

1 Background

Global warming is a trending topic nowadays. It has observable impacts on the environment, animals, and human lives. Glacier shrinks and sea level increases; ice of lakes begins to melt earlier; heat waves worsen and become deadly. According to NASA, the “Intergovernmental Panel on Climate Change (IPCC), which includes more than 1,300 scientists from the United States and other countries, forecasts a temperature rise of 2.5 to 10 degrees Fahrenheit over the next century.” \([1]\). To throw further light on the topic, we decided to conduct an analysis of monthly and annual temperatures for two of the biggest cities in South America, Mexico city and Los Angeles between the mid-\(19^{th}\) century and \(2013\). The main objectives are to question the existence of any upper trend in the temperature levels for these 2 cities, finding any temperature cycles occurring and modeling the data using ARMA model.

The data set is acquired from data.world.com \([2]\). It consists of most average temperatures of months from 1849/1/1 to 2013/9/1 for major cities.

2 Data Exploration For Monthly Data

2.1 Los Angeles

2.1.1 Data Processing for Los Angeles data

First, we read the Los Angeles data and then plot it.

## mean temperature in Los Angeles for the last 30 years  16.51158
## lowest temperature in Los Angeles for the last 30 years  7.213
## highest temperature in Los Angeles for the last 30 years  27.336

In order to demonstrate the overview of the data better, we chose the data from most recent 30 years. It is clear that the average monthly temperature varies from 5 degrees to 30 degrees Celsius. From the above result, it is clear that the average monthly temperature is fluctuated among the mean value of the average temperature, which is 16.2392 degrees Celsius. The maximum value of the average temperature in past 30 years is 27.336, and the minimum average temperature is 6.636. We could also get these basic information of data from the plot. However, we cannot find the obvious trend in the plot, which persuade us to assume this is a mean stationary data.

2.1.2 Frequency Domain Analysis for Los Angeles data

From the above figure, we can find that the average temperature in Los Angeles in past 30 years has the characteristics of periodic behavior with the peaks for every year \([3]\). The time period for each peak is about 12 months, because the average temperature will cycled between every 12 months. So we decided to exploring the data by frequency domain analysis. Firstly, we need to check the periods of the data. In order to better verify the result, we use the both “Smooth Periodogram” method and “Spectrum estimated via AR model picked by AIC” to estimate the periods. Below are the result of the plot, frequency and period for our analysis \([4]\).

From the above plots, we can see the dominant frequency for both methods just below 0.1. Now, we find the exact dominant frequencies for both methods and calculate the respective periods using the formula of \(T = \dfrac{1}{f}\) \([4]\).

## dominant frequency for smoothing method is  0.0835
## dominant frequency for AR method is  0.0832
## dominant period for smoothing method is  11.976
## dominant period for AR method is  12.024
Smoothing AR-fitted
Frequency 0.0835 0.0832
period 11.976 12.024

From the spectrum density plots, we could find both methods have similar frequency when the spectrum get the first peak value, which is also demonstrated by calculating the exact frequency for smoothed (0.0835) and AR-fitted (0.0832) methods. Because the frequency is similar for both these two methods, we are confident in this spectrum data analysis. Besides, we also calculated the periods for them, which is 11.976 and 12.024 respectively. These are also consistent with the background information (the temperature will cycle in 12 months, which we mentioned above).

2.2 Mexico city

2.2.1 Frequency Domain Analysis for Mexico city data

We do the similar spectral density analysis for Mexico city.

We spot almost same dominant frequency for Mexico city data. Let’s see what exactly it is.

## dominant frequency for smoothing method is  0.0833
## dominant frequency for AR method is  0.0832
## dominant period for smoothing method is  12
## dominant period for AR method is  12.024
Smoothing AR-fitted
Frequency 0.0833 0.0832
period 12.000 12.024

We get almost same results which was expected. In the next section, we want to find if there is any linear trend. After finding that, we will process our data (depending on existence of the trend) and fit ARMA model to our dataset. We will do these steps for both cities and then compare the results.

3 Trend Analysis, Spectral Density Analysis and ARMA fitting for annual Los Angeles temperatures

For trend analysis, we will extract the annual data (which is the averages of temperatures for each year) and use it.

3.1 Trend Analysis for annual Los Angeles temperatures

Let’s extract the annual data for Los Angeles and plot it.

Now, we want to answer the question of is the data trended?

To find it out, we will apply the linear regression model to the data and then make a statistical significance test on the coefficient of the index. Here index is the artificially created time component. Here is the summary of the linear fit \([5]\):

## 
## Call:
## lm(formula = temperature ~ index, data = losAngelos_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27382 -0.29311 -0.02612  0.32705  1.14248 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.548e+01  7.786e-02 198.816  < 2e-16 ***
## index       4.711e-03  8.185e-04   5.755 4.23e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4962 on 162 degrees of freedom
## Multiple R-squared:  0.1697, Adjusted R-squared:  0.1646 
## F-statistic: 33.12 on 1 and 162 DF,  p-value: 4.226e-08

Here is the \(99\%\) confidence interval:

## For 99% Confidence interval of the coefficient, lower bound is  0.0026   and upper bound is  0.0068

Since \(0\) is out of \(99\%\) confidence interval, we confirm that there is a linear trend. Before further analysis, we first detrend the data \([6]\). This step is crucial for ARMA model fitting since one of our assumptions for the ARMA model is the mean stationary data. However, for the dataset with a positive trend, mean stationary criteria is not satisfied.

detrended_annual_los = annual_los - predict(LR_model, losAngelos_data) # (annual data) - (linear fit of annual data)

3.2 Spectral Density Analysis for annual Los Angeles temperatures

We are going to use smoothing method for spectral density estimation \([4]\).

As we can see from the graph, the first major frequency is around 0.1 (first peak) and second major frequency is around 2.7 (second peak). We will spot those peaks and find their respective periods.

## first major period is  10.588
## first major period is  3.75

So, first major frequency is 10.588 and second major frequency is 3.75. The reason could be a cyclic behavior in our solar system.

Now we fit the ARMA model to our detrended dataset.

3.3 ARMA fitting for annual Los Angeles temperatures

For ARMA fitting, we create an AIC table and then choose the model with the least AIC value. Here is the mathematical formulation of ARMA model and AIC table.

\(\phi(B)\left(Y_{n}-\mu\right)=\psi(B) \epsilon_{n}\) with parameters \(\theta=\left(\phi_{1: p}, \psi_{1: q}, \mu, \sigma^{2}\right)\)

\[ \begin{aligned} \mu &=\mathbb{E}\left[Y_{n}\right] \\ \phi(x) &=1-\phi_{1} x-\cdots-\phi_{p} x^{p} \\ \psi(x) &=1+\psi_{1} x+\cdots+\psi_{q} x^{q} \\ \epsilon_{n} & \sim N\left[0, \sigma^{2}\right] \end{aligned} \] Finding the most accurate ARMA model is really important for this data analysis, we can use the Akaike information criteria (AIC) values, where AIC was derived as an approach to minimizing prediction error \([4]\). Models with lower AIC generally have lower prediction error, which means the fitness of the model we selected is better. Solely using AIC can lead to a mistake on choosing models, but it is often useful. Therefore, we need to calculate the AIC value for each model with different p and q values. Below is the result of the AIC estimation for the p range from 0 to 4, and q range from 0 to 5.

MA0 MA1 MA2 MA3 MA4 MA5
AR0 237.58 230.87 228.94 228.18 218.70 220.51
AR1 227.63 216.64 218.59 220.42 220.50 222.70
AR2 224.10 218.60 220.37 222.34 222.47 222.30
AR3 222.31 220.36 222.53 220.14 221.75 222.49
AR4 217.31 219.14 219.57 221.31 222.70 224.24

From the above table, we could find the most models have the AIC around 220, but the ARMA(1,1) has the lowest AIC value, which is 216.64. Therefore, we decided to use the ARMA(1,1) to fit our yearly temperature data.

##       ar1       ma1 intercept 
##    0.9091   -0.7410    0.0179

After fitting the data with the ARMA(1,1), the model could be written as below: \[ \begin{aligned} \phi(B)\left(Y_{n}-0.0179\right)=\psi(B) \epsilon_{n}\\ \end{aligned} \]

\[ \begin{aligned} \phi(x) &=1 - 0.9091x \\ \psi(x) &=1 - 0.7410x \\ \end{aligned} \]

Besides, we also need to calculate the roots of the AR and MA polynomials, which could help us check the causality and the invertibility of the models.

3.4 Diagnosis of the ARMA model for annual Los Angeles temperatures

Let’s start by checking for causality and invertibility.

## Absolute value of AR polynomial roots are:  1.099956
## 
## Absolute value of MA polynomial roots are:  1.349567

The absolute value of AR root for the model is 1.099 and the MA root for the model is 1.35. Because they are both larger than 1 (outside the unit circle), the model meets the requirements of causality and intertibility. Therefore, the model ARMA(1, 1) is reliable to us.

Now, we check our assumptions for residuals of our model. those assumptions are:

  1. They should be distributed normally

  2. They should be a white noise (\(\gamma(h) = 0\) for \(h \neq 0\)). No autocorrelation.

For normality check we plot qq plots.

From the plot above, we could find that the mean of the residual is around 0, the line is fluctuated between mean value. Besides, we could also find the variance of the residuals do not change as the times going up. These means our residual has the mean value of 0 and the equal variance as time increasing. Then we also need to check for the normality of the residual.

From the QQ plot for the residual, although there exist some points stand out the line at the end of both data, we could find almost all of the other data is standing in the solid black line. Therefore, we can conclude that the error also meets the assumption of normality.

For checking the white noise assumption, we plot the acf plot of residuals.

From the plot, it is reasonable to say that residuals are white noise.

4 Trend Analysis, Spectral Density Analysis and ARMA fitting for annual Mexico city temperatures

4.1 Trend Analysis for annual Mexico city temperatures

Let’s extract the annual data for Mexico city and plot it.

Now, we follow the similar steps (as in 3.1) for finding the trend (if there is any). Here is the \(99\%\) confidence interval:

## For 99% Confidence interval of the coefficient, lower bound: 0.0059 and upper bound: 0.0083

As we have seen for Los Angeles, there is also and upper trend for annual temperatures in Mexico city (since \(0\) is out of \(99\%\) confidence interval of index). Consequently, we are going to use the detrended data for the rest of our analysis.

detrended_annual_mex = annual_mex - predict(LR_model_mex, mexico_data)

4.2 Spectral Density Analysis for annual Mexico city temperatures

We use the same method (smoothing method) for spectral density estimation.

As we can see, there are again 2 similar dominant frequencies as we spotted in Los Angeles data. Let’s find what exactly they are.

## first major period is  10.588
## first major period is  3.913

So, first major frequency is 10.588 and second major frequency is 3.913. First major periods are same for both cities and second major frequencies are close to each other. We will further talk about this in comparisons section (section 5).

4.3 ARMA fitting for annual Mexico city temperatures

We are going to use ARIMA(3,0,2) which has a reasonably small AIC value.

##       ar1       ar2       ar3       ma1       ma2 intercept 
##    1.0085   -0.7453    0.3089   -0.7521    0.6749   -0.0055

4.4 Diagnosis of the ARMA model for annual Mexico city temperatures

Again we start by checking for causality and invertibility of the model.

## Absolute value of AR polynomial roots are:  1.40995 1.40995 1.628712
## 
## Absolute value of MA polynomial roots are:  1.217227 1.217227

Since all roots are out of unit circle, the model is invertible and causal.

Let’s continue with the normality check of residuals using qq-plots

Also qq-plot confirms the normality of residuals.

Lastly, here is the acf plot for white noise assumption check.

5 Conclusion

In conclusion, we found that the results of analysis for average temperature is same for Los Angeles and Mexico. For both of these cities, we find there exists the global warming trend as time goes on. We also found that annual temperatures for both cities have the dominant period of 10.58 years. The reason could be they are geographically close to each other. Finally, we discovered the ARMA(1, 1) to be suitable to fit the detrended data for Los Angeles and the ARMA(3, 2) to be suitable to fit the detrended data for Mexico city.

Rererences

\([1]\) - Online source: https://climate.nasa.gov/resources/global-warming-vs-climate-change/

\([2]\) - Online source: https://data.world/data-society/global-climate-change-data/workspace/file?filename=GlobalLandTemperatures%2FGlobalLandTemperaturesByMajorCity.csv

\([3]\) - Online source (Solution for previous homework): https://ionides.github.io/531w21/hw03/sol03.html

\([4]\) - Lecture notes for Stats 531: https://ionides.github.io/531w22/

\([5]\) - Online source: https://projects.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%205%20Trend%20Tests.htm

\([6]\) - Online source: https://www.statology.org/detrend-data/