1 Introduction

Entering the age of information, more and more people enjoy the benefits brought by weather forecasts. Weather prediction can be accessed conveniently and instantly, making it popular among a variety of people. People tend to rely on the temperature forecasts to decide their outfits and travel plans. Hence it is of great importance and interest to tell to elements involved in predicting the temperature, as well as the best predictive model to access the fit by using the time series data.

Comparatively speaking, India has one of the most extreme climate in the world. Unlike most other tropical countries, India has a very hot summer(April-July) and very cold winter(December-January). In the year of 2019, the India floods caused by torrential rain became a catestrophe that made over 140 death and hundreds of thousands people evacuated from their homes. It is of our great interest to investigate about the time series data of India with the incorporation of different relevant features in weather report, such as wind speed, humidity and pressure. We obtained our Daily Climate Time Series Data from Kaggle.com with the daily report of mean temperature, humidity, wind speed and mean pressure from January 1, 2013 to April 24, 2017 in the capital territory of India. The dataset contains 1462 observations and 4 features for the city of Delhi. Statistically speaking, every feature can contribute to the change of the other features. However, in this project, we will mainly focus on the intepretation of temperature as it is the most intuitive and commonly used by the citizens. Our goal is to find the best model and quantify the extent to which the temperature is influenced by the other variables.

In this report, we will explore we two different methods to remove trend and seasonality from the original data. We first used the difference ARMA model and fit models that investigate on the relationship between mean temperature changes and other variables, then we will use non-difference ARMA model that used regression model with ARMA error based on the original scale of the data without taking difference of adjacent data points. In this way, we can find out whether there exists relationship between temperature of Delhi and other three meteorological features. Finally, we will make comparisons about the models we have adopted and analyze the output with implications in the real world.

2 Data overview

2.1 Descriptive analysis on temperature, wind speed, pressure and humidity

Our data set consist of four variables, which are mean temperature, wind speed, humidity and mean pressure. From the plot we can see the first three variables have some oscillatory behaviors and pressure is stable for almost all the time despite some outliers.

Figure 2.1: Plots of Temperature, Wind Speed, Humidity and Pressure over time

2.2 Identify seasonality of the temperature over time

Now we focus on the temperature. To identify the circle, we first take a look at the spectrum.

Figure 2.2: Unsmoothed Periodogram of Temperature

The Unsmoothed spectrum plot reveals that there is a cycle with frequency 0.00267 in raw data, which corresponds to 1 year. This is same with out initial thoughts of the length of period. To view the cycle more clearly, we decompose the data in different frequency. High frequency might be considered as noise and low frequency might be considered as trend.

Figure 2.3: Decomposition of mean temperature as trend+noise+cycles

From the plot we can see the seasonality. Besides, we can also identify the non-stationarity of data. To explore the relationship between temperature and other three meteorological features, we need to remove the trend and seasonality of original data. The reason is that the trend and seasonality may caused by some confounders, i.e seasonal change of the climate in Delhi and global warming. The trend and seasonality will introduce some fake relationships between temperature and other variables. So we decide to first remove trend and seasonality before doing subsequent analysis.

In this report, we used two different methods to remove trend and seasonality from the original data.

First we use difference ARMA model, which is first order difference ARIMA model. Specifically, we first take the difference between observations \(\Delta Y_n = Y_{n} - Y_{n-1}\) to remove the trend and seasonality and then using the ARMA model to fit the data.
The second approach we used to remove trend and seasonality was LOESS smoothing. We used LOESS smoothing to extract the high-frequency part of the data and model the extracted data using ARMA models.

3 Difference ARMA model (ARIMA)

3.1 Data preprocess

For this part, we try to difference the mean temperature data such that \[\Delta Y_n=Y_n-Y_{n-1}\]

plot(ta$date[-c(1)],diff(ta$meantemp),type='l',main = '',ylab = "Differenced temperature",xlab = "Date")

Figure 3.1: Differenced data of mean temperature

Now, the trend seems to be mean stationary around zero, and the covariance stationarity seems not to be a problem.

adf.test(diff(ta$meantemp))

## Warning in adf.test(diff(ta$meantemp)): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(ta$meantemp)
## Dickey-Fuller = -14.011, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary

Also, The Augmented Dickey-Fuller Test shows that the null hypothesis is rejected and the time series is stationary.

Figure 3.2: ACF of differenced mean tempertaure

The estimated ACF plots show that there is an exponential decay in the estimated autocorrelation. Almost values are inside the dash line, which suggests they are not significant different from 0, indicating stationarity. For humidity, wind speed and mean pressure, we also take difference and use them to fit models. Adjusted data are shown as following.

Figure 3.3: Differenced data of mean humidity, temperature and wind speed

3.2 Models

To analyze the relationship between mean temperature changes and other variables, we’ll fit four models separately.

3.2.1 Model 1: mean temperature and humidity

3.2.1.1 model selection

For model 1, we want to explore the relation between changes of mean temperature and humidity. To choose a proper model, we’ll use Akaike’s information criterion (AIC) to select the best ARMA model to model the noise. We create an AIC table with autoregression order up to 4 and moving average order up to 4. For models with ar and ma larger than 5, we consider them overfitting.

Table 3.1: AIC Table of models that regresses differenced temperature on differenced humidity with ARMA error
	MA0	MA1	MA2	MA3	MA4
AR0	4822.32	4823.92	4783.26	4728.27	4726.92
AR1	4824.04	4760.65	4737.27	4726.85	4728.85
AR2	4796.88	4732.05	4732.61	4728.85	4730.83
AR3	4748.53	4731.22	4730.42	4730.74	4732.76
AR4	4743.54	4728.97	4730.82	4732.79	4734.71

The model to consider here is ARMA(1,3), which has the lowest AIC values. An alternative model is ARMA(1,1), which has larger AIC value but less parameters. To determine which model to use, we conduct a hypothesis test using Wilks’ approximation. For this test, the null hypothesis is ARMA(1,1) is better, while the alternative correponds to ARMA(1,1). Then we use an approximation \[\Lambda=2(l_1-l_0)\approx \chi^2_{D_1-D_0}\]

where \(l_i\) is the maximum log likelihood under hypothesis \(H_i\) and \(D_i\) is the number of parameters under hypothesis \(H_i\). If \(\Lambda\) is bigger than \(\chi^2\) cutoff, we’ll reject the null hypothesis. The maximum log likelihood can be estimated by the function \(arima()\). Then \(\Lambda=37.796\), and the cutoff of \(\chi^2_2\) is

qchisq(0.95,df=2)

## [1] 5.991465

This means we will reject the null hypothesis and ARMA(1,3) is better. Therefore our model 1 is \[\Delta(temperature)_t=\beta_0+\beta_1\Delta(humidity)_t+\epsilon_t\]

where \(\epsilon_t\) is ARMA(1,3) and the coefficients are shown below.

## 
## Call:
## arima(x = diff(ta$meantemp), order = c(1, 0, 3), xreg = diff(ta$humidity))
## 
## Coefficients:
##          ar1      ma1      ma2      ma3  intercept  diff(ta$humidity)
##       0.2268  -0.2899  -0.1592  -0.1568     0.0033            -0.1370
## s.e.  0.1161   0.1149   0.0296   0.0390     0.0162             0.0041
## 
## sigma^2 estimated as 1.474:  log likelihood = -2356.43,  aic = 4726.85

Then we check ar and ma roots. It seems that all the roots are outside the unit circle, suggesting desirable property of causality and invertibility.

polyroot(c(1,-coef(result1)[c('ar1')]))

## [1] 4.409386+0i

polyroot(c(1,coef(result1)[c('ma1','ma2','ma3')]))

## [1]  1.306381+0.000000i -1.160867+1.880017i -1.160867-1.880017i

3.2.1.2 model diagnostics

Since we have our model with ARMA(1,3) error, we need to check if the assumptions of model are valid. First we look at the residuals as a time series plot. It seems that there are no abnormal patterns in the plot, and the mean of residuals is around 0. Then we check the ACF of residuals. Almost all the valus are inside the dash lines apart from lag 20, suggesting no autocorrelation between residuals. It doesn’t matter to have one over 32 lags outside dash lines, since we still can construct a \(95%\) acceptance region under null hypothesis. Finally, we check the assumption that \(\{\epsilon\}\sim N(0,\sigma^2)\) with QQ plot. The plot suggests, comparing to normal distribution, residuals of our models are kind of heavy-tailed, so the standard error of previous result is not quite valid.

Figure 3.4: Model Diagnostics Plots of Regression Model with ARMA Error

3.2.1.3 simulations

The arima() function in R uses the observed Fisher information to calculate standard errors for the coefficients. An approximate \(95%\) confident interval can be constructed by these standard errors. Here, we considering doing inference on the coefficient of changes of humidity. According to the result of model, we can construct an CI \[[-0.1370-1.96*0.0041,-0.1370+1.96*0.0041]=[-0.1450,-0.1290]\]

We can see 0 is not in the confident interval, so the humidity term is significant. However, assumption diagnostics reveals that standard error is unreliable, so we’ll use a simulation to construct an confidence interval for \(\beta_1\), and here is the distribution of results from 1000 simulations.

set.seed(12)
k=1000
params <- coef(result1)
ar <- params[grep('^ar',names(params))]
ma <- params[grep('^ma',names(params))]
sigma <- sqrt(result1$sigma2)
theta <- matrix(NA,nrow = k,ncol = length(params),dimnames = list(NULL,names(params)))
for (i in 1:k) {
  Y <- arima.sim(list(ar=ar,ma=ma),n=length(ta$meantemp)-1,sd=sigma)+params[6]*diff(ta$humidity)+params[5]
  theta[i,] <- coef(arima(Y,order = c(1,0,3),xreg = diff(ta$humidity)))
}

##       2.5%      97.5% 
## -0.1451306 -0.1292738

In this plot, the dash lines correspond to the interval calculated before, and the 95% confidence interval of simulation is \([-0.1451,-0.1293]\), which is also close to CI of model 1. Since 0 is not in the CI, we can say that the change of humidity is statistically significant associated with the change of temperature. The coefficient of differenced humidity is negative, so we can say the decreasing of change of humidity will lead to an increase in change of temperature.

3.2.2 Model 2: mean temperature and wind speed

3.2.2.1 model selection

Our second model focus on the relation between changes of mean temperature and wind speed. With the same approach as model 1, we first search for a proper ARMA model for error.

Table 3.2: AIC Table of models that regresses differenced temperature on differenced wind speed with ARMA error
	MA0	MA1	MA2	MA3	MA4
AR0	5646.08	5597.70	5568.55	5536.76	5537.42
AR1	5609.94	5539.62	5541.29	5537.06	5538.80
AR2	5596.27	5541.21	5512.48	5538.73	5540.27
AR3	5559.93	5538.25	5538.51	5540.49	5542.41
AR4	5553.30	5538.75	5540.47	5542.47	5544.13

It seems that the best model is ARMA(2,2), and an alternative model is ARMA(1,1). Notice that for ARMA(2,2), the AIC value has more than 2 difference with the adjacent model, which is not consistent with the definition of AIC, so there must be some numeric error and the validation of ARMA(2,2) is doubtful. First we use hypothesis test as model 1, it turns out \(\Lambda=31.14\) and the cutoff of \(\chi^2_2\) is

qchisq(0.95,df=2)

## [1] 5.991465

So we’ll reject the null hypothesis that ARMA(1,1) is better. However, if we check the AR and MA roots of ARMA(2,2)

polyroot(c(1,-coef(result3)[c('ar1','ar2')]))

## [1] 1.031981-0i 1.344141+0i

polyroot(c(1,coef(result3)[c('ma1','ma2')]))

## [1] 1.030882+0.079158i 1.030882-0.079158i

some roots are near the threshold, suggesting that the model is not stable. Plus the problem of AIC value, it’s possible that ARMA(2,2) is overfitting. The roots of ARMA(1,1) is quite desirable, so ARMA(1,1) is our choice.

abs(polyroot(c(1,-coef(result4)[c('ar1')])))

## [1] 1.749814

abs(polyroot(c(1,coef(result4)[c('ma1')])))

## [1] 1.245916

Therefore, the model 2 is \[\Delta(temperature)_t=\beta_0+\beta_1\Delta(wind\ speed)_t+\epsilon_t\]

where \(\epsilon_t\) is ARMA(1,1) and the coefficients are shown below

## 
## Call:
## arima(x = diff(ta$meantemp), order = c(1, 0, 1), xreg = diff(ta$wind_speed))
## 
## Coefficients:
##          ar1      ma1  intercept  diff(ta$wind_speed)
##       0.5715  -0.8026     0.0026               0.0250
## s.e.  0.0419   0.0289     0.0194               0.0095
## 
## sigma^2 estimated as 2.577:  log likelihood = -2764.81,  aic = 5539.62

3.2.2.2 model diagnostics

Again, we look at model diagnostics to see if the model above violates any of assumptions.

Figure 3.5: Model Diagnostics Plots of Regression Model with ARMA Error

From the plots, there’s no striking pattern of residuals we should worrisome. And there’s no autocorrelation between residuals. The QQ plot reveals that the residuals are heavy tailed, so standard errors of model 2 are not valid. Again, we conduct a simulation.

3.2.2.3 simulation

Here, we check the significance of wind speed. Same as described above, we could construct an confidence interval with estimated standard error for \(\beta_1\), which is \[[0.0250-1.96*0.0095,0.0250+1.96*0.0095]=[0.0064,0.0436]\]

This is not valid, so we construct a CI with 1000 simulations.

##        2.5%       97.5% 
## 0.006894073 0.044547646

The dash lines correspond to the confidence interval calculated above, and the \(95%\) CI is \([0.0069,0.0445]\), which is also close. Since 0 is not inside the CI, the changes of wind speed is statistically significant and play a role in changes of temperature. The coefficient of wind speed is positive, so the increase of change of wind speed could cause a bigger temperature difference.

3.2.3 Model 3: mean temperature and mean pressure

3.2.3.1 model selection

For model 3, we explore the relation between changes of mean temperature and mean pressure. Model selection procedure is same as described in previous models. First, we select an ideal model from AIC table.

Table 3.3: AIC Table of models that regresses differenced temperature on differenced wind speed with ARMA error
	MA0	MA1	MA2	MA3	MA4
AR0	5502.41	5447.91	5405.78	5373.42	5372.82
AR1	5464.05	5374.03	5375.19	5372.58	5374.47
AR2	5444.13	5375.02	5353.89	5374.55	5376.55
AR3	5405.00	5373.76	5374.38	5376.31	5377.92
AR4	5393.15	5374.26	5376.22	5378.36	5373.26

The model to consider here is ARMA(1,3). We reject ARMA(0,4) because of the fear of overfitting, and they are also on the edge of table. An alternative choice is ARMA(1,1). Using Wilks’ approximation, we do the hypothesis test with ARMA(1,1) as null hypothesis and ARMA(1,3) as alternative. Here \(\Lambda=5.45\) and the cutoff of \(\chi^2_2\) is 5.99. So we accept the null hypothesis and choose ARMA(1,1).

polyroot(c(1,-coef(result6)[c('ar1')]))

## [1] 1.745004+0i

polyroot(c(1,coef(result6)[c('ma1')]))

## [1] 1.208809+0i

The roots of ARMA(1,1) suggest a desirable property of causality and invertibility. And the model 3 is \[\Delta(temperature)_t=\beta_0+\beta_1\Delta(pressure)_t+\epsilon_t\]

where \(\epsilon_t\) is ARMA(1,1) and the coefficients are shown as following

## 
## Call:
## arima(x = diff(ta$meantemp), order = c(1, 0, 1), xreg = diff(ta$meanpressure))
## 
## Coefficients:
##          ar1      ma1  intercept  diff(ta$meanpressure)
##       0.5731  -0.8273     0.0027                -0.3314
## s.e.  0.0386   0.0253     0.0161                 0.0245
## 
## sigma^2 estimated as 2.301:  log likelihood = -2682.02,  aic = 5374.03

3.2.3.2 model diagnostics

Now we check the assumptions of model, the plots are similar as previous models.

Figure 3.6: Model Diagnostics Plots of Regression Model with ARMA Error

From the plots we can draw the same conclusion as before: there are no abnormal pattern in residuals; there is no autocorrelation between the residuals and compared to normal distribution, the residuals of model 3 are heavy tailed. Then we still need simulations.

3.2.3.3 simulation

From the result of arima(), we can calculate the confidence interval of \(\beta_1\), which is \[[-0.3314-1.96*0.0245,-0.3314+1.96*0.0245]=[-0.3794,-0.2834]\]

Since model 3 does not meet normal assumption, we again conduct 1000 simulations to construct a confidence interval for \(\beta_1\), and the plot shows the distribution of results of 1000 simulations.

##       2.5%      97.5% 
## -0.3775430 -0.2854574

The dash lines correspond to the CI calculated before, and 95% CI of simulations \([-0.3775,-0.2854]\) is close to it. Since 0 is also outside the confidence interval, we can say the change of mean pressure is statistically significant and has effect on the change of temperature. The coefficient of pressure is negative, then the change of pressure will play a negative role in changes in mean temperature.

3.2.4 Model 4: mean temperature and all variables

3.2.4.1 model selection

For the last model, we’ll include all three variable (humidity, wind speed and mean pressure). First, same as before, we choose a ARMA model from AIC table.

Table 3.4: AIC Table of models that regresses differenced temperature on differenced pressure with ARMA error
	MA0	MA1	MA2	MA3	MA4
AR0	4689.21	4688.66	4640.28	4586.62	4581.69
AR1	4689.46	4607.39	4588.10	4581.67	4583.57
AR2	4662.03	4584.32	4585.13	4583.63	4585.51
AR3	4616.45	4584.23	4584.74	4584.91	4587.17
AR4	4606.43	4583.63	4585.62	4586.91	4588.26

It seems there are more numeric errors than previous model, but we could select ARMA(1,3), and ARMA(2,1) can be alternative. Again, we conduct a hypothesis test with ARMA(2,1) as null hypothesis and ARMA(1,3) as alternative hypothesis. We have \(\Lambda=4.65\) and the 0.95 cutoff of \(\chi^2_1\) is

qchisq(0.95,df=1)

## [1] 3.841459

Therefore, we reject the null hypothesis and choose ARMA(1,3).

polyroot(c(1,-coef(result7)[c('ar1')]))

## [1] 3.229521+0i

polyroot(c(1,coef(result7)[c('ma1','ma2','ma3')]))

## [1]  1.242954-0.000000i -1.209748+2.155796i -1.209748-2.155796i

The AR and MA roots are all outside the unit circle, suggesting a good property of causality and invertibility. And the model 4 is \[\Delta(temperature)_t=\beta_0+\beta_1\Delta(humidity)_t+\beta_2\Delta(wind\ speed)+\beta_3\Delta(pressure)+\epsilon_t\]

where \(\epsilon_t\) is ARMA(1,3) and following are the coefficients of model 4

## 
## Call:
## arima(x = diff(ta$meantemp), order = c(1, 0, 3), xreg = all_xreg)
## 
## Coefficients:
##          ar1      ma1      ma2      ma3  intercept  humidity  wind_speed
##       0.3096  -0.4086  -0.1549  -0.1317     0.0033   -0.1322     -0.0281
## s.e.  0.1063   0.1060   0.0322   0.0412     0.0134    0.0041      0.0068
##       meanpre
##       -0.2326
## s.e.   0.0197
## 
## sigma^2 estimated as 1.331:  log likelihood = -2281.83,  aic = 4581.67

3.2.4.2 model diagnostics

Same as before, we check the assumptions of model 4.

Figure 3.7: Model Diagnostics Plots of Regression Model with ARMA Error

From the plot, we can not identify any abnormal pattern of the residuals. There does not seem to be autocorrelation between residuals. Compared to normal distribution, the residuals of model 4 is heavy tailed. To fix this problem, we still need a simulation.

3.2.4.3 simulation

we can derive confidence intervals for \(\beta_1,\beta_2,\beta_3\) from the results of model 4, which are \[\beta_1:[-0.1322-1.96*0.0041,-0.1322+1.96*0.0041]=[-0.1402,-0.1241]\]

\[\beta_2:[-0.0281-1.96*0.0068,-0.0281+1.96*0.0068]=[-0.0414,-0.0148]\]

\[\beta_3:[-0.2326-1.96*0.0197,-0.2326+1.96*0.0197]=[-0.2712,-0.1940]\]

##        2.5%       97.5%        2.5%       97.5%        2.5%       97.5% 
## -0.13974427 -0.12403137 -0.04195919 -0.01579606 -0.26859836 -0.19323215

The dash lines correspond to intervals above, and the 95% confidence interval is similar too. Since 0 is outside of all CIs, all three variables are statistically significant and have effects on the change of temperature. All the coefficients of variables are negative, therefore, the increase of difference of any of these variables will lead to an decrease of temperature difference.

Notice that when fitting model with all three variables, the coefficients of humidity is stable, but for wind speed, it becomes negative and mean pressure changes a lot. It might indicate that humidity is not correlated with the other two and the changes of wind speed and mean pressure may have some relation, which can be one of our future tasks.

4 Non-difference ARMA model

The second method to model the data is to use regression model with ARMA error based on the original scale of the data without taking difference of adjacent data points. The idea is that first extract the high-frequency part of the data to remove the confounding of season and trend, then model the extracted data to find out whether there exists relationship between temperature of Delhi and other three meteorological features.

4.1 Remove seasonality and trend from original data

To extract the high-frequency part of the data, we used LOESS method to remove seasonality and trend. The plots of data after processing are shown in Figure 4.1. From the plots, we can see that only the high-frequency part of the features were extracted and trend and seasonality were removed.

Figure 4.1: High-frequency part of four meteorological features.

To give tough idea whether there exists some marginal relationship between any pair of features, pairwise pearson correlations were plotted in Figure 4.2. From the plot, we can see that there exists negative correlation between temperature and pressure, as well as temperature and humidity. The relationship between temperature and wind speed can not be directly told by the correlation plot.

Figure 4.2: Pairwise Correlation plots of four features

4.2 Potential lagged relationships between temperature and pressure, windspeed and humidity

To investigate the potential lagged relationships between temperature and other three meteorological features, sample cross-correlation functions (CCF) between pairwise time series were calculated, where CCF can be formulated as, \[ \begin{align} \hat{\rho}_{xy}(h) = \frac{\sum_{n=1}^{N-h} (x_{n+h} - \bar{x})(y_n - \bar{y})}{\sqrt{\sum_{n=1}^{N} (x_{n} - \bar{x})^2\sum_{n=1}^{N} (y_{n} - \bar{y})^2}} \end{align} \]

The CCF plots are shown in Figure 4.3. From the CCF plot of temperature vs pressure, we can see that there exists negative lagged and instantaneous relationship between temperature and pressure, which means the lower the pressure is, the higher the temperature will be. Same conclusion can be made on the lagged and instantaneous relationship between temperature and humidity, which demonstrate negative relationship between temperature and humidity. For the relationship between temperature and wind speed, the CCF plot shows there exists positive instantaneous relationship between temperature and wind speed, however the magnitude of the relationship is not very strong.

Figure 4.3: CCF plots between temperature and other features

To further investigate the cross-covariance in frequency domain, the coherency and phase of cross-spectrum were investigated between temperature and other three meteorological features. The plots are shown in Figure 4.4.

From the first row of the plots, we can identify two significant frequencies that temperature and pressure are correlated, which are 1 cycle per 2.2 days and 1 cycle per 88 days. By looking at the phase plot, we can see that the phases for these two frequencies are negative, which means that there exists lagged effect between temperature and pressure on these frequencies. And we conjecture that the reason why the frequency 1 cycle per 88 days is significant is that the seasonality has not been completely removed from the original data since we only focus on high-frequency relationship between two time series.

The second row shows the coherency and phase plot of temperature and humidity. The coherency plot told us it seems that there exists correlation between temperature and humidity in almost every frequencies. The reason is that maybe there is complicated relationship between temperature and humidify and further research need to conducted.

The third row gives us the idea that the relationship between temperature and wind speed in almost all the frequencies is neglectable since 95% confidence intervals of nearly all the frequencies covers 0. The result indicates that maybe there is no relationship between temperature and wind speed, or the relationship is very small.

Figure 4.4: Coherency and Phase plots between pairwise time series.

4.3 Non-difference regression model with ARMA error

To quantify the relationship between temperature and other three features, regression model with ARMA error was fitted. To find out what were the appropriate orders of AR and MA parameters, AIC were calculated for different ARMA models of temperature without any regressor, which were shown in Table 4.1. The optimal model was selected by try and error method from the model with smallest AIC to see that whether the model was non-invertibile or non-casual. The ARMA model with order of AR and MA being \((1,2)\) was selected as the optimal model.

Table 4.1: AIC Table for ARMA models of temperature with different orders of AR and MA.
	MA0	MA1	MA2	MA3	MA4	MA5
AR0	6095.44	5570.75	5394.55	5366.16	5355.67	5354.95
AR1	5350.53	5352.19	5349.50	5350.78	5352.34	5354.09
AR2	5352.12	5286.89	5350.25	5351.97	5281.68	5282.98
AR3	5349.77	5350.05	5286.57	5353.99	5353.83	5355.46
AR4	5350.84	5351.97	5288.55	5290.50	5282.56	5284.53
AR5	5351.65	5283.13	5283.91	5293.96	5292.56	5289.24

The estiamtes of the parameters with standard error was shown in Table 4.2, and therefore the corresponding optimal null model can be formulated as

\[\begin{align} Y_t+0.01 = 0.53 Y_{t-1} + 0.09 \epsilon_{t-2} + 0.012 \epsilon_{t-1} + \epsilon_t \end{align}\]

Table 4.2: The estimates and s.e. of the null model
	ar1	ma1	ma2	intercept
estimate	0.53	0.12	0.09	-0.01
se	0.06	0.06	0.04	0.10

The marginal relationships between temperature and other meteorological features were investigated using regression model with ARMA error and likelihood ratio tests were performed to check the significance of the correlations.

Before performing likelihood ratio test, the null model was fitted, which was \[\begin{equation} Temperature^{hs}_t = \beta_0 + \epsilon_t \tag{4.1} \end{equation}\] where \(\epsilon_i\) is ARMA(1,2) error and superscript “hs” denotes the high-frequency of the time series. The log likelihood of the model (4.1) was -2669.75.

The model diagnostics for the null model was also performed and diagnostics plots were shown in Figure 4.5. It seems that there wasn’t any violations against model assumptions. Also the absolute values of the roots of AR and MA were calculated, which were 1.9, and (3.28,3.28). The roots were all outside of the unit cycle, which indicated no vilotions of invertability and casuality.

Figure 4.5: Model diagnostics plots of regression model with ARMA error

4.3.1 Marginal relationship between temperature and pressure

The following model was fitted,

\[\begin{equation} Temperature^{hs}_t = \beta_0 + \beta_1 Pressure^{hs}_t + \epsilon_t \tag{4.2} \end{equation}\]

where \(\epsilon_i\) is ARMA(1,2) error and superscript “hs” denotes the high-frequency of the time series.

Table 4.3: The estimates and s.e. of the model parameters of temperature regressing on pressure with ARMA error
	ar1	ma1	ma2	intercept	pressure
estimate	0.51	0.13	0.08	-0.01	-0.31
se	0.07	0.07	0.05	0.09	0.02

The fitted model gave the coefficient of pressure being -0.31 with standard error 0.02. The test statistic for LRT was 158.16, which followed Chi square distribution with degree of freedom being 1. The corresponding pvalue was 0, which means the coefficient was significantly different from \(0\). Also by looking at the sign of the coefficient, we can conclude that there exists significantly negative correlation between temperature and pressure.

4.3.2 Marginal relationship between temperature and humidity

The following model was fitted,

\[\begin{equation} Temperature^{hs}_t = \beta_0 + \beta_1 Humidity^{hs}_t + \epsilon_t \tag{4.3} \end{equation}\]

where \(\epsilon_i\) is ARMA(1,2) error and superscript “hs” denotes the high-frequency of the time series.

Table 4.4: The estimates and s.e. of the model parameters of temperature regressing on humidity with ARMA error
	ar1	ma1	ma2	intercept	humidity
estimate	0.49	0.33	0.14	-0.01	-0.14
se	0.06	0.06	0.05	0.08	0.00

The fitted model gave the coefficient of pressure being -0.14 with standard error 0. The test statistic for LRT was 804.76, which followed Chi square distribution with degree of freedom being 1. The corresponding pvalue was 0, which means the coefficient was significantly different from \(0\). Also by looking at the sign of the coefficient, we can conclude that there exists significantly negative correlation between temperature and humidity.

4.3.3 Marginal relationship between temperature and wind speed

The following model was fitted,

\[\begin{equation} Temperature^{hs}_t = \beta_0 + \beta_1 WindSpeed^{hs}_t + \epsilon_t \tag{4.4} \end{equation}\]

where \(\epsilon_i\) is ARMA(1,2) error and superscript “hs” denotes the high-frequency of the time series.

Table 4.5: The estimates and s.e. of the model parameters of temperature regressing on wind speed with ARMA error
	ar1	ma1	ma2	intercept	windspeed
estimate	0.52	0.12	0.09	-0.01	0.02
se	0.06	0.07	0.04	0.10	0.01

The fitted model gave the coefficient of pressure being 0.02 with standard error 0.01. The test statistic for LRT was 6.52, which followed Chi square distribution with degree of freedom being 1. The corresponding pvalue was 0.01, which means the coefficient was significantly different from \(0\). Also by looking at the sign of the coefficient, we can conclude that there exists significantly positive between temperature and wind speed. However the positive relationship between temperature and wind speed was not very significant since the pvalue is near the critical threshold of the test, as compared to other two features.

4.3.4 Joint model of temperature regressing on pressure, humidity and wind speed

Since maybe there exists some correlations between pressure, humidity and wind speed, joint model which regresses temperature on pressure, humidity and wind speed was fitted,

\[\begin{equation} Temperature^{hs}_t = \beta_0 + \beta_1 Pressure^{hs}_t + \beta_2 Humidity^{hs}_t + \beta_3 WindSpeed^{hs}_t + \epsilon_t \tag{4.5} \end{equation}\]

where \(\epsilon_i\) is ARMA(1,2) error and superscript “hs” denotes the high-frequency of the time series.

The parameters of joint model were shown in Table 4.6. The coefficients for pressure, humidity and wind speed were -0.22, -0.13 and -0.03, while the corresponding pvalues were \(<0.00001\), \(<0.00001\) and \(0.00001\) respectively. From the joint model, we can tell that the negative relationships between temperature and pressure, as well as humidity were consistent with previous marginal analysis, however the relationship between temperature and wind speed changed sign after controlling for humidity and pressure. We conjectured that the possible reason is that maybe the wind speed are correlated with humidity and pressure.

Table 4.6: The estimates and s.e. of the joint model
	ar1	ma1	ma2	intercept	pressure	humidity	windspeed
estimate	0.48	0.30	0.12	-0.01	-0.22	-0.13	-0.03
se	0.06	0.06	0.05	0.08	0.02	0.00	0.01

5 Conclusions

In this report, we used two different methods to remove trend and seasonality from the original data. We first used the difference ARMA model and fit models that investigate on the relationship between mean temperature changes and other variables. We have reached the following sub-conclusions:

The change of humidity is statistically significant and has negative relation with the change of temperature.
The change of wind speed is statistically significant and has positive relation with the change of temperature.
The change of mean pressure is statistically significant and has negative relation with the change of temperature.
Of all three factors, when controlling two of them, remaining one has negative effect on the change of temperature.

Then we tried the non-difference ARMA model that used regression model with ARMA error based on the original scale of the data without taking difference of adjacent data points. We have reached the following sub-conclusions based on non-difference ARMA models:

There exists significant marginal negative correlation between temperature and humidity.
There exists significant marginal positive correlation between temperature and wind speed.
There exists significant marginal negative correlation between temperature and pressure.
The pressure has negative effect on temperature after controlling for wind speed and humidity; The humidity has negative effect on temperature after controlling for wind speed and pressure; The wind speed has negative effect on temperature after controlling for pressure and humidity.

6 Future work

Further explore the relationship between temperature and humidity in frequency domain.
Find out the reason why marginal effect of wind speed is different from effect of wind speed after controlling for humidity and pressure. And same work on the changes of them.
Collect more data and analyze the relationship among low-frequency parts of the time series

7 References

Data source https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data
Course notes and lectures https://ionides.github.io/531w20/
Previous midterm projects of 2018 https://ionides.github.io/531w18/midterm_project/
Previous midterm projects of 2020 https://ionides.github.io/531w20/midterm_project/
India floods: More than 140 dead after torrential rain. https://www.bbc.com/news/av/world-asia-india-49311458
Moreno-Carbonell S, Sánchez-Úbeda EF, Muñoz A. Time Series Decomposition of the Daily Outdoor Air Temperature in Europe for Long-Term Energy Forecasting in the Context of Climate Change. Energies. 2020; 13(7):1569. https://doi.org/10.3390/en13071569
Shuichi Hokoi, Mamoru Matsumoto, Toshikazu Ihara,Statistical time series models of solar radiation and outdoor temperature — Identification of seasonal models by Kalman filter, Energy and Buildings, Volume 15, Issues 3–4, 1990, Pages 373-383, ISSN 0378-7788, https://doi.org/10.1016/0378-7788(90)90011-7. https://www.sciencedirect.com/science/article/pii/0378778890900117
Climate of Delhi https://en.wikipedia.org/wiki/Climate_of_Delhi

Time series analysis on climate dataset in Delhi

1 Introduction

2 Data overview

2.1 Descriptive analysis on temperature, wind speed, pressure and humidity

2.2 Identify seasonality of the temperature over time

3 Difference ARMA model (ARIMA)

3.1 Data preprocess

3.2 Models

3.2.1 Model 1: mean temperature and humidity

3.2.1.1 model selection

3.2.1.2 model diagnostics

3.2.1.3 simulations

3.2.2 Model 2: mean temperature and wind speed

3.2.2.1 model selection

3.2.2.2 model diagnostics

3.2.2.3 simulation

3.2.3 Model 3: mean temperature and mean pressure

3.2.3.1 model selection

3.2.3.2 model diagnostics

3.2.3.3 simulation

3.2.4 Model 4: mean temperature and all variables

3.2.4.1 model selection

3.2.4.2 model diagnostics

3.2.4.3 simulation

4 Non-difference ARMA model

4.1 Remove seasonality and trend from original data

4.2 Potential lagged relationships between temperature and pressure, windspeed and humidity

4.3 Non-difference regression model with ARMA error

4.3.1 Marginal relationship between temperature and pressure

4.3.2 Marginal relationship between temperature and humidity

4.3.3 Marginal relationship between temperature and wind speed

4.3.4 Joint model of temperature regressing on pressure, humidity and wind speed

5 Conclusions

6 Future work

7 References