Entering the age of information, more and more people enjoy the benefits brought by weather forecasts. Weather prediction can be accessed conveniently and instantly, making it popular among a variety of people. People tend to rely on the temperature forecasts to decide their outfits and travel plans. Hence it is of great importance and interest to tell to elements involved in predicting the temperature, as well as the best predictive model to access the fit by using the time series data.
Comparatively speaking, India has one of the most extreme climate in the world. Unlike most other tropical countries, India has a very hot summer(April-July) and very cold winter(December-January). In the year of 2019, the India floods caused by torrential rain became a catestrophe that made over 140 death and hundreds of thousands people evacuated from their homes. It is of our great interest to investigate about the time series data of India with the incorporation of different relevant features in weather report, such as wind speed, humidity and pressure. We obtained our Daily Climate Time Series Data from Kaggle.com with the daily report of mean temperature, humidity, wind speed and mean pressure from January 1, 2013 to April 24, 2017 in the capital territory of India. The dataset contains 1462 observations and 4 features for the city of Delhi. Statistically speaking, every feature can contribute to the change of the other features. However, in this project, we will mainly focus on the intepretation of temperature as it is the most intuitive and commonly used by the citizens. Our goal is to find the best model and quantify the extent to which the temperature is influenced by the other variables.
In this report, we will explore we two different methods to remove trend and seasonality from the original data. We first used the difference ARMA model and fit models that investigate on the relationship between mean temperature changes and other variables, then we will use non-difference ARMA model that used regression model with ARMA error based on the original scale of the data without taking difference of adjacent data points. In this way, we can find out whether there exists relationship between temperature of Delhi and other three meteorological features. Finally, we will make comparisons about the models we have adopted and analyze the output with implications in the real world.
Our data set consist of four variables, which are mean temperature, wind speed, humidity and mean pressure. From the plot we can see the first three variables have some oscillatory behaviors and pressure is stable for almost all the time despite some outliers.
Now we focus on the temperature. To identify the circle, we first take a look at the spectrum.
The Unsmoothed spectrum plot reveals that there is a cycle with frequency 0.00267 in raw data, which corresponds to 1 year. This is same with out initial thoughts of the length of period. To view the cycle more clearly, we decompose the data in different frequency. High frequency might be considered as noise and low frequency might be considered as trend.
From the plot we can see the seasonality. Besides, we can also identify the non-stationarity of data. To explore the relationship between temperature and other three meteorological features, we need to remove the trend and seasonality of original data. The reason is that the trend and seasonality may caused by some confounders, i.e seasonal change of the climate in Delhi and global warming. The trend and seasonality will introduce some fake relationships between temperature and other variables. So we decide to first remove trend and seasonality before doing subsequent analysis.
In this report, we used two different methods to remove trend and seasonality from the original data.
For this part, we try to difference the mean temperature data such that \[\Delta Y_n=Y_n-Y_{n-1}\]
plot(ta$date[-c(1)],diff(ta$meantemp),type='l',main = '',ylab = "Differenced temperature",xlab = "Date")
Now, the trend seems to be mean stationary around zero, and the covariance stationarity seems not to be a problem.
adf.test(diff(ta$meantemp))
## Warning in adf.test(diff(ta$meantemp)): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(ta$meantemp)
## Dickey-Fuller = -14.011, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
Also, The Augmented Dickey-Fuller Test shows that the null hypothesis is rejected and the time series is stationary.
The estimated ACF plots show that there is an exponential decay in the estimated autocorrelation. Almost values are inside the dash line, which suggests they are not significant different from 0, indicating stationarity. For humidity, wind speed and mean pressure, we also take difference and use them to fit models. Adjusted data are shown as following.
To analyze the relationship between mean temperature changes and other variables, we’ll fit four models separately.
For model 1, we want to explore the relation between changes of mean temperature and humidity. To choose a proper model, we’ll use Akaike’s information criterion (AIC) to select the best ARMA model to model the noise. We create an AIC table with autoregression order up to 4 and moving average order up to 4. For models with ar and ma larger than 5, we consider them overfitting.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 4822.32 | 4823.92 | 4783.26 | 4728.27 | 4726.92 |
AR1 | 4824.04 | 4760.65 | 4737.27 | 4726.85 | 4728.85 |
AR2 | 4796.88 | 4732.05 | 4732.61 | 4728.85 | 4730.83 |
AR3 | 4748.53 | 4731.22 | 4730.42 | 4730.74 | 4732.76 |
AR4 | 4743.54 | 4728.97 | 4730.82 | 4732.79 | 4734.71 |
The model to consider here is ARMA(1,3), which has the lowest AIC values. An alternative model is ARMA(1,1), which has larger AIC value but less parameters. To determine which model to use, we conduct a hypothesis test using Wilks’ approximation. For this test, the null hypothesis is ARMA(1,1) is better, while the alternative correponds to ARMA(1,1). Then we use an approximation \[\Lambda=2(l_1-l_0)\approx \chi^2_{D_1-D_0}\]
where \(l_i\) is the maximum log likelihood under hypothesis \(H_i\) and \(D_i\) is the number of parameters under hypothesis \(H_i\). If \(\Lambda\) is bigger than \(\chi^2\) cutoff, we’ll reject the null hypothesis. The maximum log likelihood can be estimated by the function \(arima()\). Then \(\Lambda=37.796\), and the cutoff of \(\chi^2_2\) is
qchisq(0.95,df=2)
## [1] 5.991465
This means we will reject the null hypothesis and ARMA(1,3) is better. Therefore our model 1 is \[\Delta(temperature)_t=\beta_0+\beta_1\Delta(humidity)_t+\epsilon_t\]
where \(\epsilon_t\) is ARMA(1,3) and the coefficients are shown below.
##
## Call:
## arima(x = diff(ta$meantemp), order = c(1, 0, 3), xreg = diff(ta$humidity))
##
## Coefficients:
## ar1 ma1 ma2 ma3 intercept diff(ta$humidity)
## 0.2268 -0.2899 -0.1592 -0.1568 0.0033 -0.1370
## s.e. 0.1161 0.1149 0.0296 0.0390 0.0162 0.0041
##
## sigma^2 estimated as 1.474: log likelihood = -2356.43, aic = 4726.85
Then we check ar and ma roots. It seems that all the roots are outside the unit circle, suggesting desirable property of causality and invertibility.
polyroot(c(1,-coef(result1)[c('ar1')]))
## [1] 4.409386+0i
polyroot(c(1,coef(result1)[c('ma1','ma2','ma3')]))
## [1] 1.306381+0.000000i -1.160867+1.880017i -1.160867-1.880017i
Since we have our model with ARMA(1,3) error, we need to check if the assumptions of model are valid. First we look at the residuals as a time series plot. It seems that there are no abnormal patterns in the plot, and the mean of residuals is around 0. Then we check the ACF of residuals. Almost all the valus are inside the dash lines apart from lag 20, suggesting no autocorrelation between residuals. It doesn’t matter to have one over 32 lags outside dash lines, since we still can construct a \(95%\) acceptance region under null hypothesis. Finally, we check the assumption that \(\{\epsilon\}\sim N(0,\sigma^2)\) with QQ plot. The plot suggests, comparing to normal distribution, residuals of our models are kind of heavy-tailed, so the standard error of previous result is not quite valid.
The arima() function in R uses the observed Fisher information to calculate standard errors for the coefficients. An approximate \(95%\) confident interval can be constructed by these standard errors. Here, we considering doing inference on the coefficient of changes of humidity. According to the result of model, we can construct an CI \[[-0.1370-1.96*0.0041,-0.1370+1.96*0.0041]=[-0.1450,-0.1290]\]
We can see 0 is not in the confident interval, so the humidity term is significant. However, assumption diagnostics reveals that standard error is unreliable, so we’ll use a simulation to construct an confidence interval for \(\beta_1\), and here is the distribution of results from 1000 simulations.
set.seed(12)
k=1000
params <- coef(result1)
ar <- params[grep('^ar',names(params))]
ma <- params[grep('^ma',names(params))]
sigma <- sqrt(result1$sigma2)
theta <- matrix(NA,nrow = k,ncol = length(params),dimnames = list(NULL,names(params)))
for (i in 1:k) {
Y <- arima.sim(list(ar=ar,ma=ma),n=length(ta$meantemp)-1,sd=sigma)+params[6]*diff(ta$humidity)+params[5]
theta[i,] <- coef(arima(Y,order = c(1,0,3),xreg = diff(ta$humidity)))
}
## 2.5% 97.5%
## -0.1451306 -0.1292738
In this plot, the dash lines correspond to the interval calculated before, and the 95% confidence interval of simulation is \([-0.1451,-0.1293]\), which is also close to CI of model 1. Since 0 is not in the CI, we can say that the change of humidity is statistically significant associated with the change of temperature. The coefficient of differenced humidity is negative, so we can say the decreasing of change of humidity will lead to an increase in change of temperature.
Our second model focus on the relation between changes of mean temperature and wind speed. With the same approach as model 1, we first search for a proper ARMA model for error.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 5646.08 | 5597.70 | 5568.55 | 5536.76 | 5537.42 |
AR1 | 5609.94 | 5539.62 | 5541.29 | 5537.06 | 5538.80 |
AR2 | 5596.27 | 5541.21 | 5512.48 | 5538.73 | 5540.27 |
AR3 | 5559.93 | 5538.25 | 5538.51 | 5540.49 | 5542.41 |
AR4 | 5553.30 | 5538.75 | 5540.47 | 5542.47 | 5544.13 |
It seems that the best model is ARMA(2,2), and an alternative model is ARMA(1,1). Notice that for ARMA(2,2), the AIC value has more than 2 difference with the adjacent model, which is not consistent with the definition of AIC, so there must be some numeric error and the validation of ARMA(2,2) is doubtful. First we use hypothesis test as model 1, it turns out \(\Lambda=31.14\) and the cutoff of \(\chi^2_2\) is
qchisq(0.95,df=2)
## [1] 5.991465
So we’ll reject the null hypothesis that ARMA(1,1) is better. However, if we check the AR and MA roots of ARMA(2,2)
polyroot(c(1,-coef(result3)[c('ar1','ar2')]))
## [1] 1.031981-0i 1.344141+0i
polyroot(c(1,coef(result3)[c('ma1','ma2')]))
## [1] 1.030882+0.079158i 1.030882-0.079158i
some roots are near the threshold, suggesting that the model is not stable. Plus the problem of AIC value, it’s possible that ARMA(2,2) is overfitting. The roots of ARMA(1,1) is quite desirable, so ARMA(1,1) is our choice.
abs(polyroot(c(1,-coef(result4)[c('ar1')])))
## [1] 1.749814
abs(polyroot(c(1,coef(result4)[c('ma1')])))
## [1] 1.245916
Therefore, the model 2 is \[\Delta(temperature)_t=\beta_0+\beta_1\Delta(wind\ speed)_t+\epsilon_t\]
where \(\epsilon_t\) is ARMA(1,1) and the coefficients are shown below
##
## Call:
## arima(x = diff(ta$meantemp), order = c(1, 0, 1), xreg = diff(ta$wind_speed))
##
## Coefficients:
## ar1 ma1 intercept diff(ta$wind_speed)
## 0.5715 -0.8026 0.0026 0.0250
## s.e. 0.0419 0.0289 0.0194 0.0095
##
## sigma^2 estimated as 2.577: log likelihood = -2764.81, aic = 5539.62
Again, we look at model diagnostics to see if the model above violates any of assumptions.
From the plots, there’s no striking pattern of residuals we should worrisome. And there’s no autocorrelation between residuals. The QQ plot reveals that the residuals are heavy tailed, so standard errors of model 2 are not valid. Again, we conduct a simulation.
Here, we check the significance of wind speed. Same as described above, we could construct an confidence interval with estimated standard error for \(\beta_1\), which is \[[0.0250-1.96*0.0095,0.0250+1.96*0.0095]=[0.0064,0.0436]\]
This is not valid, so we construct a CI with 1000 simulations.
## 2.5% 97.5%
## 0.006894073 0.044547646
The dash lines correspond to the confidence interval calculated above, and the \(95%\) CI is \([0.0069,0.0445]\), which is also close. Since 0 is not inside the CI, the changes of wind speed is statistically significant and play a role in changes of temperature. The coefficient of wind speed is positive, so the increase of change of wind speed could cause a bigger temperature difference.
For model 3, we explore the relation between changes of mean temperature and mean pressure. Model selection procedure is same as described in previous models. First, we select an ideal model from AIC table.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 5502.41 | 5447.91 | 5405.78 | 5373.42 | 5372.82 |
AR1 | 5464.05 | 5374.03 | 5375.19 | 5372.58 | 5374.47 |
AR2 | 5444.13 | 5375.02 | 5353.89 | 5374.55 | 5376.55 |
AR3 | 5405.00 | 5373.76 | 5374.38 | 5376.31 | 5377.92 |
AR4 | 5393.15 | 5374.26 | 5376.22 | 5378.36 | 5373.26 |
The model to consider here is ARMA(1,3). We reject ARMA(0,4) because of the fear of overfitting, and they are also on the edge of table. An alternative choice is ARMA(1,1). Using Wilks’ approximation, we do the hypothesis test with ARMA(1,1) as null hypothesis and ARMA(1,3) as alternative. Here \(\Lambda=5.45\) and the cutoff of \(\chi^2_2\) is 5.99. So we accept the null hypothesis and choose ARMA(1,1).
polyroot(c(1,-coef(result6)[c('ar1')]))
## [1] 1.745004+0i
polyroot(c(1,coef(result6)[c('ma1')]))
## [1] 1.208809+0i
The roots of ARMA(1,1) suggest a desirable property of causality and invertibility. And the model 3 is \[\Delta(temperature)_t=\beta_0+\beta_1\Delta(pressure)_t+\epsilon_t\]
where \(\epsilon_t\) is ARMA(1,1) and the coefficients are shown as following
##
## Call:
## arima(x = diff(ta$meantemp), order = c(1, 0, 1), xreg = diff(ta$meanpressure))
##
## Coefficients:
## ar1 ma1 intercept diff(ta$meanpressure)
## 0.5731 -0.8273 0.0027 -0.3314
## s.e. 0.0386 0.0253 0.0161 0.0245
##
## sigma^2 estimated as 2.301: log likelihood = -2682.02, aic = 5374.03
Now we check the assumptions of model, the plots are similar as previous models.
From the plots we can draw the same conclusion as before: there are no abnormal pattern in residuals; there is no autocorrelation between the residuals and compared to normal distribution, the residuals of model 3 are heavy tailed. Then we still need simulations.
From the result of arima(), we can calculate the confidence interval of \(\beta_1\), which is \[[-0.3314-1.96*0.0245,-0.3314+1.96*0.0245]=[-0.3794,-0.2834]\]
Since model 3 does not meet normal assumption, we again conduct 1000 simulations to construct a confidence interval for \(\beta_1\), and the plot shows the distribution of results of 1000 simulations.
## 2.5% 97.5%
## -0.3775430 -0.2854574
The dash lines correspond to the CI calculated before, and 95% CI of simulations \([-0.3775,-0.2854]\) is close to it. Since 0 is also outside the confidence interval, we can say the change of mean pressure is statistically significant and has effect on the change of temperature. The coefficient of pressure is negative, then the change of pressure will play a negative role in changes in mean temperature.
For the last model, we’ll include all three variable (humidity, wind speed and mean pressure). First, same as before, we choose a ARMA model from AIC table.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 4689.21 | 4688.66 | 4640.28 | 4586.62 | 4581.69 |
AR1 | 4689.46 | 4607.39 | 4588.10 | 4581.67 | 4583.57 |
AR2 | 4662.03 | 4584.32 | 4585.13 | 4583.63 | 4585.51 |
AR3 | 4616.45 | 4584.23 | 4584.74 | 4584.91 | 4587.17 |
AR4 | 4606.43 | 4583.63 | 4585.62 | 4586.91 | 4588.26 |
It seems there are more numeric errors than previous model, but we could select ARMA(1,3), and ARMA(2,1) can be alternative. Again, we conduct a hypothesis test with ARMA(2,1) as null hypothesis and ARMA(1,3) as alternative hypothesis. We have \(\Lambda=4.65\) and the 0.95 cutoff of \(\chi^2_1\) is
qchisq(0.95,df=1)
## [1] 3.841459
Therefore, we reject the null hypothesis and choose ARMA(1,3).
polyroot(c(1,-coef(result7)[c('ar1')]))
## [1] 3.229521+0i
polyroot(c(1,coef(result7)[c('ma1','ma2','ma3')]))
## [1] 1.242954-0.000000i -1.209748+2.155796i -1.209748-2.155796i
The AR and MA roots are all outside the unit circle, suggesting a good property of causality and invertibility. And the model 4 is \[\Delta(temperature)_t=\beta_0+\beta_1\Delta(humidity)_t+\beta_2\Delta(wind\ speed)+\beta_3\Delta(pressure)+\epsilon_t\]
where \(\epsilon_t\) is ARMA(1,3) and following are the coefficients of model 4
##
## Call:
## arima(x = diff(ta$meantemp), order = c(1, 0, 3), xreg = all_xreg)
##
## Coefficients:
## ar1 ma1 ma2 ma3 intercept humidity wind_speed
## 0.3096 -0.4086 -0.1549 -0.1317 0.0033 -0.1322 -0.0281
## s.e. 0.1063 0.1060 0.0322 0.0412 0.0134 0.0041 0.0068
## meanpre
## -0.2326
## s.e. 0.0197
##
## sigma^2 estimated as 1.331: log likelihood = -2281.83, aic = 4581.67
Same as before, we check the assumptions of model 4.
From the plot, we can not identify any abnormal pattern of the residuals. There does not seem to be autocorrelation between residuals. Compared to normal distribution, the residuals of model 4 is heavy tailed. To fix this problem, we still need a simulation.
we can derive confidence intervals for \(\beta_1,\beta_2,\beta_3\) from the results of model 4, which are \[\beta_1:[-0.1322-1.96*0.0041,-0.1322+1.96*0.0041]=[-0.1402,-0.1241]\]
\[\beta_2:[-0.0281-1.96*0.0068,-0.0281+1.96*0.0068]=[-0.0414,-0.0148]\]
\[\beta_3:[-0.2326-1.96*0.0197,-0.2326+1.96*0.0197]=[-0.2712,-0.1940]\]
## 2.5% 97.5% 2.5% 97.5% 2.5% 97.5%
## -0.13974427 -0.12403137 -0.04195919 -0.01579606 -0.26859836 -0.19323215
The dash lines correspond to intervals above, and the 95% confidence interval is similar too. Since 0 is outside of all CIs, all three variables are statistically significant and have effects on the change of temperature. All the coefficients of variables are negative, therefore, the increase of difference of any of these variables will lead to an decrease of temperature difference.
Notice that when fitting model with all three variables, the coefficients of humidity is stable, but for wind speed, it becomes negative and mean pressure changes a lot. It might indicate that humidity is not correlated with the other two and the changes of wind speed and mean pressure may have some relation, which can be one of our future tasks.
The second method to model the data is to use regression model with ARMA error based on the original scale of the data without taking difference of adjacent data points. The idea is that first extract the high-frequency part of the data to remove the confounding of season and trend, then model the extracted data to find out whether there exists relationship between temperature of Delhi and other three meteorological features.
To extract the high-frequency part of the data, we used LOESS method to remove seasonality and trend. The plots of data after processing are shown in Figure 4.1. From the plots, we can see that only the high-frequency part of the features were extracted and trend and seasonality were removed.
To give tough idea whether there exists some marginal relationship between any pair of features, pairwise pearson correlations were plotted in Figure 4.2. From the plot, we can see that there exists negative correlation between temperature and pressure, as well as temperature and humidity. The relationship between temperature and wind speed can not be directly told by the correlation plot.
To investigate the potential lagged relationships between temperature and other three meteorological features, sample cross-correlation functions (CCF) between pairwise time series were calculated, where CCF can be formulated as, \[ \begin{align} \hat{\rho}_{xy}(h) = \frac{\sum_{n=1}^{N-h} (x_{n+h} - \bar{x})(y_n - \bar{y})}{\sqrt{\sum_{n=1}^{N} (x_{n} - \bar{x})^2\sum_{n=1}^{N} (y_{n} - \bar{y})^2}} \end{align} \]
The CCF plots are shown in Figure 4.3. From the CCF plot of temperature vs pressure, we can see that there exists negative lagged and instantaneous relationship between temperature and pressure, which means the lower the pressure is, the higher the temperature will be. Same conclusion can be made on the lagged and instantaneous relationship between temperature and humidity, which demonstrate negative relationship between temperature and humidity. For the relationship between temperature and wind speed, the CCF plot shows there exists positive instantaneous relationship between temperature and wind speed, however the magnitude of the relationship is not very strong.
To further investigate the cross-covariance in frequency domain, the coherency and phase of cross-spectrum were investigated between temperature and other three meteorological features. The plots are shown in Figure 4.4.
From the first row of the plots, we can identify two significant frequencies that temperature and pressure are correlated, which are 1 cycle per 2.2 days and 1 cycle per 88 days. By looking at the phase plot, we can see that the phases for these two frequencies are negative, which means that there exists lagged effect between temperature and pressure on these frequencies. And we conjecture that the reason why the frequency 1 cycle per 88 days is significant is that the seasonality has not been completely removed from the original data since we only focus on high-frequency relationship between two time series.
The second row shows the coherency and phase plot of temperature and humidity. The coherency plot told us it seems that there exists correlation between temperature and humidity in almost every frequencies. The reason is that maybe there is complicated relationship between temperature and humidify and further research need to conducted.
The third row gives us the idea that the relationship between temperature and wind speed in almost all the frequencies is neglectable since 95% confidence intervals of nearly all the frequencies covers 0. The result indicates that maybe there is no relationship between temperature and wind speed, or the relationship is very small.
To quantify the relationship between temperature and other three features, regression model with ARMA error was fitted. To find out what were the appropriate orders of AR and MA parameters, AIC were calculated for different ARMA models of temperature without any regressor, which were shown in Table 4.1. The optimal model was selected by try and error method from the model with smallest AIC to see that whether the model was non-invertibile or non-casual. The ARMA model with order of AR and MA being \((1,2)\) was selected as the optimal model.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 6095.44 | 5570.75 | 5394.55 | 5366.16 | 5355.67 | 5354.95 |
AR1 | 5350.53 | 5352.19 | 5349.50 | 5350.78 | 5352.34 | 5354.09 |
AR2 | 5352.12 | 5286.89 | 5350.25 | 5351.97 | 5281.68 | 5282.98 |
AR3 | 5349.77 | 5350.05 | 5286.57 | 5353.99 | 5353.83 | 5355.46 |
AR4 | 5350.84 | 5351.97 | 5288.55 | 5290.50 | 5282.56 | 5284.53 |
AR5 | 5351.65 | 5283.13 | 5283.91 | 5293.96 | 5292.56 | 5289.24 |
The estiamtes of the parameters with standard error was shown in Table 4.2, and therefore the corresponding optimal null model can be formulated as
\[\begin{align} Y_t+0.01 = 0.53 Y_{t-1} + 0.09 \epsilon_{t-2} + 0.012 \epsilon_{t-1} + \epsilon_t \end{align}\]
ar1 | ma1 | ma2 | intercept | |
---|---|---|---|---|
estimate | 0.53 | 0.12 | 0.09 | -0.01 |
se | 0.06 | 0.06 | 0.04 | 0.10 |
The marginal relationships between temperature and other meteorological features were investigated using regression model with ARMA error and likelihood ratio tests were performed to check the significance of the correlations.
Before performing likelihood ratio test, the null model was fitted, which was \[\begin{equation} Temperature^{hs}_t = \beta_0 + \epsilon_t \tag{4.1} \end{equation}\] where \(\epsilon_i\) is ARMA(1,2) error and superscript “hs” denotes the high-frequency of the time series. The log likelihood of the model (4.1) was -2669.75.
The model diagnostics for the null model was also performed and diagnostics plots were shown in Figure 4.5. It seems that there wasn’t any violations against model assumptions. Also the absolute values of the roots of AR and MA were calculated, which were 1.9, and (3.28,3.28). The roots were all outside of the unit cycle, which indicated no vilotions of invertability and casuality.
The following model was fitted,
\[\begin{equation} Temperature^{hs}_t = \beta_0 + \beta_1 Pressure^{hs}_t + \epsilon_t \tag{4.2} \end{equation}\]
where \(\epsilon_i\) is ARMA(1,2) error and superscript “hs” denotes the high-frequency of the time series.
ar1 | ma1 | ma2 | intercept | pressure | |
---|---|---|---|---|---|
estimate | 0.51 | 0.13 | 0.08 | -0.01 | -0.31 |
se | 0.07 | 0.07 | 0.05 | 0.09 | 0.02 |
The fitted model gave the coefficient of pressure being -0.31 with standard error 0.02. The test statistic for LRT was 158.16, which followed Chi square distribution with degree of freedom being 1. The corresponding pvalue was 0, which means the coefficient was significantly different from \(0\). Also by looking at the sign of the coefficient, we can conclude that there exists significantly negative correlation between temperature and pressure.
The following model was fitted,
\[\begin{equation} Temperature^{hs}_t = \beta_0 + \beta_1 Humidity^{hs}_t + \epsilon_t \tag{4.3} \end{equation}\]
where \(\epsilon_i\) is ARMA(1,2) error and superscript “hs” denotes the high-frequency of the time series.
ar1 | ma1 | ma2 | intercept | humidity | |
---|---|---|---|---|---|
estimate | 0.49 | 0.33 | 0.14 | -0.01 | -0.14 |
se | 0.06 | 0.06 | 0.05 | 0.08 | 0.00 |
The fitted model gave the coefficient of pressure being -0.14 with standard error 0. The test statistic for LRT was 804.76, which followed Chi square distribution with degree of freedom being 1. The corresponding pvalue was 0, which means the coefficient was significantly different from \(0\). Also by looking at the sign of the coefficient, we can conclude that there exists significantly negative correlation between temperature and humidity.
The following model was fitted,
\[\begin{equation} Temperature^{hs}_t = \beta_0 + \beta_1 WindSpeed^{hs}_t + \epsilon_t \tag{4.4} \end{equation}\]
where \(\epsilon_i\) is ARMA(1,2) error and superscript “hs” denotes the high-frequency of the time series.
ar1 | ma1 | ma2 | intercept | windspeed | |
---|---|---|---|---|---|
estimate | 0.52 | 0.12 | 0.09 | -0.01 | 0.02 |
se | 0.06 | 0.07 | 0.04 | 0.10 | 0.01 |
The fitted model gave the coefficient of pressure being 0.02 with standard error 0.01. The test statistic for LRT was 6.52, which followed Chi square distribution with degree of freedom being 1. The corresponding pvalue was 0.01, which means the coefficient was significantly different from \(0\). Also by looking at the sign of the coefficient, we can conclude that there exists significantly positive between temperature and wind speed. However the positive relationship between temperature and wind speed was not very significant since the pvalue is near the critical threshold of the test, as compared to other two features.
Since maybe there exists some correlations between pressure, humidity and wind speed, joint model which regresses temperature on pressure, humidity and wind speed was fitted,
\[\begin{equation} Temperature^{hs}_t = \beta_0 + \beta_1 Pressure^{hs}_t + \beta_2 Humidity^{hs}_t + \beta_3 WindSpeed^{hs}_t + \epsilon_t \tag{4.5} \end{equation}\]
where \(\epsilon_i\) is ARMA(1,2) error and superscript “hs” denotes the high-frequency of the time series.
The parameters of joint model were shown in Table 4.6. The coefficients for pressure, humidity and wind speed were -0.22, -0.13 and -0.03, while the corresponding pvalues were \(<0.00001\), \(<0.00001\) and \(0.00001\) respectively. From the joint model, we can tell that the negative relationships between temperature and pressure, as well as humidity were consistent with previous marginal analysis, however the relationship between temperature and wind speed changed sign after controlling for humidity and pressure. We conjectured that the possible reason is that maybe the wind speed are correlated with humidity and pressure.
ar1 | ma1 | ma2 | intercept | pressure | humidity | windspeed | |
---|---|---|---|---|---|---|---|
estimate | 0.48 | 0.30 | 0.12 | -0.01 | -0.22 | -0.13 | -0.03 |
se | 0.06 | 0.06 | 0.05 | 0.08 | 0.02 | 0.00 | 0.01 |
In this report, we used two different methods to remove trend and seasonality from the original data. We first used the difference ARMA model and fit models that investigate on the relationship between mean temperature changes and other variables. We have reached the following sub-conclusions:
Then we tried the non-difference ARMA model that used regression model with ARMA error based on the original scale of the data without taking difference of adjacent data points. We have reached the following sub-conclusions based on non-difference ARMA models: