About the Project

Research Content

Shortage of electricity resources is always a challenge to human’s life and production. To ensure a better electricity supply, it’s crucial to study the patterns of electricity consumption. Analyzing electricity consumption data allows us to predict peak demand periods, enabling us to prepare for electricity supply in advance. Moreover, understanding consumption patterns helps optimize the distribution of electricity resources and reduce waste.

In this project, we are going to study the time series from Electric Power Consumption [1]. We will use the time series models and methods to find a good fit for the power consumption data. All our efforts are aimed at solving the following problem:

Could we find a model to fit power consumption based on past electricity consumption data?

Data Description

The dataset Electric Power Consumption consists of 52,416 observations of energy consumption recorded in Tetouan, a city in the north of Morocco, during 2017. The data is recorded in a 10-minute window with 9 features. Here we mainly focus on the feature “Zone 1 Power Consumption”.

Model and Methods

In this project, we fitted our dataset with \(ARIMA\) and \(SARIMA\) model. What’s \(ARIMA\) model and \(SARIMA\) model? How to determine which model fits the data better? Here is the introduction about the models and determining methods we used in this project.[2] (Note: ARIMA is from chapter 6 slide 9; SARIMA is from chapter 6 slide 15; AIC is from chapter 5 slide 21; Fisher information is from chapter 5 slide 10.)

ARIMA and SARIMA Model

ARIMA model

Given time series data \(\{X_t\}_{t=1}^N\), an \(\displaystyle {\text{ARMA}}(p,q)\) model is given by:

\[\phi (B)(X_t-\mu) = \psi (B) \varepsilon_t\]

Where:

  • \(\mu = E[X_t]\).
  • \(\phi (B) = 1 - \sum_{i=1}^{p} \phi_i B^i\) is the autoregressive part of the model.
  • \(\psi(B) = 1 + \sum_{i=1}^{q} \psi_i B^i\) is the moving average part of the model.
  • \(B\) is the lag operator. \(BY_n = Y_{n-1}\).
  • \(\varepsilon_t\) is the error term at time \(t\), and normally we assume \(\varepsilon_t\) \(iid \sim \mathcal{N} (0,1)\).

When time series data is non-stationary, it’s hard to model it with \(\displaystyle {\text{ARMA}}(p,q)\). Then the difference operator \((1 - B)^d\) is introduced to detrend and make the data stationary, that’s the \(\displaystyle {\text{ARIMA}}(p,d,q)\) model:

\[\phi (B) ((1 - B)^d X_t-\mu) = \psi (B) \varepsilon_t\]

Here \(d\) is the order of difference. When \(d = 1\), \((1 - B) X_t = X_t - X_{t-1}\), each time series observation is subtracted from its previous observation. By doing this calculation, linear trend in the original series will be removed.

SARIMA Model

The \(SARIMA\) model is an extension of the \(ARIMA\) model that captures both non-seasonal and seasonal patterns in time series data (S stands for Seasonal). The \(\displaystyle {\text{SARIMA}}(p,d,q)\times(P,D,Q)_s\) model equation is given by:

\[\phi (B)\Phi (B^s) [(1 - B)^d (1 - B^s)^D X_t-\mu] = \psi (B) \Psi (B^s) \varepsilon_t\]

Where:

  • \(s\) is the length of the seasonal period.
  • \(\Phi (B^s) = 1 - \sum_{i=1}^{P} \Phi_i B^{is}\) the seasonal autoregressive (SAR) part.
  • \(\Psi(B^s) = 1 + \sum_{i=1}^{Q} \Psi_i B^{is}\) the seasonal moving average (SMA) part.

Model Selection Criterion

Model Selection Criterion is the method to find the best fit.

  • AIC

    The goal of AIC is to measure the goodness of fit of the model while penalizing the number of parameters in the model to avoid overfitting. The formula for calculating a model’s AIC is as follows:

    \[AIC=2k−2ln(L)\] where:

    • \(k\) is the number of parameters in the model. In \(ARIMA\) model, k including the total number of autoregressive, differencing, and moving average terms. In \(SARIMA\) model, the seasonal parameters are also included.

    • \(L\) is the maximum likelihood estimate of the model.

    When using \(ARIMA\) models for time series analysis, we typically calculate the AIC values for multiple models and choose the model with the smallest AIC value. A smaller AIC value means that the model provides a better fit to the data while keeping the number of parameters as low as possible.

  • Minimum Root

    The minimum root (or minimal characteristic root) is used to assess model stability. For \(ARIMA\) and \(SARIMA\) models, if the roots lies on the unit circle, the fit may be unstable. To check the stability of the model, we observe it’s minimum root. The minimum root much larger than 1 making sure that the long-term forecasts of the time series do not diverge or become unbounded.

  • Fisher Confidence Interval and Bootstrap Confidence Interval

    Confidence interval provide a range of uncertainty for an estimate of a parameter. When we give a \(1-\alpha\) CI, that means there is a 95% probability that the true value of the parameter is contained in this CI. In this report, we give a boolean variable about “whether the CI contains 0”, this is used to determine whether the parameter is significant. If the answer is True, we can drop the parameter.

    We tested two kinds of CI. Fisher CI is conducted based on Fisher Information, and Bootstrap CI are conducted based on the resampling technique.

  • The Number of Lags outside ACF CI

    The confidence interval of the ACF is used to determine if autocorrelation coefficients are significantly non-zero. The “number of lags outside ACF CI” refers to how many lag periods in the ACF plot have autocorrelation coefficients that fall outside the confidence interval. These lag periods, whose autocorrelation coefficients are statistically significant, indicate that there is notable autocorrelation at those specific lags.

  • Residuals’ Normality

    If the residuals are found to be normally distributed, it supports the validity of statistical inferences made from the model. Here, we use Shapiro-Wilk test to determine whether could we reject normality assumption, and introduced QQ-plot to diagnose residuals’ normality.

Building Model

Data Pre-processing

Our dataset contains data related to power consumption from January 9th to December 30th, 2017, with a ten-minute window for each data point. To better study the consumption patterns of power resources, we conducted model fitting on two temporal dimensions. Initially, focusing on a daily basis, we selected 13,103 data points from April, May, and June, accounting for the electricity usage over these 91 days, serving as the first dataset for modeling. Additionally, on an hourly basis, we selected the power usage data from 2,184 hours within April, May, and June as the second dataset for modeling.

Explore periods

The plot shows a clear periodicity and spectra have the peak at the frequency \(\omega = 0.042\), as we denoted on the plot. This indicates the dominant frequency corresponds to a period \(T=\frac{1}{\omega} \approx 23.8\). So predominant period is roughly 1 cycle per 24 hours, that’s 1 cycle per day. And harmonic introduced some other peaks in spectra plot.

We set the cutoff value at 0.5. And the lower boundary is \(\frac{1}{0.064}= 15.625\) and there are no upper boundary. So cycle length larger than 15.625 could be interpreted as frequencies that related to power consumption cycle. The predominant period we got before, 24, is contained in this interval, and one day is also a reasonable cycle considering scenes of real life.

Then we move to periodicity on our daily data.

The “Smoothed periodogram” plot hasn’t shown a typical seasonal pattern. As there are only 91 pieces of data in our daily consumption dataset, we can not say there is no periodicity on daily power consumption. But in this report we can conclude a significant cycle for daily power consumption.

Explore residuals

In this report, we used two different methods to remove trend and seasonality from the original data.

Firstly, we use the difference method.

The Second approach we used to remove trend and seasonality was LOESS smoothing. The LOESS extracted the trend component of the data, here we cut down the trend part from the original data. The rest component can be seen as volatility, and that’s the data we will fit using ARMA models.

Model Selection

The observation of seasonality in hourly data encourages us to fit SARMA or SARIMA models. But the considerable data amount of our data set prevents us from directly searching SARMA or SARIMA models, due to the time-consuming model convergence process of SARMA and SARIMA. To address this issue, we learn from [3] (Note: The idea of fitting ARMA models on data of different scales comes from [3].) and try to fit ARMA models separately on hourly and daily data to search for candidate \((p, q)\) and \((P, Q)\) values. By constraining our search space to these candidate \((p, q)\) and \((P, Q)\) pairs, the workload of SARMA or SARIMA search is greatly relieved.

Find SARIMA Models

In this section, we find suitable ARIMA models for hourly data and daily data, and then combine them to form the SARIMA model.

We first search a suitable ARIMA model for hourly data based on AIC values. As can be seen in the AIC table, the lowest AIC value is achieved at AR = 5 and MA = 5, but we consider this as the result of overfitting. We support our conjecture by viewing the smallest root table. In this table, we record the smallest magnitude root of the ARMA model. When AR = 5 and MA = 5, the smallest root becomes 1.0004, which indicates a possible problem of no invertibility or no causality. Similar problems also occur when there are many AR and MA parameters. By jointly considering invertibility, causality and model complexity, we restrict our search range to ARIMA(0, 1, 0), ARIMA(1, 1, 1), ARIMA(1, 1, 0), ARIMA(2, 1, 0), ARIMA(0, 1, 1), ARIMA(0, 1, 2) and ARIMA(1, 1, 2). Among these models, we choose the model with the smallest AIC value, which is ARIMA(2, 1, 0). In the following part, we will test this model.

AIC of some ARIMA models (Hourly)
MA0 MA1 MA2 MA3 MA4 MA5
AR0 23078.87 22022.00 21830.32 21797.04 21798.18 21794.32
AR1 21961.65 21817.40 21805.67 21798.68 21543.55 21502.38
AR2 21776.69 21478.11 21452.67 21446.42 21328.71 21327.84
AR3 21765.25 21457.60 21450.14 21328.58 21325.15 21323.26
AR4 21735.73 21446.92 21451.75 21452.14 21310.41 21323.54
AR5 21731.44 21447.28 21448.94 21376.11 21315.12 21162.07
Smallest roots of ARIMA models (Hourly)
MA0 MA1 MA2 MA3 MA4 MA5
AR0 NA 1.5824 1.8556 1.8199 2.1686 1.5987
AR1 1.5771 2.2392 2.7366 1.9133 1.0028 1.0029
AR2 1.8658 1.0046 1.0039 1.0045 1.0461 1.0540
AR3 1.5594 1.0041 1.0041 1.0424 1.0432 1.0427
AR4 1.3334 1.0044 1.0041 1.0041 1.0010 1.0433
AR5 1.2799 1.0045 1.0043 1.0041 1.0003 1.0004

We diagnose the ARIMA(2, 1, 0) model by analyzing its roots and residuals. As reported below, the roots of two AR parameters in ARIMA(2, 1, 0) model are \(1.4212 + 1.2089i\) and \(1.4212-1.2089i\) respectively. They are both outside the unit circle, and indicate the causality and invertibility. As for its residuals, we find the residuals may be not white noise due to the violation of normal assumption and uncorrelated assumption. In the QQ plot, a heavy-tailed distribution is obvious and this shows that the residual doesn’t obey normal distribution. In the ACF plot, a significant large ACF value is observable at lag = 24, which may imply the seasonality. There are more than 5% ACF values exceed the boundary of confidence interval, and therefore, we can conclude that the residual is not uncorrelated. The diagnosis conducted here suggests that a simple ARIMA model can’t fit the hourly data very well, and necessitate the usage of SARIMA model.

## AR roots: 1.4212+1.2089i 1.4212-1.2089i 
## Mod of AR roots: 1.8658 1.8658

Now we attempt to find an ARIMA model on daily data, and this model may provide hint about how to choose the P and Q values in SARIMA model. We first select the model based on AIC values. The AIC table below shows that most AIC values are similar. The difference between the largest AIC value and the lowest AIC value is merely \(1405.787 - 1384.096 \approx 22\), which is relatively low. This may be an indicator that the AIC is not suitable for model selection in this case. The key factor for our model selection is Bootstrap confidence interval. We perform bootstrap for each ARIMA model 100 times and get the Bootstrap confidence intervals reported in the second table below. As shown in this table, there are only two models (i.e., ARIMA(1, 1, 0) and ARIMA(0, 1, 1)) has coefficients that are significantly different from 0. Therefore, we will only choose ARIMA(1, 1, 0) or ARIMA(0, 1, 1). The ARIMA model with only MA parameters can only model autocorrelation within small and finite range. Due to this fact, we will choose ARIMA(1, 1, 0). The diagnosis of this model is performed below.

AIC of some ARIMA models (Daily)
MA0 MA1 MA2 MA3 MA4 MA5
AR0 1405.787 1389.825 1390.149 1391.911 1393.876 1393.309
AR1 1396.882 1389.896 1391.831 1393.792 1395.357 1395.297
AR2 1394.549 1391.845 1393.818 1384.976 1386.929 1384.096
AR3 1394.018 1393.810 1395.845 1386.884 1388.617 1385.619
AR4 1395.971 1395.670 1397.527 1386.934 1388.845 1386.598
AR5 1397.104 1396.106 1394.099 1389.656 1390.227 1388.395
Does Bootstrap CI covers 0? (Daily)
MA0 MA1 MA2 MA3 MA4 MA5
AR0 NA FALSE TRUE TRUE TRUE TRUE
AR1 FALSE TRUE TRUE TRUE TRUE TRUE
AR2 TRUE TRUE TRUE TRUE TRUE TRUE
AR3 TRUE TRUE TRUE TRUE TRUE TRUE
AR4 TRUE TRUE TRUE TRUE TRUE TRUE
AR5 TRUE TRUE TRUE TRUE TRUE TRUE

As shown in the following report, the model is causal and invertible, but it’s residuals are not white noise. The reported root of this model are all outside the unit circle, which ensures the causality and invertibility. The residuals are not white noise, because the QQ plot shows cauchy distribution and there are 8 lags has ACF values outside the confidence interval.

## AR roots: -2.9185+0i 
## Mod of AR roots: 2.9185

We now combine the models we get in previous part to form the following SARIMA model. It has order \((2, 0, 0)(1, 1, 0)\) and period 24. The model diagnosis shows that it is invertible and causal with residuals being nearly white noise. All of its roots are significantly outside the unit circle, and the causality and invertibility are without question. The residuals displayed in the residual plot seem to be like white noise with mean 0. The further testing such as ACF plot show that residuals are uncorrelated since most ACF values fall within the confidence interval. The QQ plot illustrates that the residuals may be not normal due to their heavy-tailed distribution. Further experiments may be conducted to find a model with better residuals.

## 
## Call:
## arima(x = data, order = arima_order, seasonal = list(order = seasonal_order, 
##     period = period), xreg = xreg)
## 
## Coefficients:
##          ar1      ar2     sar1
##       1.1265  -0.2341  -0.3980
## s.e.  0.0216   0.0216   0.0205
## 
## sigma^2 estimated as 195.6:  log likelihood = -8766.09,  aic = 17540.18
## AR roots: 1.1743+0i 3.6372+0i 
## Mod of AR roots: 1.1743 3.6372 
## SAR roots: -2.5123+0i 
## Mod of SAR roots: 2.5123

Find SARMA Models

Similar to what we conduct in previous part, we also fit two ARMA models for hourly data and daily data separately, but we utilize LOESS to detrend the data rather than relying on the difference operation in the ARIMA models.

We first conduct experiment on hourly data. As shown in the following AIC table, the smallest AIC value is attained when p = 4 and q = 3, but this may be the consequence of overfitting. Like the practice in previous section, we look at the smallest roots table for some insights. The smallest roots table shows that we can only keep the invertibility and causality of model when \(p + q\) is relatively low. The ARMA(2, 1) may be a good choice in this case. When increasing the p to be more than 2 and q to be more than 1, we can’t gain much AIC decrease, and this shows that ARMA(2, 1) is a reasonable choice in terms of AIC and model complexity.

AIC of some ARMA models (Hourly)
MA0 MA1 MA2 MA3 MA4 MA5
AR0 26848.26 24349.97 22851.26 22116.88 21779.66 21516.12
AR1 22996.31 21883.66 21628.33 21541.24 21499.36 21452.23
AR2 21469.67 21445.86 21437.67 21333.38 21324.90 21452.73
AR3 21450.74 21442.80 21330.90 21328.65 21326.89 21256.73
AR4 21438.81 21444.53 21445.55 20880.29 21327.33 21372.80
AR5 21438.74 21422.88 21368.90 21446.63 21240.31 21278.38
Smallest root of ARMA models (Hourly)
MA0 MA1 MA2 MA3 MA4 MA5
AR0 NA 1.1043 1.1493 1.1968 1.2691 1.1703
AR1 1.0978 1.1618 1.2522 1.3634 1.4939 1.2523
AR2 1.1862 1.2542 1.1641 1.0789 1.0939 1.2343
AR3 1.2537 1.2266 1.0983 1.0928 1.0942 1.0014
AR4 1.1901 1.2276 1.2117 1.0011 1.0914 1.0258
AR5 1.1722 1.0008 1.0111 1.1691 1.0035 1.0011

We diagnose the ARMA(2, 1) model. This model is causal and invertible, because its roots are outside the unit circle and can’t cancel with each other. The residuals of it seem to be problematic. QQ plot indicates the heavy-tail phenomenon and ACF plot reject the uncorrelated residual hypothesis.

## AR roots: 1.1607+0.475i 1.1607-0.475i 
## Mod of AR roots: 1.2542 1.2542 
## MA roots: -6.0076+0i 
## Mod of MA roots: 6.0076

Similar to what we find in previous section, the variance of AIC values for ARMA models fitted on daily data is very low, and we need to select models according to other criterion, like using the smallest root table. In the smallest root table, there are only few models which are not near the boundary of no invertibility or no causality. Among the invertible and causal models, we prefer models contain both AR and MA parameters, since we find the absence of AR or MA parameters will cause the final SARMA model to have more ACF outliers. Therefore, we limit our choice to ARMA(1, 1) or ARMA(1, 2). Considering the model complexity, we finally choose ARMA(1, 1).

AIC of some ARMA models (Daily)
MA0 MA1 MA2 MA3 MA4 MA5
AR0 1381.363 1379.262 1381.220 1380.942 1382.884 1378.387
AR1 1379.416 1381.238 1383.091 1376.639 1378.893 1376.587
AR2 1381.098 1374.419 1376.022 1378.537 1380.482 1377.213
AR3 1382.404 1376.088 1375.349 1372.461 1373.834 1369.246
AR4 1383.778 1378.003 1376.227 1373.453 1373.892 1370.148
AR5 1381.155 1382.226 1377.019 1372.497 1374.220 1371.271
Smallest root of ARMA models (Daily)
MA0 MA1 MA2 MA3 MA4 MA5
AR0 NA 4.6717 6.5990 1.3918 1.3527 1.0757
AR1 4.7876 6.3511 2.5110 1.0000 1.0002 1.0000
AR2 4.0903 1.0000 1.0001 1.0000 1.0000 1.0000
AR3 2.0190 1.0000 1.0000 1.0000 1.0000 1.0000
AR4 1.6031 1.0000 1.0000 1.0000 1.0000 1.0000
AR5 1.2396 1.1862 1.0229 1.0000 1.0000 1.0000

The diagnosis of ARMA(1, 1) shows that the model possesses both invertibilty and causality, due to roots outside unit circle, but its residuals may not be white noise. The contradiction of white noise assumption comes from the QQ plot and ACF plot. In QQ plot, there are many residuals deviated from the QQ line when quantile is less than -1. In ACF plot, about 7 lags have ACF values outside the boundary, showing the correlation of residuals.

## AR roots: 16.3752+0i 
## Mod of AR roots: 16.3752 
## MA roots: -6.3511+0i 
## Mod of MA roots: 6.3511

The direct combination of the two models got from previous parts will lead to convergence problem, so we select a similar model SARMA(1, 0, 1)(1, 0, 1) in the end. We show its diagnosis as follows. The model is both invertible and causal when we inspect its roots. The residuals of it follow the mean 0 assumption and uncorrelated residual assumption but violate normal assumption. The residuals fluctuate around 0 in the residual plot, which supports the mean 0 assumption. ACF plot further checks the uncorrelated residual assumption without finding more than 5% ACF values outside confidence interval. Nevertheless, the residuals don’t follow normal assumption due to heavy-tailed distribution.

Fitted Value

We plot the fitted value and original value in the same time series plot (selected 400 data point to show clearly). It seems we get a pleasant model.

Association between Power Consumption Zones

In previous section, we mainly develop and build the SARMA / SARIMA models based on the power consumption data collected from only one zone of Tetouan city (i.e., the Quads zone), while the the discrepancy and association between different zones’ power consumption deserves more attention and study. In the following parts, we will first detrend and deseasonal the power consumption data of three zones (i.e., Quads, Smir and Boussafu zones) to remove the negative effect of trend and seasonality on cross-correlation function (CCF) analysis. Then, we report and discuss the CCF, coherence and phase plots, in an attempt to provide a detailed analysis of three zones’ association. Finally, some linear regression models with ARMA errors are fitted and tested. These models further underpin our findings in CCF, coherence and phase plots. (We design the experiments of this part according to previous year’s project [4] (Note: We have some innovations like comparing the difference between HP filter and LOESS, and plotting the detrended data together to view the association.))

Detrend and Deseasonal Data

Following the practice of [5] (Note: slide 9), we leverage smoother to detect and remove the trend within data, but unlike [5], we use LOESS smoother since it’s not significantly different from HP filter in terms of detrending effect. As can be seen in the following figure, the HP filter and LOESS smother generate nearly same detrending results and may be interchangeable. Besides, we keep using LOESS smoother to make our experiment data consistent. Our previous experiments detrend data with LOESS smoother, and in this section, we will also keep this practice.

As for deseasonal, we learn from [3] (Note: In their project, they deseasonl data through taking average.) and eliminate seasonality of data through aggregating data point. More specifically, we aggregate the hourly data to craft daily data. One noticeable thing is that we actually sum up the hourly data according to 24 hours’ period rather than taking the mean of hourly data. This is because the power consumption is cumulative, unlike temperature, humidity or other data whose averages are meaningful. The comments of [3] raises the concern about whether aggregating data point will make the delay signal unrecognizable. We argue that this is not problematic, at least for our data. As can be seen in next section’s CCF, coherence and phase plots, we can still identify the delay signal.

We plot the power consumption of three zones together to render a straightforward view of their association, and we will also test and report their sample correlations for a more accurate understanding. In the following figure, we can see that the power consumption of Quads and Smir cycles together, but the association between these two zones and Boussafou is hard to recognize just through eyeballing the figure. Therefore, we quantitatively measure their correlations through R’s function. The results are reported in the below table. The power consumption of Quads and Smir possesses positive correlation, which matches our finding in the plot that they are nearly cycling together. The power consumption of Boussafou seems to be approximately independent from that of Quads, and its association with Smir is also relatively weak. This may suggest that it’s not suitable to model the relationship using Boussafou’s power consumption as predictors.

Sample Correlation of Three Zones’ Power Consumption
Quads Smir Boussafou
Quads 1.00000 0.59302 0.00061
Smir 0.59302 1.00000 -0.15690
Boussafou 0.00061 -0.15690 1.00000

CCF, Coherence and Phase Plots

In this section, we demystify the lagged relationship among three zone’s power consumption through CCF, coherence and phase plots. The theory of these plots is reviewed at the beginning. We subsequently showcase and interpret the analysis result of our data.

CCF, the abbreviation of sample cross-correlation function, is widely-used for identifying the lagged relationship between two time series. It’s formulated as:

\[ \begin{align} \hat{\rho}_{xy}(h) = \frac{\sum_{n=1}^{N-h} (x_{n+h} - \bar{x})(y_n - \bar{y})}{\sqrt{\sum_{n=1}^{N} (x_{n} - \bar{x})^2\sum_{n=1}^{N} (y_{n} - \bar{y})^2}} \end{align} \] , where \(x_n\) and \(y_n\) comes from two time series \(x_{1:N}\) and \(y_{1:N}\).

The coherence and phase plots are the Fourier transform outcomes of cross-covariance. The Fourier transform of cross-covariance produces the cross-spectrum \(\lambda_{XY}(w) = \sum_{h = -\infty}^\infty e^{- 2 \pi i w h} Cov(X_{n+h}, Y_n)\). Via normalizing this spectrum (i.e., \(\rho_{XY}(w) = \frac{\lambda_{XY}(w)}{\sqrt{\lambda_{XX}(w)\ \lambda_{YY}(w)}}\)), we can further obtain a complex-valued correlation between frequency components of two time series at each frequency \(w\). The magnitude of this correlation is coherence, measuring the degree of association between \(x_{1:N}\) and \(y_{1:N}\) at frequency \(w\). The phase is the angle of this correlation in the complex plane. The deviation of phase from 0 is a sign of negatively correlated time series. (The formulas in this part take the [5] as reference.)

From the CCF plots, we can see the lagged relationship among these three zones. In the comparison of Quarts and Smir zones, the positive local maximum around 0 demonstrates that there exists positive and nearly instantaneous relationship between Quarts and Smir zones’ power consumption. However, we also notice that the distribution of other local maximums that exceed 95% confidence interval may follow a 7 days period. The seasonality indicated by this phenomenon may be the result of incomplete seasonality removal in previous parts. A further experiment may be needed to find a better deseasonal method. As for the CCF of Quarts v.s. Boussafou, the negative relationship with lag 1 between Quarts and Boussafou is identified by the fact that the minimum and most significant CCF value is got at Lag = 1. When it comes to Smir v.s. Boussafou, we can also observe a negative relationship with lag 1. These findings are consistent with our previous sample correlation calculation. Since the sample correlations of Quarts v.s. Boussafou and Smir v.s. Boussafou are small, we will only estimate the relationship between Quarts and Smir hereafter.

The plots of coherence and phase for Quarts v.s. Smir are displayed below. In coherence’s plot, we select three strongest coherence at frequency \(w_1 \approx 0.05208\), \(w_2 \approx 0.14583\) and \(w_3 \approx 0.28125\), which corresponds to 19.2 days / cycle, 6.8 days / cycle, and 3.5 days / cycle respectively (these three frequency are marked by orange lines in the below figure). The signals at these three frequency are most powerful in the cross-spectrum. Then, we identify their corresponding phases in the phase plot. As shown in the phase plot, the 95% confidence interval covers 0 or nearly approximates 0 for all these three frequency. This indicates that the peaks at frequency \(w_1\), \(w_2\) and \(w_3\) occurs simultaneously for Quarts and Smir, and Quarts and Smir may cycle together.

Fit and Test Linear Regression Model with ARMA Errors

We try to verify the association between Quarts and Smir through constructing a linear regression model with detrended Quarts’ data as response, detrended Smir’s data as predictor, and errors following ARMA. The model selection is performed under criterion AIC.

As reported in the AIC table below, the variation of AIC values for different AR and MA parameter combination is not significant, and there exist numeric errors. For instance, comparing the AIC values of ARMA(3, 1) (i.e., 1349.984) and ARMA(4, 1) (i.e., 1351.982) reveals the existence of numeric errors, because adding a new AR parameter to the model will lead to less than 2 increase in model’s AIC, but \(1351.982 - 1349.984 \approx 2\) approximates the boundary. Therefore, the AIC values may not be a valid model selection criterion in this case. However, for the sake of caution, we still test some ARMA models (like ARMA(3, 2) errors) with low AIC values.

MA0 MA1 MA2 MA3 MA4 MA5
AR0 1343.926 1344.560 1346.216 1347.894 1349.772 1350.516
AR1 1344.423 1346.362 1347.969 1349.812 1351.744 1344.778
AR2 1346.306 1348.113 1345.670 1345.117 1349.475 1344.916
AR3 1348.032 1349.984 1342.963 1344.501 1345.431 1346.846
AR4 1350.031 1351.982 1346.020 1345.062 1345.694 1344.850
AR5 1350.075 1343.915 1345.892 1344.068 1345.003 1342.190

We test the ARMA(3, 2) errors and find that even though its residuals are not problematic, it has a MA root at the boundary of invertibility. In the residual analysis plots, the residuals act just like white noise. By viewing the residual plot and QQ plot, we know it doesn’t violate the mean 0 assumption and normal assumption, even though the residuals are likely to be of heteroskedasticity. The ACF plot suggests that residuals are not correlated. However, when we see the report of its MA roots, the MA root 1 makes model nearly not invertible. Therefore, ARMA(3, 2) errors is not a reasonable choice.

## AR roots: 1.1137+0.3226i 1.1137-0.3226i -31.7546+0i 
## Mod of AR roots: 1.1595 1.1595 31.7546 
## MA roots: 1+0i 1.5739+0i 
## Mod of MA roots: 1 1.5739

We finally decide to use the ARMA(0, 0) errors (i.e., the linear regression model with Gaussian noise) due to reasons of two aspects.

First, there is no significant difference between the residual diagnosis result of this model and that of linear regression model with ARMA errors. As shown in the following plots, the diagnosis shows results similar to previous model with ARMA errors, except for the lag 6 in ACF plot where the ACF value slightly exceeds the boundary.

Second, the likelihood ratio test (LRT) shows that this model outperforms the null hypothesis model. We display these two models and then discuss the LRT results. The Linear Regression (LR) model with ARMA(0, 0) noise can be expressed as:

\[ Y_n^{Detrended} = 16.7093 + 0.4331 X_n^{Detrended} + \epsilon_n \] , where \(Y_n^{Detrended}\) is the detrended power consumption in Quarts zone, \(X_n^{Detrended}\) is the detrended power consumption in Smir zone, and \(\epsilon_n\) is the Gaussian noise.

The model of null hypothesis is:

\[ Y_n^{Detrended} = 19.9995 + \epsilon_n \] , where \(Y_n^{Detrended}\) is the detrended power consumption in Quarts zone, and \(\epsilon_n\) is the Gaussian noise.

We review the definition of LRT for completeness. According to [6], the test statistic of LRT is:

\[ l^{<1>} - l^{<0>} \approx (1 / 2) \mathcal{X}_{D^{<1>} - D^{<0>}}^2 \] , where \(\mathcal{X}_d^2\) is the chi-squared random variable on \(d\) degrees of freedom. \(D^{<1>}\) and \(D^{<0>}\) are the number of parameters in models. \(l^{<1>}\) and \(l^{<0>}\) are the maximized log likelihood. When the test statistic exceeds the cutoff value of chi-squared distribution, we can reject the null hypothesis. (The 95% cutoff value of \(\mathcal{X}_1^2\) is about 1.92).

In LRT, the log likelihood difference of these two model reaches 19.7182, exceeding the chi-square distribution cutoff value 1.92. This gives us enough power to reject the null hypothesis. By the way, the z-statistic of \(0.4331 / 0.0616 \approx 7.03\) for the coefficient of detrended Smir zone power consumption demonstrates that our coefficient differs from 0 significantly.

## Liklihood Ratio Test
## Pure Noise with Intercept log likelihood: -688.6814 
## LR with ARMA(0, 0) Errors log likelihood: -668.9632 
## log likelihood difference: 19.7182 
## degree of freedom difference: 1 
## chi-square cutoff: 1.9207 
## p-value: 3.38935e-10 
## Test result:
##         Reject null hypothesis, choose LR with ARMA(0, 0) Errors

Our fitted model support our previous finding that the correlation between the power consumption of Quarts zone and Smir zone is positive, because the slope parameter in model is \(0.4331 > 0\). One possible reason for this positive correlation is that these two zones share many similar residential/commercial activities and infrastructure due to geographical proximity [7].

Conclusion

In this project, we analysed the data of power consumption in Tetouan during 2017. We extracted the data from April to June and aggregated it into the electricity consumption per hour and per day. We mainly fitted the hourly dataset and it has some key patterns. It has seasonality at 24 hours that was confirmed by our spectrum investigation in the frequency domain. It also has a non-linear trend so we transformed and detrended the data. Based on minimum AIC criteria and minimum root criteria, we conducted two diffferent \(SARIMA\) model through two different approach:

Model 1 - \(\displaystyle {\text{SARIMA}}(2,0,0)\times(1,1,0)_{24}\):

\[(1 - \phi_1 B - \phi_2 B^2)(1-\Phi_1 B^{24}) [(1 - B^{24}) X_t-\mu] = \varepsilon_t\]

Model 2 - \(\displaystyle {\text{SARMA}}(1,1)\times(1,1)_{24}\).

\[(1 - \phi_1 B )(1-\Phi_1 B^{24}) [X_t-\mu] = (1-\psi_1 B) (1-\Psi_1 B^{24}) \varepsilon_t\]

The residual analysis gives satisfactory results on both model, both models fit hourly data well in general. But there are still some limitations in our report:

  • The fitting results of models with different parameters are not significantly different; there is no outstanding model. This may be due to the trend component of the original data not being well identified, and the data used for fitting is not stationary enough.

  • The dataset contains other variables such as humidity and temperature, which we have not fully utilized. Otherwise, we could have gained a better understanding of electricity consumption and obtained a more optimal model.

References

[1]
FEDESORIANO, “Electric power consumption.” https://www.kaggle.com/datasets/fedesoriano/electric-power-consumption/data, 2022.
[2]
E. Ionides, “Time series analysis.” https://ionides.github.io/531w24/, 2024.
[3]
[4]
[5]
E. L. Ionides, STATS 531 Lecture 09.” https://ionides.github.io/531w24/09/index.html.
[6]
E. L. Ionides, STATS 531 Lecture 05.” https://ionides.github.io/531w24/05/index.html.
[7]