Introduction

Solar power generation is a well-documented daily cycle, inherently driven by the Earth’s rotation. Understanding this cycle is essential for efficient solar energy production forecasting. However, the extent to which this periodicity remains stable - especially in the 21st century with dynamic environmental conditions - remains an open question. This raises the question of whether solar power generation continues to follow the expected 24-hour cycle as a strictly predictable pattern.

This study explores the periodicity of solar power generation to develop a forecasting model using time series data. Through exploratory data analysis (EDA) and spectral analysis, we assess both the daily and sub-daily periodic structures and investigate the presence of potential hidden patterns. Based on these findings, we develop a Seasonal Autoregressive Integrated Moving Average (SARIMA) model to forecast power generation at an hourly interval, aiming to provide deeper insights into the characteristics of solar power generation cycles.

Data

The dataset used in this project consists of solar power generation data collected from two solar power plants in India over a period of 34 days. The dataset is structured into two categories of files: power generation data and sensor readings. The power generation data is recorded at the inverter level, where each inverter is connected to multiple lines of solar panels, capturing real-time energy production information. This allows for an analysis of variations in energy output across different inverters within the same plant.

The sensor data, on the other hand, is collected at the plant level using a strategically positioned array of sensors. These sensors measure environmental parameters such as temperature, irradiation, and other meteorological factors that influence solar power generation. The plant-level sensor readings provide a comprehensive view of the environmental conditions affecting the entire solar facility. By combining power generation data with sensor readings, this dataset facilitates a detailed examination of the relationship between environmental factors and solar energy production efficiency.

This dataset is sourced from Kaggle and is publicly available under the title “Solar Power Generation Data” [1].

Exploratory Data Analysis

We conduct an exploratory data analysis to examine periodic patterns and potential factors that are associated with solar power generation. The EDA focuses on identifying key temporal patterns in the data, assessing periodicity, and exploring factors that may contribute to variations in daily solar power generation. We aim to uncover underlying patterns to develop a more efficient forecasting model.

The time series plot reveals clear daily periodic patterns, corresponding to the natural solar cycle. Total generation varies by date with fluctuating peaks, indicating a non-stationary mean. There is no upward or downward trend, suggesting a potential stationarity in variance.

Autocorrelation Function

We then examine the autocorrelation function (ACF) at different time resolutions to investigate the stationarity. At 15-minute intervals, which is the original data, the ACF exhibits a sinusoidal pattern. As the time span increases to hourly and three-hour intervals, the periodic pattern remains, but with decreasing magnitude of autocorrelation The oscillating pattern with gradual decay suggests short-term dependence, indicating that within day values are closely related [2].

At longer time spans from 12 hours to a day, the ACF reveals a stationarity process. The strong lag 1 autocorrelation with a sharp decay suggests short term dependence. It means the shorter span of recent values have a stronger influence on today’s power generation, compared to older days [2].

As time span increases, the periodicity weakens and the autocorrelation decreases. This indicates a strong intraday periodicity of solar power generation. The forecasting model may be more efficient when using hourly data to balance bias and variance.

Basic EDA

The heatmap visualizes the daily solar power generation pattern, aligning with the common expectation that the power generation increases as the sun rises, and starts to decrease as the sun sets. The day-to-day variability can be attributed to the weather conditions. The darker part of the heatmap corresponds to zero power generation, highlighting the substantial amount of zero values in our dataset. However, these zero values are an inherent part of the dataset. Since they provide meaningful information about the underlying process, we choose to not remove those values.

This plot also aligns with the expected solar energy generation pattern. One notable observation is increasing variance over time during the period of power generation.

Relationship with Weather Factors

To further investigate the factors influencing total power generation, we examine the relationship with key weather variables: irradiation, module temperature, and ambient temperature. The key weather variables are provided in the dataset. Solar irradiation is “power received from the sun”, which can be understood as the amount of sunlight received. Module temperature refers to the temperature of the photovoltaic cells. This can be understood as the temperature of the solar panel. Ambient temperature refers to the temperature of the surrounding air [3-5].

The pairplots show a strong correlation among all four variables. Two pairs - total generation and irradiation, module temperature and irradiation - have strong positive linear relationships. Interestingly, module temperature and ambient temperature have relatively dispersed linear trends, as does ambient temperature and irradiation. This suggests a weaker correlation. However, the absolute value of the correlation coefficients are greater than 0.7. While these relationships may be weaker than irradiation with total generation, they are still considered strong correlations. This suggests that all three weather variables are closely related to each other, even though other confounding factors and latent variables may weaken their linear relationship.

Spectral Analysis

For this section, we look into the spectrum of the total solar power generated by plant 1, apart from finding the dominant frequency, we aim to apply Seasonal and Trend decomposition using Loess to deeply understand the differences of trend between two plants.

Periodogram

Plotting the unsmoothed periodogram, smoothed periodogram, and AR based periodogram, we concluded the period is 24 hours for plant 1, which aligns with our EDA.

## For Plant 1 
## Dominant Frequency (Raw): 0.010625  -> Period: 23.52941 Hours
## Dominant Frequency (Smoothed): 0.010625  -> Period: 23.52941 Hours
## Dominant Frequency (AR-based): 0.010625  -> Period: 23.52941 Hours

STL Decomposition

From the STL decomposition of both plants, by separating the seasonal effect and trend effect, we may observe some differences between two plants. Plant 1 has shown an optimal seasonal (daily) peak at noon, which can be understood by the intensity of lights (irradiation). Interestingly, we observe two peaks of trend during May, which might be correlated to its particular biological features.

SARMA modeling

Given the clear daily seasonality within our data, it makes sense to fit a SARIMA model over a ARIMA model to model the daily behavioral patterns in the data.

The general form of a seasonal ARIMA model with seasonal period 24, with each data point being an hour, denoted as SARIMA\((p,d,q) \times (P,D,Q)_{24}\), is given by \[ \phi(B) \, \Phi(B^{24}) \, (1 - B)^d \, (1 - B^{24})^D \, (Y_n-\mu) = \psi(B) \, \Psi(B^{24}) \, \epsilon_n, \] where - \(\phi(B) = 1 - \phi_1 B - \phi_2 B^2 - \cdots - \phi_p B^p\)
represents the nonseasonal AR polynomial, - \(\psi(B) = 1 + \psi_1 B + \psi_2 B^2 + \cdots + \psi_q B^q\)
represents the nonseasonal MA polynomial, - \(\Phi(B^{24}) = 1 - \Phi_1 B^{24} - \Phi_2 B^{24 \cdot 2} - \cdots - \Phi_P B^{24P}\)
represents the seasonal AR polynomial, - \(\Psi(B^{24}) = 1 + \Psi_1 B^{24} + \Psi_2 B^{24 \cdot 2} + \cdots + \Psi_Q B^{24Q}\)
represents the seasonal MA polynomial, - \(d\) and \(D\) are the orders of nonseasonal and seasonal differencing, respectively, - \(s = 96\) denotes the seasonal period, - \(\epsilon_n\) is a white noise process [6].

For our model selection, we optimized the AIC of our model. The AIC is given by \[ \text{AIC} = -2 \log(L(\hat{\theta})) + 2k, \]

where:

The AIC provides a measure of the trade-off between the model’s goodness-of-fit and its complexity [6].

Below, we present the AIC values for various SARIMA\((p,d,q) \times (0,1,1)_{24}\) models fitted on our data for different combinations of \(p\) and \(q\).

MA0 MA1 MA2 MA3 MA4 MA5
AR0 21198.09 20679.76 20496.34 20403.23 20358.43 20329.31
AR1 20358.02 20329.04 20323.85 20314.37 20309.72 20303.97
AR2 20319.32 20225.62 20227.56 20227.14 20229.13 20230.87
AR3 20305.12 20227.57 20228.48 20231.37 20231.11 20233.12
AR4 20279.61 20227.20 20229.20 20229.57 20232.18 20227.71
AR5 20265.11 20229.20 20231.18 20219.46 20228.12 20232.47

Our analysis indicates that a Seasonal ARIMA model performs better than a ARIMA model, with an AIC that is over 600 units lower. Specifically, after parameter tuning, the optimal model was identified as \[ \text{SARIMA}(2,0,1) \times (0,1,1)_{24}, \]

Although one could formally test the models via a likelihood ratio test, the SARIMA model is trivially superior in terms of likelihood. However, upon examining the AIC table for the SARIMA models, we notice an unusual pattern: the AIC often jumps by more than 2 points between model specifications. Such behavior suggests that the model estimation may be encountering convergence issues.

Residual Analysis

We will evaluate this model’s residuals, their autocorrelation structure, and their QQ plot to determine if the anomalous AIC behavior stems from non-convergence or other issues.

## 
## Call:
## arima(x = hour_df$total_power, order = c(4, 0, 1), seasonal = list(order = c(0, 
##     1, 1), period = 24))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     sma1
##       1.8036  -0.9307  0.1074  -0.0619  -0.8457  -0.7779
## s.e.  0.0471   0.0796  0.0751   0.0400   0.0287   0.0281
## 
## sigma^2 estimated as 1.334e+10:  log likelihood = -10106.6,  aic = 20227.2
## 
## Training set error measures:
##                     ME     RMSE      MAE MPE MAPE     MASE        ACF1
## Training set -69.63747 113752.6 72392.07 NaN  Inf 0.737202 0.001454645

## 
## Call:
## arima(x = hour_df$total_power, order = c(4, 0, 1), seasonal = list(order = c(0, 
##     1, 1), period = 24))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     sma1
##       1.8036  -0.9307  0.1074  -0.0619  -0.8457  -0.7779
## s.e.  0.0471   0.0796  0.0751   0.0400   0.0287   0.0281
## 
## sigma^2 estimated as 1.334e+10:  log likelihood = -10106.6,  aic = 20227.2

Our diagnostic analysis (e.g., via Q-Q plots) reveals that the residuals of our fitted model exhibit significantly fatter tails than expected under the normality assumption. Since the above log likelihood function relies on the assumption that the residuals are normally distributed, deviations from normality—especially in the form of heavy tails—can lead to inefficient or biased parameter estimates.

Transforming the dataset

To address this issue, we propose applying a power transformation, such as the Box–Cox transformation, to normalize our data [9]. The Box–Cox transformation is defined as \[ Y^{(\lambda)} = \begin{cases} \frac{Y^\lambda - 1}{\lambda}, & \lambda \neq 0, \\ \log(Y), & \lambda = 0. \end{cases} \]

By selecting an appropriate value of \(\lambda\), we can stabilize the variance and make the distribution of the residuals more normal [9]. This transformation should hypothetically lead to a more reliable estimation of the model parameters via the log likelihood function.

MA0 MA1 MA2 MA3 MA4 MA5
AR0 11040.32 10373.44 10101.69 9937.18 9866.63 9823.94
AR1 9890.29 9810.24 9795.50 9778.22 9772.31 9758.93
AR2 9773.43 9647.82 9643.28 9642.00 9643.41 9644.16
AR3 9749.43 9644.39 9643.09 9638.71 9642.62 9646.18
AR4 9714.94 9641.80 9639.06 9653.57 9646.41 9644.08
AR5 9695.52 9643.50 9645.75 9643.98 9641.66 9643.85

Our analysis indicates that the best model on the Box–Cox transformed dataset is \[ \text{SARMA}(4,0,1) \times (0,1,1)_{24}, \] with an AIC of -9641.80.

Diagnostics

Below, we perform the same diagnostics alongside an ADF test to evaluate model performance.

## Series: ts_data 
## ARIMA(4,0,1)(0,1,1)[24] 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     sma1
##       1.9452  -1.1727  0.2414  -0.0853  -0.8465  -0.7246
## s.e.  0.0461   0.0850  0.0790   0.0396   0.0275   0.0310
## 
## sigma^2 = 14998:  log likelihood = -4813.9
## AIC=9641.8   AICc=9641.95   BIC=9674.34
## 
## Training set error measures:
##                      ME     RMSE      MAE MPE MAPE      MASE         ACF1
## Training set -0.7230995 120.1379 78.67159 NaN  Inf 0.5812528 0.0003646159
## 
## Call:
## arima(x = hour_df$total_power_boxcox, order = c(4, 0, 1), seasonal = list(order = c(0, 
##     1, 1), period = 24))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     sma1
##       1.9452  -1.1727  0.2414  -0.0853  -0.8465  -0.7246
## s.e.  0.0461   0.0850  0.0790   0.0396   0.0275   0.0310
## 
## sigma^2 estimated as 14882:  log likelihood = -4813.9,  aic = 9641.8
## 
## Training set error measures:
##                      ME     RMSE      MAE MPE MAPE      MASE         ACF1
## Training set -0.7230995 120.1379 78.67159 NaN  Inf 0.5693076 0.0003646159

## 
## Call:
## arima(x = hour_df$total_power_boxcox, order = c(4, 0, 1), seasonal = list(order = c(0, 
##     1, 1), period = 24))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     sma1
##       1.9452  -1.1727  0.2414  -0.0853  -0.8465  -0.7246
## s.e.  0.0461   0.0850  0.0790   0.0396   0.0275   0.0310
## 
## sigma^2 estimated as 14882:  log likelihood = -4813.9,  aic = 9641.8

This model has roots

## [1] "AR roots:"
## [1] 0.9948592+0.2736962i 0.9948592-0.2736962i 0.4202435-3.2916579i
## [4] 0.4202435+3.2916579i
## [1] "MA roots:"
## [1] 1.181347+0i
## [1] "Seasonal MA roots:"
## [1] 1.380015+0i

Since all of the roots lie outside the unit circle, our model is stationary and causal, even when accounting for seasonal effects. Additionally, the MA root being outside the unit circle confirms that the model is invertible. It is worth noting that two of the AR roots are close to 1, which means the model is at the threshhold of non-causality and could be unstable.

Additionally, we performed a Ljung-Box test on the residuals to assess autocorrelation. The Ljung-Box test is a statistical procedure used to detect whether any group of autocorrelations in the residuals of a model significantly deviates from zero, indicating that the model may not have captured all the temporal dependencies in the data [8].

## 
##  Box-Ljung test
## 
## data:  model$residuals
## X-squared = 17.418, df = 20, p-value = 0.6257

The Ljung-Box test produced a p-value of 0.6257, indicating that we do not have enough evidence to reject the null hypothesis that there is no significant autocorrelation in the residuals, suggesting that the model has adequately captured the serial correlation in the data [8].

However, even after applying the Box–Cox transformation to stabilize the variance, the residuals continue to exhibit fat tails. This heavy-tailed behavior suggests that the standard normality assumption for the error distribution may not be fully appropriate. Consequently, we decided to fit the model using a Studentized t-distribution, which has fatter tails, to better capture the characteristics of the residuals, which is given by: \[ f(\varepsilon_t \mid \nu, \sigma^2) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\,\sigma\,\Gamma\left(\frac{\nu}{2}\right)} \left[1 + \frac{1}{\nu}\left(\frac{\varepsilon_t}{\sigma}\right)^2 \right]^{-\frac{\nu+1}{2}}, \]

where a smaller value of \(\nu\) corresponds to heavier tails. By allowing for such an error structure, we aim to obtain more robust parameter estimates.

SARMA with t-distributed errors

Below, we use the rugarch package to manually try to fit a SARIMA model with t‑distributed errors on the Box–Cox transformed time series. To capture seasonality, we generate Fourier terms (with \(K=1\)) [7].

## Warning in .sgarchfit(spec = spec, data = data, out.sample = out.sample, : 
## ugarchfit-->warning: solver failer to converge.
## 
## *---------------------------------*
## *          GARCH Model Fit        *
## *---------------------------------*
## 
## Conditional Variance Dynamics    
## -----------------------------------
## GARCH Model  : sGARCH(0,0)
## Mean Model   : ARFIMA(4,0,1)
## Distribution : std 
## 
## Convergence Problem:
## Solver Message:

However, when fitting the model above, our model does not converge. This lack of convergence can likely be attributed to numerical instability.

Final Model

Because we were unable to get the t-distributed errors to converge, the SARIMA on our power transformed time series data seems to be our best alternative, given by

## 
## Call:
## arima(x = hour_df$total_power_boxcox, order = c(4, 0, 1), seasonal = list(order = c(0, 
##     1, 1), period = 24))
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1     sma1
##       1.9452  -1.1727  0.2414  -0.0853  -0.8465  -0.7246
## s.e.  0.0461   0.0850  0.0790   0.0396   0.0275   0.0310
## 
## sigma^2 estimated as 14882:  log likelihood = -4813.9,  aic = 9641.8
## 
## Training set error measures:
##                      ME     RMSE      MAE MPE MAPE      MASE         ACF1
## Training set -0.7230995 120.1379 78.67159 NaN  Inf 0.5693076 0.0003646159
## [1] "Boxcox lambda: 0.475520302060264"

Conclusion

We focused on developing a forecast model for solar power generation dataset, after thorough investigation of the periodicities and spectral density of the data. Our final model was \(\text{SARMA}(4,0,1) \times (0,1,1)_{24}\). With 24 seasonal terms, we fitted the model on our power transformed dataset. The model was built upon a previous study on wind power generation [10] by adopting and extending their model for solar power generation. In additions to the previous models, we address the fat-tailed residuals through power transformation and heavy-tailed distributions.

However, one limitation of our approach is that our attempt to model t-distributed errors did not converge. The residuals continue to exhibit heavier tails than expected under a normal distribution. Interpretations of our final model should incorporate additional uncertainty to account for the potential impact of this heavy-tailed behavior.

Our findings with the final model selection revealed that the solar power generation continues to follow the 24-hour daily cycle. The solar power generation process itself is a complex process, creating significant challenges in forecasting and model specification. The nature of the dataset due to its numerous confounding factors and latent variables contribute to this challenge. Additionally, ARMA model may not be sufficient to capture the intrinsic dynamics of periodicities and complexities in the dataset.

References

[1] Anikannal. (2019). Solar Power Generation Data. Kaggle. Retrieved from https://www.kaggle.com/datasets/anikannal/solar-power-generation-data/data.

[2] Fiveable. (n.d.). Autocorrelation function (ACF) interpretation. Fiveable Library. Retrieved from https://library.fiveable.me/intro-time-series/unit-4/autocorrelation-function-acf-interpretation/study-guide/kRV2Op01nfiMsv6o

[3] Wikipedia contributors. (n.d.). Solar irradiance. Wikipedia, The Free Encyclopedia. Retrieved February 21, 2025, from https://en.wikipedia.org/wiki/Solar_irradiance

[4] Seven Sensor. (n.d.). What is a module temperature sensor & why is it important in PV installations? Retrieved from https://www.sevensensor.com/what-is-a-module-temperature-sensor-why-it-is-important-in-pv-installations

[5] Wikipedia contributors. (n.d.). Room temperature. Wikipedia, The Free Encyclopedia. Retrieved February 21, 2025, from https://en.wikipedia.org/wiki/Room_temperature#:~:text=In%20contrast%2C%20ambient%20temperature%20is,from%20an%20ideal%20room%20temperature

[6] Ionides, E. (2025). Notes for STATS/DATASCI 531, Modeling and Analysis of Time Series Data.

[7] “Fitting ARIMA-GARCH Model Using “Rugarch” Package.” Cross Validated, 12 Oct. 2015, stats.stackexchange.com/questions/176550/fitting-arima-garch-model-using-rugarch-package. Accessed 21 Feb. 2025.

[8] Diagnostics, STAT 510. (n.d.). PennState: Statistics Online Courses. https://online.stat.psu.edu/stat510/lesson/3/3.2

[9] Rossiter, D G. “Box-Cox Transformation.” Cornell.edu, www.css.cornell.edu/faculty/dgr2/_static/files/R_html/Transformations.html#2_the_box-cox_transform. Accessed 21 Feb. 2025.

[10] STATS-531 - Midterm Project. (2015). Github.io. https://ionides.github.io/531w24/midterm_project/project11/blinded.html#Modeling