STATS531 Midterm Project

Published

February 22, 2024

Introduction

The sun releases a stream of charged particles from its upper atmosphere called solar wind. [1] It is largely deflected by the magnetosphere of the earth, which belongs to the space weather. [2] The Geostationary Operational Environmental Satellites (GOES) are a series of geostationary satellites that provide continuous monitoring of the space environment. [3] In this project, we analyze the X-ray flux data from GOES [4] and perform time series analysis and forecasting on the data.

Data Processing

We begin our analysis by a preview of the header of the dataset.

  year month day  sat        flux1        flux2
1 2005     1   1 go10 2.345150e-06 4.784602e-07
2 2005     1   2 go10 4.112945e-07 2.196580e-08
3 2005     1   3 go10 3.982454e-07 2.178465e-08
4 2005     1   4 go10 3.743698e-07 2.658230e-08
5 2005     1   5 go10 1.267019e-07 1.192896e-08
6 2005     1   6 go10 4.886487e-08 5.977783e-09

Interpolation and Log Transformation

We interpolate the missing values using na.approx from zoo and perform a log transformation on the flux1 and flux2 columns as the STL decomposition below does not allow missing value. Below is a plot of the log-transformed and interpolated values for flux1 and flux2.

Smoothing

From the visualization, one can observe that the time series data is quite noisy. To reduce the noise and identify underlying patterns, we apply rolling mean and iterative rolling mean to the log-transformed values of flux1 and flux2.

Below is a plot of the comparison among the original, rolling mean, and iteratively rolling mean time series data for flux1 and flux2.

Time Series Analysis and Forecasting

Autocorrelation and ARIMA Modeling

We first analyze the autocorrelation function in the dataset. Below is the ACF plot for the training set of flux2.

The Autocorrelation Function (ACF) plot displays the computed autocorrelation for lags extending up to approximately 1750. Initially, there is a sharp decline in autocorrelation values at the early lags, followed by a gradual decay, indicating strong autocorrelation at shorter lags. The ACF values remain beyond the confidence interval, delineated by dashed blue lines, across a broad range of lags. These bounds are generally indicative of the threshold beyond which autocorrelations are deemed statistically significant. The fact that the ACF values do not diminish rapidly but instead exhibit significant autocorrelation across numerous lags suggests that past values exert a prolonged influence on future values. An additional distinctive aspect is the oscillatory pattern, which implies cyclical behavior. Therefore, the flux2 series may exhibit a seasonal effect as the occurrence of spikes that surpass the significance bounds at regular intervals suggests.

ARIMA Modeling and Forecasting

We then fit an ARIMA model to the training series. From the analysis of the ACF plot, we set the maximum values for p and q to 50 as the autocorrelation rapidly drops until lag 50. We specify seasonal = TRUE due to the presence of an oscillating pattern, and set d=1 to induce stationarity by differencing the series at most once. Below is a model summary for the model:

Series: curr_series_training 
ARIMA(2,1,2) 

Coefficients:
         ar1     ar2      ma1      ma2
      0.0264  0.6760  -0.2483  -0.6934
s.e.  0.0575  0.0438   0.0634   0.0602

sigma^2 = 0.1428:  log likelihood = -960.55
AIC=1931.09   AICc=1931.12   BIC=1959.48

Training set error measures:
                      ME      RMSE       MAE         MPE      MAPE     MASE
Training set 0.001731157 0.3774885 0.1720078 -0.05473336 0.9597207 1.043259
                   ACF1
Training set 0.01303998

After searching for the best parameters for the model, the best model found is ARIMA(2,1,2). One can observe that the coefficient of ar1 is non-significant. The ME is close to zero, indicating that the forecasted values are not biased. However, the MAPE is relatively high, and the MASE is greater than 1, suggesting that the forecasted values are not very accurate.

After that, we use the fitted model to forecast the next 31 days and plot the forecasted values alongside the actual values. Below are two plots for the forecasted values and the actual values. The first plot shows all the past time series with forecasting and confidence interval and the actual observations, and the second plot zooms in on the prediction of the series to further examine the difference between the prediction and actual observations.

From the two plots, the actual values appear to suffer from high volatility, whereas the forecasted values are relatively stable. Especially from the second plot, we can observe that the ARIMA model fails to capture the short-term oscillating patterns in the actual values of flux2. The forecasting by the ARIMA(2,1,2) model predicts only a general trend of the data but not the detailed periodic fluctuations.

STL Decomposition and Component Forecasting

Decomposition

To accurately determine the periodic frequency in the STL, we conducted spectral analysis to identify the frequency with the greatest power. This was achieved by detrending the original series using a rolling mean, which helps avoid the undesirable effects of the 0th frequency, representing the trend.

From the smoothed periodogram, we can determine that the total time of an oscillation period is about 13.67089 days.

We account for the frequency of daily data by setting frequency = 365.25 to account for leap years over a long period. Then, we interpolate the missing values and apply STL decomposition with the same window of rolling mean for the t.window and the identified period for s.window. Below is a plot of the STL components.

The top panel shows the original time series data for flux2 and the second one shows the trend component. The seasonal component is shown in the third panel, and the bottom panel shows the remainder component. The plot suggests that it has complex seasonality, and the large remainder components indicate that the model may not be sufficient to capture the underlying patterns in the data.

Component Forecasting

We fit separate ARIMA models to each component and forecast the next 31 points for each component. Below are the forecasts for the trend, seasonal, and remainder components.

Each plot includes a blue shaded area that represents the forecasted values with their confidence intervals. It seems that the forecasting is good for seasonal and remainder components, but the trend component is not well captured. If we combine the forecasts, we get the following plot.

There is a visible discrepancy between the actual and forecasted values. The actual values are more volatile and has a general trend of decreasing. The forecasted values not only fail to capture the volatility but also show a general trend of increasing.

Model Diagnostics

In this section, we perform a diagnostic check on both the baseline and STL model following similar procedures in [5].

Baseline Model

We perform a diagnostic check on the combined forecast. Below is a plot of the residuals and the ACF plot of the residuals.

The residuals are uniformly distributed vertically but not horizontally, and the data points in the QQ plot do not fall on the line, indicating that the residuals are not normally distributed.

We also perform a Ljung-Box test to check for autocorrelation in the residuals.


    Box-Ljung test

data:  residuals(model)
X-squared = 71.472, df = 20, p-value = 1.045e-07

It yields a p-value of \(1.045\operatorname{e}{-5}\), which is less than \(0.05\). This suggests that the residuals are autocorrelated, and the model is not capturing the underlying patterns in the data.

STL Model

Below is a plot of the residuals and the ACF plot of the residuals for the combined forecast.

The data points in the QQ plot now fall on the line more closely than the baseline model, and the ACF plot shows that the residuals are likely not autocorrelated since there are few significant lags.

Conclusion

The time series analysis and forecasting part of this study explores two main approaches: the ARIMA modeling and forecasting method and the component forecasting method after the STL decomposition and the introduction of ARIMA modeling.

The decision to use the ARIMA model (specifically the ARIMA(2,1,2) configuration) was based on the analysis of the autocorrelation function, which showed strong autocorrelation with oscillatory patterns at short lags. Nonetheless, the forecast plots of the ARIMA model show general trend forecasts and do not capture the short-term oscillations and detailed cyclical fluctuations observed in the actual values. This limitation is particularly evident in the plots of forecasts against actual observations, highlighting the model’s inability to replicate the high volatility and complex patterns inherent in the data.

In order to improve the accuracy of the forecasts, an STL decomposition of the time series data was used to provide a finer-grained analysis by decomposing the series into trends, seasons and residuals. Each component was then forecasted separately using the ARIMA model. As can be seen from the confidence intervals in the forecast plots, the component-based forecasting method performs well in capturing seasonal and residual variations. From the prediction plots we can clearly see that the prediction of the trend component after the addition of the stl model performs better compared to the prediction made by simply introducing the ARIMA model.

Finally we plot model diagnostics for both methods reveal their fitting and predictive abilities. The residuals of the ARIMA model show signs of non-normal distribution and autocorrelation as shown in the QQ plot and the Ljung-Box test results. This suggests that the model may not adequately capture the underlying structure of the data. In contrast, the STL integrated prediction model is more consistent with the assumption of residual normality, with fewer significant lags in the ACF plot, indicating a better fit to the data.

Considering both analysis and diagnosis, the STL decomposition and component forecasting approach appears to provide a more nuanced and flexible framework for understanding and forecasting time series data. The ability of the STL model to decompose the series into interpretable components and provide more detailed forecasts, particularly for seasonal and residual components, makes it potentially a more effective tool for this particular data set. Combined with the above analysis, the STL decomposition and reorganization and the introduction of the ARIMA modeling of the component prediction method fits the data better and has a higher prediction accuracy.

References

[1] Wikipedia contributors. “Solar Wind.” Wikipedia, Wikimedia Foundation. Available https://en.wikipedia.org/wiki/Solar_wind. Accessed 22 Feb 2024.

[2] Wikipedia contributors. “Space Weather.” Wikipedia, Wikimedia Foundation. Available https://en.wikipedia.org/wiki/Space_weather. Accessed 22 Feb 2024.

[3] “Geostationary Operational Environmental Satellites (GOES) - Office of Satellite and Product Operations.” NOAA’s Office of Satellite and Product Operations, National Oceanic and Atmospheric Administration, https://www.ospo.noaa.gov/Operations/GOES/index.html. Accessed 22 Feb. 2024.

[4] terhorst. “GOES Daily.” GitHub, https://github.com/terhorst/stats504/blob/main/lectures/week5/goes_daily.csv.gz. Accessed 22 Feb. 2024.

[5] STATS531 Midterm Project Group 15. “Forecasting Solar Flare from X-ray Flux Time Series.” STATS 531 Midterm Project, University of Michigan, https://ionides.github.io/531w21/midterm_project/project15/project.html. Accessed 22 Feb. 2024.