1 Introduction

Sunspots are areas that appear dark on the surface of the Sun. Solar flares are a sudden explosion of energy caused by tangling, crossing or reorganizing of magnetic field lines near sunspots. They have massive destructive power when reaches certain amount of magnitude and can interfere space-earth radio communications and neutralize space-earth electronics equipment. In the this study, we want to see whether it’s possible to build a solar flare forecasting model on time series data from the Geostationary Operational Environmental Satellite (GOES).

GOES is operated by the United States’ National Oceanic and Atmospheric Administration (NOAA)’s National Environmental Satellite, Data, and Information Service division. It supports weather forecasting, severe storm tracking, and meteorology research. GOES spacecraft carry on board the Space Environment Monitor (SEM) instrument consist of a magnetometer, X-ray sensor, high energy proton and alpha particle detector, and a energetic particles sensor. The X-ray sensor (XRS) found on board of GOES capable of registering two wavelength bands : 0.05-0.4nm and 0.1-0.8nm. The data for this study is the X-ray flux captured by the 0.1-0.8nm sensor during 2019. The detail about the data is in the next section. When the X-ray flux reaches certain intensity thresholds, solar flares occur.

1.1 Research Objectives

Given the X-ray flux time series, we want to understand these questions:

  • Is it possible to predict future solar flare events with the past data ? Furthermore, does the time series have any seasonality effect ?

  • If the data indeed have predictive power, the next task is to find which classical time series models coverd in stats531 best fit the data. In addition, it’s known that solar flares and sunpots happen together, we shall analyze whether a sunspot time series helps forecast solar flares.

  • Finally, we want to formally assess the proposed models’ predictive powers with a reserved test dataset and examine their limitations.

2 Data Preparation & Processing

2.1 Preprocessing Note

We use the the GOES 2019 X-ray flux for this project. To prepare the data, we downloaded the raw satellite 2019 data available in this source. Then, we took advantage of a python script script from Vlad Landa’s github to convert the raw data into a csv file. The csv file contains X-ray flux intensities recorded by GOES every 1 minute from 2019-01-01 00:00 to 2019-12-31 23:59.

Since there are too many data points, we aggregated the data, creating one single observation for every 12 hours of 1-minute data. The aggregation was done by taking the 97.5 percentile of every 12 hour of 1-minute data points. After aggregating, our dataset contains \(366 \times 2 = 732\) points. We now justify our method of aggregation. The project goal is to predict solar flares and since solar flares is known to occur less than an hour and the X-ray reading during solar flares is much higher than normal, taking average over \(12 \times 60 = 720\) minutes would dilute the flare intensity. While we can take the max(), this method is brittle against outlier noises.

In addition, we can also convert the raw intensities into flare categories by using the below table (from Stanford Solar Center) to provide another perspective of the data. Finally, since all of the intensities in the dataset are smaller number \(<10^{-6}\), we take log transformation on the time series for numerical stability.

Class Intensity Solar Flares’ Effect on Earth
B \(< 10^{-6}\) Too small to harm Earth.
C \([10^{-6}, 10^{-5})\) Small with few noticeable consequences on Earth.
M \([10^{-5}, 10^{-4})\) Can cause brief radio blackouts that affect Earth’s polar regions and minor radiation storms.
X \(\ge 10^{-4}\) Can trigger planet-wide radio blackouts and long-lasting radiation storms

2.2 Exploratory Analysis

After preprocessing, the data is a time series \(\{ y_1, y_2, \ldots, y_{732} \}\) where \(y_i\) is the top 97.5 percentile of X-ray readings every-12 hour in log scale. We split this data into a train set \(\{y_1, \ldots, y_{718}\}\) and a test set \(\{y_{719}, \ldots, y_{732}\}\). The analysis in Section 3 is based on the train set and the test set is utilized to assess model performance in Section 4.

Observe from the plot, the time series have multiple peaks coincidental to solar flare events. The horizontal lines specify the threshold for C, M and X solar category that may have impacts to the Earth. There are no obvious trends in the X-ray flux time series and class C flares make up most of the flare events.

Train Set X-ray fluxtime series in log scale

Train Set X-ray fluxtime series in log scale

Train Set X-ray intensities (in log scale) histogram

Train Set X-ray intensities (in log scale) histogram

Train Set Solar Flare Caterogy Distribution

Train Set Solar Flare Caterogy Distribution

3 Research Methodology & Results

3.1 Does data have predictive power ?

The first question is whether we can utilize the X-ray time series \(y = (y_1, y_2, \ldots, y_n)\) to predict future solar events. If not then the time series is just white noise or a random walk distribution. To formally analyze, we perform a hypothesis test where \(H_0:\) \(y_1, y_2, \ldots, y_n\) are iid. This can be bone by ploting the acf of y. At significant level \(\alpha = 5\)%, only 1/20 lag should be outside the acceptance region (blue dashed-line). We observe 4/20 lags outside, and therefor reject the null hypothesis that y are white noise and have confidence that it may be possible to predict future events.

On the other hand, a different way to answer is to fit ARMA(p, q) models and see whether ARMA(0, 0) is favored in term of AIC scores. As we illustrate in the next section, models with \(p > 0\) and \(q > 0\) are indeed better than \(p = q = 0\).

3.2 Which model best fits the data ?

Our analysis implies that we can learn from the past data and \(ARMA(p, q)\) models may be good candidates. Auto Regressive Moving Average (ARMA) is formally defined as \(Y_n = \phi_1 Y_{n-1} + \ldots \phi_2Y_{n-p} + \mu + \psi_1 \epsilon_{n-1} + \ldots + \psi_q \epsilon_{n-q}\), or succinctly, \(\phi(B)(Y_n - \mu) = \psi(B)\epsilon_n\) where B is the backshift operator,i.e., \(Y_{n-1} = B Y_{n}\) and

\[\begin{align*} \phi(B) &= 1 + \phi_1 B + \ldots \phi_p B^p \\ \psi(B) &= 1 + \psi_1 B + \ldots \psi_q B^q \\ \end{align*}\]

MA0 MA1 MA2 MA3 MA4 MA5
AR0 2075.84 2027.11 2029.09 2020.55 2022.55 2023.73
AR1 2033.46 2029.10 2029.41 2022.55 2020.37 2025.20
AR2 2024.10 2024.33 2021.31 2015.46 2025.20 2015.56
AR3 2022.56 2022.09 2023.30 2025.29 2009.72 2015.42
AR4 2021.89 2023.80 2025.61 2027.20 2002.05 2010.47
AR5 2023.81 2025.63 2000.45 2001.90 2001.44 2005.87

To select parameters (p, q), we fit different models with \(p,q \in \{1,2,3,4,5\}\) then pick the one with lowest AIC. One thing to note from the below table is that AIC values seems not reliable because of some big jumps of AIC even when number of parameters changes by only 1. Under that scenario, theoretically, AIC score can only move up or down by 2. As the majority of values in the table are in range 2220-2233, we pick only models within this range. The 4 candidates are ARMA(0, 3), ARMA(0, 4), ARMA(1, 3) and ARMA(1, 4) where AIC is 2020.55, 2022.55, 2022.55 and 2020.37 respectively. Inspecting the coefficients, we observe that the intercept is the same for all models but standard deviation is largest for ARMA(0, 4). In addition, ARMA(1,4)’s coefficients of AR1, MA1, MA2 and MA4 are also biggest. It implies that ARMA(1, 4) indeed best fits the X-ray flux train time series.

Intercept SE(Intercept) AR Coef. MA 1 Coef. MA 2 Coef. MA 3 Coef. MA 4 Coef.
ARMA(0, 3) -12.648 0.041 0.271 -0.023 -0.117
ARMA(0, 4) -12.648 0.041 0.272 -0.022 -0.118 -0.002
ARMA(1, 3) -12.648 0.041 0.012 0.259 -0.026 -0.117
ARMA(1, 4) -12.681 0.095 0.993 -0.73 -0.292 -0.09 0.132

3.2.1 95% CI for ARMA(1,4)

As we decided to proceed with ARMA(1,4), the following is the 95% confidence interval for the models’ coefficients.

2.5% 97.5%
Intercept -12.867588 -12.494012
AR1 0.880000 1.008484
MA1 -0.804588 -0.656412
MA2 -0.383244 -0.201356
MA3 -0.181232 0.001832
MA4 0.060940 0.202060

The reported AR1’s 95% CI in the table is produced by the profile log likelihood. Since AR1 parameter is the one component that actually involves the past data, we want to be more precise.

The Fisher approximation CI is \(0.9930 + c(-1.96,1.96) * 0.0079 = c(0.977516, 1.008484)\) and we claim it’s much narrower than the true 95% CI. This can be seen from plotting out the profile log likelihood in the below. In addition, a simulation was carried out to verify that the profile log likehood curve’s shape. Because the the curve around the MLE (the highest peak) is very sharp, an approximate quadratic curve at the MLE would a result a much narrower CI than the true interval.

AR by Profile Likelihood

AR by Profile Likelihood

AR Distribution by Simulation

AR Distribution by Simulation

3.2.2 ARMA(1, 4) GoF

To evaluate ARMA(1, 4) Goodness of Fit, we observe that the roots MA polynomial \(\psi(x)\) are outside of a unit circle and means that the model is causal. However, the root of AR polynomial \(\phi(x)\) is \(r_1 = 1.007095\) barely outside of the unit circle and it’s possible that ARMA(1, 4) is non-invertible. The following plot is to visualize the inverse roots of ARMA polys. As such, for invertible and causal models, the inverse roots should be inside the unit circle.

Plot of Inverse ARMA Roots

Plot of Inverse ARMA Roots

Given that ARMA(1,4) nearly on the verge of invertibility, we need to check the residuals. While there are no obivious residual patterns, residual is normal but not for extreme values and the acf also indicates some correlation. These are signs of the ill-fitness of ARMA(1,4). Moreover, we can also plot out the fitted data and overlay with the original time series. As observed from this figure, the original y is the gray dashed line and the fitted values are red points and lines. The fitted time series doesn’t seem to be able to capture the peaks correctly. These are signs of illness of fit of ARMA(1,4) over this data. As such, in the next section, we want to assess the model’s predictive power on a reserved test set.

ARMA(1,4) Residual Plot

ARMA(1,4) Residual Plot

ARMA(1,4) Fitted Time Series vs X-ray Train Set

ARMA(1,4) Fitted Time Series vs X-ray Train Set

3.3 Is there seasonal effect ?

Many solar activities are periodic. Thus, we’re interested to study whether the X-ray time series exhibit any periodic behaviors. To do so, we examine the data in frequency domain and its periodogram. In the following, the unsmoothed periodgram doesn’t have any obvious peaks. On the other hand, the smoothed periodogram and periodogram estimated by AR(p) have the dominant frequency about \(f =.125\) or period = 8 observations = 4 days.

## [1] "Smoothed peridogram peak is at 0.125"

## [1] "AR(p) by AIC peridogram peak is at 0.12625250501002"

Another way to look for periodic behaviors is to decompose the time series into trend, noise and cycle by using loess smoothing function with different span parameters for high and low frequency. Looking at the cycle, we do see a peak every 8 observations.

3.3.1 SARIMA for seasonality

Since there is a seasonal effect every 8 observations, we consider ARIMA(p, q, P, Q, s). The SARIMA model is defined as \(\phi_p(B) \Phi_P(B^{12})\left( ((1 -B)^d(1 - B^{12})^D)Y_n - \mu \right) = \psi(B) \Psi(B^{12}) \epsilon_n\) where the coefficients of \(\Phi(B^{12}), \Psi(B^{12})\) are the seasonal parameters.

As the X-ray time series has no obvious trend, d = D = 0. We choose p = 1, q = 4 and s = 8 from our analysis . Similar to last section, an AIC table can generated for the seasonal P and Q. The table again has many big jumps. Moreover, when including seasonal parameters, there’s a big risk of overfitting data and the Occam’s Razor principle tells us to select the simplest possible models. Ideally, we would like to stick to the upper left of the AIC table as much possible. Because the AIC scores have huge gaps after AR2-MA0 and AR0-MA2 and across the sub-table starting at AR1-MA1, we ignore these models and pick P = 2 and Q = 0 which is the simplest model with lowest AIC.

MA0 MA1 MA2 MA3 MA4
AR0 2020.37 2019.75 2021.39 1966.94 1968.93
AR1 2022.00 1970.27 1961.76 1969.51 1939.70
AR2 2020.89 1964.79 1961.75 1943.91 1939.18
AR3 1941.92 1943.08 1933.34 1920.02 1921.98
AR4 1943.50 1929.36 1929.18 1921.98 1923.94

3.3.2 SARIMA(1,4,2,0,8) GoF

We go through the same routine as in the last section to produce 95% CI for model parameters and evaluate SARIMA Goodness of Fit. SARIMA(1,4,2,0,8) are invertible and causal but residuals seem to be correlated and the fitted values still miss the peaks of the original time series.

2.5% 97.5%
Intercept -12.747396 -12.551004
AR1 -0.420444 1.437244
MA1 -1.167244 0.690444
MA2 -0.411972 0.110172
MA3 -0.181736 -0.028464
MA4 -0.049512 0.204112
SAR1 -0.000492 0.147292
SAR2 -0.007472 0.142272
Plot of SARIMA Inverse Roots

Plot of SARIMA Inverse Roots

SARIMA(1,4,2,0,8) Residual Plot

SARIMA(1,4,2,0,8) Residual Plot

SARMIA(2,0) fitted values

SARMIA(2,0) fitted values

3.4 Regression against Sunspot Count

The geophysical properties of sunspots and solar flares are intimately connected and thus we want to study whether the daily sunspot counts help improve the solar flare prediction.

3.4.1 Sunspot Count Data

We downloaded the daily sunspot number from this source: http://www.sidc.be/silso/DATA/EISN/EISN_current.csv. There are two things to note about this raw data. 2019 is a leap year yet the sunspot data doesn’t include 02/29, so we added the missing data by simply taking the average of counts in 02/28 and 03/01. In addition, since our X-ray time series has one observation for every 12-hours or 0.5 days, we create the sunspot count every 12-hour time series by replicating one daily-count into two identical halfday-counts.

Moreover, the sunspot time series seems to have a trend which can be seen by employing a loess function with low frequency span. As such, we need to detrend by the Hodrick-Prescott (HP) Filter. HP filter of \((x_1, \ldots, x_n)\) and the time series \((s^*_1, \ldots, s^*_n)\) such that:

\[s^{*}_{1,n} = \text{arg min}_{s} \sum_{i=1}^n (x_i - s_i)^2 + \lambda\sum_{i=1}^n (s_{i+1} - 2s_i + s_{i-1})^2\]

Lastly, we also split the sunspot time series into a train set and a test set. The reserved test set will be discussed in the section about model performance.

Sunspot Count Data

Sunspot Count Data

3.4.2 Regression with Sunspot Count

If it’s possible to predict the X-ray flux intensities from the number of sunspots, there should be correlation between the two time series. As noted from cross correlation plot, at significant level \(\alpha = 5\)%, we expect only 1/20 outside the acceptance reason. We noticed 2 lags outside and conclude that the X-ray and sunspot time series are correlated.

Next, we select p and q of the ARIMA(p,q) regressing against sunspot count by constructing AIC, report 95% CI for relevant parameters and evaluate its goodness of fit as before. X-ray ARMA(1,4) regressing model has the most reasonable AIC. Unfortunately, the regression ARMA(1,4) model has the same problem like ARMA(1,4) and SARIMA(1,4,2,0,8). It could not capture the peaks of the flare intensities

MA0 MA1 MA2 MA3 MA4
AR0 2077.58 2028.66 2030.65 2022.05 2024.05
AR1 2035.12 2030.66 2030.98 2024.05 2021.88
AR2 2025.60 2025.92 2022.74 2024.72 2026.33
AR3 2024.20 2023.60 2024.73 2026.73 2010.91
AR4 2023.37 2025.26 2027.07 2007.18 2030.46
2.5% 97.5%
Intercept -12.697580 -12.666220
AR1 0.918516 1.067084
MA1 -0.819748 -0.638252
MA2 -0.385036 -0.202364
MA3 -0.160660 -0.019540
MA4 -0.055084 0.318884
Xreg -0.020068 0.041868
Plot of ARMA Regression Inverse Roots

Plot of ARMA Regression Inverse Roots

ARMA Regression Residual Plot

ARMA Regression Residual Plot

Regression ARIMA Fitted Values

Regression ARIMA Fitted Values

3.5 Section Conclusion

In summary, we conclude the following from our research investigation:

  1. The data indeed possess predictive power. This is supported by inspecting the autocorrelation of the time series which revealed significant correlation within the time series. It can also be seen by the fact that the AIC value of ARMA(0, 0) is unfavorable compared to ARMA(\(p \ne 0, q \ne 0\)) models.

  2. Among ARMA(p,q), we find that ARMA(1, 4) fits the data best in term of AIC. In addition, a seasonal effect every 8 observations can be observed from a smoothed periodogram. SARIMA(1,4,2,0,8) is selected by AIC to model seasonality. However, we remark that the SARIMA model may be overparameterized and we cautiously proceed with both ARMA and SARIMA in the next section.

  3. We also create the sunspot count every half-day time series and study whether it helps with predicting X-ray intensities. The cross correlation is significant between the two time series, X-ray flux and sunspot count. Following a similar routine, Regression ARMA(1, 4) is chosen to be the most suitable regression model.

  4. While each of the three models seems acceptable in term of the polynomial roots and residual diagnosis, their fitted values miss the high peaks of original time series data. We suspect it is because the sharp increase in intensity during a solar flare can not be captured adequately with linear relationships among past observations. The linear relationships are however the cornerstone of ARIMA models.

In the next section, we assess the fitted models’ actual predictive power over the reserved test dataset.

4 Model Predictive Power Assessment

4.1 Test Set Data

As mentioned in the Data section, we reserved a test set for this section. The test set size is 14 half-day data points. The reason for a relatively small test set is that state of the art solar flare models in the literature can not predict accurately more than a few days. The test set is also selected so that it includes a flare event of class M. The X-ray flux time series and flare category plots are shown in the below.

Test Set Flare Cateogry Distribution

Test Set Flare Cateogry Distribution

4.2 Model Performance

Given the fiited models from previous sections, in R, we can use forecast package to generate the point estimate and 95% CI prediction for the next 14 lags \(\{y_1^*, \ldots, y_{14}^*\}\). The below 4 plots show the forecast of fitted White Noise, ARMA(1, 4), SARIMA(1, 4, 2, 0, 8) and Sunspot Regression ARMA(1, 4) in black and the ground truth test data series in red. Visually, the predicted point estimates don’t seem to track the test set time series well.

Mean Estimated and 95% CI Forecast (black) with X-ray true test (red)

Mean Estimated and 95% CI Forecast (black) with X-ray true test (red)

A more quantitative way to assess model performance is the root mean squared error (RMSE). Given the true intensities \(\{y^0_1, \ldots, y^0_n\}\) and the predicted \(\{y_1^*, \ldots, y^*_n \}\), RMSE is defined as

\[\text{rmse} = \sqrt{\cfrac{1}{n}\cdot \sum_{i=1}^n(y^0_i- y^*_i)^2}\]

White Nose ARMA(1,4) SARIMA(2,0) Reg ARMA(1,4)
rmse 1.100761 1.049367 1.134904 1.063858

From the table, SARIMA(1,4,2,0,8) has the biggest RMSE, worse than white noise. This could be due to overfitting. ARMA(1,4) has the lowest rmse and surprisingly regression against number of sunspot counts doesn’t improve the error though still do better than white noise.

RMSE is a good metric. Nevertheless, it doesn’t offer much room for interpretation. A more intuitive way to judge the performance is to convert the raw forecast intensities into flare categories and calculate the prediction accuracy. Let \(\{z^0_1, \ldots, z^0_n\}\) be true category labels and \(\{z^*_1, \ldots, z^*_n\)} predicted ones. Accuracy score = \(\cfrac{1}{n}\cdot\sum_{i=1}^n \mathbb{1}(z^0_i = z^*_i)\).

White Nose ARMA(1,4) SARIMA(2,0) Reg ARMA(1,4
Accuracy Score 0.5714286 0.5714286 0.5714286 0.5714286

At the first look, since there are 3 categories (B, C, M) in the test set, an accuracy score of 57.14 % seems like a fine score. However, by noticing that majority of the test set is C class, and so this score might not be that good. We see an obvious problems by printing out the prediction vector of all models. They all just predicted ‘C’.

1 2 3 4 5 6 7 8 9 10 11 12 13 14
White Nose C C C C C C C C C C C C C C
ARMA(1,4) C C C C C C C C C C C C C C
SARIMA(2,0) C C C C C C C C C C C C C C
Reg ARMA(1,4 C C C C C C C C C C C C C C

This problem can be quantitatively formulated in term of the Sensitivity, Specificity and F1 scores which are defined in the following link https://en.wikipedia.org/wiki/F-score. A detailed analysis of these scores will not add much value to the current discussion and we shall only remark the intuitive conclusion from this table that by completely focusing on one metric (Sensitivity/ Specificity) and ignoring the other, the overall balanced performance (f1-score) suffers.

Sensitivity Specificity F1
Class: B 0 1 NA
Class: C 1 0 0.7272727
Class: M 0 1 NA

In summary, the result of this section means that even though ARIMA(1,4) and Sunspot Regression ARMA(1,4) perform better than the White Noise model in term of RMSE, they are not better in term of actual solar flare category classification prediction. This defection is probably linked to the fact that all of the models couldn’t capture the peaks of the X-ray flux time series. In case of SARIMA(1,4,2,0,8), its RMSE is worse than White Noise, mostly due to overparameterization.

5 Conclusion Summary & Discussion

To summarize our findings in Section 3 and 4,

  1. We find that there is correlation within the X-ray flux time series and that implies that it’s possible to use the past data to predict the future solar flares. In addition, a seasonality every 8 observations is discovered by the smoothed periodogram.

  2. ARMA(1, 4) and SARIMA(1,4,2,0,8) are the best fitted models in term of AIC. In addition, the cross correlation of the daily number of sunspots data and X-ray flux is significant and so we took advantage of this fact by fitting a Regression ARMA(1, 4) model.

  3. While ARMA(1, 4) and Regression ARMA(1, 4) do better than White Noise in term of RMSE. For practical purpose of classifying future Solar Flare category, they didn’t do better than White Noise model. The SARIMA model in fact performed worse than white noise in term of RMSE which might be due to overparameterization.

The bad performance is likely linked to the fact that none of the fitted models could correctly capture the peaks of the data. We suspect that the reason is that the sharp increase in intensities during solar flares can not be modeled accurately with only linear relationship among observations. As a consequence, a future work of exploring time series based on non-linear relationships is a reasonable direction to explore.

Reference:

[1] Edward Ionides.Stats 531 lecture notes and class material.https://ionides.github.io/531w21/

[2] Robert H., David S. Stoffer.Time Series Analysis and Its Applications: With R Examples. Springer, 3rd ed, 2011.

[3] Chen et al, Identifying Solar Flare Precursors Using Time Series of SDO/HMI Images and SHARP Parameters, AGU, 2019

[4] Vlad Landa et al, Low Dimensional Convolutional Neural Network For Solar Flares GOES Time Series Classification, arXiv, 2021.

[4] Sunspots and Solar Flares, https://spaceplace.nasa.gov/solar-activity/en/

[5] What are the different types, or classes, of flares?, http://solar-center.stanford.edu/SID/activities/flare.html

[6] 2019 GOES X-ray flux data, https://satdat.ngdc.noaa.gov/sem/goes/data/avg/2019/

[7] 2019 Sunpot Count data, http://www.sidc.be/silso/datafiles