Image captured by the Daniel K. Inouye Solar Telescope.

Check out NASA’s sunspot live feed here. And the solar cycle progression from the Space Weather Prediction Center.

1 Introduction

The term ‘sunspots’ refers to areas on the Sun’s surface with significantly stronger magnetic fields than others. Visually they manifest as areas visibly darker than the rest of the solar surface. Most often they have an inner, darker part called the umbra surrounded by a lighter penumbra region. They most commonly appear in pairs with magnetic fields with opposite polarities [1]. Though the extent of causality is often difficult to definitively state there are a number of reasons that the monitoring and forecasting of sunspots could have significant positive ramifications.

Coronal Mass Ejections (CMEs) are large clouds of plasma and magnetic fields that, upon hitting the Earth, cause heavy fluctuations in the planet’s magnetic fields [2]. Solar flares are intense bursts of radiation that can affect the Earth’s atmosphere. The occurrence of both of these phenomena are relatively strongly correlated with the prevalence of sunspots. Solar flares are known to disrupt communications and systems that rely on radio waves such as GPS. CMEs likewise are known to, through their effect on Earth’s magnetic field, affect the efficacy and durability of satellite technology [3].

The effects of both these solar events can have potentially significant effects on satellite, radio and many other technologies. As a dramatic example, the Carrington event of 1859 was a period of unusually high solar activity, mirrored by a high number of sunspots [4]. The event caused telegraph systems around the world go haywire and make the aurora borealis visible in even some tropical areas (as opposed to being confined to polar regions like usual). If scientists can predict high periods of sunspot activity they can better take measures to preserve their satellite technology and mitigate the effects on an ever widening array of technologies used in the modern world.

Additionally, it is generally agreed that sunspot prevalence, through associated solar activity levels, plays a role in earth’s climate [5]. A dramatic example is that of the Maunder minimum. The term refers to the period from 1645 to 1715 where sunspots were exceedingly rare. It also happened to coincide significantly with the ‘Little Ice Age’ experienced by the earth in the 17th C [1]. How exactly and to what extent the effect is causal is still a topic of debate, however. Research of late has suggested that the effect on climate change depends strongly on the time scale in question. The causal effects on climate seem to be very small on the 11-year solar period but increase in magnitude sharply as one takes into account much longer ‘secular cycles’ in sunspot activity (which have a period of 100 to more than 200 years) [5]. There seems to be no indication that the Earth is heading to a new Maunder minimum so understanding the long-term trend of solar activity could be crucial in planning and simulating the effects of impending climate change.

It’s evident that a robust understanding of space weather is of extreme importance given our world’s increasing reliance on modern technologies. Moreover, space is a fascinating topic. That said, there is no shortage of scientific research on applying different statistical modeling techniques to this particular data set. Sunspots and related solar activities have been studied using various techniques including, multifractal analysis, correlation analysis, wavelet transforms, deep neural networks, autoregression, and much more.

Despite the varying modeling techniques, there is one phenomenon that all researchers agree on - the sun exhibits an approximate 11 year solar cycle. The solar cycle is indicated by the frequency and intensity of visible sunspots, and have proven to be quite difficult to predict [6]. In this study, we will employ various classical time series methods in attempt to capture the implicit behavior in sunspot activity.

1.1 Data

Our data includes 3300 (as of February 15, 2024) observations of monthly mean relative sunspot numbers (count) from 1749 to present. The data was collected by the Swiss Federal Observatory, Zurich until 1960, then the Tokyo Astronomical Observatory. The data is readily available at the Solar Influences Data Analysis Center’s (SIDC) website. A description of the data set provided by the SIDC is given below:

Column 1-2: Gregorian Calender Date
Column 3: Date in fraction of year for the middle of the corresponding month
Column 4: Monthly mean total sunspot number
Column 5: Monthly mean standard deviation of the input sunspot numbers from individual stations.
Column 6: Number of observations used to compute the monthly mean total sunspot number.
Column 7: Definitive/provisional marker. A blank indicates that the value is definitive. A ’*’ symbol indicates that the monthly value is still provisional and is subject to a possible revision (Usually the last 3 to 6 months).

This report will focus on uni-variate analysis, and thus we will only use column 4, monthly mean total sunspot number.

Table 1.1: Monthly mean sunspot data
Year	Month	date.franction	monthly.mean	monthly.sd	monthly.obs	prov.marker
1749	2	1749.123	104.3	-1	-1	1
1749	3	1749.204	116.7	-1	-1	1
1749	4	1749.288	92.8	-1	-1	1
1749	5	1749.371	141.7	-1	-1	1
1749	6	1749.455	139.2	-1	-1	1

2 Analysis

2.1 Exploratory Analysis

As one does in time series modeling, we start off by simply plotting the data.

Figure 2.1: Time series plots of full data and tail subset

From the top plot in Figure 2.1, there is no obvious trend. However, the sharp peaks seem to be regular indicating there might be some form seasonality. It’s difficult to tell if the data is stationary. For a time series to be stationary, it must possess properties that do not depend on the time at which the series is observed [7]. So, a series of data points with trends or seasonality are not stationary, but a series with cyclic behavior (but no trend or seasonality) is stationary [7]. We zoom into the data in the bottom plot in Figure 2.1 to get a better idea of what’s going on.

Over short term periods, it may appear that there are linear trends. However, looking back at the full data, it becomes apparent that these are elements of a much broader cyclical trend. There are many successive points indicating auto-regression, however the strength of the relationship changes over time with a stronger relationship occurring at the trough of the cycle. This shows an obvious change in variance. Furthermore, the bottom plot highlights the general consensus of an approximate 11 year solar cycle.

Figure 2.2: ACF of full data at lags = 250

The sample autocorrelation plot in Figure 2.2 also provide us with some insight. 250 lags displays a patterned behavior in which a decaying sin wave becomes obvious, suggesting, again, that future values of the series are correlated with past values. There is autocorrelation of 0.5 at approximately lag 11 (132/12moth re-scaled), supplying further evidence of the solar cycle. Ultimately, this process appears to be non-stationary and seasonal.

First, we’ll seasonally difference the data, shown in Figure 2.3. Although it is not obvious, this appears to be stationary, so we won’t take an additional first difference.

Figure 2.3: Seasonally differenced full data with ACF and PACF

Later in the report, we’ll use the ACF and PACF shown in Figure 2.3 to estimate the appropriate model parameters to fit to our series.

2.2 Spectral Analysis

Over the course of 234 years, we observe 22 peaks, approximately equally spaced. This hints us there may be a period of 10.6 years.

To investigate the periodical behavior of the time series data, we apply Fourier transform on the data to obtain signals in the frequency domain. This involves estimating the spectral density function of our time series data.

Let \(\gamma_h\) be the stationary autocovariance function of our time series data. The spectral density function is given by \[\lambda(\omega)=\sum_{h=-\infty}^\infty \gamma_he^{-2\pi i\omega h},\] where \(\omega\) is frequency [8]. Euler’s formula gives a sinusoidal expression of the exponentials [8]. We write \[e^{2\pi i\omega h}=\cos(2\pi\omega h)+i\sin(2\pi\omega h).\] The sine and cosine terms give a Fourier basis for expressing the time series as a weighted sum of these sinusoidal bases. We are interested in the weights of the bases, that is the frequency components [8]. We wish to estimate the spectral density function and identify the dominant frequency (that is the frequency where the frequency component is the largest). We apply the three methods shown in the figures below to estimate the spectral density function.

We use R to compute the periodogram on the unsmoothed raw data directly and will use the formula, \(\frac{1}{\omega\times12}\) where \(\omega\) = frequency to translate the values from the frequency domain back to the time domain. Note that we multiply \(\omega\) by 12 since we are dealing with monthly data. We first compute the dominant frequency an unsmoothed method (shown in Figure 2.4), which is 0.0917 cycles per year, that is 10.909 years per cycle. This is close to our first observation. However, the unsmoothed periodogram doesn’t provide us much visual clue on the data.

Now, we compute the periodogram on smoothed data. One way of smoothing the data is to use the default smoother provided by spec.pgram. This smooth the periodogram with modified Daniell smoothers [9]. We set the window sizes to be 15 and 15, based on the previous analysis on the approximate cycle length of the data. From the plot in Figure 2.5, we find the dominant frequency is 0.0917 cycles per year and equivalently 10.90 years per cycle.

Beside the non-parametric models based on periodograms, we can also fit an autoregressive model to estimate the spectral density function. An autoregressive model with order \(p\) (\(AR(p)\) for short) is written as,

\[Y_n=\sum_{i=1}^p\varphi_i Y_{n-i}+\epsilon_n,\]

where \(Y_{1:n}\) are random variables representing the AR(\(p\)) model, \(\varphi_{1:p}\) are parameters of the model and \(\epsilon_{1:n}\) is a white noise process of independent and identically distributed normal random variables [10]. The order \(p\) itself is also a parameter and is determined using Akaike Information Criterion (AIC) [11]. We don’t go into the details of AIC here and we refer the readers to [11] or later in the report for more details. We run the following code to fit an \(AR(p)\) model, where the parameter \(p\) is picked based on AIC by R.

We observe the \(AR(p)\) model, shown in Figure 2.6, has a dominant frequency of 0.0841 cycles per year, or 11.88 years per cycle.

All three models give a period of around 11 years, which confirm our observation and also coincide with our prior knowledge of sunspot cycles.

2.2.1 Unsmoothed Data

Figure 2.4: Unsoothed periodogram

2.2.2 Smoothed Data

Figure 2.5: Smoothed periodogram

2.2.3 Estimating SDF with AR model

Figure 2.6: Spectral Density selected by AR process

3 Model Selection

3.1 SARMA

A seasonal ARMA model (SARMA) is construed by including additional season terms in a ARMA models [12]. It is defined as follows: \[SARMA(p,q)\times(P,Q)_m\] Where m = number of observations per year. For example, a general \(SARMA(p,q)\times(P,Q)_{12}\) model for monthly data is \[\phi(B)\Phi(B^{12})(Y_n-\mu)=\psi(B)\Psi(B^{12})\epsilon_n,\] where \(\epsilon_n\) is a white noise process and

\[ \begin{aligned} \mu = \exp && \\ \phi(x) = 1-\phi_1x-\dots-\phi_px^p,&& \\ \psi(x) = 1+\psi_1x+\dots+\psi_qx^q,&& \\ \Phi(x) = 1-\Phi_1x-\dots-\Phi_Px^P,&& \\ \Psi(x) = 1+\Psi_1x+\dots+\Psi_Qx^Q.&& \\ \end{aligned} \]

The seasonal part of an AR or MA model can be discovered through the seasonal lags of the PACF and ACF [7]. We will consider the ACF and PACF from Figure 2.2 to estimate values for \(p,q,P,Q\). Then we will select a second model using a more algorithmic approach called grid search and compare the performance of both. Grid search is used to find optimal hyper parameters of a model which results in the most accurate fitting.

There are spikes in the PACF at lags 12, 24 and 26, but nothing similar in the ACF. This could be indicative of a seasonal AR(2) term. In terms of non-seasonal lags, the PACF shows numerous significant spikes, suggesting a high, complex \(AR(p)\) term. Lastly, the decaying wave pattern in the ACF isn’t suggestive of any straightforward model.

3.2 Grid Search

Our eye-balling analysis suggests that an \(SARIMA(3,0,0)\times(2,0,0)_{12}\) (Model 1) might be a good fit. We’ll fit this along with several other models using an algorithmic approach called grid search.

Of these models (results shown in tables below), we select the one with the lowest Akaike’s information criterion (AIC). AIC is essentially “minus twice the maximized log likelihood plus twice the number of parameters [11],” and is defined by:

\[AIC=-2\times\ell(\theta^*)+2D\]

The grid search approach chose the \(ARIMA(1,0,2)\times(1,0,0)_{12}\) (Model 2) model with the lowest AIC (i.e., the best). For our final model, we will incorporate findings from the exploratory analysis, spectral analysis, and knowledge from related readings. When fitting a seasonal component or SARIMA model, it is often times done so with some prior knowledge. An example of this prior is the 11-year solar cycle. This cycle is easily noticeable when plotting the data on a short term, and confirmed by our spectral analysis. That being said, we will now fit a simple \(AR(2)\) (Model 3) model with an external regressor to capture the 11-year cycle. This model is defined by,

\[y_t=\beta_0 + \beta \cos(2\pi t\times132)+\epsilon_t\] where \(\epsilon_t\) is the \(AR(2)\) process.

3.2.1 Table 1

Table 3.1: P=1, Q=0
	MA0	MA1	MA2	MA3
AR0	34725.37	33183.45	32492.11	32173.24
AR1	31075.33	30665.93	30617.80	30619.78
AR2	30824.40	30622.40	30619.79	30618.77
AR3	30703.01	30617.17	30625.55	30617.23

3.2.2 Table 2

Table 3.2: P=1, Q=1
	MA0	MA1	MA2	MA3
AR0	34531.85	33153.12	32488.93	32175.15
AR1	31070.80	30667.11	30618.38	30620.33
AR2	30823.64	30622.90	30620.35	30619.12
AR3	30703.85	30617.78	30626.85	30617.66

3.2.3 Table 3

Table 3.3: P=0, Q=1
	MA0	MA1	MA2	MA3
AR0	35348.09	33542.57	32659.92	32263.74
AR1	31074.55	30665.66	30617.24	30619.22
AR2	30824.14	30621.83	30619.23	30618.13
AR3	30702.84	30616.61	30625.47	30616.57

4 Diagnostics

4.1 Residual Check

A good way to check if we are overfitting the data or leaving out useful information is by checking the residuals. A residual is simply the difference between an observation and an estimated value:

\[e_t=y_t-\hat{y_t}\] A model that has adequately captured the information in the data will produced residuals with the following properties:

Residuals are uncorrelated
Residuals have mean zero
Residuals have constant variance
Residuals are normally distributed [7].

We will use the checkresiduals() function to help us determine if these properties are satisfied.

Using the visualizations and summary outputs from Figures (4.1, 4.2, 4.3) it’s obvious we haven’t fit the data as well as one would hope. Looking at the ACFs on the bottom right hand side, each model has several significant spikes, telling us there is correlation in the residual series. The histograms in the bottom right show approximate normality, however all three models display a concerning hint of right skewness. Lastly, we have the results from the Ljung-Box test. The Ljung-Box test is a method of testing for the lack of serial correlation [13]; \(H_0\) the model appears good, \(H_A\) the model shows lack of it. Yet again, with a p-values for all three models significant at the \(\alpha=0.05\) level, we receive another residual violation.

4.1.1 Model 1

Figure 4.1: Residual fheck for model 1

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 174.71, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

4.1.2 Model 3

Figure 4.2: Residual check for model 2

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 134.95, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

4.1.3 Model 3

Figure 4.3: Residual Check for Model 3

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 241.98, df = 24, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 24

5 Conclusion and Future extensions

In this report, we deployed a number of classical time series approaches to “re-discover” a known scientific fact. Despite the big ending being spoiled, it’s beneficial to go through steps and gain a robust understanding of the process - kind of like an episode of Columbo. This analysis showed that modeling something that is seeming regular isn’t always straight forward. There’s, without question, a lot of room to improve on our models. It might be useful to take a more in-depth look at the spectrum in higher frequencies, or, perhaps, there are additional cycles that we overlooked.

There are numerous avenues one could take exploring this data set. It would be interesting to see how more modern machine/deep learning methods perform in comparison to classical time series approaches, especially in terms of forecasting. The 11-year solar cycle is indeed glaring, however perhaps it’s so bright that it’s caused researchers to overlook hidden seasonal trends? After becoming comfortable with modeling that data in the long-run it would be a good extra challenge to take a stab at the unpredictable variation in the short term. Developing a robust understanding of the variation in the seasonal peaks of solar activity would give scientists heightened abilities to mitigate the effects of harmful solar events.

6 History

To save readers time, we decided to add this to the end of the report. However, some of you might enjoy this!

Since ancient times dark spots on the surface of the Sun have been observed by cultures around the world. The earliest written records of these blemishes, known as ‘sunspots’ date back to ancient China, at least as early as 364 B.C. Arab, Armenian, Russian and other texts from soon after the 4th Century B.C all make reference to dark formations on the Sun’s surface. In medieval Europe, the first records of observations of these formations in medieval Europe are found in chronicles dated 1128 A.D.[14].

In 1608 the invention (or at least patenting) of the telescope in Holland changed the face of astronomy together, including the study of sunspots. By 1609 Thomas Harriot had made detailed sketches of the sunspots on the surface of the Sun. By 1610 Johann Fabricius had published the first paper detailing the appearance of sunspots (by tracking the position of particular sunspots he was also able to deduce that the Sun rotated on its axis with a period of about 30 days which turned out to be very accurate) [15]. Soon after their discovery a number of astronomers across Europe diligently took notes on the formation of sunspots. However, by the middle of the 18th Century, interest had waned and so the number of observations reached a nadir before picking up later in that century.

This history makes sunspots the earliest recorded feature of the solar surface. Until 1776 when Christian Horrebu suggested periodicity in the presence of sunspots, scientists were of the opinion that their appearance was completely stochastic. In 1843, Heinrich Schwabe showed that the number of observed sunspots had a period of approximately 10 years [[14]. In 1852 Rudolph Wolff in Zurich started the first permanent record keeping of sunspot numbers. He also introduced the idea of a daily ‘relative’ sunspot number (a.k. the ‘Wolf number’) and calculated a solar period of 11.1 years (an improvement from Horrebu’s calculation).

From Wolff’s work, records of sunspot observations from around 1740 onwards are considered to be quite reliable. In 2011 researchers started a project to reconstruct sunspot data from as far back as 1610, however these still have significant issues that cause worries about reliability [14]. In 2015 the SILSO (Sunspot Index and Long-term Solar Observations International Data Center [16]) adopted a formula for a new Wolf number (Version 2.0). It is also known that the maximum Wolf number of a solar cycle is directly (inversely) proportional to the minimum Wolf number of the previous cycle (the duration of the previous cycle), significant at \(1\%\) significance levels [14].

Refernces

Hoyt, D.V. (1979). Variations in sunspot structure and climate. Climatic Change 2, 79–92.

Reames, D.V. (2013). The two sources of solar energetic particles. Space Science Reviews 175, 53–92.

Radasky, W.A. (2011). Overview of the impact of intense geomagnetic storms on the u.s. High voltage power grid. In 2011 IEEE international symposium on electromagnetic compatibility, pp. 300–305. 10.1109/ISEMC.2011.6038326.

Hayakawa, H., Ebihara, Y., Willis, D.M., Toriumi, S., Iju, T., Hattori, K., Wild, M.N., Oliveira, D.M., Ermolli, I., Ribeiro, J.R., et al. (2019). Temporal and spatial evolutions of a large sunspot group and great auroral storms around the carrington event in 1859. Space Weather 17, 1553–1569.

Nagovitsyn, Y.A. (2014). Specific features in the effect of solar activity on the earth’s climate changes. Geomagnetism and Aeronomy 54, 1010–1013.

Dobrijevic, D. Solar cycle: What is it and why does it matter?

Hyndman, R.J., and Athanasopoulos, G. (2018). Forecasting: Principles and practice 2nd ed. (Melbourne, Australia: OTexts).

Ionides, E.L. DATASCI/STATS 531 (winter 2024) “modeling and analysis of time series data” chaterp 8 lecture slides.

R Core Team (2023). R: A language and environment for statistical computing (Vienna, Austria: R Foundation for Statistical Computing).

10.

Ionides, E.L. DATASCI/STATS 531 (winter 2024) “modeling and analysis of time series data” chaterp 3 lecture slides.

11.

Ionides, E.L. DATASCI/STATS 531 (winter 2024) “modeling and analysis of time series data” chaterp 5 lecture slides.

12.

Ionides, E.L. DATASCI/STATS 531 (winter 2024) “modeling and analysis of time series data” chaterp 5 lecture slides.

13.

Ljung box test: definition.

14.

Vasiljeva, I., and Pishkalo, M. (2021). History of sunspot research and forecast of the maximum of solar cycle 25. Kinematics and Physics of Celestial Bodies 37, 200–211.

15.

Casanovas, J. (1997). Early observations of sunspots: Scheiner and galileo. In 1st advances in solar physics euroconference. Advances in physics of sunspots, p. 3.

16.

SILSO World Data Center (1740-2023). The international sunspot number. International Sunspot Number Monthly Bulletin and online catalogue.

Midterm Project

2024-02-23

1 Introduction

1.1 Data

2 Analysis

2.1 Exploratory Analysis

2.2 Spectral Analysis

2.2.1 Unsmoothed Data

2.2.2 Smoothed Data

2.2.3 Estimating SDF with AR model

3 Model Selection

3.1 SARMA

3.2 Grid Search

3.2.1 Table 1

3.2.2 Table 2

3.2.3 Table 3

4 Diagnostics

4.1 Residual Check

4.1.1 Model 1

4.1.2 Model 3

4.1.3 Model 3

5 Conclusion and Future extensions

6 History

Refernces