Introduction

The dataset here is the monthly sunspot dataset which describes a avaerage monthly count of the number of observed sunspots for just over 230 years[1]. The sources of the dataset is credited to Adrews & Herzberg (1985). The units are a count and there are 2820 observations in total. The unit for time sale is in month and the value is the average number of observed sunspots of each day for that month.[2]

Since the data contains over 1000 time points, I am going to mainly focused on the last 408 time points which is from Jan-1950 to Dec-1983. Here, I want to fit an appropriate model to perform some forecast work. We also have the real data from 1983 to now and we can compare to see the results

Exploratory Analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.20   26.38   63.20   77.04  113.55  253.80

This is the basic summary of the average number of sunspot in month. We can see the data is spreaded out from minimum 0.2 to maximum 253.8.

From this time series, we can clearly see there are some seasonality for the data. The tread here seems stationary but need further examination to make sure.

This is the smoothed periodogram for the data, we can see there is a large peak at frequency about 0.007, which corresponds to a cycle of 142 months which is about 11 years. According to wekipedia page about sunspot, it says “number varies according to the approximately 11-year solar cycle”, which is consistent to this frequency.[3] Although there are other peaks at larger frequency, they are much smaller than the large one.

Model Selection

ARMA Model

First under the assumption that there is no trend, let fits a simple gaussian ARMA(p, q) model to check fitness. The model is \[ \phi(B)(Y_n-\mu) = \psi(B)\epsilon_n \] where \[ \begin{align} \mu &= \mathbb{E}[Y_n] \\ \phi(x) &= 1-\phi_1 x-\cdots-\phi_p x^p \\ \psi(x)&= 1+\psi_1x + \cdots +\psi_qx^q \\ \epsilon_n &\sim \text{i.i.d. } N[0, \sigma^2] \end{align} \]

Then, by using the AIC value to find the proper value for p and q where minimize \[ AIC = -2 *l(\theta) + 2D \] where \(l(\theta)\) is the log likelihood and \(D\) is the number of parameters

Here is the table of some AIC values for different p and q

##          MA0      MA1      MA2      MA3      MA4      MA5
## AR0 4454.887 4076.280 3920.832 3818.672 3765.943 3708.110
## AR1 3571.186 3530.262 3526.337 3521.267 3523.156 3524.601
## AR2 3548.773 3528.965 3523.274 3523.081 3525.113 3526.027
## AR3 3521.921 3522.554 3524.714 3494.522 3495.737 3520.918
## AR4 3522.655 3524.533 3525.494 3526.132 3521.578 3523.387

We can see that ARMA(3, 3) is the model with lowest AIC value.

Then try fit it and make some disgnose plot

## Series: sun$Sunspots 
## ARIMA(3,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ar3      ma1      ma2     ma3     mean
##       1.6973  -0.4114  -0.2890  -1.1454  -0.2300  0.3929  75.5019
## s.e.  0.2477   0.4920   0.2448   0.2356   0.3496  0.1236   5.0574
## 
## sigma^2 estimated as 297.5:  log likelihood=-1739.26
## AIC=3494.52   AICc=3494.88   BIC=3526.61

Here is the coefficients of the model and their standard errors

From the graph of fitted and original graph, we can see overall, the model fits the data well, but it tends to under estimate the peaks. The following are disgnostics plot

From these plots, we can see the residuals are fairly random, residuals seems to have a zero mean and constant variance. The ACF plot shows no corralation for different lags. The normal approximately holds despite a little skewed to the left.

SARIMA Model

According to the notes, a general \(SARIMA(p, d, q) * (P, D, Q)_{k}\) is \[ \phi(B)\Phi(B^{k})((1-B)^d(1-B^{12})^D(Y_n-\mu))=\psi(B)\Psi(B^{k})\epsilon_n \] where \(\{\epsilon_n\}\) is a gaussian white noise process and \[ \begin{align} \mu &= \mathbb{E}[Y_n] \\ \phi(x) &= 1-\phi_1 x-\cdots-\phi_p x^p \\ \psi(x)&= 1+\psi_1x + \cdots +\psi_qx^q \\ \Phi (x) &= 1-\Phi_1x-\cdots-\Phi_Px^P \\ \Psi(x) &= 1+\Psi_1x+\cdots+\Psi_Qx^Q \end{align} \] From the periodogram which suggests the dominant frequency to be 0.007, thus, we would try to use the 142 month as the cycle. Here we use difference at one lag since applying a difference operation to the data can make it look more stationary. We also choose the p and q by constructing the AIC table

##          MA0      MA1      MA2      MA3      MA4      MA5
## AR0 2819.311 2637.882 2588.690 2547.174 2529.981 2494.887
## AR1 2472.402 2435.177 2433.377 2430.451 2431.951 2433.950
## AR2 2455.478 2434.881 2430.209 2431.647 2433.405 2435.355
## AR3 2431.953 2431.234 2431.817 2433.523 2435.395 2430.719
## AR4 2431.831 2433.232 2433.421 2434.928 2431.542 2431.423

From the table, we can see the (p, q) = (2, 2) gives us the lowest AIC value. Then, we would fit this model and examine its wellness of fit

Here are the coefficients and standard error for the model

## Series: sun$Sunspots 
## ARIMA(2,0,2)(0,1,0)[142] 
## 
## Coefficients:
##          ar1     ar2     ma1      ma2
##       0.3748  0.5845  0.1674  -0.4532
## s.e.  0.1474  0.1412  0.1374   0.0688
## 
## sigma^2 estimated as 527.9:  log likelihood=-1210.1
## AIC=2430.21   AICc=2430.44   BIC=2448.13

We can also draw the fitted vs original line here

From the graph, we can see that the fitted line fit the graph pretty good at the start, then it starts to deviate a lot from the original one, it overestimate the peaks. This may suggest the problem of overfitting of the model.

Then, we draw the diagostic plots for this model

From the graphs, we can see that the ACF and Normal QQ-plot looks very similar to those to ARMA(3, 3), however, the residuals have some zeors at earlier time and have relatively large residuals at later time, which suggest the problem of overfitting.

Comparing to real data

Here, I would use the ARMA(3, 3) to do the prediction since the SARIMA model is sufferring from overfitting. Here I only use the first 400 points after 1983.

Draw the graph of prediction vs real

From the graph, we can see that the prediction indeed capture the upadn down trend, but it always underestimate the peaks.

conclusion

In this project, we analyse the monthly sunspot count from 1950 to 1983. After some analysis, we choose the model to be ARMA(3, 3), which fits the data pretty close. When doing prediction, the model tends to underestimate the peaks, but overall, it captures the trend relatively well. To further imporve the model, we can try different SARIMA model or incorporate more data points when fitting the model since here I only used 409 points to fit the model.

Reference

[1] https://machinelearningmastery.com/time-series-datasets-for-machine-learning/

[2] http://www.sidc.be/silso/datafiles

[3] https://en.wikipedia.org/wiki/Sunspot

[4] https://ionides.github.io/531w18/midterm_project/project3/MidtermPr oj.html

[5] https://ionides.github.io/531w20/notes