The scientific question motivating my work is how to predict the weekly maximum solar radiation in San Francisco using time series, and based on our prediction models, what will be the maximum solar radiation in San Francisco this week? San Francisco locates at Northern California with latitude 37° N and longitude 122° W, which is famous of its plenty of sunshine throughout the year. Many factors with seasonal effects or long-term trends may influence the change of solar radiation. Based on the weekly maximum solar radiation data at San Francisco International Airport from January 2001 to December 2010, we aim to fit applicable time series models and predict the future maximum solar radiation.
I got my original data from National Solar Radiation Data Base 1991-2010 Update provided by website: http://rredc.nrel.gov/solar/old_data/nsrdb/1991-2010/hourly/list_by_state.html. Site file 724940 provides various hourly solar radiation data at San Francisco International Airport from January 1st 1991 to December 31st 2010 based on different measure models and directions. According to National Solar Radiation Database 1991-2010 Update: User’s Manual, I choose METSTAT-modeled global horizontal solar radiation (data in column P), which is total amount of direct and diffuse solar radiation measured by Meteorological-Statistical solar model received on a horizontal surface. The unit of the data is “\(Wh/m^2\)”. I want to fit models with most recent data, so I only use hourly solar radiation from Jan 1st 2001 to Dec 31st 2010, totally more than 80000 observations. In order to decrease the number of observations and fit models effectively, I decide to use weekly max solar radiation during these ten years. I achieve this by two steps. First, I use ten loops for data in year 2001 to year 2010 respectively to get daily max solar radiation totally 3652 observations. Second, I get weekly maxima from daily maxima. In the US, a week starts on Sunday ends on Saturday. However, Jan 1st 2001 is Monday, and Dec 31st 2010 is Friday. This means the first and the last week both have six days. I use the maximum among the first six observations of daily max solar radiation as the weekly maximum for the first week. I use the maximum among the last six observations of daily max solar radiation as the weekly maximum for the last week. For the 7th to the 3646th observations in daily max solar radiation, I put them into a loop to get weekly maxima for each of the remaining week. Combining these weekly maxima together, I get data of the weekly max solar radiation from Jan 1st 2001 to Dec 31st 2010. There are totally 522 observations. I use the first 366 data points (from Jan 1st 2001 to Dec 31st 2007) as the training data to set up our model. I use the last 156 data points (from Jan 1st 2008 to Dec 31st 2010) to check the performance of our model forecast.
From the plot of training data, we can see a strong seasonality with a period one year: in each year, weekly max solar radiation increases first then decrease; the peaks occur in the middle of each year. Also, there are small fluctuations but don’t affect the seasonal trend. The mean of weekly max solar radiation is 778.153 \(Wh/m^2\) shown by blue dotted line. Both the histogram and the QQ-plot show that the distribution of weekly max solar radiation is not normal.
I take a periodogram of our training data to confirm the frequency, and I get the following:
The plot above indicates a freuency of 1/52, i.e. period is 52 weeks and the data is of seasonality. Next, I look at the acf of the data. Note that at the seasonal lags, the acf decays very slowly and the acf showing not stationary also confirms that the period is 52 weeks.
I draw two red vertical lines at 1/52 and 1/26, where exactly dominant spikes locate. Thus, the periodogram shows that the frequency of our training data is 1/52 (the period is 52 weeks). This coincides with our analysis of training data before: the seasonal trend has a period of one year (approximately 52 weeks). In order to clearly see the trend of our data, I use stl function to decompose our time series into seasonal, trend and irregular components (residuals) by loess.
We can clearly see a seasonality and a downward trend. Before fitting data into models, we need to detrend to get a stationary time series. First, I take the first difference of our training data to remove the downward trend. Then I take the seasonal difference to remove the seasonal trend. After taking these two differences, I get the plot of our training data. It appears that most of the seasonality is gone. Data move up and down around the mean level zero. It looks to be stationary.
There is a downward trend but also a seasonality in data, so I use a multiplicative seasonal ARIMA model with nonseasonal orders p, d, and q, seasonal orders P, D, and Q, and seasonal period s. The model has a form: ARIMA\((p,d,q)\times(P,D,Q)_{s}\). We can choose adequate factors for our model by acf of the data that both first and seasonal differences have been taken.
From the acf plot, the correlation is significant at seasonal lag 52, so we can confirm that the model has a seasonal period S=52 and D=1. There is also a significant correlation at lag1 in acf. These information suggests that a simple model which incorporates the lag1 and lag 52 autocorrelations might be adequate. Thus, we can specify our multiplicative seasonal ARIMA\((0,1,1)\times(0,1,1)_{52}\) model (named as m1). The AIC for model m1 is 3103.36, and coefficients are: ma1:-0.8683 and sma1:-0.9214.
Also, I over fit the model to be ARIMA\((0,1,1)\times(0,1,2)_{52}\) (named as m2). The AIC for m2 is 3105.32, and coefficients are: ma1:-0.8689, sma1:-0.9707 and sma2:-0.0179. AIC for overfitted model m2 increased. Estimate of coefficients for ma1 and sma1 have changed very little, and the new coefficients for sma2 sufficiently close to 0, so I choose model m1. Thus, our multiplicative seasonal ARIMA model is: ARIMA\((0,1,1)\times(0,1,1)_{52}\).
The plot of residuals does not suggest any major irregularities with the model. The histogram of residuals looks normal, but the QQ-plot suggests some heavily-tailed distribution.
Acf shows that residuals of the fitted model have little serial correlation, and we can accept that they are independent. Also, I get the periodogram of the residuals of the fitted model. There is no obvious trend. Most residuals are in the 95% confidence interval, and they fluctuate across a horizontal level. The periodogram of the residuals suggests white noise.
I use the fitted ARIMA\((0,1,1)\times(0,1,1)_{52}\) model to predict weekly max solar radiation from Jan 1st 2008 to Dec 31st 2010. As shown in the plot, our prediction generally follows trends of real data. The predicted data curve and real data curve are almost coincide. Almost all data points fall in the 95% confidence interval of our prediction except those in early weeks of 2008 and 2010. In general, our fitted multiplicative seasonal ARIMA model gives pretty good prediction.
Our multicplicative seasoanl ARIMA\((0,1,1)\times(0,1,1)_{52}\) model predicts weekly max solar radiation very well. The 95% confidence interval for the forecast of ARIMA\((0,1,1)\times(0,1,1)_{52}\) covers most real data. The predict curve of ARIMA\((0,1,1)\times(0,1,1)_{52}\) also shows data fluctuations. The answer to my question is that the weekly maximum solar radiation in San Francisco can be predicted by the seasonal ARIMA model. Running in R, seasonal ARIMA model predicts the maximum solar radiation in San Francisco during this week is about 971.9234 \(Wh/m^2\). However, there are still some problems needed to be pointed out. First, the residuals of ARIMA\((0,1,1)\times(0,1,1)_{52}\) model does not show perfect normality. The QQ-plot of the model suggests somewhat heavily-tailed distribution. I tried to do log or square root transformation for my data, but the QQ-plot of residuals of fitted models do not get any better. I guess the imperfect normality of residuals occurs because the distribution of our weekly max solar radiation data are not normal. However, our purpose is to do prediction, and the forecast of our fitted models are pretty well. Second, in our weekly max data, the first observation is the maximum of six days, so the last observation. This may lead some inaccuracy. Finally, the period of our data is not perfectly accurate. There are approximatley but not exactly 52 weeks in a year. Some errors may occur when we use the period 52 weeks to conduct a long-term analysis of our dataset.