Introduction

In this report, we explore and analyze the number of wildfires occurring in California over the last 120 years. California experiences wildfires every year. Wildfires are severe incidents that may cause serious damage or loss of property, land, and human life. Having a statistical model that describes the evolution of wildfire numbers over time helps us to understand the trends and thus potentially be better prepared for future events.

In this study, we explore a California wildfires dataset and count the wildfire alarms that occurred each month. We investigate seasonality of wildfires using analytical tools in a frequency domain. We find evidence that wildfires occur approximately with 12 and 6 month periods, with a 12 month seasonality being more prevalent. The number of fires also increases over time (years), almost linearly. Based on data exploration, we proceed with building the following time series models: seasonal autoregressive-moving average (SARMA) as well as seasonal integrated autoregressive-moving average model (SARIMA), including both detrending and seasonal treatments. We use a square root transformation of the number of fires to make the time series more suitable for analysis. We find model parameters that ensure chosen models are stationary, causal and invertible. Moreover, we conduct model diagnostic tests to check that assumptions about white noise processes in SAR(I)MA are valid. As a result, we obtain two models that fit the time series sufficiently well. Both use a period of 12 months: 1) \(\text{SARIMA} (2,0,2)\times(4,1,0)_{12}\) for transformed data, and 2) \(\text{SARMA} (2,2)\times(4,0)_{12}\) for detrended and transformed data. The diagnostic tests show that model 1) fits the data better than model 2), as its autocorrelation function (ACF) of the model residuals has no evidence of lag dependency. Finally, there are a number of limitations the current analysis presents, related to data processing and modeling. These limitations can be addressed in future studies.

Methods

We aim to build a model for a California wildfires dataset. A common approach in time series analysis is to fit data with the autoregressive-moving average (ARMA) model and its extensions [1,2]. For computations, we use the R-language and its inbuilt ‘arima’ functionality [3]. We provide a theoretical summary next [4].

Theoretical background

We consider a collection of jointly defined random variables \(Y_1,Y_2,\dots,Y_N\), briefly \(Y_{1:N}\). Recall from ([2],Ch.2]) that the autocovariance function for the random variables \(Y_{1:N}\) is: \[\gamma_{m,n}=E[(Y_m-\mu_m)(Y_n-\mu_n)]\] with mean function \(\mu_n=E[Y_n]\). Also, we say that a time series model is weakly stationary if it is both mean stationary (\(\mu_m = \mu\)) and covariance stationary (\(\gamma_{n,m}=\gamma_{n,n+h}=\gamma_h\) for all \(m,n\)), ([2] Chs.2,3).

We consider an ARMA model, which is a linear process as it describes \(Y_n\) as a linear function of white noise \({\epsilon_n}\). We are interested in stationary ARMA models that can fit the detrended time series ([1], Ch.2).

The ARMA model can be described as follows ([2], Ch.1): \[\phi(B) (Y_n-\mu) = \psi(B) \epsilon_n\] with autoregressive (AR) polynomial of order \(p\): \[\phi(x)= 1-\phi_1 x -\phi_2 x^2 -\dots -\phi_p x^p\] and moving average (MA) polynomial of order \(q\): \[\psi(x)= 1+\psi_1 x -\psi_2 x^2 -\dots -\psi_q x^q\] applied to the backshift (lag) operator \(B\) such that \(B Y_n = Y_{n-1}\). Here \(\mu\) is a mean of the general stationary ARMA(p,q) and residuals \({\epsilon_n}\) are white noise processes (i.i.d. with mean 0 and variance \(\sigma^2\), [2] Ch.3).

Recall that the ARMA(p,q) model is causal if polynomial roots of \(\phi(x)=0\) (for AR(p)) are outside the unit circle in the complex plane. The ARMA(p,q) model is invertible if polynomial roots of \(\psi(x)=0\) (for MA(q)) are also outside the unit circle in the complex plane. Moreover, the roots of AR(p) and MA(q) should not be similar, otherwise the model can be simplified by canceling them in \(MA(\infty)\) representation. See [2]Ch.4 for more details.

There are useful extensions of ARMA models ([2],Ch.6). To transform non-stationary data to stationary, we can use integrated autoregressive moving average model ARIMA(p,d,q). It uses the difference operator \((1-B)^d Y_n\) to detrend, i.e. make data more stationary. It is suggested to use \(d=1\). \[\phi(B)\big[ (1-B)^d Y_n-\mu \big] = \psi(B) \, \epsilon_n\]

To incorporate a large time scale (seasonality) one can use a \(\text{SARMA} (p,q)\times(P,Q)_s\) model where \(s\) indicates seasonal time period. The model can be written as ([2] Ch.6) : \[\phi(B)\Phi(B^{s})\big[Y_n-\mu\big] = \psi(B)\Psi(B^{s}) \epsilon_n\] where \(\Phi(B),\Psi(B)\) are polynomials of order P,Q, respectively, that operate on a seasonal time scale.

There is also a general model that combines both above: \(\text{SARIMA}(p,d,q)\times(P,D,Q)_s\) with seasonal difference operator of order \(D\). Causality and invertibility of these models are checked similarly to ARMA(p,q).

Model selection approach

For our analysis, we transform and detrend data before fitting the (S)ARMA model. Detrending of the data is required for the stationary models. Transformation is helpful to diminish data heteroscedasticity and thus make variance more homogeneous [5]. To fit a SARIMA model, detrending is not needed because the model uses differencing which substitutes the need for detrending (see [1], p.58, example 2.8).

To find data seasonality, the Fourier Transform of the time series into the frequency domain is helpful. It allows us to construct the spectrum density function to identify the most important frequency modes (with highest spectral peaks) ([2], Ch.7-8).

The specific ARMA(p,q) model for a stationary process (detrended time series) can be selected based on minimum Akaike information criterion (AIC) value ([2]Ch.4). Note that if AIC variability is small, e.g. less than 10%, then this criterion is not informative and can be ignored. Note that we also can guide a model selection based on the following desirable properties: simplicity, causality, and invertibility. Usually, both approaches are combined.

Finally, we conduct model diagnostics using residuals, ACF, and QQ-plots to test whether the model assumptions (e.g. white noise process) are appropriate.

Data Exploration and Pre-Processing

We study a California Wildfires dataset that has 21,318 rows representing occurrences of wildfires in California and 18 columns, collected from 1878 to 2020, see [6] and Appendix A. We consider data from 1900-2020. We take into consideration only the following features: year (“YEAR_”) and a date of alarm notice (“ALARM_DATE”). We drop rows in which alarm dates are missing or have obvious typos.

By counting a number of alarms each month we obtain the number of fires occurring each month. We populated missing months with zero values.

We create plots showing the number of wildfires occurring each month and each year from 1900 to 2020. We can see that there does appear to be some cyclical nature to the data, with higher numbers of wildfires in recent years. Also, there is possibly of an outlier in the monthly plot around 2010.

Explore Data in the Frequency Domain

We start by plotting the spectral density estimator [7]. We plot the spans-based and AIC/AR-based smoothing of the spectrum together ([2] Ch.8). Then, we search for possible seasonality by exploring the highest spectrum peaks.

From this plot, we find that there are two main periods of the number of wildfires: around 1 year and around 6 months. See Appendix C for spectrum decomposition and mid-range frequency identification. It shows that 6 and 12 months are indeed within a mid-range band which is approximately 8-32 months. We also explore spectrum of annual number of fires and find no evidence of seasonality (Appendix A).

Modeling

In this section we present two models: SARMA and SARIMA for fitting monthly data of the number of wildfires in California.

SARMA modeling

First, we conduct a study of selecting stationary ARMA models based on minimum AIC criteria (see Appendix D). Our results suggest we use the stationary, causal, and invertible ARMA (2,1) model for monthly data, but it reveals lag autocovariance for residuals at period 12.

Second, we continue our search looking at annual data and applying a similar AIC search (see Appendix D). For model fitting, we detrend and transform annual data as well. We found that a stationary AR(4) model for annual data is causal, invertible, and has almost no evidence of residual autocovariance.

Finally, the previous two steps suggest we should combine a monthly ARMA(2,1) and annual AR(4) into one SARMA model. We start with \(\text{SARMA} (2,1)\times(4,0)_{12}\) and get some convergency issues. Thus, we try \(\text{SARMA} (2,2)\times(4,0)_{12}\) and it works relatively well, as shown in the following summary and plots.

In addition, we try three different detrending methods: linear regression, Hodrick-Prescott (HP) filter, and Loess. The HP filter detrending yields unit MA roots, and thus is not a good choice. The linear detrending gives ACF with some extra autocovariance lags. The Loess detrending gives the best results, though there is a bit of ACF dependency at lag 30; this can be a result of detrending, transforming data, or data processing.

#loess
detrended_transfomed = N_loess_sqrt$residuals

#fit
fire_sarma = arima(detrended_transfomed,order=c(2,0,2),
                   seasonal=list(order=c(4,0,0),period=12))
print(fire_sarma)
## 
## Call:
## arima(x = detrended_transfomed, order = c(2, 0, 2), seasonal = list(order = c(4, 
##     0, 0), period = 12))
## 
## Coefficients:
##           ar1     ar2     ma1      ma2    sar1    sar2    sar3    sar4
##       -0.1871  0.4029  0.4580  -0.1850  0.2566  0.2303  0.1769  0.2132
## s.e.   0.2939  0.1844  0.2967   0.1257  0.0258  0.0264  0.0265  0.0263
##       intercept
##          0.0194
## s.e.     0.3412
## 
## sigma^2 estimated as 1.246:  log likelihood = -2227.55,  aic = 4475.1
## AR roots: 1.82 1.36
## MA roots: 1.4 3.87
## SAR roots: 1.06 1.65 1.64 1.64

We conduct the model diagnostics by plotting residuals and the QQ-plot. They show a presence of an outlier. Overall, the ACF plot and diagnostics confirm that residuals can be considered as a white noise process, thus making the use of SARMA modeling valid.

SARIMA Model

In the trends exploration section, we showed that the first difference operator works better for our data than detrending, as it better eliminates heteroscedasticity. Therefore, it is a natural choice to try fit monthly data with a stationary SARIMA model. Note that we don’t need to detrend data for such modeling, though the square root transformation is still valid and helpful.

We aim to obtain a stationary model that is causal and invertible, has i.i.d. residuals (no significant autocovariance in the residuals ACF), and is relatively simple. By trial and error (Appendix E), we find the following model: \(\text{SARIMA} (2,0,2)\times(4,1,0)_{12}\) with a seasonal period of 12 years and first difference applied to the seasonal (year) scale of data: \(D=1\) while monthly \(d=0\). This model has AR, MA, and SAR roots outside of the unit circle. Thus, the model is stationary, causal, and invertible.

#transfrom data
temp.use = (ts$N)^0.5
#fit
fires_sarima = arima(temp.use,order=c(2,0,2),seasonal=list(order=c(4,1,0),
                                                           period=12))
print(fires_sarima)
## 
## Call:
## arima(x = temp.use, order = c(2, 0, 2), seasonal = list(order = c(4, 1, 0), 
##     period = 12))
## 
## Coefficients:
##           ar1     ar2     ma1      ma2     sar1     sar2     sar3     sar4
##       -0.1615  0.5287  0.4085  -0.3158  -0.7653  -0.5440  -0.3981  -0.2190
## s.e.   0.2566  0.1847  0.2607   0.1410   0.0261   0.0312   0.0311   0.0259
## 
## sigma^2 estimated as 1.218:  log likelihood = -2190.55,  aic = 4399.11
## AR roots: 1.54 1.23
## MA roots: 1.25 2.54
## SAR roots: 1.42 1.5 1.5 1.42

Then we proceed with the model diagnostics. The autocovariance function of residuals shows no evidence of significant autocovariance, thus residuals are i.i.d. The residual and QQ-plots suggest that the SARIMA residuals are indeed Gaussian white noise.

Note that this model has the same AR and MA exponents p=q=2 and P=4,Q=0, as in the previous section for SARMA fitting. The outcome of the SARIMA is better than that of the SARMA model because the ACF has less autocovariance.

Conclusion

In the current report, we present analysis of wildfires occurring in California in 1900-2020. We took a publicly available data on fire alarm dates and aggregated it into the number of fires per month. The time series we obtained has a couple of key features. It has seasonality at 12 months that was confirmed by our spectrum investigation in the frequency domain. Our time series also has non-uniform variance of fluctuations growing over years and shows a nearly linearly increasing trend. To work with such time series we transformed and detrended the data. Thus, it is possible to fit it into stationary ARMA models that assumes that residuals are a white noise process. We conducted a model selection partially based on minimum AIC criteria, as well as aiming for a causal and invertible model that ensures no autocovariance of residuals. We found two models that fit our data well: 1) \(\text{SARMA} (2,2)\times(4,0)\) with a period of 12 months, which fits detrended data and 2) \(\text{SARIMA} (2,0,2)\times(4,1,0)\) also with a period of 12 months, which fits the data before detrending, as it has an inbuilt first difference operator having a similar effect. Overall, we see that the second model gives a better fit.

There are limitations to the presented study:

These limitations raise interesting questions to be addressed in future studies.

Reference

[1] R.H.Shumway and D.S.Stoffer, “Time series analysis and its applications: with R examples”, 4th edition, Springer Nature, 2017, DOI 10.1007/978-3-319-52452-8

[2] E. Ionides, “Time series analysis”, STATS/DATASCI 531 Course, University of Michigan, Winter 2022, https://github.com/ionides/531w22

[3] RDocumentation, arima: ARIMA Modeling of Time Series, https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/arima

[4] V. F., Homework report #3, STATS/DATASCI 531 Course, University of Michigan, Winter 2022

[5] E. Ionides, on discussion during professor’s office hour. 2/17/2022.

[6] dataset sources: “https://gis.data.ca.gov/datasets/e3802d2abf8741a187e73a9db49d68fe_0/explore?location=37.357681%2C-118.992700%2C6.00 , https://cvw.cac.cornell.edu/PyDataSci1/wildfires

[7] examples of plotting two spectrum together: “https://lbelzile.github.io/timeseRies/spectral-estimation-in-r.html

Appendix

A. Spectrum for Data Aggregated by Year

The dataset summary is the following:

##   ï..OBJECTID        YEAR_         STATE              AGENCY         
##  Min.   :21440   Min.   :1878   Length:21318       Length:21318      
##  1st Qu.:26774   1st Qu.:1950   Class :character   Class :character  
##  Median :32105   Median :1981   Mode  :character   Mode  :character  
##  Mean   :32104   Mean   :1976                                        
##  3rd Qu.:37435   3rd Qu.:2005                                        
##  Max.   :42764   Max.   :2020                                        
##                  NA's   :77                                          
##    UNIT_ID           FIRE_NAME           INC_NUM            ALARM_DATE        
##  Length:21318       Length:21318       Length:21318       Min.   :0219-05-29  
##  Class :character   Class :character   Class :character   1st Qu.:1970-06-01  
##  Mode  :character   Mode  :character   Mode  :character   Median :1994-06-22  
##                                                           Mean   :1987-07-24  
##                                                           3rd Qu.:2009-08-01  
##                                                           Max.   :2106-09-26  
##                                                           NA's   :5364        
##   CONT_DATE             CAUSE          COMMENTS           REPORT_AC     
##  Length:21318       Min.   : 1.000   Length:21318       Min.   :     0  
##  Class :character   1st Qu.: 4.000   Class :character   1st Qu.:    15  
##  Mode  :character   Median : 9.000   Mode  :character   Median :    60  
##                     Mean   : 9.226                      Mean   :  2270  
##                     3rd Qu.:14.000                      3rd Qu.:   350  
##                     Max.   :19.000                      Max.   :499945  
##                     NA's   :48                          NA's   :12551   
##    GIS_ACRES            C_METHOD       OBJECTIVE       FIRE_NUM        
##  Min.   :      0.0   Min.   :1.000   Min.   :1.000   Length:21318      
##  1st Qu.:     29.9   1st Qu.:1.000   1st Qu.:1.000   Class :character  
##  Median :    161.5   Median :6.000   Median :1.000   Mode  :character  
##  Mean   :   1877.3   Mean   :4.546   Mean   :1.013                     
##  3rd Qu.:    649.3   3rd Qu.:8.000   3rd Qu.:1.000                     
##  Max.   :1032699.0   Max.   :8.000   Max.   :2.000                     
##  NA's   :7           NA's   :12222   NA's   :195                       
##   SHAPE_Length       SHAPE_Area        
##  Min.   :     11   Min.   :-7.116e+09  
##  1st Qu.:   2118   1st Qu.:-4.148e+06  
##  Median :   4916   Median :-1.024e+06  
##  Mean   :  11333   Mean   :-1.211e+07  
##  3rd Qu.:  10550   3rd Qu.:-1.925e+05  
##  Max.   :1695543   Max.   : 1.453e+05  
## 

We construct smoothed spectrum density functions for annual number of wildfires using periodogram and AIC-based methods.

Next we find frequencies with the maximum spectrum.

## Max annual spectrum by unparametric method 8803.701 is at frequency 0.0083 or 120 years period.

We can notice from the spectrum plot that the maximum is achieved at zero frequency, which is related to the trend mode. Therefore, we do not assume any seasonality for annual data.

B. Detrending by linear regression or the Hodrick-Prescott filter

Consider a linear regression fit of monthly fire data, transformed by square root. We plot the residual that is a linear detrending with transformation.

#linear regression
mod = lm(N^0.5~date, data = ts)

plot(resid(mod),ylab = 'Residual', main='Linearly Detrended, Transformed')

We also can use the Hodrick-Prescott (HP) filter to extract the trend of the monthly data ([2], Ch.9). For given observations \({y^*_{1:N}}\), the HP filter is the time series \({s^*_{1:N}}\) constructed as \[\begin{equation} {s^*_{1:N}} = argmin_{s_{1:N}} \left\{ \sum^{N}_{n=1}\big({y^*_n}-s_{n}\big)^2 + \lambda\sum^{N-1}_{n=2}\big(s_{n+1}-2s_{n}+s_{n-1}\big)^2 \right\}. \end{equation}\]

The next plot shows the resulting HP filtered number of fires per month and the square root transformed data. It can be seen that the variance becomes more uniform over time when data is transformed as expected.

C. Decomposition

To further investigate cycles we proceed with frequency decomposition by looking at the trend, noise and cycle decomposition for monthly fire numbers ([2] Ch8, slide18). Note that here we do not consider data transformation but rather investigate the original time series of a number of wildfires.

To find a band of mid-range frequencies, we plot a frequency response between cyclic number of fires and the original time series.

low hi
frequency range, region for ratio greater than 0.5 0.031 0.123

The analysis shows that mid-range frequencies are between 0.031 and 0.123 (1/0.03=32.25 and 1/0.123=8.13 months, respectively). The previously found periods of 12 and 6 months are within this band.

D: AIC search for stationary ARMA fit

Monthly data

For stationary ARMA modeling we consider monthly number of wildfires, transformed and detrended by Loess smoothing. We fit multiple ARMA(p,q) models with various values of p and q and calculate the Akaike information criteria (AIC) values (see [2] Ch.5).

MA0 MA1 MA2 MA3 MA4 MA5
AR0 6203.78 5521.90 5209.87 5119.44 5079.02 5077.10
AR1 5233.12 5200.90 5113.57 5095.14 5079.28 4995.25
AR2 5166.64 4865.65 4490.82 4381.44 4360.55 4359.88
AR3 4953.01 4751.35 4366.34 4494.81 4384.99 4364.30
AR4 4851.84 4736.14 4372.37 4369.94 4489.55 4388.34
AR5 4761.47 4709.20 4363.40 4372.02 4370.85 4366.03
## [1] "ARMA AIC variability is 30 %"
## [1] 4359.884

We find that AIC variability is within 30%, so the difference is not negligible. The table shows that ARMA(2,5) has the lowest AIC 4359.884.

We notice that AICs for ARMA(p,q) with p=2, q=0,..5 are relatively small and can be investigated further for causality and invertibility. We found that ARMA(2,1) and AR(2) have roots outside the unit circle. The ARMA(2,1) has a smaller AIC value and thus is preferable.

## arma25 roots for AR  1 1  and MA 1.02 1.02 2.15 2.15 5.11
## arma24 roots for AR  1 1  and MA 1.02 1.02 2.73 2.73
## arma24 roots for AR  1 1  and MA 1.02 1.02 3.79
## arma22 roots for AR  1 1  and MA 1.03 1.03
## arma21 roots for AR  1.14 1.14  and MA 1.31
## arma20 roots for AR  2.16 2.16

We plot the ACF for the ARMA(2,1) monthly residual. We can see there is evidence of autocovariance at lag 12 months that is not taken by this simple model.

Annual data

We check the AIC values for ARMA computed on annual data using Loess detrending of transformed data. The Loess AIC is related to ARMA(4,4). The variability of AIC within this table is smaller than 10%, thus AIC is not a helpful criterion here.

MA0 MA1 MA2 MA3 MA4 MA5
AR0 477.19 472.93 472.67 473.90 475.70 477.04
AR1 471.79 473.68 473.76 475.65 477.36 479.01
AR2 473.59 474.89 474.07 470.22 470.12 472.40
AR3 474.37 476.09 471.46 471.95 469.31 469.62
AR4 475.91 477.91 475.55 467.39 461.73 463.02
AR5 477.90 472.22 478.79 467.19 476.34 478.30
## [1] "annual ARMA AIC variability is 4 %"
## [1] 461.7259

We continue with manual model choice, looking for a stationary, causal, and invertible ARMA(p,q) model for p=4, q=2,..4. We can see that AR(4) and AR(4,1) are both suitable models for fitting annual data as they are causal and invertible. The AR(4) has smaller AIC and thus can be used for further investigation.

## arma44 roots for AR  1.06 1.13 1.13 1.06  and MA 1.01 1 1 1.13
## arma43 roots for AR  1.31 1.04 1.04 2.15  and MA 1 1 1
## arma42 roots for AR  1.09 1.09 3.59 10.63  and MA 1.01 1.01
## arma41 roots for AR  2.06 1.73 2.06 2.18  and MA 27.62
## arma40 roots for AR  2.02 1.75 2.02 2.11

We plot the ACF of the AR(4) model residuals for annual data. We can see that it is a good model with little autocovariance at lag 10, which is negligible.

E: Trial and error with SARIMA

We consider different p,q,P,Q for fitting a transformed number of fires per month. The selection criteria are: roots should be outside the unit circle, ACF should not be lag dependent (no points outside the confidence interval), the residual between fit and data should be more or less homogeneous, no problem with convergence.

#transfrom data
temp.use = (ts$N)^0.5
#fit
fires_sarima_test = arima(temp.use,order=c(2,0,2),
                     seasonal=list(order=c(3,1,0),period=12),
)
print(fires_sarima)
## 
## Call:
## arima(x = temp.use, order = c(2, 0, 2), seasonal = list(order = c(4, 1, 0), 
##     period = 12))
## 
## Coefficients:
##           ar1     ar2     ma1      ma2     sar1     sar2     sar3     sar4
##       -0.1615  0.5287  0.4085  -0.3158  -0.7653  -0.5440  -0.3981  -0.2190
## s.e.   0.2566  0.1847  0.2607   0.1410   0.0261   0.0312   0.0311   0.0259
## 
## sigma^2 estimated as 1.218:  log likelihood = -2190.55,  aic = 4399.11
#check roots of arma part of the model
AR_roots <- polyroot(c(1,-coef(fires_sarima_test)[c("ar1","ar2")]))
MA_roots <- polyroot(c(1,coef(fires_sarima_test)[c("ma1","ma2")]))
cat('AR roots:',round(Mod(AR_roots),2),'\n')
## AR roots: 1.18 2.42
cat('MA roots:',round(Mod(MA_roots),2),'\n')
## MA roots: 1.28 4.3
#check roots of Sarma part of the model
SAR_roots <- polyroot(c(1,-coef(fires_sarima_test)[c("sar1","sar2","sar3")]))
cat('SAR roots:',round(Mod(SAR_roots),2),'\n')
## SAR roots: 1.59 1.6 1.59