1. Introduction

1.1. Adélie Penguins

Adélie penguins is a kind of penguin species that mainly lives along the Antarctic continent. They live on food and distribute on breeding colonies around costal areas, spending their winter time offshore in the seas around the Antarctic pack ice. Adélie penguins feed on tiny aquatic creatures like shrimp-like krill, and also eat fish and squid. During their breeding season in October, they migrate and live in large communities called colonies along rockly coastlines. The colonies become their major living space to have a better bredding environment.

1.2. Population Arrival

The arrival of adult Adélie penguins documented annually usually performs as ice and weather conditions permit. The number of adult penguins present provides a measure of the number of adults arriving daily at the breeding colonies, which serves as a metric to sense and control the environmental conditions such as sea ice extent during late winter and early spring.

1.2. Background & Motivation

In recent years, the human behaviors throw more and more harm onto the Antarctic and the environment becomes worse and worse for those creatures. Lack of food and habitats that majorly consist of ice glacier are destroyed due to the constant global warming. Penguins, the animals that live on the Antartic annually, are grealty affected and placed at a dangerous situation. I happend to read a WeChat push in Chinese describing the bad situation of pengiuns the other day and was deeply motivated by this topic. I would like to take this chance to explore the how number of pengiuns are affected as time goes by in the past about 30 years and whether it brings some warnings to human beings about the situation of Antarctic creatures.

2. Data

2.1. Data Source

The data source I found is a research on the ocean informatics study in UCSD. The data consists observations of the number of population arrivial for Adélie penguins from October 20, 1991 to February 1, 2018 recorded at different islands and colonies in the Antarctic. The total number of records is 3,034. Here I show some sample data from the original dataset:
studyName Date.GMT Island Colony Adults
PAL9192 1991-10-20 HUM 1.1 24
PAL9192 1991-10-20 HUM 1.2 23
PAL9192 1991-10-20 HUM 1.3 5
PAL9192 1991-10-20 HUM 2.1 94
PAL9192 1991-10-20 HUM 2.2 57

2.2. Data Preprocessing

Since the data is observed for different islands and colonies of Adélie penguins for each recorded date, and the record time only happens in winter hours including October to February data for each year. I intended to aggregate and group the data into monthly data for convinience. I extend and separate the data into evenly distributed time series data from the year 1991 to 2018 by calculating the mean of the population arrivial for each month. For each year, there is supposed to get 5 data points, representing the data in October to February correspondingly. In order to keep an evenly distributed time series, I fill in those monthly missing data in a year with the average values of known datat in that year. Therefore, I get 135 time points in the end as the project dataset.

3. Analysis

3.1. Data Transformation

We can see that the data shows heavy non-stationarity with roughly exponential decrease and obvious seasonality, I decide to use logarithm transformation on the time series values. I transform the population arrivial by using a \(log_{10}\) transformation. Since we can notice that in the original dataset, there exist some 0 values, which will cause the logarithm to be infinity, so I add 0.1 to each of the data before feeding into the logarithm:

Here since my plotted time series pattern shares some similar traits with the project 1 in the past 2018 midterm project, so I utiize some of the methods taken by this author. Therefore I would like to use regression with ARMA errors that separates the data into two considerations, one for depicting the trend for this series, while the other to depict the error terms using existing models learned in class. Since our main goal to fit a time series model is around ARMA models amd their derivatives like ARIMA and SARIMA models. To apply an ARMA model to the errors, we need to have the weak stationarity properties. So follow the method used by the referred project, we can first check the variance for each segments of points:

Point 1-30 Point 31-60 Point 61-90 Point 91-120
Local variance 0.5427803 0.4786229 0.3850376 0.2475385

We can see that although there exists some differences in the variance, we can for now, saying that we can try and use the ARMA models to fit the error terms.

3.2. Data Decomposition

We can first use ACF function and the periodogram to check the seasonality of the data. It is shown that a period with 5 can be clearly observed, which means that the population arrivial for Adélie penguins indicates a yearly cycle.

From the periodogram we can also perceive a peak at frequency 0.4, since the period for that frequency turns out to be 2.5, which is hard to interpret since the data are only for serveral months, not for each month in a year. So here I would like to explore more for the cycle of 5 data points, which is the peak at frequency 0.2 in the periodogram.

We can then see a decomposition of the series by showing the trend and the noise, which helps us determine which model we would like to use and specify the series of data. I use the lowest frequency 0.2 to extract the trend, and the highest seen frequency 0.5 to generate the noises.

3.3. Regression Models for Trend

As it shows some evidence for a decreasing trend, to model this trend, I try three regression models and compare their effects.The three models includes a linear regression model \(y_n = \beta x_n + \mu\), a quadratic model \(y_n = \beta_2 x_n^2 + \beta_1x_n + \mu\) and a cubic model \(y_n = \beta_3 x_n^3 + \beta_2 x_n^2 + \beta_1x_n + \mu\), here \(y_n\) represents the population arrivials and \(x_n\) is the time series. The results for fitting them to the series output some R statistics for comparison:

## 
## Call:
## lm(formula = log.pop.arrivial ~ time)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72895  0.01262  0.19502  0.32395  0.66864 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 54.84773   14.07469   3.897 0.000154 ***
## time        -0.02667    0.00702  -3.799 0.000220 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6357 on 133 degrees of freedom
## Multiple R-squared:  0.09788,    Adjusted R-squared:  0.0911 
## F-statistic: 14.43 on 1 and 133 DF,  p-value: 0.0002204
## 
## Call:
## lm(formula = log.pop.arrivial ~ time + I(time^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.63544  0.01534  0.21959  0.34876  0.64938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.418e+03  4.052e+03  -0.843    0.401
## time         3.437e+00  4.042e+00   0.850    0.397
## I(time^2)   -8.638e-04  1.008e-03  -0.857    0.393
## 
## Residual standard error: 0.6363 on 132 degrees of freedom
## Multiple R-squared:  0.1029, Adjusted R-squared:  0.08928 
## F-statistic: 7.568 on 2 and 132 DF,  p-value: 0.0007735
## 
## Call:
## lm(formula = log.pop.arrivial ~ time + I(time^3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.63584  0.01534  0.21950  0.34864  0.64950 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.253e+03  2.702e+03  -0.834    0.406
## time         1.700e+00  2.021e+00   0.841    0.402
## I(time^3)   -1.432e-07  1.676e-07  -0.854    0.395
## 
## Residual standard error: 0.6363 on 132 degrees of freedom
## Multiple R-squared:  0.1028, Adjusted R-squared:  0.08924 
## F-statistic: 7.565 on 2 and 132 DF,  p-value: 0.0007753

Comparing the above three R output results, we can notice that the p-values for the quadratic and cubic polynomial regression models are large, so I choose the linear regression model to fit the data, so we can describe the up-coming model as a composition of a linear regression model plus a error term: \[ y_n = -0.02667x_n + 54.84773 + error.\]

We can also plot the estimated linear model and see whether it fits well:

Shown in the fitting model, we can know that there is a linear decresing trend for the series, considering the logarithm we take, we can see the number of population arrivial decreses in an exponential speed.

3.4. SARMA error

We have already figured out that the seasonality should be considered in this time series. So follow the procedures in the lecture slide, we can fit a Gaussian ARMA(p,q) model with seasonality term \(SARMA(p,q)\times(P,Q)_5\), use \(\epsilon_n\) to represent the error sequence, which can be written by: \[\begin{aligned} \phi(B)\Phi(B^5)(\epsilon_n - \mu) = \psi(B)\Psi(B^5)w_n, \end{aligned}\] where \[\begin{aligned} \mu &= \mathbb{E}[\epsilon_n], \\ \phi(x) &= 1 - \phi_1x - \phi_2x^2 - ... - \phi_px^p, \\ \psi(x) &= 1 + \psi_1x + \psi_2x^2 + ... + \psi_qx^q, \\ \Phi(x) &= 1 - \Phi_1x - \Phi_2x^2 - ... - \Phi_px^P, \\ \Psi(x) &= 1 + \Psi_1x + \Psi_2x^2 + ... + \Psi_qx^Q. \\ w &\sim \textbf{i.i.d.} N[0, \sigma^2]. \end{aligned}\]

Like what we did in class, I would like to use AIC to help me choose a suitable ARMA model. For the seasonality term, I plan to use a simple \((1,0)_5\) model. So use the R code to generate an AIC table for the selection of \((p,q)\):
MA0 MA1 MA2 MA3 MA4 MA5
AR0 199.75 201.72 203.10 203.95 205.69 192.16
AR1 201.71 203.67 201.68 201.05 203.05 194.05
AR2 203.21 201.54 201.56 202.94 203.28 188.73
AR3 203.53 201.11 201.33 197.89 199.87 190.31
AR4 205.27 202.72 203.10 201.56 196.01 192.26

We can see that the \(SARMA(2,5)\times(1,0)_5\) gives the lowest AIC value 188.73. Another simpler candidate may be picked as \(SARMA(0,5)\times(1,0)_5\) with AIC value 192.16. Output the statistics for the two models and compare:

## 
## Call:
## arima(x = error, order = c(2, 0, 5), seasonal = list(order = c(1, 0, 0), period = 5))
## 
## Coefficients:
##          ar1     ar2     ma1      ma2     ma3      ma4      ma5    sar1
##       0.0221  0.2392  0.0510  -0.1707  0.0066  -0.0296  -0.8574  0.9788
## s.e.  0.1210  0.1045  0.0974   0.0707  0.0706   0.0646   0.0715  0.0209
##       intercept
##          0.0016
## s.e.     0.0420
## 
## sigma^2 estimated as 0.1934:  log likelihood = -84.36,  aic = 188.73
## 
## Call:
## arima(x = error, order = c(0, 0, 5), seasonal = list(order = c(1, 0, 0), period = 5))
## 
## Coefficients:
##          ma1      ma2     ma3     ma4      ma5    sar1  intercept
##       0.0897  -0.0084  0.0693  0.0158  -0.7830  0.9814    -0.0576
## s.e.  0.0784   0.0794  0.0857  0.0862   0.0954  0.0195     0.3691
## 
## sigma^2 estimated as 0.2047:  log likelihood = -88.08,  aic = 192.16

From the output, I decide to choose \(SARMA(2,5)\times(1,0)_5\) with a higher log likelihood and lower estimated \(\sigma^2\) as my selected SARMA model for the errors. Now I can check its invertibility and casuality, which seems good to me.

## [1] "The AR roots:"
## [1] -0.04623+2.044084i -0.04623-2.044084i
## [1] "The MA roots:"
## [1]  0.3335292+1.007533i -0.8507752+0.558256i -0.8507752-0.558256i
## [4]  1.0000005-0.000000i  0.3335292-1.007533i

Now, the complete model is: \[\begin{aligned} Y_n = −0.02667x_n &+ 54.84773 + \epsilon_n, \\ \phi(B)\Phi(B^5)(\epsilon_n−\mu) &= \psi(B)\Psi(B^5)w_n, \end{aligned}\] where \[\begin{aligned} \mu &= \mathbb{E}[\epsilon_n], \\ \phi(x) &= 1 - 0.0221x - 0.2392x^2, \\ \psi(x) &= 1 + 0.0510x - 0.1707x^2 + 0.0066x^3 - 0.0296x^4 - 0.8574x^5, \\ \Phi(x) &= 1 - 0.9788x, \\ \Psi(x) &= 1. \\ w &\sim \textbf{i.i.d.} N[0, 0.1934]. \end{aligned}\]

3.5. Model Diagnostic

To stress the validity of the SARMA model for error term, we need to perform a diagnostic analysis:

The above generated diagnostic plots shows that there’s no evidence to show the model with heteroskacity, and the correlation of residuals for the error term is not perceived by us. Th QQ plot looks basically credible to indicate the normality of the residuals. So we can draw a conclusion that \(SARMA(2,5)\times(1,0)_5\) is a valid model.

3.6. Model Fitting

We can fit the data with the acquired SARMA model:

We can see that the fitted model can roughly depict the decreasing trend and the seasonality of the series correctly. However, it fails to give an accurate displaying for the data at the beginning and also fails to display the low extremes up to -1. Remember that I fill up some non-recorded values with the mean for known valus for that year, some of the disability of the model may be possibly come from this extra computation and data augmentation. Roughly speaking, for the data that has 0 population arrivial for a specific time, the measurement might have some problems since it only gives the observed data in a colony in one day, the accident might be large in terms of the coming of Adélie penguins coming that day. So for this model and its final effect, I would say, it roughly captures the trait of the data, but due to some reason, including the data set and the way data is acquired, it shows obvious disability.

4. Exploration: Differencing Model

Inspired by project 1 in the past 2018 midterm project, I decide to use the SARIMA model and try differencing term to see whether the trend can be depicted without using a regression model. With the model \(SARIMA(p,1,q)\times(1,1,0)_5\) with the difference of \(d=1\) and \(D=1\), I need to reselect the \((p,q)\) values from AIC table:

MA0 MA1 MA2 MA3 MA4 MA5 MA6 MA7
AR0 276.25 204.46 205.96 206.69 208.62 210.46 185.40 189.34
AR1 240.67 205.86 207.55 208.64 209.91 208.80 189.31 188.26
AR2 234.23 206.91 208.35 194.67 212.39 201.88 186.71 187.27
AR3 230.77 208.42 210.17 208.92 197.63 204.78 186.21 188.17
AR4 228.23 210.16 212.00 202.20 199.04 191.39 188.80 190.63
AR5 220.26 210.94 212.04 199.67 204.47 187.79 190.64 191.06
AR6 217.16 211.11 212.55 198.33 199.82 189.51 191.18 190.84
AR7 215.99 212.25 214.17 216.42 207.26 192.13 193.59 192.69

Suggested by the AIC table, I select \(SARIMA(0,1,6)\times(1,1,0)_5\) to represent my data:

## 
## Call:
## arima(x = log.pop.arrivial, order = c(0, 0, 6), seasonal = list(order = c(1, 
##     1, 0), period = 5))
## 
## Coefficients:
##          ma1     ma2     ma3     ma4      ma5     ma6    sar1
##       0.1064  0.0464  0.1520  0.0654  -0.9008  0.0266  0.1496
## s.e.  0.1147  0.1103  0.1213  0.0922   0.1405  0.0880  0.1123
## 
## sigma^2 estimated as 0.1908:  log likelihood = -82.93,  aic = 181.87

The given model is then: \[\begin{aligned} Y_n = x_n - x_{n-1}, \\ \phi(B)\Phi(B^5)((1-B)(1-B^{5})Y_n−\mu) &= \psi(B)\Psi(B^5)\epsilon_n, \end{aligned}\] where \[\begin{aligned} \mu &= \mathbb{E}[Y_n], \\ \phi(x) &= 1, \\ \psi(x) &= 1 + 0.1064x + 0.0464x^2 + 0.1520x^3 + 0.0654x^4 - 0.9008x^5 + 0.0266x^6, \\ \Phi(x) &= 1 - 0.1496x, \\ \Psi(x) &= 1. \\ \epsilon &\sim \textbf{i.i.d.} N[0, 0.1908]. \end{aligned}\]

And it is necessary to perform again, the diagnostic analysis:

Similarly, to simulate and see the effect of the newly used \(SARIMA(0,1,6)\times(1,1,0)_5\) model, we can together plot the original log-transformed data with two given fitted models:

As we can see in the above plot, comparing the newly added SARIMA model with the previous SARMA error model, the SARIMA model with differencing term can capture better trait at the beginning of the times series, and for the second half of the data, two models differ very little. The problems that discussed above about the \(SARMA(2,5)\times(1,0)_5\) error model still exist on the \(SARIMA(0,1,6)\times(1,1,0)_5\) model. But in the end, my choice would be the \(SARIMA(0,1,6)\times(1,1,0)_5\) model since the slightly better performance.

5. Conclusion

5.1. Modeling Strategy

For this time series data, it is a real-life measurement for Adélie penguins’ living and the implicit showing of the environmental impacts. I carried my modeling job by first mapping the data points to 5 data points representing the average population arrivials in October, November, December, January and February in a year, and expanding the time series into an evenly distributed one. I saw the seasonality of a cycle 5 by plotting the data and an exponential decrease, so I decided to perform a logarithm transformation first and use SARMA model or SARIMA model.

To find a suitable model, I first try regression with SARMA error by using linear regression to depict the trend. The strategy I used to get the fitted SARMA model is to use AIC table and select a model with small AIC values and passes model diagonostics. We can lastly, see that, the selected model can roughly represent the trend, but for many details, it fails to replicate.

Then I also tried to use SARIMA model directly without regression to get the trend. The results show that this model fit better for the beginning data, and roughly the same effect as the regression plus SARMA error model.

5.2. Model Selection

Stated above, I finally select the \(SARIMA(0,1,6)\times(1,1,0)_5\) model for the log-transformed population arrivials to represent the time series.

5.3. Real Thinking

The main goal of this project is not just exploring a data series about Adélie penguins. We can see that the population arrivials for this creatures observed has been decreasing exponentially in the past 20 years, which is a very suprising speed. In recent years, the exponentially decreasing trend has been slower, but the number of population arrivials has reached a small one, which indicates the terrible situation for Adélie penguins. From this project, I would like to alert us human beings about the danger of Antarctic creatures and their living spaces.

6. Reference

[1] Adélie pengiuns basic information: https://en.wikipedia.org/wiki/Ad%C3%A9lie_penguin and https://www.nationalgeographic.com/animals/birds/a/adelie-penguin/.

[2] Introduction to population arrivial: https://oceaninformatics.ucsd.edu/datazoo/catalogs/pallter/datasets/92.

[3] Data source and introduction: https://oceaninformatics.ucsd.edu/datazoo/catalogs/pallter/datasets/92/datatables/92.

[4] Majorly referred past midterm projects in 2018: https://github.com/ionides/531w18/blob/master/midterm_project/project1/midterm_project.Rmd.

[5] Theory basis: STATS531 lecture notes from https://ionides.github.io/531w20/.