Introduction

Avocado is known as a popular foods among millennials. It is rich in nutrition and can be found in many dishes. As a huge fan of avocado, I eat avocado toast and avocado salad a lot. However, I have never bought avocados from a market myself, nor do I have any sense of its prices. Some business articles point out that avocado is getting increasingly expensive, growing by up to \(129\%\) in recent years [3]. Empirically speaking, the price of an item is closely related to its market demands annd sales. Therefore, in this project, we are going to analyze the following problems:

To be specific, we focus our problem on the avocado market in California. Our data is measured once a week, from year 2015 to 2018 [1].

Exploratory analysis

There are two types of avocado on the market, namely conventional and organic. Typically, organic avocado is more expensive, but the difference between the two types may vary over time. First, we take a view of the average avocado price per piece from year 2015 to 2018, measured on a weekly basis:

We also compare the total volume of avocado sales of both types. Note that in the following chart the unit of y axis is “thousand pieces”.

The two types of avocado show similar trends in sales, with the conventional type occupies much greater share in the market and contains more fierce fluctuation. For clarification, in the following sections we restrict our discussion to the sales of conventional type avocado.

Trend and periodogram of avocado price

An initial attempt

The ACF plot of conventional avocado average price contains an apparent oscillation pattern, which is an evidence for MA(2) models.

We can also examine its AIC table for information of a proper ARMA model of the avocado price time series:

MA0 MA1 MA2 MA3 MA4 MA5
AR0 -14.70 -154.92 -193.89 -236.00 -237.72 -252.61
AR1 -271.44 -269.45 -267.83 -266.47 -266.08 -267.34
AR2 -269.45 -270.79 -269.92 -263.84 -266.87 -264.62
AR3 -267.78 -270.08 -274.73 -267.19 -264.33 -264.04
AR4 -266.15 -268.54 -273.88 -265.28 -271.83 -262.50
AR5 -265.41 -267.14 -265.20 -263.38 -270.41 -268.57

According to the AIC table, ARMA(3, 2) model provides the smallest AIC value, which is consistent with our first impression from the ACF plot. Accordingly, we try fitting the data with an ARMA(3, 2) model:

## 
## Call:
## arima(x = conv$AveragePrice, order = c(3, 0, 2))
## 
## Coefficients:
##           ar1     ar2     ar3     ma1     ma2  intercept
##       -0.9543  0.7451  0.7529  1.9827  0.9999     1.0954
## s.e.   0.0536  0.0665  0.0573  0.0264  0.0263     0.0653
## 
## sigma^2 estimated as 0.01025:  log likelihood = 144.36,  aic = -274.73

According to R output, our model is estimated as \[ \Phi(B)(Y_n - \mu) = \Psi(B)\epsilon_n \] where \[ \begin{aligned} \Phi(B) &= 1 + 0.95B - 0.75B^2 - 0.75B^3, \\ \Psi(B) &= 1 + 1.98B + 1.00B^2, \\ \mu &= 1.10,\\ \sigma &\sim N(0, 0.01). \end{aligned} \]

To evaluate the fitness of this model, we start by analyzing the ACF plot of residuals.

From the ACF plot, it appears that the residuals are uncorrelated, since almost all lags are within the 5% level of chance variance (except at lag = 15). However, there seems to be a damped periodic pattern in the lags, which implies that there might be a trend in the original data. In addition, as we draw the scatter plot of residuals, we could infer that there is likelihood that the variance of residual is slightly increasing with time.

We want to perform diagonostic analysis on the model to better evaluate its adequecy on our dataset. Check the roots of \(\Phi(X)\):

print(abs(polyroot(c(1,-coef(arma32)[c("ar1", "ar2", "ar3")]))))
## [1] 1.141146 1.078885 1.078885

All roots of the AR polynomial are ouotside the unit circle, which means our model is causal. However, checking the roots of \(\Psi(X)\),

print(abs(polyroot(c(1,coef(arma32)[c("ma1", "ma2")]))))
## [1] 1.000053 1.000053

we find the roots are very close to the unit circle, which means that our model is at the threshold of non-invertibility. This is undesirable for our analysis, because non-invertible models give numerically unstable estimates of residuals. Therefore, in the upcoming sections, we want to adjust our model to avoid this situation.

Seasonality and trend

To analyze the trend and seasonality of avacado average price, we plot the spectrum density for the time series. The unit of frequency is cycles per week.

There is a peak in the middle frequency, with the value

## [1] 0.2277778

this stands for a cycle every \(1 / 0.228 = 4.39\) weeks, which is approximately a month. As the power density at this peak is not very high, we can interpret that there is a weak monthly cycle in our time series.

The highest peak corresponds to the frequency value

## [1] 0.01666667

which is \(1 / 0.0167 = 60\) weeks.

To examine the period and trend in the time series, we can decompose the time series as a combination of trend, noise and cycle:

From the decomposition, we could observe a yearly period in which the avacado price rise and fall. On the whole, the average price do increases from 2015 to late 2017, but starts to drop there after. At the same time, the variance of average price gets larger over time, which can be reflected from the high frequency components. In addition, the periodic pattern is getting more apparent. In earlier years, there are only small fluctuations around the mean value. In comparison, in recent years, the pattern of fluctuation grew larger in scale and gradually evolved a clear pattern, which has two ups and downs each year: a smaller one follows by a larger one.

Trend and periodogram of avocado sales volume

We then move to analyze the avocado sales volume time series. An oscillated damping periodic pattern could als be captured in the conventional avocado sales volume time series, which is a clue for MA(2) models:

As we plot the spectrum density for this time series, we could detect a farmiliar pattern:

Recall the spectrum density plot for average price, we see that despite the magnitude of power, the pattern of the two spectrums are amazingly similar:

In comparison, the spectrum of the total volume series is smoother. A major difference between the two series, however, is the frequency at which the spectrum achieves the maximum power. For the total volume series, the peak is attained at requency

spec2$freq[4]
## [1] 0.02222222

which indicates a cycle of \(1/0.0222 = 45\) weeks.

Since the pattern of the two spectrum density plots are largely similar to each other, we want to examine the correlation between the two time series. According to Law of demand [4][5] in economics, we know that price change affects the demand of a product. Therefore, we fit a regression between the two variables with an ARMA model.

Regression

It is a common practice to apply log transform to time series in economic problems [6], therefore we apply it to our dataset before applying to a regression on ARIMA and examining its AIC table:

MA0 MA1 MA2 MA3 MA4 MA5
AR0 -196.65 -257.39 -282.41 -305.03 -317.75 -337.58
AR1 -349.54 -363.28 -361.82 -360.47 -358.53 -356.86
AR2 -362.33 -361.67 -360.74 -362.49 -360.52 -358.95
AR3 -362.21 -363.98 -362.43 -360.55 -358.50 -356.96
AR4 -360.48 -362.41 -360.42 -358.84 -357.95 -355.63
AR5 -358.54 -360.48 -369.31 -374.04 -372.25 -371.19

The AIC table suggests that a ARMA(3,1) model has the lowest AIC value, which is slightly lower than the AIC value for ARMA(1,1). However, a simpler model is prefered. Therefore, we select an ARMA(1,1) model for our data, and plot the ACF for residuals:

It is noticeable that lag=53 has a significantly large value,

acf1$acf[53]
## [1] 0.2699861

which indicates that 53-1=52 weeks is a cycle. This is approximately one year, which is of actual meaning. Taking this into account, we can modify our model to account for the seasonal pattern by introducing a seasonal operator:

## 
## Call:
## arima(x = log(conv$Total.Volume), order = c(1, 0, 1), seasonal = list(order = c(1, 
##     0, 0), period = 52), xreg = log(conv$AveragePrice))
## 
## Coefficients:
##          ar1      ma1    sar1  intercept  log(conv$AveragePrice)
##       0.9483  -0.4440  0.3899    15.6696                 -1.1083
## s.e.  0.0251   0.0841  0.0824     0.0790                  0.0662
## 
## sigma^2 estimated as 0.005421:  log likelihood = 196.05,  aic = -380.09

This give us the following model in specific: \[ \log Y_n = 15.6696 - 1.1083\log z_n + \epsilon_n, \\ \] where \(Y_n\) is the total volume of avocado sales, \(z_n\) is the average price, and \(\epsilon_n\) is assumed to be an ARMA(1,1) model satisfying \[ (1-0.9483B)(1-0.3899B^{52})\epsilon_n = (1-0.444B)\omega_n, \] and \(\{\omega_n\}\) is a Gaussian white noise, \(\omega_n\sim N[0, 0.0054]\). As before, we want to examine the adequacy of this model. The roots of characteristic polynomials are

coeff <- c(1, -0.9483, rep(0, 50), -0.3899, 0.3697)
print(abs(polyroot(coeff)))
##  [1] 1.018279 1.018279 1.018274 1.018215 1.018279 1.018279 1.018279
##  [8] 1.018279 1.018279 1.018278 1.018279 1.018274 1.018278 1.018279
## [15] 1.018279 1.018279 1.018279 1.018279 1.018279 1.018279 1.018279
## [22] 1.018279 1.018279 1.018279 1.018279 1.018279 1.018279 1.018279
## [29] 1.018279 1.054662 1.018279 1.018279 1.018278 1.018281 1.018282
## [36] 1.018279 1.018279 1.018279 1.018278 1.018278 1.018278 1.018278
## [43] 1.018279 1.018278 1.018279 1.018279 1.018279 1.018277 1.018278
## [50] 1.018279 1.018279 1.018275 1.018281
print(abs(polyroot(c(1, -0.444))))
## [1] 2.252252

all roots are safely outside the unit circle, which means that our model for \(\{\epsilon_n\}\) is stationary, causal and invertible.

From the ACF plot, all residual lags are within the range of chance variance, indicating that \(\{\epsilon_n\}\) is uncorrelated. The negative linear relation between the avacado price and sales volume verifies that avacado market demand respond negatively to the variation of its market price, which is consistent with the law of demand as well.

Discussion and conclusion

In this project, we explored the trend and periodity of avocado price and sales volume time series and also analyzed their association. We extracted monthly and annual patterns from the spectrum density analysis, and also determined the trend of price variation from the spetrum decomposition, which is not constantly increasing but shows a decreasing trend in recent years. In addition, we applied a regression on the price and sales volume time series and discovered a negative correlation between the two time series, which is consistent with our empirical knowledge.

However, we could see that after many transformation,even though the error terms are uncorrelated, there still exists an oscillated damping periodic pattern in the residual lags. The reason behind the persistence of this damping characteristic remains to be solved in future.

Reference

  1. Data source: original source on https://hassavocadoboard.com/, directly downloaded from https://www.kaggle.com/neuromusic/avocado-prices, then pre-processed with Python.
  2. https://www.refinery29.com/en-us/2019/07/238137/avocado-prices-rise-demand
  3. Avocado prices have boomed in the last decade — here’s why they are so expensive, https://www.businessinsider.com/why-avocados-are-so-expensive-2019-10
  4. The relationship between demand and price. UKEssays.com. 11 2018. All Answers Ltd. 02 2020. https://www.ukessays.com/essays/economics/the-relationship-between-demand-and-price-economics-essay.php?vref=1.
  5. Law of demand, https://www.investopedia.com/terms/l/lawofdemand.asp.
  6. When to log transform a time series before fitting an ARIMA model. https://stats.stackexchange.com/questions/6330/when-to-log-transform-a-time-series-before-fitting-an-arima-model