Pharma sales dataset is built from transactional data collected in six years (2014-2019) indicating the date and time of the sale and pharmaceutical drug brand and quantity.This dataset is from the Kaggle. The drugs are classified into eight groups by Anatomical Therapeutic Chemical (ATC) Classification System. For the purpose of this project, we focus on one specific group of drugs, R06, in which are antihistamines for systemic use. The main concern of this analysis is to find the connection between seasonal change and the difference in antihistamine sales across the year.
Firstly, we want to look at the basic statistics of the drug sales.
## datum M01AB M01AE N02BA
## Min. :2014-01-05 Min. : 7.67 Min. : 6.237 Min. : 3.50
## 1st Qu.:2015-06-15 1st Qu.:29.39 1st Qu.:22.387 1st Qu.:21.30
## Median :2016-11-23 Median :34.56 Median :26.790 Median :26.50
## Mean :2016-11-23 Mean :35.10 Mean :27.168 Mean :27.06
## 3rd Qu.:2018-05-04 3rd Qu.:40.17 3rd Qu.:31.047 3rd Qu.:32.48
## Max. :2019-10-13 Max. :65.33 Max. :53.571 Max. :60.12
## N02BE N05B N05C R03
## Min. : 86.25 Min. : 18.00 Min. : 0.000 Min. : 2.00
## 1st Qu.:149.30 1st Qu.: 47.00 1st Qu.: 2.000 1st Qu.: 21.00
## Median :198.30 Median : 57.00 Median : 3.979 Median : 35.00
## Mean :208.63 Mean : 61.74 Mean : 4.139 Mean : 38.44
## 3rd Qu.:252.47 3rd Qu.: 71.00 3rd Qu.: 6.000 3rd Qu.: 51.00
## Max. :546.90 Max. :154.00 Max. :17.000 Max. :131.00
## R06
## Min. : 1.00
## 1st Qu.:11.47
## Median :17.50
## Mean :20.22
## 3rd Qu.:26.00
## Max. :65.00
From the following boxplots, we find that all of the drug sales have outliers.
From the following pairwise scatterplot, we find that there is no sign of strong linear relationship between R06 and other variables
According to the coefficient of time-related attribute in the linear model, the mean is barely increasing in time, although we cannot trust the model statistics because of the violation of assumptions to the error term based on the residual acf plot. From the time plot of R06 sales against time, we find stationary model may not be appropriate for the data.
##
## Call:
## lm(formula = R06 ~ weekly$time + I(weekly$time^2), data = weekly)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.259 -8.035 -2.175 5.942 42.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.030e+05 1.040e+06 0.291 0.771
## weekly$time -3.020e+02 1.031e+03 -0.293 0.770
## I(weekly$time^2) 7.526e-02 2.556e-01 0.294 0.769
##
## Residual standard error: 11.1 on 299 degrees of freedom
## Multiple R-squared: 0.05567, Adjusted R-squared: 0.04935
## F-statistic: 8.812 on 2 and 299 DF, p-value: 0.0001911
From the sample autocorrelation plot, we find that the sample ACF has a peak at around 52, indicating strong evidence for period around 52, which corroberates with the inital assumption that the yearly period seasonality. The periodic waves in sample ACF shows seasonal autoregressive model should be appropriate for this data.
We can investigate the drug sale data from the frequency domain to get insights from another perspective. We can examine the periodogram of the weekly sales data and decompose the R06 drug sale into trend, noise and cycle components.
The fluctuations in antihistamine sales appears to be cyclical. As antihistamines are frequently used to treat allergies, which can often be triggered by changes in the season [https://acaai.org/allergies/seasonal-allergies]. Let’s take a look at this data in the frequency domain to see if our hypothesis of a yearly antihistamine sale cycle is supported:
There does appear to be a peak at approximately 1 cycle per year, but let’s see if we can make the cyclical pattern even clearer by using a loess bandpass filter. We’ll try to separate out the low frequency trend and high frequency noise to extract the cyclical behavior:
a_low <- ts(loess(anhist~as.numeric(weekly$datum), span=0.5)$fitted, start=2014, frequency=52)
a_high <- ts(anhist - loess(anhist~as.numeric(weekly$datum), span=0.1)$fitted, start=2014, frequency=52)
a_cycles <- anhist - a_high - a_low
plot(ts.union(anhist, a_low, a_high, a_cycles))
##Citation: From Chapter 8 notes
The mid-range frequencies look like a fairly regular, one year cycle, which is promising. Let’s plot a periodogram and extract the peak frequency of this cycle to see if that is borne out by the data:
cycle_spec <- spectrum(a_cycles, main="Spectrum of Antihistamine Cycles", ylab="spectrum", xlab="Cycles per year")
abline(v=c(1, 2), col=c("red", "blue"), lty=c(2,2))
The above plot shows the periodogram of the mid-range frequency cyclic behavior, with the red line representing our expected peak frequency of 1 cycle per year, and the blue line representing the first harmonic of that expected frequency: 2 cycles per year.
Let’s calculate the actual frequency of that first peak and the corresponding period:
peak_freq <- cycle_spec$freq[which.max(cycle_spec$spec)]
peak_freq
## [1] 0.975
period <- 1/peak_freq
period
## [1] 1.025641
The periodogram does indeed return a peak frequency of approximately 0.975, which corresponds to a period of approximately 1.03 years. This does indeed support the idea that antihistamine sales experience an annual cycle.
To establish a baseline model for comparison later, we start with an ARIMA model without considering the seasonal effects. The \(Y_{1:N}\) here be the sales of drug R06 on each week from 2014 to 2019. According to the lecture note 5 equation[5], we will fit a stationary Gaussian ARMA(p,q) model with parameter vector \(\theta = (\phi_{1:p}, \psi_{1:q}, \mu, \sigma^2)\) given by
\[\phi(B)(Y_N - \mu) = \psi(B)\epsilon_{N}\]
where
\[E[Y_N] = \mu \\ \phi(x) = 1 - \phi_1 x - ... - \phi_p x^p\\ \psi(x) = 1 - \psi_1 x - ... - \psi_q x^q\\ \epsilon_N \sim iid N(0, \sigma^2)\]
We choose to use an AIC table as our first step in the model selection process. According to the AIC table below, ARMA(1, 1) has the best AIC value 2047.86. We also spot an evidence for mathematical instability at ARMA(2, 2) and ARMA(3, 2) where the AIC value increased more than 2.
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | 2328.96 | 2193.73 | 2140.96 | 2105.58 |
AR1 | 2071.40 | 2047.86 | 2048.47 | 2047.95 |
AR2 | 2049.29 | 2049.00 | 2039.43 | 2049.95 |
AR3 | 2047.89 | 2049.73 | 2049.48 | 2048.10 |
We want to assess the goodness of fit of the ARMA(1, 1) model by looking at its residuals. Based on the residual ACF, there is no strong evidence for autocorrelation between the residual values at different lags. We superimposed the simulated data according to our ARMA(1, 1) model on top of the original R06 sales data. As we expected, the ARMA model does not capture the seasonal effects and fits poorly at the beginning and the ending part.
As we found from our exploratory data analysis, the time series data has a seasonal component with yearly period of 52 data points. Now we want to add the seasonal component into our model and see how much it improves the fit. As the data we used is a weekly dataset, according to the lecturen notes6 equation[S1] the general \(SARMA(p,q) \times (P, Q)_{52}\) model for weekly data is
\[\phi(B) \Phi(B^{52})(Y_N - \mu) = \psi(B) \Psi(B^{52}) \epsilon_N\]
where
\[E[Y_N] = \mu \\ \phi(x) = 1 - \phi_1 x - ... - \phi_p x^p\\ \psi(x) = 1 - \psi_1 x - ... - \psi_q x^q\\ \Phi(x) = 1 - \Phi_1 x - ... - \Phi_P x^P\\ \Psi(x) = 1 - \Psi_1 x - ... - \Psi_Q x^Q \\ \epsilon_N \sim iid N(0, \sigma^2)\]
MA0 | MA1 | MA2 | |
---|---|---|---|
AR0 | 2211.98 | 2143.27 | 2112.87 |
AR1 | 2070.40 | 2048.01 | 2048.71 |
AR2 | 2049.42 | 2049.19 | 2041.36 |
According to the AIC table above, the SARIMA(1,1) will have the best AIC value 2048.0090021. Although the AIC of SARIMA(2,2) is lower, there is an evidence for mathematical instability at SARMA(2, 2) and SARMA(1, 2) where the AIC value increased more than 2.
set.seed(2021)
sarma11 <- arima(weekly$R06, order=c(1, 0, 1), seasonal=list(order=c(1, 0, 0), period=52))
acf(sarma11$residuals, main='SARMA(1, 1) ACF')
t1 <- simulate(sarma11, nsim=302)
plot(R06~datum, data=weekly, type='l', xlab='date', main='weekly R06 sales')
lines(x=weekly$datum, y=t1, col='red')
Again, in order to check the goodness of fit of the SARMA(1, 1) model, we will look at its residuals. Based on the residual ACF, there is no strong evidence for autocorrelation between the residual values at different lags. We overlay the simulated data(red line) according to our SARMA(1, 1) model on the original R06 sales data(black line). As we expected, the SARMA model does capture the seasonal effects, and it fits better than the ARMA(1,1) showed above.
Overall, we conclude that the SARMA(1,1) model performs best for our dataset, and captures some weekly seasonality. We hypothesize that the weekly seasonality is likely due to the correlation between prescriptions and business week periodicity, though there are perhaps still more complex underlying causes as well.
While we investigated linear regression with ARMA errors, we could not include outside covariates due to lack of available background information about our dataset, and thus we found that the regression approach did not outperform the other approaches we’ve outlined in this report.
The following sources to compose this project:
“Seasonal Allergies.” ACAAI Public Website, 29 Oct. 2018, acaai.org/allergies/seasonal-allergies.
Ionides, Edward L. “Chapter 6: Extending the ARMA model: Seasonality, integration and trend.” 2021, ionides.github.io/531w21/06/notes.pdf.
Ionides, Edward L. “Chapter 5: Parameter estimation and model identification for ARMA models.” 2021, ionides.github.io/531w21/05/notes.pdf.
Ionides, Edward L. “Chapter 8: Smoothing in the Time and Frequency Domains.” 2021, ionides.github.io/531w21/08/notes.pdf.