1 Pharma Sales Dataset

Pharma sales dataset is built from transactional data collected in six years (2014-2019) indicating the date and time of the sale and pharmaceutical drug brand and quantity.This dataset is from the Kaggle. The drugs are classified into eight groups by Anatomical Therapeutic Chemical (ATC) Classification System. For the purpose of this project, we focus on one specific group of drugs, R06, in which are antihistamines for systemic use. The main concern of this analysis is to find the connection between seasonal change and the difference in antihistamine sales across the year.

2 Exploratory Data Analysis

2.1 Basic Statistics

Firstly, we want to look at the basic statistics of the drug sales.

##      datum                M01AB           M01AE            N02BA      
##  Min.   :2014-01-05   Min.   : 7.67   Min.   : 6.237   Min.   : 3.50  
##  1st Qu.:2015-06-15   1st Qu.:29.39   1st Qu.:22.387   1st Qu.:21.30  
##  Median :2016-11-23   Median :34.56   Median :26.790   Median :26.50  
##  Mean   :2016-11-23   Mean   :35.10   Mean   :27.168   Mean   :27.06  
##  3rd Qu.:2018-05-04   3rd Qu.:40.17   3rd Qu.:31.047   3rd Qu.:32.48  
##  Max.   :2019-10-13   Max.   :65.33   Max.   :53.571   Max.   :60.12  
##      N02BE             N05B             N05C             R03        
##  Min.   : 86.25   Min.   : 18.00   Min.   : 0.000   Min.   :  2.00  
##  1st Qu.:149.30   1st Qu.: 47.00   1st Qu.: 2.000   1st Qu.: 21.00  
##  Median :198.30   Median : 57.00   Median : 3.979   Median : 35.00  
##  Mean   :208.63   Mean   : 61.74   Mean   : 4.139   Mean   : 38.44  
##  3rd Qu.:252.47   3rd Qu.: 71.00   3rd Qu.: 6.000   3rd Qu.: 51.00  
##  Max.   :546.90   Max.   :154.00   Max.   :17.000   Max.   :131.00  
##       R06       
##  Min.   : 1.00  
##  1st Qu.:11.47  
##  Median :17.50  
##  Mean   :20.22  
##  3rd Qu.:26.00  
##  Max.   :65.00

From the following boxplots, we find that all of the drug sales have outliers.

From the following pairwise scatterplot, we find that there is no sign of strong linear relationship between R06 and other variables

According to the coefficient of time-related attribute in the linear model, the mean is barely increasing in time, although we cannot trust the model statistics because of the violation of assumptions to the error term based on the residual acf plot. From the time plot of R06 sales against time, we find stationary model may not be appropriate for the data.

## 
## Call:
## lm(formula = R06 ~ weekly$time + I(weekly$time^2), data = weekly)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.259  -8.035  -2.175   5.942  42.507 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)       3.030e+05  1.040e+06   0.291    0.771
## weekly$time      -3.020e+02  1.031e+03  -0.293    0.770
## I(weekly$time^2)  7.526e-02  2.556e-01   0.294    0.769
## 
## Residual standard error: 11.1 on 299 degrees of freedom
## Multiple R-squared:  0.05567,    Adjusted R-squared:  0.04935 
## F-statistic: 8.812 on 2 and 299 DF,  p-value: 0.0001911

2.2 Stationarility and Seasonality

From the sample autocorrelation plot, we find that the sample ACF has a peak at around 52, indicating strong evidence for period around 52, which corroberates with the inital assumption that the yearly period seasonality. The periodic waves in sample ACF shows seasonal autoregressive model should be appropriate for this data.

2.3 Frequency Domain Analysis

We can investigate the drug sale data from the frequency domain to get insights from another perspective. We can examine the periodogram of the weekly sales data and decompose the R06 drug sale into trend, noise and cycle components.

The fluctuations in antihistamine sales appears to be cyclical. As antihistamines are frequently used to treat allergies, which can often be triggered by changes in the season [https://acaai.org/allergies/seasonal-allergies]. Let’s take a look at this data in the frequency domain to see if our hypothesis of a yearly antihistamine sale cycle is supported:

There does appear to be a peak at approximately 1 cycle per year, but let’s see if we can make the cyclical pattern even clearer by using a loess bandpass filter. We’ll try to separate out the low frequency trend and high frequency noise to extract the cyclical behavior:

a_low <- ts(loess(anhist~as.numeric(weekly$datum), span=0.5)$fitted, start=2014, frequency=52)
a_high <- ts(anhist - loess(anhist~as.numeric(weekly$datum), span=0.1)$fitted, start=2014, frequency=52)
a_cycles <- anhist - a_high - a_low
plot(ts.union(anhist, a_low, a_high, a_cycles))

##Citation: From Chapter 8 notes

The mid-range frequencies look like a fairly regular, one year cycle, which is promising. Let’s plot a periodogram and extract the peak frequency of this cycle to see if that is borne out by the data:

cycle_spec <- spectrum(a_cycles, main="Spectrum of Antihistamine Cycles", ylab="spectrum", xlab="Cycles per year")
abline(v=c(1, 2), col=c("red", "blue"), lty=c(2,2))

The above plot shows the periodogram of the mid-range frequency cyclic behavior, with the red line representing our expected peak frequency of 1 cycle per year, and the blue line representing the first harmonic of that expected frequency: 2 cycles per year.

Let’s calculate the actual frequency of that first peak and the corresponding period:

peak_freq <- cycle_spec$freq[which.max(cycle_spec$spec)]
peak_freq

## [1] 0.975

period <- 1/peak_freq
period

## [1] 1.025641

The periodogram does indeed return a peak frequency of approximately 0.975, which corresponds to a period of approximately 1.03 years. This does indeed support the idea that antihistamine sales experience an annual cycle.

3 Fitting an ARMA model

To establish a baseline model for comparison later, we start with an ARIMA model without considering the seasonal effects. The \(Y_{1:N}\) here be the sales of drug R06 on each week from 2014 to 2019. According to the lecture note 5 equation[5], we will fit a stationary Gaussian ARMA(p,q) model with parameter vector \(\theta = (\phi_{1:p}, \psi_{1:q}, \mu, \sigma^2)\) given by

\[\phi(B)(Y_N - \mu) = \psi(B)\epsilon_{N}\]

where

\[E[Y_N] = \mu \\ \phi(x) = 1 - \phi_1 x - ... - \phi_p x^p\\ \psi(x) = 1 - \psi_1 x - ... - \psi_q x^q\\ \epsilon_N \sim iid N(0, \sigma^2)\]

3.1 Model Selection for ARIMA Model

We choose to use an AIC table as our first step in the model selection process. According to the AIC table below, ARMA(1, 1) has the best AIC value 2047.86. We also spot an evidence for mathematical instability at ARMA(2, 2) and ARMA(3, 2) where the AIC value increased more than 2.

	MA0	MA1	MA2	MA3
AR0	2328.96	2193.73	2140.96	2105.58
AR1	2071.40	2047.86	2048.47	2047.95
AR2	2049.29	2049.00	2039.43	2049.95
AR3	2047.89	2049.73	2049.48	2048.10

3.2 Model Diagnostics for ARIMA Model

We want to assess the goodness of fit of the ARMA(1, 1) model by looking at its residuals. Based on the residual ACF, there is no strong evidence for autocorrelation between the residual values at different lags. We superimposed the simulated data according to our ARMA(1, 1) model on top of the original R06 sales data. As we expected, the ARMA model does not capture the seasonal effects and fits poorly at the beginning and the ending part.

4 Fitting SARIMA Model

As we found from our exploratory data analysis, the time series data has a seasonal component with yearly period of 52 data points. Now we want to add the seasonal component into our model and see how much it improves the fit. As the data we used is a weekly dataset, according to the lecturen notes6 equation[S1] the general \(SARMA(p,q) \times (P, Q)_{52}\) model for weekly data is

\[\phi(B) \Phi(B^{52})(Y_N - \mu) = \psi(B) \Psi(B^{52}) \epsilon_N\]

where

\[E[Y_N] = \mu \\ \phi(x) = 1 - \phi_1 x - ... - \phi_p x^p\\ \psi(x) = 1 - \psi_1 x - ... - \psi_q x^q\\ \Phi(x) = 1 - \Phi_1 x - ... - \Phi_P x^P\\ \Psi(x) = 1 - \Psi_1 x - ... - \Psi_Q x^Q \\ \epsilon_N \sim iid N(0, \sigma^2)\]

4.1 Model Selection for SARIMA Model

	MA0	MA1	MA2
AR0	2211.98	2143.27	2112.87
AR1	2070.40	2048.01	2048.71
AR2	2049.42	2049.19	2041.36

According to the AIC table above, the SARIMA(1,1) will have the best AIC value 2048.0090021. Although the AIC of SARIMA(2,2) is lower, there is an evidence for mathematical instability at SARMA(2, 2) and SARMA(1, 2) where the AIC value increased more than 2.

4.2 Model Diagnostics for SARIMA Model

set.seed(2021)
sarma11 <- arima(weekly$R06, order=c(1, 0, 1), seasonal=list(order=c(1, 0, 0), period=52))
acf(sarma11$residuals, main='SARMA(1, 1) ACF')

t1 <- simulate(sarma11, nsim=302)
plot(R06~datum, data=weekly, type='l', xlab='date', main='weekly R06 sales')
lines(x=weekly$datum, y=t1, col='red')

Again, in order to check the goodness of fit of the SARMA(1, 1) model, we will look at its residuals. Based on the residual ACF, there is no strong evidence for autocorrelation between the residual values at different lags. We overlay the simulated data(red line) according to our SARMA(1, 1) model on the original R06 sales data(black line). As we expected, the SARMA model does capture the seasonal effects, and it fits better than the ARMA(1,1) showed above.

5 Conclusion

Overall, we conclude that the SARMA(1,1) model performs best for our dataset, and captures some weekly seasonality. We hypothesize that the weekly seasonality is likely due to the correlation between prescriptions and business week periodicity, though there are perhaps still more complex underlying causes as well.

While we investigated linear regression with ARMA errors, we could not include outside covariates due to lack of available background information about our dataset, and thus we found that the regression approach did not outperform the other approaches we’ve outlined in this report.

References and Group Activity:

The following sources to compose this project:

“Seasonal Allergies.” ACAAI Public Website, 29 Oct. 2018, acaai.org/allergies/seasonal-allergies.

Ionides, Edward L. “Chapter 6: Extending the ARMA model: Seasonality, integration and trend.” 2021, ionides.github.io/531w21/06/notes.pdf.

Ionides, Edward L. “Chapter 5: Parameter estimation and model identification for ARMA models.” 2021, ionides.github.io/531w21/05/notes.pdf.

Ionides, Edward L. “Chapter 8: Smoothing in the Time and Frequency Domains.” 2021, ionides.github.io/531w21/08/notes.pdf.

STATS 531 Midterm Project