Introduction

The story of GameStop probably can be considered as one of the most astonishing craziness in the stock market history. GameStop, one of the American brick-and-mortar video games retailer, was widely considered as a outdated and sunset industry. Facing the competition of online shopping platforms like amazon and bestbuy, as well as the end-to-end online installation of videogames, the stockprice of GameStop dropped to around 3 dollars around a year ago. However, such bearish views were certainly not accepted by a group of investors. These investors gathered on the popular social media platform, Reddit, and have their own group name called WallStreetBets. They hold a bullish view on the GameStop shares and they began to short GameStop’s stock. This resulted in a exponential increase in GameStop’s price.

The story of GameStop was certainly not isolated. There are other stocks talked on the Reddit forum such as AMC, Koss, BlackBerry and Bed Bath&Beyond. Despite many early-informed investors made a great fortune by investing in the GameStop and other WallStreetBets-favored stocks, many other investors and hedge funds lost a lot of money for the dramatic changes of the prices. In this project, we will use stochastic volatility models to analyze the GameStop stock to understand the variance of the returns for quantifying the investment risks and providing guidance for future investments on such stocks.

The dataset comes from yahoo finance, consisting the date, the open price, the high price, low price, close price, adjusted close price and the traded volume for GameStop stock from 2020-04-08 to 2021-04-07. We mainly focused on the daily adjusted close price of this year-long data [1].

Our primary goal of this research is to analyze the volatility of GameStop stock price by fitting different time series models, such as ARMA, GARCH, and POMP, and analyze how well these models fit the data.

Exploratory Data Analysis

##         Date Open High  Low Close Adj.Close  Volume
## 1 2020-04-08 3.23 3.67 3.20  3.41      3.41 2884500
## 2 2020-04-09 3.60 4.25 3.49  3.89      3.89 5908600
## 3 2020-04-13 4.25 4.76 4.16  4.74      4.74 6844500
##       Open              High              Low              Close       
##  Min.   :  3.230   Min.   :  3.670   Min.   :  3.200   Min.   :  3.41  
##  1st Qu.:  4.655   1st Qu.:  4.845   1st Qu.:  4.505   1st Qu.:  4.72  
##  Median : 10.250   Median : 10.860   Median :  9.920   Median : 10.35  
##  Mean   : 38.835   Mean   : 44.074   Mean   : 33.262   Mean   : 37.73  
##  3rd Qu.: 19.685   3rd Qu.: 20.525   3rd Qu.: 18.930   3rd Qu.: 19.70  
##  Max.   :379.710   Max.   :483.000   Max.   :262.270   Max.   :347.51  
##    Adj.Close          Volume         
##  Min.   :  3.41   Min.   :  1330100  
##  1st Qu.:  4.72   1st Qu.:  3369600  
##  Median : 10.35   Median :  6764300  
##  Mean   : 37.73   Mean   : 16875707  
##  3rd Qu.: 19.70   3rd Qu.: 14663100  
##  Max.   :347.51   Max.   :197157900

As we can see, there is a huge jump of the adjusted close price for the GameStop stock price from 3.41 to 347.51. The daily trade volume also ranges from 1.3 million to 197 million.

ggplot(dat, aes(x=Date, y=Adj.Close)) +
  geom_line() +
  ggtitle("GameStop Adjusted Stock Price since April 2020") +
  xlab("Date") +
  ylab("Close Price")

After visualizing the adjusted stock price for GameStop over the past year, we can observe that the fluctuation of the closing prices. Until the January of 2021, the stock price stays low. Beginning around mid-January, the stock price surges to over 300 dollars. Then after which the stock price quickly falls below 50 dollars. Then around March 2021, the stock prices rises again.

Then we will explore the log return for the GameStop company. The log return can be defined as:\[R_n=log(y_n)-log(y_{n-1})\]

We further subtract the mean of the log returns to obtain a mean stationary model, which is visualized below.

Fitting ARMA(p,q) Model

Now we can fit a stationary Gaussian ARMA(p,q) model \(\phi(B)\left(Y_{n}-\mu\right)=\psi(B) \epsilon_{n}\) with parameters \(\theta=\left(\phi_{1: p}, \psi_{1: q}, \mu, \sigma^{2}\right)\)

\[ \begin{aligned} \mu &=\mathbb{E}\left[Y_{n}\right] \\ \phi(x) &=1-\phi_{1} x-\cdots-\phi_{p} x^{p} \\ \psi(x) &=1+\psi_{1} x+\cdots+\psi_{q} x^{q} \\ \epsilon_{n} & \sim N\left[0, \sigma^{2}\right] \end{aligned} \]

Specifically, \(\phi(x)=1-\phi_{1} x-\cdots-\phi_{p} x^{p}\) represents autoregressive model, and \(\psi(x)=1+\psi_{1} x+\cdots+\psi_{q} x^{q}\) represents moving average model [2].

ARMA(p,q) model

MA0 MA1 MA2 MA3 MA4 MA5
AR0 -226.81 -224.81 -237.36 -251.00 -258.69 -257.12
AR1 -224.81 -224.31 -239.04 -260.88 -259.05 -257.06
AR2 -228.30 -228.66 -243.82 -259.10 -262.95 -261.35
AR3 -238.86 -255.12 -256.65 -258.05 -261.88 -260.04
AR4 -265.50 -263.74 -262.21 -263.07 -261.29 -264.86

According to the AIC table, we choose to use ARMA(1, 3) as our baseline model with AIC score -260.88 because it both is surrounded with a neighborhood of “good” AIC scores and has similar model complexity than the models we compare it with. Thus, the ARMA model we chose for \(R_n\) is \[ R_n = -0.55 R_{n-1} + 0.61 \epsilon_n + 0.21 \epsilon_{n-1} + 0.42 \epsilon_{n-2} + 0.016 \]

Diagnostics

Based on the residual ACF plot, we find that there are significant correlation at several different lags. This indicates that our ARMA model has not picked up all the signals and an extension to the current ARMA model including seasonality might improve the fit. Moreover, the QQ-plot also shows that the residuals have heavier tail than normal. However, this is only a baseline model for comparison and we do not expect it to perform too well. The unsatisfactory result also motivates us to try other models.

Fitting GARCH(p,q) model

In the financial world, people often use Garch(p,q) to model the data. Garch(p,q) takes the following form [3]:

\[Y_n = \epsilon_n\sqrt{V_n}\] where \[V_n=\alpha_0+\sum^p_{j=1}\alpha_jY_{n-j}^2+\sum_{k=1}^q\beta_kV_{n-k}\].

In our analysis, we primarily utilize Garch(1,1) as the baseline model and then try to improve the model by use the Garch(p,q) model. Garch(1,1) takes a simple form that \[Y_n = \epsilon_n\sqrt{V_n}\] where \[V_n=\alpha_0+\alpha_1Y_{n-1}^2+\beta_1V_{n-1}\]

q = 1 q = 2 q = 3 q = 4
p = 1 -400.89 -369.45 -356.28 -354.64
p = 2 -398.66 -366.11 -355.96 -351.32
p = 3 -396.44 -399.47 -354.22 -354.95
p = 4 -393.48 -417.42 -361.84 -357.63

GARCH(1,1) model

## 
## Call:
## garch(x = log_diff, grad = "numerical", trace = FALSE)
## 
## Coefficient(s):
##        a0         a1         b1  
## 0.0002697  0.1245660  0.8725749
## 'log Lik.' 203.4436 (df=3)

In the Garch(1,1) model, the log-likelihood value is 203.4436.

Diagnostics

As we can see from the ACF plot and Q-Q plot, the residuals are well laid within the confidence interval. Hence it indicates that residuals are uncorrelated at all lags. The Q-Q plot demonstrates the residuals have heavy tails comparing to the normal distribution. Hence the residuals may deviate from the standard normal distribution, and hence may undermine the fitting of the model.

GARCH(4,2) model

## 
## Call:
## garch(x = log_diff, order = c(4, 2), grad = "numerical", trace = FALSE)
## 
## Coefficient(s):
##        a0         a1         a2         b1         b2         b3         b4  
## 5.253e-04  1.165e-01  2.015e-01  1.344e-01  1.576e-11  2.306e-01  3.219e-01
## 'log Lik.' 228.4462 (df=7)

In the Garch(4,2) model, the log-likelihood value is 228.4462. It is slightly larger than the log-likelihood value for Garch(1,1) model, which has 203.4436. However, consider more parameters contained in the Garch(4,2) model, they should be approximately the same under the AIC criterion.

Diagnostics

As we can see from the ACF plot, the residuals are well laid within the confidence interval except for lag 2 and lag 6. The autocorrelation may exist; however, it is not significant. The Q-Q plot demonstrates the residuals have heavy tails comparing to the normal distribution. Hence the residuals may deviate from the standard normal distribution, and hence may undermine the fitting of the model. We can try other models instead to find a better substitute model.

The Pomp Model

Fixed Leverage Model

We know that a frequent phenomenon in the financial market that negative shocks to a stockmarket index are associated with a subsequent increase in volatility. We can formally define the leverage \(R_n\) on day n as the correlation between index return on day n-1 and the increase in the log volatility from day n-1 to day n [3].

We can adopt a pomp implementation of Breto(2014) that models \(R_n\) as a random walk on a transformed scale and hence define \(R_n = \frac{e^{2Gn}-1}{e^{2Gn}+1}\) where \(Gn\) is the usual Gaussian random walk.

When the Gaussian random walk has standard deviation of 0, it can be considered as a special case of the model mentioned above, a fixed leverage model. The pomp model allows the model parameters to vary over time. The parameters are latent random processes. Under the construction of Breto’s work(2014), the model is constructed as follows:

\[Y_n=e^{\frac{H_n}{2}}\epsilon_n\] \[H_n = \mu_h(1-\phi)+\phi H_{n-1}+\beta_{n-1}R_ne^{\frac{-H_{n-1}}{2}}+\omega_n\] \[G_n = G_{n-1}+\nu_n\] \[\beta_n = Y_n\sigma_\eta\sqrt{1-\phi^2}\] \[\epsilon_n \sim i.i.d N(0,1)\] \[\nu_n \sim i.i.d N(0,\sigma^2_\nu)\] \[\omega_n \sim N(0, \sigma^2_{\omega,n})\] where \(\sigma^2_{\omega, n} = \sigma^2_\eta(1-\phi^2)(1-R_n^2)\).

We have \(H_n\) as the log-golatility, \(X_n=(G_n, H_n, Y_n)\) as the state variable, \(Y_n\) as the measurement variable being perfect observation of this component of \(X_n\).

# Building a POMP model

# reference from https://ionides.github.io/531w21/16/slides-annotated.pdf

GME_statenames <- c("H","G","Y_state")
GME_rp_names <- c("sigma_nu","mu_h","phi","sigma_eta")
GME_ivp_names <- c("G_0","H_0")
GME_paramnames <- c(GME_rp_names,GME_ivp_names)

rproc1 <- "
double beta,omega,nu;
omega = rnorm(0,sigma_eta * sqrt( 1- phi*phi ) *
sqrt(1-tanh(G)*tanh(G)));
nu = rnorm(0, sigma_nu);
G += nu;
beta = Y_state * sigma_eta * sqrt( 1- phi*phi );
H = mu_h*(1 - phi) + phi*H + beta * tanh( G )
* exp(-H/2) + omega;
"
rproc2.sim <- "
Y_state = rnorm( 0,exp(H/2) );
"
rproc2.filt <- "
Y_state = covaryt;
"
GME_rproc.sim <- paste(rproc1,rproc2.sim)
GME_rproc.filt <- paste(rproc1,rproc2.filt)

GME_rinit <- "
G = G_0;
H = H_0;
Y_state = rnorm( 0,exp(H/2) );
"

GME_rmeasure <- "
y=Y_state;
"

GME_dmeasure <- "
lik=dnorm(y,0,exp(H/2),give_log);
"

GME_partrans <- parameter_trans(
log=c("sigma_eta","sigma_nu"),
logit="phi"
)
#simulate with an arbitrary parameters

# reference from https://ionides.github.io/531w21/16/slides-annotated.pdf

GME.filt <- pomp(data=data.frame(
y=demean_price,time=1:length(demean_price)),
statenames=GME_statenames,
paramnames=GME_paramnames,
times="time",
t0=0,
covar=covariate_table(
time=0:length(demean_price),
covaryt=c(0,demean_price),
times="time"),
rmeasure=Csnippet(GME_rmeasure),
dmeasure=Csnippet(GME_dmeasure),
rprocess=discrete_time(step.fun=Csnippet(GME_rproc.filt),
delta.t=1),
rinit=Csnippet(GME_rinit),
partrans=GME_partrans
)
# reference from https://ionides.github.io/531w21/16/slides-annotated.pdf

params_test <- c(
sigma_nu = exp(-6.5),
mu_h = -5,
phi = expit(1.5),
sigma_eta = exp(0.7),
G_0 = 0,
H_0=0
)

sim1.sim <- pomp(GME.filt,
statenames=GME_statenames,
paramnames=GME_paramnames,
rprocess=discrete_time(
step.fun=Csnippet(GME_rproc.sim),delta.t=1)
)
sim1.sim <- simulate(sim1.sim,seed=1,params=params_test)

The plot of simulated returns versus observed returns are shown above. In general, the simulated data deviates from the volatility of the observed data, especially for the earlier times, where the simulated data have extremely high volatility but the the observed data seem to have low volatility.

# reference https://ionides.github.io/531w21/16/slides-annotated.pdf

sim1.filt <- pomp(sim1.sim,
covar=covariate_table(
time=c(timezero(sim1.sim),time(sim1.sim)),
covaryt=c(obs(sim1.sim),NA),
times="time"),
statenames=GME_statenames,
paramnames=GME_paramnames,
rprocess=discrete_time(
step.fun=Csnippet(GME_rproc.filt),delta.t=1)
)

Filtering on simulated data

# reference https://ionides.github.io/531w21/16/slides-annotated.pdf

run_level <- 3
GME_Np <- switch(run_level, 100, 1e3, 2e3)
GME_Nmif <- switch(run_level, 10, 100, 200)
GME_Nreps_eval <- switch(run_level, 4, 10, 20)
GME_Nreps_local <- switch(run_level, 10, 20, 20)
GME_Nreps_global <- switch(run_level, 10, 20, 100)
# reference https://ionides.github.io/531w21/16/slides-annotated.pdf

registerDoParallel()
registerDoRNG(34118892)

stew(file=sprintf("pf1-%d.rda",run_level),{
t.pf1 <- system.time(
pf1 <- foreach(i=1:GME_Nreps_eval,
.packages='pomp') %dopar% pfilter(sim1.filt,Np=GME_Np))
})
(L.pf1 <- logmeanexp(sapply(pf1,logLik),se=TRUE))
##                        se 
## 171.85743715   0.06558164

Fitting the stochastic leverage model

# reference https://ionides.github.io/531w21/16/slides-annotated.pdf

GME_rw.sd_rp <- 0.02
GME_rw.sd_ivp <- 0.1
GME_cooling.fraction.50 <- 0.5
GME_rw.sd <- rw.sd(
sigma_nu = GME_rw.sd_rp,
mu_h = GME_rw.sd_rp,
phi = GME_rw.sd_rp,
sigma_eta = GME_rw.sd_rp,
G_0 = ivp(GME_rw.sd_ivp),
H_0 = ivp(GME_rw.sd_ivp)
)

stew(file=sprintf("mif1-%d.rda",run_level),{
t.if1 <- system.time({
if1 <- foreach(i=1:GME_Nreps_local,
.packages='pomp', .combine=c) %dopar% mif2(GME.filt,
params=params_test,
Np=GME_Np,
Nmif=GME_Nmif,
cooling.fraction.50=GME_cooling.fraction.50,
rw.sd = GME_rw.sd)
L.if1 <- foreach(i=1:GME_Nreps_local,
.packages='pomp', .combine=rbind) %dopar% logmeanexp(
replicate(GME_Nreps_eval, logLik(pfilter(GME.filt,
params=coef(if1[[i]]),Np=GME_Np))), se=TRUE)
})
})
# reference https://ionides.github.io/531w21/16/slides-annotated.pdf

r.if1 <- data.frame(logLik=L.if1[,1],logLik_se=L.if1[,2],
                    t(sapply(if1,coef)))
if (run_level>1) write.table(r.if1,file="GME_params.csv",
                             append=TRUE,col.names=FALSE,row.names=FALSE)
summary(r.if1$logLik,digits=5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   238.6   239.3   239.3   239.3   239.4   239.7

From here, we can see that the maximum log-likelohood value of the pomp model takes the value of 239.8. It is a slight improvement compare to what we had in the Garch model, 203.4436. considering the number of parameters within the two models, we consider their performance as similar.

Diagnostic

As demonstrated above, the convergence diagnostics of the pomp model is plotted. We can see from the MIF2 convergence plot that the log-likelihood quickly converges within 150 iterations. The \(\phi\) seems to converge within 150 iterations, \(G_0\) seems to converge to an interval after 150 iterations, and \(\mu_h\) quickly converges within 150 iterations. and for \(H_0\) and \(\sigma_\nu\), they seem not converge.

To address the non-convergence problem and obtain an optimization, we will use randomized starting values from a large box in the pomp model to obtain a global maximization.

Likelihood maximization using randomized starting values

# reference https://ionides.github.io/531w21/16/slides-annotated.pdf

GME_box <- rbind(
sigma_nu=c(0.005,0.05),
mu_h =c(-1,0),
phi = c(0.85,0.99),
sigma_eta = c(0.25,1),
G_0 = c(-3,3),
H_0 = c(-2,2)
)

stew(file=sprintf("box_eval-%d.rda",run_level),{
t.box <- system.time({
if.box <- foreach(i=1:GME_Nreps_global,
.packages='pomp',.combine=c) %dopar% mif2(if1[[1]],
params=apply(GME_box,1,function(x)runif(1,x)))

L.box <- foreach(i=1:GME_Nreps_global,
.packages='pomp',.combine=rbind) %dopar% {
logmeanexp(replicate(GME_Nreps_eval, logLik(pfilter(
GME.filt,params=coef(if.box[[i]]),Np=GME_Np))),
se=TRUE)}
})
})

r.box <- data.frame(logLik=L.box[,1],logLik_se=L.box[,2],
t(sapply(if.box,coef)))
if(run_level>1) write.table(r.box,file="GME_params.csv",
append=TRUE,col.names=FALSE,row.names=FALSE)

summary(r.box$logLik,digits=5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   229.6   235.2   238.4   237.1   239.4   240.1

Diagnostic

From diagnostics plots above, we see that:

\(\sigma_nu\), \(\mu_h\), \(G_0\) and \(H_0\) are convergent within around 50-100 iterations.

log-likelihood converges to a larger value.

\(\phi\) and \(\sigma_\eta\) seems to converge to a certain range, but their converging rate seems to be slower than the other parameters.

Conclusions

We modeled the GameStop stock log-return with three models, including ARMA, GARCH, and POMP. We choose to compare these models by their loglikelihood with respect to the same data to determine which of whom gives the best fit on this dataset. The ARMA(1, 3) model gives loglikelihood of 136.44 with AIC score of -262.88. (# params = 5). The GARCH(1, 1) model gives loglikelihood of 203.44 with AIC score of -400.89.(# params = 3). The GARCH(4, 2) model gives loglikelihood of 228.44 with AIC score of -442.88.(# params = 7). The POMP model gives median loglikelihood of 239.30 with AIC score of -466.6. (# of params = 6). If we takes into consideration both the loglikelihood and the model complexity, then we would prefer the POMP model by AIC score. According to the MIF convergence diagnostic plots, we believe the POMP model is a good fit for the data because the loglikelihood converges really quickly. However, there are some parameters that does not converge well, including H0 that represent the initial log-golatility. This might be caused for several reasons. The “easy” solution might be fixed by having more iterations or better starting values. If the “easy” solution does not solve the problem, we might need to learn more about the stock market and innovate our model based on the financial domain knowledge.

Reference

Contribution

Description of individual contributions removed for anonymity