Social scientists have long drawn comparisons between the spread of information and the spread of disease. Methods from epidemiology have been used to model such dynamics by treating an idea, meme, or news event as the infectious agent; this analogy has only grown in popularity with the explosive growth of social media [1, 2].
In this project, we apply epidemiological methods – specifically, compartmental models – to model information flow during the 2021 GameStop short squeeze, an event in which thousands of Reddit users coordinated to buy shares of the gaming company GameStop, having anticipated a short by major hedge funds. The resulting jump in stock price caused such hedge funds to incur significant losses, resulting in a flurry of attention on both social media sites and major news networks [3].
We use search frequency of the term “gme”, GameStop’s ticker symbol, as a proxy for the spread of news. The data was downloaded from Google Trends [4], consisting of normalized daily measurements over an 88-day period.
The main objectives of this project are as follows:
Model the information spread of a current event using an interpretable model.
Draw parallels between the spread of information and the spread of disease.
First, we note that the search data are normalized to have a maximum value of \(100\). Understandably, the search frequency of the term “gme” peaks in late January, during the initial short squeeze. Notably, the data strongly resemble the measles dataset studied in class, lending support to the idea that such trends can be modeled using methods from epidemiology.
The initial peak is followed by multiple, smaller peaks in the following months. This latter half of the data seems to exhibit strong weekly seasonality, as shown in the plot of the smoothed periodogram below. However, this may be because many stock exchanges are closed on weekends, and not something directly attributable to the short squeeze itself [5].
[1] "period corresponding to maximum spectral density (days): 7.5"
Combined with the evidence from the periodogram, the erratic, quasi-regular trend in the latter half of the data suggests that we may want to choose a model that accounts for both regular and irregular resurgences (or even delays) in popularity.
Given our analyses above, we find it appropriate to model our data with a compartmental model, commonly used in epidemiological studies – specifically, we use a SIRS model to represent the latent process.
We define susceptible in this context to mean individuals who have not yet heard about the short squeeze, infected to mean individuals who have heard about the short squeeze and are actively interested in it, and recovered to refer to individuals who have lost interest in the short squeeze and have no further inclination to search about it. In analogy with the original SIR model, the reporting rate would be the proportion of interested individuals who end up actually searching about the event online.
Consider that the popularity of a trend can resurge long after the initial event; indeed, this behavior seems to be evident in the data. Because of this, we suppose that recovered individuals – represented by the state \(R\) – move back into the initial susceptible state \(S\) after some time.
The number of individuals in each compartment, at each time point \(t\) are given by the following equations:
\[\begin{align*} S(t) &= S(0) + N_{RS}(t) - N_{SI}(t) \\ I(t) &= I(0) + N_{SI}(t) - N_{IR}(t) \\ R(t) &= R(0) + N_{IR}(t) - N_{RS}(t) \\ \end{align*}\]
where \(N_{XY}\) is a counting process that represents the number of individuals transitioning from state \(X\) to state \(Y\). The associated compartment transfer rates are given by:
\[\begin{align*} \frac{dN_{SI}(t)}{dt} &= \mu_{SI}(t)S(t) = \beta I(t)S(t) \\ \frac{dN_{IR}(t)}{dt} &= \mu_{IR}I(t) \\ \frac{dN_{RS}(t)}{dt} &= \mu_{RS}R(t) \\ \end{align*}\]
We use Euler’s method to compute approximations to the above rates. In particular, each transfer rate is discretely approximated by a binomial distribution with exponential transition probabilities:
\[\begin{align*} \Delta N_{SI}(t) &\sim \mathrm{Bin}(S, 1 - e^{-\beta\frac{I}{N}\Delta t}) \\ \Delta N_{IR}(t) &\sim \mathrm{Bin}(I, 1 - e^{-\mu_{IR}\Delta t}) \\ \Delta N_{RS}(t) &\sim \mathrm{Bin}(R, 1 - e^{-\mu_{RS}\Delta t}) \\ \end{align*}\]
We suppose that the number of searches \(Q\) is drawn from a negative binomial distribution with parameters \(H\), \(\rho\), where \(H\) is the underlying number of new infections per time step and \(\rho\) is the reporting rate. Mathematically:
\[ Q \sim \mathrm{NegBin}(H, \rho) \]
We simulate 20 trajectories from some manually chosen parameters in order to check the validity of our model.
There is great variability in the simulations, but many of the trajectories seem to peak early, then level off around some smaller, nonzero value, much like the real data. Thus, we conclude that our model is at least reasonable, and continue on to conduct a parameter search.
The model parameters we are interested in estimating are:
Since the search frequency data are normalized to have maximum \(100\), it is difficult to interpret the reporting rate \(\rho\) and population size \(N\). We will still vary these parameters so as to not overconstrain our model, but due to convergence concerns, we will limit the range of the reporting rate \(\rho\) to be \([0.1, 0.3]\), and the range of the population size \(N\) to be \([1000, 10000]\).
We define an optimal set of parameters as one that maximizes the likelihood of the model. To do this, we use iterated filtering – the IF2 algorithm – to simulate and evaluate the likelihood of many randomly-chosen candidate parameter sets. As a benchmark, we compute the likelihood of the parameter set used in the simulation in the previous section:
[1] "log-likelihood: -297.244571096161"
Let’s plot a scatterplot matrix to see if we can garner any insight into our likelihood surface.
From the parameter search, we can see that our model has converged on estimates of \(\beta\) around \(1\), although the likelihood profile is somewhat flat here. Our model also favors values of the recovery rate \(\mu_{IR}\) around \(0.5\). Fpr \(\mu_{RS}\) and \(\eta\), the parameter search did not seem to converge; the model does not seem to have an overall preference. However, there is a clear relationship between \(\beta\) and \(\eta\); it seems that higher values of \(\beta\) are associated with lower values of \(\eta\).
Finally, we obtain our best parameter set:
# A tibble: 1 x 8
Beta mu_IR mu_RS eta rho N loglik loglik.se
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.835 0.497 0.0356 0.674 0.299 1361. -294. 0.0986
To check model fit, let’s simulate some values from the model using the optimal parameters we just found.
The simulations are still quite inconclusive! In general, the only similarities between the simulations and the actual data are that they tend to peak early on and stay lower after day 50 or so. There is still considerable variability.
Let’s see if the associated profile confidence intervals give us any more information about parameter optimality.
# A tibble: 1 x 2
min max
<dbl> <dbl>
1 0.550 8.21
The profile likelihood curve for \(\beta\) is quite flat, suggesting convergence issues, or at least non-linearity of the likelihood surface. However, we see that the model tends to favor higher values of \(\beta\). An approximate 95% confidence interval for \(\beta\), based on Wilks’ Theorem, is given by \([0.55, 8.21]\).
# A tibble: 1 x 2
min max
<dbl> <dbl>
1 0.297 0.992
Here, we see that the model prefers middling values of the recovery rate \(\mu_{IR}\), although the likelihood profile is not very distinct. A 95% confidence interval for \(\mu_{IR}\) is given by \([0.30, 0.99]\).
# A tibble: 1 x 2
min max
<dbl> <dbl>
1 0.00877 0.994
The profile likelihood plot for \(\mu_{RS}\) is so scattered as to be essentially useless; there is no real evidence for any particular value, given that the confidence interval essentially comprises the entire range of values.
Overall, there is not enough evidence to conclude that these parameters are a good fit for the data. However, the underlying model and dynamics seem reasonable.
Our parameter search was mostly inconclusive. However, it is clear that the model favors higher values of \(\beta\), the information transmission rate. This both matches up with the data as well as intuition – given the speed at which social media spreads news, it makes sense that the information transmission rate is on the higher side.
We can also interpret the recovery rate \(\mu_{IR}\) as what we call the “hype factor” – how quickly do trends fade in societal consciousness? Our parameter search revealed that the recovery rate should be smaller than the information transmission rate, which does seem to match up with the data. However, we would still expect the recovery rate to be somewhat larger in magnitude since there is a drastic decrease in search frequency right after the initial peak.
The other two parameters turned out to be difficult to model effectively. The search did not converge on any particular values for the initial susceptible fraction \(\eta\) or the resurgence rate \(\mu_{RS}\)! In particular, we would expect the resurgence rate \(\mu_{RS}\) to be small in magnitude, but our optimization returned inconclusive results. This is most likely due to model mis-specification.
Overall, based on model diagnostics and some simulations, it is somewhat inconclusive whether or not our model is a good fit for the data. However, we believe the interpretability offered by compartmental models can be useful in understanding how information might spread like a pathogen. Previous research has already showed the utility in modeling information spread using epidemiological methods, and we hope this project has demonstrated some more evidence for such a claim.
Our model was somewhat simplistic – more sophisticated models could include additional susceptible states to differentiate between the speed of information spread in different media (i.e., social media vs. newspapers) [6, 7]. Another addition could be to model the resurgence rate \(\mu_{RS}\) and reporting rate \(\rho\) as time-varying.
If we had more time, we would have included the value of $GME as a covariate. From there, we would have conducted a likelihood ratio test using Wilks’ approximation to judge whether the model with or without the stock price was a better predictor of the data. This has many interesting implications; for example, we may be able to tell whether the stock price can be viewed as an external driver for the GameStop hype, or whether such search trends are self-driven. (Extrapolating further, we could even determine whether reverse causation is at play, and if the information spread is itself a driver for the stock price!)
On a computational note, we ran into many problems with infinite likelihoods due to the incompatibility of a binomial measurement model; as a result, we had to compromise a little bit of interpretability by modeling the search frequency as a negative binomial distribution. Future work could definitely explore alternatives that are both interpretable and compatible with the latent process.
[1] Wang, Lin & Wood, Brendan, “An Epidemiological Approach to Model the Viral Propagation of Memes”.
Applied Mathematical Modelling, Volume 35, Issue 11: 5442-5447. November 11, 2011.
[2] Stanford University, “How Fake News Spreads Like a Real Virus”. October 9, 2019.
[3] Wikipedia, “GameStop Short Squeeze”. Accessed April 8, 2021.
[4] Google Trends, Google. Accessed April 6, 2021.
[5] Investopedia, “Weekend Effect”. Accessed April 13, 2021.
[6] Indiana University, “The American Journalist in the Digital Age: Key Findings”. 2014.
[7] Pew Research Center, “The Evolving Role of News on Twitter and Facebook”. July 14, 2015.
Our project took inspiration from the following former STATS 531 projects: W20 Project 32 for the SIR model with delay, and
W18 Project 13 and W20 Project 34 for the idea of applying epidemiological models to information spread. All uncited statistical
methods, as well as the code used in this project, were adapted from the STATS 531 class notes.