Investigation of online player increase in CS caused by COVID-19 Pandamic

Introduction

Counter-Strike: Global Offensive (CS:GO) is a 2012 multi-player first-person shooter developed by Valve and Hidden Path Entertainment.Since the game’s release, Global Offensive has remained one of the most played and highest-grossing games on Steam. The game won the fan’s choice “eSports Game of the Year” award at The Game Awards 2015.Also, Global Offensive has one of the most popular esport scenes in the world. The Global Offensive professional scene consists of leagues and tournaments hosted by third-party organisations, and Valve-sponsored tournaments known as Major Championships. Majors are considered the most prestigious tournaments in Counter-Strike circuit and have among the largest prize pools; originally announced at US$250,000, the prize pools for Majors have risen to US$1,000,000 since MLG Columbus 2016 [1].

COVID-19, a previously unknown respiratory illness caused by the coronavirus SARS-CoV-2, was declared a pandemic by the World Health Organization (WHO) on 11 March 2020, less than 3 months after cases were first detected. With now over 479 million confirmed cases and more than 6 million deaths recorded worldwide, there are grave concerns about the global health, societal and economic effects of this virus, particularly on vulnerable and disadvantaged populations, and in low- and middle-income countries with fragile health systems [2].

In this study, we are investigating how the COVID-19 pandemic affects the players.The dataset is from SteamDB, a third-website party provides better insight into the Steam platform[3]. To investigate the affect of COVID-19, we focus on the number of online players start from January 1st, 2020 to January 1st, 2022.

Exploratory Data Analysis

For convenience of analysis, we measure the number of players in 1000 as a unit.

# read in the data set
df = read.csv("chart.csv",encoding="UTF-8") 
player_df = df

# convert date from character to datetime type
# Drop the variable "flags" because it is empty
player_df = player_df %>% mutate(DateTime = as.POSIXct(DateTime))
player_df = player_df %>% select(-Flags)

# Sub-setting the data to focus on the time series starting from 2020
player_df = player_df %>% filter(DateTime > "2019-12-31 EDT", DateTime < "2022-01-01 EDT")
# Change the unit of the player count to be 1000
df = df %>% mutate(Players = Players/1000, Twitch.Viewers = Twitch.Viewers/1000)
player_df = player_df %>% mutate(Players = Players/1000, Twitch.Viewers = Twitch.Viewers/1000)
# Create a new column "day" which represents the order of dates
player_df = player_df %>% mutate(day = 1:n())

# dimension
player_df_dim = dim(player_df)

# check whether there are NAs 
player_NA_count = sapply(player_df, function(x) sum(is.na(x)))

######################### Time plots #########################

cap_fig1 = paste(
  "**Figure 1.** *Number of CS players over time.*",
   "The blue line is the trend estimated by Loess smoothing. Grey region indicates the corresponding 95% confidence intervel."
)

# Time plot for the number of player numbers and Twitch.Viewers
p1 = player_df %>% ggplot() + 
  geom_line(aes(x = DateTime, y = Players), color = "blue")  +
  geom_smooth(aes(x = DateTime, y = Players), method='loess', color = "orange") +
  theme_bw() + labs(x = "Date Time", y = "count (1k)")+theme(text = element_text(size=rel(5)))

p1

Figure 1. Number of CS players over time. The blue line is the trend estimated by Loess smoothing. Grey region indicates the corresponding 95% confidence intervel.

We create a time plot (Figure 1) to understand how the number of players changed over time, in particular, during 2020 and 2021. We can observe two major peaks in the time plot, one appears around May 2020 and the other appears around May 2021. There is also a small peak appears at the end of 2021. The COVID-19 pandemic started from 2019 and dominated 2020, during the period of which, many countries took enforced quarantine policy. Due to the quarantine policy, people were staying at home, so they had more chance to try different kind of games to spend the spare time. This may explain the peak of the player numbers at the beginning of 2020. Then, during 2021, as an increasing population got vaccinated, people went out of their home to social and play with friends. The upturn in the epidemic has eased tension in the society, which may explain the second peak at the start of 2021.[4]

Then, we explore the log returns of the players online. In our analysis, the return can be understood as the rate of increase in player numbers with positive values indicating an increase in the number; and negative values indicating a decrease in the player number. The log return is defined as: \[ R_n = log(y_n) - log(y_{n-1})\]

# By viewing the rate of increase in player numbers as "returns", we consider fitting the GARCH model to model the volatility of the rate of increase.
# Here, for EDA, we plot the demeaned log rate of increase
cap_fig2 = paste(
  "**Figure 2.** *Demeaned Log rate of increase in the number of players over time.*",
   ""
)

log_diff = diff(log(player_df$Players))
demean_players = log_diff-mean(log_diff)
log_df = data.frame(day = c(1:length(log_diff)), log_diff = log_diff)
log_df2 = data.frame(day = c(1:length(demean_players)), demean_players = demean_players)
log_df2$DateTime = player_df$DateTime[2:nrow(player_df)]

ggplot(log_df2, aes(x=DateTime, y=demean_players)) +
  geom_line() +
  xlab("Date Time") +
  ylab("Demeaned Log rate of increase")+
  geom_line( color="#69b3a2") +
  theme_bw() + theme(text = element_text(size=rel(5)))

Figure 2. Demeaned Log rate of increase in the number of players over time.

We plot the demeaned log rate of increase in the number of players in Figure 2. From the plot, we observe a high volatility of returns at the beginning of 2020, the end of 2020, and the end of 2021. These observations are consistent with the trend shown in the time plot where a high volatility corresponds to a big increase/decrease of the number of players. Therefore, by viewing the rate of increase in player numbers as “returns”, we consider fitting the GARCH model to model the volatility of the rate of increase.

Then, we generate ACF plot to observe the possible cycle of the data. According to the ACF plot, it seems like we have a period of 7 days.

# Plot the sample autocorrelation function
cap_fig3 = paste(
  "**Figure 3.** *Auto-correlation of the demeaned Log rate of increase in the number of players.*",
   "The accpetance region is constructed by the dashed line."
)

acf(log_df2$demean_players, main = "ACF: Demeaned Log rate of increase")

Figure 3. Auto-correlation of the demeaned Log rate of increase in the number of players. The accpetance region is constructed by the dashed line.

Spectrum Analysis

cap_fig4 = paste(
  "**Figure 4.** *Unsmoothed periodogram of the demeaned Log rate of increase in the number of players.*",
   ""
)

# Unsmoothed Spectrum
# Code from the lecture notes and previous midterm project
raw_spec = spectrum(log_df2 %>% .$demean_players, main="Unsmoothed periodogram", plot = FALSE)
sales_spec = tibble(freq = raw_spec$freq, spec = raw_spec$spec)
max_omega = sales_spec$freq[which.max(sales_spec$spec)]

sales_spec %>%
  ggplot(aes(x = freq, y = spec)) + 
  geom_line() + 
  scale_x_continuous(name = "Frequency (unit: cycle/day)") + 
  scale_y_continuous(name = "Spectrum",
                     trans = "log10") +
  ggtitle("Unsmoothed periodogram") + 
  theme_bw() +
  geom_vline(xintercept = max_omega,
             colour = "tomato3",
             linetype = "dashed") +
  geom_text(aes(x = max_omega,
                label = sprintf("%.3f", max_omega),
                y = 0.05),
            colour = "darkred")

Figure 4. Unsmoothed periodogram of the demeaned Log rate of increase in the number of players.

We transform the data to the Fourier basis in the frequency domain. Figure 4 shows us an unsmoothed periodogram for the log rate of increase in the number of players. The periodogram shows that the frequency is around 0.143, which coincides with our observation that the cycle is around 7 days.

Decomposition

By applying Loess smoothing, we generate the plot of the decomposition of the rate of increase as trend, noice and cycles.

# Code from the lecture notes and previous midterm project
cap_fig5 = paste(
  "**Figure 5.** *Decomposition of the time series for the demeaned Log rate of increase in the number of players.*",
   "The plots are raw data, trend, noise, and circle."
)

log_df2$DateTime = player_df$DateTime[2:nrow(player_df)]
num_day_per_year = log_df2 %>% mutate(year = lubridate::year(DateTime)) %>% count(year)

rate = log_df2 %>% .$demean_players
# date2019 = seq(from = 2019,length = 364 , by = 1 / 364)
date2020 = seq(from = 2020,length = 365 , by = 1 / 365)
date2021 = seq(from = 2021,length = 365 , by = 1 / 365)
date = c(date2020, date2021)

`Rate low` = ts(loess(rate ~ date, span = 0.5)$fitted,
            start = 2020, 
            frequency = 365)
`Rate high` = ts(rate - loess(rate ~ date, span = 0.1)$fitted,
           start = 2020,
           frequency = 365)
`Rate cycles` = rate - `Rate high` - `Rate low`
plot(ts.union(rate, `Rate low`, `Rate high`, `Rate cycles`),
     main = "Decomposition of the rate of increase as trend + noise + cycles")

Figure 5. Decomposition of the time series for the demeaned Log rate of increase in the number of players. The plots are raw data, trend, noise, and circle.

Benchmark Model

Model Selection

To begin modeling the data, we started with an Auto Regressive Integrated moving Average (ARIMA) model. Formally, the ARIMA(p,d,q) model with intercept $\mu$ for $Y_{1:N}$ is:

\[\phi(B)\left[(1 - B)^dY_n - \mu\right] = \psi(B)\epsilon_n\] where $\{\epsilon_n\}$ is a white noise process; $\phi(B)$ and $\psi(B)$ are ARMA polynomials.[5]

# Code is from previous final project 15
generate_aic_table=function(data, P, Q, D=0, ...){
    table=matrix(NA, (P+1), (Q+1))
    for(p in 0:P) {
        for(q in 0:Q) {
          model_aic = try(
            arima(data, order = c(p, D, q), method="ML", ...)$aic, 
            silent = TRUE
          )
          table[p + 1,q + 1] = ifelse(
            inherits(model_aic, "try-error"), NA, model_aic
          )
        }
    }
    dimnames(table) = list(paste("AR", 0:P, sep=""), paste("MA", 0:Q, sep=""))
  table
}
table_s00 = generate_aic_table(data = log_df2$demean_players, P = 5, Q = 5) #-2784.976 p=5, q=2
table_s10 = generate_aic_table(data = log_df2$demean_players, P = 5, Q = 5, 
                               seasonal = list(order = c(1, 0, 0), period = 7)) # -2886.925 p=3, q=5
table_s01 = generate_aic_table(data = log_df2$demean_players, P = 5, Q = 5, 
                               seasonal = list(order = c(0, 0, 1), period = 7)) # -2881.090 p=5, q=5
table_s11 = generate_aic_table(data = log_df2$demean_players, P = 5, Q = 5, 
                               seasonal = list(order = c(1, 0, 1), period = 7)) # -2938.117 p=5, q=5
table_sD1 = generate_aic_table(data = log_df2$demean_players, P = 5, Q = 5, D = 1) # -2857.070 p=5, q=5

knitr::kable(table_sD1)

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	-1830.990	-2330.206	-2328.217	-2468.700	-2493.033	-2508.317
AR1	-1937.467	-2328.211	-2334.017	-2483.642	-2495.928	-2492.195
AR2	-2031.141	-2377.432	-2512.038	-2595.291	-2507.587	-2525.926
AR3	-2077.671	-2406.629	-2527.133	-2555.439	-2524.457	-2523.733
AR4	-2090.029	-2458.608	-2574.332	-2713.468	-2728.386	-2597.201
AR5	-2217.953	-2649.929	-2691.986	-2782.355	-2786.964	-2857.070

We fix $d$ to 1 and explore the model performance with different values of $p$ and $q$. Based on the AIC table above, we find that the model with p = 5 and q = 5 gives us the lowest AIC. In addition to the ARIMA model, we have also explored the ARMA model and the SARIMA model. The most competitive SARIMA model gives lower AIC than ARIMA(5,1,5) but we decide to keep using ARIMA(5,1,5) as our baseline model because the ARIMA model is relatively simple, and there is not much difference in the AIC values.

Model Diagnostics

# Code from previous midterm project
cap_fig7 = paste(
  "**Figure 7.** *Fitted value(Red) vs Observed value(Black).*",
  ""
)

ARIMA515 = arima(log_df2$demean_players, order = c(5, 1, 5), method="ML")
ARIMA515_loglik = ARIMA515$loglik - sum(log_df2 %>% .$demean_players)


log_df2 %>%
  ggplot() +
  geom_line(aes(x = DateTime, y = demean_players)) +
  geom_line(aes(x = DateTime, y = fitted(ARIMA515)),
            col = "tomato3") +
  labs(x = "Date", y = "Demeaned Log rate of increase in the number of players") + 
  theme_bw()

Figure 7. Fitted value(Red) vs Observed value(Black).

# Code from previous midterm project
cap_fig8 = paste(
  "**Figure 8.** *Residuals of the benchmark model.*",
  ""
)
## Residual plot
tibble(Date = log_df2, Residual = ARIMA515$residuals) %>%
  ggplot(aes(x = Date$DateTime, y = Residual)) +
  geom_line() +
  xlab("Date") +
  ylab("Residuals") +
  geom_hline(yintercept = 0,
             col = "tomato3") + 
  theme_bw()

Figure 8. Residuals of the benchmark model.

# Code from previous midterm project
cap_fig9 = paste(
  "**Figure 9.** *Residuals Autocorrelation function of the benchmark model.*",
  ""
)
# Autocorrelation function of the final model
Acf_plot = acf(ARIMA515$residuals, main = "Residuals Autocorrelation")

Figure 9. Residuals Autocorrelation function of the benchmark model.

# Code from previous midterm project
cap_fig10 = paste(
  "**Figure 10.** *QQ-plot for the residuals of the benchmark model.*",
  ""
)
qqnorm(ARIMA515$residuals, main = "QQ-Plot: Residuals")
qqline(ARIMA515$residuals)

Figure 10. QQ-plot for the residuals of the benchmark model.

Generally speaking, there is no obvious trend in the residual plot.
In the ACF plot, almost all spikes are within the 95% confidence boundaries. the QQ-plot also shows that the residuals have heavier tail than normal. However, this is only a baseline model for comparison and we do not expect it to perform too well. The unsatisfactory result also motivates us to try other models.

GARCH model

As mentioned previously, we decide to use the GARCH model to model the volatility of the rate of increase in player numbers. Specifically, we utilize GARCH(5,5) to model the volatility of the of the rate of increase. GARCH(5,5) takes a simple form that \[ Y_n = \epsilon_n\sqrt{V_n}\] where \[ V_n = \alpha_0 + \alpha_1Y_{n-1}^{2}+\beta_1V_{n-1}\]

require(tseries) 
fit.garch <- garch(log_df2$demean_players,grad = "numerical",trace = FALSE)
L.garch <- tseries:::logLik.garch(fit.garch)
L.garch

## 'log Lik.' 1170.101 (df=3)

par(mfrow = c(1, 2))
acf(resid(fit.garch)[-1],na.action = na.pass, main = "ACF of GARCH(5,5) residuals")
qqnorm(resid(fit.garch))
qqline(resid(fit.garch))

For the GARCH(5,5) model, we obtain a maximized log likelihood of 1170.101. The acf plot indicates that there still exists the cycling behavior of the residuals. And noticing the points on the Q-Q plot, the fall along a line in the middle of the graph, but curve off in the extremities. The Q-Q plot demonstrates the residuals have heavy tails comparing to the normal distribution. Hence the residuals may deviate from the standard normal distribution, and hence may undermine the fitting of the model.

As a result, GARCH model is not a good choice in this case. We can try other models instead to find a better model as substitute.

POMP Fixed Leverage Model

Leverage is a financial term, representing the phenomenon that negative shocks to a stockmarket index are associated with a subsequent increase in volatility.[6] Let $R_n$ on day $n$ denotes the correlation between the return on day $n−1$ and the increase in the log volatility from day $n−1$ to day $n$.[7]
In the pomp implementation of Breto (2014), which models $R_n$ as a random walk on a transformed scale, \[ R_b = \frac{exp(2G_n)-1}{exp(2G_n)+1}\]where $G_n$ is Gaussian random walk.[4]
Following the idea of Breto(2014), the model representation can be expressed by equations below: \[ Y_n = exp(\frac{H_n}{2})\epsilon_n\] \[ H_n = \mu_h(1-h)+\phi H_{n-1}+\beta_{n-1}R_nexp(\frac{-H_{n-1}}{2})+w_n\] \[ G_n = G_{n-1}+\upsilon_n\] \[ \beta_n = Y_n \sigma_\eta\sqrt{1-\phi^{2}}\] \[ \epsilon_n \sim i.i.d N(0,1)\] \[\upsilon_n \sim i.i.dN(0,\sigma_v^{2})\] \[\omega_n \sim N(0,\sigma_{\omega,n}^{2})\] where $\sigma_{\omega,n}^{2} = \sigma_\eta^{2}(1-\phi^{2})(1-R_n^{2})$.

GME_statenames <- c("H","G","Y_state")
GME_rp_names <- c("sigma_nu","mu_h","phi","sigma_eta")
GME_ivp_names <- c("G_0","H_0")
GME_paramnames <- c(GME_rp_names,GME_ivp_names)

rproc1 <- "
double beta,omega,nu;
omega = rnorm(0,sigma_eta * sqrt( 1- phi*phi ) *
sqrt(1-tanh(G)*tanh(G)));
nu = rnorm(0, sigma_nu);
G += nu;
beta = Y_state * sigma_eta * sqrt( 1- phi*phi );
H = mu_h*(1 - phi) + phi*H + beta * tanh( G )
* exp(-H/2) + omega;
"
rproc2.sim <- "
Y_state = rnorm( 0,exp(H/2) );
"
rproc2.filt <- "
Y_state = covaryt;
"
GME_rproc.sim <- paste(rproc1,rproc2.sim)
GME_rproc.filt <- paste(rproc1,rproc2.filt)

GME_rinit <- "
G = G_0;
H = H_0;
Y_state = rnorm( 0,exp(H/2) );
"

GME_rmeasure <- "
y=Y_state;
"

GME_dmeasure <- "
lik=dnorm(y,0,exp(H/2),give_log);
"

GME_partrans <- parameter_trans(
log=c("sigma_eta","sigma_nu"),
logit="phi"
)

sim1.filt <- pomp(data=data.frame(
y=log_df2$demean_players,time=1:length(log_df2$demean_players)),
statenames=GME_statenames,
paramnames=GME_paramnames,
times="time",
t0=0,
covar=covariate_table(
time=0:length(log_df2$demean_players),
covaryt=c(0,log_df2$demean_players),
times="time"),
rmeasure=Csnippet(GME_rmeasure),
dmeasure=Csnippet(GME_dmeasure),
rprocess=discrete_time(step.fun=Csnippet(GME_rproc.filt),
delta.t=1),
rinit=Csnippet(GME_rinit),
partrans=GME_partrans
)

params_test <- c(
sigma_nu = exp(-6.5),
mu_h = -5,
phi = expit(1.5),
sigma_eta = exp(0.7),
G_0 = 0,
H_0=0
)

sim1.sim <- pomp(sim1.filt,
statenames=GME_statenames,
paramnames=GME_paramnames,
rprocess=discrete_time(
step.fun=Csnippet(GME_rproc.sim),delta.t=1)
)
sim1.sim <- simulate(sim1.sim,seed=1,params=params_test)
plot(Y_state~time, data=sim1.sim, type='l', col='blue', main="Observed returns and simulated returns", ylab="Demeaned log rate of increase")
lines(log_df$log_diff,col='black')
legend('topright' , c("Observed Rates","Simulated Rates"), col=c("black","blue"), lty=c(1,1),cex = 0.5)

The plot above compares the observed rate of increase and our simulated rate of increase (from the model with the parameters of our initial guess). We see that the simulated rates capture some of the patterns of the observed rates but there are still many deviations. We will use this result as a starting point. And in the later analysis, we will determine a more proper interval for each parameter through the pair plot.

Filtering on simulated data

sim1.filt2 <- pomp(sim1.sim,
covar=covariate_table(
time=c(timezero(sim1.sim),time(sim1.sim)),
covaryt=c(obs(sim1.sim),NA),
times="time"),
statenames=GME_statenames,
paramnames=GME_paramnames,
rprocess=discrete_time(
step.fun=Csnippet(GME_rproc.filt),delta.t=1)
)

run_level <- 3
GME_Np <- switch(run_level, 100, 1e3, 2e3)
GME_Nmif <- switch(run_level, 10, 100, 200)
GME_Nreps_eval <- switch(run_level, 4, 10, 20)
GME_Nreps_local <- switch(run_level, 10, 20, 20)
GME_Nreps_global <- switch(run_level, 10, 20, 100)

registerDoParallel()
registerDoRNG(34118892)

stew(file=sprintf("pf1-%d.rda",run_level),{
t.pf1 <- system.time(
pf1 <- foreach(i=1:GME_Nreps_eval,
.packages='pomp') %dopar% {
  pfilter(sim1.filt2,Np=GME_Np)
  })
}) -> results
(L.pf1 <- logmeanexp(sapply(pf1,logLik),se=TRUE))

##                      se 
## 518.3896817   0.1722256

We carry out replicated particle filters at our initial guess of the parameters. We obtain a log likelihood estimate of 518.3896817 with a Monte Carlo standard error of 0.1722256. Next, we will fit the stochastic leverage model to the online player data with some randomly selected initial values.

Fitting the stochastic leverage model

# reference https://ionides.github.io/531w21/16/slides-annotated.pdf
GME_rw.sd_rp <- 0.02
GME_rw.sd_ivp <- 0.1
GME_cooling.fraction.50 <- 0.5
GME_rw.sd <- rw.sd(
sigma_nu = GME_rw.sd_rp,
mu_h = GME_rw.sd_rp,
phi = GME_rw.sd_rp,
sigma_eta = GME_rw.sd_rp,
G_0 = ivp(GME_rw.sd_ivp),
H_0 = ivp(GME_rw.sd_ivp)
)
stew(file=sprintf("mif1-%d.rda",run_level),{
t.if1 <- system.time({
if1 <- foreach(i=1:GME_Nreps_local,
.packages='pomp', .combine=c) %dopar% mif2(sim1.filt,
params=params_test,
Np=GME_Np,
Nmif=GME_Nmif,
cooling.fraction.50=GME_cooling.fraction.50,
rw.sd = GME_rw.sd)
L.if1 <- foreach(i=1:GME_Nreps_local,
.packages='pomp', .combine=rbind) %dopar% logmeanexp(
replicate(GME_Nreps_eval, logLik(pfilter(sim1.filt,
params=coef(if1[[i]]),Np=GME_Np))), se=TRUE)
})
},seed=318817883,kind="L'Ecuyer")
r.if1 <- data.frame(logLik=L.if1[,1],logLik_se=L.if1[,2],
t(sapply(if1,coef)))
if (run_level>1) write.table(r.if1,file="GME_params.csv",
append=TRUE,col.names=FALSE,row.names=FALSE)
summary(r.if1$logLik,digits=5)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1272    1274    1274    1275    1276    1277

pairs(~logLik+sigma_nu+mu_h+phi+sigma_eta,
data=subset(r.if1,logLik>max(logLik)-20))

The best log likelihood is 1277. Comparing to the results of GARCH(5,5), which is 1170.101, the stochastic leverage model does seem like a better choice for this particular dataset.

Diagnostic

As demonstrated above, the convergence diagnostics of the pomp model is plotted. We can see from the MIF2 convergence plot that the log-likelihood, $\phi$, $\mu_h$ and $\sigma_\eta$ quickly converges before 50 iterations. The $\sigma_\upsilon$ seems to converge within 100 iterations, $G_0$ seems to converge to an interval around 75 iterations. And for $H_0$, it does not seem to converge.

To address the non-convergence problem and obtain an optimization, we will use randomized starting values from a large box in the pomp model to obtain a global maximization.[8]

Likelihood maximization using randomized starting values

GME_box <- rbind(
 sigma_nu=c(0.005,0.05),
 mu_h    =c(-7,0),
 phi = c(0.95,0.99),
 sigma_eta = c(0.5,1),
 G_0 = c(-2,2),
 H_0 = c(-1,1)
)
stew(file=sprintf("box_eval-%d.rda",run_level),{
  t.box <- system.time({
    if.box <- foreach(i=1:GME_Nreps_global,
      .packages='pomp',.combine=c) %dopar% mif2(if1[[1]],
        params=apply(GME_box,1,function(x)runif(1,x)))
    L.box <- foreach(i=1:GME_Nreps_global,
      .packages='pomp',.combine=rbind) %dopar% {
         logmeanexp(replicate(GME_Nreps_eval, logLik(pfilter(
       sim1.filt,params=coef(if.box[[i]]),Np=GME_Np))), 
           se=TRUE)}
  })
})
r.box <- data.frame(logLik=L.box[,1],logLik_se=L.box[,2],
  t(sapply(if.box,coef)))
if(run_level>1) write.table(r.box,file="GME_params.csv",
  append=TRUE,col.names=FALSE,row.names=FALSE)
summary(r.box$logLik,digits=5)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1258    1271    1274    1273    1276    1280

Diagnostic

maxlik=subset(r.box,logLik==max(logLik))
row.names(maxlik) = NULL
kable(maxlik,digits=3)

logLik	logLik_se	sigma_nu	mu_h	phi	sigma_eta	G_0	H_0
1279.503	0.064	0	-6.492	0.964	1.061	0.523	-4.866

params_test <- c(
  sigma_nu = exp(log(maxlik$sigma_nu)),  
  mu_h = maxlik$mu_h,       
  phi = expit(logit(maxlik$phi)),     
  sigma_eta = exp(log(maxlik$sigma_eta)),
  G_0 = maxlik$G_0,
  H_0=maxlik$H_0
)
sim1.sim <- pomp(sim1.filt,
statenames=GME_statenames,
paramnames=GME_paramnames,
rprocess=discrete_time(
step.fun=Csnippet(GME_rproc.sim),delta.t=1)
)

sim1.sim <- simulate(sim1.sim,seed=8,params=params_test)
plot(Y_state~time,data=sim1.sim,type='l',col="blue",ylab="return")
lines(log_df$log_diff,type='l',col="black")
legend("topright",legend=c("Original","Simulated"),col=c("black","red"),
       cex=0.8,lty=1,bty="n")

From diagnostics plots above, we see that:
$\mu_h$, $\phi$, $\sigma_\eta$, $G_0$ and $H_0$ do not converage at all.
The log likelihood converges at nearly the same value but slower since the steps between y-coordinates increase from 20 to 200.
$\sigma_\upsilon$ converges within 50 iterations.

Conclusion

Model = c("ARIMA(5,1,5)", "GARCH(5,5)", "POMP w. fixed values", "POMP w. randomized values")
`Maximum Likelihood` = c(ARIMA515_loglik, 1170.101, 1277, 1280)
knitr::kable(data.frame(Model, `Maximum Likelihood`))

Model	Maximum.Likelihood
ARIMA(5,1,5)	1439.535
GARCH(5,5)	1170.101
POMP w. fixed values	1277.000
POMP w. randomized values	1280.000

In this project, we focus on the data for CS:GO players. Creatively, we view the increase rate of online players as the returns for financial asset and model the volatility of the rate of increase. We first select the ARIMA(5,1,5) as our baseline model. And then, we apply several advanced model for the financial data to our data of interest. The summary table above compares the log likelihood for the models we have explored. We find that the ARIMA(5,1,5) perform the best with the highest log likelihood of about 1439.535. The second most competitive model is POMP with randomized values, which has a log likelihood of about 1280. One possible explanation for the relatively poor performance of the three advanced model we explored is that these models are designed for financial asset but not for the data set we use. It is likely that the time series from our data have features that cannot be well captured by the models for financial assets. Modeling the number of online players can help game companies better understand the market, and can also supplement the research of other fields such as Sociology, Psychology and Economics. Therefore, more advanced models need to developed for the online player data.

Reference

1, The Counter-Strike encyclopedia that you can edit

2, Budd, J., Miller, B.S., Manning, E.M. et al. Digital technologies in the public-health response to COVID-19. Nat Med 26, 1183–1192 (2020).

3, Counter-Strike: Global Offensive on SteamDB

4, We provide possible reasons for the peaks in the time plot based on A Timeline of COVID-19 Developments in 2020 and https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020.

5, Ionides, E. (2022). Notes for STATS/DATASCI 531, Page 11 of Chapter 6 slides: Extending the ARMA model: Seasonality, integration and trend

6, The Wilshire 5000 Index Time Series Analysis and Modelling

7, Class note in Winter 2020

8, To The Moon or Not - Analysis on GameStop Stock Price

Source

The general structure for the part before the GARCH model refers to the previous midterm project: Candy Production Data and store sales data. These two previous projects provide a complete pipeline in analyzing the time series data using materials we learnt at the first half of the semester. We make adjustments to specific sections to accommodate it to our analysis. In addition, we simplify the analysis because we believe the main focus of this project is the part starts from the exploration of the GARCH model, which applies knowledge from the second half of the semester.

The second part of this project uses the general structure from the case study discussed in class and a previous final project: The Wilshire 5000 Index Time Series Analysis and Modelling. The case study and the previous project provide a clear procedure in modeling the financial data. Although our project does not focus on the financial data but concentrates on the data for CS players, we observe that there are similarities in the features. We view the increase rate of player numbers as returns, so the analysis from the previous project provides a good reference to us. We make adjustments to some sections to accommodate it to our research context. Moreover, we include diagnostic plots to assess the model performance.