Counter-Strike: Global Offensive (CS:GO) is a 2012 multi-player first-person shooter developed by Valve and Hidden Path Entertainment.Since the game’s release, Global Offensive has remained one of the most played and highest-grossing games on Steam. The game won the fan’s choice “eSports Game of the Year” award at The Game Awards 2015.Also, Global Offensive has one of the most popular esport scenes in the world. The Global Offensive professional scene consists of leagues and tournaments hosted by third-party organisations, and Valve-sponsored tournaments known as Major Championships. Majors are considered the most prestigious tournaments in Counter-Strike circuit and have among the largest prize pools; originally announced at US$250,000, the prize pools for Majors have risen to US$1,000,000 since MLG Columbus 2016 [1].
COVID-19, a previously unknown respiratory illness caused by the coronavirus SARS-CoV-2, was declared a pandemic by the World Health Organization (WHO) on 11 March 2020, less than 3 months after cases were first detected. With now over 479 million confirmed cases and more than 6 million deaths recorded worldwide, there are grave concerns about the global health, societal and economic effects of this virus, particularly on vulnerable and disadvantaged populations, and in low- and middle-income countries with fragile health systems [2].
In this study, we are investigating how the COVID-19 pandemic affects the players.The dataset is from SteamDB, a third-website party provides better insight into the Steam platform[3]. To investigate the affect of COVID-19, we focus on the number of online players start from January 1st, 2020 to January 1st, 2022.
For convenience of analysis, we measure the number of players in 1000 as a unit.
# read in the data set
= read.csv("chart.csv",encoding="UTF-8")
df = df
player_df
# convert date from character to datetime type
# Drop the variable "flags" because it is empty
= player_df %>% mutate(DateTime = as.POSIXct(DateTime))
player_df = player_df %>% select(-Flags)
player_df
# Sub-setting the data to focus on the time series starting from 2020
= player_df %>% filter(DateTime > "2019-12-31 EDT", DateTime < "2022-01-01 EDT")
player_df # Change the unit of the player count to be 1000
= df %>% mutate(Players = Players/1000, Twitch.Viewers = Twitch.Viewers/1000)
df = player_df %>% mutate(Players = Players/1000, Twitch.Viewers = Twitch.Viewers/1000)
player_df # Create a new column "day" which represents the order of dates
= player_df %>% mutate(day = 1:n())
player_df
# dimension
= dim(player_df)
player_df_dim
# check whether there are NAs
= sapply(player_df, function(x) sum(is.na(x)))
player_NA_count
######################### Time plots #########################
= paste(
cap_fig1 "**Figure 1.** *Number of CS players over time.*",
"The blue line is the trend estimated by Loess smoothing. Grey region indicates the corresponding 95% confidence intervel."
)
# Time plot for the number of player numbers and Twitch.Viewers
= player_df %>% ggplot() +
p1 geom_line(aes(x = DateTime, y = Players), color = "blue") +
geom_smooth(aes(x = DateTime, y = Players), method='loess', color = "orange") +
theme_bw() + labs(x = "Date Time", y = "count (1k)")+theme(text = element_text(size=rel(5)))
p1
We create a time plot (Figure 1) to understand how the number of players changed over time, in particular, during 2020 and 2021. We can observe two major peaks in the time plot, one appears around May 2020 and the other appears around May 2021. There is also a small peak appears at the end of 2021. The COVID-19 pandemic started from 2019 and dominated 2020, during the period of which, many countries took enforced quarantine policy. Due to the quarantine policy, people were staying at home, so they had more chance to try different kind of games to spend the spare time. This may explain the peak of the player numbers at the beginning of 2020. Then, during 2021, as an increasing population got vaccinated, people went out of their home to social and play with friends. The upturn in the epidemic has eased tension in the society, which may explain the second peak at the start of 2021.[4]
Then, we explore the log returns of the players online. In our analysis, the return can be understood as the rate of increase in player numbers with positive values indicating an increase in the number; and negative values indicating a decrease in the player number. The log return is defined as: \[ R_n = log(y_n) - log(y_{n-1})\]
# By viewing the rate of increase in player numbers as "returns", we consider fitting the GARCH model to model the volatility of the rate of increase.
# Here, for EDA, we plot the demeaned log rate of increase
= paste(
cap_fig2 "**Figure 2.** *Demeaned Log rate of increase in the number of players over time.*",
""
)
= diff(log(player_df$Players))
log_diff = log_diff-mean(log_diff)
demean_players = data.frame(day = c(1:length(log_diff)), log_diff = log_diff)
log_df = data.frame(day = c(1:length(demean_players)), demean_players = demean_players)
log_df2 $DateTime = player_df$DateTime[2:nrow(player_df)]
log_df2
ggplot(log_df2, aes(x=DateTime, y=demean_players)) +
geom_line() +
xlab("Date Time") +
ylab("Demeaned Log rate of increase")+
geom_line( color="#69b3a2") +
theme_bw() + theme(text = element_text(size=rel(5)))
We plot the demeaned log rate of increase in the number of players in Figure 2. From the plot, we observe a high volatility of returns at the beginning of 2020, the end of 2020, and the end of 2021. These observations are consistent with the trend shown in the time plot where a high volatility corresponds to a big increase/decrease of the number of players. Therefore, by viewing the rate of increase in player numbers as “returns”, we consider fitting the GARCH model to model the volatility of the rate of increase.
Then, we generate ACF plot to observe the possible cycle of the data. According to the ACF plot, it seems like we have a period of 7 days.
# Plot the sample autocorrelation function
= paste(
cap_fig3 "**Figure 3.** *Auto-correlation of the demeaned Log rate of increase in the number of players.*",
"The accpetance region is constructed by the dashed line."
)
acf(log_df2$demean_players, main = "ACF: Demeaned Log rate of increase")
= paste(
cap_fig4 "**Figure 4.** *Unsmoothed periodogram of the demeaned Log rate of increase in the number of players.*",
""
)
# Unsmoothed Spectrum
# Code from the lecture notes and previous midterm project
= spectrum(log_df2 %>% .$demean_players, main="Unsmoothed periodogram", plot = FALSE)
raw_spec = tibble(freq = raw_spec$freq, spec = raw_spec$spec)
sales_spec = sales_spec$freq[which.max(sales_spec$spec)]
max_omega
%>%
sales_spec ggplot(aes(x = freq, y = spec)) +
geom_line() +
scale_x_continuous(name = "Frequency (unit: cycle/day)") +
scale_y_continuous(name = "Spectrum",
trans = "log10") +
ggtitle("Unsmoothed periodogram") +
theme_bw() +
geom_vline(xintercept = max_omega,
colour = "tomato3",
linetype = "dashed") +
geom_text(aes(x = max_omega,
label = sprintf("%.3f", max_omega),
y = 0.05),
colour = "darkred")
We transform the data to the Fourier basis in the frequency domain. Figure 4 shows us an unsmoothed periodogram for the log rate of increase in the number of players. The periodogram shows that the frequency is around 0.143, which coincides with our observation that the cycle is around 7 days.
By applying Loess smoothing, we generate the plot of the decomposition of the rate of increase as trend, noice and cycles.
# Code from the lecture notes and previous midterm project
= paste(
cap_fig5 "**Figure 5.** *Decomposition of the time series for the demeaned Log rate of increase in the number of players.*",
"The plots are raw data, trend, noise, and circle."
)
$DateTime = player_df$DateTime[2:nrow(player_df)]
log_df2= log_df2 %>% mutate(year = lubridate::year(DateTime)) %>% count(year)
num_day_per_year
= log_df2 %>% .$demean_players
rate # date2019 = seq(from = 2019,length = 364 , by = 1 / 364)
= seq(from = 2020,length = 365 , by = 1 / 365)
date2020 = seq(from = 2021,length = 365 , by = 1 / 365)
date2021 = c(date2020, date2021)
date
`Rate low` = ts(loess(rate ~ date, span = 0.5)$fitted,
start = 2020,
frequency = 365)
`Rate high` = ts(rate - loess(rate ~ date, span = 0.1)$fitted,
start = 2020,
frequency = 365)
`Rate cycles` = rate - `Rate high` - `Rate low`
plot(ts.union(rate, `Rate low`, `Rate high`, `Rate cycles`),
main = "Decomposition of the rate of increase as trend + noise + cycles")
To begin modeling the data, we started with an Auto Regressive Integrated moving Average (ARIMA) model. Formally, the ARIMA(p,d,q) model with intercept \(\mu\) for \(Y_{1:N}\) is:
\[\phi(B)\left[(1 - B)^dY_n - \mu\right] = \psi(B)\epsilon_n\] where \(\{\epsilon_n\}\) is a white noise process; \(\phi(B)\) and \(\psi(B)\) are ARMA polynomials.[5]
# Code is from previous final project 15
=function(data, P, Q, D=0, ...){
generate_aic_table=matrix(NA, (P+1), (Q+1))
tablefor(p in 0:P) {
for(q in 0:Q) {
= try(
model_aic arima(data, order = c(p, D, q), method="ML", ...)$aic,
silent = TRUE
)+ 1,q + 1] = ifelse(
table[p inherits(model_aic, "try-error"), NA, model_aic
)
}
}dimnames(table) = list(paste("AR", 0:P, sep=""), paste("MA", 0:Q, sep=""))
table
}= generate_aic_table(data = log_df2$demean_players, P = 5, Q = 5) #-2784.976 p=5, q=2
table_s00 = generate_aic_table(data = log_df2$demean_players, P = 5, Q = 5,
table_s10 seasonal = list(order = c(1, 0, 0), period = 7)) # -2886.925 p=3, q=5
= generate_aic_table(data = log_df2$demean_players, P = 5, Q = 5,
table_s01 seasonal = list(order = c(0, 0, 1), period = 7)) # -2881.090 p=5, q=5
= generate_aic_table(data = log_df2$demean_players, P = 5, Q = 5,
table_s11 seasonal = list(order = c(1, 0, 1), period = 7)) # -2938.117 p=5, q=5
= generate_aic_table(data = log_df2$demean_players, P = 5, Q = 5, D = 1) # -2857.070 p=5, q=5
table_sD1
::kable(table_sD1) knitr
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | -1830.990 | -2330.206 | -2328.217 | -2468.700 | -2493.033 | -2508.317 |
AR1 | -1937.467 | -2328.211 | -2334.017 | -2483.642 | -2495.928 | -2492.195 |
AR2 | -2031.141 | -2377.432 | -2512.038 | -2595.291 | -2507.587 | -2525.926 |
AR3 | -2077.671 | -2406.629 | -2527.133 | -2555.439 | -2524.457 | -2523.733 |
AR4 | -2090.029 | -2458.608 | -2574.332 | -2713.468 | -2728.386 | -2597.201 |
AR5 | -2217.953 | -2649.929 | -2691.986 | -2782.355 | -2786.964 | -2857.070 |
We fix \(d\) to 1 and explore the model performance with different values of \(p\) and \(q\). Based on the AIC table above, we find that the model with p = 5 and q = 5 gives us the lowest AIC. In addition to the ARIMA model, we have also explored the ARMA model and the SARIMA model. The most competitive SARIMA model gives lower AIC than ARIMA(5,1,5) but we decide to keep using ARIMA(5,1,5) as our baseline model because the ARIMA model is relatively simple, and there is not much difference in the AIC values.
# Code from previous midterm project
= paste(
cap_fig7 "**Figure 7.** *Fitted value(Red) vs Observed value(Black).*",
""
)
= arima(log_df2$demean_players, order = c(5, 1, 5), method="ML")
ARIMA515 = ARIMA515$loglik - sum(log_df2 %>% .$demean_players)
ARIMA515_loglik
%>%
log_df2 ggplot() +
geom_line(aes(x = DateTime, y = demean_players)) +
geom_line(aes(x = DateTime, y = fitted(ARIMA515)),
col = "tomato3") +
labs(x = "Date", y = "Demeaned Log rate of increase in the number of players") +
theme_bw()
# Code from previous midterm project
= paste(
cap_fig8 "**Figure 8.** *Residuals of the benchmark model.*",
""
)## Residual plot
tibble(Date = log_df2, Residual = ARIMA515$residuals) %>%
ggplot(aes(x = Date$DateTime, y = Residual)) +
geom_line() +
xlab("Date") +
ylab("Residuals") +
geom_hline(yintercept = 0,
col = "tomato3") +
theme_bw()
# Code from previous midterm project
= paste(
cap_fig9 "**Figure 9.** *Residuals Autocorrelation function of the benchmark model.*",
""
)# Autocorrelation function of the final model
= acf(ARIMA515$residuals, main = "Residuals Autocorrelation") Acf_plot
# Code from previous midterm project
= paste(
cap_fig10 "**Figure 10.** *QQ-plot for the residuals of the benchmark model.*",
""
)qqnorm(ARIMA515$residuals, main = "QQ-Plot: Residuals")
qqline(ARIMA515$residuals)
Generally speaking, there is no obvious trend in the residual plot.
In the ACF plot, almost all spikes are within the 95% confidence boundaries. the QQ-plot also shows that the residuals have heavier tail than normal. However, this is only a baseline model for comparison and we do not expect it to perform too well. The unsatisfactory result also motivates us to try other models.
As mentioned previously, we decide to use the GARCH model to model the volatility of the rate of increase in player numbers. Specifically, we utilize GARCH(5,5) to model the volatility of the of the rate of increase. GARCH(5,5) takes a simple form that \[ Y_n = \epsilon_n\sqrt{V_n}\] where \[ V_n = \alpha_0 + \alpha_1Y_{n-1}^{2}+\beta_1V_{n-1}\]
require(tseries)
<- garch(log_df2$demean_players,grad = "numerical",trace = FALSE)
fit.garch <- tseries:::logLik.garch(fit.garch)
L.garch L.garch
## 'log Lik.' 1170.101 (df=3)
par(mfrow = c(1, 2))
acf(resid(fit.garch)[-1],na.action = na.pass, main = "ACF of GARCH(5,5) residuals")
qqnorm(resid(fit.garch))
qqline(resid(fit.garch))
For the GARCH(5,5) model, we obtain a maximized log likelihood of 1170.101. The acf plot indicates that there still exists the cycling behavior of the residuals. And noticing the points on the Q-Q plot, the fall along a line in the middle of the graph, but curve off in the extremities. The Q-Q plot demonstrates the residuals have heavy tails comparing to the normal distribution. Hence the residuals may deviate from the standard normal distribution, and hence may undermine the fitting of the model.
As a result, GARCH model is not a good choice in this case. We can try other models instead to find a better model as substitute.
Leverage is a financial term, representing the phenomenon that negative shocks to a stockmarket index are associated with a subsequent increase in volatility.[6] Let \(R_n\) on day \(n\) denotes the correlation between the return on day \(n−1\) and the increase in the log volatility from day \(n−1\) to day \(n\).[7]
In the pomp implementation of Breto (2014), which models \(R_n\) as a random walk on a transformed scale, \[ R_b = \frac{exp(2G_n)-1}{exp(2G_n)+1}\]where \(G_n\) is Gaussian random walk.[4]
Following the idea of Breto(2014), the model representation can be expressed by equations below: \[ Y_n = exp(\frac{H_n}{2})\epsilon_n\] \[ H_n = \mu_h(1-h)+\phi H_{n-1}+\beta_{n-1}R_nexp(\frac{-H_{n-1}}{2})+w_n\] \[ G_n = G_{n-1}+\upsilon_n\] \[ \beta_n = Y_n \sigma_\eta\sqrt{1-\phi^{2}}\] \[ \epsilon_n \sim i.i.d N(0,1)\] \[\upsilon_n \sim i.i.dN(0,\sigma_v^{2})\] \[\omega_n \sim N(0,\sigma_{\omega,n}^{2})\] where \(\sigma_{\omega,n}^{2} = \sigma_\eta^{2}(1-\phi^{2})(1-R_n^{2})\).
<- c("H","G","Y_state")
GME_statenames <- c("sigma_nu","mu_h","phi","sigma_eta")
GME_rp_names <- c("G_0","H_0")
GME_ivp_names <- c(GME_rp_names,GME_ivp_names)
GME_paramnames
<- "
rproc1 double beta,omega,nu;
omega = rnorm(0,sigma_eta * sqrt( 1- phi*phi ) *
sqrt(1-tanh(G)*tanh(G)));
nu = rnorm(0, sigma_nu);
G += nu;
beta = Y_state * sigma_eta * sqrt( 1- phi*phi );
H = mu_h*(1 - phi) + phi*H + beta * tanh( G )
* exp(-H/2) + omega;
"
<- "
rproc2.sim Y_state = rnorm( 0,exp(H/2) );
"
<- "
rproc2.filt Y_state = covaryt;
"
<- paste(rproc1,rproc2.sim)
GME_rproc.sim <- paste(rproc1,rproc2.filt)
GME_rproc.filt
<- "
GME_rinit G = G_0;
H = H_0;
Y_state = rnorm( 0,exp(H/2) );
"
<- "
GME_rmeasure y=Y_state;
"
<- "
GME_dmeasure lik=dnorm(y,0,exp(H/2),give_log);
"
<- parameter_trans(
GME_partrans log=c("sigma_eta","sigma_nu"),
logit="phi"
)
<- pomp(data=data.frame(
sim1.filt y=log_df2$demean_players,time=1:length(log_df2$demean_players)),
statenames=GME_statenames,
paramnames=GME_paramnames,
times="time",
t0=0,
covar=covariate_table(
time=0:length(log_df2$demean_players),
covaryt=c(0,log_df2$demean_players),
times="time"),
rmeasure=Csnippet(GME_rmeasure),
dmeasure=Csnippet(GME_dmeasure),
rprocess=discrete_time(step.fun=Csnippet(GME_rproc.filt),
delta.t=1),
rinit=Csnippet(GME_rinit),
partrans=GME_partrans
)
<- c(
params_test sigma_nu = exp(-6.5),
mu_h = -5,
phi = expit(1.5),
sigma_eta = exp(0.7),
G_0 = 0,
H_0=0
)
<- pomp(sim1.filt,
sim1.sim statenames=GME_statenames,
paramnames=GME_paramnames,
rprocess=discrete_time(
step.fun=Csnippet(GME_rproc.sim),delta.t=1)
)<- simulate(sim1.sim,seed=1,params=params_test)
sim1.sim plot(Y_state~time, data=sim1.sim, type='l', col='blue', main="Observed returns and simulated returns", ylab="Demeaned log rate of increase")
lines(log_df$log_diff,col='black')
legend('topright' , c("Observed Rates","Simulated Rates"), col=c("black","blue"), lty=c(1,1),cex = 0.5)
The plot above compares the observed rate of increase and our simulated rate of increase (from the model with the parameters of our initial guess). We see that the simulated rates capture some of the patterns of the observed rates but there are still many deviations. We will use this result as a starting point. And in the later analysis, we will determine a more proper interval for each parameter through the pair plot.
<- pomp(sim1.sim,
sim1.filt2 covar=covariate_table(
time=c(timezero(sim1.sim),time(sim1.sim)),
covaryt=c(obs(sim1.sim),NA),
times="time"),
statenames=GME_statenames,
paramnames=GME_paramnames,
rprocess=discrete_time(
step.fun=Csnippet(GME_rproc.filt),delta.t=1)
)
<- 3
run_level <- switch(run_level, 100, 1e3, 2e3)
GME_Np <- switch(run_level, 10, 100, 200)
GME_Nmif <- switch(run_level, 4, 10, 20)
GME_Nreps_eval <- switch(run_level, 10, 20, 20)
GME_Nreps_local <- switch(run_level, 10, 20, 100)
GME_Nreps_global
registerDoParallel()
registerDoRNG(34118892)
stew(file=sprintf("pf1-%d.rda",run_level),{
<- system.time(
t.pf1 <- foreach(i=1:GME_Nreps_eval,
pf1 .packages='pomp') %dopar% {
pfilter(sim1.filt2,Np=GME_Np)
})-> results
}) <- logmeanexp(sapply(pf1,logLik),se=TRUE)) (L.pf1
## se
## 518.3896817 0.1722256
We carry out replicated particle filters at our initial guess of the parameters. We obtain a log likelihood estimate of 518.3896817 with a Monte Carlo standard error of 0.1722256. Next, we will fit the stochastic leverage model to the online player data with some randomly selected initial values.
# reference https://ionides.github.io/531w21/16/slides-annotated.pdf
<- 0.02
GME_rw.sd_rp <- 0.1
GME_rw.sd_ivp .50 <- 0.5
GME_cooling.fraction<- rw.sd(
GME_rw.sd sigma_nu = GME_rw.sd_rp,
mu_h = GME_rw.sd_rp,
phi = GME_rw.sd_rp,
sigma_eta = GME_rw.sd_rp,
G_0 = ivp(GME_rw.sd_ivp),
H_0 = ivp(GME_rw.sd_ivp)
)stew(file=sprintf("mif1-%d.rda",run_level),{
<- system.time({
t.if1 <- foreach(i=1:GME_Nreps_local,
if1 .packages='pomp', .combine=c) %dopar% mif2(sim1.filt,
params=params_test,
Np=GME_Np,
Nmif=GME_Nmif,
cooling.fraction.50=GME_cooling.fraction.50,
rw.sd = GME_rw.sd)
<- foreach(i=1:GME_Nreps_local,
L.if1 .packages='pomp', .combine=rbind) %dopar% logmeanexp(
replicate(GME_Nreps_eval, logLik(pfilter(sim1.filt,
params=coef(if1[[i]]),Np=GME_Np))), se=TRUE)
})seed=318817883,kind="L'Ecuyer")
},<- data.frame(logLik=L.if1[,1],logLik_se=L.if1[,2],
r.if1 t(sapply(if1,coef)))
if (run_level>1) write.table(r.if1,file="GME_params.csv",
append=TRUE,col.names=FALSE,row.names=FALSE)
summary(r.if1$logLik,digits=5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1272 1274 1274 1275 1276 1277
pairs(~logLik+sigma_nu+mu_h+phi+sigma_eta,
data=subset(r.if1,logLik>max(logLik)-20))
The best log likelihood is 1277. Comparing to the results of GARCH(5,5), which is 1170.101, the stochastic leverage model does seem like a better choice for this particular dataset.
As demonstrated above, the convergence diagnostics of the pomp model is plotted. We can see from the MIF2 convergence plot that the log-likelihood, \(\phi\), \(\mu_h\) and \(\sigma_\eta\) quickly converges before 50 iterations. The \(\sigma_\upsilon\) seems to converge within 100 iterations, \(G_0\) seems to converge to an interval around 75 iterations. And for \(H_0\), it does not seem to converge.
To address the non-convergence problem and obtain an optimization, we will use randomized starting values from a large box in the pomp model to obtain a global maximization.[8]
<- rbind(
GME_box sigma_nu=c(0.005,0.05),
mu_h =c(-7,0),
phi = c(0.95,0.99),
sigma_eta = c(0.5,1),
G_0 = c(-2,2),
H_0 = c(-1,1)
)stew(file=sprintf("box_eval-%d.rda",run_level),{
<- system.time({
t.box <- foreach(i=1:GME_Nreps_global,
if.box .packages='pomp',.combine=c) %dopar% mif2(if1[[1]],
params=apply(GME_box,1,function(x)runif(1,x)))
<- foreach(i=1:GME_Nreps_global,
L.box .packages='pomp',.combine=rbind) %dopar% {
logmeanexp(replicate(GME_Nreps_eval, logLik(pfilter(
params=coef(if.box[[i]]),Np=GME_Np))),
sim1.filt,se=TRUE)}
})
})<- data.frame(logLik=L.box[,1],logLik_se=L.box[,2],
r.box t(sapply(if.box,coef)))
if(run_level>1) write.table(r.box,file="GME_params.csv",
append=TRUE,col.names=FALSE,row.names=FALSE)
summary(r.box$logLik,digits=5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1258 1271 1274 1273 1276 1280
=subset(r.box,logLik==max(logLik))
maxlikrow.names(maxlik) = NULL
kable(maxlik,digits=3)
logLik | logLik_se | sigma_nu | mu_h | phi | sigma_eta | G_0 | H_0 |
---|---|---|---|---|---|---|---|
1279.503 | 0.064 | 0 | -6.492 | 0.964 | 1.061 | 0.523 | -4.866 |
<- c(
params_test sigma_nu = exp(log(maxlik$sigma_nu)),
mu_h = maxlik$mu_h,
phi = expit(logit(maxlik$phi)),
sigma_eta = exp(log(maxlik$sigma_eta)),
G_0 = maxlik$G_0,
H_0=maxlik$H_0
)<- pomp(sim1.filt,
sim1.sim statenames=GME_statenames,
paramnames=GME_paramnames,
rprocess=discrete_time(
step.fun=Csnippet(GME_rproc.sim),delta.t=1)
)
<- simulate(sim1.sim,seed=8,params=params_test)
sim1.sim plot(Y_state~time,data=sim1.sim,type='l',col="blue",ylab="return")
lines(log_df$log_diff,type='l',col="black")
legend("topright",legend=c("Original","Simulated"),col=c("black","red"),
cex=0.8,lty=1,bty="n")
From diagnostics plots above, we see that:
\(\mu_h\), \(\phi\), \(\sigma_\eta\), \(G_0\) and \(H_0\) do not converage at all.
The log likelihood converges at nearly the same value but slower since the steps between y-coordinates increase from 20 to 200.
\(\sigma_\upsilon\) converges within 50 iterations.
= c("ARIMA(5,1,5)", "GARCH(5,5)", "POMP w. fixed values", "POMP w. randomized values")
Model `Maximum Likelihood` = c(ARIMA515_loglik, 1170.101, 1277, 1280)
::kable(data.frame(Model, `Maximum Likelihood`)) knitr
Model | Maximum.Likelihood |
---|---|
ARIMA(5,1,5) | 1439.535 |
GARCH(5,5) | 1170.101 |
POMP w. fixed values | 1277.000 |
POMP w. randomized values | 1280.000 |
In this project, we focus on the data for CS:GO players. Creatively, we view the increase rate of online players as the returns for financial asset and model the volatility of the rate of increase. We first select the ARIMA(5,1,5) as our baseline model. And then, we apply several advanced model for the financial data to our data of interest. The summary table above compares the log likelihood for the models we have explored. We find that the ARIMA(5,1,5) perform the best with the highest log likelihood of about 1439.535. The second most competitive model is POMP with randomized values, which has a log likelihood of about 1280. One possible explanation for the relatively poor performance of the three advanced model we explored is that these models are designed for financial asset but not for the data set we use. It is likely that the time series from our data have features that cannot be well captured by the models for financial assets. Modeling the number of online players can help game companies better understand the market, and can also supplement the research of other fields such as Sociology, Psychology and Economics. Therefore, more advanced models need to developed for the online player data.
1, The Counter-Strike encyclopedia that you can edit
3, Counter-Strike: Global Offensive on SteamDB
4, We provide possible reasons for the peaks in the time plot based on A Timeline of COVID-19 Developments in 2020 and https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020.
5, Ionides, E. (2022). Notes for STATS/DATASCI 531, Page 11 of Chapter 6 slides: Extending the ARMA model: Seasonality, integration and trend
6, The Wilshire 5000 Index Time Series Analysis and Modelling
The general structure for the part before the GARCH model refers to the previous midterm project: Candy Production Data and store sales data. These two previous projects provide a complete pipeline in analyzing the time series data using materials we learnt at the first half of the semester. We make adjustments to specific sections to accommodate it to our analysis. In addition, we simplify the analysis because we believe the main focus of this project is the part starts from the exploration of the GARCH model, which applies knowledge from the second half of the semester.
The second part of this project uses the general structure from the case study discussed in class and a previous final project: The Wilshire 5000 Index Time Series Analysis and Modelling. The case study and the previous project provide a clear procedure in modeling the financial data. Although our project does not focus on the financial data but concentrates on the data for CS players, we observe that there are similarities in the features. We view the increase rate of player numbers as returns, so the analysis from the previous project provides a good reference to us. We make adjustments to some sections to accommodate it to our research context. Moreover, we include diagnostic plots to assess the model performance.