On September 18, 2019, Major League Baseball’s (MLB) Houston Astros defeated the Texas Rangers 3-2. This win clinched the Astros a postseason berth and marked their third consecutive 100 win season (out of 162 games), thereby becoming the first team accomplish this feat since 2004 [1]. However, its historic run was to be marred by controversy.
On November 12th of the same year, shortly after the Astros lost in the World Series to the Washington Nationals, baseball writers Ken Rosenthal and Evan Drellich published a story in The Athletic accusing the Astros for using video technology to steal the pitching signs of opposing teams[2]. Baseball pitchers use signs that indicate what type of pitch they should throw at a given time. These signs are relayed to the pitcher by the catcher, who is behind and out of view of the batter. These signs are generally standard throughout baseball. For example, if the catcher points one finger down, he is asking the pitcher to throw a fastball, while two fingers indicates a curveball. The article stated that a camera feed from center field was being relayed to the Astros dugout, where a member of the Astros would loudly bang on a trash can to let the batter know what type of pitch is about to be thrown. Upon the publication of the article, many videos began to surface online where the banging sound can be heard whenever certain signs were made while the Astros were batting[3].
After a subsequent investigation by the MLB, the Houston Astros were found guilty and were punished with fines and loss of draft picks. In the following weeks, the Houston players and staff were critcized for showing little remorse for their actions, with team owner Jim Crane saying the scheme “didn’t impact the game[4].”
This report analyzes batting data of the Houston Astros before, during, and after the sign stealing occurred to find whether or not the system benefitted the team. Specifically, my goal is to model the performance of the Houston Astros over time using an ARMA model.
In the aforementioned videos, the loud banging can be heard whenever a sign for an offspeed pitch is made. These pitches are slower than fastballs, but can throw off the timing of the batter, casuing him to swing and miss. Certain offspeed pitches also move away from the strike zone right as the ball reaches the batter. If a batter knows that such a pitch is about to be thrown, he is less likely to swing and chase after the ball.
The data I am using is of the swing and miss rate on offspeed pitches of Houston Astros batters from home games during the 2016-2019 seasons[5]. According to the MLB investigation, the Astros used their scheme in the 2017 and 2018 seasons, but not in 2019. Thus, we will be able to compare the middle two years of our data to that of the first and last years to find any benefit the scheme had. Also note that the camera capturing signs was only installed in Houston’s home stadium, so we are only looking at data from home games.
We can visualize this data.
There are 81 home games each regular season, with about 8 or so home games each postseason depending on how far the Astros advanced (in 2016, Houston did not qualify for the postseason). We can already see that the miss rate is greater in 2016 than the following years. However, it may be that the team became better overall at not swinging and missing at offspeed pitches naturally. To account for this factor, we can subtract out the average miss rate from each season’s away games. This modification will leave us with the difference in home and away miss rates. If we can still see a trend, then there is greater evidence for foul play during Astros home games. We plot the new series.
We can still see a decrease in the miss rate difference from 2016 to 2018, with perhaps a slight increase near the end of the 2019 season. This will be the time series that we will use for the remainder of the report.
There is a clear trend in our time series. We can model the data using polynomial regression with the residuals modeled as an ARMA process: \[Y_n = P(n) + \eta_n\] where \(P\) is a polynomial and \(\eta_n\) is a stationary ARMA(\(p,q\)) model.
We first must fit a polynomial to the data. Given the nature of the trend, I will fit a 3rd degree polynomial to the series.
##
## Call:
## lm(formula = missRate ~ poly(game, 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.136977 -0.038090 -0.000454 0.037490 0.168324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.004077 0.002932 -1.391 0.1652
## poly(game, 3)1 -0.248173 0.054609 -4.545 7.64e-06 ***
## poly(game, 3)2 0.127292 0.054609 2.331 0.0203 *
## poly(game, 3)3 0.120442 0.054609 2.206 0.0281 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05461 on 343 degrees of freedom
## Multiple R-squared: 0.08277, Adjusted R-squared: 0.07474
## F-statistic: 10.32 on 3 and 343 DF, p-value: 1.615e-06
The coefficients of the polynomial regression are significant, and we can see the polynomial follows the trend of the series. Now we can look at the residuals to see whether it is appropriate to fit an ARMA model to them.
The residuals do not appear to have a trend over time. The QQ-plot shows that they follow a normal distrbution. From the ACF plot, we see that only one of the 25 lags has a significant autocorrelation, while the rest are insignincant, indicating a generally stationary process. Now, we can proceed to fitting an ARMA model to the residuals.
We will consider all ARMA(\(p,q\)) models where \(0 \leq p,q \leq 4\). To start, we can find the AIC values of each of these models when fit to the residuals.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | -1033.12 | -1032.26 | -1030.27 | -1030.34 | -1028.78 |
AR1 | -1032.24 | -1030.26 | -1033.16 | -1032.48 | -1030.52 |
AR2 | -1030.32 | -1033.58 | -1033.85 | -1030.59 | -1038.15 |
AR3 | -1030.19 | -1032.47 | -1030.60 | -1028.59 | -1026.60 |
AR4 | -1028.26 | -1030.50 | -1038.12 | -1037.88 | -1040.87 |
The lowest AIC value is from the ARMA(4,4) model. However, we can see that there are certain cases of the AIC value increasing by more than 2 when adding one more parameter to the model. This indicates that there are some issues when maximizing the likelihood function of some models. Also, an ARMA(4,4) model is rather complex, which can lead to issues such as non-invertibility and numeric instability. Among the simpler models, we can see that the ARMA(2,2) model has the lowest AIC. This model achieves a good balance between predictive power and simplicity, so we will select it as the model for our data. We can find the parameters of this model.
##
## Call:
## arima(x = errors, order = c(2, 0, 2), method = "ML")
##
## Coefficients:
## ar1 ar2 ma1 ma2 intercept
## 0.1361 -0.9419 -0.1872 0.9668 0.0000
## s.e. 0.0611 0.0695 0.0674 0.0361 0.0028
##
## sigma^2 estimated as 0.002869: log likelihood = 522.93, aic = -1033.85
We can write this model as: \[X_n = 0.14X_{n-1} - 0.94X_{n-2} - 0.19\epsilon_{n-1} + 0.97\epsilon_{n-2} + \epsilon_n\], where \(\epsilon_n \sim N(0,0.0029)\).
To make sure that we can use the ARMA(2,2) model, we should run some diagnostics. First, we can check the roots of the AR and MA polynomials
## [1] "AR Polynomial Roots"
## [1] 1.030373 1.030373
## [1] "MA Polynomial Roots"
## [1] 1.017025 1.017025
Both roots of each polynomial fall outside of the unit circle, so we know our model is both causal and invertible. Also note that the standard errors of coefficients give us confidence intervals which don’t include 0, meaning we can consider the estimates good.
We can now look at a plot of the residuals of the model.
We can see that the residuals look similar to white noise. There is no noticable trend nor change in variability over time. To make sure of these assumptions, we can look at a QQ-plot of the residuals and run a Shaprio-Wilk test for normality.
##
## Shapiro-Wilk normality test
##
## data: arma22$residuals
## W = 0.99449, p-value = 0.2474
The QQ-plot shows that the distribution of the residuals is close to normal, but there is a heavy left tail. However, the p-value from the Shaprio-Wilk test is greater than 0.05, so we can say that the distribution is not significantly different from a normal distribution.
We will lastly look at the ACF of the residuals to ensure stationarity.
All but two lags are insignificant, so we can be relatively sure about our stationarity assumption.
If we are able to find any seasonality in the data, we may be able to use a SARMA model to better model the series and ensure that all lags have insignificant autocorrelations. We can start by looking at a smoothed periodogram of the series in order to find the dominant frequency.
The dominant frequency is about 0.05, which corresponds to a period of about 20 games. After experiemeting with seasonal models using this information, I was unable to find a model which performed signifcantly better than the non seasonal model. We should not necessarily expect any seasonality in the data, and the period of 20 games has no fundamental meaning with regards to a baseball season. Generally in an MLB season, two teams will play each other for 3-5 games in a row before playing against a different team. However, there is likely no relation between how well an opposing team pitched and the order in which the Astros played against such a team, so any seasonality which may exist is probably due to chance. For these reasons, I will stick with the original ARMA(2,2) model.
Now that we have finalized our model, we can try to forecast the performance of the Houston Astros for the upcoming 2020 season. I will run the forecast for the next 50 games.
## Loading required package: forecast
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
The forecast predicts that the home-away swing and miss rate on offspeed pitches by the Houston Astros will increase at the beginning of the 2020 season. This trend is likey due to the upward trend at the end of the 2019 season. Since the players on the team may change between seasons, this forcecast may not be very accurate. However, it does indicate an increase in miss rate when no sign stealing occurs.
We have been able to model the batting behavior of the Houston Astros over a time when they began to steal the signs of opposing pitchers. We can see that the Houston batters knowing which pitch was going to be thrown to them allowed them to have better batting discipline and not swing and miss at as many pitches. In this sense, their system did have a tangible impact on the game. The extent of this impact is still unclear. We do not know if this decrease in miss rate actually helped the Astros win more games. It is also possible that the decrase in miss rate is also due to other factors which we did not consider.
If we looked at other statistics outside of miss rate, we could potentially strengthen or weaken evidence against the impact the sign stealing had on games. The performance of other teams in the league could also be considered. If similar trends are found in teams which did not steal signs, then the benefit of sign stealing may be lower than what our analysis suggests.
Many baseball fans have been calling for the Houston Astros to be stripped of their 2017 World Series title, and we have shown that their complaints may have merit.
I also referenced the lecture slides for methods and code.