Currently, the United States is in the midst of the Democratic Primary season where voters will choose their preferred nominee to face incumbent Republican president Donald Trump. This primary cycle has been contentious and polarizing, with a heavy emphasis on the idea of “electability”, which many attribute to the remarkable success of Joe Biden on Super Tuesday and the recent dropping out of once-front-runner Elizabeth Warren.
One contributing factor to this question of electability is the national approval rating of Donald Trump. Drops in approval ratings coinciding with Democratic debates or other significant events during this primary season could be an indication that moderate republicans or independent voters may be leaning towards a Democratic vote in November, and could play a strategic role in swaying voter’s choices in the upcoming primaries, including Michigan’s on March 10th.
In this analysis, we will attempt to answer the following questions:
Approval ratings data were obtained from fivethirtyeight, and contains daily estimates of national approval ratings beginning 3/11/2017 through 3/6/2020. Details of fivethirtyeight’s estimation of national approval ratings are provided here. These data provide approval estimates for both voters and the general public, but here we will focus on modeling approval among voters only. To learn our model we will truncate approval ratings for March 2020 and will use forecasting accuracy as a way to assess our final time series model.
Here, we plot the approval ratings data both as the raw time series and the first-order differenced time series. We can see that the raw time series does not appear stationary, but rather shows an approximately cubic temporal trend. On the other hand. the first-order differencing transformation does indeed seem to make the data appear stationary. We will keep this in mind when making our model choices in the next section.
When observing the first-order difference time series of Approval Ratings, we notice there is quite a large variation right at the beginning of our time series, potentially indicating the first few days of Trump’s administration were somewhat volatile in terms of public opinion. To avoid complications that may arise from this spuriously high variability, we will cut the first 5 timestamps off of our time series. This truncated approval ratings difference series now appears to have both stationary mean and variance, which is desirable in ARIMA modeling.
Looking at the sample autocorrelation function (ACF) plots for the raw and differenced data, we see further evidence in support of utilizing the first-order differencing transformation, as the raw data are significantly autocorrelated at all lags <= 30 (and likely much further).
The trend in the unadjusted, raw approval ratings time series suggests a cubic temporal trend. We will first explore a 3-rd order polynomial regression with ARIMA errors to model our time series. We begin just by fitting a cubic OLS model to the data:
##
## Call:
## lm(formula = approve_estimate ~ day + I(day^2) + I(day^3), data = trump_voters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9862 -0.9158 -0.1480 1.1347 3.1531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.447e+01 1.638e-01 271.49 <2e-16 ***
## day -2.676e-02 1.288e-03 -20.78 <2e-16 ***
## I(day^2) 5.835e-05 2.717e-06 21.47 <2e-16 ***
## I(day^3) -3.302e-08 1.622e-09 -20.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.353 on 1096 degrees of freedom
## Multiple R-squared: 0.3599, Adjusted R-squared: 0.3581
## F-statistic: 205.4 on 3 and 1096 DF, p-value: < 2.2e-16
Clearly, a 3-rd order polynomial captures the temporal trend in our data quite well. We will now compare AIC values of 3rd order polynomial regressions with ARIMA(\(p\),1,\(q\)) errors for values of \(p\) and \(q\) from 0 to 4. Based on AIC, we identify ARIMA(3,1,3) to best model the errors in our cubic regression model. However, when we look into this model we see that we run into convergence problems, and our \(\beta\) coefficient is actually 0. Based on these results, we will move forward without the regression terms.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 424.48 | 426.39 | 427.34 | 428.80 | 430.76 |
AR1 | 426.38 | 428.21 | 428.94 | 430.80 | 432.76 |
AR2 | 427.38 | 428.96 | 430.95 | 432.80 | 434.76 |
AR3 | 428.78 | 430.78 | 432.78 | 419.44 | 421.36 |
AR4 | 430.75 | 432.75 | 434.75 | 421.34 | 419.28 |
##
## Call:
## arima(x = trump_voters$approve_estimate, order = c(3, 1, 3), xreg = day^3, method = "ML")
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
## ar1 ar2 ar3 ma1 ma2 ma3 day^3
## 0.4312 -0.8574 0.5718 -0.4344 0.9033 -0.5791 0
## s.e. 0.8472 0.1490 0.8140 0.8518 0.1360 0.8489 NaN
##
## sigma^2 estimated as 0.08441: log likelihood = -201.72, aic = 419.44
Based on our above observation that transforming the data via first-order differencing enforces stationarity, we will test ARIMA(\(p\),1,\(q\)) models, where \(p\) and \(q\) range from 0 to 4. We will use AIC to determine the most appropriate model (the lower AIC the better).
## Warning in log(s2): NaNs produced
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 422.65 | 424.56 | 425.50 | 426.96 | 428.91 |
AR1 | 424.55 | 426.38 | 427.09 | 428.95 | 430.91 |
AR2 | 425.54 | 427.11 | 429.12 | 430.96 | 432.91 |
AR3 | 426.93 | 428.93 | 430.93 | 417.58 | 419.48 |
AR4 | 428.91 | 430.91 | 432.91 | 419.49 | 417.63 |
Based on these results we would choose the ARIMA(3,1,3) model (AIC = 417.6). However, when we investigate the roots of the AR and MA polynomials, we should raise 3 concerns:
## [1] "Roots of the AR Poly: -0.087+1.019i"
## [2] "Roots of the AR Poly: -0.087-1.019i"
## [3] "Roots of the AR Poly: 1.596+0i"
## [1] "Modulus of the AR Poly: 1.022" "Modulus of the AR Poly: 1.022"
## [3] "Modulus of the AR Poly: 1.596"
## [1] "Roots of the MA Poly: -0.71+0i"
## [2] "Roots of the MA Poly: 1.095+1.058i"
## [3] "Roots of the MA Poly: 1.095-1.058i"
## [1] "Modulus of the MA Poly: 0.71" "Modulus of the MA Poly: 1.522"
## [3] "Modulus of the MA Poly: 1.522"
By studing the AIC table, we do not see evidence of instability in the MLE calculcation (no increases > 2 by adding one parameter). However, the three concerns above indicate a more reduced model than the ARIMA(3,1,3) may be preferable. The next smaller model with the lowest AIC is the ARIMA(0,1,0 model), which is equivalent to a random walk model.
We can validate this model choice using the output of the auto.arima
function in the forecast
package, which chooses the best model using AIC. This method also chooses the ARIMA(0,1,0) as the most appropriate model.
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,1,2) with drift : 431.4651
## ARIMA(0,1,0) with drift : 427.1169
## ARIMA(1,1,0) with drift : 430.029
## ARIMA(0,1,1) with drift : 429.0343
## ARIMA(0,1,0) : 425.1096
## ARIMA(1,1,1) with drift : 431.8709
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(0,1,0) : 422.6545
##
## Best model: ARIMA(0,1,0)
The ARIMA(0,1,0) model is given by: \[ (1-B)(Y_n - \mu) = \epsilon_n,\] where \(\mu\) = 42.55, \(B\) is the backshift operator, and \(\epsilon_{1:n}\) are iid Gaussian white noise terms. This random walk model is essentially equivalent to AR(1) model in which the autoregressive coefficient is equal to 1.
The ACF of the residuals looks overall uncorrelated, with correlations at lags 8 and 20 just barely touching/passing the line of significance.The Q-Q plot indicates a significant deviation from normal residuals, which is an importan assumption of our ARIMA model. This may be expected here based on the study of the roots of the AR and MA terms above. Looking more closely at the residuals, we see the residual values are heavily clustered around 0. This is in accordance with the tiny \(\sigma^2\) estimate of 0.085 from the fitted ARIMA(0,1,0) model, however it should make us concerned about overfitting.
Here, we plot the fitted values of the ARIMA(0,1,0) model over the original observed approval ratings data, along with a LOESS-smoothed estimation of trend. We see that the fitted values essentially overlay the original curve exactly. Knowing this is a random walk model with a very small \(\sigma^2\) value, this is not surprising. The fact that the random walk was derived as our most appropriate model indicates that day-to-day fluctuations in approval rating for President Trump are essentially attributable to noise.
Based on the following periodograms and their respective significance bars, there are no significant frequencies that would indicate strong cyclical trends in Presidential approval ratings. In the smoothed periodogram, we see some interesting spikes around frequencies of 0.1 and 0.25, which correspond to periods of roughly 10 and 4 days, respectively. These periods may be due to news cycles or perhaps the persistence of positive/negative feedback on current presidential events in social (media) circles. There is also a slight bump in the periodogram at frequency = 0.05 (period = 20 days), which matches up with the slightly significant correlations observed at lag=20 in the first order difference series.
Since our model is a random walk without drift, we know that it will not be very useful for forecasting, as random walk models predict the last known value for all \(k\) future values, with increasingly wider 95% confidence intervals. Using the forecast
function from the forecast
package, we can confirm this is the exact behavior of our ARIMA(0,1,0) model (blue line = prediction, red line = ground truth).
Our goal was to investigate three questions with regards to Trump’s Presidential Approval Ratings time series, as laid out in Section (2). Our conclusions for these three questions are: