1. Introduction

Currently, the United States is in the midst of the Democratic Primary season where voters will choose their preferred nominee to face incumbent Republican president Donald Trump. This primary cycle has been contentious and polarizing, with a heavy emphasis on the idea of “electability”, which many attribute to the remarkable success of Joe Biden on Super Tuesday and the recent dropping out of once-front-runner Elizabeth Warren.

One contributing factor to this question of electability is the national approval rating of Donald Trump. Drops in approval ratings coinciding with Democratic debates or other significant events during this primary season could be an indication that moderate republicans or independent voters may be leaning towards a Democratic vote in November, and could play a strategic role in swaying voter’s choices in the upcoming primaries, including Michigan’s on March 10th.

2. Questions of interest

In this analysis, we will attempt to answer the following questions:

  1. Can we fit an ARIMA-family time series model to model President Trump’s approval rating?
  2. Is there evidence of strong cyclical patterns in Presidential approval ratings?
  3. Can our model be used to predict future Presidential approval ratings?

3. Data

Approval ratings data were obtained from fivethirtyeight, and contains daily estimates of national approval ratings beginning 3/11/2017 through 3/6/2020. Details of fivethirtyeight’s estimation of national approval ratings are provided here. These data provide approval estimates for both voters and the general public, but here we will focus on modeling approval among voters only. To learn our model we will truncate approval ratings for March 2020 and will use forecasting accuracy as a way to assess our final time series model.

4. Data Exploration

Here, we plot the approval ratings data both as the raw time series and the first-order differenced time series. We can see that the raw time series does not appear stationary, but rather shows an approximately cubic temporal trend. On the other hand. the first-order differencing transformation does indeed seem to make the data appear stationary. We will keep this in mind when making our model choices in the next section.

When observing the first-order difference time series of Approval Ratings, we notice there is quite a large variation right at the beginning of our time series, potentially indicating the first few days of Trump’s administration were somewhat volatile in terms of public opinion. To avoid complications that may arise from this spuriously high variability, we will cut the first 5 timestamps off of our time series. This truncated approval ratings difference series now appears to have both stationary mean and variance, which is desirable in ARIMA modeling.

Looking at the sample autocorrelation function (ACF) plots for the raw and differenced data, we see further evidence in support of utilizing the first-order differencing transformation, as the raw data are significantly autocorrelated at all lags <= 30 (and likely much further).

5. Model selection

Regression with ARIMA errors

The trend in the unadjusted, raw approval ratings time series suggests a cubic temporal trend. We will first explore a 3-rd order polynomial regression with ARIMA errors to model our time series. We begin just by fitting a cubic OLS model to the data:

## 
## Call:
## lm(formula = approve_estimate ~ day + I(day^2) + I(day^3), data = trump_voters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9862 -0.9158 -0.1480  1.1347  3.1531 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.447e+01  1.638e-01  271.49   <2e-16 ***
## day         -2.676e-02  1.288e-03  -20.78   <2e-16 ***
## I(day^2)     5.835e-05  2.717e-06   21.47   <2e-16 ***
## I(day^3)    -3.302e-08  1.622e-09  -20.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.353 on 1096 degrees of freedom
## Multiple R-squared:  0.3599, Adjusted R-squared:  0.3581 
## F-statistic: 205.4 on 3 and 1096 DF,  p-value: < 2.2e-16

Clearly, a 3-rd order polynomial captures the temporal trend in our data quite well. We will now compare AIC values of 3rd order polynomial regressions with ARIMA(\(p\),1,\(q\)) errors for values of \(p\) and \(q\) from 0 to 4. Based on AIC, we identify ARIMA(3,1,3) to best model the errors in our cubic regression model. However, when we look into this model we see that we run into convergence problems, and our \(\beta\) coefficient is actually 0. Based on these results, we will move forward without the regression terms.

MA0 MA1 MA2 MA3 MA4
AR0 424.48 426.39 427.34 428.80 430.76
AR1 426.38 428.21 428.94 430.80 432.76
AR2 427.38 428.96 430.95 432.80 434.76
AR3 428.78 430.78 432.78 419.44 421.36
AR4 430.75 432.75 434.75 421.34 419.28
## 
## Call:
## arima(x = trump_voters$approve_estimate, order = c(3, 1, 3), xreg = day^3, method = "ML")
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##          ar1      ar2     ar3      ma1     ma2      ma3  day^3
##       0.4312  -0.8574  0.5718  -0.4344  0.9033  -0.5791      0
## s.e.  0.8472   0.1490  0.8140   0.8518  0.1360   0.8489    NaN
## 
## sigma^2 estimated as 0.08441:  log likelihood = -201.72,  aic = 419.44

ARIMA(p,d,q) Models

Model Fitting

Based on our above observation that transforming the data via first-order differencing enforces stationarity, we will test ARIMA(\(p\),1,\(q\)) models, where \(p\) and \(q\) range from 0 to 4. We will use AIC to determine the most appropriate model (the lower AIC the better).

## Warning in log(s2): NaNs produced
MA0 MA1 MA2 MA3 MA4
AR0 422.65 424.56 425.50 426.96 428.91
AR1 424.55 426.38 427.09 428.95 430.91
AR2 425.54 427.11 429.12 430.96 432.91
AR3 426.93 428.93 430.93 417.58 419.48
AR4 428.91 430.91 432.91 419.49 417.63

Based on these results we would choose the ARIMA(3,1,3) model (AIC = 417.6). However, when we investigate the roots of the AR and MA polynomials, we should raise 3 concerns:

  1. One of the MA roots (0.71) does not fall outside the unit circle, indicating a non-invertible model. This is concerning because non-invertible models give numerically unstable estimates of residuals, and are therefore undesirable.
  2. One of the AR roots (1.022) just barely falls outside the unit circle. This indicates that the model is on the cusp of being non-causal.
  3. The remaining AR and MA roots (1.596 and 1.522, respectively) are very close in magnitude, and therefore could indicate redundancy in the model.
## [1] "Roots of the AR Poly: -0.087+1.019i"
## [2] "Roots of the AR Poly: -0.087-1.019i"
## [3] "Roots of the AR Poly: 1.596+0i"
## [1] "Modulus of the AR Poly: 1.022" "Modulus of the AR Poly: 1.022"
## [3] "Modulus of the AR Poly: 1.596"
## [1] "Roots of the MA Poly: -0.71+0i"    
## [2] "Roots of the MA Poly: 1.095+1.058i"
## [3] "Roots of the MA Poly: 1.095-1.058i"
## [1] "Modulus of the MA Poly: 0.71"  "Modulus of the MA Poly: 1.522"
## [3] "Modulus of the MA Poly: 1.522"

By studing the AIC table, we do not see evidence of instability in the MLE calculcation (no increases > 2 by adding one parameter). However, the three concerns above indicate a more reduced model than the ARIMA(3,1,3) may be preferable. The next smaller model with the lowest AIC is the ARIMA(0,1,0 model), which is equivalent to a random walk model.

We can validate this model choice using the output of the auto.arima function in the forecast package, which chooses the best model using AIC. This method also chooses the ARIMA(0,1,0) as the most appropriate model.

## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,1,2) with drift         : 431.4651
##  ARIMA(0,1,0) with drift         : 427.1169
##  ARIMA(1,1,0) with drift         : 430.029
##  ARIMA(0,1,1) with drift         : 429.0343
##  ARIMA(0,1,0)                    : 425.1096
##  ARIMA(1,1,1) with drift         : 431.8709
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(0,1,0)                    : 422.6545
## 
##  Best model: ARIMA(0,1,0)

The ARIMA(0,1,0) model is given by: \[ (1-B)(Y_n - \mu) = \epsilon_n,\] where \(\mu\) = 42.55, \(B\) is the backshift operator, and \(\epsilon_{1:n}\) are iid Gaussian white noise terms. This random walk model is essentially equivalent to AR(1) model in which the autoregressive coefficient is equal to 1.

Model Diagnostics

The ACF of the residuals looks overall uncorrelated, with correlations at lags 8 and 20 just barely touching/passing the line of significance.The Q-Q plot indicates a significant deviation from normal residuals, which is an importan assumption of our ARIMA model. This may be expected here based on the study of the roots of the AR and MA terms above. Looking more closely at the residuals, we see the residual values are heavily clustered around 0. This is in accordance with the tiny \(\sigma^2\) estimate of 0.085 from the fitted ARIMA(0,1,0) model, however it should make us concerned about overfitting.

6. Response to Question 1

Here, we plot the fitted values of the ARIMA(0,1,0) model over the original observed approval ratings data, along with a LOESS-smoothed estimation of trend. We see that the fitted values essentially overlay the original curve exactly. Knowing this is a random walk model with a very small \(\sigma^2\) value, this is not surprising. The fact that the random walk was derived as our most appropriate model indicates that day-to-day fluctuations in approval rating for President Trump are essentially attributable to noise.

7. Response to Question 2

Based on the following periodograms and their respective significance bars, there are no significant frequencies that would indicate strong cyclical trends in Presidential approval ratings. In the smoothed periodogram, we see some interesting spikes around frequencies of 0.1 and 0.25, which correspond to periods of roughly 10 and 4 days, respectively. These periods may be due to news cycles or perhaps the persistence of positive/negative feedback on current presidential events in social (media) circles. There is also a slight bump in the periodogram at frequency = 0.05 (period = 20 days), which matches up with the slightly significant correlations observed at lag=20 in the first order difference series.

8. Response to Question 3

Since our model is a random walk without drift, we know that it will not be very useful for forecasting, as random walk models predict the last known value for all \(k\) future values, with increasingly wider 95% confidence intervals. Using the forecast function from the forecast package, we can confirm this is the exact behavior of our ARIMA(0,1,0) model (blue line = prediction, red line = ground truth).

9. Conclusions

Our goal was to investigate three questions with regards to Trump’s Presidential Approval Ratings time series, as laid out in Section (2). Our conclusions for these three questions are:

  1. An ARIMA(0,1,0) (a.k.a random walk without drift) model seems to be most suitable to model these data. We can interpret this as daily approval ratings are well predicted by those of the previous day - though these trends are clearly driven by other outside factors (i.e. policies being passed, stock market fluctuations, or (in Trump’s case) a single tweet), however more data and more sophisticated models would be required to properly model this phenomenon in a way that could provide insights into the strongest driving factors.
  2. Using spectral decomposition of our time series, we found no significant evidence for cyclical patterns in Trump’s approval ratings. It is worthy to note again here that the approval ratings used here are aggregated from several different pollsters by fivethirtyeight, and represent an estimate of average national approval among voters. If data were available to break down these approval ratings by demographics including age, race and party affiliation, this could uncover some underlying cyclical pattern that is unrecognizable with these data.
  3. Our ARIMA(0,1,0) model, by nature, is not useful for forecasting, as it will always predict the last known data point. By employing some of the more sophisticated methodologies desribed above, perhaps forecasting of Trump’s approval rating would be possible. I believe this is would be most interesting to do for the deomographic of independent voters or “moderate” republicans, as Trump continues to swing strongly to the right these individuals may decide to swing left come Election Day in November.

10. Resources

  1. https://fivethirtyeight.com/features/did-sexism-and-fear-of-sexism-keep-warren-from-winning-the-nomination/
  2. https://projects.fivethirtyeight.com/trump-approval-ratings/
  3. https://fivethirtyeight.com/features/how-were-tracking-donald-trumps-approval-ratings/
  4. 2018 Midterm Project Examples
  5. Course Notes
  6. https://datascienceplus.com/time-series-analysis-using-arima-model-in-r/
  7. https://people.duke.edu/~rnau/Notes_on_the_random_walk_model--Robert_Nau.pdf