Since the invention of motion pictures, movies and films have steadily gained great popularity throughout the world. The global film industry is ever growing, and as of 2020 sat at a value of nearly $239.4 billion. This value is expected to reach $318.2 billion by 2025, and $410.6 billion by 2030 [1]. Even through tumultuous times, the film industry has been steadfast in its ability to draw in great revenue.
Although the global film industry net worth has greatly increased over time, and is expected to continue to do so, it is unclear how this net worth has changed over time with respect to various income sources. With the great rise of streaming services such as Netflix, Hulu, HBOMax, and more, movies are more accessible than ever in a home setting. Patrons no longer need to purchase DVDs or attend movie theaters to gain access to feature films. The goal of this work is to analyze whether the increased accessibility to movies at home has led to a decrease in movie theater attendance, resulting in a decrease in box office revenues for new films. This question will be approached from many angles, most importantly exploring seasonality that may arise as popular movies are released in similar time frames, such as the holiday season or start of summer.
In order to answer this question we utilize the The Movies Dataset from Kaggle [2]. The Movies Dataset contains movie information from the The Movie Database (TMDB) - a collaborative online movie database with information input by users according to guidelines [3], accompanied by a free API - matched with rating information from GroupLens, a research lab in the Department of Computer Science and Engineering at the University of Minnesota [4]. The dataset contains data 45466 movies (i.e. our observations) released between 1874 and 2020. Because the data only extends to 2020, we do not need to account for any effects the pandemic may have had on the film industry.
We concern ourselves with only the movies released between 1991 and 2015. This subset contains 27115 observations. Given the collaborative nature of the TMDB, we consider a subset of this data which only contains movies that are released to public, had more than $100 million in revenue, a non-zero budget entry, and more than 10 votes contributing to its rating. This allows us to focus the analysis on major feature films, discarding any movie that might be made with non-monetary intentions, including film festivals, small scale projects, and possibly low-budget “direct to DVD” movies. As a result, we are left with 3683 movies between the years 1991 and 2015, averaging out to 153.46 movies per year.
Using this subset, we construct a time-series of movie revenues by taking the mean revenue of the movies released in the given month and taking its logarithm. We utilize the mean movie revenue rather than the total movie revenue to filter out the effect of the number of movies released that month. Instead we focus on the particular effect that being released in a given month has on movie revenue. Formally, let the revenue of the \(j\)th movie released in month \(u\) and year \(v\) be \(X^j_{u,v}\). We would like to model the average revenue of movies in a given month and year. Which is; \[R_i = \frac{\sum_{j = 1}^{M_i}X^{j}_i}{M_i}\] and the sample version is \[r^*_i = \frac{\sum_{j = 1}^{m_i}x^{*j}_i}{m_i}\] where \(i\) is the index value defined by; \(i=v+u/12\) and \(m_i\) is the number of movies that are released in the month \(u\) year \(v\). By definition \(R_i\) is an averaged random variable. We are aware that this construction might create a non constant variance problem, especially if the number of movie releases varies greatly in each month. Suppose the revenue of a movie in the index \(i\) has a distribution with mean \(\mu_i\) and variance \(\sigma^2\). Then, \(E[R_i] = \mu_i\) and \(Var[R_i] = \frac{\sigma^2}{m_i}\)
We will comment on this issue in the modeling diagnostics section.
The decision to take logarithms of mean revenue, \(R_i\), is motivated partially by numerical convergence issues in the model fitting process and partially by ease of interpretation as our models can be interpreted as percent change in mean revenue rather than absolute dollar change in mean revenue. The plot of the time series, as well as a linear and quadratic trend estimated by least squares, can be seen below:
Looking at the time series of monthly movie revenue, we observe that the mean movie revenue for the month exhibits a positive trend, suggesting the use of Regression with ARMA Errors to incorporate the information regarding trend as well as the time dependence. In selecting which regressor to apply, there is very little difference between the quadratic fit and linear fit as seen above. Thus, for simplicity, we utilize a linear trend.
Next, in plotting the Autocorrelation Function (ACF) of the log mean movie revenue as seen below, there appears to be a clear oscillatory behavior, with peaks occurring around every 12 months. This seasonality within the data can be defined as yearly peaks in movie revenue. Although there are smaller peaks at 6 month intervals, the peaks found at 12 month intervals are much more substantial.
Using this observation and fitting a periodogram to the data, the results matched the above conclusions. After testing multiple different smoothing factors, the below periodogram was produced. Following this, it is clear to see that the first major peak falls around a frequency of approximately 0.085. This dominant peak corresponds to a frequency of 0.085 cycles per month, which is equivalent to approximately 12 months per cycle. Intellectually, this makes sense, as most major blockbuster films are released during the summer months (June-August) [5]. This is due to the fact that school-aged children are on break, and people have more free time in general for fun activities, such as going to the movie theater. Historically, the highest grossing films were released in this time frame, so the seasonality was unsurprising. The blip seen at the 6 month period is likely due to the increase in movie attendance during the holiday season at the end of the year, but because it is much less significant than that seen at 12 months, it is not analyzed. Therefore, based on the given evidence the analysis is continued assuming a yearly seasonality within the data.
Our preliminary results indicate two things which we will have to incorporate into our model:
There is a clear trend in our data as seen by the time series plot of log mean monthly movie revenue. Our results indicate the trend can be better modeled as linear as a quadratic trend might over-fit the data.
There is a 12 month (1 year) seasonality in the data, as indicated by both the auto-correlation plot and the periodogram. While we might also consider the smaller 6 month seasonality, the annual seasonality makes more sense from a theoretical standpoint.
Under these considerations we utilize a noise plus colored noise framework, modeling the noise using a Seasonal AutoRegressive MovingAverage (SARMA) model.
Signal plus noise models are one of the most common models in statistics, serving as a generalization of regression. A general signal plus noise model is given by \(Y_n=\mu_n+\eta_n\), consisting of two parts: the signal, given by the mean function \(\mu_n\), and the noise, given by a zero mean stochastic process \(\{\eta_n\}\). When the noise \(\{\eta_n\}\) is assumed to be a uncorrelated Gaussian random variables (or equivalently independent Gaussian random variable as for Gaussian random variables independence is equivalent to being uncorrelated), we have the special case of signal plus white noise model utilized commonly in regression setting. In our case the seasonality in our data is a violation of the uncorrelated (i.e. independence) assumption and thus we have to utilize a signal plus colored noise model, where the dependence in noise is modeled. In our case we utilize a SARMA model for the noise, and modeling the signal, the mean function, using a classical regression framework. Such a model is also referred to as Regression with SARMA Errors.
A SARMA model is a subset of the greater AutoRegressive MovingAverage (ARMA) model framework. An ARMA(p,q) model for stationary time series as is defined as:
\[Y_n = \mu + \phi_1(Y_{n-1} - \mu) + \dots + \phi_p(Y_{n-p} - \mu) + \epsilon_n + \psi_1\epsilon_{n-1} + \dots + \psi_q\epsilon_{n-q}\]
where \(\mu\) is the mean of \(Y_n\) and \(\{\epsilon_n \}\) is a Gaussian white noise process with \(\epsilon_n\sim N(0,\sigma^2)\). An ARMA(p,q) model consists of 2 components, a p order autoregressive (AR) component \(\phi_1,\dots,\phi_p\), and a q order moving average (MA) component \(\epsilon_1,\dots,\epsilon_q\). Using the backshift operator \(B\) the AR and MA components are succinctly written in terms of the polynomials \(\phi(B)= 1-\phi_1B-\phi_2B^2-\dots +\phi_pB^p\) and \(\psi(B)= 1+\psi_1B+\psi_2B^2+\dots +\psi_pB^p\), which are aptly called the AR(p) and MA(q) polynomials. Utilizing these AR(p) and MA(q) polynomials, the ARMA(p,q) model can be written in its compact form:
\[\phi(B)(Y_n-\mu)=\psi(B)\epsilon_n\].
The ARMA framework can be extended to incorporate seasonality in data by incorporating additional multiplicative polynomials AR and MA for seasonal terms, giving us the Seasonal AutoRegressive MovingAverage (SARMA) model. The \(SARMA(p,q)\times(P,Q)_{S}\) model is a special case of the ARMA models of form \[\phi(B)\Phi(B^{S})(Y_n-\mu)=\psi(B)\Psi(B^{S})\epsilon_n\]
where the polynomials \(\Phi(B^{S})\) and \(\Psi(B^{S})\) are aptly Seasonal AutogRegressive (SAR) and Seasonal Moving Average (SMA) polynomials with exponent S denoting the seasonal period. In the context of monthly data with annual seasons we have the \(SARMA(p,q)\times(P,Q)_{12}\) model of form
\[\phi(B)\Phi(B^{12})(Y_n-\mu)=\psi(B)\Psi(B^{12})\epsilon_n\]
where the AR and MA polynomials are factored into a monthly polynomial in \(B\) and annual (seasonal) polynomial in \(B^{12}\). Further details about the ARMA and by extension SARMA models can be found in Chapter 3 of Shumway and Stoffer [6] and in Lectures 4-6 of E.L. Ionides [7].
After modeling the noise, we can model the signal \(\mu_n\) using an appropriate mean function. In our case we model the mean function using a linear regression, using year as the co-variate. This gives us a mean function of form \(\mu_n=\beta_1*Year\) where \(Year\) is the index used to generate the time series, given by \(Year= u+\frac{v}{12}\) where \(u\) is the year of the movie’s release and \(v\) is the month of the movie’s release, as defined in the Data section. Note that the intercept term is already incorporated in the \(\mu\) term in the left hand side of the \(SARMA(p,q)\times(P,Q)_{12}\) model and consequently not included in the mean function.
Combining the signal and the noise components, we have the signal plus colored noise model of form \(R_i=Year_i+\eta_i\) where the noise \(\eta_i\) is given by the \(SARMA(p,q)\times(P,Q)_{12}\) model of form \(\phi(B)\Phi(B^{12})(R_n-\mu_n)=\psi(B)\Psi(B^{12})\epsilon_n\). Given this format, we apply model selection by AIC to select the model that generalizes the best.
Note that a regression with \(SARMA(p,q)\times(P,Q)_{12}\) errors model has \(p\times q\times P\times Q\) possible subsets of models, determined by the order of AR,MA, SAR, and MAR polynomials. Given the multiplicative nature of the subset selection problem , we constrain ourselves to the \(SARMA(p,q)\times(1,1)_{12}\) models, applying model selection to the AR(p) and MA(q) polynomials. The decision to chose SAR(P) and SMA(Q) polynomials to be order one is motivated partially by the desire to have a simpler model to prevent numerical issues in parameter estimation as well as our preliminary results favoring the use both SAR(P) and SMA(Q) polynomials of order 1.
Likewise, we also constrain our mean function, the regression, to only contain the covariate \(Year\) rather than any higher order polynomials based on \(Year\). This is motivated by our preliminary data analysis, where the time series plot with linear and quadratic trend lines of \(Year\) estimated by OLS suggests that the second order polynomial of \(Year\) is close to the linear trend from just \(Year\). These results are further supported by our initial analysis utilizing \(Year\) and \(Year^2\) (the second order polynomial of \(Year\)) as a covariate. Our results indicate that the coefficient of \(Year^2\) is statistically significant, but practically insignificant. Furthermore, for the subset of models compared, models with \(Year\) tended to fit better than models with \(Year^2\).
Under this general model framework we fit \(SARMA(p,q)\times(1,1)_{12}\) models for each combination of p and q ranging from 0 to 5, resulting in \(6\times6=36\) different models. The model parameters are estimated using Maximum Likelihood Estimator (MLE). In order compare these models we will use Akaike’s Information Criterion (AIC), an estimation of prediction error that takes into account the opposing forces of model fit and model complexity. The AIC is defined as \(AIC=-2\cdot \ell(\theta^*)+2D\) where \(\ell(\theta^*)\) is the maximized log likelihood of the model and D is the number of parameters in the model. When comparing models using AIC, the model with smaller AIC is preferred to that with a larger AIC.
The AIC values of \(SARMA(p,q)\times(1,1)_{12}\) models of all permutations of p and q between 0 and 5 is displayed in table below. From this table we select the models with smallest AIC and verify that the roots of the model polynomials do not cancel and are outside the unit circle.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 356.89 | 358.74 | 360.13 | 359.5 | 361.44 | 363.09 |
AR1 | 358.73 | 359.69 | 360.93 | 361.47 | 355.28 | 357.17 |
AR2 | 360.04 | 361.05 | 357.72 | 361.16 | 363.23 | 361.84 |
AR3 | 359.79 | 361.79 | 363.96 | 362.98 | 363.71 | 365.06 |
AR4 | 361.79 | 363.79 | 365.74 | 365.02 | 356.16 | 360.97 |
AR5 | 363.33 | 365.72 | 363.36 | NA | NA | 353.22 |
We’ve selected the white noise model with a period of 12. The model formula can be written as follows:
\[(1-\Phi \times B^{12})(R_n-\mu-\beta \times Year_n) = (1+\Psi\times B^{12})\epsilon_n\]
The fitted time series vs actual time series can be seen in the graph below.
Shown below are the statistics for the parameter estimates.
coef | se | Z score | p-val | lb | ub | |
---|---|---|---|---|---|---|
sar1 | 0.980 | 0.0092 | 110.0 | 0.0e+00 | 0.970 | 1.000 |
sma1 | -0.760 | 0.0570 | -13.0 | 0.0e+00 | -0.870 | -0.650 |
intercept | -57.000 | 13.0000 | -4.4 | 1.3e-05 | -83.000 | -32.000 |
year | 0.038 | 0.0066 | 5.8 | 0.0e+00 | 0.025 | 0.051 |
From the table we get point estimates as: \(\hat{\Phi}\) = 0.98, \(\hat{\Psi}\) = -0.76, \(\hat{\mu}\) = -57, \(\hat{\beta}\) = 0.038, where lb and ub correspond to lower and upper confidence interval bounds (\(0.025\), \(0.975\)), respectively.
All parameters have significant p-values by the Z-test. However SARMA coefficients indicate root problems, and confidence intervals obtained by the fisher approximation may not be reliable. Here we checked the roots for causality and invertibiliy:
## [1] "sar root:"
## [1] 1.001336
## [1] "sma root:"
## [1] 1.023199
These indicate causality and invertibility problems. One can solve this by dropping either \(\Phi\) or \(\Psi\). After attempting this, we decided against this action, as log likelihood of the model significantly dropped. Therefore, we continue with this model, but to get more accurate confidence intervals, we decided to perform a simulation study.
Following the same bootstrap algorithm in the Chapter 5 of the Lecture Notes [7], we repeat the procedure outlined.
In the above plot, the histogram is for the empirical distribution of the simulation estimates, the red line corresponds to the original point estimates in the SARMA model, and the blue lines are for the bounds of \(95\%\) confidence interval. The quadratic approximation in the fisher confidence interval is not very accurate for SAR1 coefficient and its estimate has a left skewed distribution, but it is accurate for SMA1 and the trend estimate.
After fitting our model, we perform diagnostic checks to verify model assumptions. The main assumption we need to check is that the errors, as approximated by the residuals, are normally (Gaussian) distributed with mean 0 and constant variance. To do so we will look at the time series of the residuals and the normal Quantile-Quantile plot of the residuals.
The time series plot of the the residuals can be seen above. Looking at the plot of the residuals we can see that the residuals are centered at mean 0, satisfying our zero mean assumption. However, it is clear from the plot that the magnitude (absolute value) of the residuals decrease in general as the year increases, exhibiting a “funnel shape.” This is a violation of the constant variance assumption. As discussed early on in the Data section, heteroscedasticity, non-constant variance, is expected in our data and is a direct, undesired, consequence of how we constructed our time series. Since the individual observations of our time series are means of the movie revenues for the given month in year, the variance depends on how many movies fitting our filtering criterion were released in that month. The plot of number of movies released each month can be seen below.
As can be seen from the above plot, the number of movies released per month is, on average, increasing every month. This indicates that decreasing magnitude of the variance can be attributed to the increase in the number of movies each month.
The next assumption we look into is the errors, estimated by the residuals, are normally distributed. To do so we plot the normal quantile-quantile (QQ) plot of the residuals, with the theoretical quantiles of a standard normal on the horizontal axis and the empirical quantiles of the residuals on the vertical axis.
The normal QQ plot, as seen above, indicates that the empirical quantiles of the residual match that of the normal closely, with some minor deviations in the tails, with the empirical quantiles of the residuals being higher than the theoretical quantiles of the normal. This form of deviation indicates that the residuals might have heavier tails than the normal distribution. The deviation in the lower tail is minor and can be attributed to statistical randomness while the deviation in the upper tail is more significant, yet still minor.
Finally, we look into whether the error terms in our model, as estimated by the residuals, are independent. We check this assumption by plotting the autocorrelation function for the residuals up to 240 month lag (i.e. 20 years).
The plot of autocorrelation function is within the 95% confidence band and therefore indicates that our residuals are independent with respect to time lags.
Overall the model assumptions for regression with \(SARMA(0,0)\times(1,1)_{12}\) errors hold, with the exception of the constant variance assumption. As discussed above, heteroscedasticity in our data is a unwanted consequence of how we constructed the time series and consequently hard to get rid of.
In this section we conduct a robustness check on the trend in our data by repeating our analysis after adjusting movie revenues for inflation.
According to the community guidelines for inputting data into the TMDB, the revenues must be quoted nominally. This suggests that part of the trend, increase in log mean revenue over time, we capture in our data is simply the inflation, the increase in prices in an economy due to increasing money supply. To determine whether the monthly movie revenues are increasing in real terms (i.e. after adjusting for inflation), we repeat our analysis after adjusting the movie revenues for inflation.
We utilize the Consumer Price Index (CPI) data, released by the Federal Reserve Economic Research Division (FRED) in Federal Reserve Bank of St. Louis (St. Louis FED) [8]. The consumer price index is a measure of inflation obtained by measuring the cost of a fixed basket of goods representative of the needs of an average consumer [9]. Unlike other forms of inflation, CPI measures inflation felt by the consumers (hence the name) more so than the overall economy. Since the revenue of movies are inherently related with whether the audiences can afford to actually watch them, CPI should be a more appropriate measure of inflation compared to others.
Utilizing the CPI data from FRED (CPIAUCSL) [8], we adjust the movie revenues to be quoted in 2015 dollars, and repeat our analysis. We again select the \(SARMA(0,0)\times(1,1)_{12}\) model; however, the main difference is that the trend, the coefficient of \(Year\) parameter for the inflation-adjusted revenue is 0.014 rather than 0.038 for nominal movie revenue. This difference is consisted with our expectation as the 0.024 in the \(year\) coefficient corresponds to the average inflation in United States between 1991 and 2015 [8].
coef | se | Z score | p-val | |
---|---|---|---|---|
sar1 | 0.9840 | 0.00948 | 104.000 | 0.0000000 |
sma1 | -0.7570 | 0.05850 | -12.900 | 0.0000000 |
intercept | -9.3500 | 13.20000 | -0.706 | 0.4800287 |
year | 0.0139 | 0.00661 | 2.100 | 0.0354437 |
The goal of this report was to model movie revenue over time, with the aim of exploring how monthly revenue has changed as movies have become increasingly more accessible. Based on the above analysis, it can be seen that the best fitting model for this data is the regression with \(SARMA(0,0)\times(1,1)_{12}\) errors model. The year variable was found to be significant in terms of trend, and the data followed a seasonal pattern of around 12 months per cycle. These yearly oscillations align well with what is expected, due to the high release volume of big name movies in the summer months. The results provided hold for both nominal and real (inflation-adjusted) revenue.
As can be seen in the diagnostic plots above, the model does a good job fitting the data provided. One surprising result was the fact that movie revenue has remained fairly constant, with a slight positive trend, from 1991 to 2015. Our original hypothesis was that this value would decrease due to a great increase in the popularity of at home movie availability, starting with DVDs followed by streaming services. Surprisingly, the opposite is true: box office movie revenues have slightly increased in the time frame of this data set. That being said, there are many plausible explanations as to why this may be true. First, many “cinematic universes” exist with ever growing fan bases, such as the Marvel Cinematic Universe, Star Wars, James Bond, Harry Potter, etc. The great increase in fanatics that follow these movie releases closely are a likely cause for the increase in revenue. Another plausible explanation for this increase is the fact that humans frequently study the human mind. In doing so, researchers have learned how to appeal to large groups of people. By learning what will pique viewers’ interest, they can draw in a larger crowd, generating a greater revenue.
There are many possibilites for future research with this dataset. One interesting comparison would be to look at the cross-covariance between monthly revenue and GDP, as we suspect people are more likely to go to the movies as their income increases. Another possible direction would be focusing on cinematic universe movies and exploring their trends. Lastly, exploring the trend with return on investment (RoI), rather than revenue, may provide differing insights.
References.