Introduction

Background

In the spring of 2014, West African communities, and specifically those in the nations of Guinea, Liberia, and Sierra Leone, were struck by the beginning of an Ebola epidemic that would last more than a year and cause over 11,000 total deaths [1]. In the small, coastal nation of Sierra Leone, there were over 14,000 confirmed cases of Ebola and almost 4,000 deaths in less than two years [4]. Due to the rural nature of the country and the lack of health infrastructure, however, there were certainly more cases and deaths than were officially recorded.

An extremely dangerous virus, Ebola is transmitted from animal to human and from human to human through direct contact with or ingestion of bodily fluids from an infected individual. Once someone has been infected by the Ebola virus, they go through an incubation period where they show no symptoms and are not contagious; this period can last anywhere from 2 to 21 days. The symptoms of Ebola are often mistaken for those of other tropical diseases such as malaria or typhoid, making it very difficult to diagnose and control. Another important factor in the spread of Ebola are burial rituals, many of which require close contact with the deceased, whose body can still contain the virus and pass it on [3]. Because of these factors, and the insufficient healthcare infrastructure in many West African communities, Ebola spread rampantly and wreaked havoc for many months

Motivating Question

Many different people are interested in modeling the course of an epidemic in order to gain an understanding of how a disease like Ebola travels through a population. By accurately modeling what happened in Sierra Leone from 2014 through 2016, we can have a better idea of the length of the incubation period, the length of the infectious period, and the rate at which cases are reported/confirmed, all of which would be useful for healthcare workers and policy makers the next time an Ebola epidemic hits. The goal of this analysis, then, is to try to find a model that accurately captures some of these characteristics of the Sierra Leone Ebola epidemic.

It would also be interesting to see how well a model for the Sierra Leone epidemic can generalize to the portions of the epidemic that took place in Guinea and Liberia. A model is much more useful if it can extend accurately to new data sources.

Data Exploration

The data I will be using for this analysis consist of the number of confirmed Ebola cases in Sierra Leone from the end of August 2014 through the end of December 2015 [6]. The data is provided on an approximately weekly basis and is labeled as the number of confirmed Ebola cases in the last 21 days. This format makes the data a little more complicated to work with (weekly reports with information from the past three weeks means cases are counted multiple times), but I will discuss the modeling choices that I made later in this report.

As seen above, this data source misses the very beginning of the epidemic in Sierra Leone, but seems to capture the majority of the epidmic’s course very well. The maximum number of Ebola cases confirmed in the previous three weeks was 1,455 and the minimum was 0, which can be seen at the very end of the epidemic in late 2015. When fitting models, I actually removed the last 5 observations in the dataset due to the gaps in reporting that can be seen in the plot above. Additionally, these values were all zeros and removing them decreases the size of the time series without losing any information – even without those observations, we see the epidemic has effectively ended by the end of 2015. After removing these points, the data contains 54 observations.

Model Fitting

Setting Up the Model

When choosing an appropriate model for this data, I initially wanted to use a basic \(\textrm{SEIR}\) model, that would contain one compartment for each portion of the population: the susceptibles, the incubators (infected, but not yet infectious), the infectious, and the recovered/removed. When working with this model, however, it became clear that adding more compartments would make the model fit the data better. Specifically, I needed to alter how I modeled the incubation stage of Ebola.

As mentioned above, the data is reported on an approximately weekly schedule, but the incubation period can last up to three weeks. This indicated that modeling the incubation period as three separate compartments could be more effective than just using one. Therefore, I am using the following compartment model for these data:

\[ S \longrightarrow E_1 \longrightarrow E_2 \longrightarrow E_3 \longrightarrow I \longrightarrow R\]

The \(\textrm{S}\) compartment contains the susceptible population, \(\textrm{E}_1\) through \(\textrm{E}_3\) contain the people who have been infected but are not yet contagious, \(\textrm{I}\) contains those who are actively infectious, and \(\textrm{R}\) contains those who are no longer infectious. The \(\textrm{R}\) compartment consists of both recovered individuals and those who have died from Ebola and have already been buried. Folks who are dead, but not yet buried will still be in the \(\textrm{I}\) compartment since Ebola is frequently passed during burial rituals.

Since the time frame for the Ebola epidemic is so short, I felt that it was reasonable to consider the population size fixed and not model births into the system. Additionally, there is the potential that the antibodies in the blood of recovered individuals make them immune to re-contracting Ebola, so I did not return them to the susceptible pool [2].

The observed data, \(C(t)\), is the number of cases of Ebola that were confirmed in the last three weeks. Since this number doesn’t correspond directly to one of the compartments in the model, there are multiple ways to think about how to model the observed count. The first method I tried was to model \(C(t)\) as the number of people who moved from \(\textrm{E}_3\) to \(\textrm{I}\) since the last observation; these are the newly infectious individuals. This method suffers, however, from the problem of individuals being triple counted, and the models I built using this assumption failed to capture the structure of the data well.

The second method that I used, and the one that I will report in this analysis, is to model \(C(t)\) as the number of people who are infectious at time \(t\). In other words, this is the number of individuals in the \(\textrm{I}\) compartment. With this setup, it makes sense that individuals are counted in the \(C(t)\) variable for multiple weeks. After symptoms have begun, infected individuals typically die six to sixteen days later or begin to recover seven to fourteen days later [2]. Additionally, neither death nor the beginning of recovery indicate that the individual is no longer infectious – some recovered patients still retain the virus in bodily fluids, and many infections are passed through burial rituals after death. Therefore, a length of three weeks for the infectious period is not unreasonable. Modeling \(C(t)\) in this way provided a model much more consistent with the data than others that I tried.

The Model

Now that the setup of the compartments and the handling of the observed data has been decided, it is time to outline the specific model structure that I used.

Process model:

The state variables \(\textrm{S}\), \(\textrm{E}_1\) through \(\textrm{E}_3\), \(\textrm{I}\), and \(\textrm{R}\) are modeled as follows:

\[\begin{eqnarray} S(t + \delta) &=& S(t) - N_{SE_1}(\delta) \hspace{.5 cm} \textrm{where } N_{SE_1} \sim bin(S(t), 1-e^{-\beta \frac{I(t)}{N}\delta}) \\ E_1(t + \delta) &=& E_1(t) + N_{SE_1}(\delta) - N_{E_1E_2}(\delta) \hspace{.5 cm} \textrm{ where } N_{E_1E_2} \sim bin(E_1(t), 1-e^{-\mu_{EE}\delta}) \\ E_2(t + \delta) &=& E_2(t) + N_{E_1E_2}(\delta) - N_{E_2E_3}(\delta) \hspace{.5 cm} \textrm{ where } N_{E_2E_3} \sim bin(E_2(t), 1-e^{-\mu_{EE}\delta}) \\ E_3(t + \delta) &=& E_3(t) + N_{E_2E_3}(\delta) - N_{E_3I}(\delta) \hspace{.5 cm} \textrm{ where } N_{E_3I} \sim bin(E_3(t), 1-e^{-\mu_{EI}\delta}) \\ I(t + \delta) &=& I(t) + N_{E_3I}(\delta) - N_{IR}(\delta) \hspace{.5 cm} \textrm{ where } N_{IR} \sim bin(I(t), 1-e^{-\mu_{IR}\delta}) \\ R(t + \delta) &=& R(t) + N_{IR}(\delta) \\ \end{eqnarray}\]

In these equations, the quantities of the form \(N_{ij}(\delta)\) indicate the number of individuals who moved from compartment \(i\) to compartment \(j\) in a time step of length \(\delta\). These equations come from using Euler’s method for the \(\textrm{SEIR}\) model outlined above. I chose to use a binomial approximation with exponential probabilities because they seemed to work best with the data.

The parameter \(\beta\) represents the contact rate, or how likely it is for a susceptible to become infected when in contact with an infected individual. The parameter \(N\) corresponds to the total size of the population. I decided to use \(\mu_{EE}\) as the rate of moving from \(\textrm{E}_1\) to \(\textrm{E}_2\) and the rate of moving from \(\textrm{E}_2\) to \(\textrm{E}_3\). Since I am using three compartments for the incubation stage, it seems reasonable to assume that the rate of moving across those compartments should be constant. Finally, I used \(\mu_{EI}\) and \(\mu_{IR}\) to model the rate of moving from \(\textrm{E}_3\) to \(\textrm{I}\) and from \(\textrm{I}\) to \(\textrm{R}\), respectively.

Measurement model:

At time \(t_n\), we observe the number of Ebola cases that were confirmed in the last three weeks, which is denoted as \(C_n\). A case of Ebola will likely not be confirmed until the infected individual begins to show symptoms, which coincides with the beginning of the infectious period. Additionally, the number of confirmed cases will not be the same as the number of actual cases, since not everyone will go to the doctor when sick. Therefore, the observed data (number of reported cases in the past three weeks) given the total number of infectious people will be modeled as follows:

\[ C_n|I(t_n) \sim Neg Bin( \mu = \rho I(t_n), \sigma^2 = \rho I(t_n) + \frac{(\rho I(t_n))^2}{a}) \]

In this expression, \(\rho\) is the reporting rate corresponding to what fraction of actual Ebola cases will be reported and confirmed, and \(a\) is the dispersion parameter that controls the relationship between the mean and the variance of a negative binomial random variable. I chose to use a negative binomial distribution rather than a Poisson because it gives more freedom to better fit the model to the observed data.

Model Refinement and Parameter Estimation

After setting up the model, the next step is to begin fitting it to the observed data. To do this, I began by running a few simulations to see what range of parameter values was appropriate for the data. After finding what I thought were reasonable starting values, I performed a global search for the maximum likelihood estimate with random initializations using iterated filtering. Here is a summary of the likelihood estimates that this procedure produced:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -275.7  -275.6  -272.6  -270.2  -264.2  -262.1 

The maximum likelihood obtained by this global search is \(-262.1\) with a corresponding standard error of \(0.0771\). The parameter estimates that give this maximum likelihood are:

     Beta     mu_EE     mu_EI     mu_IR       rho         N         a 
    0.155     1.723     0.405     0.106     0.904 34561.830    54.039 

To make sure that the modeling was working as anticipated, I used these parameter estimates to simulate data that I could compare to the actually observed data. Ten of these simulations are plotted below in red, while the observed data is plotted in blue. From this plot, it appears that my global search found parameter estimates that fit the behavior of the data quite well.

After locating this MLE when doing a wide, global search and finding that it models the data pretty well, I refined my methods to look for an estimate with even higher likelihood. For this, I again used iterated filtering, but started the algorithm at the global MLE each time, to search in the neighborhood of that estimate. Here is a summary of the likelihood estimates that this localized search produced:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -262.9  -256.7  -255.6  -256.1  -253.7  -252.9 

The maximum likelihood obtained by this local search is \(-252.9\) with a corresponding standard error of \(0.1365\); this increases the likelihood by about 10 over the previous best parameter estimates. The parameter estimates that give this maximum likelihood are as follows:

     Beta     mu_EE     mu_EI     mu_IR       rho         N         a 
     0.55     10.37      1.46      0.51      0.91 485076.35    932.84 

When comparing these parameter estimates to those found by the global search, we find that they are quite different. This seems to indicate that perhaps the global search that I performed above wasn’t able to actually explore the parameter space as well as I had hoped. I will adress some of the limitations of the searches that I did below.

After obtaining the MLE from the local search, I again did some simulations to look at the structure of the data that are produced from these parameter estimates. The results of 10 simulations are plotted below, with the observed data in blue.

This plot seems to indicate that the parameter estimates do produce data that is similar in shape to the observed epidemic data. The main difference seems to be the height of the peak (the maximum number of confirmed Ebola cases), which is higher in all simulations than in the actual data. This might indicate that the estimate of the reporting rate parameter \(\rho\) is too high. Indeed, I was surprised to get an estimated reporting rate as high as 91% for this epidemic.

Interpretation of Parameters

When trying to interpret parameter estimates, there were a couple of interesting features that can be seen in the plots below. The first thing to notice is the relationship between the estimates of \(\beta\) and \(\mu_{IR}\), which have a clear linear relationship; as the estimate for \(\beta\) increases, the estimate for \(\mu_{IR}\) also increases. They are also on a very similar scale. This seems to indicate that the flow of individuals into the incubation and infectious compartments is roughly equal to the flow out, which doesn’t seem like an unreasonable result.

The other interesting aspect of this plot is that the estimate of \(\mu_{EE}\) that gives the maximum likelihood is much higher than the other estimates for that parameter (above 10 vs. about 3-5). I’m not sure if this is something that should be concerning, but it is something to keep in mind when interpreting these parameter estimates. For example, an estimate of 3 for \(\mu_{EE}\) indicates a time of about 2.5 to 3 days spent in each of the first two incubation compartments, while an estimate of 10 for \(\mu_{EE}\) indicates a time of less than one day. The first estimate fits much better with domain knowledge about the length of the incubation period, so I think this parameter estimate would be something to look into more carefully.

Another parameter estimate to consider is that of \(a\), which is the dispersion parameter for the negative binomial measurement model. I used a negative binomial model so that the mean and variance were not constrained to be equal, and \(a\) is the parameter that controls the relationship between mean and variance. The estimate of \(932.84\) for \(a\), however, is very large and indicates that an approximately equal mean and variance is appropriate. Therefore, maybe a Poisson measurement model would be a simpler model that is good for the data and would be something to look into.

I can also look at the rate parameters \(\mu_{EE}\), \(\mu_{EI}\), and \(\mu_{IR}\) to get an idea of how long individuals stay in each phase of the disease. As mentioned above, I am somewhat concerned about the estimate of \(\mu_{EE}\), but by using this value and the one for \(\mu_{EI}\), I can estimate that a patient spends about 6 days in the incubation stage of the disease. This is a little low, but seems not totally inconsistent with the domain knowledge about Ebola. From the estimate of \(\mu_{IR}\), we can estimate people are infectious for about two weeks, which fits very well with what is known about Ebola.

Finally, I want to discuss the treatment of \(N\), the parameter controlling the total population size. Since the population of Sierra Leone during the epidemic is known (about 6.1 million [4]), I could have used that as a fixed value in the model. However, it didn’t make sense to me that everyone in the country was equally susceptible to Ebola, or even that everyone was susceptible at all. What I mean is that there are some regions of the country or communities where no one was exposed to the virus, so they were highly unlikely to get sick. To deal with this, I could have fixed \(N\) to be 6.1 million and then had a very low \(\beta\) value, or I could have left \(N\) as a parameter and let the data determine the best value for it. I chose the latter, which definitely had consequences for the interpretability of the modeling results. With the final parameter estimate of \(N = 485,000\) it appears that only about 8% of the population was susceptible at the time. I’m not sure if this makes sense in the context of the epidemic, and this is a modeling choice that should be investigated further.

Model Diagnostics

Goodness of Fit

In order to test model goodness of fit, I looked at diagnostic plots for the final local iterated filtering search that I did above. This is the search that resulted in my maximum likelihood estimate of \(-252.9\). When looking at these plots, there are some concerning patterns, but also some easy (though time consuming) fixes. In the top plot, we see that effective sample size doesn’t drop too low – the lowestvalues seem to be about \(2,000\). However, there is a lot of oscillation in the effective sample size, meaning it drops to those lower values frequently. Additionally, we see not great convergencve for most of the parameters. The likelihood seems to be flattening out, indicating that we have found a local maximum, but the parameter estimates seem to be spreading out from their original starting point and not stabilizing.

Diagnostic plots that look like this indicate that the iterated filtering needs to be run with more particles and for more iterations. Unfortunately, I was not able to access the amount of computational power that was needed to run the iterated filtering at the optimal algorithmic parameter values. However, the results that I got from the model fitting I was able to do are promising enough to convince me that this was a good model to fit to the data and that a little more time and power would give good final results.

I also attempted to look at the profile likelihood for some of the parameters, but had little success with the amount of computing time available to me. Using more particles and more starting values would have allowed me to get an idea of the likelihood structure for parameters of interest and do inference about the parameter estimates. This is definitely something that should be done in the future, with more time and more available computing power.

Comparing to a Benchmark

The compartment model that I used for this Ebola data is relatively complex and computationally intensive. To justify the use of such a model, I wanted to compare it to a linear Gaussian auto-regression moving-average (ARMA) model that is a basic approach to modeling time series. The best likelihood from an ARMA model came from the \(\textrm{ARMA(3,3)}\) and was \(-299.49\) when fitting 8 parameters. The \(\textrm{SEIR}\) model fit above has 7 parameters and a maximum likelihood estimate of \(-252.9\). This gives me some confidence that the compartment model is doning something competitive with the data and perhaps gives more intuition about the nature of an Ebola epidemic.

Other Epidemic Data

A final goal that I had was to see if the model that was fit to the Sierra Leone data could be generalized to other sources of Ebola epidemic data. For example, the same type of data I used in this analysis is available for the Ebola epidemic in Guinea. From the plot below, we can see that the course of the epidemic in Guinea was similar to the one in Sierra Leone, though it appears to have taken longer to subside.

Although I was not able to actually fit the same \(\textrm{SEIR}\) model to this data, I believe that it would also do reasonably well because the data behaves similarly to the Sierra Leone data. It would be interesting to fit the model and see how the parameter estimates change – this would give an idea of how Ebola epidemics behave differently in different places.

Conclusion and Future Analysis

First, it appears that an \(\textrm{SEIR}\) model with the incubation compartment split into three separate compartments is a good model for Ebola epidemic data of the form used here (weekly case reports from the past three weeks). Although the model I fit would definitely have benefitted from longer run times and more computational power, it seems clear that this type of model fits the data well. In addition to simulations that produced data similar to the observed case counts, some of the parameter estimates also seemed to align with domain knowledge about the spread of Ebola.

If I had more time and more computation power available to me, I would have liked to run the iterated filtering algorithms with more particles and more iterations to get better convergence and better results. I also would have liked to be able to look at profile confidence intervals to do some inference about the parameter estimates. It would also have been interesting to fit the \(\textrm{SEIR}\) model to the Guinea Ebola epidmic and compare that to what happened in Sierra Leone. Finally, I would have liked to include separate compartments in the model for individuals who died and individuals who recovered from the disease. Unfortunately, I did not have enough data on the number of Ebola cases that resulted in death to fit this model.

Sources:

[1] “Ebola (Ebola Virus Disease).” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 22 June 2016, www.cdc.gov/vhf/ebola/outbreaks/2014-west-africa/index.html.

[2] “Ebola Virus Disease.” Wikipedia, Wikimedia Foundation, 24 Apr. 2018, en.wikipedia.org/wiki/Ebola_virus_disease.

[3] “Ebola Virus Disease.” World Health Organization, World Health Organization, www.who.int/mediacentre/factsheets/fs103/en/.

[4] “Ebola Virus Epidemic in Sierra Leone.” Wikipedia, Wikimedia Foundation, 10 Apr. 2018, en.wikipedia.org/wiki/Ebola_virus_epidemic_in_Sierra_Leone.

[5] King, A A, et al. Case Study: Forecasting Ebola. kingaa.github.io/sbied/ebola/ebola.html.

[6] Larsen, Liam. Ebola Cases, 2014 to 2016 | Kaggle, 23 Apr. 2017, www.kaggle.com/kingburrito666/ebola-cases.