1 Background

Tuberculosis (TB) is an infectious disease caused by the bacteria mycobacterium tuberculosis. While TB primarily affects a person’s lungs, it is known to affect other organs such as kidneys and bones. The disease spreads through coming into contact with tiny water droplets from coughing or sneezing of persons with the bacteria. TB is particularly deadly to people with fragile immune systems, these mostly being very young children, the elderly, and persons infected with the Human Immunodeficiency Virus (HIV). It is a serious world health concern, affecting mostly developing nations. Worldwide, about 9 million new cases of TB develop and about 1.7 million deaths are due to TB annually.

Once a person is infected with myobacterium tuberculosis, he then is a carrier of the bacteria, and in most cases does not immediately experience symptoms of the disease. After an incubation period of two to twelve weeks, this inactive TB becomes active TB and only then are they able to spread the disease and experience symptoms such as chest pains, coughing up blood, or extreme fatigue. It should be noted that there exist individuals with inactive TB who may never transition to active TB individuals during their lifetimes.

India is said to be the country that bears the highest economic burden of TB worldwide. In 2014 along, there were 9 million people globally who were reported to have active TB, out of these, 2.2 million cases belonged to India. This has been estimated to cost the Indian economy USD 340B annually. Developing effective strategies to combat the spread of TB is difficult as it is believed to be heavily underdiagnosed in India. To quote an article in http://www.tbfacts.org/tb-india/, “Case notification is estimated to be only 58%. Over one third of cases are not diagnosed, or they are diagnosed but not treated, or they are diagnosed and treated but not notified to the RNTCP (Revised National Tuberculosis Control Program)”. 40% of the Indian population is estimated to carry mycobacterium tuberculosis, although majority of carriers have inactive TB.

Being able to accurately estimate the actual number of TB cases and their trend in a particular region is imperative, especially for a government with scarce resources, and who needs to surgically target its meager economic resources to a few big wins.

2 Previous work on TB Epidemiological Models

It is customary to model the spread of a disease via compartment models. A compartment model divides the whole population under consideration into mutually exclusive and collectively exhaustive compartments. A simple compartment model is what is called an S-I-R model where individuals in a population seen to flow from a stage of being Susceptible, Infected, and then Recovered. Researchers in the past have used this and the following deterministic models to describe the spread of TB among individuals in a population:

  1. Susceptible \(\longrightarrow\) Infected \(\longrightarrow\) Susceptible
  2. Susceptible \(\longrightarrow\) Exposed \(\longrightarrow\) Infected \(\longrightarrow\) Recovered
  3. Susceptible \(\longrightarrow\) Exposed \(\longrightarrow\) Infected \(\longrightarrow\) Quarantined \(\longrightarrow\) Recovered \(\longrightarrow\) Vaccinated \(\longrightarrow\) Susceptible
  4. Susceptible \(\longrightarrow\) Latent TB \(\longrightarrow\) Active TB \(\longrightarrow\) Diagnosed \(\longrightarrow\) Treated \(\longrightarrow\) Rrecovered \(\longrightarrow\) Relapse or Reinfection to Active TB

The shortcoming of an SIR or an SEIR model is that they assume permanent immunity of people who have recovered from an active TB infection. However, this is not the case in practice. Moreover, the models above assume that we have access to fully observed TB case data. However, the reality is that active TB cases are not all reported and as latent TB carriers exhibit no symptoms, its exact number is even more difficult to estimate. Because this framework is deterministic, it does not account for the stochastic variability in the flow of persons from one compartment to another.

Other researchers have applied stochastic methods to analyze TB case data. For example, Kumar et al. [5] fit a SARMA time series model to TB case data gathered from a hospital in Delhi to study the existence of seasonality of the disease in that area. Wah et al. [6] fit a linear regression model with covariates and SARIMA errors to Singapore TB case data and determined subpopulations where TB indicence rates were higher than the rest of the population.

Again, like the deterministic compartment model approach, this assumes that complete TB case data are available. Where TB cases are unreported, time series models constructed using the conventional ARMA framework will not explain the true underlying mechanism that drive the spread of TB. Moreover, there is the issue of interpretability of the class of ARMA models. Once we obtain estimates of the parameters of ARMA models, these do not help in identifying which one of the stages of the spread of TB in the compartment models above have most impact on the spread of TB.

This brings us to the need to model TB case data using a framework that will account for the partially observed TB case data that is typically available in ministries of health and is interpretable.

3 The Data Set

From the paper of Kumar, V, et al. [5], we obtain a data set of the number of pulmonary TB cases reported in January 2007 (month 1) to December 2012 (month 72) at a Directly Observed Treatment Short Course Center of Fatehpur Beri primary health care center located in the southern district of Delhi, India’s capital region located in the northern part of the Indian subcontinent. The treatment center caters to a population of 64,000 and people within this region may seek TB treatment at alternative treatment centers. The treatment center administers spatum smear examinations for acid-fast-bacilli in accordance with RNTCP guidelines and results of these examinations were recorded. A positive case is recorded during the first month a person records a positive smear test result.

set.seed(73800112)
require(pomp)
require(foreach)
require(ggplot2)

delhi <- read.csv("delhi.csv",head=T)
month <- delhi$month
cases <- delhi$cases
plot(month,cases,type='o',
     main="Monthly Reported TB Cases in Delhi (2007-2012)",
     xlab="Months",ylab="No. of Reported Cases")

We observe that the number of TB cases recorded at the treatment center fluctuates and peaks towards the last or the first few months of each year. There also seems to be a decreasing overall trend after six years.

4 Objectives of this Project

In this report, we build a Partially Observed Markov Process (POMP) model to describe partially observed TB case data for the southern Delhi area. A POMP model combines the interpretable elements of the compartment model approach to modeling spread of diseases and incorporates a probabilistic model for the unobserved data, as well as the underlying true stochastic process that generated the case data. With this model, we aim to answer the following questions:

5 The SEIQRV Tuberculosis Transmission Model

We use Mishra and Srivastava’s [4] TB transmission model as the deterministic skeleton of our POMP model. Compared to more common models, it differentiates itself through adding a ‘Quarantined’ compartment. Quarantining TB positive patients is seen to control the rate of transmission ofthe disease and hence provide an improvement in modeling accuracy over the more common compartment models while remaining parsimonious. The following are the compartments:

The flow of people between these compartments are visualized in the diagram below (obtained from figure 1 of [4]). The quantities listed in the arrows below represent rates of transfer of people from one compartment to another while \(\Lambda\), \(d\), and \(\delta\) represent birth rate, natural mortality rate, and mortality rate due to TB, respectively. In this project, we do not concern ourselves with demographics and assume these quantities to be equal to zero.

Caption for the picture.

In the above, the model parameters are specifically defined to be:

\(\beta =\) ineffectivity contact rate

\(\sigma =\) vaccinating rate coefficient for susceptible population

\(\gamma =\) rate of transmission from E to I

\(\alpha =\) rate of transmission from I to Q

\(\phi =\) rate of transmission from I to R

\(\eta =\) rate of recovery

\(\epsilon =\) rate of transfer from R to S

\(\rho =\) rate of transmission from V to S

6 The Deterministic Skeleton of the POMP Model

We use Euler’s Method as to numerically approximate the ordinary differential equations used by Mishra and Srivastava in [4] to describe their SEIQRV model. These approximations are given below:

\(S(t) = S(0) -N_{SE}(t)-N_{SV}(t)+N_{VS}(t)+N_{RS}(t)\)

\(E(t) = E(0) -N_{EI}(t) + N_{SI}(t)\)

\(I(t) = I(0) - N_{IQ}(t) + N_{IR}(t) + N_{EI}(t)\)

\(Q(t) = Q(0) -N_{QR}(t)+N_{IQ}(t)\)

\(R(t) = R(0) - N_{RS}(t) + N_{IR}(t) + N_{QR}(t)\)

\(V(t) = V(0) - N_{VS}(t) + N_{SV}(t)\)

7 Modelling rProcess: Transitions of Individuals Among Compartments

We approximate the counts of the number of persons moving from one compartment to another as follows. For \(t=k\delta\), \(k\) a positive integer and \(\delta\) a fixed positive number:

\(\tilde{N}_{SE}(t+\delta) = \tilde{N}_{SE}(t) + \Delta N_{SE}\) where \(\Delta N_{SE} \sim Binomial(S(t),1-exp(-\beta E(t) \delta))\) and \(\beta = \lambda /N\), \(N\) is the total number of susceptible individuals assumed to be equal to 64,000.

\(\tilde{N}_{EI}(t+\delta) = \tilde{N}_{EI}(t) + \Delta N_{EI}\) where \(\Delta N_{EI} \sim Binomial(S(t),1-exp(-\gamma I(t) \delta))\)

\(\tilde{N}_{IQ}(t+\delta) = \tilde{N}_{IQ}(t) + \Delta N_{IQ}\) where \(\Delta N_{IQ} \sim Binomial(S(t),1-exp(-\alpha Q(t) \delta))\)

\(\tilde{N}_{QR}(t+\delta) = \tilde{N}_{QR}(t) + \Delta N_{QR}\) where \(\Delta N_{QR} \sim Binomial(S(t),1-exp(-\eta R(t) \delta))\)

\(\tilde{N}_{SV}(t+\delta) = \tilde{N}_{SV}(t) + \Delta N_{SV}\) where \(\Delta N_{SV} \sim Binomial(S(t),1-exp(-\sigma V(t) \delta))\)

\(\tilde{N}_{VS}(t+\delta) = \tilde{N}_{VS}(t) + \Delta N_{VS}\) where \(\Delta N_{VS} \sim Binomial(S(t),1-exp(-\rho S(t) \delta))\)

\(\tilde{N}_{RS}(t+\delta) = \tilde{N}_{RS}(t) + \Delta N_{RS}\) where \(\Delta N_{RS} \sim Binomial(S(t),1-exp(-\epsilon S(t) \delta))\)

\(\tilde{N}_{IR}(t+\delta) = \tilde{N}_{IR}(t) + \Delta N_{IR}\) where \(\Delta N_{IR} \sim Binomial(S(t),1-exp(-\phi R(t) \delta))\)

8 Modeling the Measurement Model

We assume that the recorded cases of TB in the data set correspond to the number of individuals that go from compartment E to I, in which they are reported to health authorities. The recorded cases of TB we have thus are a sample of the true number of TB cases in the southern Delhi region. So, we model the case data C as

\(C_t \sim Binomial (H(t) - H(t-1), \psi)\)

We set \(\psi = 0.58\), the estimated TB case notification rate of India.

9 Model Fitting

We are now ready to combine the above elements discussed into a POMP model. We do this in the code below:

##---r process----------------------------------------------------
sir_step <- Csnippet("
  double dN_SE = rbinom(S,1-exp(-lambda*I/N*dt));
  double dN_EI = rbinom(E,1-exp(-gamma*I*dt));
  double dN_IQ = rbinom(I,1-exp(-alpha*Q*dt));
  double dN_QR = rbinom(Q,1-exp(-eta*R*dt));
  double dN_SV = rbinom(S,1-exp(-sigma*V*dt));
  double dN_VS = rbinom(V,1-exp(-rho*S*dt));
  double dN_RS = rbinom(R,1-exp(-epsilon*S*dt));
  double dN_IR = rbinom(I,1-exp(-phi*R*dt));
  S -= dN_SE - dN_SV + dN_VS + dN_RS;
  E += dN_SE - dN_EI;
  I += dN_EI - dN_IQ - dN_IR;
  Q += dN_IQ - dN_QR;
  R += dN_QR - dN_RS + dN_IR;
  V += dN_SV - dN_VS;
  H += dN_EI;
")

##---initialize variables in r process----------------------------
sir_init <- Csnippet("
  S = N-1;
  E = 1;
  I = 0;
  Q = 0;
  R = 0;
  V = 0;
  H = 0;
")

##---create pomp object
seirqv <- pomp(delhi,times = "month", t0=1,
               rprocess=euler.sim(sir_step,delta.t=1/2),
               initializer=sir_init,
               paramnames=c("lambda","gamma","alpha","eta",
                            "sigma","rho","epsilon","phi","N"),
               statenames=c("S","E","I","Q","R","V","H"))

##---add dmeasure and r measure into pomp object------------------
dmeas <- Csnippet("lik = dbinom(cases,H,psi,give_log);")
rmeas <- Csnippet("cases = rbinom(H,psi);")
seirqv <- pomp(seirqv,rmeasure=rmeas,dmeasure=dmeas,
               statenames="H",zeronames="H",paramnames="psi")

10 Results

Mishra and Srivastava in [4] use the parameters \(\lambda = 38,400\), \(\gamma=0.3\), \(\alpha=0.1\), \(\eta=0.4\),\(\sigma=0.3\),\(rho=0.6\),\(\epsilon=0.3\),\(\phi=0.68\) which they solved from their set of differential equations via Runge Kuttamethods to run simulations for their dynamical model. We use these same values as parameter inputs in out POMP model when we simulate from the POMP model and plot the simulated values againstthe data below.

##---simulate from pomp model we created--------------------------
sims <- simulate(seirqv,
                 params=c(N=64000,psi=0.58,
                          lambda=38400,gamma=0.3,alpha=0.1,eta=0.4,
                          sigma=0.3,rho=0.6,epsilon=0.3,phi=0.68),
                 nsim=1000,as=TRUE,include=TRUE)

ggplot(sims,aes(x=time,y=cases,group=sim,color=sim=="data"))+
  geom_line()+guides(color=FALSE)

The above graph shows the simulated values as red lines that do not coincide with the plot of the data in blue. This indicates that either the SEIQRV model is inadequate as a deterministic skeleton to our POMP model, that demigraphics or people migration play a bigger role than thought, or that the parameter values used are grossly inappropriate. Out of the three possibilities mentioned, it is the second mentioned that seems most likely: In 2014, the population of Delhi was 16,787,941 in 2012, while in 2007 the figure was at 14.3 million. The influx of about 2,487,941 people in the six year time span could have affected disease transmission dynamics and increased out simulated estimates the number of TB positive cases in the region.

We evaluate the loglikelihood of the above model with the specified parameters below:

##---evaluate log likelihood--------------------------------------
p <- c(N=64000,psi=0.58, lambda=38400,gamma=0.3,alpha=0.1,eta=0.4,
                          sigma=0.3,rho=0.6,epsilon=0.3,phi=0.68)
t <- c(1:72)
t0 <- 0
x0 <- init.state(seirqv,params=p,t0=t0)
x <- rprocess(seirqv,xstart=x0,times=c(t0,t),params=p,offset=1)
y <- rmeasure(seirqv,params=p,x=x,times=t)
ll <- dmeasure(seirqv,y=y,x=x,times=t,params=p,log=TRUE)
ell <- apply(ll,1,sum)
summary(exp(ell))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1
logmeanexp(ell,se=TRUE)
##    se 
##  0 NA

11 Fitting a SARIMA model

We also fit a SARIMA model with the parameters below chosen by the AIC criterion and perform a forecast 12 months into the future. Model diagnostics below show no autocorrelation among residuals at all lags, a statistically significant fit. Moreover, the model yields a log likelihood of -132.96. However, we must keep in mind that this only models reported cases of TB.

require(astsa)
ts <- sarima(delhi$cases,p=3,d=0,q=3,P=1,D=0,Q=1,S=12)
## initial  value 0.933172 
## iter   2 value 0.755527
## iter   3 value 0.435208
## iter   4 value 0.366123
## iter   5 value 0.252134
## iter   6 value 0.212190
## iter   7 value 0.202171
## iter   8 value 0.190223
## iter   9 value 0.186255
## iter  10 value 0.173477
## iter  11 value 0.168174
## iter  12 value 0.165245
## iter  13 value 0.162790
## iter  14 value 0.155068
## iter  15 value 0.154732
## iter  16 value 0.151564
## iter  17 value 0.149887
## iter  18 value 0.148252
## iter  19 value 0.147333
## iter  20 value 0.145634
## iter  21 value 0.144191
## iter  22 value 0.143074
## iter  23 value 0.141637
## iter  24 value 0.141332
## iter  25 value 0.141257
## iter  26 value 0.141202
## iter  27 value 0.141091
## iter  28 value 0.140781
## iter  29 value 0.140142
## iter  30 value 0.139429
## iter  31 value 0.138758
## iter  32 value 0.137855
## iter  33 value 0.137634
## iter  34 value 0.137620
## iter  35 value 0.137608
## iter  36 value 0.137602
## iter  37 value 0.137581
## iter  38 value 0.137579
## iter  39 value 0.137565
## iter  40 value 0.137563
## iter  41 value 0.137562
## iter  42 value 0.137561
## iter  43 value 0.137559
## iter  44 value 0.137555
## iter  45 value 0.137549
## iter  46 value 0.137542
## iter  47 value 0.137537
## iter  48 value 0.137536
## iter  49 value 0.137536
## iter  50 value 0.137536
## iter  50 value 0.137536
## iter  50 value 0.137536
## final  value 0.137536 
## converged
## initial  value 1.159532 
## iter   2 value 0.884125
## iter   3 value 0.561488
## iter   4 value 0.493776
## iter   5 value 0.478535
## iter   6 value 0.460855
## iter   7 value 0.453789
## iter   8 value 0.446738
## iter   9 value 0.444706
## iter  10 value 0.443923
## iter  11 value 0.442885
## iter  12 value 0.442711
## iter  13 value 0.442581
## iter  14 value 0.442442
## iter  15 value 0.442242
## iter  16 value 0.442109
## iter  17 value 0.442061
## iter  18 value 0.442031
## iter  19 value 0.441963
## iter  20 value 0.441861
## iter  21 value 0.441842
## iter  22 value 0.441833
## iter  23 value 0.441809
## iter  24 value 0.441772
## iter  25 value 0.441529
## iter  26 value 0.441453
## iter  27 value 0.441425
## iter  28 value 0.441398
## iter  29 value 0.441260
## iter  30 value 0.441181
## iter  31 value 0.441085
## iter  32 value 0.441058
## iter  33 value 0.441047
## iter  34 value 0.441038
## iter  35 value 0.441019
## iter  36 value 0.440963
## iter  37 value 0.440887
## iter  38 value 0.440812
## iter  39 value 0.440766
## iter  40 value 0.440754
## iter  41 value 0.440746
## iter  42 value 0.440606
## iter  43 value 0.440353
## iter  44 value 0.440308
## iter  45 value 0.440256
## iter  46 value 0.440151
## iter  47 value 0.439855
## iter  48 value 0.439450
## iter  49 value 0.438916
## iter  50 value 0.437605
## iter  51 value 0.437045
## iter  52 value 0.436021
## iter  53 value 0.435907
## iter  54 value 0.435330
## iter  55 value 0.435327
## iter  56 value 0.433830
## iter  57 value 0.432807
## iter  58 value 0.431847
## iter  59 value 0.431761
## iter  60 value 0.431697
## iter  61 value 0.431457
## iter  62 value 0.431195
## iter  63 value 0.430488
## iter  64 value 0.430042
## iter  65 value 0.429869
## iter  66 value 0.429811
## iter  67 value 0.429729
## iter  68 value 0.429572
## iter  69 value 0.429519
## iter  70 value 0.429501
## iter  71 value 0.429494
## iter  72 value 0.429476
## iter  73 value 0.429451
## iter  74 value 0.429426
## iter  75 value 0.429413
## iter  76 value 0.429409
## iter  77 value 0.429402
## iter  78 value 0.429402
## iter  79 value 0.429398
## iter  80 value 0.429396
## iter  81 value 0.429386
## iter  82 value 0.429374
## iter  83 value 0.429231
## iter  84 value 0.429231
## iter  85 value 0.429203
## iter  86 value 0.429192
## iter  87 value 0.429129
## iter  88 value 0.428993
## iter  89 value 0.428815
## iter  90 value 0.428561
## iter  91 value 0.428357
## iter  92 value 0.428121
## iter  93 value 0.427960
## iter  94 value 0.427896
## iter  95 value 0.427830
## iter  96 value 0.427803
## iter  97 value 0.427786
## iter  98 value 0.427776
## iter  99 value 0.427770
## iter 100 value 0.427759
## final  value 0.427759 
## stopped after 100 iterations

ts$AIC
## [1] 1.794669
ts$fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = xmean, include.mean = FALSE, optim.control = list(trace = trc, 
##     REPORT = 1, reltol = tol))
## 
## Coefficients:
##          ar1      ar2     ar3      ma1     ma2      ma3    sar1     sma1
##       2.2993  -1.8383  0.5095  -2.1226  1.4483  -0.2392  0.9553  -0.4827
## s.e.  0.3028   0.5619  0.2798   0.6428  0.9759   0.3795  0.0391   0.1920
##        xmean
##       5.2694
## s.e.  1.8883
## 
## sigma^2 estimated as 1.724:  log likelihood = -132.96,  aic = 285.92
ts.forecast <- sarima.for(delhi$cases,n.ahead=12,p=2,d=0,q=2,P=1,D=0,Q=1,S=12)

12 Conclusion

The comparison between the POMP model and the SARIMA Model is inconclusive from what we see above. The POMP approach to analyzing this data set is promising, however, in the future, tweaks to the deterministic skeleton are needed to take demographics as well as other factors.

These other factors which affect TB transmission which can be included are: patient behaviour in following through with prescribed medication, patients who self-cure, differentiation of patients with different strains of TB, such as multi-drug resistant TB, and the possibility of co-infection with HIV which further decreases patient survival probability.

13 References

  1. http://www.tbfacts.org/tb-india/
  2. http://www.tbfacts.org/tb-statistics-india/
  3. Mandal, S. and Arinaminpathy, N., “Transmission Modeling and Health Systems: the Case of TB in India”, Int Health 2015; 7: 114-120, http://www.ncbi.nlm.nih.gov/pubmed/25733561
  4. Mishra, B. and Srivastava, J., “Mathematical Model on Pulmonary and Multi-Drug Resistant Tuberculosis Patients with Vaccination”, Journal of the Egyptian Mathematical Society (2014) 22, 311-316, http://www.sciencedirect.com/science/article/pii/S1110256X13000965
  5. Kumar, V, et al, “Seasonality of Tuberculosis in Delhi, India: A Time Series Analysis”, Tuberculosis Research and Treatment (2014), http://dx.doi.org/10.1155/2014/514093
  6. Wah, W. et al, “Time series analysis of demographic and temporal trends of tuberculosis in Singapore”, BMC Public Health (2014), http://bmcpublichealth.biomedcentral.com/articles/10.1186/1471-2458-14-1121