Tuberculosis (TB) is an infectious disease caused by the bacteria mycobacterium tuberculosis. While TB primarily affects a person’s lungs, it is known to affect other organs such as kidneys and bones. The disease spreads through coming into contact with tiny water droplets from coughing or sneezing of persons with the bacteria. TB is particularly deadly to people with fragile immune systems, these mostly being very young children, the elderly, and persons infected with the Human Immunodeficiency Virus (HIV). It is a serious world health concern, affecting mostly developing nations. Worldwide, about 9 million new cases of TB develop and about 1.7 million deaths are due to TB annually.
Once a person is infected with myobacterium tuberculosis, he then is a carrier of the bacteria, and in most cases does not immediately experience symptoms of the disease. After an incubation period of two to twelve weeks, this inactive TB becomes active TB and only then are they able to spread the disease and experience symptoms such as chest pains, coughing up blood, or extreme fatigue. It should be noted that there exist individuals with inactive TB who may never transition to active TB individuals during their lifetimes.
India is said to be the country that bears the highest economic burden of TB worldwide. In 2014 along, there were 9 million people globally who were reported to have active TB, out of these, 2.2 million cases belonged to India. This has been estimated to cost the Indian economy USD 340B annually. Developing effective strategies to combat the spread of TB is difficult as it is believed to be heavily underdiagnosed in India. To quote an article in http://www.tbfacts.org/tb-india/, “Case notification is estimated to be only 58%. Over one third of cases are not diagnosed, or they are diagnosed but not treated, or they are diagnosed and treated but not notified to the RNTCP (Revised National Tuberculosis Control Program)”. 40% of the Indian population is estimated to carry mycobacterium tuberculosis, although majority of carriers have inactive TB.
Being able to accurately estimate the actual number of TB cases and their trend in a particular region is imperative, especially for a government with scarce resources, and who needs to surgically target its meager economic resources to a few big wins.
It is customary to model the spread of a disease via compartment models. A compartment model divides the whole population under consideration into mutually exclusive and collectively exhaustive compartments. A simple compartment model is what is called an S-I-R model where individuals in a population seen to flow from a stage of being Susceptible, Infected, and then Recovered. Researchers in the past have used this and the following deterministic models to describe the spread of TB among individuals in a population:
The shortcoming of an SIR or an SEIR model is that they assume permanent immunity of people who have recovered from an active TB infection. However, this is not the case in practice. Moreover, the models above assume that we have access to fully observed TB case data. However, the reality is that active TB cases are not all reported and as latent TB carriers exhibit no symptoms, its exact number is even more difficult to estimate. Because this framework is deterministic, it does not account for the stochastic variability in the flow of persons from one compartment to another.
Other researchers have applied stochastic methods to analyze TB case data. For example, Kumar et al. [5] fit a SARMA time series model to TB case data gathered from a hospital in Delhi to study the existence of seasonality of the disease in that area. Wah et al. [6] fit a linear regression model with covariates and SARIMA errors to Singapore TB case data and determined subpopulations where TB indicence rates were higher than the rest of the population.
Again, like the deterministic compartment model approach, this assumes that complete TB case data are available. Where TB cases are unreported, time series models constructed using the conventional ARMA framework will not explain the true underlying mechanism that drive the spread of TB. Moreover, there is the issue of interpretability of the class of ARMA models. Once we obtain estimates of the parameters of ARMA models, these do not help in identifying which one of the stages of the spread of TB in the compartment models above have most impact on the spread of TB.
This brings us to the need to model TB case data using a framework that will account for the partially observed TB case data that is typically available in ministries of health and is interpretable.
From the paper of Kumar, V, et al. [5], we obtain a data set of the number of pulmonary TB cases reported in January 2007 (month 1) to December 2012 (month 72) at a Directly Observed Treatment Short Course Center of Fatehpur Beri primary health care center located in the southern district of Delhi, India’s capital region located in the northern part of the Indian subcontinent. The treatment center caters to a population of 64,000 and people within this region may seek TB treatment at alternative treatment centers. The treatment center administers spatum smear examinations for acid-fast-bacilli in accordance with RNTCP guidelines and results of these examinations were recorded. A positive case is recorded during the first month a person records a positive smear test result.
set.seed(73800112)
require(pomp)
require(foreach)
require(ggplot2)
delhi <- read.csv("delhi.csv",head=T)
month <- delhi$month
cases <- delhi$cases
plot(month,cases,type='o',
main="Monthly Reported TB Cases in Delhi (2007-2012)",
xlab="Months",ylab="No. of Reported Cases")
We observe that the number of TB cases recorded at the treatment center fluctuates and peaks towards the last or the first few months of each year. There also seems to be a decreasing overall trend after six years.
In this report, we build a Partially Observed Markov Process (POMP) model to describe partially observed TB case data for the southern Delhi area. A POMP model combines the interpretable elements of the compartment model approach to modeling spread of diseases and incorporates a probabilistic model for the unobserved data, as well as the underlying true stochastic process that generated the case data. With this model, we aim to answer the following questions:
Does this model describe the spread of TB well?
Is this model an improvement over conventional time series approaches such as fitting a simple SARMA model of the data to model spread of the disease?
We use Mishra and Srivastava’s [4] TB transmission model as the deterministic skeleton of our POMP model. Compared to more common models, it differentiates itself through adding a ‘Quarantined’ compartment. Quarantining TB positive patients is seen to control the rate of transmission ofthe disease and hence provide an improvement in modeling accuracy over the more common compartment models while remaining parsimonious. The following are the compartments:
The flow of people between these compartments are visualized in the diagram below (obtained from figure 1 of [4]). The quantities listed in the arrows below represent rates of transfer of people from one compartment to another while \(\Lambda\), \(d\), and \(\delta\) represent birth rate, natural mortality rate, and mortality rate due to TB, respectively. In this project, we do not concern ourselves with demographics and assume these quantities to be equal to zero.
In the above, the model parameters are specifically defined to be:
\(\beta =\) ineffectivity contact rate
\(\sigma =\) vaccinating rate coefficient for susceptible population
\(\gamma =\) rate of transmission from E to I
\(\alpha =\) rate of transmission from I to Q
\(\phi =\) rate of transmission from I to R
\(\eta =\) rate of recovery
\(\epsilon =\) rate of transfer from R to S
\(\rho =\) rate of transmission from V to S
We use Euler’s Method as to numerically approximate the ordinary differential equations used by Mishra and Srivastava in [4] to describe their SEIQRV model. These approximations are given below:
\(S(t) = S(0) -N_{SE}(t)-N_{SV}(t)+N_{VS}(t)+N_{RS}(t)\)
\(E(t) = E(0) -N_{EI}(t) + N_{SI}(t)\)
\(I(t) = I(0) - N_{IQ}(t) + N_{IR}(t) + N_{EI}(t)\)
\(Q(t) = Q(0) -N_{QR}(t)+N_{IQ}(t)\)
\(R(t) = R(0) - N_{RS}(t) + N_{IR}(t) + N_{QR}(t)\)
\(V(t) = V(0) - N_{VS}(t) + N_{SV}(t)\)We approximate the counts of the number of persons moving from one compartment to another as follows. For \(t=k\delta\), \(k\) a positive integer and \(\delta\) a fixed positive number:
\(\tilde{N}_{SE}(t+\delta) = \tilde{N}_{SE}(t) + \Delta N_{SE}\) where \(\Delta N_{SE} \sim Binomial(S(t),1-exp(-\beta E(t) \delta))\) and \(\beta = \lambda /N\), \(N\) is the total number of susceptible individuals assumed to be equal to 64,000.
\(\tilde{N}_{EI}(t+\delta) = \tilde{N}_{EI}(t) + \Delta N_{EI}\) where \(\Delta N_{EI} \sim Binomial(S(t),1-exp(-\gamma I(t) \delta))\)
\(\tilde{N}_{IQ}(t+\delta) = \tilde{N}_{IQ}(t) + \Delta N_{IQ}\) where \(\Delta N_{IQ} \sim Binomial(S(t),1-exp(-\alpha Q(t) \delta))\)
\(\tilde{N}_{QR}(t+\delta) = \tilde{N}_{QR}(t) + \Delta N_{QR}\) where \(\Delta N_{QR} \sim Binomial(S(t),1-exp(-\eta R(t) \delta))\)
\(\tilde{N}_{SV}(t+\delta) = \tilde{N}_{SV}(t) + \Delta N_{SV}\) where \(\Delta N_{SV} \sim Binomial(S(t),1-exp(-\sigma V(t) \delta))\)
\(\tilde{N}_{VS}(t+\delta) = \tilde{N}_{VS}(t) + \Delta N_{VS}\) where \(\Delta N_{VS} \sim Binomial(S(t),1-exp(-\rho S(t) \delta))\)
\(\tilde{N}_{RS}(t+\delta) = \tilde{N}_{RS}(t) + \Delta N_{RS}\) where \(\Delta N_{RS} \sim Binomial(S(t),1-exp(-\epsilon S(t) \delta))\)
\(\tilde{N}_{IR}(t+\delta) = \tilde{N}_{IR}(t) + \Delta N_{IR}\) where \(\Delta N_{IR} \sim Binomial(S(t),1-exp(-\phi R(t) \delta))\)We assume that the recorded cases of TB in the data set correspond to the number of individuals that go from compartment E to I, in which they are reported to health authorities. The recorded cases of TB we have thus are a sample of the true number of TB cases in the southern Delhi region. So, we model the case data C as
We set \(\psi = 0.58\), the estimated TB case notification rate of India.
We are now ready to combine the above elements discussed into a POMP model. We do this in the code below:
##---r process----------------------------------------------------
sir_step <- Csnippet("
double dN_SE = rbinom(S,1-exp(-lambda*I/N*dt));
double dN_EI = rbinom(E,1-exp(-gamma*I*dt));
double dN_IQ = rbinom(I,1-exp(-alpha*Q*dt));
double dN_QR = rbinom(Q,1-exp(-eta*R*dt));
double dN_SV = rbinom(S,1-exp(-sigma*V*dt));
double dN_VS = rbinom(V,1-exp(-rho*S*dt));
double dN_RS = rbinom(R,1-exp(-epsilon*S*dt));
double dN_IR = rbinom(I,1-exp(-phi*R*dt));
S -= dN_SE - dN_SV + dN_VS + dN_RS;
E += dN_SE - dN_EI;
I += dN_EI - dN_IQ - dN_IR;
Q += dN_IQ - dN_QR;
R += dN_QR - dN_RS + dN_IR;
V += dN_SV - dN_VS;
H += dN_EI;
")
##---initialize variables in r process----------------------------
sir_init <- Csnippet("
S = N-1;
E = 1;
I = 0;
Q = 0;
R = 0;
V = 0;
H = 0;
")
##---create pomp object
seirqv <- pomp(delhi,times = "month", t0=1,
rprocess=euler.sim(sir_step,delta.t=1/2),
initializer=sir_init,
paramnames=c("lambda","gamma","alpha","eta",
"sigma","rho","epsilon","phi","N"),
statenames=c("S","E","I","Q","R","V","H"))
##---add dmeasure and r measure into pomp object------------------
dmeas <- Csnippet("lik = dbinom(cases,H,psi,give_log);")
rmeas <- Csnippet("cases = rbinom(H,psi);")
seirqv <- pomp(seirqv,rmeasure=rmeas,dmeasure=dmeas,
statenames="H",zeronames="H",paramnames="psi")
Mishra and Srivastava in [4] use the parameters \(\lambda = 38,400\), \(\gamma=0.3\), \(\alpha=0.1\), \(\eta=0.4\),\(\sigma=0.3\),\(rho=0.6\),\(\epsilon=0.3\),\(\phi=0.68\) which they solved from their set of differential equations via Runge Kuttamethods to run simulations for their dynamical model. We use these same values as parameter inputs in out POMP model when we simulate from the POMP model and plot the simulated values againstthe data below.
##---simulate from pomp model we created--------------------------
sims <- simulate(seirqv,
params=c(N=64000,psi=0.58,
lambda=38400,gamma=0.3,alpha=0.1,eta=0.4,
sigma=0.3,rho=0.6,epsilon=0.3,phi=0.68),
nsim=1000,as=TRUE,include=TRUE)
ggplot(sims,aes(x=time,y=cases,group=sim,color=sim=="data"))+
geom_line()+guides(color=FALSE)
The above graph shows the simulated values as red lines that do not coincide with the plot of the data in blue. This indicates that either the SEIQRV model is inadequate as a deterministic skeleton to our POMP model, that demigraphics or people migration play a bigger role than thought, or that the parameter values used are grossly inappropriate. Out of the three possibilities mentioned, it is the second mentioned that seems most likely: In 2014, the population of Delhi was 16,787,941 in 2012, while in 2007 the figure was at 14.3 million. The influx of about 2,487,941 people in the six year time span could have affected disease transmission dynamics and increased out simulated estimates the number of TB positive cases in the region.
We evaluate the loglikelihood of the above model with the specified parameters below:
##---evaluate log likelihood--------------------------------------
p <- c(N=64000,psi=0.58, lambda=38400,gamma=0.3,alpha=0.1,eta=0.4,
sigma=0.3,rho=0.6,epsilon=0.3,phi=0.68)
t <- c(1:72)
t0 <- 0
x0 <- init.state(seirqv,params=p,t0=t0)
x <- rprocess(seirqv,xstart=x0,times=c(t0,t),params=p,offset=1)
y <- rmeasure(seirqv,params=p,x=x,times=t)
ll <- dmeasure(seirqv,y=y,x=x,times=t,params=p,log=TRUE)
ell <- apply(ll,1,sum)
summary(exp(ell))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
logmeanexp(ell,se=TRUE)
## se
## 0 NA
We also fit a SARIMA model with the parameters below chosen by the AIC criterion and perform a forecast 12 months into the future. Model diagnostics below show no autocorrelation among residuals at all lags, a statistically significant fit. Moreover, the model yields a log likelihood of -132.96. However, we must keep in mind that this only models reported cases of TB.
require(astsa)
ts <- sarima(delhi$cases,p=3,d=0,q=3,P=1,D=0,Q=1,S=12)
## initial value 0.933172
## iter 2 value 0.755527
## iter 3 value 0.435208
## iter 4 value 0.366123
## iter 5 value 0.252134
## iter 6 value 0.212190
## iter 7 value 0.202171
## iter 8 value 0.190223
## iter 9 value 0.186255
## iter 10 value 0.173477
## iter 11 value 0.168174
## iter 12 value 0.165245
## iter 13 value 0.162790
## iter 14 value 0.155068
## iter 15 value 0.154732
## iter 16 value 0.151564
## iter 17 value 0.149887
## iter 18 value 0.148252
## iter 19 value 0.147333
## iter 20 value 0.145634
## iter 21 value 0.144191
## iter 22 value 0.143074
## iter 23 value 0.141637
## iter 24 value 0.141332
## iter 25 value 0.141257
## iter 26 value 0.141202
## iter 27 value 0.141091
## iter 28 value 0.140781
## iter 29 value 0.140142
## iter 30 value 0.139429
## iter 31 value 0.138758
## iter 32 value 0.137855
## iter 33 value 0.137634
## iter 34 value 0.137620
## iter 35 value 0.137608
## iter 36 value 0.137602
## iter 37 value 0.137581
## iter 38 value 0.137579
## iter 39 value 0.137565
## iter 40 value 0.137563
## iter 41 value 0.137562
## iter 42 value 0.137561
## iter 43 value 0.137559
## iter 44 value 0.137555
## iter 45 value 0.137549
## iter 46 value 0.137542
## iter 47 value 0.137537
## iter 48 value 0.137536
## iter 49 value 0.137536
## iter 50 value 0.137536
## iter 50 value 0.137536
## iter 50 value 0.137536
## final value 0.137536
## converged
## initial value 1.159532
## iter 2 value 0.884125
## iter 3 value 0.561488
## iter 4 value 0.493776
## iter 5 value 0.478535
## iter 6 value 0.460855
## iter 7 value 0.453789
## iter 8 value 0.446738
## iter 9 value 0.444706
## iter 10 value 0.443923
## iter 11 value 0.442885
## iter 12 value 0.442711
## iter 13 value 0.442581
## iter 14 value 0.442442
## iter 15 value 0.442242
## iter 16 value 0.442109
## iter 17 value 0.442061
## iter 18 value 0.442031
## iter 19 value 0.441963
## iter 20 value 0.441861
## iter 21 value 0.441842
## iter 22 value 0.441833
## iter 23 value 0.441809
## iter 24 value 0.441772
## iter 25 value 0.441529
## iter 26 value 0.441453
## iter 27 value 0.441425
## iter 28 value 0.441398
## iter 29 value 0.441260
## iter 30 value 0.441181
## iter 31 value 0.441085
## iter 32 value 0.441058
## iter 33 value 0.441047
## iter 34 value 0.441038
## iter 35 value 0.441019
## iter 36 value 0.440963
## iter 37 value 0.440887
## iter 38 value 0.440812
## iter 39 value 0.440766
## iter 40 value 0.440754
## iter 41 value 0.440746
## iter 42 value 0.440606
## iter 43 value 0.440353
## iter 44 value 0.440308
## iter 45 value 0.440256
## iter 46 value 0.440151
## iter 47 value 0.439855
## iter 48 value 0.439450
## iter 49 value 0.438916
## iter 50 value 0.437605
## iter 51 value 0.437045
## iter 52 value 0.436021
## iter 53 value 0.435907
## iter 54 value 0.435330
## iter 55 value 0.435327
## iter 56 value 0.433830
## iter 57 value 0.432807
## iter 58 value 0.431847
## iter 59 value 0.431761
## iter 60 value 0.431697
## iter 61 value 0.431457
## iter 62 value 0.431195
## iter 63 value 0.430488
## iter 64 value 0.430042
## iter 65 value 0.429869
## iter 66 value 0.429811
## iter 67 value 0.429729
## iter 68 value 0.429572
## iter 69 value 0.429519
## iter 70 value 0.429501
## iter 71 value 0.429494
## iter 72 value 0.429476
## iter 73 value 0.429451
## iter 74 value 0.429426
## iter 75 value 0.429413
## iter 76 value 0.429409
## iter 77 value 0.429402
## iter 78 value 0.429402
## iter 79 value 0.429398
## iter 80 value 0.429396
## iter 81 value 0.429386
## iter 82 value 0.429374
## iter 83 value 0.429231
## iter 84 value 0.429231
## iter 85 value 0.429203
## iter 86 value 0.429192
## iter 87 value 0.429129
## iter 88 value 0.428993
## iter 89 value 0.428815
## iter 90 value 0.428561
## iter 91 value 0.428357
## iter 92 value 0.428121
## iter 93 value 0.427960
## iter 94 value 0.427896
## iter 95 value 0.427830
## iter 96 value 0.427803
## iter 97 value 0.427786
## iter 98 value 0.427776
## iter 99 value 0.427770
## iter 100 value 0.427759
## final value 0.427759
## stopped after 100 iterations
ts$AIC
## [1] 1.794669
ts$fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), xreg = xmean, include.mean = FALSE, optim.control = list(trace = trc,
## REPORT = 1, reltol = tol))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 sar1 sma1
## 2.2993 -1.8383 0.5095 -2.1226 1.4483 -0.2392 0.9553 -0.4827
## s.e. 0.3028 0.5619 0.2798 0.6428 0.9759 0.3795 0.0391 0.1920
## xmean
## 5.2694
## s.e. 1.8883
##
## sigma^2 estimated as 1.724: log likelihood = -132.96, aic = 285.92
ts.forecast <- sarima.for(delhi$cases,n.ahead=12,p=2,d=0,q=2,P=1,D=0,Q=1,S=12)
The comparison between the POMP model and the SARIMA Model is inconclusive from what we see above. The POMP approach to analyzing this data set is promising, however, in the future, tweaks to the deterministic skeleton are needed to take demographics as well as other factors.
These other factors which affect TB transmission which can be included are: patient behaviour in following through with prescribed medication, patients who self-cure, differentiation of patients with different strains of TB, such as multi-drug resistant TB, and the possibility of co-infection with HIV which further decreases patient survival probability.