Tuberculosis (TB) is a serious infectious disease, caused by the bacterium Mycobacterium tuberculosis. Notably, it is one of the top causes of death worldwide, affecting the respiratory system. This disease has attracted many people’ attention due to its impact on public health and its ability to spread rapidly in the crowd. TB spreads through the air when individuals with active TB cough, speak, or sneeze. It primarily harms the lungs, but it can also affect other parts of the body, including the kidneys, spine, and brain. While the incidence of TB in the U.S. is lower compared to global averages, the disease still plays a significant role in causing deaths among Americans.
The U.S. Centers for Disease Control and Prevention (CDC) actively monitors TB incidence, which is the source of the data for our report. The goal of this report is to analyze the incidence of TB in the U.S. to discover trends and patterns of the disease, including its progression, spread. It aims to enhance public understanding of TB, aiding people to take measures to prevent its transmission.
Taken from [1].
Tuberculosis Cases:
Tuberculosis Deaths:
library(ggplot2)
# Plot for number of TB cases and deaths
ggplot(tb_data, aes(x = Year)) +
geom_line(aes(y = Number, colour = "TB Cases"), size = 1) +
geom_point(aes(y = Number, colour = "TB Cases")) +
geom_line(aes(y = Number1, colour = "TB Deaths"), size = 1) +
geom_point(aes(y = Number1, colour = "TB Deaths")) +
scale_color_manual(values = c("TB Cases" = "blue", "TB Deaths" = "red")) +
theme_minimal() +
labs(title = "TB Cases and Deaths over Years",
x = "Year",
y = "Number",
caption = "Data: TB_data_usa.csv") +
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_blank())
#### Figure 1: TB Cases and Deaths over years
TB cases(Blue Line):
TB deaths(Red Line)
# plot for Incidence Rate of TB and Death Rate
ggplot(tb_data, aes(x = Year)) +
geom_line(aes(y = Rate, colour = "Incidence Rate"), size = 1) +
geom_point(aes(y = Rate, colour = "Incidence Rate")) +
geom_line(aes(y = Rate1, colour = "Death Rate"), size = 1) +
geom_point(aes(y = Rate1, colour = "Death Rate")) +
scale_color_manual(values = c("Incidence Rate" = "blue", "Death Rate" = "red")) +
theme_minimal() +
labs(title = "TB Incidence and Death Rates per 100,000 People",
x = "Year",
y = "Rate",
caption = "Data: TB_data_usa.csv") +
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_blank())
### Figure 2: TB Cases and Deaths over years
# Plot for percentage changes in Number and Rate
ggplot(tb_data, aes(x = Year)) +
geom_line(aes(y = Number.1, colour = "Number Change"), na.rm = TRUE) +
geom_line(aes(y = Rate.1, colour = "Rate Change"), na.rm = TRUE) +
geom_point(aes(y = Number.1, colour = "Number Change"), na.rm = TRUE) +
geom_point(aes(y = Rate.1, colour = "Rate Change"), na.rm = TRUE) +
scale_color_manual(values = c("Number Change" = "red", "Rate Change" = "green")) +
theme_minimal() +
labs(title = "Percentage Change in TB Cases and Rates",
x = "Year",
y = "Percentage Change",
color = "Legend",
caption = "Data: TB_data_usa.csv") +
theme(plot.title = element_text(hjust = 0.5))
# Plot for percentage changes in Deaths number and Death Rate
ggplot(tb_data, aes(x = Year)) +
geom_line(aes(y = Number.2, colour = "Deaths Change"), na.rm = TRUE) +
geom_line(aes(y = Rate.2, colour = "Death Rate Change"), na.rm = TRUE) +
geom_point(aes(y = Number.2, colour = "Deaths Change"), na.rm = TRUE) +
geom_point(aes(y = Rate.2, colour = "Death Rate Change"), na.rm = TRUE) +
scale_color_manual(values = c("Deaths Change" = "red", "Death Rate Change" = "green")) +
theme_minimal() +
labs(title = "Percentage Change in TB Deaths number and Death Rates",
x = "Year",
y = "Percentage Change",
color = "Legend",
caption = "Data: TB_data_usa.csv") +
theme(plot.title = element_text(hjust = 0.5))
par(mar=c(6, 4, 4, 2))
pc_spectrum_smooth <- spectrum(tb_num, spans = c(5, 7, 5),
main = "Smoothed periodogram")
mtext(text = expression(bold("Figure 5:") ~ "incidence rate for tuberculosis cases with max density frequency point."), side = 1, line = 5,cex = 0.75, adj = 0.5)
max_density_freq = round(pc_spectrum_smooth$freq[which.max(pc_spectrum_smooth$spec)], 3)
abline(v = max_density_freq, lty = "dashed", col = "red", lwd = 2)
text(x = 0.09, y = 5e+1, labels = sprintf("Max: %.3f", max_density_freq), col = "red")
model_selection_table <- function(tbdata,
max_p, d, max_q,
P = 0, D = 0, Q = 0, period = 0,
simulation_times = 100) {
aic_table <- matrix(NA, max_p + 1, max_q + 1)
smallest_root_table <- matrix(NA, max_p + 1, max_q + 1)
fisher_ci_cover_0_table <- matrix(NA, max_p + 1, max_q + 1)
simulated_ci_cover_0_table <- matrix(NA, max_p + 1, max_q + 1)
residual_normal_test_table <- matrix(NA, max_p + 1, max_q + 1)
residual_acf_outlier_table <- matrix(NA, max_p + 1, max_q + 1)
dimnames(aic_table) <- list(paste("AR", 0:max_p, sep = ""),
paste("MA", 0:max_q, sep = ""))
dimnames(smallest_root_table) <- list(paste("AR", 0:max_p, sep = ""),
paste("MA", 0:max_q, sep = ""))
dimnames(fisher_ci_cover_0_table) <- list(paste("AR", 0:max_p, sep = ""),
paste("MA", 0:max_q, sep = ""))
dimnames(simulated_ci_cover_0_table) <- list(paste("AR", 0:max_p, sep = ""),
paste("MA", 0:max_q, sep = ""))
dimnames(residual_normal_test_table) <- list(paste("AR", 0:max_p, sep = ""),
paste("MA", 0:max_q, sep = ""))
dimnames(residual_acf_outlier_table) <- list(paste("AR", 0:max_p, sep = ""),
paste("MA", 0:max_q, sep = ""))
is_sarima_model <- any(c(P, D, Q) != 0)
has_intercept <- (d == 0) && (D == 0)
for (p in 0:max_p) {
for (q in 0:max_q) {
has_at_least_one_param <- ((p + q + P + Q) > 0)
if (is_sarima_model) {
pc_model <- try(arima(tbdata, order = c(p, d, q),
seasonal = list(order = c(P, D, Q), period = period)),
silent = TRUE)
} else {
pc_model <- try(arima(tbdata, order = c(p, d, q)),
silent = TRUE)
}
if (inherits(pc_model, "try-error")) {
next
}
# aic
aic_table[p + 1, q + 1] <- pc_model$aic
# smallest root
if (p != 0) {
pc_model_ar_root <- polyroot(c(1, -coef(pc_model)[paste0("ar", 1:p)]))
pc_model_ar_root_mod <- Mod(pc_model_ar_root)
} else {
pc_model_ar_root <- NA
pc_model_ar_root_mod <- NA
}
if (q != 0) {
pc_model_ma_root <- polyroot(c(1, coef(pc_model)[paste0("ma", 1:q)]))
pc_model_ma_root_mod <- Mod(pc_model_ma_root)
} else {
pc_model_ma_root <- NA
pc_model_ma_root_mod <- NA
}
if (P != 0) {
pc_model_sar_root <- polyroot(c(1, -coef(pc_model)[paste0("sar", 1:P)]))
pc_model_sar_root_mod <- Mod(pc_model_sar_root)
} else {
pc_model_sar_root <- NA
pc_model_sar_root_mod <- NA
}
if (Q != 0) {
pc_model_sma_root <- polyroot(c(1, coef(pc_model)[paste0("sma", 1:Q)]))
pc_model_sma_root_mod <- Mod(pc_model_sma_root)
} else {
pc_model_sma_root <- NA
pc_model_sma_root_mod <- NA
}
if (any(c(p, q, P, Q) != 0)) {
smallest_root_table[p + 1, q + 1] <- min(c(pc_model_ar_root_mod,
pc_model_ma_root_mod,
pc_model_sar_root_mod,
pc_model_sma_root_mod), na.rm = TRUE)
}
# fisher ci
if (any(c(p, q, P, Q) != 0)) {
fisher_ci_low <- pc_model$coef - 1.96 * diag(pc_model$var.coef)
fisher_ci_high <- pc_model$coef + 1.96 * diag(pc_model$var.coef)
if (has_intercept) {
fisher_ci_low <- fisher_ci_low[1:(length(fisher_ci_low) - 1)]
fisher_ci_high <- fisher_ci_high[1:(length(fisher_ci_high) - 1)]
}
if (any(fisher_ci_low <= 0 & fisher_ci_high >= 0)) {
fisher_ci_cover_0_table[p + 1, q + 1] <- TRUE
} else {
fisher_ci_cover_0_table[p + 1, q + 1] <- FALSE
}
}
# simulated ci
if (has_at_least_one_param && (simulation_times > 0)) {
if (!is_sarima_model) {
simulated_ci <- simulation_arima(pc_model, c(p, d, q),
has_intercept, simulation_times, length(tbdata))
} else {
simulated_ci <- simulation_sarima(pc_model, c(p, d, q), c(P, D, Q), period,
has_intercept, simulation_times, length(tbdata))
}
if (!any(is.na(simulated_ci))) {
simulated_ci_low <- simulated_ci[1, ]
simulated_ci_high <- simulated_ci[2, ]
if (has_intercept) {
simulated_ci_low <- simulated_ci_low[1:(length(simulated_ci_low) - 1)]
simulated_ci_high <- simulated_ci_high[1:(length(simulated_ci_high) - 1)]
}
if (any(simulated_ci_low <= 0 & simulated_ci_high >= 0)) {
simulated_ci_cover_0_table[p + 1, q + 1] <- TRUE
} else {
simulated_ci_cover_0_table[p + 1, q + 1] <- FALSE
}
}
}
# residual normal test
shapiro_test_result <- shapiro.test(pc_model$residuals)
if (shapiro_test_result$p.value < 0.05) {
residual_normal_test_table[p + 1, q + 1] <- FALSE
} else {
residual_normal_test_table[p + 1, q + 1] <- TRUE
}
# residual acf test
residual_acf <- acf(pc_model$residuals, plot = FALSE, lag.max = 30)
acf_ci_high <- qnorm((1 + 0.95) / 2) / sqrt(residual_acf$n.used)
acf_ci_low <- -acf_ci_high
residual_acf_outlier_table[p + 1, q + 1] <- sum((residual_acf$acf < acf_ci_low) |
(residual_acf$acf > acf_ci_high)) - 1
}
}
result_table_list <- list(aic_table = aic_table,
smllest_root_table = smallest_root_table,
fisher_ci_cover_0_table = fisher_ci_cover_0_table,
simulated_ci_cover_0_table = simulated_ci_cover_0_table,
residual_normal_test_table = residual_normal_test_table,
residual_acf_outlier_table = residual_acf_outlier_table)
return(result_table_list)
}
select_arima_case <- model_selection_table(tb_num,
max_p = 5, d = 1, max_q = 5,
P = 0, D = 0, Q = 0, period = 0,
simulation_times = 0)
build_and_diagnose_model <- function(tbdata, model_name,
arima_order,
seasonal = FALSE,
seasonal_order = c(0, 0, 0), period = NULL,
xreg = NULL,
without_summary = FALSE,
without_plot = FALSE) {
if (seasonal) {
pc_model <- arima(tbdata, order = arima_order,
seasonal = list(order = seasonal_order, period = period),
xreg = xreg)
} else {
pc_model <- arima(tbdata, order = arima_order, xreg = xreg)
}
if (!without_summary) {
print(pc_model)
}
p <- arima_order[1]
d <- arima_order[2]
q <- arima_order[3]
P <- seasonal_order[1]
D <- seasonal_order[2]
Q <- seasonal_order[3]
if (p > 0) {
pc_model_ar_roots <- polyroot(c(1, -coef(pc_model)[paste0("ar", 1:p)]))
cat("AR roots:", round(pc_model_ar_roots, 4), "\n")
cat("Mod of AR roots:", round(Mod(pc_model_ar_roots), 4), "\n")
}
if (q > 0) {
pc_model_ma_roots <- polyroot(c(1, coef(pc_model)[paste0("ma", 1:q)]))
cat("MA roots:", round(pc_model_ma_roots, 4), "\n")
cat("Mod of MA roots:", round(Mod(pc_model_ma_roots), 4), "\n")
}
if (P > 0) {
pc_model_sar_roots <- polyroot(c(1, -coef(pc_model)[paste0("sar", 1:P)]))
cat("SAR roots:", round(pc_model_sar_roots, 4), "\n")
cat("Mod of SAR roots:", round(Mod(pc_model_sar_roots), 4), "\n")
}
if (Q > 0) {
pc_model_sma_roots <- polyroot(c(1, coef(pc_model)[paste0("sma", 1:Q)]))
cat("SMA roots:", round(pc_model_sma_roots, 4), "\n")
cat("Mod of SMA roots:", round(Mod(pc_model_sma_roots), 4), "\n")
}
if (!without_plot) {
par(mfrow = c(2, 2))
plot(pc_model$residuals, ylab = "Residuals", main = "Residual Plot")
qqnorm(pc_model$residuals, ylab = "Residuals", main = "QQ Plot")
qqline(pc_model$residuals)
acf(pc_model$residuals, type = "correlation",
lag.max = 40, main = "Autocorrelation Function")
}
return(invisible(pc_model))
}
based on the AIC and smallest root, the relatively suitable model we choose is ARIMA(0,1,5) with smallest root as 1.05
library(knitr)
kable(select_arima_case$aic_table, digits = 3, caption = "AIC of some ARIMA models (incidence number)")
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 1210.780 | 1192.318 | 1182.158 | 1181.642 | 1183.515 | 1166.624 |
AR1 | 1176.774 | 1167.845 | 1169.498 | 1171.211 | 1172.394 | 1167.008 |
AR2 | 1171.724 | 1169.381 | 1170.895 | 1172.894 | 1173.545 | 1168.920 |
AR3 | 1170.755 | 1171.805 | 1172.890 | 1168.371 | 1164.323 | 1168.386 |
AR4 | 1172.726 | 1173.735 | 1172.952 | 1171.385 | 1166.296 | 1168.971 |
AR5 | 1170.381 | 1171.447 | 1165.181 | 1166.415 | 1168.262 | 1168.700 |
kable(select_arima_case$smllest_root_table, digits = 4, caption = "Smallest roots of ARIMA models (incidence rate)")
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | NA | 2.0134 | 1.3139 | 1.5106 | 1.3103 | 1.0557 |
AR1 | 1.4975 | 1.0273 | 1.0150 | 1.0118 | 1.0112 | 1.0826 |
AR2 | 1.1856 | 1.0129 | 1.0181 | 1.0192 | 1.0110 | 1.0819 |
AR3 | 1.0876 | 1.1183 | 1.0285 | 1.0000 | 1.0000 | 1.0000 |
AR4 | 1.0817 | 1.1061 | 1.0109 | 1.0108 | 1.0001 | 1.0000 |
AR5 | 1.0334 | 1.0398 | 1.0001 | 1.0001 | 1.0001 | 1.0000 |
To determine invertibility, we need to ensure that all MA roots lie outside the unit circle. Because all the modulus of MA roots are larger than 1, we can say that the moving average part is also invertible. As a result, our ARIMA(0,1,5) can be reasonable treated as an invertible model.
As for its residuals, we find the residuals seems to be white noise. One thing worth noting is that residuals at earlier time have larger variance than residuals at later time. Therefore, homoscedasticity is violated. Heteroscedasticity does not bias the coefficient estimates themselves, but it can lead to biased estimates of the standard errors, which in turn can mislead inferences made regarding the significance of predictors. In the QQ plot, a heavy-tailed distribution is obvious and this shows that the residual doesn’t obey normal distribution. In the ACF plot, nothing significant is observed, therefore, we can conclude that the residual is uncorrelated. Overall, the diagnosis conducted here suggests that a simple ARIMA model is not a very suitable model for tuberculosis incidence data because the violation of normality and homoscedasticity makes the statistical inference hard to proceed.
## MA roots: 0.7283+0.86i -0.3663+0.9901i -0.3663-0.9901i 0.7283-0.86i -1.1631+0i
## Mod of MA roots: 1.127 1.0557 1.0557 1.127 1.1631
We now build a Partially Observed Markov Process (POMP) model to take into account that the TB case data is partially observed. We consider SEIRS model. the SEIRS model is an extension of the classic SIR (Susceptible, Infectious, Recovered) model that incorporates additional compartments to capture more details of the disease dynamics. Specifically, SEIRS stands for:
In the SEIRS model, there is a S to indicate that recovered individuals may lose their immunity over time and return to the susceptible state.
Model parameters:
N
is the total population size.Beta (β)
is the transmission rate, which defines how
often a susceptible-infected contact results in a new exposure.mu_EI (μ_EI)
is the rate at which exposed individuals
become infectious.mu_IR (μ_IR)
is the recovery rate, or the rate at which
infectious individuals recover and become immune.mu_RS (μ_RS)
is the loss of immunity rate, or the rate
at which recovered individuals become susceptible again.Beta_t (β_t)
is a time-dependent component of the
transmission rate, allowing for changes in β over time, potentially
reflecting intervention measures or changes in behavior.sigmaSE (σ_SE)
is a parameter related to the infectious
force of infection; it might be involved with the level of stochasticity
or environmental variation in the transmission rate.rho
is the reporting rate, indicating what fraction of
the new infections (H) are reported/measured.k
is the dispersion parameter of the negative binomial
distribution, often used to model overdispersed count data.
-S_0, E_0, I_0, R_0
represent the initial proportions of
the population that are susceptible, exposed, infectious, and recovered,
respectively.library(tidyverse)
data |>
ggplot(aes(x=Year,y=Number))+
geom_line()+
geom_point()
#### Image 6: Year Vs Number Trend
In this case, we encounter situations where the spread of the disease is more downwards than what would be predicted by traditional deterministic or even stochastic models that assume homogeneous mixing of the population. This phenomenon, known as overdispersion, is characterized by variances that exceed the mean number of new infections. To accurately represent the overdispersion observed in empirical data within a model, we can introduce a stochastic component into the force of infection term. In the standard SEIR framework, the force of infection is defined as \(\mu_{IR} = \frac{\beta .I.S}{N}\) In the presence of overdispersion, the force of infection can be modified by introducing a multiplicative gamma-distributed white-noise term, leading to a new expression: \(\mu_{IR} = \frac{\beta .I.S}{N} . \frac{d \gamma (t)}{dt}\). Here, \(\frac{d \gamma}{dt}\) symbolizes gamma white-noise, which introduces random fluctuations to the rate at which susceptible individuals become infected. The variance of this gamma-distributed noise is governed by the parameter \(SigmaSE\) (σ_SE). The inclusion of this term adjusts the force of infection at every time step to mimic the irregular pattern in transmission rates reflective of real-world data.
The original equation for the transmission rate \(\mu_{IR}\) is given by: \(\mu_{IR} = \frac{B.I.S}{N}. \frac{d\gamma(t)}{dt}\)
Here, represents the rate of change of a gamma-distributed white-noise term, which introduces random fluctuations to the transmission rate. But as we see the constant decrease in the number of TuborCulosis cases in plot[1] which is because of advancement in healthcare treatments and medicine, we modify the equation to introduce a linear trend in the transmission coefficient \(\beta\) over time. It adjusts the transmission rate \(\mu_{IR}\) based on a linear trend in \(\beta\) the transmission rate coefficient: \(\mu_{IR}(t) = \frac{(\beta - (\beta_t.(t-1952))) I.S}{N} . \frac{d\gamma (t)}{dt}\) Here, \(\beta_{t}\) represents the time-dependent component of the transmission rate coefficient. The term $ (- (_t.(t-1952))) $ represents the linear trend over time. \((t - 1952)\) is used to make the trend start from a initial year, indicating a linear change in the transmission coefficient starting from that year.
The modified equation allows for the transmission rate to vary linearly over time based on the values of \(\beta\) and \(\beta_t\), providing a more flexible and realistic representation of how transmission dynamics may change over time due to various factors.
\[ \begin{align*} &\frac{dS}{dt} = \mu_{RS} R - \beta(t) \frac{S I}{N} - dw(t) \beta \frac{S I}{N}\\[30pt] &\frac{dE}{dt} = \beta(t) \frac{S I}{N} + dw(t) \beta \frac{S I}{N} - \mu_{EI} E\\[30pt] &\frac{dI}{dt} = \mu_{EI} E - \mu_{IR} I\\[30pt] &\frac{dR}{dt} = \mu_{IR} I - \mu_{RS} R\\[30pt] &\frac{dH}{dt} = \mu_{IR} I \end{align*} \]
\[ \begin{align*} \tilde{S}(t + \delta) &= \tilde{S}(t) - \text{Binomial}(\tilde{S}(t), 1 - \text{exp}(-\text{dw}(t) \cdot \beta \cdot \frac{I(t)}{N(t)} \delta)) \\[10pt] \tilde{E}(t + \delta) &= \tilde{E}(t) + \text{Binomial}(\tilde{S}(t), 1 - \text{exp}(-\mu_{EI} \delta)) \\[10pt] \tilde{I}(t + \delta) &= \tilde{I}(t) + \text{Binomial}(\tilde{I}(t), 1 - \text{exp}(-\mu_{IR} \delta)) \\[10pt] \tilde{R}(t + \delta) &= \tilde{R}(t) + \text{Binomial}(\tilde{R}(t), 1 - \text{exp}(-\mu_{RS} \delta)) \\[10pt] \tilde{H}(t + \delta) &= \tilde{H}(t) + \text{Binomial}(\tilde{I}(t), 1 - \text{exp}(-\mu_{IR} \delta)) \end{align*} \]
\(dw(t)\) is a gamma white-noise, introducing random fluctuations to the rate at which susceptible individuals become infected. The variance of this gamma-distributed noise is governed by the parameter \(\sigma_{SE}\)
seir_step <- function (S, E, I, R, N, Beta, mu_EI, mu_IR, delta.t, ...)
{
dN_SE <- rbinom(n=1, size = S, prob = 1 - exp(-Beta * I / N * delta.t))
dN_EI <- rbinom(n=1, size = E, prob = 1 - exp(-mu_EI * delta.t))
dN_IR <- rbinom(n=1, size = I, prob = 1 - exp(-mu_IR * delta.t))
S <- S - dN_SE
E <- E + dN_SE - dN_EI
I <- I + dN_EI - dN_IR
R <- R + dN_IR
c(S = S, E = E, I = I, R = R)
}
seir_rinit <- function (N, eta, ...) {
c(S = round(N * eta), E = 0, I = 1, R = round(N * (1 - eta)))
}
library(pomp)
# Define SEIR step function with H compartment
seir_step <- function (S, E, I, R, H, N, Beta, mu_EI, mu_IR, delta.t, ...) {
dN_SE <- rbinom(n = 1, size = S, prob = 1 - exp(-Beta * I / N * delta.t))
dN_EI <- rbinom(n = 1, size = E, prob = 1 - exp(-mu_EI * delta.t))
dN_IR <- rbinom(n = 1, size = I, prob = 1 - exp(-mu_IR * delta.t))
S <- S - dN_SE
E <- E + dN_SE - dN_EI
I <- I + dN_EI - dN_IR
R <- R + dN_IR
H <- H + dN_IR
c(S = S, E = E, I = I, R = R, H = H)
}
# Define SEIR initial conditions function with H compartment
seir_rinit <- function (N, eta, ...) {
c(S = round(N * eta), E = 0, I = 1, R = round(N * (1 - eta)), H = 0)
}
# Create pomp object for SEIR model
data |>
pomp(
times = "Year",
t0 = 0,
rprocess = euler(seir_step, delta.t = 1 / 52),
rinit = seir_rinit
) -> TBseir
# Define SEIR measurement density function
seir_dmeas <- function (Rate, H, rho, k, log, ...) {
dnbinom(x = Rate, size = k, mu = rho * H, log = log)
}
# Define SEIR measurement function
seir_rmeas <- function (H, rho, k, ...) {
c(Rate = rnbinom(n = 1, size = k, mu = rho * H))
}
# Update the pomp object with SEIR measurement functions
TBseir |>
pomp(
rmeasure = seir_rmeas,
dmeasure = seir_dmeas
) -> TBseir
seir_step <- Csnippet("
double foi = (Beta - Beta_t * (t - 1952)) * I / N;
double dw = rgammawn(sigmaSE, dt);
double dN_SE = rbinom(S, 1 - exp(-dw * foi * dt));
double dN_EI = rbinom(E, 1 - exp(-mu_EI * dt));
double dN_IR = rbinom(I, 1 - exp(-mu_IR * dt));
double dN_RS = rbinom(R, 1 - exp(-mu_RS * dt));
S += dN_RS - fmin(dN_SE, S);
E += dN_SE - fmin(dN_EI, E);
I += dN_EI - fmin(dN_IR, I);
R += dN_IR - fmin(dN_RS, R);
H += dN_IR;
S = fmax(0.0, S);
E = fmax(0.0, E);
I = fmax(0.0, I);
R = fmax(0.0, R);
H = fmax(0.0, H);
")
seir_init <- Csnippet("
S = nearbyint(S_0 * N);
E = nearbyint(E_0 * N);
I = nearbyint(I_0 * N);
R = nearbyint(R_0 * N);
H = 0;
")
seir_dmeas <- Csnippet("
lik = dnbinom_mu(Number, k, rho*H, give_log);
")
seir_rmeas <- Csnippet("
Number = rnbinom_mu(k, rho*H);
")
data |>
pomp(
times = "Year",
t0 = 1952,
rprocess = euler(seir_step, delta.t = 1/52),
rinit = seir_init,
paramnames = c("N", "Beta", "mu_EI", "mu_IR", "mu_RS", "rho", 'k', "sigmaSE", "Beta_t", "S_0", "E_0", "I_0", "R_0"),
statenames = c("S", "E", "I", "R", "H"),
dmeasure = seir_dmeas,
rmeasure = seir_rmeas,
partrans = parameter_trans(log = c("Beta", "mu_EI", "mu_IR", "k", "mu_RS", "sigmaSE", "Beta_t"),
logit = c('rho'),
barycentric = c("S_0", "E_0", "I_0", "R_0")),
accumvars = 'H',
obsnames = 'Number'
) -> TBseir_C
TBseir_C |>
simulate(
params = c(Beta = 4.342608e+01, Beta_t = 2.631072e-05, mu_EI = 1.293220e+02, mu_IR = 8.155015e-01, rho = 5.941820e-01, k = 4.208512e+00, sigmaSE = 3.896112e-01, N = 333000000, mu_RS = 3.384936e+01, S_0 = 7.542048e-01, I_0 = 1.755303e-04, E_0 = 2.474361e-04, R_0 = 2.453723e-01),
#params = coef(mif_out),
nsim = 5,
format = "data.frame",
include.data = TRUE) |>
ggplot(aes(x = Year, y = Number, group = .id, color = .id == "data")) +
geom_line() +
guides(color = "none")
The simulation runs display variability around certain points in time, specifically around the 1960s to 1980s, indicating periods of higher uncertainty or variability in the model’s predictions. After the peak around the 1980s, there is a general downward trend in the number of cases.
cyan line represents the actual observed data over the years, against which the model simulations are being compared. The simulations seem to capture the overall trend of the data well, though there are periods where the simulation runs diverge significantly from the actual data.
Initially, we began with a simple SEIRS model, which divides a population into compartments representing the stages of the disease: as described previously. However, when comparing the output from a model with these basic components to actual TB case data, it did not capture the observed trends. Among the discrepancies one could be the overdispersion in case counts and trends in the data that reflect changes in transmission over time due to factors such as public health interventions or behavior changes.
Therefore, to improve the fit, we introduce additional features to the model (also as described while introducing POMP Model):
Overdispersion: The actual TB data may show more variability than what a Poisson process would predict, known as overdispersion. To account for this, we added an additional parameter, ‘k’, which controls the dispersion of the model.
Linear time trend in transmission: Instead of a constant transmission rate (β), we include a time-dependent component (β_t) that allows the transmission rate to decrease linearly over time, which better reflects the actual decrease in TB cases due to advancements in healthcare.
Stochasticity: To further capture the random fluctuations in transmission, we add stochasticity to the force of infection term. This is done by introducing a gamma-distributed white-noise term governed by the parameter σ_SE.
After these enhancements, we re-fit the model to the TB data and
perform simulations. So far, Beta = 4.342608e+01
,
Beta_t = 2.631072e-05
,
mu_EI = 1.293220e+02
,mu_IR = 8.155015e-01
,
rho = 5.941820e-01
, k = 4.208512e+00
,
sigmaSE = 3.896112e-01
, N = 333000000
,
mu_RS = 3.384936e+01
, S_0 = 7.542048e-01
,
I_0 = 1.755303e-04
, E_0 = 2.474361e-04
,
R_0 = 2.453723e-01
are the best parameters we could find
with log likelihood of -628.8447
. As a future scope, we aim
to try global search method to find the best parameters. Due to time
contraint it was not possible to run global search.
TBseir_C |>
mif2(
params = c(Beta = 4.342608e+01, Beta_t = 2.631072e-05, mu_EI = 1.293220e+02, mu_IR = 8.155015e-01, rho = 5.941820e-01, k = 4.208512e+00, sigmaSE = 3.896112e-01, N = 333000000, mu_RS = 3.384936e+01, S_0 = 7.542048e-01, I_0 = 1.755303e-04, E_0 = 2.474361e-04, R_0 = 2.453723e-01),
rw.sd = rw_sd(Beta = 0.02, Beta_t = 0.02, mu_EI = 0.02, mu_IR = 0.02, rho = 0.02, k = 0.02, mu_RS = 0.02, sigmaSE = 0.02, S_0 = ivp(0.1), I_0 = ivp(0.1), E_0 = ivp(0.1)),
Nmif = 50, Np = 2000,
cooling.fraction.50 = 0.5
) -> mif_out
logLik(mif_out)
## [1] -629.6903
plot(mif_out)
The ARIMA model analysis suggested using an ARIMA(0,1,5) model based on minimizing the AIC and having a smallest root close to 1. However, the ARIMA model assumes the data is fully observed, which may not be appropriate for infectious disease cases that are typically under-reported.
In contrast, the POMP model explicitly accounts for partial observability by modeling both the underlying disease process (via a stochastic SEIRS compartmental model) and the observation process (reporting rate). Additional features incorporated into the POMP model include overdispersion in case counts via a negative binomial distribution, and a time-varying transmission rate to capture the decreasing trend likely due to improved healthcare interventions over time.
While finding optimal parameters for the POMP model remained challenging due to time constraints, the POMP model simulations utilizing the estimated parameters were able to reasonably capture the overall declining trend in tuberculosis cases.
In conclusion, the POMP model provides a more comprehensive approach for analyzing partially observed epidemiological data compared to the ARIMA model. However, further investigation into parameter estimation techniques could potentially improve the POMP model fit to the tuberculosis data.
(Note: Despite trying global search techniques, these efforts were omitted from the report due to encountered failures and time constraints.)
SIERS model could be updated to SEIQRV in which Q represents Quarantined and V represents Vaccinated as described by [5].