Outline
Measles is a highly contagious infectious disease, which causes fever, runny nose, and severe rash. People would have these symptoms after ten to twelve days of contacting with an infected person and these symptoms would last for about seven to ten days. According to LA County Department of Public Health: “Most people who have never been vaccinated against or sick with the measles will get it if they have contacted with the virus.”, and this probability is about 9/10. Even the person with measles can spread the disease before they have any symptoms.
After the measles vaccine was inveted, about eighty five percents of nine-month children and ninty five percents of over twelve-month children are immune. And effectiveness of the vaccine could last for many years. It is generally safe, its side effects are usually short-lived.
This report focuses on finding appropriate ARMA and pomp models for the total measles cases from 1945 to 1947 in Los Angeles. The data set was obtained from “Project Tycho”, and it was recorded weekly.
By the graph below, we could see that there was an outbreak in 1946 in Los Angeles. However, the seasonality is not obvious. For 1945 and 1946, the trend of reported measles cases is similar. It seems like it is mores easily to get this disease in Spring. But in 1947, the reported cases is relatively low and the trend is very different from 1945 and 1946.
From the plot of original data, we could find out that the variance non-stationarity is very apparent, since there is a peak between 1946 and 1947 and a fairly small peak between 1945 and 1946. By checking the ACF plot below, the non-stationarity can be demonstrated, since the peaks decay very slowly.
In order to adjust the non-stationarity, I made two kinds of transformation: one is square root transformation, the other is log transformation, and then took difference for both transformed data. Two graphs below are the results. It is obvious that the right plot is better, so I decided to use the square-root- transformed data to do the following analysis.
## integer(0)
## integer(0)
To be more clear, the data now is \[Y_n=(1-B)Z_n=(1-B)\sqrt X_n\]
Then I used AIC to select model. By checking the AIC table below, we could notice that the AIC value of ARMA(3,3) is the lowest. Since the number of parameters is acceptable, I decided to choose this model and then check the performance.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 514.98 | 513.32 | 501.97 | 502.12 | 495.99 | 497.85 |
AR1 | 510.89 | 501.41 | 493.70 | 495.47 | 494.19 | 495.51 |
AR2 | 498.67 | 493.59 | 495.83 | 491.17 | 490.48 | 492.47 |
AR3 | 499.03 | 495.56 | 494.63 | 485.18 | 487.18 | 489.17 |
AR4 | 488.59 | 490.59 | 491.93 | 487.18 | 487.95 | 495.99 |
The table below is the result of their roots. By this result, we could notice that the roots of MA1 and MA2 do not lie outside the unite circle, which causes the concern of non-invertibility. Therefore, I decided to choose the model ARMA(3,2), since the number of parameters is smaller and the AIC value is acceptible.
AR_roots | MA_roots |
---|---|
1.11353340995196+0.41523331451202i | -0.666994478639817+0.656447484114403i |
-1.11989186201033+0i | -0.666994478639819-0.656447484114402i |
1.11353340995196-0.41523331451202i | 1.7724068324497-0i |
By the table below, we could know that no matter the roots of ARs or MAs, they all lie outside of the unite circle. So, there is no concern of non-causality or non-invertibility.
AR_roots2 | MA_roots2 |
---|---|
1.17119700243801+0i | 0.1513159674741+1.42276002065889i |
-1.1118308306977-0i | 0.15131596747412-1.42276002065889i |
-3.33670481516862+0i | 0+0i |
Then I checked the ACF. By the plot below, we could notice that this model fits the data well, since except for lag0, most of the ACF values are all inside the 95% confidence interval. Only at lag11 and lag14, the peaks almost across the lower bound. However, this is not too serious.
Finally, I checked the residuals. By the scatter plot, we could know that the mean and variance are constant and from the second plot, we could tell that the residual is roughly normal distribution, which is also demonstrated by the Q-Q plot. Although the Q-Q plot shows this distribution might have a heavy tail, I think this won’t cause serious problems since it seems to be a lightly heavy tail. Therefore, after taking some transformations for the data, the model ARIMA(3,1,2) is an appropriate model.
In this section, I used the pomp package built in R to fit a SIR model for this data. All the assumption of this model is obtained from the notes of Lecture 11 of STATS 531. In this model, we assume that S, I and R are the numbers of individuals in the susceptible, infected and recovered period, respectively. \(\bigtriangleup N_{SI}\) and \(\bigtriangleup N_{IR}\) represents the individuals moving form S to I and I to R, respectively. We also assume that \(\bigtriangleup N_{SI}\sim Bin(S, 1-e^{-\beta\bigtriangleup t})\) and \(\bigtriangleup N_{IR}\sim Bin(I, 1-e^{-\gamma\bigtriangleup t})\). The vaiable H is used to trak the numbers of individuals. And then we assume our data follows the distribution of \(Bin(H(t)-H(t-1),\rho)\).
sir_step=Csnippet("
double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
double dN_IR = rbinom(I,1-exp(-gamma*dt));
S -= dN_SI;
I += dN_SI - dN_IR;
R += dN_IR;
")
sir_init=Csnippet("
S = N-1;
I = 1;
R = 0;
")
sir=pomp(data.frame(meas_p),time="week",t0=0,rprocess=euler.sim(sir_step,delta.t=1),initializer=sir_init,paramnames=c("N","Beta","gamma"),statenames=c("S","I","R"))
sir_step2 <- Csnippet("
double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
double dN_IR = rbinom(I,1-exp(-gamma*dt));
S -= dN_SI;
I += dN_SI - dN_IR;
R += dN_IR;
H += dN_IR;
")
sir_init2 <- Csnippet("
S = N-1;
I = 1;
R = 0;
H = 0;
")
dmeas=Csnippet("lik = dbinom(number,H,rho,give_log);")
rmeas=Csnippet("number = rbinom(H,rho);")
sir2=pomp(sir,
rprocess=euler.sim(sir_step2,delta.t=1),
initializer=sir_init2,
rmeasure=rmeas,dmeasure=dmeas,
zeronames="H",statenames=c("H","S","I","R"),
paramnames=c("Beta","gamma","rho","N"))
By the data from U.S. Census Bureau, the population in Los Angeles was 1,504,277 in 1940, and 1,970,358 in 1950. I took the average of these two numbers and used it as an approximation of population from 1945 to 1947.
After trying several times, I chose \(\beta=0.62\), \(\gamma=0.46\) and \(\rho=0.009\) and found out that although this model is used to fit the reported flu cases, by controllng some parameters, it might also fit well for the measles cases. Although it does not catch the trend of small peak from 1945 to 1946. It almost catches the variation of the peak from 1946 to 1947.
By fixing \(\rho\) and \(N\), I then checked the MLE of \(\beta\) and \(\gamma\). From the two plots below, we could find out that the MLE of these two parameters are very close to 0 in the range of 0.01 to 10.
The following graphs are the result of the estimation of parameters y global search. Here, I chose Np=60,000 and Nmif=300. \(\beta\) seems to almost converge, however, \(\rho\) and \(\gamma\) need more iteraion. In the beginning, the effective sample size is not stable, but after time_50, most of the sample size is larger than 100. The performance of nfail is not good, from the plot, we could notice that most of the value is above 100. Besides, the maximized likelihood is -5617, which is pretty small.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5626 -5626 -5621 -5621 -5617 -5617
By comparing two kinds of model, we could notice that POMP analysis needs a lot of computation and the ARMA model is relatively easier. Even though from the simulation, this POMP model seems to fit well for this data, after taking two days of calculation, some parameters still can not converge. To resolve this problem, we can try to increase the value of iteration, or we could change the model. Since other performances, like maximized likelihood is not good. Therefore, between these models, the ARIMA(3,1,2) model performs better in terms of result and consuming time.