Markov modelling presents an analytical framework for modeling state switching processes occuring with an individual or within a family unit using panel study longitudinal data. Income mobility and poverty switching dynamics particularly have generated some interest in the economics/social policy field. Similar to the analysis of Costa and Angelis, in this paper, I am trying to model the switching states of poverty. They postulate that there are two underlying poverty states; transitionary and permenent. I aim to discover these states using the data from the Panel Study of Income Dynamics run by the Institute of Social Research at the University of Michigan.

The Data

The panel study of income dynamics (PSID) is a long running longitudinal household survey which began in 1968. It studies a nationally representative sample of over 18000 individuals living in 5000 families in the United States. The design of the survey, interviewing both individuals and family units over time, presents an interesting opportunity to study poverty transition by family unit. The poverty proxy used in this analysis takes the median of the Total Income variable - similar to the process used by Costa and Angelis - and keeps only those indivials below the median level. I also use socio-economic indicators from the data as transition covariates in the HMM model. Specifically, the age, employment status, years of education completed and race variables. Costa and Angelis use the same covariates with the inclusion of housing htatus, gender and geographical area. Gender was not included in the HMM model in this analysis as this is an individual level variable that does not apply to the family as a unit of observation. Below is a break down of the specific variables used, transformations to the variables, as well as year of waves. Years: For this analysis I have used 6 waves of the study, 2005, 2007, 2009, 2011, 2013 and 2015. After excluding families over the median level of income, and dropping observations with non-responses in any of the covariates, the number of observations in each wave are as follows.

Year of study 2005 2007 2009 2011 2013 2015
No. of observations 17033 15767 14324 14185 12857 11078

These numbers were inclded in the model to account for the cycles of the longitudinal dataset.

Age: This is a continuous variable from the dataset ranging from 3 - 40. Since the income level reported (see income) is by family level and not individual, all the age observations were preserved. Observations with 999 indicating a refusal to answer the question were coded as missing and excluded form the model.

Employment Status: This variable was derived using the ‘Employment Status’ sequence of questions. The data is coded as Employed if a responded answered 1 - Working Now, Unemployed if the respondent answered 2 - Temporarily laid off, 3 - Looking for work, unemployed, 6 - Housewife, coded Retired if the respondent answered 4 - Retired and and coded Other for everything else.

Years of Education Completed: This is another continuous variable with years of education completed in lifetime of respondent. The range of this variable is 0 - 12 with 99 for respondents who refused to reply. All observtions with a response of 99 were coded as missing and dropped from the analysis.

Race: This variable was derived as the combination of two response variables from the data. The race of the head of household and the race of spouse. If the head and spouse have the same race, the family unit is coded as that race. If the head of household and spouse are of different races, the Race variable was coded as other. Due to the different combinations of races, more than 10, assigning dummy variables to all the mixed race families was not feasible. All the major race categories were preserved.

Total Income: Total household income in dollars reported by respondents. This income level is the same for all individuals in the household.

Model For this analysis, I used a hidden markov model which is defined by the following: \[\begin{array}{l} f_{X_{0}}(x_0\params\theta), \ f_{X_{n}|X_{n-1}}(x_{n}\given x_{n-1}\params\theta), \ f_{Y_{n}|X_n}(y_{n}\given x_n\params\theta), \end{array}\] for \(n=1:N\) Here I assume that family has a latent rate \(X_i(t)\) of transitiong out or back into poverty states. Each data point \(y_{ij}\) is the total income level for a family between time \(t_{j-1}\) and \(t_j\), where \(i=1,\dots,85244\) and \(j=1,\dots,6\). The unobserved poverty process \(\{X_i(t)\}\) is connected to the data through change in income for each family unit \(i\) in number of reporting cycles \(j\), which can be written as \[C_{ij}= \int_{t_{j-1}}^{t_j} X_i(t)\, dt,\]. [2]
A stochastic mixture model for gaussian distributed data was used to model \(y_{ij}\) as a gaussian random variable with mean \(\mu\) and variance \(\sigma^2\). This model was chosen after examining the vizually density of the response variable and testing the fit of different distributions to the data and comparing AIC. The distributions tested were lognormal, normal, weibull and gamma. Only the lognormal and gaussian distributions could be fitted to the data, and Like the plot suggests, the data is best fit by a normal/gaussian model as it had the lower AIC of 1904573 compared to 2032992 with the lognormal. The Mixture Hidden Markov Model(MHMM) was selected for this analysis vs the Partially Observed Markov Model (POMP) because it allows for a continuous gaussian distributed response variable which a POMP does not. The possibility of a general dependence on \(n\) includes the possibility that there is some covariate time series \(z_{0:N}\) such that \[\begin{array}{lcl} f_{X_{0}}(x_0\params\theta)&=& f_{X_{0}}(x_0\params\theta,z_0) \ f_{X_{n}|X_{n-1}}(x_{n}\given x_{n-1}\params\theta) &=& f_{X_{n}|X_{n-1}}(x_{n}\given x_{n-1}\params\theta,z_n), \ f_{Y_{n}|X_n}(y_{n}\given x_n\params\theta) &=& f_{Y_{n}|X_n}(y_{n}\given x_n\params\theta,z_n), \end{array}\] for \(n=1:N\)
To derive a multidimensional concept of poverty, the covariates age, years of education, employment status and race were included in the model as I hypothesized these soci-economic factors to have an effect on the transition of poverty states. To test this hypothesis, I ran a glm regression on the response variable total income using the covariates as independent variables (see analysis section). The regression Parameter estimation for the MHMM model was achieved by means of a variant of the EM procedure, forward-backward or Baum-Welch algorithm.

Analysis First, after using a glm regression to the hypothesis of a relationship between total income levels and the four independent variables, I found age, as well as employment status to be statistically significant. The difference income levels between white households compared to black and american indian households was also statistically significant. Note that these households are households below the median income levels overall. Years completed education is not significant but the model including it as a transition covariate performed better in terms of log likelihood, AIC and BIC values (Table 1).


Call:
glm(formula = log(TotalIncome) ~ Age + as.factor(EmploymentStatus) + 
    as.factor(Race) + YearsCompletedEducation, data = PSIDdat)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-12.6766   -0.2183    0.2954    0.6621    1.7196  

Coefficients:
                                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                             10.0168328  0.0103971 963.422  < 2e-16 ***
Age                                     -0.0022776  0.0003552  -6.411 1.45e-10 ***
as.factor(EmploymentStatus)Employed      0.3887984  0.0165892  23.437  < 2e-16 ***
as.factor(EmploymentStatus)Unemployed   -0.3289443  0.0201588 -16.318  < 2e-16 ***
as.factor(EmploymentStatus)Retired       0.1603831  0.0303615   5.282 1.28e-07 ***
as.factor(Race)Other                     0.0231736  0.0210058   1.103    0.270    
as.factor(Race)Black/African-American   -0.4039575  0.0110886 -36.430  < 2e-16 ***
as.factor(Race)American Indian          -0.4215158  0.0618219  -6.818 9.28e-12 ***
as.factor(Race)Asian                    -0.0191417  0.0554854  -0.345    0.730    
as.factor(Race)Native Hawaiian/Islander  0.2628725  0.4071302   0.646    0.518    
YearsCompletedEducation                  0.0008535  0.0008495   1.005    0.315    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 2.319653)

    Null deviance: 205549  on 85243  degrees of freedom
Residual deviance: 197711  on 85233  degrees of freedom
AIC: 313651

Number of Fisher Scoring iterations: 2

I also examined the autocorrelation between the two continuous covariates and income levels using a CCF plot of each and found evidence for autocorrelation at different lag periods.

The next step is to determing if the number of transitionary states; 2, hypothesized by Costa and Angelis is the optimal number of hidden/latent states present in my data. After running the model with number of states from 2-5, as well as with all the transition covariates, transition covariates except years of education completed and none of the covariates, we can examine the likelihood, AIC and BIC values to compare performance (see table 1). I also ran a multivariate response hidden markov model with total income as well as employment status, indicating in the model that this is a multinomial variable. We can see that the model with the best performace in terms of log likelihood, AIC and BIC is the 2 hidden state model, with 1 response and transition covariates age and race.

Table 1
States Log.Lik AIC BIC df Response Covariates
2 -922865.3 1845773 1845969 21 Total Income Race, Age, Years of Education
3 -934894.0 1869900 1870424 56 Total Income Race, Age, Years of Education
4 -934894.0 1870002 1871003 107 Total Income Race, Age, Years of Education
2 -1028984.0 2057994 2058115 13 Total Income and Employment Status None
2 -934894.0 1869842 1870095 27 Total Income Race, Age, Years of Education, Years of Employment
3 -934894.0 1869816 1869947 14 Total Income None
2 -934894.0 1869826 1870004 19 Total Income Race and Age
3 -934894.0 1869888 1870356 50 Total Income Race and Age

Results and Discussion

Because the algorithm allows covariates on the transition probabilities, the default baseline category is the first state. The probability of being in the first state is the second state value in the first component, and the probability of being in the second state is the second state value in the second state component. We can also see the probability of being in each state withing each state component when the covariates are set at 0. (See table below)

Table 2
State Component 1
State Component 2
Covariates State.1.1 State.1.2 State.2.1 State.2.2
(Intercept) 0.0000000 -1.6952652 0.000000 1.6539263
RaceWhite 0.0000000 0.3452791 0.000000 -0.1616289
RaceBlack/African-American 0.0000000 -0.0162066 0.000000 -0.8061333
RaceAmerican Indian 0.0000000 0.4283429 0.000000 -0.6587425
RaceAsian 0.0000000 -0.3316479 0.000000 0.3998702
RaceNative Hawaiian/Islander 0.0000000 0.0955277 0.000000 -0.4882065
Age 0.0000000 0.0094510 0.000000 -0.0065010
YearsCompletedEducation 0.0000000 0.0057672 0.000000 -0.0049468
Probalities at zero values of the covariates 0.8449153 0.1550847 0.160579 0.8394210

To analyze the state for each individual in a household, and compare it to the covariates, we use the posterior viterbi algorithm to find the most likely state sequence. We vizualize the most probable state for each household and analyze the result. Looking at the results, there are clear indications that state 1 is the permanent poverty state, and state 2 is the transitionary poverty state. We can see some interesting results looking at charts 1 and 2. It seems like white and mixed households on average, are less likely to be in the permanent poverty state than black households. It also seems (chart 2), that households where the head of house and spouse are employed are more likely to be in the transtionary poverty state than households with unemployed individuals. Looking also at income levels by state, we can see that households with higher income levels are overwhelmenly more likely to be in the transitionary poverty state than those with lower income levels. For the age covariate, on average, younger individuals are more likely to be in a permanent poverty state,but the difference is trivial.

References

A Dynamic Latent Model for Poverty Measurement; Michelle Costa and Luca De Angelis
Panel Study of Income Dynamics (PSID) University of Michigan
https://pdfs.semanticscholar.org/207e/81cf7b5c22176ae3972da2ecd09f6b2c3c67.pdf
http://yunus.hacettepe.edu.tr/~iozkan/eco742/hmm.html Lecture Notes: https://ionides.github.io/531w18/13/notes13.html
Depmix4 CRAN

