The package LMest
allows us to specify and fit Latent (or Hidden) Markov (LM) models for the analysis of longitudinal continuous and categorical data.
It contains functions for parameter estimation, via the Expectation-Maximization (EM) algorithm, of the basic LM model and its extensions with individual covariates that are included by means of suitable parameterizations.
It also includes functions for simulation from these models, for performing model selection (that also includes the search for the global maximum likelihood), and for local and global decoding.
Finally, standard errors for model parameters are available, which are based on numerical or exact methods and on parametric bootstrap.
This document introduces the LMest
’s basic set of tools, and shows how to apply them for the different model specifications allowed in the package, with data frames in different formats.
LMest
contains the following main functions having specific tasks:
Functions | Description |
---|---|
lmestData() |
to manipulate data in long format |
lmestFormula() |
to build the formulas specifying the model to be estimated |
lmest() |
to estimate LM models for categorical responses with or without covariates |
se() |
to obtain standard errors for the estimated LM model |
lmestCont() |
to estimate LM models for continuous outcomes with or without covariates |
lmestMixed() |
to estimate LM models for categorical responses with initial and transition probabilities of the latent process allowed to vary across different latent subpopulations |
lmestSearch() |
to search for the global maximum of the model log-likelihood and to select the optimal number of latent states |
lmestDecoding() |
to perform local and global decoding (Viterbi algorithm) |
bootstrap() |
to perform bootstrap parametric resampling to compute standard errors for the parameter estimates |
drawLMbasic() |
to draw samples from the basic LM model with categorical responses |
—————————————— |
Data available are:
Name | Description |
---|---|
data_criminal_sim |
Simulated dataset about crimes committed by a cohort of subjects |
data_drug |
Longitudinal dataset about marijuana consumption deriving from the National Youth Survey |
data_SRHS_long |
Dataset about self-reported health status deriving from the Health and Retirement Study conducted by the University of Michigan in long format |
PSIDlong |
Longitudinal dataset deriving from the Panel Study of Income Dynamics |
RLMSdat |
Longitudinal dataset deriving from the Russia Longitudinal Monitoring Survey (RLMS) about job satisfaction |
RLMSlong |
Longitudinal dataset deriving from the Russia Longitudinal Monitoring Survey (RLMS) about job satisfaction in long format |
NLSYlong |
Longitudinal dataset deriving from the National Longitudinal Survey of Youth (NLSY) about antisocial behavior and self-esteem |
—————————————— |
See help(package="LMest")
for further details and citation("LMest")
for main references.
library(LMest)
#> LMest: Generalized Latent Markov Models (Version 3.0.4)
#> Type 'citation("LMest")' for citing this package.
This dataset, already in the LMest
package, concerns the evaluation of job satisfaction of \(n\) = 1,718 individuals followed at \(T\) = 7 years from 2008 to 2014.
The data come from the Russia Longitudinal Monitoring Survey, and are documented in ?RLMSlong
.
The response variable (named value
), corresponding to the reported job satisfaction at different time occasions, has five ordered categories from 1: absolutely satisfied
to 5: absolutely not satisfied
.
The longitudinal dataset is in long format:
data("RLMSlong")
dim(RLMSlong)
#> [1] 12026 4
str(RLMSlong)
#> 'data.frame': 12026 obs. of 4 variables:
#> $ time : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ id : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ rlms : Factor w/ 7 levels "IKSJQ","IKSJR",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ value: int 2 2 2 2 2 3 1 3 1 1 ...
This dataset, already in the LMest
package, is derived from the Panel Study of Income Dynamics (https://psidonline.isr.umich.edu).
The data used in the following application concern \(n\) = 1,446 women who were followed from 1987 to 1993.
There are two binary response variables:
Y1Fertility
(indicating whether a woman had given birth to a child in a certain year) and
Y2Employment
(indicating whether she was employed).
The covariates are: X1Race
(dummy variable equal to 1 for a black woman); X2Age
(in 1986, rescaled by its maximum value); X3Age2
(squared age); X4Education
(number of years of schooling); X5Child1_2
(number of children in the family aged between 1 and 2 years, referred to the previous year); X6Child3_5
; X7Child6_13
; X8Child14
; X9Income
of the husband (in dollars, referred to the previous year, divided by 1,000).
data("PSIDlong")
dim(PSIDlong)
#> [1] 10122 13
str(PSIDlong)
#> 'data.frame': 10122 obs. of 13 variables:
#> $ id : num 1 1 1 1 1 1 1 2 2 2 ...
#> $ time : num 1 2 3 4 5 6 7 1 2 3 ...
#> $ Y1Fertility : num 0 0 0 0 0 0 0 1 0 0 ...
#> $ Y2Employment: num 1 1 1 1 1 1 1 1 1 1 ...
#> $ X1Race : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ X2Age : num -8 -8 -8 -8 -8 -8 -8 -15 -15 -15 ...
#> $ X3Age2 : num 0.64 0.64 0.64 0.64 0.64 0.64 0.64 2.25 2.25 2.25 ...
#> $ X4Education : num 12 12 12 12 12 12 12 12 12 12 ...
#> $ X5Child1_2 : num 0 0 0 0 0 0 0 0 0 1 ...
#> $ X6Child3_5 : num 0 0 0 0 0 0 0 0 0 0 ...
#> $ X7Child6_13 : num 0 2 2 2 2 1 1 0 1 1 ...
#> $ X8Child14 : num 0 0 0 0 0 1 1 0 0 0 ...
#> $ X9Income : num 35.3 28.2 33.2 37 39.9 ...
This simulated dataset, already in the LMest
package, contains \(n\) = 60,000 observations related to the complete conviction histories of a cohort of offenders followed from the age of criminal responsibility, 10 years, until 40 years, and also includes the proportion of non-offenders.
We consider \(T=6\) age bands of length equal to five years. For every age band, we dispose of a binary variable equal to 1 if the subject has been convicted for a crime of the following ten typologies and to 0 otherwise.
The typologies of crime are: y1 violence against the person
, y2 sexual offenses
, y3 burglary
, y4 robbery
, y5 theft and handling stolen goods
, y6 fraud and forgery
, y7 criminal damage
, y8 drug offenses
, y9 motoring offenses
, y10 other offenses
.
Gender
is a covariate coded equal to 1 for male and 2 for female:
data(data_criminal_sim)
dim(data_criminal_sim)
#> [1] 60000 13
str(data_criminal_sim)
#> num [1:60000, 1:13] 1 1 1 1 1 1 2 2 2 2 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : NULL
#> ..$ : chr [1:13] "id" "sex" "time" "y1" ...
This dataset in the LMest
package, also downloadable from the package panelr
, is derived from the National Longitudinal Survey of Youth (https://www.nlsinfo.org/content/cohorts/nlsy79).
The data are a subset concerning \(n\) = 581 individuals who were followed from 1990 to 1994.
We consider two response variables: anti
, providing a measure of antisocial behavior (measured on a scale ranging from 0 to 6), self
, which is a measure of self-esteem (measured on a scale ranging from 6 to 24).
The covariates are the following: momage
, a continuous variable indicating the age of the mother at birth; gender
, a dummy variable equal to 1 for females; childage
, a continuous variable indicating the age of the child at the first interview; momwork
, a dummy variable equal to 1 if mother works; married
, a dummy variable equal to 1 if parents are married; hispanic
and black
, two dummy variables for ethnicity; pov
, a time-varying binary variable indicating the poverty status of the family at years 1990, 1992 and 1994.
data("NLSYlong")
dim(NLSYlong)
#> [1] 1743 12
Function lmestData()
allows us to check and prepare the data. For example, for the NLSYlong
dataset we have:
dt <- lmestData(data = NLSYlong, id = "id", time="time",
responsesFormula= anti+self ~NULL)
summary(dt, dataSummary="responses", varType=rep("c",ncol(dt$Y)))
#>
#> Data Info:
#> ----------
#>
#> Observations: 581
#> Time occasions: 3
#> Variables: 2
#>
#>
#> Summary:
#> ----------
#>
#> anti self
#> Min. :0.000000 Min. : 6.0000
#> 1st Qu.:0.000000 1st Qu.:18.0000
#> Median :1.000000 Median :21.0000
#> Mean :1.636833 Mean :20.3502
#> 3rd Qu.:3.000000 3rd Qu.:23.0000
#> Max. :6.000000 Max. :24.0000
#>
#> Summary by year:
#> ----------
#>
#>
#> Time = 0
#>
#> anti self
#> Min. :0.000000 Min. : 9.00000
#> 1st Qu.:0.000000 1st Qu.:18.00000
#> Median :1.000000 Median :21.00000
#> Mean :1.567986 Mean :20.07057
#> 3rd Qu.:2.000000 3rd Qu.:23.00000
#> Max. :6.000000 Max. :24.00000
#>
#> Time = 2
#>
#> anti self
#> Min. :0.000000 Min. : 6.00000
#> 1st Qu.:0.000000 1st Qu.:18.00000
#> Median :1.000000 Median :21.00000
#> Mean :1.595525 Mean :20.36213
#> 3rd Qu.:3.000000 3rd Qu.:23.00000
#> Max. :6.000000 Max. :24.00000
#>
#> Time = 4
#>
#> anti self
#> Min. :0.000000 Min. : 9.0000
#> 1st Qu.:0.000000 1st Qu.:19.0000
#> Median :1.000000 Median :21.0000
#> Mean :1.746988 Mean :20.6179
#> 3rd Qu.:3.000000 3rd Qu.:23.0000
#> Max. :6.000000 Max. :24.0000
We can display the summary of every response variable for each time occasion. For the criminal data, if we select only females, we have:
data_criminal_sim<-data.frame(data_criminal_sim)
crimf <- data_criminal_sim[data_criminal_sim$sex == 2,]
dt1 <- lmestData(data = crimf, id = "id", time = "time")
summary(dt1, varType = rep("d",ncol(dt1$Y)))
#>
#> Data Info:
#> ----------
#>
#> Observations: 5200
#> Time occasions: 6
#> Variables: 11
#>
#>
#> Proportion:
#> ----------
#>
#> time sex y1 y2 y3 y4 y5
#> 1:5200 2:1 0:0.9838782 0:0.9971474 0:0.9796154 0:0.9981731 0:0.9427564
#> 2:5200 1:0.0161218 1:0.0028526 1:0.0203846 1:0.0018269 1:0.0572436
#> 3:5200
#> 4:5200
#> 5:5200
#> 6:5200
#> y6 y7 y8 y9 y10
#> 0:0.9909615 0:0.9839103 0:0.991859 0:0.9987179 0:0.9849679
#> 1:0.0090385 1:0.0160897 1:0.008141 1:0.0012821 1:0.0150321
#>
#>
#>
#>
#>
#> Proportion by year:
#> ----------
#>
#>
#> Time = 1
#>
#> time sex y1 y2 y3 y4 y5
#> 1:5200 2:1 0:0.9886538 0:0.9982692 0:0.9696154 0:0.9990385 0:0.9328846
#> 1:0.0113462 1:0.0017308 1:0.0303846 1:0.0009615 1:0.0671154
#> y6 y7 y8 y9 y10
#> 0:0.9957692 0:0.9830769 0:0.9963462 0:0.9990385 0:0.9884615
#> 1:0.0042308 1:0.0169231 1:0.0036538 1:0.0009615 1:0.0115385
#>
#> Time = 2
#>
#> time sex y1 y2 y3 y4 y5
#> 2:5200 2:1 0:0.9755769 0:0.9942308 0:0.9598077 0:0.9965385 0:0.8901923
#> 1:0.0244231 1:0.0057692 1:0.0401923 1:0.0034615 1:0.1098077
#> y6 y7 y8 y9 y10
#> 0:0.9832692 0:0.9696154 0:0.9880769 0:0.9973077 0:0.9707692
#> 1:0.0167308 1:0.0303846 1:0.0119231 1:0.0026923 1:0.0292308
#>
#> Time = 3
#>
#> time sex y1 y2 y3 y4 y5 y6
#> 3:5200 2:1 0:0.975 0:0.9963462 0:0.975 0:0.9973077 0:0.9290385 0:0.9871154
#> 1:0.025 1:0.0036538 1:0.025 1:0.0026923 1:0.0709615 1:0.0128846
#> y7 y8 y9 y10
#> 0:0.9778846 0:0.9859615 0:0.9975 0:0.9773077
#> 1:0.0221154 1:0.0140385 1:0.0025 1:0.0226923
#>
#> Time = 4
#>
#> time sex y1 y2 y3 y4 y5
#> 4:5200 2:1 0:0.9825 0:0.9965385 0:0.9861538 0:0.9980769 0:0.9523077
#> 1:0.0175 1:0.0034615 1:0.0138462 1:0.0019231 1:0.0476923
#> y6 y7 y8 y9 y10
#> 0:0.9901923 0:0.9867308 0:0.9907692 0:0.9994231 0:0.9876923
#> 1:0.0098077 1:0.0132692 1:0.0092308 1:0.0005769 1:0.0123077
#>
#> Time = 5
#>
#> time sex y1 y2 y3 y4 y5
#> 5:5200 2:1 0:0.9873077 0:0.9982692 0:0.9919231 0:0.9984615 0:0.97
#> 1:0.0126923 1:0.0017308 1:0.0080769 1:0.0015385 1:0.03
#> y6 y7 y8 y9 y10
#> 0:0.9938462 0:0.9909615 0:0.9936538 0:0.9994231 0:0.9909615
#> 1:0.0061538 1:0.0090385 1:0.0063462 1:0.0005769 1:0.0090385
#>
#> Time = 6
#>
#> time sex y1 y2 y3 y4 y5
#> 6:5200 2:1 0:0.9942308 0:0.9992308 0:0.9951923 0:0.9996154 0:0.9821154
#> 1:0.0057692 1:0.0007692 1:0.0048077 1:0.0003846 1:0.0178846
#> y6 y7 y8 y9 y10
#> 0:0.9955769 0:0.9951923 0:0.9963462 0:0.9996154 0:0.9946154
#> 1:0.0044231 1:0.0048077 1:0.0036538 1:0.0003846 1:0.0053846
Function lmestFormula()
allows us to specify the model to be estimated. In particular, the formula for the basic LM model is specified as:
fmBasic <- lmestFormula(data = RLMSlong, response = "value")
The formula for the LM model with all covariates affecting the distribution of the latent process may be specified as:
fmLatent <- lmestFormula(data = PSIDlong, response = "Y",
LatentInitial = "X", LatentTransition ="X")
in which the column names start with “Y” ar “X”.
Alternatively, it is possible to specify subsets of covariates influencing the initial and transition probabilities of the latent process that can be also different. For example:
fmLatent2 <- lmestFormula(data = PSIDlong, response = "Y",
LatentInitial = c("X1Race","X2Age","X3Age2","X9Income"),
LatentTransition =c("X1Race","X2Age","X3Age2","X9Income"))
Function lmest()
allows us to estimate LM models for categorical responses with different model specifications and with and without covariates.
These models rely on a homogenous Markov chain of first order with a finite number of states.
Maximum likelihood estimation of model parameters is performed through the EM algorithm.
Standard errors for the parameter estimates are obtained by exact computation of the information matrix or through reliable numerical approximations of this matrix, by using option out_se=TRUE
or by using the suitable function
se()
.
Using PSIDlong
, the basic LM model with time heterogenous transition probabilities can be estimated as follows:
mod <- lmest(responsesFormula = fmLatent$responsesFormula,
index = c("id","time"),
data = PSIDlong, k = 2)
The function requires a data in long format and then an “id” column and a “time” column must be specified in the argument “index”. The number of latent state may be specified as a single values, or a vector of integer values as follows:
mod <- lmest(responsesFormula = fmLatent$responsesFormula,
index = c("id","time"),
data = PSIDlong, k = 1:3)
The suitable number of latent states is selected using the BIC or AIC criterion and returned.
Print method shows the main results:
print(mod)
#>
#> Basic Latent Markov model
#> Call:
#> lmest(responsesFormula = fmLatent$responsesFormula, data = PSIDlong,
#> index = c("id", "time"), k = 1:3)
#>
#> Available objects:
#> [1] "lk" "piv" "Pi" "Psi" "np" "k"
#> [7] "aic" "bic" "lkv" "V" "n" "TT"
#> [13] "modBasic" "Lk" "Bic" "Aic" "call" "data"
#>
#> Convergence info:
#> LogLik np k AIC BIC n TT
#> -6900.447 17 2 13834.89 13924.6 1446 7
Standard errors can be obtained with function se()
as:
se(mod)
#> $sepiv
#> [1] 0.01541024 0.01541024
#>
#> $sePi
#> , , time = 1
#>
#> state
#> state 1 2
#> 1 0 0
#> 2 0 0
#>
#> , , time = 2
#>
#> state
#> state 1 2
#> 1 0.02212323 0.02212323
#> 2 0.01311762 0.01311762
#>
#> , , time = 3
#>
#> state
#> state 1 2
#> 1 0.02149914 0.02149914
#> 2 0.01074772 0.01074772
#>
#> , , time = 4
#>
#> state
#> state 1 2
#> 1 0.023346423 0.023346423
#> 2 0.009465333 0.009465333
#>
#> , , time = 5
#>
#> state
#> state 1 2
#> 1 0.02084359 0.02084359
#> 2 0.01037142 0.01037142
#>
#> , , time = 6
#>
#> state
#> state 1 2
#> 1 0.02017732 0.02017732
#> 2 0.01002753 0.01002753
#>
#> , , time = 7
#>
#> state
#> state 1 2
#> 1 0.02561638 0.02561638
#> 2 0.01156591 0.01156591
#>
#>
#> $sePsi
#> , , item = 1
#>
#> state
#> category 1 2
#> 0 0.005557113 0.00282196
#> 1 0.005557113 0.00282196
#>
#> , , item = 2
#>
#> state
#> category 1 2
#> 0 0.01066309 0.004149002
#> 1 0.01066309 0.004149002
For the data PSIDlong
, we can estimate an LM model with covariates affecting the distribution of the latent process by fixing \(k\) = 2 latent states as follows:
mod2 <- lmest(responsesFormula = fmLatent$responsesFormula,
latentFormula = fmLatent$latentFormula,
index = c("id","time"),
data = PSIDlong, k = 2,
paramLatent = "multilogit",
start = 0, out_se=TRUE)
#> ------------|-------------|-------------|-------------|-------------|-------------|
#> k | start | step | lk | lk-lko | discrepancy |
#> ------------|-------------|-------------|-------------|-------------|-------------|
#> 2 | 0 | 0 | -8957.56 |
#> 2 | 0 | 10 | -6782.67 | 0.320199 | 0.00644032 |
#> 2 | 0 | 20 | -6782.14 | 0.00309348 | 0.000645719 |
#> 2 | 0 | 29 | -6782.13 | 5.10209e-05 | 8.2241e-05 |
#> ------------|-------------|-------------|-------------|-------------|-------------|
Every 10 iterations of the EM algorithm, the function displays, along with the indication of the chosen model specification and number of latent states, the type of starting values used, the number of iterations, the value of the log-likelihood at the end of the current iteration, the difference with respect to the log-likelihood at the end of the previous iteration, and the discrepancy between the corresponding parameter vectors.
The summary()
method returns the estimation results:
summary(mod2)
#> Call:
#> lmest(responsesFormula = fmLatent$responsesFormula, latentFormula = fmLatent$latentFormula,
#> data = PSIDlong, index = c("id", "time"), k = 2, start = 0,
#> paramLatent = "multilogit", out_se = TRUE)
#>
#> Coefficients:
#>
#> Be - Parameters affecting the logit for the initial probabilities:
#> logit
#> 2
#> intercept -1.4473
#> X1Race 0.0783
#> X2Age -0.0952
#> X3Age2 -0.5027
#> X4Education 0.2170
#> X5Child1_2 -0.8882
#> X6Child3_5 -0.5797
#> X7Child6_13 0.1647
#> X8Child14 0.1014
#> X9Income -0.0189
#>
#> Standard errors for Be:
#> logit
#> 2
#> intercept 0.6170
#> X1Race 0.1625
#> X2Age 0.0678
#> X3Age2 0.2916
#> X4Education 0.0376
#> X5Child1_2 0.1171
#> X6Child3_5 0.1116
#> X7Child6_13 0.7246
#> X8Child14 0.1638
#> X9Income 0.0044
#>
#> Ga - Parameters affecting the logit for the transition probabilities:
#> logit
#> 1 2
#> intercept -3.5582 -3.7495
#> X1Race -0.1057 -0.4382
#> X2Age 0.0298 -0.0497
#> X3Age2 0.2579 0.2224
#> X4Education 0.1658 -0.0061
#> X5Child1_2 -0.1673 -0.2244
#> X6Child3_5 -0.2238 0.1116
#> X7Child6_13 0.0874 -0.0968
#> X8Child14 -0.0671 0.1715
#> X9Income -0.0113 0.0088
#>
#> Standard errors for Ga:
#> logit
#> 1 2
#> intercept 0.7166 0.7512
#> X1Race 0.1920 0.2026
#> X2Age 0.0728 0.0777
#> X3Age2 0.3135 0.3174
#> X4Education 0.0407 0.0421
#> X5Child1_2 0.1373 0.1780
#> X6Child3_5 0.1250 0.1444
#> X7Child6_13 0.0751 0.0996
#> X8Child14 0.1392 0.1326
#> X9Income 0.0033 0.0028
#>
#> Psi - Conditional response probabilities:
#> , , item = 1
#>
#> state
#> category 1 2
#> 0 0.8992 0.9485
#> 1 0.1008 0.0515
#>
#> , , item = 2
#>
#> state
#> category 1 2
#> 0 0.9068 0.0347
#> 1 0.0932 0.9653
#>
#>
#> Standard errors for the conditional response probability matrix:
#> , , item = 1
#>
#> state
#> category 1 2
#> 0 0.0056 0.0028
#> 1 0.0056 0.0028
#>
#> , , item = 2
#>
#> state
#> category 1 2
#> 0 0.0099 0.0041
#> 1 0.0099 0.0041
A plot of the conditional response probabilities referred to the categories of the multivariate response is obtained as:
plot(mod2, what = "CondProb")
To get an insight on the results, we observe that the second latent state corresponds to women with the lowest propensity to fertility and the highest propensity to have a job.
The first state corresponds to women with both low propensity to fertility and to have a job position.
A path diagram of the estimated transition probabilities is obtained as:
plot(mod2, what="transitions")
The averaged estimated transition matrix shows a quite high persistence in the same latent state: we observe a probability of around 0.06 of moving from the first to the second state.
The estimated marginal distribution of the latent states can be represented for each time occasion in the following plot:
plot(mod2, what="marginal")
The LM model for continuous outcomes may be estimated by using function lmestCont()
.
For the data NLSYlong
we estimate a multivariate LM model with covariates in the latent process. The selection of the number of latent states can be performed by setting option k
in a suitable way:
modc <- lmestCont(responsesFormula = anti + self ~ NULL,
latentFormula = ~ gender + childage + hispanic + black + pov +
momwork + married| gender + childage + hispanic + black+ pov+
momwork + married,
index = c("id", "time"),
data = dt$data,
k = 1:3,
modBasic=1,
output = TRUE,
tol=10^-1)
We can display the estimation results with a plot of the indices based on the Akaike Information Criterion (AIC) and on the Bayesian Information Criterion (BIC):
plot(modc,what="modSel")
A plot of the ellipse of the estimated overall density with weighs given by the estimated marginal distribution of the latent states is obtained as:
plot(modc, what="density")
A density plot for each component (latent state) is obtained as:
plot(modc,what="density",components=c(1,2))
The latent states are ordered according to increasing values of antisocial behavior.
The path diagram of the estimated transition probabilities is obtained as:
plot(modc,what="transitions")
Other results and asymptotic tests can be obtained by using the estimated standard errors
semodc<-se(modc)
TabBe <-cbind(modc$Be, semodc$seBe, modc$Be/semodc$seBe)
colnames(TabBe) <- c("estBe", "s.e.Be","t-test")
round(TabBe,3)
#> estBe s.e.Be t-test
#> intercept -0.223 2.056 -0.108
#> gender -0.505 0.280 -1.803
#> childage 0.045 0.231 0.196
#> hispanic -0.392 0.374 -1.048
#> black 0.043 0.351 0.122
#> pov 0.403 0.352 1.145
#> momwork 0.091 0.293 0.311
#> married -0.258 0.361 -0.716
The argument Be
, returned by the function, contains the estimated regression parameters affecting the distribution of the initial probabilities: gender log-odds (second row of Be
) is negative and significant, indicating that females tend to be allocated in the first latent state at the beginning of the survey.
TabGa1 <- cbind(modc$Ga,semodc$seGa,modc$Ga/semodc$seGa)
colnames(TabGa1) <- c("estGa(1)","estGa(2)", "s.e.Ga(1)","s.e.Ga(2)", "t-test(1)","t-test(2)")
round(TabGa1,3)
#> estGa(1) estGa(2) s.e.Ga(1) s.e.Ga(2) t-test(1) t-test(2)
#> intercept -2.388 -1.494 6.284 5.247 -0.380 -0.285
#> gender -0.272 0.388 0.853 1.073 -0.319 0.362
#> childage 0.015 -0.125 0.736 0.573 0.021 -0.218
#> hispanic -0.024 -0.063 1.008 1.260 -0.024 -0.050
#> black 0.186 -0.232 1.285 0.987 0.145 -0.235
#> pov 0.142 -0.125 1.841 0.768 0.077 -0.163
#> momwork -0.045 -0.083 0.957 0.755 -0.048 -0.110
#> married -0.120 0.115 1.376 1.049 -0.087 0.110
Output Ga
contains the estimated parameters affecting the distribution of the transition probabilities. They measure the influence of each covariate on the transition between states.
Function lmestMixed()
allows us to estimate mixed LM models for categorical responses to take into account additional sources of (time-fixed) dependence in the data.
For the data data_criminal_sim
we are interested to evaluate the patterns of criminal behavior among individuals.
At this aim, we estimate a model with \(k_1\) = 2 latent classes and \(k_2\) = 2 latent states, restricting the analysis to females.
responsesFormula <- lmestFormula(data = crimf,response = "y")$responsesFormula
modm <- lmestMixed(responsesFormula =responsesFormula,
index = c("id","time"),
k1 = 2, k2 = 2,
tol = 10^-3,
data = crimf)
Results:
summary(modm)
#> Call:
#> lmestMixed(responsesFormula = responsesFormula, data = crimf,
#> index = c("id", "time"), k1 = 2, k2 = 2, tol = 10^-3)
#>
#> Coefficients:
#>
#> Mass probabilities:
#> [1] 0.5929 0.4071
#>
#> Initial probabilities:
#> u
#> v 1 2
#> 1 0.9186 0.7904
#> 2 0.0814 0.2096
#>
#> Transition probabilities:
#> , , u = 1
#>
#> v1
#> v0 1 2
#> 1 0.9667 0.0333
#> 2 0.4983 0.5017
#>
#> , , u = 2
#>
#> v1
#> v0 1 2
#> 1 0.9762 0.0238
#> 2 0.3826 0.6174
#>
#>
#> Conditional response probabilities:
#> , , j = 1
#>
#> v
#> y 1 2
#> 0 0.9962 0.8637
#> 1 0.0038 0.1363
#>
#> , , j = 2
#>
#> v
#> y 1 2
#> 0 0.9985 0.9838
#> 1 0.0015 0.0162
#>
#> , , j = 3
#>
#> v
#> y 1 2
#> 0 0.9988 0.7935
#> 1 0.0012 0.2065
#>
#> , , j = 4
#>
#> v
#> y 1 2
#> 0 1 0.9807
#> 1 0 0.0193
#>
#> , , j = 5
#>
#> v
#> y 1 2
#> 0 0.9821 0.5601
#> 1 0.0179 0.4399
#>
#> , , j = 6
#>
#> v
#> y 1 2
#> 0 0.9985 0.9176
#> 1 0.0015 0.0824
#>
#> , , j = 7
#>
#> v
#> y 1 2
#> 0 0.9975 0.8521
#> 1 0.0025 0.1479
#>
#> , , j = 8
#>
#> v
#> y 1 2
#> 0 0.9981 0.931
#> 1 0.0019 0.069
#>
#> , , j = 9
#>
#> v
#> y 1 2
#> 0 1 0.9863
#> 1 0 0.0137
#>
#> , , j = 10
#>
#> v
#> y 1 2
#> 0 0.9993 0.8454
#> 1 0.0007 0.1546
round(modm$Psi[2, , ], 3)
#> j
#> v 1 2 3 4 5 6 7 8 9 10
#> 1 0.004 0.001 0.001 0.000 0.018 0.001 0.003 0.002 0.000 0.001
#> 2 0.136 0.016 0.207 0.019 0.440 0.082 0.148 0.069 0.014 0.155
We can identify the first latent state as that of females with null or very low tendency to commit crimes, whereas the second latent state corresponds to criminals having mainly the following types of activity: theft, burglary, and other offences. According to the estimated transition matrix, females classified in the first cluster present a higher probability (of around 0.5) to move from the second to the first latent state than those assigned to the second cluster (of around 0.4), revealing a more pronounced tendency to commit less crimes over time.
Function lmestSearch()
allows us to deal with both model selection and multimodality of the likelihood function. Different initializations are employed to search for the global maximum of the log-likelihood function.
Two main criteria are provided to select the number of latent states: AIC and BIC.
For example, for the RLMSlong
dataset we can estimate the basic LM model for increasing values of the latent states \(k\) ranging from 1 to 4:
out <- lmestSearch(responsesFormula = fmBasic$responsesFormula,
index = c("id","time"),
data = RLMSlong,version ="categorical", k = 1:4,
modBasic = 1, seed = 123)
We can display the results of the model selection with:
summary(out)
#> Call:
#> lmestSearch(responsesFormula = fmBasic$responsesFormula, data = RLMSlong,
#> index = c("id", "time"), k = 1:4, version = "categorical",
#> seed = 123, modBasic = 1)
#>
#> states lk np AIC BIC
#> 1 -14943.73 4 29895.45 29917.25
#> 2 -13921.09 11 27864.18 27924.12
#> 3 -13557.20 20 27154.41 27263.39
#> 4 -13392.93 31 26847.86 27016.78
The minimum BIC index corresponds to the model with \(k\)=4 latent states and the model has 31 free parameters.
The estimation results for the selected number of states may be displayed as follows:
mod4 <- out$out.single[[4]]
summary(mod4)
#> Call:
#> NULL
#>
#> Coefficients:
#>
#> Initial probabilities:
#> est_piv
#> [1,] 0.1717
#> [2,] 0.4400
#> [3,] 0.2021
#> [4,] 0.1863
#>
#> Transition probabilities:
#> , , time = 2
#>
#> state
#> state 1 2 3 4
#> 1 0.8809 0.0675 0.0250 0.0266
#> 2 0.0173 0.9219 0.0520 0.0088
#> 3 0.0165 0.0925 0.8727 0.0183
#> 4 0.0145 0.0642 0.1568 0.7645
#>
#> , , time = 3
#>
#> state
#> state 1 2 3 4
#> 1 0.8809 0.0675 0.0250 0.0266
#> 2 0.0173 0.9219 0.0520 0.0088
#> 3 0.0165 0.0925 0.8727 0.0183
#> 4 0.0145 0.0642 0.1568 0.7645
#>
#> , , time = 4
#>
#> state
#> state 1 2 3 4
#> 1 0.8809 0.0675 0.0250 0.0266
#> 2 0.0173 0.9219 0.0520 0.0088
#> 3 0.0165 0.0925 0.8727 0.0183
#> 4 0.0145 0.0642 0.1568 0.7645
#>
#> , , time = 5
#>
#> state
#> state 1 2 3 4
#> 1 0.8809 0.0675 0.0250 0.0266
#> 2 0.0173 0.9219 0.0520 0.0088
#> 3 0.0165 0.0925 0.8727 0.0183
#> 4 0.0145 0.0642 0.1568 0.7645
#>
#> , , time = 6
#>
#> state
#> state 1 2 3 4
#> 1 0.8809 0.0675 0.0250 0.0266
#> 2 0.0173 0.9219 0.0520 0.0088
#> 3 0.0165 0.0925 0.8727 0.0183
#> 4 0.0145 0.0642 0.1568 0.7645
#>
#> , , time = 7
#>
#> state
#> state 1 2 3 4
#> 1 0.8809 0.0675 0.0250 0.0266
#> 2 0.0173 0.9219 0.0520 0.0088
#> 3 0.0165 0.0925 0.8727 0.0183
#> 4 0.0145 0.0642 0.1568 0.7645
#>
#>
#> Conditional response probabilities:
#> , , item = 1
#>
#> state
#> category 1 2 3 4
#> 0 0.5793 0.0694 0.0239 0.0181
#> 1 0.3463 0.8413 0.3040 0.1679
#> 2 0.0493 0.0635 0.5576 0.1851
#> 3 0.0222 0.0248 0.1035 0.4541
#> 4 0.0030 0.0009 0.0111 0.1748
A plot of the conditional response probabilities referred to the categories of the univariate response is obtained with:
plot(mod4, what="CondProb")
Function lmestDecoding()
allows us to predict the sequence of latent states for the sample units on the basis of the output of the main estimation functions, and so to perform a “dynamic pattern recognition”.
For the basic LM model estimated
by using data PSIDlong
the local (Ul
) and global (Ug
) decoding (Viterbi algorithm) are given by:
dec <- lmestDecoding(mod)
head(dec$Ul)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 2 2 2 2 2 2 2
#> [2,] 2 2 2 1 1 1 2
#> [3,] 2 2 2 2 2 2 2
#> [4,] 1 1 2 2 2 2 2
#> [5,] 2 2 2 2 2 2 2
#> [6,] 2 2 2 2 2 2 2
head(dec$Ug)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 2 2 2 2 2 2 2
#> [2,] 2 2 2 1 1 1 2
#> [3,] 2 2 2 2 2 2 2
#> [4,] 1 1 2 2 2 2 2
#> [5,] 2 2 2 2 2 2 2
#> [6,] 2 2 2 2 2 2 2
Function bootstrap()
allows us to obtain standard errors by parametric bootstrap. A reasonable number of bootstrap samples is \(B=200\) and just for illustrative purposes we show how to obtain \(B=2\) samples from the output modc
of the model with covariates for the NLSYlong
data estimated with function lmestCont()
as follows:
mboot <- bootstrap(modc, n = 581, B = 2, seed = 172)
Object seMu
contains the estimated standard errors for the conditional means:
mboot$seMu
#> state
#> 1 2
#> [1,] 0.01780044 0.07510123
#> [2,] 0.09425562 0.18199815
Function drawLMbasic()
allows us to draw samples from the estimated basic LM model.
For example, considering the basic LM model illustrated with the RLMSlong
dataset, we can obtain a sample of responses of size \(n\) = 100 as follows:
draw3 <- drawLMbasic(est = mod4, format = "matrices", seed = 4321, n = 100)
#> ------------|
#> sample unit|
#> ------------|
#> ------------|
head(draw3$Y)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] 1 0 1 1 1 1 2
#> [2,] 1 0 1 1 1 1 1
#> [3,] 0 2 1 0 1 2 1
#> [4,] 1 1 1 1 1 1 1
#> [5,] 1 2 1 1 1 1 1
#> [6,] 1 1 1 1 3 1 2
Every line of \(Y\) contains the sample responses of each unit for the seven time occasions.
The package also provides functions to draw samples from other LM models.