# A bi-dimensional finite mixture model for longitudinal

data subject to dropout

###### Abstract

In longitudinal studies, subjects may be lost to follow-up, or miss some of the planned visits, leading to incomplete response sequences. When the probability of non-response, conditional on the available covariates and the observed responses, still depends on unobserved outcomes, the dropout mechanism is said to be non ignorable. A common objective is to build a reliable association structure to account for dependence between the longitudinal and the dropout processes. Starting from the existing literature, we introduce a random coefficient based dropout model where the association between outcomes is modeled through discrete latent effects. These effects are outcome-specific and account for heterogeneity in the univariate profiles. Dependence between profiles is introduced by using a bi-dimensional representation for the corresponding distribution. In this way, we define a flexible latent class structure which allows to efficiently describe both dependence within the two margins of interest and dependence between them. By using this representation we show that, unlike standard (unidimensional) finite mixture models, the non ignorable dropout model properly nests its ignorable counterpart. We detail the proposed modeling approach by analyzing data from a longitudinal study on the dynamics of cognitive functioning in the elderly. Further, the effects of assumptions about non ignorability of the dropout process on model parameter estimates are (locally) investigated using the index of (local) sensitivity to non-ignorability.

Keywords: Panel data, Informative missingness, Nonparametric Maximum Likelihood, Concomitant latent variables, Index of Sensitivity to Non-Ignorability.

## 1 Introduction

In longitudinal studies, measurements from the same individuals (units) are repeatedly taken over time. However, individuals may be lost to follow up or do not show up at some of the planned measurement occasions, leading to attrition (also referred to as dropout) and intermittent missingness, respectively. rub1976 provides a well-known taxonomy for mechanisms that generate incomplete sequences. If the probability of a missing response does not depend neither on the observed nor on the missing responses, conditional on the observed covariates, the data are said to be missing completely at random (MCAR). Data are missing at random (MAR) if, conditional on the observed data (both covariates and responses), the missingness does not depend on the non-observed responses. When the previous assumptions do not hold, that is when, conditional on the observed data, the mechanism leading to missing data still depends on the unobserved responses, data are referred to as missing not at random (MNAR). In the context of likelihood inference, when the parameters in the measurement and in the missingness processes are distinct, processes leading either to MCAR or MAR data may be ignored; when either the parameter spaces are not distinct or the missing data process is MNAR, missing data are non-ignorable (NI). Only when the ignorability property is satisfied, standard (likelihood) methods can be used to obtain consistent parameter estimates. Otherwise, some form of joint modeling of the longitudinal measurements and the missigness process is required. See litrub2002 for a comprehensive review of the topic.

For this purpose, in the following, we will focus on the class of Random Coefficient Based Dropout Models (RCBDMs - Little1995). In this framework, separate (conditional) models are built for the two partially observed processes, and the link between them is due to sharing common or dependent individual- (and possibly outcome-) specific random coefficients. The model structure is completed by assuming that the random coefficients are drawn from a given probability distribution. Obviously, a choice is needed to define such a distribution and, in the past years, the literature focused both on parametric and nonparametric specifications. Frequently, the random coefficients are assumed to be Gaussian (e.g. ver2002; gao2004), but this assumption was questioned by several authors, see e.g. sch1999, since the resulting inference can be sensitive to such assumptions, especially in the case of short longitudinal sequences. For this reason, alf2009 proposed to leave the random coefficient distribution unspecified, defining a semi-parametric model where the longitudinal and the dropout processes are linked through dependent (discrete) random coefficients. tso2009 suggested to follow a similar approach for handling intermittent, potentially non ignorable, missing data. A similar approach to deal with longitudinal Gaussian data subject to missingness was proposed by Beunc2008, where a finite mixture of mixed effect regression models for the longitudinal and the dropout processes was discussed. Further generalizations in the shared parameter model framework were proposed by cre2011, who discussed an approach based on partially shared individual (and outcome) specific random coefficients, and by bart2015 who extended standard latent Markov models to handle potentially informative dropout, via shared discrete random coefficients.

In the present paper, the association structure between the measurement and the dropout processes is based on a random coefficient distribution which is left completely unspecified, and estimated through a discrete distribution, leading to a (bi-dimensional) finite mixture model. The adopted bi-dimensional structure allows the bivariate distribution for the random coefficients to reduce to the product of the corresponding marginals when the dropout mechanism is ignorable. Therefore, a peculiar feature of the proposed modeling approach, when compared to standard finite mixture models, is that the MNAR specification properly nests the MAR/MCAR ones, and this allows a straightforward (local) sensitivity analysis. We propose to explore the sensitivity of parameter estimates in the longitudinal model to the assumptions on non-ignorability of the dropout process by developing an appropriate version of the so-called index of sensitivity to non-ignorability (ISNI) developed by trox2004 and ma2005, considering different perturbation scenarios.

The structure of the paper follows. In section 2 we introduce the motivating application, the Leiden 85+ study, entailing the dynamics of cognitive functioning in the elderly. Section LABEL:sec:3 discusses general random coefficient based dropout models, while our proposal is detailed in sectionLABEL:sec:4. Sections LABEL:sec:5-LABEL:sec:6 detail the proposed EM algorithm for maximum likelihood estimation of model parameters and the index of local sensitivity we propose. Section LABEL:sec:7 provides the application of the proposed model to data from the motivating example, using either MAR or MNAR assumptions, and the results from sensitivity analysis. Last section contains concluding remarks.

## 2 Motivating example: Leiden 85+ data

The motivating data come from the Leiden 85+ study, a retrospective study entailing 705 Leiden inhabitants (in the Netherlands), who reached the age of 85 years between September 1997 and September 1999. The study aimed at identifying demographic and genetic determinants for the dynamics of cognitive functioning in the elderly. Several covariates collected at the beginning of the study were considered: gender (female is the reference category), educational status distinguishing between primary (reference category) or higher education, plasma Apolipoprotein E (APOE) genotype. As regards the educational level, this was determined by the number of years each subject went to school; primary education corresponds to less than 7 years of schooling. As regards the APOE genotype, the three largest groups were considered: , and . This latter allele is known to be linked to an increased risk for dementia, whereas allele carriers are relatively protected. Only 541 subjects present complete covariate information and will be considered in the following.

Study participants were visited yearly until the age of at their place of residence and face-to-face interviews were conducted through a questionnaire whose items are designed to assess orientation, attention, language skills and the ability to perform simple actions. The Mini Mental State Examination index, in the following MMSE (fol1975), is obtained by summing the scores on the items of the questionnaire designed to assess potential cognitive impairment. The observed values are integers ranging between and (maximum total score).

A number of enrolled subjects dropout prematurely, because of poor health conditions or death. In Table 2, we report the total number of available measures for each follow-up visit. Also, we report the number (and the percentage) of participants who leave the study between the current and the subsequent occasion, distinguishing between those who dropout and those who die. As it can be seen, less than half of the study participants presents complete longitudinal sequences () and this is mainly due to death ( of the subjects died during the follow-up).

Follow-up | Total | Complete (%) | Do not (%) | Die (%) |
---|---|---|---|---|

age | participate | |||

85-86 | 541 | 484 (89.46) | 9 (1.66) | 48 (8.87) |

86-87 | 484 | 422 (87.19) | 3 (0.62) | 59 (12.19) |

87-88 | 422 | 373 (88.39) | 2 (0.47) | 47 (11.14) |

88-89 | 373 | 318 (85.25) | 6 (1.61) | 49 (13.14) |

89-90 | 318 | 266 (83.65) | 15 (4.72) | 37 (11.63) |

Total | 541 | 266 (0.49) | 35 (0.07) | 240 (0.44) |