Tree-Structured Clustering in Fixed Effects Models
Fixed effects models are very flexible because they do not make assumptions on the distribution of effects and can also be used if the heterogeneity component is correlated with explanatory variables. A disadvantage is the large number of effects that have to be estimated. A recursive partitioning (or tree based) method is proposed that identifies clusters of units that share the same effect. The approach reduces the number of parameters to be estimated and is useful in particular if one is interested in identifying clusters with the same effect on a response variable. It is shown that the method performs well and outperforms competitors like the finite mixture model in particular if the heterogeneity component is correlated with explanatory variables. In two applications the usefulness of the approach to identify clusters that share the same effect is illustrated.
Keywords: Fixed effects model; random effects model; recursive partitioning; tree-structured regression; regularization.
The analysis of longitudinal data and cross-sectional data that come in clusters requires to take the dependence of observations and the heterogeneity of measurement units into account. Typically, measurements within units tend to be more similar than measurements between units. If the heterogeneity is ignored poor performance of estimators and misleading standard errors are to be expected.
The most popular, widely used model to account for unobserved heterogeneity is the random effects model, see, for example, VerMol:2000, MolVer:2005 and MccSea:2001. Typically in the random effects model it is assumed that the random effects follow a normal distribution. This strong assumption results in an economical model but inference may be sensitive to the specification of the distribution of random effects, see HeaKur:2001, AgrCafOhm:2004 and LitAloMol:2007.
Several approaches to weaken the assumption of normally distributed random effects have been proposed. More flexible distributions are obtained, for example, by using mixtures of normals as proposed by Chenetal:2002 and MagZeg:96. Huang:2009 proposed diagnostic methods for random-effect misspecification and ClaHart:2009 proposed tests for the assumption of the normal distribution. More recently, lombardia2012new proposed the class of semi-mixed effects models, a continuum of models that combine random and fixed effects.
An alternative approach to model heterogeneity uses finite mixtures. In finite mixtures of generalized linear models it is assumed that the density or mass function of the responses given the explanatory variables is determined by a finite mixture of components. Each of the components has its own response distribution and own parameters that determine the influence of explanatory variables. If only part of the parameters, for example the intercepts, are allowed to vary over components one obtains a discrete distribution of the heterogeneity part of the model. Models of that type were considered by FolLam:89 and Aitkin:99. FolLam:89 investigated the identifiability of finite mixtures of binomial regression models and gave sufficient identifiability conditions for mixing of binary and binomial distributions. grun2008identifiability considered identifiability for mixtures of multinomial logit models.
Finite mixture models replace the assumption of a fixed continuous distribution of random effects by the assumption of a discrete distribution. One may see this as an alternative and flexible specification of the heterogeneity component only. However, by assuming a discrete distribution of the intercepts instead of a continuous distribution as in random effects models one also implicitly assumes that there are clusters of units that share the same effect. In some applications it is definitely of interest to identify these units. We will consider an example in which the units are schools and one wants to know which schools are similar in their performance with regard to the education of students.
Here we consider an alternative to finite mixture models with the same objectives, that are use of a flexible discrete distribution and identification of units that share the same effect. However, the starting point is different. We use a fixed effects model in which each unit has its own parameter. An advantage is that no structural assumptions on the unit-specific effects have to be made. Clusters of parameters and therefore units with the same effect are found by tree methodology, although a different one as in classical trees.
Classical recursive partitioning techniques or trees were first introduced by MorSon:63. Very popular methods are classification and regression trees (CART) by BreiFrieOls:84 and C4.5 by Quinlan:86 and Quinlan:93. A newer version of recursive partitioning based on conditional inference was proposed by Hotetal:2006. An overview on recursive partitioning in health science was given by ZhaSin:1999 and with a focus on psychometrics by Strobetal:2009. An easily accessible introduction into the basic concepts is found in HasTibFri:2009B.
The tree methodology used here differs from these approaches. In CART and other classical approaches the whole covariate space is recursively partitioned into subspaces. In order to obtain a partitioning in the intercepts (or slopes) only, one has to apply a different form of trees. It has to be designed in a way that the subspaces are built for specific effects only, for example the intercepts, while other parameters that represent common effects of explanatory variables are not partitioned into subspaces. Our main focus is on the clustering of intercepts, however, we will also refer to the case of unit-specific slopes. One big advantage using recursive partitioning techniques is the computational efficiency. The proposed tree-structured model especially enables the evaluation of high-dimensional data. Alternative approaches to identify clusters within a fixed effects model framework as proposed by TuOelkFixed fail in high dimensional settings.
The article is organized as follows: In Section 2 we introduce the tree-structured model for unit-specific intercepts and in section LABEL:sec:example we present an illustrative example. Details about the fitting procedure are given in Section LABEL:sec:fitting_procedure. After a short introduction of related approaches in Section LABEL:sec:related we give the results of wider simulation studies (Section LABEL:sec:simulations). Finally, Section LABEL:sec:application contains a second application.
2 Accounting for Heterogeneity in Clustered Data
Consider clustered data given by , where denotes the response of measurement for unit and two sets of predictive variables and . In longitudinal data the units can, for example, represent persons that are measured repeatedly. In the following, we consider alternative methods to account for the potential heterogeneity of units. We start with methods that use random effects, then consider fixed effects model and finite mixtures.
2.1 Random Effects Models
In a generalized linear mixed model (GLMM) the mean response is linked to the explanatory variables by
where is a linear term which contains the fixed effect . The second term contains the random effects for covariates that are varying across units and is a known link function. In a GLMM it is assumed that the distribution of follows a simple exponential family and that the observations are conditionally independent. For the random effects , which model the heterogeneity of the units, one typically assumes a normal distribution .
In a GLMM the distribution of the random effects is used to account for the heterogeneity of the units and the focus is mainly on the parametric term . Although the distributional assumption for the random effects makes the estimation of the model very efficient there are also some disadvantages. If the assumed distribution is very different from the real data generating distribution, inference can be biased. The assumption of a continuous distribution also does not allow for the same effects of different units. Hence, clustering of units is not possible. Another crucial point of the GLMM is the assumption that the random effects and the covariates are uncorrelated. This assumption can lead to poor estimation accuracy, see, for example, grilli2011endo. Functions for the estimation of generalized linear mixed models are provided by the R-package lme4 (lme4:2015), which we will use for the computations in the applications and simulations.
2.2 Fixed Effects Models
In contrast to mixed models, fixed effects models model heterogeneity among units by using one parameter for each unit. The mean response is linked to the explanatory variables in the form
where again is a vector of covariates that have the same effect across all units and contains covariates that have different effects over units. Each measurement unit has his own parameter vector . The specification of one parameter vector per unit results in a very large number of parameters which can affect estimation accuracy. Moreover, typically there is not enough information to distinguish between all units. To cope with these problems one can assume that there are groups of units that share the same effect on the response. Forming clusters of units leads to a reduced number of parameters and stable estimates. There are several strategies to identify these clusters, the fixed effects model with regularization considered in the next section or the finite mixture model (Section 2.4).
2.3 Tree-Structured Clustering
In the approach considered here one assumes that the fixed effects model holds, but not all the unit-specific parameters are assumed to be different. Clusters (or groups) of measurement units are identified by recursive partitioning methods. We first consider unit-specific intercepts only. Let us start with the simplest case in which all intercepts are equal, that is, the linear predictor has the form . If there are two clusters the corresponding linear predictor is given by
where denotes if the unit is in the first or the second group. A simple test, for example a likelihood ratio test, for the hypothesis can be used to determine if the model with two groups is more adequate for the data than the model in which all the intercepts are equal. By iterative splitting into subsets guided by test statistics one obtains a clustering of units that have to be distinguished with regard to their intercept.
In general, regression trees can be seen as a representation of a partition of the predictor space. A tree is built by successively splitting one node , that is already a subset of the predictor space, into two subsets and with the split being determined by only one variable. In a fixed effects model, when specifying specific intercepts for each unit, the unit number itself can be seen as a nominal categorical variable with categories. The partition has the form , where and are disjoint, non-empty subsets and its complement . Using this notation another representation of model (3) is given by
where denotes the indicator function with , if a is true and otherwise. After several splits one obtains a clustering of the units and the predictor of the resulting model can be represented by
where is a partition of consisting of clusters that have to be distinguished in terms of their individual intercepts.
In the following we will use the model abbreviation TSC for tree-structured clustering.
2.4 Finite Mixture Models
An alternative approach that also allows to identify clusters of units are finite mixture models. These were, for example, considered by FolLam:89 and Aitkin:99. The general assumption in finite mixtures of generalized regression models is that the mixture consists of K components where each component follows a parametric distribution of the exponential family of distributions. The density of the mixture can be given by
where denotes the -th component of the mixture with parameter vector and dispersion parameter . For the unknown component weights and has to hold.
Here we consider models with components that differ in their intercepts. Within the framework of finite mixtures one specifies for the th component of the mixture a model with predictor . For models with normal response the mixture components are given by , where the variance is fixed for all components. For models with a binary response the mixture components are , where and logit(. For further details, see GruLei:2007.
Estimation of the mixture model is usually obtained by the EM-algorithm with the number of components being specified beforehand. The optimal number of components is chosen afterwards, for example by information criteria like AIC or BIC. GruLei:2008 provide the R-package flexmix, which is used for the computations in our applications and simulations. Regularization and variable selection for mixture models have been considered by khalili07 and stadler10 but not with the objective of clustering units with regard to their effects.