Composite Robust Estimators for Linear Mixed Models
Abstract
The Classical TukeyHuber Contamination Model (CCM) is a usual framework to describe the mechanism of outliers generation in robust statistics. In a data set with observations and variables, under the CCM, an outlier is a unit, even if only one or few values are corrupted. Classical robust procedures were designed to cope with this setting and the impact of observations were limited whenever necessary. Recently, a different mechanism of outliers generation, namely Independent Contamination Model (ICM), was introduced. In this new setting each cell of the data matrix might be corrupted or not with a probability independent on the status of the other cells. ICM poses new challenge to robust statistics since the percentage of contaminated rows dramatically increase with , often reaching more than 50% . When this situation appears, classical affine equivariant robust procedures do not work since their breakdown point is 50%. For this contamination model we propose a new type of robust methods namely composite robust procedures which are inspired on the idea of composite likelihood, where low dimension likelihood, very often the likelihood of pairs, are aggregate together in order to obtain an approximation of the full likelihood which is more tractable. Our composite robust procedures are build over pairs of observations in order to gain robustness in the independent contamination model. We propose composite S and estimators for linear mixed models. Composite estimators are proved to have an high breakdown point both in the CCM and ICM. A Monte Carlo study shows that our estimators compare favorably with respect to classical Sestimators under the CCM and outperform them under the ICM. One example based on a real data set illustrates the new robust procedure.
Keywords: Composite Sestimators, Composite estimators, Independent Contamination Model, TukeyHuber Contamination Model, Robust estimation.
1 Introduction
The purpose of this paper is to find robust procedures for mixed linear models. This class of models include among others ANOVA models with repeated measures, models with random nested design and models for studying longitudinal data. These models are generally based on the assumption that the data follow a normal distribution and therefore the parameters are estimated using the maximum likelihood principle. See for example, Searle et al. (1992). As is well known, in general, the estimator obtained by maximum likelihood under the assumption that the data have a normal distribution is very sensitive to the presence of a small fraction of outliers in the sample. More than that, just one outlier may have an unbounded effect on this estimator. There are many robust estimators that have been proposed to avoid a large outlier influence. A large list of references of these proposals is available in Heritier et al. (2009). VictoriaFeser and Copt (2006) introduces a very interesting robust Sestimator for mixed linear models based on Mscales which has breakdown point equal to . We can also mention Gill (2000), Jiang and Zhang (2001), Sinha (2004), Copt and Heritier (2006), JacqminGadda et al. (2007), Lachosa et al. (2009), Chervoneva and Vishnyakov (2011) and Koller (2013a) which studied an SMDMestimator. The procedure proposed in the last paper is implemented in the R package robustlmm (Koller, 2013b).
However all these procedures are focused on coping with outliers generated under the Classical (TukeyHuber) Contamination Model (CCM), where some percentage of the units that compose the sample are replaced by outliers. However Alqallaf et al. (2009) introduced another type of contamination (called Independent Contamination Model, ICM) that may occur in multivariate data. Instead of contaminating a percentage of the units that compose the sample, the different cells of each unit may be independently contaminated. In this case, if the dimension of each unit is large, even a small fraction of cell contamination may lead to a large fraction of units with at least one contaminated cell. This type of contamination specially occurs when the different variables that compose each unit are measured from independent laboratories. Alqallaf et al. (2009) showed that for this type of contamination the breakdown point of affine equivariant procedures for multivariate location and covariance matrix tends to zero when the number of variables increases and therefore their degree of robustness is not satisfactory. A similar phenomenon occurs when dealing with mixed linear models. In particular the Sestimator procedure introduced in VictoriaFeser and Copt (2006) loses robustness for high dimensional data with independent contamination.
In this paper we propose a new class of robust estimators for linear mixed models. These estimators may be thought as robust counterparts of the composite likelihood estimators proposed by Lindsay (1988). If a vector of dimension is observed, the composite likelihood estimators are based on the likelihood of all the subvectors of a dimension . The estimators that we propose here are based on scales of the Mahalanobis distances of two dimensional subvectors of . The scale estimators were introduced by Yohai and Zamar (1988) and provides scales estimators which are simultaneously highly robust and highly efficient. We are going to show that these estimators have a robust behavior for both contamination models: the classical contamination model and the independent contamination model. In particular we will show that the breakdown point for the classical contamination model is , while for the independent contamination model is .
In Section 2 the model and the notation are presented. Section 2.1 introduces the Composite Sestimator, while Section 3 defines the Composite estimator. Sections 4 and 5 discuss the breakdown properties and the asymptotic normality of the Composite estimator, Section 6 provides details on the computational algorithm and Section 7 illustrates with a real data set the advantages of the proposed estimator. In Section 8 we perform a Monte Carlo simulation that shows that the proposed procedure has a robust behavior under both contamination models. Section 9 provides some concluding remarks. An Appendix contains details on computational aspects and the proofs of statements reported in previous Sections.
2 Model and Notation
Denote by the multivariate normal distribution of dimension with mean and covariance matrix . Many statistical models for components of variance and longitudinal analysis are of the following form. In the case of fixed covariables is assumed that independent dimensional om vectors in are observed, and has distribution , where
(1)  
are a fixed matrices and is an unknown vector parameter. Moreover,
(2) 
where , are matrices, is the identity, and are unknown parameters, where
In the case of random covariables, that is, when are i.i.d random matrices, it is assumed that
(3) 
This is equivalent to independent of with distribution . However, in Section 5, where we study the asymptotic properties of the proposed estimators, we use a weaker assumption. In fact we only require that be independent of and have elliptical distribution with center and covariance matrix .
This setup covers several statistical models, for instance that of the form
(4) 
where are as before, , , are known design matrices for the random effects, are independent dimensional vectors with distribution , where is the identity and are dimensional error vectors with distribution . Then, in this case we have with , .
2.1 Composite Sestimator
A very interesting class of Sestimators for the model defined by (1) and (2) was proposed by VictoriaFeser and Copt (2006).
Given a dimensional column vector and a vector and matrix the square of the Mahalanobis distance is defined by
Let , where is the set of nonnegative real numbers, satisfying the following properties:
 A1

.
 A2

implies .
 A3

is continuous.
 A4

sup.
 A5

If and , then .
Let be defined by
where is a chisquare distribution with degrees of freedom. Then, given a sample , an Mscale estimator is defined by the value solution of
(5) 
The Sestimator proposed by VictoriaFeser and Copt (2006) is defined by
subject to
These estimators can be thought as a constrained version of the Sestimators for multidimensional location and scatter proposed by Davies (1987).
Given a squared matrix we denote by where is the determinant of the matrix . Note that depends only on and then will be denoted by . It is easy to show that the estimators proposed by VictoriaFeser and Copt (2006) can be also defined by
where the Mscale is defined now by (5). Notice that is defined by
In the classical contamination model a fraction of the vectors are replaced by outliers. VictoriaFeser and Copt (2006) show that for this model the breakdown point of this estimator is . Therefore if , we get .
Alqallaf et al. (2009) consider a different contamination model for multivariate data: the independent contamination model. In this contamination model if we observe a vector each component of has probability of being replaced by an outlier. Therefore the probability that at least one component of be contaminated is , and this number is close to one when is large even if is small.
Alqallaf et al. (2009) showed that the breakdown point for the independent contamination model of Sestimators of multivariate location and scatter tends to when . The same happens with other popular affine equivariance estimators like the minimum volume ellipsoid (Rousseeuw, 1985), Minimum covariance determinant (Rousseeuw, 1985) or the DonohoStahel estimators (Donoho, 1982; Stahel, 1981). The Sestimator proposed by VictoriaFeser and Copt (2006) for model (1)(2) have a similar shortcoming: when , its breakdown point tends to under the independent contamination model. For this reason, hereafter we introduce a new type of estimators namely composite Sestimators and composite estimators.
Given a vector , a matrix and for a couple of indices () we denote and the submatrix
In a similar way, given a matrix we denote by the matrix of dimension built by using the corresponding rows of . We define a pairwise squared Mahalanobis distance and a pairwise scale by
(6) 
where the Mscale is now defined by (5) with given by
(7) 
Thus is defined by
(8) 
Similarly to VictoriaFeser and Copt (2006), we define for the model in (1)(2), the composite Sestimator of and by
(9) 
and the estimator of by
(10) 
One shortcoming of the composite Sestimators are, as occurs with regression Sestimators, that they are not simultaneously highly robust and highly efficient. For this reason in next section we introduce the composite estimators which are defined similarly to the estimators, but replacing the Mscale by a scale.
3 Composite Estimator
In this section we introduce the composite estimator. A scale is defined using two functions and . Given a sample , the function is used to define an Mscale by
(11) 
and the scale by
We will require that and satisfy A1A5. Put , . In Yohai and Zamar (1988) it is shown that to guarantee the Fisher consistency of the estimators of regression, it is required that satisfies the following condition:
 A6

is continuously differentiable and for .
The breakdown point of the scale is the same as that of the scale. Then we are going to set to have breakdown point close to in the classical contamination model.
The estimators are going to be defined as in the previous section by replacing the Mscales by the scales. Then is defined as in (6) and the scale is
Let be the sum of all the scales, i.e.,
(12) 
then the composite estimator of is defined as follows
(13) 
while is obtained as in (10) setting .
In the example of Section 7 and in the Monte Carlo study of Section 8 we took in the following family of functions,
(14) 
where is the family of rho functions introduced by Muler and Yohai (2002) defined by
(15) 
where , , , , and . The functions in this family have shapes close to that of those in the optimal family obtained by Yohai and Zamar (1997). However, they are easier to compute. The reason why we compose the function with the square root is that the functions are applied to the squared Mahalanobis distances. Notice that for any the scale obtained with and is equal to the scale corresponding to and divided by Hence without loss of generality we can take . We found that taking we obtain a good tradeoff between robustness and efficiency, and these are the values that we recommend to use.
It is easy to show that the composite estimators are equivariant for regression transformations of the form where is a vector, affine transformations of the form , where is a non singular matrix or scale transformations of the form , where is a scalar.
4 Breakdown point
Donoho and Huber (1983) introduced the concept of a finite sample breakdown point (FSBDP). For our case, let and be estimators of and . Informally speaking, the FSBDP of is the smallest fraction of outliers that makes the estimator unbounded.
To formalize this, let be a data set of size corresponding to model (1)(2), , , , and (). Let be the set of all the samples with such that . Given estimators and we let
Definition 1
The finite sample breakdown point of for classical contamination (FSBDPCC) at the sample is defined by where and the breakdown point of by where
Let be the set of all the samples such that for each , . Given estimators and we let
Definition 2
The finite sample breakdown point for under independent contamination (FSBDPIC) at the sample is defined by , where , and the breakdown point of by where
The following theorems, whose proofs are discussed in Appendix B, gives a lower bound for the breakdown point of composite estimators under both the classical and the independent contamination models. Before to state the Theorems we need the following notation. Given a sample we define
(16)  
(17)  
(18)  
(19)  
(20) 
Theorem 1
Note that taking , this lower bound is close to for large independently of .
Theorem 2
In this case taking this lower bound is close to for large independently of .
5 Consistency and Asymptotic Normality
In this Section we study the almost sure consistency and asymptotic normality of the composite estimators. We need the following additional assumptions for consistency
 A7

The vector is random and the error is independent of and has an elliptical density of the form
(21) where is non increasing and is strictly decreasing in a neighborhood of 0.
 A8

Let be the distribution of Then for any , we have
 A9

(Identification condition). If for all we have .
An important family of distributions satisfying A7 is the multivariate normal, in this case,
(22) 
Note that when the s satisfy (1), (2) and (3), A7 is satisfied. The following Theorem states the consistency of composite estimators.
Theorem 3
Let , , , be i.i.d random vectors with distribution and call the marginal distribution of the Assume (i) satisfies (A1A5), (ii) satisfies A1A6, (iii) under A7 and A8 holds and (iv) A9. Then, the composite estimators , and satisfy (a.s.), (a.s.). Moreover, if is given by (22) we also have (a.s.).
Note that for the consistency of and is not necessary that be multivariate normal. We do not give a formal proof of this Theorem. In Theorem 5 of the Appendix C we give a rigorous proof of the Fisher consistency of the estimating functional associated to the compose estimator. From this result we derive an heuristic proof of Theorem 3.
The following Theorem states the asymptotic normality of composite estimators. We need the following assumptions
 A10

Let be the distribution of . Then has finite second moments and is non–singular.
 A11

The functions , are twice differentiable.
Theorem 4
Let and be the composite estimator. Consider the same assumptions as in Theorem 3, A10 and A11. Then, we have
where
and and are the gradient and Hessian matrix of respectively.
We do not give the proof of Theorem 4. However, it can be obtained using standard delta method arguments, see for example Theorem 10.9 in Maronna et al. (2006). This Theorem allows to define Wald tests for null hypothesis and confidence intervals for and , but not for . However in most practical applications the interest is centered in and .
6 Computational aspects
The composite estimators are obtained by an iterative algorithm. Hereafter we provide some details. Given starting values and , we perform iterations on steps (A)(C) until convergence. Suppose that we have already computed and then we performed the following steps A, B and C to obtain and :
 (A)

Find scales () by solving equations
 (B)
 (C)

A fixed point equation for can be derived from the estimating equation (27). However we found that to use this equation to update was numerically unstable. We preferred to make this updating by means of a direct minimization of the goal function defined in (12), that is, we define by
For this purpose, in the example of Section 7 and in the Monte Carlo study of Section 8 we used the function optim of the R program.
 (D)

Once the convergence criterion for is reached, the estimator of is obtained by solving the equation (10).
To start the iterative algorithm the initial estimators and are necessary. Let , the th column of , . Then, can be obtained by means of robust regression estimator using as response and , as covariables. In our algorithm we use an MMestimator as implemented in function lmRob of the R package robust using an efficiency of . Once this initial estimator is computed, the residuals can be evaluated. Then, a robust covariance matrix of robust under the ICM model is obtained applying to these residuals the estimator presented in Agostinelli et al. (2014) based on filtering and Sestimators with missing observations. Call to this matrix and let be the vector of the values of the lower triangular side of this matrix including the diagonal elements. In a similar way, let be the column vector of the values of the lower triangular side of the matrix . An initial estimator of could be obtained by means of a regression estimator using as response and as covariables. Since neither nor need to have outliers, it is not necessary to use a robust estimator, in fact we use function lm of R to perform this step.
7 Example
Hereafter we present one application of the introduced method on a real data set. The example is a prospective longitudinal study of children with disorder of neural development. In this data set, outliers are present in the couples rather than in the units and the composite estimator provides a different analysis with respect to maximum likelihood and classical robust procedures.
7.1 Autism
The data used in this example were collected by researchers at the University of Michigan (Anderson et al., 2009) as part of a prospective longitudinal study of children and they are analyzed, among others, also in West et al. (2007). The children were divided into three diagnostic groups when they were years old: autism, pervasive developmental disorder (PDD), and nonspectrum children. The study was designed to collect information on each child at ages , , , , and years, although not all children were measured at each age. One of the study objectives was to assess the relative influence of the initial diagnostic category (autism or PDD), language proficiency at age , and other covariates on the developmental trajectories of the socialization of these children. Study participants were children who had consecutive referrals to one of two autism clinics before the age of years. Social development was assessed at each age using the Vineland Adaptive Behavior Interview survey form, a parentreported measure of socialization. The dependent variable, vsae (Vineland Socialization Age Equivalent), was a combined score that included assessments of interpersonal relationships, play/leisure time activities, and coping skills. Initial language development was assessed using the Sequenced Inventory of Communication Development (SICD) scale; children were placed into one of three groups (sicdegp, , where is the indicator function of the group) based on their initial SICD scores on the expressive language subscale at age . We consider the subset of children for which all measurements are available. We analyze this data using a regression model with random coefficients where vsae is explained by intercept, age, age and sicdegp as a factor variable plus interaction among the age related variables and sicdegp. Hereafter, the variable age is shifted by . Let be the value of the th vsae for the th ages value , then it is assumed that for we have
where are i.i.d. random coefficients with mean and covariance matrix
are fixed coefficients and the are i.i.d. random errors independent of the random coefficients with zero mean and variance . Then, the model could be rewritten in term of (1) and (2) with , , and , ,
while the variance and covariance structure is as follows. Let, a vector of ones, , which corresponds to age and which corresponds to . Then, we have , , , , and . is the scale of the error term, , , , , and .
Method  Int.  

Max. Lik.  12.847  6.851  0.062  5.245  2.154  6.345  4.512  0.133  0.236 
[0.000]  [0.000]  [0.579]  [0.041]  [0.325]  [0.000]  [0.000]  [0.446]  [0.121]  
Composite  12.143  6.308  0.089  5.214  4.209  5.361  3.852  0.082  0.061 
[0.000]  [0.000]  [0.329]  [0.000]  [0.012]  [0.000]  [0.001]  [0.578]  [0.677]  
S Rocke  10.934  7.162  0.107  4.457  0.108  5.769  4.995  0.094  0.419 
[0.000]  [0.001]  [0.666]  [0.049]  [0.957]  [0.002]  [0.000]  [0.688]  [0.011]  
SMDM  12.346  6.020  0.001  5.192  2.173  5.190  3.870  0.046  0.151 
[0.000]  [0.000]  [0.992]  [0.010]  [0.213]  [0.000]  [0.000]  [0.781]  [0.300] 
Table 1 report the estimators and the inference for the fixed term parameters using different methods, while Table 2 reports the estimators of the random effect terms. ML, S and SMDM provide similar results, while differences are present with the composite method. Main differences are on the estimation of the random effects terms, both in size (error variance component) and shape (correlation components). Composite assign part of the total variance to the random components while the other methods assign it to the error term. In fact, variances estimated by composite are in general larger than that estimated by the other methods; composite suggests negative correlation between intercept and age, while ML, S and SMDM suggest positive correlation. Composite provides small estimates compared to the other methods for the error variance. These discrepancies reflects mainly on the inference for the fixed term coefficients where the variable sicdegp is significant using composite but is not using ML, S and SMDM procedures. Interactions between age and sicdegp is highly non significant using composite and SMDM while it is somewhat significant using S.
To investigate more the reasons of differences between composite robust procedure and classic robust procedure results, we investigate cell, couple and row outliers. For a given dimension we define as dimension outliers those dimension observations such that the corresponding squared Mahalanobis distance is greater than a quantile order of a chisquare distribution with degree of freedom. In particular we call cell, couple and row outliers the dimension, dimension and dimension outliers respectively. Composite procedure identifies couple outliers out of couples () at . The affected rows, with at least one couple outliers, are out of . This means that the classic S and SMDM procedures have to deal with a data set with a level of contamination about . We also run the analysis using the composite S estimator, not reported here, the results are similar to those obtained by the composite estimator.
Method  

Maximum Likelihood  2.643  2.328  0.102  0.775  0.429  0.038  51.360 
Composite  9.362  9.670  0.052  4.019  0.002  0.327  5.164 
S Rocke  9.467  3.373  0.222  2.170  1.062  0.349  22.209 
SMDM  5.745  0.092  0.115  0.727  0.813  0.103  25.385 
8 Monte Carlo simulations
In this section we describe the results of a Monte Carlo study with the aim of illustrating the performance of the new procedure in the classical contamination Model (CCM) and in the independent contamination model (ICM). We consider a 2way crossed classification with interaction linear mixed model
(23) 
where , , and . Here, we set , and which leads to .