Reliable inference for complex models by discriminative composite likelihood estimation
Abstract
Composite likelihood estimation has an important role in the analysis of multivariate data for which the full likelihood function is intractable. An important issue in composite likelihood inference is the choice of the weights associated with lowerdimensional data subsets, since the presence of incompatible submodels can deteriorate the accuracy of the resulting estimator. In this paper, we introduce a new approach for simultaneous parameter estimation by tilting, or reweighting, each sublikelihood component called discriminative composite likelihood estimation (DMcLE). The dataadaptive weights maximize the composite likelihood function, subject to moving a given distance from uniform weights; then, the resulting weights can be used to rank lowerdimensional likelihoods in terms of their influence in the composite likelihood function. Our analytical findings and numerical examples support the stability of the resulting estimator compared to estimators constructed using standard composition strategies based on uniform weights. The properties of the new method are illustrated through simulated data and real spatial data on multivariate precipitation extremes.
Keywords: Composite likelihood estimation; Model selection; Exponential tilting; Stability, Robustness
1 Introduction
While likelihoodbased inference is central to modern statistics, for many multivariate problems the full likelihood function is impossible to specify or its evaluation involves a prohibitive computational cost. These limitations have motivated the development of composite likelihood approaches, which avoid the full likelihood by compounding a set of lowdimensional likelihoods into a surrogate criterion function. Composite likelihood inference have proved useful in a number of fields, including geostatistics, analysis of spatial extremes, statistical genetics, and longitudinal data analysis. See Varin&al11 for a comprehensive survey of composite likelihood theory and applications. larribe2011composite review several applications in genetics.
Let be a random vector and be the assumed density model for , indexed by the parameter , . Suppose that the full likelihood function, , is difficult to specify or compute, but we can specify lowdimensional distributions with one, two, or more variables. Specifically, let be a set of marginal or conditional lowdimensional variables constructed from with associated likelihoods , where , denotes the th lowdimensional density model for . The lowdimensional variables are userdefined and could be constructed by taking marginal models, like , pairs like , or conditional variables like . The overall structure of such lowerdimensional models is sometimes referred as to composite likelihood design (Lindsay&al11) and its choice is often driven by computational convenience. For example, if follows a variate normal distribution , the full likelihood is hard to compute when is large due to inversion of , which involves operations. In contrast, using submodels for variable pairs , , can reduce the computational burden since it involves simply inverting partial covariance matrices.
Following Lindsay88, we define the composite likelihood function by
(1) 
where are nonnegative weights, possibly depending on . A wellknown issue in composite likelihood estimation is the selection of the weights, as their specification plays a crucial role in determining both efficiency and reliability of the resulting composite likelihood estimator (Lindsay88; Joe&Lee09; Cox&Reid04; Varin&al11; Xu&Reid11). Despite the importance of the weights, many statistical and computational challenges still hinder their selection (Lindsay&al11).
This paper is concerned with the aspect of stability of composite likelihood selection. Stability occurs when the maximizer of the overall composite likelihood function is not overly affected by the existence of locally optimal parameters that work only for a relatively small portion of such subsets, say , . The presence of such local optima arises from the incompatibility between the assumed fulllikelihood model and the lower dimensional models. For example, suppose that the true distribution of is a variate normal distribution with zero mean vector, unit variance and correlations for all variable pairs, while the true correlation is for some small fraction of the pairs. If one mistakenly assumes that all correlations are equal to , both maximum likelihood and pairwise likelihood estimators with uniform weight, , , are not consistent for in this situation. Other examples of incompatible models are given in Xu&Reid11. In applications, model compatibility is hard to detect, especially when is large, so incompatible submodels are often included in the composite likelihood function with detrimental effects on the accuracy of the global composite likelihood estimator.
Motivated by the above issues, we introduce the discriminative maximum composite likelihood estimator (DMcLE), a new methodology for reliable likelihood composition and simultaneous parameter estimation. The new approach computes smooth weights by maximizing the composite likelihood function for a sample of observations subject to moving a given distance, say , from uniform weights. The DMcLE is regarded as a generalization of the traditional McLE. If the DMcLE is exactly the common composite likelihood estimator with uniform weights. When , incompatible submodels are downweighted, thus resulting in estimators for with bounded worstcase bias. Our analytical findings and simulations support the validity of the proposed method compared to classic composite likelihood estimators with uniform weights. The new framework is illustrated through estimation of maxstable models, which have proved useful for describing extreme environmental occurrences as hurricanes, floods and storms (davison2012).
The proposed procure would be useful in two respects. First, the resulting weights would be a valuable diagnostic tool for composite likelihood selection. Small weights would signal suspicious models, which could be further examined leading to improved assumptions. Conversely, the method can be employed to identify influential data subsets for many types of composite likelihood estimators. Second, the estimates obtained by such method would be trustworthy at least for the bulk of the data subsets models (which are compatible with model assumptions). Clearly, assigning the same weight to all the models including the ones in strong disagreement with the majority of data would lead to biased global estimates, which can be an untrustworthy representations of the entire dataset.
The proposed method is a type of data tilting, a general technique which involves replacing uniform weights with more general weights. To our knowledge, this is the first work that introduces tilting for lowerdimensional data subsets within the composite likelihood framework. In robust statistics, tilting has been typically employed to robustify parametric estimating equations, or to obtain natural data order in terms of their influence Choi00. Tilting has also been used to obtain measures of outlyingness and influence of datasubsets; e.g., see hall99; critchley04; lazar05; camponovo12. genton2014 use a tilting approach in the context of multivariate functional data to ranking influence of data subsets.
The rest the paper is organized as follows. In Section 2, we describe the new methodology for simultaneous likelihood selection/estimation; we give an efficient algorithm and introduce the compatibility plot, a new graphical tool to assess the adequacy of the submodels. In Section 3, we study the properties of the new estimator and give its limit distribution. In Section 4, we provide simulated examples in finite samples confirming our theoretical findings. In Section 5, we illustrate the new procedure to the Tasmanian rainfall spatial data on multivariate precipitation extremes. In Section 6, we conclude and discuss possible extensions for . Proofs of technical results are deferred to a separate appendix.
2 Methodology
2.1 Composite likelihood selection
Given independent observations from the true distribution , we construct the set of marginal or conditional lowdimensional observations , , and define the weighted composite loglikelihood function
(2) 
where are constants playing the role of importance weights. The weight characterizes the impact of the th sublikelihood,
on the overall composite likelihood function . We define incompatibility by assuming there is a global parameter, say , which suits most submodels. Specifically, we assume partial models , where if (incompatible models) and , if (compatible models).
Next, we introduce the DMcLE procedure for simultaneous discrimination of discordant models and parameter estimation. We propose to select the weight to be small when, for a value of that is appropriate for the majority of the data subsets, the sublikelihood function for the th data subset, , is small. To this end, is regarded as a discrete distribution on points and the discrepancy between and the uniform distribution is measured by the KullbackLeibler divergence
(3) 
where . For a given parameter , datadependent weights are then chosen by solving the following program
(4) 
Finally, the DMcLE, denoted by , is then defined as the maximizer of the composite loglikelihood function
where is the vector of datadependent weights. Equivalently, can be obtained by computing the profiled estimator by maximizing for a given weight and then solve (4) with .
The composite likelihood estimator entails moving away from uniform weights in the direction that emphasizes the contribution of the most useful data subsets. If , the relative importance of the sublikelihoods that are incompatible with the data is diminished in the composite likelihood equation (2). The special case when corresponds to the composite likelihood estimator with uniform weights . Thus, all the data subsets are regarded as equally compatible. Other divergence measures may be considered in place of the KullbackLeibler divergence (3), which could be useful in particular estimation setups, although these are not pursued in this paper. The KullbackLeibler divergence, however, has the advantage that allows one or more zero weights, and gives automatically nonnegative wights without imposing additional constraints by some algorithm to ensure this property. For example, when is very large it could be useful to modify to promote sparsity, i.e. select relatively a large number weights that are exactly zero.
2.2 Dataadaptive weights and parameter estimation
The program in (4) is solved by maximizing the Lagrangian function
(5) 
where and are Lagrange multipliers. It is easy to see that the solution to (5) has the form
(6) 
where and depend on the Lagrange multipliers and . From the two constraints in (4), and are obtained by solving
(7) 
and . The DMcLE is then computed by maximizing .
Lemma 1 in the appendix shows that computing the DMcLE, , is equivalent to solving the estimating equations
(8) 
where denotes the partial score function corresponding to the th data subset. Thus, is a weighted estimating equation involving the partial scores with weights depending on the data and . A small weight implies a modest contribution of the th score, , to the overall composite likelihood equation. The constant is regarded as a stability parameter which can be used to control for the relative impact of the incompatible lowerdimensional likelihoods. Particularly, if is large incompatible models will receive a low weight, with a relatively small effect on the final parameter estimates. If , all the submodels are treated equally in terms of the impact of corresponding sublikelihoods in .
Equation (8) highlights the resemblance to estimating functions of classic robust Mestimators, whose main aim is to reduce the influence of outliers in the full likelihood function. Indeed, the approach followed here coincides with the robust estimation approach by Choi00 in the particular case where: , are independent and all submodels , are all identical to the full likelihood model, . In general, however, the DMcLE is very different from Choi00 and other similar robust methods. The main difference is that the weights in (6) refer to variables , which are constructed by taking subsets of the original vector and are possibly correlated; in robust Mestimation weights refer to independent observations on the original vector . Thus, in our approach observations corresponding to the th data subset, namely , , receive the same weight, . This reflects our need to control for the incompatibility of a portion of the submodels, say , , rather than reducing the effect of outlying observations with respect to the full model .
2.3 Computing
The form of equation (8) suggests a simple algorithm to simultaneously compute weights and parameter estimates. At each step of the algorithm, we update weights based on previous parameter estimates and then compute a fresh parameter estimate using the new weights. Starting from an initial estimate, , we compute:
(9) 
until convergence is reached. We consider a relative convergence criterion on the weights and stop iterating when , where is some tolerance level. A practical advantageis that (9) is easy to implement when a basic composite likelihood estimator with fixed weights is already available.
In our numerical studies, the algorithm gave satisfactory performances. In all our examples convergence was reached in a few iterations and we noted that the computational cost does not increase much as grows. This behavior makes the proposed algorithm wellsuited to highdimensional problems with a large number of sublikelihoods and is shared by analogous iteratively reweighted algorithms for Mestimation with wellestablished theory (e.g. see arslan2004convergence). Although we do not offer theoretical insight on the general theoretical behaviour of our algorithm, convergence results may be derived following an argument analogous to basu2004iteratively in the context of iteratively reweighted procedures for minimum divergence estimators.
2.4 Compatibility profile plots (CPPs)
Let be the arrangement of indices implied by , where , , are datadependent weights computed by the algorithm in Section 2.3. The ordering induces an importance ranking for the submodels in terms of their compatibility with the true distribution generating the data. Based on this ranking, a graphical tool is introduced, called a compatibility profile plot (CPP). The CPP traces the fitted weights, , , as moves away from zero and can be used to inspect the compatibility of individual sublikelihoods. For instance, a sharp decrease of the first weights from uniform weights , suggests that the first sublikelihoods are likely to be misspecified and a different model should be used for such components. The weights often exhibit diverging trajectories (see for example Figure 2) which may be used to determine a suitable value for the parameter . For example, the plots help us pick a value of corresponding to a sufficient degree of separation between compatible and incompatible models. Eventually, reaches an equilibrium point where the trajectories are maximally separated. After equilibrium, weights cluster together again as they tend to 0, where a single weight converges to 1.
2.5 Selection of
The stability parameter tunes the extent to which we downweight incompatible models, which is important to discuss. One approach is to select the tuning constant closest to 0 (i.e., closest to uniform weights) such that the point estimates of the parameters of interest are sufficiently stable. If all the sublikelihoods are compatible, already gives stable estimates and moving away is expected to have little impact on the estimates. In the presence of incompatible sublikelihoods, values of close to tend give unstable estimates in terms of bias and variance, so we move away from until stability is reached. For example in Figure 2 (right), the correlation estimator is far from the true correlation value of 0.5 when . As moves away from zero, changes rapidly until stability is reached when . The above discussion suggests a simple datadriven procedure to select :

Define an equally spaced grid .

Starting from compute the correspondent point estimates, , .

Select the optimal value using the stopping rule , where is some threshold value.
By definition, is the value closest to 0 such that the variation of the point estimates is smaller than some acceptable threshold. Based on our simulations, a grid between and , with typically works well and choices not too far from already give considerable stability. If a very small portion of data subsets are incompatible, it may be useful to consider refinements of the grid near , such as, , .
3 Properties
3.1 Large sample behavior of and standard errors
To emphasize reliability aspects, it is helpful to distinguish between the true process generating the data and the parametric model used for inference. Assume that has distribution , while the true distribution for the subvector is denoted by . The density function of with respect to the dominating measure is denoted by . Let be a parametric family of distributions for and let denote the corresponding densities with respect to . We assume that is identifiable, i.e. for , , for all .
The composite likelihood function (2) is correctly specified if there is a parameter such that for all ; when no such exists then (2) is misspecified, meaning that it contains incompatible models. The optimal parameter, , is defined as the minimizer of the weighted composite KullbackLeibler divergence
(10) 
where
is the crossentropy between the true distribution and the parametric submodel and () here denote asymptotic weights computed as in Section 2.3 with replaced by . In the remainder of the paper, we assume that is the unique maximizer of (10).
Next, consistency and asymptotic normality of are established. We note that standard Mestimation theory cannot be applied directly to equation (8) because the weights in (4) depend on random averages; thus some additional care is needed to characterize the asymptotic behavior of .
Proposition 3.1
Assume: (C1) is an interior point in ; (C2) as (); and (C3) (). Then the maximum composite likelihood estimator converges in probability to defined in (10).
A direct consequence is Fisherconsistency of , i.e. under correct composite likelihood specification the optimal target value is for all . This can be seen by taking the expectation of equation (8) with :
(11) 
since if and only if , for all . This means that the estimating equation (8) is solved by regardless of the choice of , since changing the latter affects only the weights , but not the partial scores . Section 3.2 discusses bias in the presence of incompatible sublikelihoods.
Proposition 3.2
Under conditions (C1) – (C3) in Proposition 3.1 and additional regularity conditions given in the Appendix, converges in distribution to the variate normal as , where and are the following matrices
(12) 
, (), and expectations are with respect to .
The random weights, , play a crucial role in determining the asymptotic behavior of . This feature is also found in model averaging, where parameter estimators obtained from different models, say , are combined into a global estimator , through random weights (claeskens08, Chapter 7). The connection with model averaging is further highlighted by the normal location example in Section 4. Here the random weights converge in probability to constants; thus, the asymptotic variance takes the usual sandwich form and , can be consistently estimated analogously to Varin&al11 with weights (), computed as in Section 2.3. Resampling techniques such as jackknife and bootstrap may be also used.
3.2 Bias under incompatible models
In this section, we examine the firstorder properties of our estimator in the presence of incompatible models. For clarity of exposition, in this section we consider the case where , but analogous arguments can easily extended to the general case. To represent incompatibility, we assume heterogeneous parameters for the first submodels. Particularly, let , , (), where , follows the drift model , if , and , if . In addition, we assume that the Fisher information , , are bounded away from zero and infinity. A firstorder Taylor expansion of and in (8) about under suitable regularity conditions gives
Rearranging the above expression leads to the following approximation for the bias
where
Therefore, an approximate upper bound to the bias, , is
(13) 
which is regarded as the worstcase bias under incompatible models. Clearly, when (equivalently, ), the worstcase bias grows linearly in . When , is bounded and the estimator achieves bias control. Particularly, if and all the models are compatible, then . If is large, since the denominator in (13) dominates the numerator, the maximal bias decreases quickly to 0.
A secondorder Taylor expansion of and in (8) about (not shown here) can be used to derive an upper bound for the mean squared error. Analogously to (13), when (equivalently, ), the worstcase mean squared error grows quadratically in . When , the maximal mean squared error is bounded, meaning that the estimator achieves both bias and variance control. This theoretical understanding is confirmed by the numerical simulations in Section 4.
As an illustration, Figure 1 shows the maximal bias for the multivariate normal model with where if , and , if . Clearly, the classic estimator with equal weights () is very risky for this model, since the maximal bias can be potentially very large. This undesirable behavior can be easily avoided by setting . Thus, if the degree of incompatibility is strong ( ), the worstcase bias approaches zero. For intermediate cases where the bias remains bounded and can be controlled by tuning .
4 Examples
4.1 Example 1: Estimation of correlation
Suppose the random vector follows a multivariate normal distribution with zero mean vector, unit variances and covariances if , for some , and otherwise. If we model as a multivariate normal with zero mean vector and all correlations equal, then the model is clearly misspecified and the maximum likelihood estimator is not consistent for .
When constructing a composite likelihood function we only need pairwise lowerdimensional likelihoods, since the marginal univariate sublikelihoods do not contain information on . Therefore the correlation estimator is obtained as described in Section 2 by maximizing the pairwise likelihood
(14) 
where and . Note that (14) refers to combining bivariate normal models with zero mean and covariance given by matrices with diagonal elements equal to 1 and offdiagonal elements equal to . Therefore will be consistent for only if .
In Table 1, we show the finitesample bias and variance of the DMcLE for different values of . As a comparison, we report results for the MLE and the usual McLE with uniform weights corresponds to the column with . When all the sublikelihoods are compatible (), not surprisingly the MLE has the best performance in terms of variance. For the DMcLE, however, both bias and variance do not increase much as long as is not too far from . In the presence of incompatible submodels (), the bias for the MLE and DMcLE with uniform weights () is very large compared to the DMcLE with . For example, when , the bias of the DMcLE is negligible when . In addition to bias control of DMcLE, we note also that our procedure also achieves variance reduction when and is small. These results suggest that by setting slightly above zero (e.g., 0.1, 0.2, or 0.3) already gives substantial stability and reduce the mean squared error of the corresponding estimator, .
MLE  DMcLE()  

=  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  
Bias  
1  0.00  0.00  0.06  0.14  0.21  0.27  0.37  0.43  0.52  0.62  0.68  0.78  
3  2.43  1.75  0.12  0.00  0.05  0.12  0.20  0.26  0.36  0.43  0.50  0.58  
5  6.32  4.53  0.46  0.00  0.05  0.11  0.19  0.27  0.34  0.42  0.51  0.57  
Var  
1  0.10  0.12  0.12  0.12  0.13  0.14  0.13  0.13  0.14  0.14  0.14  0.15  
3  0.18  0.20  0.14  0.13  0.13  0.14  0.13  0.14  0.14  0.15  0.16  0.16  
5  0.18  0.22  0.15  0.13  0.13  0.14  0.14  0.14  0.15  0.15  0.16  0.17 
Figure 2 illustrates the profile plot (left) and parameter estimates (right) for a sample of observations. When , the estimator is unreliable with estimates between and . When moves away from zero, the importance profile shows two distinct groups of sublikelihoods, with the four overlapping paths at the bottom corresponding to misspecified sublikelihoods. When , the estimator exploits correctly the information from the compatible sublikelihoods and gives estimates close to the true value . Finally, as , a single partial likelihoods tends to dominate the others, but much of the information from the other useful data pairs is ignored. Therefore the composite estimate at is inferior to that at , in terms of accuracy.
4.2 Example 2: Location of heterogeneous normal variates
Let be independent normal variables with common mean ( and heterogeneous variances (). This is the basic metaanalysis model where a weighted average of a series of study estimates, say , is combined to obtain a more precise estimate for . The inverse of the estimates’ variance, , is the optimal study weight ensuring minimum variance of the combined estimate. All the parameter information is contained in the marginal models, so the following negative onewise composite likelihood function is minimized:
(15) 
and the profiled composite likelihood estimators are
Replacing and in (15) gives , which is then minimized subject to the constraints and .
The resulting location estimator, say , solves the fixedpoint equation
(16) 
where is computed as in (6) for a given , and the variance estimators are (.
The degree of incompatibility of models is very strong, then the estimator is nearly as good as the estimator obtained by ignoring the corresponding data subsets. If all the models are compatible, still performs well in terms of accuracy. Particularly, if all the partial likelihoods are correctly specified, then . If () are far away from , then finding is approximately equivalent to solving (16) with , if .
Table 2 shows bias and variance for under correctly specified and misspecified sublikelihoods. The usual McLE with uniform weights corresponds to the column with . For comparison purposes, we also show the maximum likelihood estimator with weights , where is the sample standard deviation for the th variable. The results correspond to the location model with () and misspecification introduced by the location shift , . When all the sublikelihoods are compatible (), the MLE has the best performance, but the DMcLE with doing comparably well. In the presence of two incompatible submodels (), the bias for the MLE and DMcLE with uniform weights () is large compared to the DMcLE with . The bias is quite small when . The variance of DMcLE for is also quite small compared to the McLE with uniform weights; interestingly in a few cases the variance is smaller than that of the MLE. This confirms the behavior observed in other numerical examples and in the derivations given in Section 3.2. Across a number of other simulation settings, we found that slightly larger than zero gives estimators with negligible bias and relatively small mean squared errors.
MLE  DMcLE()  

0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  
Bias  
0  0.00  0.01  0.00  0.01  0.07  0.14  0.19  0.21  0.22  
2  1.35  36.32  1.27  0.16  0.09  0.22  0.26  0.59  1.79  
0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  
2  2.53  39.47  1.14  0.21  0.04  0.00  0.00  0.00  0.01  
Var  
0  3.51  5.03  4.08  6.69  9.55  11.01  11.89  12.49  12.90  
2  9.50  6.66  5.06  6.93  10.23  15.14  17.58  18.70  21.67  
0  0.41  0.65  0.43  0.47  0.53  0.59  0.65  0.71  0.76  
2  0.53  0.73  0.48  0.51  0.53  0.55  0.57  0.59  0.61 
5 Multivariate models for spatial extremes: application to the Tasmanian rainfall data
Maxstable processes have emerged as a useful representation of extreme environmental occurrences such as hurricanes, floods and storms (davison2012). However, their estimation poses significant challenges, since they lack of a general multivariate density expression. A well studied case is the Gaussian maxstable process defined as , where is a Poisson process on , with intensity measure , and is the bivariate normal distribution with zero mean and covariance (Smith90). The process has unit Frechét margins with distribution function , . Smith90 interprets as extreme environmental episodes, such as storms, where , , and are the storm magnitude, center, and shape, respectively.
Next, we apply the DMcLE to estimate the extreme covariance parameter in the context of the Tasmania rainfall data described below. For a finite set of spatiallyreferenced indexes, , the joint distribution of the random vector has no analytical representation for . Padoan10 give a closedform expression for the bivariate density and propose estimation based on the pairwise likelihood function. Given observations on locations, , , the weighted pairwise likelihood function obtained by considering all location pairs is
where is the bivariate density
(17) 
In the above expression, and are the standard normal probability and density functions, respectively; , ; ; and . For fixed , the extremal dependence behaviour is determined by , which is therefore the main interest for inference. Since the above model requires unit Frechét margins, the observed margins, , are transformed in unit Frechét by the transformation , where and , and are location, scale and shape parameters obtained from the empirical distribution.
We consider a data set of 20 yearly rainfall maxima recorded at 10 gauging stations from 1995 to 2014 in the Australian state of Tasmania corresponding to the following locations, also shown in Figure 3: Bushy Park, Ross, King Island, Eddistone Point, Geeveston, Strahan, Flinders Island, Marrawah, Rocky Point, Orford (source: http://wwwc.bom.gov.au/tas/). The maxstable Gaussian model is then fitted using a pairwise likelihood function including all pairs of locations. We compute estimates for different choices of ranging from