Reliable inference for complex models by discriminative composite likelihood estimation
Composite likelihood estimation has an important role in the analysis of multivariate data for which the full likelihood function is intractable. An important issue in composite likelihood inference is the choice of the weights associated with lower-dimensional data sub-sets, since the presence of incompatible sub-models can deteriorate the accuracy of the resulting estimator. In this paper, we introduce a new approach for simultaneous parameter estimation by tilting, or re-weighting, each sub-likelihood component called discriminative composite likelihood estimation (D-McLE). The data-adaptive weights maximize the composite likelihood function, subject to moving a given distance from uniform weights; then, the resulting weights can be used to rank lower-dimensional likelihoods in terms of their influence in the composite likelihood function. Our analytical findings and numerical examples support the stability of the resulting estimator compared to estimators constructed using standard composition strategies based on uniform weights. The properties of the new method are illustrated through simulated data and real spatial data on multivariate precipitation extremes.
Keywords: Composite likelihood estimation; Model selection; Exponential tilting; Stability, Robustness
While likelihood-based inference is central to modern statistics, for many multivariate problems the full likelihood function is impossible to specify or its evaluation involves a prohibitive computational cost. These limitations have motivated the development of composite likelihood approaches, which avoid the full likelihood by compounding a set of low-dimensional likelihoods into a surrogate criterion function. Composite likelihood inference have proved useful in a number of fields, including geo-statistics, analysis of spatial extremes, statistical genetics, and longitudinal data analysis. See Varin&al11 for a comprehensive survey of composite likelihood theory and applications. larribe2011composite review several applications in genetics.
Let be a random vector and be the assumed density model for , indexed by the parameter , . Suppose that the full likelihood function, , is difficult to specify or compute, but we can specify low-dimensional distributions with one, two, or more variables. Specifically, let be a set of marginal or conditional low-dimensional variables constructed from with associated likelihoods , where , denotes the th low-dimensional density model for . The low-dimensional variables are user-defined and could be constructed by taking marginal models, like , pairs like , or conditional variables like . The overall structure of such lower-dimensional models is sometimes referred as to composite likelihood design (Lindsay&al11) and its choice is often driven by computational convenience. For example, if follows a -variate normal distribution , the full likelihood is hard to compute when is large due to inversion of , which involves operations. In contrast, using sub-models for variable pairs , , can reduce the computational burden since it involves simply inverting partial covariance matrices.
Following Lindsay88, we define the composite likelihood function by
where are non-negative weights, possibly depending on . A well-known issue in composite likelihood estimation is the selection of the weights, as their specification plays a crucial role in determining both efficiency and reliability of the resulting composite likelihood estimator (Lindsay88; Joe&Lee09; Cox&Reid04; Varin&al11; Xu&Reid11). Despite the importance of the weights, many statistical and computational challenges still hinder their selection (Lindsay&al11).
This paper is concerned with the aspect of stability of composite likelihood selection. Stability occurs when the maximizer of the overall composite likelihood function is not overly affected by the existence of locally optimal parameters that work only for a relatively small portion of such sub-sets, say , . The presence of such local optima arises from the incompatibility between the assumed full-likelihood model and the lower dimensional models. For example, suppose that the true distribution of is a -variate normal distribution with zero mean vector, unit variance and correlations for all variable pairs, while the true correlation is for some small fraction of the pairs. If one mistakenly assumes that all correlations are equal to , both maximum likelihood and pair-wise likelihood estimators with uniform weight, , , are not consistent for in this situation. Other examples of incompatible models are given in Xu&Reid11. In applications, model compatibility is hard to detect, especially when is large, so incompatible sub-models are often included in the composite likelihood function with detrimental effects on the accuracy of the global composite likelihood estimator.
Motivated by the above issues, we introduce the discriminative maximum composite likelihood estimator (D-McLE), a new methodology for reliable likelihood composition and simultaneous parameter estimation. The new approach computes smooth weights by maximizing the composite likelihood function for a sample of observations subject to moving a given distance, say , from uniform weights. The D-McLE is regarded as a generalization of the traditional McLE. If the D-McLE is exactly the common composite likelihood estimator with uniform weights. When , incompatible sub-models are down-weighted, thus resulting in estimators for with bounded worst-case bias. Our analytical findings and simulations support the validity of the proposed method compared to classic composite likelihood estimators with uniform weights. The new framework is illustrated through estimation of max-stable models, which have proved useful for describing extreme environmental occurrences as hurricanes, floods and storms (davison2012).
The proposed procure would be useful in two respects. First, the resulting weights would be a valuable diagnostic tool for composite likelihood selection. Small weights would signal suspicious models, which could be further examined leading to improved assumptions. Conversely, the method can be employed to identify influential data sub-sets for many types of composite likelihood estimators. Second, the estimates obtained by such method would be trustworthy at least for the bulk of the data sub-sets models (which are compatible with model assumptions). Clearly, assigning the same weight to all the models including the ones in strong disagreement with the majority of data would lead to biased global estimates, which can be an untrustworthy representations of the entire data-set.
The proposed method is a type of data tilting, a general technique which involves replacing uniform weights with more general weights. To our knowledge, this is the first work that introduces tilting for lower-dimensional data sub-sets within the composite likelihood framework. In robust statistics, tilting has been typically employed to robustify parametric estimating equations, or to obtain natural data order in terms of their influence Choi00. Tilting has also been used to obtain measures of outlyingness and influence of data-subsets; e.g., see hall99; critchley04; lazar05; camponovo12. genton2014 use a tilting approach in the context of multivariate functional data to ranking influence of data subsets.
The rest the paper is organized as follows. In Section 2, we describe the new methodology for simultaneous likelihood selection/estimation; we give an efficient algorithm and introduce the compatibility plot, a new graphical tool to assess the adequacy of the sub-models. In Section 3, we study the properties of the new estimator and give its limit distribution. In Section 4, we provide simulated examples in finite samples confirming our theoretical findings. In Section 5, we illustrate the new procedure to the Tasmanian rainfall spatial data on multivariate precipitation extremes. In Section 6, we conclude and discuss possible extensions for . Proofs of technical results are deferred to a separate appendix.
2.1 Composite likelihood selection
Given independent observations from the true distribution , we construct the set of marginal or conditional low-dimensional observations , , and define the weighted composite log-likelihood function
where are constants playing the role of importance weights. The weight characterizes the impact of the th sub-likelihood,
on the overall composite likelihood function . We define incompatibility by assuming there is a global parameter, say , which suits most sub-models. Specifically, we assume partial models , where if (incompatible models) and , if (compatible models).
Next, we introduce the D-McLE procedure for simultaneous discrimination of discordant models and parameter estimation. We propose to select the weight to be small when, for a value of that is appropriate for the majority of the data sub-sets, the sub-likelihood function for the th data sub-set, , is small. To this end, is regarded as a discrete distribution on points and the discrepancy between and the uniform distribution is measured by the Kullback-Leibler divergence
where . For a given parameter , data-dependent weights are then chosen by solving the following program
Finally, the D-McLE, denoted by , is then defined as the maximizer of the composite log-likelihood function
where is the vector of data-dependent weights. Equivalently, can be obtained by computing the profiled estimator by maximizing for a given weight and then solve (4) with .
The composite likelihood estimator entails moving away from uniform weights in the direction that emphasizes the contribution of the most useful data sub-sets. If , the relative importance of the sub-likelihoods that are incompatible with the data is diminished in the composite likelihood equation (2). The special case when corresponds to the composite likelihood estimator with uniform weights . Thus, all the data sub-sets are regarded as equally compatible. Other divergence measures may be considered in place of the Kullback-Leibler divergence (3), which could be useful in particular estimation setups, although these are not pursued in this paper. The Kullback-Leibler divergence, however, has the advantage that allows one or more zero weights, and gives automatically nonnegative wights without imposing additional constraints by some algorithm to ensure this property. For example, when is very large it could be useful to modify to promote sparsity, i.e. select relatively a large number weights that are exactly zero.
2.2 Data-adaptive weights and parameter estimation
The program in (4) is solved by maximizing the Lagrangian function
where and are Lagrange multipliers. It is easy to see that the solution to (5) has the form
where and depend on the Lagrange multipliers and . From the two constraints in (4), and are obtained by solving
and . The D-McLE is then computed by maximizing .
Lemma 1 in the appendix shows that computing the D-McLE, , is equivalent to solving the estimating equations
where denotes the partial score function corresponding to the th data subset. Thus, is a weighted estimating equation involving the partial scores with weights depending on the data and . A small weight implies a modest contribution of the th score, , to the overall composite likelihood equation. The constant is regarded as a stability parameter which can be used to control for the relative impact of the incompatible lower-dimensional likelihoods. Particularly, if is large incompatible models will receive a low weight, with a relatively small effect on the final parameter estimates. If , all the sub-models are treated equally in terms of the impact of corresponding sub-likelihoods in .
Equation (8) highlights the resemblance to estimating functions of classic robust M-estimators, whose main aim is to reduce the influence of outliers in the full likelihood function. Indeed, the approach followed here coincides with the robust estimation approach by Choi00 in the particular case where: , are independent and all sub-models , are all identical to the full likelihood model, . In general, however, the D-McLE is very different from Choi00 and other similar robust methods. The main difference is that the weights in (6) refer to variables , which are constructed by taking sub-sets of the original vector and are possibly correlated; in robust M-estimation weights refer to independent observations on the original vector . Thus, in our approach observations corresponding to the th data sub-set, namely , , receive the same weight, . This reflects our need to control for the incompatibility of a portion of the sub-models, say , , rather than reducing the effect of outlying observations with respect to the full model .
The form of equation (8) suggests a simple algorithm to simultaneously compute weights and parameter estimates. At each step of the algorithm, we update weights based on previous parameter estimates and then compute a fresh parameter estimate using the new weights. Starting from an initial estimate, , we compute:
until convergence is reached. We consider a relative convergence criterion on the weights and stop iterating when , where is some tolerance level. A practical advantageis that (9) is easy to implement when a basic composite likelihood estimator with fixed weights is already available.
In our numerical studies, the algorithm gave satisfactory performances. In all our examples convergence was reached in a few iterations and we noted that the computational cost does not increase much as grows. This behavior makes the proposed algorithm well-suited to high-dimensional problems with a large number of sub-likelihoods and is shared by analogous iteratively re-weighted algorithms for M-estimation with well-established theory (e.g. see arslan2004convergence). Although we do not offer theoretical insight on the general theoretical behaviour of our algorithm, convergence results may be derived following an argument analogous to basu2004iteratively in the context of iteratively reweighted procedures for minimum divergence estimators.
2.4 Compatibility profile plots (CPPs)
Let be the arrangement of indices implied by , where , , are data-dependent weights computed by the algorithm in Section 2.3. The ordering induces an importance ranking for the sub-models in terms of their compatibility with the true distribution generating the data. Based on this ranking, a graphical tool is introduced, called a compatibility profile plot (CPP). The CPP traces the fitted weights, , , as moves away from zero and can be used to inspect the compatibility of individual sub-likelihoods. For instance, a sharp decrease of the first weights from uniform weights , suggests that the first sub-likelihoods are likely to be misspecified and a different model should be used for such components. The weights often exhibit diverging trajectories (see for example Figure 2) which may be used to determine a suitable value for the parameter . For example, the plots help us pick a value of corresponding to a sufficient degree of separation between compatible and incompatible models. Eventually, reaches an equilibrium point where the trajectories are maximally separated. After equilibrium, weights cluster together again as they tend to 0, where a single weight converges to 1.
2.5 Selection of
The stability parameter tunes the extent to which we down-weight incompatible models, which is important to discuss. One approach is to select the tuning constant closest to 0 (i.e., closest to uniform weights) such that the point estimates of the parameters of interest are sufficiently stable. If all the sub-likelihoods are compatible, already gives stable estimates and moving away is expected to have little impact on the estimates. In the presence of incompatible sub-likelihoods, values of close to tend give unstable estimates in terms of bias and variance, so we move away from until stability is reached. For example in Figure 2 (right), the correlation estimator is far from the true correlation value of 0.5 when . As moves away from zero, changes rapidly until stability is reached when . The above discussion suggests a simple data-driven procedure to select :
Define an equally spaced grid .
Starting from compute the correspondent point estimates, , .
Select the optimal value using the stopping rule , where is some threshold value.
By definition, is the value closest to 0 such that the variation of the point estimates is smaller than some acceptable threshold. Based on our simulations, a grid between and , with typically works well and choices not too far from already give considerable stability. If a very small portion of data sub-sets are incompatible, it may be useful to consider refinements of the grid near , such as, , .
3.1 Large sample behavior of and standard errors
To emphasize reliability aspects, it is helpful to distinguish between the true process generating the data and the parametric model used for inference. Assume that has distribution , while the true distribution for the sub-vector is denoted by . The density function of with respect to the dominating measure is denoted by . Let be a parametric family of distributions for and let denote the corresponding densities with respect to . We assume that is identifiable, i.e. for , , for all .
The composite likelihood function (2) is correctly specified if there is a parameter such that for all ; when no such exists then (2) is misspecified, meaning that it contains incompatible models. The optimal parameter, , is defined as the minimizer of the weighted composite Kullback-Leibler divergence
is the cross-entropy between the true distribution and the parametric sub-model and () here denote asymptotic weights computed as in Section 2.3 with replaced by . In the remainder of the paper, we assume that is the unique maximizer of (10).
Next, consistency and asymptotic normality of are established. We note that standard M-estimation theory cannot be applied directly to equation (8) because the weights in (4) depend on random averages; thus some additional care is needed to characterize the asymptotic behavior of .
Assume: (C1) is an interior point in ; (C2) as (); and (C3) (). Then the maximum composite likelihood estimator converges in probability to defined in (10).
A direct consequence is Fisher-consistency of , i.e. under correct composite likelihood specification the optimal target value is for all . This can be seen by taking the expectation of equation (8) with :
since if and only if , for all . This means that the estimating equation (8) is solved by regardless of the choice of , since changing the latter affects only the weights , but not the partial scores . Section 3.2 discusses bias in the presence of incompatible sub-likelihoods.
Under conditions (C1) – (C3) in Proposition 3.1 and additional regularity conditions given in the Appendix, converges in distribution to the -variate normal as , where and are the following matrices
, (), and expectations are with respect to .
The random weights, , play a crucial role in determining the asymptotic behavior of . This feature is also found in model averaging, where parameter estimators obtained from different models, say , are combined into a global estimator , through random weights (claeskens08, Chapter 7). The connection with model averaging is further highlighted by the normal location example in Section 4. Here the random weights converge in probability to constants; thus, the asymptotic variance takes the usual sandwich form and , can be consistently estimated analogously to Varin&al11 with weights (), computed as in Section 2.3. Re-sampling techniques such as jackknife and bootstrap may be also used.
3.2 Bias under incompatible models
In this section, we examine the first-order properties of our estimator in the presence of incompatible models. For clarity of exposition, in this section we consider the case where , but analogous arguments can easily extended to the general case. To represent incompatibility, we assume heterogeneous parameters for the first sub-models. Particularly, let , , (), where , follows the drift model , if , and , if . In addition, we assume that the Fisher information , , are bounded away from zero and infinity. A first-order Taylor expansion of and in (8) about under suitable regularity conditions gives
Re-arranging the above expression leads to the following approximation for the bias
Therefore, an approximate upper bound to the bias, , is
which is regarded as the worst-case bias under incompatible models. Clearly, when (equivalently, ), the worst-case bias grows linearly in . When , is bounded and the estimator achieves bias control. Particularly, if and all the models are compatible, then . If is large, since the denominator in (13) dominates the numerator, the maximal bias decreases quickly to 0.
A second-order Taylor expansion of and in (8) about (not shown here) can be used to derive an upper bound for the mean squared error. Analogously to (13), when (equivalently, ), the worst-case mean squared error grows quadratically in . When , the maximal mean squared error is bounded, meaning that the estimator achieves both bias and variance control. This theoretical understanding is confirmed by the numerical simulations in Section 4.
As an illustration, Figure 1 shows the maximal bias for the multivariate normal model with where if , and , if . Clearly, the classic estimator with equal weights () is very risky for this model, since the maximal bias can be potentially very large. This undesirable behavior can be easily avoided by setting . Thus, if the degree of incompatibility is strong ( ), the worst-case bias approaches zero. For intermediate cases where the bias remains bounded and can be controlled by tuning .
4.1 Example 1: Estimation of correlation
Suppose the random vector follows a multivariate normal distribution with zero mean vector, unit variances and covariances if , for some , and otherwise. If we model as a multivariate normal with zero mean vector and all correlations equal, then the model is clearly misspecified and the maximum likelihood estimator is not consistent for .
When constructing a composite likelihood function we only need pair-wise lower-dimensional likelihoods, since the marginal univariate sub-likelihoods do not contain information on . Therefore the correlation estimator is obtained as described in Section 2 by maximizing the pairwise likelihood
where and . Note that (14) refers to combining bi-variate normal models with zero mean and covariance given by matrices with diagonal elements equal to 1 and off-diagonal elements equal to . Therefore will be consistent for only if .
In Table 1, we show the finite-sample bias and variance of the D-McLE for different values of . As a comparison, we report results for the MLE and the usual McLE with uniform weights corresponds to the column with . When all the sub-likelihoods are compatible (), not surprisingly the MLE has the best performance in terms of variance. For the D-McLE, however, both bias and variance do not increase much as long as is not too far from . In the presence of incompatible sub-models (), the bias for the MLE and D-McLE with uniform weights () is very large compared to the D-McLE with . For example, when , the bias of the D-McLE is negligible when . In addition to bias control of D-McLE, we note also that our procedure also achieves variance reduction when and is small. These results suggest that by setting slightly above zero (e.g., 0.1, 0.2, or 0.3) already gives substantial stability and reduce the mean squared error of the corresponding estimator, .
Figure 2 illustrates the profile plot (left) and parameter estimates (right) for a sample of observations. When , the estimator is unreliable with estimates between and . When moves away from zero, the importance profile shows two distinct groups of sub-likelihoods, with the four overlapping paths at the bottom corresponding to misspecified sub-likelihoods. When , the estimator exploits correctly the information from the compatible sub-likelihoods and gives estimates close to the true value . Finally, as , a single partial likelihoods tends to dominate the others, but much of the information from the other useful data pairs is ignored. Therefore the composite estimate at is inferior to that at , in terms of accuracy.
4.2 Example 2: Location of heterogeneous normal variates
Let be independent normal variables with common mean ( and heterogeneous variances (). This is the basic meta-analysis model where a weighted average of a series of study estimates, say , is combined to obtain a more precise estimate for . The inverse of the estimates’ variance, , is the optimal study weight ensuring minimum variance of the combined estimate. All the parameter information is contained in the marginal models, so the following negative one-wise composite likelihood function is minimized:
and the profiled composite likelihood estimators are
Replacing and in (15) gives , which is then minimized subject to the constraints and .
The resulting location estimator, say , solves the fixed-point equation
where is computed as in (6) for a given , and the variance estimators are (.
The degree of incompatibility of models is very strong, then the estimator is nearly as good as the estimator obtained by ignoring the corresponding data sub-sets. If all the models are compatible, still performs well in terms of accuracy. Particularly, if all the partial likelihoods are correctly specified, then . If () are far away from , then finding is approximately equivalent to solving (16) with , if .
Table 2 shows bias and variance for under correctly specified and misspecified sub-likelihoods. The usual McLE with uniform weights corresponds to the column with . For comparison purposes, we also show the maximum likelihood estimator with weights , where is the sample standard deviation for the th variable. The results correspond to the location model with () and misspecification introduced by the location shift , . When all the sub-likelihoods are compatible (), the MLE has the best performance, but the D-McLE with doing comparably well. In the presence of two incompatible sub-models (), the bias for the MLE and D-McLE with uniform weights () is large compared to the D-McLE with . The bias is quite small when . The variance of D-McLE for is also quite small compared to the McLE with uniform weights; interestingly in a few cases the variance is smaller than that of the MLE. This confirms the behavior observed in other numerical examples and in the derivations given in Section 3.2. Across a number of other simulation settings, we found that slightly larger than zero gives estimators with negligible bias and relatively small mean squared errors.
5 Multivariate models for spatial extremes: application to the Tasmanian rainfall data
Max-stable processes have emerged as a useful representation of extreme environmental occurrences such as hurricanes, floods and storms (davison2012). However, their estimation poses significant challenges, since they lack of a general multivariate density expression. A well studied case is the Gaussian max-stable process defined as , where is a Poisson process on , with intensity measure , and is the bivariate normal distribution with zero mean and covariance (Smith90). The process has unit Frechét margins with distribution function , . Smith90 interprets as extreme environmental episodes, such as storms, where , , and are the storm magnitude, center, and shape, respectively.
Next, we apply the D-McLE to estimate the extreme covariance parameter in the context of the Tasmania rainfall data described below. For a finite set of spatially-referenced indexes, , the joint distribution of the random vector has no analytical representation for . Padoan10 give a closed-form expression for the bivariate density and propose estimation based on the pairwise likelihood function. Given observations on locations, , , the weighted pairwise likelihood function obtained by considering all location pairs is
where is the bivariate density
In the above expression, and are the standard normal probability and density functions, respectively; , ; ; and . For fixed , the extremal dependence behaviour is determined by , which is therefore the main interest for inference. Since the above model requires unit Frechét margins, the observed margins, , are transformed in unit Frechét by the transformation , where and , and are location, scale and shape parameters obtained from the empirical distribution.
We consider a data set of 20 yearly rainfall maxima recorded at 10 gauging stations from 1995 to 2014 in the Australian state of Tasmania corresponding to the following locations, also shown in Figure 3: Bushy Park, Ross, King Island, Eddistone Point, Geeveston, Strahan, Flinders Island, Marrawah, Rocky Point, Orford (source: http://wwwc.bom.gov.au/tas/). The max-stable Gaussian model is then fitted using a pair-wise likelihood function including all pairs of locations. We compute estimates for different choices of ranging from