-BOOSTING ON GENERALIZED HOEFFDING
DECOMPOSITION FOR DEPENDENT VARIABLES
APPLICATION TO SENSITIVITY ANALYSIS
Magali Champion, Gaelle Chastaing, Sébastien Gadat,
Institut de Mathématiques de Toulouse
Université Joseph Fourier, LJK/MOISE
Institut National de la Recherche Agronomique, MIA
This paper is dedicated to the study of an estimator of the generalized Hoeffding decomposition. We build such an estimator using an empirical Gram-Schmidt approach and derive a consistency rate in a large dimensional settings. Then, we apply a greedy algorithm with these previous estimators to Sensitivity Analysis. We also establish the consistency of this -boosting up to sparsity assumptions on the signal to analyse. We end the paper with numerical experiments, which demonstrates the low computational cost of our method as well as its efficiency on standard benchmark of Sensitivity Analysis.
Key words and phrases: -boosting, convergence, dependent variables, generalized ANOVA decomposition, sensitivity analysis.
In many scientific fields, it is desirable to extend a multivariate regression model as a specific sum of increasing dimension functions.
Functional ANOVA decomposition or High Dimensional Representation Model (HDMR) given by Hooker (2007); Li, Rabitz, Yelvington, Oluwole, Bacon and
Schoendorf (2010) are well known expansions that allow for understanding the model behaviour, and for detecting how inputs interact to each other. For high dimensional models, the HDMR is also a good way to deal with the curse of dimensionality. Indeed, a model function may be well approximated by some first order functional components, making easier the study of a complex model. However, the existence and uniqueness of the functional ANOVA components is of major importance to valid a study. Thus, some identifiability constraints need to be imposed to make the ANOVA decomposition unique.
When input variables are independent, Hoeffding establishes the uniqueness of the decomposition provided that the summands are mutually orthogonal (see e.g. Hoeffding (1948)). Further, as pointed by Sobol (1993), the analytical expression of these components can be recursively obtained in terms of conditional expectations. Thus, their estimation can be deduced by numerical approximation of integrals (see e.g Sobol (2001); Saltelli, Ratto, Andres, Campolongo, Cariboni, Gatelli, Saisana and Tarantola (2008)).
Nevertheless, the independence assumption is often unrealistic for some real-world phenomena. In this paper, we are interested in the ANOVA expansion of some models that depend on not necessarily independent input variables. Following the work of Stone (1994), later exploited in machine learning by Hooker (2007), and in sensitivity analysis by Chastaing, Gamboa and Prieur (2012), we focus on a generalized Hoeffding decomposition under general assumptions on the inputs distribution. That is, any model function can be uniquely decomposed as a sum of hierarchically orthogonal component functions. Two summands are called hierarchically orthogonal whenever all variables included in one of them are also involved in the other. For a better understanding of the paper, this generalized ANOVA expansion will be called a Hierarchically Orthogonal Functional Decomposition (HOFD), as done in Chastaing, Gamboa and Prieur (2012).
Since analytical formulation for HOFD is rarely available, it is of great importance to develop estimation procedures. In this paper, we focus on an alternative method proposed in Chastaing, Gamboa and Prieur (2013) to estimate the HOFD components. It consists of constructing a hierarchically orthogonal basis from a suitable Hilbert orthonormal basis. Inspired by the usual Gram-Schmidt algorithm, the procedure recursively builds for each component a multidimensional basis that satisfies the identifiability constraints imposed to this summand. Then, each component is well approximated on a truncated basis, where the unknown coefficients are deduced by solving an ordinary least-squares. Nevertheless, in a high-dimensional paradigm, this procedure suffers from a curse of dimensionality. Moreover, it is numerically observed that only a few of coefficients are not close to zero, meaning that only a small number of predictors restore the major part of the information contained in the components. Thus, it is important to be able to select the most relevant representative functions, and next identify the HOFD with a limited computational budget.
In this view, we suggest in this article to transform the ordinary least-squares into a penalized regression as it has been proposed in Chastaing, Gamboa and Prieur (2013). In the present paper, we focus here on the -boosting to deal with the penalization, developped by Friedman (2001). The -boosting is a greedy strategy that performs variable selection and shrinkage. The choice of such an algorithm is motivated by the fact that the -boosting is very intuitive and easy to implement. It is also closely related (in some practical sense) to the LARS algorithm, proposed by Efron, Hastie, Johnstone and Tibshirani (2004), which solves the Lasso regression with a penalization (see e.g. Bühlmann and van de Geer (2011); Tibshirani (1996)). The -boosting and the LARS both select predictors using the maximal correlation with the current residuals. The question that naturally arises now is the following: provided that the theoretical procedure of components reconstruction is well tailored, do the estimators obtained by the -boosting converge to the theoretical true sparse parameters when the number of observations tends to infinity ?
The goal of this paper is to extend the work of Chastaing, Gamboa and Prieur (2013) by addressing this question. More precisely, the aim is to determine sufficient conditions for which the consistency of the estimators is satisfied. Further, we discuss these conditions and give some numerical examples where such conditiones are fulfilled. One interesting application of the general theory is the global sensitivity analysis (SA). We apply the -boosting to estimate the generalized sensitivity indices defined in Chastaing, Gamboa and Prieur (2012, 2013). After reminding the form of these indices, we numerically compare the -boosting performance with the LARS technique and the Forward-Backward algorithm, proposed by Zhang (2011).
The article is organized as follows. Paragraph 2.1 aims at introducing the notation of the paper.We also remind the HOFD representation of the model function in Paragraph 2.2. In Paragraph 2.3, we recall the procedure detailed in Chastaing, Gamboa and Prieur (2013) that consists in constructing well tailored hierarchically orthogonal basis to represent the components of the HOFD. At last, we highlight the curse of dimensionality we are exposed to, and present the -boosting. Section 3 gathers our main theoretical results on the proposed algorithms. Section 4 presents a numerical study of our method. We finally conclude this work in Section 5, and we provide the proofs of the two main theorems in an Appendix.
Acknowledgment Authors are indebted to Fabrice Gamboa for motivating discussions and numerous suggestions on the subject.
2. Estimation of the generalized Hoeffding decomposition components
We consider a measurable function of a random real vector of , . The response variable is a real-valued random variable defined as
where stands for a centered random variable independent of and models the variability of the response around its theoretical unknown value . We denote by the distribution law of , which is unknown in our setting, and we assume that admits a density function with respect to the Lebesgue measure on . Note that is not necessarily a tensor product of univariate distributions since the components of may be correlated.
Further, we suppose that , where denotes the Borel set of . The Hilbert space is denoted by , for which we use the inner product , and the norm as follows,
Here, stands for the expected value. Further, denotes the variance, and the covariance.
For any , we denote by the marginal distribution of and extend naturally the former notation
2.2 The generalized Hoeffding decomposition
Let us denote , with , and let be the collection of all subsets of . We also define . For , the subvector of is defined as . Conventionally, for , . The marginal distribution (resp. density) of is denoted (resp. ).
A functional ANOVA decomposition consists in expanding as a sum of increasing dimension functions,
where is a constant term, , are the main effects, , are the interaction effects, and the last component is the residual.
Decomposition (2.2) is generally not unique. However, under mild assumptions on the joint density (see Assumptions (C.1) and (C.2) in Chastaing, Gamboa and Prieur (2012)), the decomposition is unique under some additional orthogonality assumptions.
More precisely, let us introduce the set of constant functions, and for all , . Then we define , as follows:
where denotes the strict inclusion.
Definition 1 (Hierarchical Orthogonal Functional Decomposition - HOFD).
The components of the HOFD (2.2) are referred as hierarchically orthogonal, that is .
To get more details on the HOFD, the reader is referred to Hooker (2007); Chastaing, Gamboa and Prieur (2012). In this paper, we are interested in estimating the summands in (2.2). As underlined in Huang (1998), estimating all components of (2.2) suffers from a curse of dimensionality, leading to an intractable problem in practice. To bypass this issue, we assume further along the article (without loss of generality) that is centered, so that and suppose that is well approximated by
We thus assume that interactions of order can be neglected.
But even by choosing , the number of components in (2.3) can become prohibitive if the number of inputs is high. We therefore are interested by estimation procedures under sparse assumptions when the number of variables is large.
In the next section, we remind the procedure to identify components of (2.3). Through this strategy, we highlight the curse of dimensionality when is getting large, and we propose to use a greedy -boosting to tackle this issue.
2.3 Practical determination of the Sparse HOFD
General description of the procedure
We propose in this section a Two-Steps estimation procedure to identify the components in (2.3): the first one is a simplified version of the Hierarchical Orthogonal Gram-Schmidt (HOGS) procedure developed in Chastaing, Gamboa and Prieur (2013), and the second consists of a -boosting algorithm (see e.g. Friedman (2001); Bühlmann (2006)). The specificity of our new -boosting algorithm is that it is based on a random dictionary and then falls into the framework of sparse recovery problem with error in the variables.
To lead this two-steps procedure, we assume that we observe two independent and identically distributed samples and from the distribution of (the initial sample can be splitted in such two samples). We define the empirical inner product and the empirical norm associated to a -sample as
Also, for , we define the multi-index . We use the notation to define the set of all finite linear combination of elements of , also called the linear span of .
Step 1 and Step 2 of our sparse HOFD procedure will be described in details further below.
In the following, we assume that in (2.3). The procedure could be extended to any higher order approximation, but we think that the description of the methodology for helps for a better understanding. We thus have chosen to only describe this situation for the sake of clarity.
Step 1: Hierarchically Orthogonal Gram-Schmidt procedure
For each , let denote an orthonormal basis of . For , for , we set
as well as
We define , the approximation of , as
The recursive procedure below aims at constructing a basis of and a basis of for any .
For any , define , . Then, thanks to the orthogonality of , we get
Second order interactions
Let , with . As the dimension of is equal to , and that the approximation space is subject to constraints, its dimension is then equal to . We want to construct a basis for , which satisfies the hierarchical orthogonal constraints. We are looking for such a basis of the form:
The constants are determined by resolving the following constraints:
This construction can be extended to any . We refer the interested reader to Chastaing, Gamboa and Prieur (2013). Just note that the dimension of the approximation space is given by , where denotes the cardinality of .
Algorithm 1 below proposes an empirical version of the HOGS procedure. It consists in substituting the inner product by its empirical version obtained with the first data set .
For any such that , write the matrix as well as obtained using the former expressions with .
Solve (2.6) with the empirical inner product and compute .
Compute by using Equation (2.7) and .
The empirical version of the basis given by (2.4) is then:
Step 2: Greedy selection of Sparse HOFD
Each component of the HOFD defined in Definition 1 is a projection onto . Since, for , the space well approximates , it is then natural to approximate by:
where is the multi-index . For the sake of clarity (since there is no ambiguity), we will omit the summation support of in the sequel.
Now, we consider the second sample and we aim to recover the unknown coefficients on the regression problem,
However, the number of coefficients is equal to . When gets large, the usual least-squares estimator is not adapted to estimate the coefficients . We then use the penalized regression,
where is the -penalty, i.e.
Of course, such an optimisation procedure is not tractable and we instead consider the relaxed -boosting (see e.g. Friedman (2001)) to solve this penalized problem. Mimicking the notation of Temlyakov (2000); Champion, Cierco-Ayrolles, Gadat and Vignes (2013), we define the dictionary of functions as
The quantity denotes the approximation of at step , as a linear combination of elements of . At the end of the algorithm, the estimation of is denoted . The -boosting is described in Algorithm 2.
Select such that
Compute the new approximation of as
For any step , Algorithm 2 selects a function from wich provides a sufficient information on the residual . The shrinkage parameter is the standard step-length parameter of the boosting algorithm. It actually smoothly inserts the next predictor in the model, making possible a refinement of the greedy algorithm, and may statistically guarantees its convergence rate.
In a deterministic setting, the shrinkage parameter is not really useful and may be set to (see Temlyakov (2000) for further details). It is indeed useful from a practical point of view to smooth the boosting iterations.
An algorithm for our new sparse HOFD procedure
Algorithm 3 below provides now a simplified description of our sparse HOFD procedure, whose steps have been described further above.
We now obtain a strategy to estimate the components of the decomposition (2.3) in a high-dimensional paradigm. We aim to show that the obtained estimators are consistent, and that the Two-Steps procedure (summarized in Algorithm 3) is numerically convincing. The next section is devoted to the asymptotic properties of the estimators.
3. Consistency of the estimator
In this section, we study the asymptotic properties of the estimator obtained from the Algorithm 3 described in Section 2. To this end, we restrict our study to the case of and assume that is well approximated by first and second order interaction components. Hence, the observed signal may be represented as
where is the true parameter, and the functions , are constructed according to the HOFD described in the paragraph id1. We assume that we have in hand a -sample of observations, divided into two samples and . Samples in (resp. in ) of size (resp. of size ) are used for the construction of described in Algorithm 1 (resp. for the -boosting Algorithm 2 to estimate ).
The goal of this section is to study the consistency of when the sample size tends to infinity. Its objective is also to determine an optimal number of steps to get a consistent estimator from Algorithm 2.
We first briefly recall some notation: for any sequences , , we write when is a bounded sequence for large enough. Now, for any random sequence , means that is bounded in probability.
We have chosen to present our assumptions in three parts to deal with the dimension, the noise and the sparseness of the entries.
The first set of hypotheses matches with the bounded case and is adapted to the special situation of bounded support for the random variable , for instance when each follows a uniform law on a compact set where is a compact set of independent of . It is refered as in the sequel and corresponds to the following three conditions.
The number of variables satisfies
The Gram matrices introduced in (2.6) satisfies:
where denotes the determinant of a matrix.
Roughly speaking, this will be the favorable situation from a technical point of view since it will be possible to apply a Matrix Hoeffding’s type Inequality. It may be possible to slightly relax such an hypothesis using a sub-exponential tail argument. For the sake of simplicity, we have chosen to only restrict our work to the settings of .
Whatever the joint law of the random variables is, it is always possible to build an orthonormal basis from a bounded (frequency truncated) Fourier basis and thus is not so restrictive in practice.
Assumption copes with the high dimensional situation. The number of variables can grow exponentially fast with the number of observations .
Note that Hypothesis stands for a lower bound of the determinant of the Gram matrices involved in the HOFD. It is shown in Chastaing, Gamboa and Prieur (2013) that each of these Gram matrices are invertible and thus each are positive. Nevertheless, if , this hypothesis assume that such an invertibility is uniform over all choices of tensor . This hypothesis may be too strong for a large number of variables when . However, when , Hypothesis drastically relax the case and becomes very weak. It will be satisfied in many of our numerical examples. In the sequel, the parameters and will be related each other and we will obtain a consistency result of the sparse HOFD up to the condition . This constraint implicitely limits the size of since .
We will assume the noise measurement to get some bounded moments of sufficiently high order, which is true for Gaussian or bounded noise. This assumption is given by
The last assumption concerns the sparse representation of the unknown signal described by in the basis . Such an hypothesis will be usefull to assess the statistical performance of the -boosting and will be refered as in the sequel. It is legitimate by our high dimension setting and our motivation to identify the main interactions .
The true parameter satisfies uniformly with
It is possible to relax this former condition and let growing to as . The price to pay to face such a situation is then a more restrictive condition on the number of variables . We refer to Bühlmann (2006) for a short discussion on a related problem and will only consider the situation described by for the sake of simplicity.
3.2 Main results
We first provide our main result on the efficiency of the EHOFD (Algorithm 1).
Assume that holds with (resp. ) given by (resp. ). Then, if , the sequence of estimators satisfies:
The proof of this Theorem is deferred to the Appendix section. Our second main result concerns the -boosting which recovers the unknown up to a preprocessing estimation of on a first sample . Such a result is satisfied provided the sparsity Assumptions . We assume that
where is the true parameter that expands .
Theorem 2 (Consistency of the -boosting).
If we assume that and are satisfied with , then there exists a sequence , with such that
We briefly describe the proof and postpone the technical details to the Appendix section.
Sketch of Proof of Theorem 2.
Further, following the work of Champion, Cierco-Ayrolles, Gadat and Vignes (2013), we introduce a phantom residual in order to reproduce the behaviour of a deterministic boosting, studied in Temlyakov (2000). This phantom algorithm is the theoretical -boosting, performed using the randomly chosen elements of the dictionary by Equations (2.8) and (6.13), but updated using the deterministic inner product. The phantom residuals , , are defined as follows,
We then have to show that each term of the right-hand side of (3.3) converges towards zero in probability. ∎
4. Numerical Applications
In this section, we are interested by the numerical efficiency of the Two-Steps procedure given in Section 2, and we primarily focus on the practical use of the HOFD through sensitivity analysis (SA). The goal of SA is to identify and to rank the input variables that drive the uncertainty of the model output. For further details, the reader may refer to Saltelli, Chan and Scott (2000); Cacuci, Ionescu-Bujor and Navon (2005). Therefore, the HOFD presented in Paragraph 2.2 is of great interest, because it may be used to decompose the global variance of the model. Here, as each HOFD is subject to hierarchical orthogonality constraints given in Definition 1, we obtain that
Therefore, to measure the contribution of , for , in terms of variability in the model, it is then quite natural to define a sensitivity index as follows,
This definition is given and discussed in Chastaing, Gamboa and Prieur (2012). In practice, once we have applied the procedure described in Algorithm 3 to get , it is straightforward to deduce the empirical estimation of , for all . In the following, we are mostly interested by the estimation of the first and second order sensitivity indices (i.e. and , ).
We end the work with a short simulation study and we are primarily interested by the performance of the greedy selection algorithm for the prediction of generalized sensitivity indices. As the estimation of these indices consists in estimating the summands of the generalized functional ANOVA decomposition (called HOFD), we start by constructing a hierarchically orthogonal system of functions to approximate the components. As pointed above (see Assumption in Theorem 1 and 2), the invertibility of each linear system plays an important role in our theoretical study. We hence have measured for each situation the degeneracy of involved matrices given by
Then, we use a variable selection method to select a sparse number of predictors. The goal is to numerically compare three variable selection methods: the -boosting, the Forward-Backward greedy algorithm (refered as FoBa in the sequel), and the Lasso estimator. As pointed above, we have in hand a -sample of i.i.d. observations broken up into two samples of size . The first sample is used to construct the system of functions according to Algorithm 1. Let us now briefly describe how we use the Lasso and the FoBa. Each of the three selection methods aims to solve a generic minimization problem