Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

# Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Francis Bach
francis.bach@mines.org
INRIA - WILLOW Project-Team
Laboratoire d’Informatique de l’Ecole Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
45, rue d’Ulm, 75230 Paris, France
###### Abstract

For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the -norm or the block -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

## 1 Introduction

In the last two decades, kernel methods have been a prolific theoretical and algorithmic machine learning framework. By using appropriate regularization by Hilbertian norms, representer theorems enable to consider large and potentially infinite-dimensional feature spaces while working within an implicit feature space no larger than the number of observations. This has led to numerous works on kernel design adapted to specific data types and generic kernel-based algorithms for many learning tasks (see, e.g., [1, 2]).

Regularization by sparsity-inducing norms, such as the -norm has also attracted a lot of interest in recent years. While early work has focused on efficient algorithms to solve the convex optimization problems, recent research has looked at the model selection properties and predictive performance of such methods, in the linear case [3] or within the multiple kernel learning framework [4].

In this paper, we aim to bridge the gap between these two lines of research by trying to use -norms inside the feature space. Indeed, feature spaces are large and we expect the estimated predictor function to require only a small number of features, which is exactly the situation where -norms have proven advantageous. This leads to two natural questions that we try to answer in this paper: (1) Is it feasible to perform optimization in this very large feature space with cost which is polynomial in the size of the input space? (2) Does it lead to better predictive performance and feature selection?

More precisely, we consider a positive definite kernel that can be expressed as a large sum of positive definite basis or local kernels. This exactly corresponds to the situation where a large feature space is the concatenation of smaller feature spaces, and we aim to do selection among these many kernels, which may be done through multiple kernel learning [5]. One major difficulty however is that the number of these smaller kernels is usually exponential in the dimension of the input space and applying multiple kernel learning directly in this decomposition would be intractable.

In order to peform selection efficiently, we make the extra assumption that these small kernels can be embedded in a directed acyclic graph (DAG). Following [6, 7], we consider in Section 2 a specific combination of -norms that is adapted to the DAG, and will restrict the authorized sparsity patterns; in our specific kernel framework, we are able to use the DAG to design an optimization algorithm which has polynomial complexity in the number of selected kernels (Section 3). In simulations (Section 5), we focus on directed grids, where our framework allows to perform non-linear variable selection. We provide extensive experimental validation of our novel regularization framework; in particular, we compare it to the regular -regularization and shows that it is always competitive and often leads to better performance, both on synthetic examples, and standard regression and classification datasets from the UCI repository.

Finally, we extend in Section 4 some of the known consistency results of the Lasso and multiple kernel learning [3, 4], and give a partial answer to the model selection capabilities of our regularization framework by giving necessary and sufficient conditions for model consistency. In particular, we show that our framework is adapted to estimating consistently only the hull of the relevant variables. Hence, by restricting the statistical power of our method, we gain computational efficiency.

## 2 Hierarchical multiple kernel learning (HKL)

We consider the problem of predicting a random variable from a random variable , where and may be quite general spaces. We assume that we are given i.i.d. observations , . We define the empirical risk of a function from to as , where is a loss function. We only assume that is convex with respect to the second parameter (but not necessarily differentiable). Typical examples of loss functions are the square loss for regression, i.e., for , and the logistic loss or the hinge loss for binary classification, where , leading respectively to logistic regression and support vector machines. Other losses may be used for other settings (see, e.g., [2] or the Appendix).

### 2.1 Graph-structured positive definite kernels

We assume that we are given a positive definite kernel , and that this kernel can be expressed as the sum, over an index set , of basis kernels , , i.e, for all , . For each , we denote by and the feature space and feature map of , i.e., for all , . Throughout the paper, we denote by the Hilbertian norm of and by the associated dot product, where the precise space is omitted and can always be inferred from the context.

Our sum assumption corresponds to a situation where the feature map and feature space for is the concatenation of the feature maps for each kernel , i.e, and . Thus, looking for a certain and a predictor function is equivalent to looking jointly for , for all , and .

As mentioned earlier, we make the assumption that the set can be embedded into a directed acyclic graph. Directed acyclic graphs (referred to as DAGs) allow to naturally define the notions of parents, children, descendants and ancestors. Given a node , we denote by the set of its ancestors, and by , the set of its descendants. We use the convention that any is a descendant and an ancestor of itself, i.e., and . Moreover, for , we let denote the set of sources of the graph restricted to (i.e., nodes in with no parents belonging to ). Given a subset of nodes , we can define the hull of as the union of all ancestors of , i.e., . Given a set , we define the set of extreme points of as the smallest subset such that (note that it is always well defined, as ). See Figure 1 for examples of these notions.

The goal of this paper is to perform kernel selection among the kernels , . We essentially use the graph to limit the search to specific subsets of . Namely, instead of considering all possible subsets of active (relevant) vertices, we are only interested in estimating correctly the hull of these relevant vertices; in Section 2.2, we design a specific sparsity-inducing norms adapted to hulls.

In this paper, we primarily focus on kernels that can be expressed as “products of sums”, and on the associated -dimensional directed grids, while noting that our framework is applicable to many other kernels. Namely, we assume that the input space factorizes into components and that we are given sequences of length of kernels , , , such that . We thus have a sum of kernels, that can be computed efficiently as a product of sums. A natural DAG on is defined by connecting each to , , . As shown in Section 2.2, this DAG will correspond to the constraint of selecting a given product of kernels only after all the subproducts are selected. Those DAGs are especially suited to nonlinear variable selection, in particular with the polynomial and Gaussian kernels. In this context, products of kernels correspond to interactions between certain variables, and our DAG implies that we select an interaction only after all sub-interactions were already selected.

Polynomial kernels    We consider , ; the full kernel is then equal to . Note that this is not exactly the usual polynomial kernel (whose feature space is the space of multivariate polynomials of total degree less than ), since our kernel considers polynomials of maximal degree .

Gaussian kernels    We also consider , and the Gaussian-RBF kernel . The following decomposition is the eigendecomposition of the non centered covariance operator for a normal distribution with variance (see, e.g., [8]):

 e−b(x−x′)2=∑∞k=0(b/A)k2kk![e−bA(a+c)x2Hk(√2cx)][e−bA(a+c)(x′)2Hk(√2cx′)],

where , , and is the -th Hermite polynomial. By appropriately truncating the sum, i.e, by considering that the first basis kernels are obtained from the first single Hermite polynomials, and the -th kernel is summing over all other kernels, we obtain a decomposition of a uni-dimensional Gaussian kernel into components ( of them are one-dimensional, the last one is infinite-dimensional, but can be computed by differencing). The decomposition ends up being close to a polynomial kernel of infinite degree, modulated by an exponential [2]. One may also use an adaptive decomposition using kernel PCA (see, e.g., [2, 1]), which is equivalent to using the eigenvectors of the empirical covariance operator associated with the data (and not the population one associated with the Gaussian distribution with same variance). In simulations, we tried both with no significant differences.

Finally, by taking product over all variables, we obtain a decomposition of the -dimensional Gaussian kernel into components, that are adapted to nonlinear variable selection. Note that for , we obtain ANOVA-like decompositions [2].

Kernels or features?    In this paper, we emphasize the kernel view, i.e., we are given a kernel (and thus a feature space) and we explore it using -norms. Alternatively, we could use the feature view, i.e., we have a large structured set of features that we try to select from; however, the techniques developed in this paper assume that (a) each feature might be infinite-dimensional and (b) that we can sum all the local kernels efficiently (see in particular Section 3.2). Following the kernel view thus seems slightly more natural.

### 2.2 Graph-based structured regularization

Given , the natural Hilbertian norm is defined through . Penalizing with this norm is efficient because summing all kernels is assumed feasible in polynomial time and we can bring to bear the usual kernel machinery; however, it does not lead to sparse solutions, where many will be exactly equal to zero.

As said earlier, we are only interested in the hull of the selected elements , ; the hull of a set is characterized by the set of , such that , i.e., such that all descendants of are in the complement : . Thus, if we try to estimate , we need to determine which are such that . In our context, we are hence looking at selecting vertices for which .

We thus consider the following structured block -norm defined as

 ∑v∈Vdv∥βD(v)∥=∑v∈Vdv(∑w∈D(v)∥βw∥2)1/2,

where are positive weights. Penalizing by such a norm will indeed impose that some of the vectors are exactly zero. We thus consider the following minimization problem111Following [5], we consider the square of the norm, which does not change the regularization properties, but allow simple links with multiple kernel learning.:

 minβ∈∏v∈VFv1n∑ni=1ℓ(yi,∑v∈V⟨βv,Φv(xi)⟩)+λ2(∑v∈Vdv∥βD(v)∥)2. (1)

Our Hilbertian norm is a Hilbert space instantiation of the hierarchical norms recently introduced by [6]. If all Hilbert spaces are finite dimensional, our particular choice of norms corresponds to an “-norm of -norms”. While with uni-dimensional groups or kernels, the “-norm of -norms” allows an efficient path algorithm for the square loss and when the DAG is a tree [6], this is not possible anymore with groups of size larger than one, or when the DAG is a not a tree. In Section 3, we propose a novel algorithm to solve the associated optimization problem in time polynomial in the number of selected groups or kernels, for all group sizes, DAGs and losses. Moreover, in Section 4, we show under which conditions a solution to the problem in Eq. (1) consistently estimates the hull of the sparsity pattern.

Finally, note that in certain settings (finite dimensional Hilbert spaces and distributions with absolutely continuous densities), these norms have the effect of selecting a given kernel only after all of its ancestors [6]. This is another explanation why hulls end up being selected, since to include a given vertex in the models, the entire set of ancestors must also be selected.

## 3 Optimization problem

In this section, we give optimality conditions for the problems in Eq. (1), as well as optimization algorithms with polynomial time complexity in the number of selected kernels. In simulations we consider total numbers of kernels larger than , and thus such efficient algorithms are essential to the success of hierarchical multiple kernel learning (HKL).

### 3.1 Reformulation in terms of multiple kernel learning

Following [9, 10], we can simply derive an equivalent formulation of Eq. (1). Using Cauchy-Schwarz inequality, we have that for all such that and ,

 (∑v∈Vdv∥βD(v)∥)2⩽∑v∈V∥βD(v)∥2ηv=∑w∈V(∑v∈A(w)η−1v)∥βw∥2,

with equality if and only if . We associate to the vector , the vector such that , . We use the natural convention that if is equal to zero, then is equal to zero for all descendants of . We let denote the set of allowed and the set of all associated . The set and are in bijection, and we can interchangeably use or the corresponding . Note that is in general not convex (unless the DAG is a tree, see the Appendix), and if , then for all , i.e., weights of descendant kernels are smaller, which is consistent with the known fact that kernels should always be selected after all their ancestors.

The problem in Eq. (1) is thus equivalent to

 minη∈Hminβ∈∏v∈VFv1n∑ni=1ℓ(yi,∑v∈V⟨βv,Φv(xi)⟩)+λ2∑w∈Vζw(η)−1∥βw∥2. (2)

Using the change of variable and , this implies that given the optimal (and associated ), corresponds to the solution of the regular supervised learning problem with kernel matrix , where is the kernel matrix associated with kernel . Moreover, the solution is then , where are the dual parameters associated with the single kernel learning problem.

Thus, the solution is entirely determined by and (and its corresponding ). More precisely, we have (see proof in the Appendix):

###### Proposition 1

The pair is optimal for Eq. (1), with , if and only if given , is optimal for the single kernel learning problem with kernel matrix , and given , maximizes

 ∑w∈V(∑v∈A(w)η−1v)−1α⊤Kwα.

Moreover, the total duality gap can be upperbounded as the sum of the two separate duality gaps for the two optimization problems, which will be useful in Section 3.2 (see Appendix for more details). Note that in the case of “flat” regular multiple kernel learning, where the DAG has no edges, we obtain back usual optimality conditions [9, 10].

Following a common practice for convex sparsity problems [11], we will try to solve a small problem where we assume we know the set of such that is equal to zero (Section 3.3). We then “simply” need to check that variables in that set may indeed be left out of the solution. In the next section, we show that this can be done in polynomial time although the number of kernels to consider leaving out is exponential (Section 3.2).

### 3.2 Conditions for global optimality of reduced problem

We let denote the complement of the set of norms which are set to zero. We thus consider the optimal solution of the reduced problem (on ), namely,

 minβJ∈∏v∈JFv1n∑ni=1ℓ(yi,∑v∈J⟨βv,Φv(xi)⟩)+λ2(∑v∈Vdv∥βD(v)∩J∥)2, (3)

with optimal primal variables , dual variables and optimal pair . We now consider necessary conditions and sufficient conditions for this solution (augmented with zeros for non active variables, i.e., variables in ) to be optimal with respect to the full problem in Eq. (1). We denote by the optimal value of the norm for the reduced problem.

###### Proposition 2 (Nj)

If the reduced solution is optimal for the full problem in Eq. (1) and all kernels in the extreme points of are active, then we have

 maxt∈sources(Jc)α⊤Ktα/d2t⩽δ2.
###### Proposition 3 (SJ,ε)

If , then the total duality gap is less than .

The proof is fairly technical and can be found in the Appendix; this result constitutes the main technical contribution of the paper: it essentially allows to solve a very large optimization problem over exponentially many dimensions in polynomial time.

The necessary condition does not cause any computational problems. However, the sufficient condition requires to sum over all descendants of the active kernels, which is impossible in practice (as shown in Section 5, we consider of cardinal often greater than ). Here, we need to bring to bear the specific structure of the kernel . In the context of directed grids we consider in this paper, if can also be decomposed as a product, then is also factorized, and we can compute the sum over all in linear time in . Moreover we can cache the sums in order to save running time.

### 3.3 Dual optimization for reduced or small problems

When kernels , have low-dimensional feature spaces, we may use a primal representation and solve the problem in Eq. (1) using generic optimization toolboxes adapted to conic constraints (see, e.g., [12]). However, in order to reuse existing optimized supervised learning code and use high-dimensional kernels, it is preferable to use a dual optimization. Namely, we use the same technique as [9]: we consider for , the function , which is the optimal value of the single kernel learning problem with kernel matrix . Solving Eq. (2) is equivalent to minimizing with respect to .

If a ridge (i.e., positive diagonal) is added to the kernel matrices, the function is differentiable. Moreover, the function is differentiable on . Thus, the function , where is the vector with elements , is differentiable if . We can then use the same projected gradient descent strategy as [9] to minimize it. The overall complexity of the algorithm is then proportional to —to form the kernel matrices—plus the complexity of solving a single kernel learning problem—typically between and .

### 3.4 Kernel search algorithm

We are now ready to present the detailed algorithm which extends the feature search algorithm of [11]. Note that the kernel matrices are never all needed explicitly, i.e., we only need them (a) explicitly to solve the small problems (but we need only a few of those) and (b) implicitly to compute the sufficient condition , which requires to sum over all kernels, as shown in Section 3.2.

• Input: kernel matrices , , maximal gap , maximal of kernels

• Algorithm

1. Initialization: set ,
compute solutions of Eq. (3), obtained using Section 3.3

2. while and are not satisfied and

• If is not satisfied, add violating variables in to
else, add violating variables in of to

• Recompute optimal solutions of Eq. (3)

• Output: , ,

The previous algorithm will stop either when the duality gap is less than or when the maximal number of kernels has been reached. In practice, when the weights increase with the depth of in the DAG (which we use in simulations), the small duality gap generally occurs before we reach a problem larger than . Note that some of the iterations only increase the size of the active sets to check the sufficient condition for optimality; forgetting those does not change the solution, only the fact that we may actually know that we have an -optimal solution.

In the directed -grid case, the total running time complexity is a function of the number of observations , and the number of selected kernels; with proper caching, we obtain the following complexity, assuming for the single kernel learning problem, which is conservative: , which decomposes into solving single kernel learning problems, caching kernels, and computing quadratic forms for the sufficient conditions. Note that the kernel search algorithm is also an efficient algorithm for unstructured MKL.

## 4 Consistency conditions

As said earlier, the sparsity pattern of the solution of Eq. (1) will be equal to its hull, and thus we can only hope to obtain consistency of the hull of the pattern, which we consider in this section.

For simplicity, we consider the case of finite dimensional Hilbert spaces (i.e., ) and the square loss. We also hold fixed the vertex set of , i.e., we assume that the total number of features is fixed, and we let tend to infinity and decrease with .

Following [4], we make the following assumptions on the underlying joint distribution of : (a) the joint covariance matrix of (defined with appropriate blocks of size ) is invertible, (b) with and almost surely. With these simple assumptions, we obtain (see proof in the Appendix):

###### Proposition 4 (Sufficient condition)

If we have

 maxt∈sources(Wc)∑w∈D(t)∥ΣwWΣ−1WWDiag(dv∥βD(v)∥−1)v∈WβW∥2(∑v∈A(w)∩D(t)dv)2<1,

then and the hull of are consistently estimated when and .

###### Proposition 5 (Necessary condition)

If the and the hull of are consistently estimated for some sequence , then

 maxt∈sources(Wc)∥ΣwWΣ−1WWDiag(dv/∥βD(v)∥)v∈WβW∥2/d2t⩽1.

Note that the last two propositions are not consequences of the similar results for flat MKL [4], because the groups that we consider are overlapping. Moreover, the last propositions show that we indeed can estimate the correct hull of the sparsity pattern if the sufficient condition is satisfied. In particular, if we can make the groups such that the between-group correlation is as small as possible, we can ensure correct hull selection. Finally, it is worth noting that if the ratios tend to infinity slowly with , then we always consistently estimate the depth of the hull, i.e., the optimal interaction complexity. We are currently investigating extensions to the non parametric case [4], in terms of pattern selection and universal consistency.

## 5 Simulations

Synthetic examples    We generated regression data as follows: samples of variables were generated from a random covariance matrix, and the label was sampled as a random sparse fourth order polynomial of the input variables (with constant number of monomials). We then compare the performance of our hierarchical multiple kernel learning method (HKL) with the polynomial kernel decomposition presented in Section 2 to other methods that use the same kernel and/or decomposition: (a) the greedy strategy of selecting basis kernels one after the other, a procedure similar to [13], and (b) the regular polynomial kernel regularization with the full kernel (i.e., the sum of all basis kernels). In Figure 2, we compare the two approaches on 40 replications in the following two situations: original data (left) and rotated data (right), i.e., after the input variables were transformed by a random rotation (in this situation, the generating polynomial is not sparse anymore). We can see that in situations where the underlying predictor function is sparse (left), HKL outperforms the two other methods when the total number of variables increases, while in the other situation where the best predictor is not sparse (right), it performs only slightly better: i.e., in non sparse problems, -norms do not really help, but do help a lot when sparsity is expected.

UCI datasets    For regression datasets, we compare HKL with polynomial (degree 4) and Gaussian-RBF kernels (each dimension decomposed into 9 kernels) to the following approaches with the same kernel: regular Hilbertian regularization (L2), same greedy approach as earlier (greedy), regularization by the -norm directly on the vector , a strategy which is sometimes used in the context of sparse kernel learning [14] but does not use the Hilbertian structure of the kernel (lasso-), multiple kernel learning with the kernels obtained by summing all kernels associated with a single variable, a strategy suggested by [5] (MKL). For all methods, the kernels were held fixed, while in Table 1, we report the performance for the best regularization parameters obtained by 10 random half splits.

We can see from Table 1, that HKL outperforms other methods, in particular for the datasets bank-32nm, bank-32nh, pumadyn-32nm, pumadyn-32nh, which are datasets dedicated to non linear regression. Note also, that we efficiently explore DAGs with very large numbers of vertices .