SpicyMKL

SpicyMKL

Taiji SUZUKI and Ryota TOMIOKA

Department of Mathematical Informatics
Graduate School of Information Science and Technology
The University of Tokyo
{t-suzuki,tomioka}@mist.i.u-tokyo.ac.jp
Abstract

We propose a new optimization algorithm for Multiple Kernel Learning (MKL) called SpicyMKL, which is applicable to general convex loss functions and general types of regularization. The proposed SpicyMKL iteratively solves smooth minimization problems. Thus, there is no need of solving SVM, LP, or QP internally. SpicyMKL can be viewed as a proximal minimization method and converges super-linearly. The cost of inner minimization is roughly proportional to the number of active kernels. Therefore, when we aim for a sparse kernel combination, our algorithm scales well against increasing number of kernels. Moreover, we give a general block-norm formulation of MKL that includes non-sparse regularizations, such as elastic-net and -norm regularizations. Extending SpicyMKL, we propose an efficient optimization method for the general regularization framework. Experimental results show that our algorithm is faster than existing methods especially when the number of kernels is large ( 1000).

1 Introduction

Kernel methods are powerful nonparametric methods in machine learning and data analysis. Typically a kernel method fits a decision function that lies in some Reproducing Kernel Hilbert Space (RKHS) (Aronszajn, 1950; Schölkopf and Smola, 2002). In such a learning framework, the choice of a kernel function can strongly influence the performance. Instead of using a fixed kernel, Multiple Kernel Learning (MKL) (Lanckriet et al., 2004; Bach et al., 2004; Micchelli and Pontil, 2005) aims to find an optimal combination of multiple candidate kernels.

More specifically, we assume that a data point lies in a space and we are given candidate kernel functions (). Each kernel function corresponds to one data source. A conical combination of () gives the combined kernel function , where is a nonnegative weight. Our goal is to find a good set of kernel weights based on some training examples.

While the MKL framework has opened up a possibility to combine multiple heterogeneous data sources (numerical features, texts, links) in a principled manner, it is also posing an optimization challenge: the number of kernels can be very large. Instead of arguing whether polynomial kernel or Gaussian kernel fits a given problem better, we can simply put both of them into an MKL algorithm; instead of evaluating which band-width parameter to choose, we can simply generate kernels from all the possible combinations of parameter values and feed them to an MKL algorithm. See Gehler and Nowozin (2009); Tomioka and Suzuki (2009).

Various optimization algorithms have been proposed in the context of MKL. In the pioneering work of Lanckriet et al. (2004), MKL was formulated as a semi-definite programming (SDP) problem. Bach et al. (2004) showed that the SDP can be reduced to a second order conic programming (SOCP) problem. However solving SDP or SOCP via general convex optimization solver can be quite heavy especially for large number of samples.

More recently, wrapper methods have been proposed. A wrapper method iteratively solves a single kernel learning problem (e.g., SVM), for a given kernel combination and then updates the kernel weights. A nice property of this type of methods is that it can make use of existing well-tuned solvers for SVM. Semi-Infinite Linear Program (SILP) approach proposed by Sonnenburg et al. (2006) utilizes a cutting plane method for the update of the kernel weights. SILP often suffers from instability of the solution sequence especially when the number of kernels is large, i.e. the intermediate solution oscillates around the optimal one (Rakotomamonjy et al., 2008). SimpleMKL proposed by Rakotomamonjy et al. (2008) performs a reduced gradient descent on the kernel weights. Simple MKL resolves the drawback of SILP, but still it is a first order method. Xu et al. (2009) proposed a novel Level Method (which we call LevelMKL) as an improvement of SILP and SimpleMKL. LevelMKL is rather efficient than SimpleMKL and scales well against the number of kernels by utilizing sparsity but it shows unstable behavior as the algorithm proceeds because LevelMKL solves Linear Programming (LP) and Quadratic Programming (QP) of increasingly large size as its iteration proceeds. HessianMKL proposed by Chapelle and Rakotomamonjy (2008) replaced the gradient descent update of SimpleMKL with a Newton update. At each iteration, HessianMKL solves a QP problem with the size of the number of kernels to obtain the Newton update direction. HessianMKL shows second order convergence, but scales badly against the number of kernels because the size of the QP grows as the number of kernels grows.

The first contribution of this article is an efficient optimization algorithm for MKL based on the block 1-norm formulation introduced in Bach et al. (2005) (see also Bach et al. (2004)); see Eq. (4). The block 1-norm formulation can be viewed as a kernelized version of group lasso (Yuan and Lin, 2006; Bach, 2008). For group lasso, or more generally sparse estimation, efficient optimization algorithms have recently been studied intensively, boosted by the development of compressive sensing theory (Candes et al., 2006). Based on this view, we extend the dual augmented-Lagrangian (DAL) algorithm (Tomioka and Sugiyama, 2009) recently proposed in the context of sparse estimation to kernel-based learning. DAL is efficient when the number of unknown variables is much larger than the number of samples. This enables us to scale the proposed algorithm, which we call SpicyMKL, to thousands of kernels.

Compared to the original DAL algorithm (Tomioka and Sugiyama, 2009), our presentation is based on an application of proximal minimization framework (Rockafellar, 1976) to the primal MKL problem. We believe that the current formulation is more transparent compared to the dual based formulation in Tomioka and Sugiyama (2009), because we are not necessarily interested in solving the dual problem. Moreover, we present a theorem on the rate of convergence.

SpicyMKL does not need to solve SVM, LP or QP internally as previous approaches. Instead, it minimizes a smooth function at every step. The cost of inner minimization is proportional to the number of active kernels. Therefore, when we aim for a sparse kernel combination, the proposed algorithm is efficient even when thousands of candidate kernels are used. In fact, we show numerically that we are able to train a classifier with 3000 kernels in less than 10 seconds.

Learning combination of kernels, however, has recently recognized as a more complex task than initially thought. Cortes (2009) pointed out that learning convex kernel combination with an -constraint on the kernel weights (see Section 2) produces an overly sparse (many kernel weights are zero) solution, and it is often outperformed by a simple uniform combination of kernels; accordingly they proposed to use an -constraint instead (Cortes et al., 2009). In order to search for the best trade-off between the sparse -MKL and the uniform weight combination, Kloft et al. (2009) proposed a general -norm constraint and Tomioka and Suzuki (2009) proposed an elastic-net regularization, both of which smoothly connect the 1-norm MKL and uniform weight combination.

The second contribution of this paper is to extend the block-norm formulation that allows us to view these generalized MKL models in a unified way, and provide an efficient optimization algorithm. We note that while this paper was under review, Kloft et al. (2010) presented a slightly different approach that results in a similar optimization algorithm. However, our formulation provides clearer relationship between the block-norm formulation and kernel-combination weights and is more general.

This article is organized as follows. In Section 2, we introduce the framework of block 1-norm MKL through Tikhonov regularization on the kernel weights. In Section 3, we propose an extension of DAL algorithm to kernel-learning setting. Our formulation of DAL algorithm is based on a primal application of the proximal minimization framework (Rockafellar, 1976), which also sheds a new light on DAL algorithm itself. Furthermore, we discuss how we can carry out the inner minimization efficiently exploiting the sparsity of the 1-norm MKL. In Section 4, we extend our framework to general class of regularizers including the -norm MKL (Kloft et al., 2009) and Elastic-net MKL (Tomioka and Suzuki, 2009). We extend the proposed SpicyMKL algorithm for the generalized formulation and also present a simple one-step optimization procedure for some special cases that include Elastic-net MKL and -norm MKL. In Section 5, we discuss the relations between the existing methods and the proposed method. In Section 6, we show the results of numerical experiments. The experiments show that SpicyMKL is efficient for block 1-norm regularization especially when the number of kernels is large. Moreover the one-step optimization procedure for elastic-net regularization shows quite fast convergence. In fact, it is faster than those methods with block 1-norm regularization. Finally, we summarize our contribution in Section 7. The proof of super-linear convergence of the proposed SpicyMKL is given in Appendix A.

A Matlab implementation of SpicyMKL is available at the following URL:

http://www.simplex.t.u-tokyo.ac.jp/~s-taiji/software/SpicyMKL

2 Framework of MKL

In this section, we first consider a learning problem with fixed kernel weights in Section 2.1. Next in Section 2.2, using Tikhonov regularization on the kernel weights, we derive a block 1-norm formulation of MKL, which can be considered as a direct extension of group lasso in the kernel-based learning setting. In addition, we discuss the connection between the current block 1-norm formulation and the squared block 1-norm formulation. In Section 2.3, we present a finite dimensional version of the proposed formulation and prepare notations for the later sections.

2.1 Fixed kernel combination

We assume that we are given samples where belongs to an input space and belongs to an output space (usual settings are for classification and for regression). We define the Gram matrix with respect to the kernel function as . We assume that the Gram matrix is positive definite111To avoid numerical instability, we added to diagonal elements of in the numerical experiments..

We first consider a learning problem with fixed kernel weights. More specifically, we fix non-negative kernel weights and consider the RKHS corresponding to the combined kernel function . The (squared) RKHS norm of a function in is written as follows (see Sec 6 in Aronszajn (1950), and also Micchelli and Pontil (2005)):

(1)

where is the RKHS that corresponds to the kernel function . Accordingly, with a fixed kernel combination, a supervised learning problem can be written as follows:

(2)

where is a bias term and is a loss function, which can be the hinge loss or the logistic loss for a classification problem, or the squared loss or the SVR loss for a regression problem. The above formulation may not be so useful in practice, because we can compute the combined kernel function and optimize over instead of optimizing functions . However, explicitly handling the kernel weights allows us to consider various generalizations of MKL in a unified manner.

2.2 Learning kernel weights

In order to learn the kernel weights through the objective (2), there is clearly a need for regularization, because the objective is a decreasing function of the kernel weights . Roughly speaking, the kernel weight corresponds to the complexity allowed for the th classifier component , without regularization, we can get a serious over-fitting.

One way to prevent such overfitting is to penalize the increase of the kernel weight by adding a penalty term. Adding a linear penalty term, we have the following optimization problem.

(3)

The above formulation reduces to the block 1-norm introduced in Bach et al. (2005) by explicitly minimizing over as follows:

(4)

where we used the inequality of arithmetic and geometric means; the minimum with respect to is obtained by taking .

The regularization term in the above block 1-norm formulation is the linear sum of RKHS norms. This formulation can be seen as a direct generalization of group lasso (Yuan and Lin, 2006) to the kernel-based learning setting, and motivates us to extend an efficient algorithm for sparse estimation to MKL.

The block 1-norm formulation (4) is related to the following squared block 1-norm formulation considered in Bach et al. (2004); Sonnenburg et al. (2006); Zien and Ong (2007); Rakotomamonjy et al. (2008):

(5)

which is obtained by considering a simplex constraint on the kernel weights (Kloft et al., 2009) instead of penalizing them as in (3).

The solution of the two problems (4) and (5) can be mapped to each other. In fact, let be the minimizer of the block 1-norm formulation (4) with the regularization parameter and let be

(6)

Then also minimizes the squared block 1-norm formulation (5) with the regularization parameter because of the relation

where is a subdifferential with respect to .

2.3 Representer theorem

In this subsection, we convert the block 1-norm MKL formulation (4) into a finite dimensional optimization problem via the representer theorem and prepare notation for later sections.

The optimal solution of (4) is attained in the form of due to the representer theorem (Kimeldorf and Wahba, 1971). Thus, the optimization problem (4) is reduced to the following finite dimensional optimization problem:

where , , and the norm is defined through the inner product for . We also define the norm through the inner product for .

For simplicity we rewrite the above problem as

(7)

where , and

3 A dual augmented-Lagrangian method for MKL

In this section, we first present an extension of dual augmented-Lagrangian (DAL) algorithm to kernel-learning problem through a new approach based on the proximal minimization framework (Rockafellar, 1976). Second, assuming that the loss function is twice differentiable, we discuss how we can compute each minimization step efficiently in Section 3.3. Finally, the method is extended to the situation where the loss function is not differentiable in Section 3.4.

3.1 MKL optimization via proximal minimization

Starting from some initial solution , the proximal minimization algorithm (Rockafellar, 1976) iteratively minimizes the objective (7) together with proximity terms as follows:

(8)

where is a nondecreasing sequence of proximity parameters222 Typically we exponentially increase , e.g. . In practice, we can use different values of proximity parameters for each variable (e.g. use for and for ) and choose adaptively depending on the scales of the variables. and is an approximate minimizer at the th iteration; is the (regularized) objective function (7). The last two terms in the right-hand side are proximity terms that tries to keep the next solution close to the current solution . Thus, we call the minimization problem (8) a proximal MKL problem. Solving the proximal MKL problem (8) seems as difficult as solving the original MKL problem in the primal. However, when we consider the dual problems, solving the dual of proximal MKL problem (8) is a smooth minimization problem and can be minimized efficiently, whereas solving the dual of the original MKL problem (7) is a non-smooth minimization problem and not necessarily easier than solving the primal.

The update equation can be interpreted as an implicit gradient method on the objective function . In fact, by taking the subgradient of the update equation (8) and equating it to zero, we have

This implies the th update step is a subgradient of the original objective function at the next solution .

The super-linear convergence of proximal minimization algorithm (Rockafellar, 1976; Bertsekas, 1982; Tomioka et al., 2011) can also be extended to kernel-based learning setting as in the following theorem.

Theorem 1.

Suppose the problem (7) has a unique333 The uniqueness of the optimal solution is just for simplicity. The result can be generalized for a compact optimal solution set (see (Bertsekas, 1982)). optimal solution , , and there exist a scalar and a -neighborhood () of the optimal solution such that

(9)

for all satisfying . Then for all sufficiently large we have

Therefore if , the solution converges to the optimal solution super-linearly.

Proof.

The proof is given in Appendix A. ∎

3.2 Derivation of SpicyMKL

Although directly minimizing the proximal MKL problem (8) is not a trivial task, its dual problem can efficiently be solved. Once we solve the dual of the proximal minimization update (8), we can update the primal variables . The resulting iteration can be written as follows:

(10)
(11)
(12)

where is the dual objective, which we derive in the sequel, and the proximity operator corresponding to the regularizer is defined as follows:

(13)

The above operation is known as the soft-thresholding function in the field of sparse estimation (see Figueiredo and Nowak (2003); Daubechies et al. (2004)) and it has been applied to MKL earlier in Mosci et al. (2008). Intuitively speaking, it thresholds a vector smaller than in norm to zero but also shrinks a vector whose norm is larger than so that the transition at is continuous (soft); see Figure 1 for a one dimensional illustration.

At every iteration we minimize the inner objective (the dual of the proximal MKL problem (8)), and use the minimizer to update the primal variables . The overall algorithm is shown in Table 1.

Figure 1: Illustration of soft thresholding function .

1. Choose a sequence as . 2. Minimize the augmented Lagrangian with respect to : 2. 3. Update 4. Repeat 2. and 3. until the stopping criterion is satisfied.

Table 1: Algorithm of SpicyMKL for block 1-norm MKL
Quick overview of the derivation of the iteration (10)-(12).

Consider the following Lagrangian of the proximal MKL problem (8):

(14)

where is the Lagrangian multiplier corresponding to the equality constraint . The vector in the first step (10) is the optimal Lagrangian multiplier that maximizes the dual of the proximal MKL problem (8) (see Eq. (25)). The remaining steps (11)–(12) can be obtained by minimizing the above Lagrangian with respect to and for , respectively.

Detailed derivation of the iteration (10)-(12).

Let’s consider the constraint reformulation of the proximal MKL problem (8) as follows:

subject to

The Lagrangian of the above constrained minimization problem can be written as in Eq. (14).

The dual problem can be derived by minimizing the Lagrangian (14) with respect to the primal variables . See Eq. (25) for the final expression. Note that the minimization is separable into minimization with respect to (Eq. (15)), (Eq. (20)), and (Eq. (24)).

First, minimizing the Lagrangian with respect to gives

(15)

where is the convex conjugate of the loss function as follows:

where is the convex conjugate of the loss with respect to the second argument.

For example, the conjugate loss for the logistic loss is the negative entropy function as follows:

(16)

The conjugate loss for the hinge loss is given as follows:

(17)

Second, minimizing the Lagrangian (14) with respect to , we obtain

(18)
(19)
(20)

where is a term that only depends on and ; the function is the convex conjugate of the regularization term as follows:

where
(21)

See Fig. 2 for a one dimensional illustration of the conjugate regularizer . In addition, the function in the last line is Moreau’s envelope function (see Fig. 2):

(22)

See Eq. (13) and Figure 1 for the definition of the soft-threshold operation . Furthermore, we used the following proposition and some algebra to derive Eq. (19) from Eq. (18).

Proposition 1.

Let be a closed proper convex function and be the convex conjugate of defined as

where is a positive semidefinite matrix. Then

(23)
Proof.

It is a straightforward generalization of Moreau’s theorem. See Rockafellar (1970, Theorem 31.5) and Tomioka et al. (2011). ∎

Finally, minimizing the Lagrangian (14) with respect to , we obtain

(24)

where const is a term that only depends on and .

Combining Eqs. (15), (20), and (24), the dual of the proximal MKL problem (8) can be obtained as follows:

(25)

where the constant terms are ignored. We denote the maximand in the above dual problem by ; see Eq. (10).

Figure 2: Comparison of the conjugate regularizer and the corresponding Moreau’s envelope function in one dimension. The conjugate regularizer  (21) is nondifferentiable at the boundary of its domain, whereas its envelope function is smooth.

3.3 Minimizing the inner objective function

The inner objective function (25) that we need to minimize at every iteration is convex and differentiable when is differentiable. In fact, the gradient and the Hessian of the inner objective can be written as follows:

(26)
(27)

where is the set of indices corresponding to the active kernels; i.e., ; for , a scalar and a vector are defined as and .

Remark 1.

The computation of the objective  (25), the gradient (26), and the Hessian (27) is efficient because they require only the terms corresponding to the active kernels .

The above sparsity, which makes the proposed algorithm efficient, comes from our dual formulation. By taking the dual, there appears flat region (see Fig. 2); i.e. the region is not a single point but has its interior.

Since the inner objective function (25) is differentiable, and the sparsity of the intermediate solution can be exploited to evaluate the gradient and Hessian of the inner objective, the inner minimization can efficiently be carried out. We call the proposed algorithm Sparse Iterative MKL (SpicyMKL). The update equations (11)-(12) exactly correspond to the augmented Lagrangian method for the dual of the problem (7) (see Tomioka and Sugiyama (2009)) but derived in more general way using the techniques from Rockafellar (1976).

We use the Newton method with line search for the minimization of the inner objective (25). The line search is used to keep inside the domain of the dual loss . This works when the gradient of the dual loss is unbounded at the boundary of its domain, for example the logistic loss (16), in which case the minimum is never attained at the boundary. On the other hand, for the hinge loss (17), the solution lies typically at the boundary of the domain . This situation is handled separately in the next subsection.

3.4 Explicitly handling boundary constraints

The Newton method with line search described in the last section is unsuitable when the conjugate loss function has a non-differentiable point in the interior of its domain or it has finite gradient at the boundary of its domain. We use the same augmented Lagrangian technique for these cases. More specifically we introduce additional primal variables so that the AL function becomes differentiable. First we explain this in the case of hinge loss for classification. Next we discuss a generalization to other cases.

3.4.1 Optimization for hinge loss

Here we explain how the augmented Lagrangian technique is applied to the hinge loss. To this end, we introduce two sets of slack variables as in standard SVM literature (see e.g., Schölkopf and Smola (2002)). The basic update equation (Eq. (8)) is rewritten as follows444 is the set of non-negative real numbers:

where

This function can again be expressed in terms of maximum over , as follows:

We exchange the order of minimization and maximization as before and remove , and by explicitly minimizing or maximizing over them (see also Section 3.2). Finally we obtain the following update equations.

(28)
(29)
(30)
(31)

and is the minimizer of the function defined as follows:

(32)

and . The gradient and the Hessian of with respect to can be obtained in a similar way to Eqs. (26) and (27). Thus we use the Newton method for the minimization (32). The overall algorithm is analogous to Table 1 with update equations (28)-(32).

3.4.2 Optimization for general loss functions with constraints

Here we generalize the above argument to a broader class of loss functions. We assume that the dual of the loss function can be written by using twice differentiable convex functions and as

(33)

where is an auxiliary variable. An example is -sensitive loss for regression that is defined as