# A Geometric Algorithm for Scalable Multiple Kernel Learning

###### Abstract

We present a geometric formulation of the Multiple Kernel Learning (MKL) problem. To do so, we reinterpret the problem of learning kernel weights as searching for a kernel that maximizes the minimum (kernel) distance between two convex polytopes. This interpretation combined with novel structural insights from our geometric formulation allows us to reduce the MKL problem to a simple optimization routine that yields provable convergence as well as quality guarantees. As a result our method scales efficiently to much larger data sets than most prior methods can handle. Empirical evaluation on eleven datasets shows that we are significantly faster and even compare favorably with a uniform unweighted combination of kernels.

## 1 Introduction

Multiple kernel learning is a principled alternative to choosing kernels (or kernel weights) and has been successfully applied to a wide variety of learning tasks and domains [18, 4, 2, 36, 10, 35, 22, 26]. Pioneering work by Lanckriet et al. [18] jointly optimizes the Support Vector Machine (SVM) task and the choice of kernels by exploiting convex optimization at the heart of both problems. Although theoretically elegant, this approach requires repeated invocations of semidefinite solvers. Other existing methods [26, 18, 25, 32, 33], albeit accurate, are slow and have large memory footprints.

In this paper, we present an alternate geometric perspective on the MKL problem. The starting point for our approach is to view the MKL problem as an optimization of kernel distances over convex polytopes. The ensuing formulation is a Quadratically Constrainted Quadratic Program (QCQP) which we solve using a novel variant of the Matrix Multiplicative Weight Update (MMWU) method of Arora and Kale [3]; a primal-dual combinatorial algorithm for solving Semidefinite Programs (SDP) and QCQPs. While the MMWU approach in its generic form does not yield an efficient solution for our problem, we show that a careful geometric reexamination of the primal-dual algorithm reveals a simple alternating optimization with extremely light-weight update steps. This algorithm can be described as simply as: “find a few violating support vectors with respect to the current kernel estimate, and reweight the kernels based on these support vectors”.

Our approach (a) does not require commercial cone or SDP solvers, (b) does not make explicit calls to SVM libraries (unlike alternating optimization based methods), (c) provably converges in a fixed number of iterations, and (d) has an extremely light memory footprint. Moreover, our focus is on optimizing MKL on a single machine. Existing techniques [26] that use careful engineering to parallelize MKL optimizations in order to scale can be viewed as complementary to our work. Indeed, our future work is focused on adding parallel components to our already fast optimization method.

A detailed evaluation on eleven datasets shows that our proposed algorithm (a) is fast, even as the data size increases beyond a few thousand points, (b) compares favorably with LibLinear [11] after Nyström kernel approximations are applied as feature transformations, and (c) compares favorably with the uniform heuristic that merely averages all kernels without searching for an optimal combination. As has been noted [7], the uniform heuristic is a strong baseline for the evaluation of MKL methods. We use LibLinear with Nyström approximations (LibLinear+) as an additional scalable baseline, and we are able to beat both these baselines when both and are significantly large.

## 2 Related Work

In practice, since the space of all kernels can be unwieldy, many methods operate by fixing a base set of kernels and determining an optimal (conic) combination. An early approach (Uniform) eliminated the search and simply used an equal-weight sum of kernel functions [22]. In their seminal work, Lanckriet et al. [18] proposed to simultaneously train an SVM as well as learn a convex combination of kernel functions. The key contribution was to frame the learning problem as an optimization over positive semidefinite kernel matrices which in turn reduces to a QCQP. . Soon after, Bach et al. [4] proposed a block-norm regularization method based on second order cone programming (SOCP).

For efficiency, researchers started using alternating optimization methods that alternate between updating the classifier parameters and the kernel weights. Sonnenburg et al. [26] modeled the MKL objective as a cutting plane problem and solved for kernel weights using Semi-Infinite Linear Programming (SILP) techniques. Rakotomamonjy et al. [25] used sub-gradient descent based methods to solve the MKL problem. An improved level set based method that combines cutting plane models with projection to level sets was proposed by Xu et al. [32]. Xu et al. [33] also derived a variant of the equivalence between group LASSO and the MKL formulation that leads to closed-form updates for kernel weights. However, as pointed out in [7], most of these methods do not compare favorably (both in accuracy as well as speed) even with the simple uniform heuristic.

Other works in MKL literature study the use of different kernel families, such as Gaussian families [19], hyperkernels [20] and non-linear families [29, 8]. Regularization based on the -norm [16] and -norm [15, 30] have also been introduced. In addition, stochastic gradient descent based online algorithms for MKL have been studied in [21]. Another work by Jain et al. [13] discusses a scalable MKL algorithm for dynamic kernels. We briefly discuss and compare with this work when presenting empirical results (Section 5).

In two-stage kernel learning, instead of combining the optimization of kernel weights as well as that of the best hypothesis in a single cost function, the goal is to learn the kernel weights in the first stage and then use it to learn the best classifier in the second stage. Recent two-stage approaches seem to do well in terms of accuracy – such as Cortes et al. [9], who optimize the kernel weights in the first stage and learn a standard SVM in the second stage, and Kumar et al. [17], who train on meta-examples derived from kernel combinations on the ground examples. In Cortes et al. [9], the authors observe that their algorithm reduces to solving a meta-SVM which can be solved using standard off-the-shelf SVM tools such as LibSVM. However, despite being highly efficient on few examples, LibSVM is very inefficient on more than a few thousand examples due to quadratic scaling [6]. As for Kumar et al. [17], the construction of meta-examples scales quadratically in the number of samples and so their algorithm may not scale well past the small datasets evaluated in their work.

## 3 Background

##### Notation.

We will denote vectors by boldface lower case letters like , and matrices by bold uppercase letters .

zero vector or matrix | |

all-ones vector or matrix | |

is positive semidefinite | |

The diagonal matrix such that |

##### Modeling the geometry of SVM.

Suppose that is a collection of training samples in a -dimensional vector space (the rows are the points). Also, are the binary class labels for the data points in . Let denote the rows corresponding to the positive entries of , and likewise for the negative entries.

From standard duality, the maximum margin SVM problem is equivalent to finding the shortest distance between the convex hulls of and . This shortest distance between the hulls will exist between two points on the respective hulls (see Figure 1). Since these points are in the hulls, they can be expressed as some convex combination of the rows of and , respectively. That is, if is the closest point on the positive hull, then can be expressed as , where and , with a similar construction for and .

This in turn can be written as an optimization

(3.1) | ||||

s.t. |

Collecting all the terms together by defining , and expanding the distance term , it is straightforward to show that Problem (3.1) is equivalent to

(3.2) | ||||

s.t. |

where is merely a compact way of writing . Problem (3.2) is of course the familiar dual SVM problem. The equivalence of (3.1) and (3.2) is well known, so we decline to prove it here; see Bennett and Bredensteiner [5] for a proof of this equivalence.

##### Kernelizing the dual.

The geometric interpretation of the dual does not change when the examples are transformed by a reproducing kernel Hilbert space (RKHS). The Euclidean norm of the base vector space in is merely substituted with the RKHS norm:

where the kernel function stands in for the inner product. This is dubbed the kernel distance [24] or the maximum mean discrepancy [12]. The dual formulation then changes slightly, with the covariance term being replaced by the kernel matrix . For brevity, we will define .

##### Multiple kernel learning.

Multiple kernel learning is simply the SVM problem with the additional complication that the kernel function is unknown, but is expressed as some function of other known kernel functions.

Following standard practice [18] we assume that the kernel function is a convex combination of other kernel functions; i.e., that there is some set of coefficients , that , and that (which implies that the Gram matrix version is ). We regularize by setting [18]. The dual problem then takes the following form [18]:

(3.3) | ||||

s.t. |

When juxtaposed with (3.1) and (3.2), this can be interpreted as searching for the kernel that maximizes the shortest (kernel) distance between polytopes.

## 4 Our Algorithm

The MKL formulation of (3.3) can be transformed (as we shall see later) into a quadratically-constrained quadratic problem that can be solved by a number of different solvers [18, 1, 27]. However, this approach requires a memory footprint of to store all kernel matrices. Another approach would be to exploit the - structure of (3.3) via an alternating optimization: note that the problem of finding the shortest distance between polytopes for a fixed kernel is merely the standard SVM problem. There are two problems with this approach: (a) standard SVM algorithms do not scale well with , and (b) it is not obvious how to adjust kernel weights in each iteration.

##### Overview.

Our solution exploits the fact that a QCQP is a special case of a general SDP. We do this in order to apply the combinatorial primal-dual matrix multiplicative weight update (MMWU) algorithm of Arora and Kale [3]. While the generic MMWU has expensive steps (a linear program and matrix exponentiation), we show how to exploit the structure of the MKL QCQP to yield a very simple alternating approach. In the “forward” step, rather than solving an SVM, we merely find two support vector that are “most violating” normal to the current candidate hyperplane (in the lifted feature space). In the “backward” step, we reweight the kernels involved using a matrix exponentiation that we reduce to a closed form computation without requiring expensive matrix decompositions. Our speedup comes from the facts that (a) the updates to support vectors are sparse (at most two in each step) and (b) that the backward step can be computed very efficiently. This allows us to reduce our memory footprint to .

##### QCQPs and SDPs.

We start by using an observation due to Lanckriet et al. [18] to convert
(3.3)^{1}^{1}1We note that (4.1) is the hard-margin version of the MKL problem. The standard soft-margin variants can also be placed in this general framework [18].
For the -norm soft margin, we add the constraint that all terms of are upper bounded by the margin constant .
For the -norm soft margin, another term appears in the objective, or we can simply add a constant multiple of to each .
into the following QCQP:

(4.1) | ||||

s.t. |

where , , and .

Next, we rewrite (4.1) in canonical SDP form in order to apply the MMWU framework:

(4.2) | ||||

s.t. | ||||

where for all .

##### The MMWU framework.

We give a brief overview of the MMWU framework of Arora and Kale [3] (for more details, the reader is directed to Satyen Kale’s thesis [14]). The approach starts with a “guess” for the optimal value of the SDP (and uses a binary search to find this guess interleaved with runs of the algorithm). Assuming that this guess at the optimal value is correct, the algorithm then attempts to find either a feasible primal () or dual assignment such that this guess is achieved.

The process starts with some assignment to (typically the identity matrix ). If this assignment is both primal feasible and at most , the process ends. Else, there must be some assignment to (the dual) that “witnesses” this lack of feasibility or optimality, and it can be found by solving a linear program using the current primal/dual assignments and constraints (i.e., is positive, has dual value at least , and satisfies constraints (4.1)).

The primal constraints and are then used to guide the search for a new primal assignment .
They are combined to form the matrix (see (4.1)), and then adjusted to form an “event matrix” (see Paragraph “the backward step” for details)^{2}^{2}2 generalizes the loss incurred by experts in traditional MWU – by deriving from the SDP constraints, the duality gap of the SDP takes the role of the loss..
Exponentiating the sum of all the observed so far, the algorithm exponentially re-weights primal constraints that are more important, and the process repeats.
By minimizing the loss, the assignments to and are guaranteed to result in an SDP value that approximates within a factor of .

### 4.1 Our algorithm

We now adapt the above framework to solve the MKL SDP given by (4.2). As we will explain below, we can assign a priori in most cases and we can solve our problem with only one round of feasibility search. We denote the dual update in iteration by , the event matrix in iteration by and the primal variable (matrix) in iteration by . is closely related to the desired primal kernel coefficients . We denote as the accumulated dual assignment thus far and as the accumulated event matrix.

#### 4.1.1 The backward step

It will be convenient to explain the backward step first. Given and , we define where is a rate parameter to be set later. Note that (and ) is “almost-diagonal”, taking the form . Such matrices can be exponentiated in closed form.

###### Lemma 4.1.

The exponential of a matrix in the form where and , is

###### Proof.

We symbolically exponentiate an matrix of the form

Since this matrix is real and symmetric, its eigenvalues are positive and its unit eigenvectors form an orthonormal basis. The method that we use to symbolically exponentiate it is to express it in the form

The exponential then becomes

As a matter of notation, let be the unit vector such that .

##### Eigenvalues.

The characteristic equation for is not difficult to calculate. It is:

(4.3) |

This yields eigenvalues equal to , and the other two equal to and . We label them and , respectively, and the rest are equal to .

##### Eigenvectors.

First we show that has two eigenvectors of the form :

so these are eigenvectors with eigenvalues . We will call the corresponding eigenvectors and . Since is symmetric, all of its eigenvectors are orthogonal. The remaining eigenvectors are of the form , where :

Clearly the corresponding eigenvalue for any such eigenvector is , so there are of them. The corresponding parts of these eigenvectors are labeled , where , and we assume they are unit vectors.

##### The Exponential.

For unit eigenvectors , since

and the eigenvalue is of multiplicity , we have

The last term in the equality is due to the fact that and the form an orthonormal basis for , so . ∎

Lemma 4.1 implies that we can exponentiate the event matrix (see Algorithm 1) quickly, as promised. In particular, we set where normalizes the matrix to have unit trace.

##### Practical considerations.

In Lemma 4.1, large inputs to the functions , , and will cause them to rapidly overflow even at double-precision range. Fortunately there are two steps we can take. First, and converge exponentially to , so above a high enough value, we can simply approximate and with .

Because can overflow just as much as or , this doesn’t solve the problem completely. However, since is always normalized so that , we can multiply the elements of by any factor we choose and the factor will be normalized out in the end. So above a certain value, we can use alone and throw a “quashing” factor () into the equations before computing the result, and it will be normalized out later in the computation (this also means that we can ignore the factor). For our purposes, setting suffices. This trades overflow for underflow, but underflow can be interpreted merely as one kernel disappearing from significance.

Note that the structure of also allows us to avoid storing it explicitly, since . We need only store the coefficients of the blocks of the .

##### The exponentiation algorithm.

#### 4.1.2 The forward step

In the forward step, we wish to check if our primal solution is feasible and optimal, and if not find updates to . In order to do so, we apply the MMWU template. The goal now is to find such that

The existence of such a will prove that the current guess is either primal infeasible or suboptimal (see Arora and Kale [3] for details).

We now exploit the structure of given by Lemma 4.1. In particular, let and . So

then reduces to:

(4.5) |

The right hand side is the negative trace of (which is normalized to ), so this becomes

(4.6) |

where .
If we let (which can be calculated at the end of the backward step), then we have simply
which is a simple collection of linear constraints that can always be satisfied^{3}^{3}3The current margin borders a convex combination of points from each side. If we could not find a point such that the inequality is satisfied, then no point from the convex combination can be found on or past the margin, which is impossible..

Geometrically, gives us a way to examine the training points that are farthest away from the margin. The higher a value is, the more it violates the current decision boundary. In order to find a that satisfies (4.6), we simply choose the highest elements of that correspond to both positive and negative labels, then set each corresponding entry in to . Algorithm 3 describes the pseudo-code for this process.

##### Practical Considerations.

We highlight two important practical consequences of our formulation. First, the procedure produces a very sparse update to : in each iteration, only two coordinates of are updated. This makes each iteration very efficient, taking only linear time. Second, by expressing in terms of we never need to explicitly compute (as ), which in turn means that we do not need to compute the (expensive) square root of explicitly.

Another beneficial feature of the dual-finding procedure for MKL is that terms involving the primal variables are either normalized (when we set the trace of to ) or eliminated (due to the fact that we have a compact closed-form expression for ), which means that we never have to explicitly maintain , save for a small number () of variables.

### 4.2 Avoiding binary search for

The objective function in (4.2) is linear, so we can scale and and use the fact that to transform the problem^{4}^{4}4This fact follows from the KKT conditions for the original problem.
The support constraints of the SVM problem can be written as .
If we multiply both sides of this inequality by then it becomes an equality (by complementary slackness):
.
is a substitution for in the MKL problem [18] so as well.:

where . The first constraint can be transformed back into an optimization; that is, , subject to the remaining linear constraints. Because does not figure into the maximization, we can compute simply by maximizing . Practically, this means that we simply add the constraint , and the “guess” for is set to . We then know the objective, and only one iteration is needed, so the binary search is eliminated.

### 4.3 Extracting the solution from the MMWU

We start by observing that (by complementary slackness), which can rewritten as

(4.7) |

Now recall (from section 3) that and we also use the fact that . Combining the above two we have:

(4.8) |

### 4.4 Putting it all together

Algorithm 4 summarizes the discussion in this section. The parameter is the error in approximating the objective function, but its connection to classification accuracy is loose. We set the actual value of via cross-validation (see Section 5). The parameter is the width of the SDP, a parameter that indicates how much the solution can vary at each step. is equal to the maximum absolute value of the eigenvalues of , for any [3].

###### Lemma 4.2.

is bounded by .

###### Proof.

is defined as the maximum of for all . Here denotes the largest eigenvalue in absolute value [3]. Because (see Section 4), the eigenvalues of are (with multiplicity ), and . The greater of these in absolute value is clearly .

is equal to

always has two nonzero elements, and they are equal to . They also correspond to values of with opposite signs, so if and are the coordinates in question, , because and are both negative. Because of the factor of , and because is the trace of , . This is true for any of the , so the maximum eigenvalue of in absolute value is bounded by . ∎

##### Running time.

Every iteration of Algorithm 4 will require a call to Find-, a call to Exponentiate- and an update to and . Find- requires a linear search for two maxima in , so the first is . The latter are each , which dominate Find-.

Algorithm 4 requires a total of iterations at most, where . Since we only require one run of the main algorithm, the running time is bounded by

## 5 Experiments

In this section we compare the empirical performance of MWUMKL with other multiple kernel learning algorithms. Our results have two components: (a) qualitative results that compares test accuracies on small scale datasets, and (b) scalability results that compares training time on larger datasets.

We compare MWUMKL with the following baselines: (a) Uniform (uniformly weighted combination of kernels), and (b) LibLinear [11] with Nyström kernel approximations for each kernel (hereafter referred to as LibLinear+). We evaluate these MKL methods on binary datasets from UCI data repository. They include: (a) small datasets Iono, Breast Cancer, Pima, Sonar, Heart, Vote, WDBC, WPBC, (b) medium dataset Mushroom, and (c) comparatively larger datasets Adult, CodRna, and Web.

Classification accuracy and kernel scalability results are presented on small and medium datasets (with many kernels). Scalability results (with kernels due to memory constraints) are provided for large datasets. Finally, we show results for lots of kernels on small data subsets.

##### Uniform kernel weights.

Uniform is simply LibSVM [6] run with a kernel weighted equally amongst all of the input kernels (where the kernel weights are normalized by the trace of their respective Gram matrices first).
The performance of Uniform is on par or better than LibLinear+ on many datasets (see Figure 2) and the time is similar to MWUMKL.
However Uniform does not scale well due to the poor scaling of LibSVM beyond a few thousand samples (see Figure 3), because of the need to hold the entire Gram matrix in memory
^{5}^{5}5This is true even when LibSVM is told to use one kernel, which it can compute on the fly – the scaling of LibSVM is - [6], poor compared to MWUMKL and LibLinear+ with increasing sample size..
We employ Scikit-learn [23] because it offers efficient access to LibSVM.

##### LibLinear [11] with Nyström kernel approximations [31, 34] (LibLinear+).

One important observation about multiple kernel learning is that Uniform performs as well or better than many MKL algorithms with better efficiency. Along this same line of thought, we should consider comparison against methods that are as simple as possible. One of the very simplest algorithms to consider is to use a linear classifier (in this case, LibLinear [11]), and transform the features of the data with a kernel approximation. For our purposes, we use Nyström approximations as described by Williams and Seeger [31] and discussed further by Yang et al. [34]. Because LibLinear is a primal method, we don’t need to scale each kernel – each kernel manifests as a set of features, which the algorithm weights by definition.

For the Nyström feature transformations, one only needs to specify the kernel function and the number of sample points desired from the data set. We usually use points, unless memory constraints force us to use fewer. Theoretically, if is the number of sample points, the number of data points, and the number of kernels, then we would need space to store double-precision floats. With regard to time, the training task is very rapid – the transformation is the bottleneck (requiring time to transform every point with every kernel approximation).

We employ Scikit-learn [23] for implementations of both the linear classifier and the kernel approximation because (a) this package offloads linear support-vector classification to the natively-coded LibLinear implementation, (b) it offers a fast kernel transformation using the NumPy package, and (c) Scikit-learn makes it very easy and efficient to chain these two implementations together. In practice this method is very good and very fast for low numbers of kernels (see Figures 2, (a)a, and (b)b). For high numbers of kernels, this scaling breaks down due to time and memory constraints (see Figure 5).

##### Legacy MKL implementations.

In all cases, we omit the results for older MKL algorithm implementations such as (a) SILP [26], (b) SdpMKL [18], (c) SimpleMKL [25], (d) LevelMKL [32], and (e) GroupMKL [33] which take significantly longer to complete, have no significant gain in accuracy, and do not scale to any datasets larger than a few thousand samples. For example, on Sonar (one of the smallest sets in our pool), each iteration of SILP takes about seconds on average whereas Uniform requires seconds on average.

##### Experimental parameters.

Size | Dataset | #Points | #Dim |
---|---|---|---|

Breast Cancer | 683 | 9 | |

Heart | 270 | 13 | |

Iono | 351 | 33 | |

Small | Pima | 768 | 8 |

Sonar | 208 | 60 | |

Vote | 435 | 16 | |

WDBC | 569 | 30 | |

WPBC | 198 | 33 | |

Medium | Mushroom | 8124 | 112 |

Adult | 39073 | 123 | |

Large | CodRna | 47628 | 8 |

Web | 64700 | 300 |

Similar to Rakotomamonjy et al. [25] and Xu et al. [33], we test our algorithms on a base kernel family of polynomial kernels (of degree to ) and Gaussian kernels. Contrary to [25, 33], however, we test with Gaussian kernels that have a tighter range of bandwidths (, instead of ). The reason for this last choice is that our method actively seeks solutions for each of the kernels, and kernels that encourage overfitting the training set (such as low-bandwidth Gaussian kernels) pull MWUMKL away from a robust solution.

For small datasets, kernels are constructed using each single feature and are repeated times with different train/test partitions. For medium and large datasets, due to memory constraints on LibLinear+, we test only on kernels constructed using all features, and repeat only times. All kernels are normalized to trace . Results from small datasets are presented with a % confidence interval that the median lies in the range. Results from medium-large datasets present the median, with the min and max values as a range around the median. In each iteration, % of the examples are randomly selected as the training data and the remaining % are used as test data. Feature values of all datasets have been scaled to . SVM regularization parameter is chosen by cross-validation. For example, in Figure 2 results are presented for the best value of for each dataset and algorithm.

For MWUMKL, we choose by cross-validation. Most datasets get , but the exceptions are Web (), CodRna (), and Adult (). Contrary to existing works we do not compare the number of SVM calls (as MWUMKL does not explicitly use an underlying SVM) and the number of kernels selected.

Experiments were performed on a machine with an Intel^{®} Core^{TM} 2 Quad CPU ( GHz) and 2GB RAM.
All methods have an outer test harness written in Python.
MWUMKL also uses a test harness in Python with an inner core written in C++.

##### Accuracy.

On small datasets our goal is to show that MWUMKL compares favorably with LibLinear+ and Uniform in terms of test accuracies.

In Figure 2 we present the median misclassification rate for each small dataset over 30 random training/test partitions. In each case, we train the classifier with kernels for each feature in the dataset, and each kernel only operates on one feature. We are able either to beat the other methods or remain competitive with them.

##### Data Scalability.

Both MWUMKL and LibLinear+ are much faster as compared with Uniform. At this point, Adult, CodRna, and Web are large enough datasets that Uniform fails to complete because of memory constraints. This can be seen in Figure 3, where we plot training time versus the proportion of the training data used – the training time taken by Uniform rises sharply and we are unable to train on this dataset past points. Hence, for the remaining experiments on large datasets, we compare MWUMKL with LibLinear+. In Figures (a)a and (b)b, we choose a random partition of train and test, and then train with increasing proportions of the training partition (but always test with the whole test partition). With more data, our algorithm settles in to be competitive with LibLinear+.

##### Kernel Scalability.

We aim to demonstrate not only that MWUMKL performs well with the number of examples, but also that it performs well against the number of kernels. In fact, for an MKL algorithm to be truly scalable it should do well against both examples and kernels.

For kernel scalability, we present the training times for the best parameters of several of the datasets, divided by the number of kernels used, versus the size of the dataset (see Figure 5). We divide time by number of kernels because time scales very close to linearly with the number of kernels for all methods. Also presented are log-log models fit to the data, and the median of each experiment is plotted as a point.

We report the time for the same experiments that produced the results in Figure 2, and also train on increasing proportions of Mushroom (, , , and examples) with per-feature kernels. With these selections, we are testing in the neighborhood of million elements.

As expected, Uniform scales quadratically or more with the number of examples, performing very well at the lower range. The number of examples from Mushroom is not so high that LibSVM runs out of memory, but we do see the algorithm’s typical scaling.

LibLinear+ shows slightly superlinear scaling, with a high multiplier due to the matrix computations required for the feature transformations. As we run the algorithm on Mushroom, the number of samples taken for the kernel approximations is reduced so that the features can fit in machine memory. Even so, this reduction doesn’t offer any help to the scaling and at examples with kernels, training time is several hours.

Even though we reduced the number of samples for LibLinear+, MWUMKL outperforms both Uniform and LibLinear+ when both examples and kernels are greater than about .

##### Dynamic Kernels.

We also present results for a few datasets with lots of kernels. By computing columns of the kernel matrices on demand, we can run with a memory footprint of , improving scalability without affecting solution quality (a technique also used in SMOMKL [30]). Table 2 shows that we can indeed scale well beyond tens of thousands of points, as well as many kernels.

Dataset | #Points | #Kernels | Time |
---|---|---|---|

Adult | 39073 | 3 | minutes |

CodRna | 47628 | 3 | seconds |

Sonar 1M | 208 | 1000000 | hours |

We choose the above datasets to compare against another work on scalable MKL [13]. Jain et al. [13] indicate the ability to deal with millions of kernels, but in effect the technique also has a memory footprint of (the footprint of MWUMKL is , in contrast). This limits any such approach to either many kernels or many points, but not both.

Since the work in Jain et al. [13] does not provide accuracy numbers, a direct head-to-head comparison is difficult to make, but we can make a subjective comparison. The above table shows times for MWUMKL with accuracy similar to or better than what LibLinear+ can achieve on the same datasets. The time numbers we achieve are similar in order of magnitude when scaled to the number of kernels demonstrated in Jain et al. [13].

## 6 Conclusions and Future Work

We have presented a simple, fast and easy to implement algorithm for multiple kernel learning. Our proposed algorithm develops a geometric reinterpretation of kernel learning and leverages fast MMWU-based routines to yield an efficient learning algorithm. Detailed empirical results on data scalability, kernel scalability and with dynamic kernels demonstrate that we are significantly faster than existing legacy MKL implementations and outpeform LibLinear+ as well as Uniform.

Our current results are for a single machine. As mentioned earlier, one of our future goals is to add parallellization techniques to improve the scalability of MWUMKL over data sets that are large and use a large number of kernels. The MWUMKL algorithm lends itself easily to the bulk synchronous parallel (BSP) framework [28], as most of the work is done in the loop that updates (see the last line of the loop in Algorithm 4). This task can be “sharded” for either kernels or data points, and scalability of would not suffer under BSP. Since there are many BSP frameworks and tools in use today, this is a natural direction to experiment.

## 7 Acknowledgments

This research was partially supported by the NSF under grant CCF-0953066. The authors would also like to thank Satyen Kale and Sébastien Bubeck for their valuable feedback.

## References

- Andersen and Andersen [1999] E. D. Andersen and K. D. Andersen. The MOSEK interior point optimization for linear programming: an implementation of the homogeneous algorithm, pages 197–232. Kluwer Academic Publishers, 1999.
- Argyriou et al. [2006] Andreas Argyriou, Raphael Hauser, Charles A. Micchelli, and Massimiliano Pontil. A DC-programming algorithm for kernel selection. In ICML, Pennsylvania, USA, 2006.
- Arora and Kale [2007] Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semidefinite programs. In STOC, pages 227–236, New York, NY, USA, 2007. ACM.
- Bach et al. [2004] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In ICML, pages 6–, New York, NY, USA, 2004. ACM.
- Bennett and Bredensteiner [2000] Kristin P. Bennett and Erin J. Bredensteiner. Duality and geometry in SVM classifiers. In ICML, pages 57–64, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
- Chang and Lin [2011] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM TIST, 2(3):27:1–27:27, May 2011.
- Cortes [2009] Corinna Cortes. Invited talk: Can learning kernels help performance? In ICML, Montreal, Canada, 2009.
- Cortes et al. [2009] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combinations of kernels. In NIPS, pages 396–404, Vancouver, Canada, 2009.
- Cortes et al. [2010] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Two-stage learning kernel algorithms. In ICML, pages 239–246, Haifa, Israel, 2010.
- Cristianini et al. [2006] Nello Cristianini, John Shawe-Taylor, André Elisseeff, and Jaz S. Kandola. On kernel-target alignment. In Innovations in Machine Learning, pages 205–256. Springer, 2006.
- Fan et al. [2008] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A library for large linear classification. JMLR, 9:1871–1874, 2008.
- Gretton et al. [2007] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alexander J Smola. A kernel method for the two-sample problem. In NIPS, pages 513–. MIT, 2007.
- Jain et al. [2012] Ashesh Jain, S.V.N. Vishwanathan, and Manik Varma. SPF-GMKL: Generalized multiple kernel learning with a million kernels. In KDD, pages 750–758, New York, NY, USA, 2012. ACM.
- Kale [2007] Satyen Kale. Efficient algorithms using the multiplicative weights update method. PhD thesis, Princeton University, 2007.
- Kloft et al. [2009] Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Pavel Laskov, Klaus-Robert Müller, and Alexander Zien. Efficient and accurate Lp-norm multiple kernel learning. In NIPS, pages 997–1005, Vancouver, Canada, 2009.
- Kloft et al. [2011] Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. l-norm multiple kernel learning. JMLR, 12:953–997, 2011.
- Kumar et al. [2012] Abhishek Kumar, Alexandru Niculescu-Mizil, Koray Kavukcuoglu, and Hal III Daume. A binary classification framework for two stage multiple kernel learning. In ICML, pages 1295–1302, 2012.
- Lanckriet et al. [2004] Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semidefinite programming. JMLR, 5:27–72, December 2004.
- Micchelli and Pontil [2005] Charles A. Micchelli and Massimiliano Pontil. Learning the kernel function via regularization. JMLR, 6:1099–1125, December 2005.
- Ong et al. [2005] Cheng Soon Ong, Alexander J. Smola, and Robert C. Williamson. Learning the kernel with hyperkernels. JMLR, 6:1043–1071, 2005.
- Orabona and Jie [2011] Francesco Orabona and Luo Jie. Ultra-fast optimization algorithm for sparse multi kernel learning. In ICML, pages 249–256, Bellevue, USA, 2011.
- Pavlidis et al. [2001] Paul Pavlidis, Jason Weston, Jinsong Cai, and William Noble Grundy. Gene functional classification from heterogeneous data. In Proc. Intl. Conf. on Computational Biology, RECOMB ’01, pages 249–255, New York, NY, USA, 2001. ACM.
- Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. JMLR, 12:2825–2830, 2011.
- Phillips and Venkatasubramanian [2011] Jeff M. Phillips and Suresh Venkatasubramanian. A gentle introduction to the kernel distance. CoRR, abs/1103.1625, 2011.
- Rakotomamonjy et al. [2007] Alain Rakotomamonjy, Francis Bach, Stéphane Canu, and Yves Grandvalet. More efficiency in multiple kernel learning. In ICML, pages 775–782, New York, NY, USA, 2007. ACM.
- Sonnenburg et al. [2006] Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. Large scale multiple kernel learning. JMLR, 7:1531–1565, December 2006.
- Sturm [1999] J. F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11–12:625–653, 1999.
- Valiant [1990] Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103–111, August 1990.
- Varma and Babu [2009] Manik Varma and Bodla Rakesh Babu. More generality in efficient multiple kernel learning. In ICML, pages 1065–1072, New York, NY, USA, 2009. ACM.
- Vishwanathan et al. [2010] S. V. N. Vishwanathan, Zhaonan Sun, Nawanol Ampornpunt, and Manik Varma. Multiple kernel learning and the SMO algorithm. In NIPS, volume 22, pages 2–, Vancouver, Canada, 2010.
- Williams and Seeger [2001] Christopher Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In NIPS, pages 682–688, 2001.
- Xu et al. [2008] Zenglin Xu, Rong Jin, Irwin King, and Michael R. Lyu. An extended level method for efficient multiple kernel learning. In NIPS, pages 1825–1832, Vancouver, Canada, 2008.
- Xu et al. [2010] Zenglin Xu, Rong Jin, Haiqin Yang, Irwin King, and Michael R. Lyu. Simple and efficient multiple kernel learning by group lasso. In ICML, pages 1175–1182, Haifa, Israel, 2010.
- Yang et al. [2012] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nyström method vs random fourier features: A theoretical and empirical comparison. In NIPS, pages 485–493, 2012.
- Ye et al. [2007] Jieping Ye, Jianhui Chen, and Shuiwang Ji. Discriminant kernel and regularization parameter learning via semidefinite programming. In ICML, pages 1095–1102, New York, NY, USA, 2007. ACM.
- Zien and Ong [2007] Alexander Zien and Cheng Soon Ong. Multiclass multiple kernel learning. In ICML, pages 1191–1198, New York, NY, USA, 2007. ACM.