#
Semidefinite and Spectral Relaxations for Multi-Label Classification

###### Abstract

In this paper, we address the problem of multi-label classification. We consider linear classifiers and propose to learn a prior over the space of labels to directly leverage the performance of such methods. This prior takes the form of a quadratic function of the labels and permits to encode both attractive and repulsive relations between labels. We cast this problem as a structured prediction one aiming at optimizing either the accuracies of the predictors or the -score. This leads to an optimization problem closely related to the max-cut problem, which naturally leads to semidefinite and spectral relaxations. We show on standard datasets how such a general prior can improve the performances of multi-label techniques.

Semidefinite and Spectral Relaxations for Multi-Label Classification

Rémi Lajugie INRIA, Sierra Project-team École Normale Supérieure, Paris remi.lajugie@ens.fr Piotr Bojanowski INRIA, Willow Project-team École Normale Supérieure, Paris piotr.bojanowski@inria.fr Sylvain Arlot INRIA, Sierra Project-team École Normale Supérieure, Paris sylvain.arlot@ens.fr Francis Bach INRIA, Sierra Project-team École Normale Supérieure, Paris francis.bach@inria.fr

## 1 Introduction

Multi-label classification aims at predicting a set of labels for each data instance [Tsoumakas2007multi, Zhang2013review]. This setting is ubiquitous in real-world applications and for example can take the form of video or text tagging, where the goal is to assign instances to categories [Joachims1998text]. For video, [Xiao2010sun] proposes to consider the problem of labeling scenes, on which several objects appear.

One of the main difficulties of this problem lies in the fact that the space of potential labelings is exponentially bigger than the set of labels . Doing an exhaustive search over the space of labelings is thus not possible. Moreover, contrary to the standard binary classification setting, the set has a specific structure and one has to take it into account, especially when the number of labels is large. Indeed, imagine that we are given one classifier for each , we would probably observe that some predict labels that are not actually present; for instance, in image tagging, if it is very likely to see a zebra and a lion on the same image, it is rather not probable to see a reindeer with a lion. A prior over labels could have, for instance, penalized the prediction of a reindeer together with the lion. Incorporating structure into the label set can be done a priori by assuming labels are organized in a certain hierarchy [Rousu2006kernel]; [Hariharan2010large] incorporates a prior knowledge when training the classifiers, permitting to learn correlated classifiers. However this prior does not affect the way predictions are done.

Our goal is to learn such a prior over labels directly from data, at the same time that classifiers are learnt. This idea has already been tackled by [Petterson2011submodular] who restricted their study to the specific case of incorporating positive affinities between labels. We go beyond this approach and propose a model permitting to take into account affinities and incompatibilities between labels.

Related work.

A large part of the recent literature considers a moderately large set of labels (order of hundreds) and a huge space of labelings . In this setting it is possible to learn specific classifiers for each label separately. One way to train such classifiers is the well-known one-versus-rest technique (a.k.a. binary relevance technique [Tsoumakas2007multi]).

Within this setting, some approaches use the structured prediction framework [Tsochantaridis2005large, Taskar2003maximum] as we do. This corresponds to considering the task of prediction as being a task over the huge output space . [Lampert2011maximum] has proposed to plug a model within a structured SVM, and considers the prior knowledge between labels as fixed a priori, whereas we aim at learning it. They defined a proper loss and the corresponding loss-augmented decoding. The loss they used is called the “max loss” and is slightly related to the Hamming loss. This approach leads to an efficient loss-augmented decoding, and avoids an exhaustive search over the power set . Other approaches [Dembczynski2013optimizing] considered the direct optimization of the -score within a structured SVM. Another part of the recent literature dealing with multi-label classification [Bi2013efficient] considers the case where the space of labels itself is huge. In these papers, the goal is to use the fact that only few labels are present in an instance. This allows to reduce the dimension of the prediction space and performing the labeling over a lower dimensional space. The priors we propose here could be combined with these approaches.

Contributions. Our contribution is four-fold: (1) we propose a model with priors for multi-label classification allowing attractive as well as repulsive weights, (2) we cast the learning of this model into the framework of structured prediction using either Hamming of losses and propose an approach for solving exactly the loss-augmented decoding using the loss, (3) we propose semidefinite and spectral relaxations to efficiently solve the resulting structured prediction problem, (4) we show on real datasets how the learning of such a general prior can improve the multi-label prediction over the models where no prior is learnt or when only attractive weights are allowed.

## 2 Structured Prediction for Multi-Label Classification

In this section, we review several ways to perform the multi-label classification task when a prior over the labels is fixed. Decoding consists in assigning potentially several labels to a data point belonging to some feature space. We then discuss how to learn the parameters of the predictive function. For the rest of the paper we denote our feature space by .

### 2.1 The multi-label classification problem problem

Let us consider the set of possible labels of cardinal . We define the set of labelings, as the set of binary vectors . The set is the one on which we perform our structured prediction.

Let us assume that for each possible label , we are given a linear classifier parameterized by . We denote by , the vertical concatenation of all the vectors . In the multi-label setting, the decoding problem is:

(1) |

This is usually referred to as the binary relevance method for multi-label learning [Tsoumakas2007multi].

The aforementioned approach does not take into account any dependency between the different labels. A way to do so is to penalize the discriminative function by some penalty depending on the subset of predicted labels. In our case, we propose to consider:

(2) |

However, not all functions are admissible, so that (2) remains tractable since .

A class of penalizations that are well-suited for our problem is the class of submodular functions [Bach2013learning, Petterson2011submodular]. When is submodular, the decoding becomes the maximization of a supermodular function (maximization of a modular minus a submodular function). This is known to be tractable (solvable in polynomial time in ). [Petterson2011submodular] has proposed to use a graph-cut based penalty. This corresponds to where and is proportional to the Laplacian matrix of a graph. Intuitively, this corresponds to considering that labels are organized in a graph with non-negative weights, encoding attractive affinities between the labels; the linear part of the prior corresponds to a prior over the frequencies of the classes.

For general weights, meaning that the matrix not only encodes affinities but also costs, the decoding task becomes as hard as solving a max-cut problem. In Sec. 5.1 we review common convex relaxations permitting to obtain a good approximate solution in polynomial time. Using a matrix with arbitrary entries, our decoding model becomes:

(3) |

### 2.2 Learning the parameters , and

In the previous section we have assumed that we are given linear classifiers , a linear prior and a matrix . Thus, the discussed decoding problem can be seen as being parameterized by , and .

Suppose that we are given examples , and consider a loss function between two labelings . Ideally, given this loss, we would like to minimize the following regularized empirical loss:

(4) |

where is a convex regularizer (typically a squared -norm) over the parameter space. This is a hard combinatorial problem that thus needs to be relaxed. Following [Tsochantaridis2005large, Taskar2003maximum], we define the structural hinge loss as:

(5) |

We estimate parameters and by solving the following problem:

(6) |

## 3 Performance Measures and Losses for Multi-Label Tasks

In order to set up the aforementioned problem, we need to define a proper loss function .

Normalized Hamming loss. The simplest loss is based on accuracy, and is defined as:

(7) |

The loss associated to accuracy is the so-called Hamming loss [Hamming1950error, Zhang2013review]. It is defined as a linear function of the binary label vector by:

(8) | ||||

(9) |

where is the -dimensional vector with ones. This loss corresponds to the symmetric difference between two sets . Note also that, if we consider that not all the errors are equivalent, one can use a weighted Hamming loss instead.

loss. A common choice in the multi-label learning literature is the loss [Tsoumakas2007multi, Petterson2011submodular]. This loss is a function of precision and recall and has some important advantages over the Hamming loss. In the common situations where each instance has only few labels among all the ones that are possible, the loss penalizes a lot the solution while the Hamming does not.

Precision and recall with respect to a training labeling are defined respectively as:

Then the general score is defined as, for every ,

(10) |

The most widely used is the score (which turns out to be the harmonic mean of precision and recall), and the associated loss is then . More precisely:

(11) |

Please note the non linear dependency of this loss in .

## 4 Loss-Augmented Decoding

We propose to derive a structured-SVM-like optimization objective [Tsochantaridis2005large]. As mentioned earlier, we want to learn the parameters of our predictive function using annotated data. Following the definition of , we can write the complete optimization problem (6) as:

(12) |

Using the Hamming loss. If we use the Hamming loss for , then . Our optimization problem can be re-written as follows:

(13) |

Note that the objective function of the optimization is jointly convex but not smooth.

Using the loss. If in turn we decide to use the loss, the proposed problem is harder because of the vector in the denominator. To cope with this issue, we can split the set into subsets. We define the set as the set of labelings such that entries are positive:

As is often done when optimizing the score, which is a contingency-table based loss [Joachims2005support], we can divide the initial problem into subproblems by replacing by as follows:

(14) |

The problems of Eq. (4)–(14) above assume that we are able to solve quadratic optimization problems for . [Petterson2011submodular] proposes a greedy approximate algorithm for solving this type of problems in the specific case where off diagonal entries of the prior are negative.

In the following section, we propose relaxations of these problems leading to a tractable loss-augmented decoding with no restriction over the matrix .

## 5 Optimization in

So far, we have written three problems that we are not able to solve efficiently. The first one was the general decoding of Eq. (3). The other ones were the subproblems of Eq. (4) and Eq. (14). All of these are quadratic boolean optimization problems and are closely linked to the max-cut problem (see, e.g., [Boyd2004convex, Sec. 5.1.5]). These can be written in the canonical form as follows:

(15) |

where , and is an affine function.

Note the presence of the additional constraint . This additional equation is only needed for the problem mentioned in Eq. (14). For the two other problems, one can simply ignore it. Eq. (15) allows us to tackle three problems in a unified framework. In the next section we discuss two relaxations to this problem. First we describe the standard SDP relaxation. We then present how to cast this optimization problem as a spectral problem.

### 5.1 Classical semidefinite relaxation for max-cut

The family of problems presented in Eq. (15) is known as the two-way partitioning problems. They are a generalization of max-cut, with potentially negative entries in . Also, they contain an extra linear term (see Sec. 5.1.5 of [Boyd2004convex]) and potential constraints over the domain.

There exists a classical semidefinite relaxation. Following [Boyd2004convex, Aspremont2003relaxations], we use a similar relaxation to the one used by [Goemans1995improved] to approximate the max-cut problem. We introduce a new variable . Using this notation we can re-write the term as . Then using a set of constraints that is equivalent to the problem (15) can be re-written as:

(16) |

Following [Boyd2004convex], the convex relaxation of this problem is obtained by removing the rank constraint. We define as the affine function where and . We use the Schur complement trick (see, e.g., [Boyd2004convex]) and define the matrix as:

Using , the vector with all coordinates equal to zero except the last one, our relaxation of (15) can be re-written as:

(17) |

Problem (17) can be solved using any standard convex optimization solver at least for small (). When is large, one can use specific techniques relying explicitly on the fact the solution is expected to be low-rank (see, e.g., [Journee2010low] and references therein).

Rounding scheme

At test time, we follow [Boyd2004convex] to round the relaxed solution, i.e., get back to some admissible solution of (15). We notice that at the optimum of Eq. (17), implies that is a covariance matrix. Therefore, we simply sample several from a normal distribution, round the solution by taking the signs and choose the best one in terms of the objective function. This procedure leads to good feasible points in our experiments.

### 5.2 Spectral relaxation

The generic problem in Eq. (15) can be rewritten by replacing the integrality constraint with a quadratic equality . Please note this makes the problem non-convex. Using the same expression for as in the previous section leads to the following optimization problem:

(18) |

We deal with the linear constraint by dualizing it, yielding the following problem:

(19) |

This can be solved by performing a binary search over .

The inner loop problem is classical in optimization, in particular in trust-region methods [Forsythe1965stationary, Spjotvoll1972note]. It reduces—using the Lagrange multiplier technique—to solving a quadratic eigenvalue problem [Tisseur2001quadratic]. Solving the inner loop problem of Eq. (19) (with nonzero ) is equivalent to finding the minimal eigenvalue of the quadratic eigenvalue problem:

(20) |

where denotes the identity matrix. The problem above is solved efficiently by performing the SVD of the matrix :

(21) |

Once this has been solved, we get the desired solution by taking , where is the smallest non-zero eigenvalue of .

Note that when optimizing the Hamming loss, we get rid of the constraint . In that case we can set and solve the inner loop problem only once.

### 5.3 Cheaper (but still efficient) solution for the spectral relaxation

In this section we present an other way to deal with the spectral relaxation, inspired by [Gander1989constrained]. The proposed method is more efficient computationnally than the one of the previous section since it does not involve solving the binary search problem over the Lagrange multiplier .

We start from the problem of Eq. (18). By the change of variables and and by introducing ( is the dimensional identity matrix) we can write the problem as:

(22) |

Following [Gander1989constrained], let us simply introduce the QR factorization of the matrix , where is an orthogonal matrix and . Let us now introduce . and

Eq. (22) can be rewritten as:

(23) |

Note that the last constraint permit to fix the variable With a slight abuse of notation, corresponds to the inverse of the rotation part. Let us define with , and . Using the previous notations, we get: .

Let us also introduce Since is not entirely determined this problem is equivalent to:

(24) |

Note that we slightly abuse of notations with being restricted to its last components.

### 5.4 Links with graph-cuts

The min-cut problem can be written as an optimization problem through the following equation:

(25) |

where and . By making the change of variables and carrying on some calculations, we get the following equivalent program:

(26) |

Therefore, when the matrix has negative off-diagonal entries, the problem formulated in Eq. (15) can be solved using min-cut / max-flow. We can use standard min-cut / max-flow toolboxes by providing the matrix and .

When optimizing the loss, note that we can use the same dualization for the constraint . We proceed exactly as with the spectral relaxation except that the inner loop is solved with min-cut / max-flow.

### 5.5 Solving the F1 loss augmented decoding for negative

In this section, we show how we can solve the constrained problem by relating it to the well-studied total variation denoising problem [Chambolle2009total, Bach2013learning]. Note that, contrary to [Petterson2011submodular], in this section, we deal with the cardinality constraint exactly and we do not use any approximation algorithm in this specific case. We just use total variation minimization algorithm to perform the constrained minimization.

Here we consider that the constraint of Eq. (15) is simply a cardinality constraint, namely that it is of the form for a certain and is the dimensional vector composed of ones.

Now, we dualize this equality constraint by introducing the associated Lagrange multiplier. This yields the following problem:

(27) |

Equivalentally, by considering the variable we get the following problem:

(28) |

The problem of Eq.(28) is a separable submodular optimization problem [Bach2013learning]. Thus solving it can be done by considering the associated proximal problem. More precisely, if we introduce the Choquet integral of the cut (often referred to the “co-area formula” [Chambolle2009total] for the specific case of cut functions or Lovasz extension for submodular functions), the generic proximal problem associated to any cut problem is:

(29) |

where in our case is exactly .

This problem is the well known total variation denoising problem. There exists several efficient algorithms to deal with it, especially the ones relying on parametric max-flow techniques. Once problem(29) has been solved and that we recovered its (unique if is positive) solution , we get all the candidates for being a solution of (27) by considering the different . Then, we just have to compute the associated objective values and select the optimal one.

## 6 Optimization in and

We optimize our cost function in Eq. (4) with stochastic subgradient descent. When we relax the inner optimization problem in we implicitly modify the cost function. Therefore we have to be careful when computing the subgradients.

In this section we provide the derivations in one specific case. The details for the other cases can be found in the supplementary material. When using the Hamming loss and the SDP relaxation, our cost function becomes, with :

(30) |

To obtain the subgradients, we first solve the relaxed loss-augmented inference. Using the obtained and , we compute the subgradients in and as follows:

(31) | ||||

(32) | ||||

(33) |

## 7 Experimental Evaluation

We now validate the proposed approach on standard benchmarks. We compare our implementation to [Petterson2011submodular] and to the one-versus-rest model (OvR). The code corresponding to the described method will be made publicly available. In this experimental section we first describe the used datasets and discuss the baselines to which we compare.

Datasets.
We validate our approach on four datasets.
Following [Petterson2011submodular], we picked our datasets from the mulan^{1}^{1}1http://mulan.sourceforge.net/datasets.html repository.
We picked the yeast [Elisseeff01akernel], enron, medical [Pestian2007shared] and bibtex [Katakis08multilabeltext] datasets.
The datasets are of various sizes and natures: yeast only has 14 labels while bibtex has 159.
All of them also present different challenges (different structures, label concurrence patterns, etc.).

These datasets are given with a train / test split. We further split the training set to generate a validation set. We select all relevant parameters by plain validation on this set. We report all performances on the actual test set as given in the dataset. Caracteristics of these datasets are given in Table 1.

Instances | Features | Labels | |
---|---|---|---|

yeast | 2417 | 103 | 14 |

enron | 1702 | 1001 | 53 |

medical | 978 | 1449 | 45 |

bibtex | 7395 | 1836 | 159 |

One-versus-rest results. In Table 2 we report the performance of a one-versus-rest model for all the datasets. For every label, we train a linear classifier using a standard SVM toolbox [Fan2008liblinear]. We select the hyper-parameters by validation on a held-out part of the training set. We compare three criteria for choosing the optimal set of regularization parameters. We can either select a common regularization parameter for all classes (“Single ” column), chosen with the Hamming loss (which decouples over classes), or one per class (“Multiple ” column). When choosing a common for all classes, one can choose it according to the or Hamming loss on the validation set.

Single | Multiple | ||
---|---|---|---|

Hamming | Hamming | ||

yeast | 0.39 | 0.40 | 0.54 |

enron | 0.48 | 0.49 | 0.46 |

medical | 0.29 | 0.29 | 0.28 |

bibtex | 0.61 | 0.66 | 0.66 |

Table 2 shows that it is sometimes important to use the relevant loss as a criterion to select hyperparameter. In our experiments, this becomes more and more important as the size of the label set increases and thus as discussed in Sec. 3 the Hamming loss behaves more and more differently from the loss.

One would also expect that picking one parameter per label would lead to better performance. But the benefits from selecting a specific parameter per class is offset by the fact that one cannot use the loss in this case. In all our remaining simulations, we use a single for all classes.

Our model and comparison to [Petterson2011submodular]. We run our algorithm—with equal to the Hamming loss—on all four datasets and compare to the available implementation of [Petterson2011submodular]. For all methods we select all hyper-parameters based on the performance in terms of loss on the validation set. Because of the challenging number of labels for bibtex, we were able to run neither the code from [Petterson2011submodular], nor the SDP, in reasonable time.

SDP | Spectral | ||||||||
---|---|---|---|---|---|---|---|---|---|

OvR | [Petterson2011submodular] | MC | Any | Any | |||||

yeast | 0.39 | 0.36 | 0.40 | 0.40 | 0.39 | 0.39 | 0.39 | 0.37 | 0.37 |

enron | 0.48 | 0.45 | 0.47 | 0.47 | 0.47 | 0.45 | 0.48 | 0.49 | 0.49 |

medical | 0.29 | 0.33 | 0.29 | 0.31 | 0.29 | 0.24 | 0.30 | 0.21 | 0.24 |

bibtex | 0.61 | N/A | 0.61 | N/A | N/A | N/A | 0.62 | 0.57 | 0.60 |

Table 3 compares the one-versus-rest approach, the approach described in [Petterson2011submodular] and variants of our method. We compare the two relaxations we proposed while optimizing the Hamming loss. Please recall that the min-cut (MC) solution implies that (non-positive entries).

When , we can measure the tightness of the proposed relaxations. We see that the various relaxations, SDP then spectral, do not degrade performances over the exact approach MC (which cannot be run for general ).

We also notice that using a negative matrix is a strong limitation. The performance observed when is unconstrained or non-negative is better. This motivates our formulation and shows that repulsive weights between labels are relevant.

OvR | [Petterson2011submodular] | Our Hamming | Our | |
---|---|---|---|---|

yeast | 0.39 | 0.37 | 0.43 | 0.43 |

enron | 0.48 | 0.45 | 0.47 | 0.47 |

medical | 0.29 | 0.33 | 0.28 | 0.28 |

bibtex | 0.61 | 0.58 | 0.60 | 0.60 |

The Hamming loss and the loss. In this experiment we do not make use of the quadratic prior, so . Table 4 gives the loss we obtain by optimizing either the loss or the Hamming loss. We compare the implementation of the score minimization in [Petterson2011submodular] (carried out using a greedy technique). In that table, “Our ” is our own implementation of the support vector technique for -loss [Joachims2005support] using the optimization described in Section 4. This is an exact optimization technique. We also report the results obtained by training SVMs, using the one-versus-rest scheme. It appears that, on these standard datasets (), optimizing the loss does not yield better performances than optimizing the Hamming loss.

## 8 Conclusion

We have proposed a framework to learn a prior for improving the performances of multi-label classification tasks. This prior takes the form of a quadratic function over the space of labels and incorporates both affinities and negative affinities. Existing work [Petterson2011submodular] only takes into account positive affinities between labels. We provide semidefinite and spectral relaxations of the learning problem, yielding to an efficient optimization scheme. In particular the spectral relaxation permits to deal computationally with datasets rather large () whereas existing algorithms cannot (since the loss-augmented decoding problems have to solved many times).

It would be interesting to see how it is possible to leverage the range of applicability of the semidefinite relaxations which is, for now, limited to multi-label problems for which is of the order of hundreds. To that extent, we could use techniques from matrix optimization theory, taking into account for the fact that the solution we aim at finding has low rank [Journee2010low].