Learning Multimodal Similarity
Abstract
In many applications involving multimedia data, the definition of similarity between items is integral to several key tasks, e.g., nearestneighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many realworld applications.
We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graphbased techniques to filter similarity measurements, resulting in a simplified and robust training procedure.
Department of Computer Science and Engineering
University of California
San Diego, CA 920930404, USA Gert Lanckriet gert@ece.ucsd.edu
Department of Electrical and Computer Engineering
University of California
San Diego, CA 920930407, USA
Editor:
1 Introduction
In applications such as contentbased recommendation systems, the definition of a proper similarity measure between items is crucial to many tasks, including nearestneighbor retrieval and classification. In some cases, a natural notion of similarity may emerge from domain knowledge, e.g., cosine similarity for bagofwords models of text. However, in more complex, multimedia domains, there is often no obvious choice of similarity measure. Rather, viewing different aspects of the data may lead to several different, and apparently equally valid notions of similarity. For example, if the corpus consists of musical data, each song or artist may be represented simultaneously by acoustic features (such as rhythm and timbre), semantic features (tags, lyrics), or social features (collaborative filtering, artist reviews and biographies, etc). Although domain knowledge may be employed to imbue each representation with an intrinsic geometry — and, therefore, a sense of similarity — the different notions of similarity may not be mutually consistent. In such cases, there is generally no obvious way to combine representations to form a unified similarity space which optimally integrates heterogeneous data.
Without extra information to guide the construction of a similarity measure, the situation seems hopeless. However, if some sideinformation is available, e.g., as provided by human labelers, it can be used to formulate a learning algorithm to optimize the similarity measure.
This idea of using sideinformation to optimize a similarity function has received a great deal of attention in recent years. Typically, the notion of similarity is captured by a distance metric over a vector space (e.g., Euclidean distance in ), and the problem of optimizing similarity reduces to finding a suitable embedding of the data under a specific choice of the distance metric. Metric learning methods, as they are known in the machine learning literature, can be informed by various types of sideinformation, including class labels (Xing et al., 2003; Goldberger et al., 2005; Globerson and Roweis, 2006; Weinberger et al., 2006), or binary similar/dissimilar pairwise labels (Wagstaff et al., 2001; Shental et al., 2002; Bilenko et al., 2004; Globerson and Roweis, 2007; Davis et al., 2007). Alternatively, multidimensional scaling (MDS) techniques are typically formulated in terms of quantitative (dis)similarity measurements (Torgerson, 1952; Kruskal, 1964; Cox and Cox, 1994; Borg and Groenen, 2005). In these settings, the representation of data is optimized so that distance (typically Euclidean) conforms to sideinformation. Once a suitable metric has been learned, similarity to new, unseen data can be computed either directly (if the metric takes a certain parametric form, e.g., a linear projection matrix), or via outofsample extensions (Bengio et al., 2004).
To guide the construction of a similarity space for multimodal data, we adopt the idea of using similarity measurements, provided by human labelers, as sideinformation. However, it has to be noted that, especially in heterogeneous, multimedia domains, similarity may itself be a highly subjective concept and vary from one labeler to the next (Ellis et al., 2002). Moreover, a single labeler may not be able to consistently decide if or to what extent two objects are similar, but she may still be able to reliably produce a rankordering of similarity over pairs (Kendall and Gibbons, 1990). Thus, rather than rely on quantitative similarity or hard binary labels of pairwise similarity, it is now becoming increasingly common to collect similarity information in the form of triadic or relative comparisons (Schultz and Joachims, 2004; Agarwal et al., 2007), in which human labelers answer questions of the form:
“Is more similar to or ?”
Although this form of similarity measurement has been observed to be more stable than quantitative similarity (Kendall and Gibbons, 1990), and clearly provides a richer representation than binary pairwise similarities, it is still subject to problems of consistency and interlabeler agreement. It is therefore imperative that great care be taken to ensure some sense of robustness when working with perceptual similarity measurements.
In the present work, our goal is to develop a framework for integrating multimodal data so as to optimally conform to perceptual similarity encoded by relative comparisons. In particular, we follow three guiding principles in the development of our framework:

The embedding algorithm should be robust against subjectivity and interlabeler disagreement.

The algorithm must be able to integrate multimodal data in an optimal way, i.e., the distances between embedded points should conform to perceptual similarity measurements.

It must be possible to compute distances to new, unseen data as it becomes available.
We formulate this problem of heterogeneous feature integration as a learning problem: given a data set, and a collection of relative comparisons between pairs, learn a representation of the data that optimally reproduces the similarity measurements. This type of embedding problem has been previously studied by Agarwal et al. (2007) and Schultz and Joachims (2004). However, Agarwal et al. (2007) provide no outofsample extension, and neither support heterogeneous feature integration, nor do they address the problem of noisy similarity measurements.
A common approach to optimally integrate heterogeneous data is based on multiple kernel learning, where each kernel encodes a different modality
of the data. Heterogeneous feature integration via multiple kernel learning has been addressed by previous authors in a variety of contexts, including
classification (Lanckriet et al., 2004; Zien and Ong, 2007; Kloft et al., 2009; Jagarlapudi et al., 2009), regression (Sonnenburg et al., 2006; Bach, 2008; Cortes et al., 2009), and dimensionality reduction (Lin et al., 2009).
However, none of these methods specifically address the problem of learning a unified data representation which conforms to perceptual similarity
measurements.
1.1 Contributions
Our contributions in this work are twofold. First, we develop the partial order embedding (POE) framework (McFee and Lanckriet, 2009b), which allows us to use graphtheoretic algorithms to filter a collection of subjective similarity measurements for consistency and redundancy. We then formulate a novel multiple kernel learning (MKL) algorithm which learns an ensemble of feature space projections to produce a unified similarity space. Our method is able to produce nonlinear embedding functions which generalize to unseen, outofsample data. Figure 1 provides a highlevel overview of the proposed methods.
The remainder of this paper is structured as follows. In Section 2, we develop a graphical framework for interpreting and manipulating subjective similarity measurements. In Section 3, we derive an embedding algorithm which learns an optimal transformation of a single feature space. In Section 4, we develop a novel multiplekernel learning formulation for embedding problems, and derive an algorithm to learn an optimal space from heterogeneous data. Section 5 provides experimental results illustrating the effects of graphprocessing on noisy similarity data, and the effectiveness of the multiplekernel embedding algorithm on a music similarity task with human perception measurements. Finally, we prove hardness of dimensionality reduction in this setting in Section 6, and conclude in Section 7.
1.2 Preliminaries
A (strict) partial order is a binary relation over a set () which satisfies the following properties:^{1}^{1}1The standard definition of a (nonstrict) partial order also includes the reflexive property: . For reasons that will become clear in Section 2, we take the strict definition here, and omit the reflexive property.

Irreflexivity: ,

Transitivity: ,

Antisymmetry: .
Every partial order can be equivalently represented as a directed acyclic graph (DAG), where each vertex is an element of and an edge is drawn from to if . For any partial order, may refer to either the set of ordered tuples or the graph (DAG) representation of the partial order; the use will be clear from context. Let denote the length of the longest (finite) sourcetosink path in the graph of .
For a directed graph , we denote by its transitive closure, i.e., contains an edge if and only if there exists a path from to in . Similarly, the transitive reduction (denoted ) is the minimal graph with equivalent transitivity to , i.e., the graph with the fewest edges such that .
Let denote the training set of items. A Euclidean embedding is a function which maps into a dimensional space equipped with the Euclidean () metric:
For any matrix , let denote its column vector. A symmetric matrix has a spectral decomposition , where is a diagonal matrix containing the eigenvalues of , and contains the eigenvectors of . We adopt the convention that eigenvalues (and corresponding eigenvectors) are sorted in descending order. is positive semidefinite (PSD), denoted by , if each eigenvalue is nonnegative: . Finally, a PSD matrix gives rise to the Mahalanobis distance function
2 A graphical view of similarity
Before we can construct an embedding algorithm for multimodal data, we must first establish the form of sideinformation that will drive the algorithm, i.e., the similarity measurements that will be collected from human labelers. There is an extensive body of work on the topic of constructing a geometric representation of data to fit perceptual similarity measurements. Primarily, this work falls under the umbrella of multidimensional scaling (MDS), in which perceptual similarity is modeled by numerical responses corresponding to the perceived “distance” between a pair of items, e.g., on a similarity scale of 1–10. (See Cox and Cox (1994); Borg and Groenen (2005) for comprehensive overviews of MDS techniques.)
Because “distances” supplied by test subjects may not satisfy metric properties — in particular, they may not correspond to Euclidean distances — alternative nonmetric MDS (NMDS) techniques have been proposed (Kruskal, 1964). Unlike classical or metric MDS techniques, which seek to preserve quantitative distances, NDMS seeks an embedding in which the rankordering of distances is preserved.
Since NMDS only needs the rankordering of distances, and not the distances themselves, the task of collecting similarity measurements can be simplifed by asking test subjects to order pairs of points by similarity:
“Are and more similar than and ?”
or, as a special case, the “triadic comparison”
“Is more similar to or ?”
Based on this kind of relative comparison data, the embedding problem can be formulated as follows. Given is a set of objects , and a set of similarity measurements , where a tuple is interpreted as “ and are more similar than and .” (This formulation subsumes the triadic comparisons model when .) The goal is to find an embedding function such that
(1) 
The unit margin is forced between the constrained distances for numerical stability.
Agarwal et al. (2007) work with this kind of relative comparison data and describe a generalized NMDS algorithm (GNMDS), which formulates the embedding problem as a semidefinite program. Schultz and Joachims (2004) derive a similar algorithm which solves a quadratic program to learn a linear, axisaligned transformation of data to fit relative comparisons.
Previous work on relative comparison data often treats each measurement as effectively independent (Schultz and Joachims, 2004; Agarwal et al., 2007). However, due to their semantic interpretation as encoding pairwise similarity comparisons, and the fact that a pair may participate in several comparisons with other pairs, there may be some global structure to which these previous methods are unable to exploit.
In Section 2.1, we develop a graphical framework to infer and interpret the global structure exhibited by the constraints of the embedding problem. Graphtheoretic algorithms presented in Section 2.2 then exploit this representation to filter this collection of noisy similarity measurements for consistency and redundancy. The final, reduced set of relative comparison constraints defines a partial order, making for a more robust and efficient embedding problem.
2.1 Similarity graphs
To gain more insight into the underlying structure of a collection of comparisons , we can represent as a directed graph over . Each vertex in the graph corresponds to a pair , and an edge from to corresponds to a similarity measurement (see Figure 2). Interpreting as a graph will allow us to infer properties of global (graphical) structure of . In particular, two facts become immediately apparent:

If contains cycles, then there exists no embedding which can satisfy .

If is acyclic, any embedding that satisfies the transitive reduction also satisfies .
The first fact implies that no algorithm can produce an embedding which satisfies all measurements if the graph is cyclic. In fact, the converse of this statement is also true: if is acyclic, then an embedding exists in which all similarity measurements are preserved (see Appendix A). If is cyclic, however, by analyzing the graph, it is possible to identify an “unlearnable” subset of which must be violated by any embedding.
Similarly, the second fact exploits the transitive nature of distance comparisons. In the example depicted in Figure 2, any that satisfies and must also satisfy . In effect, the constraint is redundant, and may also be safely omitted from .
These two observations allude to two desirable properties in for embedding methods: transitivity and antisymmetry. Together with irreflexivity, these fit the defining characteristics of a partial order. Due to subjectivity and interlabeler disagreement, however, most collections of relative comparisons will not define a partial order. Some graph processing, presented next, based on an approximate maximum acyclic subgraph algorithm, can reduce them to a partial order.
2.2 Graph simplification
Because a set of similarity measurements containing cycles cannot be embedded in any Euclidean space, is inherently inconsistent. Cycles in therefore constitute a form of label noise. As noted by Angelova (2004), label noise can have adverse effects on both model complexity and generalization. This problem can be mitigated by detecting and pruning noisy (confusing) examples, and training on a reduced, but certifiably “clean” set (Angelova et al., 2005; Vezhnevets and Barinova, 2007).
Unlike most settings, where the noise process affects each label independently — e.g., random classification noise (Angluin and Laird, 1988) — the graphical structure of interrelated relative comparisons can be exploited to detect and prune inconsistent measurements. By eliminating similarity measurements which cannot be realized by any embedding, the optimization procedure can be carried out more efficiently and reliably on a reduced constraint set.
Ideally, when eliminating edges from the graph, we would like to retain as much information as possible. Unfortunately, this is equivalent to the maximum acyclic subgraph problem, which is NPComplete (Garey and Johnson, 1979). A approximate solution can be achieved by a simple greedy algorithm (Algorithm 1) (Berger and Shor, 1990).
Once a consistent subset of similarity measurements has been produced, it can be simplified further by pruning redundancies. In the graph view of similarity measurements, redundancies can be easily removed by computing the transitive reduction of the graph (Aho et al., 1972).
By filtering the constraint set for consistency, we ensure that embedding algorithms are not learning from spurious information. Additionally, pruning the constraint set by transitive reduction focuses embedding algorithms on the most important core set of constraints while reducing overhead due to redundant information.
3 Partial order embedding
Now that we have developed a language for expressing similarity between items, we are ready to formulate the embedding problem. In this section, we develop an algorithm that learns a representation of data consistent with a collection of relative similarity measurements, and allows to map unseen data into the learned similarity space after learning. In order to accomplish this, we will assume a feature representation for . By parameterizing the embedding function in terms of the feature representation, we will be able to apply to any point in the feature space, thereby generalizing to data outside of the training set.
3.1 Linear projection
To start, we assume that the data originally lies in some Euclidean space, i.e., . There are of course many ways to define an embedding function . Here, we will restrict attention to embeddings parameterized by a linear projection matrix , so that for a vector ,
Collecting the vector representations of the training set as columns of a matrix , the inner product matrix of the embedded points can be characterized as
(2) 
Now, for a relative comparison , we can express the distance constraint (1) between embedded points as follows:
(3) 
These inequalities can then be used to form the constraint set of an optimization problem to solve for . Because, in general, may not be satisfiable by a linear projection of , we soften the constraints by introducing a slack variable for each constraint, and minimize the empirical hinge loss over constraint violations . This choice of loss function can be interpreted as a generalization of ROC area (see Appendix C).
To avoid overfitting, we introduce a regularization term , and a tradeoff parameter to control the balance between regularization and loss minimization. This leads to a regularized risk minimization objective:
(4)  
s.t.  
After learning by solving this optimization problem, the embedding can be extended to outofsample points by applying the projection: .
Note that the distance constraints in (4) involve differences of quadratic terms, and are therefore not convex. However, since only appears in the form in (4), we can equivalently express the optimization problem in terms of a positive semidefinite matrix . This change of variables results in Algorithm 2, a convex optimization problem, more specifically a semidefinite programming (SDP) problem (Boyd and Vandenberghe, 2004), since objective and constraints are linear in , including the linear matrix inequality . The corresponding inner product matrix is
Finally, after the optimal is found, the embedding function can be recovered from the spectral decomposition of :
3.2 Nonlinear projection via kernels
The formulation in Algorithm 2 can be generalized to support nonlinear embeddings by the use of kernels, following the method of Globerson and Roweis (2007): we first map the data into a reproducing kernel Hilbert space (RKHS) via a feature map with corresponding kernel function ; then, the data is mapped to by a linear projection . The embedding function is the therefore the composition of the projection with :
Because may be nonlinear, this allows to learn a nonlinear embedding .
More precisely, we consider as being comprised of elements of , i.e., . The embedding can thus be expressed as
where denotes concatenation over vectors.
Note that in general, may be infinitedimensional, so directly optimizing may not be feasible. However, by appropriately regularizing , we may invoke the generalized representer theorem (Schölkopf et al., 2001). Our choice of regularization is the HilbertSchmidt norm of , which, in this case, reduces to
With this choice of regularization, it follows from the generalized representer theorem that at an optimum, each must lie in the span of the training data, i.e.,
for some realvalued matrix . If is a matrix representation of in (i.e., for ), then the projection operator can be expressed as
(5) 
We can now reformulate the embedding problem as an optimization over rather than . Using (5), the regularization term can be expressed as
where is the kernel matrix over :
To formulate the distance constraints in terms of , we first express the embedding in terms of and the kernel function:
where is the column vector formed by evaluating the kernel function at against the training set. The inner product matrix of embedded points can therefore be expressed as
which allows to express the distance constraints in terms of and the kernel matrix :
The embedding problem thus amounts to solving the following optimization problem in and :
(6)  
s.t.  
Again, the distance constraints in (6) are nonconvex due to the differences of quadratic terms. And, as in the previous section, only appears in the form of inner products in (6) — both in the constraints, and in the regularization term — so we can again derive a convex optimization problem by changing variables to . The resulting embedding problem is listed as Algorithm 3, again a semidefinite programming problem (SDP), with an objective function and constraints that are linear in .
After solving for , the matrix can be recovered by computing the spectral decomposition , and defining . The resulting embedding function takes the form:
As in Schultz and Joachims (2004), this formulation can be interpreted as learning a Mahalanobis distance metric over . More generally, we can view this as a form of kernel learning, where the kernel matrix is restricted to the set
(7) 
3.3 Connection to GNMDS
We conclude this section by drawing a connection between Algorithm 3 and the generalized nonmetric MDS (GNMDS) algorithm of Agarwal et al. (2007).
First, we observe that the th column, , of the kernel matrix can be expressed in terms of and the standard basis vector :
From this, it follows that distance computations in Algorithm 3 can be equivalently expressed as
(8) 
If we consider the extremal case where , i.e., we have no prior featurebased knowledge of similarity between points, then Equation 8 simplifies to
Therefore, in this setting, rather than defining a feature transformation, directly encodes the inner products between embedded training points. Similarly, the regularization term becomes
Minimizing the regularization term can be interpreted as minimizing a convex upper bound on the rank of (Boyd and Vandenberghe, 2004), which expresses a preference for lowdimensional embeddings. Thus, by setting in Algorithm 3, we directly recover the GNMDS algorithm.
Note that directly learning inner products between embedded training data points rather than a feature transformation does not allow a meaningful outofsample extension, to embed unseen data points. On the other hand, by Equation 7, it is clear that the algorithm optimizes over the entire cone of PSD matrices. Thus, if defines a DAG, we could exploit the fact that a partial order over distances always allows an embedding which satisfies all constraints in (see Appendix A) to eliminate the slack variables from the program entirely.
4 Multiple kernel embedding
In the previous section, we derived an algorithm to learn an optimal projection from a kernel space to such that Euclidean distance between embedded points conforms to perceptual similarity. If, however, the data is heterogeneous in nature, it may not be realistic to assume that a single feature representation can sufficiently capture the inherent structure in the data. For example, if the objects in question are images, it may be natural to encode texture information by one set of features, and color in another, and it is not immediately clear how to reconcile these two disparate sources of information into a single kernel space.
However, by encoding each source of information independently by separate feature spaces — equivalently, kernel matrices — we can formulate a multiple kernel learning algorithm to optimally combine all feature spaces into a single, unified embedding space. In this section, we will derive a novel, projectionbased approach to multiplekernel learning and extend Algorithm 3 to support heterogeneous data in a principled way.
4.1 Unweighted combination
Let be a set of kernel matrices, each with a corresponding feature map and RKHS , for . One natural way to combine the kernels is to look at the product space, which is formed by concatenating the feature maps:
Inner products can be computed in this space by summing across each feature map:
resulting in the sumkernel — also known as the average kernel or product space kernel. The corresponding kernel matrix can be conveniently represented as the unweighted sum of the base kernel matrices:
(9) 
Since is a valid kernel matrix itself, we could use as input for Algorithm 3. As a result, the algorithm would learn a kernel from the family
4.2 Weighted combination
Note that treats each kernel equally; it is therefore impossible to distinguish good features (i.e., those which can be transformed to best fit ) from bad features, and as a result, the quality of the resulting embedding may be degraded. To combat this phenomenon, it is common to learn a scheme for weighting the kernels in a way which is optimal for a particular task. The most common approach to combining the base kernels is to take a positiveweighted sum
where the weights are learned in conjunction with a predictor (Lanckriet et al., 2004; Sonnenburg et al., 2006; Bach, 2008; Cortes et al., 2009). Equivalently, this can be viewed as learning a feature map
where each base feature map has been scaled by the corresponding weight .
Applying this reasoning to learning an embedding that conforms to perceptual similarity, one might consider a twostage approach to parameterizing the embedding (Figure 3(a)): first construct a weighted kernel combination, and then project from the combined kernel space. Lin et al. (2009) formulate a dimensionality reduction algorithm in this way. In the present setting, this would be achieved by simultaneously optimizing and to choose an inner product matrix from the set
(10) 
The corresponding distance constraints, however, contain differences of terms cubic in the optimization variables and :
and are therefore nonconvex and difficult to optimize. Even simplifying the class by removing crossterms, i.e., restricting to the form
(11) 
still leads to a nonconvex problem, due to the difference of positive quadratic terms introduced by distance calculations:
However, a more subtle problem with this formulation lies in the assumption that a single weight can characterize the contribution of a kernel to the optimal embedding. In general, different kernels may be more or less informative on different subsets of or different regions of the corresponding feature space. Constraining the embedding to a single metric with a single weight for each kernel may be too restrictive to take advantage of this phenomenon.
4.3 Concatenated projection
We now return to the original intuition behind Equation 9. The sumkernel represents the inner product between points in the space formed by concatenating the base feature maps . The sets and characterize projections of the weighted combination space, and turn out to not be amenable to efficient optimization (Figure 3(a)). This can be seen as a consequence of prematurely combining kernels prior to projection.
Rather than projecting the (weighted) concatenation of , we could alternatively concatenate learned projections , as illustrated by Figure 3(b). Intuitively, by defining the embedding as the concatenation of different projections, we allow the algorithm to learn an ensemble of projections, each tailored to its corresponding domain space and jointly optimized to produce an optimal space. By contrast, the previously discussed formulations apply essentially the same projection to each (weighted) feature space, and are thus much less flexible than our proposed approach. Mathematically, an embedding function of this form can be expressed as the concatenation
Now, given this characterization of the embedding function, we can adapt Algorithm 3 to optimize over multiple kernels. As in the singlekernel case, we introduce regularization terms for each projection operator
to the objective function. Again, by invoking the representer theorem for each , it follows that
for some matrix , which allows to reformulate the embedding problem as a joint optimization over rather than . Indeed, the regularization terms can be expressed as
(12) 
The embedding function can now be rewritten as
and the inner products between embedded points take the form:
Similarly, squared Euclidean distance also decomposes by kernel:
(13) 
Finally, since the matrices only appear in the form of inner products in (12) and (13), we may instead optimize over PSD matrices . This renders the regularization terms (12) and distances (13) linear in the optimization variables . Extending Algorithm 3 to this parameterization of therefore results in an SDP, which is listed as Algorithm 4. To solve the SDP, we implemented a gradient descent solver, which is described in Appendix B.
The class of kernels over which Algorithm 4 optimizes can be expressed algebraically as
(14) 
Note that contains as a special case when all are positive scalar multiples of eachother. However, leads to a convex optimization problem, where does not.
Table 1 lists the blockmatrix formulations of each of the kernel combination rules described in this section. It is worth noting that it is certainly valid to first form the unweighted combination kernel and then use (Algorithm 3) to learn an optimal projection of the product space. However, as we will demonstrate in Section 5, our proposed multiplekernel formulation () outperforms the simple unweighted combination rule in practice.
Kernel class  Learned kernel matrix 

4.4 Diagonal learning
The MKPOE optimization is formulated as a semidefinite program over different matrices — or, as shown in Table 1, a single PSD matrix with a blockdiagonal sparsity structure. Scaling this approach to large data sets can become problematic, as they require optimizing over multiple highdimensional PSD matrices.
To cope with larger problems, the optimization problem can be refined to constrain each to the set of diagonal matrices. If are all diagonal, positive semidefiniteness is equivalent to nonnegativity of the diagonal values (since they are also the eigenvalues of the matrix). This allows the constraints to be replaced by linear constraints , and the resulting optimization problem is a linear program (LP), rather than an SDP. This modification reduces the flexibility of the model, but leads to a much more efficient optimization procedure.
More specifically, our implementation of Algorithm 4 operates by alternating gradient descent on and projection onto the feasible set (see Appendix B for details). For full matrices, this projection is accomplished by computing the spectral decomposition of each , and thresholding the eigenvalues at 0. For diagonal matrices, this projection is accomplished simply by
which can be computed in time, compared to the time required to compute spectral decompositions.
Restricting to be diagonal not only simplifies the problem to linear programming, but carries the added interpretation of weighting the contribution of each (kernel, training point) pair in the construction of the embedding. A large value at corresponds to point being a landmark for the features encoded in . Note that each of the formulations listed in Table 1 has a corresponding diagonal variant, however, as in the full matrix case, only and lead to convex optimization problems.
5 Experiments
To evaluate our framework for learning multimodal similarity, we first test the multiple kernel learning formulation on a simple toy taxonomy data set, and then on a realworld data set of musical perceptual similarity measurements.
5.1 Toy experiment: Taxonomy embedding
For our first experiment, we generated a toy data set from the Amsterdam Library of Object Images (ALOI) data set (Geusebroek et al., 2005). ALOI consists of RGB images of 1000 classes of objects against a black background. Each class corresponds to a single object, and examples are provided of the object under varying degrees of outofplane rotation.
In our experiment, we first selected 10 object classes, and from each class, sampled 20 examples. We then constructed an artificial taxonomy over the label set, as depicted in Figure 4. Using the taxonomy, we synthesized relative comparisons to span subtrees via their least common ancestor. For example,
and so on. These comparisons are consistent and therefore can be represented as a directed acyclic graph. They are generated so as to avoid redundant, transitive edges in the graph.
For features, we generated five kernel matrices. The first is a simple linear kernel over the grayscale intensity values of the images, which, roughly speaking, compares objects by shape. The other four are Gaussian kernels over histograms in the (backgroundsubtracted) red, green, blue, and intensity channels, and these kernels compare objects based on their color or intensity distributions.
We augment this set of kernels with five “noise” kernels, each of which was generated by sampling random points from the unit sphere in and applying the linear kernel.
The data was partitioned into five 80/20 training and test set splits. To tune , we further split the training set for 5fold crossvalidation, and swept over . For each fold, we learned a diagonallyconstrained embedding with Algorithm 4, using the subset of relative comparisons with and restricted to the training set. After learning the embedding, the held out data (validation or test) was mapped into the space, and the accuracy of the embedding was determined by counting the fraction of correctly predicted relative comparisons. In the validation and test sets, comparisons were processed to only include comparisons of the form where belongs to the validation (or test) set, and and belong to the training set.
We repeat this experiment for each base kernel individually (i.e., optimizing over with a single base kernel), as well as the unweighted sum kernel ( with all base kernels), and finally MKPOE ( with all base kernels). The results are averaged over all training/test splits, and collected in Table 2. For comparison purposes, we include the prediction accuracy achieved by computing distances in each kernel’s native space before learning. In each case, the optimized space indeed achieves higher accuracy than the corresponding native space. (Of course, the random noise kernels still predict randomly after optimization.)
As illustrated in Table 2\subreftab:taxonomy:mkl, taking the unweighted combination of kernels significantly degrades performance (relative to the best kernel) both in the native space (0.718 accuracy versus 0.862 for the linear kernel) and the optimized sumkernel space (0.861 accuracy for versus 0.951 for the linear kernel), i.e., the unweighted sum kernel optimized by Algorithm 3. However, MKPOE () correctly identifies and omits the random noise kernels by assigning them negligible weight, and achieves higher accuracy (0.984) than any of the single kernels (0.951 for the linear kernel, after learning).


5.2 Musical artist similarity
To test our framework on a real data set, we applied the MKPOE algorithm to the task of learning a similarity function between musical artists. The artist similarity problem is motivated by several realworld applications, including recommendation and playlistgeneration for online radio. Because artists may be represented by a wide variety of different features (e.g., tags, acoustic features, social data), such applications can benefit greatly from an optimally integrated similarity metric.
The training data is derived from the aset400 corpus of Ellis et al. (2002), which consists of 412 popular musicians, and 16385 relative comparisons of the form . Relative comparisons were acquired from human test subjects through a web survey; subjects were presented with a query artist (), and asked to choose what they believe to be the most similar artist () from a list of 10 candidates. From each single response, 9 relative comparisons are synthesized, indicating that is more similar to than the remaining 9 artists () which were not chosen.
Our experiments here replicate and extend previous work on this data set (McFee and Lanckriet, 2009a). In the remainder of this section, we will first give an overview of the various types of features used to characterize each artist in Section 5.2.1. We will then discuss the experimental procedure in more detail in Section 5.2.2. The MKL embedding results are presented in Section 5.2.3, and are followed by an experiment detailing the efficacy of our constraint graph processing approach in Section 5.2.4.
5.2.1 Features
We construct five base kernels over the data, incorporating acoustic, semantic, and social views of the artists.

MFCC: for each artist, we collected between 1 and 10 songs (mean 4). For each song, we extracted a short clip consisting of 10000 halfoverlapping 23ms windows. For each window, we computed the first 13 Mel Frequency Cepstral Coefficients (MFCCs) (Davis and Mermelstein, 1990), as well as their first and second instantaneous derivatives. This results in a sequence of 39dimensional vectors (deltaMFCCs) for each song. Each artist was then summarized by a Gaussian mixture model (GMM) over deltaMFCCs extracted from the corresponding songs. Each GMM has 8 components and diagonal covariance matrices. Finally, the kernel between artists and is the probability product kernel (Jebara et al., 2004) between their corresponding deltaMFCC distributions :

Autotags: Using the MFCC features described above, we applied the automatic tagging algorithm of Turnbull et al. (2008), which for each song yields a multinomial distribution over a set of 149 musicallyrelevant tag words (autotags). Artistlevel tag distributions were formed by averaging model parameters (i.e., tag probabilities) across all of the songs of artist . The kernel between artists and for autotags is a radial basis function applied to the distance between the multinomial distributions and :
In these experiments, we fixed .

Social tags: For each artist, we collected the top 100 most frequently used tag words from Last.fm,^{2}^{2}2http://last.fm a social music website which allows users to label songs or artists with arbitrary tag words or social tags. After stemming and stopword removal, this results in a vocabulary of 7737 tag words. Each artist is then represented by a bagofwords vector in , and processed by TFIDF. The kernel between artists for social tags is the cosine similarity (linear kernel) between TFIDF vectors.

Biography: Last.fm also provides textual descriptions of artists in the form of usercontributed biographies. We collected biographies for each artist in the aset400 data set, and after stemming and stopword removal, we arrived at a vocabulary of 16753 biography words. As with social tags, the kernel between artists is the cosine similarity between TFIDF bagofwords vectors.

Collaborative filtering: Celma (2008) collected collaborative filtering data from Last.fm in the form of a bipartite graph over users and artists, where each user is associated with the artists in her listening history. We filtered this data down to include only the aset400 artists, of which all but 5 were found in the collaborative filtering graph. The resulting graph has 336527 users and 407 artists, and is equivalently represented by a binary matrix where each row corresponds to an artist, and each column corresponds to a user. The entry of this matrix is 1 if we observe a userartist association, and 0 otherwise. The kernel between artists in this view is the cosine of the angle between corresponding rows in the matrix, which can be interpreted as counting the amount of overlap between the sets of users listening to each artist and normalizing for overall artist popularity. For the 5 artists not found in the graph, we fill in the corresponding rows and columns of the kernel matrix with the identity matrix.
5.2.2 Experimental procedure
The data set was split into 330 training and 82 test artists. Given the inherent ambiguity in the task and the format of the survey, there is a great deal of conflicting information in the survey responses. To obtain a more accurate and internally coherent set of training comparisons, directly contradictory comparisons (e.g., and ) are removed from the training set, reducing the set from 7915 to 6583 relative comparisons. The training set is further cleaned by finding an acyclic subset of comparisons and taking its transitive reduction, resulting in a minimal partial order with 4401 comparisons.
To evaluate the performance of an embedding learned from the training data, we apply it to the test data, and then measure accuracy by counting the fraction of similarity measurements correctly predicted by distance in the embedding space, where belongs to the test set, and and belong to the training set. This setup can be viewed as simulating a query (byexample) and ranking the responses from the training set. To gain a more accurate view of the quality of the embedding, the test set was also pruned to remove directly contradictory measurements. This reduces the test set from 2095 to 1753 comparisons. No further processing is applied to test measurements, and we note that the test set is not internally consistent, so perfect accuracy is not achievable.
For each experiment, the optimal is chosen from by 10fold crossvalidation, i.e., repeating the test procedure above on splits within the training set. Once is chosen, an embedding is learned with the entire training set, and then evaluated on the test set.
5.2.3 Embedding results
For each base kernel, we evaluate the testset performance in the native space (i.e., by distances calculated directly from the entries of the kernel matrix), and by learned metrics, both diagonal and full (optimizing over with a single base kernel). Table 3 lists the results. In all cases, we observe significant improvements in accuracy over the native space. In all but one case, fullmatrix embeddings significantly outperform diagonallyconstrained embeddings.
Kernel  Accuracy  

Native  (diagonal)  (full)  
MFCC  0.464  0.593  0.590 
Autotags (AT)  0.559  0.568  0.594 
Social tags (ST)  0.752  0.773  0.796 
Biography (Bio)  0.611  0.629  0.760 
Collaborative filter (CF)  0.704  0.655  0.776 
We then repeated the experiment by examining different groupings of base kernels: acoustic (MFCC and Autotags), semantic (Social tags and Bio), social (Collaborative filter), and combinations of the groups. The different sets of kernels were combined by Algorithm 4 (optimizing over ). The results are listed in Table 4. For comparison purposes, we also include the unweighted sum of all base kernels (listed in the Native column).
In all cases, MKPOE improves over the unweighted combination of base kernels. Moreover, many combinations outperform the single best kernel (ST), and the algorithm is generally robust in the presence of poorlyperforming distractor kernels (MFCC and AT). Note that the poor performance of MFCC and AT kernels may be expected, as they derive from songlevel rather than artistlevel features, whereas ST provides highlevel semantic descriptions which are generally more homogeneous across the songs of an artist, and Bio and CF are directly constructed at the artist level. For comparison purposes, we also trained a metric over all kernels with (Algorithm 3), and achieve 0.711 (diagonal) and 0.764 (full): significantly worse than the results.
Figure 5 illustrates the weights learned by Algorithm 4 using all five kernels and diagonallyconstrained matrices. Note that the learned metrics are both sparse (many 0 weights) and nonuniform across different kernels. In particular, the (lowestperforming) MFCC kernel is eliminated by the algorithm, and the majority of the weight is assigned to the (highestperforming) social tag (ST) kernel.
A tSNE (van der Maaten and Hinton, 2008) visualization of the space produced by MKPOE is illustrated in Figure 6. The embedding captures a great deal of highlevel genre structure in low dimensions: for example, the classic rock and metal genres lie at the opposite end of the space from pop and hiphop.
Base kernels  Accuracy  

Native  (diagonal)  (full)  
MFCC + AT  0.521  0.589  0.602 
ST + Bio  0.760  0.786  0.811 
MFCC + AT + CF  0.580  0.671  0.719 
ST + Bio + CF  0.777  0.782  0.806 
MFCC + AT + ST + Bio  0.709  0.788  0.801 
All  0.732  0.779  0.801 
5.2.4 Graph processing results
To evaluate the effects of processing the constraint set for consistency and redundancy, we repeat the experiment of the previous section with different levels of processing applied to . Here, we focus on the Biography kernel, since it exhibits the largest gap in performance between the native and learned spaces.
As a baseline, we first consider the full set of similarity measurements as provided by human judgements, including all inconsistencies. In the 8020 split, there are 7915 total training measurements. To first deal with what appear to be the most eggregious inconsistencies, we prune all directly inconsistent training measurements; i.e., whenever and both appear, both are removed.^{3}^{3}3A more sophisticated approach could be used here, e.g., majority voting, provided there is sufficient oversampling of comparisons in the data. This variation results in 6583 training measurements, and while they are not wholly consistent, the worst violators have been pruned. Finally, we consider the fully processed case by finding a maximal consistent subset (partial order) of and removing all redundancies, resulting in a partial order with 4401 measurements.
Using each of these variants of the training set, we test the embedding algorithm with both diagonal and fullmatrix formulations. The results are presented in Table 5. Each level of graph processing results in a small improvement in the accuracy of the learned space, and provides substantial reductions in computational overhead at each step of the optimization procedure for Algorithm 3.
Accuracy  

Diagonal  Full  
Full  0.604  0.754 
Length2  0.621  0.756 
Processed  0.629  0.760 
6 Hardness of dimensionality reduction
The algorithms given in Sections 3 and 4 attempt to produce lowdimensional solutions by regularizing , which can be seen as a convex approximation to the rank of the embedding. In general, because rank constraints are not convex, convex optimization techniques cannot efficiently minimize dimensionality. This does not necessarily imply other techniques could not work. So, it is natural to ask if exact solutions of minimal dimensionality can be found efficiently, particularly in the multidimensional scaling scenario, i.e., when (Section 3.3).
As a special case, one may wonder if any instance can be satisfied in . As Figure 7 demonstrates, not all instances can be realized in one dimension. Even more, we show that it is NPComplete to decide if a given can be satisfied in . Given an embedding, it can be verified in polynomial time whether is satisfied or not by simply computing the distances between all pairs and checking each comparison in , so the decision problem is in NP. It remains to show that the partial order embedding problem (hereafter referred to as 1POE) is NPHard. We reduce from the Betweenness problem (Opatrny, 1979), which is known to be NPcomplete.
Definition 6.1 (Betweenness)
Given a finite set and a collection of ordered triples of distinct elements from , is there a onetoone function such that for each , either or ?
Theorem 1
1POE is NPHard.
Proof
Let be an instance of Betweenness. Let , and for each , introduce constraints and to . Since Euclidean distance in is simply line distance, these constraints force to lie between and . Therefore, the original
instance if and only if the new instance . Since Betweenness is NPHard, 1POE is NPHard as well.
\hfill
Since 1POE can be reduced to the general optimization problem of finding an embedding of minimal dimensionality, we can conclude that dimensionality reduction subject to partial order constraints is also NPHard.
7 Conclusion
We have demonstrated a novel method for optimally integrating heterogeneous data to conform to measurements of perceptual similarity. By interpreting a collection of relative similarity comparisons as a directed graph over pairs, we are able to apply graphtheoretic techniques to isolate and prune inconsistencies in the training set and reduce computational overhead by eliminating redundant constraints in the optimization procedure.
Our multiplekernel formulation offers a principled way to integrate multiple feature modalities into a unified similarity space. Our formulation carries the intuitive geometric interpretation of concatenated projections, and results in a semidefinite program. By incorporating diagonal constraints as well, we are able to reduce the computational complexity of the algorithm, and learn a model which is both flexible — only using kernels in the portions of the space where they are informative — and interpretable — each diagonal weight corresponds to the contribution to the optimized space due to a single point within a single feature space. Table 1 provides a unified perspective of multiple kernel learning formulations for embedding problems, but it is clearly not complete. It will be the subject of future work to explore and compare alternative generalizations and restrictions of the formulations presented here.
A Embeddability of partial orders
In this appendix, we prove that any set with a partial order over distances can be embedded into while satisfying all distance comparisons.
In the special case where is a total ordering over all pairs (i.e., a chain graph), the problem reduces to nonmetric multidimensional scaling (Kruskal, 1964), and a constraintsatisfying embedding can always be found by the constantshift embedding algorithm of Roth et al. (2003). In general, is not a total order, but a respecting embedding can always be produced by reducing the partial order to a (weak) total order by topologically sorting the graph (see Algorithm 5).
Let be the dissimilarity matrix produced by Algorithm 5 on an instance . An embedding can be found by first applying classical multidimensional scaling (MDS) (Cox and Cox, 1994) to :
(15) 
where is the centering matrix, and is a vector of 1s. Shifting the spectrum of yields
(16) 
where is the minimum eigenvalue of . The embedding can be found by decomposing , so that is the column of ; this is the solution constructed by the constantshift embedding nonmetric MDS algorithm of Roth et al. (2003).
Applying this transformation to affects distances by