Efficient Regularized Least-Squares Algorithms for Conditional Ranking on Relational Data
In domains like bioinformatics, information retrieval and social network analysis, one can find learning tasks where the goal consists of inferring a ranking of objects, conditioned on a particular target object. We present a general kernel framework for learning conditional rankings from various types of relational data, where rankings can be conditioned on unseen data objects. We propose efficient algorithms for conditional ranking by optimizing squared regression and ranking loss functions. We show theoretically, that learning with the ranking loss is likely to generalize better than with the regression loss. Further, we prove that symmetry or reciprocity properties of relations can be efficiently enforced in the learned models. Experiments on synthetic and real-world data illustrate that the proposed methods deliver state-of-the-art performance in terms of predictive power and computational efficiency. Moreover, we also show empirically that incorporating symmetry or reciprocity properties can improve the generalization performance.
We first motivate the study by presenting some examples relevant for the considered learning setting in Section 1.1. Next, we briefly review and compartmentalize related work in Section 1.2, and present the main contributions of the paper in Section 1.3.
Let us start with two introductory examples to explain the problem setting of conditional ranking. First, suppose that a number of persons are playing an online computer game. For many people it is always more fun to play against someone with similar skills, so players might be interested in receiving a ranking of other players, ranging from extremely difficult to beat to unexperienced players. Unfortunately, pairwise strategies of players in many games – not only in computer games but also in board or sports games – tend to exhibit a rock-paper-scissors type of relationship (Fisher, 2008), in the sense that player A beats player B (with probability greater than 0.5), who on his term beats C (with probability greater than 0.5), while player A loses from player C (as well with probability greater than 0.5). Mathematically speaking, the relation between players is not transitive, leading to a cyclic relationship and implying that no global (consistent) ranking of skills exists. Yet, a conditional ranking can always be obtained for a specific player (Pahikkala et al., 2010b).
As a second introductory example, let us consider the supervised inference of biological networks, like protein-protein interaction networks, where the goal usually consists of predicting new interactions from a set of highly-confident interactions (Yamanishi et al., 2004). Similarly, one can also define a conditional ranking task in such a context, as predicting a ranking of all proteins in the network that are likely to interact with a given target protein (Weston et al., 2004). However, this conditional ranking task differs from the previous one because (a) rankings are computed from symmetric relations instead of reciprocal ones and (b) the values of the relations are here usually not continuous but discrete.
Applications for conditional ranking tasks arise in many domains where relational information between objects is observed, such as relations between persons in preference modelling, social network analysis and game theory, links between database objects, documents, websites, or images in information retrieval (Geerts et al., 2004; Grangier and Bengio, 2008; Ng et al., 2011), interactions between genes or proteins in bioinformatics, graph matching (Caetano et al., 2009), et cetera. When approaching conditional ranking from a graph inference point of view, the goal consists of returning a ranking of all nodes given a particular target node, in which the nodes provide information in terms of features and edges in terms of labels or relations. At least two properties of graphs play a key role in such a setting. First, the type of information stored in the edges defines the learning task: binary-valued edge labels lead to bipartite ranking tasks (Freund et al., 2003), ordinal-valued edge labels to multipartite or layered ranking tasks (Fürnkranz et al., 2009) and continuous labels result in rankings that are nothing more than total orders (when no ties occur). Second, the relations that are represented by the edges might have interesting properties, namely symmetry or reciprocity, for which conditional ranking can be interpreted differently.
1.2 Learning Setting and Related Work
We present a kernel framework for conditional ranking, which covers all above situations. Unlike existing single-task or multi-task ranking algorithms, where the conditioning is respectively ignored or only happening for training objects, our approach also allows to condition on new data objects that are not known during the training phase. Thus, in light of Figure 1 that will be explained below, the algorithm is not only able to predict conditional rankings for objects A to E, but also for objects F and G that do not participate in the training dataset. From this perspective, one can define four different learning settings in total:
Setting 1: predict a ranking of objects for a given conditioning object, where both the objects to be ranked and the conditioning object were contained in the training dataset (but not the ranking of the objects for that particular conditioning object).
Setting 2: given a new conditioning object unseen in the training phase, predict a ranking of the objects encountered in the training phase.
Setting 3: given a set of new objects unseen in the training phase, predict rankings of those objects with respect to the conditioning objects encountered in the training phase.
Setting 4: predict a ranking of objects for a given conditioning object, where neither the conditioning object nor the objects to be ranked were observed during the training phase.
These four settings cover as special cases different types of conventional machine learning problems. The framework that we propose in this article can be used for all four settings, in contrast to many existing methods. In this paper, we focus mainly on setting 4.
Setting 1 corresponds to the imputation type of link prediction setting, where missing relational values between known objects are predicted. In this setting matrix factorization methods (see e.g. Srebro et al. (2005)) are often applied. Many of the approaches are based solely on exploiting the available link structure, though approaches incorporating feature information have also been proposed (Menon and Elkan, 2010; Raymond and Kashima, 2010). Setting 2 corresponds to the label ranking problem (Hüllermeier et al., 2008), where a fixed set of labels is ranked given a new object. Setting 3 can be modeled as a multi-task ranking problem where the the number of ranking tasks is fixed in advance (Agarwal, 2006). Finally, setting 4 requires that the used methods are able to generalize both over new conditioning objects and objects to be ranked (see e.g. Park and Marcotte (2012) for some recent discussion about this setting). Learning in setting 4 may be realized by using joint feature representations of conditioning objects and objects to be ranked.
In its most general form the conditional ranking problem can be considered as a special case of the listwise ranking problem, encountered especially in the learning to rank for information retrieval literature (see e.g. (Cao et al., 2006; Yue et al., 2007; Xia et al., 2008; Qin et al., 2008; Liu, 2009; Chapelle and Keerthi, 2010; Qin et al., 2008; Kersting and Xu, 2009; Xu et al., 2010; Airola et al., 2011b)). For example, in document retrieval one is supplied both with query objects and associated documents that are ranked according to how well they match the query. The aim is to learn a model that can generalize to new queries and documents, predicting rankings that capture well the relative degree to which each document matches the test query.
Previous learning approaches in this setting have typically been based on using hand-crafted low-dimensional joint feature representations of query-document pairs. In our graph-based terminology, this corresponds to having a given feature representation for edges, possibly encoding prior knowledge about the ranking task. A typical example of this kind of joint feature particularly designed for the domain of information retrieval is the well-known Okapi BM25 score, indicating how well a given query set of words matches a given set of words extracted from a text document. While this type of features are often very efficient and absolutely necessary in certain practical tasks, designing such application-specific features requires human experts and detailed information about every different problem to be solved, which may not be always available.
In contrast, we focus on a setting in which we are only given a feature representation of nodes from which the feature representations of the edges have to be constructed, that is, the learning must be performed without access to the prior information about the edges. This opens many possibilities for applications, since we are not restricted to the setting where explicit feature representations of the edges are provided. In our experiments, we present several examples of learning tasks for which our approach can be efficiently used. In addition, the focus of our work is the special case where both the conditioning objects and the objects to be ranked come from the same domain (see e.g. Weston et al. (2004); Yang et al. (2009) for a similar settings). This allows us to consider how to enforce relational properties such as symmetry and reciprocity, a subject not studied in previous ranking literature. To summarize, the considered learning setting is the following:
We are given a training set of objects (nodes), which have a feature representation of their own, either an explicit one or an implicit representation obtained via a nonlinear kernel function.
We are also given observed relations between these objects (weighted edges), whose values are known only between the nodes encountered in the training set. This information is distinct from the similarities in point (a).
The aim is to learn to make predictions for pairs of objects (edges) for which the value of this relation is unknown, by taking advantage of the features or kernels given in point (a), and in the conditional ranking setting, the goal is to learn a ranking of the object pairs.
The node kernel is used as a building block to construct a pairwise kernel able to capture similarities of the edges, which in turn, is used for learning to predict the edge weights considered in point (b).
The proposed framework is based on the Kronecker product kernel for generating implicit joint feature representations of conditioning objects and the sets of objects to be ranked. This kernel has been proposed independently by a number of research groups for modelling pairwise inputs in different application domains (Basilico and Hofmann, 2004; Oyama and Manning, 2004; Ben-Hur and Noble, 2005). From a different perspective, it has been considered in structured output prediction methods for defining joint feature representations of inputs and outputs (Tsochantaridis et al., 2005; Weston et al., 2007).
While the usefulness of Kronecker product kernels for pairwise learning has been clearly established, computational efficiency of the resulting algorithms remains a major challenge. Previously proposed methods require the explicit computation of the kernel matrix over the data object pairs, hereby introducing bottlenecks in terms of processing and memory usage, even for modest dataset sizes. To overcome this problem, one typically applies sampling strategies to avoid computing the whole kernel matrix for training. However, non-approximate methods can be implemented by taking advantage of the Kronecker structure of the kernel matrix. This idea has been traditionally used to solve certain linear regression problems (see e.g. Van Loan (2000) and references therein). More recent and related applications have emerged in link prediction tasks by (Kashima et al., 2009a; Raymond and Kashima, 2010), which can be considered under setting 1.
An alternative approach known as the Cartesian kernel has been proposed by Kashima et al. (2009b) for overcoming the computational challenges associated with the Kronecker product kernels. This kernel indeed exhibits interesting computational properties, but it can be solely employed in selected applications, because it cannot make predictions for (couples of) objects that are not observed in the training dataset, that is, in setting 4 (see Waegeman et al. (2012) for further discussion and experimental results).
There exists a large body of literature about relational kernels, with values obtained from e.g. similarity graphs of data points, random walk and path kernels, et cetera. These can be considered to be complementary to the Kronecker product based pairwise kernels in the sense that they are used to infer similarities for data points rather than for pairs of data points. Thus, if relational information, such as paths or random walks for example, is used to form a kernel for the points in a graph, a pairwise kernel could be subsequently used to construct a kernel for the edges of the same graph.
Yet another family of related methods consists of the generalization of the pairwise Kronecker kernels framework to tasks, in which the condition and target objects come from different domains. Typical examples of this type of Kronecker kernel applications are found in bioinformatics, such as the task of predicting interactions between drugs and targets (van Laarhoven et al., 2011). To our knowledge, none of these studies still concern the fourth setting. While the algorithms considered in this paper can be straightforwardly generalized to the two domain case, we only focus on the single domain case for simplicity, because the concepts of symmetric and reciprocal relations would not be meaningful with two domains.
The main contributions of this paper can be summarized as follows:
We propose kernel-based conditional ranking algorithms for setting 4, that is, for cases where predictions are performed for (couples of) objects that are not observed in the training dataset. We consider both regression and ranking based losses, extending regularized least-squares (RLS) (Saunders et al., 1998; Evgeniou et al., 2000) and the RankRLS (Pahikkala et al., 2009) algorithms to conditional ranking. We propose both update rules for iterative optimization algorithms, as well as closed-form solutions, exploiting the structure of the Kronecker product in order to make learning efficient. The proposed methods scale to graphs consisting of millions of labeled edges.
We show how prior knowledge about the underlying relation can be efficiently incorporated in the learning process, considering the specific cases of symmetric and reciprocal relations. These properties are enforced on the learned models via corresponding modifications of the Kronecker product kernel, namely the symmetric kernel studied by Ben-Hur and Noble (2005) and the reciprocal kernel introduced by us (Pahikkala et al., 2010b). We prove that, for RLS, symmetry and reciprocity can be implicitly encoded by including each edge in the training set two times, once for each direction. To our knowledge, the only related result so far has been established for the support vector machine classifiers by Brunner et al. (2012). We also prove that this implicitness, in turn, ensures the computational efficiency of the training phase with symmetric and reciprocal Kronecker kernels. These results are, to our knowledge, completely novel in the field of both machine learning and matrix algebra.
We present new generalization bounds, showing the benefits of applying RankRLS instead of basic RLS regression in conditional ranking tasks. The analysis presented in this paper shows that the larger expressive power of the space of regression functions compared to the corresponding space of conditional ranking functions indicates the learning to be likely to generalize better if the space of functions available for the training algorithm is restricted to conditional ranking functions rather than all regression functions.
Finally, we evaluate the proposed algorithms with an array of different practical problems. The results demonstrate the ability of the algorithms to solve learning problems in setting 4. Moreover, in our scalability experiments, we show that the algorithms proposed by us scale considerably better to large data sets than the state-of-the-art RankSVM solvers, even in cases where the SVMs use fast primal training methods for linear kernels.
The article is organized as follows. We start in Section 2 with a formal description of conditional ranking from a graph-theoretic perspective. The Kronecker product kernel is reviewed in Section 3 as a general edge kernel that allows for modelling the most general type of relations. In addition, we briefly recall two important subclasses of relations, namely symmetric and reciprocal relations, for which more specific, knowledge-based kernels can be derived. The proposed learning algorithms are presented in Section 4, and the connections and differences with related learning algorithms are discussed in Section 5, with a particular emphasis on the computational complexity of the algorithms. In Section 6 we present promising experimental results on synthetic and real-world data, illustrating the advantages of our approach in terms of predictive power and computational scalability.
2 General Framework
Let us start with introducing some notations. We consider ranking of data structured as a graph , where corresponds to the set of nodes, where nodes are sampled from a space , and represents the set of edges , for which labels are provided in terms of relations. Moreover, these relations are represented by weights on the edges and they are generated from an unknown underlying relation . We remark that the interval is used here only due to certain properties that are historically defined for such relations. However, relations taking values in arbitrary closed real intervals can be straightforwardly transformed to this interval with an appropriate increasing bijection and vice versa.
Following the standard notations for kernel methods, we formulate our learning problem as the selection of a suitable function , with a certain hypothesis space, in particular a reproducing kernel Hilbert space (RKHS). Given an input space and a kernel , the RKHS associated with can be considered as the completion of
in the norm
Hypotheses are usually denoted as with a vector of parameters that need to be estimated based on training data. Let us denote a training dataset of cardinality as a set of input-label pairs, then we formally consider the following variational problem in which we select an appropriate hypothesis from for training data . Namely, we consider an algorithm
with a given loss function and a regularization parameter.
with dual parameters, the kernel function associated with the RKHS and the feature mapping corresponding to .
Given two relations and defined on any triplet of nodes in , we compose the ranking of and conditioned on as
Let the number of correctly ranked pairs for all nodes in the dataset serve as evaluation criterion for verifying (3), then one aims to minimize the following empirical loss when computing the loss over all conditional rankings simultaneously:
with the Heaviside function returning one when its argument is strictly positive, returning when its argument is exactly zero and returning zero otherwise. Importantly, denotes the set of all edges starting from, or the set of all edges ending at the node , depending on the specific task. For example, concerning the relation “trust” in a social network, the former loss would correspond to ranking the persons in the network who are trusted by a specific person, while the latter loss corresponds to ranking the persons who trust that person. So, taking Figure 1 into account, we would in such an application respectively use the rankings (outgoing edges) and (incoming edges) as training info for node .
Since (4) is neither convex nor differentiable, we look for an approximation that has these properties as this considerably simplifies the development of efficient algorithms for solving the learning problem. Let us to this end start by considering the following squared loss function over the observed edges in the training set:
Such a setting would correspond to directly learning the labels on the edges in a regression or classification setting. For the latter case, optimizing (5) instead of the more conventional hinge loss has the advantage that the solution can be found by simply solving a system of linear equations (Saunders et al., 1998; Suykens et al., 2002; Shawe-Taylor and Cristianini, 2004; Pahikkala et al., 2009). However, the simple squared loss might not be optimal in conditional ranking tasks. Consider for example a node and we aim to learn to predict which of the two other nodes, or , would be closer to it. Let us denote and , and let and denote the relation between and and between and , respectively. Then it would be beneficial for the regression function to have a minimal squared difference , leading to the following loss function:
which can be interpreted as a differentiable and convex approximation of (4).
3 Relational Domain Knowledge
Above, a framework was defined where kernel functions are constructed over the edges, leading to kernels of the form . In this section we show how these kernels can be constructed using domain knowledge about the underlying relations. The same discussion was put forward for inferring relations. Details and formal proofs can be found in our previous work on this topic (Pahikkala et al., 2010b; Waegeman et al., 2012).
3.1 Arbitrary Relations
When no further restrictions on the underlying relation can be specified, the following Kronecker product feature mapping is used to express pairwise interactions between features of nodes:
where represents the feature mapping for individual nodes and denotes the Kronecker product. As shown by Ben-Hur and Noble (2005), such a pairwise feature mapping yields the Kronecker product pairwise kernel in the dual model:
with the kernel corresponding to .
It can be formally proven that with an appropriate choice of the node kernel , such as the Gaussian RBF kernel, the RKHS of the corresponding Kronecker product edge kernel allows approximating arbitrarily closely any relation that corresponds to a continuous function from to . Before summarizing this important result, we recollect the definition of universal kernels.
(Steinwart, 2002) A continuous kernel on a compact metric space (i.e. is closed and bounded) is called universal if the RKHS induced by is dense in , where is the space of all continuous functions .
Accordingly, the hypothesis space induced by the kernel can approximate any function in arbitrarily well, and hence it has the universal approximating property.
(Waegeman et al., 2012) Let us assume that the space of nodes is a compact metric space. If a continuous kernel is universal on , then defines a universal kernel on .
The proof is based on the so-called Stone-Weierstraß theorem (see e.g. Rudin (1991)). The above result is interesting because it shows, given that an appropriate loss is optimized and a universal kernel applied on the node level , that the Kronecker product pairwise kernel has the ability to assure universal consistency, guaranteeing that the expected prediction error converges to its the lowest possible value when the amount of training data approaches infinity. We refer to (Steinwart and Christmann, 2008) for a more detailed discussion on the relationship between universal kernels and consistency. As a consequence, the Kronecker product kernel can always be considered a valid choice for learning relations if no specific a priori information other than a kernel for the nodes is provided about the relation that underlies the data. However, we would like to emphasize that Theorem 3.2 does not guarantee anything about the speed of the convergence or how large training sets are required for approximating the function closely enough. As a rule of thumb, whenever we have an access to useful prior information about the relation to be learned, it is beneficial to restrict the expressive power of the hypothesis space accordingly. The following two sections illustrate this more in detail for two particular types of relational domain knowledge: symmetry and reciprocity.
3.2 Symmetric Relations
Symmetric relations form an important subclass of relations in our framework. As a specific type of symmetric relations, similarity relations constitute the underlying relation in many application domains where relations between objects need to be learned. Symmetric relations are formally defined as follows.
A binary relation is called a symmetric relation if for all it holds that .
More generally, symmetry can be defined for real-valued relations analogously as follows.
A binary relation is called a symmetric relation if for all it holds that .
For symmetric relations, edges in multi-graphs like Figure 1 become undirected. Applications arise in many domains and metric learning or learning similarity measures can be seen as special cases. If the relation is -valued as , then we end up with a classification setting instead of a regression setting.
Symmetry can be easily incorporated in our framework via the following modification of the Kronecker kernel:
The symmetric Kronecker kernel has been previously used for predicting protein-protein interactions in bioinformatics (Ben-Hur and Noble, 2005). The following theorem shows that the RKHS of the symmetric Kronecker kernel can approximate arbitrarily well any type of continuous symmetric relation.
In other words the above theorem states that using the symmetric Kronecker product kernel is a way to incorporate the prior knowledge about the symmetry of the relation to be learned by only sacrificing the unnecessary expressive power. Thus, consistency can still be assured, despite considering a smaller hypothesis space.
3.3 Reciprocal Relations
Let us start with a definition of this type of relation.
A binary relation is called a reciprocal relation if for all it holds that .
For general real-valued relations, the notion of antisymmetry can be used in place of reciprocity:
A binary relation is called an antisymmetric relation if for all it holds that .
For reciprocal and antisymmetric relations, every edge in a multi-graph like Figure 1 induces an unobserved invisible edge with appropriate weight in the opposite direction. Applications arise here in domains such as preference learning, game theory and bioinformatics for representing preference relations, choice probabilities, winning probabilities, gene regulation, et cetera. The weight on the edge defines the real direction of such an edge. If the weight on the edge is higher than 0.5, then the direction is from to , but when the weight is lower than 0.5, then the direction should be interpreted as inverted, for example, the edges from to in Figures 1 (a) and (e) should be interpreted as edges starting from instead of . If the relation is -valued as , then we end up with a three-class ordinal regression setting instead of an ordinary regression setting. Analogously to symmetry, reciprocity can also be easily incorporated in our framework via the following modification of the Kronecker kernel:
Thus, the addition of kernels in the symmetric case becomes a subtraction of kernels in the reciprocal case. One can also prove that the RKHS of this so-called reciprocal Kronecker kernel allows approximating arbitrarily well any type of continuous reciprocal relation.
Unlike many existing kernel-based methods for relational data, the models obtained with the presented kernels are able to represent any symmetric or reciprocal relation, respectively, without imposing additional transitivity properties of the relations.
4 Algorithmic Aspects
This section gives a detailed description of the different algorithms that we propose for conditional ranking tasks. Our algorithms are primarily based on solving specific systems of linear equations, in which domain knowledge about the underlying relations is taken into account. In addition, a detailed discussion about the differences between optimizing (5) and (6) is provided.
4.1 Matrix Representation of Symmetric and Reciprocal Kernels
Let us define the so-called commutation matrix, which provides a powerful tool for formalizing the kernel matrices corresponding to the symmetric and reciprocal kernels.
Definition 4.1 (Commutation matrix).
is called the commutation matrix (Abadir and Magnus, 2005), where are the standard basis vectors of .
We use the superscript to indicate the dimension of the matrix but we omit this notation when the dimensionality is clear from the context or when the considerations do not depend on the dimensionality. For , we have the following properties. First, , where is the identity matrix, since is a symmetric permutation matrix. Moreover, for every square matrix , we have , where vec is the column vectorizing operator that stacks the columns of an -matrix in an -dimensional column vector, that is,
Furthermore, for , we have
The commutation matrix is used as a building block in constructing the following types of matrices:
Definition 4.2 (Symmetrizer and skew-symmetrizer matrices).
are known as the symmetrizer and skew-symmetrizer matrix, respectively (Abadir and Magnus, 2005).
Armed with the above definitions, we will now consider how the kernel matrices corresponding to the reciprocal kernel and the symmetric kernel can be represented in a matrix notation. Note that the next proposition covers also the kernel matrices constructed between, say, nodes encountered in the training set and the nodes encountered at the prediction phase, and hence the considerations involve two different sets of nodes.
Let be a kernel matrix consisting of all kernel evaluations between nodes in sets and , with and , that is, , where and . The ordinary, symmetric and reciprocal Kronecker kernel matrices consisting of all kernel evaluations between edges in and edges in are given by
The claim concerning the ordinary Kronecker kernel is an immediate consequence of the definition of the Kronecker product, that is, the entries of are given as
where and . To prove the other two claims, we pay closer attention to the entries of . In particular, the -th column of contains all kernel evaluations of the edges in with the edge . By definition (10) of vec, this column can be written as , where is a matrix whose -th entry contains the kernel evaluation between the edges and :
We have the following properties of the symmetrizer and skew-symmetrizer matrices that straightforwardly follow from those of the commutation matrix. For any
Thus, the -th column of can, according to (11), be written as , where the -th entry of contains the kernel evaluation
which corresponds to the symmetric Kronecker kernel between the edges and . The reciprocal case is analogous. ∎
We also note that for the symmetrizer and skew-symmetrizer matrices commute with the -matrix in the following sense:
where and in the left-hand sides are -matrices and and are -matrices in the right-hand sides. Thus, due to (12), the above-considered symmetric Kronecker kernel matrix may as well be written as or as . The same applies to the reciprocal Kronecker kernel matrix due to (13).
4.2 Regression with Symmetric and Reciprocal Kernels
Let and , respectively, represent the number of nodes and edges in . In the following, we make an assumption that contains, for each ordered pair of nodes , exactly one edge starting from and ending to , that is, and corresponds to a complete directed graph on nodes which includes a loop at each node. As we will show below, this important special case enables the use of many computational short-cuts for the training phase. This assumption is dropped in Section 4.4, where we present training algorithms for the more general case.
In practical applications, a fully connected graph is most commonly available in settings where the edges are generated by comparing some direct property of the nodes, such as whether they belong to same or similar class in a classification taxonomy (for examples, see the experiments). In experimental research on small sample data, such as commonly considered in many bioinformatics applications (see e.g. (Park and Marcotte, 2012)), it may also be feasible to gather the edge information directly for all pairs through experimental comparisons. If this is not the case, then imputation techniques can be used to fill in the values for missing edges in the training data in case only a small minority is missing.
Using the notation of Proposition 4.3, we let be the kernel matrix of , containing similarities between all nodes encountered in . Due to the above assumption and Proposition 4.3, the kernel matrix containing the evaluations of the kernels , and between the edges in can be expressed as , and , respectively.
Recall that, according to the representer theorem, the prediction function obtained as a solution to problem (1) can be expressed with the dual representation (2), involving a vector of so-called dual parameters, whose dimension equals the number of edges in the training set. Here, we represent the dual solution with a vector containing one entry per each possible edge between the vertices occurring in the training set.
where is a convex loss function that maps the vector of training labels and the vector of predictions to a real value.
Up to multiplication with a constant, the loss (5) can be represented in matrix form as
Thus, for the regression approach, the objective function to be minimized becomes:
By taking the derivative of (16) with respect to , setting it to zero, and solving with respect to , we get the following system of linear equations:
If the kernel matrix is not strictly positive definite but only positive semi-definite, should be interpreted as . Accordingly, (17) can be simplified to
Using the standard notation and rules of Kronecker product algebra, we show how to efficiently solve shifted Kronecker product systems. For a more in-depth analysis of the shifted Kronecker product systems, we refer to Martin and Van Loan (2006).
Let be diagonalizable matrices, that is, the matrices can be eigen decomposed as
where contain the eigenvectors and the diagonal matrices contain the corresponding eigenvalues of and . Then, the following type of shifted Kronecker product system
where and , can be solved with respect to in time if the inverse of exists.
Before starting the actual proof, we recall certain rules concerning the Kronecker product (see e.g. Horn and Johnson (1991)) and introduce some notations. Namely, for , , and , we have:
From this, it directly follows that
Moreover, for , , and , we have:
Furthermore, for , let denote the Hadamard (elementwise) product, that is, . Further, for a vector , let denote the diagonal -matrix, whose diagonal entries are given as . Finally, for , we have:
A minimizer of (16) can be computed in time.
Since the kernel matrix is symmetric and positive semi-definite, it is diagonalizable and it has nonnegative eigenvalues. This ensures that the matrix has strictly positive eigenvalues and therefore its inverse exists. Consequently, the claim follows directly from Proposition 4.4, which can be observed by substituting for both and . ∎
We continue by considering the use of the symmetric and reciprocal Kronecker kernels and show that, with those, the dual solution can be obtained as easily as with the ordinary Kronecker kernel. We first present and prove the following two inversion identities:
Let for some square matrix . Then,
if the considered inverses exist.
For a start, we note certain directly verifiable properties of the symmetrizer and skew-symmetrizer matrices. Namely, the matrices and are idempotent, that is,
Furthermore, and are orthogonal to each other, that is,
We prove (23) by multiplying with its alleged inverse matrix and show that the result is the identity matrix:
When going from (4.2) to (27), we use the fact that commutes with , because it commutes with both and . Moreover, the second term of (4.2) vanishes, because of the orthogonality of and to each other. In (4.2) we have used the following inversion identity known in matrix calculus literature (Henderson and Searle, 1981)
Identity (24) can be proved analogously. ∎
These inversion identities indicate that we can invert a diagonally shifted symmetric or reciprocal Kronecker kernel matrix simply by modifying the inverse of a diagonally shifted ordinary Kronecker kernel matrix. This is an advantageous property, since the computational short-cuts provided by Proposition 4.4 ensure the fast inversion of the shifted ordinary Kronecker kernel matrices, and its results can thus be used to accelerate the computations for the symmetric and reciprocal cases too.
The next result uses the above inversion identities to show that, when learning symmetric or reciprocal relations with kernel ridge regression (Saunders et al., 1998; Suykens et al., 2002; Shawe-Taylor and Cristianini, 2004; Pahikkala et al., 2009), we do not explicitly have to use the symmetric and reciprocal Kronecker kernels. Instead, we can just use the ordinary Kronecker kernel to learn the desired model as long as we ensure that the symmetry or reciprocity is encoded in the labels.
Using the symmetric Kronecker kernel for RLS regression with a label vector is equivalent to using an ordinary Kronecker kernel and a label vector . One can observe an analogous relationship between the reciprocal Kronecker kernel and a label vector .
In the last equality, we again used the fact that commutes with , because it commutes with both and . Let be a new couple of nodes for which we are supposed to do a prediction with a regressor determined by the coefficients . Moreover, let denote, respectively, the base kernel evaluations of the nodes and with the nodes in the training data. Then, contains the Kronecker kernel evaluations of the edge with all edges in the training data. Further, according to Proposition 4.3, the corresponding vector of symmetric Kronecker kernel evaluations is . Now, the prediction for the couple can be expressed as
As a consequence of this, we also have a computationally efficient method for RLS regression with symmetric and reciprocal Kronecker kernels. Encoding the properties into the label matrix ensures that the corresponding variations of the Kronecker kernels are implicitly used.
4.3 Conditional Ranking with Symmetric and Reciprocal Kernels
where , is the -identity matrix, and is the vector of which every entry is equal to , be the -centering matrix. The matrix is an idempotent matrix and multiplying it with a vector subtracts the mean of the vector entries from all elements of the vector. Moreover, the following equality can be shown
where are the entries of a vector . Now, let us consider the following quasi-diagonal matrix:
where is the number of edges starting from for . Again, given the assumption that the training data contains all possible edges between the nodes exactly once and hence for all , loss function (6) can be, up to multiplication with a constant, represented in matrix form as
provided that the entries of are ordered in a way compatible with the entries of , that is, the training edges are arranged according to their starting nodes.
Analogously to the regression case, the training phase corresponds to solving the following system of linear equations:
If the ordinary Kronecker kernel is used, we get a result analogous to Corollary 4.5.
A solution of (33) can be computed in time.
While the kernel matrix is not necessarily invertible, a solution can still be obtained from the following reduced form:
This can, in turn, be rewritten as
The matrix is symmetric, and hence if is strictly positive definite, the product is diagonalizable and has nonnegative eigenvalues (see e.g. (Horn and Johnson, 1985, p. 465)). Therefore, (34) is of the form which can be solved in time due to Proposition 4.4. The situation is more involved if is positive semi-definite. In this case, we can solve the so-called primal form with an empirical kernel map (see e.g. Airola et al. (2011a)) instead of (34) and again end up with a Kronecker system solvable in time. We omit the details of this consideration due to its lengthiness and technicality. ∎
4.4 Conjugate Gradient-Based Training Algorithms
Interestingly, if we use the symmetric or reciprocal Kronecker kernel for conditional ranking, we do not have a similar efficient closed-form solution as those indicated by Corollaries 4.5 and 4.8. The same concerns both regression and ranking if the above assumption of the training data having every possible edge between all nodes encountered in the training data (i.e. for all ) is dropped. Fortunately, we can still design algorithms that take advantage of the special structure of the kernel matrices and the loss function in speeding up the training process, while they are not as efficient as the above-described closed-form solutions.
Before proceeding, we introduce some extra notation. Let be a bookkeeping matrix of the training data, that is, its rows and columns are indexed by the edges in the training data and the set of all possible pairs of nodes, respectively. Each row of contains a single nonzero entry indicating to which pair of nodes the edge corresponds. This matrix covers both the situation in which some of the possible edges are not in the training data and the one in which there are several edges adjacent to the same nodes. Objective function (14) can be written as
with the ordinary Kronecker kernel and analogously with the symmetric and reciprocal kernels. Note that the number of dual variables stored in vector is still equal to , while the number of labels in is equal to . If an edge is not in the training data, the corresponding entry in is zero, and if a particular edge occurs several times, the corresponding entry is the sum of the corresponding variables in representation (2). For the ranking loss, the system of linear equations to be solved becomes
If we use an identity matrix instead of in (32), the system corresponds to the regression loss.
To solve the above type of linear systems, we consider an approach based on conjugate gradient type of methods with early stopping regularization. The Kronecker product can be written as , where and . Computing this product is cubic in the number of nodes. Moreover, multiplying a vector with the matrices or does not increase the computational complexity, because they contain only nonzero elements. Similarly, the matrix has only non-zero elements. Finally, we observe from (30) and (31) that the matrix can be written as , where is the following quasi-diagonal matrix:
The matrices and both have nonzero entries, and hence multiplying a vector with the matrix can also be performed in time.
Conjugate gradient methods require, in the worst case, iterations in order to solve the system of linear equations (33) under consideration. However, the number of iterations required in practice is a small constant, as we will show in the experiments. In addition, since using early stopping with gradient-based methods has a regularizing effect on the learning process (see e.g. Engl et al. (1996)), this approach can be used instead of or together with Tikhonov regularization.
4.5 Theoretical Considerations
Next, we give theoretical insights to back the idea of using RankRLS-based learning methods instead of ordinary RLS regression. As observed in Section 4.3, the main difference between RankRLS and the ordinary RLS is that RankRLS enforces the learned models to be block-wise centered, that is, the aim is to learn models that, for each node , correctly predict the differences between the utility values of the edges and , rather than the utility values themselves. This is common for most of the pairwise learning to rank algorithms, since learning the individual utility values is, in ranking tasks, relevant only with relation to other utility values. Below, we consider whether the block-wise centering approach actually helps in achieving this aim. This is done via analyzing the regression performance of the utility value differences.
We start by considering the matrix forms of the objective functions of the ordinary RLS regression
and RankRLS for conditional ranking
where is a quasi-diagonal matrix whose diagonal blocks are -centering matrices. Here we make the further assumption that the label vector is block-wise centered, that is, . We are always free to make this assumption with conditional ranking tasks.
The following lemma indicates that we can consider the RankRLS problem as an ordinary RLS regression problem with a modified kernel.
Objective functions (37) and
have a common minimizer.
This lemma provides us a different perspective on RankRLS. Namely, if we have a prior knowledge that the underlying regression function to be learned is block-wise centered (i.e. we have a conditional ranking task), this knowledge is simply encoded into a kernel function, just like we do with the knowledge about the reciprocity and symmetry.
In the literature, there are many results (see e.g. De Vito et al. (2005) and references therein) indicating that the expected prediction error of the regularized least-squared based kernel regression methods obey the following type of probabilistic upper bounds. For simplicity, we only consider the regression error. Namely, for any , it holds that
where denotes the probability, is the expected prediction error,