Ranking with Features: Algorithm and A Graph Theoretic Analysis

# Ranking with Features: Algorithm and A Graph Theoretic Analysis

Aadirupa Saha, Indian Institute of Science. aadirupa@iisc.ac.in    Arun Rajkumar Conduent Labs, India. arun.rajkumar@conduent.com
###### Abstract

We consider the problem of ranking a set of items from pairwise comparisons in the presence of features associated with the items. Recent works have established that samples are needed to rank well when there is no feature information present. However, this might be sub-optimal in the presence of associated features. We introduce a new probabilistic preference model called feature-Bradley-Terry-Luce (f-BTL) model that generalizes the standard BTL model to incorporate feature information. We present a new least squares based algorithm called fBTL-LS which we show requires much lesser than pairs to obtain a good ranking – precisely our new sample complexity bound is of , where denotes the number of ‘independent items’ of the set, in general . Our analysis is novel and makes use of tools from classical graph matching theory to provide tighter bounds that sheds light on the true complexity of the ranking problem, capturing the item dependencies in terms of their feature representations. This was not possible with earlier matrix completion based tools used for this problem. We also prove an information theoretic lower bound on the required sample complexity for recovering the underlying ranking, which essentially shows the tightness of our proposed algorithms. The efficacy of our proposed algorithms are validated through extensive experimental evaluations on a variety of synthetic and real world datasets.

## 1 Introduction

We consider the problem of ranking a set of items from pairwise comparisons when some side information about the items is available. Given a set of items and pairwise comparisons among them, the problem of ranking from pairwise comparisons is to recover an underlying ranking among the items. This is a well studied problem in several disciplines including statistics, operations research, theoretical computer science, social choice theory, machine learning, decision systems etc [Thurstone(1927), Bradley and Terry(1952), Luce(1959), Saaty(2008), Ailon et al.(2008)Ailon, Charikar, and Newman], [Braverman and Mossel(2008), Gleich and Lim(2011), Jamieson and Nowak(2011), Negahban et al.(2012)Negahban, Oh, and Shah], [Wauthier et al.(2013)Wauthier, Jordan, and Jojic, Busa-Fekete et al.(2014)Busa-Fekete, Hüllermeier, and Szörényi, Rajkumar and Agarwal(2014), Shah and Wainwright(2015), Borkar et al.(2016)Borkar, Karamchandani, and Mirani, Chen and Joachims(2016), Rajkumar and Agarwal(2016), Shah et al.(2016)Shah, Balakrishnan, Guntuboyina, and Wainwright, Niranjan and Rajkumar(2017)]. A typical approach to solve this problem is to assume that the comparisons are generated in a stochastic fashion according to a score based pairwise probability model such as the Bradley-Terry-Luce model [Bradley and Terry(1952)] [Luce(1959)] or the Thurstone model [Thurstone(1927)] and to develop algorithms [Gleich and Lim(2011)], [Negahban et al.(2012)Negahban, Oh, and Shah], [Rajkumar and Agarwal(2014)], [Borkar et al.(2016)Borkar, Karamchandani, and Mirani] that first estimate the score vector generating the comparisons and rank the items by simply sorting the estimated scores. However in practice these algorithms suffer from several shortcomings. Firstly, side information such as features or certain relationships among items is often available. For example, to rank a set of mobile phones, it is natural to use features such as cost, battery life, size etc. which influence the pairwise preferences of users in preferring one mobile over the other. However, most algorithms do not take into account the valuable feature information associated with items. Secondly, they fail to handle the case when new items get added to the existing set i.e., one cannot find the position of a new item in an already estimated ranking without collecting pairwise preferences of the newly added item with at least some items in the existing set. Finally, the sample complexity of previous approaches scale as which may often be sub-optimal in the presence of features.

In this work, introduce the feature-Bradley–Terry–Luce (f-BTL) model of pairwise comparisons to tackle the problems listed above. The f-BTL model is a generalization of the standard BTL model where the probability of preferring one item over the other explicitly depends on their associated features. We propose fBTL-LS, a least squares based algorithm to recover the underlying true scores (and ranking) under the f-BTL model. The novelty of our approach is in the analysis of the sample complexity (the number of comparisons needed to achieve a fixed error) of our algorithm to recover a good ranking which we show is often smaller than , based on the structure of items dependencies. Our improved sample complexity guarantee is of , where on an intuitive level, denotes the number of the main (independent) items which influence the preference structures of the rest of items in the set — This clearly shows a significant reduction in the number of comparisons needed, compared to the earlier known bound , especially when , which is often the case for various practical scenarios. The key ingredient which helps us achieve this is a relation graph that we define on the features. We apply ideas from classical graph matching theory on the relation graph and show how they help us prove our sample complexity bounds. We demonstrate the usefulness of our bounds by deriving sample complexity results for several special cases of relation graphs such as cliques, disconnected cliques, trees, star graphs, cycles etc. By explicitly modelling dependencies among features (rather than just considering them to lie in some low dimensional space), our bounds reveal the true complexity of the problem. We believe that the graph theory based approach used in this work would be of wider use to the learning theory community. Furthermore, we also give a matching lower bound guarantee analyzing the minimal number of pairwise preferences (sample complexity) required for estimating a ‘good ranking’, which in fact establishes the optimality of our proposed algorithms. Our experiments on synthetic and real world preference data sets show that the proposed algorithm significantly outperforms existing algorithms.

### 1.1 Related Work

As mentioned earlier, ranking from pairwise comparisons has been studied extensively in various disciplines and it is not in the scope of this paper to review all the previous works. We review the most relevant work to the setting considered here. The work that is most related to our setting is that of [Niranjan and Rajkumar(2017)] which also uses features associated with items. They assume the features lie in some low dimensional space and use a matrix completion based approach to obtain a ranking. The low rank assumption is a global assumption on the features and might miss out completely on the exact dependencies on them. As we will see, the set of features in a low dimensional space might give rise to very different type of relation graphs which may lead to very different sample complexity bounds that our analysis will capture while theirs does not. [Gleich and Lim(2011)], [Borkar et al.(2016)Borkar, Karamchandani, and Mirani] give least squares based algorithms. However, they do not consider feature information. [Negahban et al.(2012)Negahban, Oh, and Shah, Wauthier et al.(2013)Wauthier, Jordan, and Jojic, Busa-Fekete et al.(2014)Busa-Fekete, Hüllermeier, and Szörényi, Rajkumar and Agarwal(2014), Shah and Wainwright(2015)] [Chen and Joachims(2016)], [Rajkumar and Agarwal(2016)], [Shah et al.(2016)Shah, Balakrishnan, Guntuboyina, and Wainwright] work in the pairwise ranking setting under different probabilistic models (including the BTL model). Again, none of these use features explicitly and hence (as we will see in the experiments) are sub-optimal for our setting. [Jamieson and Nowak(2011)] work in a setting where the probabilities come from some unknown low dimensional feature embedding of the items. However they require the pairs to be queried actively while we work in the passive setting. There is a rich ranking literature on noisy sorting [Braverman and Mossel(2008)], approximation algorithms [Ailon et al.(2008)Ailon, Charikar, and Newman], duelling bandits [Yue et al.(2012)Yue, Broder, Kleinberg, and Joachims] etc., which are fundamentally different from the passive setting under the BTL model considered in this work.

### 1.2 Summary of Contributions

The main contributions of this work are as follows:

• We introduce a new probabilistic model, f-BTL, for ranking from pairwise comparisons which explicitly uses features associated with items (Section 2).

• We give a novel analysis for the sample complexity of the proposed model by using ideas from graph matching theory that captures the dependencies among features explicitly than previous approaches (Section 3).

• We propose a novel least squares based algorithm, fBTL-LS , and provide its sample complexity guarantees for recovering a ‘good estimate’ of the score vector (and ranking) under the f-BTL model (Section 4).

• We further prove that the sample complexity guarantee of our proposed fBTL-LS  algorithm is tight proving a matching lower bound for the problem (Section 5).

• We finally give supporting experimental results to show the efficacy of our algorithm on both synthetic data (which follows the f-BTL model) and on real datasets which not necessarily follow the f-BTL model (Section 6).

### 1.3 Organization

We give the necessary preliminaries in Section 2.1 and define the problem formally. In Section 3, we analyze the case when the probability values for the sampled pairs are known exactly and derive a graph matching theory based sample complexity bound. In Section 4, we propose our least squares based algorithm and show theoretical guarantees of it’s performance. Section 5 proves a matching lower bound guarantee for the problem. In section 6, we experimentally evaluate our algorithm on various synthetic and real world data sets. We conclude in Section 7 with directions for future work. All proofs are presented in the appendix.

Notation: We use lower case boldface letters for vectors, upper case boldface letters for matrices, lower case letters for scalars and upper case letters for constants. denotes the norm for vectors and spectral norm for matrices. denotes the Frobenius norm for matrices. We denote the set by . For a square matrix , we denote by the magnitude of the largest eigenvalue of .

## 2 Preliminaries

BTL Model:([Bradley and Terry(1952)], [Luce(1959)]) The standard probabilistic model for pairwise comparisons is the Bradley–Terry–Luce model. In this model, the probability of preferring item over item is given by

 Pij=wiwi+wj (1)

where is a vector of scores associated with the items.

Hall’s Marriage Theorem: In our sample complexity analysis, we will use this classical result in bipartite graph matching.

###### Theorem 2.1 ([Hall(1935)]).

Let be a finite bipartite graph. For , let denote the neighbours of in . Then admits a matching that entirely covers if and only if

 |NC(S)|≥|S|  ∀S⊆A

### 2.1 Problem Setting

Let be the set of items to be ranked. We assume that the items are related to each other using a relation graph where a pair of items are related iff there is a corresponding edge in . The items have associated feature vectors where each . The following natural assumption relates and :

Assumption: The subset of vectors in corresponding to the items in the independent set of are linearly independent and form a basis for .
Let be the independence number of and . Let be the coefficient matrix that expresses in terms of the basis vectors as follows:

 ui=∑j∈I(G)Bjiuj  ∀i∈[n] (2)

We will assume (without loss of generality) the set to be . When there are multiple independent sets of , we arbitrarily fix one.

Model: We introduce the feature Bradley–Terry–Luce model (f-BTL) where the probability of preferring item over is given by:

 Pij=exp(wTui)exp(wTui)+exp(wTuj)

where . The f-BTL model reduces to the standard BTL model when and is the standard basis. Let be the score vector where .

Sampling: We assume that a set of pairs is generated where each pair is chosen with probability . Each pair in is compared times independently according to f-BTL model.

Problem: Given and , for what values of and under the above sampling model does one have an algorithm whose estimated score vector satisfies

 P(∥θ−^θ∥2≤ϵ)>1−δ  ?

Previous results show that under the standard BTL model, the rank centrality [Negahban et al.(2012)Negahban, Oh, and Shah] [Rajkumar and Agarwal(2016)], maximum likelihood under the BTL model [Shah and Wainwright(2015)] and the least squares [Borkar et al.(2016)Borkar, Karamchandani, and Mirani] algorithms need comparisons to achieve a small error with probability at least . However, these algorithms do not consider the features explicitly. The feature low rank model of [Niranjan and Rajkumar(2017)] uses features but requires pairs to be compared. As we will see, the fBTL-LS algorithm that we propose will require much lesser samples than .

## 3 Analysis: Case When Probabilities are Known

We begin by analyzing the problem for the noiseless case where for every pair that is compared, we have access to the exact value for . This analysis will shed light into the structure of the problem which will be useful to analyse the case when the probabilities for each have to be estimated using the comparisons. Under this setting, the goal is to bound the number of samples needed to exactly recover the score vector where . From Equation 2, we have that

 wTui =∑j∈I(G)BjiwTuj, or equivalently,  θi =∑j∈I(G)Bjiθj  ∀i∈[n]. (3)

As we have access to and , we only need to recover the scores of so that the remaining scores can be computed using Equation 3.
For a pair , under the f-BTL model, the following holds:

 α∑k=1γijkθk=α∑k=1γijk(wTuk)=0 (4)

where . Equation 4 shows that knowing for any pair gives rise to a linear equation involving the score vectors corresponding to the items in . Since the f-BTL model is invariant to scaling the score vector by a positive constant, we can w.l.o.g assume that one of the scores is normalized to . Thus to recover the scores for all the items, it seems like we only need equations like Equation 4 that can be used to solve for the scores of the items in . However, if the coefficients are , then the corresponding equation does not involve the -th independent set item. Thus, the equations (i.e., the pairs selected) should be such that each item in appears in at least one of the selected equations so that it can be solved for.

Thus our problem now is to compute the number of pairs needed (under the sampling model) to ensure that with high probability each item in appears in at least one equation of the form of Equation 4. To compute this number, we need to explicitly model the dependencies among features. We do this below and prove the necessary result using the Hall’s marriage theorem, a classical result from graph matching theory (refer Section ).

The bipartite graph that will be of interest to us is as follows: The set is just the set of items in the independent set i.e., . The set consists of nodes, one corresponding to each edge . For an edge , define

 Fij={k∈[α]:γijk≠0} (5)

is the set of independent set nodes whose coefficients are non-zero in the equation (refer Eq: 4) induced by the pair . Thus, by observing the pair , we have an equation involving the items in . We define the edge set such that an edge from node to an edge is present in the bipartite graph iff . For any set of edges , we define the reduced bipartite graph by restricting the to and defining correspondingly.

###### Theorem 3.1.

Given a set of edges , the bipartite graph as defined above admits a matching that covers if and only if the system of linear equations induced by the edges can be solved for.

###### Proof.

If there is a matching that covers , then each node in has a distinct representative edge in which induces an equation containing . Thus there are at least equations with each node appearing in at least one of them and hence the system can be solved for. On the other hand, if there is no matching that covers , then by Hall’s theorem 2.1, there must exist some subset such that it’s neighbours . As the total number of equations that involve nodes in are less than the number of nodes, this set of equations cannot be solved for. ∎

Theorem 3.1 gives us a novel way to analyse the number of pairs needed to obtain enough equations to solve for the score vector . In particular, we only need to bound the probability that the Hall’s marriage condition is not met to get an upper bound on the number of pairs needed. Before we prove the result, we need the following definitions for a given set . Let denote the neighbours of node in . Let , . We now prove the main result of this section

\thmt@toks\thmt@toks

Given a relation graph ,feature matrix , a set of pairs where generated according to the sampling model above (where each pair is chosen with probability ), and the exact preference probabilities , the probability that the score vector is same as that estimated score vector that is got by solving the equations obtained is bounded by

 P(^θ≠θ)≤min{α(G), (dmax(G)+1)}∑q=1∑I⊆I(G)||I|=q(dIq−1)pq−1(1−p)(cI−(q−1)),
###### Theorem 3.2 (Bound On Error Probability).

being the maximum degree of . The above theorem gives us a way of choosing such that the probability of not satisfying the Hall’s condition (and hence not having enough equations to solve) can be bounded by a suitable value. As can be seen in the Theorem, the quantities of interest are and which capture the dependencies among the feature vectors of the nodes in the graph. For several common types of graphs, these quantities are easily computable and readily yield sample complexity bounds. We prove these results for some special cases below:

\thmt@toks\thmt@toks

Under the settings of Theorem 3, the following sample complexity bounds hold with probability at least
If is a disconnected graph, star graph, or a cycle,
If is a clique,
If is a -disconnected clique (i.e. union of cliques),

###### Theorem 3.3 (Sample Complexity for Common Graphs).

Remarks: The above theorem captures the relation between the structure of the graph (and the induced dependencies among the features) and the sample complexity needed to recover the score vector under the f-BTL model. For instance, if the graph is a clique, then there is only one independent vector and hence one recovers the score using pairs. However, when the graph is disconnected, there are independent vectors and we recover the result for the BTL model. More importantly, there are graphs (such as r-disconnected cliques) where the number of pairs needed scale as and are independent of the total number of nodes . This is an example where we get significant improvements in the sample complexity by exploiting the structure of the features which [Niranjan and Rajkumar(2017)] fails to achieve. In the appendix, we also discuss the sample complexity for other graphs such as regular graphs and trees.

## 4 Algorithm For General Case

In this section, we consider the original problem where we don’t have access to the exact values but only estimates of it available from the independent comparisons made. In this setting, we cannot expect to solve the linear equations exactly. We propose f-BTL, a least squares based algorithm, shown in Algorithm 1 to solve for the score vector. Let the graph induced by the edge set on the nodes be called the comparison graph. The node-edge Incidence matrix used in the algorithm is such that is the standard unnormalized Laplacian associated with the comparison graph i.e., where is the diagonal matrix of degrees and is the adjacency matrix. Algorithm 1 is motivated using the fact that when the true probabilities are known exactly, the following holds:

 QTBv=y (6)

where and where such that . As we only have estimates instead of , we take a least squares approach to estimate the scores.

### 4.1 Connectivity

The results of [Borkar et al.(2016)Borkar, Karamchandani, and Mirani] show that the sample complexity for the least squares algorithm for the standard BTL model depends on how well connected the comparison graph is. In particular this is measured w.r.t the second Eigenvalue of the Laplacian which is if and only if the comparison graph is disconnected. Thus when the comparison graph is disconnected, there is no way to recover the score vector in the standard BTL case. However, as we will see below, our analysis will depend on the least eigenvalue of the matrix and not the Laplacian matrix. The important point to note here is that even if the comparison graph is disconnected, the fBTL-LS algorithm may still recover the score vector. This is because of the fact that the algorithm makes use of the matrix of coefficients to relate scores across possibly disconnected components in the comparison graph.

An example of this is shown in Figure LABEL:fig:matrix. Here and and . The comparison graph as can be seen in the figure is disconnected. The nodes circled in red are assumed to be the independent set nodes. The exact relation between the feature vectors of the independent set i.e., and those not in the independent set i.e., are given by the matrix shown in the figure. It can be verified for this example that the matrix (also shown in the figure) has non zero eigenvalues though the Laplacian is block diagonal (which happens if and only if the comparison graph is disconnected as in this case).

We now prove the main result of this section: \thmt@toks\thmt@toks Let be a set of edges generated as per the sampling model and let each pair in be compared times independently according to the f-BTL model. Then for any positive scalar , with probability at least , the normalized -error of Algorithm 1 satisfies

 ∥^θ−θ∥∥θ∥≤2a⋅ ⎷λmax(BTB)λmin(BTB)⋅√mα⋅√λnλ1

where , . Similarly and respectively denotes the minimum and maximum non-zero eigenvalues of the positive semi-definite matrix . denote the range of the f-BTL parameter such that and .

###### Theorem 4.1 (Recovery Guarantee for fBTL-LS Algorithm).

Remarks: Some explanation is in order regarding the bound in the above theorem. As can be seen, the normalized error is bounded by a product of terms. The first term can be treated as a constant that depends on the minimum score vector corresponding to the f-BTL model. The second term is the condition number of the feature coefficient matrix and captures how the features interact with each other. The third term depends on the number of pairs seen in . When , this term becomes . The fourth term grows depending on how many samples one sees as it depends on which is the Laplacian of the comparison graph. If both and are , then the normalized error is a constant with probability at least . Thus, the result essentially says that if one sees samples and is such that both and are , then the normalized error is bounded by a small constant.

## 5 Lower Bound

In this section, we show that how the achievable -error rate of the fBTL-LS algorithm (as derived in Theorem 4.1), compares to the minimax -error rate possible, over the class of feature Bradley-Terry-Luce (f-BTL) model. More specifically, our result in Theorem 5 proves an information-theoretic lower bound for the -error rate achievable by any learning algorithm for estimating the score parameters of the f-BTL model. Our proof technique uses a constructive argument to generate the score vectors from a uniform distribution that respects the f-BTL model in the dynamic range , and solves the stochastic inference problem into a multi-way hypothesis testing problem. Our derived lower bound guarantee is given below:

\thmt@toks\thmt@toks

Let us consider the following set of score vectors of a f-BTL model defined with respect to the coefficient matrix and range parameters such that:

 ΘB(a,b)={θ∈Rn∣θ satifies (???), |θi|≤a ∀i∈[α], |θi|≥b ∀i∈[n]}.

Now suppose the learner (i.e. an algorithm which estimates scores of a f-BTL model) is given access to noisy pairwise preferences sampled according to a Erdős-Réyni random graph with for some , such that independent noisy pairwise preferences are available for each sampled pair, generated according to some unknown f-BTL model in . Then if be the learner’s estimated f-BTL score vector based on the sampled pairwise preferences, upon which environment chooses a worst case true score vector , then for any such learning algorithm one can show that

 supθ∈ΘB(a,b)E[∥^θ−θ∥]∥θ∥≥√λmin(BTB)16bλmax(BTB)√448ζKe2(b+1),

where the expectation is taken over the randomness of the algorithm.

## 6 Experiments

We now describe the experiments we ran with the fBTL-LS algorithm on various synthetic and real world datasets. We compared our results with three algorithms, (i) ordinary least squares (OLS ) [Borkar et al.(2016)Borkar, Karamchandani, and Mirani], (ii) rank centrality (RC) [Negahban et al.(2012)Negahban, Oh, and Shah] and (iii) inductive pairwise ranking based on inductive matrix completion (IMC) [Niranjan and Rajkumar(2017)]. The first two algorithms do not use any feature information while the third algorithm does. Both the OLS and RC algorithms are guaranteed to work well for the standard BTL model with pairs compared while the IPR algorithm requires pairs to be compared for its guarantees to hold good (for a generalized BTL model that they define), being the feature dimension.

Performance Measures: We measure the performance of algorithms using the following metrics

1. Normalized -error : For experiments where there is a true score vector that generates the comparisons, we use the normalized error between the estimated score vector and the true score vector.

2. Pairwise disagreement (pd) error: Suppose denotes the underlying pairwise preference matrix corresponding to the true (and unknown) score , given by and be the estimated preference matrix return by the algorithm (note if the algorithm returns an score vector estimate , we compute ), then pd-error essentially counts the fraction of pairs on which and disagree, defined as:

 pd(^P,P∗)=1n2∑i0.5))
3. Sample complexity(sc): Minimum number of pairwise comparisons required to be observed to obtain normalized -error .

### 6.1 Experiments on Synthetic Datasets

We evaluate the four algorithms on three different type of graphs with respect to the above performance measures. We consider three different settings for this purpose — Type-I plots: with increasing node size , Type-II plots: with increasing sampling rate but fixed node size and independence number Type-III plots: with increasing independence number and fixed node size . The details of the experimental setups and results are provided below.

Type of graphs. We use three different type of graphs for synthetic experiments: -disconnected cliques: Union of -cliques -regular graphs: Graphs with each node having degree and -ary trees: Trees with every node having -children (except the leaf nodes).

Data generation For each of the above type of graphs , we first fix a maximum independent set of , and embed the node of with canonical basis vector of , denoted by with , if and otherwise , . We next generate a random coefficient matrix and embed rest of the nodes in according to (2) as defined in the problem setting. We now choose a random vector and assign a BTL score to every node as defined in (3). Finally is normalized to -norm , i.e. , setting .

Parameter setting. As clear from the above data generation procedure, the feature dimension is equal to the independence number of in all the experiments. We also fix throughout for all the experiments (unless performance is reported against ), and report the average performances over runs.

### Type-I plots: with increasing node size (n)

In this setup, we compare the four algorithms, with respect to the above three performance measures with varying node size , on three different graphs: Union of disconnected cliques on nodes, -Regular graph of nodes with fixed degree and Full binary tree of nodes. The results are reported in Figure 1. They clearly reflect the superior performance of fBTL-LS  for each of the three performance measures.

Note that for graph type of -disconnected cliques, the independence number is fixed for all , unlike graph and where scales with . Now the interesting observation is that the sample complexity sc of fBTL-LS  for achieving a target error for -disconnected clique is almost constant even with varying from to , unlike the rest of the three algorithms where it scales with . This indeed justifies our claim of the required sample complexity to be , as remarked in Theorem 4.1. This also justifies why for -regular graph and full binary tree, sample complexity of fBTL-LS  monotonically increases with , as the independence number itself scales with for these two graphs.

### Type-II plots: with increasing sampling rate (p) but fixed node size (n) and independence number (α)

In this case we compare the four algorithms with varying sampling rate with respect to the two estimation error metrics, normalized -error and pairwise disagreement pd, on the following three different graphs: Union of disconnected cliques on nodes, i.e. with each clique having nodes, -Regular graphs on nodes with each node of degree and Full binary tree of height ( nodes). Thus in each case, and are kept fixed, with to be set as , varying from to . From the results in Figure 2, it clearly reflects that, as expected, the performance of all the algorithms get improved with higher sampling rate .

However, the rate of performance improvement is far more drastic for fBTL-LS  compared to the rest due to its inherent ability to exploit the feature correlation, and thus hence reaching to more accurate score estimates faster.

### Type-III plots: with increasing independence number (α) and fixed node size (n)

Finally in the third setup, we compare the four algorithms with varying independence set size (or independence number) for a fixed set of nodes on the following two graphs: Union of -disconnected cliques over nodes with varying and -Regular graph of nodes with varying degree (Figure 3). The results again show that varying as , the performance metrics, normalized -error and pd, remains almost constant validating the claim of the required sample complexity of fBTL-LS  to be , as follows from Theorem 4.1. The sample complexity curves on the other hand validate the dependency of sc on , which clearly increases with higher values of , as expected.

### 6.2 Experiments on Read-World Datasets

We also evaluate the four algorithms on two real-world preference learning datasets: car and sushi. The datasets and the experimental setup is described below:

Car Dataset. ([Abbasnejad et al.(2013)Abbasnejad, Sanner, Bonilla, Poupart, et al.]) This dataset contains pairwise preferences of cars given by users, where each car represented by a -dimensional feature vector.

Sushi Dataset. ([Kamishima and Akaho(2009)]) This dataset contains over 100 sushis rated according to their preferences, where each sushi is represented by a -dimensional feature vector.

Setup. Note that the real world datasets does not satisfy any preference modelling assumption, e.g. BTL assumption, and hence there is no true score vector associated to the item preferences. From the user preferences, we first compute the underlying pairwise preference matrix , where is computed by taking the empirical average of number of times an item is preferred over item . Further to construct the feature matrix , we use the provided feature information of the item set, that is provided in each dataset. Specifically, if each item is represented by -dimensional feature vector (as described before, for Car and for Sushi), we find a set of items whose corresponding features are linearly independent that forms a basis of and use these items as the independent set . The coefficient matrix is then constructed by representing the rest of the items as a linear combination of , such that it satisfies (2). (see Section 2.1 for details)

Performance Measure. As noted above, the real world datasets does not satisfy the BTL assumption, so there is no true score vector associated to the item preferences. We however measure the performances of the algorithms with respect to the true preference matrix , using pairwise disagreement error pd. In both the cases, our proposed algorithm outperforms the rest. We also evaluate the algorithms with increasing number of repeated samples per pair . As expected, it shows that higher indeed leads to improved performance. The results are reported in Figure 4.

## 7 Conclusion

We considered the problem of ranking from pairwise comparisons in presence of feature information. The existing results either fail to utilize this feature information, or make broad low rank assumptions which cannot capture the item dependencies through their corresponding feature representations. In this work, we introduce a feature based probabilistic preference model, fBTL-LS  and have shown that the feature information could be used to obtain tight sample complexity bounds for recovering ‘good estimates’ of the underlying scores of the preference model. We have proposed a least squares based algorithm and have shown theoretical recovery guarantees for the same. Furthermore, our information theoretic based lower bound analysis show the optimality of our proposed algorithm with a matching lower bound guarantee.

While least square based algorithms are a natural choice for this problem, it would be interesting to see how Markov chain based approaches, such as rank centrality can be extended to accommodate feature information. It would also be interesting to analyze the problem in contextual setting introducing feature dependencies of the users as well.

## References

• [Abbasnejad et al.(2013)Abbasnejad, Sanner, Bonilla, Poupart, et al.] Ehsan Abbasnejad, Scott Sanner, Edwin V Bonilla, Pascal Poupart, et al. Learning community-based preferences via dirichlet process mixtures of gaussian processes. In Proceedings of the 18th International Joint Conferences on Artificial Intelligence, 2013.
• [Ailon et al.(2008)Ailon, Charikar, and Newman] Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM, 55(5), 2008.
• [Borkar et al.(2016)Borkar, Karamchandani, and Mirani] Vivek S Borkar, Nikhil Karamchandani, and Sharad Mirani. Randomized kaczmarz for rank aggregation from pairwise comparisons. In Information Theory Workshop (ITW), 2016 IEEE. IEEE, 2016.
• [Bradley and Terry(1952)] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4), 1952.
• [Braverman and Mossel(2008)] Mark Braverman and Elchanan Mossel. Noisy sorting without resampling. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2008.
• [Busa-Fekete et al.(2014)Busa-Fekete, Hüllermeier, and Szörényi] Róbert Busa-Fekete, Eyke Hüllermeier, and Balázs Szörényi. Preference-based rank elicitation using statistical models: The case of mallows. In Proceedings of the 31st International Conference on Machine Learning, volume 32, 2014.
• [Chen and Joachims(2016)] Shuo Chen and Thorsten Joachims. Modeling intransitivity in matchup and comparison data. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining. ACM, 2016.
• [Gleich and Lim(2011)] David F Gleich and Lek-heng Lim. Rank aggregation via nuclear norm minimization. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data mining. ACM, 2011.
• [Hall(1935)] Philip Hall. On representatives of subsets. Journal of the London Mathematical Society, 1(1), 1935.
• [Jamieson and Nowak(2011)] Kevin G Jamieson and Robert Nowak. Active ranking using pairwise comparisons. In Advances in Neural Information Processing Systems, 2011.
• [Kamishima and Akaho(2009)] Toshihiro Kamishima and Shotaro Akaho. Efficient clustering for orders. In Mining Complex Data. Springer, 2009.
• [Luce(1959)] R Duncan Luce. Individual Choice Behavior: A Theoretical Analysis. Wiley, 1959.
• [Negahban et al.(2012)Negahban, Oh, and Shah] Sahand Negahban, Sewoong Oh, and Devavrat Shah. Iterative ranking from pair-wise comparisons. In Advances in Neural Information Processing Systems, 2012.
• [Niranjan and Rajkumar(2017)] UN Niranjan and Arun Rajkumar. Inductive pairwise ranking: Going beyond the n log (n) barrier. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, 2017.
• [Rajkumar and Agarwal(2014)] Arun Rajkumar and Shivani Agarwal. A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In Proceedings of 31st International Conference on Machine Learning, 2014.
• [Rajkumar and Agarwal(2016)] Arun Rajkumar and Shivani Agarwal. When can we rank well from comparisons of o (nlog (n)) non-actively chosen pairs? In Proceedings of the 29th Conference on Learning Theory, 2016.
• [Saaty(2008)] Thomas L Saaty. Decision making with the analytic hierarchy process. International Journal of Services Sciences, 1(1), 2008.
• [Shah et al.(2016)Shah, Balakrishnan, Guntuboyina, and Wainwright] Nihar Shah, Sivaraman Balakrishnan, Aditya Guntuboyina, and Martin Wainwright. Stochastically transitive models for pairwise comparisons: Statistical and computational issues. In Proceedings of 33rd International Conference on Machine Learning, 2016.
• [Shah and Wainwright(2015)] Nihar B Shah and Martin J Wainwright. Simple, robust and optimal ranking from pairwise comparisons. arXiv preprint arXiv:1512.08949, 2015.
• [Thurstone(1927)] Louis L. Thurstone. A law of comparative judgment. In Psychological Review, volume 34:4, 1927.
• [Wauthier et al.(2013)Wauthier, Jordan, and Jojic] Fabian Wauthier, Michael Jordan, and Nebojsa Jojic. Efficient ranking from pairwise comparisons. In Proceedings of the 30th International Conference on Machine Learning, 2013.
• [Yue et al.(2012)Yue, Broder, Kleinberg, and Joachims] Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5), 2012.

## Appendix A Proof of Theorem 3

###### Proof.

Note from Theorem 3.1 we have that one only fails to recover the true if and only if the edge set of the bipartite graph fails to cover . Thus we have

 P(θ≠^θ) =P({A is not covered by CM}) =P({∃S′⊆A s.t. |NCM(S′)|<|S′|})   (by Theorem ???)

We use to denote the set of neighbours of node in a graph and to denote the set of neighbours of node in including itself, i.e. . Define and . Thus we can associate every node in the independent set to a set of edges such that . Let us also denote and let . More generally we denote .

We will also find it convenient to define and , . Clearly when , say , . In general, for we have , where the size of the intersecting sets s depends on specific the structure of the graph (see Theorem 3 for graph specific analysis).

Now if we denote the event , , and recalling , we further get

 P(θ≠^θ) =P({∃S′⊆A s.t. |NCM(S′)|<|S′|}) =P(F1∪F2∪F3…Fα(G)) =P(F1∪(F2∩Fc1)∪(F3∩Fc2)∪…∪(Fα(G)∩Fcα(G)−1)) =P(F1)+P(F2∩Fc1)+…+P(Fα(G)∩Fcα(G)−1) (7)

Assuming the pairwise node preferences are drawn according to the edges sampled from an Erdős-Réyni random graph and applying Theorem 3.1 on the event , it is easy to see that

 P(F1)=P({∃S′⊆A%s.t.|NCM(S′)|<|S′|=1}) =P({∃S′={k}, k∈[α(G)] s.t. no edge from Mk is sampled in G(n,p)})≤α(G)∑i=1(1−p)ni,

where the last inequality follows taking union bound over all singletons in . Note that one can further bound above as . In general, for any , one can similarly derive

 P(Fq∩Fcq−1) =P({∃S′⊆A,|S′|=q, S′ is not covered by CM and ∀S′1⊂A,|S′1|

where the last inequality follows from the crucial observation that for any , if is not covered by but all it subsets are, then must have sampled exactly edges from and none from . Using (A) in (A) we finally get,

 P(θ≠^θ) ≤P(F1)+P(F2∩Fc1)+…+P(Fα(G)∩Fcα(G)−1) =α(G)∑q=1∑I⊆I(G)∣|I|=q(dIq−1)pq−1(1−p)cI−(q−1),

where we assume , if . Further note that if , then for any such that , we have , using which we further get

 P(θ≠^θ)≤min{α(G), (dmax(G)+1)}∑q=1∑I⊆I(G)∣|I|=q(dIq−1)pq−1(1−p)cI−(q−1)

Thus the claim follows. ∎

## Appendix B Proof of Theorem 3

###### Proof.

We will now analyse Theorem 3 for certain specific class of graphs. We will be using the same notations used in proof of Theorem 3 for the purpose.

1. Fully Disconnected Graph: Note that in this case . Also note that . Thus . Moreover , , , , and if , .

Now applying Theorem 3 and noting , we further get that,

 P(θ≠^θ) ≤min{α(G), (dmax(G)+1)}∑q=1∑I∣|I|=q(dIq−1)pq−1(1−p)cI−q =n∑i=1(1−p)n−1+∑i

solving which we get . Thus the expected number of edges (pairwise preferences) in the random graph required is atleast , which recovers the result for the usual BTL model.

2. Complete Graph: In this case . Without loss of generality assuming , thus we have . Thus . Moreover , , .

Applying Theorem 3 as before and noting , we further get,

 P(θ≠^θ) ≤min{α(G), (dmax(G)+1)}∑q=1∑I∣|I|=q(dIq−1)pq−1(1−p)cI−q =(1−p)(n2) =(e−p)(n2) ≤δ,

solving which one gets . Thus the expected number of edges (pairwise preferences) in the random graph required is atleast , which is intuitive as well since in a complete graph one needs the knowledge of only pairwise preferences to recover the exact ranking (i.e. ) with high probability .

3. -Disconnected Cliques: Say has exactly disconnected cliques, , each with edges (i.e. for each ), assuming . Thus in this case . Without loss of generality assume . Then , we have . Thus . Moreover , , , and , .

Then applying Theorem 3 as above and noting , we further get,

 P(θ≠^θ) ≤min{α(G), (dmax(G)+1)}∑q=1∑I∣|I|=q(dIq−1)pq−1(1−p)cI−q =r∑i=1(1−p)(d2)+r−1+∑i

solving which one can derive . Thus the expected number of edges (pairwise preferences) in the random graph required is atleast , where the last inequality follows assuming . Note that setting and , one can recover the earlier bounds we derived for disconnected and complete graphs respectively.

4. Star: Note that in this case the size of the maximal independent set . Without loss of generality assume . Thus we have that for any , . Thus . Moreover , , and