We consider the problem of learning the qualities of a collection of items by performing noisy comparisons among them. Following the standard paradigm, we assume there is a fixed “comparison graph” and every neighboring pair of items in this graph is compared times according to the Bradley-Terry-Luce model (where the probability than an item wins a comparison is proportional the item quality). We are interested in how the relative error in quality estimation scales with the comparison graph in the regime where is large. We prove that, after a known transition period, the relevant graph-theoretic quantity is the square root of the resistance of the comparison graph. Specifically, we provide an algorithm that is minimax optimal. The algorithm has a relative error decay that scales with the square root of the graph resistance, and provide a matching lower bound (up to log factors). The performance guarantee of our algorithm, both in terms of the graph and the skewness of the item quality distribution, outperforms earlier results.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.
Graph Resistance and Learning from Pairwise Comparisons
Julien M. Hendrickx Alex Olshevsky Venkatesh Saligrama
Proceedings of the International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).00footnotetext:
Department of Mathematical Engineering, ICTEAM, UCLouvain, Belgium
Department of Electrical and Computer Engineering, Boston University, USA \@xsect
This paper considers quality estimation from pairwise comparisons, which is a common method of preference elicitation from users. For example, the preference of a customer for one product over another can be thought of as the outcome of a comparison. Because customers are idiosyncratic, such outcomes will be noisy functions of the quality of the underlying items. A similar problem arises in crowdsourcing systems, which must strive for accurate inference even in the presence of unreliable or error-prone participants. Because crowdsourced tasks pay relatively little, errors are common; even among workers making a genuine effort, inherent ambiguity in the task might lead to some randomness in the outcome. These considerations make the underlying estimation algorithm an important part of any crowdsourcing scheme.
Our goal is accurate inference of true item quality from a collection of outcomes of noisy comparisons. We will use one of the simplest parametric models for the outcome of comparisons, the Bradley-Terry-Luce (BTL) model, which associates a real-valued quality measure to each item and posits that customers select an item with a probability that is proportional to its quality. Given a “comparison graph” which captures which pairs of items are to be compared, our goal is to understand how accuracy scales in terms of this graph when participants make choices according to the BTL model.
We focus on the regime where we perform many comparisons of each pair of items in the graph. In this regime, we are able to give a satisfactory answer to the underlying question. Informally, we prove that, up to various constants and logarithms, the relative estimation error will scale with the square root of measures of resistance in the underlying graph. Specifically, we propose an algorithm whose performance scales with graph resistance, as well as a matching lower bound. The difference between our upper and lower bounds depends only on the log of the confidence level and on the skewness of the item qualities. Additionally, we note that our performance guarantees scale better in terms of item skewness as compared to previous work.
We are given an undirected “comparison graph” , where each node has a positive weight . If , then we perform comparisons between and . The outcomes of these comparisons are i.i.d. Bernoulli and the probability that wins a given comparison according to the BTL model is
The goal is to recover the weights from the outcomes of these comparisons. Because multiplying all by the same constant does not affect the distribution of outcomes, we will recover a scaled version of the weight vector .
Thus our goal will thus be come up with a vector of estimated weights close, in a scale-invariant sense, to the true but unknown vector111We follow the usual convention of denoting random variables by capital letters, which is why is capitalized while is not. . A natural error measure turns out to be the absolute value of the sine of the angle defined by and , which can also be expressed as (see Lemma A.1 in the Supplementary Information)
In other words, is the relative error to the closest normalization of the true quality vector . We will also discuss the connection between this error measure and others later on in the paper.
Following earlier literature, we assume that
for some constant . The number can be thought of as a measure of the skewness of the underlying item quality. Our goal is to understand how the error between and scales as a function of the comparison graph .
The dominant approach to recommendation systems relies on inferring item quality from raw scores provided by users (see (Jannach et al., 2016)). However, such scores might be poorly calibrated and inconsistent; alternative approaches that offer simpler choices might perform better.
Our starting point is the Bradley-Terry-Luce (BTL) model of Eq. (1), dating back to (Bradley & Terry, 1952; Luce, 2012), which models how individuals make noisy choices between items. A number of other models in the literature have also been used as the basis of inference, we mention the Mallows model introduced in (Mallows, 1957) and the PL and Thurstone models (see description in (Hajek et al., 2014)). However, we focus here solely on the BTL model.
Our work is most closely related to the papers (Negahban et al., 2012) and (Negahban et al., 2016). These works proposed an eigenvector calculation which, provided the number of comparisons is sufficiently large, successfully recovers the true weights from the outcomes of noisy comparisons. The main result of (Negahban et al., 2016) stated that, given a comparison graph, if the number of comparisons per edge satisfied a certain lower bound, then it is possible to construct an estimate satisfying
with high probability, where are, respectively, the smallest and largest degrees in the comparison graph, is the spectral gap of a certain normalized Laplacian of the comparison graph, and both are normalized so that their entries sum to 1. It can be proved (see Lemma A.4) that the relative error on the left-hand side of Eq. (3) is within a factor of the measure provided that , so asymptotically these two measures differ only by factor depending on the skewness .
The problem of recovering was further studied in (Rajkumar & Agarwal, 2014), where the comparison graph was taken to be a complete graph but with comparisons on edges made at non-uniform rates. The sample complexity of recovering the true weights was provided as a function of the smallest sampling rate over pairs of items.
A somewhat more general setting was considered in (Shah et al., 2016), which considered a wider class of noisy comparison models which include the BTL model as a special case. Upper and lower bounds on the minimax optimal rates in estimation, depending on the eigenvalues of a corresponding Laplacian, were obtained for absolute error in several different metrics; in one of these metric, the Laplacian semi-metric, the upper and lower bounds were tight up to constant factors. Similarly to (Shah et al., 2016), our goal is to understand the dependence on the underlying graph, albeit in the simpler setting of the BTL model.
Our approach to the problem very closely parallels the approach of (Jiang et al., 2011), where a collection of potentially inconsistent rankings is optimally reconciled by solving an optimization problem over the comparison graph. However, whereas (Jiang et al., 2011) solves a linear programming problem, we will use a linear least squares approach, after a certain logarithmic change of variable.
We now move on to discuss work more distantly related to the present paper. We mention that the problem we study here is related, but not identical, to the so-called noisy sorting problem, introduced in (Braverman & Mossel, 2009), where better items win with probability at least for some positive . This assumption does not hold for the BTL model with arbitrary weights. Noisy sorting was also studied in the more general setting of ranking models satisfying a transitivity condition in (Shah et al., 2017) and (Pananjady et al., 2017), where near-optimal minimax rates were derived. Finally, optimal minimax rates for noisy sorting were recently demonstrated in (Mao et al., 2017).
There are a number of variations of this problem that have been studied in the literature which we do not survey at length due to space constraints. For example, the papers (Yue et al., 2012; Szörényi et al., 2015) considered the online version of this problem with corresponding regret, (Chen & Suh, 2015) considered recovering the top ranked items, (Falahatgar et al., 2017; Agarwal et al., 2017; Maystre & Grossglauser, 2015) consider recovering a ranked list of the items, and (Ajtai et al., 2016) consider a model where comparisons are not noisy if the item qualities are sufficiently far apart. We refer the reader to the references within those papers for more details on related works in these directions.
We will construct our estimate by solving a log-least-squares problem described next. We denote by the fraction of times node wins the comparison against its neighbor , and we further set . As the number of comparisons on each edge goes to infinity, we will have that approaches with probability one. Our method consists in finding as follows:
This can be done efficiently by observing that it amounts to solving the linear system of equations
in the least square sense. Let to be the incidence matrix222Given an directed graph with nodes and edges, the incidence matrix is the matrix whose ’th column has a corresponding to the source of edge , a corresponding to the destination of node , and zeros elsewhere. For an undirected graph, an incidence matrix is obtained by first orienting the edges arbitrarily. of the comparison graph. Stacking up the into a vector , we can then write
Least-square solutions satisfy
or equivalently , where is the graph Laplacian. Finally, a solution is given by
where is the Moore-Penrose pseudoinverse. By using the classic results of (Spielman & Teng, 2014), Eq. (5) can be solved for to accuracy in nearly linear time in terms of the size of the input, specifically in iterations for some constant . We note that, for connected graphs, all solutions of (4) are equal up to a multiplicative constant and are thus equivalent in terms of criterion (2).
We will find it useful to view the graph as a circuit with a unit resistor on each edge; will denote the resistance between nodes and in this circuit, denotes the largest of these resistances over all pairs of nodes and similarly denotes the average resistance over all pairs. We will use to denote the set of edges lying on at least one simple path starting at and terminating at , with denoting the largest of the . Naturally, is upper bounded by the total number of edges in the comparison graph. The performance of our algorithms is described by the following theorem.
Let . There exist absolute constants constants such that, if and and , then we have, with probability at least , that
The main feature of this theorem is the favorable form of the bound in the setting when is large. Then only the leading term
dominates the expression on the right-hand-side. Taking square roots, it follows that, asymptotically,
where the notation hides logarithmic factor in .
Our other main result is that, in the regime when is large, there is very little room for improvement.
For any comparison graph , and for any algorithm, as long as for some absolute constant , we have that
where as before is the graph Laplacian.
Comparing Theorem 1 with Theorem 2, we see that the performance bounds of Theorem 1 are minimax optimal, at least up to the logarithmic factor in the confidence level and dependence on the skewness factor . We can thus conclude that the square root of the graph resistance is the key graph-theoretic property which captures how relative error decays for learning from pairwise comparisons. This observation is the main contribution of this paper.
Table 1 quantifies how much the bound of Theorem 1 expressed in terms of improves the asymptotic decay rate on various graphs over the bound (Negahban et al., 2016). The notation ignores log-factors. Both random graphs are taken at a constant multiple threshold which guarantees connectivity; for Erdos-Renyi this means and for a geometric random graph, this means connecting nodes at random positions at the unit square when they are apart.
|Graph||Eq. (3)||Theorem 1|
|2 stars joined at centers|
|Geo. random graph|
Most of the scalings for eigenvalues of normalized Laplacians used in Table 1 are either known or easy to derive. For an analysis of the eigenvalue of the barbell graph333Following (Wilf, 1989), the barbell graph refers to two complete graphs on vertices connected by a line of vertices., we refer the reader to (Landau & Odlyzko, 1981); for mixing times on the geometric random graph, we refer the reader to (Avin & Ercal, 2007); for the resistance of an Erdos-Renyi graph, we refer the reader to (Sylvester, 2016).
In terms of the worst-case performance in terms of the number of nodes, our bound grows at worst as using the observation that . By contrast, for the barbell graph, the bound of (Negahban et al., 2016) grows as , and it is not hard to see this is actually the worst-case scaling in terms of the number of nodes.
Finally, we note that these comparisons use slightly different error measures: on our end vs the relative error in the -norm after have been normalized to sum to one, used by (Negahban et al., 2016). To compare both in terms of the latter, we could multiply our bounds by (see Lemma A.4).
As mentioned earlier, we let be the empirical rate of success of item in the comparisons between and ; thus so that the previously introduced can be expressed as . We also let , to which should converge asymptotically.
We will make a habit of stacking any of the quantities defined into vectors; thus , for example, denotes the vector in which stacks up the quantities with the choice of and consistent with the orientation in the incidence matrix . The the vectors and are defined likewise.
We begin the proof with a sequence of lemmas which work their way to the main theorem. The first step is to introduce some notation for the comparison on the edge .
Let be the outcome of a single coin toss comparing coins and . Using the standard formula for the variance of a Bernoulli random variable, we obtain
where we have defined . Observe that is always upper bounded by , where we remind .
We first argue that all are reasonably close to their expected values. For the sake of concision, we state the following assumptions about the constants, , and the quantity . Note that some of the intermediate results hold under weaker assumptions, but we omit these details for the sake of simplicity.
We have that , , and .
The following lemma is a standard application of Chernoff’s inequality. For completeness, a proof is included in Section LABEL:highprobproof of the Supplementary Information.
There exist absolute constants constants such that, under Assumption 1, we have
The next lemma provides a convenient expression for the quantity in terms of the “measurement errors” . Note that the normalization assumption is not a loss of generality since is defined up to a multiplicative constant, and is directly satisfied if is obtained from (5).
Suppose is normalized so that . There exist absolute constants such that, under Assumption 1, there holds with probability
where is a diagonal matrix whose entries are the , for all edges .
which we can write as It follows that
since is assumed normalized so that . Combining this with Eq. (5), we obtain
We thus turn our attention to analyzing the vector . Our analysis will be conditioning on the event that for all
which, by Lemma 1, holds with probability at least . We will call this event .
We begin with one implication that comes from putting together event and our assumption (in Assumption 1) for a constant that we can choose: that we can assume that
Indeed, from Eq. (10) for this last equation to hold it suffices to have Observing that
we see that assuming is sufficient for Eq. (11) to hold conditional on event .
Our analysis of begins with the observation that since
we have that
Next we use Taylor’s expansion of the function , for which we have
to obtain that can thus be expressed as
where lies between and (and lies thus between and ). We can rewrite this equality in a condensed form
where corresponds to the second terms in (12), which we will now bound. Because we have conditioned on event , which, as discussed above implies , we actually have that and that lying between and belongs to . Hence
The following lemma bounds how much the ratios of our estimates differ from the corresponding ratios of the true weights . To state it, we will use the notation
where is the standard notation for the ’th basis vector. Furthermore, we define the product
Observe that the matrix is positive semidefinite, which implies by standard arguments that
holds for all vectors .
Suppose is normalized so that . There exist absolute constants such, under Assumption 1, with with probability , we have that for all pairs ,
For any , we have that
where, recall, is the resistance between nodes and , and is the set of edges belonging to some simple path from to .
The result follows from circuit theory, and we sketch it out along with the relevant references. The key idea is that the vector has a simple electric interpretation. We have that and the ’th entry of is the current on edge when a unit of current is put into node at removed at node . For details, see the discussion in Section 4.1 of (Vishnoi, 2013).
This lemma follows from several consequences of this interpretation. First, the entries of are an acyclic flow from to ; this follows, for example, from Thompson’s principle which asserts that the current flow minimizes energy (see Theorem 4.8 of (Vishnoi, 2013)). Moreover, Thompson’s principle further asserts that . Finally, by the flow decomposition theorem (Theorem 3.5 in (Ahuja et al., 2017)), we can decompose this flow along simple paths from to ; this implies that .
With these facts in mind, we apply Cauchy-Schwarz to obtain
and then conclude the proof using Holder’s inequality
There exist absolute constants such that, under Assumption 1, with probability , we have that for all pairs ,
We now turn to the first-term in Eq. (15), which is bounded in the next lemma.
There exist absolute constants such that, under Assumption 1, with probability we have that for all pairs ,
The random variable (where, recall, is the outcome of a single comparison between nodes and ) is zero-mean and supported on an interval of length , and consequently it is subgaussian444A random variable is said to be subgaussian with parameter if for all . with parameter (see Section 5.3 of (Lattimore & Szepesvári, 2018)). By standard properties of subgaussian random variables, it follows that is subgaussian with . It follows then from Theorem 2.1 of (Hsu et al., 2012) for subgaussian random variables applied to , that for any there is a probability at least that
where we have used , and . We now compute this trace.
where the second equality uses the well-known property of the Moore-Penrose pseudo-inverse: for any matrix (see Section 2.9 of (Drineas & Mahoney, 2018)); and last equality uses a well-known relation between resistances and Laplacian pseudoinverses, see Chapter 4 of (Vishnoi, 2013). The result follows then from the application of (S2.Ex28) to . ∎
Having obtained the bounds in the preceding sequence of lemmas, we now return to Lemma 3 and “plug in” the results we have obtained. The result is the following lemma.
There exist absolute constants such, under Assumption 1, with probability , we have that for all pairs ,
A particular implication is that . Applying the inequality to (20) leads then to
and now using , the proof follows by combining the last equation with Eq. (19). ∎
The next lemma demonstrates how to convert Lemma 6 into a bound on the relative error between and the true weight vector .
Suppose we have that
for all . Fix index . Then there hold
The purpose of this section is two-fold. First, we would like to demonstrate that simulations are consistent with Theorem 1; in particular, we would like to see error scalings that are consistent with the average resistance, rather than e.g., spectral gap. Second, we wish observe that, although our results are asymptotic, in practice the scaling with resistance appears immediately, even for small . Since our main contribution is theoretical, and since we do not claim that our algorithm is better than available methods in practice, we do not perform a comparison to other methods in the literature. Additional details about our experiments are provided in Section id1 in the Supplementary Information.
We begin with Erdos-Renyi comparison graphs. Figure 1 shows the evolution of the error with the number of comparisons per edge. The error decreases as as predicted. Moreover, this is already the case for small values of .
Next we move to the influence of the graph properties. Figure 2 shows that the average error is asymptotically constant when grows while keeping the expected degree constant, and that it decreases as when the expected degree grows while keeping constant. This is consistent with our analysis in Table 1, and with the results (Boumal & Cheng, 2014) showing that the average resistance of Erdos-Renyi graphs evolves as .
We next consider lattice graphs in Figure 3. For the 3D lattice, the error appears to converge to a constant when grows, which is consistent with our results since the average resistance of 3D lattice is bounded independently of . The trend for the 2D lattice appears also consistent with a bound in predicted by our results since the resistance on 2D lattice evolves as .
Our main contribution has been to demonstrate, by a combination of upper and lower bounds, that the error in quality estimation from pairwise comparisons scales as the graph resistance. Our work motivates a number of open questions.
First, our upper and lower bounds are not tight with respect to skewness measure . We conjecture that the scaling of for relative error is optimal, but either upper of lower bounds matching this quantity are currently unknown.
Second, it would interesting to obtain non-asymptotic version of the results presented here. Our simulations are consistent with the asymptotic scaling (ignoring the dependence on ) being effective immediately, but at the moment we can only prove this scaling governs the behavior as .
- Agarwal et al. (2017) Agarwal, A., Agarwal, S., Assadi, S., and Khanna, S. Learning with limited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pp. 39–75, 2017.
- Ahuja et al. (2017) Ahuja, R. K., Magnanti, T. L., and Orlin, J. B. Network Flows: Theory, Algorithms, and Applications. Pearson Education, 2017.
- Ajtai et al. (2016) Ajtai, M., Feldman, V., Hassidim, A., and Nelson, J. Sorting and selection with imprecise comparisons. ACM Transactions on Algorithms (TALG), 12(2):19, 2016.
- Avin & Ercal (2007) Avin, C. and Ercal, G. On the cover time and mixing time of random geometric graphs. Theoretical Computer Science, 380(1):2, 2007.
- Boumal & Cheng (2014) Boumal, N. and Cheng, X. Concentration of the Kirchhoff index for Erdős–Rényi graphs. Systems & Control Letters, 74:74–80, 2014.
- Bradley & Terry (1952) Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Braverman & Mossel (2009) Braverman, M. and Mossel, E. Sorting from noisy information. arXiv preprint arXiv:0910.1191, 2009.
- Chen & Suh (2015) Chen, Y. and Suh, C. Spectral mle: Top-k rank aggregation from pairwise comparisons. In International Conference on Machine Learning, pp. 371–380, 2015.
- Drineas & Mahoney (2018) Drineas, P. and Mahoney, M. W. Lectures on Randomized Numerical Linear Algebra. The Mathematics of Data, 25:1, 2018.
- (10) Duchi, J. Assouads method. https://web.stanford.edu/class/stats311/Lectures/lec-04.pdf. Lecture notes.
- Falahatgar et al. (2017) Falahatgar, M., Orlitsky, A., Pichapati, V., and Suresh, A. T. Maximum selection and ranking under noisy comparisons. arXiv preprint arXiv:1705.05366, 2017.
- (12) Hajek, B. and Raginsky, M. Statistical learning theory. http://maxim.ece.illinois.edu/teaching/SLT/SLT.pdf. Book draft.
- Hajek et al. (2014) Hajek, B., Oh, S., and Xu, J. Minimax-optimal inference from partial rankings. In Advances in Neural Information Processing Systems, pp. 1475–1483, 2014.
- Hsu et al. (2012) Hsu, D., Kakade, S., and Zhang, T. A Tail Inequality for Quadratic Forms of subgaussian Random Vectors. Electronic Communications in Probability, 17, 2012.
- Jannach et al. (2016) Jannach, D., Resnick, P., Tuzhilin, A., and Zanker, M. Recommender systems – beyond matrix completion. Communications of the ACM, 59(11):94–102, 2016.
- Jiang et al. (2011) Jiang, X., Lim, L.-H., Yao, Y., and Ye, Y. Statistical ranking and combinatorial hodge theory. Mathematical Programming, 127(1):203–244, 2011.
- Landau & Odlyzko (1981) Landau, H. and Odlyzko, A. Bounds for eigenvalues of certain stochastic matrices. Linear algebra and its Applications, 38:5–15, 1981.
- Lattimore & Szepesvári (2018) Lattimore, T. and Szepesvári, C. Bandit Algorithms. http://downloads.tor-lattimore.com/banditbook/book.pdf, 2018. Book draft.
- Luce (2012) Luce, R. D. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.
- Mallows (1957) Mallows, C. L. Non-null ranking models. i. Biometrika, 44(1/2):114–130, 1957.
- Mao et al. (2017) Mao, C., Weed, J., and Rigollet, P. Minimax rates and efficient algorithms for noisy sorting. arXiv preprint arXiv:1710.10388, 2017.
- Maystre & Grossglauser (2015) Maystre, L. and Grossglauser, M. Just sort it! A simple and effective approach to active preference learning. arXiv preprint arXiv:1502.05556, 2015.
- Negahban et al. (2012) Negahban, S., Oh, S., and Shah, D. Iterative ranking from pair-wise comparisons. In Advances in neural information processing systems, pp. 2474–2482, 2012.
- Negahban et al. (2016) Negahban, S., Oh, S., and Shah, D. Rank centrality: Ranking from pairwise comparisons. Operations Research, 65(1):266–287, 2016.
- Pananjady et al. (2017) Pananjady, A., Mao, C., Muthukumar, V., Wainwright, M. J., and Courtade, T. A. Worst-case vs average-case design for estimation from fixed pairwise comparisons. arXiv preprint arXiv:1707.06217, 2017.
- Rajkumar & Agarwal (2014) Rajkumar, A. and Agarwal, S. A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In International Conference on Machine Learning, pp. 118–126, 2014.
- Shah et al. (2016) Shah, N. B., Balakrishnan, S., Bradley, J., Parekh, A., Ramchandran, K., and Wainwright, M. J. Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. The Journal of Machine Learning Research, 17(1):2049–2095, 2016.
- Shah et al. (2017) Shah, N. B., Balakrishnan, S., Guntuboyina, A., and Wainwright, M. J. Stochastically transitive models for pairwise comparisons: Statistical and computational issues. IEEE Transactions on Information Theory, 63(2):934–959, 2017.
- Spielman & Teng (2014) Spielman, D. A. and Teng, S.-H. Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM Journal on Matrix Analysis and Applications, 35(3):835–885, 2014.
- Sylvester (2016) Sylvester, J. A. Random walk hitting times and effective resistance in sparsely connected erdos-renyi random graphs. arXiv preprint arXiv:1612.00731, 2016.
- Szörényi et al. (2015) Szörényi, B., Busa-Fekete, R., Paul, A., and Hüllermeier, E. Online rank elicitation for plackett-luce: A dueling bandits approach. In Advances in Neural Information Processing Systems, pp. 604–612, 2015.
- Tao (2012) Tao, T. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
- Vishnoi (2013) Vishnoi, N. . Foundations and Trends in Theoretical Computer Science, 8, 2013.
- Wilf (1989) Wilf, H. S. The editor’s corner: the white screen problem. The American Mathematical Monthly, 96(8):704–707, 1989.
- Yue et al. (2012) Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012.
Our relative error criterion of differs somewhat from the criterion used in (Negahban et al., 2016), which was
where both and and need to be normalized to sum to . To represent this compactly, we introduce the notation for positive vectors , defined as
so that the criterion of (Negahban et al., 2016) can be written simply as .
We will show that if and satisfy and , then the two relative error criteria are within a multiplicative factor of . Thus, ignoring factors depending on the the skewness , we may pass from one to the other at will.
The proof will require a sequence of lemmas, which we present next. The first lemma provides some inequalities satisfied by the the sine error measure.
Let and denote by the sine of the angle made by these vectors. Then we have that
Moreover, if the angle between and is less than (which always holds when and are nonnegative), we have that
Moreover, since the expressions remain valid if we permute and .
We begin with the first equality. Observe that is the distance between and its orthogonal projection on the 1-dimensional subspace spanned by ; by definition of sine, this is also , which implies the equality sought.
The second equality directly follows from the change of variable . Passing from to is necessary is necessary in case the optimal is 0, which happens when and are orthogonal.
Let now be the angle made by and . An analysis of the triangle defined by 0, and shows that