Adversarial Top-K Ranking

Adversarial Top- Ranking

Changho Suh Vincent Y. F. Tan Renbo Zhao C. Suh is with the School of Electrical Engineering at Korea Advanced Institute of Science and Technology (email: chsuh@kaist.ac.kr).V. Y. F. Tan is with the Department of Electrical and Computer Engineering and the Department of Mathematics, National University of Singapore. (email: vtan@nus.edu.sg).R. Zhao is with the Department of Electrical and Computer Engineering, National University of Singapore. (email: elezren@nus.edu.sg).C. Suh is supported by a gift from Samsung. V. Y. F. Tan and R. Zhao gratefully acknowledge financial support from the National University of Singapore (NUS) under the NUS Young Investigator Award R-263-000-B37-133.
Abstract

We study the top- ranking problem where the goal is to recover the set of top- ranked items out of a large collection of items based on partially revealed preferences. We consider an adversarial crowdsourced setting where there are two population sets, and pairwise comparison samples drawn from one of the populations follow the standard Bradley-Terry-Luce model (i.e., the chance of item beating item is proportional to the relative score of item to item ), while in the other population, the corresponding chance is inversely proportional to the relative score. When the relative size of the two populations is known, we characterize the minimax limit on the sample size required (up to a constant) for reliably identifying the top- items, and demonstrate how it scales with the relative size. Moreover, by leveraging a tensor decomposition method for disambiguating mixture distributions, we extend our result to the more realistic scenario in which the relative population size is unknown, thus establishing an upper bound on the fundamental limit of the sample size for recovering the top- set.

Adversarial population, Bradley-Terry-Luce model, crowdsourcing, minimax optimality, sample complexity, top- ranking, tensor decompositions

I Introduction

Ranking is one of the fundamental problems that has proved crucial in a wide variety of contexts—social choice [1, 2], web search and information retrieval [3], recommendation systems [4], ranking individuals by group comparisons [5] and crowdsourcing [6], to name a few. Due to its wide applicability, a large volume of work on ranking has been done. The two main paradigms in the literature include spectral ranking algorithms [7, 3, 8] and maximum likelihood estimation (MLE) [9]. While these ranking schemes yield reasonably good estimates which are faithful globally w.r.t. the latent preferences (i.e., low loss), it is not necessarily guaranteed that this results in optimal ranking accuracy. Accurate ranking has more to do with how well the ordering of the estimates matches that of the true preferences (a discrete/combinatorial optimization problem), and less to do with how well we can estimate the true preferences (a continuous optimization problem).

In applications, a ranking algorithm that outputs a total ordering of all the items is not only overkill, but it also unnecessarily increases complexity. Often, we pay attention to only a few significant items. Thus, recent work such as that by Chen and Suh [10] studied the top- identification task. Here, one aims to recover a correct set of top-ranked items only. This work characterized the minimax limit on the sample size required (i.e., the sample complexity) for reliable top- ranking, assuming the Bradley-Terry-Luce (BTL) model [11, 12].

While this result is concerned with practical issues, there are still limitations when modeling other realistic scenarios. The BTL model considered in [10] assumes that the quality of pairwise comparison information which forms the basis of the model is the same across annotators. In reality (e.g., crowdsourced settings), however, the quality of the information can vary significantly across different annotators. For instance, there may be a non-negligible fraction of spammers who provide answers in an adversarial manner. In the context of adversarial web search [13], web contents can be maliciously manipulated by spammers for commercial, social, or political benefits in a robust manner. Alternatively, there may exist false information such as false voting in social networks and fake ratings in recommendation systems [14].

As an initial effort to address this challenge, we investigate a so-called adversarial BTL model, which postulates the existence of two sets of populations—the faithful and adversarial populations, each of which has proportion and respectively. Specifically we consider a BTL-based pairwise comparison model in which there exist latent variables indicating ground-truth preference scores of items. In this model, it is assumed that comparison samples drawn from the faithful population follow the standard BTL model (the probability of item beating item is proportional to item ’s relative score to item ), and those of the adversarial population act in an “opposite” manner, i.e., the probability of beating is inversely proportional to the relative score. See Fig. 1.

I-a Main contributions

We seek to characterize the fundamental limits on the sample size required for top- ranking, and to develop computationally efficient ranking algorithms. There are two main contributions in this paper.

Building upon RankCentrality [7] and SpectralMLE [10], we develop a ranking algorithm to characterize the minimax limit required for top- ranking, up to constant factors, for the -known scenario. We also show the minimax optimality of our ranking scheme by proving a converse or impossibility result that applies to any ranking algorithm using information-theoretic methods. As a result, we find that the sample complexity is inversely proportional to , which suggests that less distinct the population sizes, the larger the sample complexity. We also demonstrate that our result recovers that of the case in [10], so the work contained herein is a strict generalization of that in [10].

The second contribution is to establish an upper bound on the sample complexity for the more practically-relevant scenario where is unknown. A novel procedure based on tensor decomposition approaches in Jain-Oh [15] and Anandkumar et al. [16] is proposed to first obtain an estimate of the parameter that is in a neighborhood of , i.e., we seek to obtain an -globally optimal solution. This is usually not guaranteed by traditional iterative methods such as Expectation Maximization [17]. Subsequently, the estimate is then used in the ranking algorithm that assumes knowledge of . We demonstrate that this algorithm leads to an order-wise worse sample complexity relative to the -known case. Our theoretical analyses suggest that the degradation is unavoidable if we employ this natural two-step procedure.

I-B Related work

The most relevant related works are those by Chen and Suh [10], Negahban et al. [7], and Chen et al. [6]. Chen and Suh [10] focused on top- identification under the standard BTL model, and derived an error bound on preference scores which is intimately related to top- ranking accuracy. Negahban et al. [7] considered the same comparison model and derived an error bound. A key distinction in our work is that we consider a different measurement model in which there are two population sets, although the and norm error analyses in [10, 7] play crucial roles in determining the sample complexity.

The statistical model introduced by Chen et al. [6] attempts to represent crowdsourced settings and forms the basis of our adversarial comparison model. We note that no theoretical analysis of the sample complexity is available in [6] or other related works on crowdsourced rankings [18, 19, 20]. For example, Kim et al. [20] employed variational EM-based algorithms to estimate the latent scores; global optimality guarantees for such algorithms are difficult to establish. Jain and Oh [15] developed a tensor decomposition method [16] for learning the parameters of a mixture model [21, 22, 23] that includes our model as a special case. We specialize their model and relevant results to our setting for determining the accuracy of the estimated . This allows us to establish an upper bound on the sample complexity when is unknown.

Recently, Shah and Wainwright [24] showed that a simple counting method [25] achieves order-wise optimal sample complexity for top- ranking under a general comparison model which includes, as special cases, a variety of parametric ranking models including the one under consideration in this paper (the BTL model). However, the authors made assumptions on the statistics of the pairwise comparisons which are different from that in our model. Hence, their result is not directly applicable to our setting.

I-C Notations

We provide a brief summary of the notations used throughout the paper. Let represent . We denote by , , the norm, norm, and norm of , respectively. Additionally, for any two sequences and , or mean that there exists a (universal) constant such that ; or mean that there exists a constant such that ; and or mean that there exist constants and such that . The notation denotes a sequence in for some .

Ii Problem Setup

We now describe the model which we will analyze subsequently. We assume that the observations used to learn the rankings are in the form of a limited number of pairwise comparisons over items. In an attempt to reflect the adversarial crowdsourced setting of our interest in which there are two population sets—the faithful and adversarial sets—we adopt a comparison model introduced by Chen et al. [6]. This is a generalization of the BTL model [11, 12]. We delve into the details of the components of the model.

Fig. 1: Adversarial top- ranking given samples where and is the edge set of an Erdős-Rényi random graph.

Preference scores: As in the standard BTL model, this model postulates the existence of a ground-truth preference score vector . Each represents the underlying preference score of item . Without loss of generality, we assume that the scores are in non-increasing order:

(1)

It is assumed that the dynamic range of the score vector is fixed irrespective of :

(2)

for some positive constants and . In fact, the case in which the ratio grows with can be readily translated into the above setting by first separating out those items with vanishing scores (e.g., via a simple voting method like Borda count [25, 26]).

Comparison graph: Let be the comparison graph such that items and are compared by an annotator if the node pair belongs to the edge set . We will assume throughout that the edge set is drawn in accordance to the Erdős-Rényi (ER) model . That is node pair appears independently of any other node pair with an observation probability .

Pairwise comparisons: For each edge , we observe comparisons between and . Each outcome, indexed by and denoted by , is drawn from a mixture of Bernoulli distributions weighted by an unknown parameter . The -th observation of edge has distribution with probability and distribution with probability . Hence,

(3)

See Fig. 1. When , all the observations are fair coin tosses. In this case, no information can be gleaned about the rankings. Thus we exclude this degenerate setting from our study. The case of is equivalent to the “mirrored” case of where we flip ’s to ’s and ’s to ’s. So without loss of generality, we assume that . We allow to depend on .

Conditioned on the graph , the ’s are independent and identically distributed across all ’s, each according to the distribution of (3). The collection of sufficient statistics is

(4)

The per-edge number of samples is measure of the quality of the measurements. We let , and be various statistics of the available data.

Performance metric: We are interested in recovering the top- ranked items in the collection of items from the data . We denote the true set of top- ranked items by which, by our ordering assumption, is the set . We would like to design a ranking scheme that maps from the available measurements to a set of indices. Given a ranking scheme , the performance metric we consider is the probability of error

(5)

We consider the fundamental admissible region of pairs in which top- ranking is feasible for a given , i.e., can be arbitrarily small for large enough . In particular, we are interested in the sample complexity

(6)

where . Here we consider a minimax scenario in which, given a score estimator, nature can behave in an adversarial manner, and so she chooses the worst preference score vector that maximizes the probability of error under the constraint that the normalized score separation between the -th and -th items is at least . Note that is the expected number of edges of the ER graph so is the expected number of pairwise samples drawn from the model of our interest.

Iii Main Results

As suggested in [10], a crucial parameter for successful top- ranking is the separation between the two items near the decision boundary,

(7)

The sample complexity depends on and only through —more precisely, it decreases as increases. Our contribution is to identify relationships between and the sample complexity when is known and unknown. We will see that the sample complexity increases as decreases. This is intuitively true as captures how distinguishable the top- set is from the rest of the items.

We assume that the graph is drawn from the ER model with edge appearance probability . We require to satisfy

(8)

From random graph theory, this implies that the graph is connected with high probability. If the graph were not connected, rankings cannot be inferred [9].

We start by considering the -known scenario in which key ingredients for ranking algorithms and analysis can be easily digested, as well as which forms the basis for the -unknown setting.

Theorem 1 (Known ).

Suppose that is known and . Also assume that and . Then with probability , the set of top- set can be identified exactly provided

(9)

Conversely, for a fixed , if

(10)

holds, then for any top- ranking scheme , there exists a preference vector with separation such that . Here, and in the following, are finite universal constants.

Proof.

See Section IV for the algorithm and a sketch of the achievability proof (sufficiency). The proof of the converse (impossibility part) can be found in Section V. ∎

This theorem asserts that the sample complexity scales as

(11)

This result recovers that for the faithful scenario where in [10]. When is uniformly bounded above , we achieve the same order-wise sample complexity. This suggests that the ranking performance is not substantially worsened if the sizes of the two populations are sufficiently distinct. For the challenging scenario in which , the sample complexity depends on how scales with . Indeed, this dependence is quadratic. This theoretical result will be validated by experimental results in Section VII. Several other remarks are in order.

No computational barrier: Our proposed algorithm is based primarily upon two popular ranking algorithms: spectral methods and MLE, both of which enjoy nearly-linear time complexity in our ranking problem context. Hence, the information-theoretic limit promised by (11) can be achieved by a computationally efficient algorithm.

Implication of the minimax lower bound: The minimax lower bound continues to hold when is unknown, since we can only do better for the -known scenario, and hence the lower bound is also a lower bound in the -unknown scenario.

Another adversarial scenario: Our results readily generalize to another adversarial scenario in which samples drawn from the adversarial population are completely noisy, i.e., they follow the distribution . With a slight modification of our proof techniques, one can easily verify that the sample complexity is on the order of if is known. This will be evident after we describe the algorithm in Section IV.

Theorem 2 (Unknown ).

Suppose that is unknown and . Also assume that and . Then with probability , the top- set can be identified exactly provided

(12)
Proof.

See Section VI for the key ideas in the proof. ∎

This theorem implies that the sample complexity satisfies

(13)

This bound is worse than (11)—the inverse dependence on is now an inverse dependence on . This is because our algorithm involves estimating , incurring some loss. Whether this loss is fundamentally unavoidable (i.e., whether the algorithm is order-wise optimal or not) is open. See detailed discussions in Section VIII. Moreover, since the estimation of is based on tensor decompositions with polynomial-time complexity, our algorithm for the -unknown case is also, in principle, computationally efficient. Note that minimax lower bound in (11) also serves as a lower bound in the -unknown scenario.

Iv Algorithm and Achievability Proof of Theorem 1

Fig. 2: Ranking algorithm for the -known scenario: (1) shifting the empirical mean of pairwise measurements to get , which converges to as ; (2) performing SpectralMLE [10] seeded by to obtain a score estimate ; (3) return a ranking based on the estimate . Our analysis reveals that the norm bound w.r.t. satisfies , which in turn ensures under .

Iv-a Algorithm Description

Inspired by the consistency between the preference scores and ranking under the BTL model, our scheme also adopts a two-step approach where is first estimated and then the top- set is returned.

Recently a top- ranking algorithm SpectralMLE [10] has been developed for the faithful scenario and it is shown to have order-wise optimal sample complexity. The algorithm yields a small loss of the score vector which ensures a small point-wise estimate error. Establishing a key relationship between the norm error and top- ranking accuracy, Chen and Suh [10] then identify an order-wise tight bound on the norm error required for top- ranking, thereby characterizing the sample complexity. Our ranking algorithm builds on SpectralMLE, which proceeds in two stages: (1) an appropriate initialization that concentrates around the ground truth in an sense, which can be obtained via spectral methods [7, 3, 8]; (2) a sequence of iterative updates sharpening the estimates in a point-wise manner using MLE.

We observe that RankCentrality [7] can be employed as a spectral method in the first stage. In fact, RankCentrality exploits the fact that the empirical mean converges to the relative score as . This motivates the use of the empirical mean for constructing the transition probability from to of a Markov chain. Note that the detailed balance equation that holds as will enforce that the stationary distribution of the Markov chain is identical to up to some constant scaling. Hence, the stationary distribution is expected to serve as a reasonably good global score estimate. However, in our problem setting where is not necessarily , the empirical mean does not converge to the relative score, instead it behaves as

(14)

Note, however, that the limit is linear in the desired relative score and , implying that knowledge of leads to the relative score. A natural idea then arises. We construct a shifted version of the empirical mean:

(15)

and take this as an input to RankCentrality. This then forms a Markov chain that yields a stationary distribution that is proportional to as and hence a good estimate of the ground-truth score vector when is large. This serves as a good initial estimate to the second stage of SpectralMLE as it guarantees a small point-wise error.

A formal and more detailed description of the procedure is summarized in Algorithm 1. For completeness, we also include the procedure of RankCentrality in Algorithm 2. Here we emphasize two distinctions w.r.t. the second stage of SpectralMLE. First, the computation of the pointwise MLE w.r.t. say, item , requires knowledge of :

(16)

Here, is the profile likelihood of the preference score vector where indicates the preference score estimate in the -th iteration, denotes the score estimate excluding the -th component, and is the data available at node . The second difference is the use of a different threshold which incorporates the effect of :

(17)

where is a constant. This threshold is used to decide whether should be set to be the pointwise MLE in (22) (if ) or remains as (otherwise). The design of is based on (1) the loss incurred in the first stage; and (2) a desirable loss that we intend to achieve at the end of the second stage. Since these two values are different, needs to be adapted accordingly. Notice that the computation of requires knowledge of . The two modifications in (16) and (17) result in a more complicated analysis vis-à-vis Chen and Suh [10].

Input: The average comparison outcome for all ; the score range .
Partition randomly into two sets and each containing edges. Denote by (resp. ) the components of obtained over (resp. ).
Compute the shifted version of the average comparison output: . Denote by the components of obtained over
Initialize to be the estimate computed by Rank Centrality on ().
Successive Refinement: for do
1) Compute the coordinate-wise MLE
       where is the likelihood function defined in (16).
2) For each , set
where is the replacement threshold defined in (17).
Output the indices of the largest components of .
Algorithm 1 Adversarial top- ranking for the -known scenario
Input: The shifted average comparison outcome for all .
Compute the transition matrix such that for
where is the maximum out-degrees of vertices in .
Output the stationary distribution of .
Algorithm 2 Rank Centrality [7]

Iv-B Achievability Proof of Theorem 1

Let be the final estimate in the second stage. We carefully analyze the loss of the vector, showing that under the conditions in Theorem 1

(18)

holds with probability exceeding . This bound together with the following observation completes the proof. Observe that if , then for a top- item and a non-top- item ,

(19)
(20)

This implies that our ranking algorithm outputs the top- ranked items as desired. Hence, as long as holds (coinciding with the claimed bound in Theorem 1), we can guarantee perfect top- ranking, which completes the proof of Theorem 1.

The remaining part is the proof of (18). The proof builds upon the analysis made in [10], which demonstrates the relationship between and . We establish a new relationship for the arbitrary case, formally stated in the following lemma. We will then use this to prove (18).

Lemma 1.

Fix . Consider such that it is independent of and satisfies

(21)

Consider an estimate of the score vector such that for all . Let

(22)

Then, the pointwise error

(23)

holds with probability at least .

Proof.

The relationship in the faithful scenario , which was proved in [10], means that the point-wise MLE is close to the ground truth in a component-wise manner, once an initial estimate is accurate enough. Unlike the faithful scenario, in our setting, we have (in general) noisier measurements due to the effect of . Nonetheless this lemma reveals that the relationship for the case of is almost the same as that for an arbitrary case only with a slight modification. This implies that a small point-wise loss is still guaranteed as long as we start from a reasonably good estimate. Here the only difference in the relationship is that the multiplication term of additionally applies in the upper bound of (23). See Appendix A for the proof. ∎

Obviously the accuracy of the point-wise MLE reflected in the error depends crucially on an initial error . In fact, Lemma 1 leads to the claimed bound (18) once the initial estimation error is properly chosen as follows:

(24)

Here we demonstrate that the desired initial estimation error can indeed be achieved in our problem setting, formally stated in Lemma 2 (see below). On the other hand, adapting the analysis in [10], one can verify that with the replacement threshold defined in (17), the loss is monotonically decreasing in an order-wise sense, i.e.,

(25)

We are now ready to prove (18) when and

(26)

Lemma 1 asserts that in this regime, the point-wise MLE is expected to satisfy

(27)

Using the analysis in [10], one can show that the choice of in (17) enables us to detect outliers (where an estimation error is large) and drag down the corresponding point-wise error, thereby ensuring that . This together with the fact that

(28)

(see (26) above and Lemma 2) gives

(29)

A straightforward computation with this recursion yields (18) if is sufficiently small (e.g., ) and , the number of iterations in the second stage of SpectralMLE, is sufficiently large (e.g., ).

Lemma 2.

Let and . Let be an initial estimate: an output of RankCentrality [7] when seeded by . Then,

(30)

holds with probability exceeding .

Proof.

Here we provide only a sketch of the proof, leaving details to Appendix B. The proof builds upon the analysis structured by Lemma 2 in Negahban et al. [7], which bounds the deviation of the Markov chain w.r.t. the transition matrix after steps:

(31)

where denotes the distribution w.r.t.  at time seeded by an arbitrary initial distribution , the matrix indicates the fluctuation of the transition probability matrix111The notation , a matrix, should not be confused with the scalar normalized score separation , defined in (7). around its mean , and . Here and indicates the -th eigenvalue of .

Unlike the faithful scenario , in the arbitrary case, the bound on depends on :

(32)

which will be proved in Lemma B by using various concentration bounds (e.g., Hoeffding and Tropp [27]). Adapting the analysis in [7], one can easily verify that under one of the conditions in Theorem 1 that . Applying the bound on and to (31) gives the claimed bound, which completes the proof. ∎

V Converse Proof of Theorem 1

As in Chen and Suh’s work [10], by Fano’s inequality, we see that it suffices for us to upper bound the mutual information between a set of appropriately chosen rankings of cardinality . More specifically, let represent a permutation over . We also denote by and the corresponding index of the -th ranked item and the index set of all top- items, respectively. We subsequently impose a uniform prior over as follows: If then

(33)

and if , then

(34)

In words, each alternative hypothesis is generated by swapping only two indices of the hypothesis (ranking) obeying . Clearly, the original minimax error probability is lower bounded by the corresponding error probability of this reduced ensemble.

Let the set of observations for the edge be denoted as . We also find it convenient to introduce an erased version of the observations which is related to the true observations as follows,

(35)

Here is an erasure symbol. Let , a chance variable, be a uniformly distributed ranking in (the ensemble of rankings created in (33)–(34)). Let be the distribution of the observations given that the ranking is where and a similar notation is used for when is replaced by . Now, by the convexity of the relative entropy and the fact that the rankings are uniform, the mutual information can be bounded as

(36)
(37)
(38)
(39)

Assume that under ranking , the score vector is and under ranking , the score vector is for some fixed permutation . By using the statistical model described in Section II, we know that

(40)

where is the binary relative entropy. For brevity, write

(41)

Furthermore, we note that the chi-squared divergence is an upper bound for the relative entropy between two distributions and on the same (countable) alphabet (see e.g. [28, Lemma 6.3]), i.e.,

(42)

We also use the notation to denote the binary chi-squared divergence similarly to the binary relative entropy. Now, we may bound (40) using the following computation

(43)
(44)

Now

(45)

Hence, if we consider the case where (which is the regime of interest), uniting (44) and (45) we obtain

(46)

By construction of the hypotheses in (33)–(34), conditional on any two distinct rankings , the distributions of (namely and ) are different over at most locations so

(47)

Thus, plugging this into the bound on the mutual information in (39), we obtain

(48)

Plugging this into Fano’s inequality, and using the fact that (from ), we obtain

(49)
(50)

Thus, if for some small enough but positive , we see that

(51)

Since this is independent of the decoder , the converse part is proved.

Vi Algorithm and Proof of Theorem 2

Vi-a Algorithm Description

The proof of Theorem 2 follows by combining the results of Jain and Oh [15] with the analysis for the case when is known in Theorem 1. Jain and Oh were interested in disambiguating a mixture distribution from samples. This corresponds to our model in (3). They showed using tensor decomposition methods that it is possible to find a globally optimal solution for the mixture weight using a computationally efficient algorithm. They also provided an bound on the error of the distributions but as mentioned, we are more interested in controlling the error so we estimate separately. The use of the bound in [15] leads to a worse sample complexity for top- ranking.

Thus, in the first step, we will use the method in [15] to estimate given the data samples (pairwise comparisons) . The estimate is denoted as . It turns out that one can specialize the result in [15] with suitably parametrized “distribution vectors”

(52)

and and where in (52), runs through all values in . Hence, we are in fact applying [15] to a more restrictive setting where the two probability distributions represented by and are “coupled” but this does not preclude the application of the results in [15]. In fact, this assumption makes the calculation of relevant parameters (in Lemma 6) easier. The relevant second and third moments are

(53)
(54)

where is the outer product and is the -fold tensor outer product. If one has the exact and , we can obtain the mixture weight exactly. The intuition as to why tensor methods are applicable to problems involving latent variables has been well-documented (e.g. [16]). Essentially, the second- and third-moments contained in and provide sufficient statistics for identifying and hence estimating all the parameters of an appropriately-defined model with latent variables (whereas second-order information contained in is, in general, not sufficient for reconstructing the parameters). Thus, the problem boils down to analyzing the precision of when we only have access to empirical versions of and formed from pairwise comparisons in . As shown in Lemma 5 to follow, there is a tradeoff between the sample size per edge and the quality of the estimate of . Hence, this causes a degradation to the overall sample complexity reflected in Theorem 2.

Fig. 3: Ranking algorithm for the unknown scenario. The key distinction relative to the known case is that we estimate based on the tensor decomposition method [15, 16] and the estimate is employed for shifting and performing the point-wise MLE. This method allows us to get , which ensures that