Active Ordinal Querying for Tuplewise Similarity Learning
Abstract
Many machine learning tasks such as clustering, classification, and dataset search benefit from embedding data points in a space where distances reflect notions of relative similarity as perceived by humans. A common way to construct such an embedding is to request triplet similarity queries to an oracle, comparing two objects with respect to a reference. This work generalizes triplet queries to tuple queries of arbitrary size that ask an oracle to rank multiple objects against a reference, and introduces an efficient and robust adaptive selection method called InfoTuple that uses a novel approach to mutual information maximization. We show that the performance of InfoTuple at various tuple sizes exceeds that of the stateoftheart adaptive triplet selection method on synthetic tests and new human response datasets, and empirically demonstrate the significant gains in efficiency and query consistency achieved by querying larger tuples instead of triplets.
1 Introduction
Similarity learning is the process of assigning point coordinates to objects in a dataset such that distances between objects in the learned space are consistent with notions of similarity as perceived by humans. While these objects usually exist in some highdimensional space (e.g., images, audio), very often the semantic information humans attribute to these objects lies in a lowdimensional space (e.g., items, words). Once this lowdimensional embedding is learned, existing intelligent algorithms [16, 4] can be used to search the dataset with query complexity scaling in the embedding dimension, allowing large datasets to be searched quickly in applications such as task selection for robot learning from demonstration [1], object recognition [10], or image retrieval [34].
To construct such an embedding for a given set of objects, queries that capture the similarity statistics between the objects in question must be made to human experts. While there exist several types of similarity queries that can be made (e.g., relative attributes between objects [25]), we focus on relative similarity queries posed to an oracle comparing objects with respect to a “head” (i.e., reference) object. Relative similarity queries are useful because they gather object similarity information using only object identifiers rather than predetermined features or attributes, allowing similarity learning methods to be applied to any collection of uniquely identifiable objects. In contrast, if a head object were not specified, an oracle would need to use a featurebased criterion for ranking the object set, which is not viable in many applications of interest (e.g., learning human preferences).
Such relative similarity queries typically come in the form of triplet comparisons (i.e., “is object more similar to object or ?”) [30, 31, 12]. In our first main contribution, we extend these queries to larger rank orderings of tuples of objects to gather more information at once for similarity learning. This query type takes the form “rank objects through according to their similarity to object .” To the best of our knowledge, this study is the first attempt to leverage this generalized query type in similarity learning. The use of this query type is motivated by the fact that comparing multiple objects simultaneously provides increased context for a human expert [9], which can increase labeling consistency without a significant increase in human effort per query [19] and has demonstrated benefits in settings such as rank learning [6]. In technical terms, tuplewise queries capture joint dependence between objects that isn’t captured in triplet comparisons (which are often incorrectly modeled as independent queries). To illustrate this point, consider the difference between the triplet query and tuple query presented in Figure 1. In the triplet query, multiple attributes could be used to rank a given query, increasing the ambiguity about which item should be chosen as more similar to the reference. Adding an item to the tuple can provide additional context about the entire dataset to the oracle, clarify which criterion should be used to rank the tuple and thereby making the query less ambiguous.
While tuple queries are appealing, their use presents two major challenges. First, in a dataset of objects queried with tuples of size there are possible tuples. Labeling these individual tuples is prohibitively time consuming for large datasets. Even if uniformly random query selection is used to downsample this set, there is evidence that such a strategy is still punitively expensive [15]. Requesting an exhaustive number of queries is also inefficient from an information standpoint, since there is redundancy in the set of all tuple rankings. Second, in many settings of interest, the oracle answering such queries may be stochastic. For example, crowd oracles may aggregate responses from experts with differing similarity judgements [30], and individual oracles can be unreliable over time (especially for queries regarding similar objects).
These issues can be ameliorated in part by leveraging tools from active learning, the goal of which is to minimize the total labeling cost including the number of expert interactions (usually corresponding to monetary cost), aggregate response time, and computational cost needed to dynamically select queries. This is achieved through adaptive approaches that increase learning efficiency by using previous query responses to determine which information about a model is still “missing” as well as model the oracle’s stochasticity. In this framework, unlabeled data points that optimize a measure of informativeness are selected for expert labeling. One such metric, mutual information, is a popular way to assess the reduction in uncertainty a query provides about unknown learning parameters [28, 20, 23]. In active similarity learning, the stateoftheart is a strategy called “Crowd Kernel Learning” (CKL) that selects triplets that maximize the mutual information between a query response and the embedding coordinates of the head object [30]. However, CKL does not apply to ordinal queries of general tuples sizes (), and its formulation of mutual information only measures the information a query provides about the embedding coordinates of the head object, disregarding information about the locations of the other objects in the query.
In our second main contribution, we address these deficiencies and the lack of an active similarity learning strategy for our new query type by introducing a novel method for efficient and robust adaptive selection of tuplewise queries of arbitrary size. Our method, called InfoTuple, maximizes the mutual information a query response provides about the entire embedding, which is a direct measure of query informativeness that leverages the high degree of coupling between all of the objects in a query. InfoTuple relies on a novel set of simplifying yet reasonable assumptions for tractable mutual information estimation from a single batch of Monte Carlo samples. Our approach accounts for all objects in a query, while avoiding the need to decompose mutual information into a prohibitive number of terms. We demonstrate the performance of this method across datasets, oracle models, and tuple sizes, using both synthetic tests and newly collected largescale human response datasets. In particular, we empirically show that InfoTuple’s performance exceeds that of CKL and random queries, and furthermore that it benefits significantly from using larger tuples even after normalizing for tuple size. We also demonstrate the utility of our novel query type by showing an increase in query consistency for larger tuples over triplets, and show that these advantages can be gained without excessive labelingtime increases.
2 Related Work
Similarity learning from triplets is increasingly commonplace in modern AI, and popular deep learning architectures have been developed to leverage triplet labels [12]. Frameworks such as that of [21] or tSTE [31] are relatively ubiquitous in the visualization community, and attempt to directly capture a notion of visual similarity close to that observed in psychometrics literature (e.g. [7]). However, for large datasets it is often punitively expensive to collect such exhaustive relationship data from labelers, so the development of approximate methods of learning such embeddings is a matter of interest to the AI community.
The bulk of the existing literature on active selection of ordinal queries for constructing these embeddings focuses on the case where distance relationships between objects can be determined with absolute certainty. This deterministic case is well studied, and lower bounds exist on the sample complexity needed to learn highquality embeddings [15]. In reality, responses are often not deterministic for a number of practical reasons and probabilistic MDS methods have been proposed to model such cases [30]. Analytic results do exist characterizing bounds on prediction error in this setting [14], but determining optimal strategies for query selection in the stochastic setting remains largely an open problem.
Specifically, to the best of our knowledge there have been no previous attempts to adaptively select relative comparisons with respect to a head object for general tuple sizes () in the context of similarity learning. Prior work [19, 35] develops an active strategy for sampling tuples, but the query task is relative attribute ranking within the tuple according to some prespecified attribute as opposed to comparison against a head object. Other work [27] actively samples the same query type as our study, but in the context of classification via label propagation. Research exists that is similar to our learning scenario since they actively sample tuples for relative similarity comparisons to a head for the sake of learning and searching an embedding of objects [5], but these comparisons are ternary ‘similar’, ‘dissimilar’, or ‘neither’ labels and their methodology differs from the mutual information approach presented here. Similarly, other work [26] actively samples tuplewise queries with binary ‘similar’ or ‘dissimilar’ label responses with respect to a head, but in the context of classification. Finally, the prior work [32] also employs such tuplewise binary queries for active similarity learning, but with randomly selected queries. While no previous study addresses the similarity learning problem that we explore here, the existing literature demonstrates the effectiveness, efficiency, and feasibility of queries involving multiple objects and provides support for the practical use of our proposed query type.
3 Methods
The problem of adaptively selecting a tuplewise query can be formulated as follows: for a dataset of objects, assume that there exists a dimensional vector of embedding coordinates for each object which are concatenated as columns in matrix . The similarity matrix corresponding to is given by , which implies an matrix of distances between the objects in . Specifically, the squared distance between the th and th objects in the dataset is given by . These distances are assumed to be consistent in expectation with similarity comparisons from an oracle (e.g., human expert or crowd) such that similar objects are closer and dissimilar objects are farther apart. Since relative similarity comparisons between tuples of objects inform their relative embedding distances rather than their absolute coordinates, our objective is to learn similarity matrix rather than , which can be recovered from up to a change in basis [30].
A tuplewise oracle query at time step is composed of a “body” of objects which the oracle ranks by similarity with respect to some “head” object . Letting denote the th posed tuple, we denote the oracle’s ranking response as which is a permutation of such that where indicates that the oracle ranks object as more similar to than object . Since the oracle is assumed to be stochastic, is a random permutation of governed by a distribution that is assumed to depend on . This assumed dependence is natural because oracle consistency is likely coupled with notions of object similarity, and therefore with distances between the objects in . The actual recorded oracle ranking is a random variate of denoted as . Letting , define as an estimate of learned from previous rankings , with corresponding distance matrix .
Suppose that tuples have been posed as queries to the oracle with corresponding ranking responses , and consider a Bayes optimal approach where after the th query we estimate the similarity matrix as the maximum aposteriori (MAP) estimator over a similarity matrix posterior distribution given by , i.e. . To choose the query , a reasonable objective is to select a query that maximizes the achieved posterior value of the resulting MAP estimator (or equivalently one that maximizes the achieved logarithm of the posterior), corresponding to a higher level of confidence in the estimate. However, because the oracle response is unknown before a query is issued, the resulting maximized posterior value is unknown. Instead, a more reasonable objective is to select a query that maximizes the expected value over the posterior of . This can be stated as
In practice, this optimization is infeasible since each expectation involves the calculation of several MAP estimates. Noting that maximization is lower bounded by expectation, this optimization can be relaxed by replacing the maximization over with an expectation over its posterior distribution given and , resulting in a feasible maximization of a lower bound given by
(1) 
where denotes conditional differential entropy [8]. Let the mutual information between and given be defined by
and note that the second term is equal to (1) while the first term does not depend on the choice of . Thus, maximizing (1) over is equivalent to maximizing . Hence, we can adaptively select tuples that maximize mutual information as a means of greedily maximizing a lower bound on the logposterior achieved by a MAP estimator, corresponding to a high estimator confidence.
However, calculating (1) for a candidate tuple is an expensive procedure that involves estimating the differential entropy of a combinatorially large number of posterior distributions, since the expectation with respect to is taken over possible rankings. Instead, in the spirit of [13] we leverage the symmetry of mutual information to write the equivalent objective
(2) 
where denotes conditional entropy of a discrete random variable. Estimating (2) for a candidate tuple only involves averaging ranking entropy over a single posterior , regardless of the value of . This insight, along with suitable probability models discussed in the next sections, allows us to efficiently estimate mutual information for a candidate tuple over a single batch of Monte Carlo samples, rather than having to sample from posteriors.
Furthermore, by interpreting entropy of discrete random variables as a measure of uncertainty, this form of mutual information maximization has a satisfying qualitative interpretation. The first entropy term in (2) prefers tuples whose rankings are uncertain, preventing queries from being wasted on predictable or redundant responses. Meanwhile, the second term discourages tuples that have high expected uncertainty when conditioned on ; this prevents the selection of tuples that, even if were somehow revealed, would still have uncertain rankings. Such queries are inherently ambiguous, and therefore uninformative to the embedding. Thus, maximizing mutual information optimizes the balance between these two measures of uncertainty and therefore prefers queries that are unknown to the learner but that can still be answered consistently by the oracle.
3.1 Estimating Mutual Information
To tractably estimate the entropy terms in (2) for a candidate tuple, we employ several simplifying assumptions concerning the joint statistics of the query sequence and the embedding that allow for efficient Monte Carlo sampling:

[label=(A0),align=left]

As is common in active learning settings, we assume that each query response is statistically independent of previous responses , when conditioned on .

The distribution of conditioned on is only dependent on the distances between and the objects in , notated as set . This direct dependence of tuple ranking probabilities on interobject distances is rooted in the fact that the distance relationships in the embedding are assumed to capture oracle response behavior, and is a common assumption in ordinal embedding literature [31, 30]. Furthermore, this conditional independence of from objects is prevalent in probabilistic ranking literature [29]. In the next section, we describe a reasonable ranking probability model that satisfies this assumption.

is conditionally independent of , given . This assumption is reasonable because embedding methods used to estimate (and subsequently ) are designed such that distances in the estimated embedding preserve the response history contained in . In practice, it is more convenient to model an embedding posterior distribution by conditioning on , learned from the previous responses , rather than by conditioning on itself. This is in the same spirit of CKL, where the current embedding estimate is used to approximate a posterior distribution over points.

Conditioned on , the posterior distribution of is normally distributed about the corresponding values in , i.e. , where is a variance parameter. Imposing Gaussian distributions on interobject distances is a recent approach to modeling uncertainty in ordinal embeddings [22] that allows us to approximate the distance posterior with a fixed batch of samples from a simple distribution. Furthermore, the combination of this model with 2 means that we only need to sample from the normal distributions corresponding to the objects in . We choose to be the sample variance of all entries in , which is a heuristic that introduces a source of variation that preserves the scale of the embedding.
Combining these assumptions, with a slight abuse of notation by writing for a random variable with probability mass function , and to represent normal distribution , we have
Similarly, we have
This formulation allows a fixedsized batch of samples to be drawn and evaluated over, the size of which can be tuned based on realtime performance specifications. This enables us to separate our computational budget and mutual information estimation accuracy from the size of the tuple query.
3.2 Embedding Technique
In order to maximize the flexibility of our approach and draw a closer onetoone comparison to existing methods for similarity learning, we train our embedding on our actively selected tuples by first decomposing a tuple ranking into constituent triplets defined by the set , and then learning an embedding from these triplets with any triplet ordinal embedding algorithm of choice. Since we compare performance against CKL in our experiments, our proposed embedding technique follows directly from the probabilistic MDS formulation in [30] so as to evaluate the effectiveness of our novel query selection strategy in a controlled setting. We wish to constrain our learned similarity matrix to the set of symmetric unitlength PSD matrices, so we consider the set S of such matrices: . We denote the closest matrix in to as . Projecting to the element in closest to is a quadratic program, which we solve by gradient projection descent on . We do this by selecting an initial arbitrarily, and for each iteration computing with being the empirical logloss at iteration i.e. , and being the probability that the oracle correctly ordered the constituent triplets of the selected tuples. For the response probability of an individual triplet, we adopt the model in [30] that is reminiscent of BradleyTerry pairwise score models [3]: for parameter , .
3.3 Tuple Response Model
Our proposed technique is compatible with any tuple ranking model that satisfies 2. However, since we use the triplet response model listed above in the probabilistic MDS formulation, combined with the need for a controlled test against CKL, we extend their model to the tuplewise case as follows: we first decompose an oracle’s ranking into its constituent triplets, and then apply
for parameter . This model corresponds to oracle behavior that ranks objects proportionally to the ratio of their distances with respect to , such that closer (resp. farther) objects are more (resp. less) likely to be deemed similar. Models of this type are generally held to be similar to the scaleinvariant models present in some human perceptual systems [7].
3.4 Adaptive Algorithm
Combining these concepts, we have the following algorithm titled InfoTuple, summarized in Algorithm 1: the algorithm requires that some initial set of randomly selected tuples be labeled to provide a reasonable initialization of the learned similarity matrix. Since the focus of this work is on the effectiveness of various adaptive selection methods, this initialization is standardized across methods considered in our results. Specifically, following established practice [30], a “burnin” period is used where random triplets are posed for each object in object set , with being the head of each query. Then, for each time step we learn a similarity matrix on the set of previous responses by using probabilistic MDS. To make a comparison to CKL, we follow their procedure and subsequently pose a single tuple for each head . However, it is possible to adaptively choose with our method by searching over both head and body objects for a maximally informative tuple. The body of each tuple, given some head , is chosen by uniformly downsampling the set of possible bodies and selecting the one that maximizes the mutual information, calculated using the aforementioned probability model in our estimation procedure. This highlights the importance of computational tractability in estimating mutual information, since for a fixed computing budget per selected query, less expensive mutual information estimation allows for more candidate bodies to be considered. For a tuple size of we denote the run of an algorithm using that tuple size as InfoTuple.
4 Experiments
Our results on synthetic and human response datasets show that InfoTuple’s adaptive selection outperforms both random query selection and that of CKL. This is true even when normalizing for changes in tuple size and when normalizing for labeling effort, showing that the incurred benefit is not only due to the increased information inherently present in larger tuples but also due to our improved adaptive selection. We also show that there are inherent consistency benefits to the use of larger queries, and that human labelers can respond to these query types in practice without undue cost.
4.1 Datasets
To evaluate algorithm performance in a controlled setting, we constructed a synthetic evaluation dataset by generating a point cloud drawn from a dimensional multivariate normal distribution. To simulate oracle responses for this dataset, we use the popular PlackettLuce permutation model to sample a ranking for a given head and body [6, 11]. In this response model, each object in a tuple body is assigned a score according to a scoring function, which in our case is based on the distance in the underlying space between each object and the head. For a given subset of body objects, the probability of an object being ranked as most similar to the head is its score divided by the scores of all objects in that subset, and we generate each simulated oracle response by sequentially sampling objects without replacement from a tuple according to this model. We chose this tested response model to differ from the one we use to estimate mutual information in order to demonstrate the robustness of our method to mismatched noise models, and evaluate an additional Gaussian noise model in the supplement. This dataset was used to compare InfoTuple3, InfoTuple4, InfoTuple5, CKL, Random3, and Random5 across noiseless, Gaussian, and PlackettLuce oracles.
To demonstrate the broader applicability of our work in realworld settings and evaluate our proposed technique on perceptual similarity data, we also collected a large dataset of human responses to tuplewise queries through Amazon Mechanical Turk. Drawing 3000 food images from the Food10k dataset [33], we presented over 7000 users with a total of 192,000 varyingsize tuplewise queries chosen using Infotuple3, InfoTuple5, Random3, and Random5 as selection strategies across three repeated runs of each algorithm. Users were evaluated with one repeat query out of 25, and users who responded inconsistently to the repeat query were discarded. Query bodies were always shuffled when presented to minimize the impact of any possible order effect, and it was not found to be the case that there was any significant order effect in the human responses. Initial embeddings for each of these methods were trained on 5,000 triplet queries drawn from [33]. Although experimental costs prevented us from extending the experiments in Figure 2 to larger tuple sizes, in order to verify the feasibility of having humans respond to larger tuples in practice we performed a separate data collection in which we asked users to rank randomly selected tuples up to a size of and recorded the labeling time for each response.
4.2 Evaluation Metrics
In order to directly measure the preservation of object rankings between the ground truth object coordinates and the embedding learned from oracle responses, we use Kendall’s Tau rank correlation coefficient [17]. To get an aggregate measure of quality when comparing an estimated embedding to a groundtruth embedding, we take the mean of Kendall’s Tau across the total rankings obtained by setting each object as the head and sorting all objects by embedding distance to the head. In our experiments with human respondents it is not possible to use this measure, as the “ground truth” embedding that corresponds to human preferences is not known. In these cases we instead measure the accuracy with respect to a heldout set of queries drawn from the Food10k dataset [33], which is a common embedding quality metric [31, 33]. The holdout accuracy is the fraction of a held out set of triplet comparisons that agrees with distances in the final learned embedding. To capture a notion of the internal coherence between a set of oracle responses and an embedding that is learned from them, we measure the mean rank correlation between each response in this set and the ranking over the same objects imputed from the learned embedding–we refer to this as the coherence of a set of tuples.
One issue that naturally arises when comparing results from strategies that select tuples of different size is normalization, as larger tuples will naturally be more informative. In humanresponse studies normalization is relatively straightforward, as we can simply normalize with respect to the total time spent labeling queries in order to reflect the total labeling cost. While other more comprehensive measures of labeler effort exist, labeling time is a firstorder approximation for the cognitive load of a labeling task and is the most salient metric for determining the cost of a largescale data collection. In the case of synthetic data, we instead compute a normalized query count corresponding to the number of constituent triplet comparisons defining the relation of each body point to the head in the tuple. This is justified since in practice we decompose tuples in this way when feeding them into the embedding algorithm, and corresponds to the size of a tuple’s transitive reduction (a common representation in learningtorank literature [18]). Additional experimental details such as hyperparameter selection are available in the supplement.
4.3 Experimental Results
Using simulated data, we show a direct comparison of embedding quality from using InfoTuple, CKL, and Random queries under a simulated deterministic oracle (Figure 2) and two simulated stochastic oracles (Figure 2), and note that InfoTuple consistently outperformed the other methods. We note two important observations from these results: first, regardless of the oracle used, larger tuple sizes for InfoTuple tended to perform better and converge faster than did smaller tuple sizes even after normalizing for the tuple size, showing the benefit of larger tuples beyond just providing more constituent triplets. Recalling that the PlackettLuce oracle was not directly modeled in our estimate of mutual information, this lends support to the robustness of our technique to various oracle distributions. Second, results on Random3, Random4 and Random5 are comparable, implying both that the improvements seen in InfoTuple are not solely due to the difference in tuple sizes and that our choice of normalization is appropriate. Note that since random query performance did not change with tuple size, Figure 2 only shows Random3 for the sake of visual clarity.
Using the Mechanical Turk dataset described previously, we also show that these basic results extend to real data situations when the stochastic response model is not exactly known, and allows us to examine the complexity of acquiring data with increasing tuple sizes. While larger tuples sizes produce more informative queries, it is possible that the information gained incurs a hidden cost in the complexity or labeler effort involved in acquiring the larger query. Specifically, it can be the case that maximizing query informativeness can produce queries that are more difficult to answer [2]. Fortunately, the results on tuplewise comparisons collected for our Mechanical Turk dataset indicate that this is not an issue for our proposed use case. In particular, Figure 2 shows the accuracy results when predicting the labels from a held out set of 1200 triplet queries. These results show an increase in the effectiveness of InfoTuple adaptive selection as well as increasing tuples sizes when plotted against the aggregate query response time. In other words, any increase in query complexity (measured by response time) is more than compensated for by the increased information acquired by the query and the increase in the resulting quality of the learned embedding.
Figure 3 explores this issue further by examining the response times for our additional timing dataset as a function of query size. There are only modest increases in the ranking time cost with increasing tuple size, leading to the significant gains observed in normalized information efficiency in this range of tuple sizes. While it is true that complexity cost will continue to increase for larger tuple sizes and the gains in information efficiency are not guaranteed to increase indefinitely and there may also be additional factors in the choice of optimal tuple size for a given problem, we show that up to a modest tuple size it is strictly more useful to ask tuplewise queries than triplet queries.
One possible reason for why tuples outperform triplets is that asking a query that contains more objects provides additional context for the oracle about the contents of the dataset, allowing it to more reliably respond to ambiguous comparisons than if these were asked as triplet queries. As a result of this increase in context, oracles tend to respond to larger queries significantly more coherently than they do to smaller ones, as shown in Figure 4. We note that this is not guaranteed to increase indefinitely as larger tuples are considered, but the effect is noticeable for modest increases in tuple sizes and is clear when comparing 5tuples to triplets.
5 Discussion
In this paper we proposed InfoTuple, an adaptive tuple selection strategy based on maximizing mutual information for relative tuple queries for similarity learning. We introduce the tuple query for similarity learning, present a novel set of assumptions for efficient estimation of mutual information, and through the collection of new userresponse datasets, provide new insights into the gains acquired by using larger tuples in learning efficiency and query consistency. After testing on synthetic and real datasets, InfoTuple was found to more effectively learn similaritybased object embeddings than random queries and stateoftheart triplet queries for both synthetic data (with a typical oracle model) and in a real world experiment. The performance gains were especially evident for larger tuples and even after normalizing for tuple size, indicating that the proposed selection objective that maximizes the mutual information between the query response and the entire embedding yields information gains that are not simply due to an increase in tuple size. Taken together, these results suggest that large tuples selected with InfoTuple supply richer and more robust embedding information than their triplet and random counterparts.
In practice, larger tuple sizes can provide more context for the oracle, increasing the reliability of the responses without significant increases in labeling effort. In the pathological extreme, the level of effort almost certainly outweighs the benefits of larger tuples, as an oracle would have to provide a ranking over the entire dataset. Despite this downside in extreme tuple sizes, our human study results indicate that performance increases hold up in the realworld for moderate tuple sizes. This interesting tradeoff between informativeness per query and realworld oracle behavior merits a more comprehensive study on the psychometric aspects of the problem, in the spirit of [24].
6 Acknowledgements
This work is partially supported by NSF CAREER award number CCF1350954 and ONR grant number N000141512619.
References
 [1] (2009) A survey of robot learning from demonstration. Robotics and autonomous systems 57 (5), pp. 469–483. Cited by: §1.
 [2] (2009) How well does active learning actually work?: timebased evaluation of costreduction strategies for language documentation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1Volume 1, pp. 296–305. Cited by: §4.3.
 [3] (1952) Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika 39 (3/4), pp. 324–345. Cited by: §3.2.
 [4] (2019) Active embedding search via noisy paired comparisons. In International Conference on Machine Learning, pp. 902–911. Cited by: §1.
 [5] (2015) Facial similarity learning with humans in the loop. Journal of Computer Science and Technology 30 (3), pp. 499–510. Cited by: §2.
 [6] (2007) Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: §1, §4.1.
 [7] (1999) Scaleinvariance as a unifying psychological principle. Cognition 69 (3), pp. B17–B24. Cited by: §2, §3.3.
 [8] (2012) Elements of information theory. John Wiley & Sons. Cited by: §3.
 [9] (2015) Learning to rank based on subsequences. In Proceedings ICCV 2015, pp. 2785–2793. Cited by: §1.
 [10] (2004) Simultaneous object recognition and segmentation by image exploration. In Computer Vision  ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 1114, 2004. Proceedings, Part I, pp. 40–54. External Links: Document, ISBN 9783540246701, Link Cited by: §1.
 [11] (2009) Bayesian inference for plackettluce ranking models. In proceedings of the 26th annual international conference on machine learning, pp. 377–384. Cited by: §4.1.
 [12] (2015) Deep metric learning using triplet network. In International Workshop on SimilarityBased Pattern Recognition, pp. 84–92. Cited by: §1, §2.
 [13] (2012) Collaborative gaussian processes for preference learning. In Advances in neural information processing systems, pp. 2096–2104. Cited by: §3.
 [14] (2016) Finite sample prediction and recovery bounds for ordinal embedding. In Advances In Neural Information Processing Systems, pp. 2711–2719. Cited by: §2.
 [15] (2011) Lowdimensional embedding using adaptively selected ordinal data. In Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on, pp. 1077–1084. Cited by: §1, §2.
 [16] (2011) Active ranking using pairwise comparisons. In Advances in Neural Information Processing Systems, pp. 2240–2248. Cited by: §1.
 [17] (1938) A new measure of rank correlation. Biometrika 30 (1/2), pp. 81–93. Cited by: §4.2.
 [18] (2009) Comparing apples and oranges through partial orders: an empirical approach. In American Control Conference, 2009. ACC’09., pp. 5434–5439. Cited by: §4.2.
 [19] (2014) Beyond comparing image pairs: setwise active learning for relative attributes. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 208–215. Cited by: §1, §2.
 [20] (1956) On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, pp. 986–1005. Cited by: §1.
 [21] (2012) Metric learning from relative comparisons by minimizing squared residual. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pp. 978–983. Cited by: §2.
 [22] (2019) Uncertainty estimates for ordinal embeddings. arXiv preprint arXiv:1906.11655. Cited by: item 4.
 [23] (1992) Informationbased objective functions for active data selection. Neural computation 4 (4), pp. 590–604. Cited by: §1.
 [24] (1956) The magical number seven, plus or minus two: some limits on our capacity for processing information.. Psychological review 63 (2), pp. 81. Cited by: §5.
 [25] (2011) Relative attributes. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 503–510. Cited by: §1.
 [26] (2015) Tropel: crowdsourcing detectors with minimal training.. In HCOMP, pp. 150–159. Cited by: §2.
 [27] (2013) Active learning from relative queries.. In IJCAI, pp. 1614–1620. Cited by: §2.
 [28] (2012) Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6 (1), pp. 1–114. Cited by: §1.
 [29] (1990) Models for distributions on permutations. Journal of the American Statistical Association 85 (410), pp. 558–564. Cited by: item 2.
 [30] (2011) Adaptively learning the crowd kernel. arXiv preprint arXiv:1105.1033. Cited by: §1, §1, §1, §2, item 2, §3.2, §3.4, §3.
 [31] (2012) Stochastic triplet embedding. In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pp. 1–6. Cited by: §1, §2, item 2, §4.2.
 [32] (2014) Costeffective hits for relative similarity comparisons. In Second AAAI Conference on Human Computation and Crowdsourcing, Cited by: §2.
 [33] (20151213) Learning concept embeddings with combined humanmachine expertise. In International Conference on Computer Vision (ICCV), External Links: Link Cited by: §4.1, §4.2.
 [34] (2010) A boosting framework for visualitypreserving distance metric learning and its application to medical image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (1), pp. 30–44. Cited by: §1.
 [35] (2005) SVM selective sampling for ranking with application to data retrieval. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 354–363. Cited by: §2.
Appendix A Supplementary Material
a.1 Experimental Details
For each of the humansubject experiments, was set to 0.1 and was set to 4 per the hyperparameter search shown in Figure 5. The validation set for this search was an additional 500 heldout triplets from the Food10k dataset. In the synthetic experiments provided, was set to 0.5 and d was set to 2 to match the dimensionality of the generating distribution. The stochastic oracle had a high noise level, inverting 33% of tuple responses. Higher tuple sizes were strongly correlated with both higher performance and higher robustness to error (even when normalized by the effective number of pairwise queries), indicating performance gains for InfoTuple that are not simply due to increasing tuple sizes. A heuristic was used to pick a number of samples for the Monte Carlo estimation of the mutual information, with samples being used in practice.
Figure 2 in the paper body shows empirical performance for query selection algorithms on predicting labels from held out triplet queries in the Mechanical Turk dataset described. Experimental horizons for human subject experiments were chosen based on estimates of the initial steps of convergence and had to be limited due to high experimental costs. Turk subjects were presented with queries in batches of 25, with one repeated tuple across the batch as a test for validity. If the repeat query was not answered the same way by the user both times it was asked the batch was discarded. Order effects were controlled for by shuffling queries prior to presenting them to users for labeling, ensuring that any queries presented to multiple users would appear in different orders and that the test queries would also appear differently each time.
a.2 Oracle Details
Two different models of oracle noise were used in our synthetic experiments, PlackettLuce noise and Gaussian noise. These models were chosen to be different from the one we use to estimate mutual information in order to demonstrate the robustness of our method. In the body of the paper we describe the selection process used by the PlackettLuce oracle noise, which works by assigning latent scores to objects on the basis of their distances in some synthetic “ground truth” embedding space. The Gaussian noise model, instead of applying noise directly at the level of the ranking responses, applies noise at the level of the oracle’s representation of the “ground truth” embedding by adding Gaussian noise to the coordinates of each point drawn from the “ground truth” embedding before imputing a ranking from distances in the oracle’s noisy interpretation of the space. For the PlackettLuce error model results shown in the paper body, 33% of individual rankings were inverted.