Learning Nearest Neighbor Graphs from Noisy Distance Samples
Abstract
We consider the problem of learning the nearest neighbor graph of a dataset of items. The metric is unknown, but we can query an oracle to obtain a noisy estimate of the distance between any pair of items. This framework applies to problem domains where one wants to learn people’s preferences from responses commonly modeled as noisy distance judgments. In this paper, we propose an active algorithm to find the graph with high probability and analyze its query complexity. In contrast to existing work that forces Euclidean structure, our method is valid for general metrics, assuming only symmetry and the triangle inequality. Furthermore, we demonstrate efficiency of our method empirically and theoretically, needing only queries in favorable settings, where accounts for the effect of noise. Using crowdsourced data collected for a subset of the UT Zappos50K dataset, we apply our algorithm to learn which shoes people believe are most similar and show that it beats both an active baseline and ordinal embedding.
Learning Nearest Neighbor Graphs from Noisy Distance Samples
Blake Mason ^{†}^{†}thanks: Authors contributed equally to this paper and are listed alphabetically. University of Wisconsin Madison, WI 53706 bmason3@wisc.edu Ardhendu Tripathy ^{1}^{1}footnotemark: 1 University of Wisconsin Madison, WI 53706 astripathy@wisc.edu Robert Nowak University of Wisconsin Madison, WI 53706 rdnowak@wisc.edu
noticebox[b]\end@float
1 Introduction
In modern machine learning applications, we frequently seek to learn proximity/ similarity relationships between a set of items given only noisy access to pairwise distances. For instance, practitioners wishing to estimate internet topology frequently collect onewaydelay measurements to estimate the distance between a pair of hosts (Eriksson et al., 2010). Such measurements are affected by physical constraints as well as server load, and are often noisy. Researchers studying movement in hospitals from WiFi localization data likewise contend with noisy distance measurements due to both temporal variability and varying signal strengths inside the building (Booth et al., 2019). Additionally, human judgments are commonly modeled as noisy distances (Shepard, 1962; Kruskal, 1964). As an example, Amazon Discover asks customers their preferences about different products and uses this information to recommend new items it believes are similar based on this feedback. We are often primarily interested in the closest or most similar item to a given one– e.g., the closest server, the closest doctor, the most similar product. The particular item of interest may not be known a priori. Internet traffic can fluctuate, different patients may suddenly need attention, and customers may be looking for different products. To handle this, we must learn the closest/ most similar item for each item. This paper introduces the problem of learning the Nearest Neighbor Graph that connects each item to its nearest neighbor from noisy distance measurements.
Problem Statement: Consider a set of points in a metric space. The metric is unknown, but we can query a stochastic oracle for an estimate of any pairwise distance. In as few queries as possible, we seek to learn a nearest neighbor graph of that is correct with probability , where each is a vertex and has a directed edge to its nearest neighbor .
1.1 Related work
Nearest neighbor problems (from noiseless measurements) are well studied and we direct the reader to Bhatia et al. (2010) for a survey. Clarkson (1983); Vaidya (1989); Sankaranarayanan et al. (2007) all provide theory and algorithms to learn the nearest neighbor graph which apply in the noiseless regime. Note that the problem in the noiseless setting is very different. If noise corrupts measurements, the methods from the noiseless setting can suffer persistent errors. There has been recent interest in introducing noise via subsampling for a variety of distance problems LeJeune et al. (2019); Bagaria et al. (2017, 2018), though the noise here is not actually part of the data but introduced for efficiency. In our algorithm, we use the triangle inequality to get tighter estimates of noisy distances in a process equivalent to the classical Floyd–Warshall Floyd (1962); Cormen et al. (2009). This has strong connections to the metric repair literature (Brickell et al., 2008; Gilbert and Jain, 2017) where one seeks to alter a set of noisy distance measurements as little as possible to learn a metric satisfying the standard axioms. (Singla et al., 2016) similarly uses the triangle inequality to bound unknown distances in a related but noiseless setting. In the specific case of noisy distances corresponding to human judgments, a number of algorithms have been proposed to handle related problems, most notably Euclidean embedding techniques, e.g. (Jain et al., 2016; Van Der Maaten and Weinberger, 2012; Kruskal, 1964). To reduce the load on human subjects, several attempts at an active method for learning Euclidean embeddings have been made but have only seen limited success Jamieson et al. (2015). Among the culprits is the strict and often unrealistic modeling assumption that the metric be Euclidean and low dimensional.
1.2 Main contributions
In this paper we introduce the problem of identifying the nearest neighbor graph from noisy distance samples and propose ANNTri, an active algorithm, to solve it for general metrics. We empirically and theoretically analyze its complexity to show improved performance over a passive and an active baseline. In favorable settings, such as when the data forms clusters, ANNTri needs only queries, where accounts for the effect of noise. Furthermore, we show that ANNTri achieves superior performance compared to methods which require much stronger assumptions. We highlight two such examples. In Fig. 1(c), for an embedding in , ANNTri outperforms the common technique of triangulation that works by estimating each point’s distance to a set of anchors. In Fig. 2(b), we show that ANNTri likewise outperforms Euclidean embedding for predicting which images are most similar from a set of similarity judgments collected on Amazon Mechanical Turk. The rest of the paper is organized as follows. In Section 2, we further setup the problem. In Sections 3 and 4 we present the algorithm and analyze its theoretical properties. In Section 5 we show ANNTri’s empirical performance on both simulated and real data. In particular, we highlight its efficiency in learning from human judgments.
2 Problem setup and summary of our approach
We denote distances as where is a distance function satisfying the standard axioms and define . Though the distances are unknown, we are able to draw independent samples of its true value according to a stochastic distance oracle, i.e. querying
(1) 
where is a zeromean subGaussian random variable assumed to have scale parameter . We let denote the empirical mean of the values returned by queries made until time . The number of queries made until time is denoted as . A possible approach to obtain the nearest neighbor graph is to repeatedly query all pairs and report for all . But since we only wish to learn , if , we do not need to query as many times as . To improve our query efficiency, we could instead adaptively sample to focus queries on distances that we estimate are smaller. A simple adaptive method to find the nearest neighbor graph would be to iterate over and use a bestarm identification algorithm to find in the round.^{1}^{1}1We could also proceed in a noniterative manner, by adaptively choosing which among pairs to query next. However this has worse empirical performance and same theoretical guarantees as the inorder approach. However, this procedure treats each round independently, ignoring properties of metric spaces that allow information to be shared between rounds.

Due to symmetry, for any the queries and follow the same law, and we can reuse values of collected in the round while finding in the round.

Using concentration bounds on and from samples of and collected in the round, we can bound via the triangle inequality. As a result, we may be able to state without even querying .
Our proposed algorithm ANNTri uses all the above ideas to find the nearest neighbor graph of . For general , the sample complexity of ANNTri contains a problemdependent term that involves the order in which the nearest neighbors are found. For an consisting of sufficiently well separated clusters, this orderdependence for the sample complexity does not exist.
3 Algorithm
Our proposed algorithm (Algorithm 1) ANNTri finds the nearest neighbor graph of with probability . It iterates over in order of their subscript index and finds in the ‘round’. All bounds, counts of samples, and empirical means are stored in symmetric matrices in order to share information between different rounds. We use Python array/Matlab notation to indicate individual entries in the matrices, for e.g., . The number of queries made is queried is stored in the entry of . Matrices and record upper and lower confidence bounds on . and record the associated triangle inequality bounds. Symmetry is ensured by updating the entry at the same time as the entry for each of the above matrices. In the round, ANNTri finds the correct with probability by calling SETri (Algorithm 2), a modification of the successive elimination algorithm for bestarm identification. In contrast to standard successive elimination, at each time step SETri only samples those points in the active set that have the fewest number of samples.
3.1 Confidence bounds on the distances
Using the subGaussian assumption on the noise random process, we can use Hoeffding’s inequality and a union bound over time to get the following confidence intervals on the distance :
(2) 
which hold for all points at all times with probability , i.e.
(3) 
where and . EvenDar et al. (2006) use the above procedure to derive the following upper bound for the number of oracle queries used to find :
(4) 
where for any the suboptimality gap characterizes how hard it is to rule out from being the nearest neighbor. We also set . Note that one can use tighter confidence bounds as detailed in Garivier (2013) and Jamieson and Nowak (2014) to obtain sharper bounds on the sample complexity of this subroutine.
3.2 Computing the triangle bounds and active set
Since is only computed within SETri, we abuse notation and use its argument to indicate the time counter private to SETri. Thus, the initial active set computed by SETri when called in the round is denoted . During the round, the active set contains all points that have not been eliminated from being the nearest neighbor of at time . We define ’s active set at time as
(5) 
Assuming and are valid lower and upper bounds on respectively, (5) states that point is active if its lower bound is less than the minimum upper bound for over all choices of . Next, for any we construct triangle bounds on the distance . Intuitively, for some reals , if and then , and
(6) 
where . The lower bound can be seen as true by Fig. 7 in the Appendix. Lemma 3.1 uses these ideas to form upper and lower bounds on distances by the triangle inequality.
Lemma 3.1.
For all , set . For any define
(7) 
For all , set . For any define
(8) 
where . If all the bounds obtained by SETri in rounds are correct then
The proof is in Appendix B.1. ANNTri has access to two sources of bounds on distances: concentration bounds and triangle inequality bounds, and as can be seen in Lemma 3.1, the former affects the latter. Furthermore, triangle bounds are computed from other triangle bounds, leading to the recursive definitions of and . Because of these facts, triangle bounds are dependent on the order in which ANNTri finds each nearest neighbor. These bounds can be computed using dynamic programming and brute force search over all possible is not necessary. We note that the above recursion is similar to the FloydWarshall algorithm for finding shortest paths between all pairs of vertices in a weighted graph Floyd (1962); Cormen et al. (2009). The results in Singla et al. (2016) show that the triangle bounds obtained in this manner have the minimum norm between the upper and lower bound matrices.
4 Analysis
All omitted proofs of this section can be found in the Appendix Section B.
Theorem 4.1.
ANNTri finds the nearest neighbor for each point in with probability .
4.1 A simplified algorithm
The following Lemma indicates which points must be eliminated initially in the round.
Lemma 4.2.
If , then for ANNTri.
Proof.
∎
Next, we define ANNEasy, a simplified version of ANNTri that is more amenable to analysis. Here, we say that is eliminated in the round of ANNEasy if i) and (symmetry from past samples) or ii) (Lemma 4.2). Therefore, ’s active set for ANNEasy is
(9) 
To define ANNEasy in code, we remove lines 36 of ANNTri (Algorithm 1), and call a subroutine SEEasy in place of SETri. SEEasy matches SETri (Algorithm 2) except that lines 1 and 8 are replaced with (9) instead. We provide full pseudocode of both ANNEasy and SEEasy in the Appendix A.1.1. Though ANNEasy is a simplification for analysis, we note that it empirically captures much of the same behavior of ANNTri. In the Appendix A.1.2 we provide an empirical comparison of the two.
4.2 Complexity of ANNEasy
We now turn our attention to account for the effect of the triangle inequality in ANNEasy.
Lemma 4.3.
For any if the following conditions hold for some , then .
(10) 
The first condition characterizes which ’s must satisfy the condition in Lemma 4.2 for the round. The second guarantees that was sampled in the round, a necessary condition for forming triangle bounds using .
Theorem 4.4.
Conditioned on the event that all confidence bounds are valid at all times, ANNEasy learns the nearest neighbor graph of in the following number of calls to the distance oracle:
(11) 
In the above expression and , if does not satisfy the triangle inequality elimination conditions of (10) , and otherwise.
In Theorem B.6, in the Appendix, we state the sample complexity when triangle inequality bounds are ignored by ANNTri, and this upper bounds (11). Whether a point can be eliminated by the triangle inequality depends both on the underlying distances and the order in which ANNTri finds each nearest neighbor (c.f. Lemma 4.3). In general, this dependence on the order is necessary to ensure that past samples exist and may be used to form upper and lower bounds. Furthermore, it is worth noting that even without noise the triangle inequality may not always help. A simple example is any arrangement of points such that . To see this, consider triangle bounds on any distance due to any . Then so . Thus no triangle upper bounds separate from triangle lower bounds so no elimination via the triangle inequality occurs. In such cases, it is necessary to sample all distances. However, in more favorable settings where data may be split into clusters, the sample complexity can be much lower by using triangle inequality.
4.3 Adaptive gains via the triangle inequality
We highlight two settings where ANNTri provably achieves sample complexity better than independent of the order of the rounds. Consider a dataset containing clusters of points each as in Fig. 0(a). Denote the cluster as and suppose the distances between the points are such that
(12) 
The above condition is ensured if the distance between any two points belonging to different clusters is at least a dependent constant plus twice the diameter of any cluster.
Theorem 4.5.
Consider a dataset of clusters which satisfy the condition in (12). Then ANNEasy learns the correct nearest neighbor graph of with probability at least in
(13) 
queries where is the average number of samples distances between points in the same cluster.
By contrast, random sampling requires where . In fact, the value in (11) be be even lower if unions of clusters also satisfy (12). In this case, the triangle inequality can be used to separate groups of clusters. For example, in Fig. 0(b), if and satisfy (12) along with , then the triangle inequality can separate and . This process can be generalized to consider a dataset that can be split recursively into into subclusters following a binary tree of levels. At each level, the clusters are assumed to satisfy (12). We refer to such a dataset as hierarchical in (12).
Theorem 4.6.
Consider a dataset of clusters of size that is hierarchical in (12). Then ANNEasy learns the correct nearest neighbor graph of with probability at least in
(14) 
queries where is the average number of samples distances between points in the same cluster.
Expression (14) matches known lower bounds of on the sample complexity for learning the nearest neighbor graph from noiseless samples (Vaidya, 1989), the additional penalty of is due to the effect of noise in the samples. In Appendix C, we state the sample complexity in the average case, as opposed to the high probability statements above. The analog of the cluster condition (12) there does not involve constants and is solely in terms of pairwise distances (c.f. (33)).
5 Experiments
Here we evaluate the performance of ANNTri on simulated and real data. To construct the tightest possible confidence bounds for SETri, we use the law of the iterated logarithm as in Jamieson and Nowak (2014) with parameters and . Our analysis bounds the number of queries made to the oracle. We visualize the performance by tracking the empirical error rate with the number of queries made per point. For a given point , we say that a method makes an error at the sample if it fails to return as the nearest neighbor, that is, . Throughout, we will compare ANNTri against random sampling. Additionally, to highlight the effect of the triangle inequality, we will compare our method against the same active procedure, but ignoring triangle inequality bounds (referred to as ANN in plots). All baselines may reuse samples via symmetry as well. We plot all curves with confidence regions shaded.
5.1 Simulated Experiments
We test the effectiveness of our method, we generate an embedding of clusters of points spread around a circle such that each cluster is separated by at least of its diameter in as in shown in Fig. 1(a). We consider Gaussian noise with . In Fig. 1(b), we present average error rates of ANNTri, ANN, and Random plotted on a log scale. ANNTri quickly learns and has lower error with samples due to initial elimination by the triangle inequality. The error curves are averaged over repetitions. All rounds were capped at samples for efficiency.
5.1.1 Comparison to triangulation
An alternative way a practitioner may use to obtain the nearest neighbor graph might be to estimate distances with respect to a few anchor points and then triangulate to learn the rest. Eriksson et al. (2010) provide a comprehensive example, and we summarize in Appendix A.2 for completeness. The triangulation method is naïve for two reasons. First, it requires much stronger modeling assumptions than ANNTri— namely that the metric is Euclidean and the points are in a lowdimensional of known dimension. Forcing Euclidean structure can lead to unpredictable errors if the underlying metric might not be Euclidean, such as in data from human judgments. Second, this procedure may be more noise sensitive because it estimates squared distances. In the example in Section A.2, this leads to the additive noise being subexponential rather than subGaussian. In Fig. 1(c), we show that even in a favorable setting where distances are truly sampled from a lowdimensional Euclidean embedding and pairwise distances between anchors are known exactly, triangulation still performs poorly compared to ANNTri. We consider the same dimensional embedding of points as in Fig. 1(a) for a noise variance of and compare the ANNTri and triangulation for different numbers of samples.
5.2 Human judgment experiments
5.2.1 Setup
Here we consider the problem of learning from human judgments. For this experiment, we used a set of images of shoes drawn from the UT Zappos50k dataset Yu and Grauman (2014, 2017) and seek to learn which shoes are most visually similar. To do this, we consider queries of the form “between , , and , which two are most similar?”. We show example queries in Figs. 4(a) and 4(b) in the Appendix. Each query maps to a pair of triplet judgments of the form “is or more similar to ?”. For instance, if and are chosen, then we may imply the judgments “ is more similar to than to ” and “ is more similar to than to ”. We therefore construct these queries from a set of triplets collected from participants on Mechanical Turk by Heim et al. (2015). The set contains multiple samples of all unique triples so that the probability of any triplet response can be estimated. We expect that is most commonly selected as being more similar to than any third point . We take distance to correspond to the fraction of times that two images , are judged as being more similar to each other than a different pair in a triplet query . Let be the event that the pair are chosen as most similar amongst , , and . Accordingly, we define the ‘distance’ between images and as
where is drawn uniformly from the remaining images in . For a fixed value of ,
where the probabilities are the empirical probabilities of the associated triplets in the dataset. This distance is a quasimetric on our dataset as it does not always satisfy the triangle inequality; but satisfies it with a multiplicative constant: . Relaxing metrics to quasimetrics has a rich history in the classical nearest neighbors literature Houle and Nett (2015); Tschopp et al. (2011); Goyal et al. (2008), and ANNTri can be trivially modified to handle quasimetrics. However, we empirically note that of the distances violate the ordinary triangle inequality here so we ignore this point in our evaluation.
5.2.2 Results
When ANNTri or any baseline queries from the oracle, we randomly sample a third point and flip a coin with probability . The resulting sample is an unbiased estimate of the distance between and . In Fig. 2(a), we compare the error rate averaged over trials of ANNTri compared to Random and STE. We also plot associated gains in sample complexity by ANNTri. In particular, we see gains of x over random sampling, and gains up to x relative to ordinal embedding. ANNTri also shows x gains over ANN in sample complexity (see Fig. 6 in Appendix).
Additionally, a standard way of learning from triplet data is to perform ordinal embedding. With a learned embedding, the nearest neighbor graph may easily be computed. In Fig. 2(b), we compare ANNTri against the state of the art STE algorithm Van Der Maaten and Weinberger (2012) for estimating Euclidean embeddings from triplets, and select the embedding dimension of via cross validation. To normalize the number of samples, we first perform ANNTri with a given max budget of samples and record the total number needed. Then we select a random set of triplets of the same size and learn an embedding in via STE. We compare both methods on the fraction of nearest neighbors predicted correctly. On the axis, we show the total number of triplets given to each method. For small dataset sizes, there is little difference, however, for larger dataset sizes, ANNTri significantly outperforms STE. Given that ANNTri is active, it is reasonable to wonder if STE would perform better with an actively sampled dataset, such as (Tamuz et al., 2011). Many of these methods are computationally intensive and lack empirical support (Jamieson et al., 2015), but we can embed using the full set of triplets to mitigate the effect of the subsampling procedure. Doing so, STE achieves error, within the confidence bounds of the largest subsample shown in Fig. 2(b). In particular, more data and more carefully selected datasets, may not correct for the bias induced by forcing Euclidean structure.
6 Conclusion
In this paper we solve the nearest neighbor graph problem by adaptively querying distances. Our method makes no assumptions beyond standard metric properties and is empirically shown to achieve sample complexity gains over passive sampling. In the case of clustered data, we show provable gains and achieve optimal rates in favorable settings.
References
 Bagaria et al. [2017] Vivek Bagaria, Govinda M Kamath, Vasilis Ntranos, Martin J Zhang, and David Tse. Medoids in almost linear time via multiarmed bandits. arXiv preprint arXiv:1711.00817, 2017.
 Bagaria et al. [2018] Vivek Bagaria, Govinda M Kamath, and David N Tse. Adaptive montecarlo optimization. arXiv preprint arXiv:1805.08321, 2018.
 Bhatia et al. [2010] Nitin Bhatia et al. Survey of nearest neighbor techniques. arXiv preprint arXiv:1007.0085, 2010.
 Booth et al. [2019] Brandon M Booth, Tiantian Feng, Abhishek Jangalwa, and Shrikanth S Narayanan. Toward robust interpretable human movement pattern analysis in a workplace setting. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7630–7634. IEEE, 2019.
 Brickell et al. [2008] Justin Brickell, Inderjit S Dhillon, Suvrit Sra, and Joel A Tropp. The metric nearness problem. SIAM Journal on Matrix Analysis and Applications, 30(1):375–396, 2008.
 Clarkson [1983] Kenneth L Clarkson. Fast algorithms for the all nearest neighbors problem. In 24th Annual Symposium on Foundations of Computer Science (sfcs 1983), pages 226–232. IEEE, 1983.
 Cormen et al. [2009] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009.
 Eriksson et al. [2010] Brian Eriksson, Paul Barford, Joel Sommers, and Robert Nowak. A learningbased approach for ip geolocation. In International Conference on Passive and Active Network Measurement, pages 171–180. Springer, 2010.
 EvenDar et al. [2006] Eyal EvenDar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079–1105, 2006.
 Floyd [1962] Robert W Floyd. Algorithm 97: shortest path. Communications of the ACM, 5(6):345, 1962.
 Garivier [2013] Aurélien Garivier. Informational confidence bounds for selfnormalized averages and applications. In 2013 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2013.
 Gilbert and Jain [2017] Anna C Gilbert and Lalit Jain. If it ain’t broke, don’t fix it: Sparse metric repair. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 612–619. IEEE, 2017.
 Goyal et al. [2008] Navin Goyal, Yury Lifshits, and Hinrich Schütze. Disorder inequality: a combinatorial approach to nearest neighbor search. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 25–32. ACM, 2008.
 Heim et al. [2015] Eric Heim, Matthew Berger, Lee Seversky, and Milos Hauskrecht. Active perceptual similarity modeling with auxiliary information. arXiv preprint arXiv:1511.02254, 2015.
 Houle and Nett [2015] Michael E Houle and Michael Nett. Rankbased similarity search: Reducing the dimensional dependence. IEEE transactions on pattern analysis and machine intelligence, 37(1):136–150, 2015.
 Jain et al. [2016] Lalit Jain, Kevin G Jamieson, and Rob Nowak. Finite sample prediction and recovery bounds for ordinal embedding. In Advances In Neural Information Processing Systems, pages 2711–2719, 2016.
 Jamieson and Nowak [2014] K. Jamieson and R. Nowak. Bestarm identification algorithms for multiarmed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6, March 2014. doi: 10.1109/CISS.2014.6814096.
 Jamieson et al. [2013] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. On finding the largest mean among many. arXiv preprint arXiv:1306.3917, 2013.
 Jamieson et al. [2015] Kevin G Jamieson, Lalit Jain, Chris Fernandez, Nicholas J Glattard, and Rob Nowak. Next: A system for realworld development, evaluation, and application of active learning. In Advances in Neural Information Processing Systems, pages 2656–2664, 2015.
 Kruskal [1964] Joseph B Kruskal. Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29(2):115–129, 1964.
 LeJeune et al. [2019] Daniel LeJeune, Richard G Baraniuk, and Reinhard Heckel. Adaptive estimation for approximate knearestneighbor computations. arXiv preprint arXiv:1902.09465, 2019.
 Sankaranarayanan et al. [2007] Jagan Sankaranarayanan, Hanan Samet, and Amitabh Varshney. A fast all nearest neighbor algorithm for applications involving large pointclouds. Computers & Graphics, 31(2):157–174, 2007.
 Shepard [1962] Roger N Shepard. The analysis of proximities: multidimensional scaling with an unknown distance function. i. Psychometrika, 27(2):125–140, 1962.
 Singla et al. [2016] Adish Singla, Sebastian Tschiatschek, and Andreas Krause. Actively learning hemimetrics with applications to eliciting user preferences. In International Conference on Machine Learning, pages 412–420, 2016.
 Tamuz et al. [2011] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Tauman Kalai. Adaptively learning the crowd kernel. arXiv preprint arXiv:1105.1033, 2011.
 Tschopp et al. [2011] Dominique Tschopp, Suhas Diggavi, Payam Delgosha, and Soheil Mohajer. Randomized algorithms for comparisonbased search. In Advances in Neural Information Processing Systems, pages 2231–2239, 2011.
 Vaidya [1989] Pravin M Vaidya. Ano (n logn) algorithm for the allnearestneighbors problem. Discrete & Computational Geometry, 4(2):101–115, 1989.
 Van Der Maaten and Weinberger [2012] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In 2012 IEEE International Workshop on Machine Learning for Signal Processing, pages 1–6. IEEE, 2012.
 Yu and Grauman [2014] A. Yu and K. Grauman. Finegrained visual comparisons with local learning. In Computer Vision and Pattern Recognition (CVPR), Jun 2014.
 Yu and Grauman [2017] A. Yu and K. Grauman. Semantic jitter: Dense supervision for visual comparisons via synthetic images. In International Conference on Computer Vision (ICCV), Oct 2017.
Appendix
Appendix A Additional experimental results and details
a.1 Differences between ANNTri and ANNEasy
a.1.1 Pseudocode for ANNEasy and SEEasy
a.1.2 Empirical differences in performance for ANNTri and ANNEasy
In Figure 4 we compare the empirical performance of ANNTri and ANNEasy. We compare their performance in the same setting as Figure 1(a) with clusters of points separated by their at least of their diameter. The curves are averaged over independent trials and plotted with confidence regions. As is indicated in the plot, ANNEasy has similar behavior as ANNTri, but achieves slightly worse performance.
a.2 Triangulation
In this section, we provide a brief review of triangulation to estimate Euclidean embeddings, similar to the presentation in [Eriksson et al., 2010]. The method is summarized as follows. Let be a set of points in Euclidean space and be the associated Euclidean distance matrix where each entry is the square of the associated Euclidean distance. Let be a set of anchor points. Without loss of generality, we take . The is to correct for the fact that Euclidean distance matrices have rank . Let and . Then it can easily be verified that . To learn the entries in (as well as ), sample the distance from each of the points to the anchors as many times as there is budget for and square the results. The empirical mean is a plugin estimator of the associated entry in and , and we take and to be their unbiased estimates. Therefore is an unbiased estimate of . With , the nearest neighbor graph can easily be computed.
a.3 Additional experimental results for Zappos dataset
In Fig. 5 we show two example queries of the form “which pair are most similar of these three?”. Some queries are more straightforward whereas some are more subjective.
Additionally, in Fig. 6, we show the performance of ANNTri, ANN, and Random in identifying nearest neighbors from the Zappos data. In this setting, there is less of an advantage to using the triangle inequality due to the highly noisy and subjective nature of human judgments. Despite this, we still see a slight advantage to ANNTri over ANN. In particular, for moderate accuracy, there is a gain sample complexity of around x.
Appendix B Proofs and technical lemmas
b.1 Proof of Lemma 3.1
By symmetry for all , we have existing samples of and and we use bounds based on these samples as well as past triangle inequality upper bounds on and due to and respectively. The upper bound is derived as follows:
Since we may form bounds based on all for which we have both samples of and , we may optimize over to get the tightest possible triangle inequality bounds on .
Lower bounds are derived similarly. Again, intuitively, we may use past samples of both and and associated bounds to derive a lower bound on . The form is slightly more complicated here since we have to worry about both upper and lower bounds on and . These bounds may either be from concentration bounds based on past samples directly or past triangle inequality upper and lower bounds on these distances due to points .
where and , (not necessarily unique) are chosen to optimize the bound. Similar to the upper bound, this holds with respect to any and we optimize over . To ease presentation, let and be the tightest upper and lower bounds for . For the lower bound, note that if the argument of is negative, then any
can be the value of both and as it lies in both their confidence intervals. Then points can possibly be at the same location in the metric space, in which case . On the other hand if the RHS is positive, then and cannot be at the same location as . In fact, the smallest possible value for occurs if are collinear. This can be seen to be true from Figure 7.
We finish with a quick lemma noting what can and cannot be eliminated via the triangle inequality.
Lemma B.1.
Conditioned on the good event that all bounds are correct at all times, the triangle inequality cannot be used to to separate the two closest points to any given third point.
Proof.
Consider finding . Let . By the triangle inequality, Clearly, the RHS is no smaller than . Since we are conditioning on all bounds being correct at all times, no upper bound on from the triangle inequality can ever be smaller that . Rearranging the inequality, we see that . The LHS is no larger than , and is the only distance wrt that is smaller than by assumption. Therefore, no lower bound on due to the triangle inequality is greater than . ∎
b.1.1 Helper Lemmas
Lemma B.2.
Let index the rounds of the procedure SETri in finding . Suppose all confidence intervals are valid, i.e., (3) is true. Then and all ,
(15) 
Proof.
If the good event (3) is true then for any pair and time we have
A similar calculation can be done for as well. ∎
Lemma B.3.
Let , and let be the time when is last sampled in the round and equivalently for . Assume without loss of generality that . If and are such that
(16) 
then SETri can eliminate without sampling it, i.e., .
Proof.
Focusing on the number of queries, we have that
(17) 
the inequality in (17) is due to Lemma B.2, and using the number of queries,
(18) 
The first inequality in (18) is because if then there may have been more queries beyond the number of queries made while finding . Rearranging the equation in the Lemma statement,