Search using queries on indistinguishable items
We investigate the problem of determining a set of indistinguishable integers in the range . The algorithm is allowed to query an integer , and receive a response comparing this integer to an integer randomly chosen from . The algorithm has no control over which element of the query is compared to. We show tight bounds for this problem. In particular, we show that in the natural regime where , the optimal number of queries to attain error probability is . In the regime where , the optimal number of queries is .
Our main technical tools include the use of information theory to derive the lower bounds, and the application of noisy binary search in the spirit of Feige, Raghavan, Peleg, and Upfal (1994). In particular, our lower bound technique is likely to be applicable in other situations that involve search under uncertainty.
This paper investigates the problem of identifying a set of indistinguishable items by repeated queries where we know the range of values the items can take. At every query, we gain information based on our query and some random item from the set we are trying to find (we do not know which item was chosen). The overall simple statement of the problem makes it widely generalizable. The query can be thought of as an experiment in which we apply a measurement on an element of without knowing which element has been measured. The set of items can refer to a set of DNA strands in a “soup” of DNAs, passwords or any item that we might be interested in finding when we know what possible values the item may take. The queries can be viewed as tests on DNA strands, attempts at guessing a password or any trial we may run that will provide some information about one of the items in question. The specific problem we investigate is where the items are integers. Our queries are guesses of integers which return the result of a comparison with a chosen integer from the set we are trying to find.
As far as we know, this problem has not been investigated in the literature. However, it falls into the rich class of noisy search problems. Since we do not know which number was chosen when we query a number, we have to deal with a lack of information in trying to determine the set of numbers. Due to this missing information, it is not immediately obvious that there exists a solution to the problem.
In this paper we give asymptotically tight upper and lower bounds for the number of queries needed to find a set of size of numbers from , where the queries are comparison queries.
We briefly discuss similar problems that have been previously studied. Feige et al. explored the depth of noisy decision trees, where each node can be wrong with some constant probability, in . One of the problems they investigated is binary search where the result of each query is wrong with a constant probability. They presented an algorithm to solve this with running time where is the input set size and is the probability of error of the algorithm. The algorithm we present uses a similar technique to the one used for noisy binary search in .
The Renyi-Ulam game is also a related problem. In one variation of this game, we need to discover a chosen integer. To do this, we query a number and are told whether the number we are trying to find is greater than the number we guessed or not. However, some constant number of lies are allowed. In , one lie is allowed, which means that one of the responses to our queries can be false. Similarly, Pelc discussed in  an algorithm for performing the search when one lie is allowed and concluded that the original question posed by Ulam (finding an integer between one and a million with one lie allowed) requires 25 queries. In ,  and other papers that explore the Renyi-Ulam game, some restriction is placed on the pattern of queries with false results. Ravikumar and Lakshmanan discussed such patterns (and why they are necessary to make the problem solvable) in .
Another related problem is sorting from noisy information. Braverman and Mossel investigated this in . The problem of sorting from noisy information is similar to our problem because in noisy sorting we can make comparisons between the items that need to be sorted, but each comparison may give us false information. This has applications, for example, in ranking sports teams where the comparisons are games between teams (one team wins) but the comparisons are noisy because the better team (which should have a higher rank) does not always win. Klein et al. also investigated this problem in . Apart from noisy sorting, they applied the same model to explore other problems, such as finding the maximum of numbers.
The problem we are investigating is motivated by applications that involve a search for several items by repeated queries where we do not know which item was chosen to be compared with our query (i.e. the items are indistinguishable). One interpretation is where the items represent DNA strands in a mixture that we are trying to identify. We can perform tests that give us some information about one of the DNA strands in the mixture, but we do not know which one. Similarly, instead of trying to identify DNA strands, we might be trying to identify passwords where our queries give us some partial information about one password out of several that a particular user often uses (and switches between).
We note that the applications mentioned do not take the exact form as the problem we explore. The items in our problem are integers and the queries are guesses of an integer that result in the response ‘less than or equal to’ or ‘greater than’. In generalizing the problem to other applications, the form of items or queries may change. For example, the queries in the DNA mixture example may describe a property of a particular nucleotide instead of returning one of two possible answers. Therefore, the algorithm will have to be changed. However, a similar framework can be used which allows information to be gained despite the uncertainty regarding query responses due to the indistinguishability of the items. A solution to the problem we have posed can lead to the development of new methods for identifying a set of items where we know these items can only take on a certain range of values. On the lower-bound side, our results show that information-theoretic quantities are very effective at measuring and upper-bounding information learned from queries, even when such information is only a fraction of one bit. We believe that the information-theoretic lower bound technique will generalize to tight lower bounds in other settings.
We now discuss the results and structure of the paper. In Section 2, we formally introduce the problem we are solving with the restriction that the number of chosen integers is significantly smaller than the range of integers available. We prove a lower bound for the problem in Section 3.1 using information theoretic techniques. This involves constructing the hard instances where we split the possible values the chosen integers can take into consecutive clusters of equal size and place one chosen integer in each such cluster. Intuitively, this forces the search algorithm to find the elements one at a time, which turns out to be costly due to the fact that we don’t control the sample. To formalize this intuition, we calculate the entropy of the random variable representing a particular chosen integer (it may take values of the integers in one of the clusters described above). We then use the mutual information of this random variable and the random variable representing the responses to the queries we make to find the minimum number of queries required to find that chosen integer. After showing that the same minimum number of queries applies to at least half of the chosen integers, we reach a lower bound of , where is the size of the set and the elements of take integer values between and (inclusive). Further, this bound extends to all , using a slightly different set of hard instances. When we obtain a lower bound of . In Section 4, we present an optimal algorithm for solving the problem, proving both its correctness and worst case running time of where is the probability of error. This shows that the lower bound is tight. Moreover, while the lower bound applies to finding even with a constant error probability, we see that the upper bound remains asymptotically the same even if we set the error to be polynomially small.
Our results show that the problem we describe can be solved in practice when the items we are searching for can take a large number of values. This is because the dependence of the running time on grows as . However, the number of items in needs to remain small because the dependence of the running time on grows as .
2 Problem definition
We consider a (multi-)set of distinct integers where each is for . Our goal is to discover the set . The process is to repeat the following three steps:
Query an integer .
An integer is selected from uniformly at random.
We are told whether or .
These three steps are repeated until we know what the integers in are. Our goal is to find the most efficient algorithm for determining . Our model of computation is that queries are the costly operations. Therefore, by finding the most efficient algorithm we mean finding the algorithm that minimizes the number of queries made. We refer to this as ‘the problem’ we are solving. Furthermore, for brevity, we refer to the two possible responses to queries as ‘’ () and ‘’ () and the integers in as ‘the chosen integers’.
In this paper we give a complete characterization of the query complexity of this problem. Note that since the is selected at random from , we cannot hope for a deterministic algorithm, and have to settle for a probabilistic performance guarantee. We focus on the regime where we are required to output the correct set except with some (possibly constant) probability . The answer can be broken down into three main regimes, which will be discussed in the analysis: (1) , e.g. ; (2) ; and (3) . The answer is given by the following main theorem:
The number of queries needed to determine a multi-set of size with a given error is when , and when .
Note that the distinction between and only comes up in the analysis, but (asymptotically) makes no difference in the result.
Because of the way the algorithms work, Theorem 1 remains true even if the comparisons in the query answers are themselves noisy, and output the correct value of correctly only with probability for some constant .
Somewhat surprisingly, same bounds hold for a fairly broad range of error parameters. In particular, the lower bound holds even when the error is constant, while the upper bound holds even for polynomially small errors (the constant in the may depend on the constant in ).
3 The lower bounds
We begin with showing the lower bound. In fact, we break the lower bound into two regimes: and . In the former regime, we use information-theoretic techniques to show the lower bound. In the latter, we give a more straightforward proof of the lower bound when , and when . The lower bound is weaker in general than when , but is equivalent in the regime where .
3.1 The case : an information-theoretic lower bound
The main technical ingredient in the lower bound proof is the Kullback-Leibler divergence and mutual information. We first introduce these terms and the lemmas we will use. For a more thorough introduction to these, see .
The Kullback-Leibler divergence (KL-divergence) measures the difference between two probability distributions:
For discrete random variables and over sample space , the KL-Divergence is defined as:
with the convention that the term in the sum is interpreted as 0 when and when and
We also use mutual information, which we define and arrange into a form we will use:
Mutual information is a measure of the correlation between two random variables. The more independent the variables are, the lower the mutual information is.
Before we rearrange this definition into a form we will use, we first note (from ) that it can also be written in terms of the more familiar Shannon entropy as:
Since , . If entropy is interpreted as the uncertainty regarding a probability distribution, we see that the mutual information between and represents the reduction in uncertainty of by knowing .
We now return to the original definition given for mutual information. Using the definition of the KL-divergence and conditional probability (), we have:
Thus we see that the mutual information is the expectation of the KL-divergence between the probability distribution of and the probability distribution of conditioned on . If these two distributions have a high KL-divergence, then knowing provides us a high amount of information regarding the probability distribution of . This is equivalent to saying that the mutual information of and is high.
We will use the chain rule for mutual information:
For a proof of the above lemma, see . We are now done defining the information theory terms we will need. Lastly, we will need the following lemma which describes the KL-divergence between two Bernoulli random variables with a similar probability of success:
where is a Bernoulli random variable with probability of success , and .
Here we prove the part of the lemma (). The part is nearly identical and is thus excluded.
|Use the inequalities and :|
|since and , :|
We are now ready to begin our proof of the lower bound. The approach taken is to show that the information gain from each query is small compared with the total information required to find a certain chosen integer. This will allow us to show that a certain minimum number of queries is required to find each of the integers.
The lower bound for the number of queries required to find the integers between and in the set with probability , when , is
We choose our input as follows. Split the integers in the range into equally sized clusters. Call these clusters . Let there be one of the chosen integers in each such cluster. This integer is chosen uniformly at random from the integers in the cluster. Note that the number of integers in each cluster is , which, without loss of generality, we will assume is an integer. See Figure 1 for a visualization of this.
We consider individually a cluster where . Let be the random variable that represents the chosen integer in . Since this number is chosen uniformly at random from elements, the probability of each integer being the chosen integer is . Therefore, the entropy of is . We now define to be a Bernoulli random variable representing the response to the query (i.e. either ‘’ or ‘’). We need to make enough queries so that the information gain relevant to is close to the entropy of in order to determine the chosen number in with a high degree of accuracy. This is equivalent to saying that the mutual information between and the queries made is at least a constant times the entropy of . Indeed, in the end, we must have determined the point with probability greater than . Therefore, conditioned on the queries, most of the mass is concentrated on one point and . Therefore, . Thus, we need:
where is the number of queries made. We want to find the minimum for which this is true. First, we use Lemma 6 (chain rule) to write:
Take one of these terms and recall that we can express mutual information in terms of KL-divergence:
where . Thus, we need to find the KL-divergence of and of . We note that since we chose cluster , there are of the chosen integers that are smaller and of the numbers that are bigger than any element of . Therefore, for both probability distributions, the probability that the response is ‘’ is at least and the probability that the response is ‘’ is at least . Therefore, both probability distributions are Bernoulli with probability of success (taking success to be the response ‘’) between and . Thus, the difference in probabilities of success of the two distributions is at most . Then if we let be and let be , we know (because ) and (because this is the maximum difference in probability of success between the two distributions). By lemma 7, . So: and we have:
Returning to equation 2:
From (1), we have so
since . This is the minimum number of queries to find the chosen integer in . This holds in total for of the chosen numbers (this is the number of clusters with in the range we considered). Note that to find the chosen number in , queries made in determining the number within with provide no information for determining the number in (as all queries are either bigger or smaller than all the numbers in ). Then finding of the chosen numbers requires at least time. Therefore, finding all of the chosen numbers requires at least queries. ∎
3.2 The lower bound when
Next we turn our attention to the lower bound in the regime where . We start with the case , as the case is treated very similarly. The multi-set is constructed as follows: we place ’s and ’s in . Partition the rest of the set into bins , , etc. For each bin for , we place exactly one of the elements of in independently and uniformly at random. We now look at the process of determining which element of has been selected using the queries. Note that only the query with carries any information on which element of has been selected. Thus a set of observations can be specified by a set of pairs of numbers where represents the number of times we queried and received the ‘’ answer, and represents the number of times we received the ‘’ answer. The probability of each answer is between and , and varies by depending on whether we selected or in .
When we output the set , we need to make decisions of whether to output or for each . Each of these decisions should depend only on the values of , and should maximize the probability that the output is correct. This can only be done by outputting the maximum likelihood value for each . More precisely, we should output if , and otherwise. We are not particularly concerned with these details, but only with the probability that our output is wrong. Denote by the probability that the maximum-likelihood output given is incorrect. We first claim that to have a probability of to be correct in outputting , we must have a bound on the sum of the ’s.
If given the values the output is correct with probability , then .
Since the events of being correct on each are independent, the probability of being correct on all ’s is given by
which implies the statement of the claim. ∎
Next, let us denote by the a-priori expected number of ‘’ responses on queries, and let be the observed deviation from this expected value. Intuitively, the greater this deviation, the greater is our confidence in the answer. In fact, it is not hard to formalize this intuition:
For each , and , .
Suppose wlog that , and thus we are outputting . Denote and . We have by Bayes’ rule
The second-to last inequality follows from the fact that the breakdown is more likely under the selection of than under the selection of . ∎
Equation (3) implies , for .
Denote , and let . The function is convex, and thus we have
since . This implies the claim. ∎
To finish the proof let denote the random variable representing the value of after queries. Let . At each time step, a query to will on average not change if the element from is not selected for comparison with . If it is selected, it will change by at most . Thus, on average, only grows by at most after each time step. Thus is a supermartingale. Let be the random variable representing the time at which we stop and output . By the optional stopping time theorem, we have , which implies .
If our overall success probability is , it must be the case that with probability the probability of the output being correct conditioned on the observed is . Thus by Claims 9, 10 and 11, we have with probability . Thus,
completing the proof of the lower bound.
The proof in the regime is very similar. The only difference is that there are bins now, and we’d get instead of , and thus .
We will now study the case where .
4 Optimal upper bounds
As discussed in the previous sections, it is not immediately clear how to make use of the information gained from queries because we do not know which of the integers the information corresponds to. In this section, we present an algorithm for solving this problem. The algorithm is optimal when the probability of error required is constant (which means its worst case running time matches the lower bound). Our algorithm finds each of the numbers individually, without attempting to use information gained when finding one integer to find another integer. We first introduce a concept we will use in all our algorithms:
The -position of an integer is the number of integers in that have a value less than or equal to
The general technique of the algorithms is to do a binary search for a chosen integer, but repeat each query of the binary search enough times to know the -position of the queried integer. A straightforward application of binary search with repeated queries would take queries to find the -position of a number, even with a constant error probability. We essentially use the noisy binary search technique of Feige et. al.  to attain the optimal query complexity. We start with the following simple lemma:
We can find the -position of integer by making queries with the probability of being correct being at least .
Let be the -position of . We do queries of to find . For each query , the probability of a response being ‘’ or ‘’ is given simply in terms of :
because is the number of integers in less than or equal to and each such integer is chosen as the for a query with equal probability. We use the analogy that the random variable is a coin with probability of heads (which represents ‘’) being . Given tosses of the coin, of which are heads, we can approximate as: . We need to find the relation between the number of tosses and the probability of error in this approximation. Using standard concentration bounds , we see that coin tosses are needed to guarantee that with error at most (where ).
We need to decide on a value for . Note that is an integer in the range and therefore, can only take on the values . Thus, we need so then we can always round to the closest , where and . Using this in the results from , we see that coin tosses are enough to guarantee that we know the correct value of with probability of error being at most . Given , we have so we have the -position of . ∎
We note that this immediately lets us solve the problem for :
When , there is an algorithm to find all integers in with probability for all constant .
We find the -position of all integers in the range . Given the -position of all integers, we know how many of the numbers have each integer value. If the -position of is and the -position of is , we know there are of the chosen numbers with the value (for . For , we know the number of the chosen integers with this value is equal to the -position of ).
To find the -position of an integer with probability of error at most , we need to perform queries. If we want the probability of error of the algorithm to be a constant, we need the probability of error of finding the -position of each integer to be at most so that applying a union bound gives a total probability of error (since we find the -position of integers). Thus, to find the -position of each integer we need to perform queries. Since we do this for integers, the total number of queries we make is: . ∎
We provide an example to illustrate an approach using what we have so far. Suppose , , and let . We want to find the lowest of the numbers first. We do a binary search where we repeat each query times. Our decision at each stage of the binary search is determined by the -position found for the number of that stage. Therefore, we first do queries of . From this, we calculate (note the probability of error for this statement is ). So there is one of the numbers below or equal to 8. Next, we do queries of and find again that . When we do queries of , we find that . This tells us that none of the numbers are below or equal to 2. Therefore, we do queries of and find that . If one of the numbers is less than or equal to 3, but none of them are less than or equal to 2, we conclude that one of the numbers is 3. We then repeat the same process to find the second of the numbers.
However, this approach is problematic because of the constant error each time we find the -position of a number. This flaw is mentioned for a similar algorithm in . The number of queries we make is . Each group of queries of the same ( of them) give the wrong result with probability . Applying a union bound, our overall probability of error () is . If we want to be a constant, we need and thus, the number of queries we make is actually
To alleviate this problem, we model our algorithm as a random walk on a tree. In using this technique, we follow . In , the random walk approach is taken to do a noisy binary search. We use this technique to find each of the chosen integers, although each step of the random walk is modified to accommodate our lack of information about which of the integers was chosen in a particular query. We use a binary tree where the leaves are (in order) the integers . The internal nodes represent intervals that are the union of the leaves in their subtrees. For example, the root node has the interval and the left child of the root has the interval . The tree height is . Finally, we extend this tree by adding chains of length to each of the leaf nodes, where the nodes in these chains have the same value as the leaf they are attached to. An example tree with is shown in Figure 2 below.
We discuss an algorithm for finding the of the chosen integers. This algorithm is repeated times (once for each of the numbers). Starting at the root, for each node we take the following two steps:
We first check whether the chosen integer is in the range of the node (call it ). To do this, we find the -position of and by doing queries of each of them. If we find that the -position of is at most and the -position of is at least , then the number lies in the range . Otherwise, we backtrack up the tree to the parent node of .
If, according to the first step, the number lies in the range , we do queries of the middle value of the range of the node (call this where ). If is not a leaf (or on a leaf chain) and the -position of is at most , we choose the right child of . If the -position of is at least , we choose the left child of . If is a leaf (or on a leaf chain), we go down the chain further regardless of the result of the queries.
Note that there is a constant probability of error each time we determine the -position of an integer. This leads to a constant probability of choosing the wrong node to go to next. We will analyze this probability shortly.
The algorithm walks for steps and then stops, where . If it stops on an internal node, the algorithm failed. If it stops on one of the leaf chains (or a leaf node), it outputs the value of the leaf (i.e. declares this value to be the value of the of the numbers).
The following theorem summarizes our results:
Our algorithm finds all integers in in time with probability of error at most for
To reach this theorem, we use the following lemma:
The algorithm finds the correct integer in with the probability of error being at most , where is the number of steps in the random walk.
We need to prove that the algorithm’s position on the walk after steps is the correct leaf chain with high probability. Orient all edges of the tree so they are directed towards the correct leaf chain (and within this leaf chain they are directed down). We can do this because the graph is a tree (there is only one path between every two vertices) and there is only one correct leaf. We can now consider the algorithm’s position in the tree as a one dimensional random walk. We let the starting point of the walk be (the root of the tree), the correct leaf be steps to the right and any of the wrong leaves be steps to the left. Note that (height of the tree).
We need to find the probabilities of moving left and right in the random walk. We will show that the probability of moving in the correct direction (to the right) is at least at every node. Furthermore, note that the decision made at any node is independent of the previous steps in the random walk. Let be the probability of going left at any move. This is equivalent to the probability of going along the wrong direction of an edge, which is equivalent to making a mistake somewhere in choosing the next vertex. The probability of incorrectly calculating whether the number is in the range is at most the probability that we incorrectly calculate the -position of either or . Since we do queries of each, by Lemma 14 we know that the probability of error in calculating the -position of each is where . So the probability of incorrectly calculating the -position of either or is at most . Similarly, we do queries of , so the probability of error is where . Thus, the total probability of error at each node is . Therefore, and , where is the probability of going to the right (i.e. the correct direction). Figure 3 illustrates the random walk space.
For the algorithm to be correct, it must be on or to the right of after steps (so it returns the correct integer), otherwise it is wrong. Let be the random variable denoting the number of moves to the right made after moves. Then is the number of moves to the left. Therefore, the algorithm is correct if . This is equivalent to the condition that . Then the probability that the algorithm is correct is and is the probability of error we want to bound. To find , let be an indicator random variable that is 1 if the algorithm moves to the right on the move and 0 otherwise. Note that . Therefore, by linearity of expectation. We want to use a Chernoff bound to bound the probability of error, so we need to find a such that:
Note that because . Since each step of the random walk is independent of the other steps (i.e. is independent of for ), we can use the Chernoff bound ():
Recall that and set , where is a constant. Then . We want to write this as where is a constant. Then . Note that as increases, decreases to some asymptotic value:
Then we have that . Therefore,
Thus, we have bounded the probability of error as required. ∎
We apply Lemma 17 to prove the bound on the full algorithm. Even though our lower bounds works when the error probability is constant, the algorithm applies even when the error is very small (). We are now ready to present the proof for Theorem 16.
We prove separately the cases when and when . In the first case, we set . By Lemma 17, the probability of not finding the correct number is at most . Applying a union bound of this over the numbers we need to find, the probability of error is at most because . Since , the probability of error is bounded as required. So we need in total steps of the random walk algorithm. Recall that each such step takes queries. Therefore, in total, we have a running time of since .
We now consider the case when . Set . The probability of not finding the correct number is at most by Lemma 17 (and that ). Applying a union bound over the numbers we need to find, the overall probability of error is as required. Thus, we need steps in the random walk, where each consists of queries. Therefore, the total running time is . ∎
-  M. Braverman, E. Mossel. Sorting from Noisy Information, CoRR, Vol abs/0910.1191, in SODA’09, 2009.
-  T. Cover, J. Thomas. Elements of Information Theory, New York: John Wiley & Sons, Inc., 1991
-  U. Feige, P. Raghavan, D. Peleg, and E. Upfal. Computing with noisy information, SIAM Journal on Computing, Vol 23, No. 5, pp. 1001–1018, 1994
-  R. Karp, R. Kleinberg. Noisy binary search and its applications, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 881–890, 2007
-  R. Klein, R. Penninger, C. Sohler, D. Woodruff. Tolerant algorithms, Proceedings of the 19th European conference on Algorithms, pp. 736–747, 2011
-  E. Lehman, T. Leighton. Mathematics for Computer Science, Online PDF file, 2004
-  A. Pelc. Solution of Ulam’s problem on searching with a lie, Journal of Combinatorial Theory, Series A, Vol 44, No. 1, pp. 129–140, 1987
-  M. Raginsky. Introduction: What is Statistical Learning Theory?, Online PDF file, 2011
-  B. Ravikumar, K.B. Lakshmanan. Coping with known patterns of lies in a search game, Theoretical Computer Science, Volume 33, Issue 1, pp. 85–94, 1984.
-  J. Spencer. Guess a Number-with Lying, Mathematics Magazine, Vol. 57, No. 2, pp. 105–108, 1984