The capacity of non-identical adaptive group testing

The capacity of non-identical adaptive group testing

Abstract

We consider the group testing problem, in the case where the items are defective independently but with non-constant probability. We introduce and analyse an algorithm to solve this problem by grouping items together appropriately. We give conditions under which the algorithm performs essentially optimally in the sense of information-theoretic capacity. We use concentration of measure results to bound the probability that this algorithm requires many more tests than the expected number. This has applications to the allocation of spectrum to cognitive radios, in the case where a database gives prior information that a particular band will be occupied.

I Introduction and notation

I-a The Probabilistic group testing problem

Group testing is a sparse inference problem, first introduced by Dorfman [1] in the context of testing for rare diseases. Given a large population of items , indexed by , where some small fraction of the items are interesting in some way, how can we find the interesting items efficiently?

We perform a sequence of pooled tests defined by test sets , where each . We represent the interesting (‘defective’) items by a random vector , where is the indicator of the event that item is defective. For each test , we jointly test all the items in , and the outcome is ‘positive’ () if and only if any item in is defective. In other words, , since for simplicity we are considering the noiseless case. Further, in this paper, we restrict our attention to the adaptive case, where we choose test set based on a knowledge of sets and outcomes . The group testing problem requires us to infer with high probability given a low number of tests .

Since Dorfman’s paper [1], there has been considerable work on the question of how to design the sets in order to minimise the number of tests required. In this context, we briefly mention so-called combinatorial designs (see [2, 3] for a summary, with [3] giving invaluable references to an extensive body of Russian work in the 1970s and 1980s). Such designs typically aim to ensure that set-theoretic properties known as disjunctness and separability occur. In contrast, for simplicity of analysis, as well as performance of optimal order, it is possible to consider random designs. Here sets are chosen at random, either using constructions such as independent Bernoulli designs [4, 5, 6] or more sophisticated random designs based on LDPC codes [7].

Much previous work has focussed on the Combinatorial group testing problem, where there are a fixed number of defectives , and the defectivity vector is chosen uniformly among all binary vectors of weight . In contrast, in this paper we study a Probabilistic group testing problem as formulated for example in the work of Li et al. [8], in that we suppose each item is defective independently with probability , or equivalently take to be independent Bernoulli().

This Probabilistic framework, including non-uniform priors, is natural for many applications of group testing. For example, see [9], the cognitive radio problem can be formulated in terms of a population of communication bands in frequency spectra with some (unknown) occupied bands you must not utilise. Here, the values of may be chosen based on some database of past spectrum measurements or other prior information. Similarly, as in Dorfman’s original work [1] or more recent research [10] involving screening for genetic conditions, values of might summarise prior information based on a risk profile or family history.

I-B Group testing capacity

It is possible to characterize performance tradeoffs in group testing from an information-theoretic point of view – see for example [4, 11, 6, 12]. These papers have focussed on group testing as a channel coding problem, with [4, 12] explicitly calculating the mutual information. The paper [11] defined the capacity of a Combinatorial group testing procedure, which characterizes the number of bits of information about the defective set which we can learn per test. We give a more general definition here, which covers both the Combinatorial and Probabilistic cases.

Definition I.1

Consider a sequence of group testing problems where the th problem has defectivity vector , and consider algorithms which are given tests. We refer to a constant as the (weak) group testing capacity if for any :

  1. any sequence of algorithms with

    (1)

    has success probability bounded away from 1,

  2. and there exists a sequence of algorithms with

    (2)

    with success probability .

Remark I.2

In the Combinatorial case of defective items with all defective sets equally likely, , which is the term found in the denominator in [11, Eq. (1) and (2)]. In the Probabilistic case (as in [8]) we know where is the binary entropy function.

Remark I.3

If for , the success probability we say that is the strong group testing capacity, following standard terminology in information theory. Such a result is referred to as a strong converse.

I-C Main results

The principal contribution of [11, Theorem 1.2] was the following result:

Theorem I.4 ([11])

The strong capacity of the adaptive noiseless Combinatorial group testing problem is , in any regime such that .

This argument came in two parts. First, in [11, Theorem 3.1] the authors proved a new upper bound on success probability

(3)

which implied a strong converse (). This was complemented by showing that, in the Combinatorial case, an algorithm based on Hwang’s Generalized Binary Splitting Algorithm (HGBSA) [13], [2] is essentially optimal in the required sense, showing that is achievable.

It may be useful to characterize the Probabilistic group testing problem in terms of the effective sparsity . In particular, if the are (close to) identical, we would expect performance similar to that in the Combinatorial case with defectives. As in [11], we focus on asymptotically sparse cases, where . In contrast, Wadayama [7] considered a model where are identical and fixed. The main result of the present paper is Theorem III.9, stated and proved in Section III-E below, which implies the following Probabilistic group testing version of Theorem I.4.

Corollary I.5

In the case where , the weak capacity of the adaptive noiseless Probabilistic group testing problem is , in any regime such that and .

Again we prove our main result Theorem III.9 using complementary bounds on both sides. First in Section II-A we recall a universal upper bound on success probability, Theorem II.1, taken from [8], which implies a weak converse. In [8], Li et al. introduce the Laminar Algorithm for Probabilistic group testing. In Section II-C we propose a refined version of this Laminar Algorithm, based on Hwang’s HGBSA [13], which is analysed in Section III-E, and shown to imply performance close to optimal in the sense of capacity.

Bounds of [8] (see Theorem II.3 below) which shows that tests are required to guarantee convergence to zero success probability. Our calculation is to improve this to plus an error term, which is optimal up to the size of the error term.

Ii Algorithms and existing results

Ii-a Upper bounds on success probability

Firstly [8, Theorem 1] can be restated to give the following upper bound on success probability:

Theorem II.1

Any Probabilistic group testing algorithm using tests with noiseless measurements has success probability satisfying

Rephrased in terms of Definition I.1, this tells us that the weak capacity of noiseless Probabilistic group testing is . The logic is as follows; if the capacity were for some , then there would exist a sequence of algorithms with with success probability tending to 1. However, by Theorem II.1, any such algorithms have , meaning that we have established that a weak converse holds.

Remark II.2

It remains an open and interesting problem to prove an equivalent of (3) as in [11, Theorem 3.1]. That is we hope to find an upper bound on success probability in a form which implies a strong converse, and hence that the strong capacity of Probabilistic group testing is equal to 1.

Ii-B Binary search algorithms

The main contribution of this work is to describe and analyse algorithms that will find the defective items. In brief, we can think of Hwang’s HGBSA algorithm as dividing the population into search sets . First, all the items in a search set are tested together, using a test set . If the result is negative (), we can be certain that contains no defectives. However, if the result is positive (), must contain at least one defective.

If , we can be guaranteed to find at least one defective, using the following binary search strategy. We split the set in two, and test the ‘left-hand’ set, say . If , then we know that contains at least one defective. If , then contains no defective, so we can deduce that contains at least one defective. By repeated use of this strategy, we are guaranteed to find a succession of nested sets which contain at least one defective, until is of size 1, and we have isolated a single defective item.

However this strategy may not find every defective item in . To be specific, it is possible that at some stage both the left-hand and right-hand sets contain a defective. The Laminar Algorithm of [8] essentially deals with this by testing both sets. However, we believe that this is inefficient, since typically both sets will not contain a defective. Nonetheless, the Laminar Algorithm satisfies the following performance guarantees proved in [8, Theorem 2]:

Theorem II.3

The expected number of tests required by the Laminar Algorithm [8] is . Under a technical condition (referred to as non-skewedness), the success probability can be bounded by using tests, where is defined implicitly in terms of , and .

Ignoring the term, and assuming the non-skewedness condition holds, this implies that (using the methods of [8]) tests are required to guarantee convergence to of the success probability. In our language, this implies a lower bound of . Even ignoring the analysis of error probability, the fact that the expected number of tests is suggests that we cannot hope to achieve using the Laminar Algorithm.

Ii-C Summary of our contribution

Data: A Set of items, of which are actually defective in expectation, a probability vector describing each item’s independent probability of being defective, and a cutoff
Result: The set of defective items
Discard items with
Sort the remaining items into bins, collecting items together with in bin . Bin contains items with probability
Sort the items in each bin into sets s.t. the total probability of each set is .
Test each set in turn
if The test is positive then
      Arrange the items in the set on a Shannon-Fano/Huffman Tree and recursively search the set for all the defectives it contains
end if

The main contribution of our paper is a refined version of the Laminar Algorithm, summarised above, and an analysis resulting in tighter error bounds as formulated in Proposition III.7 (in terms of expected number of tests) and Theorem III.9 (in terms of error probabilities). The key ideas are:

  1. To partition the population into search sets containing items which have similar probabilities, expressed through the Bounded Ratio Condition 1. This is discussed in Section III-A, and optimised in the proof of Proposition III.7.

  2. The way in which we deal with sets which contain more than one defective, as discussed in Remark III.2 below. Essentially we do not backtrack after each test by testing both left- and right-hand sets, but only backtrack after each defective is found.

  3. To discard items which have probability below a certain threshold, since with high probability none of them will be defective. This is an idea introduced in [8] and discussed in Section III-B, with a new bound given in Lemma III.4.

  4. Careful analysis in Section III-D of the properties of search sets gives Proposition III.7, which shows that the expected number of tests required can be expressed as plus an error term. In Section III-E, we give an analysis of the error probability using Bernstein’s inequality, Theorem III.8, allowing us to prove Theorem III.9.

Ii-D Wider context: sparse inference problems

Recent work [14, 12] has shown that many arguments and bounds hold in a common framework of sparse inference which includes group testing and compressive sensing.

Digital communications, audio, images, and text are examples of data sources we can compress. We can do this, because these data sources are sparse: they have fewer degrees of freedom than the space they are defined upon. For example, images have a well known expansion in either the Fourier or Wavelet bases. The text of an English document will only be comprised of words from the English dictionary, and not all the possible strings from the space of strings made up from the characters .

Often, once a signal has been acquired it will be compressed. However, the compressive sensing paradigm introduced by [15, 16] shows that this isn’t necessary. In those papers it was shown that a ’compressed’ representation of a signal could be obtained from random linear projections of the signal and some other basis (for example White Gaussian Noise). The question remains, given this representation how do we recover the original signal? For real signals, a simple linear programme suffices. Much of the work in this area has been couched in terms of the sparsity of the signal and the various bases the signal can be represented in (see for example [15, 16]).

Iii Analysis and new bounds

Iii-a Searching a set of bounded ratio

Recall that we have a population of items to test, each with associated probability of defectiveness . The strategy of the proof is to partition into search sets , each of which contains items which have comparable values of .

Condition 1 (Bounded Ratio Condition)

Given , say that a set satisfies the Bounded Ratio Condition with constant if

(4)

(For example clearly if , any set satisfies the condition for any ).

Lemma III.1

Consider a set satisfying the Bounded Ratio Condition with constant and write . In a Shannon–Fano tree for the probability distribution , each item has length bounded by

(5)

where we write .

Proof:

Under the Bounded Ratio Condition, for any and , we know that by taking logs of (4)

Multiplying by and summing over all , we obtain that

(6)

Now, the Shannon–Fano length of the th item is

(7)

and the result follows by (6). \qed

Next we describe our search strategy:

Remark III.2

Our version of the algorithm will find every defective in a set . We start as before by testing every item in together. If this test is negative, we are done. Otherwise, if it is positive, we can perform binary search as section II-B to find one defective item, say . Now, test every item in together. If this test is negative, we are done, otherwise we repeat the search step on this smaller set, to find another defective item , then we test and so on.

We think of the algorithm as repeatedly searching a binary tree. Clearly, if the tree has depth bounded by , then the search will take tests to find one defective. In total, if the set contains defectives, we need to repeat rounds of searching, plus the final test to guarantee that the set contains no more defectives, so will use tests.

Lemma III.3

Consider a search set satisfying the Bounded Ratio Condition and write . If (independently) item is defective with probability , we can recover all defective items in the set using tests, where for

(9)
Proof:

Using the algorithm of Remark III.2, laid out on the Shannon-Fano tree constructed in Lemma III.1, we are guaranteed to find every defective. The number of tests to find one defective thus corresponds to the depth of the tree, which is bounded by given in (5).

Recall that we write for the indicator of the event that the th item is defective, we will write for the total number of defectives in , and for the length of the word in the Shannon Fano tree. As discussed in Remark III.2 this search procedure will take

(10)

Here we write , which has expectation zero, and (10) follows using the expression for given in Lemma III.1. \qed

Iii-B Discarding low probability items

As in [8], we use a probability threshold , and write for the population having removed items with . If an item lies in we do not test it, and simply mark it as non-defective. This truncation operation gives an error if and only if some item in is defective. By the union bound, this truncation operation contributes a total of to the error probability.

Lemma III.4

Choosing such that

(11)

ensures that

(12)
Proof:

The approach of [8] is essentially to bound so that . Hence, choosing a threshold of guarantees the required bound on .

We combine this with another bound, constructed using a different function: , so that

so we deduce the result. \qed

Iii-C Searching the entire set

Having discarded items with below this probability threshold and given bounding ratio , we create a series of bins. We collect together items with probabilities in bin 0, in bin 1, items with probabilities in bin 2, …, and items with probabilities in bin .

The probability threshold means that there will be a finite number of such bins, with the index of the last bin defined by the fact that , meaning that , so

(13)

We split the items in each bin into search sets , motivated by the following definition:

Definition III.5

A set of items is said to be if .

Our splitting procedure is as follows: we create a list of possible sets . For increasing from to , we place items from bin into sets , for some , where . Taking the items from bin , while is not full (has total probability ) we will place items into it. Once enough items have been added to fill , we will proceed in the same way to fill , and so on until all the items in bin have been assigned to sets , where may remain not full.

Proposition III.6

This splitting procedure will divide into search sets , where the total number of sets is

Each set satisfies the Bounded Ratio Condition and has total probability .

Proof:

First, note that the items from bin each lie in a set on their own. These sets will be full, trivially satisfy the Bounded Ratio Condition 1 with constant , and have probability satisfying . For each of bins :

  1. For each bin , it is possible that the last set will not be full, but every other set corresponding to that bin will be full. Hence, there are no more than sets which are not full.

  2. For each resulting set , the total probability (since just before we add the final item, is not full, so at that stage has total probability , and each element in bins has probability ).

  3. Since each set contains items taken from the same bin, it will satisfy the Bounded Ratio Condition with constant .

Note that the number of full sets is , since

(14)
(15)
(16)

Since, as discussed in point 1) above, the total number of sets is bounded by the number of full sets plus , the result follows using Equation (13). \qed

Iii-D Bounding the expected number of tests

We allow the algorithm to work until all defectives in are found, and write for the (random) number of tests this takes.

Proposition III.7

Given a population where (independently) item is defective with probability , we recover all defective items in in tests with , where

(17)
Proof:

Given a value of , Proposition III.6 shows that our splitting procedure divides into sets , such that each set satisfies the Bounded Ratio Condition with constant and has total probability . Using the notation of Lemma III.3, , where .

Adding this bound over the different sets, since means that , we obtain

(18)
(19)

This follows by the bound on in Proposition III.6, as well as the fact that means that for any , .

Finally, we choose to optimize the second bracketed term in Equation (19). Differentiation shows that the optimal satisfies meaning that the bracketed term

and the result follows. \qed

Iii-E Controlling the error probabilities

Although Section III-D proves that , to bound the capacity, we need to prove that with high probability is not significantly larger than . This can be done using Bernstein’s inequality (see for example Theorem 2.8 of [17]):

Theorem III.8 (Bernstein)

For zero-mean random variables which are uniformly bounded by , if we write then

(20)

We deduce the following result:

Theorem III.9

Write , and . Define

(21)

where is given in (17).

  1. If we terminate our group testing algorithm after tests, the success probability

    (22)
  2. Hence in any regime where with and , our group testing algorithm has (a) for any and (b) , so the capacity .

Proof:

We first prove the success probability bound (22). Recall that our algorithm searches the reduced population set for defectives. This gives two error events – either there are defective items in the set , or the algorithm does not find all the defectives in using tests. We consider them separately, and control the probability of either happening using the union bound.

Writing for brevity and choosing ensures that (by Lemma III.4) the first event has probability , contributing to (22).

Our analysis of the second error event is based on the random term from Equation (10), which we previously averaged over but now wish to bound. There will be an error if , or (rearranging) if

For brevity, for , we write and , where has expectation zero.

We have discarded elements with probability below , as given by (11), and by design all the sets have total probability . Using (III-A) we know that the are bounded by

(23)

Hence, the conditions of Bernstein’s inequality, Theorem III.8, are satisfied. Observe that since all ,

Hence Theorem III.8 gives that

Using the union bound, the probability bound (22) follows.

We next consider the capacity bound of 2). Since , using (17) and (21)

(24)

which in our regime of interest is in the limit, since by assumption. \qed

Proof:

In the case where all are identical, , , so . Similarly, and so that as required. \qed

Iv Results

The performance of Algorithm 1 (in terms of the sample complexity) was analysed by simulating 500 items, with a mean number of defectives equal to 8, i.e. and .

The probability distribution was generated by a Dirichlet distribution with parameter . This produces an output distribution whose uniformity can be controlled via the parameter , as opposed to simply choosing a set of random numbers and normalise by the sum. Consider the case of two random numbers, , distributed uniformly on the square . Normalising by the sum projects the point onto the line and so favours points closer to than the endpoints. The Dirichlet distribution avoids this by generating points directly on the simplex.

We then chose values of the cutoff parameter from 0.0001 to 0.01, and for each simulated the algorithm 1000 times. We plot the empirical distribution of tests, varying as well as the uniformity/concentration of the probability distribution (via the parameter of the Dirichlet distribution). We also plot (in figure (1), the theoretical lower and upper bounds on the number of Tests required for successful recovery alongside the empirical number tests (all as a function of ).

Fig. 1: Theoretical lower and upper bounds and empirical Test frequencies as functions of

Note that the Upper bound is not optimal and there still is some room for improvement. Note also that the lower bound degrades with . The lower bound () was generated according to Theorem (II.1).

Fig. 2: Cumulative success probability distribution curves of the algorithm with fixed and varying
Fig. 3: Cumulative success probability distribution curves for fixed and varying

Figures (2) and (3) show that the performance is relatively insensitive to the cut-off , and more sensitive to the uniformity (or otherwise) of the probability distribution . Heuristically, this is for because distributions which are highly concentrated on a few items algorithms can make substantial savings on the testing budget by testing those highly likely items first (which is captured in the bin structure of the above algorithm).

The insensitivity to the cutoff is due to items below being overwhelmingly unlikely to be defective - which for small means that few items (relative to the size of the problem) get discarded.

V Discussion

We have introduced and analysed an algorithm for Probabilistic group testing which uses ‘just over’ tests to recover all the defectives with high probability. Combined with a weak converse taken from [8], this allows us to deduce that the weak capacity of Probabilistic group testing is . These results are illustrated by simulation.

For simplicity, this work has concentrated on establishing a bound in (17) which has leading term , and not on tightening bounds on the coefficient of in (17). For completeness, we mention that this coefficient can be reduced from 3, under a simple further condition:

Remark V.1

For some , we assume that all the , and we alter the definition of ‘fullness’ to assume that a set is full if it has total probability less than . In this case, the term in (9) becomes , the bound in (14) becomes , and since is decreasing in , we can add a term to (19). Overall, the coefficient of becomes , which we can optimize over . For example, if , taking , we obtain .

It remains of interest to tighten the upper bound of Theorem II.1, in order prove a strong converse, and hence confirm that the strong capacity is also equal to .

In future work, we hope to explore more realistic models of defectivity, such as those where the defectivity of are not necessarily independent, for example by imposing a Markov neighbourhood structure.

Acknowledgments

This work was supported by the Engineering and Physical Sciences Research Council [grant number EP/I028153/1]; Ofcom; and the University of Bristol. The authors would particularly like to thank Gary Clemo of Ofcom for useful discussions.

References

  1. R. Dorfman, “The detection of defective members of large populations,” The Annals of Mathematical Statistics, pp. 436–440, 1943.
  2. D. Du and F. Hwang, Combinatorial Group Testing and Its Applications, ser. Series on Applied Mathematics.   World Scientific, 1993.
  3. M. Malyutov, “Search for sparse active inputs: a review,” in Information Theory, Combinatorics and Search Theory, ser. Lecture notes in Computer Science.   London: Springer, 2013, vol. 7777, pp. 609–647.
  4. G. Atia and V. Saligrama, “Boolean compressed sensing and noisy group testing,” IEEE Trans. Inform. Theory, vol. 58, no. 3, pp. 1880 –1901, March 2012.
  5. D. Sejdinovic and O. T. Johnson, “Note on noisy group testing: Asymptotic bounds and belief propagation reconstruction,” in Proceedings of the 48th Annual Allerton Conference on Communication, Control and Computing, 2010, pp. 998–1003.
  6. M. P. Aldridge, L. Baldassini, and O. T. Johnson, “Group testing algorithms: bounds and simulations,” IEEE Trans. Inform. Theory, vol. 60, no. 6, pp. 3671–3687, 2014.
  7. T. Wadayama, “An analysis on non-adaptive group testing based on sparse pooling graphs,” in 2013 IEEE International Symposium on Information Theory, 2013, pp. 2681–2685.
  8. T. Li, C. L. Chan, W. Huang, T. Kaced, and S. Jaggi, “Group testing with prior statistics,” 2014, see arxiv:1401.3667.
  9. G. Atia, S. Aeron, E. Ermis, and V. Saligrama, “On throughput maximization and interference avoidance in cognitive radios,” in Consumer Communications and Networking Conference, 2008. CCNC 2008. 5th IEEE.   IEEE, 2008, pp. 963–967.
  10. N. Shental, A. Amir, and O. Zuk, “Identification of rare alleles and their carriers using compressed se (que) nsing,” Nucleic acids research, vol. 38, no. 19, pp. e179–e179, 2010.
  11. L. Baldassini, O. T. Johnson, and M. P. Aldridge, “The capacity of adaptive group testing,” in 2013 IEEE International Symposium on Information Theory, Istanbul Turkey, July 2013, 2013, pp. 2676–2680.
  12. V. Tan and G. Atia, “Strong impossibility results for sparse signal processing,” IEEE Signal Processing Letters, vol. 21, no. 3, pp. 260–264, March 2014.
  13. F. K. Hwang, “A method for detecting all defective members in a population by group testing,” Journal of the American Statistical Association, vol. 67, no. 339, pp. 605–608, 1972.
  14. C. Aksoylar, G. Atia, and V. Saligrama, “Sparse signal processing with linear and non-linear observations: A unified Shannon theoretic approach,” in 2013 IEEE Information Theory Workshop (ITW), Sept 2013, pp. 1–5.
  15. E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inform. Theory, vol. 52, no. 2, pp. 489–509, 2006.
  16. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, no. 4, pp. 1289–1306, 2006.
  17. V. V. Petrov, Limit Theorems of Probability Theory: Sequences of Independent Random Variables.   Oxford: The Clarendon Press, 1995.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
214354
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description