Group testing with Random Pools: Phase Transitions and Optimal Strategy

1 Université Paris-Sud, LPTMS, UMR8626, Bât. 100, 91405 Orsay cedex, France 2 CNRS, LPTMS, UMR8626, Bât. 100, 91405 Orsay cedex, France 3 Université Paris VI et VII, LPMA, UMR7599, 4 Pl. Jussieu, Paris, 75005 France 4 CNRS,LPMA, UMR7599, 4 Pl. Jussieu, Paris, 75005 France

Abstract

The problem of Group Testing is to identify defective items out of a set of objects by means of pool queries of the form “Does the pool contain at least a defective?”. The aim is of course to perform detection with the fewest possible queries, a problem which has relevant practical applications in different fields including molecular biology and computer science. Here we study GT in the probabilistic setting focusing on the regime of small defective probability and large number of objects, and . We construct and analyze one-stage algorithms for which we establish the occurrence of a non-detection/detection phase transition resulting in a sharp threshold, , for the number of tests. By optimizing the pool design we construct algorithms whose detection threshold follows the optimal scaling . Then we consider two-stages algorithms and analyze their performance for different choices of the first stage pools. In particular, via a proper random choice of the pools, we construct algorithms which attain the optimal value (previously determined in Ref. [16]) for the mean number of tests required for complete detection. We finally discuss the optimal pool design in the case of finite .

## 1 Introduction

The general problem of Group Testing (GT) is to identify defective items in a set of objects. Each object can be either defective or OK and we are allowed only to test groups of items via the query “Does the pool contain at least one defective?”. The aim is of course to perform detection in the most efficient way, namely with the fewest possible number of tests.

Apart from the original motivation of performing efficient mass blood testing [1], GT has been also applied in a variety of situations in molecular biology: blood screening for HIV tests [2], screening of clone libraries [3, 4], sequencing by hybridization [5, 6]. Furthermore it has proved relevant for fields other than biology including quality control in product testing [7], searching files in storage systems [8], data compression [9] and more recently in the context of data gathering in sensor networks [10]. We refer to [11, 12] for reviews on the different applications of GT.

The more abstract setting of GT is the following. We have items and each one is associated with a binary random variable which takes value or . We want to detect the value of all variables by performing tests on pools of variables. Each test corresponds to an OR function among the variables of the group, i.e. it returns a binary variable which equals 1 (respectively 0) if at least one variable of the pool equals 1 (respectively if all variables are 0). Here we will only deal with this (very much studied) choice for the tests, often referred to as the gold-standard case. It is however important to keep in mind for future work that in many biological applications one should include the possibility of faulty OR tests [2, 13].

In all our study we will focus on probabilistic GT in the Bernoulli p-scheme, i.e. the situation in which the status of the items are i.i.d. random variables which take value one with probability and zero with probability . In particular, we will be interested in constructing efficient detection algorithms for this GT problem in the limit of large number of objects and small defective probability, and .

In order to summarize our results we need first to introduce some terminology. The construction of any algorithm for GT involves two ingredients: the pool design (the choice of the groups over which tests are performed) and the inference procedure (how to detect the value of the items given the result of the tests). The pool design can be composed by one or more stages of parallel queries. For one-stage (or fully non-adaptive) algorithms all tests are specified in advance: the choice of the pools does not depend on the outcome of the tests. This would be in principle the easiest procedure for several biological applications. Indeed the test procedure can be destructive for the objects and repeated tests on the same sample require more sophisticated techniques. However the number of tests required by fully non-adaptive algorithms can be much larger than for adaptive ones. The best compromise for most screening procedures [14] is therefore to consider two-stage algorithms with a first stage containing a set of predetermined pools (tested in parallel) and a second stage whose pools are chosen depending on the outcomes of the first stage, i.e. after an inference procedure which uses the results of the first stage. Concerning the inference procedure, there exist both exact and approximate algorithms which lead after the last stage to detect the value of all variables with certainty or with high probability, respectively.

Here we will construct one-stage approximate algorithms and two-stage exact algorithms. In both cases the pool design for the first stage will involve random pools and we will focus on the case and with (the case stands for after ). This choice was first discussed by Berger and Levenshtein in the two-stage setting in [15] where they proved that for the minimal number of tests optimized over all exact two-stage procedure, is proportional to .

In the one-stage case we will establish the occurrence of a phase transition: considering two simple inference algorithms, we identify a threshold such that the probability of making at least one mistake in the detection goes to one (respectively to zero) when if the number of tests is below (respectively above) . By optimizing over the pool distribution, we will construct algorithms for which the detection threshold shows the optimal scaling .

Recently in Ref. [16] the value of the prefactor of has been determined exactly when for two-stage procedures. More precisely, the authors have shown that: . Here we will discuss the performance of two-stage algorithms for different choices of the first stage pool design. In particular we will show that the optimal value is obtained on random pools with a properly chosen fixed number of tests per variable and of variables per test (regular-regular case) and also when the number of tests per variable is fixed but the number of variables per test is Poisson distributed (regular-Poisson case). On the other hand we will show that this optimal value can never be attained in the Poisson-regular or in the Poisson-Poisson case. Finally, we discuss the optimal pool design when and is held fixed.

The paper is organized as follows: In Sec. 2 we introduce the factor graph representation of the problem in the most general case. In Sec. 3 we describe the first simple inference procedure which allows to identify the sure variables. In Sec. 4 we analyze one-stage approximate algorithms, while in Sec. 5 we turn to the two-stage exact setting. Finally, in Sec. 6 we give a perspective of our work in view of applications.

## 2 Pool design: factor graph representation and random pools

As we have explained, a GT algorithm can involve one or more stages of parallel tests. The best way to define the pool design of each stage is in term of a factor graph representation. We build a graph with two types of vertexes: each variable is a vertex (variable node) and each test is also a vertex (function node). Variable (function) nodes will be denoted by indexes () and depicted by a circle (square). Whenever a variable belongs to test we set an edge between vertex and . Thus if is the overall number of items and the number of parallel tests in the stage, we obtain a bipartite graph with variable nodes and test nodes with edges between variables and tests only (in Fig 1 we depict a case with ).

We will denote by (by ) the fraction of variable nodes (function nodes) of degree and use a practical representation of these degree profiles, standard in coding theory, in terms of their generating functions and . The average variable node (resp. function node) degree is given by (resp. ).

Both in the one and two stage case we will use pool designs with a first stage based on a randomly generated factor graph. We will consider different possible distributions, but in all cases they will be uniform over the set of graphs for a fixed choice of the degree profiles and . Thus the probability (respectively ) that a randomly chosen edge in the graph is adjacent to a variable node (resp. function node) of degree (degree ) are given by

(1) |

which are derived by noticing that the graph contains (resp. ) edges adjacent to variable nodes of degree (resp. function nodes of degree ). We also define the edge perspective degree profiles as and , namely

(2) |

Note that the number of checks can also be written in terms of these sequences, because the mean degree of variables, , and the mean degree of tests, , are related by . As and we get . Therefore

(3) |

## 3 First inference step: sure and isolated variables

After the first stage of tests we will either use an inference procedure to identify the result (in the one stage case) or choose a new set of pools based on the outcomes of the previous tests (in the two stage case). In our problem, the prior distribution of the variables, , is Bernoulli: . Given the outputs of the tests, the inference problem consists in finding the configuration which maximizes

(4) |

Here is the value of test and if , otherwise, where is the pool of variables connected to .

Since the minimization of the above function is in general a very difficult task, we start by checking whether some variables are identified with certainty by the first stage and then try to extract information on the remaining variables (see Fig.1). The first observation is that in order for a variable to be a sure zero it should belong to at least one test with outcome zero. On the other hand in order to be a sure one it should belong to at least one positive test in which all the other variables are sure zeros. Variables that are neither sure zeros nor sure ones are the undetermined variables.

We start by noticing that if a test contains only zeros, or if it contains at least a sure one, then it does not carry any information on the undetermined variables. We call such a test strippable, as is the case for tests in Fig.1. It is then immediate to verify that we have no information on a variable if it is undetermined and all the tests to which it belongs are strippable. We call such undetermined variable isolated, as is the case for variable in Fig.1. The above terminology is motivated by the fact that all the information on the undetermined variables is encoded in a reduced graph (see right part of Fig. 1) constructed via the following stripping procedure: erase all variable nodes which correspond to sure variables and all test nodes which are strippable (note that isolated variables are those that are not connected to any test in the reduced graph). Therefore the inference problem corresponding to the minimization of (4) can be rephrased as a Hitting Set problem on the corresponding reduced graph [17].

Given a variable and a choice of the pools, the probability (respectively ) that is a sure zero (resp. a sure one) can be found as follows. Let us denote by () the set of variable (resp. the set of tests) nodes connected to test (resp. variable ). We introduce the indicator that is a sure as well as the indicator that is a sure one:

(5) |

(6) |

which are expressed in terms of

(7) |

Then and are given by:

(8) |

(9) |

where the sum is over all .

It is clear that Eq. (8) for involves only the variables at graph distance two from . Thus, if does not belong to a loop of length four in the factor graph, are independent variables and the mean over the variable values in (8) can be easy carried out yielding

(10) |

where is the number of variables which belong to test . Then, for any given choice of the degree profiles of the random factor graph, if the probability that two tests have degree and factorize, we can easily perform the mean over the uniform distribution for the factor graphs. This leads to a value which does not depend anymore on the index and can be rewritten as with

(11) |

Formula (9) for involves only variables at distance at most from . If the ball centered in of radius 4 does not contain any loop, we can perform easily the mean over the variables in (9) and get with

(12) |

The probability that is strippable (), is an isolated zero () and is an isolated one () are instead given by

(13) |

(14) |

(15) |

In this case, if there is no loop in the ball of radius 6 centered on , we can easily perform the mean over the variables and over the random graph distribution which yield and with

(16) |

with

(17) |

## 4 One-stage algorithms

In this section we analyze one-stage algorithms when the number of items, , goes to infinity and the defect probability, , goes to zero as with . When constructing the pools we use random graph ensembles of two types: either regular-regular (R-R) graphs (fixed connectivity both for test and variable nodes) or regular-Poisson (R-P) graphs (fixed connectivity for variables, Poisson distribution for the test degree). As for the inference procedure we will consider two types of algorithms: Easy Algorithm (EA) and Belief Propagation (BP). We will show that both undergo a non-detection/detection phase transition when one varies the number of tests, : we identify a threshold such that for the overall detection error goes (as ) to one while for it goes to zero. When we can establish analytically the value of which turns out to be equal for the two algorithms: EA and BP have the same performance in the large limit. We will explain why this transition is robust and we will optimize the pool design (i.e. choice of the parameters of the regular-regular and regular-Poisson graphs) to obtain the smallest possible . The resulting algorithms have a threshold value which satisfies . This is the same scaling in and as for the optimal number of tests in an exact two-stage algorithm, albeit with a different prefactor.

### 4.1 Pool design

Given a random graph ensemble, we denote by the number of test nodes, by the mean degree of tests (which also coincides with the degree of each test in the R-R case) and by the degree of each variable and we let

(18) |

The degree profile polynomials are:

Then, if the hypotheses on the absence of short loops which lead to (11), (12) and (16) are valid, the probabilities , and are given in the R-R case by:

(19) |

(20) |

(21) |

(22) |

In the R-P case they are given by:

(23) |

(24) |

(25) |

(26) |

It is easy to verify that in leading order when and the above quantities for the regular regular and regular Poisson case coincide. In particular if we set they are given by

(27) |

if | |||||

if | (28) | ||||

if |

and

if | |||||

if | (29) | ||||

if |

where we set and , for .

Let us discuss in what range of one expects the above asymptotic behaviors to be valid. As explained in section 3, the only hypothesis in their derivation consists in neglecting the presence of some short loops in a proper neighborhood of the chosen variable. In particular the equation for is valid if we can neglect the presence of loops of length four through a given variable. Consider for definiteness the R-R case. The probability of having at least one loop of length four through , , verifies

which goes to zero for . Thus we are guaranteed that (19) is correct in this regime. By the same type of reasoning, we can show that the formulas for and are valid respectively for and . However through the following heuristic argument, one can expect that the formula for (resp. ) be correct in the larger regimes (resp. ). Indeed, when we evaluate we need to determine whether variables at distance from a variable are sure zeros. We expect the probability of this joint event to be well approximated by the product of the single event probabilities if the number of tests that a variable at distance from shares with the others is and if the number of variables that a test at distance from shares with the others is . Both conditions are satisfied if (the probability that a test at distance belongs to more than one variable at distance goes as and the probability that a variable at distance belongs to more than one test at distance goes as ). For the argument is similar but, since we have a further shell in tests and variables to analyze in order to determine whether a variable is isolated or not, we get an extra factor in the exponents which lead to the validity of the approximations only for .

### 4.2 Easy algorithm (EA)

A straightforward inference procedure is the one that fixes the sure variables to their value and does not analyze the remaining information carried by the tests, thus putting to zero all other variables (since ). We call this procedure Easy Algorithm (EA). By definition the probability that a variable is set to a wrong value, , is given by . In the hypothesis of independent bit errors, i.e. if we suppose that the probability of making at least one mistake satisfies and if (see the discussion at the end of previous section), we can apply (4.1) which yields

if | |||||

if | (30) | ||||

if |

both for the R-R and R-P graphs. Therefore EA displays a phase transition in the large limit, when one varies the parameter from a region at in which the probability of at least one error, , goes to one, to a region where it goes to zero. The threshold of this regime is given by

(31) |

The most efficient pools, within the R-R and R-P families, are obtained by minimizing with respect to . The value of the optimal threshold and the parameter at which the optimal value is attained, namely , are

This, together with (18), gives a threshold

(32) |

for the number of tests. Note that the threshold in the case , i.e. if we send after , is infinite. This corresponds to the fact that for any choice and the bit error stays finite when , since and depend only on .

In order to verify the above results and the approximations on which they are based we have performed numerical simulations in the case of the R-R graph with , and different values of . The results we obtain confirm that in this regime bit errors can be regarded as independent and formulas (19)–(22) are valid. The values of as a function of are depicted in Fig. 2a for different values of . The value of the threshold connectivity and the form of the finite size corrections for the total error (continuous curves) are in excellent agreement with the above predictions (4.2) and (31). Furthermore we have verified that when both the independent bit error approximation and the approximation leading to Eq. (4.2) fail as expected. This can be seen for example in Fig. 3a where we report the results for the case . Indeed the numerical results (black dots) differ from the continuous line which corresponds to Eq. (4.2), thus confirming that in this case both the shape of finite size corrections and the position of the threshold cannot be derived by (4.2).

### 4.3 Belief Propagation (BP)

The algorithm considered in previous section is very simple since it does not exploit the information contained in the reduced graph (see section 3). We will now instead define a different algorithm in order to extract as much information as we can from the first stage. As already explained in section 3, this requires in principle the minimization of Eq. (4). In order to perform this task we will use Belief Propagation (BP) algorithm to estimate for each variable the value of the marginal probability . Then we will set to one (to zero) variables for which (respectively ). Let us derive the BP equations for the marginal probabilities. We denote by () the set of function (variable) nodes connected to the variable node (respectively to the function node ), by the probability of value for the i-th variable in absence of test and by the joint cavity distribution in the absence of (so that ). We can then write

where by we denote the vector . Furthermore we make the usual assumption that the joint cavity distributions factorize

which leads to closed equations for the set of single variable cavity probabilities. In order to simplify these equations we define a normalized message from function node to variable node as

and therefore

and

Using the fact that takes values in and that both and are normalized we introduce cavity fields and cavity biases defined as follows

The BP equation for the cavity biases and fields are:

if | ||||

if |

and

Our detection procedure corresponds to initialize the cavity and bias fields to some values and iterate BP equations above until they converge. Then, the marginal probability distribution can be rewritten as

with the full local field satisfying

and the inference procedure is completed by setting to one (to zero) if ( respectively). Note that on the sure variables BP algorithm lead to the correct detection. Furthermore one should expect that its performance is better than EA: since we analyze also the information which comes from tests which are non strippable it is possible that some of the undetermined ones which are all set to zero in EA are here correctly detected.

In order to test the performance of BP algorithm we run the procedure on the regular-regular graph for and as we did for EA. The total error probability as a function of is reported in figure 2b (black squares). As for EA, a non-detection/detection phase transition occurs at . Thus, even if EA is a much simpler algorithm, the performance of the two coincide in the large limit, suggesting that for the choice with the reduced graph does not carry any additional information. In figure 3a) we plot instead the total error of EA and BP when for . The data indicate that BP algorithm performs much better than the EA in this case: the reduced graph carries information which is used by BP to optimize the procedure. We have also verified that the difference between BP and EA performance does not diminish as the size of the graph is increased. In Fig. 3b) we plot instead the results for BP again in the case but for different values of . The data become sharper as is increased. Similarly to the case, this seems to indicate the presence of a sharp phase transition in the thermodynamic limit.

Let us start by evaluating the non-detection/detection threshold from BP equations and then explain why we expect it to coincide with the one for EA at least when is in .

If we denote by () the mean over the random graph distribution of the probability for the full local field on conditioned to the fact that (), the probability of setting to a wrong value the i-th variable is here with

(33) |

(34) |

From the BP equations it is easy to obtain the following ’replica symmetric’ cavity equations satisfied by and [18]:

(35) |

(36) |

where

(37) |

(38) |

(39) |

(40) |

It is now easy to verify that and , where and are the probability that a variable is sure zero and one respectively, and are given by Eqs. (19) and (20). Furthermore the following relation holds

where is the probability that a variable is isolated, given in Eq. (22).

By using the above observations together with the definitions (33) and (34) for the bit error probabilities one obtains the following inequalities

(41) |

(42) |

We will now show how it is possible to locate the non-detection/detection transition from these inequalities without the need to evaluate the bit error probabilities.

The leading order of the quantities , and have been evaluated in section 4.1. Furthermore, for the higher order corrections give and where . Thus

Therefore, in the assumption of independent bit errors, we get

for , namely . Since we have and the above bounds on the total error imply the occurrence of a phase transition at the same value found with the EA algorithm (see (31)). Thus the performance of EA and BP coincide if the approximations leading to Eqs. (19), (20) and (22) are correct. By the discussion at the end of section 4.1 we know that these approximations are under full control for and we expect them to hold also up to . We conclude that in this regime the value of the threshold for BP transition equals the one for EA (31), as is indeed confirmed by the numerical results that we already discussed for the case (see Fig. 2). We stress that there is no reason for that to be true in the regime where the approximations of neglecting proper loops which lead to (19), (20) and (22) do not hold. For example, as is shown in Fig.3a and b, in the case even if a sharp non-detection/detection phase transition seems to occur when , the error probability is certainly not in agreement with (4.2) which for the chosen parameters would yield to a threshold at .

Note that in the discussion above we have upper bounded the bit error with the error over all variables that are neither sure nor isolated and lower bounded it with the error over isolated variables. It is thus immediate to see that the position of the phase transition remains unchanged for all algorithms which set to zero all the isolated variables (which is the best guess since we have no information and ) and set to the correct value the sure variables (EA is indeed the simplest algorithm which belongs to this class). This is due to the fact that the mean number of tests in the reduced graph goes to zero in the detection regime , as can be checked using formula (13) and neglecting loops.

Finally, we would like to stress that even if we have shown that EA and BP inference procedures are optimal for R-R and P-R pool designs, at least when , this does not imply that these pool designs are optimal over all the possible designs of the factor graph. However, an indication that they might be optimal comes from the results on two-stage exact algorithms presented in section 5. As a further check we have evaluated the thresholds for the Poisson-Poisson (P-P) and Poisson-regular (P-R) cases. Using the same technique as above, we found in both cases a non-detection/detection phase transition which occurs at the same threshold for EA and BP. If we set , , the threshold value is

(43) |

By optimizing (43) over the choice of we get and , which is larger than the optimal threshold for R-R and R-P.

## 5 Two-stage algorithms

In this section we analyze two-stage exact algorithms when the number of items, , goes to infinity and the defect probability, , goes to zero as . This setting was first discussed by Berger and Levenshtein in [15] where they proved that if , the minimal (over all two-stage exact procedures) mean number of tests, , satisfies the bounds

In [16] two of the authors have derived the prefactor for the above scaling when ,

(44) |

and constructed a choice of algorithms over which this optimal value is attained. Note that our analysis includes the case , namely the situation in which the limit is taken after . Note that the asymptotic result (44) is above the information theoretic bound . In section 5.1 we give a short account of the derivation of (44) and we construct an optimal algorithm. In section 5.2 we test the performance of algorithms corresponding to different choices of the random pools of the first stage.

### 5.1 Optimal number of tests for ,

An exact two-stage algorithm involves a first stage of tests after which all variables are identified and set to their value. Then a second stage is performed where all the remaining variables are individually tested. The mean number of tests, , is therefore given by

(45) |

where is the number of tests of the first stage and and are the probabilities for variable to be sure zero and sure one. The latter in turn are given by Eqs. (8) and (9) with ’s and ’s being the neighborhood of tests and variables of the first stage.

It is immediate to verify that in the limit and the number of individual check over undetected ones is irrelevant, i.e.

(46) |

Furthermore is always upper bounded by the expression (10) obtained by neglecting loops, as is proven in [16] by using Fortuin-Kasteleyn-Ginibre inequality [20] together with the observation that the existence of at least one variable equal to one in two (or more) intersecting pools are positively correlated. We define to be the fraction of sites such that among their neighbors there are tests of degree , tests of degree , etc. By using (10) and (46), the optimal number of tests over all two stage procedures can be lower bounded as

(47) |

where the infimum is over all possible probability distributions with and

(48) |

Minimization over can then be carried out and leads in the limit to

(49) |

Furthermore the above minimization procedure shows that this infimum is attained for with . This implies that the lower bound is saturated on the uniform distribution over regular-regular graphs with and provided that we can neglect loops in the evaluation of . This, as already explained in section 4.1, is true as long as . Note that the optimal result is also attained if instead of a random construction of pools we fix a regular-regular graph which has no loops of length and has the same choices of test and variable degrees as above. The existence of at least one of such a graph for these choices of and when is guaranteed by the results in [19]. Thus we have established the result (44) for the optimal value of tests over all exact two-stage procedure and constructed algorithms based on regular-regular graphs which attain this optimal value.

### 5.2 Testing different pool designs for

We will now check the performance of different pool designs corresponding to different random distributions for the pools in the first stage. In all cases we will fix the degree profiles and and consider a uniform distribution over graphs with these profiles. Using the notation of section 3 and neglecting the presence of loops, the mean number of tests (45) can easily be rewritten

(50) |

(we suppose that the fraction of both test and variable nodes of degree zero is equal to zero). As for the one stage case, we consider four different choices of the connectivity distributions corresponding to regular-regular (R-R), regular-Poisson (R-P), Poisson-Poisson (P-P) and Poisson-regular (P-R) graphs and for each choice we have optimized over the parameters of the distribution. The corresponding degree profiles and edge perspectives are given in section 4.2