Secure Group Testing A. Cohen, A. Cohen and O. Gurewitz are with the Department of Communication Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel (e-mail: alejandr@post.bgu.ac.il; coasaf@bgu.ac.il; gurewitz@bgu.ac.il). S. Jaggi is with the Department of Information Engineering, Chinese University of Hong Kong (jaggi@ie.cuhk.edu.hk). Parts of this work were presented at the IEEE International Symposium on Information Theory, ISIT 2016.

Secure Group Testing thanks: A. Cohen, A. Cohen and O. Gurewitz are with the Department of Communication Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel (e-mail: alejandr@post.bgu.ac.il; coasaf@bgu.ac.il; gurewitz@bgu.ac.il). S. Jaggi is with the Department of Information Engineering, Chinese University of Hong Kong (jaggi@ie.cuhk.edu.hk).
Parts of this work were presented at the IEEE International Symposium on Information Theory, ISIT 2016.

Alejandro Cohen              Asaf Cohen               Sidharth Jaggi               Omer Gurewitz
Abstract

The principal goal of Group Testing (GT) is to identify a small subset of “defective” items from a large population, by grouping items into as few test pools as possible. The test outcome of a pool is positive if it contains at least one defective item, and is negative otherwise. GT algorithms are utilized in numerous applications, and in many of them maintaining the privacy of the tested items, namely, keeping secret whether they are defective or not, is critical.

In this paper, we consider a scenario where there is an eavesdropper (Eve) who is able to observe a subset of the GT outcomes (pools). We propose a new non-adaptive Secure Group Testing (SGT) algorithm based on information-theoretic principles, which keeps the eavesdropper ignorant regarding the items’ status. Specifically, when the fraction of tests observed by Eve is , we prove that the number of tests required for both correct reconstruction at the legitimate user (with high probability) and negligible information leakage to Eve is times the number of tests required with no secrecy constraint for fixed regime. By a matching converse, we completely characterize the Secure GT capacity. Moreover, we consider a computationally efficient decoding algorithm, and prove that for the number of tests required, without any constraint on , is at most times the number of tests required with no secrecy constraint.

I Introduction

The classical version of Group Testing (GT) was suggested during World War II in order to identify syphilis-infected draftees while dramatically reducing the number of required tests [1]. Specifically, when the number of infected draftees, , is much smaller than the population size, , instead of examining each blood sample individually, one can conduct a small number of of pooled samples. Each pool outcome is negative if it contains no infected sample, and positive if it contains at least one infected sample. The problem is thus to identify the infected draftees via as few pooled tests as possible. Figure 1 (a)-(c) depicts a small example.

Since its exploitation in WWII, GT has been utilized in numerous fields, including biology and chemistry [2, 3], communications [4, 5, 6, 7], sensor networks [8], pattern matching [9] and web services [10]. GT has also found applications in the emerging field of Cyber Security, e.g., detection of significant changes in network traffic [11], Denial of Service attacks [12] and indexing information for data forensics [13].

Many scenarios which utilize GT involve sensitive information which should not be revealed if some of the tests leak (for instance, if one of the several labs to which tests have been distributed for parallel processing is compromised). However, in GT, leakage of even a single pool-test outcome may reveal significant information about the tested items. If the test outcome is negative it indicates that none of the items in the pool is defective; if it is positive, at least one of the items in the pool is defective (see Figure 1 (d) for a short example). Accordingly, it is critical to ensure that a leakage of a fraction of the pool-tests outcomes to undesirable or malicious eavesdroppers does not give them any useful information on the status of the items. It is very important to note that protecting GT is different from protecting the communication between the parties. To protect GT, one should make sure that information about the status of individual items is not revealed if a fraction of the test outcomes leaks. However, in GT, we do not want to assume one entity has access to all pool-tests, and can apply some encoding function before they are exposed. We also do not want to assume a mixer can add a certain substance that will prevent a third party from testing the sample. To protect GT, one should make sure that without altering mixed samples, if a fraction of them leaks, either already tested or not, information is not revealed.

While the current literature includes several works on the privacy in GT algorithms for digital objects [14, 13, 15, 16], these works are based on cryptographic schemes, assume the testing matrix is not known to all parties, impose a high computational burden, and, last but not least, assume the computational power of the eavesdropper is limited [17, 18]. Information theoretic security considered for secure communication [19, 18], on the other hand, if applied appropriately to GT, can offer privacy at the price of additional tests, without keys, obfuscation or assumptions on limited power. Due to the analogy between channel coding and group-testing regardless of security constraints, [20, 21], in Section II we present an extensive survey of the literature on secure communication as well.

Main Contribution

In this work, we formally define Secure Group Testing (SGT), suggest SGT algorithms based on information-theoretic principles and analyse their performance. In the considered model, there is an eavesdropper Eve who might observe part of the vector of pool-tests outcomes. The goal of the test designer is to design the tests in a manner such that a legitimate decoder can decode the status of the items (whether the items are defective or not) with an arbitrarily small error probability. It should also be the case that as long as Eve the eavesdropper gains only part of the output vector (a fraction - a bound on the value of is known a priori to the test designer, but which specific items are observed is not), Eve cannot (asymptotically, as the number of items being tested grows without bound) gain any significant information on the status of any of the items.

We propose a SGT code and corresponding decoding algorithms which ensure high of reliability (with high probability over the test design, the legitimate decoder should be able to estimate the status of each item correctly), as well as strong secrecy conditions (as formally defined in Section III) - which ensures that essentially no information about the status of individual items’ leaks to Eve.

Our first SGT code and corresponding decoding algorithm (based on Maximum Likelihood (ML) decoding) requires a number of tests that is essentially information-theoretically optimal in N, K and (as demonstrated in Section VI by corresponding information-theoretic converse that we also show for the problem).

The second code and corresponding decoding algorithm, while requiring a constant factor larger number of tests than is information-theoretically necessary (by a factor that is a function of ), is computationally efficient. It maintains the reliability and secrecy guarantees, yet requires only O() decoding time, where is the number of tests.

(a)
(b)
(c)
(d)
Fig. 1: Classical group testing: An example of test results, a simple decoding procedure at the legitimate decoder and the risk of leakage. The example includes 7 items, out of which at most one defective (the second one in this case; unknown to the decoder). Three pooled tests are conducted. Each row dictates in which pooled tests the corresponding item participates. (a) Since the first result is negative, items 1 and 6 are not defective. (b) The second result is positive, hence at least one of items 2 and 4 is defective. (c) Based on the last result, as item 4 cannot be defective, it is clear that 2 is defective. Note that decoding in this case is simple: any algorithm which will simply rule out each item whose row in the matrix is not compatible with the result will rule out all but the second item, due to the first and last test results being negative, thus identifying the defective item easily. (d) An eavesdropper who has access to part of the results (the first two) can still infer useful information. Our goal is construct a testing matrix such that such an eavesdropper remains ignorant.

We do so by proposing a model, which is, in a sense, analogous to a wiretap channel model, as depicted in Figure 2. In this analogy the subset of defective items (unknown a priori to all parties) takes the place of a confidential message. The testing matrix (representing the design of the pools - each row corresponds to the tests participated in by an item, and each column corresponds to a potential test) is a succinct representation of the encoder’s codebook. Rows or disjunctive unions of rows of this testing matrix can be considered as codewords. The decoding algorithm is analogous to a channel decoding process, and the eavesdropped signal is the output of an erasure channel, namely, having only any part of the transmitted signal from the legitimate source to the legitimate receiver.

In classical non-adaptive group-testing, each row of the testing matrix comprises of a length- binary vector which determines which pool-tests the item is tested in. In the SGT code constructions proposed in this work, each item instead corresponds to a vector chosen uniformly at random from a pre-specified set of random and independent vectors. Namely, we use stochastic encoding, and each vector corresponds to different sets of pool-tests an item may participate in. For each item the lab picks one of the vectors in its set (we term the set associated with item as “Bin j”) uniformly at random, and the item participates in the pool-tests according to this randomly chosen vector. The set (“Bin”) is known a priori to all parties, but the specific vector chosen by the encoder/mixer is only known to the encoder/mixer, and hence is not a shared key/common randomness in any sense. A schematic description of our procedure is depicted in Figure 4.

Fig. 2: An analogy between a wiretap erasure channel and the corresponding SGT model.

Accordingly, by obtaining a pool-test result, without knowing the specific vectors chosen by the lab for each item, the eavesdropper may gain only negligible information regarding the items themselves. Specifically, we show that by careful design of the testing procedure, even though the pool-tests in which each item participated are chosen randomly and even though the legitimate user does not know a-priori in which pool-tests each item has participated, the legitimate user will, with high probability over the testing procedure, be able to correctly identify the set of defective items, while the eavesdropper, observing only a subset of the pool-test results, will have no significant information regarding the status of the items.

The structure of this work is as follows. In Section II, we present an extensive survey and summarize the related work. In Section III, a SGT model is formally described. Section IV includes our main results, with the direct proved in Section V and converse proved in Section VI. Section VII describes a computationally efficient algorithm, and proves an upper bound on its error probability. Section VIII concludes the paper.

Ii Background and Related Work

Ii-a Group-testing

Group-testing comes in various flavours, and the literature on these is vast. At the risk of leaving out much, we reference here just some of the models that have been considered in the literature, and specify our focus in this work.

Ii-A1 Performance Bounds

GT can be non-adaptive, where the testing matrix is designed beforehand, adaptive, where each new test can be designed while taking into account previous test results, or a combination of the two, where testing is adaptive, yet with batches of non-adaptive tests. It is also important to distinguish between exact recovery and a vanishing probability of error.

To date, the best known lower bound on the number of tests required (non-adaptive, exact recovery) is [22]. The best known explicit constructions were given in [23], resulting in . However, focusing on exact recovery requires more tests, and forces a combinatorial nature on the problem. Settling for high probability reconstructions allows one to reduce the number of tests to the order of .111A simple information theoretic argument explains a lower bound. There are defectives out of items, hence possibilities to cover: bits of information. Since each test carries at most one bit, this is the amount of tests required. Stirling’s approximation easily shows that for , the leading factor of that is . For example, see the channel-coding analogy given in [20]. A similar analogy to wiretap channels will be at the basis of this work as well. In fact, probabilistic methods with an error probability guarantee appeared in [24], without explicitly mentioning GT, yet showed the bound. Additional probabilistic methods can be found in [25] for support recovery, or in [26], when an interesting phase transition phenomenon was observed, yielding tight results on the threshold (in terms of the number of tests) between the error probability approaching one or vanishing.

Ii-A2 A Channel Coding Interpretation

As mentioned, the analogy to channel coding has proved useful [20]. [21] defined the notion of group testing capacity, that is, the value of under which reliable algorithms exist, yet, over which, no reliable reconstruction is possible. A converse result for the Bernoulli, non-adaptive case was given in [27]. Strong converse results were given in [28, 29], again, building on the channel coding analogy, as well as converses for noisy GT [30]. In [31], adaptive GT was analyzed as a channel coding with feedback problem.

Ii-A3 Efficient Algorithms

A wide variety of techniques were used to design efficient GT decoders. Results and surveys for early non-adaptive decoding algorithms were given in [32, 33, 34]. Moreover, although most of the works described above mainly targeted fundamental limits, some give efficient algorithms as well. In the context of this work, it is important to mention the recent COMP [35], DD and SCOMP [36] algorithms, concepts from which we will use herein.

Ii-B Secure communication

It is very important to note that making GT secure is different from making communication secure, as remarked in Section I. Now, we briefly survey the literature in secure communication, since many of the ideas/models/primitives in secure communication will have analogues in secure group-testing.

Ii-B1 Information-theoretic secrecy

In a secure communication setting, transmitter Alice wishes to send a message to receiver Bob. To do so, she is allowed to encode into a (potentially random) function , and transmit over a medium. It is desired that the eavesdropper Eve should glean no information about from its (potentially noisy) observation . This information leakage is typically measured via the mutual information between and . The receiver Bob should be able to reconstruct based on its (also potentially noisy) observation of (and, potentially, a shared secret that both Bob and Alice know, but Eve is ignorant of).

There are a variety of schemes in the literature for information-theoretically secure communications.222Security in general has many connotations — for instance, in the information-theory literature it can also mean a scheme that is resilient to an active adversary, for instance a communication scheme that is resilient to jamming against a malicious jammer. In this work we focus our attention on passive eavesdropping adversaries, and aim to ensure secrecy of communications vis-a-vis such adversaries. We shall thus henceforth use the terms security and secrecy interchangeably. Such schemes typically make one of several assumptions (or combinations of these):

  • Shared secrets/Common randomness/Symmetric-key encryption: The first scheme guaranteed to provide information-theoretic secrecy was by [37], who analyzed the secrecy of one-time pad schemes and showed that they ensure perfect secrecy (no leakage of transmitted message). He also provided lower bounds on the size of this shared key. The primary disadvantage of such schemes is that they require a shared key that is essentially as large as the amount of information to be conveyed, and it be continually refreshed for each new communication. These requirements typically make such schemes untenable in practice.

  • Wiretap secrecy/Physical-layer secrecy: Wyner et al. [19, 38] first considered certain communication models in which the communication channel from Alice to Eve is a degraded (noisier) version of the channel from Alice to Bob, and derived the information-theoretic capacity for communication in such settings. These results have been generalized in a variety of directions. See [39, 40, 18] for (relatively) recent results. The primary disadvantage of such schemes is that they require that it be possible to instantiate communication channels from Alice to Bob that are better than the communication channel from Alice to Eve. Further, they require that the channel parameters of both channels be relatively well known to Alice and Bob, since the choice of communication rate depends on these parameters. These assumptions make such schemes also untenable in practice, since on one hand Eve may deliberately situate herself to have a relatively clear view of Alice’s transmission than Bob, and on the other hand there are often no clear physically-motivated reasons for Alice and Bob to know the channel parameters of the channel to Eve.

  • Public discussion/Public feedback: A very nice result by Maurer ( [41] and subsequent work - see [18] for details) significantly alleviated at least one of the charges level against physical-layer security systems, that they required the channel to Bob to be “better” than the channel to Eve. Maurer demonstrated that feedback (even public feedback that is noiselessly observable by Eve) and multi-round communication schemes can allow for information-theoretically secure communication from Alice to Bob even if the channel from Alice to Bob is worse than the channel from Alice to Eve. Nonetheless, such public discussion schemes still require some level of knowledge of the channel parameters of the channel to Eve.

Ii-B2 Cryptographic security

Due to the shortcomings highlighted above, modern communication systems usually back off from demanding information-theoretic security, and instead attempt to instantiate computational security. In these settings, instead of demanding small information leakage to arbitrary eavesdroppers, one instead assumes bounds on the computational power of the eavesdropper (for instance, that it cannot computationally efficiently invert “one-way functions”). Under such assumptions one is then often able to provide conditional security, for instance with a public-key infrastructure  [42, 43]. Such schemes have their own challenges to instantiate. For one, the computational assumptions they rely on are sometimes unfounded and hence sometimes turn out to be prone to attack [17, 18, 44]. For another, the computational burden of implementing cryptographic primitives with strong guarantees can be somewhat high for Alice and Bob [45].

Ii-C Secure Group-Testing

On the face of it, the connection between secure communication and secure group-testing is perhaps not obvious. We highlight below scenarios that make these connections explicit. Paralleling the classification of secure communication schemes above, one can also conceive of a corresponding classification of secure GT schemes.

Ii-C1 Information-theoretic schemes

  • Shared secrets/Common randomness/Symmetric-key encryption: A possible scheme to achieve secure group testing, is to utilize a shared key between Alice and Bob. For example, consider a scenario in which Alice the nurse has a large number of blood samples that need to be tested for the presence of a disease. She sends them to a lab named Eve to be tested. To minimize the number of tests done via the lab, she pools blood samples appropriately. However, while the lab itself will perform the tests honestly, it can’t be trusted to keep medical records secure, and so Alice keeps secret the identity of the people tested in each pool. 333Even in this setting, it can be seen that the number of diseased individuals can still be inferred by Eve. However, this is assumed to be a publicly known/estimable parameter.

    Given the test outcomes, doctor Bob now desires to identify the set of diseased people. To be able to reconstruct this mapping, a relatively large amount of information (the mapping between individuals’ identities and pools tested) needs to be securely communicated from Alice to Bob. As in the one-time pad secure communication setting, this need for a large amount of common randomness makes such schemes unattractive in practice. Nonetheless, the question is theoretically interesting, and some interesting results have been recently reported in this direction by [14, 13, 15, 16].

  • Wiretap secrecy/Physical-layer secrecy: This is the setting of this paper. Alice does not desire to communicate a large shared key to Bob, and still wishes to maintain secrecy of the identities of the diseased people from “honest but curious” Eve. Alice therefore does the following two things: (i) For some , she chooses a number of independent labs, and divides the T pools to be tested into pool sets of pools each, and sends each set to a distinct lab. (ii) For each blood pool, she publicly reveals to all parties (Bob, Eve, and anyone else who’s interested) a set of possible combinations of individuals whose blood could constitute that specific pool . As to which specific combination from of individuals the pool actually comprises of, only Alice knows a priori - Alice generates this private randomness by herself, and does not leak it to anyone (perhaps by destroying all trace of it from her records). The twin-fold goal is now for Alice to choose pool-sets and set of for each to ensure that as long as no more than one lab leaks information, there is sufficient randomness in the set of so that essentially no information about the diseased individuals identities leaks, but Bob (who has access to the test reports from all the labs) can still accurately estimate (using the publicly available information on for each test ) the disease status of each individual. This scenario closely parallels the scenario in Wyner’s Wiretap channel. Specifically, this corresponds to Alice communicating a sequence of test outcomes to Bob, whereas Eve can see only a fraction of test outcomes. To ensure secrecy, Alice injects private randomness (corresponding to which set from corresponds to the combination of individuals that was tested in test ) into each test - this is the analogue of the coding schemes often used for Wyner’s wiretap channels.

    Remark 1.

    It is a natural theoretical question to consider corresponding generalizations of this scenario with other types of broadcast channels from Alice to Bob/Eve (not just degraded erasure channels), especially since such problems are well-understood in a wiretap security context. However, the physical motivation of such generalizations is not as clear as in the scenario outlined above. So, even though in principle the schemes we present in Section III can be generalized to other broadcast channels, to keep the presentation in this paper clean we do not pursue these generalizations here.

    Remark 2.

    Note that there are other mechanisms via which Alice could use her private randomness. For instance, she could deliberately contaminate some fraction of the tests she sends to each lab with blood from diseased individuals. Doing so might reduce the amount of private randomness required to instantiate secrecy. While this is an intriguing direction for future work, we do not pursue such ideas here.

  • Public discussion/Public feedback: The analogue of a public discussion communication scheme in the secure group-testing context is perhaps a setting in which Alice sends blood pools to labs in multiple rounds, also known as adaptive group testing in the GT literature. Bob, on observing the set of test outcomes in round , then publicly broadcasts (to Alice, Eve, and any other interested parties) some (possibly randomized) function of his observations thus far. This has several potential advantages. Firstly, adaptive group-testing schemes (e.g.[36]) significantly outperform the best-known non-adaptive group-testing schemes (in terms of smaller number of tests required to identify diseased individuals) in regimes where . One can hope for similar gains here. Secondly, as in secure communication with public discussion, one can hope that multi-round GT schemes would enable information-theoretic secrecy even in situations where Eve may potentially have access to more test outcomes than Bob. Finally, such schemes may offer storage/computational complexity advantages over non-adaptive GT schemes. Hence this is an ongoing area of research, but outside the scope of this paper.

Ii-C2 Cryptographic secrecy

As in the context of secure communication, the use of cryptographic primitives to keep information about the items being tested secure has also been explored in sparse recovery problems - see, for instance [14, 13, 15, 16]. Schemes based on cryptographic primitives have similar weaknesses in the secure GT context as they do in the communication context, and we do not explore them here.

Iii Problem Formulation

In SGT, a legitimate user desires to identify a small unknown subset of defective items from a larger set , while minimizing the number of measurements and keeping the eavesdropper, which is able to observe a subset of the tests results, ignorant regarding the status of the items. Let , denote the total number of items, and the number of defective items, respectively. As formally defined below, the legitimate user should (with high probability) be able to correctly estimate the set ; on the other hand, from the eavesdropper’s perspective, this set should be “almost” uniformly distributed over all possible sets. We assume that the number of defective items in is known a priori to all parties - this is a common assumption in the GT literature [3].444If this is not the case, [46, 47] give methods/bounds on how to “probably approximately” correctly learn the value of in a single stage with tests.

Throughout the paper, we use boldface to denote matrices, capital letters to denote random variables, lower case letters to denote their realizations, and calligraphic letters to denote the alphabet. Logarithms are in base and denotes the binary entropy function.

Figure 3 gives a graphical representation of the model. In general, and regardless of security constraints, non-adaptive GT is defined by a testing matrix

where each row corresponds to a separate item , and each column corresponds to a separate pool test . For the -th item,

is a binary row vector, with the -th entry if and only if item participates in the -th test.

Fig. 3: Noiseless non-adaptive secure group-testing setup.

If denotes an indicator function for the -th item, determining whether it belongs to the defective set, i.e., if and otherwise, the (binary) outcome of the pool test equals

where is used to denote the boolean OR operation.

In SGT, we assume an eavesdropper who observes a noisy vector , generated from the outcome vector . In the erasure case considered in the work, the probability of erasure is , i.i.d. for each test. That is, on average, outcomes are not erased and are accessible to the eavesdropper via . Therefore, in the erasure case, if is an erasure indicator function for the -th pool test, i.e., with probability , and with probability , the eavesdropper observes

Denote by the index of the subset of defective items. We assume is uniformly distributed, that is, there is no a priori bias towards any specific subset.555This is a common probabilistic model for the set of defectives in group-testing. Another model, called Probabilistic Group Testing, assumes that items are defective with probability . Yet another model assumes that any set of size at most K (rather than exactly ) instantiates with equal probability. In many group-testing scenarios results for one probabilistic model for the set of defectives can be translated over to other scenarios, so we focus on the model presented above, where exactly items are defective. Further, denote by the index recovered by the legitimate decoder, after observing . In this work, we assume that the mixer may use a randomized testing matrix. In this case, the random bits used are know only to the mixer, and are not assumed to be shared with the decoder. In other words, the “codebook” which consists of all possible testing matrices is known to all parties, Alice, Bob and Eve. However, if the mixer choose a specific X, the random value is not shared with Bob or Eve. We refer to the codebook consisting of all possible matrices, together with the decoder at Bob’s side as SGT algorithm.

As we are interested in the asymptotic behavior, i.e., in “capacity style” results, with a focus on the number of tests (as a function of and ) required to guarantee high probability of recovery as the number of items grows without bound. For simplicity, in the first part of this work, we focus primarily on the regime where is a constant independent of . In Section VII, we give an algorithm which applies to any .666Following the lead of [27], in principle, many of our results in this section as well can be extended to the regime where , but for ease of presentation we do not do so here). The following definition lays out the goals of SGT algorithms.

Definition 1.

A sequence of SGT algorithms with parameters and is asymptotically (in N) reliable and weakly or strongly secure if,

(1) Reliable: The probability (over the index ) of incorrect reconstruction of at the legitimate receiver converges to zero. That is,

(2) Weakly secure: One potential security goal is so-called weak information-theoretic security against eavesdropping. Specifically, if the eavesdropper observes , a scheme is said to be weakly secure if

(3) Strongly secure: A stronger notion of security is so-called strong information-theoretic security against eavesdropping. Specifically, if the eavesdropper observes , a scheme is said to be strongly secure if

Remark 3.

Note that strong security implies that in the limit the distribution over is essentially statistically independent of the distribution over . Specifically, the KL divergence between and converges to .

Remark 4.

While weak security is a much weaker notation of security against eavesdropping than strong security, and indeed is implied by strong security, nonetheless we consider it in this work for the following reason. Our impossibility result will show that even guaranteeing weak security requires at least a certain number of tests, and our achievability results will show that essentially the same number of tests suffices to guarantee strong security. Hence both our impossibility and achievability results are with regard to the corresponding “harder to prove” notion of security.

To conclude, the goal in this work is to design (for parameters and ) an measurement matrix (which is possibly randomized) and a decoding algorithm , such that on observing , the legitimate decoder can (with high probability over ) identify the subset of defective items, and yet, on observing , the eavesdropper learns essentially nothing about the set of defective items.

Iv Main Results

Under the model definition given in Section III, our main results are the following sufficiency (direct) and necessity (converse) conditions, characterizing the maximal number of tests required to guarantee both reliability and security. The proofs are deferred to Section V and Section VI.

Iv-a Direct (Sufficiency)

The sufficiency part is given by the following theorem.

Theorem 1.

Assume a SGT model with items, out of which are defective. For any , if

(1)

for some independent of and , then there exists a sequence of SGT algorithms which are reliable and secure. That is, as , both the average error probability approaches zero exponentially and an eavesdropper with leakage probability is kept ignorant.

The construction of the SGT algorithm, together with the proofs of reliability and secrecy are deferred to Section V. In fact, in Section V we actually prove that the maximal error probability decays to . However, a few important remarks are in order now.

First, rearranging terms in eq. 1, we have

That is, compared to only a reliability constraint, the number of tests required for both reliability and secrecy is increased by the multiplicative factor , where, again, is the leakage probability at the eavesdropper.

The result given in Theorem 1 uses an ML decoding at the legitimate receiver. The complexity burden in ML, however, prohibits the use of this result for large N. In Theorem 3, we suggest an efficient decoding algorithm, which maintains the reliability and the secrecy results using a much simpler decoding rule, at the price of only slightly more tests.

Using an upper bound on , the maximization in Theorem 1 can be solved easily, leading to simple bound on with tight scaling and only a moderate constant.

Corollary 1.

For SGT with parameters and , reliability and secrecy can be maintained if

Proof.

Substituting , the maximum over is easily solved. ∎

Note that together with the converse below, this suggests , and, a result for bounded away from .

Iv-B Converse (Necessity)

The necessity part is given by the following theorem.

Theorem 2.

Let be the minimum number of tests necessary to identify a defective set of cardinality among population of size while keeping an eavesdropper, with a leakage probability , ignorant regarding the status of the items. Then, if , one must have:

where , with as .

The lower bound is derived using Fano’s inequality to address reliability, assuming a negligible mutual information at the eavesdropper, thus keeping an eavesdropper with leakage probability ignorant, and information inequalities bounding the rate of the message on the one hand, and the data Eve does not see on the other. Compared with the lower bound without security constraints, it is increased by the multiplicative factor .

Iv-C Secrecy capacity in SGT

Returning to the analogy in [21] between channel capacity and group testing, one might define by the (asymptotic) minimal threshold value for , above which no reliable and secure scheme is possible. Under this definition, the result in this paper show that , where is the capacity without the security constraint. Clearly, this can be written as

raising the usual interpretation as the difference between the capacity to the legitimate decoder and that to the eavesdropper [18]. Note that as the effective number of tests Eve sees is , her GT capacity is .

Iv-D Efficient Algorithms

Under the SGT model definition given in Section III, we further consider a computationally efficient algorithm at the legitimate decoder. Specifically, we analyze the Definite Non-Defective (DND) algorithm (originally called Combinatorial Orthogonal Matching Pursuit (COMP)), considered for the non-secure GT model in the literature [35, 36]. The theorem below states that indeed efficient decoding (with arbitrarily small error probability) and secrecy are possible, at the price of a higher . Interestingly, the theorem applies to any , and not necessarily only to . This is, on top of the reduced complexity, an important benefit of the suggested algorithm.

Theorem 3.

Assume a SGT model with items, out of which are defective. Then, for any , there exists an efficient decoding algorithm, requiring operations, such that if the number of tests satisfies

its error probability is upper bounded by

The construction of the DND GT algorithm, together with the proofs of reliability and secrecy are deferred to Section VII. Clearly, the benefits of the algorithm above come at the price of additional tests and a smaller range of it can handle.

V Code Construction and a Proof for Theorem 1

In order to keep the eavesdropper, which obtains only a fraction of the outcomes, ignorant regarding the status of the items, we randomly map the items to the tests. Specifically, as depicted in Figure 4, for each item we generate a bin, containing several rows. The number of such rows corresponds to the number of tests that the eavesdropper can obtain, yet, unlike wiretap channels, it is not identical to Eve’s capacity, and should be normalized by the number of defective items. Then, for the -th item, we randomly select a row from the -th bin. This row will determine in which tests the item will participate.

In order to rigorously describe the construction of the matrices and bins, determine the exact values of the parameters (e.g., bin size), and analyze the reliability and secrecy, we first briefly review the representation of the GT problem as a channel coding problem [20], together with the components required for SGT.

A SGT code consists of an index set , its -th item corresponding to the -th subset ; A discrete memoryless source of randomness , with known alphabet and known statistics ; An encoder,

which maps the index of the defective items to a matrix of codewords, each of its rows corresponding to a different item in the index set , , . The need for a stochastic encoder is similar to most encoders ensuring information theoretic security, as randomness is required to confuse the eavesdropper about the actual information [18]. Hence, we define by the random variable encompassing the randomness required for the defective items, and by the number of rows in each bin. Clearly, .

At this point, an important clarification is in order. The lab, of course, does not know which items are defective. Thus, operationally, it needs to select a row for each item. However, in the analysis, since only the defective items affect the output (that is, only their rows are ORed together to give ), we refer to the “message” as the index of the defective set and refer only to the random variable required to choose the rows in their bins. In other words, unlike the analogous communication problem, in GT, nature performs the actual mapping from to . The mixer only mixes the blood samples according to the (random in this case) testing matrix it has.

A decoder at the legitimate user is a map

The probability of error is . The probability that an outcome test leaks to the eavesdropper is . We assume a memoryless model, i.e., each outcome depends only on the corresponding input , and the eavesdropper observes , generated from according to

We may now turn to the detailed construction and analysis.

Fig. 4: Binning and encoding process for a SGT code.

V-1 Codebook Generation

Choose M such that

for some . will affect the equivocation. Using a distribution , for each item generate independent and identically distributed codewords , . The codebook is depicted in the left hand side of Figure 4. Reveal the codebook to Alice and Bob. We assume Eve may have the codebook as well.

V-2 Testing

For each item , the mixer/lab selects uniformly at random one codeword from the -th bin. Therefore, the SGT matrix contains randomly selected codewords of length , one for each item, defective or not. Amongst is an unknown subset , with the index representing the true defective items. An entry of the -th random codeword is if the -item is a member of the designated pool test and otherwise.

V-3 Decoding at the Legitimate Receiver

The decoder looks for a collection of codewords , one from each bin, for which is most likely. Namely,

Then, the legitimate user (Bob) declares as the set of bins in which the rows reside.

V-a Reliability

Let denote a partition of the defective set into disjoint sets and , with cardinalities and , respectively.777This partition helps decompose the error events into classes, where in class one already knows defective items, and the dominant error event corresponds to missing the other . Thus, it is easier to present the error event as one “codeword” against another. See Appendix A for the details. Let denote the mutual information between and , under the i.i.d. distribution with which the codebook was generated and remembering that is the output of a Boolean channel. The following lemma is a key step in proving the reliability of the decoding algorithm.

Lemma 1.

If the number of tests satisfies

then, under the codebook above, as the average error probability approaches zero.

Next, we prove Lemma 1, which extends the results in [20] to the codebook required for SGT. Specifically, to obtain a bound on the required number of tests as given in Lemma 1, we first state Lemma 2, which bounds the error probability of the ML decoder using a Gallager-type bound [48].

Definition 2.

The probability of error event in the ML decoder defined, as the event of mistaking the true set for a set which differs from it in exactly items.

Lemma 2.

The error probability is bounded by

where the error exponent is given by

(2)

In Appendix A we analyze the bound provided in Lemma 2. Note that there are two main differences compared to non-secured GT. First, the decoder has possible subsets of codewords to choose from, for the number of possible bins and for the number of possible rows to take in each bin. Thus, when fixing the error event, there are subsets to confuse the decoder. Moreover, due to the bin structure of the code, there are also many “wrong” codewords which are not the one transmitted on the channel, hence create a decoding error codeword-wise, yet the actual bin decoded may still be the right one.

Proof of Lemma 1.

For this lemma, we follow the derivation in [20]. However, due the different code construction, the details are different. Specially, for each item there is a bin of codewords, from which the decoder has to choose. Define

We wish to show that as . Note that is a constant for the fixed regime. Thus for large we have . Since the function is differentiable and has a power series expansion, for a sufficiently small , by Taylor series expansion in the neighborhood of we have

Now,

Hence, if

for some constant , the exponent is positive for large enough and we have as for . Using a union bound one can show that taking the maximum over will ensure a small error probability in total.

The expression in Lemma 1 is critical to understand how many tests are required, yet it is not a function of the problem parameters in any straight forward mannar. We now bound it to get a better handle on .

Claim 1.

For large , and under a fixed input distribution for the testing matrix , the mutual information between and is lower bounded by

Proof of creftypecap 1.

First, note that

where equality (a) follows since the rows of the testing matrix are independent, and (b) follows since is the uncertainty of the legitimate receiver given , thus when observing the noiseless outcomes of all pool tests, this uncertainty is zero. Also, note that the testing matrix is random and i.i.d. with distribution , hence the probability for zeros is .

Then, under a fixed input distribution for the testing matrix and large it is easy to verify that the bounds meet at the two endpoint of and , yet the mutual information is concave in thus the bound is obtained. This is demonstrated graphically in Figure 5.

Fig. 5: Mutual Information Bound

Applying creftypecap 1 to the expression in Lemma 1, we have

Hence, substituting , a sufficient condition for reliability is

Rearranging terms results in

where by reducing and we increase the bound on , and with some constant . Noting that this is for large and , and that is independent of them, achieves the bound on provided in Theorem 1 and reliability is established.

V-B Information Leakage at the Eavesdropper

We now prove the security constraint is met. Hence, we wish to show that , as . Denote by the random codebook and by the set of codewords corresponding to the true defective items. We have,

where as . (a) is since there is a correspondence between and ; (b) is since is independent of and ; (c) is since in this direct result the codebook is defined by the construction and is memoryless, as well as the channel; (d) is since by choosing an i.i.d. distribution for the codebook one easily observes that . Finally, (e) is for the following reason: Given and the codebook, Eve has a perfect knowledge regarding the bins from which the codewords were selected. It requires to see whether she can indeed estimate . Note that the channel Eve sees in this case is the following multiple access channel: each of the defective items can be considered as a “user” with messages to transmit. Eve’s goal is to decode the messages from all users. This is possible if the rates at which the users transmit are within the capacity region of this MAC. Indeed, this is a (binary) Boolean MAC channel, followed by a simple erasure channel. The sum capacity cannot be larger than , and this sum capacity is easily achieved by letting one user transmit at a time, or, in our case, where the codebook is randomly i.i.d. distributed, under a fixed input distribution . Since actually, we use input distribution for the testing matrix of , and , for large , each user obtain the same capacity; namely, each user sees a capacity of .

Remark 5.

Under non-secure GT, it is clear that simply adding tests to a given GT code (increasing ) can only improve the performance of the code (in terms of reliability). A legitimate decoder can always disregard the added tests. For SGT, however, the situation is different. Simply adding tests to a given code, while fixing the bin sizes, might make the vector of results vulnerable to eavesdropping. In order to increase reliability, one should, of course, increase , but also increase the bin sizes proportionally, so the secrecy result above will still hold. This will be true for the efficient algorithm suggested in Section VII as well.

Remark 6.

To establish the weak secrecy constraint we set to be , where the readability is archived without any constraint on . However, in Appendix B to establish the strong secrecy constraint we require to be .

Remark 7.

Note that since

any finite-length approximation for will give a finite length approximation for the leakage at the eavesdropper. For example, one can use the results in [49], to show that the leakage can be approximated as .

Vi Converse (Necessity)

In this section, we derive the necessity bound on the required number of tests. Let denote the random variable corresponding to the tests which are not available to the eavesdropper. Hence, . By Fano’s inequality, if , we have

where as . Moreover, the secrecy constraint implies

(3)

where as . Consequently,

where (a) follows from Fano’s inequality and since , (b) follows from (3), (c) follows from the Markov chain and (d) is since conditioning reduces entropy.

We now evaluate . Denote by the set of tests which are not available to Eve and by the event for some . We have

where the last inequality follows from the Chernoff bound for i.i.d. Bernoulli random variables with parameter and is true for some such that for any .

Thus, we have

That is,

for some such that as . This completes the converse proof.

Vii Efficient Algorithms

The achievability result given in Theorem 1 uses a random codebook and ML decoding at the legitimate party. The complexity burden in ML, however, prohibits the use of this result for large . In this section, we derive and analyze an efficient decoding algorithm, which maintains the reliability result using a much simpler decoding rule, at the price of only slightly more tests. The secrecy constraint, as will be clear, is maintained by construction, as the codebook and mixing process do not change compared to the achievability result given before. Moreover, the result in this section will hold for any , including even the case were grows linearly with .

Specifically, we assume the same codebook generation and the testing procedure given in Section V, and analyze the Definite Non-Defective (DND) algorithm, previously considered for the non-secure GT in the literature [35, 36]. The decoding algorithm at the legitimate user is as follows. Bob attempts to match the rows of X with the outcome vector . If a particular row of X has the property that all locations where it has , also corresponds to a in , then that row can correspond to a defective item. If, however, the row has at a location where the output has , then it is not possible that the row corresponds to a defective item. The problem, however, when considering the code construction in this paper for SGT, is that the decoder does not know which row from each bin was selected for any given item. Thus, it takes a conservative approach, and declares an item as defective if at least one of the rows in its bin signals it may be so. An item is not defective only if all the rows in its bin prevent it from being so.

It is clear that this decoding procedure has no false negatives, as a defective item will always be detected. It may have, though, false positives. A false positive may occur if all locations with ones in a row corresponding to a non-defective item are hidden by the ones of other rows corresponding to defective items and selected by the mixer. To calculate the error probability, fix a row of X corresponding to a non-defective item (a row in its bin). Let index the rows of X corresponding to the defective items, and selected by the mixer for these items (that is, the rows which were actually added by the Boolean channel). An error event associated with the fixed row occurs if at any test where that row has a , at least one of the entries also has a . The probability for this to happen, per column, is . Hence, the probability that a test result in a fixed row is hidden from the decoder, in the sense that it cannot be declared as non defective due to a specific column, is

Since this should happen for all columns, the error probability for a fixed row is . Now, to compute the error probability for the entire procedure we take a union bound over all rows corresponding to non-defective items. As a result, we have

(4)

In the above, (a) follows by taking and setting as , for some positive and , to be defined. (b) follows since for small and any integer . In the sequence below, we will use it with , for which it is true. (c) follows since for and any integer . (d) follows by choosing . (e) is by setting and substituting the value for .

The result in (4) can be interpreted as follows. As long as , the leakage probability at the eavesdropper, is smaller than , choosing with a large enough results in an exponentially small error probability. For example, for large enough and , one needs , that is, about tests to have an exponentially small (with ) error probability while using an efficient decoding algorithm. To see the dependence of the error probability on the number of tests, denote

Then, if the number of tests satisfies

one has

Thus, while the results under ML decoding (Theorem 1) show that any value of is possible (with a toll on compared to non-secure GT), the analysis herein suggests that using the efficient algorithm, one can have a small error probability only for , and the toll on is greater than . This is consistent with the fact that this algorithm is known to achieve only half of the capacity for non-secure GT [36]. However, both these results may be due to coarse analysis, and not necessarily due to an inherent deficiency in the algorithm.

Remark 8 (Complexity).

It is easy to see that the algorithm runs over all rows in the codebook, and compares each one to the vector of tests’ results. The length of each row is . There are items, each having about rows in its bin. Since , we have rows in total. Thus, the number of operations is . This should be compared to without any secrecy constraint.

Figure 6 includes simulation results of the secure DND GT algorithm proposed, compared with ML decoding and the upper and lower bounds on the performance of ML.

Fig. 6: Definite Non-Defective and ML simulation results.

Viii Conclusions

In this paper, we proposed a novel non-adaptive SGT algorithm, which with parameters