Private Information Retrieval Through Wiretap Channel II: Privacy Meets Security^{1}
Abstract
We consider the problem of private information retrieval through wiretap channel II (PIRWTCII). In PIRWTCII, a user wants to retrieve a single message (file) privately out of messages, which are stored in replicated and noncommunicating databases. An external eavesdropper observes a fraction (of its choice) of the traffic exchanged between the th database and the user. In addition to the privacy constraint, the databases should encode the returned answer strings such that the eavesdropper learns absolutely nothing about the contents of the databases. We aim at characterizing the capacity of the PIRWTCII under the combined privacy and security constraints. We obtain a general upper bound for the problem in the form of a maxmin optimization problem, which extends the converse proof of the PIR problem under asymmetric traffic constraints. We propose an achievability scheme that satisfies the security constraint by encoding a secret key, which is generated securely at each database, into an artificial noise vector using an MDS code. The user and the databases operate at one of the corner points of the achievable scheme for the PIR under asymmetric traffic constraints such that the retrieval rate is maximized under the imposed security constraint. The upper bound and the lower bound match for the case of and messages, for any , and any .
1.2
1 Introduction
Private information retrieval (PIR) is a canonical problem which considers the privacy of the content downloaded from public databases. The problem is introduced by Chor et al. [1], and attracted considerable interest within the computer science community [1, 2, 3, 4, 5]. In the classical PIR model, there are replicated and noncolluding databases, each storing the same set of messages. A user requests to download a single file from the databases privately, i.e., no database can know the identity of the user’s desired file. To that end, the user submits a query to each database that does not leak any information about the identity of the file. Each database responds with an answering string. From all answering strings, the user should be able to decode the desired file reliably. PIR schemes are designed to be more efficient than the trivial scheme of downloading all the files stored in the databases. The efficiency is measured by the retrieval rate, which is the ratio between the number of desired message symbols to the total number of downloaded symbols. PIR is important from a practical point of view as many privacy threats exist in modern networks, in particular, when advanced learning algorithms are employed within social networks and online shopping websites. From a technical standpoint, PIR lies at the intersection of computer science, information theory, coding theory, network coding, and signal processing.
There has been a growing interest in the PIR problem in the informationtheory society, with early examples [6, 7, 8, 9, 10, 11]. In [12], Sun and Jafar investigate the fundamental limits of the classical PIR problem by introducing the notion of PIR capacity. The PIR capacity is defined as the supremum of PIR rates over all achievable retrieval schemes. [12] determines the exact PIR capacity of the classical model to be . Following [12], the fundamental limits of many interesting variants of the classical PIR problem have been considered, such as: PIR from colluding databases, robust PIR, symmetric PIR, PIR from MDScoded databases, PIR for arbitrary message lengths, multiround PIR, multimessage PIR, PIR from Byzantine databases, secure symmetric PIR with adversaries, cacheaided PIR, PIR with private side information (PSI), PIR for functions, storage constrained PIR, PIR with asymmetric traffic constraints and their several combinations [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40].
The sole requirement of most of these previous works is to protect the identity of the desired message from the public databases in addition to satisfying the reliability constraint. We ensure this protection via imposing the privacy constraint on the submitted queries. Another interesting dimension to the PIR problem is when the content of the requested message needs to be protected against an external eavesdropper (wiretapper), who wishes to learn about the contents of the databases by observing the queries and answer strings exchanged between the user and the databases. In this paper, we tackle the problem of secure PIR. We impose an extra constraint to the PIR problem, namely, the secrecy constraint in addition to the usual privacy constraint. The secrecy constraint ensures that the queries and the answer strings do not leak any information about the contents of the databases to the eavesdropper. Such systems are relevant in practice, for example, in the stock market, investors need to keep the identity of the records that they are interested in private from the public databases as revealing such interest in a specific record may change its value. This is a classical PIR application. Now, consider the case when the contents of the records themselves are confidential except for a small subset of authorized investors. Thus, the queries and the answer strings should be designed such that unauthorized entities who wiretap the retrieval process learn absolutely nothing about the contents of these confidential records.
Although there is a vast literature on PIR, only a few works exist on secure PIR: [41] considers the more general problem of information storage and retrieval, guaranteeing that also the process of storing the information is secure in the presence of failing servers. [38] considers a symmetric PIR setting where there is a passive eavesdropper who can tap in on the incoming and outgoing transmissions of any servers. [38] derives the PIR capacity in this setting. Interestingly, the secret key needed for the symmetric retrieval process is used as an encryption key to secure the contents of the databases from the eavesdropper. This requires, as in the underlying symmetric PIR, that databases exchange a secret key of at least a certain size. This problem is investigated further in [39] for the classical PIR problem under privacy constraint for the case of . [39] derives inner and outer bounds for this problem in addition to the minimum amount of common randomness required, which is shared between the databases.
We study the secure PIR problem from a different angle than [41, 38, 39]. We consider a classical PIR setting, where there are replicated databases storing messages. We assume that the contents of the databases are fixed and cannot be coded to satisfy the security constraint during the storage phase, unlike [41]. There are no shared keys in place required for symmetric PIR unlike [38], as we consider classical PIR, not symmetric PIR. We further assume that the eavesdropper observes the queries and the answer strings of all databases through wiretap channels in contrast to observing the noiseless transmission from any of the databases as in [39]. In this work, we investigate the PIR problem through wiretap channel II (PIRWTCII). Ozarow and Wyner [42] introduced the wiretap channel II (WTCII) model, which considers a noiseless main channel and a binary erasure channel to the wiretapper, where the wiretapper is able to select the positions of erasures. In PIRWTCII (see Fig. 1), the user observes the length answer strings through a noiseless channel from the th database. The eavesdropper can observe a fraction from the th answer string. More specifically, the eavesdropper chooses any set of positions to observe from the th answer string, such that . The databases should encode the answer strings such that the eavesdropper learns nothing from observing any fraction of the traffic from the th database. This is in addition to normal privacy and reliability constraints. Naturally, the th database dedicates portion of the answer string to confuse the eavesdropper, constraining the meaningful portion of the answer to be . This fundamentally relates PIRWTCII to the PIR problem under asymmetric traffic constraints [40], as lengths of answer strings can no longer be symmetric. This poses the following questions: How can we design a retrieval code that satisfies the combined privacy and security constraints for the PIRWTCII problem? Does PIRWTCII problem necessitate the existence of common randomness between the databases as in [39]? Should the databases share any common randomness with the user (retriever)?
In this paper, we obtain a general upper bound for the PIRWTCII problem, when the eavesdropper can wiretap fractions from the traffic outgoing from every database. We note first that this problem is the first concrete example of a PIR problem under asymmetric traffic constraints in the sense of [40]. We show that this upper bound can be expressed as a maxmin problem. The inner minimization problem extends the converse techniques of the PIR problem under asymmetric traffic constraints in [40] to the PIRWTCII problem. The outer problem maximizes the retrieval rate over all possible traffic ratio vectors. For the achievability, we extend the achievable scheme used in [40] to achieve the corner points for the meaningful portions of the queries. In the extension, to satisfy the security constraint, each database generates a secret key with length and encodes it into an artificial noise vector using a MDS code and encrypts the returned answer strings with the artificial noise vector. Interestingly, our achievable rate does not need any shared randomness among the databases or between the databases and the user. The keys used by the databases are unknown to the user, but are decodable and canceled at the retriever; however, the same keys are not extractable at the wiretapper due to the MDS code used and the existence of WTCII. We express the achievable retrieval rate in terms of the output of a system of difference equations. We present an explicit achievable rate for the problem for the case of databases and any arbitrary . Our upper and lower bounds match for and messages, for any , and any , which conforms with the results of [40].
2 System Model
Consider a classical PIR model, in which there are noncolluding and replicated databases, each storing the same content of messages (or files). The message is represented as a vector of length , whose elements are picked from a finite field with a sufficiently large alphabet. The messages are independent and identically distributed, hence,
(1)  
(2) 
We assume that the messages are uncoded and fixed, i.e., we assume that the contents of the databases cannot be coded to satisfy the security constraint during the storage phase.
In classical PIR, a user wants to retrieve a message from the databases without revealing the identity of the message to any individual database. The user prepares queries, one for each database. The user sends to the th database. Since the user has no knowledge about the realization of , the queries and the messages are statistically independent, i.e.,
(3) 
where . Furthermore, to ensure the privacy of , the user should constrain the query intended to retrieve to be indistinguishable from the query intended to retrieve any other message at any individual database. Thus, the privacy constraint is formalized as,
(4) 
where denotes statistical equivalence.
The th database, after receiving the query , responds with a length answering string . Note that we allow the user and the databases to choose arbitrary lengths for the answer strings such that they maximize the retrieval rate. The answer string is generally a stochastic mapping of the messages and the received query , hence,
(5) 
where is a random variable independent of all other random variables, whose realization is known at the th database only and not shared with any other database or the user a priori of the transmission. We denote the traffic ratio vector by . The traffic ratio at the th database is given by,
(6) 
We assume that the answer strings are transmitted through a WTCII (see Fig. 1). In this case, an external eavesdropper (wiretapper) wishes to learn about the contents of the databases by observing the queries and answer strings exchanged by the user and the databases. In PIRWTCII, the user observes the length answer string from the th database through a noiseless channel. On the other hand, the eavesdropper can observe a fraction from the th answer string. More specifically, the eavesdropper arbitrarily chooses any set of positions to observe from the th answer string, such that , i.e., the output of the eavesdropper channel is given by,
(7) 
We denote the unobserved portion of the answer string by , where , thus, . We write the eavesdropping ratios as a vector . Without loss of generality, we assume that the databases are arranged ascendingly in , i.e., , i.e., the first database is the least threatened (most secure) and the th database is the most threatened (least secure).
Upon preparing the answer string, the databases should encode the answer strings such that the eavesdropper learns nothing from observing any fraction from the traffic from the th database even with observing the queries submitted by the user. Consequently, we write the security constraint as,
(8) 
Additionally, the user should be able to reconstruct the desired message from the collected answer strings with arbitrarily small probability of error. Using Fano’s inequality, we write the reliability constraint as,
(9) 
where as .
For a fixed , , traffic ratio vector , and eavesdropping ratio vector , a retrieval rate is achievable if there exists a PIR scheme which satisfies the privacy constraint (4), security constraint (8), and the reliability constraint (9) for some message length and answer strings of lengths such that , where the retrieval rate is therefore given by,
(10) 
We note that in this problem, the user and the databases can agree on a traffic ratio vector to maximize the retrieval rate, thus, we can express the secure retrieval rate under eavesdropping capabilities , , as,
(11) 
Note that the message lengths can grow arbitrarily large to conform with standard informationtheoretic arguments. The capacity of the PIRWTCII problem is defined as the supremum of all achievable retrieval rates over all achievable schemes, i.e., .
3 Main Results and Discussions
In this section, we present the main results of this paper. Our first result characterizes a general upper bound for the PIRWTCII problem for fixed , , and an arbitrary .
Theorem 1 (Upper bound)
For the PIRWTCII problem under eavesdropping capabilities , the capacity is upper bounded by,
(12) 
where .
The proof of this upper bound is given in Section 4. We have the following remarks.
Remark 1
When , i.e., without any security constraints, the upper bound reduces to:
(13)  
(14)  
(15)  
(16) 
where the inner problem in (14) is precisely the upper bound of the PIR problem under asymmetric traffic [40]. From [40], we know that is maximized by adopting symmetric schemes, i.e., , which achieves the PIR capacity in [12].
Remark 2
If the PIRWTCII problem is further constrained by the asymmetric traffic constraints , the corresponding upper bound is given by the inner problem of (12), i.e.,
(17) 
Hence, without the asymmetric traffic constraints, the user and the databases can agree on that maximizes the retrieval rate, which results in the outer maximization over . This is reminiscent of the classical converse proof for the channel coding theorem, where a converse argument is constructed for an arbitrary input distribution of the transmission codebook, and then the converse proof is concluded with a maximization step over all the input distributions.
Remark 3
The upper bound in Theorem 1 can be written as the following linear programming problem:
s.t.  
(18) 
where , i.e., the number of constraints are finite (at most constraints). Hence, the optimal solution of this optimization problem is attained at one of the corner points of the feasible set.
Next, we present a general lower bound on for fixed , .
Theorem 2 (Lower bound)
For PIRWTCII, for a monotone nondecreasing sequence , let , and . Denote to be the number of stages of the achievable scheme that downloads sums from the th database in one repetition of the scheme, such that , and . Let . The number of stages is characterized by the following system of difference equations:
(19) 
where denotes the Kronecker delta function. The initial conditions of (2) are , and for . Consequently, the traffic ratio vector corresponding to the sequence is given by:
(20) 
Then, the achievable rate corresponding to is given by:
(21) 
Consequently, the capacity is lower bounded by:
(22)  
(23) 
Remark 4
For fixed , , the number of the achievable rates in Theorem 2 corresponds to the number of monotone nondecreasing sequences , which is equal to .
Remark 5
After achieving the corner points in Theorem 2, which achieve , one can perform timesharing between the corner points to obtain an achievable for any . The highest possible achievable rate can be obtained by maximizing over . However, this is not needed as timesharing results in a piecewise affine function in . Hence, maximizing over would result in operating directly at one of the corner points.
Remark 6
We note that the core of the achievability scheme is the PIR scheme under asymmetric traffic constraints in [40]. Hence, the recursive structure described by (2) is directly inherited from [40]. Nevertheless, two main differences appear in the final rate expression. First, the answer string length from every database belonging to the same group is different in contrast to [40]. This is due to the fact that every database experiences a different eavesdropping capability in general, hence the th database encrypts its responses with a key, whose length depends on , thus the key lengths are different in general. Second, there is no need for timesharing over the corner points as shown in Remark 5.
In the following corollary, we settle the capacity for , , and arbitrary .
Corollary 1 (Exact capacity for and messages)
For PIRWTCII, the capacity for , and an arbitrary is given by:
(24) 
Remark 7
The explicit capacity expressions in Corollary 1 can be interpreted using basic circuit theory. To see that for for a given , consider the circuit in Fig. 2. The circuit has a current source of units. The circuit consists of parallel resistors. The th resistor has the value of if , and if . Hence, the capacity is the voltage across the current source. A similar interpretation can be inferred from Fig. 3 for the case of . Interestingly, this interpretation implies that in order to maximize the retrieval rate (the voltage across the equivalent resistance of the circuit), one should pick such that the resistance of each parallel branch is as symmetric as possible. This is due to the fact that the equivalent resistance of parallel resistors is less than the resistance of the least resistor.
Finally, in the next corollary, we present an explicit achievable rate for when , and an arbitrary . The proof of the corollary can be found in Section 5.5
Corollary 2 (Achievable retrieval rate for )
For PIRWTCII with and an arbitrary , let , then the secure PIR capacity is lower bounded by:
(25) 
Remark 8
We note the strong connection between the PIRWTCII problem and the PIR problem under asymmetric traffic constraints in [40]. In PIRWTCII problem, the th database uses a secret key of length to span the entire space of the eavesdropper. This in turn leaves symbols for meaningful queries. Since the eavesdropping vulnerabilities of the databases are different in general (different ), the meaningful queries are naturally constrained, e.g., we expect the first database (the most secure) to support more meaningful queries than the remaining databases. However, the main difference between the two problems is that in the PIR problem under asymmetric traffic constraints [40], the traffic ratio vector is fixed (by the problem formulation) in contrast to the PIRWTCII problem, where the user and the databases can agree on a traffic ratio vector to maximize the retrieval rate under the fixed eavesdropping capabilities .
Remark 9
We now compare our model with the PIR model in [38, 39]. In [38, 39], there is an eavesdropper, which observes all communication of out of databases, whose identities are unknown to the user. We restrict the comparison to the case (i.e., no collusion between the databases). In this case, the capacity of the secure PIR problem in [39] (abbreviated as TEPIR problem) is . This requires a common randomness, which is shared between the databases and unknown to the user, of length [39, Theorem 1]. We note that the capacity expression is independent of the number of messages in [39]. For the symmetric version of the problem in [38], the capacity expression is also . Interestingly, in the symmetric version of the problem, the common randomness among the databases is used to satisfy both the database privacy and the security constraints simultaneously.
On the other hand, in our model, the eavesdropper wiretaps all databases according to the given . The user knows the ratio of the traffic which is observed by the eavesdropper from each database, i.e., , but does not know which positions are being observed. Surprisingly, our model does not need any shared randomness among the databases or with the user, i.e., here we are able to achieve nontrivial PIR rates with zero shared randomness rates.
As a concrete example, let , and for a fair comparison, let for all in our model. The rationale for this choice of is that in [39], the eavesdropper has access to a total of observations, where is the length of the answer string from any database in [39]. Now, for symmetric in our model, all answer string lengths need to be symmetric, i.e., for all , and therefore, the eavesdropper accesses a total of observations here as it does in [39]. The capacity for this case in our model, from Corollary 1, is , which is attained with in the corollary. This rate is strictly less than the rate in [39], which is , however, [39] requires a shared randomness between the databases at a rate of at least , while in our case no shared randomness is required.
4 Converse Proof
In this section, we derive a general upper bound for the retrieval rate under the privacy and security constraints (4), (8) for the PIRWTCII problem. Our converse proof extends the techniques of [12] to incorporate the security constraint. In addition, since the eavesdropper observes a different fraction of the traffic from each database, we do not expect that the answer strings (and consequently the traffic ratios) from each database to be symmetric in length. Thus, we modify the converse proof in [12] to account for this prospected traffic asymmetry along the lines of [40]. However, different from [40], traffic ratios are not given, and must be chosen; the eavesdropping ratios are given here. Our converse proof extends the proof in [40] to account for the imposed security constraint.
In the next lemma, we discuss some consequences of the security constraint in (8). The security constraint introduces some interesting conditional independence properties which simplify the converse proof.
Lemma 1 (Security consequences)
In the PIRWTCII problem, the following implications are true due to the security constraint (8):

Messages are conditionally independent given the observed part of the answer strings at the eavesdropper , i.e.,
(26) 
There is no leakage of from all the queries , the eavesdropper observations , and any subset of messages such that ,
(27) In particular,
(28) 
The eavesdropper’s observations and the messages are conditionally independent given the queries , i.e., for sets , , such that ,
(29) In particular,
(30) 
The messages and the queries are conditionally independent given the eavesdropper’s observations, i.e., for sets , , such that ,
(31) 
The messages and the queries for any are conditionally independent given , i.e.,
(32)
Proof:

We have
(44) (45)