The Capacity of Private Information Retrieval
Abstract
In the private information retrieval (PIR) problem a user wishes to retrieve, as efficiently as possible, one out of messages from noncommunicating databases (each holds all messages) while revealing nothing about the identity of the desired message index to any individual database. The information theoretic capacity of PIR is the maximum number of bits of desired information that can be privately retrieved per bit of downloaded information. For messages and databases, we show that the PIR capacity is . A remarkable feature of the capacity achieving scheme is that if we eliminate any subset of messages (by setting the message symbols to zero), the resulting scheme also achieves the PIR capacity for the remaining subset of messages.
1 Introduction
Marked by paradigmshifting developments such as big data, cloud computing, and internet of things, the modern information age presents researchers with an unconventional set of challenges. The rapidly evolving research landscape continues to blur traditional boundaries between computer science, communication and information theory, coding and signal processing. For example, the index coding problem which was introduced by computer scientists in 1998 [1, 2], is now a very active research topic in information theory because of its fundamental connections to a broad range of questions that includes topological interference management [3], network coding [4], distributed storage capacity [5], hat guessing [6], and nonShannon information inequalities [7]. Evidently, the crossover of problems across fields creates exciting opportunities for fundamental progress through a consolidation of complementary perspectives. The pursuit of such crossovers brings us to the private information retrieval (PIR) problem [8, 9, 10].
Introduced in 1995 by Chor, Kushilevitz, Goldreich and Sudan [11, 12], the private information retrieval (PIR) problem seeks the most efficient way for a user to retrieve a desired message from a set of distributed databases, each of which stores all the messages, without revealing any information about which message is being retrieved to any individual database. The user can hide his interests trivially by requesting all the information, but that could be very inefficient (expensive). The goal of the PIR problem is to find the most efficient solution.
Besides its direct applications, PIR is of broad interest because it shares intimate connections to many other prominent problems. PIR attracted our attention initially in [10] because of its curious similarities to Blind Interference Alignment [13]. PIR protocols are the essential ingredients of oblivious transfer [14], instance hiding [15, 16, 17], multiparty computation [18], secret sharing schemes [19, 20] and locally decodable codes [21]. Through the connection between locally decodable and locally recoverable codes [22], PIR also connects to distributed data storage repair [23], index coding [2] and the entire umbrella of network coding [24] in general. As such PIR holds tremendous promise as a point of convergence of complementary perspectives. The characterization of the information theoretic capacity of PIR that we undertake in this work, is a step in this direction.
The PIR problem is described as follows. We have noncommunicating databases,
each stores the full set of independent messages . A user wants one of the messages, say , but requires each database to learn absolutely nothing (in the information theoretic sense)
For example, suppose we have databases and messages. To retrieve privately, the user first generates a random length vector , where each element is independent and identically distributed uniformly over , i.e., equally likely to be or . Then the user sends to the first database and to the second database. Each database uses the query vector as the combining coefficients and produces the corresponding linear combination of message bits as the answer to the query.
(1)  
(2) 
The user obtains by subtracting from . Privacy is guaranteed because each query is independent of the desired message index . This is because regardless of the desired message index , each of the query vectors is individually comprised of elements that are i.i.d. uniform over . Thus, each database learns nothing about which message is requested.
The PIR problem was initially studied in the setting where each message is one bit long [11, 12, 26, 27, 28, 21, 29], where the cost of a PIR scheme is measured by the total amount of communication between the user and the databases, i.e., the sum of lengths of each query string (upload) and each answering string (download).
However, for the traditional Shannon theoretic formulation, where message size is allowed to be arbitrarily large, the upload cost is negligible compared to the download cost [30]
The paper is organized as follows. Section 2 presents the problem statement. The exact capacity of PIR is characterized in Section 3. Section 4 presents a novel PIR scheme, and Section 5 provides the information theoretic converse (i.e., a tight upper bound) to establish its optimality. Section 6 contains a discussion of the results and we conclude in Section 7.
Notation: For a positive integer , we use the notation . The notation is used to indicate that and are identically distributed. Define the notation , as the set if , and as the null set otherwise.
2 Problem Statement
Consider independent messages of size bits each.
(3)  
(4) 
There are databases and each database stores all the messages . In PIR a user privately generates and wishes to retrieve while keeping a secret from each database. Depending on , there are strategies that the user could employ to privately retrieve his desired message. For example, if , then in order to retrieve , the user employs queries . Since the queries are determined by the user with no knowledge of the realizations of the messages, the queries must be independent of the messages,
(5) 
The user sends query to the th database. Upon receiving , the th database generates an answering string , which is a function of and the data stored (i.e., all messages ).
(6) 
Each database returns to the user its answer . From all the information that is now available to the user, he must be able to decode the desired message , with probability of error . The probability of error must approach zero as the size of each message approaches infinity
(7) 
where represents any term whose value approaches zero as approaches infinity.
To protect the user’s privacy, the strategies must be indistinguishable (identically distributed) from the perspective of each database, i.e., the following privacy constraint must be satisfied
(8) 
The PIR rate characterizes how many bits of desired information are retrieved per downloaded bit, and is defined as follows.
(9) 
where is the expected value (over random queries) of the total number of bits downloaded by the user from all the databases. Note that because of the privacy constraint (8), the expected number of downloaded bits for each message must be the same.
A rate is said to be error achievable if there exists a sequence of PIR schemes, each of rate greater than or equal to , for which as
3 Main Result: Capacity of Private Information Retrieval
Theorem 1 states the main result.
Theorem 1
For the private information retrieval problem with messages and databases, the capacity is
(10) 
The following observations are in order.

For databases, the capacity expression can be equivalently expressed as .

The capacity is strictly higher than the previously best known achievable rate of .

The capacity is a strictly decreasing function of the number of messages, , and when the number of messages approaches infinity, the capacity approaches .

The capacity is strictly increasing in the number of databases, . As the number of databases approaches infinity, the capacity approaches 1.

Since the download cost is the reciprocal of the rate, Theorem 1 equivalently characterizes the optimal download cost per message bit as bits.

The achievability proof for Theorem 1 to be presented in the next section, shows that message size approaching infinity is not necessary to approach capacity. In fact, it suffices to have messages of size equal to any positive integer multiple of bits (or symbols in any finite field) each to achieve a rate exactly equal to capacity, and with zeroerror.

The upper bound proof will show that no PIR scheme can achieve a rate higher than capacity with as message size . Unbounded message size is essential to the information theoretic formulation of capacity. However, from a practical standpoint, it is natural to ask what this means if the message size is limited. Finding the optimal rate for limited message size remains an open problem in general. However, we note that regardless of message size, (and therefore also ) is always an upper bound on zeroerror rate. For arbitrary message size , a naive extension of our PIR scheme can be obtained as follows. Pad zeros to each message, rounding up the message size to an integer multiple of . Then over each block of symbols per message, directly use the capacity achieving PIR scheme. This achieves the rate , which matches capacity exactly if is a positive integer multiple of , and otherwise, approaches capacity for large . It is also clearly suboptimal in general, especially for smaller message sizes where much better schemes are already known. Additional discussion on message size reduction for a capacity achieving PIR scheme is presented in Section 6.
4 Theorem 1: Achievability
We present a zeroerror PIR scheme for bits per message in this section, whose rate is equal to capacity. Note that a zeroerror scheme with finite message length can always be repeatedly applied to create a sequence of schemes with messagelengths approaching infinity for which the probability of error approaches (is) zero. Thus, the same scheme will suffice as the proof of achievability for both zeroerror and error capacity.
Let us illustrate the intuition behind the achievable scheme with a few simple examples. Then, based on the examples, we will present an algorithmic description of the achievable scheme for arbitrary number of messages, and arbitrary number of databases, . We will then revisit the examples in light of the algorithmic formulation. Finally, we will prove that the scheme is both correct and private, and that its rate is equal to the capacity.
4.1 Two Examples to Illustrate the Key Ideas
The capacity achieving PIR scheme has a myopic or greedy character, in that it starts with a narrow focus on the retrieval of the desired message bits from the first database, but grows into a full fledged scheme based on iterative application of three principles:

Enforcing Symmetry Across Databases

Enforcing Message Symmetry within the Query to Each Database

Exploiting Side Information of Undesired Messages to Retrieve New Desired Information
Example 1:
Consider the simplest PIR setting, with databases, and messages with bits per message. Let represent a random permutation of bits from . Similarly, let represent an independent random permutation of bits from . These permutations are generated privately and uniformly by the user.
Suppose the desired message is , i.e., . We start with a query that requests the first bit from the first database (DB1). Applying database symmetry, we simultaneously request from the second database (DB2). Next, we enforce message symmetry, by including queries for and as the counterparts for and . Now we have side information of from DB2 to be exploited in an additional query to DB1, which requests a new desired information bit mixed with . Finally, applying database symmetry we have the corresponding query for DB2. At this point the queries satisfy symmetry across databases, message symmetry within the query to each database, and all undesired side information is exploited, so the construction is complete. The process is explained below, where the number above an arrow indicates which of the three principles highlighted above is used in each step.
Similarly, the queries for are constructed as follows.
Privacy is ensured by noting that is a random permutation of and is an independent random permutation of . These permutations are only known to the user and not to the databases. Therefore, regardless of the desired message, each database is asked for one randomly chosen bit of each message and a sum of a different pair of randomly chosen bits from each message. Since the permutations are uniform, all possible realizations are equally likely, and privacy is guaranteed.
To verify correctness, note that every desired bit is either downloaded directly or added with known side information which can be subtracted to retrieve the desired bit value. Thus, the desired message bits are successfully recoverable from the downloaded information.
Now, consider the rate of this scheme. The total number of downloaded bits is and the number of desired bits is . Thus, the rate of this scheme is which matches the capacity for this case.
Finally, let us represent the structure of the queries (to any database) in the following matrix.
() represents a placeholder for a distinct element of (). The key to the structure is that it is made up of sums (a single variable is also named a (trivial) sum) of message bits, no message bit appears more than once, and all possible assignments of message bits to these placeholders are equally likely. The structure matrix will be useful for the algorithmic description later.
Example 2:
The second example is when , . In this case, all messages have bits. The construction of the optimal PIR scheme for is illustrated below, where are three i.i.d. uniform permutations of bits from , respectively. The construction of the queries from each database when may be visualized as follows.
Similarly, the queries when are as follows.
The structure of the queries is summarized in the following structure matrix. Note again that the structure matrix is made up of sums of placeholders of message bits, no message bit appears more than once, and the assignment of all messages bits to these placeholders is equally likely.
The examples illustrated above generalize naturally to arbitrary and . As we proceed to proofs of privacy and correctness and to calculate the rate for arbitrary parameters, a more formal algorithmic description will be useful.
4.2 Formal Description of Achievable Scheme
For all , define
The achievable scheme is comprised of the following elements: 1) a fixed query set structure, 2) an algorithm to generate the query set as a deterministic function of , and 3) a random mapping from variables to message bits, which will produce the actual queries to be sent to the databases. The random mapping will be privately generated by the user, unknown to the databases. These elements are described next.
A Fixed Query Set Structure
For all , let us define ‘query sets’: , which must satisfy the following structural properties. Each must be the union of disjoint subsets called “blocks”, that are indexed by . Block must contain only sums. Note that there are only possible “types” of sums. Block must contain all of them. We require that block contains exactly distinct instances of each type of sum. This requirement is chosen following the intuition from the three principles, and as we will prove shortly, it ensures that the resulting scheme is capacity achieving. Thus, the total number of elements contained in block must be , and the total number of elements in each query set must be . For example, for , as illustrated previously, there are types of sums (, , ) and we have instances of each; there are types of sums (, , ) and we have instances of each; and there is type of sum () and we have instances of it. The query to each database has this structure. Furthermore, no message symbol can appear more than once in a query set for any given database.
The structure of Block of the query , enforced by the constraints described above, is illustrated in Figure 1 through an enumeration of all its elements. In the figure, each represents a placeholder for a distinct element of . Note that the structure as represented in Figure 1 is fixed regardless of and DB. All query sets must have the same fixed structure.
A Deterministic Algorithm
Next we present the algorithm which will produce for all as function of alone. In particular, this algorithm will determine which variable is assigned to each placeholder value in the query structure described earlier. To present the algorithm we need these definitions.
For each , let be a function that, starting with , returns the “next” variable in each time it is called with as its argument. So, for example, the following sequence of calls to this function: will produce as the output.
Let us partition each block into two subsets — a subset that contains the sums which include a variable from , and a subset which contains all the remaining sums which contain no symbols from .
Using these definitions the algorithm is presented next.
(11)  
(12) 
(13) 
Algorithm 1 realizes the 3 principles as follows. The forloop in steps 5 to 14 ensures database symmetry (principle (1)). The forloop in steps 10 to 13 ensures message symmetry within one database (principle (2)). Steps 7 to 8 retrieve new desired information using existing side information (principle (3)).
The proof that the produced by this algorithm indeed satisfy the query structure described before, is presented in Lemma 1.
Ordered Representation and Mapping to Message Bits to Produce
It is useful at this point to have an ordered vector representation of the query structure, as well as the query set . For the query structure, let us first order the blocks in increasing order of block index. Then within the th block, , arrange the “types” of sums by first sorting the indices into such that , and then arranging the tuples in increasing lexicographic order. For the query set, we have the same arrangement for blocks and types, but then for each given type, we further sort the multiple instances of that type by the index of the term with the smallest value in that type. Let denote the ordered representation of . Next we will map the variables to message bits to produce a query vector.
Suppose each message , , is represented by the vector , where is the binary random variable representing the th bit of . The user privately chooses permutations , uniformly randomly from all possible permutations over the index set , so that the permutations are independent of each other and of . The variables are mapped to the messages through the random permutation , . Let denote an operator that replaces every instance of with , . For example, . This random mapping, applied to produces the actual query vector that is sent to database DB as
(14) 
We use the doublequotes notation around a random variable to represent the query about its realization. For example, while is a random variable, which may take the value or , in our notation “” is not random, because it only represents the question: “what is the value of ?” This is an important distinction, in light of constraints such as (5) which require that queries must be independent of messages, i.e., message realizations. Note that our queries are indeed independent of message realizations because the queries are generated by the user with no knowledge of message realizations. Also note that the only randomness in is because of the and the random permutation .
4.3 The Two Examples Revisited
To illustrate the algorithmic formulation, let us revisit the two examples that were presented previously from an intuitive standpoint.
Example 1:
Consider the simplest PIR setting, with databases, and messages with bits per message. Instead of our usual notation, i.e., , for this example it will be less cumbersome to use the notation . Similarly, . The query structure and the outputs produced by the algorithm for as well as for are shown below. The blocks are separated by horizontal lines. Within each block the terms are highlighted in red and the terms are in black. Note that there are no terms in for the last block (Block ), because there are no sums that do not include the variables.
To verify that the scheme is correct, note that whether or , every desired bit is either downloaded directly (block 1) or appears with known side information that is available from the other database. To see why privacy holds, recall that the queries are ultimately presented to the database in terms of the message variables and the mapping from to is uniformly random and independent of . So, consider an arbitrary realization of the query with (distinct) message bits from and from .
(15) 
Given this query, the probability that it was generated for is , which is the same as the probability that it was generated for . Thus, the query provides the database no information about , and the scheme is private. This argument is presented in detail and generalized to arbitrary and in Lemma 3. Finally, consider the rate of this scheme. The total number of downloaded bits is , and the number of desired bits downloaded is , so the rate of this scheme is which matches the capacity for this case.
Example 2:
The second example is when , . In this case, both messages have bits. . The query structure and the output of the algorithm for are shown below.
4.4 Proof of Correctness, Privacy and Achieving Capacity
The following lemma confirms that the query set produced by the algorithm satisfies the required structural properties.
Lemma 1
Structure of For any and for any , the produced by Algorithm 1 satisfies the following properties.

For all , block contains exactly instances of sums of each possible type.

No variable appears more than once within for any given DB.

Exactly variables for each , , appear in the query set .

The size of is .
Proof:

Fix any arbitrary . The proof is based on induction on the claim , defined as follows.
“Block contains exactly instances of sums of all possible types.”The basis step is when . This step is easily verified, because a sum is simply one variable, of which there are possible types, and from (11), (12) in Algorithm 1, we note that the first block always consists of one variable of each vector .
We next proceed to the inductive step. Suppose is true. Then we wish to prove that must be true as well. Here we have . First, consider sums of type where none of the indices is . These belong in , and from line 11 of the algorithm it is verified that exactly instances are generated of this type. Next, consider the sums of type where one of the indices is . These belong to and are obtained by adding to each of the sums of type that belong to for all . Therefore, the number of instances of sums of type in must be equal to the product of the number of ‘other’ databases , which is equal to , and the number of instances of type in each database , which is equal to because is assumed to be true as the induction hypothesis. , and thus, we have shown that is true, completing the proof by induction.

From (11),(13), we see that for each block, the desired variables, i.e., the variables appear only through the function so that each of them only appears once. For the nondesired variables , we see that the only time that they do not appear through the function is when they enter through in (13). However, from (13) we see that these variables come from the part of the previous block of other databases, where each of them was only introduced once through a function. Moreover, each term from the part of the previous block of other databases is used exactly once. Therefore, these variables also appear no more than once in the query set of a given database.

Since we have shown that no variable appears more than once, we only need to count the number of times each vector is invoked within . Consider any particular vector, say . The number of possible types of sums that include index is . As we have also shown, the th block contains instances of sums of each type. Therefore, the number of instances of vector in block is . Summing over all blocks within we find