Multi-Message Private Information Retrieval: Capacity Results and Near-Optimal SchemesThis work was supported by NSF Grants CNS 13-14733, CCF 14-22111, CCF 14-22129, and CNS 15-26608. A shorter version is submitted to IEEE ISIT 2017.

# Multi-Message Private Information Retrieval: Capacity Results and Near-Optimal Schemes

## Abstract

We consider the problem of multi-message private information retrieval (MPIR) from non-communicating replicated databases. In MPIR, the user is interested in retrieving messages out of stored messages without leaking the identity of the retrieved messages. The information-theoretic sum capacity of MPIR is the maximum number of desired message symbols that can be retrieved privately per downloaded symbol. For the case , we determine the exact sum capacity of MPIR as . The achievable scheme in this case is based on downloading MDS-coded mixtures of all messages. For , we develop lower and upper bounds for all . These bounds match if the total number of messages is an integer multiple of the number of desired messages , i.e., . In this case, . The achievable scheme in this case generalizes the single-message capacity achieving scheme to have unbalanced number of stages per round of download. For all the remaining cases, the difference between the lower and upper bound is at most , which occurs for , , . Our results indicate that joint retrieval of desired messages is more efficient than successive use of single-message retrieval schemes.

\setstretch

1.2

## 1 Introduction

The computer science formulation of this problem assumes that the messages are of length one. The metrics in this case are the download cost, i.e., the sum of lengths of the answer strings, and the upload cost, i.e., the size of the queries. Most of this work is computational PIR as it ensures only that a server cannot get any information about user intent unless it solves a certain computationally hard problem [2, 5]. The information-theoretic re-formulation of the problem considers arbitrarily large message sizes, and ignores the upload cost. This formulation provides an absolute, i.e., information-theoretic, guarantee that no server participating in the protocol gets any information about the user intent. Towards that end, recently, [6] has drawn a connection between the PIR problem and the blind interference alignment scheme proposed in [7]. Then, [8] has determined the exact capacity of the classical PIR problem. The retrieval scheme in [8] is based on three principles: message symmetry, symmetry across databases, and exploiting side information from the undesired messages through alignment.

The basic PIR setting has been extended in several interesting directions. The first extension is the coded PIR (CPIR) problem [9, 10, 11]. The contents of the databases in this problem are coded by an storage code instead of being replicated. This is a natural extension since most storage systems nowadays are in fact coded to achieve reliability against node failures and erasures with manageable storage cost. In [12], the exact capacity of the MDS-coded PIR is determined. Another interesting extension is PIR with colluding databases (TPIR). In this setting, databases can communicate and exchange the queries to identify the desired message. The exact capacity of colluded PIR is determined in [13]. The case of coded colluded PIR is investigated in [14]. The robust PIR problem (RPIR) extension considers the case when some databases are not responsive [13]. Lastly, in the symmetric PIR problem (SPIR) the privacy of the remaining records should be maintained against the user in addition to the usual privacy constraint on the databases, i.e., the user should not learn any other messages other than the one it wished to retrieve. The exact capacity of symmetric PIR is determined in [15]; and the exact capacity of symmetric PIR from coded databases is determined in [16].

Although there is a vast literature on classical PIR in the computer science literature, only a few works exist in MPIR, such as: [18] which proposes a multi-block (multi-message) scheme and observes that if the user requests multiple blocks (messages), it is possible to reuse randomly mixed data blocks (answer strings) across multiple requests (queries). [19] develops a multi-block scheme which further reduces the communication overhead. [20] develops an achievable scheme for the multi-block PIR by designing -safe binary matrices that uses XOR operations. [20] extends the scheme in [1] to multiple blocks. [21] designs an efficient non-trivial multi-query computational PIR protocol and gives a lower bound on the communication of any multi-query information retrieval protocol. These works do not consider determining the information-theoretic capacity.

In this paper, we formulate the MPIR problem with non-colluding repeated databases from an information-theoretic perspective. Our goal is to characterize the sum capacity of the MPIR problem , which is defined as the maximum ratio of the number of retrieved symbols from the desired messages to the number of total downloaded symbols. When the number of desired messages is at least half of the total number of messages , i.e., , we determine the exact sum capacity of MPIR as . We use a novel achievable scheme which downloads MDS-coded mixtures of all messages. We show that joint retrieving of the desired messages strictly outperforms successive use of single-message retrieval for times. Additionally, we present an achievable rate region to characterize the trade-off between the retrieval rates of the desired messages.

For the case of , we derive lower and upper bounds that match if the total number of messages is an integer multiple of the number of desired messages , i.e., . In this case, the sum capacity is . The result resembles the single-message capacity with the number of messages equal to . In other cases, although the exact capacity is still an open problem, we show numerically that the gap between the lower and upper bounds is monotonically decreasing in and is upper bounded by . The achievable scheme when is inspired by the greedy algorithm in [8], which retrieves all possible combinations of messages. The main difference of our scheme from the scheme in [8] is the number of stages required in each download round. For example, round to round , which correspond to retrieving the sum of to sum of messages, respectively, are suppressed in our scheme. This is because, they do not generate any useful side information for our purposes here, in contrast to [8]. Interestingly, the number of stages for each round is related to the output of a -order IIR filter [22]. Our converse proof generalizes the proof in [8] for . The essence of the proof is captured in two lemmas: the first lemma lower bounds the uncertainty of the interference for the case , and the second lemma upper bounds the remaining uncertainty after conditioning on interfering messages.

## 2 Problem Formulation

Consider a classical PIR setting storing messages (or files). Each message is a vector , whose elements are picked uniformly and independently from sufficiently large field . Denote the contents of message by the vector . The messages are independent and identically distributed, and thus,

 H(Wi) =L,i∈{1,⋯,M} (1) H(W1:M) =ML (2)

where . The messages are stored in non-colluding (non-communicating) databases. Each database stores an identical copy of all messages, i.e., the databases encode the messages via repetition storage code [12].

In the MPIR problem, the user aims to retrieve a subset of messages indexed by the index set out of the available messages, where , without leaking the identity of the subset . We assume that the cardinality of the potential message set, , is known to all databases. To retrieve , the user generates a query and sends it to the th database. The user does not have any knowledge about the messages in advance, hence the messages and the queries are statistically independent,

 I(W1,⋯,WM;Q[P]1,⋯,Q[P]N)=I(W1:M;Q[P]1:N)=0 (3)

The privacy is satisfied by ensuring statistical independence between the queries and the message index set , i.e., the privacy constraint is given by,

 I(Q[i1,⋯,iP]n;i1,⋯,iP)=I(Q[P]n;P)=0,n∈{1,⋯,N} (4)

The th database responds with an answer string , which is a deterministic function of the queries and the messages, hence

 H(A[P]n|Q[P]n,W1:M)=0 (5)

We further note that by the data processing inequality and (4),

 I(A[P]n;P)=0,n∈{1,⋯,N} (6)

In addition, the user should be able to reconstruct the messages reliably from the collected answers from all databases given the knowledge of the queries. Thus, we write the reliability constraint as,

 H(Wi1,⋯,WiP|A[P]1,⋯,A[P]N,Q[P]1,⋯,Q[P]N)=H(WP|A[P]1:N,Q[P]1:N)=0 (7)

We denote the retrieval rate of the th message by , where . The retrieval rate of the th message is the ratio between the length of message and the total download cost of the message set that includes . Hence,

 Ri=H(Wi)∑Nn=1H(A[P]n) (8)

The sum retrieval rate of is given by,

 P∑i=1Ri=H(WP)∑Nn=1H(A[P]n)=PL∑Nn=1H(A[P]n) (9)

The sum capacity of the MPIR problem is given by

 CPs=supP∑i=1Ri (10)

where the is over all private retrieval schemes.

In this paper, we follow the information-theoretic assumptions of large enough message size, large enough field size, and ignore the upload cost as in [12, 8, 13, 11]. A formal treatment of the capacity under message and field size constraints for can be found in [23]. We note that the MPIR problem described here reduces to the classical PIR problem when , whose capacity is characterized in [8].

## 3 Main Results and Discussions

Our first result is the exact characterization of the sum capacity for the case , i.e., when the user wishes to privately retrieve at least half of the messages stored in the databases.

###### Theorem 1

For the MPIR problem with non-colluding and replicated databases, if the number of desired messages is at least half of the number of overall stored messages , i.e., if , then the sum capacity is given by,

 CPs=11+M−PPN (11)

The achievability proof for Theorem 1 is given in Section 4, and the converse proof is given in Section 6.1. We note that when , the constraint of Theorem 1 is equivalent to , and the result in (11) reduces to the known result of [8] for , , which is . We observe that the sum capacity in (11) is a strictly increasing function of , and as . We also observe that the sum capacity in this regime is a strictly increasing function of , and approaches as .

The following corollary compares our result and the rate corresponding to the repeated use of single-message retrieval scheme [8].

###### Corollary 1

For the MPIR problem with , the repetition of the single-message retrieval scheme of [8] times in a row, which achieves a sum rate of,

 Rreps=(N−1)(NM−1+P−1)NM−1 (12)

is strictly sub-optimal with respect to the exact capacity in (11).

Proof:  In order to use the single-message capacity achieving PIR scheme as an MPIR scheme, the user repeats the single-message achievable scheme for each individual message that belongs to . We note that at each repetition, the scheme downloads extra decodable symbols from other messages. By this argument, the following rate is achievable using a repetition of the single-message scheme,

 Rreps=C+Δ(M,P,N) (13)

where is the single-message capacity which is given by [8], and is the rate of the extra decodable symbols that belong to . To calculate , we note that the total download cost is given by by definition. Since in the single-message scheme, . The single-message scheme downloads one symbol from every message from every database, i.e., the scheme downloads extra symbols from the remaining desired messages that belong to , thus,

 Δ(M,P,N)=(P−1)N(N−1)NM+1−N=(P−1)(N−1)NM−1 (14)

Using this in (13) gives the expression in (12).

Now, the difference between the capacity in (11) and achievable rate in (12) is,

 CPs−Rreps =PNP(N−1)+M−(N−1)(NM−1+P−1)NM−1 (15) =η(P,M,N)(NM−1)(P(N−1)+M) (16)

It suffices to prove that for all , , when and . Note,

 η(P,M,N)= (2P−M)NM+(M−P)NM−1−P(P−1)N2 +((P−1)(2P−M)−P)N+(M−P)(P−1) (17)

In the regime , coefficients of are non-negative. Denote the negative terms in by which is . We note when , which is the case here. Thus,

 η(P,M,N)≥ (2P−M)NM+(M−P)NM−1 +(P−1)(2P−M)N+(M−P)(P−1)−P2N2 (18) > (2P−M)NM+(M−P)NM−1−P2N2 (19) = N2((2P−M)NM−2+(M−P)NM−3−P2) (20) ≥ N2((2P−M)2M−2+(M−P)2M−3−P2) (21) = N2(2M−3(3P−M)−P2) (22) ≥ N2(2M−3⋅M2−M2) (23) = MN2(2M−4−M) (24)

where (21) follows from the fact that is monotone increasing in for , and (23) follows from . From (24), we conclude that for all , and . Examining the expression in (3) for the remaining cases manually, i.e., when , we note that in these cases as well. Therefore, for all possible cases, and the MPIR capacity is strictly larger than the rate achieved by repeating the optimum single-message PIR scheme.

For the example in the introduction, where , , , our MPIR scheme achieves a sum capacity of in (11), which is strictly larger than the repeating-based achievable sum rate of in (12).

The following corollary gives an achievable rate region for the MPIR problem.

###### Corollary 2

For the MPIR problem, for the case , the following rate region is achievable,

 C=\emphconv {(C,δ,⋯,δ),(δ,C,⋯,δ),⋯,(δ,⋯,δ,C),(C,0,0,⋯,0), (0,C,0,⋯,0),⋯,(0,0,⋯,C),(0,0,⋯,0),(CP,CP,⋯,CP)} (25)

where

 C=1−1N1−(1N)M,CP=CPsP=NPN+(M−P),δ=Δ(M,P,N)P−1=N−1NM−1 (26)

and where conv denotes the convex hull, and all corner points lie in the -dimensional space.

Proof:  This is a direct consequence of Theorem 1 and Corollary 1. The corner point is achievable from the single-message achievable scheme. Due to the symmetry of the problem any other permutation for the coordinates of this corner point is also achievable by changing the roles of the desired messages. Theorem 1 gives the symmetric sum capacity corner point for the case of , namely . By time sharing of these corner points along with the origin, the region in (2) is achievable.

As an example for this achievable region, consider again the example in the introduction, where , , . In this case, we have a two-dimensional rate region with three corner points: , which corresponds to the single-message capacity achieving point that aims at retrieving ; , which corresponds to single-message capacity achieving point that aims at retrieving ; and , which corresponds to the symmetric sum capacity point. The convex hull of these corner points together with the points on the axes gives the achievable region in Fig. 1.

For the case , we have the following result, where the lower and upper bound match if .

###### Theorem 2

For the MPIR problem with non-colluding and replicated databases, when , the sum capacity is lower and upper bounded as,

 \underaccent¯Rs≤CPs≤¯Rs (27)

where the upper bound is given by,

 ¯Rs =11+1N+⋯+1N⌊MP⌋−1+(MP−⌊MP⌋)1N⌊MP⌋ (28) =11−(1N)⌊MP⌋1−1N+(MP−⌊MP⌋)1N⌊MP⌋ (29)

For the lower bound, define as,

 ri=ej2π(i−1)/PN1/P−ej2π(i−1)/P,i=1,⋯,P (30)

where , and denote , to be the solutions of the linear equations , and , then is given by,

 \underaccent¯Rs=∑Pi=1γirM−Pi[(1+1ri)M−(1+1ri)M−P]∑Pi=1γirM−Pi[(1+1ri)M−1] (31)

The achievability lower bound in Theorem 2 is shown in Section 5 and the upper bound is derived in Section 6.2. The following corollary states that the bounds in Theorem 2 match if the total number of messages is an integer multiple of the number of desired messages.

###### Corollary 3

For the MPIR problem with non-colluding and replicated databases, if is an integer, then the bounds in (27) match, and hence,

 CPs=1−1N1−(1N)MP,MP∈N (32)

Proof:  For the upper bound, observe that if , then . Hence, (28) becomes

 ¯Rs=1−1N1−(1N)MP (33)

For the lower bound, consider the case . From (30),

 (1+1ri)M=(N1/Pej2π(i−1)/P)M=NMP (34)

since for . Similarly, . Hence, if ,

 \underaccent¯Rs =∑Pi=1γirM−Pi[NMP−NMP−1]∑Pi=1γirM−Pi[NMP−1] (35) =NMP−NMP−1NMP−1 (36) =1−1N1−(1N)MP (37)

Thus, if , and we have an exact capacity result in this case.

Examining the result, we observe that when the total number of messages is an integer multiple of the number of desired messages, the sum capacity of the MPIR is the same as the capacity of the single-message PIR with the number of messages equal to . Note that, although at first the result may seem as if every messages can be lumped together as a single message, and the achievable scheme in [8] can be used, this is not the case. The reason for this is that, we need to ensure the privacy constraint for every subset of messages of size . That is why, in this paper, we develop a new achievable scheme.

The state of the results is summarized in Fig. 2: Consider the plane, where naturally . The valid part of the plane is divided into two regions. The first region is confined between the lines and ; the sum capacity in this region is exactly characterized (Theorem 1). The second region is confined between the lines and ; the sum capacity in this region is characterized only for the cases when (Corollary 3). The line corresponds to the previously known result for the single-message PIR [8]. The exact capacity for the rest of the cases is still an open problem; however, the achievable scheme in Theorem 2 yields near-optimal sum rates for all the remaining cases with the largest difference of from the upper bound, as discussed next.

Fig. 7 shows the difference of the achievable rate and the upper bound in Theorem 2. The figure shows that the difference decreases as increases. This difference in all cases is small and is upper bounded by , which occurs when , , . In addition, the difference is zero for the cases (Theorem 1) or (Corollary 3).

Fig. 8 shows the effect of changing for fixed . We observe that as increases, the sum rate monotonically decreases and has a limit of . In addition, Fig. 9 shows the effect of changing for fixed . We observe that as increases, the sum rate increases and approaches , as expected.

## 4 Achievability Proof for the Case P≥M2

In this section, we present the general achievable scheme that attains the upper bound for the case . The scheme applies the concepts of message symmetry, database symmetry, and exploiting side information as in [8]. However, our scheme requires the extra ingredient of MDS coding of the desired symbols and the side information in its second stage.

### 4.1 Motivating Example: M=3, P=2 Messages, N=2 Databases

We start with a simple motivating example in this sub-section. The scheme operates over message size . For sake of clarity, we assume that the three messages after interleaving their indices are , , and . We use Reed-Solomon generator matrix over as

 G2×3=[111123] (38)

The user picks a random permutation for the columns of from the 6 possible permutations, e.g., in this example we use the permutation . In the first round, the user starts by downloading one symbol from each database and each message, i.e., the user downloads from the first database, and from the second database. In the second round, the user encodes the side information from database 2 which is with two new symbols from which are using the permuted generator matrix, i.e., the user downloads two equations from database 1 in the second round,

 GS1⎡⎢⎣a3b3c2⎤⎥⎦=[111123]⎡⎢⎣010100001⎤⎥⎦⎡⎢⎣a3b3c2⎤⎥⎦=[a3+b3+c22a3+b3+3c2] (39)

The user repeats this operation for the second database with as desired symbols and as the side information from the first database.

For the decodability: The user subtracts out from round two in the first database, then the user can decode from and . Similarly, by subtracting out from round two in the second database, the user can decode from and .

For the privacy: Single bit retrievals of and from the two databases in the first round satisfy message symmetry and database symmetry, and do not leak any information. In addition, due to the private shuffling of bit indices, the different coefficients of 1, 2 and 3 in front of the bits in the MDS-coded summations in the second round do not leak any information either; see a formal proof in Section 4.3. To see the privacy constraint intuitively from another angle, we note that the user can alter the queries for the second database when the queries for the first database are fixed, when the user wishes to retrieve another set of two messages. For instance, if the user wishes to retrieve instead of , it can alter the queries for the second database by changing every in the queries of the second database with , with , with , and with .

The query table for this case is shown in Table 1 below. The scheme retrieves and , i.e., 8 bits in 10 downloads (5 from each database). Thus, the achievable sum rate for this scheme is . If we use the single-message optimal scheme in [8], which is given in [8, Example 4.3] for this specific case, twice in a row to retrieve two messages, we achieve a sum rate of as discussed in the introduction.

### 4.2 General Achievable Scheme

The scheme requires , and is completed in two rounds. The main ingredient of the scheme is MDS coding of the desired symbols and side information in the second round. The details of the scheme are as follows.

1. Index preparation: The user interleaves the contents of each message randomly and independently from the remaining messages using a random interleaver which is known privately to the user only, i.e.,

 xm(i)=wm(πm(i)),i∈{1,⋯,L} (40)

where is the interleaved message. Thus, the downloaded symbol at any database appears to be chosen at random and independent from the desired message subset .

2. Round one: As in [8], the user downloads one symbol from every message from every database, i.e., the user downloads from the th database. This implements message symmetry, symmetry across databases and satisfies the privacy constraint.

3. Round two: The user downloads a coded mixture of new symbols from the desired messages and the undesired symbols downloaded from the other databases. Specifically,

1. The user picks an MDS generator matrix , which has the property that every submatrix is full-rank. This implies that if the user can cancel out any symbols from the mixture, the remaining symbols can be decoded. One explicit MDS generator matrix is the Reed-Solomon generator matrix over , where , [24, 25]

 G=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣111⋯1123⋯M122232⋯M2⋮⋮⋮⋮⋮1P−12P−13P−1⋯MP−1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦P×M (41)
2. The user picks uniformly and independently at random the permutation matrices of size . These matrices shuffle the order of columns of to be independent of .

3. At the first database, the user downloads an MDS-coded version of new symbols from the desired set and undesired symbols that are already decoded from the second database in the first round, i.e., the user downloads equations of the form

 GS1[xi1(n+1)xi2(n+1)⋯xiP(n+1)xj1(2)xj2(2)⋯xjM−P(2)]T (42)

where are the indices of the desired messages and are the indices of the undesired messages. In this case, the user can cancel out the undesired messages and be left with a invertible system of equations that it can solve to get . This implements exploiting side information as in [8].

4. The user repeats the last step for each set of side information from database 3 to database , each with different permutation matrix.

5. By database symmetry, the user repeats all steps of round two at all other databases.

### 4.3 Decodability, Privacy, and Calculation of the Achievable Rate

Now, we verify that this achievable scheme satisfies the reliability and privacy constraints.

For the reliability: The user gets individual symbols from all databases in the first round, and hence they are all decodable by definition. In the second round, the user can subtract out all the undesired message symbols using the undesired symbols downloaded from all other databases during the first round. Consequently, the user is left with a system of equations which is guaranteed to be invertible by the MDS property, hence all symbols that belong to are decodable.

For the privacy: At each database, for every message subset of size , the achievable scheme retrieves randomly interleaved symbols which are encoded by the following matrix:

 HP=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣IP0P0P⋯0P0PG1P0P⋯0P0P0PG2P⋯0P⋮⋮⋮⋮⋮0P0P0P⋯GN−1P⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (43)

where are the columns of the encoding matrix that correspond to the message subset after applying the random permutation . Since the permutation matrices are chosen uniformly and independently from each other, the probability distribution of is uniform irrespective to (the probability of realizing such a matrix is ). Furthermore, the symbols are chosen randomly and uniformly by applying the random interleaver. Hence, the retrieval scheme is private.

To calculate the achievable rate: We note that at each database, the user downloads individual symbols in the first round that includes desired symbols. The user exploits the side information from the remaining databases to generate equations for each side information set. Each set of equations in turn generates desired symbols. Hence, the achievable rate is calculated as,

 P∑i=1Ri =total number of desired symbolstotal % downloaded equations (44) =N(P+P(N−1))N(M+P(N−1)) (45) =PN(M−P)+PN (46) =11+M−PPN (47)

### 4.4 Further Examples for the Case P≥M2

In this section, we illustrate our achievable scheme with two more basic examples. In Section 4.1, we considered the case , , . In the next two sub-sections, we will consider examples with larger , (Section 4.4.1), and larger (Section 4.4.2).

#### M=5 Messages, P=3 Messages, N=2 Databases

Let , and to denote the contents of to , respectively. The achievable scheme is similar to the example in Section 4.1. The difference is the use permutation matrix for and Reed-Solomon generator matrix over as:

 G3×5=⎡⎢⎣111111234514410⎤⎥⎦ (48)

The query table is shown in Table 2 below with the following random permutation for the columns: . The reliability and privacy constraints are satisfied due to the MDS property that implies that any subset of messages corresponds to a invertible submatrix if the remaining symbols are decodable from the other database. This scheme retrieves , and , hence 12 bits in 16 downloads (8 from each database). Thus, the achievable sum rate is which equals the sum capacity in (11). This strictly outperforms the repetition-based achievable sum rate in (12).

#### M=4 Messages, P=2 Messages, N=3 Databases

Next, we give an example with a larger . Here, the message size is . With a generator matrix to be the upper left submatrix of the previous example and two set of random permutations (corresponding to ) as , and . The query table is shown in Table 3 below. This scheme retrieves and , hence 18 bits in 24 downloads (8 from each database). Thus, the achievable rate is . This strictly outperforms the repetition-based achievable scheme sum rate in (12).

## 5 Achievability Proof for the Case P≤M2

In this section, we describe an achievable scheme for the case . We show that this scheme is optimal when the total number of messages is an integer multiple of the number of desired messages . The scheme incurs a small loss from the upper bound for all other cases. The scheme generalizes the ideas in [8]. Different than [8], our scheme uses unequal number of stages for each round of download. Interestingly, the number of stages at each round can be thought of as the output of an all-poles IIR filter. Our scheme reduces to [8] if we let . In the sequel, we define the th round as the download queries that retrieve sum of different symbols. We define the stage as a block of queries that exhausts all combinations of the sum of symbols in the th round.