# Capacity of Single-Server Single-Message Private Information Retrieval with Coded Side Information

###### Abstract

This paper considers the problem of single-server single-message private information retrieval with coded side information (PIR-CSI). In this problem, there is a server storing a database, and a user which knows a linear combination of a subset of messages in the database as a side information. The number of messages contributing to the side information is known to the server, but the indices and the coefficients of these messages are unknown to the server. The user wishes to download a message from the server privately, i.e., without revealing which message it is requesting, while minimizing the download cost. In this work, we consider two different settings for the PIR-CSI problem depending on the demanded message being or not being one of the messages contributing to the side information. For each setting, we prove an upper bound on the maximum download rate as a function of the size of the database and the size of the side information, and propose a protocol that achieves the rate upper-bound.

## I introduction

In the original setting of the private information retrieval (PIR) problem [1], a user wishes to download (with minimum cost) a message belonging to a database of messages privately, i.e., without revealing which message it is requesting, from a single server or multiple servers each storing a copy of the database. In a single-server setting or a multiple-server setting when the servers collude, in order to achieve privacy in an information-theoretic sense, the user must download the whole database [1]. However, when the database is replicated on multiple non-colluding servers (see, e.g., [2, 3]), or coded versions of the data are stored on the servers (e.g., see [4, 5]), or the user has some side information about the database (see, e.g., [6, 7, 8, 9, 10]), the user can achieve the information-theoretic privacy more efficiently than downloading the whole database. The multi-message setting of PIR problem has also been studied in [11, 12], where the user wishes to download multiple messages privately, instead of only one message as in the single-message setting, from a single server or multiple servers.

In this paper, we study the single-server single-message PIR problem when the user knows a linear combination of a subset of messages in the database as a side information. This problem generalizes those previously studied in the single-server single-message PIR setting. In particular, we assume that the indices and coefficients of the messages contributing to the user’s side information are unknown to the server, and the user’s demanded message may or may not be one of the messages in the side information. This type of side information can be motivated by several scenarios. For example, the user may have overheard some coded packets over a wireless broadcast channel, or some part of the user’s information, which is locally stored using an erasure code, may be lost and not recoverable locally.

### I-a Main Contributions

In this work, we characterize the capacity of the PIR-CSI problem, defined as the supremum of all achievable download rates, in a single-server single-message setting as a function of the size of the database () and the size of the side information (). In particular, for the setting in which the user’s demand is not one of the messages contributing to its side information, we prove that the capacity is equal to for any . Interestingly, the capacity of PIR with (uncoded) side information [6] is also equal to where is the number of messages available at the user. This shows that there will be no loss in capacity, when compared to the case that the user knows messages separately, even if the user knows only one linear combination of messages. Also, for the setting in which the demanded message is contributing to the user’s side information, we prove that the capacity is equal to for and , and is equal to for any . This is interesting because it shows that, no matter what the size of the side information is, the user can privately retrieve any message contributing to its side information with a download cost at most twice the cost of downloading the message directly. The proof of converse for each setting is based on information-theoretic arguments, and for the achievability proofs, for different cases of , we propose different PIR protocols which are all based on the idea of randomized non-uniform partitioning.

## Ii Problem Formulation

Let , , , and be integers. Let be a finite field of size , and let be an extension field of of size . Let be the multiplicative group of , i.e., . For a positive integer , denote , and let .

We assume that there is a server storing a set of messages , with each message being independently and uniformly distributed over , i.e., and , where . Also, we assume that there is a user that wishes to retrieve a message from the server for some , and knows a linear combination for some and some , where is the set of all subsets of of size , and is the set of all ordered sets of size (i.e., all sequences of length ) with elements from . We refer to as the demand index, as the demand, as the side information, and as the side information size.

Let , , and be random variables representing , , and , respectively. Denote the probability mass function (pmf) of by , the pmf of by , and the conditional pmf of given by .

We assume that is uniformly distributed over , i.e.,

and is uniformly distributed over , i.e.,

Also, we consider two different models for the conditional pmf of given as follows:

#### Model I

is uniformly distributed over , i.e.,

#### Model II

is uniformly distributed over , i.e.,

(Note that the model I is valid for , and the model II is valid for .)

Let be an indicator function such that if , and if . In the model I, , and in the model II, .

We assume that is known to the server a priori. We also assume that the server knows the size of (i.e., ) and the pmf’s , , and , whereas the realizations , , and are unknown to the server a priori.

For any , , and , in order to retrieve , the user sends to the server a query , which is a (potentially stochastic) function of , , , and , and is independent of any where and , such that , i.e.,

The query must protect the privacy of the user’s demand index from the perspective of the server, i.e.,

This condition is referred to as the privacy condition.

Upon receiving , the server sends to the user an answer , which is a (deterministic) function of the query and the messages in , i.e.,

The answer along with the side information must enable the user to retrieve the demand , i.e.,

This condition is referred to as the recoverability condition.

By the privacy and recoverability conditions, it follows that for any , , and any , there exists for some and some such that , and

If there is no such that is recoverable from and , i.e., for all and all such that , then from the server’s perspective, cannot be the user’s demand index, i.e., , and cannot be private.

For each model (I or II), the problem is to design a query and an answer for any , , and that satisfy the privacy and recoverability conditions. We refer to this problem as single-server single-message Private Information Retrieval (PIR) with Coded Side Information (CSI), or PIR-CSI for short. Specifically, we refer to PIR-CSI under the model I as the PIR-CSI–I problem, and PIR-CSI under the model II as the PIR-CSI–II problem.

A collection of and for all , , and such that or , which satisfy the privacy and recoverability conditions, is referred to as a PIR-CSI–I protocol or a PIR-CSI–II protocol, respectively.

The rate of a PIR-CSI–I or PIR-CSI–II protocol is defined as the ratio of the entropy of a message, i.e., , to the average entropy of the answer, i.e., , where the average is taken over all , , and such that or , respectively. That is, for a PIR-CSI–I or PIR-CSI–II protocol, is given by

where the summation is over all , , and such that or , respectively.

The capacity of PIR-CSI–I or PIR-CSI–II problem, respectively denoted by or , is defined as the supremum of rates over all PIR-CSI–I or PIR-CSI–II protocols, respectively. (The notations and should not be confused with the notation for set .)

In this work, our goal is to characterize and , and to design a PIR-CSI–I protocol that achieves the capacity and a PIR-CSI–II protocol that achieves the capacity .

## Iii Main Results

In this section, we present our main results. Theorem 1 characterizes the capacity of PIR-CSI–I problem, , and Theorem 2 characterizes the capacity of PIR-CSI–II problem, , for different values of and . The proofs of Theorems 1 and 2 are given in Sections IV and V, respectively.

###### Theorem 1.

The capacity of PIR-CSI–I problem with messages and side information size is given by

The proof consists of two parts. In the first part, we lower bound the average entropy of the answer, , or equivalently, upper bound the rate of any PIR-CSI–I protocol. In the second part, we construct a PIR-CSI–I protocol which achieves this rate upper-bound.

###### Theorem 2.

The capacity of PIR-CSI–II problem with messages and side information size is given by

For the case of , the proof is straightforward. In this case, the user has one (and only one) message in its side information, and it demands the same message. A PIR-CSI–II protocol for this case is to send no query, and receive no answer. Since the (average) entropy of the answer is zero, the rate of this protocol is infinity, and so is the capacity.

For each of the other cases of , the proof consists of two parts. In the first part, we provide a lower bound on , or equivalently, an upper bound on the rate of any PIR-CSI–II protocol, for each case. In the second part, we construct a PIR-CSI–II protocol for each case which achieves the corresponding upper-bound on the rate.

## Iv The PIR-CSI–I Problem

### Iv-a Proof of Converse for Theorem 1

###### Lemma 1.

For , .

###### Proof.

Suppose that the user wishes to retrieve for a given , and it knows for given and such that . The user sends to the server a query , and the server responds to the user by an answer . We need to show that is lower bounded by . Since is the average entropy of the answer, it suffices to show that is lower bounded by . The proof proceeds as follows:

(1) | ||||

(2) |

where (1) holds because , and (by the recoverability condition); and (2) holds since is independent of (noting that ), and .

Now, we lower bound . There are two cases: (i) , and (ii) . In the case (i), , and so, . Since , then (by (2)).

In the case (ii), we arbitrarily choose a message, say , from the set of remaining messages, i.e., . By the privacy and recoverability conditions, there exists for some and some such that and . Since conditioning does not increase the entropy, then . Thus,

(3) | ||||

(4) |

where (3) holds because , and (by the assumption); and (4) follows from the independence of and (noting that ), and .

Let . Similarly as above, it can be shown that for all there exist , , and (and accordingly, ), where , such that , and

Note that for all . Repeating a similar argument as before,

for all . Putting these lower bounds together, {dmath*} H(A—Q,X_W,Y) ≥∑_i=1^n-1 H(X_W_i)+ H(A—Q,X_W,X_W_1,…,X_W_n-1,Y,Y_1,…,Y_n-1), and subsequently, {dmath} H(A—Q,X_W,Y) ≥∑_i=1^n-1 H(X_W_i) = (n-1) H(X_W) since . Putting (2) and (8) together,

as was to be shown. ∎

### Iv-B Proof of Achievability for Theorem 1

In this section, we propose a PIR-CSI–I protocol for arbitrary and .

Assume, without loss of generality (w.l.o.g.), that and .

#### Randomized Partitioning (RP) Protocol

The RP protocol consists of four steps as follows:

Step 1: The user constructs (ordered) sets of indices in , each of size , and (ordered) sets of elements in , each of size .

For constructing , extra indices are required. The procedure of selecting these extra indices is as follows. First, the user randomly chooses two integers and according to a joint pmf given by

where for and , and

for and ; for all and such that , and otherwise; and is the (unique) solution of the equation where the sum is over all and .

If , the user randomly selects indices from and indices from . If , the user selects the index along with and randomly chosen indices from and , respectively. Denote by the set of selected indices from , and by the set of selected indices from , , and . Note that the probability of any specific realization of is given by

Next, the user creates the set , and assigns all indices in to the set (if exists, i.e., ) and the set (if exists, i.e., ). Then, the user assigns randomly selected indices from (or respectively, ) to (or respectively, ). Next, the user randomly partitions all indices in (if any) into the remaining sets (if exist, i.e., ), each of size . Note that the probability of a specific realization of such a partitioning is given by

For constructing , the user creates the set where is chosen from at random, and it creates each of the sets by randomly choosing elements from .

Step 2: The user randomly rearranges the elements of each set and , and constructs for all . The user then reorders by a randomly chosen permutation , and sends to the server the query .

Step 3: By using , the server computes for all where and , and it sends to the user the answer .

Step 4: Upon receiving the answer from the server, the user retrieves by subtracting off the contribution of its side information from .

###### Lemma 2.

The RP protocol is a PIR-CSI–I protocol, and achieves the rate .

###### Proof.

In the RP protocol (Step 3), the answer consists of pieces of information , where each is a linear combination of messages in . Since are uniformly and independently distributed over and are linearly independent combinations of over , then are uniformly and independently distributed over . That is, , and . Since for all such that , then the average entropy of the answer over all such that , i.e., , is equal to . Thus, the rate of the RP protocol is equal to .

From Step 4 of the RP protocol, it should be obvious that the recoverability condition is satisfied. To prove that the RP protocol satisfies the privacy condition, we need to show that for all . Since the RP protocol does not depend on the contents of the messages , then it is sufficient to prove that for all . By the application of the total probability theorem and Bayes’ rule, to show that the RP protocol satisfies the privacy condition, it suffices to show that is the same for all . Since all possible collections are equiprobable, it suffices to show that is the same for all , where .

Each set consists of two disjoint subsets and where is the set of all indices in that belong to no other set . Note that and . Consider an arbitrary . There are two cases: (i) and for some , , and (ii) for some .

In the case (i),

where , , , and , and and are defined as in the protocol. Note that and are equal to . (Note that .) In the case (ii),

where , and . (Note that .) Define

Note that, in the case (i), for some such that and , and in the case (ii), for some such that . Thus, it should not be hard to see that the privacy condition is met so long as the following equations hold:

(5) |

for all such that ;

(6) |

for all and all such that and , and

(7) |

for all such that . By a simple algebra, one can verify that for the choice of specified in the protocol, the equations (5)-(7) are met, and the RP protocol satisfies the privacy condition. This completes the proof. ∎

## V The PIR-CSI–II Problem

### V-a Proof of Converse for Theorem 2

###### Lemma 3.

For , ; for , , and for , .

###### Proof.

Fix an arbitrary and for arbitrary and such that . Let and . Consider a query and an answer .

For the cases of and , it suffices to show that . Note that , where the equality follows from the recoverability condition and the chain rule of entropy, and , where the inequality follows from the independence of and , and the non-negativity of entropy. Putting these arguments together, .

For the cases of , we need to show that . By the argument above,

(8) |

To lower bound , we arbitrarily choose a message, say , such that . By the privacy and recoverability conditions, there exists for some and some such that and . Since conditioning does not increase the entropy, then . Thus,

(9) | ||||

(10) |

where (9) holds because , and (by the assumption); and (10) follows from the chain rule of entropy.

Since , , , and are linear functions, either (i) , i.e., is independent of