Secure Symmetric Private Information Retrieval from Colluding Databases with Adversaries

Secure Symmetric Private Information Retrieval from Colluding Databases with Adversaries

Qiwen Wang, and Mikael Skoglund
Email: {qiwenw, skoglund}@kth.se
School of Electrical Engineering, KTH Royal Institute of Technology
Abstract

The problem of symmetric private information retrieval (SPIR) from replicated databases with colluding servers and adversaries is studied. Specifically, the database comprises files, which are replicatively stored among servers. A user wants to retrieve one file from the database by communicating with the servers, without revealing the identity of the desired file to any server. Furthermore, the user shall learn nothing about the other files in the database. Any out of servers may collude, that is, they may communicate their interactions with the user to guess the identity of the requested file. An adversary in the system can tap in on or even try to corrupt the communication. Three types of adversaries are considered: a Byzantine adversary who can overwrite the transmission of any servers to the user; a passive eavesdropper who can tap in on the incoming and outgoing transmissions of any servers; and a combination of both – an adversary who can tap in on a set of any nodes, and overwrite the transmission of a set of any nodes. The problems of SPIR with colluding servers and the three types of adversaries are named T-BSPIR, T-ESPIR and T-BESPIR respectively. The capacity of the problem is defined as the maximum number of information bits of the desired file retrieved per downloaded bit. We show that the information-theoretical capacity of the T-BSPIR problem equals , if the servers share common randomness (unavailable at the user) with amount at least times the file size. Otherwise, the capacity equals zero. The information-theoretical capacity of the T-ESPIR problem is proved to equal , if the servers share common randomness with amount at least times the file size. Finally, for the problem of T-BESPIR, the capacity is proved to be , where the common randomness shared by the servers should be at least times the file size. The results resemble those of secure network coding problems with adversaries and eavesdroppers.

I Introduction

In the situation where a user wants to retrieve a file from a remotely stored database, the nature of the data might be privacy-sensitive, for example medical records, stock prices etc., such that the user does not want to reveal the identity of the data retrieved. This is known as the problem of private information retrieval (PIR). In some cases, the privacy of the database needs also to be preserved. For example, if a user wants to retrieve his/her medical data from a database, it is hoped that the user obtains no information about other users’ medical records. This is known as the problem of symmetric private information retrieval (SPIR).

The problem of SPIR was firstly studied in the computer science society. It is shown that if the database is stored at a single server, the only possible scheme for the user is to download the entire database to guarantee information-theoretic privacy [1, 2], which is inefficient in practice. It is further shown that the communication cost can be reduced in sublinear scale by replicating the database at multiple non-colluding servers [2]. To further protect the privacy of the database, the problem of SPIR is introduced [3], such that the user obtains no more information regarding the database other than the requested file. In [1, 2, 3], the database is modeled as a bit string, and the user wishes to retrieve a single bit. In these works, the communication cost is measured as the sum of the transmission at the querying phase from user to servers and at the downloading phase from servers to user.

When the file size is significantly large and the target is to minimize the communication cost of only the downloading phase, the metric of the downloading cost is defined as the number of bits downloaded per bit of the retrieved file, and the reciprocal of which is named the PIR capacity. A series of recent works derive information-theoretic limits of various versions of the PIR problem [4, 5, 6, 7, 8, 9, 10] etc. The leading work in the area is by Sun and Jafar[4], where the authors find the capacity of the PIR problem with replicated databases. In subsequent works by Sun and Jafar [5, 6], the PIR capacity with duplicated databases and colluding servers, and the SPIR capacity with duplicated (non-colluding) databases are derived. In [7, 8, 9], Banawan and Ulukus find the capacity of the PIR problem with coded databases, multi-message PIR with replicated databases, and the PIR problem with colluding and Byzantine databases. In our previous work [10], we derive the capacity of the SPIR problem with coded databases.

Another series of works focus more on the coding structure of the storage system, and study schemes and information limits for various PIR problems with coded databases [11, 12, 13, 14, 15]. In [11], PIR is achieved by downloading one extra bit other than the desired file, given that the number of storage nodes grows with file size, which can be impractical in some storage systems. In [12], storage overhead can be reduced by increasing the number of storage nodes. In [13], tradeoff between storage cost and downloading cost is analyzed. Subsequently in [14], explicit schemes which match the tradeoff in [13] are presented. It is worth noting that in [7], the capacity of PIR for coded database is settled, which improves the results in [13, 14]. Recently in [15], the authors present a framework for PIR from coded database with colluding servers.

In this work, we study the SPIR version of the problem in [9], that is, SPIR from replicated databases with colluding and Byzantine servers. We also study the SPIR problem with a passive eavesdropper, then generalize to the case with an adversary who can both eavesdrop and corrupt the communication. In analogy to previous works on SPIR [6, 10], in the non-trivial context where the database comprises at least two files, the storage nodes need to share common randomness which is independent from the database and meanwhile unavailable to the user. Furthermore, in the case with an eavesdropper who can tap in on a set of the nodes and is curious about the database, the utility of the shared common randomness is two-fold in the sense that it also protect the database from the eavesdropper. Briefly speaking, in this work, we study the SPIR problem with replicated databases, where a database with files are replicated at servers. Any out of the servers may collude, that is, they may share their communication with the user to infer the identity of the requested file. The communication in the system is not secure, that is, there is an adversary who can tap in on or even corrupt the transmissions in the system. We consider three types of adversaries, a Byzantine adversary who can overwrite the transmission of any servers to the user, named T-BSPIR; a passive eavesdropper who can tap in on the incoming and outgoing transmissions of any servers, named T-ESPIR; and a combination of both – an adversary who can tap in on a set of any nodes, and overwrite the transmission of a set of any nodes (the two sets may overlap), named T-BESPIR. We show that the information-theoretical capacity of the T-BSPIR problem equals , if the servers share common randomness (unavailable at the user) with amount at least times the file size. Otherwise, the capacity equals zero. This is presented in Theorem 1. The information-theoretical capacity of the T-ESPIR problem is proved to equal , if the servers share common randomness with amount at least times the file size. This is presented in Theorem 2. Finally in Section VI-A, we show that for the problem of T-BESPIR, the capacity is , where the common randomness shared by the servers should be at least times the file size. The results resemble the capacity of secure network coding with adversaries [16].

Ii Model

Ii-a Notations

Let denote the set for . For the sake of brevity, denote the set of random variables by . The transpose of matrix is denoted by .

Ii-B Problem Description

Database: A database comprises independent files, denoted by , which are replicated at nodes (servers). Each file consists of symbols drawn independently and uniformly from the finite field . Therefore, for any ,

User queries: A user wants to retrieve a file with index from the database, where the desired file index is uniformly distributed among . Let denote a random variable privately generated by the user, which represents the randomness of the query scheme followed by the user. The random variable is generated before the realizations of the messages or the desired file index. Let the realization of the file index be , based on the realization of the desired file index and the realization of , the user generates and sends queries to all nodes, where the query received by node- is denoted by . Let denote the complete query scheme, namely, the collection of all queries under all cases of desired file index. We have that .

Node common randomness: Let random variable denote the common randomness shared by all nodes, the realization of which is known to all nodes but unavailable to the user. The common randomness is utilized to protect database-privacy (2) below. For any node , a random variable is generated from , which is used in the answer scheme followed by node . Hence, .

Node answers: The nodes generate answers according to the agreed scheme with the user based on the received query , the stored database, and the random variable generated from the common randomness. The answer generated and sent to the user by node is denoted by .

Adversary: Three types of adversaries are considered in this work. The first type is called Byzantine adversaries, who can overwrite the answers of a set of at most nodes, called corrupted nodes, pretending to send answers to the user from the corrupted nodes to confuse the user. The nodes that are not corrupted by the adversary are called authentic nodes. The user has no knowledge of the identity of the corrupted nodes. The answers overwritten and sent to the user are denoted by . We assume the Byzantine adversary is omniscient, that is, the adversary can tap in on all transmissions and corrupt the answers in a worst-case way that confuses the user the most. The model considered in [9], where there are Byzantine adversarial nodes who send arbitrary or worst-case answers to the user, can be considered as a special case where the adversary can only taps on the transmissions of the nodes chosen to corrupt, i.e. the adversary has less knowledge.111For zero-error decodability, the knowledge of the adversary does not affect the result, because the adversary could “happen” to generate the worst-case corrupted answers without knowing the transmissions in the system, in which case the communication scheme should still prevent the user from decoding the desired file wrong.

The second type adversary considered is called passive eavesdroppers, who can tap in on the incoming and outgoing transmissions of nodes in the system. The eavesdropper is “nice but curious”, in the sense that the goal of the eavesdropper is to obtain some information about the database, without corrupting any transmission. The user has no knowledge of the identity of the nodes tapped on by the eavesdropper.

The third type of adversary considered is a combination of the above two types. The adversary can tap in on the incoming and outgoing communications of any set with nodes, and can overwrite the answers of any set with nodes. The two sets may intersect. In this case, the adversary is not omniscient and does not tap in on the nodes that are in but not in .

T-BSPIR and T-ESPIR: Based on the received answers (for the case with Byzantine adversary, we abuse the notation and let ) and the query scheme , the user shall be able to decode the requested file with zero error. Any set of nodes may collude to guess the requested file index, by communicating their interactions with the user. Two privacy constraints must be satisfied:

  • User-privacy: any colluding nodes shall not be able to obtain any information regarding the identity of the requested file, i.e.,

    (1)
  • Database-privacy: the user shall learn no information regarding other files in the database, that is, defining ,

    (2)

For the case with passive eavesdropper and the case with the combination adversary, one more privacy constraint must be satisfied to protect the database from the eavesdropper. For any node set with at most nodes, and for any :

(3)

We use the same definition as in [10] for rate and capacity of T-BSPIR, T-ESPIR and T-BESPIR schemes. (We state only the definitions in terms of T-BSPIR.)

Definition 1.

The rate of a T-BSPIR scheme is the number of information bits of the requested file retrieved per downloaded answer bit. By symmetry among all files, for any ,

The capacity is the supremum of over all T-BSPIR schemes.

Definition 2.

The secrecy rate is the amount of common randomness shared by the storage nodes relative to the file size, that is

Iii Main Result

Iii-a T-Bspir

When there is only one file in the database, i.e. , database-privacy is guaranteed automatically, because there is no other file to protect from the user in the database. Therefore, the T-BSPIR problem reduces to T-BPIR problem, and from [9], the capacity is if . In fact, when , user-privacy is also trivial, since there is only one file that the user can request for. That is the reason the parameter is not in the capacity . Therefore, the condition can be relaxed to that if , the capacity of T-BSPIR when is . If , the user cannot successfully retrieve the file regardless of how much information downloaded, i.e. the capacity is . When , T-BSPIR is non-trivial and our main result is summarized below.

Theorem 1.

For symmetric private information retrieval from a database with files which are replicated at nodes, where any nodes may collude and a Byzantine adversary can corrupt the answers of any nodes, if , the capacity is

Remark: In [9], the authors show that the T-BPIR capacity is . It can be observed that as the number of files tends to infinity, their T-BPIR capacity approaches our T-BSPIR capacity. The intuition is that, when the number of files increases, the penalty in the downloading rate to protect database-privacy decays. When there are asymptotically infinitely many files, the information rate the user can learn about the database from finite downloaded symbols vanishes.

Iii-B T-Espir

When there is only one file in the database, the database-privacy and user-privacy become trivial. The only privacy constraint needed to be guaranteed is that the eavesdropper learns no information of the database (3). It can be easily checked that the capacity equals . When , the capacity of T-ESPIR is summarized below.

Theorem 2.

For symmetric private information retrieval from a database with files which are replicated at nodes, where any nodes may collude and an eavesdropper can tapped on the communication of any nodes, the capacity is

Iv T-Bspir

Iv-a Achievability

In this section, we present a general scheme which achieves the maximum T-BSPIR rate when the secrecy rate is . The main concepts of the construction are:

  • The queries received by any set of nodes are mutually independent, and are independent of the desired file index . This is achieved by expanding independent query vectors with an -MDS code.

  • Because the answers received from any nodes might be erroneous, the answers are formalized in a form of -MDS code, such that the user can correct up to errors.

Assume each file comprises symbols from a large enough field .222The field size should be large enough such that the MDS codes used in the construction exist. Let the vector represent the database, which is stored at each server. The user wants to retrieve privately.

The user generates the queries following the steps below: Step 1: Generates independent uniformly random vectors of length over . Let the matrix denote .

Step 2: Let denote the length- unit vector where only the th entry is and all the other entries are ’s. The purpose of is to retrieve the th entry of . Let the matrix denote .

Step 3: Let be distinct nonzero elements from . Let be the generating matrix of an -generalized-Reed-Solomon (GRS) code with code locators and column multipliers all be . That is,

(4)

Let be the generating matrix of an -GRS code with code locators and column multipliers . That is,

(5)
(6)
(7)

Step 4: Generate the query vectors by

(8)
(9)

The user sends the query vectors generated from equation (9) to the servers.

All the servers share symbols that are uniformly and independently chosen from , which are unavailable to the user. The servers generate their answers by taking the inner product of the query vectors they receive and the stored data vector, then add on a linear combination of . Specifically,

(10)

There are at most servers corrupted by the Byzantine adversary, who generate arbitrary (or even malicious) answers to confuse the user. Assume the Byzantine adversary generates answers of the same size as the authentic servers, i.e. the size the user expects to receive, otherwise the user can easily identify the erroneous answers.

To see that the user can decode successfully, firstly we look at the correct answers. Denote , where . From (9) and (10), . Hence,

(11)

where

(12)

It can be observed that is the generating matrix of an -GRS code with code locators and column multipliers all be . Therefore, when at most symbols out of are wrong, the user can still successfully decode , which include all symbols of .

It is obvious that database-privacy is guaranteed. Because besides , the user solves symbols , . Because are independent uniform symbols drawn from , the user can obtain no information about the linear combinations of the database from the ’s.

To see that user-privacy is guaranteed, from (9), because are independent random vectors and is the generating matrix of an -GRS code, any column vectors of are still independent uniform random vectors. Hence, by adding deterministic column vectors of , any query vectors are still independent uniform random vectors, and are independent from the desired file index. Therefore, any colluding nodes cannot infer the desired file index.

To conclude, the rate achieved by this scheme is with secrecy rate , which matches the capacity.

Iv-B Converse

In this section, we prove the converse part of Theorem 1. Lemmas 3-5 below are the versions with colluding servers and replicated databases of Lemmas 2-4 in [10] (and Lemmas 1-2 in [6]). Hence we state the lemmas with sketch proofs. For any set of nodes that are not corrupted by the adversary, given their received queries, the answers generated by these nodes do not depend on other queries. Because besides the received queries, the answers depend on the database and the shared common randomness, which are independent with other queries. Lemma 3 below states that this also holds if conditioned on the requested file.

Lemma 3.

For any set of nodes that are not corrupted by the adversary,

Proof: We first show that , as follows

where holds because the answers are deterministic functions of the database, common randomness, and the queries. In the last step, holds because the queries do not depend on the database and common randomness.

On the other hand, it is immediate that . Therefore, .

Lemma 4.

For any set of nodes with size that are not corrupted by the adversary,

(13)
(14)

Proof: The proof is similar as that of Lemma 1 in [6]. We omit the detailed proof here. The key idea is that since any nodes may collude, the statistical distribution of the queries and answers of any nodes shall be the same regardless of the requested file index, even if the nodes condition on a part of the database, for example . Otherwise, the nodes can differentiate between the cases where is requested and is requested.

Lemma 5.

For any set of nodes with size that are not corrupted by the adversary,

Proof: By database-privacy (2), . For any , because , we have

where equality holds because is independent of the queries, and equality follows by (14).

Lemma 6 below states that the user should be able to decode the desired file from any authentic nodes. This is similar as Lemma 4 in [9], developed from the cut-set bound in the network coding problem [17, 18], and the distributed storage problem [19]. The difference between our Lemma 6 and Lemma 4 in [9] is that instead of arguing that the answers from any authentic nodes must be unique for every realization of the database, we argue that it only needs to hold for any realization of the requested file . For different realizations of the database that differ on files other than , the interference may still be the same hence the user can successfully decode. We reprise the proof of Lemma 4 in [9] with slight modification for the proof of Lemma 6 below.

Lemma 6.

For any set of authentic nodes where , for correctly decoding , the answers are unique for every realization of . That is, there cannot exist two realizations of the th file, , such that . Consequently, .

Proof: Divide the nodes into two size- sets, denoted by and . The scheme shall allow the user to correctly decode if any nodes in are corrupted by the Byzantine adversary, with any corrupted answers. Consider the following two cases:

  • Case 1: The true realization of the th file is . The user downloads from the authentic nodes in . The nodes in are also authentic, who generates the answers . The nodes in are the corrupted nodes, the answers from which overwritten by the adversary “happened” to be generated with the agreed scheme but by replacing the th file with , denoted by .

  • Case 2: The true realization of the th file is . The user downloads from the authentic nodes in . The nodes in are also authentic, who generates the answers . The nodes in are corrupted, the answers from which overwritten by the adversary “happened” to be generated with the agreed scheme but by replacing the th file with , hence generating .

If , under both cases, the user downloads the same set of answers from all nodes, i.e., . Hence, the user cannot successfully decode whether the th file is or .

In conclusion, for any different realization of , the answers from differs. In other words, the user should be able to successfully decode the desired file from the authentic nodes, . i.e., .

Iv-B1 The proof for

By Lemma 6, let be a set of honest nodes, ,

In step (a), can be any set of nodes in . Step (b) holds by Lemma 3. Steps (c) and (d) follow by Lemma 5 and Lemma 4 respectively.

Averaging over all with size from , we have that

By Han’s inequality [20],

Hence, , where is an honest node.

Assume that the corrupted nodes send the same amount of information bits to the user, otherwise the user can easily identify the corrupted nodes. Hence, .

Iv-B2 The proof for

By database-privacy,

where step (a) follows from Lemma 6 that the user should be able to decode from . In step (b), can be any set of nodes in . Step (c) holds because the authentic answers are deterministic functions of the queries, the database, and the common randomness.

Averaging over all , and from the proof in Section IV-B1 above,

Hence, .

V T-Espir

V-a Achievability

Assume each file comprises symbols from a large enough field . Let the vector represent the database, which is stored at each server. The user wants to retrieve privately.

The queries are generated in the following way. The user firstly generate independent uniformly random vectors of length over . The user choose an -GRS code with generating matrix . Let denote the length- unit vector where only the th entry is and all the other entries are ’s. Again, the purpose of is to retrieve the th entry of . The query vectors are generated by

(15)

The nodes share symbols , called common randomness, that are uniformly and independently chosen from . The common randomness is unavailable to the user and the eavesdropper. The servers generate their answers by taking the inner product of the query vector and the stored data vector, then add on a linear combination of the common randomness in the following way,

(16)

where denotes the th column of matrix . Let , where , the answers received by the user are

(17)

where we omit the dimension of the zero matrix and the identity matrix because there is no ambiguity. Because is the generating matrix of an -GRS code, the matrix is invertible. Therefore, the user can solve , hence obtain .

To see that database-privacy is guaranteed, besides the symbols of , the user solves , where . Because are independent uniform symbols drawn from , the user can obtain no information about the database. User-privacy is also guaranteed, because from equation (15), every query vectors are independently and uniformly distributed. Hence every nodes see independent and uniformly distributed query vectors, no matter which file the user requests. To see that the eavesdropper learns no information about the database, the eavesdropper taps on the queries and answers of nodes. By the MDS property of GRS codes, any columns of are linearly independent. From equation (16), any answers are protected by independent linear combinations of . That is, for any nodes , ’s are statistically independent and uniformly distributed. Hence, from any query and answer pairs, the eavesdropper obtains no information about the database, i.e. (3) is satisfied.

V-B Converse

In this section, we prove the converse part of Theorem 2. We also use Lemmas 3-5 in Section IV-B for the proofs below.

V-B1 The proof for

For any file , , and any set of nodes with size ,

where holds because given the queries , the answers of do not depend on other queries. If , by Lemma 5 and Lemma 4, we have that holds; if , from equation (3), , hence also holds.

Averaging over all with size , we have that

By Han’s inequality [20],