On the Asymptotic Capacity of -Secure -Private
Information Retrieval with Graph Based Replicated Storage
The problem of private information retrieval with graph-based replicated storage was recently introduced by Raviv, Tamo and Yaakobi. Its capacity remains open in almost all cases. In this work the asymptotic (large number of messages) capacity of this problem is studied along with its generalizations to include arbitrary -privacy and -security constraints, where the privacy of the user must be protected against any set of up to colluding servers and the security of the stored data must be protected against any set of up to colluding servers. A general achievable scheme for arbitrary storage patterns is presented that achieves the rate , where is the total number of servers, and each message is replicated at least times. Notably, the scheme makes use of a special structure inspired by dual Generalized Reed Solomon (GRS) codes. A general converse is also presented. The two bounds are shown to match for many settings, including symmetric storage patterns. Finally, the asymptotic capacity is fully characterized for the case without security constraints for arbitrary storage patterns provided that each message is replicated no more than times. As an example of this result, consider PIR with arbitrary graph based storage () where every message is replicated at exactly servers. For this -replicated storage setting, the asymptotic capacity is equal to where is the maximum size of a -matching in a storage graph . In this undirected graph, the vertices correspond to the set of servers, and there is an edge between vertices only if a subset of messages is replicated at both servers and .
As distributed storage systems become increasingly prevalent, there are mounting concerns regarding user privacy and data security. The problem of -secure and -private information retrieval (XSTPIR) deals with both of these issues [Jia_Sun_Jafar_XSTPIR]. In its basic form, private information retrieval (PIR) involves datasets (messages) that are replicated at distributed servers, and a user who wishes to retrieve one of these datasets without revealing any information about the identity of his desired dataset to any of the servers [PIRfirst, PIRfirstjournal]. XSTPIR is a generalization of PIR where the stored data must remain secure as long as the number of colluding servers is not more than , and the user’s privacy must be preserved as long as the number of colluding servers is not more than [Jia_Sun_Jafar_XSTPIR]. The rate of a PIR scheme is the ratio of the number of bits of desired message that are retrieved per bit of total download from all servers. The supremum of achievable rates is called the capacity of PIR [Sun_Jafar_PIR].
The capacity of the basic PIR setting was characterized in [Sun_Jafar_PIR] for arbitrary number of messages replicated across arbitrary number of servers. Following in the footsteps of [Sun_Jafar_PIR] there has been a wave of new results exploring the fundamental limits of PIR under a variety of constraints. This includes PIR with -privacy and replicated storage [Sun_Jafar_TPIR], PIR with MDS coded storage [Tajeddine_Rouayheb, Banawan_Ulukus], PIR with optimal storage and upload cost [Tian_Sun_Chen_Upload], PIR with arbitrary message lengths [Sun_Jafar_PIRL], PIR with restricted collusion patterns [Tajeddine_Gnilke_Karpuk_Etal, Jia_Sun_Jafar], PIR with -privacy and MDS coded storage [FREIJ_HOLLANTI, Sun_Jafar_MDSTPIR], multi-message PIR [Banawan_Ulukus_Multimessage], PIR with asymmetric traffic constraints [Banawan_Ulukus_Traffic], multi-round PIR [Sun_Jafar_MPIR], cache-aided and otherwise storage-constrained PIR [Tandon_CPIR, Wei_Banawan_Ulukus], PIR with side-information [Kadhe_Garcia_Heidarzadeh_Rouayheb_Sprintson, Chen_Wang_Jafar], PIR for computation [Sun_Jafar_PC, Mirmohseni_Maddah, obead2018achievable, David_Karpuk], PIR for security against eavesdroppers [Banawan_Ulukus_Asymmetric, Wang_Sun_Skoglund], PIR with Byzantine adversaries [Banawan_Ulukus_Byzantine, Zhang_Ge_Variant, Tajeddine_Gnilke_Karpuk_Hollanti], symmetrically secure PIR [Sun_Jafar_SPIR, Wang_Skoglund_TSPIR, Wang_Skoglund_SPIRAd], and PIR with secure storage [Yang_Shin_Lee, Jia_Sun_Jafar_XSTPIR].
Most relevant to this work is the recent characterization in [Jia_Sun_Jafar_XSTPIR] of the asymptotic () capacity of XSTPIR as . Note that the XSTPIR setting includes as special case the TPIR setting, obtained by setting , as well as the original PIR setting, obtained by setting and . It is limited, however, by its assumption of fully replicated storage, i.e., all messages are stored by all servers, which can be burdensome for large data sets. Motivated by the preference for simple storage, Raviv, Tamo and Yaakobi in [Raviv_Tamo_Yaakobi] introduced a graph based replicated storage model. Instead of full replication where every message is replicated at every server, graph based replication assumes that each message is replicated only among a subset of servers. This allows a graph representation where the vertices are the servers and each message is represented by a hyperedge comprised of vertices (servers) where this message is replicated. Reference [Raviv_Tamo_Yaakobi] primarily focuses on GTPIR, i.e., PIR with graph based replicated storage and -privacy. An achievable scheme is proposed that achieves the rate as long as is smaller than the replication factor of each message (the number of servers where the message is replicated), and is shown to be within a factor of from optimality for some special cases. However, optimal GTPIR schemes remain unknown in almost all settings. Understanding the key ideas that constitute optimal PIR schemes under graph based replicated storage is our goal in this paper.
The main contributions of this work are as follows. We study the asymptotic capacity of -private and -secure PIR with graph-based replicated storage, in short GXSTPIR. Recall that asymptotic capacity is quite meaningful for PIR because the number of messages is typically large, and the convergence of capacity to its asymptotic value tends to take place quite rapidly [Jia_Sun_Jafar_XSTPIR]. GXSTPIR includes as special cases the settings of GTPIR [Raviv_Tamo_Yaakobi], XSTPIR [Jia_Sun_Jafar_XSTPIR], TPIR [Sun_Jafar_TPIR] and basic PIR [Sun_Jafar_PIR], and as such it presents a unified view of these settings. Our first result is an achievable scheme for GXSTPIR that achieves the rate for arbitrary storage patterns provided every message is replicated at least times. In addition to ideas like cross-subspace alignment, Reed-Solomon (RS) coded storage and RS coded queries that were previously used for XSTPIR [Jia_Sun_Jafar_XSTPIR], a key novelty of our achievable scheme for GXSTPIR is how it creates and takes advantage of a structure inspired by dual Generalized Reed Solomon (GRS) codes. This is explained intuitively in Section 3.2. Our second contribution is a general converse bound for asymptotic capacity of GXSTPIR with arbitrary storage patterns. While the asymptotic capacity of GXSTPIR remains open in general, it is remarkable that our converse bound is tight in all settings where we are able to settle the capacity. In particular, the general achievable scheme matches the converse bound when the storage is symmetric, settling the asymptotic capacity for those settings. For several examples with asymmetric storage, it turns out that the achievable scheme can be improved to match the converse bound by applying it only after eliminating certain redundant servers. Thus, the asymptotic capacity for such cases is settled as well. In general however, with arbitrary graph based storage, more sophisticated achievable schemes may be obtained by combining our achievable scheme with ideas from private computation [Sun_Jafar_PC]. To illustrate this, we consider the GTPIR problem () where every message is replicated no more than times. As our final result, for this problem we fully settle the asymptotic capacity for arbitrary storage patterns. The asymptotic capacity depends strongly on the storage graph structure, and requires a private computation scheme on top of our general achievable scheme. As an example of this result, consider GPIR, i.e., PIR with arbitrary graph based storage () where every message is replicated at exactly servers. For this -replicated storage setting, the asymptotic capacity is exactly equal to where is the maximum size of a -matching in a storage graph . In this storage graph, the vertices correspond to the set of servers, and there is an edge between vertices only if a subset of messages is replicated at both servers and . . This is consistent with the intuition that storage graph properties must be essential to the asymptotic capacity of graph-based storage.
Notation: For a positive integer the notation denotes the set . The notation stands for the set . Similarly, for an index set , denotes the set . If is a set of random variables, then by we denote the joint entropy of those random variables. Mutual informations between sets of random variables are similarly defined. For tuples such as we allow set theoretic notions of inclusion. For example, denotes the relationship . Similarly, denotes . The notation is used to indicate that and are identically distributed. When a natural number, say , is used to represent an element of a finite field , it denotes the sum of ones in , i.e., , where the addition is over .
2 Problem Statement
We begin with a description of messages and storage structure. Based on the storage structure we will partition the set of messages into subsets so that the messages in the same subset have the same storage structure. Define where , is comprised of messages,
Messages are independent, and each message is composed of i.i.d. uniform symbols from , i.e.,
in -ary units. There are a total of servers. Corresponding to , let us define
where contains the servers, that store the set of messages . Without loss of generality we will assume that the servers are listed in increasing order in each tuple . The cardinality of is , which will be referred to as the replication factor for the messages in . The minimum replication factor is defined as
It is important to note that the messages may not be directly replicated at the servers. Because of security constraints, each message , is represented by a total of shares (the nomenclature comes from secret-sharing), denoted , such that the share is stored at Server , for all . Messages are independently secured and must be recoverable from their shares, as specified by the following constraints.
The information stored at Server is defined as
Let us also define the index set of that are stored at Server , as
For example, suppose we have message sets (each comprised of messages), stored at servers as shown.
Then for this example,111Incidentally, our results will show that as , for this example , and Server is redundant. we have,
The -secure constraint, , requires that any (or fewer) colluding servers learn nothing about the messages.
represents the setting without security constraints. If , then no secret sharing is needed, so each share of a message is the message itself,
This completes the description of the messages and the storage at the servers. Next, let us describe the private information retrieval aspect.
The user desires the message , where the indices and are chosen privately and uniformly by the user from , respectively. In order to retrieve his desired message, the user generates queries, , and sends the query, to the -th server. The user has no prior knowledge of the message realizations,
A -private scheme, , requires that any (or fewer) colluding servers learn nothing about .
Upon receiving the query , the -th server generates an answer string , which is a function of the query and its stored information .
The correctness constraint guarantees that from all the answers, the user is able to decode the desired message ,
The rate of a GXSTPIR scheme is defined by the number of -ary symbols of desired message that are retrieved per downloaded -ary symbol,
where is the expected total number of -ary symbols downloaded by the user from all servers. The capacity of GXSTPIR, denoted as , is the supremum of across all feasible schemes. In this work we are interested in the setting where each subset of messages is comprised of a large number of messages. Specifically, we wish to characterize the asymptotic capacity, as for all . In order to have approach infinity together for all , let us define,
so that are fixed constants, while approaches infinity. Then the asymptotic capacity is defined as
Note that the number of message sets, , and the storage pattern remain unchanged, while , i.e., the number of messages in each approaches infinity.
Our first result is a general achievability argument that provides us a lower bound on the asymptotic capacity of GXSTPIR.
The asymptotic capacity of GXSTPIR is bounded below as follows,
The proof of Theorem 1 appears in Section LABEL:proof:ach. An interesting aspect of the proof is the use of a structure inspired by dual GRS codes, that is intuitively explained in Section 3.2. Another interesting aspect of Theorem 1 is that applying it to a subset of servers (by eliminating the rest) may produce a higher achievable rate than if all servers were used. Therefore, in order to find the best achievable rate guaranteed by Theorem 1 we must choose the best subset of servers. Example 4 in Section 3.1 illustrates this idea.
Our next result is a converse argument that holds for arbitrary storage patterns. Recall that is the normalized download from Server .
The asymptotic capacity of GXSTPIR is bounded above as follows,
|and is defined as|
The proof of Theorem 2 appears in Section LABEL:proof:converse. Since the asymptotic capacity is zero for , in the remainder of this section we will assume that .
Remark: Note that (27) implies that the total normalized download from any servers in must be at least . A simple averaging argument implies that the total normalized download from all servers in any must be at least .
The general lower bound in Theorem 1 is in closed form and the general upper bound in Theorem 2 is essentially a linear program, so for arbitrary settings it is possible to evaluate both to check if they match (provided the parameter values are not too large to be computationally feasible). Conceptually, the condition for them to match may be understood as follows. Consider a hypergraph with the set of vertices representing the servers, and the set of hyperedges such that if and only if such that and . For this graph, hyperedges , with corresponding weights , are said to form a fractional matching if for every vertex the total weight of the edges that include is less than or equal to . The largest possible total weight of a fractional matching is called the fractional matching number of [Schrijver]. As shown in Lemma LABEL:lemma:frac in Appendix LABEL:app:lemmas, the optimal converse bound from Theorem 2 on the total normalized download, i.e., is equal to the fractional matching number of . Thus, the following corollary immediately follows.
Next let us identify some interesting special cases of Corollary 1.
Let be a collection of the sets . We define to be an exact -cover of if for all , and every element of is contained in exactly sets in . It follows that the asymptotic capacity if there exists an exact -cover for some . This is easily seen because for each in we have the bound according to (27). Adding all these bounds we obtain the desired converse bound , i.e., , which is achievable according to Theorem 1.
As a special case that is of particular interest, define a symmetric storage setting as one where (after some permutation of message and server indices) for all , . Here, and server indices are interpreted modulo , e.g., Server is the same as Server . Furthermore, is an integer value. Then any symmetric storage setting thus defined has asymptotic capacity because the storage sets form an exact -cover.
Based on these observations, here are some examples of storage patterns where the asymptotic capacity is .
which is a symmetric storage setting (forms an exact cover).
which is a symmetric storage setting (forms an exact -cover).
because it forms an exact cover.
for arbitrary because it contains an exact -cover, .
because it contains an exact -cover of in .
While the existence of an exact -cover for some positive integer is sufficient to guarantee that the asymptotic capacity is , it is not a necessary condition. Examples 1 and 2 in Section 3.1 show such settings.
On the other hand, it is also easy to see that the lower bound of Theorem 1 and the upper bound of Theorem 2 do not always match. Remarkably, in all such cases that we have been able to settle so far, it is the upper bound that is tight, and the achievability that needs to be improved. In many cases, such as Example 4 in Section 3.1, an improved achievability result is found easily by eliminating a redundant server before applying Theorem 1. However, more sophisticated achievable schemes may be required in general.
Our final result emphasizes this point by settling the asymptotic capacity of GTPIR, i.e., -private information retrieval with arbitrary graph based storage and no security constraints , provided each message is replicated no more than times. Because this result deals with arbitrary storage patterns, for its precise statement we will need the following definitions that follow the convention of Schrijver [Schrijver].
Define as a simple undirected graph with vertices corresponding to the servers, and with edges if and only if for some .
A set is called a stable set (also called independent set) if there are no edges between any two members of .
For , define as the set of vertices in that are neighbors of vertices in .
Define as the set of edges incident with vertex .
A function is denoted as a vector . A function is similarly denoted as a vector . The size of a vector is defined as the sum of its entries.
For any , and , define .
A -matching in is defined as a vector satisfying for each vertex . The maximum size of a -matching in is defined as .
Define as the set of servers that do not store any messages that are replicated fewer than times.
It is worthwhile to recall that from basic results in graph theory (see Chapter 30, Section 30.1 of Schrijver [Schrijver]), it is known that
With this we are ready to state our final result.
The asymptotic capacity of GTPIR with for all , i.e., when each message set is replicated no more than times, is
The proof of Theorem 3 appears in Section LABEL:proof:GTPIR. While the converse bound for Theorem 3 follows directly from the general converse bound in Theorem 2, the achievability goes beyond the scheme of Theorem 1, to involve a limited generalization to private computation that is presented in Section LABEL:sec:GTPC. As an interesting special case of Theorem 3, note that if all messages are replicated, i.e., is an empty set, then the asymptotic capacity is exactly .
Let us consider a few more examples to illustrate our results. For these examples we set for simplicity, but similar examples are easily constructed for as well.
Consider message sets, stored at servers according to the replication pattern , , . Since every message is -replicated, according to Theorem 1 we have . For the converse we note that , , , and adding these bounds gives us . Thus we have for this example. Note that this example does not contain an exact -cover for any positive integer , but the asymptotic capacity for this example is still .
Consider message sets stored at servers according to the replication pattern , so that every message is -replicated, but the storage is not symmetric, nor does it contain an exact -cover. For the converse we note that ; ; ; and combining these bounds gives us the converse bound as . Since , Theorem 1 shows that the rate is achievable, so that for this example.
Consider message sets stored at servers according to the replication pattern , so that messages in are -replicated while those in are only -replicated. For the converse we note that ; while . Adding them up we have the bound , which gives us the converse bound . Since , the lower bound from Theorem 1 is also , so that for this example. Note that we could eliminate any one element from so that messages in are also only -replicated, but that would not change the asymptotic capacity. Or we could add one more element to so that messages in are replicated at every server, and that would also not change the capacity. Thus, this example illustrates redundant storage.
Consider message sets stored at servers according to the replication pattern , , so that each message is -replicated. The converse from Theorem 2 says , but since , Theorem 1 applied directly only proves the achievability of rate which does not match the converse bound. However, note that if we eliminate Server and Server , then we are left with the same222Note that while some servers may be eliminated (i.e., not used) by an achievable scheme, the message sets cannot be reduced because the achievable scheme must still work for all messages. message sets stored at servers according to the replication pattern , for which , and Theorem 1 shows that the rate is achievable, which indeed matches the converse bound. Thus, the asymptotic capacity for this example is . The example shows that achievable rates may be improved by eliminating redundant servers.
Consider message sets stored at servers according to the storage pattern , so that messages in are -replicated, while messages in are -replicated, and . The achievable scheme from Theorem 1 achieves a rate , however Theorem 3 builds upon that scheme to achieve the rate which also matches the converse. Thus, for this setting, the capacity is settled by Theorem 3 as .
Consider message sets stored at servers according to the storage pattern . The capacity for this case is settled by Theorem 3 as . To explicitly see the converse bound, note that in (27) ; ; and . Adding these bounds we have , which implies that asymptotically the total normalized download and the converse bound follows. The graph representation for this setting, is shown in Figure 1. Vertices in are shown with a red border, while vertices in are shown with a black border. The maximum size of a -matching on is , corresponding to the edges shown in red. Alternatively, it corresponds to the choice of in (29). Note that while has neighbors in , i.e., , it has only neighbor in , i.e., . Therefore, . Achievability follows by the scheme presented in the proof of Theorem 3, downloading a symbol from each of , and downloading another symbol from each of according to a private computation scheme described in Section LABEL:sec:GTPC, for a total download of symbols from which desired symbols are retrieved.
3.2 Solution Structure inspired by Dual GRS Codes
The most interesting aspect of the achievable scheme in Theorem 1 is a generalized query and storage structure that is inspired by dual GRS codes. Since the storage and query structure for XSTPIR in [Jia_Sun_Jafar_XSTPIR] was based on RS codes, the generalization to GRS code structure for GXSTPIR is somewhat serendipitous (note that the in GRS codes is not automatically associated with the in GXSTPIR which stands for Graph based replicated storage). It is also surprisingly effective, as explained intuitively in this section.
Before discussing how GRS codes are a part of the solution, let us illustrate the nature of the problem with a simple example. Let us consider a very basic setting, where we have subsets of messages, servers, and , we have , i.e., messages in are stored at all servers except Server . Let be four vectors in , each of size , such that the vector has a zero in its coordinate (reflecting the fact that messages in are not stored at Server ) and all other coordinates are non-zero. Then, as we will explain shortly, the rank of the matrix reflects the number of dimensions occupied by interference, i.e., downloaded symbols that are undesired. For example, suppose we are operating in and we choose,
which has rank . Then this choice corresponds to a scheme where interference occupies out of the dimensions, leaving the remaining dimensions available for retrieving desired message symbols. To see this explicitly, suppose each message is comprised of symbols, in , and the user desires the message . The download from the server is the row of the following vector.
The vectors are two vectors, called demand vectors that help retrieve the desired message symbols. To preserve privacy, the demand vectors must also have zeros in the coordinates where has zeros. The random variables are i.i.d. uniform noise terms added to hide the demand vectors contained in the query sent to each server, thus ensuring privacy of user’s demand. The demand vectors, which carry the desired message symbols must be linearly independent of which carry only interference. To retrieve his desired message, the user projects into the dimensional null space of , where all interference disappears and only the two desired signal dimensions remain, from which the desired symbols are retrieved. The rate achieved by this scheme is which is also the asymptotic capacity for this setting (converse follows from Theorem 2).
From this example, it is clear that the problem is related to min-rank of the matrix subject to constraints on which terms take zero or non-zero values. These constraints are affected not only by the given storage structure, but also from the possibility of redundant servers333As illustrated by examples in Section 3.1 the solution may be further optimized on storage structure by ignoring redundant storage. as well as privacy and correctness constraints, e.g., because demand vectors must share the same structure to ensure privacy. Evidently, PIR with graph based storage is connected to other problems such as index coding, where also min-rank is important [Birk_Kol]. For arbitrary storage patterns such min-rank problems can be difficult to solve in general. However, now let us consider what happens if every message is replicated the same number of times, for all . As will be shown in the proof of Theorem 3, even if replication factors vary across messages, schemes for such settings may use the constant-replication-factor schemes as their essential building blocks. Thus, the constant-replication-factor setting is of fundamental significance. It is also the setting where we exploit the structure of dual GRS codes.
For simplicity we will only consider a setting with and . Consider such a setting with an arbitrary number of message sets , with servers, constant-replication-factor , and an arbitrary storage pattern reflected in the structure of the following matrix.