Private Information Retrieval From a Cellular Network With Caching at the Edge
Abstract
We consider the problem of downloading content from a cellular network where content is cached at the wireless edge while achieving privacy. In particular, we consider private information retrieval (PIR) of content from a library of files, i.e., the user wishes to download a file and does not want the network to learn any information about which file she is interested in. To reduce the backhaul usage, content is cached at the wireless edge in a number of smallcell base stations using maximum distance separable codes. We propose a PIR scheme for this scenario that achieves privacy against a number of spy SBSs that (possibly) collaborate. The proposed PIR scheme is an extension of a recently introduced scheme by Kumar et al. to the case of multiple code rates, suitable for the scenario where files have different popularities. We then derive the backhaul rate and optimize the content placement to minimize it. We prove that uniform content placement is optimal, i.e., all files that are cached should be stored using the same code rate. This is in contrast to the case where no PIR is required. Furthermore, we show numerically that popular content placement is optimal for some scenarios.
I Introduction
Bringing content closer to the end user in wireless networks, the socalled caching at the wireless edge, has emerged as a promising technique to reduce the backhaul usage. The literature on wireless caching is vast. Informationtheoretic aspects of caching were studied in [1, 2]. To leverage the potential gains of caching, several papers proposed to cache files in densely deployed smallcell base stations (SBSs) with large storage capacity, see, e.g., [3, 4, 5, 6, 7]. In [5], content is cached in SBSs using maximum distance separable (MDS) codes to reduce the download delay. This scenario was further studied in [7], where the authors optimized the MDScoded caching to minimize the backhaul rate. Caching content directly in the mobile devices and exploiting devicetodevice communication has been considered in, e.g., [8, 9, 10, 11, 12].
Recently, private information retrieval (PIR) has attracted a significant interest in the research community [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. In PIR, a user would like to retrieve data from a distributed storage system (DSS) in the presence of spy nodes, without revealing any information about the piece of data she is interested in to the spy nodes. PIR was first studied by Chor et al. [24] for the case where a binary database is replicated among servers (nodes) and the aim is to privately retrieve a single bit from the database in the presence of a single spy node (referred to as the noncolluding case), while minimizing the total communication cost. In the last few years, spurred by the rise of DSSs, research on PIR has been focusing on the more general case where data is stored using a storage code.
The PIR capacity, i.e., the maximum achievable PIR rate, was studied in [18, 19, 21, 22, 23]. In [19, 23], the PIR capacity was derived for the scenario where data is stored in a DSS using a repetition code. In [22], for the noncolluding case, the authors derived the PIR capacity for the scenario where data is stored using an (single) MDS code, referred to as the MDSPIR capacity. For the case where several spy nodes collaborate with each other, referred to as the colluding case, the MDSPIR capacity is in general still unknown, except for some special cases [18] (and for repetition codes [23]). PIR protocols for DSSs have been proposed in [14, 16, 17, 20, 21]. In [16], a PIR protocol for MDScoded DSSs was proposed and shown to achieve the MDSPIR capacity for the case of noncolluding nodes when the number of files stored in the DSS goes to infinity. PIR protocols for the case where data is stored using nonMDS codes were proposed in [17, 20, 21].
In this paper, we consider PIR of content from a cellular network. In particular, we consider the private retrieval of content from a library of files that have different popularities. We consider a similar scenario as in [7] where, to reduce the backhaul usage, content is cached in SBSs using MDS codes. We propose a PIR scheme for this scenario that achieves privacy against a number of spy SBSs that possibly collude. The proposed PIR scheme is an extension of Protocol 3 in [21] to the case of multiple code rates, suitable for the scenario where files have different popularities. We also propose an MDScoded content placement slightly different than the one in [7] but that is more adapted to the PIR case. We show that, for the conventional content retrieval scenario with no privacy, the proposed content placement is equivalent to the one in [7], in the sense that it yields the same average backhaul rate. We then derive the backhaul rate for the PIR case as a function of the content placement. We prove that uniform content placement, i.e., all files that are cached are encoded with the same code rate, is optimal. This is a somewhat surprising result, in contrast to the case where no PIR is considered, where optimal content placement is far from uniform [7]. We further consider the minimization of a weighted sum of the backhaul rate and the communication rate from the SBSs, relevant for the case where limiting the communication from the SBSs is also important. We finally report numerical results for both the scenario where SBSs are placed regularly in a grid and for a Poisson point process (PPP) deployment model where SBSs are distributed over the plane according to a PPP. We show numerically that popular content placement is optimal for some system parameters. To the best of our knowledge, PIR for the wireless caching scenario has not been considered before.
Notation: We use lower case bold letters to denote vectors, upper case bold letters to denote matrices, and calligraphic upper case letters to denote sets. For example, , , and denote a vector, a matrix, and a set, respectively. We denote a submatrix of that is restricted in columns by the set by . will denote a linear code over the finite field . The multiplicative subgroup of (not containing the zero element) is denoted by . We use the customary code parameters to denote a code of blocklength and dimension . A generator matrix for will be denoted by and a paritycheck matrix by . A set of coordinates of , , of size is said to be an information set if and only if is invertible. The Hadamard product of two linear subspaces and , denoted by , is the space generated by the Hadamard products for all pairs , . The inner product of two vectors and is denoted by , while denotes the Hamming weight of . represents the transpose of its argument, while represents the entropy function. With some abuse of language, we sometimes interchangeably refer to binary vectors as erasure patterns under the implicit assumption that the ones represent erasures. An erasure pattern (or binary vector) is said to be correctable by a code if matrix has rank .
Ii System Model
We consider a cellular network where a macrocell is served by a macro base station (MBS). Mobile users wish to download files from a library of files that is always available at the MBS through a backhaul link. We assume all files of equal size.^{1}^{1}1Assuming files of equal size is without loss of generality, since content can always be divided into chunks of equal size. In particular, each file consists of bits and is represented by a matrix ,
where upperindex is the file index. Therefore, each file can be seen as divided into stripes of bits each. The file library has popularity distribution , where file is requested with probability . We also assume that SBSs are deployed to serve requests and offload traffic from the MBS whenever possible. To this purpose, each SBS has a cache size equivalent to files. The considered scenario is depicted in Fig. 1.
Iia Content Placement
File is partitioned into packets of size bits and encoded before being cached in the SBSs. In particular, each packet is mapped onto a symbol of the field , with . For simplicity, we assume that is integer and set . Thus, stripe can be equivalently represented by a stripe , , of symbols over . Each stripe is then encoded using an MDS code over into a codeword , where code symbols , , are over . For later use, we define , , and .
The encoded file can be represented by a matrix . Code symbols are then stored in the th SBS (the ordering is unimportant). Thus, for each file , each SBS caches one coded symbol of each stripe of the file, i.e., a fraction of the th file. As ,
where implies that file is not cached. Note that, to achieve privacy, , i.e., files need to be cached with redundancy. As a result, is not allowed. This is in contrast to the case of no PIR, where (and hence ) is possible.
Since each SBS can cache the equivalent of files, the ’s must satisfy
We define the vector and refer to it as the content placement. Also, we denote by the caching scheme that uses MDS codes according to the content placement . For later use, we define and .
We remark that the content placement above is slightly different than the content placement proposed in [7]. In particular, we assume fixed code length (equal to the number of SBSs, ) and variable , such that, for each file cached, each SBS caches a single symbol from each stripe of the file. In [7], the content placement is done by first dividing each file into symbols and encoding them using an MDS code, where , . Then, (different) symbols of the th file are stored in each SBS and the MBS stores symbols.^{2}^{2}2This is because the model in [7] assumes that one SBS is always accessible to the user. If this is not the case, the MBS must store all symbols of the file. Here, we consider the case where the MBS must store all symbols because it is a bit more general. Our formulation is perhaps a bit simpler and more natural from a coding perspective. Furthermore, we will show in Section IV that the proposed content placement is equivalent to the one in [7], in the sense that it yields the same average backhaul rate.
IiB File Request
Mobile devices request files according to the popularity distribution . Without loss of generality, we assume . The user request is initially served by the SBSs within communication range. We denote by the probability that the user is served by SBSs and define . If the user is not able to completely retrieve from the SBSs, the additional required symbols are fetched from the MBS. Using the terminology in [7], the average fraction of files that are downloaded from the MBS is referred to as the backhaul rate, denoted by R, and defined as
Note that for the case of no caching .
As in [7], we assume that the communication is error free.
IiC Private Information Retrieval and Problem Formulation
We assume that some of the SBSs are spy nodes that (potentially) collaborate with each other. On the other hand, we assume that the MBS can be trusted. The users wish to retrieve files from the cellular network, but do not want the spy nodes to learn any information about which file is requested by the user. The goal is to retrieve data from the network privately while minimizing the use of the backhaul link, i.e., while minimizing R. Thus, the goal is to optimize the content placement to minimize R.
Iii Private Information Retrieval Protocol
In this section, we present a PIR protocol for the caching scenario. The PIR protocol proposed here is an extension of Protocol 3 in [21] to the case of multiple code rates.^{3}^{3}3Protocol 3 in [21] is based on and improves the protocol in [20], in the sense that it achieves higher PIR rates.
Assume without loss of generality that the user wants to download file . To retrieve the file, the user generates query matrices, , , where are the queries sent to the SBSs within visibility and the remaining queries are sent to the MBS. Note that is a parameter that needs to be optimized. Each query matrix is of size symbols (from ) and has the following structure,
The query matrix consists of subqueries , , of length symbols each. In response to query matrix , a SBS (or the MBS) sends back to the user a response vector of length , computed as
(1) 
We will denote the th entry of the response vector , i.e., , as the th subresponse of . Each response vector consists of subresponses, each being a linear combination of symbols. Note that the operations are performed over the largest extension field, i.e., , and the subresponses are also over this field, i.e., each subresponse is of size bits and hence each response is of size bits.
The queries and the responses must be such that privacy is ensured and the user is able to recover the requested file. More precisely, informationtheoretic PIR in the context of wireless caching with spy SBSs is defined as follows.
Definition 1.
Consider a wireless caching scenario with SBSs that cache parts of a library of files and in which a set of SBSs act as colluding spies. A user wishes to retrieve the th file and generates queries , . In response to the queries the SBSs and (potentially) the MBS send back the responses . This scheme achieves perfect informationtheoretic PIR if and only if \cref@addtoresetequationparentequation
Privacy:  (2a)  
Recovery:  (2b) 
Condition (2a) means that the spy SBSs gain no additional information about which file is requested from the queries (i.e., the uncertainty about the file requested after observing the queries is identical to the a priori uncertainty determined by the popularity distribution), while Condition (2b) guarantees that the user is able to recover the file from the response vectors.
We define the code , , as the code obtained by puncturing the underlying storage code , and by the code with parameters .^{4}^{4}4Without loss of generality, to simplify notation we assume that the last coordinates of the code are puntured. For the protocol to work, we require that divides for all , i.e., . This ensures that . Furthermore, we require the codes to be such that . The protocol is characterized by the codes and by two other codes, and . Code (over ) has parameters and characterizes the queries sent to the SBSs and the MBS, while code (defined below) defines the responses sent back to the user from the SBSs and the MBS. The designed protocol achieves PIR against a number of colluding SBSs , where is the minimum Hamming distance of the dual code of .
Iiia Query Construction
The queries must be constructed such that privacy is preserved and the user can retrieve the requested file from the response vectors , . In particular, the protocol is designed such that the subresponses , , corresponding to the subqueries recover unique code symbols of the file .
The queries are constructed as follows. The user chooses codewords , , , independently and uniformly at random. Then, the user constructs vectors,
(3) 
where collects the th coordinates of the codewords , , i.e., .
Assume that the user wants to retrieve file . Then, subquery is constructed as
(4) 
where
(5) 
for some set that will be defined below. Vector , , denotes the th dimensional unit vector, i.e., the length vector with a one in the th coordinate and zeroes in all other coordinates, and the allzero vector. The meaning of index will become apparent later.
According to (4), each subquery vector is the sum of two vectors, and . The purpose of is to make the subquery appear random and thus ensure privacy (i.e., Condition (2a)). On the other hand, the vectors are deterministic vectors which must be properly constructed such that the user is able to retrieve the requested file from the response vectors (i.e., Condition (2b)). Similar to Protocol 3 in [21], the vectors are constructed from a binary matrix where each row represents a weight erasure pattern that is correctable by and where the weights of its columns are determined from information sets , , of .
The construction of is addressed below. We define the set as the index set of information sets that contain the th coordinate of , i.e., . To allow the user to recover the requested file from the response vectors, is constructed such that it satisfies the following conditions.

The user should be able to recover unique code symbols of the requested file from the responses to each set of subqueries , . This is to say that each row of should have exactly ones. We denote by the support of the th row of .

The user should be able to recover unique code symbols of the requested file , at least symbols from each stripe. This means that each row , , of should correspond to an erasure pattern that is correctable by .

Let , , be the th column vector of . The protocol should be able to recover unique code symbols from the th response vector, which means that it is required that . We call the vector the column weight profile of .
IiiB Response Vectors
The th subresponse corresponding to subquery , , is (see (1))
The user collects the subresponses , , in the vector ,
(6) 
where symbol represents the code symbol from file downloaded in the th subresponse from the th response vector. Due to the structure of the queries obtained from , the user retrieves code symbols from the set of subresponses to the th subqueries. Consider a retrieval code of the form
(7) 
where denotes the sum of subspaces and , resulting in the set consisting of all elements for any and , and where follows due to the fact that the Hadamard product is distributive over addition.
The symbols requested by the user are then obtained solving the system of linear equations defined by
IiiC Privacy
For the retrieval, we require to be a valid code, i.e., it must have a code rate strictly less than . For a given number of colluding SBSs , the combination of conditions on and restricts the choice for the underlying storage codes . In the following theorem, we present a family of MDS codes, namely generalized ReedSolomon (GRS) codes, that work with the protocol. A GRS code over of length and dimension is a weighted polynomial evaluation code of degree defined by some weighting vector and an evaluation vector satisfying for all [25, Ch. 5]. In the sequel, we refer to as the parameters of a GRS code .
Lemma 1.
Given an GRS code , for all , there exists an GRS code that is a subcode of .
Proof:
The canonical generator matrix for an GRS code is given by
(8) 
Clearly, taking the first rows of the leftmost matrix of (8) and multiplying it with the rightmost diagonal matrix generates an subcode of which by itself is an GRS code. Thus, GRS codes are naturally nested, and the result follows. ∎
Theorem 1.
Let be a caching scheme with GRS codes of parameters and let be the code obtained by puncturing . Also, let be an GRS code. Then, for and , the protocol achieves PIR against colluding SBSs.
Proof:
The proof is given in the appendix. ∎
Note that the retrieval code depends on the SBSs within visibility that are contacted by the user through its evaluation vector. Finally, we remark that, with some slight modifications, the proposed protocol can be adapted to work with nonMDS codes.
IiiD Example
As an example, consider the case of files, and , both of size bits. The first file is stored in the SBSs according to Fig. 2 using an binary repetition code . Similarly, the second file is stored (again according to Fig. 2) using an binary single paritycheck code . Assume (i.e., no puncturing) and that none of the SBSs collude, i.e., . Furthermore, we assume that the user wants to retrieve and is able to contact SBSs (i.e., we consider the extreme case where the user is not contacting the MBS). According to Theorem 1, we can choose and . Finally, we choose as an binary repetition code.
According to (7), the retrieval code and can be generated by
Moreover, let
where is an information set of (the submatrix has rank ). Note that satisfies all three conditions – and has column weight profile .
Query Construction. The user generates codewords and independently and uniformly at random from . Without loss of generality, let . Next, the subqueries , , are constructed according to creftype 4, creftype 5 as
where is defined in (3).
File Retrieval. Consider the subresponses , . Then, according to (IIIB),
and the code symbol of the file is recovered from
Note that in order to retain privacy across the two files of the library, we need to send subqueries to each SBS, thus generating subresponses from each SBS (even if the first file can be recovered from the subresponses , ).
Iv Backhaul Rate Analysis: No PIR Case
In this section, we derive the backhaul rate for the proposed caching scheme for the case of no PIR, i.e., the conventional caching scenario where PIR is not required.
Proposition 1.
The average backhaul rate for the caching scheme in Section II for the case of no PIR is
(9) 
Proof:
To download file , if the user is in communication range of a number of SBSs, , larger than or equal to , the user can retrieve the file from the SBSs and there is no contribution to the backhaul rate. Otherwise, if , the user retrieves a fraction of the file from each of the SBSs, i.e., a total of bits, and downloads the remaining bits from the MBS. Averaging over and (for the files cached) and normalizing by the file size , the contribution to the backhaul rate of the retrieval of files that are cached in the SBSs is
(10) 
On the other hand, the files that are not cached are retrieved completely from the MBS, and their contribution to the backhaul rate is
(11) 
We denote by the maximum PIR rate resulting from the optimization of the content placement. can be obtained solving the following optimization problem,
where , as is a valid value for the case where PIR is not required.
In the following lemma, we show that the proposed content placement is equivalent to the one in [7], in the sense that it yields the same average backhaul rate.