Data Exchange Problem with Helpers
Abstract
In this paper we construct a deterministic polynomial time algorithm for the problem where a set of users is interested in gaining access to a common file, but where each has only partial knowledge of the file. We further assume the existence of another set of terminals in the system, called helpers, who are not interested in the common file, but who are willing to help the users. Given that the collective information of all the terminals is sufficient to allow recovery of the entire file, the goal is to minimize the (weighted) sum of bits that these terminals need to exchange over a noiseless public channel in order achieve this goal. Based on established connections to the multiterminal secrecy problem, our algorithm also implies a polynomialtime method for constructing the largest shared secret key in the presence of an eavesdropper. We consider the following sideinformation settings: (i) sideinformation in the form of uncoded packets of the file, where the terminals’ sideinformation consists of subsets of the file; (ii) sideinformation in the form of linearly correlated packets, where the terminals have access to linear combinations of the file packets; and (iii) the general setting where the the terminals’ sideinformation has an arbitrary (i.i.d.) correlation structure. We provide a polynomialtime algorithm (in the number of terminals) that finds the optimal rate allocations for these terminals, and then determines an explicit optimal transmission scheme for cases (i) and (ii).
I Introduction
In recent years cellular systems have witnessed significant improvements in terms of data rates, and are nearly approaching the theoretical limits in terms of the physical layer spectral efficiency. At the same time, the rapid growth in the popularity of dataenabled mobile devices, such as smart phones and tablets, and the resulting explosion in demand for more throughput are challenging our abilities even with the current highly efficient cellular systems. One of the major bottlenecks in scaling the throughput with the increasing number of mobile devices is the “last mile” wireless link between the base station and the mobile devices – a resource that is shared among many terminals served within the cell. This motivates the study of paradigms where cell phone devices can cooperate among themselves to get the desired data in a peertopeer fashion without solely relying on the base station.
An example of such a setting is shown in Figure 1, where a base station wants to deliver the same file to multiple geographicallyclose users over an unreliable wireless downlink. We assume that some terminals, which are in the range of the base station, are not interested in the file, but due to their proximity to the base station, they are able to overhear some of its transmissions. Moreover, we assume that these terminals are willing to help in distributing the file to the respective users. We will refer to these terminals as helpers. In the example of Figure 1 we assume that the file consists of four equally sized packets , , and belonging to some finite field . Suppose that after a few initial transmission attempts by the base station, the three terminals (including one helper) individually receive only parts of the file (see Figure 1), but collectively have the entire file. Now, if all terminals are in close vicinity and can communicate with each other, then, it is much more desirable and efficient, in terms of resource usage, to reconcile the file among users by letting all terminals “talk” to each other without involving the base station. The cooperation among the terminals has the following advantages:

Local communication among terminals has a smaller footprint in terms of interference, thus allowing one to use the shared resources (code, time or frequency) freely without penalizing the base station’s resources, i.e., higher resource reuse factor.

Transmissions within the close group of terminals is much more reliable than from the base station to any terminal due to geographical proximity of terminals.

This cooperation allows for the file recovery even when the connection to the base station is either unavailable after the initial phase of transmission, or it is too weak to meet the delay requirement.
The problem of reconciling a file among multiple wireless users having parts of it while minimizing the cost in terms of the total number of bits exchanged is known in the literature as the data exchange problem and was introduced by El Rouayheb et al. in [1]. In the problem formulation of the data exchange problem it is assumed that all the terminals in the system are interested in recovering the entire file, i.e., there are no helpers. For data exchange problem without helpers a randomized algorithm was proposed in [2] and [3], while a deterministic polynomial time algorithms was proposed in [4], [5].
In this paper we consider a scenario with helpers, and linear communication cost. W.r.t. the example considered here, if user , user and the helper transmit and bits, respectively, the data exchange problem with helpers would correspond to minimizing the weighted sumrate such that, when the communication is over, user and user can recover the entire file. It can be shown that for the case when , the minimum communication cost is and can be achieved by the following coding scheme: user transmits packet , and the helper transmits , where the addition is over the underlying field . This corresponds to the optimal rate allocation symbol in . If there was no helper in the system, it would take a total of transmissions to reconcile the file among the two users. That is user has to transmit and user transmits and . Thus, the helpers can contribute to lowering the total communication cost in the system.
The discussion above considers only a simple form of sideinformation, where different terminals observe partial uncoded “raw” packets of the original file. Content distribution networks are increasingly using coding, such as Fountain codes or linear network codes, to improve the system efficiency [6]. In such scenarios, the sideinformation representing the partial knowledge gained by the terminals would be coded and in the form of linear combinations of the original file packets, rather than the raw packets themselves. The previous two cases of sideinformation (“raw” and coded) can be regarded as special cases of the more general problem where the sideinformation has arbitrary correlation among the data observed by the different terminals and where the goal is to minimize the weighted total communication cost. In [7] Csiszár and Narayan posed a related security problem referred to as the “multiterminal key agreement” problem. They showed that obtaining the file among the users in minimum number of bits exchanged over the public channel is sufficient to maximize the size of the secret key shared between the users. This result establishes a connection between the Multiparty key agreement and the Data exchange problem with helpers. [7] solves the key agreement problem by formulating it as a linear program (LP) with an exponential number of rateconstraints, corresponding to all possible cutsets that need to be satisfied.
In this paper, we make the following contributions. First, we provide a deterministic polynomial time algorithm for finding an optimal rate allocation, w.r.t. a linear weighted sumrate cost needed to deliver the file to all users when all terminals have arbitrarily correlated sideinformation. For the data exchange problem with helpers, this algorithm computes the optimal rate allocation in polynomial time for the case of linearly coded sideinformation (including the “raw” packets case) and for the general linear cost functions (including the sumrate case). Second, for the the data exchange problem with helpers, with raw or linearly coded sideinformation, we propose an efficient communication scheme design based on the algebraic network coding framework [8], [9].
Ii System Model and Preliminaries
In this paper, we consider a set up with terminals out of which some subset of them is interested in gaining access to a file or a random process. Let , denote the components of a discrete memoryless multiple source (DMMS) with a given joint probability mass function. Each user observes i.i.d. realizations of the corresponding random variable .
Let be the subset of terminals, called users, who are interested in gaining access to the file, i.e., learning the joint process . The remaining terminals serve as helpers, i.e., they are not interested in recovering the file, but they are willing to help users in the set to obtain it. In [7], Csiszár and Narayan showed that deliver the file to all users in a setup with general DMMS interactive communication is not needed. As a result, in the sequel WLOG we can assume that the transmission of each user is only a function of its own initial observations. Let represent the transmission of the user , where is any desired mapping of the observations . For each user in in order to recover the entire file, transmissions , , should satisfy,
(1) 
where .
Definition 1.
A rate tuple is an achievable data exchange (DE) rate tuple if there exists a communication scheme with transmitted messages that satisfies (1), and is such that
(2) 
It is easy to show using cutset bounds that all the achievable DE rate tuple’s necessarily belong to the following region
(3) 
where . Also, using a random coding argument, it can be shown that the rate region is an achievable rate region [7].
In this work, we aim to design a polynomial complexity algorithm that delivers the file to all users in while simultaneously minimizing a linear communication cost function , where , is an dimensional vector of nonnegative finite weights. We allow ’s to be arbitrary nonnegative constants, to account for the case when communication of some group of terminals is more expensive compared to the others, e.g., setting to be a large value compared to the other weights minimizes the rate allocated to the user . This goal can be formulated as the following linear program:
(4)  
Iia Finite Linear Source Model
In general an efficient content distribution networks use coding such as fountain codes or linear network codes. This results in terminals’ observations to be in the form of linear combinations of the original packets forming the file, rather than the uncoded data themselves as is the case in conventional ‘Data Exchange problem’. This linear correlation source model is known in literature as Finite linear source [10].
Next, we briefly describe the finite linear source model. Let be some power of a prime. Consider the dimensional random vector whose components are independent and uniformly distributed over the elements of Then, in the linear source model, the observation of user is simply given by
(5) 
where is an observation matrix for the user .
It is easy to verify that for the finite linear source model,
(6) 
Henceforth for the finite linear source model we will use the entropy of the observations and the rank of the observation matrix interchangeably.
Iii Deterministic Algorithm
We begin this section by exploring the case when the set consists of only one user. Then, by using the methodology of [11], we extend our solution to the case when the set has arbitrary number of users.
Iiia Deterministic Algorithm when
Let the user be the only one user interested in a file, i.e., . This is known as a multiterminal SlepianWolf problem [12] for which the achievable rate region has the following form:
Hence, the underlying optimization problem has the following form
(7) 
Optimization problem (7) can be solved analytically due to the fact that the set function
(8) 
is supermodular (see [13] for the formal definition). Therefore, optimization problem (7) is over a supermodular polyhedron . From the combinatorial optimization theory it is known that Edmonds’ greedy algorithm [14] renders an analytical solution to this problem (see Algorithm 1).
Example 1.
Consider a system with terminals . For convenience, we express the underlying data vector as , where are independent uniform random variables in . Let us consider the case where each node has the following observations: , , , , , . Let us assume that user is interested in recovering the vector such that underlying communication cost is .
It immediately follows from Algorithm 1 that a solution to this problem is , and . In other words, user is missing linear equations in order to be able to decode all data packets.
IiiB Deterministic Algorithm when
In this section we extend the results from the previous section to the case where the set contains arbitrary number of users. Optimization problem (4) can be written as follows
(9)  
where
Equivalence between the optimization problems (4) and (9) follows from the fact that transmissions of all terminals in have to be such that all users in can learn . Optimization problem (9) has an exponential number constraints, which makes it challenging to solve in polynomial time. To obtain a polynomial time solution we consider the Lagrangian dual of problem (9).
(10)  
where
(11) 
Dual variable in the above problem is represented in matrix form as follows.
(12) 
We denote by and , the column vector and row vector of the matrix , respectively. Moreover, we denote by
(13) 
the rate matrix whose row, here denoted by , represents an optimizer of the problem (11) w.r.t. the weight vector . In order to ensure consistency with the optimization problem (10) observe that , and , .
For any given user , the objective function (11) of the dual problem (10) can be computed analytically using Algorithm 1. The optimization problem (10) is a linear program (LP) with number of constraints, which makes it possible to solve it in polynomial time (w.r.t. number of terminals). To solve the optimization problem (10) we apply a subgradient method, as described below.
Starting with a feasible iterate w.r.t. the optimization problem (10), every subsequent iterate can be recursively represented as an Euclidian projection of the vector
(14) 
onto the hyperplane , where is the column of the rate matrix . The Euclidian projection ensures that every iterate is feasible w.r.t. the optimization problem (10). It is not hard to verify that the following initial choice of is feasible w.r.t. the problem (10).
(15) 
By appropriately choosing the step size in each iteration (14), it is guaranteed that the subgradient method described above converges to the optimal solution of the problem (10). To recover the primal optimal solution from the iterates we use results from [15], where at each iteration of (14), the primal iterate is constructed as follows.
(16) 
where
(17) 
By carefully choosing the step size , in (14) and the convex combination coefficients , , , it is guaranteed that (16) converges to the minimizer of (9), and therefore to the minimizer of the original problem (4). In [15], the authors proposed several choices for and which lead to the primal recovery. Here we list some of them.

, , where , , ,
, , , 
, , where ,
, , .
Now, it is only left to compute an optimal rate allocation w.r.t to the problem defined in (4). Let and be the optimal rate vectors of the problems (4) and (9), respectively. As we pointed out earlier , where can be computed from the matrix for a sufficiently large , as follows
(18) 
Pseudo code of the algorithm described in this section is shown below (see Algorithm 2).
IiiC Code Construction for the Linear Source Model
In this Section we briefly address the question of the optimal code construction for the linear source model. For that matter, let us consider the following example.
Example 2.
Let us consider the same source model as in Example 1, where , and the objective function is . Applying the algorithm described above, we obtain
(19) 
This solution suggests that in order to design a scheme that performs optimally, it is necessary to split all the packets into equally sized chunks. In other words, terminals’ observations can be written as , , etc., where all ’s, ’s and ’s belong to . For this “extended” source model we have that the optimal rate allocation is , .
Next question we need to address is how to design transmissions of each user? Starting from an optimal (integer) rate allocation, we first construct the corresponding multicast network (see Figure 2). In this construction, notice that there are several types of nodes. First, there is a super node that possesses all the packets. Each user in the set plays the role of a transmitter and a receiver, while the helpers act only as transmitters. To model this, we denote to be the “sending” nodes, and , and to be the receiving nodes. To model the sideinformation at users , and , we introduce links , , of capacity , which are routing the users’ observations to the corresponding receiving nodes. To model the broadcast nature of each transmission, we introduce “dummy” nodes , such that the capacity of the links is the same as link capacity , , and is equal to , .
To solve for actual transmissions of each terminal, we apply the algebraic network coding approach [8], with appropriately designed source matrix which corresponds to the sideinformation of all terminals. Finally, the network code for the data exchange problem with helpers can be constructed in polynomial time from the algorithms provided in [9] which are based on a simultaneous transfer matrix completion.
Iv Conclusion and Extensions
In this paper we study the data exchange problem with helpers. We provide a deterministic polynomial time algorithm for minimizing the weighted sumrate cost of communication. We show that the data exchange problem with only one user and many helpers can be solved analytically using Edmonds’ algorithm. Further using single user solution as a building block we show how one can solve the more general problem with arbitrary number of users. Several extensions are of interest. For instance, we can consider a modification of the original data exchange problem where only helpers are allowed to transmit. Starting from a single user case, it is easy to see that an achievable rate tuple must satisfy all the cutset constraints over the helper set such that the user is always on the receiving side of the cut. Minimizing the weighted sumrate cost over all achievable rate tuples can again be done using Edmonds’ algorithm (see Algorithm 1). Finally, extension to the multiple user case corresponds to the weighted sumrate minimization over all rate tuples that are simultaneously achievable for all users. This optimization problem can be solved in polynomial time using the same approach as in Algorithm 2.
References
 [1] S. El Rouayheb, A. Sprintson, and P. Sadeghi, “On coding for cooperative data exchange,” in Proceedings of ITW, 2010.
 [2] A. Sprintson, P. Sadeghi, G. Booker, and S. El Rouayheb, “A randomized algorithm and performance bounds for coded cooperative data exchange,” in Proceedings of ISIT, 2010, pp. 1888–1892.
 [3] D. Ozgul and A. Sprintson, “An algorithm for cooperative data exchange with cost criterion,” in Information Theory and Applications Workshop (ITA), 2011. IEEE, pp. 1–4.
 [4] T. Courtade, B. Xie, and R. Wesel, “Optimal Exchange of Packets for Universal Recovery in Broadcast Networks,” in Proceedings of Military Communications Conference, 2010.
 [5] S. Tajbakhsh, P. Sadeghi, and R. Shams, “A model for packet splitting and fairness analysis in network coded cooperative data exchange.”
 [6] M. Luby, “Lt codes,” in Foundations of Computer Science, 2002. Proceedings. The 43rd Annual IEEE Symposium on. IEEE, 2002, pp. 271–280.
 [7] I. Csiszár and P. Narayan, “Secrecy capacities for multiple terminals,” IEEE Transactions on Information Theory, vol. 50, no. 12, pp. 3047–3061, 2004.
 [8] R. Koetter and M. Medard, “An Algebraic Approach to Network Coding,” IEEE/ACM Transactions on Networking, vol. 11, no. 5, pp. 782 – 795, 2003.
 [9] N. Harvey, D. Karger, and K. Murota, “Deterministic network coding by matrix completion,” in Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms, 2005, pp. 489–498.
 [10] C. Chan, “Generating Secret in a Network,” Ph.D. dissertation, Massachusetts Institute of Technology, 2010.
 [11] D. Lun, N. Ratnakar, M. Médard, R. Koetter, D. Karger, T. Ho, E. Ahmed, and F. Zhao, “Minimumcost multicast over coded packet networks,” Information Theory, IEEE Transactions on, vol. 52, no. 6, pp. 2608–2623, 2006.
 [12] T. Cover and J. Thomas, “Elements of information theory 2nd edition,” 2006.
 [13] S. Fujishige, Submodular functions and optimization. Elsevier Science, 2005.
 [14] J. Edmonds, “Submodular functions, matroids, and certain polyhedra,” Combinatorial structures and their applications, pp. 69–87, 1970.
 [15] H. Sherali and G. Choi, “Recovery of primal solutions when using subgradient optimization methods to solve lagrangian duals of linear programs,” Operations Research Letters, vol. 19, no. 3, pp. 105–113, 1996.