Privacy-guaranteed Two-Agent Interactions Using Information-Theoretic Mechanisms
This paper introduces a multi-round interaction problem with privacy constraints between two agents that observe correlated data. The data is assumed to have both public and private features and the goal of the interaction is to share the public data subject to utility constraints (bounds on distortion of public feature) while ensuring bounds on the information leakage of the private data at the other agent. The agents alternately share data with one another for a total of rounds such that each agent initiates sharing over rounds. The interactions are modeled as a collection of random mechanisms (mappings), one for each round. The goal is to jointly design the private mechanisms to determine the set of all achievable distortion-leakage pairs at each agent. Arguing that an mutual information-based leakage metric can be appropriate for streaming data settings, this paper: (i) determines the set of all achievable distortion-leakage tuples ; (ii) shows that the mechanisms allow for precisely composing the total privacy budget over rounds without loss; and (ii) develops conditions under which interaction reduces the net leakage at both agents and illustrates it for a specific class of sources. The paper then focuses on log-loss distortion to better understand the effect on leakage of using a commonly used utility metric in learning theory. The resulting interaction problem leads to a non-convex sum-leakage-distortion optimization problem that can be viewed as an interactive version of the information bottleneck problem. A new merge-and-search algorithm that extends the classical agglomerative information bottleneck algorithm to the interactive setting is introduced to determine a provable locally optimal solution. Finally, the benefit of interaction under log-loss is illustrated for specific source classes and the optimality of one-shot is proved for Gaussian sources under both mean-square and log-loss distortions constraints.
Consider an electric power system in which systems operators that manage specific sub-areas of the network share measurements with each other to obtain precise estimates of the underlying system state, i.e., complex voltages. Despite the need for such sharing and the value of high fidelity state estimates, such sharing is often limited due to privacy considerations; in the process of sharing measurements the operators do not wish to leak information about a subset of their internal states. However, since the measurements need to be shared, and often multiple times due to the iterative nature of power systems state estimation, it is crucial to understand: (a) the effect of applying privacy-preserving mechanisms on both the utility of estimation and leakage of the private data; and (b) the effect of multiple rounds of interaction and sharing on the net leakage.
Privacy in such a distributed “competitive” context is different from the traditional statistical database privacy setting in which data is published to ensure statistical value while ensuring that the privacy of any individual in the database is not comprised. In this database context, differential privacy with guarantees on the worst-case privacy leakage has emerged as a strong formalism . However, in many data sharing settings, such as the above-mentioned electric power system example as well as other streaming data settings (e.g., sensors networks, IoT, even electronic medical records, etc), the data stream as a whole has private and public features that need to be hidden and revealed, respectively. In such settings where the privacy threat is primarily about inference, a statistical approach using an information-theoretic privacy framework can capture the correlation between the public and private features. In fact, the current definitions of differential privacy which focus on protecting individual privacy cannot be applied easily to study the tradeoffs for multi-feature privacy vs. inference problems.
To this end, we consider a two-way interactive data sharing setting with two agents. Each agent generates an -length independent and identically distributed (i.i.d.) sequence of public and private data; data at the two agents are assumed to be correlated as is generally the case in such distributed settings. Each agent wishes to share a function of its public data with the other agent to satisfy a desired measure of utility (e.g., via a distortion function) while ensuring that a mutual information based leakage of its private data is constrained over rounds of communications.
Formally, an information-theoretic privacy mechanism is a randomizing function that maps the public data from a data source to an output (revealed/released data); any such mapping will achieve a certain utility, quantified via a desired distortion function, and leakage of private data quantified via average mutual information. In the interactive setting, we allow for a total of rounds of data sharing ( rounds per agent) and introduce a private interactive mechanism as a collection of random mappings. From both a theoretical and an application viewpoint, it is of much interest to understand whether interaction reduces privacy leakage or if a single round of data sharing suffices for a fixed privacy budget (leakage constraint).
I-a Related Work
An information-theoretic formulation of the utility-privacy tradeoff problem was introduced in  for the one-shot data publishing setting and has also been studied in [3, 4]. For the interactive setting,  determines the largest achievable utility-privacy tradeoff region for a two-agent system for a class of Gaussian sources and mean-squared distortion functions at both agents. In contrast, the focus in this paper is on both discrete and Gaussian memoryless sources and appropriate classes of distortion functions.
For a one-way non-interactive setting, in  Makhdoumi et al. introduce an algorithm based on the agglomerative information bottleneck algorithm to compute the risk-distortion tradeoff for logarithmic loss based privacy and distortion functions that they refer to as the privacy funnel problem. We will henceforth refer to the generalization of the privacy funnel problem for the interactive case that we study here as the interactive privacy funnel problem. More recently, in  Vera et al. study the rate-relevance region for an interactive two-agent information bottleneck problem.
It is worth noting that the problem at hand also falls under the purview of multiparty computation; in this context, recently, in  Kairouz et al. prove the optimality of one-shot interactions for binary sources using a differentially private data sharing mechanism. Our information-theoretic approach considers general sources and distortions as well as public and private data for two agents and shows that in general a data source can leak less over multiple rounds for a fixed distortion. Furthermore, while secure multiparty computation (SMC) is often considered a recourse to such interactive data sharing setups (see for example, ), the complexity of SMC implementations and the rise of many cloud-based applications with demands for real-time distributed data processing suggests the need for alternative privacy-guaranteeing approaches as motivated in .
We also note that the interactive formulation studied here shares some similarities with interactive source coding problems introduced in  and further studied in  and . However, unlike classical source coding, in our model each source does not ‘code’ its data sequence but rather maps it to an intermediate ‘revealed’ sequence in each round such that at the end of rounds, the receiver agent ‘reconstructs’ a typical data sequence using all the information at its disposal. We also note that our assumption of memoryless sources leads to single-letter expressions for (mutual) information leakage as a function of the distortion pairs which bears similarities to the rate-distortion function in source coding setup. We exploit this similarity to determine the conditions under which interaction benefits leakage in a manner similar to that done for interactive source coding problem by Ma et. al. in . It is crucial to note that, in contrast to the traditional interactive source coding setup with rate and distortion constraints, here the leakage and distortion constraints are on different aspects of the source, namely, the private and public features, respectively. Thus, it is unclear a priori if multiple rounds of interaction can reduce leakage or may worsen it.
I-B Our Contributions
In this paper, we consider discrete memoryless correlated sources at the two agents and determine the set of all possible leakage-distortion tuples achievable at both agents over rounds of interaction (Section II); for jointly Gaussian sources with quadratic distortion constraints we show the optimality of one-shot privacy mechanisms. In this same section, we also highlight how an information-theoretic approach naturally lends itself to composing privacy optimally over multiple interactions without any cumulative loss; we complete this section by illustrating the advantage of interaction for specific two-agent source and distortion models. In Section III, we determine the conditions under which interaction helps. We then focus on a specific class of distortion functions, namely, log-loss distortion (Section IV), which is often used as a utility function in machine learning applications. Our motivation for this model stems from the fact that the intermediate soft decoding characteristic of many interactive systems is well captured by log-loss distortion in which each agent continually refines its belief of the data to be estimated/inferred with each interaction. We show that the resulting interactive privacy funnel problem is a dual of an interactive information bottleneck problem, and analogously, involves optimization over a non-convex probability space; to this end, we extend the agglomerative information bottleneck algorithm appropriately for the two-agent interactive case that we call the agglomerative interactive privacy algorithm. We show that for Gaussian sources with log-loss distortion, one-shot data sharing is optimal; in contrast, we also prove that, in general, there always exists a pair of distributed correlated (non-Gaussian) sources for which interaction helps under log-loss. We illustrate our results using publicly available census data (Section V) and conclude in Section VI.
Preliminary work on the achievable distortion-leakage region for the two-agent interaction problem with privacy constraints studied here is developed by the authors in . Furthermore, in , the authors present an example to illustrate the advantage of interactions to reduce leakage as well as study the leakage-distortion tradeoffs under log-loss distortion; to this end,  introduces the interactive version of the privacy funnel problem and the related ‘merge-and-search’ algorithm. These above-mentioned details are also covered in this paper with the prime difference being that, unlike , this paper includes the detailed proofs for all theorems and intermediate results. Additionally, this paper also develops the following with detailed proofs: (i) rigorous testable conditions under which interaction helps; (ii) a composition theorem that clarifies how an optimal mechanism composes a net leakage budget amongst rounds of interaction; and finally (iii) a detailed example to illustrate that interaction reduces leakage under log-loss distortion for a large class of binary sources.
Finally, we also briefly comment on the source and mechanism model considered here and place it context of related work. We assume that the datasets at each agent are large, i.e., the agents, if needed, could empirically evaluate the source distributions to design the interactive mechanisms. Furthermore, the data sources are assumed to be memoryless, i.e., the data of each user corresponding to a row of the dataset is independent of that of other users; however, the public and private features of each user (in any row) are correlated. We assume that the privacy mechanism over all rounds of interaction is general and not necessarily memoryless; in fact, we use the tools of asymptotic information theory to show that memoryless mechanisms suffice for memoryless sources. Many related privacy approaches implicitly assume memoryless mechanisms [3, 4] thereby modeling the problem as one of “local privacy” wherein each user applies the same mechanism independently to their own data (see for example, ). The implicitness comes from the fact that these works assume that the statistics of the data for every user in the dataset is known and follows the same distribution thus allowing the use of a single mechanism locally; in contrast, we show this explicitly here.
Notation: We use upper-case letters to denote random variable and lower-case letters to denote realizations of random variables. Superscripts are used to denote the length of a vector. We write Var to denote the variance of a random variable and for conditional variance also use the expectation operator as . We write to denote the Bernoulli distribution with parameter and write to denote a doubly symmetric binary source with crossover probability. We write and to denote the entropy and mutual information, respectively. We also interchangeably use the notation for . We write to denote the distortion vector .
Ii System Model and Interactive Mechanism
Our problem consists of a discrete source (e.g., the electric power system) that generates -length i.i.d. sequences with , for all . These sequences are partially observed at two agents that interact with one another as shown in Fig. 1 such that the two agents and observe -length sequences and , respectively. The public data at both agents are denoted by and the correlated private data by . Furthermore, we assume that the private data is hidden and can only be leaked through the public data.
We consider a -round interactive protocol in which, without loss of generality, we assume that agent A initiates the interaction and is even. A -interactive privacy mechanism is given by as a collection of probabilistic mappings such that agent A shares data in the odd rounds beginning with round 1 and agent B shares in the even rounds. A privacy mechanism for agent A used in the -th round, , is a mapping from its public data sequence and all prior sequences revealed from agent B. Thus, in round 1, , where is the revealed set when a sequence is shared via . For the odd rounds , the mechanism used by agent A is
Similarly, agent B in even rounds , uses its public data and the prior data sequences revealed from agent A and maps them via a privacy mechanism
From (1) and (2), we see that our model assumes that the private sequences and are not explicitly involved in the mapping such that in the -th and -th rounds, , respectively, and form Markov chains. This is because any dependence of agent A’s ’s on is captured via the ’s from agent B and vice versa. Thus, in any round, conditioned on all data that an agent has until then, what the agent transmits is independent of the private data at the other agent. Our assumption is motivated by the fact that private data is not, in general, accessible and models inferred features that are not known a priori. Our model is motivated by the example we have alluded to earlier, that of operators sharing measurement data in the electric power grid which could lead to estimation of each other’s (private) system state; more broadly, our model captures any interactive application in which personal habits or preferences not known or observed directly can only be inferred from the data collected () or shared (). Furthermore, since our problem model involves a sequence of random mappings, the model directly includes auxiliary random variables , such that their -length sequences are the outputs of the privacy mechanism over rounds. While these auxiliary variables are required to model the problem, it is unclear a priori what the cardinalities of their support set should be, and thus, one needs to develop bounds on them in the process of determining the largest achievable utility-privacy tradeoff region. To this end, we use classical information-theoretic methods to obtain bounds on the cardinalities. It is worth noting that bounds on the cardinalities of the outputs of privacy mechanisms are also seen in problems involving other privacy mechanisms such as differential privacy (e.g., ).
At the end of rounds, agents A and B reconstruct sequences and , respectively, where and are appropriately chosen reconstruction functions. The set of mechanism pairs is chosen to satisfy
where and are the given distortion measures.
The utility-privacy tradeoff region is the set of all tuples for which a privacy mechanism exists and is given by the following theorem.
For a target distortion pair and a -round interactive privacy mechanism, the utility-privacy tradeoff region is the set of all tuples that satisfy
such that for all , the following Markov chains hold:
with where if is odd and if is even.
The proof details are in Appendix A. We briefly review the steps. Achievability follows from using an i.i.d. mechanism in each round and using strong typicality (defined precisely in Appendix A) to bound the achievable leakage at both agents. The converse, on the other hand, considers a mechanism that achieves (3a)-(3d) and exploits the i.i.d. nature of correlated sources to obtain single letter bounds. We also note that the Markov chains in (5) and (6) directly capture the fact that at each transmitting agent the data shared in the next round is independent of the private data at the receiving agent conditioned on the data available at the transmitting agent (including the data from the previous rounds). This Markovity is a result of the in turn is due to the fact that the private data is not directly used in the random mapping in each round and . \qed
For , the leakage term in Theorem 1 can be written as such that, for the equivalent interactive source coding problem in , the source coding sum-rate is simply the excess information that needs to be shared beyond what can be inferred at the receiver via , i.e., , and similarly, .
Note that a one-shot setting is one in which both agents share data independently and simultaneously with each other only once.
Without loss of generality we assume we initiate interaction from agent such that the last round of interaction is from agent to agent . We define a compact subset of a finite Euclidean space as
In addition to the tradeoff region, one can also focus on the net leakage over rounds. From Theorem 1, the sum leakage-distortion function over rounds initiated from agent A is
For the region given by Theorem 1 with target distortions and , one can define a sum leakage over any rounds, . Assuming agent A initiates the interactions, we have
One can similarly define for sum leakage over rounds originating from agent B.
For all ,
The bounds in (10) for all follow from the fact that any -round interactive mechanism starting at one of the agent (e.g., A) can be considered as special case of -round interactive mechanism starting at the same agent with for all sequences, i.e., a deterministic sequence (w.l.o.g., with entries ) is sent thereby conveying no information. The bounds in (11) follow from the fact that any -round interactive mechanism initiated at B (respectively A) can be considered as a special case of a -round interactive mechanism initiated at agent A (respectively B) with (respectively ). \qed
From the inequality (10) in Lemma 1, and are both non-increasing in and bounded from below, and thus their limits exist. Furthermore, from the inequality (11) in Lemma 1, . Thus, taking limits, since both and converge we have that and thus, we can define and compute .
Ii-a Gaussian Sources: Interactive Mechanism
We now consider the case where the data pairs at each agent are drawn according to bivariate Gaussian distributions, i.e., , , and . For jointly Gaussian sources subject to mean square error distortion constraints, we prove that one round of interaction suffices to achieve the utility-privacy tradeoff.
For the private interactive mechanism, the leakage-distortion region under mean square error distortion constraints consist of all tuples satisfying
where and .
If is jointly Gaussian, we can write , where is a zero mean Gaussian random variable independent of and .
Achievability is established by considering a single round Gaussian mechanism, i.e., the sequence is chosen such that the ‘test channel’ from to yields , where is a zero-mean Gaussian with variance which is independent of the rest of the random variables. The variance is chosen such that the reconstruction function of , i.e., the minimum mean square estimate ( MMSE) of given and , is .
To prove the converse, we have
where (15) follows from expanding the mutual information, (16) from using chain rule for entropy and the fact that the sources are i.i.d., (17) from the fact that conditioning does not increase entropy, (18) from the fact that the conditional differential entropy is maximized by a Gaussian distribution for a given variance, (19) from the concavity of the entropy function, (20) from the fact that are jointly Gaussian, and thus, can be written as where is independent of and . The final expression in (21) follows from the following facts: (i) is independent of and , and thus, where is a random function of , and therefore, independent of for all ; (ii) from the definition of the quadratic distortion function, is the minimum mean square estimate of given , and thus, , for all , where is the distortion of the entry of ; we use this in conjunction with the fact that to obtain the first term in the denominator of (21); (iii) since , , and thus, since the sources are memoryless (in fact, each instantiation of the source is independent and identically distributed); and finally the numerator of 21 follows directly from the fact that the sources are jointly Gaussian distributed. We can similarly prove that . \qed
One can notice in the case that is a Markov chain, we have .
Ii-B Composition Rules
When guaranteeing privacy, it is important to understand whether a given total leakage budget can be allocated optimally over multiple rounds such that the sum of the leakages in each round does not exceed this total. Thus, we seek to understand if the net leakage constraint can be “composed” (or alternately decomposed) appropriately over multiple rounds. The following theorem summarizes our results.
For a round interactive data sharing setup between two agents A and B, the total leakage constraint can be (de-)composed into leakages, one for each round, without any loss if in each round the privacy mechanism at each agent is chosen conditioned on all data (received and known a priori) available at each agent.
We now show that the information-theoretic model presented here allows taking a total leakage budget and (de-)composing it into parts. We first observe that at the beginning of round ( odd) from agent A to B, agent B has access to from prior rounds and its own data. The leakage for just this round with mechanism , for all can be easily verified to be ( , Theorem 2). On the other hand, the net leakage at agent B over rounds is , where the even numbered terms are zero since for even is a mapping of and thus conditioning on provides no new information. One can similarly write the expression for leakage at agent B.
Thus, we see that the net leakage of the private information of agent A (B) at agent B (A) is simply a sum of the leakages for each round of communication initiated at A (B) and ending at B (A), i.e., the -round privacy mechanism satisfies a well desired composition property that the net leakage is not greater than the sum of the parts. Such a composition is a direct result of the fact that the privacy mechanism in each round is chosen with knowledge of side information at the receiver agent. \qed
Note that the above composition rule also holds for a one-sided multi-round model in which only one agent shares data for a fixed number of rounds and a net distortion constraint over all rounds.
We note that composition here focuses on taking a total leakage budget and assigning to optimally to each round; in contrast, composition in differential privacy shows that privacy risk is additive when two different mechanisms are used sequentially on the data. Such a composition rule is generally not straightforward to show for mutual information based metrics.
Ii-C Interaction Reduces Leakage: Illustration
A natural question in the interactive setting is to understand whether multiple rounds can reduce leakage of the private variables while achieving the desired distortion. In general, it is unclear whether interaction would reduce leakage relative to a one-shot setting. We now present an example for which interaction helps. To make such a comparison, one could compare the leakage of a specific transmitter agent at the other receiver agent over one round with that over multiple rounds such that in both cases the total number of rounds culminate at the same receiver agent, i.e., the agent at which a certain level of leakage and distortion is desired. However, depending on whether one chooses odd or even number of rounds, the transmitter agent need not be the same for both cases if one were to ensure that the receiver agent is the same. Specifically, if we compare the leakage of a single-round from agent A to agent B against the leakage over two rounds, for the one-shot communications agent A initiates the data sharing. On the other hand, for the two round case the interaction is initiated at agent B such that the second round terminates at agent B, thereby allowing us to compare the one-round leakage of agent A’s private data at agent B with that for the two-round interaction setup. We remark that a similar comparison of rate reduction for interactive function computation is developed by Ma et al. in .
We note that our example is similar to the one in  wherein Ma et. al. consider an interactive source coding problem for sources at the two agents, i.e., without private data and with constraints on coding rate in place of leakage. However, it is not clear the optimal mechanisms for the rate-distortion problem hold when minimizing leakage of . In fact, one needs to evaluate the optimal mechanism for the problem at hand in each round due to the presence of private side information at each agent and the leakage function being minimized; we detail these computations below.
We consider binary random variables , , , such that is modeled as doubly symmetric binary source with parameter , i.e., , with and . Furthermore, and are correlated as follows: and where for , and and are independent of and , respectively. We let and consider an erasure distortion measure as:
One-round sum leakage : We first compute the sum leakage for a one round interaction starting from agent A. Note that in this case even though B does not share data, by definition, the sum leakage includes the leakage of at A. In Appendix B, we show that
For the classical source coding problem with the same distribution defined above for and functional in (22), the optimal minimizing the Wyner-Ziv rate-distortion function is well known. However, it is not clear a priori that the same transition probability distribution will also minimize the leakage in the presence of private features at both agents. In Appendix B, we prove that is indeed minimized by the same distribution that minimizes . This is also a result of independent interest.
Two-round sum leakage : We now compute the sum leakage for a two-round interaction starting from agent B in round 1 and returning from A to B in round 2. Let denote the output of the mapping in round 1 from B to A and denotes the output of mapping in round 2 from A to B. We will explicitly construct a mechanism pair and which leads to an admissible tuple . Let be binary symmetric channel with crossover probability , i.e., . We choose the conditional pmf as given in Table I and let .
For a given value for the DSBS parameter, , there are several values of pair such that . For example, for , , and , is
and the corresponding distortion is . By computing and comparing it with (24) for the same distortion, we have . Thus, interaction reduces leakage.
In , using the same , and as described above, Ma et. al. show that interaction reduces the sum-rate over two rounds relative to one round for specific values of , , and . However, as discussed earlier, it wasn’t clear whether the same parameters in  also reduce leakage of correlated hidden variables in our problem. We have verified that for different value of and including those in , the two-round sum leakage is smaller than the one-round leakage.
Iii When Does Interaction help?
An important question to address in the interactive setting is whether interaction actually reduces leakage relative to a one-round mechanism. In this section, we introduce a test for checking when multiple rounds of interaction help. Our approach is modeled along the lines of the method in  by Ma et al. in which an interactive source coding problem is considered. However, since our source model includes a pair of public and private variables at each agent, we extend the methods in  to the problem setting at hand. The characterization of in (8) does not give us any bounds on the rate of convergence to for a given distribution . Thus, as in , we use the fact that the sum-leakage function depends on the source distribution only via marginal distributions and and characterize the convergence of to for a set of source distributions with the same marginals; this in turn allows us to identify three conditions on the sum-leakage function required for interaction to reduce leakage.
Without loss of generality, let agent initiate a -round interaction. The goal is to characterize the family of source distributions for which interaction helps. To this end, we define ”leakage reduction” functions and as follows.
The leakage reduction over rounds initiated at agent A is defined as
For a -round interaction initiated at agent B, the corresponding leakage reduction function is
Note that depends on the distributions and . Evaluating is equivalent to evaluating . Definition 2 enables us to characterize the properties of which then gives us . The goal is to determine source distributions for which where is the leakage reduction the absence of interaction. When , we have and .
For a given source, since it is generally not possible to precisely determine the rate of convergence of to , we focus, as in , on determining the set of source distributions for which is strictly decreasing. This leads us to define the set of structured neighborhoods of , i.e., a collection of all joint distribution that have the same marginal as follows.
The marginal perturbation set for a given joint distribution is defined as
where is majorizing operator. One can similarly define .
Note that and are nonempty sets as they contain . Furthermore, for all