Preserving Data-Privacy with Added Noises: Optimal Estimation and Privacy Analysis

Preserving Data-Privacy with Added Noises: Optimal Estimation and Privacy Analysis

Abstract

Networked systems often relies on distributed algorithms to achieve a global computation goal with iterative local information exchanges between neighbor nodes. To preserve data privacy, a node may add a random noise to its original data for information exchange at each iteration. Nevertheless, a neighbor node can estimate other’s original data based on the information it received. The estimation accuracy and data privacy can be measured in terms of -data-privacy, defined as the probability of -accurate estimation (the difference of an estimation and the original data is within ) is no larger than (the disclosure probability). How to optimize the estimation and analyze data privacy is a critical and open issue. In this paper, a theoretical framework is developed to investigate how to optimize the estimation of neighbor’s original data using the local information received, named optimal distributed estimation. Then, we study the disclosure probability under the optimal estimation for data privacy analysis. We further apply the developed framework to analyze the data privacy of the privacy-preserving average consensus algorithm and identify the optimal noises for the algorithm.

{IEEEkeywords}

Distributed algorithm, Noise adding mechanism, Distributed estimation, Data privacy, Average consensus.

\IEEEpeerreviewmaketitle

1 Introduction

Without relying on a central controller, distributed algorithms are robust and scalable, so they have been widely adopted in networked systems to achieve global computation goals (e.g., mean and variance of the distributed data) with iterative local information exchanges between neighbor nodes [1, 2, 3]. In many scenarios, e.g., social networks, the nodes’ original data may include users’ private or sensitive information, e.g., age, income, daily activities, and opinions. With the privacy concern, nodes in the network may not be willing to share their real data with others. To preserve data privacy, a typical method is adding random noises to the data to be released in each iteration. With the noise adding procedure, the goal of privacy-preserving distributed algorithms is to ensure data privacy while achieving the global computation goal [4, 6, 5].

Consensus, an efficient distributed computing and control algorithm, has been heavily investigated and widely applied, e.g., in distributed estimation and optimization [8, 9], distributed energy management and scheduling [10, 11], and time synchronization in sensor networks [12, 14, 13]. Recently, the privacy-preserving average consensus problem has attracted attention, aiming to guarantee that the privacy of the initial states is preserved while an average consensus can still be achieved [15, 18, 16, 17, 19]. The main solution is to add variance decaying and zero-sum random noises during each iteration of the consensus process.

In the literature, differential privacy, a formal mathematical standard, has been defined and applied for quantifying to what extent individual privacy in a statistical database is preserved [20]. It aims to provide means to maximize the accuracy of queries from statistical databases while maintaining indistinguishability of its transcripts. To guarantee the differential privacy, a commonly used noise is Laplacian noise [21, 22].

Different from the database query problems, for many distributed computing algorithms such as consensus, the key privacy concern is to ensure that other nodes cannot accurately estimate the original data, instead of the indistinguishability. No matter what type of noise distribution is used, there is a chance that an estimated value of the original data is close to the real data, such a probability cannot be directly measured by differential privacy. To quantify the estimation accuracy and data privacy, we first define -accurate estimation, i.e., the difference of the estimated value and the original data is no larger than . We then define -data-privacy in [6] as that the probability of -accurate estimation is no larger than . Using the -data-privacy definition, in this paper, we develop a theoretical framework to investigate how to optimize the estimation of neighbor’s original data using the local information received, named optimal distributed estimation. Then, we study the disclosure probability under the optimal estimation for data privacy analysis. The main contributions of this work are summarized as follows.

  1. To the best of our knowledge, this is the first work to mathematically formulate and solve the optimal distributed estimation problem and data privacy problem for the distributed algorithm with a general noise adding mechanism. The optimal distributed estimation is defined as the estimation that can achieve the highest disclosure probability, , of -accurate estimation, given the available information set.

  2. A theoretical framework is developed to analyze the optimal distributed estimation and data privacy by considering the distributed algorithm with a noise adding procedure, where the closed-form solutions of both the optimal distributed estimation and the privacy parameter are obtained. The obtained results show that how the iteration process and the noise adding sequence affect the estimation accuracy and data privacy, which reveals the relationship among noise distribution, estimation and data privacy.

  3. We apply the obtained theoretical framework to analyze the privacy of a general privacy-preserving average consensus algorithm (PACA), and quantify the -data-privacy of PACA. We also identify the condition that the data privacy may be compromised. We further obtain the optimal noise distribution for PACA under which the disclosure probability of -accurate estimation is minimized, i.e., the highest data privacy is achieved.

The rest of this paper is organized as follows. Section 2 provides preliminaries and formulates the problem. The optimal distributed estimation and the privacy analysis under different available information set are discussed in Sections 3 and 4, respectively. In Section 5, we apply the framework to analyze the data privacy of PACA. Concluding remarks and further research issues are given in Section 6.

2 Preliminaries Problem Formulation

A networked system is abstracted as an undirected and connected graph, denoted by , where is the set of nodes and is the set of edges. An edge exists if and only if (iff) nodes can exchange information with node . Let be the neighbor set of node (). Let be the total number of nodes and . Each node in the network has an initial scalar state , which can be any type of data, e.g., the sensed or measured data of the node. Let be the initial state vector.

Symbol Definition
the network graph
node ’s initial state
the initial state vector of all nodes
the distributed iteration algorithm
the domain of random variable
the PDF of random variable
the noise input of node until iteration
the information output of node until iteration
the optimal distributed estimation of until iteration
the measure on estimation accuracy
the disclosure probability
the possible output when the initial input is
the information available to node to estimate
until iteration
Table 1: Important Notations

2.1 Privacy-Preserving Distributed Algorithm

The goal of a distributed algorithm is to obtain the statistics of all nodes’ initial states (e.g., the average, maximum, or minimum value, variance, etc.) in a distributed manner. Nodes in the network use the local information exchange to achieve the goal, and thus each node will communicate with its neighbor nodes periodically for data exchange and state update. With the privacy concerns, each node is not willing to release its real initial state to its neighbor nodes. A widely used approach for the privacy preservation is adding random noises at each iteration for local data exchange.

Define the data being sent out by node in iteration , given by

(1)

where is a random variable. When node receives the information from its neighbor nodes, it updates its state using the following function,

(2)

where the state-transition function, , depends on and for only. The above equation defines a distributed iteration algorithm with privacy preserving since only the neighbor nodes’ information is used for state update in each iteration and the data exchanged have been mixed with random noises to preserve privacy. Hence, (2) is named as a privacy-preserving distributed algorithm. Since the initial state is most important for each node in the sense of privacy, in this paper, we focus on the estimation and privacy analysis of nodes’ initial states.

2.2 Important Notations and Definitions

Define the noise input and state/information output sequences of node in the privacy-preserving distributed algorithm until iteration by

(3)

and

(4)

respectively. Note that for any neighbor node , it can not only receive the information output of node , but also eavesdrop the information output of all their common neighbor nodes, which means that there may be more information available for node to estimate at iteration . Hence, we define

as the available information set/outputs for node to estimate of node at iteration . Clearly, we have and .

Let be the probability density function (PDF) of random variable . Let be the set of the possible values of . Clearly, if , it means that can be any value in . Given any function , we define the function as

(5)

and let

(6)

be the zero-point set of . Let be the boundary point set of a given set , e.g., .

Note that each node can estimate its neighbor nodes’ initial states based on all the information it knows, i.e., the available information set of the node. For example, based on , node can take the probability over the space of noise (where the space is denoted by ) to estimate the values of the added noises, and then infer the initial state of node using the difference between and the real initial state , i.e., . Hence, we give two definitions for the estimation as follows.

Definition 2.1

Let be an estimation of variable . If , where is a small constant, then we say is an -accurate estimation.

Note that is the information output sequence of node , which is related to directly, and this should be considered in the estimation. Since only the local information is available to the estimation, we define the optimal distributed estimation of as follows.

Definition 2.2

Let be the possible output given the condition that at iteration . Considering -accurate estimation, under ,

is named the optimal distributed estimation of at iteration . Then, is named the optimal distributed estimation of .

In order to quantify the degree of the privacy protection of the privacy-preserving distributed algorithm and construct a relationship between estimation accuracy and the privacy, we introduce the following -data-privacy definition.

Definition 2.3

A distributed randomized algorithm is -data-private, iff

(7)

where is the disclosure probability that the initial state can be successfully estimated by others using the optimal distributed estimation in a given interval .

In the above definition, depends on the output sequences, , which are the functions of random noise inputs and its neighbors’ output . All the possible outputs of under a privacy-preserving distributed algorithm should be considered to calculate , and thus is a random variable in (7). There are two important parameters in the privacy definition, and , where denotes the estimation accuracy and is the disclosure probability () denoting the degree of the privacy protection. A smaller value of corresponds to a higher accuracy, and a smaller value of corresponds to a lower maximum disclosure probability.

2.3 Problem of Interests

We have the following basic assumptions, i) if there is no information of any variable in estimation, then the domain of is viewed as , ii) unless specified, the global topology information is unknown to each node, iii) the initial states of nodes in the network are independent of each other, i.e., each node cannot estimate the other nodes’ state directly based on its own state or the estimation is of low accuracy.

In this paper, we aim to provide a theoretical framework of the optimal distributed estimation and data privacy analysis for the privacy-preserving distributed algorithm (2). Specifically, we are interesting in the following three issues: i) how to obtain the optimal distributed estimation and its closed-form expression considering the distributed algorithm (2); ii) using the -data-private definition to analyze the privacy of the distributed algorithm (2), i.e., obtaining the closed-form expression of the disclosure probability and its properties; and iii) using the obtained theoretical results to analyze the privacy of the existing privacy-preserving average consensus algorithm, and finding the optimal noise adding process to the algorithm, i.e.,

(8)

where is the statistic goal, aiming at minimizing the disclosure probability while obtaining the average value of all initial states.

To solve the above issues, in the following, we first consider the case that only the one-step information output (), which depends on the initial state () and the one-step noise (), is available, and obtain the optimal distributed estimation and privacy properties. This case is suitable for the general one-step random mechanism (e.g., [23, 26]), and the theoretical results provide the foundations of the following analysis. Then, we consider the optimal distributed estimation under the information set , which reveals that how the iteration process affects the estimation and helps to understand the optimal distributed estimation under the information set (). Based on the observations, we extend the results to the general case that () is available for the estimation. Lastly, we apply the obtained results to the general PACA algorithm for privacy analysis, and discuss the optimal noises for preserving the data privacy.

3 Optimal Distributed Estimation and Privacy Analysis under

In this section, the optimal distributed estimation of using the information only is investigated, and the disclosure probability under the optimal estimation is derived.

3.1 Optimal Distributed Estimation under

Let be the estimation of under . The optimal distributed estimation of under and its closed-form expression are given in the following theorem.

Theorem 3.1

Considering the distributed algorithm (2), under , the optimal distributed estimation of satisfies

(9)

where

(10)

Specifically, if , then

(11)

where

(12)

which is independent of . {proof} Given and an estimation , we have

(13)

From Definition 2.2, it follows that

(14)

which concludes that (9) holds.

If , for any real number output of , we have

In this case, we have

(15)

Substituting (3.1) into (3.1) gives

i.e., (11) holds. Thus, we have completed the proof.

In (9), can be viewed as the estimation of the noise , i.e., . Thus, (9) can be written as

which means that the estimation problem is equivalent to estimating the value of the added noise. From (10), it is noted that depends on , , and . We use Fig. 1(a) as an example to illustrate how to obtain and when . Let the blue curve be the (it follows the Gaussian distribution in this example) and , and is the fixed initial output. We then have

(a)
(b)
Figure 1: Two examples of the optimal distributed estimation under considering and , respectively.

Given an and , denotes the shaded area of in the interval , which is named as the -shaded area of at point . Clearly, when , has the largest -shaded area. It follows that , and thus . Meanwhile, we consider the case that or is not available to the other nodes, and use Fig. 1(b) as an example for illustration. In this case, we have for any output . From the above theorem, we have

Then, the optimal distributed estimation given any output .

Next, a general approach is introduced to calculate the value of . Note that

It is well known that is a necessary condition that is an extreme point of . One then follows from (10) that is either one of the extreme points of (i.e., ) or one of the boundary points of (i.e., . Let

we then have

(16)

Applying the above general approach to the example of Fig. 1, one can easily obtain that

and for the two cases, respectively. Based on (16), we obtain the same optimal estimations for the two cases.

Remark 3.2

From the above discussion, it is observed that is the point that has the largest -shaded area around point , where . It should be pointed out that is in and depends on , and thus it may not be the point that has the maximum value of . However, if is sufficiently small and is continuous, typically has the largest -shaded area at point when has the maximum value for . Meanwhile, the above examples also show that the unbiased estimation also may not be the optimal distributed estimation of .

3.2 Privacy Analysis under

In the above subsection, we have obtained the optimal distributed estimation when is fixed. Note that

(17)

when is fixed. To analyze the privacy of distributed algorithm (2) with the -data-privacy definition, the main goal is to calculate the disclosure probability , so that all the possible initial output and its corresponding optimal distributed estimation should be considered. Considering the outputs which can make an -accurate estimations of to be obtained, we define all the corresponding noises by

(18)

For each , we have and , i.e., an -accurate estimation is obtained when .

Theorem 3.3

Considering the distributed algorithm (2), under , the disclosure probability satisfies

(19)

Specifically, if , then

(20)
{proof}

From (3.2) and the definition of , we have

(21)

From Theorem 3.1, if , then which is independent of . In this case, we have

(22)

i.e., only if , we can obtain the -accurate estimation of . Then,

(23)

Since satisfies (12), is the point that has the largest -shaded area around it, and the domain of is . It follows that

(24)

We thus have completed the proof.

From the above theorem, we obtain that (19) provides the expression of the disclosure probability under . Using (19), the main challenge to calculate is that how to obtain the set of . Although based on the definition of , the elements of can be obtained by comparing all possible values of and their corresponding (how to obtain the value of is discussed in the previous subsection), this approach is infeasible due to the infinite possible values of . Fortunately, we can apply the properties of to fast obtain in many cases of practical importance. For the example given in Fig. 1(a), since is continuous and concave, it is straight-forward to obtain that

Using the facts that and , we then obtain

Based on the above equation, for any given and , we obtain all the in by solving , and thus is obtained.

4 Optimal Distributed Estimation and Privacy under

In this section, we investigate the optimal distributed estimation and privacy under , and then extent the results to the general case that is available to the estimation. Let be the estimation of under .

4.1 Optimal Distributed Estimation under

Under , there are two outputs, and , of node , which can be used for initial state estimation or inference attack. Note that , which means that has involved the outputs of node ’s neighbors. Hence, under , both the optimal distributed estimation and privacy analysis depend on the output of both node and its neighbor nodes. Suppose that in (2) is available to the estimation in the remainder parts of this paper.

The following theorem provides the optimal distributed estimation of under , which reveals the relationship between the information outputs (which are available to the node for estimation) and the optimal estimation.

Theorem 4.1

Considering the distributed algorithm (2), under , the optimal distributed estimation of satisfies

(25)

where

(26)

in which ; Then, if , we have

(27)
{proof}

Let be an estimation of under at iteration . Given , and we have

Note that depends on and only, while depends on and , where and are two random variables. It follows that

(28)

where

Using the relationship between the joint distribution and the conditional distribution, one infers that