Hiding in the Crowd: A Massively Distributed Algorithm for Private Averaging with Malicious Adversaries

Hiding in the Crowd: A Massively Distributed Algorithm for Private Averaging with Malicious Adversaries

Pierre Dellenbach INRIA Aurélien Bellet Corresponding author: first.last@inria.fr INRIA Jan Ramon INRIA
Abstract

The amount of personal data collected in our everyday interactions with connected devices offers great opportunities for innovative services fueled by machine learning, as well as raises serious concerns for the privacy of individuals. In this paper, we propose a massively distributed protocol for a large set of users to privately compute averages over their joint data, which can then be used to learn predictive models. Our protocol can find a solution of arbitrary accuracy, does not rely on a third party and preserves the privacy of users throughout the execution in both the honest-but-curious and malicious adversary models. Specifically, we prove that the information observed by the adversary (the set of maliciours users) does not significantly reduce the uncertainty in its prediction of private values compared to its prior belief. The level of privacy protection depends on a quantity related to the Laplacian matrix of the network graph and generally improves with the size of the graph. Furthermore, we design a verification procedure which offers protection against malicious users joining the service with the goal of manipulating the outcome of the algorithm.

1 Introduction

Through browsing the web, engaging in online social networks and interacting with connected devices, we are producing ever growing amounts of sensitive personal data. This has fueled the massive development of innovative personalized services which extract value from users’ data using machine learning techniques. In today’s dominant approach, users hand over their personal data to the service provider, who stores everything on centralized or tightly coupled systems hosted in data centers. Unfortunately, this poses important risks regarding the privacy of users. To mitigate these risks, some approaches have been proposed to learn from datasets owned by several parties who do not want to disclose their data. However, they typically suffer from some drawbacks: (partially) homomorphic encryption schemes (Paillier, 1999; Graepel et al., 2012; Aslett et al., 2015) require the existence of a trusted third party, secure multi-party computation techniques (Yao, 1982; Lindell and Pinkas, 2009) are generally intractable when the number of parties is large, and exchanging noisy sketches of the data through (local) differential privacy (Dwork, 2006; Duchi et al., 2012) only provides approximate solutions which are quite inaccurate in the highly distributed setting considered here. Furthermore, many of these techniques are not robust to the presence of malicious parties who may try to manipulate the outcome of the algorithm.

In this paper, our goal is to design a massively distributed protocol to collaboratively compute averages over the data of thousands to millions of users (some of them honest-but-curious and some corrupted by a malicious party), with arbitrary accuracy and in a way that preserves their privacy. For machine learning algorithms whose sufficient statistics are averages (e.g., kernel-based algorithms in primal space and decision trees), this could be used as a primitive to privately learn more complex models. The approach we propose is fully decentralized: users keep their own data locally and exchange information asynchronously over a peer-to-peer network (represented as a connected graph), without relying on any third party. Our algorithm (called Gopa: GOssip for Private Averaging) draws inspiration from a randomized gossip protocol for averaging (Boyd et al., 2006), augmented with a first phase involving pairwise exchanges of noise terms so as to mask the private values of users without affecting the global average. We first analyze the correctness of the algorithm, showing that the addition of noise has a mild effect on the convergence rate. We then study the privacy guarantees of Gopa in a Bayesian framework, where the adversary has some prior belief about the private values. Specifically, we give an exact expression for the posterior variance of the adversary after he has observed all the information output by the protocol, and show that the variance loss is negligible compared to the prior. This is equivalent to showing that the uncertainty in the adversary’s predictions of the private values has not been significantly reduced. Interestingly, the proportion of preserved variance depends on the variance of the noise used to mask the values but also on an interpretable quantity related to the Laplacian matrix of the network graph. To the best of our knowledge, we are the first to draw a link between privacy and a graph smoothing operator popular in semi-supervised learning (Zhu and Ghahramani, 2002; Zhou et al., 2003), multi-task learning Evgeniou and Pontil (2004) and signal processing (Shuman et al., 2013). We show how this result motivates the use of a random graph model to construct the network graph so as to guarantee strong privacy even under rather large proportions of malicious users, as long as the number of users is big enough. The practical behavior of Gopa is illustrated on some numerical simulations. Finally, we further enhance our protocol with a verification procedure where users are asked to publish some values in encrypted form, so that cheaters trying to manipulate the output of the algorithm can be detected with high probability while preserving the aforementioned privacy guarantees.

The rest of this paper is organized as follows. Section 2 describes the problem setting, including our adversary and privacy models. Section 3 presents some background on (private) decentralized averaging and partially homomorphic encryption, along with related work and baseline approaches. Section 4 introduces the Gopa algorithm and studies its convergence qs well as privacy guarantees. Section 5 describes our verification procedure to detect cheaters. Finally, Section 6 displays some numerical simulations. Proofs can be found in the supplementary material.

2 Preliminaries

We consider a set of users. Each user holds a personal value , which can be thought of as the output of some function applied to the personal data of (e.g., a feature vector describing ). The users want to collaboratively compute the average value while keeping their personal value private. Such a protocol could serve as a building block for privately running machine learning algorithms which interact with the data only through averages, such as linear regression models (ordinary least-squares, ridge regression), decision trees and gradient descent for empirical risk minimization problems. We denote by the vector .

Instead of relying on a central server or any third party to exchange information, users communicate over a peer-to-peer network represented by a connected undirected graph , where indicates that users and are neighbors in and can exchange messages directly. For a given user , we denote by the set of its neighbors. We denote by the adjacency matrix of , by the degree vector ( and by its Laplacian matrix.

2.1 Adversary Models

We consider two commonly adopted adversary models for users, which were formalized by Goldreich (1998) and are used in the design of many secure protocols. An honest-but-curious (honest for short) user will follow the protocol specification, but can use all the information received during the execution to infer information about other users. In contrast, a malicious user may deviate from the protocol execution by sending incorrect values at any point (but we assume that they follow the required communication policy; if not, this can be easily detected). Malicious users can collude, and thus will be seen as a single malicious party who has access to all information collected by malicious users. We only restrict the power of attackers by requiring that honest users communicate through secure channels, which means that malicious users only observe information during communications they take part in.

Each user in the network is either honest or malicious, and honest users do not know whether other nodes are honest or malicious. We denote by the set of honest users and by the proportion of malicious users in the network. We also denote by the subgraph of induced by , so that . Throughout the paper, we will rely on the following natural assumption on (we will discuss how to generate such that it holds in practice in Section 4.4).

Assumption 1.

The graph of honest users is connected.

This implies that there exists a path between any two honest users in the full graph which does not go through a malicious node. In the rest of the paper, we will use the term adversary to refer to the set of malicious users (privacy with respect to a honest user can be obtained as a special case).

2.2 Privacy Model

Recall that our goal is to design a protocol which deterministically outputs the exact average (which we argue does not reveal much information about individual values in the large-scale setting we consider). This requirement automatically rules out Differential Privacy (Dwork, 2006) as the latter implies that the output of the protocol has to be randomized.

We take a Bayesian, semantic view of privacy promoted by several recent papers (Kasiviswanathan and Smith, 2014; Li et al., 2013; Kifer and Machanavajjhala, 2014; He et al., 2014). We consider a family of prior distributions which represent the potential background knowledge that the adversary may have about the private values of honest users (since the adversary controls the malicious users, we consider he has exact knowledge of their values). We will denote by the prior belief of the adversary about the private value of some honest user . Given all the information gathered by the adversary during the execution of the protocol, the privacy notion we consider is that the ratio of prior and posterior variance of the private value is lower bounded by for some :

(1)

The case describes the extreme case where observing removed all uncertainty about . On the other hand, when , all variance was preserved.

It is important to note that (1) with close to does not guarantee that the adversary learns almost nothing from the output : in particular, the posterior expectation can largely differ from the prior expectation , especially if the observed global average was very unlikely under the prior belief. This is related to the “no free lunch” theorem in data privacy (Kifer and Machanavajjhala, 2011), see also the discussions in Kasiviswanathan and Smith (2014); Li et al. (2013). What (1) does guarantee, however, is that the squared error that the adversary expects to make by predicting the value of some private after observing is almost the same as it was before observing . In other words, the uncertainty in its prediction has not been significantly reduced by the participation to the protocol. This is formalized by the following remark.

Remark 1 (Expected squared error).

Assume that (1) is satisfied. Let be the best prediction (in terms of expected square error) that the adversary can make given its prior belief. After observing the output , the adversary can make a new best guess . We have:

Our results will be valid for Gaussian prior distributions of the form , for all .111We use Gaussian distributions for technical reasons. We expect similar results to hold for other families of distributions which behave nicely under conditioning and linear transformations, such as the exponential family. We assume for simplicity that the prior variance is the same for all , but our analysis straightforwardly extends to the more general case where . We use centered Gaussians without loss of generality, since (1) depends only on the variance.

Remark 2 (Privacy axioms).

One can show that our notion of privacy (1) satisfies the axioms that any reliable privacy definition should satisfy according to Kifer and Lin (2012), namely “transformation invariance” and “convexity”.

3 Background and Related Work

3.1 Decentralized Averaging

1:  Input: graph , initial values
2:  for  do
3:     Draw uniformly at random from
4:     Set
5:  end for
Algorithm 1 Randomized gossip (Boyd et al., 2006)

The problem of computing the average value of a set of users in a fully decentralized network without a central coordinator has been extensively studied (see for instance Tsitsiklis, 1984; Kempe et al., 2003; Boyd et al., 2006). In most existing approaches, users iteratively compute the weighted average between their value and those of (a subset of) their neighbors in the network. We focus here on the randomized gossip algorithm proposed by Boyd et al. (2006) as it is simple, asynchronous and exhibits fast convergence. Each user has a clock ticking at the times of a rate 1 Poisson process. When the clock of an user ticks, it chooses a random neighbor and they average their current value. As discussed in Boyd et al. (2006), one can equivalently consider a single clock ticking at the times of a rate Poisson process and a random edge drawn at each iteration. The procedure is shown in Algorithm 1, and one can show that all users converge to at a geometric rate.

Although there is no central server collecting all users’ data, the above gossip algorithm is not private since users must share their inputs directly with others. A way to ensure privacy is that each user locally perturbs its own input before starting the algorithm so as to satisfy local differential privacy (Duchi et al., 2012; Kairouz et al., 2016).

Baseline 1.

Assume for some finite . Each user computes a perturbed value , where is a noise value with Laplacian distribution with . This guarantees that is an -differentially private approximation of (see e.g., Dwork, 2008), hence running Algorithm 1 on is also -differentially private.

Unfortunately, this protocol converges to an approximate average , and the associated RMSE can be high even for a large number of users. For instance, for users, and , the RMSE is approximately .

Other attempts have been made at designing privacy-preserving decentralized averaging protocols under various privacy and attack models, and additional assumptions on the network topology. For an adversary able to observe all communications in the network, Huang et al. (2012) proposed an -differentially private protocol where users add exponentially decaying Laplacian noise to their value. The protocol converges with probability to a solution with an error radius of order . Manitara and Hadjicostis (2013) instead proposed that each user iteratively adds a finite sequence of noise terms which sum to zero, so that convergence to the exact average can be achieved. The adversary consists of a set of curious nodes which follow the protocol but can share information between themselves. Under some assumption on the position of the curious nodes in the network, their results prevent perfect recovery of private values but do not bound the estimation accuracy that the adversary can achieve. Mo and Murray (2014) proposed to add and subtract exponentially decaying Gaussian noise, also ensuring convergence to the exact average. They consider only honest-but-curious nodes and rely on a strong topological assumption, namely that there is no node whose neighborhood includes another node and its entire neighborhood. Their privacy guarantees are in the form of lower bounds on the covariance matrix of the maximum likelihood estimate of the private inputs that any node can make, which are difficult to interpret and very loose in practice. Finally, Hanzely et al. (2017) introduced variants of Algorithm 1 which intuitively leak less information about the private inputs, but do not necessarily converge to the exact average. Importantly, they do prove any formal privacy guarantee.

In the above approaches, privacy is typically achieved by asking each user to independently add decaying local noise to its private value, which results in accuracy loss and/or weak privacy guarantees. In contrast, the protocol we propose in this paper relies on sharing zero-sum noise terms across users so as to “dilute” the knowledge of private values into the crowd. This will allow a flexible trade-off between privacy and convergence rate even when a node has a large proportion of malicious neighbors, while ensuring the convergence to the exact average. Furthermore and unlike all the above approaches, we provide a verification procedure to prevent malicious nodes from manipulating the output of the algorithm. This verification procedure relies on partially homomorphic encryption, which we briefly present below.

3.2 Partially Homomorphic Encryption

A standard technique for secure computation is to rely on a partially homomorphic encryption scheme to cipher the values while allowing certain operations to be carried out directly on the cypher text (without decrypting).

We use the popular Paillier cryptosystem (Paillier, 1999), which is additive. Formally, one generates a public (encryption) key , where for two independent large primes and , and a secret (decryption) key . A message can then be encrypted into a cypher text with

(2)

where is drawn randomly from . With knowledge of the secret key, one can recover from based on the fact that . Denote the decryption operation by . Paillier satisfy the following homomorphic property for any :

(3)

hence is a valid encryption of . For the purpose of this paper, we will consider the Paillier encryption scheme as perfectly secure (i.e., the computational complexity needed to break the encryption is beyond reach of any party). We can use the Paillier scheme to design a second simple baseline for private averaging.

Baseline 2.

Consider and assume that users trust two central honest-but-curious entities: the server and the third party. The server generates a Paillier encryption scheme and broadcasts to the set of users. Each user computes and sends it to the third party. Following (3), the third party then computes and sends it to the server, which can obtain by decrypting the message (and dividing by ).

If the server and the third party are indeed honest, nobody observes any useful information except the outcome . But if they are not honest, various breaches can occur. For instance, if the third party is malicious, it can send an incorrect output to the server. The third party could also send the encrypted values of some users to the server, which can decrypt them using the private key . Similarly, if the server is malicious, it can send to the third party which can then decrypt all of the users’ private values.

In contrast, we will design a protocol which eliminates the need for such central trusted entities and instead distributes the trust across many users in the network (Section 4). We will however rely on homomorphic encryption to detect potential malicious users (Section 5).

4 GOPA: Private Gossip Averaging Protocol

4.1 Protocol Description

We describe our Gopa protocol (GOssip for Private Averaging), which works in two phases. In the first phase (randomization phase), users mask their private value by adding noise terms that are correlated with their neighbors, so that the global average remains unchanged. In the second phase (averaging phase), users average their noisy values. For simplicity of explanation, we abstract away the communication mechanisms, i.e., contacting a neighbor and exchanging noise are considered as atomic operations.

1:  Input: graph , private values , distribution
2:  for all neighbor pairs s.t.  do
3:      and jointly draw a random number from
4:     ,
5:  end for
6:  for all users  do
7:     ,
8:  end for
9:  Output: noisy values
Algorithm 2 Randomization phase of Gopa

Randomization phase. Algorithm 2 describes the first phase of Gopa, during which all neighboring nodes contact each other to exchange noise values. Specifically, they jointly draw a random real number from a probability distribution (a parameter of the algorithm that the community agrees on), that will add to its private value and will subtract. Following the common saying “don’t put all your eggs in one basket”, each user thereby distributes the noise masking his private value across several other users (his direct neighbors but also beyond by transitivity), which will provide some robustness to malicious parties. The idea is reminiscent of one-time pads (see for instance Bonawitz et al., 2017, Section 3 therein), but our subsequent analysis will highlight the key role of the network topology and show that we can tune the magnitude of the noise so as to trade-off between privacy on the one hand, and convergence speed as well as the impact of user drop out on the other hand. The result of this randomization phase is a set of noisy values for each user, with the same average value as the private values. Note that each user exchanges noise exactly once with each of his neighbors, hence the noisy value consists of noise terms.

Averaging phase. In the second phase, the users start from their noisy values obtained in the randomization phase and simply run the standard randomized gossip averaging algorithm (Algorithm 1).

4.2 Correctness and Convergence of Gopa

In this section, we study the correctness and convergence rate of Gopa. Let be the values after iterations of the averaging phase (Algorithm 1) initialized with the noisy values generated by the randomization phrase (Algorithm 2). Following the seminal work of Boyd et al. (2006), we will measure the convergence rate in terms of the -averaging time. Given , the -averaging time is the number of iterations needed to guarantee that for any :

(4)

Note that the error in (4) is taken relatively to the original set of values to account for the impact of the addition of noise on the convergence rate. We have the following result for the case where all users are honest (we will lift this requirement in Section 5).

Proposition 1 (Correctness and convergence rate).

When all users are honest, the sequence of iterates generated by Gopa satisfies . Furthermore, the -averaging time of Algorithm 1 is:

(5)

where with the second smallest eigenvalue of the Laplacian matrix (Boyd et al., 2006; Colin et al., 2015), is the maximum degree and are upper bounds for the absolute value of private values and noise terms respectively.222We use a bounded noise assumption for simplicity. The argument easily extends to boundedness with high probability.

Proposition 1 allows to quantify the worst-case impact of the randomization phase on the convergence of the averaging phase. The -averaging time of Gopa is only increased by a constant additive factor compared to the non-private averaging phase. Importantly, this additive factor has a mild (logarithmic) dependence on and . This behavior will be confirmed by numerical experiments in Section 6.

4.3 Privacy Guarantees

We now study the privacy guarantees of Gopa. We consider that the knowledge acquired by the adversary (colluding malicious users) during the execution of the protocol contains the following: (i) the noisy values of all users at the end of the randomization phase, (ii) the full network graph (and hence which pairs of honest users exchanged noise), and (iii) for any communication involving a malicious party, the noise value used during this communication. The only unknowns are the private values of honest users, and noise values exchanged between them. Note that (i) implies that the adversary learns the network average.

Remark 3.

Since we assume that the noisy values are known to the adversary, our privacy guarantees will hold even if these values are publicly released. Note however that computing the average with Algorithm 1 provides additional protection as the adversary will observe only a subset of the noisy values of honest nodes (those who communicate with a malicious user at their first iteration). In fact, inferring whether a received communication is the first made by that user is already challenging due to the asynchronous nature of the algorithm.

Our main result exactly quantifies how much variance is left in any private value after the adversary has observed all information in .

Theorem 1 (Privacy guarantees).

Assume that the prior belief on the private values is Gaussian, namely for all honest users , and that the noise variables are also drawn from a Gaussian distribution . Denote by the indicator vector of the -th coordinate and let , where is the Laplacian matrix of the graph . Then we have for all honest user :

(6)

We now provide an detailed interpretation of the above result. First note that the proportion of preserved variance only depends on the interactions between honest users, making the guarantees robust to any adversarial values that malicious users may send. Furthermore, notice that matrix can be seen as a smoothing operator over the graph . Indeed, given some , is the solution to the following optimization problem:

(7)

where . This problem is known as graph smoothing (also graph regularization, or graph filtering) and has been used in the context of semi-supervised learning (see Zhu and Ghahramani, 2002; Zhou et al., 2003), multi-task learning (Evgeniou and Pontil, 2004) and signal processing on graphs (Shuman et al., 2013), among others. The first term in (7) encourages solutions that are close to while the second term is the Laplacian quadratic form which enforces solutions that are smooth over the graph (the larger , the more smoothing). One can show that is a doubly stochastic matrix (Segarra et al., 2015), hence the vector sums to the same quantity as the original vector . In our context, we smooth the indicator vector so : the larger the noise variance and the more densely connected the graph of honest neighbors, the more “mass” is propagated from user to other nodes, hence the closer to the value and in turn the more variance of the private value is preserved. Note that we recover the expected behavior in the two extreme cases: when we have and hence the variance ratio is , while when we have and hence . This irreducible variance loss accounts for the fact that the adversary learns from the average over the set of honest users (by adding their noisy values and subtracting the noise terms he knows). Since we assume the number of users to be very large, this variance loss can be considered negligible. We emphasize that in contrast to the lower bounds on the variance of the maximum likelihood estimate obtained by Mo and Murray (2014), our variance computation (6) is exact, interpretable and holds under non-uniform prior beliefs of the adversary.

It is important to note that smoothing occurs over the entire graph : by transitivity, all honest users in the network contribute to keeping the private values safe. In other words, Gopa turns the massively distributed setting into an advantage for privacy, as we illustrate numerically in Section 6. Note, still, that one can derive a simple (but often loose) lower bound on the preserved variance which depends only on the local neighborhood.

Proposition 2.

Let and denote by the number of honest neighbors of . We have:

Proposition 2 shows that the more honest neighbors, the larger the preserved variance. In particular, if , we have as .

We conclude this subsection with some remarks.

Remark 4 (Composition).

As can be seen from inspecting the proof of Theorem 1, is a Gaussian distribution. This makes our analysis applicable to the setting where Gopa is run several times with the same private value as input, for instance within an iterative machine learning algorithm. We can easily keep track of the preserved variance by recursively applying Theorem 1.

Remark 5 ( not connected).

The results above still hold when Assumption 1 is not satisfied. In this case, the smoothing occurs separately within each connected component of , and the irreducible variance loss is ruled by the size of the connected component of that the user belongs to (instead of the total number of honest users ).

Remark 6 (Drop out).

The use of centered Gaussian noise with bounded variance effectively limits the impact of some users dropping out during the randomization phase. In particular, any residual noise term has expected value , and can be bounded with high probability. Alternatively, one may ask each impacted user to remove the noise exchanged with users who have dropped (thereby ensuring exact convergence at the expense of reducing privacy guarantees).

4.4 Robust Strategies for Network Construction

We have seen above that the convergence rate and most importantly the privacy guarantees of GOPA crucially depend on the network topology. In particular, it is crucial that the network graph is constructed in a robust manner to ensure good connectivity and to guarantee that all honest users have many honest neighbors with high probability. In the following, we assume users have an address list for the set of users and can select some of them as their neighbors. Note that the randomization phase can be conveniently executed when constructing the network.

A simple choice of network topology is the complete graph, which is best in terms of privacy (since each honest user has all other honest users as neighbors) and convergence rate (best connectivity). Yet this is not practical when is very large: beyond the huge number of pairwise communications needed for the randomization phase, each user also needs to create and maintain secure connections, which is costly (Chan et al., 2003).

We propose instead a simple randomized procedure to construct a sparse network graph with the desired properties based on random -out random graphs (Bollobás, 2001), also known as random -orientable graphs (Fenner and Frieze, 1982). The idea is to make each (honest) user select other users uniformly at random from the set of all users. Then, the edge is created if selected or selected (or both). This procedure is the basis of a popular key predistribution scheme used to create secure peer-to-peer communication channels in distributed sensor networks (Chan et al., 2003). Fenner and Frieze (1982) show that for any , the probability that the graph is -connected goes to almost surely. This provides robustness against malicious users. Note also that the number of honest neighbors of a honest node follows a hypergeometric distribution and is tightly concentrated around its expected value , where is the proportion of malicious users. It is worth noting that the probability that the graph is connected is actually very close to 1 even for small and . This is highlighted by the results of Yağan and Makowski (2013), who established a rather tight lower bound on the probability of connectivity of random -out network graphs. For instance, the probability is guaranteed to be larger than for and . The algebraic connectivity is also large in practice for random -out graphs, hence ensuring good convergence as per Proposition 1.

5 Verification Procedure

1:  Input: each user has generated its own Paillier encryption scheme and has published:
2:        Public key , (before the execution of Algorithm 2)
3:        Noise values (when exchanging with during Algorithm 2)
4:        , (at the end of Algorithm 2)
5:  for all user  do
6:     ,
7:     Verify that and (if not, add to cheater list)
8:  end for
9:  for all user  do
10:     Draw random subset of with , ask to publish and for
11:     for all  do
12:        Ask to publish , verify that and (if not, add to cheater list)
13:     end for
14:  end for
Algorithm 3 Verification procedure for the randomization phase of Gopa

We have shown in Section 4.2 that Gopa converges to the appropriate average when users are assumed not to tamper with the protocol by sending incorrect values. In this section, we complement our algorithm with a verification procedure to detect malicious users who try to influence the output of the algorithm (we refer to them as cheaters). While it is impossible to force a user to give the “right” input to the algorithm (no one can force a person to answer honestly to a survey),333We can use range proofs (Camenisch et al., 2008) to check that each input value lies in an appropriate interval without revealing anything else about the value. our goal is to make sure that given the input vector , the protocol will either output or detect cheaters with high probability. Our approach is based on asking users to publish some (potentially encrypted) values during the execution of the protocol.444We assume that users cannot change their published values after publication. This could be enforced by relying on blockchain structures as in Bitcoin transactions (Nakamoto, 2008). The published information should be publicly accessible so that anyone may verify the validity of the protocol (avoiding the need for a trusted verification entity), but should not threatens the privacy.

In order to allow public verification without compromising privacy, we will rely on the Paillier encryption scheme described in Section 3.2. For the purpose of this section, we assume that all quantities (private values and noise terms) are in . This is not a strong restriction since with appropriate and scaling one can represent real numbers from a given interval with any desired finite precision. Each user generates its own Paillier scheme and publishes encrypted values using its encryption operation . These published cypher texts may not be truthful (when posted by a malicious user), hence we denote them by the superscript to distinguish them from a valid encryption of a value (e.g., is the cypher text published by for the quantity ).

Algorithm 3 describes our verification procedure for the randomization phase of Gopa (the averaging phase can be verified in a similar fashion). In a first step, we verify the coherence of the publications of each user , namely that and are satisfied for the published cypher texts of these quantities. To allow reliable equality checks between cipher texts, some extra care is needed (see supplementary material for details). The second step is to verify that during a noise exchange, the user and his neighbor have indeed used opposite values as noise terms. However, each user has his own encryption scheme so one cannot meaningfully compare cypher texts published by two different users. To address this issue, we ask each user to publish in plain text a random selected fraction of his noise values (the noise values to reveal are drawn publicly). The following result lower bounds the probability of catching a cheater.

Proposition 3 (Verification).

Let be the number of times a user cheated during the randomization phase. If we apply the verification procedure (Algorithm 3), then the probability that a cheater is detected is at least .

Proposition 3 shows that Algorithm 3 guards against large-scale cheating: the more cheating, the more likely at least one cheater gets caught. Small scale cheating is less likely to be detected but cannot affect much the final output of the algorithm (as values are bounded with high probability and the number of users is large). Of course, publishing a fraction of the noise values in plain text decreases the privacy guarantees: publishing in plain text for corresponds to ignoring the associated edge of in our privacy analysis. If the community agrees on in advance, this effect can be easily compensated by constructing a network graph of larger degree, see Section 4.4.

6 Numerical Experiments

\includegraphics

[width=.35]figures/conv1

(a) Relative error w.r.t. # iter.
\includegraphics

[width=.35]figures/conv2

(b) # iter. to error w.r.t.
Figure 1: Impact of the noise variance on the convergence of Gopa for a random -out graph with and , with mean and standard deviation over 10 random runs.
\includegraphics

[width=.35]figures/var_m01

(a) 10% of malicious nodes
\includegraphics

[width=.35]figures/var_m05

(b) 50% of malicious nodes
Figure 2: Preserved data variance (mean and standard deviation across users) w.r.t. noise variance for several random -out topologies and two proportions of malicious nodes.

In this section, we run some simulations to illustrate the practical behavior of Gopa. In particular, we study two aspects: the impact of the noise variance on the convergence and on the proportion of preserved data variance, and the influence of network topology. In all experiments, the private values are drawn from the normal distribution , and the network is a random -out graph (see Section 4.4).

Figure 1 illustrates the impact of the noise variance on the convergence of the averaging phase of Gopa, for users and . We see on Figure 1(a) that larger have a mild effect on the convergence rate. Figure 1(b) confirms that the number of iterations needed to reach a fixed error is logarithmic in , as shown by our analysis (Proposition 1). Figure 2 shows the proportion of preserved data variance for several topologies and proportions of malicious nodes in the network. Recall that in random -out graphs with fixed , the average degree of each user is roughly equal to and hence remains constant with . The results clearly illustrate one of our key result: beyond the noise variance and the number of honest neighbors, the total number of honest users has a strong influence on how much variance is preserved. The more users, the more smoothing on the graph and hence the more privacy, without any additional cost in terms of memory or computation for each individual user since the number of neighbors to interact with remains constant. This effect is even more striking when the proportion of malicious users is large, as in Figure 2(b). Remarkably, for large enough, Gopa can effectively withstand attacks from many malicious users.

7 Conclusion

We proposed and analyzed a massively distributed protocol to privately compute averages over the values of many users, with robustness to malicious parties. Our novel privacy guarantees highlight the benefits of the large-scale setting, allowing users to effectively “hide in the crowd” by distributing the knowledge of their private value across many other parties. We believe this idea to be very promising and hope to extend its scope beyond the problem of averaging. In particular, we would like to study how our protocol may be used as a primitive to learn complex machine learning models in a privacy-preserving manner.

Acknowledgments

This research was partially supported by grant ANR-16-CE23-0016-01 and by a grant from CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015-2020. The work was also partially supported by ERC-PoC SOM 713626.

References

  • Aslett et al. (2015) Aslett, L. J. M., Esperança, P. M., and Holmes, C. C. (2015). Encrypted statistical machine learning: new privacy preserving methods. Technical report, arXiv:1508.06845.
  • Bollobás (2001) Bollobás, B. (2001). Random Graphs. Cambridge University Press.
  • Bonawitz et al. (2017) Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., Ramage, D., Segal, A., and Seth, K. (2017). Practical Secure Aggregation for Privacy-Preserving Machine Learning. In CCS.
  • Boyd et al. (2006) Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. (2006). Randomized Gossip Algorithms. IEEE/ACM Transactions on Networking, 14(SI):2508–2530.
  • Camenisch et al. (2008) Camenisch, J., Chaabouni, R., and Shelat, A. (2008). Efficient protocols for set membership and range proofs. In Proceedings of the 14th International Conference on the Theory and Application of Cryptology and Information Security (ASIACRYPT), pages 234–252.
  • Chan et al. (2003) Chan, H., Perrig, A., and Song, D. X. (2003). Random Key Predistribution Schemes for Sensor Networks. In S&P.
  • Colin et al. (2015) Colin, I., Bellet, A., Salmon, J., and Clémençon, S. (2015). Extending Gossip Algorithms to Distributed Estimation of U-statistics. In NIPS.
  • Duchi et al. (2012) Duchi, J. C., Jordan, M. I., and Wainwright, M. J. (2012). Privacy Aware Learning. In NIPS.
  • Dwork (2006) Dwork, C. (2006). Differential Privacy. In ICALP.
  • Dwork (2008) Dwork, C. (2008). Differential Privacy: A Survey of Results. In TAMC.
  • Evgeniou and Pontil (2004) Evgeniou, T. and Pontil, M. (2004). Regularized multi-task learning. In SIGKDD.
  • Fenner and Frieze (1982) Fenner, T. I. and Frieze, A. M. (1982). On the connectivity of random m-orientable graphs and digraphs. Combinatorica, 2(4):347–359.
  • Goldreich (1998) Goldreich, O. (1998). Secure multi-party computation. Manuscript. Preliminary version.
  • Graepel et al. (2012) Graepel, T., Lauter, K. E., and Naehrig, M. (2012). ML Confidential: Machine Learning on Encrypted Data. In ICISC.
  • Hanzely et al. (2017) Hanzely, F., Konečný, J., Loizou, N., Richtárik, P., and Grishchenko, D. (2017). Privacy Preserving Randomized Gossip Algorithms. Technical report, arXiv:1706.07636.
  • He et al. (2014) He, X., Machanavajjhala, A., and Ding, B. (2014). Blowfish privacy: tuning privacy-utility trade-offs using policies. In SIGMOD.
  • Huang et al. (2012) Huang, Z., Mitra, S., and Dullerud, G. (2012). Differentially private iterative synchronous consensus. In ACM workshop on Privacy in the Electronic Society.
  • Kairouz et al. (2016) Kairouz, P., Oh, S., and Viswanath, P. (2016). Extremal Mechanisms for Local Differential Privacy. Journal of Machine Learning Research, 17:1–51.
  • Kasiviswanathan and Smith (2014) Kasiviswanathan, S. P. and Smith, A. (2014). On the ‘Semantics’ of Differential Privacy: A Bayesian Formulation. Journal of Privacy and Confidentiality, 6(1):1–16.
  • Kempe et al. (2003) Kempe, D., Dobra, A., and Gehrke, J. (2003). Gossip-Based Computation of Aggregate Information. In FOCS.
  • Kifer and Lin (2012) Kifer, D. and Lin, B.-R. (2012). An Axiomatic View of Statistical Privacy and Utility. Journal of Privacy and Confidentiality, 4(1):5–46.
  • Kifer and Machanavajjhala (2011) Kifer, D. and Machanavajjhala, A. (2011). No free lunch in data privacy. In SIGKDD.
  • Kifer and Machanavajjhala (2014) Kifer, D. and Machanavajjhala, A. (2014). Pufferfish: A framework for mathematical privacy definitions. ACM Transactions on Database Systems, 39(1):3:1–3:36.
  • Li et al. (2013) Li, N., Qardaji, W. H., Su, D., Wu, Y., and Yang, W. (2013). Membership privacy: a unifying framework for privacy definitions. In CCS.
  • Lindell and Pinkas (2009) Lindell, Y. and Pinkas, B. (2009). Secure Multiparty Computation for Privacy-Preserving Data Mining. Journal of Privacy and Confidentiality, 1(1):59–98.
  • Manitara and Hadjicostis (2013) Manitara, N. E. and Hadjicostis, C. N. (2013). Privacy-preserving asymptotic average consensus. In ECC.
  • Mo and Murray (2014) Mo, Y. and Murray, R. M. (2014). Privacy preserving average consensus. In CDC.
  • Nakamoto (2008) Nakamoto, S. (2008). Bitcoin: A Peer-to-Peer Electronic Cash System. Available online at http://bitcoin.org/bitcoin.pdf.
  • Paillier (1999) Paillier, P. (1999). Public-key cryptosystems based on composite degree residuosity classes. In EUROCRYPT.
  • Segarra et al. (2015) Segarra, S., Huang, W., and Ribeiro, A. (2015). Diffusion and Superposition Distances for Signals Supported on Networks. IEEE Transactions on Signal and Information Processing over Networks, 1(1):20–32.
  • Shuman et al. (2013) Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A., and Vandergheynst, P. (2013). The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83–98.
  • Tsitsiklis (1984) Tsitsiklis, J. N. (1984). Problems in decentralized decision making and computation. PhD thesis, Massachusetts Institute of Technology.
  • Yağan and Makowski (2013) Yağan, O. and Makowski, A. M. (2013). On the Connectivity of Sensor Networks Under Random Pairwise Key Predistribution. IEEE Transactions on Information Theory, 59(9):5754–5762.
  • Yao (1982) Yao, A. C. (1982). Protocols for secure computations. In FOCS.
  • Zhou et al. (2003) Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. (2003). Learning with Local and Global Consistency. In NIPS.
  • Zhu and Ghahramani (2002) Zhu, X. and Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University.

Supplementary Material

Appendix Appendix A Proof of Proposition 1

The randomization phase consists in pairs of users adding noise terms which sum to zero. Hence, in the HBC setting, we have . Therefore, the sum of the user values over the network remains unchanged after the randomization phase:

The first claim then follows from the correctness of the averaging procedure of Boyd et al. (2006) used for the averaging phase.

From Boyd et al. (2006), we know that the -averaging time of the averaging phase applied to the original (non-noisy) values is . Hence, to achieve the same guarantee (4), we need to run the algorithm for at least

(8)

iterations. Due to the bound on the noise values and the amount of noise exchanges a user can make, we have for any :

and hence

We get the second claim by plugging this inequality into (8).∎

Appendix Appendix B Proof of Theorem 1

We first introduce some auxiliary notations. Let be the ordered edges between honest users. Let be the vector of the private values of all honest users and be the vector of all values exchanged by these users. Let

be the concatenation of these two vectors. We will index vectors and matrices with elements of . To emphasize that we refer to the random variable rather than its value, we will use square brackets. For , we define and if , and and if .

We formalize the knowledge of the adversary by a set of linear equations, specifying the constraints the elements of must satisfy according to its knowledge:

(9)

where is the noisy value of user minus the sum of all noise terms exchanged with malicious users.

Let be the matrix representing the above set of linear equations, and the vector representing the right hand side, so that we can write (9) as . In particular, is a sparse matrix whose elements are zero except for and and .

The matrix has full rank and can be written as where is the identity matrix and is the oriented incidence matrix of the graph (where the direction of edges is given by ). Let be a full rank matrix such that . Let . As and are non-singular and orthogonal to each other, is also non-singular. Let be a diagonal matrix with and . We know that . is a linear transformation of a multivariate Gaussian, and hence a multivariate Gaussian itself. In particular, it holds that .

We now consider the random vector conditioned on . This again gives a Gaussian distribution, with

Let some honest user. We now focus on the variance of the private value conditioned on the information obtained by the adversary:

(10)

where Laplacian matrix associated with . Finally, for clarity we rewrite

Appendix Appendix C Proof of Proposition 2

To better understand Theorem 1, we investigate a fictional scenario where the only noise exchanged in the network is between the user of interest and his neighbors. It is clear that the amount of variance preserved in this scenario gives a lower bound for the general case (where pairs of nodes not involving also exchange noise). Indeed, noise exchanges involving a malicious user have no effect (they are subtracted away in the linear system (9)), while those between honest users can only increase the uncertainty for the adversary.

Without loss of generality, assume . The Laplacian matrix in Theorem 1 (ignoring the nodes which did not exchange noise) is given by

Let be a unitary matrix for which

We have

Indeed, we can easily verify that and . For every vector with , we can check that and hence . It follows that for any , . We know that is orthogonal to and hence . Then, . Therefore, all vectors orthogonal to and are eigenvectors with eigenvalue .

We can now rewrite the matrix inverse in (10) from the proof of Theorem 1 as:

We are interested in user so we focus on the value

Plugging back in (10), we finally get:

Appendix Appendix D Details on Verification Procedure

d.1 Reliable Equality Checks

To allow reliable equality checks between cipher texts, some extra care is needed. For any and , let us denote by , , and the random integers generated (and kept private) by user to respectively encrypt , ,