Differentially-Private Two-Party Egocentric Betweenness Centrality

Differentially-Private Two-Party
Egocentric Betweenness Centrality

Leyla Roohi1, Benjamin I. P. Rubinstein2, Vanessa Teague3 School of Computing and Information Systems, University of Melbourne, Australia
Email: 1lroohi@student.unimelb.edu.au, 2brubinstein@unimlb.edu.au, 3vjteague@unimelb.edu.au
Abstract

We describe a novel protocol for computing the egocentric betweenness centrality of a node when relevant edge information is spread between two mutually distrusting parties such as two telecommunications providers. While each node belongs to one network or the other, its ego network might include edges unknown to its network provider. We develop a protocol of differentially-private mechanisms to hide each network’s internal edge structure from the other; and contribute a new two-stage stratified sampler for exponential improvement to time and space efficiency. Empirical results on several open graph data sets demonstrate practical relative error rates while delivering strong privacy guarantees, such as 16% error on a Facebook data set.

Differential Privacy; Betweenness Centrality

I Introduction

Data sets such as social, communication, and transport networks are graph structured: people are nodes and their interactions edges. Such graph structures are valuable for understanding real-world properties. However, revealing the graph or its statistics can cause privacy disclosure even with anonymisation techniques [1, 2, 3]. Furthermore, for many corporations, customer data is an asset they are reluctant to share. This motivates interest in joint computation over databases with limited exposure to sensitive information.

Differential privacy (DP) [4] guarantees that a release output distribution does not change by more than a small multiplicative factor under input data perturbation. We consider edge DP wherein perturbations correspond to edge flips: the existence of sensitive edges is not revealed by edge-DP release.

We envisage two (or more) networks controlled by different corporations, such as telephone or email providers, or two different social networks. The complete list of nodes (i.e., people) is public knowledge, but the individual connections between them are not, so we consider edge DP in order to hide the connection between the nodes in each network. Each service provider knows the connections within its own network, plus the connections between one of its members and the outside (e.g., when they contact someone in a different network), but not the internal connections in other networks.

We are the first to consider differentially-private computation of egocentric betweenness centrality (EBC) [5]. Informally, EBC measures the importance of a node as a link between different parts of the graph. A node that forms a link between otherwise-isolated parts of the network has high betweenness centrality; a node that is simply an easily-bypassed member of an interconnected network has a low betweenness centrality. This is a property of the whole communication graph: one service provider cannot compute it using its partial view of the graph alone.

Betweenness centrality could be used in targeted advertising or customer retention campaigns, as individuals with high EBC have the capacity to transfer information from one community to another. EBC is equally important in understanding and combating the spread of misinformation or “fake news”: individuals with high EBC can be educated to be more discerning about what they spread through the network, thereby mitigating spread of fake news. The difficulty of assessing misleading political content and obstructing its spread has become one of the most important research questions in online social network analysis, to which even the networks themselves are devoting significant research effort.111https://newsroom.fb.com/news/2018/04/new-elections-initiative/

We enable a network provider to compute the egocentric betweenness centrality of a node, while requiring only differentially-private information about internal connections to be shared between networks. Our main contributions are:

  1. We introduce a privacy preserving method to compute the egocentric betweenness centrality of nodes in undirected graphs. In this work the network has local connections, while there are also inter-network connections.

  2. We propose a two-stage sampling process that delivers a simple approach to implement and exponential savings in time and space over naïve sampling from the exponential mechanism directly.

  3. We report on thorough experiments using a Facebook graph data set on 63,000 nodes. The experiments in Section VI show that the error is approximately 16% of the true EBC for reasonable values of privacy level . Similar results hold for other networks from Enron and PGP email.

First we survey the technical background and give a precise definition of EBC. We then explain why a precise computation would expose individual links between networks. Section V describes our differentially-private mechanism for communicating enough information between networks to permit effective approximation of EBC while preserving strong privacy guarantees. We then present empirical results testing the feasibility of our approach on samples of public data from Facebook, Enron and PGP.

I-a Related Work

-anonymity [6] represents a major, early attempt at preventing node and edge re-identification by graph transformation, but it has been proven to be insufficient [7].

Differential privacy for graph processing was first introduced in [8] and was followed up by further work [9, 10, 11, 12]. Two main privacy models exist when publishing graph-based information under differential privacy; node [10] and edge differential privacy [8]. Hay et al. [8] introduced an algorithm for publishing degree distributions under edge privacy, implicitly permitting private -star counting as well. Projection-based techniques have been proposed to answer degree distribution queries under node differential privacy [13, 14].

Other statistics have been approximated under differential privacy such as frequent patterns of given sub graphs [9, 15, 16]. Bhaskar et al. [15] used the exponential mechanism to publish the (approximately) most frequent patterns with high probability, and the Laplace mechanism to release the noisy frequency of maximising patterns. Karawa et al. [16] proposed a differentially-private algorithm to output answers to sub-graph counting queries for -star, -triangle sub graphs while using local sensitivity [17] to overcome high global sensitivity in sub-graph counting queries. An approach to finding arbitrary frequent patterns was proposed by Shen & Yu [12]. They utilise the exponential mechanism and Markov chain Monte Carlo sampling to output frequent patterns on graph data sets.

Finding node clusters in a single graph under differential privacy was first proposed by [11] and followed by [18]. These techniques try to find the group of nodes sharing many links with other nodes in the same group but relatively few outside the group. They maintain the privacy of the output clusters under node or edge differential privacy.

Our work differs from previous studies in two key ways. First we focus on the problem of node influence, through the study of ego betweenness centrality. This particular task poses significant technical challenges, made efficient here by adopting two-stage stratified and accept-reject sampling. Second we consider a core graph processing task in a distributed two-party setting. While most existing work on graph mining under differential privacy can adopt a model of trusted computation, and there is some work on privacy for distributed systems [19, 20], these are based on distributed queries that are decomposed into sub-queries, each answered per database. Our setting requires untrusting parties to cooperate on computation without revealing one another’s privacy-sensitive data.

Ii Preliminaries

Ii-a Egocentric Betweenness Centrality

First proposed by Everett & Borgatti [21] as an approximation to betweenness centrality [22], egocentric betweenness centrality (EBC) has gained recognition in its own right as a natural measure of a node’s importance as a network bridge [23]. The EBC of a node is the sum, for all pairs of neighbours of that aren’t directly connected, of the fraction of 2-edge paths between them that pass through .

Definition 1.

Egocentric betweenness centrality (EBC) of node in simple undirected graph is defined as

where denotes the neighbourhood or ego network of , denotes the adjacency matrix induced by with if and otherwise; denotes the -th entry of the matrix square, guaranteed positive for all since all such nodes are connected through .

Ii-B Differential Privacy on Graphs

Differential privacy was proposed to quantify the indistinguishability of input databases when observing the output of data analysis [4]. With careful selection of which databases are to be indistinguishable—through the so-called neighbouring relation—the protective semantics of differential privacy may be controlled.

As detailed further in Section III, our concern is maintaining the privacy of connections in networks, e.g., who calls whom in a telecommunications network. We therefore use edge privacy [8] and so relate graphs that differ by edges. The adjacency matrix fully represents the edgeset of a graph of known nodes (the indices into the adjacency matrix). As such, we focus on databases as sequences of bits: elements of .

Formally, two databases and are termed neighbouring (denoted ) if there exists exactly one such that and for all . In other words, .

Definition 2.

For , a randomised algorithm on databases or mechanism is said to preserve -differential privacy if for any two neighbouring databases , and for any measurable set ,

Ii-B1 Generic Mechanisms for Privacy

We leverage two well-known DP mechanisms in this paper: the Laplace mechanism [4] which applies additive noise to numeric vector-valued analyses, and the exponential mechanism [24] which privately optimises a real-valued objective function bivariate in the database and the decision variable which need not be numeric. Common to most generic mechanisms, and the Laplace and exponential in particular, is the concept of sensitivity-calibrated randomisation: the more sensitive a target function is to input perturbation, the more randomisation is required to attain a level of differential privacy. Both mechanisms leveraged here are calibrated via the same measure of sensitivity, defined next.

Definition 3.

The -global sensitivity of any function for any , is defined as

For functions of additional variables we extend this definition naturally as

We can now define the aforementioned generic mechanisms.

Lemma 4.

Consider any Euclidean vector-valued deterministic function for any , and any scalar . Given input , the Laplace mechanism releases responses in distributed as where is i.i.d. zero-mean Laplace222The zero-mean scalar Laplace with scale has PDF . r.v.’s with scale . Then the Laplace mechanism preserves -differential privacy.

Lemma 5.

Consider any real-valued bivariate quality function , which assigns quality score to candidate response , on input database . The exponential mechanism approximately maximises by releasing randomised response with likelihood proportional to . Then the exponential mechanism preserves -differential privacy.

Ii-B2 Compositional Calculus

In order to build up more complex privacy-preserving computations, it is necessary to be able to quantify the privacy loss of compositions. Fortunately, differential privacy satisfies sequential composition and transformation invariance [4, 25, 26] among other compositions.

Lemma 6 (Sequential composition).

For any sequence of randomised mechanisms , if each preserves -differential privacy then the compound response on a database , , preserves -differential privacy.

Lemma 7 (Transformation invariance).

For any mechanism that is -differentially private, and any (possibly randomised) mapping with domain containing the co-domain of , the randomised mechanism preserves -differential privacy.

Iii Problem Statement

Consider a two-party setting of a telecommunications network with two service providers : every customer is represented as a node that belongs to one and only one service provider; pairs of customers who e.g., have called one another are represented as edges in a simple undirected graph on the disjoint union of nodes. Edges can either connect nodes within one party ( or ) in which case are unknown to the other party ( or respectively), or edges span both parties and are known to both. We consider all nodes to be known to both parties, as being addressable within a global addressing system (e.g., a phone book).

Denote by the nodes of respectively, the edges (two-element sets) within parties respectively, and the edges spanning as sets with one element each from . The simple undirected graph on the entire network comprises node-set disjoint union and edge-set disjoint union . Note we will often equivalently represent edge sets as adjacency matrices (or flattened vectors) with elements in . Table I shows all of the symbols used in this paper.

We wish to enable one party (without loss of generality) to compute the ego betweenneess centrality (EBC) of one of its nodes , while maintaining edge privacy between parties. Before detailing a protocol for accomplishing this task, we must be precise about a privacy model.

The two parties e.g., competing service providers.
The nodes per party.
Edges entirely within each party.
Edges spanning both parties.
The ego node (assumed WLOG to be in ).
The ego network of .
Party ’s nodes excluding .
The ego network contained in .
Counts of 2-paths spanning .
Partial EBC sums by endpoints.
A private, randomised approximation to .
For , a partition of .
The differential-privacy budget.
A global sensitivity bound.
TABLE I: Glossary of symbols used in this paper.
Problem 8 (Private Two-Party EBC).

Consider a simple undirected graph partitioned by parties as above, and an arbitrary node . The problem of private two-party egocentric betweenness centrality is for the parties to collaboratively approximate under assumptions that:

  1. Both parties know the entire node set ;

  2. Each party knows every edge incident to nodes within their own network. That is, knows while knows ; and

  3. The computed needs to be available to but need not be shared with .

Any solution must not reveal to what is not already known except for discovering (Assumption 3). We seek solutions under an honest-but-curious adversarial model: while will follow any agreed upon protocol prescribing computations to take and messages to send to one-another, without attempting to manipulate the other party; each party is curious about the other’s edges and may apply arbitrary auxiliary computation and leverage data sources in attempting to discover the other’s edges. Formally, what is revealed by () to (respectively ) must preserve -differential privacy with respect to (respectively ).

Iv Warm-Up: A Non-Private Protocol

We first consider how might cooperate without preserving differential privacy. In particular cannot itself count 2-paths that are

  • Contained entirely within ; or

  • Ending in both with intermediate node in .

Any protocol must involve in aggregating over such paths. But while the first case can be aggregated independently by , the second case requires to communicate its endpoint neighbours of to . This significantly complicates the differentially-private solution developed in the next section.

Recall that denotes the ego network of anywhere in the graph (notably not including since the graph has no self-loops). Figure 1 summarises the following protocol.

Protocol 9.

Proceeding in sequence:

  1. [Forward message] sends to the set of neighbours of contained within ;

  2. [Backward message] computes and sends to , for each and for each (where are not directly connected), a count of 2-paths with endpoints and intermediate point in ;

  3. [Backward message] computes and sends to , the EBC partial sum over endpoint nodes with intermediate nodes in . That is, ;

  4. increments the received by the number of 2-paths between with intermediate point in . It then sets to the sum of their reciprocals;

  5. computes, over distinct and disconnected endpoint nodes with intermediate node in , the EBC partial sum. That is: ; and

  6. completes computation of as .

Party

Party

Fig. 1: EBC two-party computation protocol, comprising one forward and two backward messages.
Remark 10.

We now briefly comment on where edge privacy is potentially breached, thereby highlighting challenges faced by any solution to Problem 8. When sends its set of neighbours of , party learns directly of all edges incident to in . When sends its 2-path counts , while counts aggregate exact connectivity at worst this level of aggregation could be very small therefore revealing information about connections to within , and the inter-connections between these nodes. A worst case occurs when there are two nodes in connected to : as soon as receives the vector of counts of 2-paths spanning , it can learn whether these two nodes are connected.

V A Privacy-Preserving Protocol

We now develop our protocol for private EBC which involves a series of differentially-private mechanisms for overcoming the privacy disclosures identified in Remark 10. We relegate proofs to the Appendices.

We use the exponential mechanism (viz. Lemma 5) to release a set of nodes in that privately approximates ’s ego network in . This is Protocol 9.i’s ‘forward message’ (Section V-A).

While our application of the exponential mechanism protects edge privacy for , there is still potential for privacy disclosure when communicates counts to . To overcome this problem, we leverage the Laplace mechanism to privatise vectors of 2-path counts communicated by within Protocol 9.ii’s ‘backward message’ (Section V-B). The components of this message are indexed (in part) by the approximate sent in the forward message. A second Laplace mechanism makes private the partial EBC of Protocol 9.iii’s ‘backward message’.

In this way our privacy-preserving protocol follows the broad-brush sequence of steps outlined in Section IV but is made more involved by the addition of differential privacy.

V-a Forward Message

The goal of the forward message, is to communicate a privacy-preserving approximation to chosen from power set . In order to leverage the exponential mechanism (viz. Lemma 5) we must specify a quality function of the form . That is a mapping from the adjacency matrix for and a candidate response , to a score reflecting the approximation quality of by . Since the response set is finite (albeit exponential in the graph size), the exponential mechanism then has normalised response probability mass function,

(1)

with implicit dependency on fixed adjacency matrix.

In designing an appropriate quality function, we typically want to be maximised uniquely by the desired non-private output . The function should also be a semantically meaningful ‘distance’ between outputs and such that the utility bounds for the exponential mechanism of [24, Lemma 7 and Theorem 8] make meaningful guarantees. The utility guarantee states that with high probability the released random has score not too much lower than . And so if meets this global maximum, then we have that the released set has score not much lower than that of . If responses close in are also ‘close’ then this guarantees a good approximation to with high probability.

Remark 11.

A natural choice for quality function is as it is clearly maximised by . However it is not uniquely maximised, indeed for any superset (including the entire set of nodes) we have that also. There are many such sets: which is not far from the number of all possible responses for modest ego network sizes, in which case the exponential mechanism does not achieve our goal.

Section V-A1 develops a sound choice of quality function in the symmetric set difference.

In the rest of this section we will abuse notation and abbreviate with the meaning understood from context. We will also denote by .

V-A1 Symmetric Set Difference

We adopt (for minimisation) the symmetric set difference with given by as a promising basis for quality function design. Define complements relative to . We have

where the second equality follows from a disjoint union in the first equality’s right-hand side.

In minimising the symmetric difference, dismissing the constant as redundant to optimisation, we can equivalently maximise333While ’s dependence on is suppressed, it should be implicitly understood.

(2)

This quality function takes values in and is uniquely maximised by .

While the machinery of the exponential mechanism only requires sensitivity of this quality function (Section V-A5) to guarantee differential privacy, a significant challenge is involved in sampling from the mechanism’s response distribution as it is defined over an enormous response space: the power set of . Thanks to the amplification of by the exponential, the distribution’s mass varies an incredible amount even for graphs of modest size.

V-A2 Equi-Quality Responses

It will be useful to consider the sets of candidate exponential mechanism responses, with equal quality score value ,

where . It can be shown that the form a partition of : the sets are pairwise disjoint, and their union is all subsets of . It can also be shown that for ,

A consequence of this identity is an efficient approach to computing the normalising constant for the exponential mechanism response distribution. The proof can be found in the Appendices.

Corollary 12.

Consider the normalised exponential mechanism response distribution (1) for the quality function given in (2). The normalising constant is equivalent to

Computing this expression takes time and space .

Note that other phases of our protocol, to be specified, also require linear space. By comparison computing naïvely would take time exponential in and constant space.

V-A3 Two-Stage Sampling

We will now outline how to sample from the exponential mechanism response distribution, using a two-stage sampling process that delivers a simple approach to implement, and exponential savings in time and space vs naïve sampling from the exponential mechanism.

Remark 13.

Another standard approach to sampling from challenging distributions is acceptance-rejection sampling or the Metropolis algorithm. However no clear surrogate probability mass presents itself that yields acceptable rates of rejection for practical applicability.

Stratified Sampling

To simplify notation, let us consider the problem at-hand more generally: let be a discrete random variable on finite probability space i.e., a multinomial with . Suppose there exists a partition of into the disjoint union of , such that for all , and all , i.e., the probability mass is constant within each part. We can exactly sample from by (1) drawing a random variable that selects a part in the partition according to the relative part sizes then (2) sampling uniformly from within the chosen part—this is the approach taken by Algorithm 1 ForwardMessage. Denote by the constant probability mass of any . The following result is proven in the Appendices.

Lemma 14.

Define random variable where for each , and . Then for all . Moreover, the probability mass stated for is already normalised, i.e., .

Corollary 15.

The following sampling process, implemented in ForwardMessage Algorithm 1, is equivalent to sampling from the exponential mechanism (1)

(i) Sample with log-space probability mass

(ii) Sample

Proof:

As the exponential mechanism (1) on , with quality function (2), is a multinomial distribution with strata of constant probability given by the , Lemma 14 establishes that the stratified sampler successfully implements the mechanism. All that remains is to compute the probability of selecting as the stratum’s cardinality times constant probability (normalised by as given in Corollary 12).

where the final equality follows from the observation that the expression for is a Binomial expansion. We simplify and convert this expression to log-space as the expression exponentially increases in :

completing the result. ∎

0:  edge set ; ego node ;
1:  
2:  
3:  
4:  
5:  return  
Algorithm 1 ForwardMessage Two-Stage Sampler

V-A4 Linear-Complexity Sampling

While Corollary 15 reduces the problem from sampling from a large support set with highly skewed probability mass, we must address efficient implementation of the two-stage sampling.

0:  cardinality ; // Compute log-space PDF of
1:  
2:  for  do
3:     
4:  end for// Search in CDF for random quantile
5:  
6:  
7:  for  do
8:     if   then
9:        return  
10:     end if
11:     
12:  end for
13:  return  
Algorithm 2 InverseTransformSampler
0:  node set ; node set ;
1:  
2:   without replacement
3:  for  do
4:     if  then
5:        
6:     else
7:        
8:     end if
9:  end for
10:  return  
Algorithm 3 PickAndFlipSampler
Inverse Transform Sampling of

The sampling of multinomial over much smaller support set can be accomplished efficiently via inverse transform sampling. Given access to a random variable ’s (invertible) CDF , one can sample realisations of by first sampling then releasing quantile . For , we can always take . However highly-skewed distributions can suffer from numeric instability in floating-point computation of the CDF. Though not as severe as for , this remains a problem for . To combat this we employ a library for arbitrary floating-point precision in our implementation (see Section VI) and we represent ’s probability mass in log space as reported in Corollary 15. In this case, the inverse transform sampler is easily adapted as proved in the Appendices.

Proposition 16.

Consider r.v. defined in Corollary 15. It can be sampled via the InverseTransformSampler (Algorithm 2) in time and space given oracle access to exponential r.v.’s.

Pick-and-Flip Sampling of

After sampling , we must sample uniformly from within chosen stratum , a constrained and potentially large subset of . For sampled we are to have , so the size of and ’s symmetric set difference is . Moreover, this describes all candidates for within .

Proposition 17.

If r.v. is sampled with InverseTransformSampler (Algorithm 2), then Algorithm 3 PickAndFlipSampler yields a sample of r.v. defined in Corollary 15 in time and space .

Proof:

Every time a new node sampled from , it could be sampled from either or . In both cases, is added to the set difference. This loop invariance continues until the set difference size reaches establishing and uniformly so due to the uniform sampling of the . Sampling the nodes (like the rest of the algorithm) can be achieved in linear time/space with Fisher-Yates shuffling. ∎

V-A5 Quality Function Global Sensitivity

The remaining ingredient for invoking the exponential mechanism to privately release , is bounding ’s sensitivity.

Lemma 18.

Consider any fixed and -contained ego networks , induced by neighbouring adjacency matrices on and fixed ego node . Noting explicitly the dependence of the quality function (2) on non-private ego network, .

Proof:

Consider the effect of switching an edge within on the symmetric difference cardinality between i.e., the quality function. Adding/removing an edge can impact at most one node being neighbours with ego node ; it can therefore only decrease or increase the first or second terms of by 1, at most. Since these two sets are disjoint, it cannot change both simultaneously. ∎

Theorem 19.

ForwardMessage (Algorithm 1) takes time and space , and when run with , preserves -differential privacy of the edge set within party ’s network.

Proof:

Privacy follows from Lemma 5, Corollary 15 and Lemma 18, complexity from Propositions 16 and 17. ∎

V-B Backward Message

Analogous to the non-private protocol of Section IV (Figure 1), receives node-set approximating ’s ego network in , via ForwardMessage. Subsequently, must send back: counts of -paths spanning with intermediate node in —indexed by ; and its partial EBC sum over paths with endpoints in and intermediate node in . Note this last set is the private appoximation to . We apply in BackwardMessage (Algorithm 4) the Laplace mechanism (Lemma 4) to both backward message components to avoid disclosure of edges in .

Note cases in which BackwardMessage need not be run by : If there are no paths incident to with intermediate point in ; or contained entirely within . We may therefore assume within the algorithm that these cases are not present.

0:  ego node ; edge sets ; private node set ; // Count 2-paths of type
1:  for  and  do
2:     
3:     
4:  end for// Partial EBC sum over 2-paths in
5:  
6:  .
7:  for  with and  do
8:     
9:     
10:  end for
11:  
12:  return  
Algorithm 4 BackwardMessage
0:  node sets and ; ; ; ego node ; edge sets // Party runs:
1:   ForwardMessage
2:  Send to party // Party runs:
3:   BackwardMessage
4:  Send to party // Party runs:
5:  
6:  for  and where  do
7:     
8:     if  then
9:        
10:     end if
11:     
12:     
13:  end for
14:  
15:  for  where and  do
16:     
17:     
18:  end for
19:  return   to party
Algorithm 5 PrivateEBC

V-B1 Privately Counting Paths

The first part of the backward message compares the —noisy counts of 2-paths with intermediate node in —over in the given approximating , and . The sensitivity of these counts relates to adding or removing an edge in , as follows, with proof given in the Appendices.

Lemma 20.

Let query denote the vector-valued non-private response . The -global sensitivity of is upper-bounded by .

((a)) Average relative error of the 60 random nodes with to , Facebook data set.
((b)) Average relative error of 60 nodes with to , Enron data set.
((c)) Average relative error of 60 nodes with to , PGP data set
((d)) Average relative error of 60 nodes in three different mechanisms, , Facebook data set.
((e)) Time of computing 20 random nodes with to , Facebook data set.
((f)) Relative error of 100 nodes with different degrees for , Facebook data set.
Fig. 8: Experimental results for Facebook, Enron and PGP data sets.

V-B2 Private Partial EBC

The second part of BackwardMessage is a partial EBC sum over 2-paths with end-points in (and intermediate point in either the same or ) as in Protocol 9.iii. We again apply the Laplace mechanism to avoid privacy disclosure of edges in , which requires bounding sensitivity of the non-private sum as follows. The proof can be found in the Appendices.

Lemma 21.

Let query denote the partial EBC sum over 2-paths with end-points and intermediate node . Then the -global sensitivity of is upper-bounded by .

As both applications of the Laplace mechanism run with privacy budget , Lemma 6 implies overall edge privacy is guaranteed.

Corollary 22.

BackwardMessage (Algorithm 4) takes time and space where , and when run with , preserves -differential privacy of edge set within party ’s network.

V-C PrivateEBC: Putting it All Together

After parties and have respectively run ForwardMessage and BackwardMessage, must complete the computation of the private EBC. As shown in Algorithm 5, PrivateEBC comprises two phases that closely mirror the two components of BackwardMessage: counting 2-paths spanning , and counting 2-paths with endpoints in .

Within the first stage we incorporate BackwardMessage noisy counts contributed by , which count paths having intermediate nodes in . Party simply increments these values with counts of paths having endpoints in . The sum reciprocals forms . We make one optimisation to utility at no cost to privacy: counts for are discarded.

A straightforward sum of paths with endpoints in and intermediate points in completes . Finally party completes PrivateEBC by summing the partial EBCs.

Remark 23.

We opt to use the same privacy budget for both parties (Theorem 19) and (Corollary 22) in PrivateEBC out of symmetry. However the algorithm can operate with separate budgets if desired.

Vi Experiments

To empirically validate the effectiveness of PrivateEBC we ran experiments on three graph data sets: a Facebook friendship graph [27] with 63,731 vertices and 817,035 edges; the Enron email network [28] with 36,692 nodes and 183,831 edges; and the Pretty Good Privacy (PGP) [27] data set with 10,680 users as vertices and 24,316 inter-user interactions as edges. We follow a random process to partition the nodes, while the structure of the graph stays intact: nodes are assigned to parties or independently and uniformly at random, while edges are not changed.

The experiments were run on a server with core Xeon’s (112 threads with hyper threading) and 1.5 TB RAM, using Python 3.7 without parallel computations for fair comparison. We use relative error between true and private EBC—the lower the relative error the higher the utility. We employed the Mpmath arbitrary precision library and set the precision to 300 bits. Arbitrary precision is vital for implementing inverse transform sampling as described in Section V-A4.

Vii Results

We first examine the relationship between utility and privacy for PrivateEBC. For 60 uniformly-at-random ego nodes selected from randomly-partitioned party , we report average relative error (comparing private and true EBC) for a range of privacy levels between 0.1 and 7. Figure LABEL:sub@fig:refaLABEL:sub@refc show the results for the three datasets, where it is apparent that average relative error decreases dramatically when is increased to 1, and stays very small for larger . For , average relative error is usually below 50%. And at the strong guarantee of the average relative error is 16% (Facebook) 47% (Enron) and 25% (PGP).

As we have employed three different privacy-preserving mechanisms in our proposed protocol—one exponential (Mech1) and two Laplace (Mech2, Mech3)—we examine each separately to evaluate how they affect overall relative error. Specifically, we run the PrivateEBC protocol with only one of the privacy-preserving mechanisms intact and use the non-private version for remaining mechanisms, with each of Mech1–Mech3 taking turns being private. In this way we can isolate the incremental cost to utility of each mechanism. Figure LABEL:sub@refd reports the results on Facebook, which demonstrate that Mech1 ForwardMessage has the least impact on the relative error while Mech3 BackwardMessage second component, has the highest impact. This suggest future work may focus on the third mechanism.

We next report on timing analysis for PrivateEBC as function of privacy level. Median computation time of 20 random ego nodes for from 0.1 to 7 is reported in Figure LABEL:sub@refe on Facebook data. Here total time is overall decreasing as privacy decreases (increasing ), while a small increase to runtime can be seen at very high levels of privacy (low but increasing . This dual effect is slightly more pronounced on Enron and PGP (see the Appendices), and is likely due to different behaviours in the protocol with increasing . When the set difference of and is small, the two-stage sampler generates just small numbers of nodes in faster time. However faster runtime with lower privacy dominates behaviour overall. Moreover any effect of privacy is not strong, with at most a change in runtime which across data sets is practical at under 10 min (median) on the larger data sets.

Figure LABEL:sub@reff shows how the relative error between true and private EBC varies by ego node degree. We report results on , which do not show significant dependence: for node degrees up to , deviation is approximately 7% of the maximum relative error which is low.

Viii Conclusion and Future Work

In this paper we have developed the PrivateEBC algorithm which comprises a protocol of differentially-private mechanisms for cooperative 2-party computation of egocentric betweenness centrality. Theoretical and empirical results demonstrate that our approach achieves strong privacy guarantees for both parties which achieving practical levels of utility with efficient time and space complexity. Notably we contribute a novel two-stage sampler that improves upon the exponential mechanism’s time and space complexities exponentially. PrivateEBC should extend naturally to multiple networks—we expect to add to our empirical investigations of efficiency in that case. It would be interesting to extend differential privacy to the case in which the answer needs to be returned by the party whose node is being queried to some untrusted authority.

Acknowledgment

This work is supported by the Australian Research Training Program and the Australian Research Council DE160100584.

References

  • [1] L. Backstrom, C. Dwork, and J. Kleinberg, “Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography,” in WWW’07, 2007, pp. 181–190.
  • [2] A. Narayanan, E. Shi, and B. I. Rubinstein, “Link prediction by de-anonymization: How we won the Kaggle social network challenge,” in IJCNN’11, 2011, pp. 1825–1834.
  • [3] A. Narayanan and V. Shmatikov, “De-anonymizing social networks,” in SP’09, 2009, pp. 173–187.
  • [4] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in TCC’06, 2006, pp. 265–284.
  • [5] K.-I. Goh, E. Oh, B. Kahng, and D. Kim, “Betweenness centrality correlation in social networks,” Physical Review E, vol. 67, no. 1, p. 017101, 2003.
  • [6] L. Sweeney, “k-anonymity: A model for protecting privacy,” Int. J. Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 557–570, 2002.
  • [7] C. C. Aggarwal and S. Y. Philip, “A general survey of privacy-preserving data mining models and algorithms,” in Privacy-preserving data mining.   Springer, 2008, pp. 11–52.
  • [8] M. Hay, C. Li, G. Miklau, and D. Jensen, “Accurate estimation of the degree distribution of private networks,” in ICDM, 2009, pp. 169–178.
  • [9] J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao, “Private release of graph statistics using ladder functions,” in SIGMOD’15, 2015, pp. 731–745.
  • [10] W.-Y. Day, N. Li, and M. Lyu, “Publishing graph degree distribution with node differential privacy,” in SIGMOD’16, 2016, pp. 123–138.
  • [11] Y. Mülle, C. Clifton, and K. Böhm, “Privacy-integrated graph clustering through differential privacy.” in EDBT/ICDT Workshops, 2015, pp. 247–254.
  • [12] E. Shen and T. Yu, “Mining frequent graph patterns with differential privacy,” in KDD’13, 2013, pp. 545–553.
  • [13] S. P. Kasiviswanathan, K. Nissim, S. Raskhodnikova, and A. Smith, “Analyzing graphs with node differential privacy,” in TCC’13, 2013, pp. 457–476.
  • [14] S. Raskhodnikova and A. Smith, “Efficient Lipschitz extensions for high-dimensional graph statistics and node private degree distributions,” arXiv preprint arXiv:1504.07912, 2015.
  • [15] R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta, “Discovering frequent patterns in sensitive data,” in SIGKDD’10, 2010, pp. 503–512.
  • [16] V. Karwa, S. Raskhodnikova, A. Smith, and G. Yaroslavtsev, “Private analysis of graph structure,” PVLDB, vol. 4, no. 11, pp. 1146–1157, 2011.
  • [17] K. Nissim, S. Raskhodnikova, and A. Smith, “Smooth sensitivity and sampling in private data analysis,” in STOC’07, 2007, pp. 75–84.
  • [18] H. H. Nguyen, A. Imine, and M. Rusinowitch, “Detecting communities under differential privacy,” in Proceedings of the 2016 ACM Workshop on Privacy in the Electronic Society, 2016, pp. 83–93.
  • [19] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor, “Our data, ourselves: Privacy via distributed noise generation,” in EUROCRPY’06, 2006, pp. 486–503.
  • [20] R. Chen, A. Reznichenko, P. Francis, and J. Gehrke, “Towards statistical queries over distributed private user data.” in NSDI, 2012, pp. 13–13.
  • [21] M. Everett and S. P. Borgatti, “Ego network betweenness,” Social networks, vol. 27, no. 1, pp. 31–38, 2005.
  • [22] L. C. Freeman, “Centrality in social networks conceptual clarification,” Social networks, vol. 1, no. 3, pp. 215–239, 1978.
  • [23] P. V. Marsden, “Egocentric and sociocentric measures of network centrality,” Social networks, vol. 24, no. 4, pp. 407–422, 2002.
  • [24] F. McSherry and K. Talwar, “Mechanism design via differential privacy,” in FOCS’07.   IEEE, 2007, pp. 94–103.
  • [25] D. Kifer and B.-R. Lin, “Towards an axiomatization of statistical privacy and utility,” in SIGMOD’10, 2010, pp. 147–158.
  • [26] F. D. McSherry, “Privacy integrated queries: an extensible platform for privacy-preserving data analysis,” in SIGMOD’09, 2009, pp. 19–30.
  • [27] Institute of Web Science and Technologies at the University of Koblenz–Landau, “The Koblenz network collection,” 2018. [Online]. Available: http://konect.uni-koblenz.de/
  • [28] Stanford University, “Stanford large network dataset collection.” [Online]. Available: https://snap.stanford.edu/data/index.html

Appendix A Supplemental Material

A-a Proof of Corollary 12

Consider the denominator of the exponential response distribution (1):

establishing the result.

A-B Proof of Lemma 14

The proof follows by splitting on , the chain rule of probability, and by definition of the r.v.’s. For any , denote to be such that , then

The penultimate equality follows from

This also establishes that the probability mass is already normalised.

A-C Proof of Proposition 16

The pseudo-inverse of the CDF follows the general case, while it is easy to show that , for , is distributed as . The time complexity corresponds to both computing the CDF (which need not be stored in its entirety) and a linear search for its inversion.

A-D Proof of Lemma 20

Suppose that graphs and differ in some edge with and both in (that is, the edge would belong to ). Our task is to upper bound the corresponding change to counts resulting from running query on the two graphs. There can be at most choices of endpoint node for forming 2-hop paths affected by the addition/deletion. Similarly there can be at most choices of endpoint for paths affected by the addition/deletion. For each of these paths the addition/deletion can affect by at most 1. This proves the result.

A-E Proof of Lemma 21

There are two ways the addition or removal of an edge can affect . If the edge is the one between endpoints and , then this can change the term by at most 1, (from 0 to 1, in the case that the only other connection between and is via ). If the edge is within , then it can affect a term by at most : the denominator is incremented/decremented by 1 while the denominator must always be at least 1 as a 2-path must go through . This can occur for at most terms in the sum, because they’re paths involving some intermediate node in that is neither nor . So overall the change in resulting from the addition or removal of one edge is at most:

((a))
((b))
Fig. 11: LABEL:sub@refap1: Time of computing 20 random nodes vs. on Enron data set. LABEL:sub@refap2: Time of computing 20 random nodes vs. on PGP data set.

A-F Additional Experimental Results

We present the timing analyses for PrivateEBC as function of privacy level. Figure LABEL:sub@refap1 and LABEL:sub@refap2 show the results for Enron and PGP data sets which present similar behaviour as the Facebook data cf. Fig. ((e))(e).

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
331888
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description