Anonymizing Social Graphs via Uncertainty Semantics

Anonymizing Social Graphs via Uncertainty Semantics

Hiep H. Nguyen, Abdessamad Imine, and Michaël Rusinowitch Email: {huu-hiep.nguyen,michael.rusinowitch}@inria.fr, abdessamad.imine@loria.fr LORIA/INRIA Nancy-Grand Est, France
Abstract

Rather than anonymizing social graphs by generalizing them to super nodes/edges or adding/removing nodes and edges to satisfy given privacy parameters, recent methods exploit the semantics of uncertain graphs to achieve privacy protection of participating entities and their relationship. These techniques anonymize a deterministic graph by converting it into an uncertain form. In this paper, we propose a generalized obfuscation model based on uncertain adjacency matrices that keep expected node degrees equal to those in the unanonymized graph. We analyze two recently proposed schemes and show their fitting into the model. We also point out disadvantages in each method and present several elegant techniques to fill the gap between them. Finally, to support fair comparisons, we develop a new tradeoff quantifying framework by leveraging the concept of incorrectness in location privacy research. Experiments on large social graphs demonstrate the effectiveness of our schemes.

I Introduction

Graphs represent a rich class of data observed in daily life where entities are described by vertices and their connections are characterized by edges. With the emergence of increasingly complex networks [11], the research community requires large and reliable graph data to conduct in-depth studies. However, this requirement usually conflicts with privacy rights of data contributing entities. Naive approaches like removing user ids from a social graph are not effective, leaving users open to privacy risks, especially re-identification attacks [1] [7]. Therefore, many graph anonymization schemes have been proposed [24, 9, 25, 4, 20, 18].

Given an unlabeled undirected graph, the existing anonymization methods fall into four main categories. The first category includes random addition, deletion and switching of edges to prevent the re-identification of nodes or edges. The methods in the second category provide k-anonymity [17] by deterministic edge additions or deletions, assuming attacker’s background knowledge regarding certain properties of its target nodes. The methods in the third category assign edge probabilities to add uncertainty to the true graph. The edges probabilities may be computed explicitly as in [2] or implicitly via random walks [10]. Finally, the fourth class of techniques, generalization, cluster nodes into super nodes of size at least k. Note that the last two classes of schemes induce possible world models, i.e., we can retrieve sample graphs that are consistent with the anonymized output graph.

The third category is the most recent class of methods which leverage the semantics of edge probability to inject uncertainty to a given deterministic graph, converting it into an uncertain one. Most of schemes in this category are scalable, i.e. runnable on million-scale graphs or more. As an example, Boldi et al. [2] introduced the concept of (k,)-obfuscation (denoted as ()-obf), where is a desired level of obfuscation and is a tolerance parameter. However, the pursuit for minimum standard deviation in (k,)-obf has high impact on node privacy and high privacy-utility tradeoff. Edge rewiring method based on random walks (denoted as RandWalk) in [10] also introduces uncertainty to edges as we show in section IV. This scheme suffers from high lower bounds for utility despite its excellent privacy-utility tradeoff.

Motivated by (k,)-obf and RandWalk, we propose in this work a generalized model for anonymizing graphs based on edge uncertainty. Both (k,)-obf and RandWalk display their fitting into the model. We point out disadvantages in (k,)-obf and RandWalk, the tradeoff gap between them and present several elegant techniques to fill this gap. Finally, to support fair comparisons, we develop a new tradeoff quantifying framework using the concept of incorrectness in location privacy research [15].

Our contributions are summarized as follows:

  • We propose a generalized model called uncertain adjacency matrix for anonymizing graph via edge uncertainty semantics (Section IV). The key property of this model is that expected degrees of all nodes must be unchanged. We show the fitting of (k,)-obf and RandWalk into the model and then analyze their disadvantages (Sections III, IV).

  • We introduce the Maximum Variance (MaxVar) scheme (Section V) that satisfies all the properties of the uncertain adjacency matrix. It achieves good privacy-utility tradeoff by using two key observations: nearby potential edges and maximization of total node degree variance via a simple quadratic program.

  • Towards a fair comparison for anonymization schemes on graphs, this paper describes a generic quantifying framework (Section VI) by putting forward the distortion measure (also called incorrectness in [15]) to measure the re-identification risks of nodes. As for the utility score, typical graph metrics [2] [21] are chosen.

  • We conduct a comparative study of aforementioned approaches on three real large graphs and show the effectiveness of our gap-filling solutions (Section VII).

Table I summarizes notations used in this paper.

Symbol Definition
true graph with and
uncertain graph constructed from
sample graph from ,
degree of node in
number of nodes having degree in
neighbors of node in
truncated normal distribution on [0,1]
a sample from the distribution
() probability of edge ()
number of potential edges,
, adjacency matrices of ,
random walk transition matrix of
uncertain adjacency matrix,
walk length
switching matrix
total degree variance
TABLE I: List of notations

Ii Related Work

Ii-a Anonymizing Deterministic Graphs

There is a vast literature on graph perturbation that deserves a survey. In this section, we enumerate only several groups of ideas that are related to our proposed schemes.

Ii-A1 Anonymizing unlabeled vertices for node privacy

In unlabeled graphs, node identifiers are numbered in an arbitrary manner after removing their labels. An attacker aims at reidentifying nodes solely based on their structural information. For this line of graphs, node privacy protection implies link privacy. Techniques of adding and removing edges, nodes can be done randomly or deterministically. Random perturbation is a naive approach and usually used as a baseline method. More guided approaches consist of k-neighborhood[24], k-degree[9], k-automorphism[25], k-symmetry[20], k-isomorphism[4] and -degree[18]. These schemes provide k-anonymity [17] semantics and usually rely on heuristics to avoid combinatorial intractability. K-automorphism, k-symmetry, and k-isomorphism can resist any structural attacks by exploiting the inherent symmetry in graph. -degree addresses the friendship attacks, based on the vertex degree pair of an edge. Ying and Wu [21] propose a spectrum preserving approach which wisely chooses edge pairs to switch in order to keep the spectrum of the adjacency matrix not to vary too much. The clearest disadvantage of the above schemes is that they are inefficient on large scale graphs.

Apart from the two above categories, perturbation techniques have other categories that settle on possible world semantics. Hay et al. [7] generalize a network by clustering nodes and publish graph summarization of super nodes and super edges. The utility of this scheme is limited. On the other hand, Boldi et al. [2] take the uncertain graph approach. With edge probabilities, the output graph can be used to generate sample graphs by independent edge sampling. Our approach belongs to this class of techniques with different formulation and better privacy-utility tradeoff. Note that in k-symmetry[4], the output sample graphs are also possible worlds of the symmetric intermediate graph.

Ii-A2 Anonymizing labeled vertices for link privacy

If nodes are labeled, we are only concerned about the link disclosure risk. For example, Mittal et al. [10] employ an edge rewiring method based on random walks to keep the mixing time tunable and prevent link re-identification by Bayesian inference. This method is effective for social network based systems, e.g. Sybil defense, DHT routing. Link privacy is also described in [21] for Random Switch, Random Add/Del. Interestingly, RandWalk [10] can also be used for unlabeled graphs as shown in Section IV.

Ii-A3 Min entropy, Shannon entropy and incorrectness measure

We now survey commonly used notions of privacy metrics. Min entropy [16] quantifies the largest probability gap between the posterior and prior over all items in the input dataset. K-anonymity has the same semantics with the corresponding min entropy of . So we say k-anonymity based perturbation schemes belong to min entropy. Shannon entropy argued in [3] and [2] is another choice of privacy metrics. The third metrics that we use in this paper is the incorrectness measure from location privacy [15]. Given the prior information (e.g. node degree in the true graph) and the posterior information harvested from the anonymized output, incorrectness measure is the number of incorrect guesses made by the attacker. This measure gauges the distortion caused by the anonymization algorithm.

Ii-B Mining Uncertain Graphs

Uncertain graphs pose big challenges to traditional mining techniques. Because of the exponential number of possible worlds, naive enumerations are intractable. Typical graph search operations like k-Nearest neighbor and pattern matching require new approaches [13] [26] [23]. Those methods answer threshold-based queries by using pruning strategies based on Apriori property of frequent patterns.

Iii Preliminaries

This section starts with definitions and common assumptions on uncertain graphs. It then analyzes vulnerabilities in -obf [2].

Iii-a Uncertain Graph

Let be an uncertain undirected graph, where is the function that gives an existence probability to each edge (see Fig.0(b)). The common assumption is on the independence of edge probabilities. Following the possible-worlds semantics in relational data [5], the uncertain graph induces a set {} of deterministic graphs (worlds), each is defined by a subset of . The probability of is:

(1)

Note that deterministic graphs are also uncertain graphs with all edges having probabilities 1.

Iii-B -obf and Its Limitations

In [2], Boldi et al. extend the concept of k-obfuscation developed earlier [3].

Definition III.1

(k,)-obf [2]. Let P be a vertex property, be a desired level of obfuscation, and be a tolerance parameter. The uncertain graph is said to k-obfuscate a given vertex with respect to P if the entropy of the distribution over the vertices of is greater than or equal to :

(2)

The uncertain graph is a -obf with respect to property P if it k-obfuscates at least vertices in with respect to P.

(a)
(b)
(c)
Fig. 1: (a) True graph (b) An obfuscation with potential edges (dashed) (c) Truncated normal distribution on [0,1] (bold solid curves)

Given the true graph (Fig.0(a)), the basic idea of -obf (Fig.0(b)) is to transfer the probabilities from existing edges to potential (non-existing) edges to satisfy Definition III.1. For each existing sampled edge , it is assigned a probability where (Fig. 0(c)) and for each non-existing sampled edge , it is assigned a probability .

Table II gives an example of how to compute degree entropy for the uncertain graph in Fig. 0(b). Here vertex property is the node degree. Each row in the left side is the degree distribution for the corresponding node. For instance, has degree with probability . The right side normalizes values in each column (i.e. in each degree value) to get distributions . The entropy for each degree value is shown in the bottom row. Given , then with true degree 2 and with true degree 1 satisfy (2). Therefore, .

node degree uncertainty
d=0 d=1 d=2 d=3 d=0 d=1 d=2 d=3
v1 .014 .188 .582 .216 .044 .117 .355 .491
v2 .210 .580 .210 .000 .656 .362 .128 .000
v3 .036 .252 .488 .224 .112 .158 .298 .509
v4 .060 .580 .360 .000 .187 .362 .220 .000
1.40 1.84 1.91 0.99
TABLE II: The degree uncertainty for each node (left) and normalized values for each degree (right)

While -obf provides a novel technique to come up with an uncertain version of the graph, the specific approach in [2] has two drawbacks. First, it formulated the problem as the minimization of . With small values of , highly concentrates around zero, so existing sampled edges have probabilities nearly 1 and non-existing sampled edges are assigned probabilities almost 0. By the simple rounding technique, the attacker can easily reveal the true graph. Even if the graph owner only publishes sample graphs, the re-identification attacks are still effective as we show in Section VII. Note that in [2], the found values of vary in a wide range from to . Second, the approach in [2] does not consider the locality (subgraph) of nodes in selecting pairs of nodes for establishing potential edges. As shown in [6], subgraph-wise perturbation effectively reduces structural distortion.

Iv A Generalized Model for Uncertain Graph

This section introduces a generalized model of graph anonymization via semantics of edge uncertainty. Then we analyze several schemes using this model.

Iv-a A Generalized Model: Uncertain Adjacency Matrix

(a)
(b)
Fig. 2: (a) Semantics of selfloops (left), multi-selfloops (middle) and multiedges (right) in uncertain adjacency matrix (b) Edge switching

Given the true graph , an uncertain graph constructed from must have its uncertain adjacency matrix satisfying

  1. symmetry

  2. and . If we relax this constraint to
    (2’) allow then we have selfloops and allow then we have multiedges (Fig. 1(a)).

  3. expected degrees of all nodes must be unchanged. It means

We first define the transition matrix which is right stochastic (i.e. non-negative and row sums equal to 1) as follows (note that we use the short notation )

(3)

The power when is .

We prove two lemmas on properties of the products and where is right stochastic.

Lemma IV.1

For an adjacency matrix and a right stochastic matrix , the product is non-negative and has row sums equal to those of .

  • The non-negativity of is trivial. The sum of row of is

Lemma IV.2

For a deterministic graph possessing adjacency matrix and , the product is also symmetric.

  • We prove the result by induction. The case is trivial. We prove that for any , where is a path of length from to .

    When , , so the result holds. Assuming that the result is correct up to , i.e. . Because , .

    Because is undirected, the set of all is equal to the set of all , so .

We prove the uniqueness of in the following proposition.

Proposition IV.3

Given a deterministic graph with adjacency matrix , there exists one and only one right stochastic matrix that satisfies for all and is symmetric for all . The unique solution is .

  • Lemma IV.2 shows that satisfies for all and is symmetric for all .

    To prove that this is the unique solution, we repeat the formula in the proof of Lemma IV.2. Let , then where implies the successive node of in . Because has the same number of products as (i.e. the number of paths of length ), is symmetric if and only if corresponding products are equal, i.e. . At , for any path we must have . Along with the requirement that is right stochastic, i.e. , we obtain . This is exactly .

Iv-B RandWalk Approach

Now we apply the model of uncertain adjacency matrix to the analysis of RandWalk [10]. Algorithm 1 depicts the steps of RandWalk. As we show below, the trial-and-error condition in Line 6 makes RandWalk hard to analyze 111It also causes edge miss at , e.g. a 2-length walk on edge (Fig. 0(a)) causes the selfloop .. So we modify it by removing the condition and using parameter instead of 1.0 in Line 12 222This line causes errors for degree-1 nodes as shown in RandWalk-mod. (see Algorithm 2). When , all edges are assigned with probability 0.5. In RandWalk-mod, we add a checking for (Line 8) to keep the total degree of equal to that of , which is missing in RandWalk. Note that RandWalk-mod accepts selfloops and multiedges.

Let be the edge adding matrix defined as

We show that RandWalk-mod can be formulated as an uncertain adjacency matrix , where is the Hadamard product (element-wise). is equivalent to computations in lines 2-6 and is equivalent to computations in lines 7-13. We use instead of due to the fact that when the edge is added to with probability , the edge is also assigned the same probability. We come up with the following theorem.

Theorem IV.4

RandWalk-mod can be formulated as . is symmetric. It satisfies the constraint of unchanged expected degree iff 333This implies a mistake in Theorem 3 of [10].

  • By Lemmas IV.1 and IV.2, let be , we have symmetric and its row sums are equal to those of . Because and both and are symmetric, is also symmetric.

    Due to the fact that has the same locations of non-zeros as , the condition of unchanged expected degree is satisfied if and only if all non-zeros in are 1. This occurs if and only if .

1:undirected graph , walk length and maximum loop count
2:anonymized graph
3:
4:for  in  do
5:     
6:     for  in  do
7:          
8:          while  do
9:               perform hop random walk from
10:                is the terminal node of the random walk
11:                          
12:          if  then
13:               if  then
14:                    add to with probability 1.0
15:               else
16:                    add to with probability                          
17:                return
Algorithm 1 RandWalk() [10]
1:undirected graph , walk length and probability
2:anonymized graph
3:
4:for  in  do
5:     
6:     for  in  do
7:          perform hop random walk from
8:           is the terminal node of the random walk
9:          if  then
10:               if  then
11:                    add to with probability 0.5
12:               else
13:                    add to with probability                
14:          else
15:               add to with probability           
16:                return
Algorithm 2 RandWalk-mod()

We investigate the limit case when (i.e. ). Correspondingly has . The following theorem quantifies the number of selfloops and multiedges in for power-law (PL) graphs and sparse Erdös-Renyi (ER) random graphs [11].

Theorem IV.5

For power-law graphs with the exponent , the number of selfloops in is , where is the Riemann zeta function defined only for ; the number of multiedges is zero.

For sparse ER random graphs with constant where is the edge probability, the number of selfloops in is ; the number of multiedges is zero.

  • See Appendix A-A.

Remark IV.1

We notice that RandWalk-mod can be done equivalently by the idea in SybilGuard [22]. We first pick a random permutation on neighbors of each node to get pairs of (in-edge, out-edge). Then for any walk reaching node by the in-edge , the out-edge is fixed to . In this formulation, it is straightforward to verify that the transition probability from to a neighbor is .

Iv-C Edge Switching

In edge switching (EdgeSwitch) approaches (Fig. 1(b)), two edges are chosen and switched to if . This is done in switches. Using the switching matrix , we represent 1-step EdgeSwitch in the form (Equation (4)).

The switching matrix is feasible if and only if . Note that in the full form, is matrix with the remaining elements on diagonal are 1, other off-diagonal are 0. In general, is not right stochastic and this happens only when . For -step EdgeSwitch . If is right stochastic (i.e. we choose edges such that ), then Lemma IV.1 applies.

(4)

Iv-D Direct Construction

Given the deterministic adjacency matrix , we can directly construct that satisfies all three constraints (1),(2) and (3) in Section IV-A. (k,)-obf [2] introduces such an approach. As explained in Section III-B, the expected degrees of nodes in -obf are approximately unchanged due to the fact that are nearly zero by small . So (k,)-obf satisfies constraints (1) and (2) but it only approximately satisfies the third constraint.

To remedy this shortcoming, we present the MaxVar approach in Section V. It adds potential edges to , then tries to find the assignment of edge probabilities such that the expected node degrees are unchanged while the total variance is maximized. A comparison among schemes is also shown in the end of Section V-C.

Iv-E Mixture Approach

In this section, we present the Mixture approach by the uncertain adjacency matrix parametrized by , with the output sample graph . Given the true graph and an anonymized , every edge is chosen into with probability where

It is straightforward to show that . When applied to generated by RandWalk-mod with , we have and satisfies three constraints (1) (2’) and (3).

If there exists with constraint such that , then Mixture can be simulated by the RandWalk-mod approach with the transition matrix .

Iv-F Partition Approach

Another approach that can apply to RandWalk-mod, -obf, MaxVar and EdgeSwitch is the Partition approach. Given true graph , this divide-and-conquer strategy first partitions into disjoint subgraphs , then it applies one of the above anonymization schemes on subgraphs to get anonymized subgraphs . Finally, it combines to obtain . Note that the partitioning may cause orphan edges as in MaxVar (Section V). Those edges must be copied to to keep node degrees unchanged.

V Maximum Variance Approach

We start this section with the formulation of MaxVar in the form of quadratic programming based on two key observations. Then we describe the anonymization algorithm.

V-a Formulation

Two key observations underpinning the MaxVar approach are presented as follows.

V-A1 Observation #1: Maximum Degree Variance

We argue that efficient countermeasures against structural attacks should hinge on node degrees. If a node and its neighbors have their degrees changed, the re-identification risk is reduced significantly. Consequently, instead of replicating local structures as in k-anonymity based approaches [24, 9, 25, 4, 20, 18], we can deviate the attacks by changing node degrees probabilistically. For example, node v1 in Fig.0(a) has degree 2 with probability 1.0 whereas in Fig.0(b), its degree gets four possible values with probabilities respectively. Generally, given edge probabilities of node as , the degree of is a sum of independent Bernoulli random variables, so its expected value is and its variance is . If we naively target the maximum (local) degree variance without any constraints, the naive solution is at . However, such an assignment distorts graph structure severely and deteriorates the utility. Instead, by following the model of uncertain adjacency matrix, we have the constraint . Note that the minimum variance of an uncertain graph is 0 and corresponds to the case has all edges being deterministic, e.g. when and in switching-edge based approaches. In the following section, we show an interesting result relating the total degree variance with the variance of edit distance.

V-A2 Variance with edit distance

The edit distance between two deterministic graphs is defined as:

(5)

A well-known result about the expected edit distance between the uncertain graph and the deterministic graph is

Correspondingly, the variance of edit distance is

We prove in the following theorem that the variance of edit distance is the sum of all edges’ variance (total degree variance) and it does not depend on the choice of .

Theorem V.1

Assume that has uncertain edges and (i.e. ). The edit distance variance is and does not depend on the choice of .

  • See Appendix A-B.

V-A3 Observation #2: Nearby Potential Edges

As indicated by Leskovec et al. [8], real graphs reveal two temporal evolution properties: densification power law and shrinking diameters. Community Guided Attachment (CGA) model [8], which produces densifying graphs, is an example of a hierarchical graph generation model in which the linkage probability between nodes decreases as a function of their relative distance in the hierarchy. With regard to this observation, -obf, by heuristically making potential edges solely based on node degree discrepancy, produces many inter-community edges. Shortest-path based statistics will be reduced due to these edges. MaxVar, in contrast, tries to mitigate the structural distortion by proposing only nearby potential edges before assigning edge probabilities. Another evidence is from [19] where Vazquez analytically proved that Nearest Neighbor can explain the power-law for degree distribution, clustering coefficient and average degree among the neighbors. Those properties are in very good agreement with the observations made for social graphs. Sala et al. [14] confirmed the consistency of Nearest Neighbor model in their comparative study on graph models for social networks.

V-B Algorithms

This section describes the steps of MaxVar to convert the input deterministic graph into an uncertain one.

V-B1 Overview

The intuition behind the new approach is to formulate the perturbation problem as a quadratic programming problem. Given the true graph and the number of potential edges allowed to be added , the scheme has three phases. The first phase tries to partition into subgraphs, each one with potential edges connecting nearby nodes (with default distance 2, i.e. friend-of-friend). The second phase formulates a quadratic program for each subgraph with the constraint of unchanged node degrees to produce the uncertain subgraphs with maximum edge variance. The third phase combines the uncertain subgraphs into and publishes several sample graphs. The three phases are illustrated in Fig. 3.

By keeping the degrees of nodes in the perturbed graph, our approach is similar to the edge switching approaches (e.g.[21]) but ours is more subtle as we do it implicitly and the switching occurs not necessarily on pairs of edges.

Fig. 3: MaxVar approach

V-B2 Graph Partitioning

Because of the complexity of exact quadratic programming (Section V-B3), we need a pre-processing phase to divide the true graph into subgraphs and run the optimization on each subgraph. Given the number of subgraphs , we run METIS 444http://glaros.dtc.umn.edu/gkhome/views/metis to get almost equal-sized subgraphs with minimum number of inter-subgraph edges. Each subgraph has potential edges added before running the quadratic program. This phase is outlined in Algorithm 3.

1:true graph , number of subgraphs , number of potential edges per subgraph
2:list of augmented subgraphs
3: METIS().
4:for  in  do
5:     
6:     while  do
7:          randomly pick and with
8:          
9:                return
Algorithm 3 Partition-and-Add-Edges

V-B3 Quadratic Programming

By assuming the independence of edges, the total degree variance of for edit distance (Theorem V.1) is:

(6)

The last equality in (6) is due to the constraint that the expected node degrees are unchanged (i.e. ), so is equal to . By targeting the maximum edge variance, we come up with the following quadratic program.

Minimize
Subject to

The objective function reflects the privacy goal (i.e. sample graphs do not highly concentrate around the true graph) while the expected degree constraints aim to preserve the utility.

By dividing the large input graph into subgraphs, we solve independent quadratic optimization problems. Because each edge belongs to at most one subgraph and the expected node degrees in each subgraph are unchanged, it is straightforward to show that the expected node degrees in are also unchanged. We have a proposition on problem feasibility and an upper bound for the total variance.

Proposition V.2

The quadratic program in MaxVar is always feasible. The total variance is upper bounded by .

  • The feasibility is due to the fact that is a feasible point. Let be the number of potential edges incident to node . By requiring ’s expected degree to be unchanged, we have . Applying Cauchy-Schwarz inequality, we get . Now we take the sum over all nodes to get the following

    where the last equality is again due to Cauchy-Schwarz inequality.

V-C Comparison of schemes

Table III shows the comparison of schemes we investigate in this work. Only MaxVar and EdgeSwitch satisfy all three properties (1),(2) and (3). The next two propositions quantify the TV of -obf and RandWalk-mod.

Scheme Prop #1 Prop #2 Prop #3 Uncertain
RandWalk-mod ()
RandWalk
EdgeSwitch
-obf
MaxVar
Mixture depends on the mixed scheme
Partition depends on the scheme used in subgraphs
TABLE III: Comparison of schemes
Proposition V.3

The expected total variance of -obf is . The expressions of are given in (7) and (8).

  • In -obf, existing edges are assigned probabilities while potential edges are assigned probabilities . Therefore, the total variance is where . Take the expectation of , we get .

    has pdf . The normalization constant where erf is the error function. Basic integral computations (change of variable and integration by parts) give us the formulas for and as follows

    (7)
    (8)

Note that for , and , so

(9)
Proposition V.4

The total variance of RandWalk-mod at walk-length is upper bounded by where is the number of non-zeros in .

For power-law graphs with the exponent , . For sparse ER random graphs with constant,

  • The proof uses the same arguments as in Proposition V.2 and Theorem IV.5. We omit it due to space limitation.

Note that the increases with and when is equal to the diameter of , . Therefore, the upper bound of converges very fast to , compatible with the results in the limit cases of PL and ER random graphs.

Vi Quantifying Framework

This section describes a generic framework for privacy and utility quantification of anonymization methods.

Vi-a Privacy Measurement

We focus on structural re-identification attacks under various models of attacker’s knowledge as shown in [7]. We quantify the privacy of an anonymized graph as the sum of re-identification probabilities of all nodes in the graph. We differentiate closed-world from open-world adversaries. For example, when a closed-world adversary knows that Bob has three neighbors, this fact is exact. An open-world adversary in this case would learn only that Bob has at least three neighbors. We consider the result of structural query on a node as the node signature . Given a query , nodes having the same signatures form an equivalence class. So given the true graph and an output anonymized graph , the privacy is measured as in the following example.

Example VI.1

Assuming that we have signatures of and signatures of as in Table IV, the re-identification probabilities in of nodes 1,2 are , of nodes 4,8 are , of nodes 3,5,6,7 are 0s. And the privacy score of is . In , the privacy score is , equal to the number of equivalence classes.

Graph Equivalence classes
TABLE IV: Example of node signatures

We consider two privacy scores in this paper.

  • score uses node degree as the node signature, i.e. we assume that the attacker know apriori degrees of all nodes.

  • uses the set (not multiset) of degrees of node’s friends as the node signature. For example, if a node has 6 neighbors and the degrees of those neighbors are , then its signature for attack is .

Higher-order scores like (exact multiset of neighbors’ degrees) or (exact multiset of neighbor-of-neighbors’ degrees) induce much higher privacy scores of the true graph (in the order of ) and represent less meaningful metrics for privacy. The following proposition claims the automorphism-invariant property of structural privacy scores.

Proposition VI.1

All privacy scores based on structural queries [7] are automorphism-invariant, i.e. if we find a non-trivial automorphism of , the signatures of all nodes in are unchanged.

  • The proof is trivially based on the definition of graph automorphism. We omit it due to the lack of space.

Vi-B Utility Measurement

Following [2] and [21], we consider three groups of statistics for utility measurement: degree-based statistics, shortest-path based statistics and clustering statistics.

Vi-B1 Degree-based statistics

  • Number of edges:

  • Average degree:

  • Maximal degree:

  • Degree variance:

  • Power-law exponent of degree sequence: is the estimate of assuming the degree sequence follows a power-law

Vi-B2 Shortest path-based statistics

  • Average distance: is the average distance among all pairs of vertices that are path-connected.

  • Effective diameter: is the 90-th percentile distance among all path-connected pairs of vertices.

  • Connectivity length: is defined as the harmonic mean of all pairwise distances in the graph.

  • Diameter : is the maximum distance among all path-connected pairs of vertices.

Vi-B3 Clustering statistics

  • Clustering coefficient: where is the number of triangles and is the number of connected triples.

All of the above statistics are computed on sample graphs generated from the uncertain output . In particular, to estimate shortest-path based measures, we use Approximate Neighbourhood Function (ANF) [12]. The diameter is lower bounded by the longest distance among all-destination bread-first-searches from 1,000 randomly chosen nodes.

Vii Evaluation

In this section, our evaluation aims to show the disadvantages of -obf and RandWalk/RandWalk-mod as well as the gap between them. We then illustrate the effectiveness and efficiency of the gap-filling approaches MaxVar and Mixture. The effectiveness is measured by privacy scores (lower is better) and the relative error of utility (lower is better). The efficiency is measured by the running time. All algorithms are implemented in Python and run on a desktop PC with Core i7-4770@ 3.4Ghz, 16GB memory. We use MOSEK555http://mosek.com/ as the quadratic solver.

Three large real-world datasets are used in our experiments 666http://snap.stanford.edu/data/index.html. dblp is a co-authorship network where two authors are connected if they publish at least one paper together. amazon is a product co-purchasing network where the graph contains an undirected edge from to if a product is frequently co-purchased with product . youtube is a video-sharing web site that includes a social network. The graph sizes of dblp, amazon and youtube are (317080, 1049866), (334863, 925872) and (1134890, 2987624) respectively. We partition dblp, amazon into 20 subgraphs and youtube into 60 subgraphs. The sample size of each test case is 20.

Vii-a -obf and RandWalk

We report the performance of -obf in Table V. We keep the number of potential edges equal to (default value in [2]) and vary . We find that the scheme achieves low relative errors only at small . However, privacy scores, especially , rise fast (up to 50% compared to the true graph). This fact incurs high privacy-utility tradeoff as confirmed in Table VIII.

Table VI shows the performance similarity between RandWalk and RandWalk-mod except the case of youtube and for in amazon. Because RandWalk-mod satisfies the third constraint, it benefits several degree-based statistics while the existence of selfloops and multiedges does not impact much on shortest-path based metrics. RandWalk misses a lot of edges at (see footnote 1 in Section IV-B). The remarkable characteristics of random-walk schemes are the very low privacy scores and the high relative errors (lower-bounded around 8 to 10%). Clearly, there is a gap between high tradeoffs in -obf and high relative errors in RandWalk where MaxVar and Mixture may play their roles.

rel.err
dblp 199 125302 1049866 6.62 343 100.15 0.306 2.245 7.69 9 7.46 20
0.001 72.9 40712.1 1048153 6.61 316.0 97.46 0.303 2.244 7.74 9.4 7.50 20.0 0.018
0.01 41.1 24618.2 1035994 6.53 186.0 86.47 0.294 2.248 7.82 9.5 7.59 19.8 0.077
0.1 19.7 7771.4 991498 6.25 164.9 64.20 0.284 2.265 8.08 10.0 7.85 20.0 0.128
amazon 153 113338 925872 5.53 549 33.20 0.205 2.336 12.75 16 12.10 44
0.001 55.7 55655.9 924321 5.52 479.1 31.73 0.206 2.340 12.14 15.2 11.65 33.2 0.057
0.01 34.5 39689.8 915711 5.47 299.7 27.18 0.220 2.348 12.40 15.6 11.91 32.4 0.101
0.1 19.2 16375.4 892140 5.33 253.9 21.87 0.232 2.374 12.52 15.5 12.06 31.4 0.144
youtube 978 321724 2987624 5.27 28754 2576.0 0.0062 2.429 6.07 8 6.79 20
0.001 157.2 36744.6 2982974 5.26 28438 2522.6 0.0062 2.416 6.24 8.0 6.01 19.5 0.022
0.01 80.0 22361.7 2940310 5.18 26900 2282.6 0.0061 2.419 6.27 8.0 6.04 19.0 0.043
0.1 23.4 5806.9 2624066 4.62 16353 970.8 0.0070 2.438 6.59 8.1 6.36 20.4 0.160
TABLE V: -obf
rel.err
dblp 199 125302 1049866 6.62 343 100.15 0.306 2.245 7.69 9 7.46 20
(RW) 2 10.0 4.9 1001252 6.32 309.3 86.16 0.152 2.197 7.43 9.1 7.20 19.7 0.094
3 11.8 10.9 1048129 6.61 315.4 98.04 0.107 2.155 7.08 8.7 6.88 17.8 0.110
5 11.7 5.6 1049484 6.62 321.6 100.77 0.065 2.148 6.79 8.0 6.62 16.4 0.142
10 11.9 2.9 1049329 6.62 329.2 103.06 0.030 2.144 6.54 8.0 6.40 14.3 0.171
(RW-mod) 2 11.8 4.5 1049921 6.62 327.0 105.3 0.093 2.110 7.75 9.7 7.48 23.0 0.109
3 11.9 9.4 1049877 6.62 343.3 105.1 0.071 2.117 7.32 9.0 7.10 20.4 0.099
5 12.0 5.4 1049781 6.62 340.5 105.1 0.044 2.115 6.95 8.4 6.76 18.3 0.131
10 11.9 2.6 1049902 6.62 340.0 105.3 0.021 2.116 6.59 8.0 6.44 16.0 0.164
amazon 153 113338 925872 5.53 549 33.20 0.205 2.336 12.75 16 12.10 44
(RW) 2 5.7 5.4 861896 5.15 274.9 23.11 0.148 2.337 10.70 13.8 10.19 38.7 0.180
3 10.0 16.5 923793 5.52 495.6 32.72 0.113 2.282 10.33 13.1 9.87 34.1 0.137
5 10.4 8.6 925185 5.53 507.7 33.52 0.080 2.276 9.45 12.1 9.07 29.6 0.181
10 10.2 4.6 925748 5.53 498.1 34.37 0.046 2.273 8.55 10.5 8.25 25.7 0.234
(RW-mod) 2 9.8 3.2 925672 5.53 255.1 37.61 0.099 2.246 12.02 15.5 11.40 43.2 0.139
3 9.9 11.2 925532 5.53 535.3 37.32 0.082 2.254 10.89 14.0 10.38 37.9 0.134
5 9.7 6.0 926163 5.53 522.8 37.42 0.059 2.252 9.83 12.5 9.40 33.0 0.185
10 9.9 3.3 925809 5.53 491.4 37.45 0.035 2.251 8.76 11.0 8.44 28.7 0.238
youtube 978 321724 2987624 5.27 28754 2576.0 0.0062 2.429 6.07 8 6.79 20
(RW) 2 13.4 1.5 2636508 4.65 19253.8 1139.7 0.022 2.191 6.18 7.9 5.93 23.5 0.403
3 23.8 17.6 2982204 5.26 26803.6 2389.6 0.004 2.108 5.73 7.0 5.52 18.0 0.103
5 24.6 8.4 2985967 5.26 26018.7 2340.0 0.005 2.106 5.55 7.0 5.38 16.3 0.120
10 21.9 1.8 2984115 5.26 24695.8 2099.4 0.009 2.100 5.49 6.9 5.33 18.7 0.145
(RW-mod) 2 26.4 1.4 2987228 5.26 23829.7 2578.5 0.018 2.053 6.27 8.0 6.02 22.1 0.245
3 26.9 22.3 2988011 5.27 28611.5 2579.7 0.005 2.077 5.75 7.2 5.54 19.0 0.081
5 26.1 11.0 2987479 5.26 28619.3 2581.4 0.005 2.076 5.61 7.0 5.44 18.3 0.090
10 26.3 1.7 2987475 5.26 28432.2 2579.9 0.008 2.073 5.58 7.0 5.41 18.8 0.099
TABLE VI: RandWalk and RandWalk-mod

Vii-B Effectiveness of MaxVar

We assess privacy and utility of MaxVar by varying the number of potential edges . The results are shown in Table VII. As for privacy scores, if we increase , we gain better privacy as we allow more edge switches. Due to the expected degree constraints in the quadratic program, all degree-based metrics vary only a little.

We observe the near linear relationships between