Anonymizing Social Graphs via Uncertainty Semantics
Abstract
Rather than anonymizing social graphs by generalizing them to super nodes/edges or adding/removing nodes and edges to satisfy given privacy parameters, recent methods exploit the semantics of uncertain graphs to achieve privacy protection of participating entities and their relationship. These techniques anonymize a deterministic graph by converting it into an uncertain form. In this paper, we propose a generalized obfuscation model based on uncertain adjacency matrices that keep expected node degrees equal to those in the unanonymized graph. We analyze two recently proposed schemes and show their fitting into the model. We also point out disadvantages in each method and present several elegant techniques to fill the gap between them. Finally, to support fair comparisons, we develop a new tradeoff quantifying framework by leveraging the concept of incorrectness in location privacy research. Experiments on large social graphs demonstrate the effectiveness of our schemes.
I Introduction
Graphs represent a rich class of data observed in daily life where entities are described by vertices and their connections are characterized by edges. With the emergence of increasingly complex networks [11], the research community requires large and reliable graph data to conduct indepth studies. However, this requirement usually conflicts with privacy rights of data contributing entities. Naive approaches like removing user ids from a social graph are not effective, leaving users open to privacy risks, especially reidentification attacks [1] [7]. Therefore, many graph anonymization schemes have been proposed [24, 9, 25, 4, 20, 18].
Given an unlabeled undirected graph, the existing anonymization methods fall into four main categories. The first category includes random addition, deletion and switching of edges to prevent the reidentification of nodes or edges. The methods in the second category provide kanonymity [17] by deterministic edge additions or deletions, assuming attacker’s background knowledge regarding certain properties of its target nodes. The methods in the third category assign edge probabilities to add uncertainty to the true graph. The edges probabilities may be computed explicitly as in [2] or implicitly via random walks [10]. Finally, the fourth class of techniques, generalization, cluster nodes into super nodes of size at least k. Note that the last two classes of schemes induce possible world models, i.e., we can retrieve sample graphs that are consistent with the anonymized output graph.
The third category is the most recent class of methods which leverage the semantics of edge probability to inject uncertainty to a given deterministic graph, converting it into an uncertain one. Most of schemes in this category are scalable, i.e. runnable on millionscale graphs or more. As an example, Boldi et al. [2] introduced the concept of (k,)obfuscation (denoted as ()obf), where is a desired level of obfuscation and is a tolerance parameter. However, the pursuit for minimum standard deviation in (k,)obf has high impact on node privacy and high privacyutility tradeoff. Edge rewiring method based on random walks (denoted as RandWalk) in [10] also introduces uncertainty to edges as we show in section IV. This scheme suffers from high lower bounds for utility despite its excellent privacyutility tradeoff.
Motivated by (k,)obf and RandWalk, we propose in this work a generalized model for anonymizing graphs based on edge uncertainty. Both (k,)obf and RandWalk display their fitting into the model. We point out disadvantages in (k,)obf and RandWalk, the tradeoff gap between them and present several elegant techniques to fill this gap. Finally, to support fair comparisons, we develop a new tradeoff quantifying framework using the concept of incorrectness in location privacy research [15].
Our contributions are summarized as follows:

We propose a generalized model called uncertain adjacency matrix for anonymizing graph via edge uncertainty semantics (Section IV). The key property of this model is that expected degrees of all nodes must be unchanged. We show the fitting of (k,)obf and RandWalk into the model and then analyze their disadvantages (Sections III, IV).

We introduce the Maximum Variance (MaxVar) scheme (Section V) that satisfies all the properties of the uncertain adjacency matrix. It achieves good privacyutility tradeoff by using two key observations: nearby potential edges and maximization of total node degree variance via a simple quadratic program.

Towards a fair comparison for anonymization schemes on graphs, this paper describes a generic quantifying framework (Section VI) by putting forward the distortion measure (also called incorrectness in [15]) to measure the reidentification risks of nodes. As for the utility score, typical graph metrics [2] [21] are chosen.

We conduct a comparative study of aforementioned approaches on three real large graphs and show the effectiveness of our gapfilling solutions (Section VII).
Table I summarizes notations used in this paper.
Symbol  Definition 

true graph with and  
uncertain graph constructed from  
sample graph from ,  
degree of node in  
number of nodes having degree in  
neighbors of node in  
truncated normal distribution on [0,1]  
a sample from the distribution  
()  probability of edge () 
number of potential edges,  
,  adjacency matrices of , 
random walk transition matrix of  
uncertain adjacency matrix,  
walk length  
switching matrix  
total degree variance 
Ii Related Work
Iia Anonymizing Deterministic Graphs
There is a vast literature on graph perturbation that deserves a survey. In this section, we enumerate only several groups of ideas that are related to our proposed schemes.
IiA1 Anonymizing unlabeled vertices for node privacy
In unlabeled graphs, node identifiers are numbered in an arbitrary manner after removing their labels. An attacker aims at reidentifying nodes solely based on their structural information. For this line of graphs, node privacy protection implies link privacy. Techniques of adding and removing edges, nodes can be done randomly or deterministically. Random perturbation is a naive approach and usually used as a baseline method. More guided approaches consist of kneighborhood[24], kdegree[9], kautomorphism[25], ksymmetry[20], kisomorphism[4] and degree[18]. These schemes provide kanonymity [17] semantics and usually rely on heuristics to avoid combinatorial intractability. Kautomorphism, ksymmetry, and kisomorphism can resist any structural attacks by exploiting the inherent symmetry in graph. degree addresses the friendship attacks, based on the vertex degree pair of an edge. Ying and Wu [21] propose a spectrum preserving approach which wisely chooses edge pairs to switch in order to keep the spectrum of the adjacency matrix not to vary too much. The clearest disadvantage of the above schemes is that they are inefficient on large scale graphs.
Apart from the two above categories, perturbation techniques have other categories that settle on possible world semantics. Hay et al. [7] generalize a network by clustering nodes and publish graph summarization of super nodes and super edges. The utility of this scheme is limited. On the other hand, Boldi et al. [2] take the uncertain graph approach. With edge probabilities, the output graph can be used to generate sample graphs by independent edge sampling. Our approach belongs to this class of techniques with different formulation and better privacyutility tradeoff. Note that in ksymmetry[4], the output sample graphs are also possible worlds of the symmetric intermediate graph.
IiA2 Anonymizing labeled vertices for link privacy
If nodes are labeled, we are only concerned about the link disclosure risk. For example, Mittal et al. [10] employ an edge rewiring method based on random walks to keep the mixing time tunable and prevent link reidentification by Bayesian inference. This method is effective for social network based systems, e.g. Sybil defense, DHT routing. Link privacy is also described in [21] for Random Switch, Random Add/Del. Interestingly, RandWalk [10] can also be used for unlabeled graphs as shown in Section IV.
IiA3 Min entropy, Shannon entropy and incorrectness measure
We now survey commonly used notions of privacy metrics. Min entropy [16] quantifies the largest probability gap between the posterior and prior over all items in the input dataset. Kanonymity has the same semantics with the corresponding min entropy of . So we say kanonymity based perturbation schemes belong to min entropy. Shannon entropy argued in [3] and [2] is another choice of privacy metrics. The third metrics that we use in this paper is the incorrectness measure from location privacy [15]. Given the prior information (e.g. node degree in the true graph) and the posterior information harvested from the anonymized output, incorrectness measure is the number of incorrect guesses made by the attacker. This measure gauges the distortion caused by the anonymization algorithm.
IiB Mining Uncertain Graphs
Uncertain graphs pose big challenges to traditional mining techniques. Because of the exponential number of possible worlds, naive enumerations are intractable. Typical graph search operations like kNearest neighbor and pattern matching require new approaches [13] [26] [23]. Those methods answer thresholdbased queries by using pruning strategies based on Apriori property of frequent patterns.
Iii Preliminaries
This section starts with definitions and common assumptions on uncertain graphs. It then analyzes vulnerabilities in obf [2].
Iiia Uncertain Graph
Let be an uncertain undirected graph, where is the function that gives an existence probability to each edge (see Fig.0(b)). The common assumption is on the independence of edge probabilities. Following the possibleworlds semantics in relational data [5], the uncertain graph induces a set {} of deterministic graphs (worlds), each is defined by a subset of . The probability of is:
(1) 
Note that deterministic graphs are also uncertain graphs with all edges having probabilities 1.
IiiB obf and Its Limitations
Definition III.1
(k,)obf [2]. Let P be a vertex property, be a desired level of obfuscation, and be a tolerance parameter. The uncertain graph is said to kobfuscate a given vertex with respect to P if the entropy of the distribution over the vertices of is greater than or equal to :
(2) 
The uncertain graph is a obf with respect to property P if it kobfuscates at least vertices in with respect to P.
Given the true graph (Fig.0(a)), the basic idea of obf (Fig.0(b)) is to transfer the probabilities from existing edges to potential (nonexisting) edges to satisfy Definition III.1. For each existing sampled edge , it is assigned a probability where (Fig. 0(c)) and for each nonexisting sampled edge , it is assigned a probability .
Table II gives an example of how to compute degree entropy for the uncertain graph in Fig. 0(b). Here vertex property is the node degree. Each row in the left side is the degree distribution for the corresponding node. For instance, has degree with probability . The right side normalizes values in each column (i.e. in each degree value) to get distributions . The entropy for each degree value is shown in the bottom row. Given , then with true degree 2 and with true degree 1 satisfy (2). Therefore, .
node degree uncertainty  

d=0  d=1  d=2  d=3  d=0  d=1  d=2  d=3  
v1  .014  .188  .582  .216  .044  .117  .355  .491 
v2  .210  .580  .210  .000  .656  .362  .128  .000 
v3  .036  .252  .488  .224  .112  .158  .298  .509 
v4  .060  .580  .360  .000  .187  .362  .220  .000 
1.40  1.84  1.91  0.99 
While obf provides a novel technique to come up with an uncertain version of the graph, the specific approach in [2] has two drawbacks. First, it formulated the problem as the minimization of . With small values of , highly concentrates around zero, so existing sampled edges have probabilities nearly 1 and nonexisting sampled edges are assigned probabilities almost 0. By the simple rounding technique, the attacker can easily reveal the true graph. Even if the graph owner only publishes sample graphs, the reidentification attacks are still effective as we show in Section VII. Note that in [2], the found values of vary in a wide range from to . Second, the approach in [2] does not consider the locality (subgraph) of nodes in selecting pairs of nodes for establishing potential edges. As shown in [6], subgraphwise perturbation effectively reduces structural distortion.
Iv A Generalized Model for Uncertain Graph
This section introduces a generalized model of graph anonymization via semantics of edge uncertainty. Then we analyze several schemes using this model.
Iva A Generalized Model: Uncertain Adjacency Matrix
Given the true graph , an uncertain graph constructed from must have its uncertain adjacency matrix satisfying

symmetry

and . If we relax this constraint to
(2’) allow then we have selfloops and allow then we have multiedges (Fig. 1(a)). 
expected degrees of all nodes must be unchanged. It means
We first define the transition matrix which is right stochastic (i.e. nonnegative and row sums equal to 1) as follows (note that we use the short notation )
(3) 
The power when is .
We prove two lemmas on properties of the products and where is right stochastic.
Lemma IV.1
For an adjacency matrix and a right stochastic matrix , the product is nonnegative and has row sums equal to those of .

The nonnegativity of is trivial. The sum of row of is
Lemma IV.2
For a deterministic graph possessing adjacency matrix and , the product is also symmetric.

We prove the result by induction. The case is trivial. We prove that for any , where is a path of length from to .
When , , so the result holds. Assuming that the result is correct up to , i.e. . Because , .
Because is undirected, the set of all is equal to the set of all , so .
We prove the uniqueness of in the following proposition.
Proposition IV.3
Given a deterministic graph with adjacency matrix , there exists one and only one right stochastic matrix that satisfies for all and is symmetric for all . The unique solution is .

Lemma IV.2 shows that satisfies for all and is symmetric for all .
To prove that this is the unique solution, we repeat the formula in the proof of Lemma IV.2. Let , then where implies the successive node of in . Because has the same number of products as (i.e. the number of paths of length ), is symmetric if and only if corresponding products are equal, i.e. . At , for any path we must have . Along with the requirement that is right stochastic, i.e. , we obtain . This is exactly .
IvB RandWalk Approach
Now we apply the model of uncertain adjacency matrix to the analysis of RandWalk [10]. Algorithm 1 depicts the steps of RandWalk. As we show below, the trialanderror condition in Line 6 makes RandWalk hard to analyze ^{1}^{1}1It also causes edge miss at , e.g. a 2length walk on edge (Fig. 0(a)) causes the selfloop .. So we modify it by removing the condition and using parameter instead of 1.0 in Line 12 ^{2}^{2}2This line causes errors for degree1 nodes as shown in RandWalkmod. (see Algorithm 2). When , all edges are assigned with probability 0.5. In RandWalkmod, we add a checking for (Line 8) to keep the total degree of equal to that of , which is missing in RandWalk. Note that RandWalkmod accepts selfloops and multiedges.
Let be the edge adding matrix defined as
We show that RandWalkmod can be formulated as an uncertain adjacency matrix , where is the Hadamard product (elementwise). is equivalent to computations in lines 26 and is equivalent to computations in lines 713. We use instead of due to the fact that when the edge is added to with probability , the edge is also assigned the same probability. We come up with the following theorem.
Theorem IV.4
RandWalkmod can be formulated as . is symmetric. It satisfies the constraint of unchanged expected degree iff ^{3}^{3}3This implies a mistake in Theorem 3 of [10].

By Lemmas IV.1 and IV.2, let be , we have symmetric and its row sums are equal to those of . Because and both and are symmetric, is also symmetric.
Due to the fact that has the same locations of nonzeros as , the condition of unchanged expected degree is satisfied if and only if all nonzeros in are 1. This occurs if and only if .
We investigate the limit case when (i.e. ). Correspondingly has . The following theorem quantifies the number of selfloops and multiedges in for powerlaw (PL) graphs and sparse ErdösRenyi (ER) random graphs [11].
Theorem IV.5
For powerlaw graphs with the exponent , the number of selfloops in is , where is the Riemann zeta function defined only for ; the number of multiedges is zero.
For sparse ER random graphs with constant where is the edge probability, the number of selfloops in is ; the number of multiedges is zero.

See Appendix AA.
Remark IV.1
We notice that RandWalkmod can be done equivalently by the idea in SybilGuard [22]. We first pick a random permutation on neighbors of each node to get pairs of (inedge, outedge). Then for any walk reaching node by the inedge , the outedge is fixed to . In this formulation, it is straightforward to verify that the transition probability from to a neighbor is .
IvC Edge Switching
In edge switching (EdgeSwitch) approaches (Fig. 1(b)), two edges are chosen and switched to if . This is done in switches. Using the switching matrix , we represent 1step EdgeSwitch in the form (Equation (4)).
The switching matrix is feasible if and only if . Note that in the full form, is matrix with the remaining elements on diagonal are 1, other offdiagonal are 0. In general, is not right stochastic and this happens only when . For step EdgeSwitch . If is right stochastic (i.e. we choose edges such that ), then Lemma IV.1 applies.
(4) 
IvD Direct Construction
Given the deterministic adjacency matrix , we can directly construct that satisfies all three constraints (1),(2) and (3) in Section IVA. (k,)obf [2] introduces such an approach. As explained in Section IIIB, the expected degrees of nodes in obf are approximately unchanged due to the fact that are nearly zero by small . So (k,)obf satisfies constraints (1) and (2) but it only approximately satisfies the third constraint.
To remedy this shortcoming, we present the MaxVar approach in Section V. It adds potential edges to , then tries to find the assignment of edge probabilities such that the expected node degrees are unchanged while the total variance is maximized. A comparison among schemes is also shown in the end of Section VC.
IvE Mixture Approach
In this section, we present the Mixture approach by the uncertain adjacency matrix parametrized by , with the output sample graph . Given the true graph and an anonymized , every edge is chosen into with probability where
It is straightforward to show that . When applied to generated by RandWalkmod with , we have and satisfies three constraints (1) (2’) and (3).
If there exists with constraint such that , then Mixture can be simulated by the RandWalkmod approach with the transition matrix .
IvF Partition Approach
Another approach that can apply to RandWalkmod, obf, MaxVar and EdgeSwitch is the Partition approach. Given true graph , this divideandconquer strategy first partitions into disjoint subgraphs , then it applies one of the above anonymization schemes on subgraphs to get anonymized subgraphs . Finally, it combines to obtain . Note that the partitioning may cause orphan edges as in MaxVar (Section V). Those edges must be copied to to keep node degrees unchanged.
V Maximum Variance Approach
We start this section with the formulation of MaxVar in the form of quadratic programming based on two key observations. Then we describe the anonymization algorithm.
Va Formulation
Two key observations underpinning the MaxVar approach are presented as follows.
VA1 Observation #1: Maximum Degree Variance
We argue that efficient countermeasures against structural attacks should hinge on node degrees. If a node and its neighbors have their degrees changed, the reidentification risk is reduced significantly. Consequently, instead of replicating local structures as in kanonymity based approaches [24, 9, 25, 4, 20, 18], we can deviate the attacks by changing node degrees probabilistically. For example, node v1 in Fig.0(a) has degree 2 with probability 1.0 whereas in Fig.0(b), its degree gets four possible values with probabilities respectively. Generally, given edge probabilities of node as , the degree of is a sum of independent Bernoulli random variables, so its expected value is and its variance is . If we naively target the maximum (local) degree variance without any constraints, the naive solution is at . However, such an assignment distorts graph structure severely and deteriorates the utility. Instead, by following the model of uncertain adjacency matrix, we have the constraint . Note that the minimum variance of an uncertain graph is 0 and corresponds to the case has all edges being deterministic, e.g. when and in switchingedge based approaches. In the following section, we show an interesting result relating the total degree variance with the variance of edit distance.
VA2 Variance with edit distance
The edit distance between two deterministic graphs is defined as:
(5) 
A wellknown result about the expected edit distance between the uncertain graph and the deterministic graph is
Correspondingly, the variance of edit distance is
We prove in the following theorem that the variance of edit distance is the sum of all edges’ variance (total degree variance) and it does not depend on the choice of .
Theorem V.1
Assume that has uncertain edges and (i.e. ). The edit distance variance is and does not depend on the choice of .

See Appendix AB.
VA3 Observation #2: Nearby Potential Edges
As indicated by Leskovec et al. [8], real graphs reveal two temporal evolution properties: densification power law and shrinking diameters. Community Guided Attachment (CGA) model [8], which produces densifying graphs, is an example of a hierarchical graph generation model in which the linkage probability between nodes decreases as a function of their relative distance in the hierarchy. With regard to this observation, obf, by heuristically making potential edges solely based on node degree discrepancy, produces many intercommunity edges. Shortestpath based statistics will be reduced due to these edges. MaxVar, in contrast, tries to mitigate the structural distortion by proposing only nearby potential edges before assigning edge probabilities. Another evidence is from [19] where Vazquez analytically proved that Nearest Neighbor can explain the powerlaw for degree distribution, clustering coefficient and average degree among the neighbors. Those properties are in very good agreement with the observations made for social graphs. Sala et al. [14] confirmed the consistency of Nearest Neighbor model in their comparative study on graph models for social networks.
VB Algorithms
This section describes the steps of MaxVar to convert the input deterministic graph into an uncertain one.
VB1 Overview
The intuition behind the new approach is to formulate the perturbation problem as a quadratic programming problem. Given the true graph and the number of potential edges allowed to be added , the scheme has three phases. The first phase tries to partition into subgraphs, each one with potential edges connecting nearby nodes (with default distance 2, i.e. friendoffriend). The second phase formulates a quadratic program for each subgraph with the constraint of unchanged node degrees to produce the uncertain subgraphs with maximum edge variance. The third phase combines the uncertain subgraphs into and publishes several sample graphs. The three phases are illustrated in Fig. 3.
By keeping the degrees of nodes in the perturbed graph, our approach is similar to the edge switching approaches (e.g.[21]) but ours is more subtle as we do it implicitly and the switching occurs not necessarily on pairs of edges.
VB2 Graph Partitioning
Because of the complexity of exact quadratic programming (Section VB3), we need a preprocessing phase to divide the true graph into subgraphs and run the optimization on each subgraph. Given the number of subgraphs , we run METIS ^{4}^{4}4http://glaros.dtc.umn.edu/gkhome/views/metis to get almost equalsized subgraphs with minimum number of intersubgraph edges. Each subgraph has potential edges added before running the quadratic program. This phase is outlined in Algorithm 3.
VB3 Quadratic Programming
By assuming the independence of edges, the total degree variance of for edit distance (Theorem V.1) is:
(6) 
The last equality in (6) is due to the constraint that the expected node degrees are unchanged (i.e. ), so is equal to . By targeting the maximum edge variance, we come up with the following quadratic program.
Minimize  
Subject to  
The objective function reflects the privacy goal (i.e. sample graphs do not highly concentrate around the true graph) while the expected degree constraints aim to preserve the utility.
By dividing the large input graph into subgraphs, we solve independent quadratic optimization problems. Because each edge belongs to at most one subgraph and the expected node degrees in each subgraph are unchanged, it is straightforward to show that the expected node degrees in are also unchanged. We have a proposition on problem feasibility and an upper bound for the total variance.
Proposition V.2
The quadratic program in MaxVar is always feasible. The total variance is upper bounded by .

The feasibility is due to the fact that is a feasible point. Let be the number of potential edges incident to node . By requiring ’s expected degree to be unchanged, we have . Applying CauchySchwarz inequality, we get . Now we take the sum over all nodes to get the following
where the last equality is again due to CauchySchwarz inequality.
VC Comparison of schemes
Table III shows the comparison of schemes we investigate in this work. Only MaxVar and EdgeSwitch satisfy all three properties (1),(2) and (3). The next two propositions quantify the TV of obf and RandWalkmod.
Scheme  Prop #1  Prop #2  Prop #3  Uncertain 

RandWalkmod  ()  
RandWalk  
EdgeSwitch  
obf  
MaxVar  
Mixture  depends on the mixed scheme  
Partition  depends on the scheme used in subgraphs 
Proposition V.3

In obf, existing edges are assigned probabilities while potential edges are assigned probabilities . Therefore, the total variance is where . Take the expectation of , we get .
has pdf . The normalization constant where erf is the error function. Basic integral computations (change of variable and integration by parts) give us the formulas for and as follows
(7) (8)
Note that for , and , so
(9) 
Proposition V.4
The total variance of RandWalkmod at walklength is upper bounded by where is the number of nonzeros in .
For powerlaw graphs with the exponent , . For sparse ER random graphs with constant,
Note that the increases with and when is equal to the diameter of , . Therefore, the upper bound of converges very fast to , compatible with the results in the limit cases of PL and ER random graphs.
Vi Quantifying Framework
This section describes a generic framework for privacy and utility quantification of anonymization methods.
Via Privacy Measurement
We focus on structural reidentification attacks under various models of attacker’s knowledge as shown in [7]. We quantify the privacy of an anonymized graph as the sum of reidentification probabilities of all nodes in the graph. We differentiate closedworld from openworld adversaries. For example, when a closedworld adversary knows that Bob has three neighbors, this fact is exact. An openworld adversary in this case would learn only that Bob has at least three neighbors. We consider the result of structural query on a node as the node signature . Given a query , nodes having the same signatures form an equivalence class. So given the true graph and an output anonymized graph , the privacy is measured as in the following example.
Example VI.1
Assuming that we have signatures of and signatures of as in Table IV, the reidentification probabilities in of nodes 1,2 are , of nodes 4,8 are , of nodes 3,5,6,7 are 0s. And the privacy score of is . In , the privacy score is , equal to the number of equivalence classes.
Graph  Equivalence classes 

We consider two privacy scores in this paper.

score uses node degree as the node signature, i.e. we assume that the attacker know apriori degrees of all nodes.

uses the set (not multiset) of degrees of node’s friends as the node signature. For example, if a node has 6 neighbors and the degrees of those neighbors are , then its signature for attack is .
Higherorder scores like (exact multiset of neighbors’ degrees) or (exact multiset of neighborofneighbors’ degrees) induce much higher privacy scores of the true graph (in the order of ) and represent less meaningful metrics for privacy. The following proposition claims the automorphisminvariant property of structural privacy scores.
Proposition VI.1
All privacy scores based on structural queries [7] are automorphisminvariant, i.e. if we find a nontrivial automorphism of , the signatures of all nodes in are unchanged.

The proof is trivially based on the definition of graph automorphism. We omit it due to the lack of space.
ViB Utility Measurement
Following [2] and [21], we consider three groups of statistics for utility measurement: degreebased statistics, shortestpath based statistics and clustering statistics.
ViB1 Degreebased statistics

Number of edges:

Average degree:

Maximal degree:

Degree variance:

Powerlaw exponent of degree sequence: is the estimate of assuming the degree sequence follows a powerlaw
ViB2 Shortest pathbased statistics

Average distance: is the average distance among all pairs of vertices that are pathconnected.

Effective diameter: is the 90th percentile distance among all pathconnected pairs of vertices.

Connectivity length: is defined as the harmonic mean of all pairwise distances in the graph.

Diameter : is the maximum distance among all pathconnected pairs of vertices.
ViB3 Clustering statistics

Clustering coefficient: where is the number of triangles and is the number of connected triples.
All of the above statistics are computed on sample graphs generated from the uncertain output . In particular, to estimate shortestpath based measures, we use Approximate Neighbourhood Function (ANF) [12]. The diameter is lower bounded by the longest distance among alldestination breadfirstsearches from 1,000 randomly chosen nodes.
Vii Evaluation
In this section, our evaluation aims to show the disadvantages of obf and RandWalk/RandWalkmod as well as the gap between them. We then illustrate the effectiveness and efficiency of the gapfilling approaches MaxVar and Mixture. The effectiveness is measured by privacy scores (lower is better) and the relative error of utility (lower is better). The efficiency is measured by the running time. All algorithms are implemented in Python and run on a desktop PC with Core i74770@ 3.4Ghz, 16GB memory. We use MOSEK^{5}^{5}5http://mosek.com/ as the quadratic solver.
Three large realworld datasets are used in our experiments ^{6}^{6}6http://snap.stanford.edu/data/index.html. dblp is a coauthorship network where two authors are connected if they publish at least one paper together. amazon is a product copurchasing network where the graph contains an undirected edge from to if a product is frequently copurchased with product . youtube is a videosharing web site that includes a social network. The graph sizes of dblp, amazon and youtube are (317080, 1049866), (334863, 925872) and (1134890, 2987624) respectively. We partition dblp, amazon into 20 subgraphs and youtube into 60 subgraphs. The sample size of each test case is 20.
Viia obf and RandWalk
We report the performance of obf in Table V. We keep the number of potential edges equal to (default value in [2]) and vary . We find that the scheme achieves low relative errors only at small . However, privacy scores, especially , rise fast (up to 50% compared to the true graph). This fact incurs high privacyutility tradeoff as confirmed in Table VIII.
Table VI shows the performance similarity between RandWalk and RandWalkmod except the case of youtube and for in amazon. Because RandWalkmod satisfies the third constraint, it benefits several degreebased statistics while the existence of selfloops and multiedges does not impact much on shortestpath based metrics. RandWalk misses a lot of edges at (see footnote 1 in Section IVB). The remarkable characteristics of randomwalk schemes are the very low privacy scores and the high relative errors (lowerbounded around 8 to 10%). Clearly, there is a gap between high tradeoffs in obf and high relative errors in RandWalk where MaxVar and Mixture may play their roles.
rel.err  
dblp  199  125302  1049866  6.62  343  100.15  0.306  2.245  7.69  9  7.46  20  
0.001  72.9  40712.1  1048153  6.61  316.0  97.46  0.303  2.244  7.74  9.4  7.50  20.0  0.018 
0.01  41.1  24618.2  1035994  6.53  186.0  86.47  0.294  2.248  7.82  9.5  7.59  19.8  0.077 
0.1  19.7  7771.4  991498  6.25  164.9  64.20  0.284  2.265  8.08  10.0  7.85  20.0  0.128 
amazon  153  113338  925872  5.53  549  33.20  0.205  2.336  12.75  16  12.10  44  
0.001  55.7  55655.9  924321  5.52  479.1  31.73  0.206  2.340  12.14  15.2  11.65  33.2  0.057 
0.01  34.5  39689.8  915711  5.47  299.7  27.18  0.220  2.348  12.40  15.6  11.91  32.4  0.101 
0.1  19.2  16375.4  892140  5.33  253.9  21.87  0.232  2.374  12.52  15.5  12.06  31.4  0.144 
youtube  978  321724  2987624  5.27  28754  2576.0  0.0062  2.429  6.07  8  6.79  20  
0.001  157.2  36744.6  2982974  5.26  28438  2522.6  0.0062  2.416  6.24  8.0  6.01  19.5  0.022 
0.01  80.0  22361.7  2940310  5.18  26900  2282.6  0.0061  2.419  6.27  8.0  6.04  19.0  0.043 
0.1  23.4  5806.9  2624066  4.62  16353  970.8  0.0070  2.438  6.59  8.1  6.36  20.4  0.160 
rel.err  
dblp  199  125302  1049866  6.62  343  100.15  0.306  2.245  7.69  9  7.46  20  
(RW) 2  10.0  4.9  1001252  6.32  309.3  86.16  0.152  2.197  7.43  9.1  7.20  19.7  0.094 
3  11.8  10.9  1048129  6.61  315.4  98.04  0.107  2.155  7.08  8.7  6.88  17.8  0.110 
5  11.7  5.6  1049484  6.62  321.6  100.77  0.065  2.148  6.79  8.0  6.62  16.4  0.142 
10  11.9  2.9  1049329  6.62  329.2  103.06  0.030  2.144  6.54  8.0  6.40  14.3  0.171 
(RWmod) 2  11.8  4.5  1049921  6.62  327.0  105.3  0.093  2.110  7.75  9.7  7.48  23.0  0.109 
3  11.9  9.4  1049877  6.62  343.3  105.1  0.071  2.117  7.32  9.0  7.10  20.4  0.099 
5  12.0  5.4  1049781  6.62  340.5  105.1  0.044  2.115  6.95  8.4  6.76  18.3  0.131 
10  11.9  2.6  1049902  6.62  340.0  105.3  0.021  2.116  6.59  8.0  6.44  16.0  0.164 
amazon  153  113338  925872  5.53  549  33.20  0.205  2.336  12.75  16  12.10  44  
(RW) 2  5.7  5.4  861896  5.15  274.9  23.11  0.148  2.337  10.70  13.8  10.19  38.7  0.180 
3  10.0  16.5  923793  5.52  495.6  32.72  0.113  2.282  10.33  13.1  9.87  34.1  0.137 
5  10.4  8.6  925185  5.53  507.7  33.52  0.080  2.276  9.45  12.1  9.07  29.6  0.181 
10  10.2  4.6  925748  5.53  498.1  34.37  0.046  2.273  8.55  10.5  8.25  25.7  0.234 
(RWmod) 2  9.8  3.2  925672  5.53  255.1  37.61  0.099  2.246  12.02  15.5  11.40  43.2  0.139 
3  9.9  11.2  925532  5.53  535.3  37.32  0.082  2.254  10.89  14.0  10.38  37.9  0.134 
5  9.7  6.0  926163  5.53  522.8  37.42  0.059  2.252  9.83  12.5  9.40  33.0  0.185 
10  9.9  3.3  925809  5.53  491.4  37.45  0.035  2.251  8.76  11.0  8.44  28.7  0.238 
youtube  978  321724  2987624  5.27  28754  2576.0  0.0062  2.429  6.07  8  6.79  20  
(RW) 2  13.4  1.5  2636508  4.65  19253.8  1139.7  0.022  2.191  6.18  7.9  5.93  23.5  0.403 
3  23.8  17.6  2982204  5.26  26803.6  2389.6  0.004  2.108  5.73  7.0  5.52  18.0  0.103 
5  24.6  8.4  2985967  5.26  26018.7  2340.0  0.005  2.106  5.55  7.0  5.38  16.3  0.120 
10  21.9  1.8  2984115  5.26  24695.8  2099.4  0.009  2.100  5.49  6.9  5.33  18.7  0.145 
(RWmod) 2  26.4  1.4  2987228  5.26  23829.7  2578.5  0.018  2.053  6.27  8.0  6.02  22.1  0.245 
3  26.9  22.3  2988011  5.27  28611.5  2579.7  0.005  2.077  5.75  7.2  5.54  19.0  0.081 
5  26.1  11.0  2987479  5.26  28619.3  2581.4  0.005  2.076  5.61  7.0  5.44  18.3  0.090 
10  26.3  1.7  2987475  5.26  28432.2  2579.9  0.008  2.073  5.58  7.0  5.41  18.8  0.099 
ViiB Effectiveness of MaxVar
We assess privacy and utility of MaxVar by varying the number of potential edges . The results are shown in Table VII. As for privacy scores, if we increase , we gain better privacy as we allow more edge switches. Due to the expected degree constraints in the quadratic program, all degreebased metrics vary only a little.
We observe the near linear relationships between