A Missing Proofs

We introduce the notion of restricted sensitivity as an alternative to global and smooth sensitivity to improve accuracy in differentially private data analysis. The definition of restricted sensitivity is similar to that of global sensitivity except that instead of quantifying over all possible datasets, we take advantage of any beliefs about the dataset that a querier may have, to quantify over a restricted class of datasets. Specifically, given a query and a hypothesis about the structure of a dataset , we show generically how to transform into a new query whose global sensitivity (over all datasets including those that do not satisfy ) matches the restricted sensitivity of the query . Moreover, if the belief of the querier is correct (i.e., ) then . If the belief is incorrect, then may be inaccurate.

We demonstrate the usefulness of this notion by considering the task of answering queries regarding social-networks, which we model as a combination of a graph and a labeling of its vertices. In particular, while our generic procedure is computationally inefficient, for the specific definition of as graphs of bounded degree, we exhibit efficient ways of constructing using different projection-based techniques. We then analyze two important query classes: subgraph counting queries (e.g., number of triangles) and local profile queries (e.g., number of people who know a spy and a computer-scientist who know each other). We demonstrate that the restricted sensitivity of such queries can be significantly lower than their smooth sensitivity. Thus, using restricted sensitivity we can maintain privacy whether or not , while providing more accurate results in the event that holds true.

1 Introduction

The social networks we inhabit have grown significantly in recent decades with digital technology enabling the rise of networks like Facebook that now connect over million people and house vast repositories of personal information. At the same time, the study of various characteristics of social networks has emerged as an active research area [10]. Yet the fact that the data in a social network might be used to infer sensitive details about an individual, like sexual orientation [15], is a growing concern among social networks’ participants. Even in an ‘anonymized’ unlabeled graph it is possible to identify people based on graph structures [3]. In this paper, we study the feasibility of and design efficient algorithms to release statistics about social networks (modeled as graphs with vertices labeled with attributes) while satisfying the semantic definition of differential privacy [8, 9].

A differentially private mechanism guarantees that any two neighboring data sets (i.e., data sets that differ only on the information about a single individual) induce similar distributions over the statistics released. For social networks, we consider two notions of neighboring or adjacent networks: (1) edge adjacency stipulating that adjacent graphs differ in just one edge or in the attributes of just one vertex; and (2) vertex adjacency stipulating that adjacent networks differ on just one vertex—its attributes or any number of edges incident to it.

For any given statistic or query, its global sensitivity measures the maximum difference in the answer to that query over all pairs of neighboring data sets [9]; global sensitivity provides an upper bound on the amount of noise that has to be added to the actual statistic in order to preserve differential privacy. Since the global sensitivity of certain types of queries can be quite high, the notion of smooth sensitivity was introduced to reduce the amount of noise that needs to be added while still preserving differential privacy [18].

However, a key challenge in the differentially private analysis of social networks is that for many natural queries, both global and smooth sensitivity can be very large. In the vertex adjacency model, consider the query “How many people in are a doctor or are friends with a doctor?” Even if the answer is (e.g., there are no doctors in the social network) there is a neighboring social network in which the answer is (e.g., pick an arbitrary person from , relabel him as a doctor, and connect him to everyone). Even in the edge adjacency model, the sensitivity of queries may be high. Consider the query “How many people in are friends with two doctors who are also friends with each other?” In the answer may be even if there are two doctors that everyone else is friends with (e.g, the doctors are not friends with each other), but the answer jumps to in a neighboring graph (e.g, if we simply connect the doctors to each other). In fact, even the first query can have high sensitivity in the edge-adjacency model if we just relabel a high-degree vertex as a doctor.

Yet, while these examples respect the mathematical definitions of neighboring graphs and networks, we note that in a real social network no single individual is likely to be directly connected with everyone else. Suppose that in fact a querier has some such belief about the given network ( is a subset of all possible networks) such that its query has low sensitivity restricted only to inputs and deviations within . For example, the querier may believe the following hypothesis : the maximum degree of any node in the network is at most (e.g, after reading a study on the anatomy of Facebook [21]). Can one in that case provide accurate answers in the event that indeed and yet preserve privacy no matter what (even if is not satisfied)?

In this work, we provide a positive answer to this question. We do so by introducing the notion of restricted sensitivity, which represents the sensitivity of the query over only the given subset , and providing procedures that map a query to an alternative query s.t. and identify over the inputs in , yet the global sensitivity of is comparable to just the restricted sensitivity of . Therefore, the mechanism that answers according to and adds Laplace random noise preserves privacy for all inputs, while giving good estimations of for inputs in .

While our general scheme for devising such is inefficient and requires that we construct a separate for each query , we also design a complementary projection-based approach. A projection of is a function mapping all possible inputs (e.g., all possible -node social networks) to inputs in with the property that any input in is mapped to itself. Therefore, a projection allows us to define for any , simply by composing . Moreover, if this projection satisfies certain smoothness properties, which we define in Section 4, then this function will have its global sensitivity—or at least its smooth sensitivity over inputs in —comparable to only the restricted sensitivity of . In particular, for the case (the assumption that the network has degree at most ), we show we can efficiently construct projections satisfying these conditions, therefore allowing us to efficiently take advantage of low restricted sensitivity. These results are given in Section 4 and summarized in Table 1.

The next natural question is: how much advantage does restricted sensitivity provide, compared to global or smooth sensitivity, for natural query classes and natural sets ? In Section 5 we consider two natural classes of queries: local profile queries and subgraph counting queries. A local profile query asks how many nodes in a graph satisfy a property which depends only on the immediate neighborhood of (e.g, queries relating to clustering coefficients and bridges [10], or queries such as “how many people know two spies who don’t know each other?”). A subgraph counting query asks how many copies of a particular subgraph are contained in the network (e.g., number of triangles involving at least one spy). For the case for we show that the restricted sensitivity of these classes of queries can indeed be much lower than the smooth sensitivity. These results, presented in Section 5, are summarized in Table 2.

Adjacency Hypothesis Query Sensitivity Efficient
Theorem 9 Any Any Any No
Theorem 14 Edge Any Yes
Theorem 18 Vertex Any Yes
Table 1: Summary of Results. global sensitivity, restricted sensitivity, and smooth bound of local sensitivity.
Subgraph Counting Query Local Profile Query
Adjacency Smooth Restricted Smooth Restricted
Edge
Vertex
Table 2: Worst Case Smooth Sensitivity over vs. Restricted Sensitivity .

1.1 Related Work

Easley and Kleinberg provide an excellent summary of the rich literature on social networks [10]. Previous literature on differentially-private analysis of social networks has primarily focused on the edge adjacency model in unlabeled graphs where sensitivity is manageable 1. Triangle counting queries can be answered in the edge adjacency model by efficiently computing the smooth sensitivity [18], and this result can be extended to answer other counting queries [16]. [14] shows how to privately approximate the degree distribution in the edge adjacency model. The Johnson-Lindenstrauss transform can be used to answer all cut queries in the edge adjacency model [5].

The approach taken in the work of Rastogi et al. [19] on answering subgraph counting queries is the most similar to ours. They consider a bayesian adversary whose prior (background knowledge) is drawn from a distribution. Leveraging an assumption about the adversary’s prior they compute a high probability upper bound on the local sensitivity of the data and then answer by adding noise proportional to that bound. Loosely, they assume that the presence of an edge does not presence of other edges more likely. In the specific context of a social network this assumption is widely believed to be false (e.g., two people are more likely to become friends if they already have common friends [10]). The privacy guarantees of [19] only hold if these assumptions about the adversaries prior are true. By contrast, we always guarantee privacy even if the assumptions are incorrect.

A relevant approach that deals with preserving differential privacy while providing better utility guarantees for nice instances is detailed in the work of Nissim et al [18] who define the notion of smooth sensitivity. In their framework, the amount of random noise that the mechanism adds to a query’s true ansewr is dependent on the extent for which the input database is “nice” – having small local sensitivity. As we discuss later, in social networks many natural queries (e.g., local profile queries) even have high local and smooth sensitivity.

2 Preliminaries

2.1 Differential Privacy

We adopt the framework of differential privacy. We use to denote the set of all possible datasets. Intuitively, we say two datasets are neighbors if they differ on the details of a single individual. (See further discussion in Definitions 6 and 7.) We denote the fact that is a neighbor of using . We define the distance between two databases as the minimal non-negative integer s.t. there exists a path where , and for every we have that . Given a subset we denote the distance of a database to as .

Definition 1.

[8] A mechanism is -differentially private if for every pair of neighboring datasets and every subset we have that

Intuitively differential privacy guarantees that an adversary has a very limited ability to distinguish between the output of and the output of . A query is a function mapping the dataset to a real number.

Definition 2.

The local sensitivity of a query at a dataset is .

Definition 3.

The global sensitivity of a query is .

The Laplace mechanism preseves -differential privacy [9]. This mechanism provides useful answers to queries with low global sensitivity. The primary challenge in the differentially private analysis of social networks is the high global sensitivity of many queries. The local sensitivity may be significantly lower than the global sensitivity . However, adding noise proportional to does not preserve differential privacy because the noise level itself may leak information. A clever way to circumvent this problem is to smooth out the noise level [18].

Definition 4.

[18] A -smooth upper bound on the local sensitivity of a query is a function which satisfies (i) , and (ii) it holds that .

It is possible to preserve privacy while adding noise proportional to a -smooth upper bound on the sensitivity of a query. For example, the mechanism with preserves -differential privacy [18]. To evaluate efficiently one must present an algorithm to efficiently compute the -smooth upperbound , a task which is by itself often non-trivial.

2.2 Graphs and Social Networks

Our work is motivated by the challenges posed by differentially private analysis of social networks. As always, a graph is a pair of a set of vertices and a set of edges . We often just denote a graph as , referring to its vertex-set or edge-set as or resp. A key aspect of our work is modeling a social network as a labeled graph.

Definition 5.

A social network is a graph with labeling function . The set of all social networks is denoted .

The labeling function allows us to encode information about the nodes (e.g., age, gender, occupation). For convenience, we assume all social networks are over the same set of vertices, which is denotes as and has size , and so we assume is public knowledge.2 Therefore, the graph structures of two social networks are equal if their edge-sets are identical. Similarly, we also fix the dimension of our labeling.

Defining differential privacy over the labeled graphs requires care. What does it mean for two labeled graphs to be neighbors? There are two natural notions: edge-adjacency and vertex adjacency.

Definition 6 (Edge-adjacency).

We say that two social networks and are neighbors if either (i) and there exists a vertex such that whereas for every other we have or (ii) and the symmetric difference contains a single edge.

In the context of a social network, differential-privacy w.r.t edge-adjacency can, for instance, guarantee that an adversary will not be able to distinguish whether a particular individual has friended some specific pop-singer on Facebook. However, such guarantees do not allow a person to pretend to listen only to high-end indie rock bands, should that person have friended numerous pop-singers on Facebook. This motivates the stronger vertex-adjacency neighborhood model.

Definition 7 (Vertex-adjacency).

We say that two social networks and are neighbors if there exists a vertex such that and for every .

where for a graph and a vertex we denote as the result of removing every edge in that touches .

It is evident that any two social networks that are edge-adjacent are also vertex-adjacent. Preserving differential privacy while guaranteeing good utility bounds w.r.t vertex-adjacency is a much harder task than w.r.t edge-adjacency.

Distance Given two social networks and , recall that their distance is the minimal s.t. one can form a path of length , starting with and ending at , with the property that every two consecutive social-networks on this path are adjacent. Given the above two definitions of adjacency, we would like to give an alternative characterization of this distance.

First of all, the set dictates steps that we must take in order to transition from to . It is left to determine how many adjacent social-networks we need to transition through until we have . To that end, we construct the difference-graph whose edges are the symmetric difference of and ). Clearly, to transition from to , we need to alter every edge in the difference graph. In the edge-adjacency model, a pair of adjacent social networks covers precisely a single edge, and so it is clear that the distance . In the vertex-adjacency model, a single vertex can cover all the edges that touch it, and so the distance between the graphs and is precisely the vertex cover of the difference graph. Denoting this vertex cover as we have that . It is evident that computing the distance of between any two social-networks in the vertex-adjacency model is a NP-hard problem.

To avoid cumbersome notation, from this point on we omit the differentiation between graphs and social networks, and denote networks as graphs .

3 Restricted Sensitivity

We now introduce the notion of restricted sensitivity, using a hypothesis about the dataset to restrict the sensitivity of a query. A hypothesis is a subset of the set of all possible datasets (so in the context of social networks, is a set of labeled graphs). We say that is true if the true dataset . Because the hypothesis may not be a convex set we must consider all pairs of datasets in instead of all pairs of adjacent datasets as in the definition of global sensitivity.

Definition 8.

For a given notion of adjacency among datasets, the restricted sensitivity of over a hypothesis is

To be clear, denotes the length of the shortest-path in between and (not restricting the path to only use ) using the given notion of adjacency (e.g., edge-adjacency or vertex-adjacency). That is, we restrict the set of databases for which we compute the sensitivity, but we do not re-define the distances.

Observe that may be smaller than for some if has a neighbor . In fact we often have . As an immediate corollary, in such cases will be significantly lower than , a -smooth upper bound on .

4 Using Restricted Sensitivity to Reduce Noise

To achieve differential privacy while adding noise proportional to we must be willing to sacrifice accuracy guarantees for datasets . Our goal is to create a new query such that for every ( is accurate when the hypothesis is correct) and either has low global sensitivity or low -smooth sensitivity over datasets . In this section, we first give a non-efficient generic construction of such , showing that it is always possible to devise whose global sensitivity equals exactly the restricted sensitivity of over . We then show how for the case of social networks and for the hypothesis that the network has bounded degree, we can construct functions having approximately this property, efficiently.

4.1 A General Construction

We now show how given to generically (but not efficiently) construct whose global sensitivity exactly equals the restricted sensitivity of over .

Theorem 9.

Given any query and any hypothesis we can construct a query such that

  1. it holds that , and

Proof.

For each set . Now fix an arbitrary ordering of the set , and denote its elements as , where is the size of the set. For every we define the value of inductively. Denote . Initially, we are given the values of every . Given , we denote . We now prove one can pick the value in a way that preserves the invariant that . By applying the induction times we conclude that

Fix . Observe that

so to preserve the invariant it suffices to find any value of that satisfies that for every we have . Suppose for contradiction that no value exists. Then there must be two intervals

which don’t intersect. This would imply that

which contradicts the fact that is the restricted sensitivity of . ∎

4.2 Efficient Procedures for via Projection Schemes

Unfortunately, the construction of Theorem 9 is highly inefficient. Furthermore, this construction deals with one query at a time. We would like to a-priori have a way to efficiently devise for any . In this section, the way we devise is by constructing a projection – a function with the property that for every . Such allows us to canonically convert any into using the naïve definition . Below we discuss various properties of projections that allow us to derive “good” -s. Following each property, we exhibit the existence of such projections for the specific case of social networks and , the class of graphs of degree at most .

Definition 10.

The class is defined as the set .

In many labeled graphs, it is reasonable to believe that holds for because the degree distributions follow a power law. For example, the number of telephone numbers receiving calls in a day is proportional to , and the number of web pages with incoming links is proportional to [17, 7, 10]. For these networks it would suffice to set . The number of papers that receive citations is proportional to so we could set [10]. While the degrees on Facebook don’t seem to follow a power law, the upper bound seems reasonable [21]. By contrast, Facebook had approximately users in June, 2012 [1].

Smooth Projection

The first property we discuss is perhaps the simplest and most coveted property such projection can have – smoothness. Smoothness dictates that there exists a global bound on the distance between any two mappings of two neighboring databases.

Definition 11.

A projection is called -smooth if for any two neighboring databases we have that .

Lemma 12.

Let be a -smooth projection (i.e., for every we have ). Then for every query , the function satisfies that

Proof.

As we now show, for and for distances defined via the edge-adjacency model, we can devise an efficient smooth projection.

Claim 13.

In the edge-adjacency model, there exists an efficiently computable -smooth projection to .

The proof of the claim is deferred to the appendix. The high-level idea is to fix a canonical ordering over all edges and then define to delete an edge if and only if there is a vertex such that (1) is incident to and (2) is not one of the first edges incident to . This is then used to achieve the smoothness guarantee. An immediate corollary of Lemma 12 and Claim 13 is the following theorem.

Theorem 14.

(Privacy wrt Edge Changes) Given any query for social networks , the mechanism that uses the projection from Claim 13, and answers the query using preserves privacy for any graph .

Now, it is evident that this mechanism has the guarantee that for every it holds that . Furthermore, if the querier “lucked out” to ask a query for which and are close (say, identical), then the same guarantee holds for such as well. Note however that we cannot reveal to the querier whether and are indeed close, as such information might leak privacy.

Projections and Smooth Distances Estimators

Unfortunately, the smooth projections do not always exist, as the following toy-example demonstrates. Fix graphs, where for , and let . Because and , then there must exist some value such that , thus every cannot be -smooth for .

Note that smooth projections have the property that they also provide a -approximation of the distance of to . Meaning, for every we have that . In the vertex adjacency model, however, it is evident that we cannot have a -smooth projection since, as we show in the appendix, it is NP-hard to approximate (see Claim 23), but there does exists an efficient approximation scheme (see Claim 24) of the distance. Yet, we show that it is possible to devise a somewhat relaxed projection s.t. the distance between a database and its mapped image is a smooth function. To that end, we relax a little the definition of projection, allowing it to map instances to some predefined .

Definition 15.

Fix . Let be a projection of , so is a mapping that maps every element of to itself ( we have that ). A -smooth distance estimator is a function that satisfies all of the following. (1) For every it is defined as . (2) It is lower bounded by the distance of to its projection: . (3) Its value over neighboring databases changes by at most : .

It is simple to verify that for every we have that (using induction on ). We omit the subscript when is specified.

The following lemma suggests that a smooth distance estimator allows us to devise a good smooth-upper bound on the local-sensitivity, thus allowing us to apply the smooth-sensitivity scheme of [18].

Lemma 16.

Fix and let be a projection of . Let be an efficiently computable -smooth distance estimator. Then for every query , we can define the composition and define the function

Then is an efficiently computable -smooth upper bound on the local sensitivity of . Furthermore, define as the function . Then for every it holds that

The proof of Lemma 16 is deferred to the appendix. Like in the edge-adjacency model, we now exhibit a projection and a smooth distance estimator for the vertex-adjacency model.

Claim 17.

In the vertex-adjacency model, there exists a projection and a -smooth distance estimator , both of which are efficiently computable.

To construct and we start with the linear program that determines a “fractional distance” from a graph to . This LP has variables: which intuitively represents whether ought to be removed from the graph or not, and which represents whether the edge between and remains in the projected graph or not. We also use the notation , where if the edge is in ; otherwise .

To convert our fractional solution to a graph we define to be the graph we get by removing every edge whose either endpoint has weight or . We define our distance estimator as . In the appendix we show that and satisfy the conditions of claim 17.

As before, combining Lemma 16 with Claim 17 gives the following theorem as an immediate corollary.

Theorem 18.

(Privacy wrt Vertex Adjacency) Given any query for social networks , the mechanism that uses the projection from Claim 13 and the -smooth upper bound of Lemma 16, and answers the query using preserves privacy for any graph .

Again, it is evident from the definition that the algorithm has the guarantee that for every it holds that .

5 Restricted Sensitivity and

Now that we have constructed the machinery of restricted sensitivity, we compare the restricted sensitivity over with smooth sensitivity for specific types of queries, in order to demonstrate the benefits of our approach. In a nutshell, restricted sensitivity offers a significant advantage over smooth sensitivity whenever . I.e., we show that there are queries s.t. for some it holds that .

We now define two types of queries. First, let us introduce some notation. A profile is a function that maps a vertex in a social network to . Given a set of vertices , we denote by the social network derived by restricting and to these vertices. We use to denote the social network derived by restricting and to and its neighbors. A local profile satisfies the constraint .

Definition 19.

A (local) profile query

sums the (local) profile accross all nodes.

Local profile queries are a natural extension of predicates to social networks, which can be used to study many interesting properties of a social network like clustering coefficients[22, 17, 4], local bridges [10, 12] and -betweeness [11]. Further dissussion can be found in section C in the appendix. Claim 20 bounds the restricted sensitivity of a local profile query over (e.g., in the vertex adjacency model a node can at worst affect the local profiles of itself, its old neighbors and its new neighbors). A formal proof of Claim 20 is deferred to the appendix.

Claim 20.

For any local profile query , we have that in the vertex adjacency model, and in the edge adjacency model.

By contrast the smooth sensitivity of a local profile query may be as large as even for graphs in . Consider the local profile query “how many people are friends with a spy?” The -star graph in which a spy is friends with everyone is adjacent to the empty graph . Therefore, any smooth upper bound must have . It is also worth observing that the assumption does not necessarily shrink the range of possible answers to a local profile query (e.g., there are graphs in which everyone is friends with a spy).

Subgraph queries allows us to ask questions such as “how many triplets of people are all friends when two of them are doctors and the other is a pop-singer?” or “how many paths of length 2 are there connecting a spy and a pop-singer over ?” The average clustering coefficient of a graph can be computed from the number of triangles and -stars in a graph.

Definition 21.

A subgraph counting query is given by a connected graph over vertices and predicates . Given a social network , the answer to is the size of the set

The smooth sensitivity of a subgraph counting query may be as high as in the vertex adjacency model. Let be a subgraph counting query where is a -star and each predicate is identically true. Let be a -star (). Then in the vertex adjacency model there is a neighboring graph with no edges (). We have that . Observe that . In the appendix we show that the smooth sensitivity of is always greater than when each predicate is identically true (see claim 25). By contrast Claim 22 bounds the restricted sensitivity of subgraph counting queries. The proof is deferred to the appendix.

Claim 22.

Let be subgraph counting query and let then in the edge adjacency model and in the vertex adjacency model.

While the assumption may shrink the range of a subgraph counting query , the restricted sensitivity of will typically be much smaller than this reduced range. For example, if counts the number of triangles in then for any , while .

6 Future Questions/Directions

Efficient Mappings: While we can show that there doesn’t exist an efficiently computable -smooth projection , we don’t know whether the construction of Claim 17 can be improved. Meaning, there could be a mapping for some , whether the solution itself, the set of vertices that dominate the removed edges, is smooth. In other words, Is there an efficiently computable mapping which satisfies for some constant ? Multiple Queries: We primarily focus on improving the accuracy of a single query . Could the notion of restricted sensitivity be used in conjunction with other mechanisms (e.g., BLR [6], Private Multiplicative Weights mechanism [13], etc.) to accurately answer an entire class of queries? Alternate Hypotheses: We focused on the specific hypothesis . What other natural hypthothesis could be used to restrict sensitivity in private data analysis? Given such a hypothesis can we efficiently construct a query with low global sensitivity or with low smooth sensitivity over datasets ?

Appendix A Missing Proofs

Reminder of Claim 13. In the edge-adjacency model, there exists an efficient way to compute a -smooth projection to .

Proof of Claim 13. We construct our smooth-projection by first fixing a canonical ordering over all possible edges. Let denote the edges incident to in canonical order. For each edge we delete if and only if (i) for or (ii) for (Intuitively for each with we keep this first edges incident to and flag the other edges for deletion). If then no edges are deleted, so . Suppose that are neighbors differing on one edge (wlog, say that is in ). Observe that for every , the same set of edges incident to will be deleted from both and . In fact, if does not contain then . Otherwise, if is not deleted we may assume then there may be at most one edge (incident to ) and at most one edge (incident to ) that were deleted from but not from . Hence, .

Reminder of Lemma 16. Fix and let be a projection of . Let be an efficiently computable -smooth distance estimator. Then for every query , we can define the composition and define the function

Then is an efficiently computable -smooth upper bound on the local sensitivity of . Furthermore, define as the function . Then for every it holds that

Proof of Lemma 16. First, we show that indeed is an upper bound on the local sensitivity of . Fix any and indeed

Next we prove that is -smooth. Let and be two neighboring databases, and wlog assume . Then

Let be the value of on which the maximum of numerator is obtained. Then

where the last inequality uses the smoothness property, i.e. that .

Finally, we wish to prove the global upper bound on , i.e., that for every

Fix and define , so that . Taking the derivative of we have

which means that is maximized at . In the case that (i.e. for