Models and Algorithms for Graph Watermarking
We introduce models and algorithmic foundations for graph watermarking. Our frameworks include security definitions and proofs, as well as characterizations when graph watermarking is algorithmically feasible, in spite of the fact that the general problem is NP-complete by simple reductions from the subgraph isomorphism or graph edit distance problems. In the digital watermarking of many types of files, an implicit step in the recovery of a watermark is the mapping of individual pieces of data, such as image pixels or movie frames, from one object to another. In graphs, this step corresponds to approximately matching vertices of one graph to another based on graph invariants such as vertex degree. Our approach is based on characterizing the feasibility of graph watermarking in terms of keygen, marking, and identification functions defined over graph families with known distributions. We demonstrate the strength of this approach with exemplary watermarking schemes for two random graph models, the classic Erdős-Rényi model and a random power-law graph model, both of which are used to model real-world networks.
In the classic media watermarking problem, we are given a digital representation, , for some media object, , such as a piece of music, a video, or an image, such that there is a rich space, , of possible representations for besides that are all more-or-less equivalent. Informally, a digital watermarking scheme for is a function that maps and a reasonably short random message, , to an alternative representation, , for in . The verification of such a marking scheme takes and a presumably-marked representation, (which was possibly altered by an adversary), along with the set of messages previously used for marking, and it either identifies the message from this set that was assigned to or it indicates a failure. Ideally, it should difficult for an adversary to transform a representation, (which he was given), into another representation in , that causes the identification function to fail. Some example applications of such digital watermarking schemes include steganographic communication and marking digital works for copyright protection (e.g., see [16, 25, 50]).
With respect to digital representations of media objects that are intended to be rendered for human performances, such as music, videos, and images, there is a well-established literature on digital watermarking schemes and even well-developed models for such schemes (e.g., see Hopper et al. ). Typically, such watermarking schemes take advantage of the fact that rendered works have many possible representations with almost imperceptibly different renderings from the perspective of a human viewer or listener.
In this paper, we are inspired by recent systems work on graph watermarking by Zhao et al. [56, 55], who propose a digital watermarking scheme for graphs, such as social networks, protein-interaction graphs, etc., which are to be used for commercial, entertainment, or scientific purposes. This work by Zhao et al. presents a system and experimental results for their particular method for performing graph watermarking, but it is lacking in formal security and algorithmic foundations. For example, Zhao et al. do not provide formal proofs for circumstances under which graph watermarking is undetectable or when it is computationally feasible. Thus, as complementary work to the systems results of Zhao et al., we are interested in the present paper in providing models and algorithms for graph watermarking, in the spirit of the watermarking model provided by Hopper et al.  for media files. In particular, we are interested in providing a framework for identifying when graph watermarking is secure and computationally feasible.
1.1 Additional Related Work
Under the term “graph watermarking,” there is some additional work, although it is not actually for the problem of graph watermarking as we are defining it. For instance, there is a line of research involving software watermarking using graph-theoretic concepts and encodings. In this case, the object being marked is a piece of software and the goal of a “graph watermarking” scheme is to create a graph, , from a message, , and then embed into the control flow of a piece of software, , to mark . Examples of such work include pioneering work by Collberg and Thomborson , as well as subsequent work by Venkatesan, Vazirani, and Sinha  and Collberg et al. . (See also Chen et al.  and Bento et al. , as well as a survey by Hamilton and Danicic .) This work on software watermarking differs from the graph watermarking problem we study in the present paper, however, because in the graph watermarking problem we study an input graph is provided and we want to alter it to add a mark. In the graph-based software watermarking problem, a graph is instead created from a message to have a specific, known structure, such as being a permutation graph, and then that graph is embedded into the control flow of the piece of software.
A line of research that is more related to the graph watermarking problem we study is anonymization and de-anonymization for social networks (e.g., see [3, 57, 23, 26, 37, 43, 53]). One of the closest examples of such prior work is by Backstrom, Dwork, and Kleinberg , who show how to introduce a small set of “rogue” vertices into a social network and connect them to each other and to other vertices so that if that same network is approximately replicated in another setting it is easy to match the two copies. Such work differs from graph watermarking, however, because the set of rogue vertices are designed to “stand out” from the rest of the graph rather than “blend in,” and it may in some cases be relatively easy for an adversary to identify and remove such rogue vertices. Also, we would ideally prefer graph watermarking schemes that make small changes to the adjacencies of existing vertices rather than mark a graph by introducing new vertices, since in some applications it may not be possible to introduce new vertices into a graph that we wish to watermark. In addition to this work, also of note is work by Narayanan and Shmatikov , who study the problem of approximately matching two social networks without marking, as well as the work on Khanna and Zane  for watermarking road networks by perterbing vertex positions (which is a marking method outside the scope of our approach).
Our approach to graph watermarking is also necessarily related to the problem of graph isomorphism and its approximation (e.g., see [1, 2, 17, 27, 30, 46]). In the graph isomorphism problem, we are given two -vertex graphs, and , and asked if there is a mapping, , of vertices in to vertices in such that is an edge in if and only if is an edge in . While the graph isomorphism problem is “famous” for having an uncertain, but unlikely , with respect to being NP-complete, extensions to subgraph isomorphism and graph edit distance are known to be NP-complete (e.g., see ).
1.2 Our Results
In this paper, we introduce a general graph watermarking framework that is based on the use of key generation, marking, and identification functions, as well as a hypothetical watermarking security experiment (which would be performed by an adversary). We define these functions in terms of graphs taken over random families of graphs, which allows us to quantify situations in which graph watermarking is provably feasible.
We also provide some graph watermarking schemes as examples of our framework, defined in terms of the classic Erdős-Rényi random-graph model and a random power-law graph model. Our schemes extend and build upon previous results on graph isomorphism for these graph families, which may be of independent interest. In particular, we design simple marking schemes for these random graph families based on simple edge-flipping strategies involving high- and medium-degree vertices. Analyzing the correctness of our schemes is quite nontrivial, however, and our analysis and proofs involve intricate probabilistic arguments. We provide an analysis of our scheme against adversaries that can themselves flip edges in order to defeat our mark identification algorithms. In addition, we provide experimental validation of our algorithms, showing that our edge-flipping scheme can succeed for a graph without specific knowledge of the parameters of its deriving graph family. We also conducted experiments to fit real-world networks to the random power-law graph model, which gave results that showed that the model was generally a good fit for the networks tested but the learned values did not fall into the range needed for our scheme.
2 Our Watermarking Framework
We begin by presenting a general framework for graph watermarking, which differs from the general model of Hopper et al. , but is similar in spirit.
Suppose we are given an undirected graph, , that we wish to mark. To define the security of a watermarking scheme for , must come from a family of graphs with some degree of entropy . We formalize this by assuming a probability distribution over the family of graphs from which is taken.
A graph watermarking scheme is a tuple over a set, , of graphs where
is a private key generation function, such that is a list of (pseudo-)random graph elements, such as vertices and/or vertex pairs, defined over a graph of vertices. These candidate locations for marking are defined independent of a specific graph; that is, vertices in are identified simply by the numbering from to . For example, could be a small random graph, , and some random edges to connect to a larger input graph , or could be a set of vertex pairs in an input graph that form candidate locations for marking.
takes a private key generated by , and a specific graph from , and returns a pair, , such that is a unique identifier for and is the graph obtained by adding the mark determined by to in the location determined determined by the private key . is called every time a different marked copy needs to be produced, with the -th copy being denoted by . Therefore, the unique identifiers should be thought of as being generated randomly. To associate a marked graph with the user who receives it, the watermarking scheme can be augmented with a table storing user name and unique identifiers. Alternatively, the identifiers can be generated pseudo-randomly as a hash of a private key provided by the user.
takes a private key from , the original graph, , identifiers of previously-marked copies of , and a test graph, , and it returns the identifier, , of the watermarked graph that it is identifying as a match for . It may also return , as an indication of failure, if it does not identify any of the graphs as a match for .
In addition, in order for a watermarking scheme to be effective, we require that with high probability111Or “whp,” that is, with probability at least , for some . over the graphs from and output pairs, of , for any , we have .
Algorithm 1 shows a hypothetical security experiment for a watermarking scheme with respect to an adversary, , who is trying to defeat the scheme. Intuitively, in the hypothetical experiment, we generate a key , choose a graph , from family according to distribution (as discussed above), and then generate marked graphs according to our scheme (for some set of messages). Next, we randomly choose one of the marked graphs, , and communicate it to an adversary. The adversary then outputs a graph that is similar to where his goal is to cause our identification algorithm to fail on .
In order to characterize differences between graphs, we assume a similarity measure , defining the distance between graphs in family . We also include a similarity threshold , that defines the advantage of an adversary performing the experiment in Algorithm 1. Specifically, the advantage of an adversary, who is trying to defeat our watermarking scheme is
The watermarking scheme is -secure against adversary if the similarity threshold is and ’s advantage is polynomially negligible (i.e., is for some ).
Examples of adversaries could include the following:
Arbitrary edge-flipping adversary: a malicious adversary who can arbitrarily flip edges in the graph. That is, the adversary adds an edge if it is not already there, and removes it otherwise.
Random edge-flipping adversary: an adversary who independently flips each edge with a given probability.
Arbitrary adversary: a malicious adversary who can arbitrarily add and/or remove vertices and flip edges in the graph.
Random adversary: an adversary who independently adds and/or removes vertices with a given probability and independently flips each edge with a given probability.
One could also imagine other types of adversaries, as well, such as a random adversary who is limited in terms of the numbers or types of edges or vertices that he can change.
2.1 Random graph models
As defined above, a graph watermarking scheme requires that graphs to be marked come from some distribution. In this paper, we consider two families of random graphs—the classic Erdős-Rényi model and a random power-law graph model—which should capture large classes of applications where graph watermarking would be of interest.
Definition 2 (The Erdős-Rényi model).
A random graph is a graph with vertices, where each of the possible edges appears in the graph independently with probability .
Definition 3 (The random power-law graph model, §5.3 of ).
Given a sequence , such that , the general random graph is defined by labeling the vertices through and choosing each edge independently from the others with probability , where .
We define a random power-law graph parameterized by the maximum degree and average degree . Let for values of in the range between and , where
This definition implies that each edge appears with probability
As we show in the following proposition, this model does indeed have a power-law degree distribution.
In the random power-law graph , the expected number of vertices with degree is between and where .
The function relating the index of a vertex to its expected degree is convex and decreasing. By the mean value theorem, the number of indices such that satisfies
Now the derivative of is . Noting that is the expected number of vertices of degree , the result is proven. ∎
2.2 Graph watermarking algorithms
We discuss some instantiations of the graph watermarking framework defined above. Unlike previous watermarking or de-anonymization schemes that add vertices [3, 56], we describe an effective and efficient scheme based solely on edge flipping. Such an approach would be especially useful for applications where it could be infeasible to add vertices as part of a watermark.
Our scheme does not require adding labels to the vertices or additional objects stored in the graph for identification purposes. Instead, we simply rely on the structural properties of graphs for the purposes of marking. In particular, we focus on the use of vertex degrees, that is, the number of edges incident on each vertex. We identify high and medium degree vertices as candidates for finding edges that can be flipped in the course of marking. The specific degree thresholds for what we mean by “high-degree” and “medium-degree” depend on the graph family, however, so we postpone defining these notions precisely until our analysis sections.
Algorithms providing an example implementation of our graph watermarking scheme are shown in Algorithm 2. The algorithm randomly selects a set of candidate vertex pairs for flipping, from among the high- and medium-degree vertices, with no vertex being incident to more than a parameter of candidate pairs. We introduce a procedure, , which labels high-degree vertices by their degree ranks and each medium-degree vertex, , by a bit vector identifying its high-degree adjacencies. This bit vector has a bit for each high-degree vertex, which is for neighbors of and for non-neighbors. The algorithm , takes a random set of candidate edges and a graph, , and it flips the corresponding edges in according to a resampling of the edges using the distribution . The algorithm, approximate-isomorphism, returns a mapping of the high- and medium-degree vertices in to matching high- and medium-degree vertices in , if possible. The algorithm, , uses the approximate isomorphism algorithm to match up high- and medium-degree vertices in and , and then it extracts the bit-vector from this matching using .
As mentioned above, we also need a notion of distance for graphs. We use two different such notions. The first is the graph edit distance, which is the minimum number of edges needed to flip to go from one graph to another. The second is vertex distance, which intuitively is an edge-flipping metric localized to vertices.
Definition 5 (Graph distances).
Let be the set of graphs on vertices. If , define as the set of bijections between the vertex sets and . Define the graph edit distance as
where is the symmetric difference of the two edge sets under correspondence . Define the vertex distance as
where is the set of edges incident to .
3 Identifying High- and Medium-Degree Vertices
We begin analyzing our proposed graph watermarking scheme by showing how high- and medium-degree vertices can be identified under our two random graph distributions. We begin with some technical results related to graph isomorphism that form the basis of our watermarking approach, with the goal of determining the conditions under which a vertex of a random graph can be identified with high probability, either by its degree (if the degree is high) or by its set of high-degree neighbors (if it has medium degree). We ignore low-degree vertices: their information content and distinguishability are low, and they are not used by our example scheme. Because our results on vertex identifiability are used in our graph watermarking scheme, we also determine how robust these identifications are, based on how well-separated the vertices are by their degrees.
We first find a threshold number such that the vertices with highest degree are likely to have distinct and well-separated degree values. We call these vertices the high-degree vertices. Next, we look among the remaining vertices for those that are well-separated in terms of their high-degree neighbors. Specifically, the (high-degree) neighborhood distance between two vertices is the number of high-degree vertices which are connected to exactly one of the two vertices. Note that we will omit the term “high-degree” in “high-degree neighborhood distance” from now on, as it will always be implied.
In the Erdős-Rényi model, we show that all vertices that are not high-degree nevertheless have well-separated high-degree neighborhoods whp. In the random power-law graph model, however, there will be many lower-degree vertices whose high-degree neighborhoods cannot be separated. Those that have well-separated high-degree neighborhoods with high probability form the medium-degree vertices, and the rest are the low-degree vertices.
For completeness, we include the following well-known Chernoff concentration bound, which we will refer to time and again.
Lemma 6 (Chernoff inequality ).
Let be independent random variables with
We consider the sum , with expectation . Then
3.1 Vertex separation in the Erdős-Rényi model
Let us next consider vertex separation results for the classic Erdős-Rényi random-graph model. Recall that in this model, each edge is chosen independently with probability .
Index vertices in non-increasing order by degree. Let represent the -th highest degree in the graph. Given , we say that a vertex is high-degree with respect to if it has degree at least . Otherwise, we say that the vertex is medium-degree. We just say high-degree when the value of is understood from context.
Note that in this random-graph model, there are no low-degree vertices.
A graph is -separated if all high-degree vertices differ in their degree by at least and all medium-degree vertices are neighborhood distance apart.
Note: this definition depends on how high-degree or medium-degree vertices are defined and will therefore be different for the random power-law graph model.
Lemma 9 (Extension of Theorem 3.15 in ).
Suppose , , and . Then with probability
is such that
We quantify and extend the probability analysis of a proof from . Let
The event of the result fails if or if there is such that .
The statement of theorem 3.12 of  still holds when the words “a.e. satisfies” are replaced by “ satisfies with probability greater than ”. This can be seen directly from the part of the proof where Chebychev’s inequality is applied.
By this result, the probability that is . The probability that for a given is . ∎
Lemma 10 (Vertex separation in the Erdős-Rényi model).
Let , , , . Suppose is such that . Then is -separated with probability .
We prove the theorem with probability at least . Let and . By Lemma 9, the probability that for some is at most .
Let be the expected neighborhood distance between two vertices . We have
so that, if ,
Since the high-degree vertices are separated by more than two degrees, the fact that they are high-degree vertices is independent of whether they are neighbors of and . Consequently, we can apply a Chernoff bound (Lemma 6.) Then, by the union bound, the probability that for some medium-degree is less than . ∎
Thus, high-degree vertices are well-separated with high probability in the Erdős-Rényi model, and the medium-degree vertices are distinguished with high probability by their high-degree neighborhoods.
3.2 Vertex separation in the random power-law graph model
We next study vertex separation for a random power-law graph model, which can match the degree distributions of many graphs that naturally occur in social networking and science. For more information about power-law graphs and their applications, see e.g. [6, 40, 44].
In the random power-law graph model, vertex indices are used to define edge weights and therefore do not necessarily start at 1. The lowest index that corresponds to an actual vertex is denoted . So vertex indices range from to . Additionally, there are two other special indices and , which we define in this section, that separate the three classes of vertices.
The vertices ranging from to are the high-degree vertices, those that range from to are the medium-degree vertices, and those beyond are the low-degree vertices.
In this model, the value of is constrained by the requirement that . When , this constraint is not actually restrictive. However, when , must be asymptotically greater than . The constraints on also constrain the value of the maximal and average degree of the graph.
We define and to be independent of , but dependent on parameters that control the amount and probability of separation at each level. The constraints that and translate into corresponding restrictions on the valid values of , namely that and . We define in the following lemma.
Lemma 12 (Separation of high-degree vertices).
In the model, let . Then,
Moreover, for all satisfying and , the probability that
is at least .
The first statement follows from the fact that is a convex function of and from taking its derivative at and .
For the second statement, let and let . We will show that if , then
Since , the right hand side is bounded above by . This proves the result.
For simplicity, we often use the following observation.
Rewriting to show its dependence on , we have
For the graph model to make sense, the high-degree threshold must be asymptotically greater than the lowest index. In other words, we must have that . Since , this implies that .
We next define , the degree threshold for medium-degree vertices, in the following lemma.
Lemma 14 (Separation of medium-degree vertices).
Let and let
We claim that if , then
If we choose , we have that , so that Eq. 8 applies to all such that . Moreover, since
our choice of implies that . By applying the union bound to Eq. 8, we have
which establishes the lemma.
Let us now prove the claim. Observe that is the sum over the high-degree vertices , of indicator variables for the event that vertex is connected to exactly one of the vertices and . It i For fixed and , these are independent random variables. Therefore, we can apply a Chernoff bound. The probability that is
Since , for sufficiently large , this expression is bounded below by , and
Therefore, applying the Chernoff bound (Lemma 6) to the for fixed and and all high-degree vertices proves the claim. ∎
We would have the undesirable situation that whenever , or equivalently when . In fact, in order for , we must have .
We illustrate the breakpoints for high-, medium-, and low-degree vertices in Fig. 1.
The next lemma summarizes the above discussion and provides the forms of and that we use in our analysis.
Lemma 16 (Vertex separation in the power-law model).
Let . Fix . Let and where and . Let
For sufficiently large , the probability that a graph is not -separated is at most .
So for sufficiently large , we have . For all , the average degrees of consecutive vertices are at least apart. So for two high-degree vertices to be within of each other, at least one of the two must have degree at least away from its expected degree. By Lemma 12, the probability that some high-degree vertex satisfies is at most .
By Lemma 14, the probability that there are two medium-degree vertices with neighborhood distance less than is at most . ∎
Thus, our marking scheme for the random power-law graph model is effective.
4 Adversary Tolerance
In this section, we study the degree to which our exemplary graph watermarking scheme can tolerate an arbitrary edge-flipping adversary. To measure success, we use the notion of security and adversary advantage which are formally defined in 2. We quantify the number of edge flips that can be tolerated under the Erdős-Rényi model and the random power-law graph model.
Theorem 17 (Security against an arbitrary edge-flipping adversary in the Erdős-Rényi model).
Let , , and such that . Let be sufficiently large so that
Suppose the similarity measure is the vertex distance , the similarity threshold is , we have a number of watermarked copies, and their identifiers are generated using bits. Suppose also that the identifiers map to sets of edges of a graph constrained by the fact that no more than edges can be incident to any vertex. The watermarking scheme defined in Algorithm 2 is -secure against any deterministic adversary.
The proof of this theorem relies on two lemmas. Lemma 18 identifies conditions under which a set of bit vectors with bits independently set to 1 is unlikely to have two close bit vectors. Lemma 19 states that a deterministic adversary’s ability to guess the location of the watermark is limited. Informally, this is because the watermarked graph was obtained through a random process, so that there are many likely original graphs that could have produced it.
Lemma 18 (Separation of IDs).
Consider random bit strings of length , where each bit is independently set to 1, and the i-th bit is 1 with probability satisfying for a fixed value . The probability that at least two of these strings are within Hamming distance of each other is at most if .
The expected distance between two such strings is at least Applying Lemma 6 with , we have that the probability that their Hamming distance is less than is at most . Therefore, the probability that at least two out of strings are within Hamming distance of each other is at most . ∎
Lemma 19 (Guessing power of adversary).
Consider a complete graph on vertices, and let of its edges be red. Let be a sample of edges chosen uniformly at random among those that satisfy the constraint that no more than edges of the sample can be incident to any one vertex. Suppose also that and are non-decreasing functions of such that
For sufficiently large , the probability that contains at least red edges is bounded by . Moreover, if , then the probability that contains at least red edge is bounded by , for some and for sufficiently large .
In the process of selecting edges without replacement, let be the event that the sample contain at least red edges, and let be the event that the sample satisfies the degree constraint. The event whose probability we want to bound is equal to
Let us first show that can be lower bounded by a constant. To prove this, we select vertices with replacement uniformly at random, and pair consecutive vertices to obtain edges. Choosing vertices uniformly in this way will simplify showing that the degree constraint is satisfied. Of course we want to avoid “self-loops”, or edges where both end vertices are the same. Let denote the event that there is a vertex that is incident to more than edges of the sample. Also, let denote the event that the sample contains no self-loops and no duplicate edges. Then
Now, the probability of encountering a self-loop is and the probability of an edge being a duplicate of another is at most . Therefore,
By Eq. 10, . So is bounded away from 0. Moreover, since the edges now consist of pairs of independently chosen vertices, we can approximate the number of edges incident to each vertex by independent Poisson random variables with parameter thusly:
where the middle factor is a bound on the probability that one Poisson variable is at least (Theorem 5.4 of ), and the last factor is an adjustment factor for this approximation (Corollary 5.9 of ). This expression is bounded by a constant factor times the expression on the left-hand side of Eq. 10. Consequently, converges to 0, and for sufficiently large , , as was to be shown.
Now we find an upper bound for . To do this, we select edges with replacement uniformly at random. Because is relatively small when compared to , it is unlikely that the sample will contain any duplicates. Formally, let be the event that the sample contains at least red edges, and be the event that the sample consists of distinct edges. We have
The probability that two selected edges are the same edge is . So
So for large enough , is bounded below by .
Finally, we bound . The expected number of red edges in this sample is which is bounded below by and bounded above by . So using these bounds and a Chernoff bound (Lemma 6), where we set equal to , we have that
If as , set equal to :
for some constant . Putting it all together, we have that for large enough , and is bounded above by times one of the two bounds for . This proves the result. ∎
An upper bound on the advantage of any deterministic adversary on graphs on vertices is given by the conditional probability
where the parameters passed to are defined according to the experiment in Algorithm 1. We show that this quantity is polynomially negligible.
For to be successfully identified, it is sufficient for the following three conditions to hold:
the original graph is -separated;
the Hamming distance between any two and involved in a pair in is at least ;
changes no edges of the watermark.
These are sufficient conditions because we only test graphs whose vertices had at most incident edges modified by the adversary, and another incident edges modified by the watermarking. So for original graphs that are -separated, the labeling of the vertices can be successfully recovered. Finally, if the adversary does not modify any potential edge that is part of the watermark, the of the graph is intact and can be recovered from the labeling.
Finally, for graphs in which an adversary makes fewer than modifications per vertex, the total number of edges the adversary can modify is . Since all vertices are high- and medium-degree vertices in this model, . Therefore, . Equation 9 guarantees that the hypothesis given by Eq. 10 of Lemma 19 is satisfied. Consequently, the probability that changes one or more adversary edges is for some constant .
This proves that each of the three conditions listed above fails with polynomially negligible probability, which implies that the conditional probability is also polynomially negligible. ∎
Theorem 20 (Security against an arbitrary edge-flipping adversary in the random power-law graph model).
Let , , and where , and .
Let . Suppose the similarity measure is a vector of distances , that the corresponding similarity threshold is the vector where is the maximum number of edges the adversary can flip in total, and the maximum number number of edges it can flip per vertex. Suppose that we have watermarked copies of the graph, that we use to watermark a graph.
Suppose also that the identifiers map to sets of edges of a graph constrained by the fact that no more than edges can be incident to any vertex. Then the watermarking scheme defined in Algorithm 2 is -secure against any deterministic adversary.
The proof is similar to the proof of Theorem 17. An upper bound on the advantage of any deterministic adversary on graphs on vertices is given by the conditional probability
where the parameters passed to are defined according to the experiment in Algorithm 1. We show that this quantity is polynomially negligible.
For to be successfully identified, it is sufficient for the following three conditions to hold:
the original graph is