Adversarial Frontier Stitching for Remote Neural Network Watermarking
Abstract
The state of the art performance of deep learning models comes at a high cost for companies and institutions, due to the tedious data collection and the heavy processing requirements. Recently, [Uchida:2017] proposed to watermark convolutional neural networks by embedding information into their weights. While this is a clear progress towards model protection, this technique solely allows for extracting the watermark from a network that one accesses locally and entirely. This is a clear impediment, as leaked models can be reused privately, and thus not released publicly for ownership inspection.
Instead, we aim at allowing the extraction of the watermark from a neural network (or any other machine learning model) that is operated remotely, and available through a service API. To this end, we propose to operate on the model’s action itself, tweaking slightly its decision frontiers so that a set of specific queries convey the desired information.
In present paper, we formally introduce the problem and propose a novel zerobit watermarking algorithm that makes use of adversarial model examples (called adversaries for short). While limiting the loss of performance of the protected model, this algorithm allows subsequent extraction of the watermark using only few remote queries. We experiment this approach on the MNIST dataset with three types of neural networks, demonstrating that e.g., watermarking with images incurs a slight accuracy degradation, while being resilient to most removal attacks.
tabular \makesavenoteenvtable \iclrfinalcopy
1 Introduction
Recent years have witnessed the competition for top notch deep neural networks design and training. The industrial advantage from the possession of a state of the art model is now widely leveraged, starting to motivate some attacks for stealing those models (see [stealing]). Since it is now widely acknowledged that machine learning models will play a central role in the IT development in the years to come, the necessity for protecting those models appears more salient.
[Uchida:2017] published the first method for watermarking a neural network that might be publicly shared and thus for which traceability through ownership extraction is important. The watermarked object is here a neural network and its trained parameters. We are interested in a related though different problem, namely zerobit watermarking of neural networks (or any machine learning models) that are only remotely accessible through an API. The extraction test of a zerobit watermark in a given model refers to the presence or not of the mark in that model. This type of watermark, along with the required key to extract it, is sufficient for an entity that suspects a non legitimate usage of the watermarked model to confirm it or not.
In stark contrast to [Uchida:2017]’s approach, we seek a watermarking approach that allows extraction to be conducted remotely, without access to the model itself. More precisely, the extraction test of the proposed watermark consists in a set of requests to the machine learning service. This allows the detection of (leaked) models not only when model’s parameters are directly accessible, but also when the model is simply exposed as a service.
Rationale.
We thus aim at embedding zerobit watermarks into models, that can be extracted remotely. In this setup, we can only rely on interactions with the model through the remote API, e.g., on object recognition queries in case of an image classification model. The input, e.g., images, must thus convey a means to embed identification information into the model (zerobit watermarking step) and to extract, or not, the identification information from the remote model (watermark extraction step), see Fig. 1. Our algorithm’s rationale is that the embedded watermark is a slight modification of the original model’s decision frontiers around a set of specific inputs that form the hidden key. Answers of the remote model to these inputs are compared to those of the marked model. A strong match must indicate the presence of the watermark in the remote model with a high probability.
The inputs in the key must be crafted in a way that watermarking the model of interest does not degrade significantly its performance. To this end, we leverage adversarial perturbations of training examples ([Goodfellow:2015]) that produce new examples (the “adversaries”) very close the model’s decision frontiers. As such adversaries seem to generalize across models, notably across different neural network architectures for visual recognition, see e.g., [Rozsa:2016], this frontier tweaking should resist model manipulation and yield only few false positives (wrong identification of non marked model).
Contributions.
The contributions of this paper are: 1) A formalization of the problem of zerobit watermarking a model for remote identification, and associated requirements (Section 2); 2) A practical algorithm, the frontier stitching algorithm based on adversaries, to address this problem (Section 3); 3) Experiments with three different types of neural networks on the MNIST dataset, validating the approach with regards to the specified requirements (Section 4)
2 Model watermarking for Remote Extraction
Considered scenario
The scenario that motivates our work is as follows: An entity, having designed and trained a machine learning model, notably a neural network, wants to zerobit watermark it (topaction on Figure 1). That model could then be placed in production for applications and services. In case of the suspicion of a leak in that application (model has been stolen), the entity suspecting a given online service to reuse that leaked model can query that remote service for answering its doubts (bottomaction).
Like for classic media watermarking methods ([771066]), our proposal is composed by operations of embedding (the zerobit watermark in the model), extraction (where the entity verifies the presence or not of its watermark), and of studying possible attacks (actions performed in order to remove the watermark from the model).
Modeling Requirements
The requirements for watermarking and extracting the watermark from the weights of a neural network that is available locally for inspection are listed by [Uchida:2017]. Those requirements are based on previous work for watermarking in the multimedia domain ([771066]). Since our aim for the capability of remote extraction makes them non applicable, we now specify new requirements adapted to our setup.
We consider the problem of zerobit watermarking a generic classifier, for remote watermark extraction. Let be the dimensionality of the input space (raw signal space for neural nets or handcrafted feature space for linear and nonlinear SVMs), and the finite set of target labels. Let be the perfect classifier for the problem (i.e., is always the correct answer). Let be the trained classifier to be watermarked, and be the space of possible such classifiers. Our aim is to find a zerobit watermarked version of (hereafter denoted ) along with a set of specific inputs, named the key, and their labels . The purpose is to query with the key a remote model that can be either or another unmarked model . The key, which is thus composed of “objects” to be classified, is used to embed the watermark into .
Here are listed the requirements of an ideal watermarked model and key couple, :
 Loyal.

The watermark embedding does not hinder the performance of the original classifier:
(1)  Efficient.

The key is as short as possible, as accessing the watermark requires requests.
 Effective.

The embedding allows unique identification of using (zerobit watermarking):
(2)  Robust.

Attacks (such as finetuning or compression) to do not remove the watermark
^{1} :(3)  Secure.

There should be no efficient algorithm to detect the presence of a watermark in a model by an unauthorized party.
Note that effectiveness is new requirement as compared to the list of Uchida et al. Also, Uchida et al.’s capacity requirement, i.e., the amount of information that can be embedded by a method, is not part of ours as our goal is to decide whether watermarked model is used or not (zerobit watermark extraction).
One can observe the conflicting nature of effectiveness and robustness: If, for instance, then this function violates one of the two. In order to allow for a practical setup for the problem, we rely on a measure of the matching between two classifiers :
(4) 
where is the Kronecker delta. One can observe that is simply the Hamming distance between the vectors and , thus based on elements in . With this focus on distance, our two requirements can now be recast in a nonconflicting way:

Robustness:

Effectiveness:
After having presented those requirements, we are ready to propose a practical zerobit model watermarking algorithm that permits remote extraction through requests to API.
3 The Frontier Stitching Algorithm for zerobit watermarking
We now present our approach and its underlying intuition. Our aim is to output a zerobit watermarked model , which can for instance be placed into production for use by consumers, together with a watermark key to be used in case of model leak suspicion. Figure 2 illustrates the approach in the setting of a binary classifier.
As we use input points for watermarking the owned model and subsequently to query a suspected remote model, the choice of those inputs is crucial. A non watermarkingbased solution based simply on choosing arbitrarily training examples (along with their correct labels), is very unlikely to succeed in the identification of a specific valuable model: Classifying those points correctly should be easy for highly accurate classifiers, which will then provide similar results, ruining the effectiveness (Fig. 2(a)). On the other hand (Fig. 2(b)), the opposite strategy of selecting arbitrary examples and finetuning so that it changes the way their are classified (e.g. ) is an option to modify model’s behavior in an identifiable way. However, finetuning on even few examples that are possibly far from class frontiers will significantly alter the performance of : The produced solution will not be loyal.
Together, those observations lead to the conclusion that the selected points should be close to the original model’s decision frontier, that is, their classification is not trivial and depends heavily on the model (Fig. 2(c)). Finding and manipulating such inputs is the purpose of adversarial perturbations [Goodfellow:2015]. Given a trained model, any well classified example can be modified in a very slight and simple way such that it is now misclassified with high chance. For instance, a natural image that is correctly recognized by a given model can be modified in an imperceptible way so as to be assigned a wrong class. Such modified samples are called “adversarial examples”, or adversaries in short.
The proposed frontier stitching algorithm, presented in Algorithm 1, makes use of such adversaries, selected to “clamp” the frontier in a unique, yet harmless way. It proceeds in two steps to watermark the model. The first step is to select a small key set of specific input points, which is composed of two types of adversaries. It first contains classic adversaries, we call true adversaries, that are misclassified by although being each very close to a well classified example. It also contains false adversaries, each obtained by applying an adversarial perturbation to a well classified example without ruining its classification. In practice, the “fast gradient sign method” proposed in [Goodfellow:2015] is used with a small step to create potential adversaries of both types from training examples.
These frontier clamping inputs are then used to watermark the model. The model is finetuned into such that all points in are now well classified:
(5) 
In other words, the true adversaries of in become false adversaries of marked model, and false adversaries remain as such. The role of the false adversaries is to limit strongly the amount of changes that the decision frontiers will undergo when getting true adversaries back to the right classes.
Statistical watermark extraction
The watermarking step is thus the embedding of such a crafted key in the original model, while the watermark extraction consists in asking the remote model to classify the inputs in key , to assess the presence or absence of the zerobit watermark, as presented in Algorithm 2. We now analyze statistically this detection problem.
As discussed in Section 2, the key quantity at extraction time is the Hamming distance (Eq. 4) between remote model’s answers to the key and expected answers. The stitching algorithm produces deterministic results with respect to the imprinting of the key: Marked model perfectly matches the key, i.e., . However, as the leaked model may undergo arbitrary attacks (e.g., for watermark removal, leading to ), one should expect some deviation in the answers of such model to watermark extraction (). On the other hand, other unmarked models might also partly match key labels, and thus have a positive nonmaximum distance too. As an extreme example, even a strawman model that answers a label uniformly at random produces matches in expectation when classifying over classes. Consequently, two questions are central to the frontier stitching approach: How large is the deviation one should tolerate from the original watermark in order to state about successful zerobit watermark? And, dependently, how large should the key be, so that the tolerance is increased?
We propose to rely on a probabilistic approach. We model the probability of a (non watermarked) model to produce correct answers to requests from objects in the key, i.e., to have . While providing an analysis that would both be precise and cover all model behaviors is unrealistic, we rely on a nullmodel that assumes that when considering inputs in the key, they are so close to the frontier that, at this “resolution”, the frontier only delimits two classes (the other classes being too far from the considered key inputs), and that the probability of each of the two classes are each. This is all the more plausible since we leverage adversaries especially designed to cause misclassification.
More formally, let be the nullmodel. Then . Having such a nullmodel allows applying a value approach to the decision criteria. Indeed, let the random variable representing the distance between the key and the remote model we are querying – that is, the number of mismatching labels among request answers to the key. Assuming that the remote model is the nullmodel, the probability of having exactly errors in the key is , that is follows the binomial distribution . Let be the maximum number of errors tolerated on ’s answers to decide whether or not the watermark extraction is successful. To safely (value ) reject the hypothesis that is a model behaving like our nullmodel, we need . That is . For instance, for a key size of and a value of , the maximum number of tolerated errors is . We thus consider the zerobit watermark extraction from the remote model successful if the number of errors is below that computed threshold , as presented in Algorithm 2. Next Section includes an experimental study of false positives when extracting the watermark with this probabilistic approach.
4 Experiments
We now conduct experiments to evaluate the proposed approach in the light of the requirements stated in Section 2.
#Parameters  Details  Accuracy  

CNN  mnist_cnn.py  0.993 (10 epochs)  
IRNN  mnist_irnn.py  0.9918 (900 epochs)  
MLP  mnist_mlp.py  0.984 (10 epochs) 
MNIST classifiers.
We perform our experiments on the MNIST dataset, using the Keras
backend
Generating adversaries for watermark key.
We use the Cleverhans Python library by [papernot2017cleverhans], to generate the adversaries (function GENERATE_ADVERSARIES() in Algorithm 1). It implements the “fast gradient sign method” by [Goodfellow:2015]. Alternative methods, such the “Jacobianbased saliency map” ([2015arXiv151107528P]) may also be used. We set to the parameter controlling the intensity of the adversarial perturbation. [Goodfellow:2015] report a classification error rate of for a shallow softmax classifier, for that value of and the MNIST test set. As explained in Section 3, we also need to access the images that are not misclassified despite this adversarial perturbations (the false adversaries). Along with the true adversaries we select in , they will be used through fine tuning to “clamp” the decision frontiers of the watermarked model, with the expected result of not significantly degrading the performance of the original model.
Impact of watermarking (fidelity requirement).
This experiment considers the impact on fidelity of the watermark embedding, of sizes and , in the three networks. We generated multiples keys for this experiment and the following ones (see Algorithm 1), and kept those which required less that epochs for embedding in the models ( for IRNN). The following experiments are thus the results of multiple runs over over about generated keys per network, which allows computing standard deviations.
The cumulative distribution function (CDF) in Fig. 3 shows the accuracy for the 3 networks after embedding keys of the two sizes. As one can expect, embedding the larger key causes more accuracy loss, but the two curves are close for both CNN and IRNN cases.
False positives in remote watermark extraction (effectiveness requirement).
We now experiment the effectiveness of the watermark extraction.
When querying the remote model with Algorithm 2
returns True, it is important to get a low false
positive rate. To measure this, we ran on non watermarked retrained networks of each type
the extraction Algorithm 2 with keys used to watermark
the three original networks.
Ideally, the output should always be negative.
Results in Fig. 4 show that the remote network is not
accused wrongly in seven over nine cases. First, we observe that the
keys obtained from Algorithm 1 on a trained network
do not cause false alarm from the same, non watermarked and
retrained
The two false positives (red square cases) stem from watermark generated on the IRNN model, causing erroneous positive watermark extraction when facing the MLP or CNN architectures. Note that the key size does not result in significantly different distances. We propose a way for lowering those false positive cases, in the discussion Section 6.
Attacking the watermarks of a leaked model (robustness requirement).
We now address the robustness of the stitching algorithm. We present two types of attacks: Model compression and overwriting via finetuning.
We first introduce the notion of a plausible attack, where represents the floor accuracy to which one attacker of a leaked model is ready to degrade the model in the hope of removing the zerobit watermark. As in multimedia watermarking case, one can always hope to remove a watermark at the cost of an important, possibly catastrophic, degradation of the model. Such attacks are not to be considered. In our setup, reusing a leaked model that has been significantly degraded does not make sense, as the attacker could probably use instead a less precise, legitimate model. Since the three networks in our experiments have accuracy above , we consider only plausible attacks with for the rest of attack experiments.
We remark that due to the nature of our watermarking method, an attacker (who do not possesses the secret key) will not know whether or not her attacks removed the watermark from the leaked model.
Compression attack via pruning As done by [Uchida:2017], we study the effect of compression through parameter pruning, where to of model weights with lowest absolute values are set to zero. Results are presented on Table 2. Among all plausible attacks, none but one ( pruning of IRNN) prevents perfect extraction of the watermark. We note that the MLP is prone to important degradation of accuracy when pruned, while at the same time the average number of erased key elements from the model is way below the decision threshold of . Regarding the CNN, even of pruned parameters are not enough to reach that same threshold.
Pruning rate  Avg elts rem.  Stdev  Extraction rate  Acc. after  Acc. stdev  

CNN  0.25  0.053/100  0.229  1.000  0.983  0.003 
  0.50  0.263/100  0.562  1.000  0.984  0.002 
  0.75  3.579/100  2.479  1.000  0.983  0.003 
  0.85  34.000/100  9.298  0.789  0.936  0.030 
IRNN  0.25  14.038/100  3.873  1.000  0.991  0.001 
  0.50  59.920/100  6.782  0.000  0.987  0.001 
  0.75  84.400/100  4.093  0.000  0.148  0.021 
MLP  0.25  0.360/100  0.700  1.000  0.951  0.018 
  0.50  0.704/100  0.724  1.000  0.947  0.021 
  0.75  9.615/100  4.392  1.000  0.915  0.031 
  0.80  24.438/100  5.501  1.000  0.877  0.042 
Overwriting attack via adversarial finetuning Since we leverage adversaries in the key to embed the watermark in the model, a plausible attack is to try overwriting this watermark via adversarial finetuning of the leaked model. This relates to “adversarial training”, a form of specialized data augmentation that can be used as generic regularization \citepGoodfellow:2015 or to improve model resilience to adversarial attacks \citeppapernot2017cleverhans. In this experiment, we turn images from the MNIST test into adversaries and use them to finetune the model (test set thus now consists in the remaining images). The results of the overwriting attacks is presented on Fig. 3. An adversarial finetuning of size uses times more adversaries than the watermarking key (as , with true adversaries). We see perfect watermark extractions (no false negatives) for CNN and MLP, while there are few extraction failures from the attacked IRNN architecture.
Avg elts removed  Stdev  Extraction rate  Acc. after  Acc. stdev  

CNN  17.842  3.594  1.000  0.983  0.001 
IRNN  37.423  3.931  0.884  0.989  0.001 
MLP  27.741  5.749  1.000  0.972  0.001 
About the security and efficiency requirements
Efficiency.
The efficiency requirements deals with the computational cost of querying a suspected
remote service with the queries from the watermarking key . Given typical pricing of current online machine learning
services
Security.
The frontier stitching algorithm deforms slightly and locally the decision frontiers, based on the labelled samples in key . To ensure security, this key must be kept secret by the entity that watermarked the model (otherwise, one might devise a simple overwriting procedure that reverts these deformations). Decision frontier deformation through finetuning is a complex process (see work by [DBLP:journals/corr/Berg16]) which seems very difficult to revert in the absence of information on the key. Could a method detect specific local frontier configurations that are due to the embedded watermark? The existence of such an algorithm, related to steganalysis in the domain of multimedia, would indeed be a challenge for neural network watermarking at large, but seems unlikely.
5 Related Work
Watermarking aims at embedding information into “objects” that one can manipulate locally. Watermarking multimedia content especially is a rich and still active research field, yet showing a two decades old interest ([771066]). The extension to neural networks is very recent, following the need to protect the valuable assets of today’s state of the art machine learning techniques. [Uchida:2017] thus propose the watermarking of neural networks, by embedding information in the learned weights. Authors show in the case of convolutional architecture that this embedding does not significantly change the distribution of parameters in the model. Mandatory to the use of this approach is a local copy of the neural network to inspect, as the extraction of the watermark requires reading the weights of convolution kernels. This approach is motivated by the voluntary sharing of already trained models, in case of transfer leaning, such as in [transfer]’s work for instance.
Since more and more models and algorithms might only be accessed through API operations (as being run as a component of a remote online service), there is a growing body of research which is interested in leveraging the restricted set of operations offered by those APIs to gain knowledge about the remote system internals. [stealing] demonstrate that it is possible to extract an indistinguishable copy of a remotely executed model from some online machine learning APIs [Papernot:2017:PBA:3052973.3053009] shown attacks on remote models to be feasible, yielding erroneous model outputs. In present work, we propose a watermarking algorithm that is compliant with APIs, since it only relies on the basic classification query to the remote service.
6 Discussion and Conclusion
This paper introduces the “frontier stitching algorithm” to extract zerobit watermarks from leaked models that might be used as part of remote online services.
Experiments have shown good performance of the algorithm with regards to the general requirements we proposed for the problem. Yet, the false positives witnessed for watermark extraction corresponding to the target IRNN model require further explanation. Since the key size has no impact on this phenomenon, the remaining variable that one can leverage is . We now discuss it.
The impact of the gradient step .
In this paper, we used set at ([Goodfellow:2015]). We reexecute the experiment for the effectiveness requirement (as initially presented on Figure 4), with , and varying values of , with watermarking trials using keys per network. We now on Figure 5 observe that the false positives are intuitively also occurring for lower values and . False positives disappear for . This indicates that the model owner has to select a high value, depending on her dataset, as the generated adversaries are powerful enough to prevent accurate classification by the remote inspected model. We remark that is an extreme value for the MNIST test set we use for generation of adversaries, as this value allows to generate at most and false adversaries for the CNN and MLP networks respectively (over possible in total).
Futurework.
We have seen that, in particular, the IRNN model requires precise setting of , and was prone to the compression attack for the pruning rate of of parameters. This underlines the probable increased structural resistance of some specific architectures. Futurework thus includes a characterization of those structural properties versus their watermarking capacities, in different application contexts.
We stress that the introduced remote watermark extraction is a difficult problem, due to accessing the model only through standard API operations. We proposed to base the extraction decision on a value statistical argument. The used nullmodel assumes that an object that is crafted to be very close to a decision frontier is randomly assigned to one of two classes on either side of the frontier. This allows a generalization of the remote non marked models that one suspects and thus queries. There is certainly a need to design a better nullmodel, which could provide more precise identification means, yet probably at the cost of generality.
The watermark information is currently embedded using the binary answers to the query made on each object in the key: Whether or not this object is classified by the remote model as expected in the key label. One might wish to design a more powerful watermarking technique, leveraging not only those binary answers, but also the actual classifications issued by the remote model (or even the probability vectors), as a means to e.g., embed more information with the same watermark size.
Finally, we think that the use of the recent concept of universal adversarial perturbations ([MoosaviDezfooli:2017]) might be leveraged to build efficient and robust watermarking algorithms. Indeed, this method allows the generation of adversaries that can fool multiple classifiers. Relying on such adversaries in an extension of our framework might give rise to new, improved watermarking methods for neural networks that are only remotely accessed.
References
Footnotes
 “” stands for a small modification of the parameters of that preserves the value of the model, i.e., that does not deteriorate significantly its performance.
 https://keras.io/backend/
 https://www.tensorflow.org/
 https://github.com/fchollet/keras/blob/master/examples/
 Note that if not retrained, the normalized Hamming distance with the same non watermarked network is as a key contains half of true adversaries and half of false adversaries.
 Amazon’s Machine Learning, for instance, charges per classification requests as per Oct. 2017