PRADA: Protecting against DNN Model Stealing Attacks

PRADA: Protecting against DNN Model Stealing Attacks

Mika Juuti
Aalto University,
Sebastian Szyller
Aalto University,
Alexey Dmitrenko
Aalto University,
   Samuel Marchal
Aalto University,
N. Asokan
Aalto University

As machine learning (ML) applications become increasingly prevalent, protecting the confidentiality of ML models becomes paramount for two reasons: (a) models may constitute a business advantage to its owner, and (b) an adversary may use a stolen model to find transferable adversarial examples that can be used to evade classification by the original model. One way to protect model confidentiality is to limit access to the model only via well-defined prediction APIs. This is common not only in machine-learning-as-a-service (MLaaS) settings where the model is remote, but also in scenarios like autonomous driving where the model is local but direct access to it is protected, for example, by hardware security mechanisms. Nevertheless, prediction APIs still leak information so that it is possible to mount model extraction attacks by an adversary who repeatedly queries the model via the prediction API.

In this paper, we describe a new model extraction attack by combining a novel approach for generating synthetic queries together with recent advances in training deep neural networks. This attack outperforms state-of-the-art model extraction techniques in terms of transferability of targeted adversarial examples generated using the extracted model (+15-30 percentage points, pp), and in prediction accuracy (+15-20 pp) on two datasets.

We then propose the first generic approach to effectively detect model extraction attacks: PRADA. It analyzes how the distribution of consecutive queries to the model evolves over time and raises an alarm when there are abrupt deviations. We show that PRADA can detect all known model extraction attacks with a 100% success rate and no false positives. PRADA is particularly suited for detecting extraction attacks against local models.

adversarial machine learning; model extraction; model stealing; Deep Neural Network


1 Introduction

Recent advances in deep neural networks (DNN) have drastically improved the performance and reliability of machine learning (ML)-based decision making. DNN models are being deployed in real-world systems for solving complex decision problems such as image and speech recognition [17], autonomous systems or automated detection of attacks [37]. New business models like Machine-Learning-as-a-Service (MLaaS) have emerged where the model itself is hosted in a secure cloud service, allowing clients to query the model via a cloud-based prediction API. Model owners can monetize their models by, e.g., having clients pay to use the prediction API. In such settings, the ML model represents business value underscoring the need to keep it confidential.

Increasing adoption of ML in various applications is also accompanied by an increase in attacks targeting ML-based systems a.k.a. adversarial machine learning [21, 41]. One such attack is forging adversarial examples, which are samples specifically crafted to deceive a target ML model [16]. To date, there are no effective defenses against such attacks [4, 8] but one mitigation is by protecting the confidentiality of the ML model.

In MLaaS settings, where models are remote, confidentiality can be ensured by isolating the model behind a firewall and restricting access to it only via a suitably protected well-defined prediction API [2, 33, 23]. In scenarios like autonomous driving the model necessarily needs to be local due to real-time requirements or connectivity constraints. Similar access control can be enforced using hardware-assisted security mechanisms [39, 20].

However, prediction APIs necessarily leak information. This leakage of information is exploited by model extraction attacks [51, 40]. In model extraction, the adversary only has access to the prediction API of a target model which he can use as an oracle for returning predictions for the samples he submits. The adversary queries the target model iteratively using “natural" or synthetic samples that are specifically crafted to maximize the extraction of information about the model internals from the predictions returned by the model. The adversary uses this information to gradually train a substitute model. The substitute model itself may be used in constructing future queries whose responses are used to further refine the substitute model. The goal of the adversary is to use the substitute model to (a) obtain predictions in the future, bypassing the original model, and thus depriving its owner of their business advantage, and/or (b) construct transferable adversarial examples [45] that he can later use to deceive the original model into making incorrect predictions. The success of the adversary can thus be measured in terms of (a) prediction accuracy, and (b) transferability of adversarial samples obtained from the substitute model.

To better protect ML-based systems, we need to understand the nature of model stealing and the extent of the threat it poses to real-world systems. Prior extraction attacks are either narrowly scoped [40] (targetting transferability of a specific type of adversarial examples), or have been demonstrated only on simple models [51] rarely used in practice. Model extraction attacks are most useful when targeting complex models, e.g., DNN, that an adversary is not able to train on his own due to lack of suitable training data or expertise. We are not aware of any prior work describing effective generic techniques to detect/prevent model extraction.

Goal and contributions. Our goal is twofold. (1) demonstrate the feasibility of model extraction attack on DNN models by introducing a model extraction attack that outperforms prior work in terms of both adversarial goals discussed above. (2) develop an effective generic defense to model extraction. By “generic”, we mean applicability to models with any type of input data and any learning algorithm. We claim the following contributions:

  • a novel model extraction attack (Sect. 3), which, unlike previous proposals, leverages recent advances in DNN training methods and a newly introduced synthetic data generation approach. It outperforms prior attacks in transferability of targeted adversarial examples (+15-30 pp) and prediction accuracy (+15-20 pp) (Sect. 4).

  • a new technique, PRADA, to detect model extraction which models the evolution in the distribution successive queries from a client and identifies abrupt deviations (Sect. 5.1). We show that it is effective (100% detection rate and no false positives on all known model extraction attacks) (Sect. 5.2). To the best of our knowledge PRADA is the first generic technique for detecting model extraction.

The source code for our attacks and our defense is available on request for research use.

2 DNN Model Extraction

2.1 Deep Neural Network (DNN)

A deep neural network (DNN) is a function producing output on input , where is a hierarchical composition of parametric functions (), each of which is a layer of neurons that apply activation functions to the weighted output of the previous layer . Each layer is parametrized by a weight matrix , a bias and an activation function : . Consequently a DNN can be formulated as follows:


In this paper we focus on predictive DNN models used as -class classifiers. The output of is a -dimensional vector containing the probabilities that belongs to each class for . This output is typically computed using the softmax function applied to the output of the layer of the network, also called logit. A final prediction class111We use the hat notation ^ to denote predictions can be given to an input by applying the argmax function to : .

2.2 Attack Description

The adverary’s objective is to train a substitute model capable of mimicking a target model , i.e., . The method relies on making iterative prediction queries to to label several samples . Labeled samples are then used as training data to learn the substitute model . Since it is reasonable to assume that the adversary does not have enough “natural” samples, model extraction typically requires the generation and querying of synthetic samples crafted to extract maximal information from .

This attack pattern is realistic given the emerging MLaaS paradigm where one party (model provider) uses a private training set, domain expertise and computational power to train a model which is then made available to other parties (clients) via a prediction API on cloud-based platforms, e.g., AmazonML [2], AzureML [33], which charge fees from clients requesting predictions from the model thereby generating revenue to the model provider. Alternatively the model provider may send the model to client devices, relying on platform security mechanisms on the client device, e.g., a trusted execution environment (TEE) such as Intel SGX [1] to protect model confidentiality.

In this paper we focus on the extraction of model parameters from deep neural networks. Our aim is to estimate approximations of the weights and biases that define . This objective is complementary to recent work [38, 52] targeting the extraction of DNN hyperparameters such as model architecture, training method, etc. Both hyperparameters and parameters must be inferred to succeed an extraction attack.

Adversaries are incentivized to extract models for (Sect. 1):

  • Transfer of adversarial examples. Forging adversarial examples consists of finding minimal modifications for an input of class such that is classified as by a model . In this paper we focus on targeted adversarial examples [45], meaning that the class is selected by the adversary222In contrast, prior work has focused on transferability of untargeted adversarial examples [16, 40] where the adversary does not care what is as long as class . Untargeted adversarial examples to multi-class prediction models can be easily crafted using a random noise  [4] thereby obviating the need for the adversary to bother with model extraction in the first place. See Appendix A for further discussion.. Efficient techniques for forging adversarial examples require knowledge of the internals of the target model [9, 16]. The substitute model can be used to craft adversarial examples such that . Such an adversarial example is transferable to the target model if .

  • Reproduction of predictive behavior. The purpose of the substitute model is to reproduce as faithfully as possible the prediction of for a known subspace of the whole input space , i.e, . The adversary wants to build by performing a minimum number of prediction requests to . can later be used to obtain an unlimited number of predictions.

2.3 Adversary Model

Attack surface. We consider any scenario where the target model is isolated from clients by some means. This can be a remote isolation where the model is hosted on a server or local isolation on a personal device (e.g. smartphone) or an autonomous system (e.g., self-driving car, autonomous drone). We assume local and remote isolation provide same confidentiality guarantees and physical access does not help the adversary to overcome the isolation. Such guarantees can be typically enforced by hardware assisted TEEs [13]. Increasing availability of lightweight hardware enhanced for DNNs [24] and the rise of federated learning will push machine learning computation to the edge [27, 44]. Local isolation will become increasingly adopted to protect these models.

Capabilities. The adversary has black-box access to the isolated target model. He only knows the shape of the input () and output () layers of the model. He can query samples to be processed by the model and gets the output prediction , which is a -dimensional vector containing the probabilities that belongs to each class for . Classes are meaningful to the adversary, e.g., they correspond to digits, vehicles types, prescription drugs, etc. Thus, the adversary can assume what the input to the model looks like even though he may not know the exact distribution of the data used to train the target model333This is similar to [16] but differ from [51] which assumes that the adversary has no information about the purpose of classification, i.e, no intuition about what the input must look like. We consider that classes must be meaningful to provide utility to a client and the adversary has access to a few natural samples for each class..

Goal. The goal of the adversary is to train a substitute model that has a maximal performance with respect to the adversary goals identified in Sect. 2.2: transfer of targeted adversarial examples or reproduction of predictive behavior. He seeks to minimize the number of prediction queries to in order to: (1) avoid detection and prevention of the attack by, e.g., a query rate limiting process, (2) limit the amount of money spent for predictions, in the case of MLaaS prediction APIs, and (3) minimize the number of natural samples required to query the model.

2.4 Systematized Model Extraction Pipeline

We present a systematized pipeline for extracting confidential DNN models. Assuming a target model , we want to learn a substitute model to mimic . We train using a minimum number of labeled training samples, as defined by a maximum prediction query budget to . We start this process with a random substitute model and a set of labeled training samples initially empty: . We follow a model extraction process in six steps as follows:

  1. Selection of model hyperparameters. We select a DNN architecture and hyperparamters to use for our substitute model . The extraction of DNN hyperparameters can be perfromed in a black box setting as described in our adversary model using recent attacks [38, 52] to infer the number and type of layers as well as activation functions.

  2. Initial data collection. We compose an initial set of unlabeled samples (seed samples) that form the base to the extraction attack. Those are selected in accordance to the adversary capabilities: knowledge of the model input shape and knowledge of what the prediction class means.

    Duplication round (iterative steps)


  3. Target model prediction queries. All, or part, of the unlabeled samples is queried to the target model to get predictions . The result is a set of new labeled samples that is added to .

  4. Training of substitute model. Labeled samples from are used for training using a specific training strategy. For instance, the same training strategy as the target model can be inferred using recent black-box attacks [38].

  5. Synthetic sample generation. We increase our coverage of the input space in areas relevant to our adversary goals by generating synthetic samples. This generation typically leverages knowledge of the target model acquired until then: labeled training samples and current substitute model . This synthetic data is allocated to the set , which is used back in step (3).  

  6. Stopping criteria. We repeat the duplication round (steps 3-5) until the prediction query budget is consumed or if we reach a criteria that assesses the success of the extraction attack. The outcome is a substitute model that has maximum performance in adversarial examples transferability or in reproduction of predictive behavior.

2.5 Prior Model Stealing Attacks

We present two main techniques that have been introduced to date for model extraction. These are used as a baseline to compare the performance of our novel extraction attack (Sect. 4) and to evaluate our detection approach (Sect. 5).

2.5.1 Tramer attack [51]

Tramer et al. introduced several attacks to extract simple ML models including logistic regression models with an equation solving attack and decision trees with a path finding attack. Both have high efficiency and require a few prediction queries but are limited to the simple models mentioned. The authors also introduce an extraction method targeting DNN that we present below according to our pipeline.

The same model architecture, hyperparameters and training strategy as in are used (1,4). Initial data consists of a set of uniformly selected random points of the input space (2). No natural samples are used.

The main contribution lies in the target model prediction queries (3) where they introduce three strategies for querying additional data points from . The first strategy selects these samples randomly. The second is called line-search retraining and selects new points closest to the decision boundary of the current substitute model using a line search technique. The last strategy is adaptive retraining which has same intuition of querying samples close to the decision boundary. However, it employs techniques from active learning [11] to select these samples.

The duplication round is repeated times until the prediction query budget is consumed (6). The number of samples requested to the target model during each iteration is .

2.5.2 Papernot attack [40]

Papernot et al. introduced a model extraction attack that is specifically designed at forging transferable untargeted adversarial examples for DNNs. We present this technique according to our pipeline.

Expert knowledge is used to select a model architecture “appropriate” for the classification task of the target model (1). This selection does not require knowledge of hyperparameters. Initial data consists of a small set of natural samples (2). Seed samples are balanced (same number of samples per class) and their required number increases with the model input dimensionality. Two strategies are proposed to query the target model (3), one queries the whole set while the other called reservoir sampling queries a random subset of samples from . Unselected samples are thrown away. No specific training strategy is recommended albeit that retraining only requires a small number of 10 epochs (4).

They introduce the Jacobian-based Dataset Augmentation (JbDA) technique for generating synthetic samples (5). It relies on computing the Jacobian matrix of the current substitute model evaluated on the already labeled samples in . Each element is modified by adding the sign of Jacobian matrix dimension corresponding to the label assigned to by the target model . Thus, a new set is created. has same size has , which means that the number of generated synthetic samples doubles at each iteration.

The duplication round is repeated for a predefined number of iterations called substitute training epochs (6).

3 Generic DNN Model Extraction

Techniques proposed to date [51, 40] are both narrowly scoped and explored solutions for only some of the required steps (cf. Sect. 2.4). We propose a novel approach for DNN model extraction and investigate several strategies for two main steps of the model extraction process: training of substitute model (4) and synthetic sample generation (5). In addition, in Sect. 4.2 we explore the impact of natural sample availability during initial data collection (2).

We propose two non-exclusive strategies for training the substitute model: dropout and optimization with prediction probabilities. We introduce novel synthetic samples generation approaches (5) by generalizing the Jacobian-based Dataset Augmentation (JbDA) [40]. This method creates synthetic samples using a techniques for crafting adversarial examples called Fast Gradient Sign Method (FGSM) [16]. We extend this work with two methods: Jb-topk strategy explores class directions and Jb-self explores the direction of the self-class (opposite of JbDA) to generate synthetic samples.

3.1 Jacobian-based Synthetic Sample Generation

We generate synthetic samples by modifying already labeled samples . Our goal is to explore the space around samples in some directions of interest. These directions differ according to the two strategies we introduce: Jb-topk and Jb-self as illustrated in Fig. 1. We craft our synthetic samples using the Jacobian matrix of the substitute model. is updated at each duplication round as is retrained, improving the quality of our synthetic samples.

Figure 1: Simplistic illustration of the synthetic sample generation strategies Jb-topk and Jb-self compared to JbDA.

The intuition behind this generation method is to provide an accurate approximation of the boundaries between classes that are spatially close to each other. To perfectly reproduce the decision of , must have the same boundaries. Exploring all boundaries between every pair of classes may be expensive, as it requires generating many synthetic samples, and unnecessary, as the two classes may not be spatially proximate, e.g. and in Fig. 1. This is the rationale for developing Jb-topk in which we modify a sample in different ways to produce synthetic samples, each being closer to one of its nearest classes . Figure 1 illustrates Jb-topk for .

We use a technique for crafting targeted adversarial examples in order to generate our synthetically samples. Targeted adversarial examples probe the decision boundary between two given classes and . They can be crafted using the optimization equation:


where characterizes the base sample, is the synthetic sample, characterizes a perturbation in the direction of the class computed using the Jacobian matrix dimension corresponding to the class .

We select the target classes for synthetic samples generation using the following principle:

  1. calculate the classification probability for all classes and samples using .

  2. exclude the class of the sample and rank the remaining classes in terms of highest probabilities .

  3. select the top-k classes (in terms of ) to and create synthetic samples for all of these.

The number of generated samples increases by a factor of during each duplication round. Jb-topk creates samples that expects to be misclassified. This strategy lets us refine the accuracy of in areas of the input space that are the most difficult to separate.


The intuition behind this generation method is to find the centroids of classes, i.e., where the data used for training the target model lies in the space. We state that the most effective way to reproduce a model’s decision is to obtain same training data and apply same training strategy. This is the rationale for Jb-self in which we modify samples to get new samples closer to the centroid of its class as moving towards the direction of its own (self-) class .

This can be characterized by the optimization equation:


where characterizes the base sample, is the synthetic sample and characterizes a perturbation in the general direction of the class computed using the Jacobian matrix dimension corresponding to the class . Critically, the generated samples are “non-adversarial examples”. The optimization results in the production of a prototypical sample of class .

Jacobian direction calculation

We use the Projected Gradient Descent (PGD[31] to compute our synthetic samples using the introduced Eq. (3) and (4). We use the following PGD definition:


where is the Jacobian operator with regards to class : . The perturbation is calculated over multiple iterations and the overall perturbation is bounded by the infinity norm . By default, we use 40 iterations to calculate adversarial examples (). PGD provides good control over perturbation sizes and is fast to compute.

3.2 Using oracle information leakage

Prior work on model extraction did not reach a consensus on the usefulness of prediction probabilities provided by the target model. Some found them to be a game changer that greatly improves attack efficacy [51], while others state that they are virtually meaningless to explore [40]. We study their effect as part of our optimization strategy during the training of the substitute model.

Depending on the what outputs, an adversary can optimize:

  1. either the negative log-likelihood loss (NLL) [35] for matching classes

  2. or the Kullback-Leibler divergence (KLDiv) [35] for matching probabilities

Matching probabilities using KLDiv is achieved by minimizing the expression:


where is the logit for class (assigned by the local model to input sample ), and is the corresponding probability assigned by . NLL is a special case of KLDiv in Eq. (6) where a single class has probability 1 and other classes have 0. Thus it is clear that if the server reveals probabilities, clients learn more information about co-dependencies of classes.

3.3 Training strategy

While prior work took a fixed approach [51, 40] for training the substitute model, different training strategies drastically affect the performance of DNNs [35]. We want to evaluate the impact of different training strategies on the performance of the substitute model. In particular, since model extraction relates to model training with a limited number of labeled samples (prediction queries to the target model), we explore the use of a regularization method, dropout, that can provide better generalizability to the model.


Dropout gets its name from randomly dropping out connections during training of the DNN. It is seen as a regularizer that improves network performance by averaging out classification results over several deformations. The connections are not dropped during the prediction phase [35].

We combine the usage of prediction probabilities (Sect. 3.2) and dropout (Sect. 3.3) to create four training strategies:

  • plain: plain training

  • p: training using prediction probabilities

  • d: training with dropout

  • p+d: training using prediction probabilities and dropout

4 DNN Model Extraction: Evaluation

We investigate several properties of model extraction attacks and construct systematized tests to understand their effect. We first evaluate the co-dependencies between the available number of seed samples, our proposed training strategies and extraction effectiveness. Then we evaluate the performance of our attacks and show they outperform the state-of-the-art. Finally we study the impact of the complexity of the target model on extraction effectiveness.

4.1 Experiment Setup

Datasets and target model description

maxpool2 maxpool2
conv2-64 conv2-64
maxpool2 maxpool2
FC-200 FC-200
FC-10 FC-100

Table 1: Model architecture (ReLU activations between blocks).

We evaluate two datasets: MNIST [29] for digit recognition and GTSRB [46] for traffic sign recognition. MNIST contains 70,000 images of 2828 grayscale digits (10 classes). GTSRB contains approximately 36,000 images in the training set, and approximately 12,000 images in the test set. Images in GTSRB have different shapes (1515 to 215215); we normalize them to 3232. We additionally scale feature values for both datasets to the range .

We use the model architectures depicted in Tab. 1 for training our target models. In MNIST, we use 55,000 samples for target model training. We reserve 5,000 samples as the adversary set and 10,000 samples as an independent test set. In GTSRB, we use 30,000 images for training and 6,000 for validation of the target model. We reserve the 12,000 test images as the adversary set, and use the 30,000 training images as the test set for comparing the classification agreement. We train both models for 500 epochs using Adam [15] with a learning rate 0.01. We use the validation set to select the best GTSRB model while we select the model trained at the last epoch for MNIST.

(a) MNIST target model
(b) GTSRB target model
Figure 2: Median F-agreement (solid lines) and transferability (dashed lines) for different number of seed samples. Shading represents 25th to 75th percentiles. p=probabilities, d=dropout, plain=neither.
Performance metrics

We evaluate the performance in reproduction of predictive behavior using the F-agreement metric. It corresponds to computing the macro-averaged F-score for the substitute model predictions when taking the target model prediction as ground truth, i.e., . This metric faithfully reports the effectiveness of an attack even when classes are imbalanced (MNIST dataset is balanced while GTSRB is imbalanced). F-agreement is computed using MNIST and GTSRB test sets.

We measure transferability of adversarial examples as follows:

  1. select samples from each class in the MNIST adversary set, respectively GTSRB test set.

  2. modify these samples using the PGD algorithm for crafting adversarial examples [31] applied to our substitute model . For each seed sample of class , generate targeted adversarial examples , each targeting one of the classes different from . Use a maximum perturbation for MNIST and for GTSRB.

  3. query with the adversarial sample regardless of their success at fooling .

  4. count transferability success if is classified by as its expected target class: .

  5. compute transferability success rate from the results for all generated adversarial examples: .

Since GTSRB contains a large number of 43 classes, we define macro-classes for evaluating transferability success. We group original road signs by shape and color resulting in 8 macro-classes. These are grouped according to similarity: (1) Warning signs, (2) Yield, (3) Stop, (4) Priority, (5) Red circle, (6) Blue circle, (7) Gray circle and (8) No entry. As an example, this means that transferability for an adversarial example targeting “bumpy road”, forged using a stop-sign as a basis, is considered successful for one class if its macro-class is recognized at the target model. We record a successful transfer if the target model recognizes e.g. “general warning”, as these both are the same macro-class: “warning signs”.

4.2 Availability of seed samples

We initially explore the connection between the number of seed samples, training strategy and model extraction efficiency. No synthetic data is generated nor queried. We train substitute models (for 200 epochs) using an increasing number of natural samples from MNIST adversary set and GTSRB adversary set. These sets have no overlap with target model’s training set. We test the four training strategies: plain, dropout (d), probabilities (p) and p+d.

F-agreement is measured on a test set that the attacker does not have access to: the test set in MNIST and target’s training set on GTSRB. We chose this setup because GTSRB images are not truly independently distributed: the smallest classes have only 3 objects photographed at 20 different distances.

We show the median performance (for 5 runs) as solid lines, and denote uncertainty by marking out 25%-75% percentile regions. Transferability is shown with dashed lines (For clarity, transferability with random data is a black dashed line).

Fig. 1(a) shows the dependency between the number of seed samples and model extraction agreement for MNIST. Agreement is shown with straight lines and transferability with dashed lines.

Intuitively, both model extraction agreement and transferabiliy should increase as the number of seed samples increase. Agreement in prediction results increases logarithmically: the attacker gains most of the benefit already with very few seed samples. We see approximately 75% agreement at 50 samples and 90% agreement at 200 samples. The best training method in terms of agreement is training with dropout and probabilities.

However, the increase in transferability is very slow or nonexistant compared to the increase in agreement. Increase in agreement does not imply increase in transferability. We believe this is due to to non-overlapping training sets between the target and the attacker. We see that access to probabilities is critical for high targeted transferability on MNIST. Dropout is unhelpful towards crafting adversarial examples on MNIST. This may be due to special data structure in the dataset: non-adversarial examples only lie in the center of the image, whereas adversarial examples typically make use of edges in the image, which does not contain legitimate data. The use of dropout may interfere with finding correct structure of the model. Recall that our focus here is on transferability of targeted adversarial examples. Crafting untargeted adversarial examples is possible with all our trained models (See Appendix A).

(a) MNIST target model
(b) GTSRB target model
Figure 3: Median F-agreement (solid lines) and transferability (dashed lines) w.r.t. number of queries (MNIST: 50 seeds + rest synthetic, GTSRB: 430 seeds + rest synthetic). Shading represents 25th to 75th percentiles. p=probabilities, d=dropout, Jb-topk/Jb-self/Papernot/Tramer are query strategies.

Fig. 1(b) shows similar metrics for GTSRB. We use 8 macro-classes for transferability: a successful transfer occurs if both models agree on the macro-class (Section 4.1). The rise in agreement is somewhat different on GTSRB compared to MNIST. All training methods reach the agreement of approximately 25-35% at 430 seed samples. The agreement is considerable lower than on MNIST. This may be due to asimilarity between the GTSRB test set (adversary set) and training set (target’s set).

Transferability is highest for dropout+probabilities training. Noticeably, transferability does not markedly increase with more seed samples. GTSRB has more natural variation in the dataset, which we believe makes the crafting more difficult. The attacker model may have several false patterns to follow, e.g. low-constrast images and greenery backgrounds. In fact, the use of dropout during training seems necessary to reach higher transferability, as it may help in discarding some of the pattern alternatives. With more data (1290 – 2150 seeds), the correct patterns in become more visible, which decreases the usefulness of dropout in comparison to plain training.

4.3 Synthetic sample generation

We assess our two model extraction attack variants Jb-topk and Jb-self, and compare them to Papernot attack and Tramer attack (adaptive retraining variant). For Papernot attack, we reproduce the exact setting reported in [40] albeit the number of natural seed samples is changed as follows. We select 50 (5 per class, MNIST) and 430 (10 per class, GTSRB) natural seed samples because these are optimal values (“knees” of the curves in Figs. 1(a) and 1(b)). The natural samples are randomly picked from the adversary datasets. We test four training strategies: plain, dropout (d), probabilities (p) and p+d and train the substitute models for 200 epochs at each duplication round.

We use two different values for maximum perturbation for synthetic sample generation. For MNIST, we set . For GTSRB we explore a strategy for , since we found that large performed poorly on complex datasets.

Fig. 2(a) shows the results for MNIST. Median performances are illustrated as solid lines and transferability as dashed lines (5 repetitions). Percentiles (25th and 75th) are reported in confidence intervals. We show the mean performance of Tramer attack (adaptive) and Papernot attack (substitute model) for completeness. We train the substitute models using the same architecture as the target model. Tramer performs worst, and we allow more queries for this method (20,000). Its mean agreement is illustrated with a green ball, and transferability with an x. On average, Tramer attack reaches 16.7% agreement, while Papernot attack reaches 70.7% agreement using 3200 queries. We train the substitute model according to the description in [40]. Neither model is suited for subsequent targeted transferability attacks; when we construct targeted adversarial examples using PGD, we reach 17% targeted transferability with the model extracted by Papernot attack and 2.3% transferability using the model extracted by Tramer attack.

However, with our methods, we observe a steady increase in agreement as more synthetic data is used: it begins with almost 75% agreement initially (50 seeds) and on average reaches 84.5–88.8% agreement after 3200 queries (50 seeds + 3150 synthetic queries) using various training methods and Jb-topk synthetic samples. Agreement is 14–18 pp better compared to previous methods [40]. We report a few values for in Jb-topk, but all different parameterizations of Jb-topk behaved similarly.

Synthetic queries that make use of probabilities perform the best in terms of transferability. We observe that probability-based adversarial examples increase targeted adversarial transferability substantially: the final transferability is 64.64% for Jb-top7 p (up from 15% using seed samples only). Even without access to probabilities, we see that Jb-top3 and Jb-top7 reach 49.10% respectively 35.20% targeted transferability at 3200 queries, which is 18 – 32 pp better than previously [40].

Jb-topk performs well in terms of transferability and agreement. We observe that Jb-self performs poorly on both metrics. We can thus conclude that the benefit from synthetic samples comes from their ability to probe the decision boundary rather than increasing the attacker training set size.

Fig. 2(b) shows results for GTSRB. We observe an increase during the first few duplication rounds, but the agreement does not significantly improve after that. We observe a 15 percentage point increase from 30-35% to 45-50% using synthetic data queries.

In GTSRB, we do not see an increase in transferability with more synthetic queries. We hypothesize that this is due to the low used for crafting synthetic samples. We observe several interesting facts in Fig. 2(b): 1) Transferability is only random if the attacker does not have access to probabilities. Even then, 2) attackers need to regulate the training of the network in order to benefit from probabilities. We see this as a direct consequence of the data in GTSRB: many images have similarities to images of other classes, e.g. dark images, green backgrounds. Consequently, the DNN may construct incorrect hypotheses of what data from a particular class looks like. We can conclude that targeted transferable attacks are difficult with low/no overlap in attacker/target training sets and that in the absence of enough training data, regularization (e.g. dropout) is necessary to reach good transferable performance.

Unsurprisingly, our attack is better than Tramer attack and Papernot attack already during before 1th substitute epoch. I.e. it is not necessary to train the substitute model with synthetic data to reach high transferability / model extraction. However, we demonstrate that it is important to control the learning of the substitute model to reach higher results with a model extraction pipeline.

We can see that in order to have high transferability, it is necessary to have high agreement. We see that synthetic samples can help in mapping decision boundaries to make targeted transferability better. However, natural samples are much more valuable to the attacker than synthetic samples for reaching high agreement: the attacker needs 3200 queries (50 natural + 3150 synthetic) to reach the same level in agreement as is possible with 200 natural queries.

4.4 Complexity of target model

Prior reseach has reported conflicing results in the attacker advantage, when oracle confidence results are revealed. Access to oracle classification confidences has been showed to be beneficial in model extraction attacks [51]. However, the benefit over labels-only is shown to have diminishing returns with more complex models.

In DNNs, access to class probabilities is ruled out as out-of-scope in [40], since the authors assume that the server deploys gradient masking as a defense. Gradient masking is not widely deployed in practice, because service providers wish to provide plausible classification confidences to predictions (gradient masking destroys the utility).

We evaluated the advantage that the attacker gains from using classification probabilities over different network complexities (measured by number of layers and by number of parameters in the network). We evaluate the results over MNIST using our techniques. The alternative network structures are reported in Tab. 3 (Appendix). To measure the impact of parameters, we additionally constructed 4 alternative networks of “4 layers” by multiplying the number of feature maps and fully connected layers by a factor . Note that network structures like in “1 layer” and “2 layers” are used in support vector machines, logistic regression, extreme learning machines and multilayer perceptrons, and are very common in industry in practice. We assume that the attacker is using the same model architecture as the target is. Each target model is trained for 50 epochs.

Figure 4: MNIST: mean targeted transferability with 50 natural samples after 2 duplication rounds (+750 synthetic queries). Top image: impact of different network depths. Bottom: impact of number of parameters in network.

We show our results in Fig. 4. Bar heights report the mean targeted transferability using 50 seed samples and two duplication rounds using Jb-top3. Results are obtained by repeating the experiment 5 times each.

We can see several patterns in Fig. 4. The attack effectiveness increases with synthetic data, similarly as in Section 4.3. The attack effectivenesses decreases the more layers there are. This coincides with the folklore that added nonlinearity increases the difficulty of transferability attacks on networks. The simpler the network is, the easier is it to create adversarial examples that succeed at transferability attacks (e.g. 98% effectiveness on “1 layer”). Deeper networks have more nonlinear dependencies, which are difficult to model with a small amount of training data.

We see that wider networks are more vulnerable to adversarial examples: the more parameters the DNN “4 layers” has, the easier it is attack it. The results suggest that thin, but deep, models are more resilient to adversarial examples.

4.5 Takeaways

Seed samples: Natural seed samples are necessary to extract a substitute DNN model that has good performance in reproduction of predictive behavior, the more natural samples the adversary has the better the performance of the model. However, an increasing number of natural samples has little impact on the success of transferability for targeted adversarial examples.

Synthetic sample generation: A relevant synthetic sample generation method improves transferability of adversarial examples several fold. The best strategy consists in exploring directions to other close classes (Jb-topk). Synthetic samples also significantly improve the performance in reproduction of predictive behavior, while remaining less effective than using natural samples.

Training strategy: The use of prediction probabilities rather than class labels improve the performance of both reproduction of predictive behavior and transferability of targeted adversarial examples for any setup. Regularisation methods like dropout increase performance of transferability in any setup. They have a negative impact on reproduction of predictive behavior though, when dealing with scattered data distribution inside a same class (GTSRB).

Target model complexity: The use of deeper DNNs model (several hidden layers) degrades the ability for the adversary to craft transferable adversarial examples. We found that thinner models (lower number of parameters) are more resilient to adversarial transfer. Model depth and number of parameters have a marginal effect on performance of reproduction of predictive behavior.

5 Detecting Model Extraction

We present PRADA, a generic approach to detect model extraction attacks. Unlike prior work on adversarial machine learning defenses, e.g., for detecting adversarial examples [18, 32], our goal is not deciding whether individual queries are malicious but rather detecting attacks that span several queries. Thus, we do not rely on modeling what queries (benign or otherwise) look like but rather on how successive queries relate to each other. PRADA is generic in that it makes no assumptions about the model or its training data.

5.1 Detection approach

1: Let denote the target model, a stream of samples queried by a given client, the growing set, the set of minimum distances , the set of threshold values (note that the sets are kept separately for each class which is denoted by the index ), the fixed window size, the detection parameter.
2: , ,
3:for  do
5:     if  then # sets and threshold initialization
9:     else
11:         for all  do # pairwise distance calculation
13:         end for
14:          # distance to closest element
15:         if  then # sets and threshold update
19:         end if
20:     end if
21:     if  then # gradient evolution for
22:          # ratio computation (see Eq. (7))
23:         if  then # attack detection test
25:         else
27:         end if
28:     end if
29:end for
Algorithm 1 PRADA’s detection of model extraction

We start by observing that (1) model extraction requires making several queries to the target model and (2) queried samples are specifically generated and/or selected to extract maximal information. Successive samples submitted by an adversary are expected to have a characteristic distribution that evolves in a more unstable manner than the distribution of samples submitted in benign queries. Thus, PRADA’s detection method is based on detecting abrupt changes in the distribution of samples submitted by a given client.

Figure 5: Drop in after the switch from natural to synthetic samples in Papernot attack (for visual clarity, we only show four MNIST classes chosen at random).
Figure 6: Evolution in size of the growing set G for Papernot attack, Jb-topk attack and benign queried samples. Attacks exhibit abrupt changes. Figure 7: MNIST target model Figure 8: GTSRB target model
Figure 9: FP rate w.r.t. speed of attack detection for different combinations of and . Vertical line = switch from natural to synthetic sample queries. The black line represents the Pareto front [34].

Consider the stream of samples queried by a single client from the target model . We calculate the minimum distance from a new queried sample to any previous samples of the same class . By doing so, we want to identify a characteristic deviation in the distance that highlights a change in the distribution of queried samples, i.e., samples abnormally close to or far from any previously queried sample. For efficiency, we do not keep track of all past queries in but incrementally build a growing set for each class . consists only of samples whose distance is above a threshold value . We define as the mean minus standard deviation of the minimum distance between any two elements already in . The distance is computed w.r.t. only elements from for .

Figure 5 illustrates the intuition why our approach can effectively detect model extraction. It depicts the evolution of for samples generated during a Papernot attack on our MNIST target model. The clear drop in can be seen for all MNIST classes after five queried samples, which corresponds to the start of synthetic sample generation in this attack scenario. Such trends are typical for all known attacks.

Our attack detection criteria is based on the evolution in size of the overall growing set w.r.t. to the number of queried samples . Algorithm 1 describes PRADA’s detection technique in detail. We look for an abrupt change in the gradient of the discrete function that we can detect by computing its derivative. Figure 6 depicts this phenomenon for Jb-topk and Papernot attack on our MNIST target model and contrasts them with queries from a benign client. The growing set for a legitimate client has a relatively steady evolution. In contrast, when attacks start generating synthetic samples (50), they exhibit abrupt changes visible as pronounced elbow shapes in the curves. Papernot attack regenerates some synthetic samples that have a large enough distance to samples after two duplication rounds, explaining the restart of steady growth of after 150 samples. The detection process starts when a client queries at least samples, where is the window size parameter. We compute the derivative numerically over the current and the previous window and compute the ratio of change. Let us denote the derivative computed over the window as , then the proposed ratio of change is given in Eq. (7) where is a constant used for numerical stability. Next, if ratio is above a threshold , PRADA detects an extraction attack.


PRADA requires the defender to set two parameters: the window size and allowed ratio of change . It also needs a domain-specific distance metric to compute distances between inputs. We use (Euclidean) norm for image samples.

5.2 Evaluation

We evaluate PRADA in terms of success and speed. Speed refers to the number of samples queried by an adversary until we detect the attack. It correlates with the amount of information extracted and must be minimized. We also evaluate the false positive rate (FPR): the ratio of false alarms raised by our detection method to all query sequences from benign clients.

To evaluate success, we assess it in detecting attacks against the two target models previously trained in Sect. 4.1 for MNIST and GTSRB datasets. We subject these models to four different attacks: Papernot attack and Tramer attack and our two new attacks Jb-topk and Jb-self. We use the samples generated while evaluating the performance of these attacks in Sect. 4 and query the prediction model with them one by one (in the order they were generated). PRADA’s detection algorithm is computed for each new queried sample. When an attack is detected, we record the number of samples queried until then by the adversary to evaluate the speed of detection.

To evaluate FPR, we use natural data from MNIST and GTSRB datasets. To demonstrate that PRADA is independent from a specific data distribution, we also use the U.S. Postal Service (USPS) [28] and Belgian traffic signs (BTS) [48] datasets. USPS and BTS datasets contain similar input data as MNIST and GTSRB respectively but from different distributions. We reshaped the samples to fit the input size of MNIST and GTSRB models. We simulate a benign client by randomly picking 5,000 samples from a given dataset and successively querying them from the appropriate model: MNIST/USPS MNIST model, GTSRB/BTS GTSRB model. We simulate five legitimate clients per dataset (20 clients). To evaluate FPR, we split this sequence of queries into 100 chunks of 50 queries each and count a false positive if at least one alert is triggered by PRADA in a chunk.

Figures 7 and 8 depict the trade-off between detection speed and FPR according to the window size and threshold . Tramer attack is omitted for readability, because its results are out of the depicted range. The vertical line marks the point at which the attacker starts querying with synthetic samples (50 for MNIST and 430 for GTSRB). Regardless of the combination of parameters, PRADA can detect an attack shortly after queries with synthetic samples begin. In practice, this means that the first window that contains these samples is sufficient for our derivative based method as long as we can observe a pronounced “elbow” in the growing set size evolution. We see that performance for Jb-topk and Jb-self attacks are similar because they use the same PGD algorithm to generate synthetic samples. While synthetic samples explore different directions, they are located at similar distances from their seed samples. Papernot attack uses the FGSM which explains the slight difference in detection performance. It is worth noting that the different training strategies for Jb-topk and Jb-self attacks do not impact the detection performance. We also experimented with different value for Jb-topk and got consistent and stable detection performance. Parameter tuning is important from the perspective of FPR as too small values might result in alarms being triggered for benign queries. We identified optimal parameter values by running a grid search over all combinations in the range and . We determined that a window size and threshold provide the best performance, allowing quick detection while avoiding false positives for legitimate queries across all tested datasets - MNIST, USPS, GTRSB, BTS, and all attacks: Tramer, Papernot, Jb-topk, Jb-self. Table. 2 presents detailed speed of detection results for the selected parameters.

Attack Tramer Papernot Jb-topk Jb-self
MNIST model 3041 70 71 71
GTRSB model 1049 449 448 452
Table 2: Attack queries made until detection

Note that despite the vastly different nature of Tramer attack, PRADA is able to detect it. While the detection is slower this is not a concern since Tramer attack is itself slow in extracting DNNs (cf. Sect. 4.3). This demonstrates that PRADA is effective at protecting against all model extraction attacks developed to date.

To estimate the overhead of PRADA, we computed the memory required to store the growing set . Note that represents a subset of all queries . Samples for MNIST and GTSRB models have an average size of 561 B and 9.3 kB respectively. The average memory required before detecting an attack for MNIST is around 24 kB ( samples) and 3.7 MB for GTSRB ( samples). Legitimate clients generate a larger since its growth is not stopped by a detection. However, this growth naturally slows down as a client makes more queries. As an estimate, we used 1.1 MB ( samples) for storing of a MNIST model client submitting 3,000 queries. We used 23.7 MB ( samples) for storing of a GTSRB model client submitting 5,000 queries.

5.3 Discussion

Evasion of Detection: An adversary can attempt to evade detection by making dummy queries that are not useful for building the substitute model but that would maintain a stable growth of the growing set. A drawback of dummy queries is that they consume the query budget faster. Moreover, dummy queries have to be generated using a specific strategy. Random generation or other generation strategies would likely be ineffective because of how we build our growing set. The minimum distance is computed relative to the samples in that are of the same class as the queried sample according to the target model . The adversary cannot know the class of a new sample before querying it from the target model. Thus he cannot know which previously submitted samples will be used for computing . It leaves him with little control over the growth of .

Since PRADA analyses samples queried by a single client, an adversary can attempt to distribute its queries among several clients to avoid detection (Sybil attack). For our attacks and Papernot attack, distributing the queries for samples generated in a same duplication round would not circumvent detection. We saw in experiments that an attack can be detected as soon as there is a switch from natural to synthetic samples. All malicious clients would exhibit this characteristic. However, distributing the sequential stages of the attack among different clients may delay detection. As observed for Tramer attack where no natural sample is used, detection requires analyzing a larger number of samples ( 1,000-3,000 queries). The attack would eventually be detected except if the attacker sets an upper bound on the number of queries a single client can make. Distributed attacks apply only to remote isolation of multi-client models. It does not apply to local isolation where all queries are tracked under a single client profile.

Countermeasures: Once PRADA detects an attack, we must resort to effective mitigation. Blocking requests from the adversary would be a straightforward prevention. This would be effective on single-client models protected by local isolation. However, for multi-client models with an adversary capable of generating Sybils, this is not effective. A better defense is to keep providing altered predictions once an attack is detected in order to degrade the substitute model learned by the adversary. Returning random predictions may tip off the adversary who would get very inconsistent results. A better strategy is to return the second or third class with the highest likelihood according to prediction of the target model. This would mislead the adversary into thinking he has crossed a class boundary while he has not and effectively degrade its substitute model in a seamless manner.

Generalizability: We are confident that PRADA is applicable to any type of data and ML models without any alterations since its design is independent from these considerations and only relies on identifying adversarial querying behavior. The only aspect that depends on the type of data is finding a distance metric appropriate to compute differences between input samples of a certain type, e.g., we chose norm for image input.

By design, PRADA can also be effective at detecting other adversarial machine learning attacks that rely on making numerous queries. For instance, black-box attacks for forging an adversarial example [10, 22] require thousands to millions of queries. PRADA can likely detect this attack.

Storage overhead and scalability: PRADA requires keeping track of several client queries. It has a significant overhead in term of memory. It is worth noting that we presented results for the extreme case of image classification models, which use high dimensional inputs. Nevertheless the amount of memory required per client was estimated to a few megabytes (1-20 MB), which is reasonable. For local models being used by single clients, the storage requirements are thus minor. Multi-client remote models serving up to a few hundred clients simultaneously, would require a few gigabytes of memory in total. This is reasonable for a cloud-based setup where the model is hosted on a powerful server.

6 Related Work

6.1 Model Extraction Attacks

Model extraction is conceptually close to concept learning [3, 6] in which the goal is to learn a model for a concept using membership queries. Differences are that concepts to learn are not ML models and concept learning does not assume adversarial settings. Nevertheless, methods based on concept learning have been designed for adversarial machine learning and evading binary classifiers [30, 36]. Model evasion is a theoretically simpler task than model extraction [47] and the efficiency of these attacks have not been demonstrated on DNNs. The extraction of information from DNNs has been addressed in non-adversarial settings though to compress a DNN in a simpler representation [7, 19] or to obtain interpretable decisions from DNNs [12, 49]. These work do not apply to adversarial settings since they require white-box access to the target model and its training data.

We already presented the two closest works to ours in Sect. 2.5. Tramer et al. [51] introduced several methods for extracting ML models exposed in online prediction APIs with a minimum number of queries. They exploit the confidence values from predictions in a systematic equation solving approach to infer exact model parameters. In contrast to our work, this method addresses only the extraction of simple models such as logistic regression. This technique is ineffective at extracting DNN models despite using a larger number of queries than our method (cf. Sect. 4.3). Papernot et al. [40] introduced a method for extracting a substitute DNN model for the specific purpose of computing transferable untargeted adversarial examples. Their main contribution JbDA technique for generating synthetic samples (cf. Sect. 2.5). They extended this work by showing that knowledge about target model architecture is not necessary since any ML model can be extracted using a more complex one, e.g., a DNN [43].

In contrast to these work, we introduce a generic method for extracting DNNs. It is multipurpose and has higher performance in both transfer of targeted adversarial example and reproduction of predictive behavior. This is achieved by proposing different training strategies and by introducing novel synthetic sample generation techniques that provide significant improvement in substitute model performance.

A recent line of work targets the extraction of model hyperparameters. Joon et al. [38] train a supervised classifier taking as input prediction values rendered by a classifier for a fixed set of reference samples. Using this technique, they infer with significant confidence the architecture, optimization method, training data split and size, etc. of a confidential target model. Another work [52] that takes a stronger adversary model in consideration (access to model architecture and training data) introduces a technique for computing the value for the hyperparameter , factor of the regularization term used during training. These works are complementary to our attack and can be used in the first stage to select the hyperparameters for the substitute model. They do not provide solutions to approximate model parameters as we do.

6.2 Defenses to Model Extraction

A first defense to model extraction is to round prediction probabilities in order to reduce the amount of information given to an adversary [51]. This has the drawback that it degrades also the service provided to legitimate clients and we showed that model extraction attacks are effective even without using prediction probabilities (Sect. 4), making this defense ineffective. A method for detecting model extraction attack relies on recording all requests made by a client and computing the feature space explored by the aggregated requests [26]. When the explored space exceeds a pre-fixed threshold, an extraction attack is detected. This technique has several limitations since it requires to linearly separate prediction classes in the input space to evaluate the space explored by an attacker. Thus it does not apply to high dimension input space nor to DNN models which build highly non-linear decision boundary in this space. The false alarm rate of this technique is not evaluated and likely high since a legitimate client can genuinely explore large areas of the input space. For instance, our Jb-topk and Jb-self attacks explores a smaller area of the input space than a legitimate client (cf. Sect. 5.2, size of the growing set).

In contrast, PRADA applies to any input data dimension and ML model. It is effective at detecting any attack developed to date and does not degrade the prediction service provided to legitimate clients.

Alternatively, methods for detecting adversarial examples can help detecting synthetically generated samples from Papernot attack and ours. The main approaches rely on retraining the model with adversarial samples [18, 25], modifying the learning process, e.g., using defensive distillation [42], randomizing the decision process [14] or analyzing the distribution of inputs [32]. A drawback of these techniques is that they assume a specific distribution of the legitimate inputs to the prediction model, i.e., a same distribution as the training data. Consequently, they potentially raise a high number of false alarms if benign clients request natural samples distributed differently as the training data.

In contrast PRADA does not rely on any assumption on the training data but only studies the evolution in distribution of samples submitted by a given client. This explains that we have no false positives even when analyzing legitimate data from diverse distributions. Method for detecting adversarial examples would also be inefficient against Tramer class of attacks [51]. Synthetic samples are generated randomly and do not rely on methods for crafting adversarial examples. Our Jb-self attack variant would also be hard to detect since it modifies samples in a way that does not move them closer to a decision boundary.

7 Conclusion

We have systematically explored approaches for model extraction. We evaluated four attacks on different DNN models and showed that several criteria influence the success of model extraction attacks. Insights from this experience point to ways in which the risk of DNN model extraction can be reduced. Increasing the depth of a DNN and reducing its number of parameters reduces the ability of an adversary to forge transferable adversarial examples. In scenarios where it is possible to limit the adversary’s access to natural seed samples will also blunt the effectiveness of model stealing.

Recent research has shown that ML models, especially DNNs, are exposed to different types of vulnerabilities. In particular, white-box access to ML models allows an adversary to mount various sophisticated attacks for which no effective defense has been developed to date. Consequently, protecting confidentiality of models is a useful mitigation. In this black-box scenario, an attacker is forced into repeated interactions with the model. We demonstrated that model extraction can be effectively detected by making ML prediction APIs stateful. This defense has several advantages since it does not require any knowledge about the ML model, nor about the data used to train it. Model confidentiality combined with a stateful defense strategy is a promising venue for effectively protecting ML models against a large range of adversarial machine learning attacks. One example we are currently exploring is defending against black-box attacks for forging adversarial examples (without resorting building substitute models via model stealing attacks; see Appendix C). Such attacks requires usually to make thousands of queries to forge on adversarial example. A stateful prediction API like the one described in this paper with PRADA appears to be a promising defense.


  • [1] Intel software guard extensions programming reference. Technical report, 2014.
  • [2] Amazon. Amazon machine learning., last accessed 26/04/2018.
  • [3] D. Angluin. Queries and concept learning. Machine learning, 2(4):319–342, 1988.
  • [4] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
  • [5] B. Biggio and F. Roli. Wild patterns: Ten years after the rise of adversarial machine learning. arXiv preprint arXiv:1712.03141, 2017.
  • [6] N. H. Bshouty, R. Cleve, R. Gavaldà, S. Kannan, and C. Tamon. Oracles and queries that are sufficient for exact learning. Journal of Computer and System Sciences, 52(3):421–433, 1996.
  • [7] C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541. ACM, 2006.
  • [8] N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3–14. ACM, 2017.
  • [9] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.
  • [10] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017.
  • [11] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine learning, 15(2):201–221, 1994.
  • [12] M. Craven and J. W. Shavlik. Extracting tree-structured representations of trained networks. In Advances in neural information processing systems, pages 24–30, 1996.
  • [13] J. Ekberg, K. Kostiainen, and N. Asokan. The untapped potential of trusted execution environments on mobile devices. IEEE Security & Privacy, 12(4):29–37, 2014.
  • [14] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.
  • [15] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning, volume 1. 2016.
  • [16] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • [17] A. Graves, A.-r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013.
  • [18] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.
  • [19] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [20] W. Hua, Z. Zhang, and G. E. Suh. Reverse engineering convolutional neural networks through side-channel information leaks. In ACM Design Automation Conference, 2018.
  • [21] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. Tygar. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence, pages 43–58. ACM, 2011.
  • [22] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin. Query-efficient black-box adversarial examples. arXiv preprint arXiv:1712.07113, 2017.
  • [23] B. Inc. Bigml: Machine learning made easy., last accessed 26/04/2018.
  • [24] I. Inc. Movidius myriad x vpu., last accessed 01/05/2018.
  • [25] H. Kannan, A. Kurakin, and I. Goodfellow. Adversarial logit pairing. arXiv preprint arXiv:1803.06373, 2018.
  • [26] M. Kesarwani, B. Mukhoty, V. Arya, and S. Mehta. Model extraction warning in mlaas paradigm. CoRR, abs/1711.07221, 2017.
  • [27] J. Konecný, B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. CoRR, abs/1610.05492, 2016.
  • [28] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages 396–404, 1990.
  • [29] Y. LeCun, C. Cortes, and C. Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2010.
  • [30] D. Lowd and C. Meek. Adversarial learning. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 641–647. ACM, 2005.
  • [31] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • [32] D. Meng and H. Chen. Magnet: A two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 135–147. ACM, 2017.
  • [33] Microsoft. Azure machine learning., last accessed 26/04/2018.
  • [34] K. Miettinen. Nonlinear Multiobjective Optimization, volume 12 of International Series in Operations Research and Management Science. Kluwer Academic Publishers, Dordrecht, 1999.
  • [35] K. Murphy. Machine learning: a probabilistic approach. Massachusetts Institute of Technology, pages 1–21, 2012.
  • [36] B. Nelson, B. I. Rubinstein, L. Huang, A. D. Joseph, S. J. Lee, S. Rao, and J. Tygar. Query strategies for evading convex-inducing classifiers. Journal of Machine Learning Research, 13(May):1293–1332, 2012.
  • [37] T. D. Nguyen, S. Marchal, M. Miettinen, M. Hoang Dang, N. Asokan, and A.-R. Sadeghi. Diot: A crowdsourced self-learning approach for detecting compromised iot devices. ArXiv e-prints, 2018.
  • [38] S. J. Oh, M. Augustin, M. Fritz, and B. Schiele. Towards reverse-engineering black-box neural networks. In International Conference on Learning Representations, 2018.
  • [39] O. Ohrimenko, F. Schuster, C. Fournet, A. Mehta, S. Nowozin, K. Vaswani, and M. Costa. Oblivious multi-party machine learning on trusted processors. In USENIX Security Symposium, pages 619–636, 2016.
  • [40] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pages 506–519. ACM, 2017.
  • [41] N. Papernot, P. McDaniel, A. Sinha, and M. Wellman. Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814, 2016.
  • [42] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 582–597. IEEE, 2016.
  • [43] N. Papernot, P. D. McDaniel, and I. J. Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. CoRR, abs/1605.07277, 2016.
  • [44] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar. Federated multi-task learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 4427–4437, 2017.
  • [45] N. Šrndic and P. Laskov. Practical evasion of a learning-based classifier: A case study. In Proceedings of the 2014 IEEE Symposium on Security and Privacy, pages 197–211. IEEE Computer Society, 2014.
  • [46] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition benchmark: a multi-class classification competition. In Neural Networks (IJCNN), The 2011 International Joint Conference on. IEEE, 2011.
  • [47] D. Stevens and D. Lowd. On the hardness of evading combinations of linear classifiers. In Proceedings of the 2013 ACM workshop on Artificial intelligence and security, pages 77–86. ACM, 2013.
  • [48] R. Timofte, K. Zimmermann, and L. van Gool. Multi-view traffic sign detection, recognition, and 3d localisation. In IEEE Computer Society Workshop on Application of Computer Vision, 2009.
  • [49] G. G. Towell and J. W. Shavlik. Extracting refined rules from knowledge-based neural networks. Machine learning, 13(1):71–101, 1993.
  • [50] F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
  • [51] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart. Stealing machine learning models via prediction apis. In USENIX Security Symposium, pages 601–618, 2016.
  • [52] B. Wang and N. Zhenqiang Gong. Stealing Hyperparameters in Machine Learning. In 39th IEEE Symposium on Security and Privacy, pages 1–19, 2018.

Appendix A Transferability of untargeted adversarial examples

(a) MNIST target model
(b) GTSRB target model
Figure 10: Median F-agreement (solid lines) and untargeted transferability (dashed lines) w.r.t. number of queries (MNIST: 50 seeds + rest synthetic, GTSRB: 430 seeds + rest synthetic). Shading represents 25th to 75th percentiles. p=probabilities, d=dropout, Jb-topk/Jb-self/Papernot/Tramer are query strategies.

In comparative evaluation of model extraction techniques in terms of transferability of adversarial examples (Sect. 4.3), we limited the analysis to targetd adversarial examples because they are more difficult to transfer. Our model extraction attacks can perform well in terms of transferability of untargeted adversarial examples as well. Untargeted adversarial examples are samples that only seek to be misclassified, meaning that we target whatever class in contrast to the targeted case where is selected by the adversary. Objectively, this is subclass of, and a simpler task than, forging targeted adversarial examples.

Papernot et al. [40] limited their analysis to transferability of untargeted adversarial examples when they evaluated the original Papernot attack. We use the exact same evaluation setup as described in Sect. 4.3 for the comparative analysis below.

Fig. 9(a) and 9(b) depict the performance of several setups for our attacks Jb-topk and Jb-self compared to Papernot attack and Tramer attack. PGD attacks constrain the perturbation sizes: (MNIST) and (GTSRB). We present the performance of a randomly perturbed image for comparison. Random perturbations can analytically reach untargeted transferability , where is the number of classes (10 in MNIST and 43 in GTSRB). Random perturbations do not have a maximum perturbation (i.e. ), and therefore have higher untargeted transferability than PGD.

We see that our attacks (Jb-topk) also outperforms previous work for this adversarial goal. We obtain an increase of perfromance of 30 pp when targetting the MNIST model and 12 pp when targeting the GTSRB model.

Appendix B Impact of model complexity

1 layer 2 layers 3 layers 4 layers 5 layers

conv2-32 conv2-64
maxpool2 maxpool2
conv2-32 conv2-64 conv2-128
maxpool2 maxpool2 maxpool2
FC-200 FC-200 FC-200 FC-200
FC-10 FC-10 FC-10 FC-10 FC-10
7851 159,011 1,090,171 486,011 488,571

Table 3: Target model architectures. ReLU activations are used between blocks of layers. The number of parameters in the networks in reported at the bottom.

We show the alternative DNN architectures for complexity analysis (Section 4.4) in Table 3.

Appendix C Adversarial examples and black box attacks

Often, the adversary may not have perfect knowledge of the model he is trying to attack: he may have some knowledge about the training dataset, for instance a small subset of the training data, and he may know the target model network structure (limited knowledge adversary [5]). There are two main lines-of-work for creating black-box adversarial examples for DNNs:

Substitute model-based attacks

Papernot et al [40] showed it is practical to create transferable untargeted adversarial examples by training substitute models, relying on limited knowledge of the attacked model and dataset. They assume access to representative training data for the adversary. Tramer et al [50] show it possible to create highly transferable targeted attacks against DNNs by ensembling several alternative models trained using the same training data that the target model used.

Zero-order optimization-based attacks

Ilyas et al [22] and Chen et al [10] showed it is possible to create targeted adversarial examples for DNNs without substitute models. The attacks are very effective, but inefficient: both attacks require thousands to millions of queries per sample and are therefore easily detectable. These attacks do not extract models and require access to target model probabilities.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description