Extraction of Complex DNN Models: Real Threat or Boogeyman?

Extraction of Complex DNN Models: Real Threat or Boogeyman?

Abstract

Recently, machine learning (ML) has introduced advanced solutions to many domains. Since ML models provide business advantage to model owners, protecting intellectual property of ML models has emerged as an important consideration. Confidentiality of ML models can be protected by exposing them to clients only via prediction APIs. However, model extraction attacks can steal the functionality of ML models using the information leaked to clients through the results returned via the API. In this work, we question whether model extraction is a serious threat to complex, real-life ML models. We evaluate the current state-of-the-art model extraction attack (Knockoff nets) against complex models. We reproduce and confirm the results in the original paper. But we also show that the performance of this attack can be limited by several factors, including ML model architecture and the granularity of API response. Furthermore, we introduce a defense based on distinguishing queries used for Knockoff nets from benign queries. Despite the limitations of the Knockoff nets, we show that a more realistic adversary can effectively steal complex ML models and evade known defenses.

\makesavenoteenv

tabular \makesavenoteenvtable

1 Introduction

In recent years, machine learning (ML) has been applied to many areas with impressive results. Use of ML models is now ubiquitous. Major enterprises (Google, Apple, Facebook) utilize them in their products [28]. Companies gain business advantage by collecting proprietary data and training high quality models. Hence, protecting the intellectual property embodied in ML models is necessary to preserve the business advantage of model owners.

Increased adoption of ML models and popularity of centrally hosted services led to the emergence of Prediction-As-a-Service platforms. Rather than distributing ML models to users, it is easier to run them on centralized servers having powerful computational resources and to expose them via prediction APIs. Prediction APIs are used to protect the confidentiality of ML models and allow for widespread availability of ML-based services that require users only to have an internet connection. Even though users only have access to a prediction API, each response necessarily leaks some information about the model. A model extraction attack [30] is one where an adversary (a malicious client) extracts information from a victim model by frequently querying the model’s prediction API. Queries and API responses are used to build a surrogate model with comparable functionality and effectiveness. Deploying surrogate models deprive the model owner of its business advantage. Many extraction attacks are effective against simple ML models [23, 8] and defenses have been proposed against these simple attacks [17, 24]. However, extraction of complex ML models has got little attention to date. Whether model extraction is a serious and realistic threat to real-life systems remains an open question.

Recently, a novel model extraction attack Knockoff nets [21] has been proposed against complex deep neural networks (DNNs). The paper reported empirical evaluations showing that Knockoff nets is effective at stealing any image classification model. This attack assumes that the adversary has access to (a) pre-trained image classification models that are used as the basis for constructing the surrogate model, (b) unlimited natural samples that are not drawn from the same distribution as the training data of the victim model and (c) the full probability vector as the output of the prediction API. Knockoff nets does not require the adversary to have any knowledge about the victim model, training data or its classification task (class semantics). Although other and more recent model extraction attacks have been proposed [2, 18, 7], Knockoff nets remains the most effective one against complex DNN models with the weakest adversary model. Moreover, there is no detection mechanism specifically tailored against model extraction attacks leveraging unlabeled natural data. Therefore, the natural question is whether Knockoff nets is a realistic threat through extensive evaluation.

Goals and contributions: Our goals are twofold in this paper. First, we want to understand the conditions under which Knockoff nets constitutes a realistic threat. Hence, we empirically evaluate the attack under different adversary models. Second, we want to explore whether and under which conditions Knockoff nets can be mitigated or detected. We claim the following contributions:

  • reproduce the empirical evaluation of Knockoff nets under its original adversary model to confirm that it can extract surrogate models exhibiting reasonable accuracy (53.5-94.8%) for all five complex victim DNN models we built (Sect. 3.2).

  • introduce a defense, within the same adversary model, to detect Knockoff nets by differentiating in- and out-of-distribution queries (attacker’s queries). This defense correctly detects up to 99% of adversarial queries (Sect. 4).

  • revisit the original adversary model to investigate how the attack effectiveness changes with more realistic adversaries and victims (Section 5.1). The attack effectiveness deteriorates when

    • the adversary uses a model architecture for the surrogate that is different from that of the victim.

    • the granularity of the victim’s prediction API output is reduced (returning predicted class instead of a probability vector).

    • the diversity of adversary queries is reduced.

    On the other hand, the attack effectiveness can increase when the adversary has access to natural samples drawn from the same distribution as the victim’s training data. In this case, all existing attack detection techniques, including our own, are no longer applicable (Section 5.4).

2 Background

2.1 Deep Neural Networks

A DNN is a function , where is the number of input features and is the number of output classes in a classification task. gives a vector of length containing probabilities that input belongs to each class for . The predicted class, denoted , is obtained by applying the argmax function: . tries to approximate a perfect oracle function which gives the true class for any input . The test accuracy expresses the degree to which approximates .

2.2 Model Extraction Attacks

In a model extraction attack, the goal of an adversary is to build a surrogate model that imitates the model of a victim . wants to find an having as close as possible to on a test set. builds its own dataset and implements the attack by sending queries to the prediction API of and obtaining predictions for each query , where . uses the transfer set to train a surrogate model .

According to prior work on model extraction [23, 8], we can divide ’s capabilities into three categories: victim model knowledge, data access, querying strategy.

Victim model knowledge: Model extraction attacks operate in a black-box setting. does not have access to model parameters of but can query the prediction API without any limitation on the number of queries. might know the exact architecture of , its hyperparameters or its training process. Given the purpose of the API (e.g., image recognition) and expected complexity of the task, may attempt to guess the architecture of  [23]. ’s prediction API may return one of the following: the probability vector, top-k labels with confidence scores or only the predicted class.

Data access: Possible capabilities of for data access vary in different model extraction attacks. can have access to a small subset of natural samples from ’s training dataset [23, 8]. may not have access to ’s training dataset but may know the “domain” of data and have access to natural samples that are close to ’s training data distribution (e.g., images of dogs in the task of identifying dog breeds) [2]. can use widely available natural samples that are different from ’s training data distribution [21]. Finally, can construct with only synthetically crafted samples [30].

Querying strategy: Querying is the process of submitting a sample to the prediction API. If relies on synthetic data, it crafts samples that would help it train iteratively. Otherwise, first collects its samples , queries the prediction API with the complete , and then trains the surrogate model with .

3 Knockoff Nets Model Extraction Attack

In this section, we study the Knockoff nets model extraction attack [21] which achieves state-of-the-art performance against complex DNN models. Knockoff nets works without access to ’s training data distribution, model architecture and classification task.

3.1 Attack Description

Adversary model

The goal of is model functionality stealing [21]: wants to train a surrogate model that performs similarly on a classification task for which prediction API’s was designed. has no information about including model architecture, internal parameters and hyperparameters. Moreover, does not have access to ’s training data, prediction API’s purpose or output class semantics. is a weaker adversary than previous work in [23, 8] due to these assumptions. However, can collect an unlimited amount of varied real-world data from online databases and can query prediction API without any constraint on the number of queries. API always returns a complete probability vector as an output for each legitimate query. is not constrained in memory and computational capabilities and uses publicly available pre-trained complex DNN models as a basis for  [12].

Attack strategy

first collects natural data from online databases for constructing unlabeled dataset . For each query , , obtains a complete probability vector from the prediction API. uses this transfer set to repurpose learned features of a complex pre-trained model with transfer learning [13]. In the Knockoff nets setting, offers image classification and constructs by sampling a subset of ImageNet dataset [3].

3.2 Knockoff Nets: Evaluation

We first implement Knockoff nets under the original adversary model explained in Section 3.1. We use the datasets and experimental setup described in [21] for constructing both and . We also evaluate two additional datasets to contrast our results with previous work.

Datasets

We use Caltech [4], CUBS [31] and Diabetic Retinopathy (Diabetic5) [9] datasets as in [21] for training ’s and reproduce experiments where Knockoff nets was successful. Caltech is composed of various images belonging to 256 different categories. CUBS contains images of 200 bird species and is used for fine-grained image classification tasks. Diabetic5 contains high-resolution retina images labeled with five different classes indicating the presence of diabetic retinopathy. We augment Diabetic5 using preprocessing techniques recommended in 1 to address the class imbalance problem. For constructing , we use a subset of ImageNet, which contains 1.2M images belonging to 1000 different categories. includes randomly sampled 100,000 images from Imagenet, 100 images per class. 42% of labels in Caltech and 1% in CUBS are also present in ImageNet. There is no overlap between Diabetic5 and ImageNet labels.

Additionally, we use CIFAR10 [14], depicting animals and vehicles divided into 10 classes, and GTSRB [26], a traffic sign dataset with 43 classes. CIFAR10 contains broad, high level classes while GTSRB contains domain specific and detailed classes. These datasets do not overlap with ImageNet labels and they were partly used in prior model extraction work [23, 8]. We resize images with bilinear interpolation, where applicable.

All datasets are divided into training and test sets and summarized in Table 1. All images in both training and test sets are normalized with mean and standard deviation statistics specific to ImageNet.

Number of samples
Dataset Image size Classes Train Test
Caltech 224x224 256 23,703 6,904
CUBS 224X224 200 5994 5794
Diabetic5 224x224 5 85,108 21,278
GTSRB 32x32 / 224x224 43 39,209 12,630
CIFAR10 32x32 / 224x224 10 50,000 10,000
Table 1: Image datasets used to evaluate Knockoff nets. GTSRB and CIFAR10 are resized with bilinear interpolation before training pre-trained classifiers.

Training victim models

To obtain complex victim models, we fine-tune weights of a pre-trained ResNet34 [5] model. We train 5 complex victim models using the datasets summarized in Table 1 and name these victim models {Dataset name}-RN34. In training, we use SGD optimizer with an initial learning rate of 0.1 that is decreased by a factor of 10 every 60 epochs over 200 epochs.

Training surrogate models

To build surrogate models, we fine-tune weights of a pre-trained ResNet34 [5] model. We query ’s prediction API with samples from and obtain . We train surrogate models using an SGD optimizer with an initial learning rate of 0.01 that is decreased by a factor of 10 every 60 epochs over 100 epochs.

Experimental results

Table 2 presents the test accuracy of and in our reproduction as well as experimental results reported in the original paper (). The attack effectiveness against Caltech-RN34 and CUBS-RN34 models is consistent with the corresponding values reported in [21]. We found that against Diabetic5-RN34 does not recover the same degree of performance. This inconsistency is a result of different transfer sets labeled by two different ’s. As shown in Table 2, Knockoff nets is effective against pre-trained complex DNN models. Knockoff nets can imitate the functionality of via ’s transfer set, even though is completely different from ’s training data. We will discuss the effect of transfer set with more detail in 5.3.

Caltech-RN34 74.6% 72.2% (0.97 ) 78.8% 75.4% (0.96 )
CUBS-RN34 77.2% 70.9% (0.91 ) 76.5% 68.0% (0.89 )
Diabetic5-RN34 71.1% 53.5% (0.75 ) 58.1% 47.7% (0.82 )
GTSRB-RN34 98.1% 94.8% (0.97 ) - -
CIFAR10-RN34 94.6% 88.2% (0.93 ) - -
Table 2: Test accuracy of , in our reproduction and , reported by [21]. Good surrogate models are in bold based on their performance recovery ().

4 Detection of Knockoff Nets Attack

In this section, we present a method designed to detect queries used for Knockoff nets. We analyze attack effectiveness w.r.t. the capacity of the model used for detection and the overlap between ’s and ’s training data distributions. Finally, we investigate attack effectiveness when ’s queries are detected and additional countermeasures are taken.

4.1 Goals and Overview

DNNs are trained using datasets that come from a specific distribution . Many benchmark datasets display distinct characteristics that make them identifiable (e.g. cars in CIFAR10 vs ImageNet) as opposed to being representative of real-world data [29]. A DNN trained using such data might be overconfident, i.e. it gives wrong predictions with high confidence scores, when it is evaluated with samples drawn from a different distribution . Predictive uncertainty is unavoidable when a DNN model is deployed for use via a prediction API. In this case, estimating predictive uncertainty is crucial to reduce over-confidence and provide better generalization for unseen samples. Several methods were introduced  [6, 19, 16] to measure predictive uncertainty by detecting out-of-distribution samples in the domain of image recognition. Baseline [6] and ODIN [19] methods analyze the softmax probability distribution of the DNN to identify out-of-distribution samples. A recent state-of-the-art-method [16] detects out-of-distribution samples based on their Mahalanobis distance [1] to the closest class-conditional distribution. Although these methods were tested against adversarial samples in evasion attacks, their detection performance against Knockoff nets is unknown. What is more, their performance heavily relies on the choice of threshold value which corresponds to the rate of correctly identified in-distribution samples (TNR rate).

Our goal is to detect queries that do not correspond to the main classification task of ’s model. In case of Knockoff nets, this translates to identifying inputs that come from a different distribution than ’s training set. Queries containing such images constitute the distinctive aspect of the adversary model in Knockoff nets: 1) availability of large amount of unlabeled data 2) limited information about the purpose of the API. To achieve this, we propose a binary classifier (or one-and-a-half classifier) based on the ResNet architecture. It differentiates inputs from and out of ’s data distribution. Our solution can be used as a filter placed in front of the prediction API.

4.2 Training Setup

Datasets

When evaluating our method, we consider all ’s we built before in Section 3.2 and use ’s training data as in-distribution samples to train our binary classifier. We additionally use randomly selected 90,000 samples of Imagenet in Table 1 as out-of-distribution samples for training. In our experimental setup, either uses the remaining 10,000 ImageNet samples or 20,000 uniformly sampled images from the OpenImages dataset [15] as transfer set .

Training binary classifer

In our experiments, we examine two types of models: 1) ResNet models trained from scratch 2) pre-trained ResNet models with frozen weights where we replace the final layer with binary logistic regression. In this section, we refer to different ResNet models as RN followed by the number indicating the number of layers, e.g. RN34; we further use the LR suffix to highlight pre-trained models with a logistic regression layer.

We combine 90,000 ImageNet samples with ’s corresponding training dataset and assign binary labels that indicate whether the image comes from ’s distribution - we assign label to samples from ’s dataset and to all ImageNet samples. All images are normalized according to ImageNet-derived mean and standard deviation. We apply the same labeling and normalization procedure to the ’s transfer sets (both 10,000 ImageNet and 20,000 OpenImages samples). To train models from scratch (models RN18 and RN34), we use the ADAM optimizer [11] with initial learning rate of 0.001 for the first 100 epochs and 0.0005 for the remaining 100 (200 total). Additionally, we repeat the same training procedure while removing images whose class labels overlap with ImageNet from ’s dataset (models RN18*, RN34*, RN18*LR, RN34*LR, RN101*LR, RN152*LR). This will minimize the risk of false positives and simulate the scenario with no overlap between the datasets. To train models with the logistic regression layer (models RN18*LR, RN34*LR, RN101*LR, RN152*LR), we take ResNet models pre-trained on ImageNet. We replace the last layer with a logistic regression model and freeze the remaining layers. We train logistic regression using the LBFGS solver [33] with L2 regularization and use 10-fold cross-validation to find the optimal value of the regularization parameter.

4.3 Experimental Results

We divide our experiments into two phases. In the first phase, we select CUBS and train binary classifiers with different architectures in order to identify the optimal classifier. We assess the results using the rate of correctly detected in- (true negative rate, TNR) and out-of-distribution samples (true positive rate, TPR). In the second phase, we evaluate the performance of the selected optimal architecture using all datasets in Section 3.2 and assess it based on the achieved TPR and TNR.

Model
RN18 RN34 RN18* RN34*
TPR/TNR 86% / 83% 94% / 80% 90% / 83% 95% / 82%
Model
RN18*LR RN34*LR RN101*LR RN152*LR
TPR/TNR 84% / 84% 93% / 89% 93% / 93% 93% / 93%
Table 3: Distinguishing ’s ImageNet transfer set (TPR) from in-distribution samples corresponding to CUBS test set (TNR). Results are reported for models trained from scratch (RN18, RN34), trained from scratch excluding overlapping classes (models RN18*, RN34*) and using pre-trained models with logistic regression (models RN18*LR, RN34*LR, RN101*LR, RN152*LR. Best results are in bold.
Ours Baseline/ODIN/Mahal. Baseline/ODIN/Mahal.
Dataset TPR TNR TPR (at TNR Ours) TPR (at TNR 95%)
Caltech 63% 56% 87% / 88% / 59% 13% / 11% / 5%
CUBS 93% 93% 48% / 54% / 19% 39% / 43% / 12%
Diabetic5 99% 99% 1% / 25% / 98% 5% / 49% / 99%
GTSRB 99% 99% 42% / 56% / 71% 77% / 94% / 89%
CIFAR10 96% 96% 28% / 54% / 89% 33% / 60% / 91%
(a) Using ImageNet as ’s transfer set.
Ours Baseline/ODIN/Mahal. Baseline/ODIN/Mahal.
Dataset TPR TNR TPR (at TNR Ours) TPR (at TNR 95%)
Caltech 61% 59% 83% / 83% / 6% 11% / 11% / 6%
CUBS 93% 93% 47% / 50% / 14% 37% / 44% / 14%
Diabetic5 99% 99% 1% / 21% / 99% 4% / 44% / 99%
GTSRB 99% 99% 44% / 64% / 75% 76% / 93% / 87%
CIFAR10 96% 96% 27% / 56% / 92% 33% / 62% / 95%
(b) Using OpenImages as ’s transfer set.
Table 4: Distinguishing in-distribution test samples from ’s transfer set as out-of-distribution samples. Comparison of our method with Baseline [6], ODIN [19] and Mahalanobis [16] w.r.t TPR (correctly detected out-of-distribution samples) and TNR (correctly detected in-distribution samples). Best results are in bold.

As presented in the Table 3, we find that the optimal architecture is RN101*LR: pre-trained ResNet101 model with logistic regression replacing the final layer. Table 3 also shows that increasing model capacity improves detection accuracy. For the remaining experiments we use RN101*LR since it achieves the same TPR and TNR as RN152*LR while being faster in inference.

Prior work [32, 13] has shown that pre-trained DNN features transfer better when tasks are similar. In our case, half of task is identical to the pre-trained task (recognizing ImageNet images). Thus it might be ideal to avoid modifying network parameters and keep pre-trained model parameters frozen by replacing the last layer with a logistic regression. Another benefit of using logistic regression over complete fine-tuning is that pre-trained embeddings can be pre-calculated once at a negligible cost, after which training can proceed without performance penalties on CPU in a matter or minutes. Thus, model owners can cheaply train an effective model extraction defense. Such a defense can have wide applicability for small-scale and medium-scale model owners. Finally, since our defense mechanism is stateless, it does not depend on prior queries made by the adversary nor does it keep other state. It handles each query in isolation; therefore, it can not be circumvented by sybil attacks.

Maintaining high TNR is important for usability reasons. Table 4 showcases results for the remaining datasets. We compare our approach with existing state-of-the-art methods detecting out-of-distribution samples 2 when they are also deployed to identify ’s queries. Note that other methods are threshold-based detectors, they require setting TNR to a value before detecting ’s queries. Our method achieves high TPR () on all ’s but Caltech-RN34 and very high () for GTSRB-RN34 and Diabetic5-RN34. Furthermore, our method outperforms other state-of-the-art approaches when detecting ’s queries. These results are consistent considering the overlap between ’s training dataset and our subsets of ImageNet and OpenImages (’s transfer set). GTSRB and Diabetic5 have no overlap with ImageNet or OpenImages. On the other hand, CUBS, CIFAR10 and Caltech contain images that represent either the same classes or families of classes (as in CIFAR10) as ImageNet and OpenImages. This phenomena is particularly pronounced in case of Caltech which has strong similarities to ImageNet and OpenImages. While TPR remains significantly above the random 50%, such a model is not suitable for deployment. Although other methods can achieve higher TPR on Caltech (87-88%), we measured this value with TNR fixed at 56%. All models fail to discriminate Caltech samples from ’s queries when constrained a to have a more reasonable TNR 95%. We find that our defense method works better with prediction APIs that have specific tasks (such as traffic sign recognition), as opposed to general purpose classifiers that can classify thousands of fine-grained classes. We will discuss how a more realistic can evade these detection mechanisms in Section 5.4.

Next, we briefly examine the attack effectiveness in the case that our detection mechanism is deployed inside the prediction API and all of ’s queries are detected. As a basic means of protection, prediction API will return incorrect outputs to queries labeled as out-of-distribution. In our case, incorrect API outputs are constructed via random shuffling of the complete probability vector. In the Knockoff nets adversary model, has no prior expectation regarding the response obtained from ; therefore, obliviously uses incorrect API outputs to train . Table 5 shows that our defense method combined with incorrect API outputs significantly degrades the attack performance.

Caltech-RN34 74.6% 72.7% (0.97 ) 29.6% (0.40 )
CUBS-RN34 77.2% 70.9% (0.91 ) 20.1% (0.26 )
Diabetic5-RN34 71.1% 53.5% (0.75 ) 28.0% (0.40 )
GTSRB-RN34 98.1% 94.8% (0.97 ) 18.8% (0.19 )
CIFAR10-RN34 94.6% 88.2% (0.93 ) 2.88% (0.03 )
Table 5: Test accuracy of , , and the performance recovery of surrogate models. obtains correct prediction vector and gets incorrect prediction vector from the API.

5 Revisiting the Adversary Model

We aim to identify capabilities and limitations of Knockoff nets under different experimental setups with more realistic assumptions. We evaluate Knockoff nets when 1) ad have completely different architectures, 2) the granularity of ’s prediction API output changes, and 3) can access data closer to ’s training data distribution. We also discuss ’s effect on the surrogate model performance.

5.1 Victim Model Architecture

We measure the performance of Knockoff nets when does not use pre-trained DNN model but is trained from scratch with a completely different architecture for its task. We apply 5-layer GTSRB-5L and 9-layer CIFAR10-9L ’s as described in previous model extraction work [8]. These models are trained using Adam optimizer with learning rate of 0.001 that is decreased to 0.0005 after 100 epochs over 200 epochs. The training procedure of surrogate models is the same as in Section 3.2. Thus, GTSRB-5L and CIFAR10-9L have different architectures and optimization algorithms than those used by . As shown in Table 6, Knockoff nets performs well when both and use pre-trained models even if uses a different pre-trained model architecture (VGG16 [25]). However, the attack effectiveness decreases when is specifically designed for the given task and does not base its performance on any pre-trained model.

GTSRB-RN34 98.1% 94.8% (0.97 ) 90.1 (0.92 )
GTSRB-5L 91.5% 54.5% (0.59 ) 56.2 (0.61 )
CIFAR10-RN34 94.6% 88.2% (0.93 ) 82.9 (0.87 )
CIFAR10-9L 84.5% 61.4% (0.73 ) 64.7 (0.76 )
Table 6: Test accuracy of , and and the performance recovery of surrogate models. uses ResNet34 and uses VGG16 for surrogate model architecture.

5.2 Granularity of Prediction API Output

If ’s prediction API gives only the predicted class or truncated results, such as top-k predictions or rounded version of the full probability vector for each query, performance of the surrogate model degrades. Table 7 shows this limitation, where the prediction API gives complete probability vector to and only predicted class to . Table 7 also demonstrates that the amount of degradation is related to the number of classes in , since obtains comparatively less information if the actual number of classes is high and the granularity of response is low. For example, the degradation is severe when Knockoff nets is implemented against Caltech-RN34 and CUBS-RN34 having more than or equal to 200 classes. However, degradation is low or zero when Knockoff nets is implemented against other models (Diabetic5-RN34, GTSRB-RN34, CIFAR10-RN34).

Caltech-RN34 (256 classes) 74.6% 68.5% (0.92 ) 41.9% (0.56 )
CUBS-RN34 (200 classes) 77.2% 54.8% (0.71 ) 18.0% (0.23 )
Diabetic5-RN34 (5 classes) 71.1% 59.3% (0.83 ) 54.7% (0.77 )
GTSRB-RN34 (43 classes) 98.1% 92.4% (0.94 ) 91.6% (0.93 )
CIFAR10-RN34 (10 classes) 94.6% 71.1% (0.75 ) 53.6% (0.57 )
Table 7: Test accuracy of , , and the performance recovery of surrogate models. receives complete probability vector and only receives predicted class from the prediction API.

Many commercial prediction APIs return top-k outputs for queries (Clarifai returns top-10 outputs and Google Cloud Vision returns up to top-20 outputs from more than 10000 labels). Therefore, attack effectiveness will likely degrade when it is implemented against such real-world prediction APIs.

5.3 Transfer Set Construction

When constructing , might collect images that are irrelevant to the learning task or not close to ’s training data distribution. Moreover, might end up having an imbalanced set, where observations for each class are disproportionate. In this case, per-class accuracy of might be much lower than for classes with a few observations. Figure 1 shows this phenomenon when is CIFAR10-RN34. For example, is much lower than in “deer” and “horse” classes. When the histogram of is checked, the number of queries resulting in these prediction classes are low when compared with other classes. We conjecture that a realistic might try to balance the transfer set by adding more observations for underrepresented classes or remove some training samples with less confidence values.

Class name
Airplane 95% 88% (0.92 )
Automobile 97% 95% (0.97 )
Bird 92% 87% (0.94 )
Cat 89% 86% (0.96 )
Deer 95% 84% (0.88 )
Dog 88% 84% (0.95 )
Frog 97% 90% (0.92 )
Horse 96% 79% (0.82 )
Ship 96% 92% (0.95 )
Truck 96% 92% (0.95 )
Figure 1: Histogram of ’s transfer set constructed by querying CIFAR10-RN34 victim model with 100,000 ImageNet samples and per-class test accuracy for victim and surrogate models. The largest differences in per-class accuracies are in bold.

We further investigate the effect of a poorly chosen by performing Knockoff nets against all ’s using Diabetic5 as (aside from Diabetic5-RN34). We measure to be between 3.9-41.9%. The performance degradation in this experiment supports our argument that the should be chosen carefully by .

5.4 Access to In-distribution Data

A realistic might know the task of the prediction API and could collect natural samples related to this task. By doing so, can improve its surrogate model by constructing that approximates well without being detected.

In section 4, we observed that the higher the similarity between ’s and ’s training data distribution, the less effective our method becomes. In the worst case, has access to a large amount of unlabeled data that does not significantly deviate from ’s training data distribution. In such a scenario, TNR values in Table 4 would clearly drop to 50%. We argue that this limitation is inherent to all detection methods that try to identify out-of-distribution samples.

Publicly available datasets designed for ML research as well as vast databases accessible through search engines and from data vendors (e.g. Quandl, DataSift, Axciom) allow to obtain substantial amount of unlabeled data from any domain. Therefore, making assumptions about ’s access to natural data (or lack of thereof) is not realistic. This corresponds to the most capable, and yet plausible, adversary model - one in which has approximate knowledge about ’s training data distribution and access to a large, unlabeled dataset. In such a scenario, ’s queries are not going to differ from a benign client, rendering detection techniques ineffective. Therefore we conclude that against a strong, but realistic adversary, the current business model of prediction APIs, which allow a large number of inexpensive queries cannot thwart model extraction attacks.

Even if model extraction attacks can be detected through stateful analysis, highly distributed Sybil attacks are unlikely to be detected. In theory, vendors could charge their customers upfront for a significant number of queries (over 100k), making Sybil attacks cost-ineffective. However, this reduces utility for benign users and restricts the access only to those who can afford to pay. Using authentication for customers or rate-limiting techniques also reduce utility for benign users. Using these methods only slow down the attack and ultimately fail to prevent model extraction attacks.

6 Related Work

There have been several methods proposed to detect or deter model extraction attacks. In certain cases, altering predictions returned to API clients has been shown to significantly deteriorate model extraction attacks: predictions can be restricted to classes [30, 22] or adversarially modified to degrade the performance of the surrogate model [17, 22]. However, it has been shown that such defenses does not work against all attacks. Model extraction attacks against simple DNNs [8, 22, 21] are still effective when using only the predicted class. While these defenses may increase the training time of , they ultimately do not prevent Knockoff nets.

Other works have argued that model extraction defenses alone are not sufficient and additional countermeasures are necessary. In DAWN [27], the authors propose that the victim can poison ’s training process by occasionally returning false predictions, and thus embed a watermark in its model. If later makes the surrogate model publicly available for queries, victim can claim ownership using the embedded watermark. DAWN is effective at watermarking surrogate models obtained using Knockoff nets, but requires that ’s model is publicly available for queries and does not protect from the model extraction itself. However, returning false predictions with the purpose of embedding watermarks may be unacceptable in certain deployments, e.g. malware detection. Therefore, accurate detection of model extraction may be seen as a necessary condition for watermarking.

Prior work found that distances between queries made during model extraction attacks follow a different distribution than the legitimate ones [8]. Thus, attacks could be detected using density estimation methods, where ’s inputs produce a highly skewed distribution. This technique protects DNN models against specific attacks using synthetic queries and does not generalize to other attacks, e.g. Knockoff nets. Other methods are designed to detect queries that explore abnormally large region of the input space [10] or attempt to identify queries that get increasingly close to the classes’ decision boundaries [24]. However, these techniques are limited in application to decision trees and they are ineffective against complex DNNs that are targeted by Knockoff nets.

In this work, we aim to detect queries that significantly deviate from the distribution of victim’s dataset without affecting prediction API’s performance. As such, our approach is closest to the PRADA [8] defense. However, we aim to detect Knockoff nets, which PRADA is not designed for. Our defense exploits the fact that Knockoff nets uses natural images sampled from public databases constructed for a general task. Our defense presents an inexpensive, yet effective defense against Knockoff nets, and may have wide practical applicability. However, we believe that ML-based detection schemes open up the possibility of evasion, which we aim to investigate in future work.

7 Conclusion

We evaluated the effectiveness of Knockoff nets, a state-of-the-art model extraction attack, in several real-life scenarios. We showed that under its original adversary model described in [21], it is possible to detect an adversary mounting Knockoff nets attacks by distinguishing between in- and out-of-distribution queries. While we confirm the results reported in [21], we also showed that more realistic assumptions about the capabilities of can have both positive and negative implications for attack effectiveness. On the one hand, the performance of Knockoff nets is reduced against more realistic prediction APIs that do not return complete probability vector. On the other hand, if knows the task of the victim model and has access to sufficient unlabeled data drawn from the same distribution as the ’s training data, it can not only be very effective, but virtually undetectable. We therefore conclude that strong, but realistic adversary can extract complex real-life DNN models effectively, without being detected. Given this conclusion, we believe that deterrence techniques like watermarking [27] and fingerprinting [20] deserve further study – while they cannot prevent model extraction, they can reduce the incentive for model extraction by rendering large-scale exploitation of extracted models detectable.

Acknowledgements

This work was supported in part by the Intel (ICRI-CARS). We would like to thank Aalto Science-IT project for computational resources.

Footnotes

  1. https://github.com/gregwchase/dsi-capstone
  2. https://github.com/pokaxpoka/deep˙Mahalanobis˙detector

References

  1. C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: §4.1.
  2. J. R. Correia-Silva, R. F. Berriel, C. Badue, A. F. de Souza and T. Oliveira-Santos (2018) Copycat cnn: stealing knowledge by persuading confession with random non-labeled data. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1, §2.2.
  3. J. Deng, W. Dong, R. Socher, L. Li, K. Li and L. Fei-fei (2009) Imagenet: a large-scale hierarchical image database. In In CVPR, Cited by: §3.1.
  4. G. S. Griffin, A. Holub and P. Perona (2007) Caltech-256 object category dataset. Cited by: §3.2.
  5. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2, §3.2.
  6. D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. Proceedings of International Conference on Learning Representations. Cited by: §4.1, Table 4.
  7. M. Jagielski, N. Carlini, D. Berthelot, A. Kurakin and N. Papernot (2019) High-fidelity extraction of neural network models. arXiv preprint arXiv:1909.01838. Cited by: §1.
  8. M. Juuti, S. Szyller, S. Marchal and N. Asokan (2019) PRADA: protecting against DNN model stealing attacks. In to appear in IEEE European Symposium on Security and Privacy (EuroS&P), pp. 1–16. Cited by: §1, §2.2, §2.2, §3.1, §3.2, §5.1, §6, §6, §6.
  9. Kaggle (2015) Diabetic retinopathy detection. EyePACS. Note: \urlhttps://www.kaggle.com/c/diabetic-retinopathy-detection/overview/description Cited by: §3.2.
  10. M. Kesarwani, B. Mukhoty, V. Arya and S. Mehta (2018) Model extraction warning in MLaaS paradigm. In Proceedings of the 34th Annual Computer Security Applications Conference, Cited by: §6.
  11. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  12. S. Kornblith, J. Shlens and Q. V. Le (2018) Do better imagenet models transfer better?. arXiv preprint arXiv:1805.08974. Cited by: §3.1.
  13. S. Kornblith, J. Shlens and Q. V. Le (2019) Do better imagenet models transfer better?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2661–2671. Cited by: §3.1, §4.3.
  14. A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Cited by: §3.2.
  15. A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci and T. Duerig (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §4.2.
  16. K. Lee, K. Lee, H. Lee and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: §4.1, Table 4.
  17. T. Lee, B. Edwards, I. Molloy and D. Su (2018) Defending against model stealing attacks using deceptive perturbations. arXiv preprint arXiv:1806.00054. Cited by: §1, §6.
  18. P. Li, J. Yi and L. Zhang (2018) Query-efficient black-box attack by active learning. arXiv preprint arXiv:1809.04913. Cited by: §1.
  19. S. Liang, Y. Li and R. Srikant (2017) Principled detection of out-of-distribution examples in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §4.1, Table 4.
  20. N. Lukas, Y. Zhang and F. Kerschbaum (2019) Deep neural network fingerprinting by conferrable adversarial examples. arXiv preprint arXiv:1912.00888. Cited by: §7.
  21. T. Orekondy, B. Schiele and M. Fritz (2019) Knockoff nets: stealing functionality of black-box models. In CVPR, Cited by: §1, §2.2, §3.1, §3.2, §3.2, §3.2, Table 2, §3, §6, §7.
  22. T. Orekondy, B. Schiele and M. Fritz (2019) Prediction poisoning: utility-constrained defenses against model stealing attacks. CoRR abs/1906.10908. External Links: Link, 1906.10908 Cited by: §6.
  23. N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §1, §2.2, §2.2, §2.2, §3.1, §3.2.
  24. E. Quiring, D. Arp and K. Rieck (2018) Forgotten siblings: unifying attacks on machine learning and digital watermarking. In 2018 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 488–502. Cited by: §1, §6.
  25. K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.1.
  26. J. Stallkamp, M. Schlipsing, J. Salmen and C. Igel (2011) The german traffic sign recognition benchmark: a multi-class classification competition. In IEEE International Joint Conference on Neural Networks, Cited by: §3.2.
  27. S. Szyller, B. G. Atli, S. Marchal and N. Asokan (2019) DAWN: dynamic adversarial watermarking of neural networks. CoRR abs/1906.00830. External Links: Link, 1906.00830 Cited by: §6, §7.
  28. TechWorld (2018) How tech giants are investing in artificial intelligence. Note: \urlhttps://www.techworld.com/picture-gallery/data/tech-giants-investing-in-artificial-intelligence-3629737Online; accessed 9 May 2019 Cited by: §1.
  29. A. Torralba and A. A. Efros (2011-06) Unbiased look at dataset bias. In CVPR 2011, Vol. , pp. 1521–1528. External Links: Document, ISSN 1063-6919 Cited by: §4.1.
  30. F. Tramèr, F. Zhang, A. Juels, M. K. Reiter and T. Ristenpart (2016) Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium (USENIX Security 16), pp. 601–618. Cited by: §1, §2.2, §6.
  31. P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §3.2.
  32. J. Yosinski, J. Clune, Y. Bengio and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §4.3.
  33. C. Zhu, R. H. Byrd, P. Lu and J. Nocedal (1994) L-bfgs-b - fortran subroutines for large-scale bound constrained optimization. Technical report ACM Trans. Math. Software. Cited by: §4.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402536
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description