High Accuracy and High Fidelity Extraction of Neural Networks
Abstract
In a model extraction attack, an adversary steals a copy of a remotely deployed machine learning model, given oracle prediction access. We taxonomize model extraction attacks around two objectives: accuracy, i.e., performing well on the underlying learning task, and fidelity, i.e., matching the predictions of the remote victim classifier on any input.
To extract a highaccuracy model, we develop a learningbased attack exploiting the victim to supervise the training of an extracted model. Through analytical and empirical arguments, we then explain the inherent limitations that prevent any learningbased strategy from extracting a truly highfidelity model—i.e., extracting a functionallyequivalent model whose predictions are identical to those of the victim model on all possible inputs. Addressing these limitations, we expand on prior work to develop the first practical functionallyequivalent extraction attack for direct extraction (i.e., without training) of a model’s weights.
We perform experiments both on academic datasets and a stateoftheart image classifier trained with 1 billion proprietary images. In addition to broadening the scope of model extraction research, our work demonstrates the practicality of model extraction attacks against productiongrade systems.
1 Introduction
Machine learning, and neural networks in particular, are widely deployed in industry settings. Models are often deployed as prediction services or otherwise exposed to potential adversaries. Despite this fact, the trained models themselves are often proprietary and are closely guarded.
There are two reasons models are often seen as sensitive. First, they are expensive to obtain. Not only is it expensive to train the final model [48] (e.g., Google recently trained a model with 340 million parameters on hardware costing 61,000 USD per training run [56]), performing the work to identify the optimal set of model architecture, training algorithm, and hyperparameters often eclipses the cost of training the final model. Further, training these models also requires investing in expensive collection process to obtain the training datasets necessary to obtain an accurate classifier [16, 12, 49, 53]. Second, there are security [39, 28] and privacy [44, 41] concerns for revealing trained models to potential adversaries.
Concerningly, prior work found that an adversary with query access to a model can steal the model to obtain a copy that largely agrees with the remote victim models [28, 51, 37, 9, 36, 38, 10]. These extraction attacks are therefore important to consider.
In this paper, we systematize the space of model extraction around two adversarial objectives: accuracy and fidelity. Accuracy measures the correctness of predictions made by the extracted model on the test distribution. Fidelity, in contrast, measures the general agreement between the extracted and victim models on any input. Both of these objectives are desirable, but they are in conflict for imperfect victim models: a highfidelity extraction should replicate the errors of the victim, whereas a highaccuracy model should instead try to make an accurate prediction. At the highfidelity limit is functionallyequivalent model extraction: the two models agree on all inputs, both on and off the underlying data distribution.
While most prior work considers accuracy [51, 39, 9], we argue that fidelity is often equally important. When using model extraction to mount blackbox adversarial example attacks [39], fidelity ensures the attack is more effective because more adversarial examples transfer from the extracted model to the victim. Membership inference [44, 41] benefits from the extracted model closely replicating the confidence of predictions made by the victim. Finally, a functionallyequivalent extraction enables the adversary to inspect whether internal representations reveal unintended attributes of the input—that are statistically uncorrelated with the training objective, enabling the adversary to benefit from overlearning [46].
We design one attack for each objective. First, a learningbased attack, which uses the victim to generate labels for training the extracted model. While existing techniques already achieve high accuracy, our attacks are more queryefficient and scale to larger models. We perform experiments that surface inherent limitations of learningbased extraction attacks and argue that learningbased strategies are illsuited to achieve highfidelity extraction. Then, we develop the first practical functionallyequivalent attack, which directly recovers a twolayer neural network’s weights exactly given access to doubleprecision model inference. Compared to prior work, which required a highprecision power sidechannel [18] or access to model gradients [32], our attack only requires inputoutput access to the model, while simultaneously scaling to larger networks than either of the prior methods.
We make the following contributions:

We taxonomize the space of model extraction attacks by exploring the objective of accuracy and fidelity.

We improve the query efficiency of learning attacks for accuracy extraction and make them practical for millionsofparameter models trained on billions of images.

We achieve highfidelity extraction by developing the first practical functionallyequivalent model extraction.

We mix the proposed methods to obtain a hybrid method which improves both accuracy and fidelity extraction.
2 Preliminaries
We consider classifiers with domain and range ; the output of the classifier is a distribution over class labels. The class assigned to an input by a classifier is (for , we write ). In order to satisfy the constraint that a classifier’s output is a distribution, a softmax is typically applied to the output of an arbitrary function :
We call the function the logit function for a classifier . To convert a class label into a probability vector, it is common to use onehot encoding: for a value , the onehot encoding is a vector in with —that is, it is 1 only at index , and 0 elsewhere.
Model extraction concerns reproducing a victim model, or oracle, which we write . The model extraction adversary will run an extraction algorithm , which outputs the extracted model . We will sometimes parameterize the oracle (resp. extracted model) as (resp. ) to denote that it has model parameters —we will omit this when unnecessary or apparent from context.
In this work, we consider and to both be neural networks. A neural network is a sequence of operations—alternatingly applying linear operations and nonlinear operations—a pair of linear and nonlinear operations is called a layer. Each linear operation projects onto some space —the dimensionality of this space is referred to as the width of the layer. The number of layers is the depth of the network. The nonlinear operations are typically fixed, while the linear operations have parameters which are learned during training. The function computed by layer , , is therefore computed as , where is the th nonlinear function, and are the parameters of layer ( is the weights, the biases). A common choice of activation is the rectified linear unit, or ReLU, which sets . Introduced to improve the convergence of optimization when training neural networks, the ReLU activation has established itself as an effective default choice for practitioners [33]. Thus, we consider primarily ReLU networks in this work.
The network structure described here is called fully connected because each linear operation “connects” every input node to every output node. In many domains, such as computer vision, this is more structure than necessary. A neuron computing edge detection, for example, only needs to use information from a small region of the image. Convolutional networks were developed to combat this inefficiency—the linear functions become filters, which are still linear, but are only applied to a small (e.g., 3x3 or 5x5) window of the input. They are applied to every window using the same weights, making convolutions require far fewer parameters than fully connected networks.
Neural networks are trained by empirical risk minimization. Given a dataset of samples , training involves minimizing a loss function on the dataset with respect to the parameters of the network . A common loss function is the crossentropy loss for a sample : , where is the probability (or onehot) vector for the true class. The crossentropy loss on the full dataset is then
The loss is minimized with some form of gradient descent, often stochastic gradient descent (SGD). In SGD, gradients of parameters are computed over a randomly sampled batch , averaged, and scaled by a learning rate :
Other optimizers [34, 14, 22] use gradient statistics to reduce the variance of updates which can result in better performance.
A less common setting, but one which is important for our work, is when the target values which are used to train the network are not onehot values, but are probability vectors output by a different model . When training using the dataset , we say the trained model is distilled from with temperature , referring to the process of distillation introduced in Hinton et al. [17]. Note that the values of are always scaled to sum to 1.
3 Taxonomy of Threat Models
We now address the spectrum of adversaries interested in extracting neural networks. As illustrated in Table 1, we taxonomize the space of possible adversaries around two overarching goals—theft and reconnaissance. We detail why extraction is not always practically realizable by constructing models that are impossible to extract, or require a large number of queries to extract. We conclude our threat model with a discussion of how adversarial capabilities (e.g., prior knowledge of model architecture or information returned by queries) affect the strategies an adversary may consider.
Attack  Type  Model type  Goal  Query Output 

Lowd & Meek [28]  Direct Recovery  LM  Functionally Equivalent  Labels 
Tramer et al. [51]  (Active) Learning  LM, NN  Task Accuracy, Fidelity  Probabilities, labels 
Tramer et al. [51]  Path finding  DT  Functionally Equivalent  Probabilities, labels 
Milli et al. [32] (theoretical)  Direct Recovery  NN (2 layer)  Functionally Equivalent  Gradients, logits 
Milli et al. [32]  Learning  LM, NN  Task Accuracy  Gradients 
Pal et al. [38]  Active learning  NN  Fidelity  Probabilities, labels 
Chandrasekharan et al. [9]  Active learning  LM  Functionally Equivalent  Labels 
Copycat CNN [10]  Learning  CNN  Task Accuracy, Fidelity  Labels 
Papernot et al. [39]  Active learning  NN  Fidelity  Labels 
CSI NN [5]  Direct Recovery  NN  Functionally Equivalent  Power Side Channel 
Knockoff Nets [37]  Learning  NN  Task Accuracy  Probabilities 
Functionally equivalent (this work)  Direct Recovery  NN (2 layer)  Functionally Equivalent  Probabilities, logits 
Efficient learning (this work)  Learning  NN  Task Accuracy, Fidelity  Probabilities 
3.1 Adversarial Motivations
Model extraction attacks target the confidentiality of a victim model deployed on a remote service. A model refers here to both the architecture and its parameters. Architectural details include the learning hypothesis (i.e., neural network in our case) and corresponding details (e.g., number of layers and activation functions for neural networks). Parameter values are the result of training.
First, we consider theft adversaries, motivated by economic incentives. Generally, the defender went through an expensive process to design the model’s architecture and train it to set parameter values. Here, the model can be viewed as intellectual property that the adversary is trying to steal. A line of work has in fact referred to this as “model stealing” [51].
In the latter class of attacks, the adversary is performing reconnaissance to later mount attacks targeting other security properties of the learning system: e.g., its integrity with adversarial examples [39], or privacy with training data membership inference [44, 41]. Model extraction enables an adversary previously operating in a blackbox threat model to mount attacks against the extracted model in a whitebox threat model. The adversary has—by design—access to the extracted model’s parameters. In the limit, this adversary would expect to extract an exact copy of the oracle.
The goal of exact extraction is to produce , so that the model’s architecture and all of its weights are identical to the oracle. This definition is purely a strawman—it is the strongest possible attack, but it is fundamentally impossible for many classes of neural networks, including ReLU networks, because any individual model belongs to a large equivalence class of networks which are indistinguishable from inputoutput behavior. For example, we can scale an arbitrary neuron’s input weights and biases by some , and scale its output weights and biases by ; the resulting model’s behavior is unchanged. Alternatively, in any intermediate layer of a ReLU network, we may also add a dead neuron which never contributes to the output, or might permute the (arbitrary) order of neurons internally. Given access to inputoutput behavior, the best we can do is identify the equivalence class the oracle belongs to.
3.2 Adversarial Goals
This perspective yields a natural spectrum of realistic adversarial goals characterizing decreasingly precise extractions.
Functionally Equivalent Extraction The goal of functionally equivalent extraction is to construct an such that , . This is a tractable weakening of the exact extraction definition from earlier—it is the hardest possible goal using only inputoutput pairs. The adversary obtains a member of the oracle’s equivalence class. This goal enables a number of downstream attacks, including those involving inspection of the model’s internal representations like overlearning [46], to operate in the whitebox threat model.
Fidelity Extraction Given some target distribution over , and goal similarity function , the goal of fidelity extraction is to construct an that maximizes . In this work, we consider only label agreement, where ; we leave exploration of other similarity functions to future work.
A natural distribution of interest is the data distribution itself—the adversary wants to make sure the mistakes and correct labels are the same between the two models. A reconnaissance attack for constructing adversarial examples would care about a perturbed data distribution; mistakes might be more important to the adversary in this setting. Membership inference would use the natural data distribution, including any outliers. These distributions tend to be concentrated on a lowdimension manifold of , making fidelity extraction significantly easier than functionally equivalent extraction. Indeed, functionally equivalent extraction achieves a perfect fidelity of 1 on all distributions and all similarity functions.
Task Accuracy Extraction For the true task distribution over , the goal of task accuracy extraction is to construct an maximizing . This goal is to match (or exceed) the accuracy of the target model, which is the easiest goal to consider in this taxonomy (because it doesn’t need to match the mistakes of ).
Existing Attacks In Table 1, we fit previous model extraction work into this taxonomy, as well as discuss their techniques. Functionally equivalent extraction has been considered for linear models [28, 9], decision trees [51], both given probabilities, and neural networks [32, 5], given extra access. Task accuracy extraction has been considered for linear models [51] and neural networks [32, 10, 37], and fidelity extraction has also been considered for linear models [51] and neural networks [38, 39]. Notably, functionally equivalent attacks require modelspecific techniques, while task accuracy and fidelity typically use generic learningbased approaches.
3.3 Model Extraction is Hard
Before we consider adversarial capabilities in Section 3.4 and potential corresponding approaches to model extraction, we must understand how successful we can hope to be. Here, we present arguments that will serve to bound our expectations. First, we will identify some limitations of functionally equivalent extraction by constructing networks which require arbitrarily many queries to extract. Second, we will present another class of networks that cannot be extracted with fidelity without querying a number of times exponential in its depth. We provide intuition in this section and later prove these statements in Appendix A.
Exponential hardness of functionally equivalent attacks. In order to show that functionally equivalent extraction is intractable in the worst case, we construct of a class of neural networks that are hard to extract without making exponentially many queries in the network’s width.
Theorem 1.
There exists a class of width and depth 2 neural networks on domain (with precision numbers) with that require, given logit access to the networks, queries to extract.
The precision is the number of possible values a feature can take from . In images with 8bit pixels, we have . The intuition for this theorem is that a width network can implement a function that returns a nonzero value on at most a fraction of the space. In the worst case, queries are necessary to find this fraction of the space.
Note that this result assumes the adversary can only observe the inputoutput behavior of the oracle. If this assumption is broken then functionally equivalent extraction becomes practical. For example, Batina et al. [5] perform functionally equivalent extraction by performing a side channel attack (specifically, differential power analysis [23]) on a microprocessor evaluating the neural network.
We also observe in Theorem 2 that, given whitebox access to two neural networks, it is NPhard in general to test if they are functionally equivalent. We do this by constructing two networks that differ only in coordinates satisfying a subset sum instance. Then testing functional equivalence for these networks is as hard as finding the satisfying subset.
Theorem 2 (Informal).
Given their weights, it is NPhard to test whether two neural networks are functionally equivalent.
Any attack which can claim to perform functionally equivalent extraction efficiently (both in number of queries used and in running time) must make some assumptions to avoid these pathologies. In Section 6, we will present and discuss the assumptions of a functionally equivalent extraction attack for twolayer neural network models.
Learning approaches struggle with fidelity. A final difficulty for model extraction comes from recent work in learnability [11]. Das et al. prove that, for deep random networks with input dimension and depth , model extraction approaches that can be written as Statistical Query (SQ) learning algorithms require samples for fidelity extraction. SQ algorithms are a restricted form of learning algorithm which only access the data with noisy aggregate statistics; many learning algorithms, such as (stochastic) gradient descent and PCA, are examples. As a result, most learningbased approaches to model extraction will inherit this inefficiency. A sampleefficient approach therefore must either make assumptions about the model to be extracted (to distinguish it from a deep random network), or must access its dataset without statistical queries.
Theorem 3 (Informal [11]).
Random networks with domain and range and depth require samples to learn in the SQ learning model.
3.4 Adversarial Capabilities
We organize an adversary’s prior knowledge about the oracle and its training data into three categories—domain knowledge, deployment knowledge, and model access.
Domain Knowledge
Domain knowledge describes what the adversary knows about the task the model is designed for. For example, if the model is an image classifier, then the model output should not change under standard image data augmentations, such as shifts, rotations, or crops. Usually, the adversary should be assumed to have as much domain knowledge as the oracle’s designer.
In some domains, it is reasonable to assume the adversary has access to public taskrelevant pretrained models or datasets. This is often the case for learningbased model extraction, which we develop in Section 4. We consider an adversary using part of a public dataset of 1.3 million images [12] as unlabeled data to mount an attack against a model trained on a proprietary dataset of 1 billion labeled images [30].
Learningbased extraction is hard without natural data In learningbased extraction, we assume that the adversary is able to collect public unlabeled data to mount their attack. This is a natural assumption for a theftmotivated adversary who wishes to steal the oracle for local use—the adversary has data they want to learn the labels of without querying the model! For other adversaries, progress in generative modeling is likely to offer ways to remove this assumption [31]. We leave this to future work because our overarching aim in this paper is to characterize the model extraction attacker space around the notions of accuracy and fidelity. All progress achieved by our approaches is complementary to possible progress in synthetic data generation.
Deployment Knowledge
Deployment knowledge describes what the adversary knows about the oracle itself, including the model architecture, training procedure, and training dataset. The adversary may have access to public artifacts of the oracle—a distilled version of the oracle may be available (such as for OpenAI GPT [40]) or the oracle may be transfer learned from a public pretrained model (such as many image classifiers [43] or language models like BERT [13]).
In addition, the adversary may not even know the features (the exact inputs to the model) or the labels (the classes the model may output). While the latter can generally be inferred by interacting with the model (e.g., making queries and observing the labels predicted by the model), inferring the former is usually more difficult. Our preliminary investigations suggest that these are not limiting assumptions, but we leave proper treatment of these constraints to future work.
Model Access
Model access describes the information the adversary obtains from the oracle, including bounds on how many queries the adversary may make as well as the oracle’s response:

label: only the label of the mostlikely class is revealed.

label and score: in addition to the mostlikely label, the confidence score of the model in its prediction for this label is revealed.

top scores: the labels and confidence scores for the classes whose confidence are highest are revealed.

scores: confidence scores for all labels are revealed.

logits: raw logit values for all labels are revealed.
In general, the more access an adversary is given, the more effective they should be in accomplishing their goal. We instantiate practical attacks under several of these assumptions. Limiting model access has also been discussed as a defensive measure, as we elaborate in Section 8.
4 Learningbased Model Extraction
We present our first attack strategy where the victim model serves as a labeling oracle for the adversary. While many attack variants exist [51, 39], they generally stage an iterative interaction between the adversary and the oracle, where the adversary collects labels for a set of points from the oracle and uses them as a training set for the extracted model. These algorithms are typically designed for accuracy extraction; in this section, we will demonstrate improved algorithms for accuracy extraction, using taskrelevant unlabeled data.
We realistically simulate largescale model extraction by considering an oracle that was trained on 1 billion Instagram images [30] to obtain (at the time of the experiment) stateoftheart performance on the standard image classification benchmark, ImageNet [12]. The oracle, with 193 million parameters, obtained 84.2% top1 accuracy and 97.2% top5 accuracy on the 1000class benchmark—we refer to the model as the “WSL model”, abbreviating the paper title. We give the adversary access to the public ImageNet dataset. The adversary’s goal is to use the WSL model as a labeling oracle to train an ImageNet classifier that performs better than if we trained the model directly on ImageNet. The attack is successful if access to the WSL model—trained on 1 billion proprietary images inaccessible to the adversary—enables the adversary to extract a model that outperforms a baseline model trained directly with ImageNet labels. This is accuracy extraction for the ImageNet distribution, given unlabeled ImageNet training data.
We consider two variants of the attack: one where the adversary selects 10% of the training set (i.e., about 130,000 points) and the other where the adversary keeps the entire training set (i.e., about 1.3 million points). To put this number in perspective, recall that each image has a dimension of 224x224 pixels and 3 color channels, giving us total input features. Each image belongs to one of 1,000 classes. Although ImageNet data is labeled, we always treat it as unlabeled to simulate a realistic adversary.
4.1 Fullysupervised model extraction
The first attack is fully supervised, as proposed by prior work [51]. It serves to compare our subsequent attacks to prior work, and to validate our hypothesis that labels from the oracle are more informative than dataset labels.
The adversary needs to obtain a label for each of the points it intends to train the extracted model with. Then it queries the oracle to label its training points with the oracle’s predictions. The oracle reveals labels and scores (in the threat model from Section 3) when queried.
The adversary then trains its model to match these labels using the crossentropy loss. We used a distillation temperature of in our experiments after a random search. Our experiments use two architectures known to perform well on image classification: ResNetv250 and ResNetv2200.
Results. We present results in Table 2. For instance, the adversary is able to improve the accuracy of their model by for ResNetv250 and for ResNet_v2_200 after having queried the oracle for 10% of the ImageNet data. Recall that the task has 1,000 labels, making these improvements significant. The gains we are able to achieve as an adversary are in line with progress that has been made by the computer vision community on the ImageNet benchmark over recent years, where the research community improved the stateoftheart top1 accuracy by about one percent point per year.
Architecture  Data Fraction  ImageNet  WSL  WSL5  ImageNet + Rot  WSL + Rot  WSL5 + Rot 

Resnet_v2_50  10%  (81.86/82.95)  (82.71/84.18)  (82.97/84.52)  (82.27/84.14)  (82.76/84.73)  (82.84/84.59) 
Resnet_v2_200  10%  (83.50/84.96)  (84.81/86.36)  (85.00/86.67)  (85.10/86.29)  (86.17/88.16)  (86.11/87.54) 
Resnet_v2_50  100%  (92.45/93.93)  (93.00/94.64)  (93.12/94.87)  N/A  N/A  N/A 
Resnet_v2_200  100%  (93.70/95.11)  (94.26/96.24)  (94.21/95.85)  N/A  N/A  N/A 
Dataset  Algorithm  250 Queries  1000 Queries  4000 Queries 

SVHN  FS  (79.25/79.48)  (89.47/89.87)  (94.25/94.71) 
SVHN  MM  (95.82/96.38)  (96.87/97.45)  (97.07/97.61) 
CIFAR10  FS  (53.35/53.61)  (73.47/73.96)  (86.51/87.37) 
CIFAR10  MM  (87.98/88.79)  (90.63/91.39)  (93.29/93.99) 
4.2 Unlabeled data improves query efficiency
For adversaries interested in theft, a learningbased strategy should minimize the number of queries required to achieve a given level of accuracy. A natural approach towards this end is to take advantage of advances in labelefficient ML, including active learning [2] and semisupervised learning [7].
Active learning allows a learner to query the labels of arbitrary points—the goal is to query the best set of points to learn a model with. Semisupervised learning considers a learner with some labeled data, but much more unlabeled data—the learner seeks to leverage the unlabeled data (for example, by training on guessed labels) to improve classification performance. Active and semisupervised learning are complementary techniques [47, 45]; it is possible to pick the best subset of data to train on, while also using the rest of the unlabeled data without labels.
The connection between labelefficient learning and learningbased model extraction attacks is not new [51, 9, 38], but has focused on active learning. We show that, assuming access to unlabeled taskspecific data, semisupervised learning can be used to improve model extraction attacks. This could potentially be improved further by leveraging active learning, as in prior work, but our improvements are overall complementary to approaches considered in prior work. We explore two semisupervised learning techniques: rotation loss [57] and MixMatch [6].
Rotation loss. We leverage the current stateoftheart semisupervised learning approach on ImageNet, which augments the model with a rotation loss [57]. The model contains two linear classifiers from the secondtolast layer of the model: the classifier for the image classification task, and a rotation predictor. The goal of the rotation classifier is to predict the rotation applied to an input—each input is fed in four times per batch, rotated by . The classifier should output onehot encodings , respectively, for these rotated images. Then, the rotation loss is written:
where is the th rotation, is crossentropy loss, and is the model’s probability outputs for the rotation task. Inputs need not be labeled, hence we compute this loss on unlabeled data for which the adversary did not query the model. That is, we train the model on both unlabeled data (with rotation loss), and labeled data (with standard classification loss), and both contribute towards learning a good representation for all of the data, including the unlabeled data.
We compare the accuracy of models trained with the rotation loss on data labeled by the oracle and data with ImageNet labels. Our best performing extracted model, with an accuracy of , is trained with the rotation loss on oracle labels whereas the baseline on ImageNet labels only achieves accuracy with the rotation loss and without the rotation loss. This demonstrates the cumulative benefit of adding a rotation loss to the objective and training on oracle labels for a theftmotivated adversary.
We expect that as semisupervised learning techniques on ImageNet mature, further gains should be reflected in the performance of model extraction attacks.
MixMatch. To validate this hypothesis, we turn to smaller datasets where semisupervised learning has made significant progress. We investigate a technique called MixMatch [6] on two datasets: SVHN [35] and CIFAR10 [24]. MixMatch uses a combination of techniques, including training on “guessed” labels, regularization, and image augmentations.
For both datasets, inputs are color images of 32x32 pixels belonging to one of 10 classes. The training set of SVHN contains 73257 images and the test set contains 26032 images. The training set of CIFAR10 contains 50000 images and the test set contains 10000 images. We train the oracle with a WideResNet282 architecture on the labeled training set. The oracles achieve 97.36% accuracy on SVHN and 95.75% accuracy on CIFAR10.
The adversary is given access to the same training set but without knowledge of the labels. Our goal is to validate the effectiveness of semisupervised learning by demonstrating that the adversary only needs to query the oracle on a small subset of these training points to extract a model whose accuracy on the task is comparable to the oracle’s. To this end, we run 5 trials of fully supervised extraction (no use of unlabeled data), and 5 trials of MixMatch, reporting for each trial the median accuracy of the 20 latest checkpoints, as done in [6].
Results. In Table 3, we find that with only 250 queries (293x smaller label set than the SVHN oracle and 200x smaller for CIFAR10), MixMatch reaches 95.82% test accuracy on SVHN and 87.98% accuracy on CIFAR10. This is higher than fully supervised training that uses 4000 queries. With 4000 queries, MixMatch is within 0.29% of the accuracy of the oracle on SVHN, and 2.46% on CIFAR10. The variance of MixMatch is slightly higher than that of fully supervised training, but is much smaller than the performance gap. These gains come from the prior MixMatch is able to build using the unlabeled data, making it effective at exploiting few labels. We observe similar gains in test set fidelity.
5 Limitations of LearningBased Extraction
Learningbased approaches have several sources of nondeterminism: the random initializations of the model parameters, the order in which data is assembled to form batches for SGD, and even nondeterminism in GPU instructions [42, 25]. Nondeterminism impacts the model parameter values obtained from training. Therefore, even an adversary with full access to the oracle’s training data, hyperparameters, etc., would still need all of the learner’s nondeterminism to achieve the functionally equivalent extraction goal described in Section 3. In this section, we will attempt to quantify this: for a strong adversary, with access to the exact details of the training setup, we will present an experiment to determine the limits of learningbased algorithms to achieving fidelity extraction.
We perform the following experiment. We query an oracle to obtain a labeled substitute dataset . We use for a learningbased extraction attack which produces a model . We run the learningbased attack a second time using , but with different sources of nondeterminism to obtain a new set of parameters . If there are points such that , then the prediction on is dependent not on the oracle, but on the nondeterminism of the learningbased attack strategy—we are unable to guarantee fidelity.
We independently control the initialization randomness and batch randomness during training on FashionMNIST [55] with fully supervised SGD (we use FashionMNIST for training speed). We repeated each run 10 times and measure agreement between the ten obtained models on the test set, adversarial examples generated by running FGSM with with the oracle model and the test set, and uniformly random inputs. The oracle uses initialization seed 0 and SGD seed 0—we also use two different initialization and SGD seeds.
Even when both training and initialization randomness are fixed (so that only GPU nondeterminism remains), fidelity peaks at 93.7% on the test set (see Table 4). With no randomness fixed, extraction achieves 93.4% fidelity on the test set. (Agreement on the test set should should be considered in reference to the base test accuracy of 90%.) Hence, even an adversary who has the victim model’s exact training set will be unable to exceed ~93.4% fidelity. Using prototypicality metrics, as investigated in Carlini et al. [8], we notice that test points where fidelity is easiest to achieve are also the most prototypical (i.e., more representative of the class it is labeled as). This connection is explored further in Appendix B. The experiment of this section is also related to uncertainty estimation using deep ensembles [25]; we believe a deeper connection may exist between the fidelity of learningbased approaches and uncertainty estimation. Also relevant is the work mentioned earlier in Section 3, that shows that random networks are hard for learningbased approaches to extract. Here, we find that learningbased approaches have limits even for trained networks, on some portion of the input space.
Query Set  Init & SGD  Same SGD  Same Init  Different 

Test  93.7%  93.2%  93.1%  93.4% 
Adv Ex  73.6%  65.4%  65.3%  67.1% 
Uniform  65.7%  60.2%  59.0%  60.2% 
It follows from these arguments that nondeterminism of both the victim and extracted model’s learning procedures potentially compound, limiting the effectiveness of using a learningbased approach to reaching high fidelity.
6 Functionally Equivalent Extraction
Having identified fundamental limitations that prevent learningbased approaches from perfectly matching the oracle’s mistakes, we now turn to a different approach where the adversary extracts the oracle’s weights directly, seeking to achieve functionallyequivalent extraction.
This attack can be seen as an extension of two prior works.

Milli et al. [32] introduce an attack to extract neural network weights under the assumption that the adversary is able to make gradient queries. That is, each query the adversary makes reveals not only the prediction of the neural network, but also the gradient of the neural network with respect to the query. To the best of our knowledge this is the only functionallyequivalent extraction attack on neural networks with one hidden layer, although it was not actually implemented in practice.

Batina et al. [5], at USENIX Security 2019, develop a sidechannel attack that extracts neural network weights through monitoring the power use of a microprocessor evaluating the neural network. This is a much more powerful threat model than made by any of the other model extraction papers. To the best of our knowledge this is the only practical direct model extraction result—they manage to extract essentially arbitrary depth networks.
In this section we introduce an attack which only requires standard queries (i.e., that return the model’s prediction instead of its gradients) and does not require any sidechannel leakages, yet still manages to achieve higher fidelity extraction than the sidechannel extraction work for twolayer networks, assuming doubleprecision inference.
Attack Algorithm Intuition. As in [32], our attack is tailored to work on neural networks with the ReLU activation function (the ReLU is an effective default choice of activation function [33]). This makes the neural network a piecewise linear function. Two samples are within the same linear region if all ReLU units have the same sign, illustrated in Figure 2.
By finding adjacent linear regions, and computing the difference between them, we force a single ReLU to change signs. Doing this, it is possible to almost completely determine the weight vector going into that ReLU unit. Repeating this attack for all ReLU units lets us recover the first weight matrix completely. (We say almost here, because we must do some work to recover the sign of the weight vector.) Once the first layer of the twolayer neural network has been determined, the second layer can be uniquely solved for algebraically through least squares. This attack is optimal up to a constant factor—the query complexity is discussed in Appendix D.
6.1 Notation and Assumptions
As in [32], we only aim to extract neural networks with one hidden layer using the ReLU activation function. We denote the model weights by and biases by . Here, , and respectively refer to the input dimensionality, the size of the hidden layer, and the number of classes. This is found in Table 5.
Symbol  Definition 

Input dimensionality  
Hidden layer dimensionality ()  
Number of classes  
Input layer weights  
Input layer bias  
Logit layer weights  
Logit layer bias 
We say that is at a critical point if ; this is the location at which the unit’s gradient changes from to . We assume the adversary is able to observe the raw logit outputs as 64bit floating point values. We will use the notation to denote the logit oracle. Our attack implicitly assumes that the rows of are linearly independent. Because the dimension of the input space is larger than the hidden space by at least 100, it is exceedingly unlikely for the rows to be linearly dependent (and we find this holds true in practice).
Note that our attack is not an SQ algorithm, which would only allow us to look at aggregate statistics of our dataset. Instead, our algorithm is very particular in its analysis of the network, computing the differences between linear regions, for example, cannot be done with aggregate statistics. This structure allows us to avoid the pathologies of Section 3.3.
6.2 Attack Overview
The algorithm is broken into four phases:

Critical point search identifies inputs to the neural network so that exactly one of the ReLU units is at a critical point (i.e., has input identically ).

Weight recovery takes an input which causes the th neuron to be at a critical point. We use this point to compute the difference between the two adjacent linear regions induced by the critical point, and thus the weight vector row . By repeating this process for each ReLU we obtain the complete matrix . Due to technical reasons discussed below, we can only recover the rowvector up to sign.

Sign recovery determines the sign of each rowvector for all using global information about .

Final layer extraction uses algebraic techniques (least squares) to solve for the second layer of the network.
6.3 Critical Point Search
For a two layer network, observe that the logit function is given by the equation . To find a critical point for every ReLU, we sample two random vectors , and consider the function
for varying between a small and large appropriately selected value (discussed below). This amounts to drawing a line in the inputs of the network; passed through ReLUs, this line becomes the piecewise linear function . The points where is nondifferentiable are exactly locations where some is changing signs (i.e., some ReLU is at a critical point). Figure 3 shows an example of what this sweep looks like on a trained MNIST model.
Furthermore, notice that given a pair , there is exactly one value for which each ReLU is at a critical point, and if is allowed to grow arbitrarily large or small that every ReLU unit will switch sign exactly once. Intuitively, the reason this is true is that each ReLU’s input, (say for some ), is a monotone function of (). Thus, by varying , we can identify an input that sets the th ReLU to 0 for every relu in the network. This assumes we are not moving parallel to any of the rows (where ), and that we vary within a sufficiently large interval (so the term may overpower the constant term). The analysis of [32] suggests that these concerns can be resolved with high probability by varying .
While in theory it would be possible to sweep all values of to identify the critical points, this would require a large number of queries. Thus, to efficiently search for the locations of critical points, we introduce a refined search algorithm which improves on the binary search as used in [32]. Standard binary search requires model queries to obtain bits of precision. Therefore, we propose a refined technique which does not have this restriction and requires just queries to obtain high (20+ bits) precision. The key observation we make is that if we are searching between two values and there is exactly one discontinuity in this range, we can precisely identify the location of that discontinuity efficiently.
An intuitive diagram for this algorithm can be found in Figure 4 and the algorithm can be found in Algorithm 1. The property this leverages is that the function is piecewise linear–if we know the range is composed of two linear segments, we can identify the linear segments and compute their intersection. In Algorithm 1, lines 13 describe computing the two linear regions’ slopes and intercepts. Lines 4 and 5 compute the intersection of the two lines (also shown in the red dotted line of Figure 4). The remainder of the algorithm performs the correctness check, also illustrated in Figure 4; if there are more than 2 linear components, it is unlikely that the true function value will match the function value computed in line 5, and we can detect that the algorithm has failed.
6.4 Weight Recovery
After running critical point search we obtain a set , where each critical point corresponds to a point where a single ReLU flips sign. In order to use this information to learn the weight matrix we measure the second derivative of in each input direction at the points . Taking the second derivative here corresponds to measuring the difference between the linear regions on either side of the ReLU. Recall that prior work assumed direct access to gradient queries, and thus did not require any of the analysis in this section.
Absolute Value Recovery
To formalize the intuition of comparing adjacent hyperplanes, observe that for the oracle and for a critical point (corresponding to being zero) and for a random inputspace direction we have
for a small enough so that does not flip any other ReLU. Because is a critical point and is small, the sums in the second line differ only in the contribution of . However at this point we only have a product involving both weight matrices. We now show this information is useful.
If we compute and by querying along directions and , we can divide these quantities to obtain the value , the ratio of the two weights. By repeating the above process for each input direction we can, for all , obtain the pairwise ratios .
Recall from Section 3 that obtaining the ratios of weights is the theoretically optimal result we could hope to achieve. It is always possible to multiply all of the weights into a ReLU by a constant and then multiply all of the weights out of the ReLU by . Thus, without loss of generality, we can assign and scale the remaining entries accordingly. Unfortunately, we have lost a small amount of information here. We have only learned the absolute value of the ratio, and not the value itself.
Weight Sign Recovery
Once we reconstruct the values for all we need to recover the sign of these values. To do this we consider the following quantity:
That is, we consider what would happen if we take the second partial derivative in the direction . Their contributions to the gradient will either cancel out, indicating and are of opposite sign, or they will compound on each other, indicating they have the same sign. Thus, to recover signs, we can perform this comparison along each direction .
Here we encounter one final difficulty. There are a total of signs we need to recover, but because we compute the signs by comparing ratios along different directions, we can only obtain relations. That is, we now know the correct signed value of up to a single sign for the entire row.
It turns out this is to be expected. What we have computed is the normal direction to the hyperplane, but because any given hyperplane can be described by an infinite number of normal vectors differing by a constant scalar, we can not hope to use local information to recover this final sign bit.
Put differently, while it is possible to push a constant through from the first layer to the second layer, it is not possible to do this for negative constants, because the ReLU function is not symmetric. Therefore, it is necessary to learn the sign of this row.
6.5 Global Sign Recovery
Once we have recovered the input vector’s weights, we still don’t know the sign for the given inputs—we only measure the difference between linear functions at each critical point, but do not know which side is the positive side of the ReLU [32]. Now, we need to leverage global information in order to reconcile all of inputs’ signs.
Notice that recovering allows us to obtain by using the fact that . Then we can compute up to the same global sign as is applied to .
Now, to begin recovering sign, we search for a vector that is in the null space of , that is, . Because the neural network has , the nullspace is nonzero, and we can find many such vectors using least squares. Then, for each , we search for a vector such that where here is the th basis vector in the hidden space. That is, moving along the direction only changes ’s input value. Again we can search for this through least squares.
Given and these we query the neural network for the values of , , and . On each of these three queries, all hidden units are except for which recieves as input either , , or by the construction of . However, notice that the output of can only be either or , and the two cases collapse to just output . Therefore, if , we know that . Otherwise, we will find and . This allows us to recover the sign bit for .
6.6 Last Layer Extraction
Given the completely extracted first layer, the logit function of the network is just a linear transformation which we can recover with least squares, through making queries where each ReLU is active at least once. In practice, we use the critical points discovered in the previous section so that we do not need to make additional neural network queries.
6.7 Results
Setup. We train several onelayer fullyconnected neural networks with between 16 and 512 hidden units (for 12,000 and 100,000 trainable parameters, respectively) on the MNIST [26] and CIFAR10 datasets [24]. We train the models with the Adam [22] optimizer for 20 epochs at batch size 128 until they converge. We train five networks of each size to obtain higher statistical significance. Accuracies of these networks can be found in the supplement in Appendix C. In Section 4, we used 140,000 queries for ImageNet model extraction. This is comparable to the number of queries used to extract the smallest MNIST model in this section, highlighting the advantages of both approaches.
MNIST Extraction. We implement the functionallyequivalent extraction attack in JAX [15] and run it on each trained oracle. We measure the fidelity of the extracted model, comparing predicted labels, on the MNIST test set.
Results are summarized in Table 6. For smaller networks, we achieve 100% fidelity on the test set: every single one of the test examples is predicted the same. As the network size increases, lowprobability errors we encounter become more common, but the extracted neural network still disagrees with the oracle on only of the examples.
Inspecting the weight matrix that we extract and comparing it to the weight matrix of the oracle classifier, we find that we manage to reconstruct the first weight matrix to an average precision of 23 bits—we provide more results in Appendix C.
CIFAR10 Extraction. Because this attack is dataindependent, the underlying task is unimportant for how well the attack works; only the number of parameters matter. The results for CIFAR10 are thus identical to MNIST when controlling for model size: we achieve 100% test set agreement on models with fewer than parameters and and greater than 99% test set agreement on larger models.
Comparison to Prior Work. To the best of our knowledge, this is by orders of magnitude the highest fidelity extraction of neural network weights.
The only fullyimplemented neural network extraction attack we are aware of is the work of Batina et al. [5], who uses an electromagnetic side channels and differential power analysis to recover an MNIST neural network with neural network weights with an average error of 0.0025. In comparison, we are able to achieve an average error in the first weight matrix for a similarly sized neural network of just 0.0000009—over two thousand times more precise. To the best of our knowledge no functionallyequivalent CIFAR10 models have been extracted in the past.
We are unable to make a comparison between the fidelity of our extraction attack and the fidelity of the attack presented in Batina et al. because they do not report on this number: they only report the accuracy of the extracted model and show it is similar to the original model. We believe this strengthens our observation that comparing across accuracy and fidelity is not currently widely accepted as best practice.
Investigating Errors. We observe that as the number of parameters that must be extracted increases, the fidelity of the model decreases. We investigate why this happens and discovered that a small fraction of the time (roughly 1 in 10,000) the gradient estimation procedure obtains an incorrect estimate of the gradient and therefore one of the extracted weights is incorrect by a noninsignificant margin.
Introducing an error into just one of the weights of the first matrix should not induce significant further errors. However, because of this error, when we solve for the bias vector, the extracted bias will have error proportional to the error of . And when the bias is wrong, it impacts every calculation, even those where this edge is not in use.
Resolving this issue completely either requires reducing the failure rate of gradient estimation from 1 in 10,000 to practically 0, or would require a complex errorrecovery procedure. Instead, we will introduce in the following section an improvement which almost completely solves this issue.
Difficulties Extending the Attack. The attack is specific to two layer neural networks; deeper networks pose multiple difficulties. In deep networks, the critical point search step of Section 6.3 will result in critical points from many different layers, and determining which layer a critical point is on is nontrivial. Without knowing which layer a critical point is on, we cannot control inputs to the neuron, which we need to do to recover the weights in Section 6.4. Even given knowledge of what layer a critical point is on, the inputs of any neuron past layer 1 are the outputs of other neurons, so we only have indirect control over their inputs. Finally, even with the ability to recover these weights, small numerical errors occur in the first layer extraction. These cause errors in every finite differences computation in further layers, causing the second layer to have even larger numerical errors than the first (and so on). Therefore, extending the attack to deeper networks will require at least solving each of the following: producing critical points belonging to a specific layer, recovering weights for those neurons without direct control of their inputs, and significantly reducing numerical errors in these algorithms.
# of Parameters  12,500  25,000  50,000  100,000 

Fidelity  100%  100%  100%  99.98% 
Queries 
7 Hybrid Strategies
Until now the strategies we have developed for extraction have been pure and focused entirely on learning or entirely on direct extraction. We now show that there is a continuous spectrum from which we can draw attack strategies, and these hybrid strategies can leverage both the query efficiency of learning extraction, and the fidelity of direct extraction.
7.1 LearningBased Extraction with Gradient Matching
Milli et al. demonstrate that gradient matching helps extraction by optimizing the objective function
assuming the adversary can query the model for . This is more model access than we permit our adversary, but is an example of using intuition from direct recovery to improve extraction. We found in preliminary experiments that this technique can improve fidelity on small datasets (increasing fidelity from 95% to 96.5% on FashionMNIST), but we leave scaling and removing the model access assumption of this technique to future work. Next, we will show another combination of learning and direct recovery, using learning to alleviate some of the limitations of the previous functionallyequivalent extraction attack.
7.2 Error Recovery through Learning
Recall from earlier that the functionallyequivalent extraction attack fidelity degrades as the model size increases. This is a result of lowprobability errors in the first weight matrix inducing incorrect biases on the first layer, which in turn propagates and causes worse errors in the second layer.
We now introduce a method for performing a learningbased error recovery routine. While performing a fullylearningbased attack leaves too many free variables so that functionallyequivalent extraction is not possible, if we fix many of the variables to the values extracted through the direct recovery attack, we now show it is possible to learn the remainder of the variables.
Formally, let be the extracted weight matrix for the first layer and be the extracted bias vector for the first layer. Previously, we used least squares to directly solve for and assuming we had extracted the first layer perfectly. Here, we relax this assumption. Instead, we perform gradient descent optimizing for parameters that minimize
That is, we use a single trainable parameter to adjust the bias term of the first layer, and then solve (via gradient descent with training data) for the remaining weights accordingly.
This hybrid strategy increases the fidelity of the extracted model substantially, detailed in Table 8. In the worstperforming example from earlier (with only direct extraction) the extracted 128neuron network had fidelity agreement with the victim model. When performing learningbased recovery, the fidelity agreement jumps all the way to .
# of Parameters  50,000  100,000  200,000  400,000 

Fidelity  100%  100%  99.95%  99.31% 
Queries 
Transferability
Adversarial examples transfer: an adversarial example [50] generated on one model often fools different models, too. Transferability is higher when the models are more similar [39].
We should therefore expect that we can generate adversarial examples on our extracted model, and that these will fool the remote oracle nearly always. In order to measure transferability, we run 20 iterations of PGD [29] with distortion set to the value most often used in the literature: for MNIST: , and for CIFAR10: .
The attack achieves functionally equivalent extraction (modulo floating point precision errors in the extracted weights), so we expect it to have high adversarial example transferability. Indeed, we find we achieve a transferability success rate for all extracted models.
# of Parameters  50,000  100,000  200,000  400,000 

Transferability  100%  100%  100%  100% 
8 Related Work
Defenses for model extraction have fallen into two camps: limiting the information gained per query, and differentiating extraction adversaries from benign users. Approaches to limiting information include perturbing the probabilities returned by the model [51, 9, 27], removing the probabilities for some of the model’s classes [51], or returning only the class output [51, 9]. Another proposal has considered sampling from a distribution over model parameters [1, 9]. The other camp, differentiating benign from malicious users, has focused on analyzing query patterns [19, 21]. Nonadaptive attacks (such as supervised or MixMatch extraction) bypass query patternbased detection, and are weakened by information limiting. We demonstrate the impact of removing complete access to probability values by considering only access to top 5 probabilities from WSL in Table 2. Our functionallyequivalent attack is broken by all of these measures. We leave consideration of defenseaware attacks to future work.
Queries to a model can also reveal hyperparameters [54] or architectural information [36]. Adversaries can use side channel attacks to do the same [5, 18]. These are orthogonal to, but compatible with, our work—information about a model, such as assumptions made in Section 6, empowers extraction.
Watermarking neural networks has been proposed [58, 52] to identify extracted models. Model extraction calls into question the utility of cryptographic protocols used to protect model weights. One unrealized approach is obfuscation [3], where an equivalent program could be released and queried as many times as desired. A practical approach is secure multiparty computation, where each query is computed by running a protocol between the model owner and querier [4].
9 Conclusion
This paper characterizes and explores the space of model extraction attacks on neural networks. We focus this paper specifically around the objectives of accuracy, to measure the success of a theftmotivated adversary, and fidelity, an oftenoverlooked measure which compares the agreement between models to reflect the success of a reconmotivated adversary.
Our learningbased methods can effectively attack a model with several millions of parameters trained on a billion images, and allows the attacker to reduce the error rate of their model by 10%. This attack does not match perfect fidelity with the victim model due to what we show are inherent limitations of learningbased approaches: nondeterminism (including only the nondeterminism on the GPU) prohibits training identical models. In contrast, our direct functionallyequivalent extraction returns a neural network agreeing with the victim model on of the test samples and having fidelity on transfered adversarial examples.
We then propose a hybrid method which unifies these two attacks, using learningbased approaches to recover from numerical instability errors when performing the functionallyequivalent extraction attack.
Our work highlights many remaining open problems in model extraction, such as reducing the capabilities required by our attacks and scaling functionallyequivalent extraction.
Acknowledgements
We would like to thank Ilya Mironov for lengthy and fruitful discussions regarding the functionally equivalent extraction attack. We also thank Úlfar Erlingsson for helpful discussions on positioning the work, and Florian Tramèr for his comments on an early draft of this paper.
Appendix A Formal Statements for Section 3.3
Here, we give the formal arguments for the difficulty of model extraction to support informal statements from Section 3.3.
Theorem 1.
There exists a class of width and depth 2 neural networks on domain (with precision numbers) with that require, given logit access to the networks, queries to extract.
In order to prove Theorem 1, we introduce a family of functions we call rectangle bounded functions, which we will show satisfies this property.
Definition A.1.
A function on domain with range is a rectangle bounded function if there exists two vectors such that , where denotes elementwise comparison. The function is a rectangle bounded function if there are indices such that or .
Intuitively, a rectangle function only outputs a nonzero value on a multidimensional rectangle that is constrained in only coordinates. We begin by showing that we can implement rectangle functions for any using a ReLU network of width and depth 2.
Lemma 1.
For any with indices such that or , we can construct a rectangle bounded function for with a ReLU network of width and depth 2.
Proof.
We will start by constructing a 3ReLU gadget with output only when . We will then show how to compose of these gadgets, one for each index of the rectangle, to construct the rectangle bounded function.
The 3ReLU gadget only depends on , so weights for all other ReLUs will be set to 0. Observe that the function is nonzero only on the interval . This is easier to see when it is written as
The function with looks like a sigmoid, and has the following form:
Now, has range for any value of . Then the function
is rectangle bounded for vectors . To see why, we need that no input not satisfying has . This is simply because each term , so unless all such terms are , the inequality cannot hold. ∎
Now that we know how to construct a rectangle bounded function, we will introduce a set of disjoint rectangle bounded functions, and then show that any one requires queries to extract when the others are also possible functions.
Lemma 2.
There exists a family of rectangle bounded functions such that extracting an element of requires queries in the worst case.
Here, is the feature precision; images with 8bit pixels have .
Proof.
We begin by constructing . The following ranges are clearly pairwise disjoint: . Then pick any indices, and we can construct distinct rectangle bounded functions  one for each element in the Cartesian product of each index’s set of ranges. Call this set .
The set of inputs with nonzero output is distinct for each function, because their rectangles are distinct. Now consider the information gained from any query. If the query returns a nonzero value, the function is learned. If not, at most one function from is ruled out  the function whose rectangle was queried. Then any sequence of queries to an oracle can rule out at most of the functions of , so that at least queries are required in the worst case. ∎
Putting Lemma 1 and 2 together gives us Theorem 1.
Theorem 2.
Checking whether two networks with domains are functionally equivalent is NPhard.
Proof.
We prove this by reduction to subset sum. A similar reduction (reducing to 3SAT instead of Subset Sum) for a different statement appears in [20].
Suppose we receive a subset sum instance  the set is , the target sum is , and the problem’s precision is . We will construct networks and such that checking if and are functionally equivalent is equivalent to solving the subset sum instance. We start by setting  it never returns a nonzero value. We now construct a network that has nonzero output only if the subset sum instance can be solved (and finding an input with nonzero output reveals the satisfying subset).
The network has three hidden units in the first layer with incoming weight for the th feature equal to . This means the dot product of the input with weights will be the sum of the subset . We want to force this to accept iff there is an input where this sum is . To do so, we use the same 3ReLU gadget as in the proof of Theorem 1:
As before, this will only be nonzero in the range , and we are done.
∎
Appendix B Prototypicality and Fidelity
We know from Section 5 that learning strategies struggle to achieve perfect fidelity due to nondeterminism inherent in learning. What remains to be understood is whether some samples are more difficult than others to achieve fidelity on. We investigate using recent work on identifying prototypical data points. Using each metric developed in Carlini et al. [8], we can rank the FashionMNIST test set in order of increasing prototypicality. Binning the prototypicality ranking into percentiles, we can measure how many of the 90 models we trained for Section 5 agree with the oracle’s prediction. The intuition here is that more prototypical examples should be more consistently learnable, whereas more outlying points may be harder to consistently classify. Indeed, we find that this is the case  all metrics find a correlation between prototypicality and model agreement (fidelity), as seen in Figure 5. Interestingly, the metrics which do not use ensembles of models (adversarial distance and holdoutretraining) have the best correlation with the model agreement metric—roughly the top 50% of prototypical examples by these metrics are classified the same by nearly all 90 models.
Appendix C Supplement for Section 6
MNIST  CIFAR10  

Parameters  Accuracy  Parameters  Accuracy 
12,500  94.3%  49,000  29.2% 
25,000  95.6%  98,000  34.2% 
50,000  97.2%  196,000  40.3% 
100,000  97.7%  393,000  42.6% 
200,000  98.0%  786,000  43.1% 
400,000  98.3%  1,572,000  45.9% 
Figure 6 shows a distribution over the bits of precision in the difference between the logits (i.e., presoftmax prediction) of the 16 neuron oracle neural network and the extracted network. Formally, we measure the magnitude of the gap . Notice that this is a different (and typically stronger) measure of fidelity than used elsewhere in the paper.
Appendix D Query Complexity of Functionally Equivalent Extraction
In this section, we briefly analyze the query complexity of the attack from Section 6. We assume that a simulated partial derivative requires queries using finite differences.

Critical Point Search. This step is the most nontrivial to analyze, but fortunately this was addressed in [32]. They found this step requires gradient queries, which we simulate with model queries.

Weight Recovery. This piece is significantly complicated by not having access to gradient queries. For each ReLU, absolute value recovery requires queries and weight sign recovery requires an additional , making this step take queries total.

Global Sign Recovery. For each ReLU, we require only three queries. Then this step is .

Last Layer Extraction. This step requires queries to make the system of linear equations full rank (although in practice we reuse previous queries here, making this step require 0 queries).
Overall, the algorithm requires queries. Extraction requires queries without auxillary information, as there are parameters in the model. Then the algorithm is queryoptimal up to a constant factor, removing logarithmic factors from Milli et al. [32].
Footnotes
 https://paperswithcode.com/sota/imageclassificationonimagenet
References
 (2014) Adding robustness to support vector machines against adversarial reverse engineering. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 231–240. Cited by: §8.
 (1988) Queries and concept learning. Machine learning 2 (4), pp. 319–342. Cited by: §4.2.
 (2001) On the (im) possibility of obfuscating programs. In Annual international cryptology conference, pp. 1–18. Cited by: §8.
 (2006) A privacypreserving protocol for neuralnetworkbased computation. In Proceedings of the 8th workshop on Multimedia and security, pp. 146–151. Cited by: §8.
 (2018) Csi neural network: using sidechannels to recover your artificial neural network information. arXiv preprint arXiv:1810.09076. Cited by: §3.2, §3.3, Table 1, 2nd item, §6.7, §8.
 (2019) Mixmatch: a holistic approach to semisupervised learning. arXiv preprint arXiv:1905.02249. Cited by: §4.2, §4.2, §4.2.
 (1998) Combining labeled and unlabeled data with cotraining. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100. Cited by: §4.2.
 (2019) Prototypical examples in deep learning: metrics, characteristics, and utility. External Links: Link Cited by: Appendix B, §5.
 (2018) Model extraction and active learning. CoRR abs/1811.02054. External Links: Link, 1811.02054 Cited by: §1, §1, §3.2, Table 1, §4.2, §8.
 (2018) Copycat cnn: stealing knowledge by persuading confession with random nonlabeled data. In 2018 International Joint Conference on Neural Networks (IJCNN), Cited by: §1, §3.2, Table 1.
 (2019) On the learnability of deep random networks. CoRR abs/1904.03866. External Links: 1904.03866 Cited by: §3.3, Theorem 3.
 (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §3.4.1, §4.
 (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.4.2.
 (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §2.
 (2019) JAX. GitHub. Note: https://github.com/google/jax Cited by: §6.7.
 (2009) The unreasonable effectiveness of data. Cited by: §1.
 (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.
 (2018) Security analysis of deep neural networks operating in the presence of cache sidechannel attacks. arXiv preprint arXiv:1810.03487. Cited by: §1, §8.
 (2018) PRADA: protecting against dnn model stealing attacks. arXiv preprint arXiv:1805.02628. Cited by: §8.
 (2017) Reluplex: an efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97–117. Cited by: Appendix A.
 (2018) Model extraction warning in mlaas paradigm. In Proceedings of the 34th Annual Computer Security Applications Conference, pp. 371–380. Cited by: §8.
 (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2, §6.7.
 (1999) Differential power analysis. In Annual International Cryptology Conference, pp. 388–397. Cited by: §3.3.
 (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.2, §6.7.
 (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §5, §5.
 (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §6.7.
 (2018) Defending against model stealing attacks using deceptive perturbations. arXiv preprint arXiv:1806.00054. Cited by: §8.
 (2005) Adversarial learning. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 641–647. Cited by: §1, §1, §3.2, Table 1.
 (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §7.2.1.
 (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §3.4.1, Table 2, §4.
 (2019) Zeroshot knowledge transfer via adversarial belief matching. arXiv preprint arXiv:1905.09768. Cited by: §3.4.1.
 (2018) Model reconstruction from model explanations. arXiv preprint arXiv:1807.05185. Cited by: item 1, Appendix D, §1, §3.2, Table 1, 1st item, §6.1, §6, §6.3, §6.3, §6.5.
 (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814. Cited by: §2, §6.
 (1983) A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, Vol. 269, pp. 543–547. Cited by: §2.
 (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.2.
 (2017) Towards reverseengineering blackbox neural networks. arXiv preprint arXiv:1711.01768. Cited by: §1, §8.
 (2019) Knockoff nets: stealing functionality of blackbox models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4954–4963. Cited by: §1, §3.2, Table 1.
 (2019) A framework for the extraction of deep neural networks by leveraging public data. CoRR abs/1905.09165. External Links: Link, 1905.09165 Cited by: §1, §3.2, Table 1, §4.2.
 (2017) Practical blackbox attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §1, §1, §3.1, §3.2, Table 1, §4, §7.2.1.
 (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §3.4.2.
 (2018) Mlleaks: model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246. Cited by: §1, §1, §3.1.
 (2015) Hidden technical debt in machine learning systems. In Advances in neural information processing systems, pp. 2503–2511. Cited by: §5.
 (2014) CNN features offtheshelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806–813. Cited by: §3.4.2.
 (2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §1, §1, §3.1.
 (2020) Rethinking deep active learning: using unlabeled data at model training. External Links: Link Cited by: §4.2.
 (2019) Overlearning reveals sensitive attributes. arXiv preprint arXiv:1905.11742. Cited by: §1, §3.2.
 (2020) Combining mixmatch and active learning for better accuracy with fewer labels. External Links: Link Cited by: §4.2.
 (2019) Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243. Cited by: §1.
 (2014) Sequence to sequence learning with neural networks. In Neural information processing systems, pp. 3104–3112. Cited by: §1.
 (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §7.2.1.
 (2016) Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium (USENIX Security 16), pp. 601–618. Cited by: §1, §1, §3.1, §3.2, Table 1, §4.1, §4.2, §4, §8.
 (2017) Embedding watermarks into deep neural networks. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 269–277. Cited by: §8.
 (2016) WaveNet: a generative model for raw audio.. SSW 125. Cited by: §1.
 (2018) Stealing hyperparameters in machine learning. In 2018 IEEE Symposium on Security and Privacy (SP), pp. 36–52. Cited by: §8.
 (20170828)(Website) External Links: cs.LG/1708.07747 Cited by: §5.
 (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5754–5764. Cited by: §1.
 (2019) S4L: selfsupervised semisupervised learning. arXiv preprint arXiv:1905.03670. Cited by: §4.2, §4.2.
 (2018) Protecting intellectual property of deep neural networks with watermarking. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 159–172. Cited by: §8.