DeepSafe: A Datadriven Approach for Checking Adversarial Robustness in Neural Networks
Abstract
Deep neural networks have become widely used, obtaining remarkable results in domains such as computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, and bioinformatics, where they have produced results comparable to human experts. However, these networks can be easily “fooled” by adversarial perturbations: minimal changes to correctlyclassified inputs, that cause the network to misclassify them. This phenomenon represents a concern for both safety and security, but it is currently unclear how to measure a network’s robustness against such perturbations. Existing techniques are limited to checking robustness around a few individual input points, providing only very limited guarantees. We propose a novel approach for automatically identifying safe regions of the input space, within which the network is robust against adversarial perturbations. The approach is dataguided, relying on clustering to identify welldefined geometric regions as candidate safe regions. We then utilize verification techniques to confirm that these regions are safe or to provide counterexamples showing that they are not safe. We also introduce the notion of targeted robustness which, for a given target label and region, ensures that a NN does not map any input in the region to the target label. We evaluated our technique on the MNIST dataset and on a neural network implementation of a controller for the nextgeneration Airborne Collision Avoidance System for unmanned aircraft (ACAS Xu). For these networks, our approach identified multiple regions which were completely safe as well as some which were only safe for specific labels. It also discovered several adversarial perturbations of interest.
1 Introduction
In recent years, advances in deep neural networks (NN) have enabled the representation and modeling of complex nonlinear relationships. In this paper, we study a common use of NN as classifiers that take in complex, high dimensional input, pass it through multiple layers of transformations, and finally assign to it a specific output label or class. Such classifiers have been used in a variety of applications, including pattern analysis, image classification, speech and audio recognition, and selfdriving cars; it is expected that this trend will continue and intensify, with neural networks also being integrated into safetycritical systems which require high assurance guarantees.
While the usefulness of neural networks is evident, it has been observed that stateoftheart networks are highly vulnerable to adversarial perturbations: given a correctlyclassified input , it is possible to find a new input that is very similar to but is assigned a different label [20]. For instance, in imagerecognition networks it is possible to add a small amount of noise (undetectable by the human eye) to an image and change how it is classified by the network.
Worse still, adversarial examples have also been found to transfer across networks, making it possible to attack networks in a blackbox fashion, without access to their weights. Recent work has demonstrated that such attacks can be carried out in practice [13]. Vulnerability of neural networks to adversarial perturbations is thus a safety and security concern, and it is essential to explore systematic methods for evaluating and improving the robustness of neural networks against such attacks.
To date, researchers have mostly focused on efficiently finding adversarial perturbations around select individual input points. The problem is typically cast as an optimization problem: for a given network and an input , find an for which while minimizing . In other words, the goal is to find an input as close as possible to such that and are labeled differently. Finding the optimal solution for this optimization problem is computationally difficult, and so various approximation approaches have been proposed. Some approaches are gradient based [20, 7, 6], whereas others use optimization techniques [3]. There are also techniques that focus on generating targeted attacks: adversarial perturbations that result in the network classifying the perturbed input with a specific target label [20, 7, 6].
These various approaches for finding adversarial perturbations have successfully demonstrated the weakness of many stateoftheart networks; however, because these approaches operate on individual input points, it is unclear how to apply them to large input domains, unless one does a bruteforce enumeration of all input values which is infeasible for most input domains. Furthermore, because they are inherently incomplete, these techniques do not provide any robustness guarantees when they fail to find an adversarial input. Orthogonal approaches have also been proposed for training networks that are robust against adversarial perturbations, but these, too, provide no formal assurances [16].
Formal methods provide a promising way for providing such guarantees. Recent approaches tackle neural network verification [8, 11] by casting it as an SMT solving problem. Although typically slower than the aforementioned techniques, verification can provide sound assurances that no adversarial examples exist within a given input domain. Still, these techniques do not provide any guidance on how to select meaningful regions within which the network is expected to behave consistently. And although it is possible to formulate a naive notion of global robustness, , which checks that any two points that are similar (within a small acceptable ) have the same label, this is not only inefficient to check but also fails to hold for points on legitimate boundaries between regions.
Recent work uses Reluplex to check a more refined version of local and global robustness, where the confidence score of the lables of close inputs is checked to be within an acceptable parameter [12]. While this can potentially handle situations where the inputs are on the boundaries, it still requires manually finding the regions in which this check is likely to hold.
Our approach.
We propose a novel technique, DeepSafe, for formally evaluating the robustness of deep neural networks. The key notion underlying our approach is the use of a dataguided methodology to determine regions that are likely to be safe (instead of focusing on individual points). This enables characterizing the behavior of the network over partitions of the input space, which in turn makes the network’s behavior amenable to analysis and verification.
Our technique can automatically perform the following steps: (i) we propose a novel clustering algorithm to automatically partition the input domain into regions in which points are likely to have the same true label; (ii) these regions are then checked for robustness using an existing verification tool; (iii) the verification checks targeted robustness which, given a specific incorrect label, guarantees that no input in the region is mapped by the NN to that label; (iv) for each region, the result of the targeted verification is either that the NN is safe (i.e. no points within the region are mapped to the target label), or, if it is not, an adversarial example demonstrating that it is unsafe; (v) robustness against all target labels (other than the correct label) indicates that the region is completely safe and all points within the region are mapped to the same correct label.
Thus, we decompose the robustness requirement for a NN into a number of local proof obligations, one set for each region. Our approach provides several benefits: we discover input regions that are likely to be robust (this is akin to finding likely invariants in program analysis) which are then candidates for the safety checks; if a region is found to be safe, we provide guarantees w.r.t all points within that region, not just for individual points as in previous techniques; and the discovered regions can improve the scalability of formal NN verification by focusing the search for adversarial examples only on the input space defined by a region. Note also that the regions can be used to improve the scalability of the abovementioned approximate techniques (e.g gradient descent methods) by similarly restricting the search to the input points confined by the region.
As the usual notion of safety might be too strong for many NNs, we introduce the concept of targeted robustness, analogous to targeted adversarial perturbations [20, 7, 6]. A region of the input space that is safe w.r.t. a specific target label indicates that within that region, the network is guaranteed to be robust against misclassification to the specific target label. Therefore, even if in that region the network is not completely robust against adversarial perturbations, we give guarantees that it is safe against specific targeted attacks. As a simple example consider a NN used for perception in an autonomous car that classifies the images of a semaphore as red, green or yellow. We may want to guarantee that the NN will never classify the image of a green light as a red light and vice versa but it may be tolerable to misclassify a green light as yellow, while still avoiding traffic violations.
The contributions of our approach are as follows:

Labelguided clustering: We present a clustering technique which is guided by the labels of the training data. This technique is an extension of the standard, unsupervised clustering algorithm kMeans [10] which we modified for our purposes. The output of the technique is a set of dense clusters within the input space, each of which contains training inputs that are close to each other (with respect to a given distance metric) and are known to share the same output label.

Welldefined safe regions: Each cluster defines a subset of the inputs, a finite number of which belong to the training data. The key point is that the network is expected to display consistent behavior (i.e., assign the same label) over the entire cluster. Hence, the clusters can be considered as safe regions in which adversarial perturbations should not exist, backed by a dataguided rationale. Therefore, searching for adversarial perturbations within the clusterbased safe regions has a higher chance of producing valid examples, i.e. inputs whose misclassification is considered erroneous network behavior. The size, location and boundaries of the clusters, which depend on the distribution of the training data can help guide the search: for instance, it often does not make sense to search for adversarial perturbations of inputs that lie on the boundaries between clusters belonging to different labels, as adversarial perturbations in those areas are likely to constitute acceptable network behavior.

Scalable verification: Within each cluster, we use formal verification to prove that the network is robust or find adversarial perturbations. The verification of even simple neural networks is an NPcomplete problem [11], and is very difficult in practice. Focusing on clusters means that verification is applied to small input domains, making it more feasible and rendering the approach as a whole more scalable. Further, the verification of separate clusters can be done in parallel, increasing scalability even further.

Targeted robustness: Our approach focuses on determining targeted safe regions, analogous to targeted adversarial perturbations. A region of the input space that is safe w.r.t. a specific target label indicates that within that region, the network is guaranteed to not map any input to the specific target label. Therefore, even if the network is not completely robust against adversarial perturbations, we are able to give guarantees that it is safe against specific targeted attacks.
Our proposed basic approach can have additional, interesting applications. For example, clusters can be used in additional forms of analysis, e.g. in determining whether the network is particularly susceptible to specific kinds of perturbations in certain regions of the input space. Also, the clustering approach is blackbox in the sense that it relies solely on the data distribution to determine the safe regions (and not on the network parameters). Consequently, it is applicable to a wide range of networks with various topologies and architectures. Finally, as we later demonstrate, it is straightforward to incorporate into the approach userspecific information (domainspecific constraints) regarding which adversarial perturbations can be encountered in practice. This helps guide the search towards finding valid perturbations.
The remainder of this paper is organized as follows. In Section 2, we provide the needed background on clustering, neural networks, and neural network verification. In Section 3, we describe in detail the steps of our clusteringbased approach, followed by an evaluation in Section 4. The possible limitations of our approach are discussed in Section 5. Related work is then discussed in Section 6, and we conclude in Section 7.
2 Background
2.1 Clustering
Clustering is an approach used to divide a population of datapoints into groups called clusters, such that the datapoints in each cluster are more similar (with respect to some metric) to other points in the same cluster than to the rest of datapoints.
Here we focus on a particularly popular clustering algorithm called kMeans [10] (although our approach could be implemented using different clustering algorithms as well). Given a set of datapoints and as the desired number of clusters, the algorithm partitions the points into clusters, such that the variance (also referred to as “within cluster sum of squares”) within each cluster is minimal. The metric used to calculate the distance between points is customizable, and is typically the Euclidean distance ( norm) or the Manhattan distance ( norm). For points and these are defined as:
(1) 
The kMeans clustering is an iterative refinement algorithm which starts with random points considered as the means (the centroids) of clusters. Each iteration then comprises mainly of two steps: (i) assign each datapoint to the cluster whose centroid is closest to it with respect to the chosen distance metric; and (ii) recalculate the new means of the clusters, which will serve as the new centroids. The iterations continue until the assignment of datapoints to clusters does not change. This indicates that the clusters satisfy the constraint that the variance within the cluster is minimal, and that the datapoints within each cluster are closer to each other than to points outside the cluster.
2.2 Neural Networks
Neural networks and deep belief networks have been used in pattern analysis, image classification, speech/audio recognition, perception modules in selfdriving cars so on. Typically, the objects in such domains are high dimensional and the number of classes that the objects need to be classified into is also high — and so the classification functions tend to be highly nonlinear over the input space. Deep learning operates with the underlying rationale that groups of input parameters could be merged to derive higher level abstract features, which enable the discovery of a more linear and continuous classification function. Neural networks are often used as classifiers, meaning they assign to each input an output label/class. A neural network can thus be regarded as a function that assigns to input an output label , denoted as .
Internally, a neural networks is comprised of multiple layers of nodes called neurons, where each node refines and extracts information from values computed by nodes in the previous layer. A typical 3 layer neural network would consist of the following; the first layer is the input layer, which takes in the input variables (also called features) . The second layer is a hidden layer: each of its neurons computes a weighted sum of the input variables using a unique weight vector and a bias value, and then applies a nonlinear activation function to the result. The last layer is the output layer, which uses the softmax function to make a decision on the class for the input, based on the values computed by the previous layer.
2.3 Neural Network Verification
Neural networks are trained and tested on finite sets of inputs and outputs and are then expected to generalize well to previouslyunseen inputs. While this seems to work in many cases, when the network in question is designed to be part of a safetycritical system, we may wish to verify formally that certain properties hold for any possible input. Traditional verification techniques often cannot directly be applied to neural networks, and this has sparked a line of work focused on transforming the problem into a format more amenable to existing tools such as LP and SMT solvers [4, 8, 18, 19].
While our approach is general in the sense that it could be coupled with any verification technique, for evaluation purposes we used the recentlyproposed Reluplex approach [11]. Reluplex is a sound and complete simplexbased verification procedure, specifically tailored to achieve scalability on deep neural networks. Intuitively, the algorithm operates by eagerly solving the linear constraints posed by the neural network’s weighted sums, while attempting to satisfy the nonlinear constraints posed by its activation functions in a lazy manner. This often allows Reluplex to safely disregard many of these nonlinear constraints, which is where the bulk of the problem’s complexity stems from. Reluplex has been used in evaluating techniques for finding and defending against adversarial perturbations [2], and it has also been successfully applied to a realworld family of deep neural networks, designed to operate as controllers in the nextgeneration Airborne Collision Avoidance System for unmanned aircraft (ACAS Xu) [11].
3 The DeepSafe Approach
In this section we describe in greater detail the steps of our proposed approach: (i) clustering of training inputs; (ii) cluster analysis; (iii) cluster verification; and (iv) processing of possible adversarial examples.
3.1 Clustering of Training Inputs
function rep = next_fun_acas(p) movefile(’count.csv’,’countin.csv’); n1in = importdata("countin.csv"); m = size(n1in); rep = 0; for sn = 1 : m(1) num = n1in(sn,1); lo = "cluster"+ num2str(num) +".csv"; ln = "clusterin"+num2str(num) +".csv"; loc = char(lo); lnc = char(ln); movefile(loc,lnc); end cnt = 0; for yn = 1 : m(1) X = importdata("clusterin"+ num2str(n1in(yn,1))+".csv"); [idx,C] = kmeans(X,n1in(yn,2), ’Distance’,’sqeuclidean’); Xn = [X,idx]; Yn = sortrows(Xn,7); Un = unique(Yn(:,7)); nn = size(Un); 
for xn = 1 : nn(1) Zn = Yn(Yn(:,7) == Un(xn,1),1:6); Bn = unique(Zn(:,6)); n0n = size(Bn); wn = xn + cnt; if (n0n(1,1) > 1) rep = 1; csvwrite("cluster"+num2str(wn) + ".csv",Zn); n1n(1,1) = wn; n1n(1,2) = n0n(1,1); dlmwrite("count.csv",n1n, ’append’,’delimiter’,’,’); else sn = Un(xn,1); csvwrite("clusterFinal"+num2str(wn) + "_" + num2str(p)+".csv",C(sn,:)); dlmwrite("clusterFinal"+num2str(wn) + "_" + num2str(p)+".csv",Zn, ’append’,’delimiter’,’,’); end end cnt = wn; end end 
In our approach, we use the kMeans clustering algorithm (see Section 2.1) to perform clustering over the training inputs. By training inputs, we mean all inputs whose correct labels or output classes are known: this includes the training, validation, and test sets. We note that our technique works even in the absence of the training data, e.g. by applying the clustering to a set of randomly generated inputs that are labeled according to a given trained network. The user will then need to check that the labels are valid.
The kMeans approach is typically an unsupervised technique, meaning that clustering is based purely on the similarity of the datapoints themselves, and does not depend on their labels. Here, however, we use the labels to guide the clustering algorithm into generating clusters that have consistent labeling (in addition to containing points that are similar to each other). The modified clustering algorithm starts by setting the number of clusters , which is an input to the kMeans algorithm, to be equal to the number of unique labels. Once the clusters are obtained, we check whether each cluster contains only inputs with the same label. kMeans is then applied again on each cluster that is found to contain multiple labels, with set to the number of unique labels within that cluster. This effectively breaks the “problematic” cluster into multiple subclusters. The process is repeated until all clusters contain inputs which share a single label. The number of clusters, which is an input parameter of the kMeans algorithm, is often chosen arbitrarily. In our approach, we take guidance from the training data to customize the number of clusters to the domain under consideration. Pseudocode for the MATLAB implementation of the algorithm appears in Fig. 1. In every iteration, count.csv maps the index of each input dataset, clusterINDEX.csv, to the number of unique labels its instances correspond to. Clusters whose instances correspond to the same label are named clusterFinalINDEX.csv.
Let us consider a toy example with training data labeled as either stars or circles. Each training data point is characterised by two dimensions/attributes (x,y). The original kMeans algorithm with , will partition the training inputs into 2 groups, purely based on proximity w.r.t. the 2 attributes (Fig. 2a). However, this groups stars and circles together. Our modified algorithm creates the same partitions in its first iteration, however, since each cluster does not satisfy the invariant that it only contains training inputs with the same label, it proceeds to iteratively divide each cluster into two subclusters until this invariant is satisfied. This creates 5 clusters as shown in (Fig. 2b); 3 with label star and 2 with label circle. This example is typical for domains such as image classification, where even a small change in some attribute values for certain inputs could change their label.
Our modified clustering algorithm typically produces small, dense clusters of consistentlylabeled inputs. The underlying assumption of our approach is that each such cluster therefore constitutes a safe region in which all inputs (and not just the training points) should be labeled consistently. For instance, in our toy example, the algorithm creates relatively small clusters of stars and circles, separating them from each other. Searching within these clusters, which represent small neighborhoods of inputs belonging to the same label, may yield more meaningful results than searching regions within an arbitrary distance of each input.
Each of the clusters generated by kMeans is characterized by a centroid cen and a radius , indicating the maximum distance of any instance in the cluster from its centroid. While inputs deep within each cluster are expected to be labeled consistently, the boundaries of the clusters may lie in lowdensity regions, and could have different labels. In order to improve the accuracy of our approach, we shrink the clusters by replacing with the average distance of any instance from the centroid, denoted . This increases the likelihood that all points within the cluster should be consistently labeled, and that any deviation from this would constitute a valid adversarial perturbation. The training inputs within a cluster have the same label, which we refer to as the label of the cluster, . To summarize, the main hypothesis behind our approach is:
Hypothesis 1
For a given cluster , with centroid cen and radius , any input within distance from cen has the same true label as that of the cluster:
It immediately follows from this hypothesis that any point in the cluster which is assigned a different label by the network constitutes an adversarial perturbation. An adversarial input for our toy example could be as follows; a hypothetical NN, in its process of input transformation may incorrectly bring some of the inputs within the cluster with stars close to circles, thereby classifying them as circle.
Distance metric.
The similarity of the inputs within a cluster is determined by the distance metric used for calculating the proximity of the inputs. Therefore, it is important to choose a distance metric that generates acceptable levels of similarity for the domain under consideration. Our approach assumes that every input, characterized by attributes, can be considered as a point in Euclidean space. The Euclidean distance (Eq. 1) is a commonly used metric for measuring the proximity of points in Euclidean space. However, recent studies indicate that the usefulness of Euclidean distance in determining the proximity between points diminishes as the dimensionality increases [1]. The Manhattan distance (Eq. 1) has been found to capture proximity more accurately at high dimensions. Therefore, in our experiments, we set the distance metric depending on the dimensionality of the input space (Section 4). Other distance metrics can be easily accommodated by our approach.
3.2 Cluster Analysis
The clusters obtained as a result of the previous step characterize the behavior of the network over large chunks of the input space. Analysis of these clusters can provide useful insights regarding the behavior and accuracy of the network. Listed below are cluster properties that we use in our approach.

Density: We define the density of a cluster to be the number of instances per unit distance within the cluster. For a cluster with points and average distance from the points to the centroid, the cluster’s density is defined to be . A cluster with high density contains a large number of instances at close proximity to each other (and because of our modified clustering algorithm, these instances are also labeled consistently). We assume that in such clusters Hypothesis 1 holds, since it seems undesirable that the network should assign a different label to inputs that lie within the average distance from the centroid. However, this cannot be said regarding a cluster with low density, which encompasses fewer inputs belonging to the same label and which are spread out. Therefore, for the purpose of the robustness check, we disregard clusters with low density, in order to increase the chances of examining only valid safe regions, and hence of detecting only valid adversarial perturbations.

Centroid behavior: The centroid of a cluster can be considered as a representative for the behavior of the network over that cluster — especially when the cluster’s density is high. A classifier neural network assigns to each input a score for each possible label, representing the network’s level of confidence that the input should have that label (the label with the highest confidence is the one then assigned to the input). Intuitively, targeted adversarial perturbations are more likely to exist, e.g., for the label whose level of confidence at the centroid is the second highest, than for the least likely label. In our approach, we use the label scores to look for targeted safe regions: regions of the input space within which the network is guaranteed to be robust against misclassification to a specific target label.
3.3 Cluster Verification
Having identified and analyzed the clusters, we next use the Reluplex tool [11] to verify a formula representing the negation of Hypothesis 1. This is done in a targeted manner or on a per label basis. If the negated hypothesis is shown not to hold, the region is indeed safe with respect to that label (targeted safe region); and otherwise, Reluplex provides a satisfying assignment, which constitutes a valid adversarial perturbation. The encoding is shown in Eq. 2:
(2) 
Here, represents an input point, and cen, and represent the centroid, radius and label of the cluster, respectively. represents a label other than . Reluplex models the network without the final softmax layer, and so the networks’ outputs correspond to the levels of confidence that the network assigns to each of the possible labels; we use to denote the level of confidence assigned to label at point . Intuitively, the formula holds for a given if and only if there exists a point within distance at most from cen, for which is assigned higher confidence than . Consequently, if the property does not hold, then for every within the cluster has its score higher than . This ensures targeted robustness of the network for label : the network is guaranteed not to misclassify any input within the region to the target label . Note that for overlapping clusters with different labels, there is uncertainty regarding the desired labels for the clusters’ intersection. Our method of reducing the clusters’ radius can serve to exclude such regions.
The property in Eq. 2 is checked sequentially for every possible , where denotes the set of all possible labels. If the property is unsatisfiable for all , it ensures complete robustness of the network for inputs within the cluster; i.e., the network is guaranteed not to misclassify any input within the region to any other label. This can be expressed more formally as shown below:
(3) 
from which it follows that Hypothesis 1 holds, i.e. that:
(4) 
As is the case with many SMTbased solves, Reluplex typically solves satisfiable queries more quickly that unsatisfiable ones. Therefore, in order to optimize performance we test the possible target labels in descending order of the scores that they are assigned at the centroid, . Intuitively, this is because the label with the 2highest score is more likely to yield a satisfiable query, etc.
Distances in Reluplex.
Reluplex takes as input a conjunction of linear equations and certain piecewiselinear constraints. Consequently, it is straightforward to model the neural network itself and the query in Eq. 2. Our ability to encode the distance constraint from that equation, , depends on the distance metric being used. While is piecewise linear and can be encoded, unfortunately cannot.
When dealing with domains where distance is a better measure of proximity, we thus use the following approximation. We perform the clustering phase using the distance metric as described before, and for each cluster obtain the radius . When verifying the property in Eq. 2, however, we use the norm. Because , it is guaranteed that the verification is conducted within the cluster under consideration, and any adversarial perturbation discovered will thus be valid. If the verification shows that the network is robust, however, then this holds only for the portion of the cluster that was checked. This limitation can be waived by using a verification technique that directly supports , or by enhancing Reluplex to support it.
Clusters and scalability.
The main source of computational complexity in neural network verification is the presence of nonlinear, nonconvex activation functions. However, when restricted to a small subdomain of the input space, these functions may present purely linear behavior — in which case they can be disregarded and replaced with a linear constraint, which greatly simplifies the problem. Consequently, performing verification within small domains is beneficial, as many activation functions can often be disregarded. Our approach naturally involves verification queries on small clusters, which tends to be very helpful in this regard. Reluplex has builtin bound tightening [11] functionality to detect such cases; and we leverage this functionality by computing lower and upper bounds for each of the input variables within the cluster, and provide these as part of our query to Reluplex.
Our approach lends itself to more scalable verification also through parallelization. Because each cluster involves standalone verification queries, their verification can be performed in parallel. Also, because Eq. 2 is checked independently for every , these queries can also be performed in parallel — expediting the process even further [12].
3.4 Processing Possible Adversarial Perturbations
A SAT solution to Eq. 2 for any target label () indicates the presence of an input within the region for which the network assigns a higher score to label than to . The check for the validity of the adversarial example needs to be done by the user/domain expert. Note that this does not mean that has the highest score; i.e. it need not be a targeted adversarial example for . In some cases, there may be specific constraints on inputs that can be considered valid adversarial examples. We have been able to successfully model such domainspecific constraints for ACAS Xu to generate valid adversarial perturbations (Section 4).
4 Case Studies
We implemented DeepSafe using MATLAB R2017a for the clustering algorithm and Reluplex v1.0 for verification. The runs were dispatched on a 8Core 64GB server running Ubuntu 16.0.4. We evaluated DeepSpace on two case studies. The first network is part of a realworld controller for the nextgeneration Airborne Collision Avoidance System for unmanned aircraft (ACAS Xu), a highly safetycritical system. The second network is a digit classifier over the popular MNIST image dataset.
property  # clusters  min radius  time(hours)  #queries 

safe  125  0.084  4  11.8 
targeted safe  52  0.135  7.6  14.4 
time out  33  NA  12  NA 
cluster#  safe  radius  #queries  time(min)  slice 
for label  (Y/N)  
5282  1  0.04  1  5.45  N 
label:0  2  0.04  1  3.91  N 
3  0.04  1  3.57  N  
4  0.04  1  4.01  N  
1783  1  0.16  4  1.28  Y 
label:0  2  0.17  1  279  N 
3  0.17  1  236  N  
4  0.17  1  223  N  
2072  0  0.06  1  11.51  N 
label:1  2  0.014  9  0.98  N 
3  0.011  7  0.71  N  
4  0.012  5  0.58  N  
6138  1  0.089  9  103.2  N 
label:0  2  0.11  4  2.86  N 
4.1 ACAS Xu
ACAS X is a family of collision avoidance systems for aircraft which is currently under development by the Federal Aviation Administration (FAA) [9]. ACAS Xu is the version for unmanned aircraft control. It is intended to be airborne and receive sensor information regarding the drone (the ownship) and any nearby intruder drones, and then issue horizontal turning advisories aimed at preventing collisions. The input sensor data includes: (i) : distance from ownship to intruder; (ii) : angle of intruder relative to ownship heading direction; (iii) : heading angle of intruder relative to ownship heading direction; (iv) : speed of ownship; (v) : speed of intruder; (vi) : time until loss of vertical separation; and (vii) : previous advisory. The five possible output actions are as follows: ClearofConflict (COC), Weak Right, Weak Left, Strong Right, and Strong Left. Each advisory is assigned a score, with the lowest score corresponding to the best action. The FAA is currently exploring an implementation of ACAS Xu that uses an array of 45 deep neural networks. These networks were obtained by discretizing the two parameters, and , and so each network contains five input dimensions and treats and as constants. Each network has 6 hidden layers and a total of 300 hidden ReLU activation nodes.
We applied our approach to several of the ACAS XU networks. We describe here in detail the results for one network. Each input consists of 5 dimensions and is assigned one of 5 possible output labels, corresponding to the 5 possible turning advisories for the drone (0:COC, 1:Weak Right, 2:Weak Left, 3:Strong Right, and 4:Strong Left). We were supplied a set of cutpoints, representing valid important values for each dimension, by the domain experts [9]. We generated 2662704 inputs (cartesian product of the values for all the dimensions). The network was executed on these inputs and the output advisories (labels) were verified. These were considered as the inputs with known labels for our experiments.
The labeledguided clustering algorithm was applied on the inputs using the distance metric. Clustering yielded 6145 clusters with more than one input and 321 singleinput clusters. The clustering took 7 hours. For each cluster we computed a region, characterized by a centroid (computed by kMeans), radius (average distance of every cluster instance from the centroid), and the expected label (the label of all the cluster instances).
We first evaluated the network on all the centroids as they are considered representative of the entire cluster and should ideally have the expected label. The network assigned the expected label for the centroids of 5116 clusters (83% of total number of clusters). For the remaining 1029 clusters, we found that they contained few labeled instances spread out in large areas.Therefore, we considered these clusters were not precise and our analysis was inconclusive. For singleton clusters, we fall back to checking local robustness using previous techniques [11]. These standalone points serve to identify portions of the input space which require more training data, thus potentially more vulnerable to adversarial perturbations.
Amongst the remaining 5116 clusters, we picked randomly 210 clusters to illustrate our technique. These clusters contain 659315 labeled inputs (24% of the total inputs with known labels). For each region corresponding to the respective clusters, we applied DeepSafe to check equation 2 for every label. The distance metric used was since can not be handled by Reluplex (see section 3 that explains why this is still safe). The results are presented in tables 1 and 2. The min radius in table 1, refers to the average minimum radius around the centroid of each region for which the safety guarantee applies (averaged over the total number of regions for that safety type). The # queries refers to the number of times the solver had to be invoked until an UNSAT was obtained, averaged over all the regions for that property.
DeepSafe was able to identify 125 regions which are completely safe, i.e. the network yields a label consistent with the neighboring labeled inputs within the region. 52 regions are targeted safe, the network is safe against misclassifying inputs to certain labels. For instance, the inputs within region 6138 (Table 2) with an expected label 0 (COC), were safe against misclassification only to labels 1 (weak right) and 2 (weak left). The solver timed out without returning any result for the remaining labels. The analysis timed out without returning a concrete result for any label for 33 clusters. A time out does not allow to provide a proof for the regions, although the likely answer is safe (generally, solvers take much longer when there is no solution).
The min radius in table 1, refers to the average minimum radius around the centroid of each region for which the safety guarantee applies (averaged over the total number of regions for that safety type).
The # queries refers to the number of times the solver had to be invoked until an UNSAT was obtained, averaged over all the regions for that property.
4.2 MNIST Image Dataset
The MNIST database is a large collection of handwritten digits that is commonly used for training various image processing systems [14]. The dataset has 60,000 training input images, each characterized by 784 attributes and belonging to one of 10 labels. We used a network that comprised of 3 layers, each with 10 ReLU activation nodes. Clustering was applied using the distance metric. It yielded 6654 clusters with more than one input and 5681 singleinput clusters. The clustering consumed 10 hours. A separate process for verification of each cluster was spawned with a timeout of 12 hours.
property  # clusters  min radius  time(hours)  # queries 

safe  7  2.46  11.27  2.85 
targeted safe  63  5.19  11.02  4.87 
time out  10  NA  12  NA 
For the singleton clusters, as is the case with ACAS Xu, we performed local robustness checking as in previous approaches.
Table 3 shows the summary of the results for the runs for 80 clusters that we selected for evaluation. In past studies, the MNIST network has been shown to be extremely vulnerable to misclassification on adversarial perturbations even with stateofthe art networks [3]. Therefore, as expected, it is easy to determine SAT solutions and they were discovered very fast (within a minute). However, it is very time consuming to prove safety; the verification time is much higher than that of the ACAS Xu application as it is mainly impacted by the large number of input variables (784 attributes). We would like to highlight that our work is the first to successfully identify safety regions for MNIST even on a fairly vulnerable network.
For 7 clusters, the solver returned UNSAT for all labels within 12 hours. For 30 clusters, the solver returned UNSAT only for few labels but timed out before returning any solution for the other labels. These have been included in the targeted safe property in the table. Additionally, based on the nature of this domain, we can consider it safe to assume that if for any label the solver does not return a SAT solution within 10 hours, then it is safe w.r.t. that label even if it does not prove unsatisfiability within this time. This happened to be the case for 33 clusters, where the solver could not find a solution for a specific target label despite executing for more than 10 hours. These have been included in the targeted safe type as well. For 10 of the remaining clusters, the solver kept finding adversarial examples despite iterative reductions of the radius and the timeout occurred before the radius reduced to 0. These have been included as time out in the table, since we cannot determine for sure if the region should be marked unsafe for the specific labels.
5 Threats to Validity
We discuss below possible threats to the validity of our approach and the experiments.

Invalid adversarial examples: The validity of an adversarial perturbation depends on the accuracy of the cluster in identifying regions of the input space which should ideally be given the same label. Clustering as used in our approach typically generates small dense groups or small neighborhoods around known inputs. The NN, on the other hand, attempts to abstract the input by focusing on certain features more than the others in order to be able to assign a unique label to it. This abstraction also enables generalization of the network to other inputs not part of the training data. However, this process tends to make the network inaccurate on inputs in close neighborhoods to known inputs. Therefore, although clustering cannot be considered as an alternative classifier to NN for any input, it can be considered to be accurate or an oracle in close neighborhoods of known inputs. This can however be impacted by the presence of noise in the input space due to irrelevant attributes, which the NN sieves out.

Invalid safety regions: There could be a scenario where both the cluster and the network agree on the labels for all inputs within a region, however, ideally some of them need to be classified to a different label. This can happen when the training data is not representative enough.

Generalization of experimental results: The current implementation of the solver Reluplex used in our prototype tool only supports piecewiselinear activation functions. This could limit the generalization of the experimental results to networks which use other types of activation functions.
6 Related Work
The vulnerability of neural networks to adversarial perturbations was first discovered by Szegedy et. al. in 2013 [20]. They model the problem of finding the adversarial example as a constrained minimization problem. Goodfellow et al. [7] introduced the Fast Gradient Sign Method for crafting adversarial perturbations using the derivative of the modelâs loss function with respect to the input feature vector. They show that NNs trained for the MNIST and CIFAR10 classification tasks can be fooled with a high success rate. An extension of this approach applies the technique in an iterative manner [5]. Jacobianbased Saliency Map Attack (JSMA) [17] proposed a method for targeted misclassification by exploiting the forward derivative of a NN to find an adversarial perturbation that will force the model to misclassify into a specific target class. Carlini et. al. [3] recently proposed an approach that could not be resisted by stateoftheart networks such as those using defensive distillation. Their optimization algorithm uses better loss functions and parameters (empirically determined) and uses three different distance metrics.
The DeepFool [15] technique simplifies the domain by considering the network to be completely linear. They compute adversarial inputs on the tangent plane (orthogonal projection) of a point on the classifier function. They then introduce nonlinearity to the model, and repeat this process until a true adversarial example is found.
Deep Learning Verification (DLV) [8] is an approach that defines a region of safety around a known input and applies SMT solving for checking robustness. They consider the input space to be discretized and alter the input using manipulations until it is at a minimal distance from the original, to generate possiblyadversarial inputs. They can only guarantee freedom of adversarial perturbations within the discrete points that are explored. Our clustering approach can potentially improve the technique by constraining the discrete search within regions.
7 Conclusion
This paper presents a novel, dataguided technique to search for adversarial perturbations (or prove they cannot occur) within welldefined geometric regions in the input space that correspond to clusters of similar inputs known to share the same label. In doing so, the approach identifies and provides proof for regions of safety in the input space within which the network is robust with respect to target labels. Preliminary experiments on the ACAS Xu and MNIST datasets highlight the potential of the approach in providing formal guarantees about the robustness of neural networks in a scalable manner.
In the future, we plan to investigate the following directions:

Retraining: If there are a number of singleinput clusters or clusters with low cardinality, it could indicate two cases: (i) if the density is low, there is not enough training data in that region; or (ii) if the density is high, there is a lot of noise or number of redundant attributes, which leads to repeated splitting of clusters. The first case could act as a feedback for retraining. The second case is an indicator that the clustering should probably be carried out at a higher layer of abstraction, with a smaller number of more relevant attributes.

Input to other techniques: The boundaries of the cluster spheres formed by kMeans lie in low density areas. Therefore, the network could be assumed to have low accuracy around these boundaries. Thus, adversarial robustness checks around instances that are closer to the edge of the clusters could exhibit a high number of adversarial perturbations. This analysis could help identify potential inputs to which other techniques for assessing local robustness could be applied.

Other Solvers: While in our implementation we have used Reluplex to perform the verification, our approach is general and can use other tools as a backend solver. As checking robustness for deep neural networks is an active area of research, we plan to investigate and integrate other solvers, as they become available. We also plan to investigate testing, guided by the computed regions, as an alternative to verification, for increased scalability, but at the price of losing the formal guarantees.
Acknowledgements. This work was partially supported by grants from NASA, NSF, FAA and Intel.
References
 C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance metrics in high dimensional spaces. In Proc. 8th Int. Conf. on Database Theory (ICDT), pages 420–434, 2001.
 N. Carlini, G. Katz, C. Barrett, and D. Dill. GroundTruth Adversarial Examples, 2017. Technical Report. http://arxiv.org/abs/1709.10207.
 N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In Proc. 38th IEEE Symposium on Security and Privacy, 2017.
 R. Ehlers. Formal verification of piecewise linear feedforward neural networks. In Proc. 15th Int. Symp. on Automated Technology for Verification and Analysis (ATVA), 2017.
 R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner. Adversarial machine learning at scale, 2016. Technical Report. http://arxiv.org/abs/1611.01236.
 R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner. Detecting adversarial samples from artifacts, 2017. Technical Report. http://arxiv.org/abs/1703.00410.
 I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples, 2014. Technical Report. http://arxiv.org/abs/1412.6572.
 X. Huang, M. Kwiatkowska, S. Wang, and M. Wu. Safety verification of deep neural networks. In Proc. 29th Int. Conf. on Computer Aided Verification (CAV), pages 3–29, 2017.
 K. Julian, J. Lopez, J. Brush, M. Owen, and M. Kochenderfer. Policy compression for aircraft collision avoidance systems. In Proc. 35th Digital Avionics System Conf. (DASC), pages 1–10, 2016.
 T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient kmeans clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881–892, 2002.
 G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. In Proc. 29th Int. Conf. on Computer Aided Verification (CAV), pages 97–117, 2017.
 G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer. Towards proving the adversarial robustness of deep neural networks. In Proc. 1st Workshop on Formal Verification of Autonomous Vehicles (FVAV), pages 19–26, 2017.
 A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial Examples in the Physical World, 2016. Technical Report. http://arxiv.org/abs/1607.02533.
 Y. LeCun, C. Cortes, and C. J. C. Burges. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/.
 S. MoosaviDezfooli, A. Fawzi, and P. Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2574–2582, 2016.
 N. Papernot and P. D. McDaniel. On the effectiveness of defensive distillation, 2016. Technical Report. http://arxiv.org/abs/1607.05113.
 N. Papernot, P. D. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. In Proc. 1st IEEE European Symposium on Security and Privacy (EuroS&P), pages 372–387, 2016.
 L. Pulina and A. Tacchella. An abstractionrefinement approach to verification of artificial neural networks. In Proc. 22nd Int. Conf. on Computer Aided Verification (CAV), pages 243–257, 2010.
 L. Pulina and A. Tacchella. Challenging SMT solvers to verify neural networks. AI Communications, 25(2):117–135, 2012.
 C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks, 2013. Technical Report. http://arxiv.org/abs/1312.6199.