On the Decision Boundaries of Deep Neural Networks: A Tropical Geometry Perspective

# On the Decision Boundaries of Deep Neural Networks: A Tropical Geometry Perspective

## Abstract

This work tackles the problem of characterizing and understanding the decision boundaries of neural networks with piecewise linear non-linearity activations. We use tropical geometry, a new development in the area of algebraic geometry, to characterize the decision boundaries of a simple neural network of the form (Affine, ReLU, Affine). Our main finding is that the decision boundaries are a subset of a tropical hypersurface, which is intimately related to a polytope formed by the convex hull of two zonotopes. The generators of these zonotopes are functions of the neural network parameters. This geometric characterization provides new perspective to three tasks. Specifically, we propose a new tropical perspective to the lottery ticket hypothesis, where we see the effect of different initializations on the tropical geometric representation of a network’s decision boundaries. Moreover, we use this characterization to propose a new set of tropical regularizers, which directly deal with the decision boundaries of a network. We investigate the use of these regularizers in neural network pruning (by removing network parameters that do not contribute to the tropical geometric representation of the decision boundaries) and in generating adversarial input attacks (by producing input perturbations that explicitly perturb the decision boundaries’ geometry and ultimately change the network’s prediction).

\printAffiliationsAndNotice\icmlEqualContribution

## 1 Introduction

Deep Neural Networks (DNNs) have demonstrated outstanding performance across a variety of research domains, including computer vision (Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012), natural language processing (Bahdanau et al., 2015; Devlin et al., 2018), quantum chemistry (Schütt et al., 2017), and healthcare (Ardila et al., 2019; Zhou et al., 2019) to name a few (LeCun et al., 2015). Nevertheless, a rigorous interpretation of their success remains elusive (Shalev-Shwartz and Ben-David, 2014). For instance, and in an attempt to uncover the expressive power of DNNs, thd work of Montufar et al. (2014) studied the complexity of functions computable by DNNs that have piecewise linear activations. They derived a lower bound on the maximum number of linear regions. Several other works have followed to improve such estimates under certain assumptions (Arora et al., 2018). In addition, and in attempt to understand some of the subtle behaviours DNNs exhibit, e.g. the sensitive reaction of DNNs to small input perturbations, several works directly investigated the decision boundaries induced by a DNN for classification. The work of Frossard (2019) showed that the smoothness of these decision boundaries and their curvature can play a vital role in network robustness. Moreover, the expressiveness of these decision boundaries at perturbed inputs was studied in He et al. (2018a), where it was shown that these boundaries do not resemble the boundaries around benign inputs. The work of Li et al. (2018) showed that under certain assumptions, the decision boundaries of the last fully connected layer of DNNs will converge to a linear SVM. Also, Beise et al. (2018) showed that the decision regions of DNNs with width smaller than the input dimension are unbounded.

More recently, and due to the popularity of the piecewise linear ReLU as an activation function, there has been a surge in the number of works that study this class of DNNs in particular. As a result, this has incited significant interest in new mathematical tools that help analyze piecewise linear functions, such as tropical geometry. While tropical geometry has shown its potential in many applications such as dynamic programming (Joswig and Schröter, 2019), linear programming (Allamigeon et al., 2015), multi-objective discrete optimization (Joswig and Loho, 2019), enumerative geometry (Mikhalkin, 2004), and economics (Akian et al., 2009; Mai Tran and Yu, 2015), it has only been recently used to analyze DNNs. For instance, the work of Zhang et al. (2018) showed an equivalency between the family of DNNs with piecewise linear activations and integer weight matrices and the family of tropical rational maps, i.e. ratio between two multi-variate polynomials in tropical algebra. This study was mostly concerned about characterizing the complexity of a DNN by specifically counting the number of linear regions, into which the function represented by the DNN can divide the input space. This was done by counting the number of vertices of some polytope representation recovering the results of Montufar et al. (2014) with a much simpler analysis.

Contributions. In this paper, we take the results of Zhang et al. (2018) some steps further and present a novel perspective on the decision boundaries of DNNs using tropical geometry. To that end, our contributions are three-fold. (i) We derive a geometric representation (convex hull between two zonotopes) for a super set to the decision boundaries of a DNN in the form (Affine, ReLU, Affine). (ii) We demonstrate support for the lottery ticket hypothesis (Frankle and Carbin, 2019) using a geometric perspective. (iii) We leverage the geometric representation of the decision boundaries, referred to as the decision boundaries polytope, in two interesting applications: network purning and adversarial attacks. For tropical pruning, we design a geometrically inspired optimization problem to prune the parameters of a given network such that the decision boundaries polytope of the pruned network does not deviate too much from the decision boundaries polytope of the original one. We conduct extensive experiments with AlexNet (Krizhevsky et al., 2012) and VGG16 (Simonyan and Zisserman, 2014) on SVHN (Netzer et al., 2011), CIFAR10, and CIFAR 100 (Krizhevsky and Hinton, 2009) datasets, in which pruning rate can be achieved with a marginal drop in testing accuracy. For tropical adversarial attacks, we show that one can construct input adversaries that can change network predictions by perturbing the decision boundaries polytope.

## 2 Preliminaries to Tropical Geometry

For completeness, we first provide preliminaries to tropical geometry. For a detailed review, we refer readers to the work of Itenberg et al. (2009); Maclagan and Sturmfels (2015).

###### Definition 1.

(Tropical Semiring1) The tropical semiring is the triplet , where and define tropical addition and tropical multiplication, respectively. They are denoted as:

 x⊕y=max{x,y},x⊙y=x+y,∀x,y∈T.

It can be readily shown that is the additive identity and is the multiplicative identity.

Given the previous definition, a tropical power can be formulated as , for , , where is standard multiplication. Moreover, the tropical quotient can be defined as: where is the standard subtraction. For ease of notation, we write as . Now, we are in a position to define tropical polynomials, their solution sets, and tropical rationals.

###### Definition 2.

(Tropical Polynomials) For , and , a -variable tropical polynomial with monomials can be expressed as:

 f(x) =(c1⊙xa1)⊕(c2⊙xa2)⊕⋯⊕(cn⊙xan), ∀  ai≠aj  % when  i≠j.

We use the more compact vector notation . Moreover and for ease of notation, we will denote as throughout the paper.

###### Definition 3.

(Tropical Rational Functions) A tropical rational is a standard difference or a tropical quotient of two tropical polynomials: .

Algebraic curves or hypersurfaces in algebraic geometry, which are the solution sets to polynomials, can be analogously extended to tropical polynomials too.

###### Definition 4.

(Tropical Hypersurfaces) A tropical hypersurface of a tropical polynomial is the set of points where is attained by two or more monomials in , i.e.

 T(f):={x∈Rd:cixai=cjxaj =f(x), for some  ai≠aj}.

Tropical hypersurfaces divide the domain of into convex regions, where is linear in each region. Also, every tropical polynomial can be associated with a Newton polytope.

###### Definition 5.

(Newton Polytopes) The Newton polytope of a tropical polynomial is the convex hull of the exponents regarded as points in , i.e.

 Δ(f):=ConvHull{ai∈Rd:i=1,…,n and ci≠−∞}.

A tropical polynomial determines a dual subdivision, which can thus be constructed by projecting the collection of upper faces (UF) in onto . That is to say, the dual subdivision determined by is given as , where is the projection that drops the last coordinate. It has been shown by Maclagan and Sturmfels (2015) that the tropical hypersurface is the (-1)-skeleton of the polyhedral complex dual to . First, this implies that the vertex of each edge of the dual subdivision corresponds to one region in , where is linear as determined by the tropical hypersurface . This is exemplified in Figure 1 with three tropical polynomials, where the number of regions where is linear for the three examples are 3, 6 and 10, respectively. Second, the tropical hypersurfaces are parallel to the normals of the edges of the dual subdivision polytope. The former in particular will be essential for the remaining part of the paper. Further details and standard results are summarized by Brugallé and Shaw (2014). Moreover, Zhang et al. (2018) showed an equivalency between tropical rational maps and any neural network with piecewise linear activations and integer weights through the following theorem.

###### Theorem 1.

(Tropical Characterization of Neural Networks, Zhang et al. (2018)). A feedforward neural network with integer weights and real biases with piecewise linear activation functions is a function , whose coordinates are tropical rational functions of the input, i.e.,

 f(x)=H(x)\varobslashQ(x)=H(x)−Q(x),

where and are tropical polynomials.

While this result is new in the context of tropical geometry, it is not surprising, since any piecewise linear function can be represented as a difference of two max functions over a set of hyperplanes Melzer (1986). Mathematically, if is a piecewise linear function, it can be written as . Thus, it is clear that each of the two maxima above is a tropical polynomial, recovering Theorem 1.

## 3 Decision Boundaries of Deep Neural Networks as Polytopes

In this section, we analyze the decision boundaries of a network in the form (Affine, ReLU, Affine) using tropical geometry. For ease, we use ReLUs as the non-linear activation, but any other piecewise linear function can also be used. The functional form of this network is: , where is an element-wise operator. The outputs of the network are the logit scores. Throughout this section, we assume2 that , , and . For ease of notation, we only consider networks with two outputs, i.e. , where the extension to a multi-class output follows naturally and is discussed in the appendix. Now, since is a piecewise linear function, each output can be expressed as a tropical rational as per Theorem 1. If and refer to the first and second outputs respectively, we have and , where and are tropical polynomials. In what follows and for ease of presentation, we present our main results where the network has no biases, i.e. and , and we leave the generalization to the appendix.

###### Theorem 2.

For a bias-free neural network in the form , where and , let be a tropical polynomial. Then:

Let define the decision boundaries of , then .

. is a zonotope in with line segments and shift . is a zonotope in with line segments and shift . Note that and . The line segment has end points and in and scaled by .

The proof for Theorem 2 is left for the appendix.

Digesting Theorem 2. This theorem aims at characterizing the decision boundaries of a bias-free neural network of the form (Affine, ReLU, Affine) through the lens of tropical geometry. In particular, the first result of Theorem 2 states that the tropical hypersurface of the tropical polynomial is a superset to the set of points forming the decision boundaries, i.e. . Just as discussed earlier and exemplified in Figure 1, tropical hypersurfaces are associated with a corresponding dual subdivision polytope . Based on this, the second result of Theorem 2 states that this dual subdivision is precisely the convex hull of two zonotopes denoted as and , where each zonotope is only a function of the network parameters and .

Theorem 2 bridges the gap between the behaviour of the decision boundaries , through the superset , and the polytope , which is the convex hull of two zonotopes. It is worthwhile to mention that Zhang et al. (2018) discussed a special case of the first part of Theorem 2 for a neural network with a single output and a score function to classify the output. To the best of our knowledge, this work is the first to propose a tropical geometric formulation of a superset containing the decision boundaries of a multi-class classification neural network. In particular, the first result of Theorem 2 states that one can perhaps study the decision boundaries, , directly by studying their superset . While studying can be equally difficult, the second result of Theorem 2 comes in handy. First, note that, since the network is bias-free, becomes an identity mapping with , and thus the dual subdivision , which is the Newton polytope in this case, becomes a well-structured geometric object that can be exploited to preserve decision boundaries as per the second part of Theorem 2. Now, based on the results of Maclagan and Sturmfels (2015) (Proposition 3.1.6) and as discussed in Figure 1, the normals to the edges of the polytope (convex hull of two zonotopes) are in one-to-one correspondence with the tropical hypersurface . Therefore, one can study the decision boundaries, or at least their superset , by studying the orientation of the dual subdivision . Before any further discussion, we recap the definition of zonotopes.

###### Definition 6.

Let . The zonotope formed by is defined as . Equivalently, can be expressed with respect to the generator matrix , where as .

Another common definition for a zonotope is the Minkowski sum of a set of line segments that start from the origin in (refer to appendix). It is well known that the number of vertices of a zonotope is polynomial in the number of line segments i.e. (Gritzmann and Sturmfels, 1993).

While Theorem 2 presents a strong relation between a polytope (convex hull of two zonotopes) and the decision boundaries, it remains unclear how such a polytope can be efficiently constructed. Although the number of vertices of a zonotope is polynomial in the number of its generating line segments, fast algorithms for enumerating these vertices are still restricted to zonotopes with line segments starting at the origin (Stinson et al., 2016). Since the line segments generating the zonotopes in Theorem 2 have arbitrary end points, we present the next result that transforms these line segments into a generator matrix of line segments starting from the origin as in Definition 6. This result is essential for an efficient computation of the zonotopes in Theorem 2.

###### Proposition 1.

The zonotope formed by line segments in with two arbitrary end points as follows is equivalent to the zonotope formed by the line segments with a shift of .

The proof is left for the appendix. As per Proposition 1, the generator matrices of zonotopes in Theorem 2 can be defined as and , both with shift , where rearranges the elements of in a diagonal matrix.

In what follows, we show several applications for Theorem 2. We begin by leveraging the geometric structure to help in reaffirming the behaviour of the lottery ticket hypothesis.

## 4 Tropical Perspective to the Lottery Ticket Hypothesis

The lottery ticket hypothesis was recently proposed by Frankle and Carbin (2019), in which the authors surmise the existence of sparse trainable sub-networks of dense, randomly-initialized, feed-forward networks that when trained in isolation perform as well as the original network in a similar number of iterations.

To find such sub-networks, Frankle and Carbin (2019) propose the following simple algorithm: perform standard network pruning, initialize the pruned network with the same initialization that was used in the original training setting, and train with the same number of epochs. They hypothesize that this should result in a smaller network with a similar accuracy to the larger dense network. In other words, a subnetwork can have similar decision boundaries to the original network. While in this section we do not provide a theoretical reason for why this proposed pruning algorithm performs favorably, we utilize the geometric structure that arises from Theorem 2 to reaffirm such behaviour. In particular, we show that the orientation of the dual subdivision (refered to as decision boundaries polytope form now on wards), where the normals to its edges are parallel to that is a superset to the decision boundaries, is preserved after pruning with the proposed initialization algorithm of Frankle and Carbin (2019). On the other hand, pruning routines with a different initialization at each pruning iteration will result in a severe variation in the orientation of the decision boundaries polytope. This leads to a large change in the orientation of the decision boundaries, which tends to hinder accuracy.

To this end, we train a neural network with 2 inputs (), 2 outputs, and a single hidden layer with 40 nodes (). We then prune the network by removing the smallest of the weights. The pruned network is then trained using different initializations: (i) the same initialization as the original network (Frankle and Carbin, 2019), (ii) Xavier (Glorot and Bengio, 2010), (iii) standard Gaussian and (iv) zero mean Gaussian with variance of 0.1. Figure 3 shows the decision boundaries polytope, i.e. , as we perform more pruning (increasing the ) with different initializations. First, we show the decision boundaries by sampling and classifying points in a grid with the trained network (first subfigure). We then plot the decision boundaries polytope as per the second part of Theorem 2 denoted as original polytope (second subfigure). While there are many overlapping vertices in the original polytope, the normals to some of the edges (the major visible edges) are parallel in direction to the decision boundaries shown in the first subfigure of Figure 3. We later show the decision boundaries polytope for the same network with varying levels of pruning. It is to be observed that the orientation of the polytopes deviate from the decision boundaries polytope of the original network without any pruning much more for all different initialization schemes as compared to the lottery ticket initialization. This gives an indication that lottery ticket initialization indeed preserves the decision boundaries, since it preserves the orientation of the decision boundaries polytope throughout the evolution of pruning. Several other examples are left for the appendix. Another approach to investigate the lottery ticket could be by observing the polytopes representing the functional form of the network directly, i.e. and , in lieu of the decision boundaries polytopes. However, this does not provide conclusive answers to the lottery ticket, since there can exist multiple functional forms, and correspondingly multiple polytopes and , for networks with the same decision boundaries. This is why we explicitly focus our analysis on , which is directly related to the decision boundaries of the network. Further discussions and experiments are left for the appendix.

## 5 Tropical Network Pruning

Network pruning has been identified as an effective approach for reducing the computational cost and memory usage during network inference. While it dates back to the work of LeCun et al. (1990) and Hassibi and Stork (1993), network pruning has recently gained more attention. This is due to the fact that most neural networks over-parameterize commonly used datasets. In network pruning, the task is to find a smaller subset of the network parameters, such that the resulting smaller network has similar decision boundaries (and thus supposedly similar accuracy) to the original over-parameterized network. In this section, we show a new geometric approach towards network pruning. In particular, as indicated by Theorem 2, preserving the polytope preserves a superset to the decision boundaries , and thus supposedly the decision boundaries themselves.

Motivational Insight.   For a single hidden layer neural network, the dual subdivision to the decision boundaries is the polytope that is the convex hull of two zonotopes, where each is formed by taking the Minkowski sum of line segments (Theorem 2). Figure 4 shows an example, where pruning a neuron in the neural network has no effect on the dual subdivision polytope and equivalently no effect on the accuracy. This is since the orientation of the decision boundaries polytope did not change, thus, preserving the tropical hypersurface and keeping the decision boundaries of both networks the same.

Problem Formulation.   In light of the motivational insight, a natural question arises: Given an over-parameterized binary output neural network , can one construct a new neural network, parameterized by some sparser weight matrices and , such that this smaller network has a dual subdivision that preserves the decision boundaries of the original network?

To address this question, we propose the following general optimization problem to compute and :

 min~A,~B d(δ(~R(x)),δ(R(x))) (1) =min~A,~Bd(% ConvHull(Z~G1,Z~G2),ConvHull(ZG1,ZG2)).

The function defines a distance between two geometric objects. Since the generators and are functions of and (as per Theorem 2), this optimization problem can be challenging to solve. However, for pruning purposes, one can observe from Theorem 2 that if the generators and had fewer number of line segments (rows), this corresponds to a fewer number of rows in the weight matrix (sparser weights). So, we observe that if and , then , and thus the decision boundaries tend to be preserved as a consequence. Therefore, we propose the following optimization problem as a surrogate to Problem (1):

 min~A,~B 12(∥∥~G1−G1∥∥2F+∥∥~G2−G2∥∥2F) (2) +λ1∥∥~G1∥∥2,1+λ2∥∥~G2∥∥2,1.

The matrix mixed norm for is defined as , which encourages the matrix to be row sparse, i.e. complete rows of are zero. Note that and . We solve Problem (2) through alternating optimization over the variables and , where each sub-problem can be solved in closed form. Details of the optimization and the extension to the multi-class case are left for the appendix.

Extension to Deeper Networks. For deeper networks, one can still apply the aforementioned optimization for consecutive blocks. In particular, we prune each consecutive block of the form (Affine,ReLU,Affine) starting from the input and ending at the output of the network.

Experiments on Tropical Pruning. Here, we evaluate the performance of the proposed pruning approach as compared to several classical approaches on several architectures and datasets. In particular, we compare our tropical pruning approach against Class Blind (CB), Class Uniform (CU) and Class Distribution (CD) Han et al. (2015); See et al. (2016). In Class Blind, all the parameters across all nodes of a layer are sorted by magnitude where with smallest magnitude are pruned. In contrast, Class Uniform prunes the parameters with smallest magnitudes per node in a layer. Lastly, Class Distribution performs pruning of all parameters for each node in the layer, just as in Class Uniform, but the parameters are pruned based on the standard deviation of the magnitude of the parameters per node. Since fully connected layers in deep neural networks tend to have much higher memory complexity than convolutional layers, we restrict our focus to pruning fully connected layers. We train AlexNet and VGG16 on SVHN, CIFAR10, and CIFAR100 datasets. We observe that we can prune more than 90 of the classifier parameters for both networks without affecting the accuracy. Moreover, we demonstrate experimentally that our approach can outperform all other methods even when all parameters or when only the biases are fine tuned after pruning (these experiments in addition to many others are left for the appendix).

Setup. We adapt the architectures of AlexNet and VGG16, since they were originally trained on ImageNet (Deng et al., 2009), to account for the discrepancy in the input resolution. The fully connected layers of AlexNet and VGG16 have sizes of (256,512,10) and (512,512,10), respectively on SVHN and CIFAR100 with the last layer replaced to 100 for CIFAR100. All networks were trained to baseline test accuracy of (,,) for AlexNet on SVHN, CIFAR10 and CIFAR100, respectively and (,,) for VGG16. To evaluate the performance of pruning and following previous work (Han et al., 2015), we report the area under the curve (AUC) of the pruning-accuracy plot. The higher the AUC is, the better the trade-off is between pruning rate and accuracy. For efficiency purposes, we run the optimization in Problem (2) for a single alternating iteration to identify the rows in and elements of that will be pruned, since an exact pruning solution might not be necessary. The algorithm and the parameter setup to solving Problem (2) is left for the appendix.

Results. Figure 4 shows the comparison between our tropical approach and the three popular pruning schemes on both AlexNet and VGG16 over the different datasets. Our proposed approach can indeed prune out as much as of the parameters of the classifier without sacrificing much of the accuracy. For AlexNet, we achieve much better performance in pruning as compared to other methods. In particular, we are better in AUC by , , and over other pruning methods on SVHN, CIFAR10 and CIFAR100, respectively. This indicates that the decision boundaries can indeed be preserved by preserving the dual subdivision polytope. For VGG16, we perform similarly well on both SVHN and CIFAR10 and slightly worse on CIFAR100. While the performance achieved here is comparable to the other pruning schemes, if not better, we emphasize that our contribution does not lie in outperforming state-of-the-art pruning methods, but in giving a new geometry-based perspective to network pruning. We conduct more experiments where only the biases of the network or only the classifier are fine tuned after pruning. Retraining biases can be sufficient as they do not contribute to the orientation of the decision boundaries polytope (and effectively the decision boundaries) but only a translation. Discussion on biases and more results are left for appendix.

DNNs are notorious for being susceptible to adversarial attacks. In fact, adding small imperceptible noise, referred to as adversarial attacks, to the input of these networks can hinder their performance. Several works investigated the decision boundaries of neural networks in the presence of adversarial attacks. For instance, Khoury and Hadfield-Menell (2018) analyzed the high dimensional geometry of adversarial examples by means of manifold reconstruction. Also, He et al. (2018b) crafted adversarial attacks by estimating the distance to the decision boundaries using random search directions. In this work, we provide a tropical geometric view to this task, where we show how Theorem 2 can be leveraged to construct a tropical geometry-based targeted adversarial attack.

Dual View to Adversarial Attacks.   For a classifier and input classified as , a standard formulation for targeted adversarial attacks flips the prediction to a particular class and is usually defined as

 minη  D(η)   s.t.  argmaxi fi(x0+η)=t≠c (3)

This objective aims to compute the lowest energy input noise (measured by ) such that the the new sample crosses the decision boundaries of to a new classification region. Here, we present a dual view to adversarial attacks. Instead of designing a sample noise such that belongs to a new decision region, one can instead fix and perturb the network parameters to move the decision boundaries in a way that appears in a new classification region. In particular, let be the first linear layer of , such that . One can now perturb to alter the decision boundaries and relate this parameter perturbation to the input perturbation as follows:

 g((A1+ξA1)x0) =g(A1x0+ξA1x0) (4) =g(A1x0+A1η)=f(x0+η).

From this dual view, we observe that traditional adversarial attacks are intimately related to perturbing the parameters of the first linear layer through the linear system: . The two views and formulations are identical under such condition. With this analysis, Theorem 2 provides explicit means to geometrically construct adversarial attacks by perturbing the decision boundaries. In particular, since the normals to the dual subdivision polytope of a given neural network represent the tropical hypersurface , which is a superset to the decision boundaries set , can be designed to result in a minimal perturbation to the dual subdivision that is sufficient to change the network prediction of to the targeted class . Based on this observation, we formulate the problem as follows:

 minη,ξA1 D1(η)+D2(ξA1) (5) s.t. −loss(g(A1(x0+η)),t)≤−1; −loss(g(A1+ξA1)x0,t)≤−1; (x0+η)∈[0,1]n,∥η∥∞≤ϵ1; ∥ξA1∥∞,∞≤ϵ2,A1η=ξA1x0.

The is the standard cross-entropy loss. The first row of constraints ensures that the network prediction is the desired target class when the input is perturbed by , and equivalently by perturbing the first linear layer by . This is identical to as proposed by Carlini and Wagner (2016). Moreover, the third and fourth constraints guarantee that the perturbed input is feasible and that the perturbation is bounded, respectively. The fifth constraint is to limit the maximum perturbation on the first linear layer, while the last constraint enforces the dual equivalence between input perturbation and parameter perturbation. The function captures the perturbation of the dual subdivision polytope upon perturbing the first linear layer by . For a single hidden layer neural network parameterized as and for the first and second layers respectively, can capture the perturbations in each of the two zonotopes discussed in Theorem 2 and we define it as:

 D2(ξA1) =122∑j=1∥∥Diag(B+(j,:))ξA1∥∥2F (6) +∥∥Diag(B−(j,:))ξA1∥∥2F.

The derivation, discussion, and extension of (6) to multi-class neural networks is left for the appendix. We solve Problem (5) with a penalty method on the linear equality constraints, where each penalty step is solved with ADMM Boyd et al. (2011) in a similar fashion to the work of Xu et al. (2018). The details of the algorithm are left for the appendix.

Motivational Insight to the Dual View. Here, we train a single hidden layer neural network, where the size of the input is 2 with 50 hidden nodes and 2 outputs on a simple dataset as shown in Figure 6. We then solve Problem 5 for a given shown in black. We show the decision boundaries for the network with and without the perturbation at the first linear layer . Figure 6 shows that perturbing an edge of the dual subdivision polytope, by perturbing the first linear layer, indeed corresponds to perturbing the decision boundaries and results in the misclassification of . Interestingly and as expected, perturbing different decision boundaries corresponds to perturbing different edges of the dual subdivision. Furthermore, we conduct extensive experiments on MNIST images, which show that successful adversarial attacks can be designed by solving Problem (5). Due to space constraints, we leave these results for the appendix.

## 7 Conclusion

We leverage tropical geometry to characterize the decision boundaries of neural networks in the form (Affine, ReLU, Affine) and relate it to geometric objects such as zonotopes. We then provide a tropical perspective to support the lottery ticket hypothesis, prune networks, and design adversarial attacks. A natural extension is a compact derivation for the characterization of the decision boundaries of convolutional neural networks and graphical convolutional networks.

Acknowledgments. This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research.

## Appendix A Preliminaries and Definitions.

###### Fact 1.

is the Minkowski sum between two sets and .

###### Fact 2.

Let be a tropical polynomial and let . Then

 P(fa)=aP(f).

Let both and be tropical polynomials, Then

###### Fact 3.
 P(f⊙g)=P(f)~+P(g). (7)
###### Fact 4.
 P(f⊕g)=ConvexHull(V(P(g))∪V(P(g))). (8)

Note that is the set of vertices of the polytope .

## Appendix B Proof Of Theorem 2

###### Theorem 2.

For a bias-free neural network in the form of where and , let be a tropical polynomial. Then:

• Let define the decision boundaries of , then .

• . is a zonotope in with line segments and shift . is a zonotope in with line segments and shift . Note that and . The line segment has end points and in and scaled by .

Note that and where the is element-wise. The line segment is one that has the end points and in and scaled by the constant .

###### Proof.

For the first part, recall from Theorem1 that both and are tropical rationals and hence,

 f1(x)=H1(x)−Q1(x)f2(x)=H2(x)−Q2(x)

Thus

 B ={x∈Rn:f1(x)=f2(x)}={x∈Rn:H1(x)−Q1(x)=H2(x)−Q2(x)} ={x∈Rn:H1(x)+Q2(x)=H2(x)+Q1(x)} ={x∈Rn:H1(x)⊙Q2(x)=H2(x)⊙Q1(x)}

Recall that the tropical hypersurface is defined as the set of where the maximum is attained by two or more monomials. Therefore, the tropical hypersurface of is the set of where the maximum is attained by two or more monomials in (), or attained by two or more monomials in (), or attained by monomials in both of them in the same time, which is the decision boundaries. Hence, we can rewrite that as

 T(R(x))=T(H1(x)⊙Q2(x))∪T(H2(x)⊙Q1(x))∪B.

Therefore . For the second part of the Theorem, we first use the decomposition proposed by Zhang et al. (2018); Berrada et al. (2016) to show that for a network , it can be decomposed as tropical rational as follows

 f(x) =(B+−B−)(max(A+x,A−x)−A−x) =[B+max(A+x,A−x)+B−A−x] −[B−max(A+x,A−x)+B+A−x].

Therefore, we have that

 H1(x)+Q2(x) =(B+(1,:)+B−(2,:))max(A+x,A−x) +(B−(1,:)+B+(2,:))A−x
 H2(x)+Q1(x) =(B−(1,:)+B+(2,:))max(A+x,A−x) +(B+(1,:)+B−(2,:))A−x.

Therefore, note that:

 δ(R(x)) =δ((H1(x)⊙Q2(x))⊕(H2(x)⊙Q1(x))) \lx@stackrel(???)=% ConvexHull(δ(H1(x))~+δ(Q2(x)),δ(H2(x))~+δ(Q1(x))).

Now observe that tropically is given as follows , thus we have that :

 δ(H1(x)) =(B+(1,1)+B−(2,1))% ConvexHull(A+(1,:),A−(1,:))~+…

The operator indicates a Minkowski sum between sets. Note that is the convexhull between two points which is a line segment in with end points that are scaled with . Observe that is a Minkowski sum of line segments which is is a zonotope. Moreover, note that tropically is given as follows . One can see that is the Minkowski sum of the points in (which is a standard sum) resulting in a point. Lastly, is a Minkowski sum between a zonotope and a single point which corresponds to a shifted zonotope. A similar symmetric argument can be applied for the second part . ∎

It is also worthy to mention that the extension to network with multi class output is trivial. In that case all of the analysis can be exactly applied studying the decision boundary between any two classes where and the rest of the proof will be exactly the same.

## Appendix C Handling Biases

In this section, we show that the results of Theorem 2 are the unaltered in the presence of biases ( and ). To do so, we will derive the dual subdivision of the output of the first Affine layer, the ReLU and the last Affine layer consecutively.

As for the output of the first affine layer, note that for an input the output of the first affine layer can be presented tropically per coordinate as:

 z1i=A+(i,:)x+c1(i)−A−(i,:)x=(c1(i)⊙xA+(i,:))\varobslashxA−(i,:)=H1i\varobslashQ1i. (9)

To construct the dual subdivision for each tropical polynomial and , one needs first to construct the tropical newton polytope in as defined in Definition 5. Since both of and are tropical polynomials with a single monomials thus and are the points and in , respectively. To construct the dual subdivision of each tropical polynomial, one needs to project the newton polytope to through the operator which will again result in an identical dual subdivision if biases were not introduced.

As for the output of the ReLU layer, note that for an input , it can be presented tropically per coordinate as follows

 z2i

Following Fact 4, the newton polytope of is a line segment, i.e. , with end points . Constructing the dual subdivision by projecting both and to recovers an identical dual subdivision for a bias free-network. Similarly for which is a point in with coordinate . Applying the projection recovers to construct the dual subdivision recovers the point in .

Lastly, the output of the second affine layer per coordinate can be expressed as:

 z3i =B(i,:)z2+c2(i)=(B+(i,:)−B−(i,:))(H2i−Q2i)+c2(i) =(B+(i,:)H2i+B−(i,:)Q2i+c2(i))−(B−(i,:)H2i+B+(i,:)Q2i) =(B+(i,:)H2i+B−(i,:)Q2i+c2(i))−(B−(i,:)H2i+B+(i,:)Q2i) =H3i\varobslashQ3i.

Following Facts 2 and 3, we have that where is the newton polytope of just as before but in . Note that the first term, i.e.