On the Margin Theory of Feedforward Neural Networks
Abstract
Past works have shown that, somewhat surprisingly, overparametrization can help generalization in neural networks. Towards explaining this phenomenon, we adopt a marginbased perspective. We establish: 1) for multilayer feedforward relu networks, the global minimizer of a weaklyregularized crossentropy loss has the maximum normalized margin among all networks, 2) as a result, increasing the overparametrization improves the normalized margin and generalization error bounds for twolayer networks. In particular, an infinitesize neural network enjoys the best generalization guarantees. The typical infinite feature methods are kernel methods; we compare the neural net margin with that of kernel methods and construct natural instances where kernel methods have much weaker generalization guarantees. We validate this gap between the two approaches empirically. Finally, this infiniteneuron viewpoint is also fruitful for analyzing optimization. We show that a perturbed gradient flow on infinitesize networks finds a global optimizer in polynomial time.
1 Introduction
In deep learning, overparametrization refers to the widelyadopted technique of using more parameters than necessary (Krizhevsky et al., 2012; Livni et al., 2014). Both computationally and statistically, overparametrization is crucial for learning neural nets. Controlled experiments demonstrate that overparametrization eases optimization by smoothing the nonconvex loss surface (Livni et al., 2014; Sagun et al., 2017). Statistically, increasing model size without any regularization still improves generalization even after the model interpolates the data perfectly (Neyshabur et al., 2017b). This is surprising given the conventional wisdom on the tradeoff between model capacity and generalization.
In the absence of an explicit regularizer, algorithmic regularization is likely the key contributor to good generalization. Recent works have shown that gradient descent finds the minimum norm solution fitting the data for problems including logistic regression, linearized neural networks, and matrix factorization (Soudry et al., 2018; Gunasekar et al., 2018b; Li et al., 2018; Gunasekar et al., 2018a; Ji & Telgarsky, 2018). Many of these proofs require a delicate analysis of the algorithm’s dynamics, and some are not fully rigorous due to assumptions on the iterates. To the best of our knowledge, it is an open question to prove analogous results for even twolayer relu networks. (For example, the technique of Li et al. (2018) on twolayer neural nets with quadratic activations still falls within the realm of linear algebraic tools, which apparently do not suffice for other activations.)
We propose a different route towards understanding generalization: making the regularization explicit. The motivations are: 1) with an explicit regularizer, we can analyze generalization without fully understanding optimization; 2) it is unknown whether gradient descent provides additional implicit regularization beyond what regularization already offers; 3) on the other hand, with a sufficiently weak regularizer, we can prove stronger results that apply to multilayer neural nets with relu activations. Additionally, explicit regularization is perhaps more relevant because regularization is typically used in practice.
Concretely, we add a normbased regularizer to the cross entropy loss of a multilayer feedforward neural network with relu activations. We show that the global minimizer of the regularized objective achieves the maximum normalized margin among all the models with the same architecture, if the regularizer is sufficiently weak (Theorem 1). Informally, for models with norm that perfectly classify the data, the margin is the smallest difference across all datapoints between the classifier score for the true label and the next best score. We are interested in normalized margin because its inverse bounds the generalization error (see recent work (Bartlett et al., 2017; Neyshabur et al., 2017a, 2018) and our Theorem 1). Our work explains why optimizing the training loss can lead to parameters with a large margin and thus, better generalization error.
At a first glance, it might seem counterintuitive that decreasing the regularizer is the right approach. At a high level, we show that the regularizer only serves as a tiebreaker to steer the model towards choosing the largest normalized margin. Our proofs are simple, oblivious to the optimization procedure, and apply to any normbased regularizer. We also show that an exact global minimum is unnecessary: if we approximate the minimum loss within a constant, we obtain the maxmargin within a constant (Theorem 2).
We further study the margin of twolayer networks: let be the max normalized margin of a neural net with hidden units (formally defined in Section 3.1). Let be the largest possible margin of an infinite twolayer network. We will show three properties of the margins:

In Theorem 2, we show that the optimal normalized margin of twolayer networks is nondecreasing as the width of the architecture grows, so the generalization error bound only improves with a wider network. Thus, even if the dataset is already separable, it could still be useful to increase the width to achieve larger margin and better generalization. More formally, let be the number of training examples. We additionally approach the maximum possible margin after overparameterizing with neurons: .

We compare the neural net margin to the standard margin for the kernel SVM on the same features. We design a simple data distribution (Figure 1) where neural net margin is large but the kernel margin is small. This translates to an factor gap between the generalization error bounds for the two approaches and demonstrates the power of neural nets compared to kernel methods. We experimentally confirm that a gap does indeed exist.
In the context of bullet 2, our work is closely related to that of Rosset et al. (2007) and Neyshabur et al. (2014), who show that optimizing the loss over the parameters of a twolayer relu network is equivalent to optimizing the loss of a “convex neural net” parametrized by a distribution over hidden units. We go one step further and connect the weakly regularized training loss to the SVM.
We will also adopt this view of infinitesize neural networks to study how overparametrization helps optimization. Prior works (Mei et al., 2018; Chizat & Bach, 2018; Sirignano & Spiliopoulos, 2018) show that gradient descent on twolayer networks becomes Wasserstein gradient flow over parameter distributions in the limit of infinite neurons. For this setting, we prove that perturbed Wasserstein gradient flow finds a global optimizer in polynomial time.
Finally, we empirically validate several of the claims made in this paper. First, we train a twolayer network on a onedimensional classification task that is simple to visualize. In one dimension, it is possible to bruteforce approximate the maximum neural network margin and we show that training with an progressively smaller regularizer results in convergence to this margin. Second, we compare the generalization performance of neural networks and kernel methods and confirm that neural networks do achieve better generalization, as our theory predicts.
1.1 Additional Related Work
Zhang et al. (2016) and Neyshabur et al. (2017b) show that neural network generalization defies conventional explanations and requires new ones. One proposed explanation is the inductive bias of the training algorithm. Recent papers (Hardt et al., 2015; Brutzkus et al., 2017; Chaudhari et al., 2016) study inductive bias through training time and sharpness of local minima. Neyshabur et al. (2015a) propose a new steepest descent algorithm in a geometry invariant to weight rescaling and show that this improves generalization. Morcos et al. (2018) relate generalization in deep nets to the number of “directions” in the neurons. Other papers (Gunasekar et al., 2017; Soudry et al., 2018; Nacson et al., 2018; Gunasekar et al., 2018b; Li et al., 2018; Gunasekar et al., 2018a) study implicit regularization towards a specific solution. Ma et al. (2017) show that implicit regularization can help gradient descent avoid overshooting optima. Rosset et al. (2004a, b) study logistic regression with a weak regularization and show convergence to the max margin solution. We adopt their techniques and extend their results.
Recent works have also derived tighter Rademacher complexity bounds for deep neural networks (Neyshabur et al., 2015b; Bartlett et al., 2017; Neyshabur et al., 2017a; Golowich et al., 2017) and new compression based generalization properties (Arora et al., 2018b). Dziugaite & Roy (2017) manage to compute nonvacuous generalization bounds from PACBayes bounds. Neyshabur et al. (2018) investigate the Rademacher complexity of twolayer networks. Liang & Rakhlin (2018) and Belkin et al. (2018) study the generalization of kernel methods.
On the optimization side, Soudry & Carmon (2016) explain why overparametrization can remove bad local minima. Safran & Shamir (2016) show that overparametrization can improve the quality of the random initialization. Haeffele & Vidal (2015), Nguyen & Hein (2017), and Venturi et al. (2018) show that for sufficiently overparametrized networks, all local minima are global, but do not show how to find these minima via gradient descent. Du & Lee (2018) show that for twolayer networks with quadratic activations, all secondorder stationary points are global minimizers. Arora et al. (2018a) interpret overparametrization as a means of implicit acceleration during optimization. Mei et al. (2018), Chizat & Bach (2018), and Sirignano & Spiliopoulos (2018) take a distributional view of overparametrized networks. Chizat & Bach (2018) show that Wasserstein gradient flow converges to global optimizers under structural assumptions. We extend this to a polynomialtime result.
1.2 Notation
Let denote the set of real numbers. We will use to indicate a general norm, with denoting the norms on finite dimensional vectors, respectively, and denoting the Frobenius norm on a matrix. In general, we use on top of a symbol to denote a unit vector: when applicable, , where the norm will be clear from context. Let be the unit sphere in dimensions. Let be the space of functions on for which the th power of the absolute value is Lebesgue integrable. For , we overload notation and write . Additionally, for and or , we can define . Furthermore, we will use .
Throughout this paper, we reserve the symbol to denote the collection of datapoints (as a matrix), and to denote labels. We use to denote the dimension of our data. We often use to denote the parameters of a prediction function , and to denote the prediction of on datapoint .
We will use the notation to mean less than or greater than up to a universal constant, respectively. Unless stated otherwise, we use as a placeholders for some universal constant in upper and lower bounds, respectively. We will use poly to denote some universal constantdegree polynomial in the arguments.
2 Weak Regularizer Guarantees Max Margin Solutions
In this section, we will show that when we add a weak regularizer to crossentropy loss with a positivehomogeneous prediction function, the normalized margin of the optimum converges to some maxmargin solution. As a concrete example, feedforward relu networks are positivehomogeneous.
Let be the number of labels, so the th example has label . We work with a family of prediction functions that are positivehomogeneous in their parameters for some : . We additionally require that is continuous in . For some general norm , we study the regularized crossentropy loss , defined as
(2.1) 
for fixed . Let .
(2.2) 
Define the max normalized margin as
and let be a parameter achieving this maximum. We show that with sufficiently small regularization level , the normalized margin approaches the maximum margin . Our theorem and proof are inspired by the result of Rosset et al. (2004a, b), who analyze the special case when is a linear predictor. In contrast, our result can be applied to nonlinear as long as is homogeneous.
Theorem 1.
Assume the training data is separable by a network with an optimal normalized margin . Then, the normalized margin of the global optimum of the weaklyregularized objective (equation 2.1) converges to as the strength of the regularizer goes to zero. Mathematically, let be defined in equation 2.2. Then
An intuitive explanation for our result is as follows: because of the homogeneity, the loss roughly satisfies the following (for small , and ignoring problem parameters such as ):
Thus, the loss focuses on choosing parameters with larger margin, and the regularization term biases the loss to select parameters with a smaller norm. The full proof of the theorem is deferred to Section A.1.
We can also provide an analogue of Theorem 1 for the binary classification setting. For this setting, our prediction is now a single real output and we train using logistic loss. We provide formal definitions and results in Section A.2. Our theory for twolayer neural networks (see Section 3) is based in this setting.
2.1 Optimization Accuracy
Since is typically hard to optimize exactly for neural nets, it would be ideal to relax the condition that minimizes . Thus, we ask, how accurately do we need to optimize to obtain a margin that approximates up to a constant? The following theorem shows that if suffices to find achieving a constant factor multiplicative approximation of , where is some sufficiently small polynomial in . Though our theorem is stated for the general multiclass setting, our result applies for binary classification as well. We provide the proof in Section A.3.
Theorem 2.
In the setting of Theorem 1, suppose that we choose for sufficiently large (that only depends on ). Let denote a 2approximate
.
3 Margins of Overparameterized Twolayer Homogeneous Neural Nets
In Section 2 we showed that a weaklyregularized logistic loss leads to the maximum normalized margin. In this section, we analyze the properties of the maxmargin of neural nets more closely. We will contrast neural networks with kernel methods, for which margins have already been extensively studied. Towards a firstcut understanding, we focus on twolayer networks for binary classification.
First, in Section 3.1 we provide a bound stating that the generalization error is roughly linear in the inverse of the margin, establishing that a larger margin implies better generalization. In Section 3.2, we show that the maximum normalized margin is nondecreasing with the hidden layer size and stays constant as soon as there are more hidden units than data points. This suggests that increasing the size of the network improves the generalization of the solution.
Second, in Section 3.3, we draw an analogy to classical kernel methods by proving that the maximum normalized margin of an overparameterized neural net is equal to half the maximum possible normalized margin of linear functionals on a lifted feature space. In other words, we establish an equivalence between neural networks and the 1norm SVM (Zhu et al., 2004) on the lifted features. These features are constructed by applying the activation function on all possible hidden layer weights.
Third, continuing this analogy, we will compare the generalization power of a twolayer neural network to that of a kernel method on the lifted space. This kernel method corresponds to fixing random weights for the hidden layer and solving a 2norm maxmargin problem on the top layer weights. We demonstrate instances where two layer neural networks give better generalization error guarantees than the kernel method.
3.1 Setup and Marginbased Generalization Error
In the rest of the paper, we work with twolayer neural networks with a single output for binary classification. We use to denote the number of hidden units, for the weight vectors on the first layer, and for the weights on the second layer. We let , and we use to denote the collection of all the parameters. We assume in this section that the activation is 1homogeneous and 1Lipschitz. The network thus computes a single score
We consider regularization from here on. The regularized logistic loss of the architecture with hidden units is therefore
(3.1) 
where denotes the Euclidean norm of all the parameters in . We note that and the regularizer are both 2homogeneous in , so the results of Section 2 apply to .
Following our conventions from Section 2, we denote the optimizer of by , the normalized margin of by , the maxmargin solution by , and the maxmargin by . We emphasize the size of the network in our notation. Since our classifier now predicts a single real value, we need to redefine
When the data is not separable by a unit neural net, is zero by definition.
Recall that denotes the matrix with all the data points as columns, and denotes the labels. We sample and i.i.d. from the data generating distribution , which is supported on . We can define the population 01 loss and the training 01 loss of the network as
We will let be the average norm squared of the data and be an upper bound on the norm of a single datapoint. The following theorem shows that the generalization error only depends on the parameters through the inverse of the margin on the training data. We provide a proof in Section C.1.
Theorem 1.
Suppose is 1Lipschitz and 1homogeneous. Then for any that separates the data with margin , with probability at least over the draw of ,
(3.2) 
where . Note that is typically small, and thus the above bound mainly scales with . As a corollary, with probability ,
(3.3) 
Above we implicitly assume , since otherwise the right hand side of the bound is vacuous.
One consequence of the above theorem and Theorem 2 is that if is polynomially small in and , we only need to optimize up to a constant multiplicative factor to obtain parameters with generalization bounds roughly as good as those for .
3.2 The max margin is nondecreasing in the hidden layer size
Now we show that the maximum normalized margin is nondecreasing with the hidden layer size and stays constant once we have more hidden units than examples.
Theorem 2.
In the setting of Section 3.1, recall that denotes the max normalized margin of a twolayer neural network with hidden layer size . Then,
(3.4) 
We note that will be positive when is a sufficiently powerful activation such as relu or sigmoid and the data points are not repetitive, so the neural network can fit any function of the data. We prove Theorem 2 in Section B. Theorem 2 can explain why additional overparametrization has been observed to improve generalization in twolayer networks Neyshabur et al. (2017b). Our margin does not decrease with a larger network size, and therefore Theorem 1 gives a better generalization bound. We precisely characterize the value of in the following section.
3.3 The max margin of neural nets is equivalent to SVM in lifted space
We link infinitesize neural networks to the SVM over a lifted space, defined via a lifting function mapping data to an infinite feature vector:
(3.5) 
We look at the margin of linear functionals corresponding to . The 1norm SVM over the lifted feature solves for the maximum margin:
(3.6) 
where we rely on the inner product and 1norm defined in Section 1.2. A priori, it is unclear how to optimize this since the kernel trick does not work for norm. Here we will show that optimizing twolayer neural networks with weak regularization is equivalent to solving equation 3.6.
3.4 Comparison to kernel methods
We compare the SVM margin, attainable by a finite neural network, to the margin attainable via kernel methods. Following the setup of Section 3.3, we define the kernel problem over :
(3.8) 
where . (We scale by to make the lemma statement below cleaner.) First, can be used to obtain a standard upper bound on the generalization error of the kernel SVM. Following the notation of Section 3.1, we will let denote the 01 population classification error for the optimizer of equation 3.8.
Lemma 4.
The bound above follows from standard techniques (Bartlett & Mendelson, 2002), and we provide a full proof in Section C.1. We construct a data distribution for which this lemma does not give a good bound for kernel methods, but Theorem 1 does imply good generalization for twolayer networks.
Theorem 5.
There exists a data distribution such that the SVM with relu features has a good margin:
and with probability over the choice of i.i.d. samples from , obtains generalization error
where is typically a lower order term. Meanwhile, with high probability the SVM has a small margin:
and therefore the generalization upper bound from Lemma 4 is at least
Proof sketch for Theorem 5.
We base on the distribution of examples described below. Here is the ith standard basis vector and we use to represent the coordinate of (since the subscript is reserved to index training examples).
Figure 1 shows samples from when there are 3 dimensions. From the visualization, it is clear that there is no linear separator for . As Lemma 1 shows, a relu network with four neurons can fit this relatively complicated decision boundary. On the other hand, for kernel methods, we prove that the symmetries in induce cancellation in feature space. The following lemmas, proved in Section D.1, formalize this cancellation and show that it results in a small margin for kernel methods.
Lemma 6 (Margin upper bound tool).
In the setting of Theorem 5, we have
Lemma 7.
Combining these lemmas gives us the desired bound on .
Gap in regression setting:
We are able to prove an even larger gap between neural networks and kernel methods in the regression setting where we wish to interpolate continuous labels. Analogously to the classification setting, optimizing a regularized squared error loss on neural networks is equivalent to solving a minimum 1norm regression problem (see Theorem 3). Furthermore, kernel methods correspond to a minimum 2norm problem. We construct distributions where the 1norm solution will have a generalization error bound of , whereas the 2norm solution will have a generalization error bound that is and thus vacuous. In Section D.2, we define the 1norm and 2norm regression problems. In Theorem 6 we formalize our construction.
4 Perturbed Wasserstein gradient flow finds global optimizers in polynomial time
In the prior section, we studied the limiting behavior of the generalization of a twolayer network as its width goes to infinity. In this section, we will now study the limiting behavior of the optimization algorithm, gradient descent. Prior work (Mei et al., 2018; Chizat & Bach, 2018) has shown that as the hidden layer size grows to infinity, gradient descent for a finite neural network approaches the Wasserstein gradient flow over distributions of hidden units (defined in equation 4.1). Chizat & Bach (2018) also prove that Wasserstein gradient flow converges to a global optimizer in this setting but do not specify a convergence rate.
We show that a perturbed version of Wasserstein gradient flow converges in polynomial time. The informal takeaway of this section is that a perturbed version of gradient descent converges in polynomial time on infinitesize neural networks (for the right notion of infinitesize.)
Formally, we optimize the following functional over distributions on :
where , , and . In this work, we consider 2homogeneous and . We will additionally require that is nonnegative and is positive on the unit sphere. Finally, we need standard regularity assumptions on , and :
Assumption 1 (Regularity conditions on , , ).
and are differentiable as well as upper bounded and Lipschitz on the unit sphere. is Lipschitz and its Hessian has bounded operator norm.
We provide more details on the specific parameters (for boundedness, Lipschitzness, etc.) in Section E.1. We note that relu networks satisfy every condition but differentiability of .
Example 2 (Logistic loss for neural networks).
We interpret as a distribution over the parameters of the network. Let and for . In this case, is a distributional neural network that computes an output for each of the training examples (like a standard neural network, it also computes a weighted sum over hidden units). We can compute the distributional version of the regularized logistic loss in equation 3.1 by setting and .
We will define with and . Informally, is the gradient of with respect to , and is the induced velocity field. For the standard Wasserstein gradient flow dynamics, evolves according to
(4.1) 
where denotes the divergence of a vector field. For neural networks, these dynamics formally define continuoustime gradient descent when the hidden layer has infinite size (see Theorem 2.6 of Chizat & Bach (2018), for instance).
We propose the following modification of the Wasserstein gradient flow dynamics:
(4.2) 
where is the uniform distribution on . In our perturbed dynamics, we add uniform noise over . For infinitesize neural networks, one can informally interpret this as reinitializing a very small fraction of the neurons at every step of gradient descent. We prove convergence to a global optimizer in time polynomial in , and the regularity parameters.
Theorem 3 (Theorem 4 with regularity parameters omitted).
Suppose that and are 2homogeneous and the regularity conditions of Assumption 1 are satisfied. Also assume that from starting distribution , a solution to the dynamics in equation 4.2 exists. Define . Let be a desired error threshold and choose and , where the regularity parameters for , , and are hidden in the . Then, perturbed Wasserstein gradient flow converges to an approximate global minimum in time:
We provide a theorem statement that includes regularity parameters in Section E.1. We prove the theorem in Section E.2.
As a technical detail, Theorem 3 requires that a solution to the dynamics exists. We can remove this assumption by analyzing a discretetime version of equation 4.2:
and additionally assuming and have Lipschitz gradients. In this setting, a polynomial time convergence result also holds. We state the result in Section E.3.
5 Simulations
We first verify the normalized margin convergence on a twolayer networks with onedimensional input. A single hidden unit computes the following: . We add regularization to , and and compare the resulting normalized margin to that of an approximate solution of the SVM problem with features for . Writing this feature vector is intractable, so we solve an approximate version by choosing 1000 evenly spaced values of . Our theory predicts that with decreasing regularization, the margin of the neural network converges to the SVM objective. In Figure 2, we plot this margin convergence and visualize the final networks and ground truth labels. The network margin approaches the ideal one as , and the visualization shows that the network and SVM functions are extremely similar.
Next, we experiment on synthetic data in a higherdimensional setting. For classification and regression, we compare the generalization error and predicted generalization upper bounds
The plots in Figure 3 show that twolayer networks clearly outperform kernel methods in test error as grows. However, there seems to be looseness in the upper bounds for kernel methods: the kernel generalization bound appears to stay constant with (as predicted by our theory for regression), but the kernel test error decreases. There is also some variance in the neural network generalization bound for classification. This occured likely because we did not tune learning rate and training time, so the optimization failed to find the best margin.


In Section F, we include additional experiments training modified WideResNet architectures on CIFAR10 and CIFAR100. Although ResNet is not homogeneous, we still report interesting increases in generalization performance from annealing the weight decay during training, versus staying at a fixed decay rate.
6 Conclusion
We have made the case that maximizing margin is one of the inductive biases of relu networks with crossentropy loss. We show that we can obtain a maximum normalized margin by training with a weak regularizer. We also prove that larger normalized margin indicates better generalization for twolayer nets. Our work leaves open the question of how the normalized margin relates to generalization in much deeper neural networks. This is a fascinating theoretical and empirical question for future work. On the optimization side, we make progress towards understanding overparametrized gradient descent by analyzing infinitesize neural networks. A natural direction for future work is to apply our theory to optimize the margin of finitesized neural networks.
Acknowledgments
JDL acknowledges support of the ARO under MURI Award W911NF1110303. This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council (EPSRC) under the Multidisciplinary University Research Initiative. We also thank Nati Srebro and Suriya Gunasekar for helpful discussions in the early stage of this work.
Appendix A Missing Proofs in Section 2
We first show that does indeed have a global minimizer.
Proof.
We will argue in the setting of Theorem 1 where is the multiclass cross entropy loss, because the logistic loss case is analogous. We first note that is continuous in because is continuous in and the term inside the logarithm is always positive. Next, define . Then we note that for , we must have . It follows that . However, there must be a value which attains , because is a compact set and is continuous. Thus, is attained by some . ∎
a.1 Missing Proofs for Multiclass Setting
Towards proving Theorem 1, we first show as we decrease , the norm of the solution grows.
Lemma 2.
In the setting of Theorem 1, as , we have .
To prove Theorem 1, we rely on the exponential scaling of the cross entropy: can be lower bounded roughly by , but also has an upper bound that scales with . By Lemma 2, we can take large so the gap vanishes. This proof technique is inspired by that of Rosset et al. (2004a).
Proof of Theorem 1.
For any and with ,
(by the homogeneity of )  
(A.1)  
(A.2) 
We can also apply in order to lower bound equation A.1 and obtain
(A.3) 
Applying equation A.2 with and , noting that , we have:
(A.4) 
Next we lower bound by applying equation A.3,
(A.5) 
Combining equation A.4 and equation A.5 with the fact that (by the global optimality of ), we have
Recall that by Lemma 2, as , we have . Therefore, . Thus, we can apply Taylor expansion to the equation above with respect to and . If , then we obtain
We claim this implies that . If not, we have , which implies that the equation above is violated with sufficiently large ( would suffice). By Lemma 2, as and therefore we get a contradiction.
Finally, we have by definition of . Hence, exists and equals . ∎
Now we fill in the proof of Lemma 2.
Proof of Lemma 2.
For the sake of contradiction, we assume that such that for any , there exists with . We will determine the choice of later and pick such that . Then the logits (the prediction before softmax) are bounded in absolute value by some constant (that depends on ), and therefore the loss function for every example is bounded from below by some constant (depending on but not .)
Let , we have that
(by the optimality of )  
Taking a sufficiently small , we obtain a contradiction and complete the proof. ∎
a.2 Full Binary Classification Setting
For completeness, we state and prove our maxmargin results for the setting where we fit binary labels (as opposed to indices in ) and redefining to assign a single realvalued score (as opposed to a score for each label). This lets us work with the simpler regularized logistic loss:
As before, let , and define the normalized margin by . Define the maximum possible normalized margin
(A.6) 
Theorem 3.
Assume in the binary classification setting with logistic loss. Then as , .
The proof follows via simple reduction to the multiclass case.
Proof of Theorem 3.
We prove this theorem via reduction to the multiclass case with . Construct with and . Define new labels if and if . Now note that , so the multiclass margin for under is the same as binary margin for under . Furthermore, defining
we get that , and in particular, and have the same set of minimizers. Therefore we can apply Theorem 1 for the multiclass setting and conclude in the binary classification setting. ∎