On the Margin Theory of Feedforward Neural Networks

On the Margin Theory of Feedforward Neural Networks

Abstract

Past works have shown that, somewhat surprisingly, over-parametrization can help generalization in neural networks. Towards explaining this phenomenon, we adopt a margin-based perspective. We establish: 1) for multi-layer feedforward relu networks, the global minimizer of a weakly-regularized cross-entropy loss has the maximum normalized margin among all networks, 2) as a result, increasing the over-parametrization improves the normalized margin and generalization error bounds for two-layer networks. In particular, an infinite-size neural network enjoys the best generalization guarantees. The typical infinite feature methods are kernel methods; we compare the neural net margin with that of kernel methods and construct natural instances where kernel methods have much weaker generalization guarantees. We validate this gap between the two approaches empirically. Finally, this infinite-neuron viewpoint is also fruitful for analyzing optimization. We show that a perturbed gradient flow on infinite-size networks finds a global optimizer in polynomial time.

1 Introduction

In deep learning, over-parametrization refers to the widely-adopted technique of using more parameters than necessary (Krizhevsky et al., 2012; Livni et al., 2014). Both computationally and statistically, over-parametrization is crucial for learning neural nets. Controlled experiments demonstrate that over-parametrization eases optimization by smoothing the non-convex loss surface (Livni et al., 2014; Sagun et al., 2017). Statistically, increasing model size without any regularization still improves generalization even after the model interpolates the data perfectly (Neyshabur et al., 2017b). This is surprising given the conventional wisdom on the trade-off between model capacity and generalization.

In the absence of an explicit regularizer, algorithmic regularization is likely the key contributor to good generalization. Recent works have shown that gradient descent finds the minimum norm solution fitting the data for problems including logistic regression, linearized neural networks, and matrix factorization (Soudry et al., 2018; Gunasekar et al., 2018b; Li et al., 2018; Gunasekar et al., 2018a; Ji & Telgarsky, 2018). Many of these proofs require a delicate analysis of the algorithm’s dynamics, and some are not fully rigorous due to assumptions on the iterates. To the best of our knowledge, it is an open question to prove analogous results for even two-layer relu networks. (For example, the technique of Li et al. (2018) on two-layer neural nets with quadratic activations still falls within the realm of linear algebraic tools, which apparently do not suffice for other activations.)

We propose a different route towards understanding generalization: making the regularization explicit. The motivations are: 1) with an explicit regularizer, we can analyze generalization without fully understanding optimization; 2) it is unknown whether gradient descent provides additional implicit regularization beyond what regularization already offers; 3) on the other hand, with a sufficiently weak regularizer, we can prove stronger results that apply to multi-layer neural nets with relu activations. Additionally, explicit regularization is perhaps more relevant because regularization is typically used in practice.

Concretely, we add a norm-based regularizer to the cross entropy loss of a multi-layer feedforward neural network with relu activations. We show that the global minimizer of the regularized objective achieves the maximum normalized margin among all the models with the same architecture, if the regularizer is sufficiently weak (Theorem 1). Informally, for models with norm that perfectly classify the data, the margin is the smallest difference across all datapoints between the classifier score for the true label and the next best score. We are interested in normalized margin because its inverse bounds the generalization error (see recent work (Bartlett et al., 2017; Neyshabur et al., 2017a, 2018) and our Theorem 1). Our work explains why optimizing the training loss can lead to parameters with a large margin and thus, better generalization error.

At a first glance, it might seem counterintuitive that decreasing the regularizer is the right approach. At a high level, we show that the regularizer only serves as a tiebreaker to steer the model towards choosing the largest normalized margin. Our proofs are simple, oblivious to the optimization procedure, and apply to any norm-based regularizer. We also show that an exact global minimum is unnecessary: if we approximate the minimum loss within a constant, we obtain the max-margin within a constant (Theorem 2).

We further study the margin of two-layer networks: let be the max normalized margin of a neural net with hidden units (formally defined in Section 3.1). Let be the largest possible margin of an infinite two-layer network. We will show three properties of the margins:

  • In Theorem 2, we show that the optimal normalized margin of two-layer networks is non-decreasing as the width of the architecture grows, so the generalization error bound only improves with a wider network. Thus, even if the dataset is already separable, it could still be useful to increase the width to achieve larger margin and better generalization. More formally, let be the number of training examples. We additionally approach the maximum possible margin after over-parameterizing with neurons: .

  • The max-margin of infinite-size nets, , equals half the margin of the -norm SVM (Zhu et al., 2004) over the lifted feature space defined by the activation function applied to all possible hidden units. (See Theorem 3.)

  • We compare the neural net margin to the standard margin for the kernel SVM on the same features. We design a simple data distribution (Figure 1) where neural net margin is large but the kernel margin is small. This translates to an factor gap between the generalization error bounds for the two approaches and demonstrates the power of neural nets compared to kernel methods. We experimentally confirm that a gap does indeed exist.

In the context of bullet 2, our work is closely related to that of Rosset et al. (2007) and Neyshabur et al. (2014), who show that optimizing the loss over the parameters of a two-layer relu network is equivalent to optimizing the loss of a “convex neural net” parametrized by a distribution over hidden units. We go one step further and connect the weakly regularized training loss to the SVM.

We will also adopt this view of infinite-size neural networks to study how over-parametrization helps optimization. Prior works (Mei et al., 2018; Chizat & Bach, 2018; Sirignano & Spiliopoulos, 2018) show that gradient descent on two-layer networks becomes Wasserstein gradient flow over parameter distributions in the limit of infinite neurons. For this setting, we prove that perturbed Wasserstein gradient flow finds a global optimizer in polynomial time.

Finally, we empirically validate several of the claims made in this paper. First, we train a two-layer network on a one-dimensional classification task that is simple to visualize. In one dimension, it is possible to brute-force approximate the maximum neural network margin and we show that training with an progressively smaller regularizer results in convergence to this margin. Second, we compare the generalization performance of neural networks and kernel methods and confirm that neural networks do achieve better generalization, as our theory predicts.

1.1 Additional Related Work

Zhang et al. (2016) and Neyshabur et al. (2017b) show that neural network generalization defies conventional explanations and requires new ones. One proposed explanation is the inductive bias of the training algorithm. Recent papers (Hardt et al., 2015; Brutzkus et al., 2017; Chaudhari et al., 2016) study inductive bias through training time and sharpness of local minima. Neyshabur et al. (2015a) propose a new steepest descent algorithm in a geometry invariant to weight rescaling and show that this improves generalization. Morcos et al. (2018) relate generalization in deep nets to the number of “directions” in the neurons. Other papers (Gunasekar et al., 2017; Soudry et al., 2018; Nacson et al., 2018; Gunasekar et al., 2018b; Li et al., 2018; Gunasekar et al., 2018a) study implicit regularization towards a specific solution. Ma et al. (2017) show that implicit regularization can help gradient descent avoid overshooting optima. Rosset et al. (2004a, b) study logistic regression with a weak regularization and show convergence to the max margin solution. We adopt their techniques and extend their results.

Recent works have also derived tighter Rademacher complexity bounds for deep neural networks (Neyshabur et al., 2015b; Bartlett et al., 2017; Neyshabur et al., 2017a; Golowich et al., 2017) and new compression based generalization properties (Arora et al., 2018b). Dziugaite & Roy (2017) manage to compute non-vacuous generalization bounds from PAC-Bayes bounds. Neyshabur et al. (2018) investigate the Rademacher complexity of two-layer networks. Liang & Rakhlin (2018) and Belkin et al. (2018) study the generalization of kernel methods.

On the optimization side, Soudry & Carmon (2016) explain why over-parametrization can remove bad local minima. Safran & Shamir (2016) show that over-parametrization can improve the quality of the random initialization. Haeffele & Vidal (2015), Nguyen & Hein (2017), and Venturi et al. (2018) show that for sufficiently overparametrized networks, all local minima are global, but do not show how to find these minima via gradient descent. Du & Lee (2018) show that for two-layer networks with quadratic activations, all second-order stationary points are global minimizers. Arora et al. (2018a) interpret over-parametrization as a means of implicit acceleration during optimization. Mei et al. (2018), Chizat & Bach (2018), and Sirignano & Spiliopoulos (2018) take a distributional view of over-parametrized networks. Chizat & Bach (2018) show that Wasserstein gradient flow converges to global optimizers under structural assumptions. We extend this to a polynomial-time result.

1.2 Notation

Let denote the set of real numbers. We will use to indicate a general norm, with denoting the norms on finite dimensional vectors, respectively, and denoting the Frobenius norm on a matrix. In general, we use on top of a symbol to denote a unit vector: when applicable, , where the norm will be clear from context. Let be the unit sphere in dimensions. Let be the space of functions on for which the -th power of the absolute value is Lebesgue integrable. For , we overload notation and write . Additionally, for and or , we can define . Furthermore, we will use .

Throughout this paper, we reserve the symbol to denote the collection of datapoints (as a matrix), and to denote labels. We use to denote the dimension of our data. We often use to denote the parameters of a prediction function , and to denote the prediction of on datapoint .

We will use the notation to mean less than or greater than up to a universal constant, respectively. Unless stated otherwise, we use as a placeholders for some universal constant in upper and lower bounds, respectively. We will use poly to denote some universal constant-degree polynomial in the arguments.

2 Weak Regularizer Guarantees Max Margin Solutions

In this section, we will show that when we add a weak regularizer to cross-entropy loss with a positive-homogeneous prediction function, the normalized margin of the optimum converges to some max-margin solution. As a concrete example, feedforward relu networks are positive-homogeneous.

Let be the number of labels, so the -th example has label . We work with a family of prediction functions that are -positive-homogeneous in their parameters for some : . We additionally require that is continuous in . For some general norm , we study the -regularized cross-entropy loss , defined as

(2.1)

for fixed . Let .1 We define the normalized margin of as:

(2.2)

Define the -max normalized margin as

and let be a parameter achieving this maximum. We show that with sufficiently small regularization level , the normalized margin approaches the maximum margin . Our theorem and proof are inspired by the result of Rosset et al. (2004a, b), who analyze the special case when is a linear predictor. In contrast, our result can be applied to non-linear as long as is homogeneous.

Theorem 1.

Assume the training data is separable by a network with an optimal normalized margin . Then, the normalized margin of the global optimum of the weakly-regularized objective (equation 2.1) converges to as the strength of the regularizer goes to zero. Mathematically, let be defined in equation 2.2. Then

An intuitive explanation for our result is as follows: because of the homogeneity, the loss roughly satisfies the following (for small , and ignoring problem parameters such as ):

Thus, the loss focuses on choosing parameters with larger margin, and the regularization term biases the loss to select parameters with a smaller norm. The full proof of the theorem is deferred to Section A.1.

We can also provide an analogue of Theorem 1 for the binary classification setting. For this setting, our prediction is now a single real output and we train using logistic loss. We provide formal definitions and results in Section A.2. Our theory for two-layer neural networks (see Section 3) is based in this setting.

2.1 Optimization Accuracy

Since is typically hard to optimize exactly for neural nets, it would be ideal to relax the condition that minimizes . Thus, we ask, how accurately do we need to optimize to obtain a margin that approximates up to a constant? The following theorem shows that if suffices to find achieving a constant factor multiplicative approximation of , where is some sufficiently small polynomial in . Though our theorem is stated for the general multi-class setting, our result applies for binary classification as well. We provide the proof in Section A.3.

Theorem 2.

In the setting of Theorem 1, suppose that we choose for sufficiently large (that only depends on ). Let denote a 2-approximate 2 minimizer of , so . Denote the normalized margin of by . Then

.

3 Margins of Over-parameterized Two-layer Homogeneous Neural Nets

In Section 2 we showed that a weakly-regularized logistic loss leads to the maximum normalized margin. In this section, we analyze the properties of the max-margin of neural nets more closely. We will contrast neural networks with kernel methods, for which margins have already been extensively studied. Towards a first-cut understanding, we focus on two-layer networks for binary classification.

First, in Section 3.1 we provide a bound stating that the generalization error is roughly linear in the inverse of the margin, establishing that a larger margin implies better generalization. In Section 3.2, we show that the maximum normalized margin is non-decreasing with the hidden layer size and stays constant as soon as there are more hidden units than data points. This suggests that increasing the size of the network improves the generalization of the solution.

Second, in Section 3.3, we draw an analogy to classical kernel methods by proving that the maximum -normalized margin of an over-parameterized neural net is equal to half the maximum possible -normalized margin of linear functionals on a lifted feature space. In other words, we establish an equivalence between neural networks and the 1-norm SVM (Zhu et al., 2004) on the lifted features. These features are constructed by applying the activation function on all possible hidden layer weights.

Third, continuing this analogy, we will compare the generalization power of a two-layer neural network to that of a kernel method on the lifted space. This kernel method corresponds to fixing random weights for the hidden layer and solving a 2-norm max-margin problem on the top layer weights. We demonstrate instances where two layer neural networks give better generalization error guarantees than the kernel method.

3.1 Setup and Margin-based Generalization Error

In the rest of the paper, we work with two-layer neural networks with a single output for binary classification. We use to denote the number of hidden units, for the weight vectors on the first layer, and for the weights on the second layer. We let , and we use to denote the collection of all the parameters. We assume in this section that the activation is 1-homogeneous and 1-Lipschitz. The network thus computes a single score

We consider regularization from here on. The regularized logistic loss of the architecture with hidden units is therefore

(3.1)

where denotes the Euclidean norm of all the parameters in . We note that and the regularizer are both 2-homogeneous in , so the results of Section 2 apply to .3

Following our conventions from Section 2, we denote the optimizer of by , the normalized margin of by , the max-margin solution by , and the max-margin by . We emphasize the size of the network in our notation. Since our classifier now predicts a single real value, we need to redefine

When the data is not separable by a -unit neural net, is zero by definition.

Recall that denotes the matrix with all the data points as columns, and denotes the labels. We sample and i.i.d. from the data generating distribution , which is supported on . We can define the population 0-1 loss and the training 0-1 loss of the network as

We will let be the average norm squared of the data and be an upper bound on the norm of a single datapoint. The following theorem shows that the generalization error only depends on the parameters through the inverse of the margin on the training data. We provide a proof in Section C.1.

Theorem 1.

Suppose is 1-Lipschitz and 1-homogeneous. Then for any that separates the data with margin , with probability at least over the draw of ,

(3.2)

where . Note that is typically small, and thus the above bound mainly scales with . As a corollary, with probability ,4

(3.3)

Above we implicitly assume , since otherwise the right hand side of the bound is vacuous.

One consequence of the above theorem and Theorem 2 is that if is polynomially small in and , we only need to optimize up to a constant multiplicative factor to obtain parameters with generalization bounds roughly as good as those for .

3.2 The max margin is non-decreasing in the hidden layer size

Now we show that the maximum normalized margin is nondecreasing with the hidden layer size and stays constant once we have more hidden units than examples.

Theorem 2.

In the setting of Section 3.1, recall that denotes the max normalized margin of a two-layer neural network with hidden layer size . Then,

(3.4)

We note that will be positive when is a sufficiently powerful activation such as relu or sigmoid and the data points are not repetitive, so the neural network can fit any function of the data. We prove Theorem 2 in Section B. Theorem 2 can explain why additional over-parametrization has been observed to improve generalization in two-layer networks Neyshabur et al. (2017b). Our margin does not decrease with a larger network size, and therefore Theorem 1 gives a better generalization bound. We precisely characterize the value of in the following section.

3.3 The max margin of neural nets is equivalent to SVM in lifted space

We link infinite-size neural networks to the SVM over a lifted space, defined via a lifting function mapping data to an infinite feature vector:

(3.5)

We look at the margin of linear functionals corresponding to . The 1-norm SVM over the lifted feature solves for the maximum margin:

(3.6)

where we rely on the inner product and 1-norm defined in Section 1.2. A priori, it is unclear how to optimize this since the kernel trick does not work for norm. Here we will show that optimizing two-layer neural networks with weak regularization is equivalent to solving equation 3.6.

Theorem 3.

Let be defined in equation 3.6, and be defined in Section 3.1. For any ,

(3.7)

Rosset et al. (2007) and Neyshabur et al. (2014) show a similar equivalence, but between a lifted logistic regression problem and equation 3.1. In contrast, the above theorem, proved in Section B, shows the equivalence5 between equation 3.1 and the 1-norm SVM when the regularizer is small.

3.4 Comparison to kernel methods

Figure 1: A visualization of 60 sampled points from in 3 dimensions. Red points denote negative examples and blue points denote positive examples.

We compare the SVM margin, attainable by a finite neural network, to the margin attainable via kernel methods. Following the setup of Section 3.3, we define the kernel problem over :

(3.8)

where . (We scale by to make the lemma statement below cleaner.) First, can be used to obtain a standard upper bound on the generalization error of the kernel SVM. Following the notation of Section 3.1, we will let denote the 0-1 population classification error for the optimizer of equation 3.8.

Lemma 4.

In the setting of Theorem 1, with probability at least , the generalization error of the standard kernel SVM with relu feature (defined in equation 3.8) is bounded by

(3.9)

where is typically a lower-order term.

The bound above follows from standard techniques (Bartlett & Mendelson, 2002), and we provide a full proof in Section C.1. We construct a data distribution for which this lemma does not give a good bound for kernel methods, but Theorem 1 does imply good generalization for two-layer networks.

Theorem 5.

There exists a data distribution such that the SVM with relu features has a good margin:

and with probability over the choice of i.i.d. samples from , obtains generalization error

where is typically a lower order term. Meanwhile, with high probability the SVM has a small margin:

and therefore the generalization upper bound from Lemma 4 is at least

We briefly overview the construction of here and defer the full proof of Theorem 5 to Section D.1.

Proof sketch for Theorem 5.

We base on the distribution of examples described below. Here is the i-th standard basis vector and we use to represent the -coordinate of (since the subscript is reserved to index training examples).

Figure 1 shows samples from when there are 3 dimensions. From the visualization, it is clear that there is no linear separator for . As Lemma 1 shows, a relu network with four neurons can fit this relatively complicated decision boundary. On the other hand, for kernel methods, we prove that the symmetries in induce cancellation in feature space. The following lemmas, proved in Section D.1, formalize this cancellation and show that it results in a small margin for kernel methods.

Lemma 6 (Margin upper bound tool).

In the setting of Theorem 5, we have

Lemma 7.

In the setting of Theorem 5, let be i.i.d samples and corresponding labels from . Let be defined in equation 3.5 with . With high probability (at least ), we have

Combining these lemmas gives us the desired bound on .

Gap in regression setting:

We are able to prove an even larger gap between neural networks and kernel methods in the regression setting where we wish to interpolate continuous labels. Analogously to the classification setting, optimizing a regularized squared error loss on neural networks is equivalent to solving a minimum 1-norm regression problem (see Theorem 3). Furthermore, kernel methods correspond to a minimum 2-norm problem. We construct distributions where the 1-norm solution will have a generalization error bound of , whereas the 2-norm solution will have a generalization error bound that is and thus vacuous. In Section D.2, we define the 1-norm and 2-norm regression problems. In Theorem 6 we formalize our construction.

4 Perturbed Wasserstein gradient flow finds global optimizers in polynomial time

In the prior section, we studied the limiting behavior of the generalization of a two-layer network as its width goes to infinity. In this section, we will now study the limiting behavior of the optimization algorithm, gradient descent. Prior work (Mei et al., 2018; Chizat & Bach, 2018) has shown that as the hidden layer size grows to infinity, gradient descent for a finite neural network approaches the Wasserstein gradient flow over distributions of hidden units (defined in equation 4.1). Chizat & Bach (2018) also prove that Wasserstein gradient flow converges to a global optimizer in this setting but do not specify a convergence rate.

We show that a perturbed version of Wasserstein gradient flow converges in polynomial time. The informal take-away of this section is that a perturbed version of gradient descent converges in polynomial time on infinite-size neural networks (for the right notion of infinite-size.)

Formally, we optimize the following functional over distributions on :

where , , and . In this work, we consider 2-homogeneous and . We will additionally require that is nonnegative and is positive on the unit sphere. Finally, we need standard regularity assumptions on , and :

Assumption 1 (Regularity conditions on , , ).

and are differentiable as well as upper bounded and Lipschitz on the unit sphere. is Lipschitz and its Hessian has bounded operator norm.

We provide more details on the specific parameters (for boundedness, Lipschitzness, etc.) in Section E.1. We note that relu networks satisfy every condition but differentiability of .6 We can fit a neural network under our framework as follows:

Example 2 (Logistic loss for neural networks).

We interpret as a distribution over the parameters of the network. Let and for . In this case, is a distributional neural network that computes an output for each of the training examples (like a standard neural network, it also computes a weighted sum over hidden units). We can compute the distributional version of the regularized logistic loss in equation 3.1 by setting and .

We will define with and . Informally, is the gradient of with respect to , and is the induced velocity field. For the standard Wasserstein gradient flow dynamics, evolves according to

(4.1)

where denotes the divergence of a vector field. For neural networks, these dynamics formally define continuous-time gradient descent when the hidden layer has infinite size (see Theorem 2.6 of Chizat & Bach (2018), for instance).

We propose the following modification of the Wasserstein gradient flow dynamics:

(4.2)

where is the uniform distribution on . In our perturbed dynamics, we add uniform noise over . For infinite-size neural networks, one can informally interpret this as re-initializing a very small fraction of the neurons at every step of gradient descent. We prove convergence to a global optimizer in time polynomial in , and the regularity parameters.

Theorem 3 (Theorem 4 with regularity parameters omitted).

Suppose that and are 2-homogeneous and the regularity conditions of Assumption 1 are satisfied. Also assume that from starting distribution , a solution to the dynamics in equation 4.2 exists. Define . Let be a desired error threshold and choose and , where the regularity parameters for , , and are hidden in the . Then, perturbed Wasserstein gradient flow converges to an -approximate global minimum in time:

We provide a theorem statement that includes regularity parameters in Section E.1. We prove the theorem in Section E.2.

As a technical detail, Theorem 3 requires that a solution to the dynamics exists. We can remove this assumption by analyzing a discrete-time version of equation 4.2:

and additionally assuming and have Lipschitz gradients. In this setting, a polynomial time convergence result also holds. We state the result in Section E.3.

5 Simulations

We first verify the normalized margin convergence on a two-layer networks with one-dimensional input. A single hidden unit computes the following: . We add -regularization to , and and compare the resulting normalized margin to that of an approximate solution of the SVM problem with features for . Writing this feature vector is intractable, so we solve an approximate version by choosing 1000 evenly spaced values of . Our theory predicts that with decreasing regularization, the margin of the neural network converges to the SVM objective. In Figure 2, we plot this margin convergence and visualize the final networks and ground truth labels. The network margin approaches the ideal one as , and the visualization shows that the network and SVM functions are extremely similar.

Figure 2: Neural network with input dimension 1. Left: Normalized margin as we decrease . Right: Visualization of the normalized functions computed by the neural network and SVM solution for .

Next, we experiment on synthetic data in a higher-dimensional setting. For classification and regression, we compare the generalization error and predicted generalization upper bounds7 (from Theorem 1 and Lemmas 44, and 5) of a trained neural network against a kernel SVM with relu features as we vary . For classification we plot 0-1 error, whereas for regression we plot squared error. Our ground truth comes from a random neural network with 6 hidden units. For classification, we used rejection sampling to obtain datapoints with unnormalized margin of at least 0.1 on the ground truth network. We use a fixed dimension of . For all experiments, we train the network for 20000 steps with and average over 100 trials for each plot point.

The plots in Figure 3 show that two-layer networks clearly outperform kernel methods in test error as grows. However, there seems to be looseness in the upper bounds for kernel methods: the kernel generalization bound appears to stay constant with (as predicted by our theory for regression), but the kernel test error decreases. There is also some variance in the neural network generalization bound for classification. This occured likely because we did not tune learning rate and training time, so the optimization failed to find the best margin.

Figure 3: Comparing neural networks and kernel methods. Left: Classification. Right: Regression.

In Section F, we include additional experiments training modified WideResNet architectures on CIFAR10 and CIFAR100. Although ResNet is not homogeneous, we still report interesting increases in generalization performance from annealing the weight decay during training, versus staying at a fixed decay rate.

6 Conclusion

We have made the case that maximizing margin is one of the inductive biases of relu networks with cross-entropy loss. We show that we can obtain a maximum normalized margin by training with a weak regularizer. We also prove that larger -normalized margin indicates better generalization for two-layer nets. Our work leaves open the question of how the -normalized margin relates to generalization in much deeper neural networks. This is a fascinating theoretical and empirical question for future work. On the optimization side, we make progress towards understanding over-parametrized gradient descent by analyzing infinite-size neural networks. A natural direction for future work is to apply our theory to optimize the margin of finite-sized neural networks.

Acknowledgments

JDL acknowledges support of the ARO under MURI Award W911NF-11-1-0303. This is part of the collaboration between US DOD, UK MOD and UK Engineering and Physical Research Council (EPSRC) under the Multidisciplinary University Research Initiative. We also thank Nati Srebro and Suriya Gunasekar for helpful discussions in the early stage of this work.

Appendix A Missing Proofs in Section 2

We first show that does indeed have a global minimizer.

Claim 1.

In the setting of Theorems 1 and 3, exists.

Proof.

We will argue in the setting of Theorem 1 where is the multi-class cross entropy loss, because the logistic loss case is analogous. We first note that is continuous in because is continuous in and the term inside the logarithm is always positive. Next, define . Then we note that for , we must have . It follows that . However, there must be a value which attains , because is a compact set and is continuous. Thus, is attained by some . ∎

a.1 Missing Proofs for Multi-class Setting

Towards proving Theorem 1, we first show as we decrease , the norm of the solution grows.

Lemma 2.

In the setting of Theorem 1, as , we have .

To prove Theorem 1, we rely on the exponential scaling of the cross entropy: can be lower bounded roughly by , but also has an upper bound that scales with . By Lemma 2, we can take large so the gap vanishes. This proof technique is inspired by that of Rosset et al. (2004a).

Proof of Theorem 1.

For any and with ,

(by the homogeneity of )
(A.1)
(A.2)

We can also apply in order to lower bound equation A.1 and obtain

(A.3)

Applying equation A.2 with and , noting that , we have:

(A.4)

Next we lower bound by applying equation A.3,

(A.5)

Combining equation A.4 and equation A.5 with the fact that (by the global optimality of ), we have

Recall that by Lemma 2, as , we have . Therefore, . Thus, we can apply Taylor expansion to the equation above with respect to and . If , then we obtain

We claim this implies that . If not, we have , which implies that the equation above is violated with sufficiently large ( would suffice). By Lemma 2, as and therefore we get a contradiction.

Finally, we have by definition of . Hence, exists and equals . ∎

Now we fill in the proof of Lemma 2.

Proof of Lemma 2.

For the sake of contradiction, we assume that such that for any , there exists with . We will determine the choice of later and pick such that . Then the logits (the prediction before softmax) are bounded in absolute value by some constant (that depends on ), and therefore the loss function for every example is bounded from below by some constant (depending on but not .)

Let , we have that

(by the optimality of )

Taking a sufficiently small , we obtain a contradiction and complete the proof. ∎

a.2 Full Binary Classification Setting

For completeness, we state and prove our max-margin results for the setting where we fit binary labels (as opposed to indices in ) and redefining to assign a single real-valued score (as opposed to a score for each label). This lets us work with the simpler -regularized logistic loss:

As before, let , and define the normalized margin by . Define the maximum possible normalized margin

(A.6)
Theorem 3.

Assume in the binary classification setting with logistic loss. Then as , .

The proof follows via simple reduction to the multi-class case.

Proof of Theorem 3.

We prove this theorem via reduction to the multi-class case with . Construct with and . Define new labels if and if . Now note that , so the multi-class margin for under is the same as binary margin for under . Furthermore, defining

we get that , and in particular, and have the same set of minimizers. Therefore we can apply Theorem 1 for the multi-class setting and conclude in the binary classification setting. ∎

a.3 Missing Proof for Optimization Accuracy

Proof of Theorem 2.

Choose . We can upper bound by computing

(by equation A.2)

Furthermore, it holds that . Now we note that

for sufficiently large . Now using the fact that , we additionally have the lower bound . Since , we can rearrange to get