A Tunable Loss Function for Classification

# A Tunable Loss Function for Classification

Tyler Sypherd*, Mario Diaz, Harshit Laddha*,
Lalitha Sankar*, Peter Kairouz, Gautam Dasarathy*
Centro de Investigación en Matemáticas A.C., diaztorres@cimat.mx
###### Abstract

Recently, a parametrized class of loss functions called -loss, , has been introduced for classification. This family, which includes the log-loss and the 0-1 loss as special cases, comes with compelling properties including an equivalent margin-based form which is classification-calibrated for all . We introduce a generalization of this family to the entire range of and establish how the parameter enables the practitioner to choose among a host of operating conditions that are important in modern machine learning tasks. We prove that smaller values are more conducive to faster optimization; in fact, -loss is convex for and quasi-convex for . Moreover, we establish bounds to quantify the degradation of the local-quasi-convexity of the optimization landscape as increases; we show that this directly translates to a computational slow down. On the other hand, our theoretical results also suggest that larger values lead to better generalization performance. This is a consequence of the ability of the -loss to limit the effect of less likely data as increases from 1, thereby facilitating robustness to outliers and noise in the training data. We provide strong evidence supporting this assertion with several experiments on benchmark datasets that establish the efficacy of -loss for in robustness to errors in the training data. Of equal interest is the fact that, for , our experiments show that the decreased robustness seems to counteract class imbalances in training data.

\pdfstringdefDisableCommands\pdfstringdefDisableCommands

## 1 Introduction

The performance of a classification algorithm in terms of accuracy, tractability, and convergence guarantees depends crucially on the choice of the loss function. Consider a feature vector , an unknown finite-valued label , and a hypothesis . The canonical - loss, given by , is considered an ideal loss function that captures the probability of incorrectly guessing the true label using . However, since the - loss is neither continuous nor differentiable, its applicability in state-of-the-art learning algorithms is highly restricted. As a result, there has been much interest in identifying surrogate loss functions that approximate the - loss [1, 2, 3, 4, 5, 6]. Common surrogate loss functions include log-loss, squared loss, and hinge loss. However, as classification is applied to broader contexts including scenarios where data may be noisy (e.g., incorrect labels) or have imbalanced classes (e.g., very few samples capturing anomalous events that are crucial to detect), there is a need for good surrogate loss functions that are also robust to these practical schallenges.

We propose to address these challenges via a recently introduced class of loss functions, -loss. This class was originally conceived in a privacy setting in [7] by Liao et al. who introduced -loss, parameterized by , to quantify information leakage for a class of adversarial threat models. The viability of -loss as a possible family of surrogate loss functions for binary classification was suggested in [6]. In fact, the authors proved that -loss can be written in a margin-based form and is classification-calibrated [6]. Focusing on the classification context, we present a generalization of -loss over a larger range given by . By introducing this larger range, we establish how the parameter enables the practitioner to choose among a host of operating conditions that are important in modern machine learning tasks.

Our Contributions: Our main contribution is a family of loss functions that, in addition to including the oft-used log-loss and highly desirable 0-1 loss, captures the inherent trade-offs between classification accuracy, convexity of the optimization landscape, and robustness to noise and outliers. We provide both theoretical and experimental evidence for these trade-offs as summarized below:
(1) We show that the margin-based form of -loss is classification-calibrated over the entire range of . Further, we prove that the margin-based -loss is convex for and quasi-convex for . In the context of logistic regression, we exploit this margin-based form and its Lipschitz property to obtain generalization results using Rademacher complexity.
(2) For logistic regression, we show that the true risk is Strictly-Locally-Quasi-Convex (SLQC) (as defined by Hazan et al. in [8]) for all . Furthermore, we provide bounds on the degradation of the SLQC parameters as increases, thereby highlighting the effect of quasi-convexity (for ). More broadly, we present a lemma that introduces an equivalent condition for the direction of descent in the definition of SLQC; this can be of independent interest.
(3) For logistic regression, we also provide a uniform generalization result for every . More specifically, we show that increasing makes the empirical -risk more closely resemble the probability of error for every hypothesis.
(4) Finally, we illustrate our results using well-studied datasets such as MNIST, Fashion MNIST and CIFAR-10 and neural networks including one and two layer CNNs. We explore three contexts to compare the accuracy of -loss against that of log-loss: original datasets, noisy datasets (with a fraction of labels flipped), and class imbalanced datasets (reduced samples in specific classes of above mentioned datasets). In particular, in the noisy labels and imbalanced class settings, we show that tuning away from (log-loss) can enhance both accuracy and robustness.

### 1.1 Related Work

Surrogate loss functions for the - loss have been widely of interest to the machine learning and statistics communities [1, 2, 3, 9, 4, 10, 11, 12, 13, 14]. Recently, there has been renewed interest in alternative losses for classification [15, 11, 14, 10, 7] other than the oft-used log-loss. During the inception of the field, convex losses were widely considered optimal [1, 4, 9, 3]. However, more recent works propose the use of non-convex losses as a means to moderate the behavior of an algorithm [10, 6, 11, 2]. In particular, motivated by superior robustness and classification accuracy of non-convex losses, Mei et al. [10] studied the empirical landscape of such functions. Along these lines, Hazan et al. [8] introduce SLQC as a means to study the unimodality of quasi-convex functions and their optimization characteristics. For SLQC functions, they introduce Normalized Gradient Descent algorithm and prove its convergence guarantees. Using their methodology, we consider -loss under logistic regression for binary classification and derive intuition about the operating characteristics of -loss.

Our work is similar to [11], wherein Nguyen et al. present a tunable sigmoid loss which can be made arbitrarily close to the - loss. In essence, their loss moves from a smooth to non-smooth loss. Our loss is always smooth and moves from convex to quasi-convex. We find that in the setting of deep neural networks, some quasi-convexity of the -loss smooths the empirical landscape. In particular, we provide strong experimental evidence for a narrow range of to be used in practice (a limit on the amount of convexity and quasi-convexity of the loss), which significantly reduces the range of hyperparameter tuning induced by -loss. Increasing the degree of convexity of the optimization landscape is conducive to faster optimization. Hence, our approach could serve as an alternative to other approaches whose objective is to accelerate the optimization process, e.g., the activation function tuning in [16, 17, 18] and references therein.

## 2 α-loss and Binary Classification

We begin by generalizing the definition of -loss, introduced by Liao et al. [7], to the entire range .

###### Definition 1.

Let be the set of probability distributions over . For , we define -loss as

 lα(y,PY):=αα−1[1−PY(y)1−1/α], α∈(0,1)∪(1,∞), (1)

and, by continuous extension, and .

The above definition of -loss presents a class of loss functions that values the probabilistic estimate of the label differently for different . For , minimizing the corresponding risk leads to making a single guess on the most likely label; on the other hand, for , such a risk minimization involves minimizing the average log loss, and therefore, refining a posterior belief over all for a given observation . As increases from 1 to , the loss function increasingly limits the effect of the low probability outcomes; on the other hand, as decreases from 1 towards 0, the loss function places increasingly higher weights on the low probability outcomes until at , by continuous extension of (1), we have , i.e., the loss function pays an infinite cost by ignoring the training data distribution completely. This characteristic property of -loss is highlighted in Figure 1(a) for a binomial random variable generated by a fair coin over 20 flips. Note that quantifies the level of certainty placed on the posterior distribution. Thus, larger indicate increasing certainty over a smaller set of while smaller distributes the uncertainty over more (and eventually, all) possibles values of . Indeed, for , the distribution becomes the hard-decoding MAP rule.

Suppose that the feature-label variable pair . Observing , the goal in classification is to construct an estimate of . Upon inspecting (1), one may observe that the expected -loss , henceforth referred to as -risk, quantifies the effectiveness of the estimated posterior . The following proposition provides an explicit characterization of the optimal risk-minimizing posterior under -loss.

###### Proposition 1 ([7, Lemma 1]).

For each , the minimal -risk is

 minP^Y|XEX,Y[lα(Y,P^Y|X)]=αα−1(1−e1−ααHAα(Y|X)). (2)

where is the Arimoto conditional entropy of order [19]. The resulting unique minimizer, , is the -tilted true posterior

 P∗^Y|X(y|x)=PY|X(y|x)α∑yPY|X(y|x)α. (3)

The proof of Proposition 1 can be found in [7] and is easily extended to the case where .

For binary classification, where , it is common to use classification functions of the form such that the classifier, for any given , outputs the hypothesis (hard decision)  [1, 3, 9, 2, 6]. A classification function corresponds to the certainty of an algorithm’s prediction (e.g., SVM). Examples of loss functions that act on classification functions include logistic loss, hinge loss, and square loss.

In addition to classification functions, soft classifiers may also be output by binary classification algorithms. A soft classifier corresponds to the distribution . Log-loss, and by extension -loss, are examples of loss functions which act on soft classifiers. In practice, a soft classifier can be obtained by composing commonly used classification functions with a sigmoid function , , given by

 σ(z)=11+e−z. (4)

A large family of loss functions under binary classification are margin-based loss functions. [1, 3, 2, 9, 15]. A loss function is said to be margin-based if, for all and , the risk associated to a pair is given by for some function . In this case, the risk of the pair only depends on the product , where the product is called the margin. Observe that a negative margin corresponds to a mismatch between the signs of and , i.e., a classification error by . Similarly, a positive margin corresponds to a match between the signs of and , i.e., a correct classification by .

We now show that -loss is margin-based over the entire range of ; this is also illustrated in Figure 1(b).

###### Definition 2.

The margin-based -loss, , , is

 ~lα(z):=αα−1(1−σ(z)1−1/α), α∈(0,1)∪(1,∞), (5)

with and by continuous extension.

The relationship between a classification function and a soft classifier under -loss is articulated by the following proposition. It generalizes the result of [6] for the entire range of .

###### Proposition 2.

Consider a soft classifier . If , then, for every ,

 lα(y,g(x))=~lα(yf(x)). (6)

Conversely, if is a classification function, then the set of beliefs associated to satisfies (6). In particular, for every ,

 mingEX,Y(lα(Y,g(x)))=minfEX,Y(~lα(Yf(X))). (7)

The proof of Proposition 2 can be found in [6] and is easily extended to the case where .

While the previous result establishes an equivalent margin-based form for -loss, our next result establishes some of its basic optimization characteristics.

###### Proposition 3.

is convex for and quasi-convex for .

Finally, we conclude this section with another basic property of -loss that highlights its suitability for classification. For binary classification and margin-based losses, Bartlett et al. in [1] introduce classification-calibration as a means to compare the performance of a loss function relative to the 0-1 loss. A margin-based loss function is classification-calibrated if its minimum conditional risk given is attained by a such that , where is the true posterior. Building upon such a result in [6] for -loss, the following proposition shows that is classification-calibrated for all .

###### Proposition 4.

For every , the margin-based -loss, , is classification-calibrated.

The proof of Proposition 4 is given in Appendix A.

## 3 Main Results

In this section we establish some theoretical properties regarding the performance of -loss in a logistic regression setting. Towards this end, let be the feature vector, the label and the training dataset where, for each , the samples are independently drawn according to an unknown distribution . We consider the family of soft classifiers where and

 gθ(x)=σ(⟨θ,x⟩), (8)

with given as in (4). The -loss can now be written as

 lα(y,gθ(x))=αα−1[1−1+y2gθ(x)1−1/α−1−y2(1−gθ(x))1−1/α]. (9)

A straightforward computation shows that

 ∂∂θjlα(y,gθ(x))=[1−y2gθ(x)(1−gθ(x))1−1/α−1+y2gθ(x)1−1/α(1−gθ(x))]xj, (10)

where denote the -th components of and , respectively. Thus, the gradient of -loss is

 ∇θlα(Y,gθ(X))=F1(α,θ,X,Y)X, (11)

where is the expression within brackets in (10). Finally, we define the -risk as the risk of the -loss in (9), i.e., . Observe that, for all ,

 R∞(θ):=EX,Y[l∞(Y,gθ(X))]=P[Y≠^Yθ], (12)

where is a random variable such that for all . We define the empirical -risk by

 ^Rα(θ)=1nn∑i=1lα(Yi,gθ(Xi)). (13)

### 3.1 The Optimization Landscape of α-loss via SLQC

Next we provide some insight regarding the convexity degradation of the optimization landscape as increases. In order to do so, we rely on a relaxed form of convexity called Strict-Local-Quasi-Convexity. We begin recalling the definition of this notion introduced by Hazan et al. in [8]. For and , we let .

###### Definition 3.

Let , . We say that is -Strictly-Locally-Quasi-Convex (SLQC) in , if at least one of the following applies:

1. , and for every it holds that .

Intuitively, when is a minimizer of an -SLQC function , then, for every , either is -close to optimal or descending with the gradient leads to the ‘right’ direction. This relaxed notion of convexity comes with a natural adaptation of the Gradient Descent (GD) algorithm: Normalized Gradient Descent (NGD) [8] summarized in Algorithm 1.

Similar to the convergence guarantees for GD on convex problems, NGD comes with natural convergence guarantees for SLQC problems as shown in [8, Theorem 4.1].

###### Proposition 5.

Fix , let , and . If is -SLQC in every , then running Algorithm 1 with , and , we have .

It is important to note that, for an -SLQC function, a smaller provides better optimality guarantees. Given , smaller leads to faster optimization as the number of required iterations increases with . It should be also noted that, by using projections, NGD can be easily adapted to work over convex and closed sets including (see, for example, [8]).

We are now in a position to establish bounds on the SLQC constants of , the -risk, in the logistic model. The following theorem identifies the SLQC constants for -risk in the regime . In this context, we assume that we are in a unimodal realizable setting, i.e., has a unique critical point and

 θ∗=argminθ∈Bd(r)Rα(θ). (14)

The proof, which can be found in Appendix B, relies on three main facts: (i) is quasi-convex, (ii) is -Lipschitz where , and (iii) has a unique critical point due to the unimodality assumption.

###### Theorem 1.

If and , then, for every , the -risk is -SLQC in where .

We establish the following lemma which might be of general interest; the proof is given in Appendix C. Indeed, it provides further insight into the theory of Hazan et al. with regards to SLQC functions.

###### Lemma 1.

Assume that is differentiable, and . If is such that , then the following are equivalent:

• for all ,

• .

We are now in a position to state our second result, which provides some insight into the degradation of the SLQC constants for the -risk as increases from 1. For ease of notation, let

 Kr:=(r+log2)(σ(r))1−1/(2σ(r)−1)(2σ(r)−1)2andLr:=(r+log2)22. (15)
###### Theorem 2.

Let and . If is -SLQC in and , then is -SLQC in with and .

The proof of this theorem relies on a handful of lemmas whose proofs are given in Appendix D. The theorem intuitively states that increasing the value of increases the value of by a factor of and decreases the ball of radius by a factor of , both of which hinder the optimization process and increase the required number of iterations.

### 3.2 Generalization and Control on the Probability of Error

In this section, we focus on the setting to highlight two aspects of -loss: (i) generalization properties, and (ii) uniform guarantees on accuracy. We begin with generalization.

###### Theorem 3.

If , then, with probability at least , for all ,

 Rα(θ)−^Rα(θ)≤Cα2r√n+4Dα√2ln(4/δ)n, (16)

where and , for .

Note that and are both monotonically decreasing with , and are continuously extended such that , , , and . The proof of Theorem 3 relies on classical results in Rademacher complexity; all details are in Appendix E.

Note that this generalization guarantee has significantly better dependence on than the one derived by Sypherd et al. in [6]. The authors there, building on [10], are only able to provide a guarantee of at the critical points in the landscape relying. Instead, we use classical results in Rademacher complexity for our setting to achieve a uniform generalization guarantee of Furthermore, our constants are explicitly specified in terms of parameters critical to the landscape.

The next corollary stems from Theorem 3. It explicitly bounds the accuracy gain using .

###### Corollary 1.

If , then, with probability at least , for all ,

 R∞(θ)−^Rα(θ)≤r2√n+4√2ln(4/δ)n+Lrα, (17)

where is given in (15).

Thus, we see that for any , increasing implies that gets closer to , which is the probability of error of an estimator with . One can view the third term in (17) as a penalty term which, for a given , arises from the discrepancy between the (true) -risk and the probability of error. Indeed, (17) collapses to Theorem 3 for .

## 4 Experimental Illustration of Results

In this section, we provide experimental illustration of our theoretical results. In particular, we use the following datasets: MNIST [20], Fashion MNIST [21], and CIFAR-10 [22]. These datasets come with predefined training and test sets. We divide the original training set into training and validation sets using a 90%-10% breakup, respectively; all training is done in batches of 256. We use a variety of neural network (NN) architectures, as appropriate for the datasets. These include:
(i) logistic regression (LR) using a two layer NN with Softmax activation;
(ii) three layer NN (3-NN) with a 128-neuron hidden layer combined with ReLu activation and a Softmax activated output layer;
(iii) CNN-1: One convolutional (conv) layer with 32 kernels of size 3x3, 1 pooling layer with 2x2 kernels, a 512-size fully connected layer with ReLU activation (dense layer) and an output layer;
(iv) CNN-2: uses blocks similar to CNN-1 such as pool and dense and has layers in the following order: conv (32 kernels of 3x3), (2x2) Pool, Conv (64 kernels of 3x3), (2x2) Pool, dense, and output.

For a chosen network, we train each of the NN architectures described above for , each for 100 epochs. For a particular , the optimal epoch hyperparameter is the one at which the validation loss is minimized; finally, we choose the optimal as the one achieving the best validation accuracy. We use the resulting model parameters (including hyperparameters , optimal epoch value, and learning rate) to report accuracy on the test set. At the outset, we tested both SGD and Adam [23] to compare -loss and log-loss; for the noisy and class imbalanced datasets we only use Adam.

We evaluate the efficacy of -loss in three scenarios as detailed below. For each, our metric of comparison is the relative gain in accuracy (Acc.) from using -loss (-L) over log-loss (LL) computed as (-L Acc. - LL Acc.)/(LL Acc.).
(i) Performance of -loss relative to log-loss for each dataset: For each dataset, Table 1 compares the best -loss accuracy to that of log-loss over all architectures and optimizers considered. Results for all architectures and optimizer choices are collected in Table 4 of Appendix F; these suggest that -loss loses nothing in accuracy, and more often than not, offers improvements in accuracy.
(ii) Resilience to errors in the training set: We test the robustness of the loss function to noisy data by flipping a percentage of the labels uniformly at random (from the remaining 9 classes for each label). We vary the flipping percentage from 5-25% in steps of 5%. The results for 0, 10 and 20% are summarized in Table 2 and highlight that significantly improves accuracy (see also Table 5 in Appendix F). One can view this as the resiliency of -loss to training under data poisoning attacks.
(iii) Robustness to class imbalances. We perform this for a binary-CIFAR and CIFAR-10 datasets. We create binary-CIFAR datasets by considering two class pairs: (i) automobile and truck; and (ii) cat and dog. For each such binary dataset, we reduce the size of one class to 10% such that the affected class is now 9% of whole training and validation set. The test set is unchanged. Both overall accuracy and that for the less-frequent imbalanced class are summarized in Table 3 for CNN-1. Results for CNN-2 are in Appendix F, Table 6; these results suggest that, in general, achieves significantly better classification (the rare cases for which hint at a trade-off between accuracy and imbalance). Results for 10% class imbalance for just one CIFAR-10 class with CNN-1 and CNN-2 is summarized in Table 7 of Appendix F; these results reveal significant accuracy gains in detecting the imbalanced class for .

Our experimental results also help evaluate the trade-off between accuracy and convergence speed. We illustrate this in Figure 3 in Appendix F for the CIFAR-10 dataset using CNN-2. In particular, we plot the validation accuracy as a function of the number of epochs for four different values of . These curves, albeit noisy, buttress our theoretical results suggesting that the number of iterations to converge increases with .

In summary, we find that the best value of is dependent on the setting. Since larger more closely resembles the probability of error, we find that in the noisy label setting, the robustness and the quasi-convexity of -loss for and its faster generalization capabilities make it more appropriate than log-loss . For the datasets with class imbalances, we see that -loss for performs best; this is because the convex losses place more weight on the outliers (see Figure 1). In practice, we find that is most representative of the entire set. For such a range, it is not difficult to do a careful grid search to find the best value of for a given setting. The significant performance gains we observe here for noisy and imbalanced datasets could outweigh such (limited range) tuning of an additional () hyperparameter.

## 5 Concluding Remarks

We have presented convexity-robustness trade-offs for a class of surrogate loss functions -loss, . In addition to strong surrogate properties relative to 0-1 loss, our theoretical and experimental results suggest that this class has the potential to address many emerging challenges including class imbalance and noisy data. Precise characterization of accuracy gains for , its empirical landscape, and robustness to class imbalance are promising avenues for future work. Finally, we expect that our equivalent conditions for local quasi-convexity may be of broad interest, and we leave an exploration of this to future work.

## References

• [1] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138–156, 2006.
• [2] H. Masnadi-Shirazi and N. Vasconcelos, “On the design of loss functions for classification: theory, robustness to outliers, and SavageBoost,” in Advances in neural information processing systems, 2009, pp. 1049–1056.
• [3] Y. Lin, “A note on margin-based loss functions in classification,” Statistical & Probability Letters, vol. 68, no. 1, pp. 73–82, 2004.
• [4] L. Rosasco, E. D. Vito, A. Caponnetto, M. Piana, and A. Verri, “Are loss functions all the same?” Neural Computation, vol. 16, no. 5, pp. 1063–1076, 2004.
• [5] R. Nock and F. Nielsen, “On the efficient minimization of classification calibrated surrogates,” in Advances in neural information processing systems, 2009, pp. 1201–1208.
• [6] T. Sypherd, M. Diaz, L. Sankar, and P. Kairouz, “A tunable loss function for binary classification,” CoRR, vol. abs/1902.04639, 2019. [Online]. Available: http://arxiv.org/abs/1902.04639
• [7] J. Liao, O. Kosut, L. Sankar, and F. P. Calmon, “A tunable measure for information leakage,” in 2018 IEEE International Symposium on Information Theory (ISIT).    IEEE, 2018, pp. 701–705.
• [8] E. Hazan, K. Levy, and S. Shalev-Shwartz, “Beyond convexity: Stochastic quasi-convex optimization,” in Advances in Neural Information Processing Systems, 2015, pp. 1594–1602.
• [9] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “On surrogate loss functions and -divergences,” AOS, vol. 37, no. 2, pp. 876–904, 04 2009.
• [10] S. Mei, Y. Bai, and A. Montanari, “The landscape of empirical risk for nonconvex losses,” The Annals of Statistics, vol. 46, no. 6A, pp. 2747–2774, 2018.
• [11] T. Nguyen and S. Sanner, “Algorithms for direct 0–1 loss optimization in binary classification,” in International Conference on Machine Learning, 2013, pp. 1085–1093.
• [12] A. Singh and J. C. Principe, “A loss function for classification based on a robust similarity metric,” in The 2010 International Joint Conference on Neural Networks (IJCNN).    IEEE, 2010, pp. 1–6.
• [13] A. Tewari and P. L. Bartlett, “On the consistency of multiclass classification methods,” Journal of Machine Learning Research, vol. 8, no. May, pp. 1007–1025, 2007.
• [14] L. Zhao, M. Mammadov, and J. Yearwood, “From convex to nonconvex: a loss function analysis for binary classification,” in 2010 IEEE International Conference on Data Mining Workshops.    IEEE, 2010, pp. 1281–1288.
• [15] K. Janocha and W. M. Czarnecki, “On loss functions for deep neural networks in classification,” arXiv preprint arXiv:1702.05659, 2017.
• [16] L. Benigni and S. Péché, “Eigenvalue distribution of nonlinear models of random matrices,” arXiv preprint arXiv:1904.03090, 2019.
• [17] L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and J. Pennington, “Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks,” arXiv preprint arXiv:1806.05393, 2018.
• [18] J. Pennington and P. Worah, “Nonlinear random matrix theory for deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 2637–2646.
• [19] S. Arimoto, “Information measures and capacity of order for discrete memoryless channels,” Topics in information theory, 1977.
• [20] Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNIST database of handwritten digits,” http://yann.lecun.com/exdb/mnist/index.html.
• [21] H. Xiao, K. Rasul, and R. Vollgraf. (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms.
• [22] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset,” online: http://www. cs. toronto. edu/kriz/cifar. html, vol. 55, 2014.
• [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
• [24] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms.    Cambridge University Press, 2014.
• [25] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. 13th Intl. Conf. on AI and Statistics, Sardinia, Italy, 13–15 May 2010, pp. 249–256.

## Appendix A Proofs of Margin-Based α-loss Propositions

In this section, we provide proofs of the propositions provided in Section 2.

### a.1 Proof of Proposition 3

###### Proof.

The proof of the case where can be found in [6]. Now consider , in this case

 d2dz2~lα(z)=(e−z+1)1/αez(αez−α+1)α(ez+1)3. (18)

Observe that the second derivative in (18) is greater than zero for all , so is a convex function in the margin.

### a.2 Proof of Proposition 4

The case where is given in [6].

Before stating the proof of Proposition 4, we provide the full definition of classification-calibration given in [1] for the sake of completeness.

###### Definition 4 ([1, Definition 1]).

A margin-based loss function is said to be classification-calibrated if, for every ,

 inff:f(2η−1)≤0(η~l(f)+(1−η)~l(−f))>inff∈R(η~l(f)+(1−η)~l(−f)). (19)
###### Proof.

For , we rely on the following theorem proved by Bartlett et al. in [1].

###### Proposition 6 ([1, Theorem 6]).

Let denote a margin-based loss function and suppose it is a convex function in the margin. Then is classification-calibrated if and only if it is differentiable at 0 and .

Observe that for is a convex function of the margin as shown by Proposition 3 and since the loss is monotonically decreasing it satisfies Proposition 6. Thus, is classification-calibrated for . The optimal classifier in this region can be found by considering the -tilted distribution (3) in conjunction with Proposition 2.

## Appendix B Proof of Theorem 1

We begin by establishing that the -risk is Lipschitz.

###### Lemma 2.

If , then is -Lipschitz in where , i.e.,

 |Rα(θ)−Rα(θ′)|≤Cr,α∥θ−θ′∥,∀θ,θ′∈Bd(r). (20)
###### Proof.

We show that is -Lipschitz in by showing that the norm of its gradient is uniformly bounded on by . Since both and are bounded, differentiation under the integral sign leads to

 ∇θRα(θ) =∇θEX,Y[lα(Y,gθ(X))] (21) =EX,Y[∇θlα(Y,gθ(X))] (22) =EX,Y[F1(α,θ,X,Y)X]. (23)

Since , we obtain that

 ∥∇θRα(θ)∥ =∥EX,Y[F1(α,θ,X,Y)X]∥ ≤EX,Y[|F1(α,θ,X,Y)|∥X∥] (24) ≤EX,Y[|F1(α,θ,X,Y)|]. (25)

By definition, equals

 EX,Y[∣∣∣1−Y2gθ(X)(1−gθ(X))1−1/α−1+Y2gθ(X)1−1/α(1−gθ(X))∣∣∣]. (26)

Using the fact , it can be shown that . Since

 gθ(z)1−1/α(1−gθ(z))=gθ(−z)(1−gθ(−z))1−1/α (27)

for all , (26) is easily upper bounded by . ∎

We use the following proposition given by Hazan et al. in [8] to prove the first theorem.

###### Proposition 7 ([8]).

If is G-Lipschitz and a strictly-quasi-convex function, then , , it holds that is -SLQC in .

Now, we present the proof for Theorem 1.

###### Proof.

We will prove Theorem 1 by applying Proposition 7. As shown in Lemma 2, is -Lipschitz for all . Next we show quasi-convexity of by level-sets. Since is a convex function in for and since expectation is a linear operator, we have that is quasi-convex for . With regards to strictness, we follow the requirement of Hazan et al. [8]. We say that is strictly-quasi-convex, if it is quasi-convex and its gradients vanish only at the global minima, that is, for all such that , it holds that . Since we assume that we are in the realizable setting where is the only point for which , is strictly-quasi-convex. ∎

## Appendix C Proof of Lemma 1

The following lemma helps to quantify the range of acceptable gradient angles for an SLQC function.

###### Lemma 3.

Let and . Suppose that satisfies that, for all with ,

 ⟨−∇θf(θ),θ′−θ⟩≥0,∀θ′∈Bd(θ0,γ). (28)

Pick such that . If denotes the angle between and , then,

 ψθ≤cos-1(γ∥θ0−θ∥). (29)
###### Proof.

Consider a supporting hyperplane of the ball which contains , as shown in Figure 2. Let be the intersection between the hyperplane and the ball. Note that for this , by assumption.

Next, consider the right triangle generated by and let that be the acute angle between the perpendicular hyperplane to and . By the similiarity of the angles, it can be seen that this is the angle at of the right triangle. So by Pythagoras’ theorem,

 ϕ=cos-1(γ∥θ0−θ∥). (30)

By assumption (28), it can be shown that . Furthermore, as chosen here minimizes , i.e., any other would impose a less restrictive constraint on the magnitude of by making larger.

We now provide the proof of Lemma 1.

###### Proof.

Let and as in the proof of the previous lemma. It follows that and . By the definition of inner product, it follows that

 cos(ψ)∥θ−θ0∥∥∇f(θ)∥=⟨−∇f(θ),θ0−θ⟩. (31)

Rearranging we get

 ψ=cos-1(⟨−∇f(θ),θ0−θ⟩∥θ−θ0∥∥∇f(θ)∥)∈[0,π]. (32)

Further, by the previous lemma. This requirement is equivalent to

 cos-1(⟨−∇f(θ),θ0−θ⟩∥θ−θ0∥∥∇f(θ)∥)≤cos-1(γ∥θ−θ0∥). (33)

Since is a monotonically decreasing function, the previous condition is equivalent to

 ⟨−∇f(θ),θ0−θ⟩≥γ∥∇f(θ)∥≥0. (34)

Notice that the steps are reversible so we have both directions.

## Appendix D Proof of Theorem 2

We begin by proving a set of lemmas which are necessary for the proof of Theorem 2.

The first lemma shows that the -risk is -Lipschitz as a function of .

###### Lemma 4.

If , then , where .

###### Proof.

To show that is -Lipschitz in , it suffices to show that . Observe that

 ddαRα(θ)=ddαE[lα(Y,gθ(X))]=E[ddαlα(Y,gθ(X))], (35)

where the second equality follows since we assume well-behaved integrals. We may rewrite this expression as

 E[ddαlα(Y,gθ(X))]=P1EX|Y=1[ddαlα(1,gθ(X))]+P−1EX|Y=−1[ddαlα(−1,gθ(X))]. (36)

Consider without loss of generality the expression in the first brackets for a fixed . We denote this expression as

 f(α,θ,x)=ddααα−1[1−gθ(x)1−1/α]. (37)

It can be shown that

 f(α,θ,x)=1−gθ(x)1−1/αα−1−(loggθ(x))gθ(x)1−1/αα(α−1)−α(1−gθ(x)1−1/α)(α−1)2. (38)

By L’Hopital’s rule, we have that

 f(1,θ,x)=−(loggθ(x))22. (39)

Further, it can be shown that is monotonically decreasing in . Thus,

 |f(α,θ,x)|≤|f(1,θ,x)|≤∣∣∣(loggθ(x))22∣∣∣≤(r+log2)22. (40)

Upon plugging (40) into (36), the expectation sums to 1 and we achieve the desired bound on . ∎

The second lemma shows that the gradient of the -risk with respect to the parameter vector is -Lipschitz as a function of .

###### Lemma 5.

If , then where

 Kr=(r+log2)(σ(r))1−1/(2σ(r)−1)(2σ(r)−1)2. (41)
###### Proof.

By the linearity of the expectation we can bound as

 ∥∇Rα(θ)−∇Rα′(θ)∥ =∥E[(F1(α,θ,X,Y)−F1(α′,θ,X,Y))X]∥ (42) ≤E[|(F1(α,θ,X,Y)−F1(α′,θ,X,Y))|∥X∥] (43) =E[|(F1(α,θ,X,Y)−F1(α′,θ,X,Y))|], (44)

since has bounded support. Here, we explore the Lipschitzianity of . To do so, we calculate the maximum value of . Consider

 F1(α,θ,x,y)=1−y2gθ(x)(1−gθ)1−1/α−1+y2g1−1/αθ(1−gθ(x)). (45)

For any pair , only one term is active, so let without loss of generality. Thus,

 F1(α,θ,x,−1)=−g1−1/αθ(1−gθ(x)). (46)

In particular, we have that

 ddαF1(α,θ,x,−1) =−g1−1/αθ1α2log(gθ(x))