A Tunable Loss Function for Classification
Abstract
Recently, a parameterized class of loss functions called loss, , has been introduced for classification. This family, which includes the logloss and the 01 loss as special cases, comes with compelling properties including an equivalent marginbased form which is classificationcalibrated for all . We introduce a generalization of this family to the entire range of and establish how the parameter enables the practitioner to choose among a host of operating conditions that are important in modern machine learning tasks. We prove that smaller values are more conducive to faster optimization; in fact, loss is convex for and quasiconvex for . Moreover, we establish bounds to quantify the degradation of the localquasiconvexity of the optimization landscape as increases; we show that this directly translates to a computational slow down. On the other hand, our theoretical results also suggest that larger values lead to better generalization performance. This is a consequence of the ability of the loss to limit the effect of less likely data as increases from 1, thereby facilitating robustness to outliers and noise in the training data. We provide strong evidence supporting this assertion with several experiments on benchmark datasets that establish the efficacy of loss for in robustness to errors in the training data. Of equal interest is the fact that, for , our experiments show that the decreased robustness seems to counteract class imbalances in training data.
1 Introduction
The performance of a classification algorithm in terms of accuracy, tractability, and convergence guarantees depends crucially on the choice of the loss function. Consider a feature vector , an unknown finitevalued label , and a hypothesis . The canonical  loss, given by , is considered an ideal loss function that captures the probability of incorrectly guessing the true label using . However, since the  loss is neither continuous nor differentiable, its applicability in stateoftheart learning algorithms is highly restricted. As a result, there has been much interest in identifying surrogate loss functions that approximate the  loss [1, 2, 3, 4, 5, 6]. Common surrogate loss functions include logloss, squared loss, and hinge loss. However, as classification is applied to broader contexts including scenarios where data may be noisy (e.g., incorrect labels) or have imbalanced classes (e.g., very few samples capturing anomalous events that are crucial to detect), there is a need for good surrogate loss functions that are also robust to these practical schallenges.
We propose to address these challenges via a recently introduced class of loss functions, loss. This class was originally conceived in a privacy setting in [7] by Liao et al. who introduced loss, parameterized by , to quantify information leakage for a class of adversarial threat models. The viability of loss as a possible family of surrogate loss functions for binary classification was suggested in [6]. In fact, the authors proved that loss can be written in a marginbased form and is classificationcalibrated [6]. Focusing on the classification context, we present a generalization of loss over a larger range given by . By introducing this larger range, we establish how the parameter enables the practitioner to choose among a host of operating conditions that are important in modern machine learning tasks.
Our Contributions:
Our main contribution is a family of loss functions that, in addition to including the oftused logloss and highly desirable 01 loss, captures the inherent tradeoffs between classification accuracy, convexity of the optimization landscape, and robustness to noise and outliers. We provide both theoretical and experimental evidence for these tradeoffs as summarized below:
(1) We show that the marginbased form of loss is classificationcalibrated over the entire range of . Further, we prove that the marginbased loss is convex for and quasiconvex for . In the context of logistic regression, we exploit this marginbased form and its Lipschitz property to obtain generalization results using Rademacher complexity.
(2) For logistic regression, we show that the true risk is StrictlyLocallyQuasiConvex (SLQC) (as defined by Hazan et al. in [8]) for all . Furthermore, we provide bounds on the degradation of the SLQC parameters as increases, thereby highlighting the effect of quasiconvexity (for ). More broadly, we present a lemma that introduces an equivalent condition for the direction of descent in the definition of SLQC; this can be of independent interest.
(3) For logistic regression, we also provide a uniform generalization result for every . More specifically, we show that increasing makes the empirical risk more closely resemble the probability of error for every hypothesis.
(4) Finally, we illustrate our results using wellstudied datasets such as MNIST, Fashion MNIST and CIFAR10 and neural networks including one and two layer CNNs. We explore three contexts to compare the accuracy of loss against that of logloss: original datasets, noisy datasets (with a fraction of labels flipped), and class imbalanced datasets (reduced samples in specific classes of above mentioned datasets). In particular, in the noisy labels and imbalanced class settings, we show that tuning away from (logloss) can enhance both accuracy and robustness.
1.1 Related Work
Surrogate loss functions for the  loss have been widely of interest to the machine learning and statistics communities [1, 2, 3, 9, 4, 10, 11, 12, 13, 14]. Recently, there has been renewed interest in alternative losses for classification [15, 11, 14, 10, 7] other than the oftused logloss. During the inception of the field, convex losses were widely considered optimal [1, 4, 9, 3]. However, more recent works propose the use of nonconvex losses as a means to moderate the behavior of an algorithm [10, 6, 11, 2]. In particular, motivated by superior robustness and classification accuracy of nonconvex losses, Mei et al. [10] studied the empirical landscape of such functions. Along these lines, Hazan et al. [8] introduce SLQC as a means to study the unimodality of quasiconvex functions and their optimization characteristics. For SLQC functions, they introduce Normalized Gradient Descent algorithm and prove its convergence guarantees. Using their methodology, we consider loss under logistic regression for binary classification and derive intuition about the operating characteristics of loss.
Our work is similar to [11], wherein Nguyen et al. present a tunable sigmoid loss which can be made arbitrarily close to the  loss. In essence, their loss moves from a smooth to nonsmooth loss. Our loss is always smooth and moves from convex to quasiconvex. We find that in the setting of deep neural networks, some quasiconvexity of the loss smooths the empirical landscape. In particular, we provide strong experimental evidence for a narrow range of to be used in practice (a limit on the amount of convexity and quasiconvexity of the loss), which significantly reduces the range of hyperparameter tuning induced by loss. Increasing the degree of convexity of the optimization landscape is conducive to faster optimization. Hence, our approach could serve as an alternative to other approaches whose objective is to accelerate the optimization process, e.g., the activation function tuning in [16, 17, 18] and references therein.
2 loss and Binary Classification
We begin by generalizing the definition of loss, introduced by Liao et al. [7], to the entire range .
Definition 1.
Let be the set of probability distributions over . For , we define loss as
(1) 
and, by continuous extension, and .
The above definition of loss presents a class of loss functions that values the probabilistic estimate of the label differently for different . For , minimizing the corresponding risk leads to making a single guess on the most likely label; on the other hand, for , such a risk minimization involves minimizing the average log loss, and therefore, refining a posterior belief over all for a given observation . As increases from 1 to , the loss function increasingly limits the effect of the low probability outcomes; on the other hand, as decreases from 1 towards 0, the loss function places increasingly higher weights on the low probability outcomes until at , by continuous extension of (1), we have , i.e., the loss function pays an infinite cost by ignoring the training data distribution completely. This characteristic property of loss is highlighted in Figure 1(a) for a binomial random variable generated by a fair coin over 20 flips. Note that quantifies the level of certainty placed on the posterior distribution. Thus, larger indicate increasing certainty over a smaller set of while smaller distributes the uncertainty over more (and eventually, all) possibles values of . Indeed, for , the distribution becomes the harddecoding MAP rule.
Suppose that the featurelabel variable pair . Observing , the goal in classification is to construct an estimate of . Upon inspecting (1), one may observe that the expected loss , henceforth referred to as risk, quantifies the effectiveness of the estimated posterior . The following proposition provides an explicit characterization of the optimal riskminimizing posterior under loss.
Proposition 1 ([7, Lemma 1]).
For each , the minimal risk is
(2) 
where is the Arimoto conditional entropy of order [19]. The resulting unique minimizer, , is the tilted true posterior
(3) 
For binary classification, where , it is common to use classification functions of the form such that the classifier, for any given , outputs the hypothesis (hard decision) [1, 3, 9, 2, 6]. A classification function corresponds to the certainty of an algorithm’s prediction (e.g., SVM). Examples of loss functions that act on classification functions include logistic loss, hinge loss, and square loss.
In addition to classification functions, soft classifiers may also be output by binary classification algorithms. A soft classifier corresponds to the distribution . Logloss, and by extension loss, are examples of loss functions which act on soft classifiers. In practice, a soft classifier can be obtained by composing commonly used classification functions with a sigmoid function , , given by
(4) 
A large family of loss functions under binary classification are marginbased loss functions. [1, 3, 2, 9, 15]. A loss function is said to be marginbased if, for all and , the risk associated to a pair is given by for some function . In this case, the risk of the pair only depends on the product , where the product is called the margin. Observe that a negative margin corresponds to a mismatch between the signs of and , i.e., a classification error by . Similarly, a positive margin corresponds to a match between the signs of and , i.e., a correct classification by .
We now show that loss is marginbased over the entire range of ; this is also illustrated in Figure 1(b).
Definition 2.
The marginbased loss, , , is
(5) 
with and by continuous extension.
The relationship between a classification function and a soft classifier under loss is articulated by the following proposition. It generalizes the result of [6] for the entire range of .
Proposition 2.
Consider a soft classifier . If , then, for every ,
(6) 
Conversely, if is a classification function, then the set of beliefs associated to satisfies (6). In particular, for every ,
(7) 
While the previous result establishes an equivalent marginbased form for loss, our next result establishes some of its basic optimization characteristics.
Proposition 3.
is convex for and quasiconvex for .
Finally, we conclude this section with another basic property of loss that highlights its suitability for classification. For binary classification and marginbased losses, Bartlett et al. in [1] introduce classificationcalibration as a means to compare the performance of a loss function relative to the 01 loss. A marginbased loss function is classificationcalibrated if its minimum conditional risk given is attained by a such that , where is the true posterior. Building upon such a result in [6] for loss, the following proposition shows that is classificationcalibrated for all .
Proposition 4.
For every , the marginbased loss, , is classificationcalibrated.
3 Main Results
In this section we establish some theoretical properties regarding the performance of loss in a logistic regression setting. Towards this end, let be the feature vector, the label and the training dataset where, for each , the samples are independently drawn according to an unknown distribution . We consider the family of soft classifiers where and
(8) 
with given as in (4). The loss can now be written as
(9) 
A straightforward computation shows that
(10) 
where denote the th components of and , respectively. Thus, the gradient of loss is
(11) 
where is the expression within brackets in (10). Finally, we define the risk as the risk of the loss in (9), i.e., . Observe that, for all ,
(12) 
where is a random variable such that for all . We define the empirical risk by
(13) 
3.1 The Optimization Landscape of loss via SLQC
Next we provide some insight regarding the convexity degradation of the optimization landscape as increases. In order to do so, we rely on a relaxed form of convexity called StrictLocalQuasiConvexity. We begin recalling the definition of this notion introduced by Hazan et al. in [8]. For and , we let .
Definition 3.
Let , . We say that is StrictlyLocallyQuasiConvex (SLQC) in , if at least one of the following applies:


, and for every it holds that .
Intuitively, when is a minimizer of an SLQC function , then, for every , either is close to optimal or descending with the gradient leads to the ‘right’ direction. This relaxed notion of convexity comes with a natural adaptation of the Gradient Descent (GD) algorithm: Normalized Gradient Descent (NGD) [8] summarized in Algorithm 1.
Similar to the convergence guarantees for GD on convex problems, NGD comes with natural convergence guarantees for SLQC problems as shown in [8, Theorem 4.1].
Proposition 5.
Fix , let , and . If is SLQC in every , then running Algorithm 1 with , and , we have .
It is important to note that, for an SLQC function, a smaller provides better optimality guarantees. Given , smaller leads to faster optimization as the number of required iterations increases with . It should be also noted that, by using projections, NGD can be easily adapted to work over convex and closed sets including (see, for example, [8]).
We are now in a position to establish bounds on the SLQC constants of , the risk, in the logistic model. The following theorem identifies the SLQC constants for risk in the regime . In this context, we assume that we are in a unimodal realizable setting, i.e., has a unique critical point and
(14) 
The proof, which can be found in Appendix B, relies on three main facts: (i) is quasiconvex, (ii) is Lipschitz where , and (iii) has a unique critical point due to the unimodality assumption.
Theorem 1.
If and , then, for every , the risk is SLQC in where .
We establish the following lemma which might be of general interest; the proof is given in Appendix C. Indeed, it provides further insight into the theory of Hazan et al. with regards to SLQC functions.
Lemma 1.
Assume that is differentiable, and . If is such that , then the following are equivalent:

for all ,

.
We are now in a position to state our second result, which provides some insight into the degradation of the SLQC constants for the risk as increases from 1. For ease of notation, let
(15) 
Theorem 2.
Let and . If is SLQC in and , then is SLQC in with and .
The proof of this theorem relies on a handful of lemmas whose proofs are given in Appendix D. The theorem intuitively states that increasing the value of increases the value of by a factor of and decreases the ball of radius by a factor of , both of which hinder the optimization process and increase the required number of iterations.
3.2 Generalization and Control on the Probability of Error
In this section, we focus on the setting to highlight two aspects of loss: (i) generalization properties, and (ii) uniform guarantees on accuracy. We begin with generalization.
Theorem 3.
If , then, with probability at least , for all ,
(16) 
where and , for .
Note that and are both monotonically decreasing with , and are continuously extended such that , , , and . The proof of Theorem 3 relies on classical results in Rademacher complexity; all details are in Appendix E.
Note that this generalization guarantee has significantly better dependence on than the one derived by Sypherd et al. in [6]. The authors there, building on [10], are only able to provide a guarantee of at the critical points in the landscape relying. Instead, we use classical results in Rademacher complexity for our setting to achieve a uniform generalization guarantee of Furthermore, our constants are explicitly specified in terms of parameters critical to the landscape.
The next corollary stems from Theorem 3. It explicitly bounds the accuracy gain using .
Corollary 1.
Thus, we see that for any , increasing implies that gets closer to , which is the probability of error of an estimator with . One can view the third term in (17) as a penalty term which, for a given , arises from the discrepancy between the (true) risk and the probability of error. Indeed, (17) collapses to Theorem 3 for .
4 Experimental Illustration of Results
In this section, we provide experimental illustration of our theoretical results. In particular, we use the following datasets: MNIST [20], Fashion MNIST [21], and CIFAR10 [22]. These datasets come with predefined training and test sets. We divide the original training set into training and validation sets using a 90%10% breakup, respectively; all training is done in batches of 256.
We use a variety of neural network (NN) architectures, as appropriate for the datasets.
These include:
(i) logistic regression (LR) using a two layer NN with Softmax activation;
(ii) three layer NN (3NN) with a 128neuron hidden layer combined with ReLu activation and a Softmax activated output layer;
(iii)
CNN1: One convolutional (conv) layer with 32 kernels of size 3x3, 1 pooling layer with 2x2 kernels, a 512size fully connected layer with ReLU activation (dense layer) and an output layer;
(iv) CNN2: uses blocks similar to CNN1 such as pool and dense and has layers in the following order: conv (32 kernels of 3x3), (2x2) Pool, Conv (64 kernels of 3x3), (2x2) Pool, dense, and output.
For a chosen network, we train each of the NN architectures described above for , each for 100 epochs. For a particular , the optimal epoch hyperparameter is the one at which the validation loss is minimized; finally, we choose the optimal as the one achieving the best validation accuracy. We use the resulting model parameters (including hyperparameters , optimal epoch value, and learning rate) to report accuracy on the test set. At the outset, we tested both SGD and Adam [23] to compare loss and logloss; for the noisy and class imbalanced datasets we only use Adam.
We evaluate the efficacy of loss in three scenarios as detailed below. For each, our metric of comparison is the relative gain in accuracy (Acc.) from using loss (L) over logloss (LL) computed as (L Acc.  LL Acc.)/(LL Acc.).
(i) Performance of loss relative to logloss for each dataset: For each dataset, Table 1 compares the best loss accuracy to that of logloss over all architectures and optimizers considered. Results for all architectures and optimizer choices are collected in Table 4 of Appendix F; these suggest that loss loses nothing in accuracy, and more often than not, offers improvements in accuracy.
(ii) Resilience to errors in the training set: We test the robustness of the loss function to noisy data by flipping a percentage of the labels uniformly at random (from the remaining 9 classes for each label). We vary the flipping percentage from 525% in steps of 5%. The results for 0, 10 and 20% are summarized in Table 2 and highlight that significantly improves accuracy (see also Table 5 in Appendix F). One can view this as the resiliency of loss to training under data poisoning attacks.
(iii) Robustness to class imbalances. We perform this for a binaryCIFAR and CIFAR10 datasets. We create binaryCIFAR datasets by considering two class pairs: (i) automobile and truck; and (ii) cat and dog. For each such binary dataset, we reduce the size of one class to 10% such that the affected class is now 9% of whole training and validation set. The test set is unchanged. Both overall accuracy and that for the lessfrequent imbalanced class are summarized in Table 3 for CNN1. Results for CNN2 are in Appendix F, Table 6; these results suggest that, in general, achieves significantly better classification (the rare cases for which hint at a tradeoff between accuracy and imbalance). Results for 10% class imbalance for just one CIFAR10 class with CNN1 and CNN2 is summarized in Table 7 of Appendix F; these results reveal significant accuracy gains in detecting the imbalanced class for .
Dataset  Optimizer  Architecture  LL Acc.  L Acc.  Rel. Gain%  

MNIST  SGD  3NN  0.981  0.981  1.2  0.038 
Fashion MNIST  Adam  3NN  0.886  0.891  1.5  0.561 
CIFAR10  Adam  CNN2  0.724  0.729  0.95  0.648 
Dataset  Architecture  Label Flip%  LL Acc.  L Acc.  Rel. Gain %  

MNIST  LR  0  0.927  0.934  2.0  0.735 
10  0.913  0.933  2.5  2.190  
20  0.907  0.931  2.0  2.706  
CIFAR10  CNN2  0  0.724  0.729  0.95  0.648 
10  0.693  0.713  1.7  2.906  
20  0.672  0.696  2.0  3.477 
Architecture  C1  C2  LL Acc.  L Acc.  Rel. Acc. Gain%  

Imb.  Overall  Imb.  Overall  Imb.  Overall  
CNN1  Auto  Truck  0.351  0.666  0.416  0.696  18.518  4.501  0.95 
Truck  Auto  0.348  0.664  0.404  0.688  16.091  3.686  0.83  
Cat  Dog  0.111  0.549  0.174  0.573  56.756  4.458  0.9  
Dog  Cat  0.055  0.523  0.074  0.53  34.545  1.337  0.99 
Our experimental results also help evaluate the tradeoff between accuracy and convergence speed. We illustrate this in Figure 3 in Appendix F for the CIFAR10 dataset using CNN2. In particular, we plot the validation accuracy as a function of the number of epochs for four different values of . These curves, albeit noisy, buttress our theoretical results suggesting that the number of iterations to converge increases with .
In summary, we find that the best value of is dependent on the setting. Since larger more closely resembles the probability of error, we find that in the noisy label setting, the robustness and the quasiconvexity of loss for and its faster generalization capabilities make it more appropriate than logloss . For the datasets with class imbalances, we see that loss for performs best; this is because the convex losses place more weight on the outliers (see Figure 1). In practice, we find that is most representative of the entire set. For such a range, it is not difficult to do a careful grid search to find the best value of for a given setting. The significant performance gains we observe here for noisy and imbalanced datasets could outweigh such (limited range) tuning of an additional () hyperparameter.
5 Concluding Remarks
We have presented convexityrobustness tradeoffs for a class of surrogate loss functions loss, . In addition to strong surrogate properties relative to 01 loss, our theoretical and experimental results suggest that this class has the potential to address many emerging challenges including class imbalance and noisy data. Precise characterization of accuracy gains for , its empirical landscape, and robustness to class imbalance are promising avenues for future work. Finally, we expect that our equivalent conditions for local quasiconvexity may be of broad interest, and we leave an exploration of this to future work.
References
 [1] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138–156, 2006.
 [2] H. MasnadiShirazi and N. Vasconcelos, “On the design of loss functions for classification: theory, robustness to outliers, and SavageBoost,” in Advances in neural information processing systems, 2009, pp. 1049–1056.
 [3] Y. Lin, “A note on marginbased loss functions in classification,” Statistical & Probability Letters, vol. 68, no. 1, pp. 73–82, 2004.
 [4] L. Rosasco, E. D. Vito, A. Caponnetto, M. Piana, and A. Verri, “Are loss functions all the same?” Neural Computation, vol. 16, no. 5, pp. 1063–1076, 2004.
 [5] R. Nock and F. Nielsen, “On the efficient minimization of classification calibrated surrogates,” in Advances in neural information processing systems, 2009, pp. 1201–1208.
 [6] T. Sypherd, M. Diaz, L. Sankar, and P. Kairouz, “A tunable loss function for binary classification,” CoRR, vol. abs/1902.04639, 2019. [Online]. Available: http://arxiv.org/abs/1902.04639
 [7] J. Liao, O. Kosut, L. Sankar, and F. P. Calmon, “A tunable measure for information leakage,” in 2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018, pp. 701–705.
 [8] E. Hazan, K. Levy, and S. ShalevShwartz, “Beyond convexity: Stochastic quasiconvex optimization,” in Advances in Neural Information Processing Systems, 2015, pp. 1594–1602.
 [9] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “On surrogate loss functions and divergences,” AOS, vol. 37, no. 2, pp. 876–904, 04 2009.
 [10] S. Mei, Y. Bai, and A. Montanari, “The landscape of empirical risk for nonconvex losses,” The Annals of Statistics, vol. 46, no. 6A, pp. 2747–2774, 2018.
 [11] T. Nguyen and S. Sanner, “Algorithms for direct 0–1 loss optimization in binary classification,” in International Conference on Machine Learning, 2013, pp. 1085–1093.
 [12] A. Singh and J. C. Principe, “A loss function for classification based on a robust similarity metric,” in The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE, 2010, pp. 1–6.
 [13] A. Tewari and P. L. Bartlett, “On the consistency of multiclass classification methods,” Journal of Machine Learning Research, vol. 8, no. May, pp. 1007–1025, 2007.
 [14] L. Zhao, M. Mammadov, and J. Yearwood, “From convex to nonconvex: a loss function analysis for binary classification,” in 2010 IEEE International Conference on Data Mining Workshops. IEEE, 2010, pp. 1281–1288.
 [15] K. Janocha and W. M. Czarnecki, “On loss functions for deep neural networks in classification,” arXiv preprint arXiv:1702.05659, 2017.
 [16] L. Benigni and S. Péché, “Eigenvalue distribution of nonlinear models of random matrices,” arXiv preprint arXiv:1904.03090, 2019.
 [17] L. Xiao, Y. Bahri, J. SohlDickstein, S. S. Schoenholz, and J. Pennington, “Dynamical isometry and a mean field theory of cnns: How to train 10,000layer vanilla convolutional neural networks,” arXiv preprint arXiv:1806.05393, 2018.
 [18] J. Pennington and P. Worah, “Nonlinear random matrix theory for deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 2637–2646.
 [19] S. Arimoto, “Information measures and capacity of order for discrete memoryless channels,” Topics in information theory, 1977.
 [20] Y. LeCun, C. Cortes, and C. J. C. Burges, “The MNIST database of handwritten digits,” http://yann.lecun.com/exdb/mnist/index.html.
 [21] H. Xiao, K. Rasul, and R. Vollgraf. (2017) FashionMNIST: a novel image dataset for benchmarking machine learning algorithms.
 [22] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR10 dataset,” online: http://www. cs. toronto. edu/kriz/cifar. html, vol. 55, 2014.
 [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [24] S. ShalevShwartz and S. BenDavid, Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
 [25] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. 13th Intl. Conf. on AI and Statistics, Sardinia, Italy, 13–15 May 2010, pp. 249–256.
Appendix A Proofs of MarginBased loss Propositions
In this section, we provide proofs of the propositions provided in Section 2.
a.1 Proof of Proposition 3
a.2 Proof of Proposition 4
The case where is given in [6].
Before stating the proof of Proposition 4, we provide the full definition of classificationcalibration given in [1] for the sake of completeness.
Definition 4 ([1, Definition 1]).
A marginbased loss function is said to be classificationcalibrated if, for every ,
(19) 
Proof.
For , we rely on the following theorem proved by Bartlett et al. in [1].
Proposition 6 ([1, Theorem 6]).
Let denote a marginbased loss function and suppose it is a convex function in the margin. Then is classificationcalibrated if and only if it is differentiable at 0 and .
Observe that for is a convex function of the margin as shown by Proposition 3 and since the loss is monotonically decreasing it satisfies Proposition 6. Thus, is classificationcalibrated for . The optimal classifier in this region can be found by considering the tilted distribution (3) in conjunction with Proposition 2. ∎
Appendix B Proof of Theorem 1
We begin by establishing that the risk is Lipschitz.
Lemma 2.
If , then is Lipschitz in where , i.e.,
(20) 
Proof.
We show that is Lipschitz in by showing that the norm of its gradient is uniformly bounded on by . Since both and are bounded, differentiation under the integral sign leads to
(21)  
(22)  
(23) 
Since , we obtain that
(24)  
(25) 
By definition, equals
(26) 
Using the fact , it can be shown that . Since
(27) 
for all , (26) is easily upper bounded by . ∎
We use the following proposition given by Hazan et al. in [8] to prove the first theorem.
Proposition 7 ([8]).
If is GLipschitz and a strictlyquasiconvex function, then , , it holds that is SLQC in .
Now, we present the proof for Theorem 1.
Proof.
We will prove Theorem 1 by applying Proposition 7. As shown in Lemma 2, is Lipschitz for all . Next we show quasiconvexity of by levelsets. Since is a convex function in for and since expectation is a linear operator, we have that is quasiconvex for . With regards to strictness, we follow the requirement of Hazan et al. [8]. We say that is strictlyquasiconvex, if it is quasiconvex and its gradients vanish only at the global minima, that is, for all such that , it holds that . Since we assume that we are in the realizable setting where is the only point for which , is strictlyquasiconvex. ∎
Appendix C Proof of Lemma 1
The following lemma helps to quantify the range of acceptable gradient angles for an SLQC function.
Lemma 3.
Let and . Suppose that satisfies that, for all with ,
(28) 
Pick such that . If denotes the angle between and , then,
(29) 
Proof.
Consider a supporting hyperplane of the ball which contains , as shown in Figure 2. Let be the intersection between the hyperplane and the ball. Note that for this , by assumption.
Next, consider the right triangle generated by and let that be the acute angle between the perpendicular hyperplane to and . By the similiarity of the angles, it can be seen that this is the angle at of the right triangle. So by Pythagoras’ theorem,
(30) 
By assumption (28), it can be shown that . Furthermore, as chosen here minimizes , i.e., any other would impose a less restrictive constraint on the magnitude of by making larger. ∎
We now provide the proof of Lemma 1.
Proof.
Let and as in the proof of the previous lemma. It follows that and . By the definition of inner product, it follows that
(31) 
Rearranging we get
(32) 
Further, by the previous lemma. This requirement is equivalent to
(33) 
Since is a monotonically decreasing function, the previous condition is equivalent to
(34) 
Notice that the steps are reversible so we have both directions. ∎
Appendix D Proof of Theorem 2
We begin by proving a set of lemmas which are necessary for the proof of Theorem 2.
The first lemma shows that the risk is Lipschitz as a function of .
Lemma 4.
If , then , where .
Proof.
To show that is Lipschitz in , it suffices to show that . Observe that
(35) 
where the second equality follows since we assume wellbehaved integrals. We may rewrite this expression as
(36) 
Consider without loss of generality the expression in the first brackets for a fixed . We denote this expression as
(37) 
It can be shown that
(38) 
By L’Hopital’s rule, we have that
(39) 
Further, it can be shown that is monotonically decreasing in . Thus,
(40) 
Upon plugging (40) into (36), the expectation sums to 1 and we achieve the desired bound on . ∎
The second lemma shows that the gradient of the risk with respect to the parameter vector is Lipschitz as a function of .
Lemma 5.
If , then where
(41) 
Proof.
By the linearity of the expectation we can bound as
(42)  
(43)  
(44) 
since has bounded support. Here, we explore the Lipschitzianity of . To do so, we calculate the maximum value of . Consider
(45) 
For any pair , only one term is active, so let without loss of generality. Thus,
(46) 
In particular, we have that