A New Smooth Approximation to the Zero One Loss with a Probabilistic Interpretation
Abstract
We examine a new form of smooth approximation to the zero one loss in which learning is performed using a reformulation of the widely used logistic function. Our approach is based on using the posterior mean of a novel generalized BetaBernoulli formulation. This leads to a generalized logistic function that approximates the zero one loss, but retains a probabilistic formulation conferring a number of useful properties. The approach is easily generalized to kernel logistic regression and easily integrated into methods for structured prediction. We present experiments in which we learn such models using an optimization method consisting of a combination of gradient descent and coordinate descent using localized grid search so as to escape from local minima. Our experiments indicate that optimization quality is improved when learning metaparameters are themselves optimized using a validation set. Our experiments show improved performance relative to widely used logistic and hinge loss methods on a wide variety of problems ranging from standard UC Irvine and libSVM evaluation datasets to product review predictions and a visual information extraction task. We observe that the approach: 1) is more robust to outliers compared to the logistic and hinge losses; 2) outperforms comparable logistic and max margin models on larger scale benchmark problems; 3) when combined with Gaussian Laplacian mixture prior on parameters the kernelized version of our formulation yields sparser solutions than Support Vector Machine classifiers; and 4) when integrated into a probabilistic structured prediction technique our approach provides more accurate probabilities yielding improved inference and increasing information extraction performance.
1 Introduction
Loss function minimization is a standard way of solving many important learning problems. In the classical statistical literature, this is known as Empirical Risk Minimization (ERM) [18], where learning is performed by minimizing the average risk or loss over the training data. Formally, this is represented as
(1) 
where, is a model, is the i input feature vector with label , there are pairs of features and labels, and is the loss for the model output . Let us focus for the moment on the standard binary linear classification task in which we encode the target class label as and the model parameter vector as . Letting , we can define the logistic, hinge, and 01 loss as
(2)  
(3)  
(4) 
where is the indicator function which takes the value of 1 when its argument is true and 0 when its argument is false. Of course, loss functions can be more complex, for example defined and learned through a linear combination of simpler basis loss functions [20], but we focus on the widely used losses above for now.
Different loss functions characterize the classification problem differently. The log logistic loss and the hinge loss are very similar in their shape, which can be verified from Figure 1. Logistic regression models involve optimizing the log logistic loss, while optimizing a hinge loss is the heart of Support Vector Machines (SVMs). While seemingly a sensible objective for a classification problem, empirical risk minimization with the 01 loss function is known to be an NPhard problem [7].
Both the log logistic loss and the hinge loss are convex and therefore lead to optimization problems with a global minima. However, both the the log logistic loss and hinge loss penalize a model heavily when data points are classified incorrectly and are far away from the decision boundary. As can be seen in Figure 1 their penalties can be much more significant than the zero one loss. The zeroone loss captures the intuitive goal of simply minimizing classification errors and recent research has been directed to learning models using a smoothed zeroone loss approximation [24, 14]. Previous work has shown that both the hinge loss [24] and more recently the 01 loss [14] can be efficiently and effectively optimized directly using smooth approximations. The work in [14] also underscored the robustness advantages of the 01 loss to outliers. While the 01 loss is not convex, the current flurry of activity in the area of deep neural networks as well as the award winning work on 01 loss approximations in [2] have highlighted numerous other advantages to the use of nonconvex loss functions. In our work here, we are interested in constructing a probabilistically formulated smooth approximation to the 01 loss.
Let us first compare the widely used log logistic loss with the hinge loss and the 01 loss in a little more detail. The log logistic loss from the well known logistic regression model arises from the form of negative log likelihood defined by the model. More specifically, this logistic loss arises from a sigmoid function parametrizing probabilities and is easily recovered by rearranging (2) to obtain a probability model of the form . In our work here, we will take this familiar logistic function and we shall transform it to create a new functional form. The sequence of curves starting with the blue curve in Figure 2 (top) give an intuitive visualization of the way in which we alter the traditional log logistic loss. We call our new loss function the generalized BetaBernoulli logistic loss and use the acronym when referring to it. We give it this name as it arises from the combined use of a BetaBernoulli distribution and a generalized logistic parametrization.
We give the Bayesian motivations for our BetaBernoulli construction in section 3. To gain some additional intuitions about the effect of our construction from a practical perspective, consider the following analysis. When viewing the negative log likelihood of the traditional logistic regression parametrization as a loss function, one might pose the following question: (1) what alternative functional form for the underlying probability would lead to a loss function exhibiting a plateau similar to the 01 loss for incorrectly classified examples? One might also pose a second question: (2) is it possible to construct a simple parametrization in which a single parameter controls the sharpness of the smooth approximation to the 01 loss? The intuition for an answer to the first question is that the traditional logistic parametrization converges to zero probability for small values of its argument. This in turn leads to a loss function that increases with a linear behaviour for small values of as shown in Figure 1. In contrast, our new loss function is defined in such a way that for small values of , the function will converge to a nonzero probability. This effect manifests itself as the desired plateau, which can be seen clearly in the loss functions defined by our model in Figure 2 (top). The answer to our second question is indeed yes; and more specifically, to control the sharpness of our approximation, we use a factor reminiscent of a technique used in previous work which has created smooth approximations to the hinge loss [24] as well as smooth approximations of the 01 loss [14]. We show the intuitive effect of our construction for different increasing values of gamma in Figure 2 and define it more formally below.
To compare and contrast our loss function with other common loss functions such as those in equations (24) and others reviewed below, we express our loss here using and as arguments. For , the loss can be expressed as
(5) 
while for it can be expressed as
(6) 
We show in section 3 that the constants and have well defined interpretations in terms of the standard , , and parameters of the Beta distribution. Their impact on our proposed generalized BetaBernoulli loss arise from applying a fuller Bayesian analysis to the formulation of a logistic function.
The visualization of our proposed loss in Figure 2 corresponds to the use of a weak noninformative prior such as and and . In Figure 2, we show the probability given by the model as a function of at the right and the negative log probability or the loss on the left as is varied over the integer powers in the interval . We see that the logistic function transition becomes more abrupt as increases. The loss function behaves like the usual logistic loss for close to 1, but provides an increasingly more accurate smooth approximation to the zero one loss with larger values of . Intuitively, the location of the plateau of the smooth log logistic loss approximation on the yaxis is controlled by our choice of , and . The effect of the weak uniform prior is to add a small minimum probability to the model, which can be imperceptible in terms of the impact on the sigmoid function log space, but leads to the plateau in the negative log loss function. By contrast, the use of a strong prior for the losses in Figure 5 (left) leads to minimum and maximum probabilities that can be much further from zero and one.
Our work makes a number of contributions which we enumerate here: (1) The primary contribution of our work is a new probabilistically formulated approximation to the 01 loss based on a generalized logistic function and the use of the BetaBernoulli distribution. The result is a generalized sigmoid function in both probability and negative log probability space. (2) A second key contribution of our work is that we present and explore an adapted version of the optimization algorithm proposed in [14] in which we optimize the meta parameters of learning using validation sets. We present a series of experiments in which we optimize the loss using the basic algorithm from [14] and our modified version. For linear models, we show that our complete approach outperforms the widely used techniques of logistic regression and linear support vector machines. As expected, our experiments indicate that the relative performance of the approach further increases when noisy outliers are present in the data. (3) We go on to present a number of experiments with larger scale data sets demonstrating that our method also outperforms widely used logistic regression and SVM techniques despite the fact that the underlying models involved are linear. (4) We apply our model in a structured prediction task formulated for mining faces in Wikipedia biography pages. Our proposed method is well adapted to this setting and we and find that the improved probabilistic modeling capabilities of our approach yields improved results for visual information extraction through improved probabilistic structured prediction. (5) We also show how this approach is also easily adapted to create a novel form of kernel logistic regression based on our generalized BetaBernoulli Logistic Regression (BBLR) framework. We find that the kernelized version of our method, Kernel BBLR (KBBLR) outperforms nonlinear support vector machines. As expected, the regularized KBBLR does not yield sparse solutions; however, (6) since we have developed a robust method for optimizing a nonconvex loss we propose and explore a novel nonconvex sparsity encouraging prior based on a mixture of a Gaussian and a Laplacian. Sparse KBBLR typically yields sparser solutions than SVMs with comparable prediction performance, and the degree of sparsity scales much more favorably compared to SVMs .
The remainder of this paper is structured as follows. In section 2, we present a review of some relevant recent work in the area of 01 loss approximation. In section 3, we present the underlying Bayesian motivations for our proposed loss function. In section 4, we provide with the details of optimization and algorithms. In section 5, we present experimental results using protocols that both facilitate comparisons with prior work as well as evaluate our method on some large scale and structured prediction problems. We provide a final discussion and conclusions in section 6.
2 Relevant Recent Work
It has been shown in [24] that it is possible to define a generalized logistic loss and produce a smooth approximation to the hinge loss using the following formulation
(7)  
(8) 
such that . We have achieved this approximation using a factor and a shifted version of the usual logistic loss. We illustrate the way in which this construction can be used to approximate the hinge loss in Figure 3 (left).
The maximum margin Bayesian network formulation in [16] also employs a smooth differentiable hinge loss inspired by the Huber loss, having a similar shape to . The sparse probabilistic classifier approach in [10] truncates the logistic loss leading to a sparse kernel logistic regression models. [15] proposed a technique for learning support vector classifiers based on arbitrary loss functions composed of using the combination of a hyperbolic tangent loss function and a polynomial loss function.
Other recent work [14] has created a smooth approximation to the 01 loss by directly defining the loss as a modified sigmoid. They used the following function
(9)  
(10) 
In a way similar to the smooth approximation to the hinge loss, here . We illustrate the way in which this construction approximates the 01 loss in Figure 3 (right).
Another important aspect of [14] is that they compared a variety of algorithms for directly optimizing the 01 loss with a novel algorithm for optimizing the sigmoid loss, . They call their algorithm Smooth 0–1 Loss Approximation (SLA) for smooth loss approximation. The compared direct 01 loss optimization algorithms are: (1) a Branch and Bound (BnB) [11] technique, (2) a Prioritized Combinatorial Search (PCS) technique and (3) an algorithm referred to as a Combinatorial Search Approximation (CSA), both of which are presented in more detail in [14]. They compared these methods with the use of their SLA algorithm to optimize the sigmoidal approximation to the 01 loss.
To evaluate and compare the quality of the nonconvex optimization results produced by the BnB, PCS and CSA, with their SLA algorithm for the sigmoid loss, [14] also presents training set errors for a number of standard evaluation datasets. We provide an excerpt of their results in Table 1 as we will perform similar comparisons in our experimental work. These results indicated that the SLA algorithm consistently yielded superior performance at finding a good minima to the underlying nonconvex problem. Furthermore, in [14], they also provide an analysis of the runtime performance for each of the algorithms. Their experiments indicated that the SLA technique was significantly faster than the alternative algorithms for nonconvex optimization. Based on these results we build upon the SLA approach in our work here.
LR  SVM  PCS  CSA  BnB  SLA  

Breast  19  18  19  13  10  13 
Heart  39  39  33  31  25  27 
Liver  99  99  91  91  95  89 
Pima  166  166  159  157  161  156 
Sum  323  322  302  292  291  285 
The award winning work of [2] produced an approximation to the 01 loss by creating a ramp loss, , obtained by combining the traditional hinge loss with a shifted and inverted hinge loss as illustrated in Figure 4. They showed how to optimize the ramp loss using the ConcaveConvex Procedure (CCCP) of [23] and that this yields faster training times compared to traditional SVMs. Other more recent work has proposed an alternative online SVM learning algorithm for the ramp loss [6]. [22] explored a similar ramp loss which they refer to as a robust truncated hinge loss. More recent work [3] has explored a similar ramp like construction which they refer to as the slant loss. Interestingly, the ramp loss formulation has also been generalized to structured predictions [4, 8].
Although the smoothed zeroone loss captured much attention recently, we can find older references to similar research. There has been the activity of using zeroone loss like functional losses in machine learning, specially by the boosting [13] and neural network [19] communities. Vincent [19] analyzes that the loss defined through a functional of the hyperbolic tangent, , is more robust as it doesn’t penalize the outliers too excessively compared to other log logistic loss, hinge loss, and squared loss loss functions. This loss has interesting properties of both being continuous and with zeroone loss like properties. A variant of this loss has been used in boosting algorithms [13]. Other work [19] has also shown that a hyperbolic tangent parametrized squared error loss, , transforms the squared error loss to behave more like the , hyperbolic tangent loss.
3 Our Approach: Generalized BetaBernoulli Logistic Classification
We now derive a novel form of logistic regression based on formulating a generalized sigmoid function arising from an underlying Bernoulli model with a Beta prior. We also use a scaling factor to increase the sharpness of our approximation. Consider first the traditional and widely used formulation of logistic regression which can be derived from a probabilistic model based on the Bernoulli distribution. The Bernoulli probabilistic model has the form:
(11) 
where is the class label, and is the parameter of the model. The Bernoulli distribution can be reexpressed in standard exponential family form as
(12) 
where the natural parameter is given by
(13) 
In traditional logistic regression, we let the natural parameter , which leads to a model where in which the following parametrization is used
(14) 
The conjugate distribution to the Bernoulli is the Beta distribution
(15) 
where and have the intuitive interpretation as the equivalent pseudo counts for observations for the two classes of the model and is the beta function. When we use the Beta distribution as the prior over the parameters of the Bernoulli distribution, the posterior mean of the BetaBernoulli model is easily computed due to the fact that the posterior is also a Beta distribution. This property also leads to an intuitive form for the posterior mean or expected value in a BetaBernoulli model, which consists of a simple weighted average of the prior mean and the traditional maximum likelihood estimate, , such that
(16) 
where
and where is the number of examples used to estimate . Consider now the task of making a prediction using a Beta posterior and the predictive distribution. It is easy to show that the mean or expected value of the posterior predictive distribution is equivalent to plugging the posterior mean parameters of the Beta distribution into the Bernoulli distribution, , i.e.
(17) 
Given these observations, we thus propose here to replace the traditional sigmoidal function used in logistic regression with the function given by the posterior mean of the BetaBernoulli model such that
(18) 
Further, to increase our model’s ability to approximate the zero one loss, we shall also use a generalized form of the BetaBernoulli model above where we set the natural parameter of so that . This leads to our complete model based on a generalized BetaBernoulli formulation
(19) 
It is useful to remind the reader at this point that we have used the BetaBernoulli construction to define our function, not to define a prior over the parameter of a random variable as is frequently done with the Beta distribution. Furthermore, in traditional Bayesian approaches to logistic regression, a prior is placed on the parameters and used for MAP parameter estimation or more fully Bayesian methods in which one integrates over the uncertainty in the parameters.
In our formulation here, we have placed a prior on the function as is commonly done with Gaussian processes. Our approach might be seen as a pragmatic alternative to working with the fully Bayesian posterior distributions over functions given data, . The more fully Bayesian procedure would be to use the posterior predictive distribution to make predictions using
(20) 
Let us consider again the negative log logistic loss function defined by our generalized BetaBernoulli formulation where we let and we use our encoding for class labels. For this leads to
(21) 
while for the case when , the negative log probability is simply
(22) 
where and for the formulation of the corresponding loss given earlier in equations (5) and (6).
In Figure 2 we showed how setting this scalar parameter to larger values, i.e allows our generalized BetaBernoulli model to more closely approximate the zero one loss. We show the loss with and in Figure 5 (left) which corresponds to a stronger Beta prior and as we can see, this leads to an approximation with a range of values that are even closer to the 01 loss. As one might imagine, with a little analysis of the form and asymptotics of this function, one can also see that for given a setting of and , a corresponding scaling factor and linear translation can be found so as to transform the range of the loss into the interval such that . However, when as shown in Figure 5 (right), the loss function is asymmetric and in the limit of large gamma this corresponds to different losses for true positives, false positives, true negatives and false negatives. For these and other reasons we believe that this formulation has many attractive and useful properties.
3.1 Parameter Estimation and Gradients
We now turn to the problem of estimating the parameters , given data in the form of , using our model. As we have defined a probabilistic model, as usual we shall simply write the probability defined by our model then optimize the parameters via maximizing the log probability or minimizing the negative log probability. As we shall discuss in more detail in section 4, we use a modified form of the SLA optimization algorithm in which we slowly increase and interleave gradient descent steps with coordinate descent implemented as a grid search. For the gradient descent part of the optimization we shall need the gradients of our loss function and we therefore give them below.
Consider first the usual formulation of the conditional probability used in logistic regression
(23) 
here in place of the usual , in our generalized BetaBernoulli formulation we now have where . Given a data set consisting of label and feature vector pairs, this yields a loglikelihood given by
(24) 
where the gradient of this function is given by
(25) 
with
(26) 
3.2 Some Asymptotic Analysis
As we have stated at the beginning of our discussion on parameter estimation, at the end of our optimization we will have a model with a large . With a sufficiently large all predictions will be given their maximum or minimum probabilities possible under the model. Defining the class as the positive class, if we set the maximum probability under the model equal to the True Positive Rate (TPR) (e.g. on training and/or validation data) and the maximum probability for the negative class equal to the True Negative Rate (TNR) we have
(27)  
(28) 
which allows us to conclude that this would equivalently correspond to setting
(29)  
(30) 
This analysis gives us a good idea of the expected behavior of the model if we optimize and on a training set. It also suggests that an even better strategy for tuning and would be to use a validation set.
3.3 Learning hyperparameters
We have provided an asymptotic analysis of the expected values for and in the previous section. In the experiment section, we provide BBLR results for using asymptotic values of these two parameters along with crossvalidated values for other hyperparameters , where is the regularization parameter described in Section 4. It is however also possible to learn these hyperparameters using the training set, validation set or both. Below, we provide partialderivatives of likelihood function (24) for these hyperparameters.
(31) 
with
(32) 
The partialderivatives with respect to and are as follows
(33) 
(34) 
with
(35) 
3.4 Kernel BetaBernoulli Classification
It is possible to transform the traditional logistic regression technique discussed above into a kernel logistic regression (KLR) by replacing the linear discriminant function, , with
(36) 
where is a kernel function and is used as an index in the sum over all training examples.
To create our generalized BetaBernoulli KLR model we take a similar path; however, in this case we let . Thus, our Kernel BetaBernoulli model can be written as:
(37) 
If we write , where is a vector of kernel values, then the gradient of the corresponding KBBLR log likelihood obtained by setting in (24) is
(38) 
4 Optimization and Algorithms
As we have discussed in the relevant recent work section above, the work of [14] has shown that their SLA algorithm applied to outperformed a number of other techniques in terms of both true 01 loss minimization performance and run time. As our generalized BetaBernoulli loss, is another type of smooth approximation to the 01 loss, we therefore use a variation of their SLA algorithm to optimize the loss. Recall that if one compares our generalized BetaBernoulli logistic loss with the directly defined sigmoidal loss used in the SLA work of [14], it becomes apparent that the BBLR formulation has three additional hyperparameters, . These additional parameters control the locations of the plateaus of our function and these plateaus have well defined interpretations in terms of probabilities. In contrast, the plateaus of the sigmoidal loss in [14] are located at zero and one. Additionally, in practise one is interested in optimizing the regularized loss, where some form of prior or regularization is used for parameters. In our experiments here, we follow the widely used practice of using a Gaussian prior for parameters. The corresponding regularized loss arising from the negative log likelihood with the additional regularization term gives us our complete objective function
(39) 
where the parameter controls the strength of the regularization. With these additional hyperparameters , the original SLA algorithm is not directly applicable to our formulation. However, if we hold these hyperparameters fixed, we are able to use the general idea of their approach and perform a Modified SLA optimization as given in Algorithms 1 and 2. In our experiments below, we use that strategy in the BBLR series of experiments. To deal with the issue of how to jointly learn weights as well as hyperparameters , , , and ; in our BBLR series of experiments we learn these hyperparameters by gradient descent on the training set. More precisely, we learn and (as opposed to learning , ) as this permit the parameters to be easily reparametrized so that they both lie within .
Very importantly, our initial experiments indicated that the basic SLA formulation required considerable hand tuning of learning parameters for each new data set. This was the case even using the simplest smooth loss function without the additional degrees of freedom afforded by our formulation. This led us to develop a metaoptimization procedure for learning algorithm parameters. The BBLR series of experiments below use this learning metaparameter optimization procedure. Our initial and formal experiments here indicate that this metaoptimization of learning parameters is in fact essential in practice. We therefore present it in more detail below.
4.1 Our SLA Algorithm Metaoptimization (SLAM)
Here we present our metaoptimization extension and various other modifications to the SLA approach of [14]. The SLA algorithm proposed in [14] can be decomposed into two different parts; an outer loop that initializes a model then enters a loop in which one slowly increases the factor of their sigmoidal loss, repeatedly calling an algorithm they refer to as Range Optimization for SLA or Gradient Descent in Range. The Range Optimization part consists of two stages. Stage 1 is a standard gradient descent optimization with a decreasing learning rate (using the new factor). Stage 2 probes each parameter in a radius using a one dimensional grid search to determine if the loss can be further reduced, thus implementing a coordinate descent on a set of grid points. We provide a slightly modified form of the outer loop of their algorithm in Algorithm 1 where we have expressed the initial parameters given to the model, as explicit parameters given to the algorithm. In their approach they hard code the initial parameter estimates as the result of an SVM run on their data. We provide a compressed version of their inner Range optimization technique in Algorithm 2.
The first minor difference between the SLA optimization algorithm of [14] and our extension to it are the selection of the initial that the SLA algorithm starts optimizing. While the original SLA algorithm uses the SVM solution as its initial solution, , our modified SLA algorithm uses the and obtained from experiments using a validation set defined within the training data to initialize for the gradient based optimization technique which will start from . The idea here is to search for the best and that produces a reasonable solution of that the SLA algorithm will start with, where is the weight associated with the Gaussian prior leading to L2 penalty added to (24).
Our metaoptimization procedure consists of the following. We use the suggested values in the original SLA algorithm [14] for the parameters , and . For the others, we use a cross validation run using the same modified SLA algorithm to finetune algorithm parameters.
Parameter is chosen through a grid search, while and are chosen by a bracket search algorithm. In our experience, these model parameters change from problem (dataset) to problem, and hence must be finetuned for the best results.
5 Experimental Setup and Results
Below, we present results for three different groups of benchmark problems: (1) a selection from the University of California Irvine (UCI) repository, (2) some larger and higher dimensionality text processing tasks from the LibSVM evaluation archive ^{1}^{1}1http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html, and (3) the product review sentiment prediction datasets used in [5]. We then present results on a structured prediction problem formulated for the task of visual information extraction from Wikipedia biography pages. Finally we explore the kernelized version of our classifier.
In all experiments, unless otherwise stated, we use a Gaussian prior on parameters leading to an penalty term. We explore four experimental configurations for our BBLR approach: (1) BBLR, where we use our modified SLA algorithm with the following BBLR parameters held fixed : and . This corresponds to a minor modification to the traditional negative log logistic loss, but yields a probabilistically well defined smooth sigmoid shaped loss (ex. as we have seen in Figure 2); (2) BBLR, where we use values for and corresponding to the empirical counts of positives, negatives and the total number of examples from the training set, which corresponds to a simplistic heuristic, partially justified by Bayesian reasoning; (3) BBLR in which an outer metaoptimization of learning parameters is performed on top of (2), ie SLAM, and (4) BBLR in which the outer metaoptimization of learning parameters is performed, and the hyperparameters , , , and are optimized by gradient descent using the training set, with and initialized using the values given by our asymptotic analysis using a hard threshold for classifications. At each iteration of this optimization step, as parameters get updated, the complementary SLAM hyperparameters, , , are adjusted/redefined by using the same metaoptimization procedure (SLAM) and using a subset of the training data as a validation set.
Consequently, models produced by the BBLR series of experiments explore the ability of our improved SLA learning parameter metaoptimization method (SLAM) to effectively minimize a smooth approximation to the zero one loss. While the BBLR series of experiments delve the deepest into the ability of our BBLR formulation and SLAM optimization to more accurately make probabilistic predictions.
5.1 Binary Classification Tasks
5.1.1 Experiments with UCI Benchmarks
We evaluate our technique on the following datasets from the University of California Irvine (UCI) Machine Learning Repository [1]: Breast, Heart, Liver and Pima. We use these datasets in part so as to compare directly with results in [14], to understand the behaviour of our novel logistic function formulation and to explore the behavior of our learning parameter optimization procedure. Table 2 shows some brief details of these databases.
Dataset  # Examples  # Dimensions  Description 

Breast  683  10  Breast Cancer Diagnosis [12] 
Heart  270  13  Statlog 
Liver  345  6  Liver Disorders 
Pima  768  8  Pima Indians Diabetes 
To facilitate comparisons with previous results presented [14] such as those summarized in Table 3 of our literature review in Section 2, we provide a small set of initial experiments here following their experimental protocols. In our experiments here we compare our BBLRs with the following models: our own L2 Logistic Regression (LR) implementation, a linear SVM  using the same implementation (liblinear) that was used in [14], and the optimization of the sigmoid loss, of [14] using the SLA algorithm and the code distributed on the web site associated with [14] (indicated by SLA in our tables).
Despite the fact that we used the code distributed on the website associated with [14] we found that the SLA algorithm applied to their sigmoid loss, gave errors that are slightly higher than those given in [14]. We use the term SLA in Table 3 and subsequent tables to denote experiments performed using both the sigmoidal loss explored in [14] and their algorithm for minimizing it. Applying the SLA algorithm to our loss yielded slightly superior results to the sigmoidal loss when the empirical counts from the training set for , and are used and slightly worse results when we used , and .
Analyzing the ability of different loss formulations and algorithms to minimize the 01 loss on different datasets using a common model class (i.e. linear models) can reveal differences in optimization performance across different models and algorithms. However, we are certainly more interested in evaluating the ability of different loss functions and optimization techniques to learn models that can be generalized to new data. We therefore provide the next set of experiments using traditional training, validation and testing splits, again following the protocols used in [14]; however, as we shall soon see, these experiments underscored the importance of extending the original SLA algorithm to automate the adjustment of learning parameters.
In Tables 4 and 5, we create 10 random splits of the data and perform a traditional 5 fold evaluation using cross validation within each training set to tune hyperparameters. In Table 4, we present the sum of the 01 loss over each of the 10 splits as well as the total 01 loss across all experiments for each algorithm. This analysis allows us to make some intuitive comparisons with the results in Table 1, which represents an empirically derived lower bound on the 01 loss. In Table 5, we present the traditional mean accuracy across these same experiments. Examining columns SLA vs. BBLR in Table 4, we see that our reformulated logistic loss is able to outperform the sigmoidal loss, but that only with the addition of the additional tuning of parameters during the optimization in column BBLR are we able to improve upon the overall zeroone loss yielded by the logistic regression and SVM baseline methods. However, it is important to remember that all of these methods are based on an underlying linear model, these are comparatively small datasets consisting of relatively low dimensional input feature vectors. As such, we do not necessarily expect there to be any statistically significant differences test set performance due to zeroone loss minimization performance. The same observation was made in [14] and it motivated their own exploration of learning with noisy feature vectors. We follow a similar path below, but then go on further to explore datasets that are much larger and of much higher dimensions in our subsequent experimental work.
Dataset  LR  SVM  BBLR  SLA  BBLR 

Breast  21  19  11  14  12 
Heart  39  40  42  39  26 
Liver  102  100  102  90  90 
Pima  167  167  169  157  166 
Sum  329  326  324  300  294 
LR  SVM  SLA  BBLR  BBLR  

Breast  22  21  23  22  21 
Heart  45  45  48  50  43 
Liver  109  110  114  105  105 
Pima  172  172  184  176  171 
Total  348  348  368  354  340 
LR  SVM  SLA  BBLR  BBLR  BBLR  

Breast  3.2  3.1  3.6  3.2  3.1  3.0 
Heart  16.8  16.6  17.7  18.6  15.9  15.7 
Liver  31.5  31.8  32.9  30.6  30.4  30.5 
Pima  22.3  22.4  23.9  23.0  22.2  22.2 
LR  SVM  SLA  BBLR  BBLR  Impr.  
Breast  36  34  26  26  25  26% 
Heart  44  44  49  47  42  4% 
Liver  150  149  149  149  117  21% 
Pima  192  199  239  185  174  12% 
Total  422  425  463  374  359  16% 
LR  SVM  SLA  BBLR  BBLR  BBLR  

Breast  5.2  5.0  3.8  3.9  3.7  3.4 
Heart  16.4  16.2  18.1  17.3  15.5  15.2 
Liver  43.5  43.1  43.3  33.8  34.1  34.0 
Pima  25.0  25.9  31.1  24.0  22.7  22.5 
In Table 6, we present the sum of the mean 01 loss over 10 repetitions of a 5 fold leave one out experiment where 10% noise has been added to the data following the protocol given in [14]. Here again, our BBLR achieved a moderate gain over the SLA algorithm, whereas the gain of BBLR over other models is noticeable. In this table, we also show the percentage of improvement for our best model over the linear SVM. In Table 7, we show the average errors () for these 10% noise added experiments. We see here that the advantages of more directly approximating the zero one loss are more pronounced. However, the fact that the SLA approach failed to outperform the LR and SVM baselines in our experiments here; whereas in a similar experiment in [14] the SLA algorithm and sigmoidal loss did outperform these methods leads us to believe that the issue of perdataset learning algorithm parameter tuning is a significant issue. However, we observe that our BBLR experiment which used the original SLA optimization algorithm outperformed the sigmoidal loss function optimized using the SLA algorithm. These results support the notion that our proposed BetaBernoulli logistic loss is in itself a superior approach to approximate the zeroone loss from an empirical perspective. However, our results in column BBLR indicate that the combined use of our novel logistic loss and learning parameter optimization yield the most substantial improvements to zeroone loss minimization, or correspondingly improvements to accuracy.
5.1.2 Pooled McNemar Tests :
We performed McNemar tests for the four UCI benchmarks comparing BBLR with LR and linear SVMs. As we do not have significant number of test instances for any of these benchmarks, it became difficult to statistically justify and compare results. Therefore, we performed pooled McNemar tests by considering each split of our 5fold leave one out experiments as independent tests and collectively performing the significance tests as a whole. The results of this pooled McNemar test is given in Table 8. Interestingly, for our noisy dataset experiments, our BBLR was found to be statistically significant over both the LR and SVM models with .
BBLR vs. LR  BBLR vs. SVM  

cleanUCI  3.17  0.69 
noisyUCI  4.33  3.7 
5.1.3 Experiments with LibSVM Benchmarks
In this section, we present classification results using two much larger datasets: the web8, and the webspamunigrams. These datasets have predefined training and testing splits, which are distributed on the web site accompanying [25]^{2}^{2}2http://users.cecs.anu.edu.au/ xzhang/data/. These benchmarks are also distributed through the LibSVM binary data collection^{3}^{3}3http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/binary.html. The webspam unigrams data originally came from the study in [21]^{4}^{4}4http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html. Table 9 compiles some details of thsese databases.
Dataset  # Examples  # Dim.  Sparsity (%)  

web8  59,245  300  4.24  0.03 
webspamuni  350,000  254  33.8  1.54 
For these experiments we do not add additional noise to the feature vectors. In Table 10, we present classification results, and one can see that for both cases our BBLR approach shows improved performance over the LR and the linear SVM baselines. As in our earlier small scale experiments, we used our own LR implementation and the liblinear SVM for these large scale experiments.
Data set  LR  SVM  BBLR 

web8  1.11  1.13  0.98 
webspamunigrams  7.26  7.42 
We performed McNemar’s statistical tests comparing our BBLR with LR and linear SVM models for these two datasets. The results are found to be statistically significant with a value 0.01 for all cases. Given that no noise has been added to these widely used benchmark problems and that each method compared here is fundamentally based on a linear model, the fact that these experiments show statistically significant improvements for BBLR over these two widely used methods is quite interesting.
5.1.4 Experiments with Product Reviews
The goal of these tasks are to predict whether a product review is either positive or negative. For this set of experiments, we used the count based unigram features for four databases from the website associated with [5]. For each database, there are 1,000 positive and 1,000 negative product reviews. Table 11 compiles the feature dimension size of these sparse databases.
Dataset  Database size  Feature dimensions 

Books  28,234  
DVDs  2000  28,310 
Electronics  14,943  
Kitchen  12,130 
We present results in Table 12 using a ten fold cross validation setup as performed by [5]. Here again we do not add noise the the data.
Books  DVDs  Electronics  Kitchen  

LR  19.75  18.05  16.4  13.5 
SVM  20.45  21.4  17.75  14.6 
BBLR  18.38  17.5  16.29  13.0 
BBLR  18.15  16.8  15.21  13.0 
For all four databases, our BBLR and BBLR models outperformed both the LR and linear SVM. To further analyze these results, we also performed a McNemer’s test. For the Books and the DVDs database, the results of our BBLR and BBLR models are found statistically significant over both the LR and linear SVM with a value . BBLR tended to outperform BBLR, but not in a statistically significant way. However, since the primary advantage of the BBLR configuration is that it yields more accurate probabilities, we to not necessarily expect it to have dramatically superior performance compared to BBLR for classification. For this reason we explore the problem of using such models in the context of a structured prediction in the next set of experiments. When BBLR models are used to make structured predictions our hypothesis is that the benefits of providing a more accurate probabilistic prediction should be apparent through improved joint inference.
5.2 Structured Prediction Experiments
One of the advantages of our BetaBernoulli logistic loss is that it allows a model to produce more accurate probabilistic estimates. Intuitively, the controllable nature of the plateaus in the log probability view of our formulation allow probabilistic predictions to take on values that are more representative of an appropriate confidence level for a classification. In simple terms, predictions for feature vectors far from a decision boundary need not take on values that are near probablity zero or probability one when the BetaBernoulli logistic model is used. If such models are used as components to larger systems which uses probabilistic inference for more complex reasoning tasks, the additional flexibility could be a significant advantage over the traditional logistic function formulation. The following experiments explore this hypothesis.
In [9], we performed a set of face mining experiments from Wikipedia biography pages using a technique that relies on probabilistic inference in a joint probability model. For a given identity, our mining technique dynamically creates probabilistic models to disambiguate the faces that correspond to the identity of interest. These models integrate uncertain information extracted throughout a document arising from three different modalities: text, meta data and images. Information from text and metadata is integrated into the larger model using multiple logistic regression based components.
The images, face detection results as bounding boxes, some text and meta information extracted from one of the Wikipedia identity, Mr. Richard Parks, are shown in the top panel of Figure 6. In the bottom panel, we show an instance of our mining model and give a summary of the variables used in our technique. The model is a dynamically instantiated Bayesian network. Using the Bayesian network illustrated in Figure 6, the processing of information is intuitive. Text and metadata features are taken as input to the bottom layer of random variables , which influence binary (target or not target) indicator variables for each detected face through logistic regression based subcomponents. The result of visual comparisons between all faces detected in different images are encoded in the variables .
Image 1  Image 2 
Image source: Infobox 
Body text 
File name : Richard_Parks.jpg  737 Challenge.jpg 
Caption text : NULL  Richard Parks celebrating the end of the 737 Challenge at the National Assembly for Wales on 19 July 2011 

Variables Description : Visual similarity for a pair of faces, and , across different images. Binary target vs. not target label for face, . : Constraint variable for image . Local features for a face. 
Both text and meta data are transformed into feature vectors associated with each detected instance of a face. For text analysis, we use information such as: image file names and image captions. The location of an image in the page is an example of what we refer to as metadata. We also treat other information about the image that is not directly involved in facial comparisons as metadata, ex. the relative size of a face to other faces detected in an image. The bottom layer or set of random variables in Figure 6 are used to encode these features, and we discuss the precise nature and definition of these features in more detail in [9]. is therefore the local feature vector for a face, , where is the feature for face index for image index . These features are used as the input to the part of our model responsible for producing the probability that a given instance of a face belongs to the identity of interest, encoded by the random variables in Figure 6. is therefore a set of binary target vs. not target indicator variables corresponding to each face, . Inferring these variables jointly corresponds to the goal of our mining model. The joint conditional distribution defined by the general case of our model is given by
(40) 
Apart from comparing cross images faces, , the joint model uses predictive scores from per face local binary classifiers, . As mentioned above and discussed in more detail in [9], we used Maximum Entropy Models (MEMs) or Logistic Regression models for these local binary predictions working on multimedia features in our previous work.
Here, we compare the result of replacing the logistic regression components in the model discussed above with our BBLR formulation. We examine the impact of this change in terms of making predictions based solely on independent models taking text and metadata features as input as well as the impact of this difference when LR vs BBLR models are used as subcomponents in the joint structured prediction model. Our hypothesis here is that the BBLR method might improve results due to its robustness to outliers (which we have already seen in our binary classification experiments) and that the method is potentially able make more accurate probabilistic predictions, which could in turn lead to more precise joint inference.
Textonly features  Joint model with aligned faces  

MEM  63.4  76.0 
BBLR  67.8  78.2 
BBLR  70.2  81.5 
For this particular experiment, we use the biographies with 27 faces. Table 13 shows results comparing the MaxEnt model with our BBLR model. The results are for a fivefold leave one out of the wikipedia dataset. One can see that we do indeed obtain superior performance with the independent BBLR models over the Maximum Entropy models. We also see improvement to performance when BBLR models are used in the coupled model where joint inference is used for predictions.
In the row labelled BBLR, we optimized in addition to other model parameters using the technique, explained in Section 3.3. This produced statistically significant results compared to the maximum entropy model with . For this significance test, we used the McNemar test like our earlier sets of experiments.
5.3 Kernel Logistic Regression with the Generalized BetaBernoulli Loss
In Table 14 we compare BetaBernoulli logistic regression with an SVM and Kernel BetaBernoulli logistic regression (KBBLR). We see that our proposed approach compare favorably to the SVM result which is widely considered as a state of the art, strong baseline.
Dataset  BBLR  SVM  KBBLR 

Breast  
Heart  
Liver  
Pima 
5.4 Sparse Kernel BBLR
As shown in [2], one of the advantages of using the ramp loss for kernel based classification is that it can yield models that are even sparser than traditional SVMs based on the hinge loss. It is well known that based regularization does not typically yield sparse solutions when used with traditional kernel logistic regression. Our analysis of the previous experiments reveals that the regularized smooth zero one loss approximation approach proposed here does not in general lead to sparse models as well. The well known or lasso regularization method can yield sparse solutions, but often at the cost of prediction performance. Recently the so called elastic net regularization approach [26] based on a weighted combination of and regularization has been shown more effective at encouraging sparsity with a less negative impact on performance. The elastic net approach of course can be viewed as a prior consisting of the product of a Gaussian and a Laplacian distribution. However, part of the motivation for the use of these methods is that they yield convex optimization problems when combined with the log logistic loss. Since we have developed a robust approach for optimizing a nonconvex objective function above, this opens the door to the use of nonconvex sparsity encouraging regularizers. Correspondingly, we propose and explore below a prior on parameters, or equivalently, a novel regularization approach based on a mixture of a Gaussian and a Laplacian distribution. This formulation can behave like a smooth approximation to an counting “norm” prior on parameters in the limit as the Laplacian scale parameter goes to zero and the Gaussian variance goes to infinity.
With a (marginalized) GaussianLaplace mixture prior, our KBBLR loglikelihood becomes
(41)  
where is our kernel BetaBernoulli model as defined in section 3.4, equation (37). For each , its prior is modeled through a mixture of a zero mean Gaussian with variance and a Laplacian distribution , located a zero with shape parameter . For convenience we give the relevant partial derivatives for this prior in Appendix B. In our approach we also optimize the hyperparameters of this prior using hard assignment Expectation Maximization steps that are performed after step 3 of Algorithm 2. For precision we outline the steps of the modified rangeoptimization for Kernel BBLR (KBBLR) in Algorithm 3 found in Appendix C.
In Table 15, we compare sparse KBBLR and the SVM using a Radial Basis Function (RBF) kernel. The SVM free parameters were tuned by a cross validation run over the training data. For a sparse KBBLR solution, we used a mixture of a Gaussian and a Laplacian prior on the kernel weight parameters as presented above.
Dataset  SVM  Avg. Support Vectors  Sparse KBBLR  Avg. Support Vectors 

Breast  107  127  
Heart  148  85  
Liver  269  111  
Pima  548  269 
Table 15 compares sparse Kernel BBLR with SVMs on the standard UCI datasets. Figure 7 shows trends in the sparsity curves for an increase in the number of training instances comparing KBBLR with SVMs for one of the product review databases. We can see that KBBLR scales up well compared to an SVM solution when training data size increases. Support vectors for SVMs increase almost linearly for an increase in the database size, an effect that has been confirmed in a number of other studies [17, 2]. In comparison we can see that KBBLR with a GaussianLaplacian mixture prior produces a logarithmic curve for an increase in the database size. The right panel of the same figure also shows the weight distribution before and after the KBBLR optimization with a GaussianLaplacian mixture prior which yields the observed sparse solution.
6 Discussion and Conclusions
We have presented a novel formulation for learning with an approximation to the zero one loss. Through our generalized BetaBernoulli formulation, we have provided both a new smooth 01 loss approximation method and a new class of probabilistic classifiers. Our experimental results indicate that our generalized BetaBernoulli formulation is capable of yielding superior performance to traditional logistic regression and maximum margin linear SVMs for binary classification. Like other ramp like loss functions one of the principal advantages of our approach is that it is more robust dealing with outliers compared to traditional convex loss functions. Our modified SLA algorithm, which adds a learning hyperparameter optimization step shows improved performance over the original SLA optimization algorithm in [14].
We have also presented and explored a kernelized version of our approach which yields performance competitive with nonlinear SVMs for binary classification. Furthermore, with a GaussianLaplacian mixture prior on parameters our kernel BetaBernoulli model is able to yield sparser solutions than SVMs while retaining competitive classification performance. Interestingly, for an increase in training database size, our approach exhibited logarithmic scaling properties which compares favourably to the linear scaling properties of SVMs. To the best of our knowledge this is the first exploration of a GaussLaplace mixture prior for parameters – certainly in combination with our novel smooth zeroone loss formulation. The ability of this prior to behave like a smooth approximation to a counting prior is similar to an approach known as bridge regression in statistics. However, our mixture formulation has more flexibility compared to the simpler functional form of bridge regression. Interestingly, the combination of our generalized BetaBernoulli loss with a GaussianLaplacian parameter prior can be though of a smooth relaxation to learning with a zero one loss and an counting prior or regularization – a formulation for classification that is intuitively attractive, but has remained elusive in practice until now.
We also tested our generalized BetaBernoulli models for a structured prediction task arising from the problem of face mining in Wikipedia biographies. Here also our model showed better performance than traditional logistic regression based approaches, both when they were tested as independent models, and when they were compared as subparts of a Bayesian network based structured prediction framework. This experiment shows signs that the model and optimization approach proposed here may have further potential to be used in complex structured prediction tasks.
Acknowledgements
We thank the NSERC Discovery Grants program and Google for a Faculty Research Award which helped support this work.
References
 [1] K. Bache and M. Lichman. UCI machine learning repository, 2013.
 [2] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In Proceedings of the 23rd international conference on Machine learning, pages 201–208. ACM, 2006.
 [3] A. Cotter, S. ShalevShwartz, and N. Srebro. Learning optimally sparse support vector machines. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 266–274, 2013.
 [4] C. B. Do, Q. Le, C. H. Teo, O. Chapelle, and A. Smola. Tighter bounds for structured estimation. In Proc. of NIPS, 2008.
 [5] M. Dredze, K. Crammer, and F. Pereira. Confidenceweighted linear classification. In Proceedings of the 25th international conference on Machine learning, pages 264–271. ACM New York, NY, USA, 2008.
 [6] S. Ertekin, L. Bottou, and C. L. Giles. Nonconvex online support vector machines. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(2):368–381, 2011.
 [7] V. Feldman, V. Guruswami, P. Raghavendra, and Y. Wu. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558–1590, 2012.
 [8] K. Gimpel and N. A. Smith. Structured ramp loss minimization for machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 221–231. Association for Computational Linguistics, 2012.
 [9] M. K. Hasan and C. Pal. Experiments on visual information extraction with the faces of wikipedia. 2014. AAAI Conference on Artificial Intelligence (AI).
 [10] R. Hérault and Y. Grandvalet. Sparse probabilistic classifiers. In Proceedings of the 24th international conference on Machine learning, pages 337–344. ACM, 2007.
 [11] A. H. Land and A. G. Doig. An automatic method of solving discrete programming problems. Econometrica, 28(3):497–520, 1960.
 [12] O. L. Mangasarian, W. N. Street, and W. H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4):570–577, 1995.
 [13] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent in function space. NIPS, 1999.
 [14] T. Nguyen and S. Sanner. Algorithms for direct 0–1 loss optimization in binary classification. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 1085–1093, 2013.
 [15] F. PérezCruz, A. NaviaVázquez, A. R. FigueirasVidal, and A. ArtesRodriguez. Empirical risk minimization for support vector classifiers. Neural Networks, IEEE Transactions on, 14(2):296–303, 2003.
 [16] F. Pernkopf, M. Wohlmayr, and S. Tschiatschek. Maximum margin bayesian network classifiers. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(3):521–532, 2012.
 [17] I. Steinwart. Sparseness of support vector machines. The Journal of Machine Learning Research, 4:1071–1105, 2003.
 [18] V. Vapnik. The nature of statistical learning theory. springer, 2000.
 [19] P. Vincent. Modèles à noyaux à structure locale. Citeseer, 2004.
 [20] P. Vincent and Y. Bengio. Kernel matching pursuit. Machine Learning, 48(13):165–187, 2002.
 [21] D. Wang, D. Irani, and C. Pu. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom), 2012 8th International Conference on, pages 40–49. IEEE, 2012.
 [22] Y. Wu and Y. Liu. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, 102(479), 2007.
 [23] A. L. Yuille and A. Rangarajan. The concaveconvex procedure. Neural Computation, 15(4):915–936, 2003.
 [24] T. Zhang and F. J. Oles. Text categorization based on regularized linear classification methods. Information retrieval, 4(1):5–31, 2001.
 [25] X. Zhang, A. Saha, and S. Vishwanathan. Smoothing multivariate performance measures. Journal of Machine Learning Research, 10:1–55, 2011.
 [26] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
Appendix A Experimental Details
In the interests of reproducibility, we also list below the algorithm parameters and the recommended settings as given in [14] :

, a search radius reduction factor;

, the initial search radius;

, a grid spacing reduction factor;

, the initial grid spacing for 1D search;

, the gamma parameter reduction factor;

, the starting point for the search over ;

, the end point for the search over .
As a part of the Range Optimization procedure there is also a standard gradient descent procedure using a slowly reduced learning rate. The procedure has the following specified and unspecified default values for the constants defined below:

, a learning rate reduction factor;

, the initial learning rate;

, the minimal learning rate;

, used for a while loop stopping criterion based on the smallest change in the likelihood;

, used for outer stopping criterion based on magnitude of gradient
Appendix B Gradients for a GaussianLaplacian Mixture Prior
The gradient of the KBBLR likelihood is given in section 3.4. Below we provide the gradient of the log GaussianLaplace mixture prior or regularization term,
(42) 
(43) 
(44) 