A New Smooth Approximation to the Zero One Loss with a Probabilistic Interpretation
We examine a new form of smooth approximation to the zero one loss in which learning is performed using a reformulation of the widely used logistic function. Our approach is based on using the posterior mean of a novel generalized Beta-Bernoulli formulation. This leads to a generalized logistic function that approximates the zero one loss, but retains a probabilistic formulation conferring a number of useful properties. The approach is easily generalized to kernel logistic regression and easily integrated into methods for structured prediction. We present experiments in which we learn such models using an optimization method consisting of a combination of gradient descent and coordinate descent using localized grid search so as to escape from local minima. Our experiments indicate that optimization quality is improved when learning meta-parameters are themselves optimized using a validation set. Our experiments show improved performance relative to widely used logistic and hinge loss methods on a wide variety of problems ranging from standard UC Irvine and libSVM evaluation datasets to product review predictions and a visual information extraction task. We observe that the approach: 1) is more robust to outliers compared to the logistic and hinge losses; 2) outperforms comparable logistic and max margin models on larger scale benchmark problems; 3) when combined with Gaussian- Laplacian mixture prior on parameters the kernelized version of our formulation yields sparser solutions than Support Vector Machine classifiers; and 4) when integrated into a probabilistic structured prediction technique our approach provides more accurate probabilities yielding improved inference and increasing information extraction performance.
Loss function minimization is a standard way of solving many important learning problems. In the classical statistical literature, this is known as Empirical Risk Minimization (ERM) , where learning is performed by minimizing the average risk or loss over the training data. Formally, this is represented as
where, is a model, is the i input feature vector with label , there are pairs of features and labels, and is the loss for the model output . Let us focus for the moment on the standard binary linear classification task in which we encode the target class label as and the model parameter vector as . Letting , we can define the logistic, hinge, and 0-1 loss as
where is the indicator function which takes the value of 1 when its argument is true and 0 when its argument is false. Of course, loss functions can be more complex, for example defined and learned through a linear combination of simpler basis loss functions , but we focus on the widely used losses above for now.
Different loss functions characterize the classification problem differently. The log logistic loss and the hinge loss are very similar in their shape, which can be verified from Figure 1. Logistic regression models involve optimizing the log logistic loss, while optimizing a hinge loss is the heart of Support Vector Machines (SVMs). While seemingly a sensible objective for a classification problem, empirical risk minimization with the 0-1 loss function is known to be an NP-hard problem .
Both the log logistic loss and the hinge loss are convex and therefore lead to optimization problems with a global minima. However, both the the log logistic loss and hinge loss penalize a model heavily when data points are classified incorrectly and are far away from the decision boundary. As can be seen in Figure 1 their penalties can be much more significant than the zero one loss. The zero-one loss captures the intuitive goal of simply minimizing classification errors and recent research has been directed to learning models using a smoothed zero-one loss approximation [24, 14]. Previous work has shown that both the hinge loss  and more recently the 0-1 loss  can be efficiently and effectively optimized directly using smooth approximations. The work in  also underscored the robustness advantages of the 0-1 loss to outliers. While the 0-1 loss is not convex, the current flurry of activity in the area of deep neural networks as well as the award winning work on 0-1 loss approximations in  have highlighted numerous other advantages to the use of non-convex loss functions. In our work here, we are interested in constructing a probabilistically formulated smooth approximation to the 0-1 loss.
Let us first compare the widely used log logistic loss with the hinge loss and the 0-1 loss in a little more detail. The log logistic loss from the well known logistic regression model arises from the form of negative log likelihood defined by the model. More specifically, this logistic loss arises from a sigmoid function parametrizing probabilities and is easily recovered by re-arranging (2) to obtain a probability model of the form . In our work here, we will take this familiar logistic function and we shall transform it to create a new functional form. The sequence of curves starting with the blue curve in Figure 2 (top) give an intuitive visualization of the way in which we alter the traditional log logistic loss. We call our new loss function the generalized Beta-Bernoulli logistic loss and use the acronym when referring to it. We give it this name as it arises from the combined use of a Beta-Bernoulli distribution and a generalized logistic parametrization.
We give the Bayesian motivations for our Beta-Bernoulli construction in section 3. To gain some additional intuitions about the effect of our construction from a practical perspective, consider the following analysis. When viewing the negative log likelihood of the traditional logistic regression parametrization as a loss function, one might pose the following question: (1) what alternative functional form for the underlying probability would lead to a loss function exhibiting a plateau similar to the 0-1 loss for incorrectly classified examples? One might also pose a second question: (2) is it possible to construct a simple parametrization in which a single parameter controls the sharpness of the smooth approximation to the 0-1 loss? The intuition for an answer to the first question is that the traditional logistic parametrization converges to zero probability for small values of its argument. This in turn leads to a loss function that increases with a linear behaviour for small values of as shown in Figure 1. In contrast, our new loss function is defined in such a way that for small values of , the function will converge to a non-zero probability. This effect manifests itself as the desired plateau, which can be seen clearly in the loss functions defined by our model in Figure 2 (top). The answer to our second question is indeed yes; and more specifically, to control the sharpness of our approximation, we use a factor reminiscent of a technique used in previous work which has created smooth approximations to the hinge loss  as well as smooth approximations of the 0-1 loss . We show the intuitive effect of our construction for different increasing values of gamma in Figure 2 and define it more formally below.
To compare and contrast our loss function with other common loss functions such as those in equations (2-4) and others reviewed below, we express our loss here using and as arguments. For , the loss can be expressed as
while for it can be expressed as
We show in section 3 that the constants and have well defined interpretations in terms of the standard , , and parameters of the Beta distribution. Their impact on our proposed generalized Beta-Bernoulli loss arise from applying a fuller Bayesian analysis to the formulation of a logistic function.
The visualization of our proposed loss in Figure 2 corresponds to the use of a weak non-informative prior such as and and . In Figure 2, we show the probability given by the model as a function of at the right and the negative log probability or the loss on the left as is varied over the integer powers in the interval . We see that the logistic function transition becomes more abrupt as increases. The loss function behaves like the usual logistic loss for close to 1, but provides an increasingly more accurate smooth approximation to the zero one loss with larger values of . Intuitively, the location of the plateau of the smooth log logistic loss approximation on the y-axis is controlled by our choice of , and . The effect of the weak uniform prior is to add a small minimum probability to the model, which can be imperceptible in terms of the impact on the sigmoid function log space, but leads to the plateau in the negative log loss function. By contrast, the use of a strong prior for the losses in Figure 5 (left) leads to minimum and maximum probabilities that can be much further from zero and one.
Our work makes a number of contributions which we enumerate here: (1) The primary contribution of our work is a new probabilistically formulated approximation to the 0-1 loss based on a generalized logistic function and the use of the Beta-Bernoulli distribution. The result is a generalized sigmoid function in both probability and negative log probability space. (2) A second key contribution of our work is that we present and explore an adapted version of the optimization algorithm proposed in  in which we optimize the meta parameters of learning using validation sets. We present a series of experiments in which we optimize the loss using the basic algorithm from  and our modified version. For linear models, we show that our complete approach outperforms the widely used techniques of logistic regression and linear support vector machines. As expected, our experiments indicate that the relative performance of the approach further increases when noisy outliers are present in the data. (3) We go on to present a number of experiments with larger scale data sets demonstrating that our method also outperforms widely used logistic regression and SVM techniques despite the fact that the underlying models involved are linear. (4) We apply our model in a structured prediction task formulated for mining faces in Wikipedia biography pages. Our proposed method is well adapted to this setting and we and find that the improved probabilistic modeling capabilities of our approach yields improved results for visual information extraction through improved probabilistic structured prediction. (5) We also show how this approach is also easily adapted to create a novel form of kernel logistic regression based on our generalized Beta-Bernoulli Logistic Regression (BBLR) framework. We find that the kernelized version of our method, Kernel BBLR (KBBLR) outperforms non-linear support vector machines. As expected, the regularized KBBLR does not yield sparse solutions; however, (6) since we have developed a robust method for optimizing a non-convex loss we propose and explore a novel non-convex sparsity encouraging prior based on a mixture of a Gaussian and a Laplacian. Sparse KBBLR typically yields sparser solutions than SVMs with comparable prediction performance, and the degree of sparsity scales much more favorably compared to SVMs .
The remainder of this paper is structured as follows. In section 2, we present a review of some relevant recent work in the area of 0-1 loss approximation. In section 3, we present the underlying Bayesian motivations for our proposed loss function. In section 4, we provide with the details of optimization and algorithms. In section 5, we present experimental results using protocols that both facilitate comparisons with prior work as well as evaluate our method on some large scale and structured prediction problems. We provide a final discussion and conclusions in section 6.
2 Relevant Recent Work
It has been shown in  that it is possible to define a generalized logistic loss and produce a smooth approximation to the hinge loss using the following formulation
such that . We have achieved this approximation using a factor and a shifted version of the usual logistic loss. We illustrate the way in which this construction can be used to approximate the hinge loss in Figure 3 (left).
The maximum margin Bayesian network formulation in  also employs a smooth differentiable hinge loss inspired by the Huber loss, having a similar shape to . The sparse probabilistic classifier approach in  truncates the logistic loss leading to a sparse kernel logistic regression models.  proposed a technique for learning support vector classifiers based on arbitrary loss functions composed of using the combination of a hyperbolic tangent loss function and a polynomial loss function.
Other recent work  has created a smooth approximation to the 0-1 loss by directly defining the loss as a modified sigmoid. They used the following function
In a way similar to the smooth approximation to the hinge loss, here . We illustrate the way in which this construction approximates the 0-1 loss in Figure 3 (right).
Another important aspect of  is that they compared a variety of algorithms for directly optimizing the 0-1 loss with a novel algorithm for optimizing the sigmoid loss, . They call their algorithm Smooth 0–1 Loss Approximation (SLA) for smooth loss approximation. The compared direct 0-1 loss optimization algorithms are: (1) a Branch and Bound (BnB)  technique, (2) a Prioritized Combinatorial Search (PCS) technique and (3) an algorithm referred to as a Combinatorial Search Approximation (CSA), both of which are presented in more detail in . They compared these methods with the use of their SLA algorithm to optimize the sigmoidal approximation to the 0-1 loss.
To evaluate and compare the quality of the non-convex optimization results produced by the BnB, PCS and CSA, with their SLA algorithm for the sigmoid loss,  also presents training set errors for a number of standard evaluation datasets. We provide an excerpt of their results in Table 1 as we will perform similar comparisons in our experimental work. These results indicated that the SLA algorithm consistently yielded superior performance at finding a good minima to the underlying non-convex problem. Furthermore, in , they also provide an analysis of the run-time performance for each of the algorithms. Their experiments indicated that the SLA technique was significantly faster than the alternative algorithms for non-convex optimization. Based on these results we build upon the SLA approach in our work here.
The award winning work of  produced an approximation to the 0-1 loss by creating a ramp loss, , obtained by combining the traditional hinge loss with a shifted and inverted hinge loss as illustrated in Figure 4. They showed how to optimize the ramp loss using the Concave-Convex Procedure (CCCP) of  and that this yields faster training times compared to traditional SVMs. Other more recent work has proposed an alternative online SVM learning algorithm for the ramp loss .  explored a similar ramp loss which they refer to as a robust truncated hinge loss. More recent work  has explored a similar ramp like construction which they refer to as the slant loss. Interestingly, the ramp loss formulation has also been generalized to structured predictions [4, 8].
Although the smoothed zero-one loss captured much attention recently, we can find older references to similar research. There has been the activity of using zero-one loss like functional losses in machine learning, specially by the boosting  and neural network  communities. Vincent  analyzes that the loss defined through a functional of the hyperbolic tangent, , is more robust as it doesn’t penalize the outliers too excessively compared to other log logistic loss, hinge loss, and squared loss loss functions. This loss has interesting properties of both being continuous and with zero-one loss like properties. A variant of this loss has been used in boosting algorithms . Other work  has also shown that a hyperbolic tangent parametrized squared error loss, , transforms the squared error loss to behave more like the , hyperbolic tangent loss.
3 Our Approach: Generalized Beta-Bernoulli Logistic Classification
We now derive a novel form of logistic regression based on formulating a generalized sigmoid function arising from an underlying Bernoulli model with a Beta prior. We also use a scaling factor to increase the sharpness of our approximation. Consider first the traditional and widely used formulation of logistic regression which can be derived from a probabilistic model based on the Bernoulli distribution. The Bernoulli probabilistic model has the form:
where is the class label, and is the parameter of the model. The Bernoulli distribution can be re-expressed in standard exponential family form as
where the natural parameter is given by
In traditional logistic regression, we let the natural parameter , which leads to a model where in which the following parametrization is used
The conjugate distribution to the Bernoulli is the Beta distribution
where and have the intuitive interpretation as the equivalent pseudo counts for observations for the two classes of the model and is the beta function. When we use the Beta distribution as the prior over the parameters of the Bernoulli distribution, the posterior mean of the Beta-Bernoulli model is easily computed due to the fact that the posterior is also a Beta distribution. This property also leads to an intuitive form for the posterior mean or expected value in a Beta-Bernoulli model, which consists of a simple weighted average of the prior mean and the traditional maximum likelihood estimate, , such that
and where is the number of examples used to estimate . Consider now the task of making a prediction using a Beta posterior and the predictive distribution. It is easy to show that the mean or expected value of the posterior predictive distribution is equivalent to plugging the posterior mean parameters of the Beta distribution into the Bernoulli distribution, , i.e.
Given these observations, we thus propose here to replace the traditional sigmoidal function used in logistic regression with the function given by the posterior mean of the Beta-Bernoulli model such that
Further, to increase our model’s ability to approximate the zero one loss, we shall also use a generalized form of the Beta-Bernoulli model above where we set the natural parameter of so that . This leads to our complete model based on a generalized Beta-Bernoulli formulation
It is useful to remind the reader at this point that we have used the Beta-Bernoulli construction to define our function, not to define a prior over the parameter of a random variable as is frequently done with the Beta distribution. Furthermore, in traditional Bayesian approaches to logistic regression, a prior is placed on the parameters and used for MAP parameter estimation or more fully Bayesian methods in which one integrates over the uncertainty in the parameters.
In our formulation here, we have placed a prior on the function as is commonly done with Gaussian processes. Our approach might be seen as a pragmatic alternative to working with the fully Bayesian posterior distributions over functions given data, . The more fully Bayesian procedure would be to use the posterior predictive distribution to make predictions using
Let us consider again the negative log logistic loss function defined by our generalized Beta-Bernoulli formulation where we let and we use our encoding for class labels. For this leads to
while for the case when , the negative log probability is simply
In Figure 2 we showed how setting this scalar parameter to larger values, i.e allows our generalized Beta-Bernoulli model to more closely approximate the zero one loss. We show the loss with and in Figure 5 (left) which corresponds to a stronger Beta prior and as we can see, this leads to an approximation with a range of values that are even closer to the 0-1 loss. As one might imagine, with a little analysis of the form and asymptotics of this function, one can also see that for given a setting of and , a corresponding scaling factor and linear translation can be found so as to transform the range of the loss into the interval such that . However, when as shown in Figure 5 (right), the loss function is asymmetric and in the limit of large gamma this corresponds to different losses for true positives, false positives, true negatives and false negatives. For these and other reasons we believe that this formulation has many attractive and useful properties.
3.1 Parameter Estimation and Gradients
We now turn to the problem of estimating the parameters , given data in the form of , using our model. As we have defined a probabilistic model, as usual we shall simply write the probability defined by our model then optimize the parameters via maximizing the log probability or minimizing the negative log probability. As we shall discuss in more detail in section 4, we use a modified form of the SLA optimization algorithm in which we slowly increase and interleave gradient descent steps with coordinate descent implemented as a grid search. For the gradient descent part of the optimization we shall need the gradients of our loss function and we therefore give them below.
Consider first the usual formulation of the conditional probability used in logistic regression
here in place of the usual , in our generalized Beta-Bernoulli formulation we now have where . Given a data set consisting of label and feature vector pairs, this yields a log-likelihood given by
where the gradient of this function is given by
3.2 Some Asymptotic Analysis
As we have stated at the beginning of our discussion on parameter estimation, at the end of our optimization we will have a model with a large . With a sufficiently large all predictions will be given their maximum or minimum probabilities possible under the model. Defining the class as the positive class, if we set the maximum probability under the model equal to the True Positive Rate (TPR) (e.g. on training and/or validation data) and the maximum probability for the negative class equal to the True Negative Rate (TNR) we have
which allows us to conclude that this would equivalently correspond to setting
This analysis gives us a good idea of the expected behavior of the model if we optimize and on a training set. It also suggests that an even better strategy for tuning and would be to use a validation set.
3.3 Learning hyper-parameters
We have provided an asymptotic analysis of the expected values for and in the previous section. In the experiment section, we provide BBLR results for using asymptotic values of these two parameters along with cross-validated values for other hyper-parameters , where is the regularization parameter described in Section 4. It is however also possible to learn these hyper-parameters using the training set, validation set or both. Below, we provide partial-derivatives of likelihood function (24) for these hyper-parameters.
The partial-derivatives with respect to and are as follows
3.4 Kernel Beta-Bernoulli Classification
It is possible to transform the traditional logistic regression technique discussed above into a kernel logistic regression (KLR) by replacing the linear discriminant function, , with
where is a kernel function and is used as an index in the sum over all training examples.
To create our generalized Beta-Bernoulli KLR model we take a similar path; however, in this case we let . Thus, our Kernel Beta-Bernoulli model can be written as:
If we write , where is a vector of kernel values, then the gradient of the corresponding KBBLR log likelihood obtained by setting in (24) is
4 Optimization and Algorithms
As we have discussed in the relevant recent work section above, the work of  has shown that their SLA algorithm applied to outperformed a number of other techniques in terms of both true 0-1 loss minimization performance and run time. As our generalized Beta-Bernoulli loss, is another type of smooth approximation to the 0-1 loss, we therefore use a variation of their SLA algorithm to optimize the loss. Recall that if one compares our generalized Beta-Bernoulli logistic loss with the directly defined sigmoidal loss used in the SLA work of , it becomes apparent that the BBLR formulation has three additional hyper-parameters, . These additional parameters control the locations of the plateaus of our function and these plateaus have well defined interpretations in terms of probabilities. In contrast, the plateaus of the sigmoidal loss in  are located at zero and one. Additionally, in practise one is interested in optimizing the regularized loss, where some form of prior or regularization is used for parameters. In our experiments here, we follow the widely used practice of using a Gaussian prior for parameters. The corresponding regularized loss arising from the negative log likelihood with the additional regularization term gives us our complete objective function
where the parameter controls the strength of the regularization. With these additional hyper-parameters , the original SLA algorithm is not directly applicable to our formulation. However, if we hold these hyper-parameters fixed, we are able to use the general idea of their approach and perform a Modified SLA optimization as given in Algorithms 1 and 2. In our experiments below, we use that strategy in the BBLR series of experiments. To deal with the issue of how to jointly learn weights as well as hyper-parameters , , , and ; in our BBLR series of experiments we learn these hyper-parameters by gradient descent on the training set. More precisely, we learn and (as opposed to learning , ) as this permit the parameters to be easily re-parametrized so that they both lie within .
Very importantly, our initial experiments indicated that the basic SLA formulation required considerable hand tuning of learning parameters for each new data set. This was the case even using the simplest smooth loss function without the additional degrees of freedom afforded by our formulation. This led us to develop a meta-optimization procedure for learning algorithm parameters. The BBLR series of experiments below use this learning meta-parameter optimization procedure. Our initial and formal experiments here indicate that this meta-optimization of learning parameters is in fact essential in practice. We therefore present it in more detail below.
4.1 Our SLA Algorithm Meta-optimization (SLAM)
Here we present our meta-optimization extension and various other modifications to the SLA approach of . The SLA algorithm proposed in  can be decomposed into two different parts; an outer loop that initializes a model then enters a loop in which one slowly increases the factor of their sigmoidal loss, repeatedly calling an algorithm they refer to as Range Optimization for SLA or Gradient Descent in Range. The Range Optimization part consists of two stages. Stage 1 is a standard gradient descent optimization with a decreasing learning rate (using the new factor). Stage 2 probes each parameter in a radius using a one dimensional grid search to determine if the loss can be further reduced, thus implementing a coordinate descent on a set of grid points. We provide a slightly modified form of the outer loop of their algorithm in Algorithm 1 where we have expressed the initial parameters given to the model, as explicit parameters given to the algorithm. In their approach they hard code the initial parameter estimates as the result of an SVM run on their data. We provide a compressed version of their inner Range optimization technique in Algorithm 2.
The first minor difference between the SLA optimization algorithm of  and our extension to it are the selection of the initial that the SLA algorithm starts optimizing. While the original SLA algorithm uses the SVM solution as its initial solution, , our modified SLA algorithm uses the and obtained from experiments using a validation set defined within the training data to initialize for the gradient based optimization technique which will start from . The idea here is to search for the best and that produces a reasonable solution of that the SLA algorithm will start with, where is the weight associated with the Gaussian prior leading to L2 penalty added to (24).
Our meta-optimization procedure consists of the following. We use the suggested values in the original SLA algorithm  for the parameters , and . For the others, we use a cross validation run using the same modified SLA algorithm to fine-tune algorithm parameters.
Parameter is chosen through a grid search, while and are chosen by a bracket search algorithm. In our experience, these model parameters change from problem (dataset) to problem, and hence must be fine-tuned for the best results.
5 Experimental Setup and Results
Below, we present results for three different groups of benchmark problems: (1) a selection from the University of California Irvine (UCI) repository, (2) some larger and higher dimensionality text processing tasks from the LibSVM evaluation archive 111http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html, and (3) the product review sentiment prediction datasets used in . We then present results on a structured prediction problem formulated for the task of visual information extraction from Wikipedia biography pages. Finally we explore the kernelized version of our classifier.
In all experiments, unless otherwise stated, we use a Gaussian prior on parameters leading to an penalty term. We explore four experimental configurations for our BBLR approach: (1) BBLR, where we use our modified SLA algorithm with the following BBLR parameters held fixed : and . This corresponds to a minor modification to the traditional negative log logistic loss, but yields a probabilistically well defined smooth sigmoid shaped loss (ex. as we have seen in Figure 2); (2) BBLR, where we use values for and corresponding to the empirical counts of positives, negatives and the total number of examples from the training set, which corresponds to a simplistic heuristic, partially justified by Bayesian reasoning; (3) BBLR in which an outer meta-optimization of learning parameters is performed on top of (2), ie SLAM, and (4) BBLR in which the outer meta-optimization of learning parameters is performed, and the hyper-parameters , , , and are optimized by gradient descent using the training set, with and initialized using the values given by our asymptotic analysis using a hard threshold for classifications. At each iteration of this optimization step, as parameters get updated, the complementary SLAM hyper-parameters, , , are adjusted/redefined by using the same meta-optimization procedure (SLAM) and using a subset of the training data as a validation set.
Consequently, models produced by the BBLR series of experiments explore the ability of our improved SLA learning parameter meta-optimization method (SLAM) to effectively minimize a smooth approximation to the zero one loss. While the BBLR series of experiments delve the deepest into the ability of our BBLR formulation and SLAM optimization to more accurately make probabilistic predictions.
5.1 Binary Classification Tasks
5.1.1 Experiments with UCI Benchmarks
We evaluate our technique on the following datasets from the University of California Irvine (UCI) Machine Learning Repository : Breast, Heart, Liver and Pima. We use these datasets in part so as to compare directly with results in , to understand the behaviour of our novel logistic function formulation and to explore the behavior of our learning parameter optimization procedure. Table 2 shows some brief details of these databases.
|Dataset||# Examples||# Dimensions||Description|
|Breast||683||10||Breast Cancer Diagnosis |
|Pima||768||8||Pima Indians Diabetes|
To facilitate comparisons with previous results presented  such as those summarized in Table 3 of our literature review in Section 2, we provide a small set of initial experiments here following their experimental protocols. In our experiments here we compare our BBLRs with the following models: our own L2 Logistic Regression (LR) implementation, a linear SVM - using the same implementation (liblinear) that was used in , and the optimization of the sigmoid loss, of  using the SLA algorithm and the code distributed on the web site associated with  (indicated by SLA in our tables).
Despite the fact that we used the code distributed on the website associated with  we found that the SLA algorithm applied to their sigmoid loss, gave errors that are slightly higher than those given in . We use the term SLA in Table 3 and subsequent tables to denote experiments performed using both the sigmoidal loss explored in  and their algorithm for minimizing it. Applying the SLA algorithm to our loss yielded slightly superior results to the sigmoidal loss when the empirical counts from the training set for , and are used and slightly worse results when we used , and .
Analyzing the ability of different loss formulations and algorithms to minimize the 0-1 loss on different datasets using a common model class (i.e. linear models) can reveal differences in optimization performance across different models and algorithms. However, we are certainly more interested in evaluating the ability of different loss functions and optimization techniques to learn models that can be generalized to new data. We therefore provide the next set of experiments using traditional training, validation and testing splits, again following the protocols used in ; however, as we shall soon see, these experiments underscored the importance of extending the original SLA algorithm to automate the adjustment of learning parameters.
In Tables 4 and 5, we create 10 random splits of the data and perform a traditional 5 fold evaluation using cross validation within each training set to tune hyper-parameters. In Table 4, we present the sum of the 0-1 loss over each of the 10 splits as well as the total 0-1 loss across all experiments for each algorithm. This analysis allows us to make some intuitive comparisons with the results in Table 1, which represents an empirically derived lower bound on the 0-1 loss. In Table 5, we present the traditional mean accuracy across these same experiments. Examining columns SLA vs. BBLR in Table 4, we see that our re-formulated logistic loss is able to outperform the sigmoidal loss, but that only with the addition of the additional tuning of parameters during the optimization in column BBLR are we able to improve upon the overall zero-one loss yielded by the logistic regression and SVM baseline methods. However, it is important to remember that all of these methods are based on an underlying linear model, these are comparatively small datasets consisting of relatively low dimensional input feature vectors. As such, we do not necessarily expect there to be any statistically significant differences test set performance due to zero-one loss minimization performance. The same observation was made in  and it motivated their own exploration of learning with noisy feature vectors. We follow a similar path below, but then go on further to explore datasets that are much larger and of much higher dimensions in our subsequent experimental work.
In Table 6, we present the sum of the mean 0-1 loss over 10 repetitions of a 5 fold leave one out experiment where 10% noise has been added to the data following the protocol given in . Here again, our BBLR achieved a moderate gain over the SLA algorithm, whereas the gain of BBLR over other models is noticeable. In this table, we also show the percentage of improvement for our best model over the linear SVM. In Table 7, we show the average errors () for these 10% noise added experiments. We see here that the advantages of more directly approximating the zero one loss are more pronounced. However, the fact that the SLA approach failed to outperform the LR and SVM baselines in our experiments here; whereas in a similar experiment in  the SLA algorithm and sigmoidal loss did outperform these methods leads us to believe that the issue of per-dataset learning algorithm parameter tuning is a significant issue. However, we observe that our BBLR experiment which used the original SLA optimization algorithm outperformed the sigmoidal loss function optimized using the SLA algorithm. These results support the notion that our proposed Beta-Bernoulli logistic loss is in itself a superior approach to approximate the zero-one loss from an empirical perspective. However, our results in column BBLR indicate that the combined use of our novel logistic loss and learning parameter optimization yield the most substantial improvements to zero-one loss minimization, or correspondingly improvements to accuracy.
5.1.2 Pooled McNemar Tests :
We performed McNemar tests for the four UCI benchmarks comparing BBLR with LR and linear SVMs. As we do not have significant number of test instances for any of these benchmarks, it became difficult to statistically justify and compare results. Therefore, we performed pooled McNemar tests by considering each split of our 5-fold leave one out experiments as independent tests and collectively performing the significance tests as a whole. The results of this pooled McNemar test is given in Table 8. Interestingly, for our noisy dataset experiments, our BBLR was found to be statistically significant over both the LR and SVM models with .
|BBLR vs. LR||BBLR vs. SVM|
5.1.3 Experiments with LibSVM Benchmarks
In this section, we present classification results using two much larger datasets: the web8, and the webspam-unigrams. These datasets have predefined training and testing splits, which are distributed on the web site accompanying 222http://users.cecs.anu.edu.au/ xzhang/data/. These benchmarks are also distributed through the LibSVM binary data collection333http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/binary.html. The webspam unigrams data originally came from the study in 444http://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html. Table 9 compiles some details of thsese databases.
|Dataset||# Examples||# Dim.||Sparsity (%)|
For these experiments we do not add additional noise to the feature vectors. In Table 10, we present classification results, and one can see that for both cases our BBLR approach shows improved performance over the LR and the linear SVM baselines. As in our earlier small scale experiments, we used our own LR implementation and the liblinear SVM for these large scale experiments.
We performed McNemar’s statistical tests comparing our BBLR with LR and linear SVM models for these two datasets. The results are found to be statistically significant with a value 0.01 for all cases. Given that no noise has been added to these widely used benchmark problems and that each method compared here is fundamentally based on a linear model, the fact that these experiments show statistically significant improvements for BBLR over these two widely used methods is quite interesting.
5.1.4 Experiments with Product Reviews
The goal of these tasks are to predict whether a product review is either positive or negative. For this set of experiments, we used the count based unigram features for four databases from the website associated with . For each database, there are 1,000 positive and 1,000 negative product reviews. Table 11 compiles the feature dimension size of these sparse databases.
|Dataset||Database size||Feature dimensions|
For all four databases, our BBLR and BBLR models outperformed both the LR and linear SVM. To further analyze these results, we also performed a McNemer’s test. For the Books and the DVDs database, the results of our BBLR and BBLR models are found statistically significant over both the LR and linear SVM with a value . BBLR tended to outperform BBLR, but not in a statistically significant way. However, since the primary advantage of the BBLR configuration is that it yields more accurate probabilities, we to not necessarily expect it to have dramatically superior performance compared to BBLR for classification. For this reason we explore the problem of using such models in the context of a structured prediction in the next set of experiments. When BBLR models are used to make structured predictions our hypothesis is that the benefits of providing a more accurate probabilistic prediction should be apparent through improved joint inference.
5.2 Structured Prediction Experiments
One of the advantages of our Beta-Bernoulli logistic loss is that it allows a model to produce more accurate probabilistic estimates. Intuitively, the controllable nature of the plateaus in the log probability view of our formulation allow probabilistic predictions to take on values that are more representative of an appropriate confidence level for a classification. In simple terms, predictions for feature vectors far from a decision boundary need not take on values that are near probablity zero or probability one when the Beta-Bernoulli logistic model is used. If such models are used as components to larger systems which uses probabilistic inference for more complex reasoning tasks, the additional flexibility could be a significant advantage over the traditional logistic function formulation. The following experiments explore this hypothesis.
In , we performed a set of face mining experiments from Wikipedia biography pages using a technique that relies on probabilistic inference in a joint probability model. For a given identity, our mining technique dynamically creates probabilistic models to disambiguate the faces that correspond to the identity of interest. These models integrate uncertain information extracted throughout a document arising from three different modalities: text, meta data and images. Information from text and metadata is integrated into the larger model using multiple logistic regression based components.
The images, face detection results as bounding boxes, some text and meta information extracted from one of the Wikipedia identity, Mr. Richard Parks, are shown in the top panel of Figure 6. In the bottom panel, we show an instance of our mining model and give a summary of the variables used in our technique. The model is a dynamically instantiated Bayesian network. Using the Bayesian network illustrated in Figure 6, the processing of information is intuitive. Text and meta-data features are taken as input to the bottom layer of random variables , which influence binary (target or not target) indicator variables for each detected face through logistic regression based sub-components. The result of visual comparisons between all faces detected in different images are encoded in the variables .
|Image 1||Image 2|
Image source: Info-box
|File name : Richard_Parks.jpg||737 Challenge.jpg|
|Caption text : NULL||Richard Parks celebrating the end of the 737 Challenge at the National Assembly for Wales on 19 July 2011|
Variables Description : Visual similarity for a pair of faces, and , across different images. Binary target vs. not target label for face, . : Constraint variable for image . Local features for a face.
Both text and meta data are transformed into feature vectors associated with each detected instance of a face. For text analysis, we use information such as: image file names and image captions. The location of an image in the page is an example of what we refer to as meta-data. We also treat other information about the image that is not directly involved in facial comparisons as meta-data, ex. the relative size of a face to other faces detected in an image. The bottom layer or set of random variables in Figure 6 are used to encode these features, and we discuss the precise nature and definition of these features in more detail in . is therefore the local feature vector for a face, , where is the feature for face index for image index . These features are used as the input to the part of our model responsible for producing the probability that a given instance of a face belongs to the identity of interest, encoded by the random variables in Figure 6. is therefore a set of binary target vs. not target indicator variables corresponding to each face, . Inferring these variables jointly corresponds to the goal of our mining model. The joint conditional distribution defined by the general case of our model is given by
Apart from comparing cross images faces, , the joint model uses predictive scores from per face local binary classifiers, . As mentioned above and discussed in more detail in , we used Maximum Entropy Models (MEMs) or Logistic Regression models for these local binary predictions working on multimedia features in our previous work.
Here, we compare the result of replacing the logistic regression components in the model discussed above with our BBLR formulation. We examine the impact of this change in terms of making predictions based solely on independent models taking text and meta-data features as input as well as the impact of this difference when LR vs BBLR models are used as sub-components in the joint structured prediction model. Our hypothesis here is that the BBLR method might improve results due to its robustness to outliers (which we have already seen in our binary classification experiments) and that the method is potentially able make more accurate probabilistic predictions, which could in turn lead to more precise joint inference.
|Text-only features||Joint model with aligned faces|
For this particular experiment, we use the biographies with 2-7 faces. Table 13 shows results comparing the MaxEnt model with our BBLR model. The results are for a five-fold leave one out of the wikipedia dataset. One can see that we do indeed obtain superior performance with the independent BBLR models over the Maximum Entropy models. We also see improvement to performance when BBLR models are used in the coupled model where joint inference is used for predictions.
In the row labelled BBLR, we optimized in addition to other model parameters using the technique, explained in Section 3.3. This produced statistically significant results compared to the maximum entropy model with . For this significance test, we used the McNemar test like our earlier sets of experiments.
5.3 Kernel Logistic Regression with the Generalized Beta-Bernoulli Loss
In Table 14 we compare Beta-Bernoulli logistic regression with an SVM and Kernel Beta-Bernoulli logistic regression (KBBLR). We see that our proposed approach compare favorably to the SVM result which is widely considered as a state of the art, strong baseline.
5.4 Sparse Kernel BBLR
As shown in , one of the advantages of using the ramp loss for kernel based classification is that it can yield models that are even sparser than traditional SVMs based on the hinge loss. It is well known that based regularization does not typically yield sparse solutions when used with traditional kernel logistic regression. Our analysis of the previous experiments reveals that the regularized smooth zero one loss approximation approach proposed here does not in general lead to sparse models as well. The well known or lasso regularization method can yield sparse solutions, but often at the cost of prediction performance. Recently the so called elastic net regularization approach  based on a weighted combination of and regularization has been shown more effective at encouraging sparsity with a less negative impact on performance. The elastic net approach of course can be viewed as a prior consisting of the product of a Gaussian and a Laplacian distribution. However, part of the motivation for the use of these methods is that they yield convex optimization problems when combined with the log logistic loss. Since we have developed a robust approach for optimizing a non-convex objective function above, this opens the door to the use of non-convex sparsity encouraging regularizers. Correspondingly, we propose and explore below a prior on parameters, or equivalently, a novel regularization approach based on a mixture of a Gaussian and a Laplacian distribution. This formulation can behave like a smooth approximation to an counting “norm” prior on parameters in the limit as the Laplacian scale parameter goes to zero and the Gaussian variance goes to infinity.
With a (marginalized) Gaussian-Laplace mixture prior, our KBBLR log-likelihood becomes
where is our kernel Beta-Bernoulli model as defined in section 3.4, equation (37). For each , its prior is modeled through a mixture of a zero mean Gaussian with variance and a Laplacian distribution , located a zero with shape parameter . For convenience we give the relevant partial derivatives for this prior in Appendix B. In our approach we also optimize the hyper-parameters of this prior using hard assignment Expectation Maximization steps that are performed after step 3 of Algorithm 2. For precision we outline the steps of the modified range-optimization for Kernel BBLR (KBBLR) in Algorithm 3 found in Appendix C.
In Table 15, we compare sparse KBBLR and the SVM using a Radial Basis Function (RBF) kernel. The SVM free parameters were tuned by a cross validation run over the training data. For a sparse KBBLR solution, we used a mixture of a Gaussian and a Laplacian prior on the kernel weight parameters as presented above.
|Dataset||SVM||Avg. Support Vectors||Sparse KBBLR||Avg. Support Vectors|
Table 15 compares sparse Kernel BBLR with SVMs on the standard UCI datasets. Figure 7 shows trends in the sparsity curves for an increase in the number of training instances comparing KBBLR with SVMs for one of the product review databases. We can see that KBBLR scales up well compared to an SVM solution when training data size increases. Support vectors for SVMs increase almost linearly for an increase in the database size, an effect that has been confirmed in a number of other studies [17, 2]. In comparison we can see that KBBLR with a Gaussian-Laplacian mixture prior produces a logarithmic curve for an increase in the database size. The right panel of the same figure also shows the weight distribution before and after the KBBLR optimization with a Gaussian-Laplacian mixture prior which yields the observed sparse solution.
6 Discussion and Conclusions
We have presented a novel formulation for learning with an approximation to the zero one loss. Through our generalized Beta-Bernoulli formulation, we have provided both a new smooth 0-1 loss approximation method and a new class of probabilistic classifiers. Our experimental results indicate that our generalized Beta-Bernoulli formulation is capable of yielding superior performance to traditional logistic regression and maximum margin linear SVMs for binary classification. Like other ramp like loss functions one of the principal advantages of our approach is that it is more robust dealing with outliers compared to traditional convex loss functions. Our modified SLA algorithm, which adds a learning hyper-parameter optimization step shows improved performance over the original SLA optimization algorithm in .
We have also presented and explored a kernelized version of our approach which yields performance competitive with non-linear SVMs for binary classification. Furthermore, with a Gaussian-Laplacian mixture prior on parameters our kernel Beta-Bernoulli model is able to yield sparser solutions than SVMs while retaining competitive classification performance. Interestingly, for an increase in training database size, our approach exhibited logarithmic scaling properties which compares favourably to the linear scaling properties of SVMs. To the best of our knowledge this is the first exploration of a Gauss-Laplace mixture prior for parameters – certainly in combination with our novel smooth zero-one loss formulation. The ability of this prior to behave like a smooth approximation to a counting prior is similar to an approach known as bridge regression in statistics. However, our mixture formulation has more flexibility compared to the simpler functional form of bridge regression. Interestingly, the combination of our generalized Beta-Bernoulli loss with a Gaussian-Laplacian parameter prior can be though of a smooth relaxation to learning with a zero one loss and an counting prior or regularization – a formulation for classification that is intuitively attractive, but has remained elusive in practice until now.
We also tested our generalized Beta-Bernoulli models for a structured prediction task arising from the problem of face mining in Wikipedia biographies. Here also our model showed better performance than traditional logistic regression based approaches, both when they were tested as independent models, and when they were compared as sub-parts of a Bayesian network based structured prediction framework. This experiment shows signs that the model and optimization approach proposed here may have further potential to be used in complex structured prediction tasks.
We thank the NSERC Discovery Grants program and Google for a Faculty Research Award which helped support this work.
-  K. Bache and M. Lichman. UCI machine learning repository, 2013.
-  R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In Proceedings of the 23rd international conference on Machine learning, pages 201–208. ACM, 2006.
-  A. Cotter, S. Shalev-Shwartz, and N. Srebro. Learning optimally sparse support vector machines. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 266–274, 2013.
-  C. B. Do, Q. Le, C. H. Teo, O. Chapelle, and A. Smola. Tighter bounds for structured estimation. In Proc. of NIPS, 2008.
-  M. Dredze, K. Crammer, and F. Pereira. Confidence-weighted linear classification. In Proceedings of the 25th international conference on Machine learning, pages 264–271. ACM New York, NY, USA, 2008.
-  S. Ertekin, L. Bottou, and C. L. Giles. Nonconvex online support vector machines. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(2):368–381, 2011.
-  V. Feldman, V. Guruswami, P. Raghavendra, and Y. Wu. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558–1590, 2012.
-  K. Gimpel and N. A. Smith. Structured ramp loss minimization for machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 221–231. Association for Computational Linguistics, 2012.
-  M. K. Hasan and C. Pal. Experiments on visual information extraction with the faces of wikipedia. 2014. AAAI Conference on Artificial Intelligence (AI).
-  R. Hérault and Y. Grandvalet. Sparse probabilistic classifiers. In Proceedings of the 24th international conference on Machine learning, pages 337–344. ACM, 2007.
-  A. H. Land and A. G. Doig. An automatic method of solving discrete programming problems. Econometrica, 28(3):497–520, 1960.
-  O. L. Mangasarian, W. N. Street, and W. H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4):570–577, 1995.
-  L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent in function space. NIPS, 1999.
-  T. Nguyen and S. Sanner. Algorithms for direct 0–1 loss optimization in binary classification. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1085–1093, 2013.
-  F. Pérez-Cruz, A. Navia-Vázquez, A. R. Figueiras-Vidal, and A. Artes-Rodriguez. Empirical risk minimization for support vector classifiers. Neural Networks, IEEE Transactions on, 14(2):296–303, 2003.
-  F. Pernkopf, M. Wohlmayr, and S. Tschiatschek. Maximum margin bayesian network classifiers. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(3):521–532, 2012.
-  I. Steinwart. Sparseness of support vector machines. The Journal of Machine Learning Research, 4:1071–1105, 2003.
-  V. Vapnik. The nature of statistical learning theory. springer, 2000.
-  P. Vincent. Modèles à noyaux à structure locale. Citeseer, 2004.
-  P. Vincent and Y. Bengio. Kernel matching pursuit. Machine Learning, 48(1-3):165–187, 2002.
-  D. Wang, D. Irani, and C. Pu. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom), 2012 8th International Conference on, pages 40–49. IEEE, 2012.
-  Y. Wu and Y. Liu. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, 102(479), 2007.
-  A. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15(4):915–936, 2003.
-  T. Zhang and F. J. Oles. Text categorization based on regularized linear classification methods. Information retrieval, 4(1):5–31, 2001.
-  X. Zhang, A. Saha, and S. Vishwanathan. Smoothing multivariate performance measures. Journal of Machine Learning Research, 10:1–55, 2011.
-  H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
Appendix A Experimental Details
In the interests of reproducibility, we also list below the algorithm parameters and the recommended settings as given in  :
, a search radius reduction factor;
, the initial search radius;
, a grid spacing reduction factor;
, the initial grid spacing for 1-D search;
, the gamma parameter reduction factor;
, the starting point for the search over ;
, the end point for the search over .
As a part of the Range Optimization procedure there is also a standard gradient descent procedure using a slowly reduced learning rate. The procedure has the following specified and unspecified default values for the constants defined below:
, a learning rate reduction factor;
, the initial learning rate;
, the minimal learning rate;
, used for a while loop stopping criterion based on the smallest change in the likelihood;
, used for outer stopping criterion based on magnitude of gradient
Appendix B Gradients for a Gaussian-Laplacian Mixture Prior
The gradient of the KBBLR likelihood is given in section 3.4. Below we provide the gradient of the log Gaussian-Laplace mixture prior or regularization term,