A scalable stage-wise approach to large-margin multi-class loss based boosting
We present a scalable and effective classification model to train multi-class boosting for multi-class classification problems. Shen and Hao introduced a direct formulation of multi-class boosting in the sense that it directly maximizes the multi-class margin . The major problem of their approach is its high computational complexity for training, which hampers its application on real-world problems. In this work, we propose a scalable and simple stage-wise multi-class boosting method, which also directly maximizes the multi-class margin. Our approach offers a few advantages: 1) it is simple and computationally efficient to train. The approach can speed up the training time by more than two orders of magnitude without sacrificing the classification accuracy. 2) Like traditional AdaBoost, it is less sensitive to the choice of parameters and empirically demonstrates excellent generalization performance. Experimental results on challenging multi-class machine learning and vision tasks demonstrate that the proposed approach substantially improves the convergence rate and accuracy of the final visual detector at no additional computational cost compared to existing multi-class boosting.
- I Introduction
- II Related work
- III Our approach
- IV Experiments
- V Conclusion
Multi-class classification is one of the fundamental problems in machine learning and computer vision, as many real-world problems involve predictions which require an instance to be assigned to one of number of classes. Well known problems include handwritten character recognition , object recognition , and scene classification . Compared to the well studied binary form of the classification problem, multi-class problems are considered more difficult to solve, especially as a number of classes increases.
In recent years a substantial body of work related to multi-class boosting has arisen in the literature. Many of these works attempt to achieve multi-class boosting by reducing or reformulating the task into a series of binary boosting problems. Often this is done through the use of output coding matrices. The common -vs-all and -vs- schemes are a particular example of this approach, in which the coding matrices are predefined. The drawback of coding-based approaches is that they do not rapidly converge to low training errors on difficult data sets and many weak classifiers need to be learned (as is shown below in our experiments). As a result these algorithms fail to deliver the level of performance required to process large data sets, or to achieve real-time data processing.
The aim of this paper is to develop a more direct boosting algorithm applicable to multi-class problems that will achieve the effectiveness and efficiency of previously proposed methods for binary classification. To achieve our goal we exploit the efficiency of the coordinate descent algorithm, e.g., AdaBoost , along with more effective and direct formulations of multi-class boosting known as MultiBoost . Our proposed approach is simpler than coding-based multi-class boosting since we do not need to learn output coding matrices. The approach is also fast to train, less sensitive to the choice of parameters chosen and has a comparable convergence rate to MultiBoost. Furthermore, the approach shares a similar property to -constrained maximum margin classifiers, in that it converges asymptotically to the -constrained solution.
Our approach is based on a novel stage-wise multi-class form of boosting which bypasses error correcting codes by directly learning base classifiers and weak classifiers’ coefficients. The final decision function is a weighted average of multiple weak classifiers. The work we present in this paper intersects with several successful practical works, such as multi-class support vector machines , AdaBoost  and column generation based boosting .
Our main contributions are as follows:
Our approach is the first greedy stage-wise multi-class boosting algorithm which does not rely on codewords and which directly optimizes the boosting objective function. In addition, our approach converges asymptotically to the -constrained solution;
We show that our minimization problem shares a connection with those derived from coordinate descent methods. In addition, our approach is less prone to over-fitting as techniques, such as shrinkage, can be easily adopted;
Empirical results demonstrate that the approach exhibits the same classification performance as the state-of-the-art multi-class boosting classifier , but is significantly faster to train, and orders of magnitude more scalable. We have made the source code of the proposed boosting methods accessible at:
The remainder of the paper is organized as follows. Section II reviews related works on multi-class boosting. Section III describes the details of our proposed approach, including its computational complexity, and discusses various aspects related to its convergence and generalization performance. Experimental results on machine learning and computer vision data sets are presented in Section IV. Section V concludes the paper with directions for possible future work.
Ii Related work
There exist a variety of multi-class boosting algorithms in the literature. Many of them solve multi-class learning tasks by reducing multi-class problems to multiple binary classification problems. We briefly review some well known boosting algorithms here in order to illustrate the novelty of the proposed approach.
Coding-based boosting was one of the earliest multi-class boosting algorithms proposed (see, for example, AdaBoost.MH , AdaBoost.MO , AdaBoost.ECC , AdaBoost.SIP  and JointBoost ). Coding-based approaches perform multi-class classification by combining the outputs of a set of binary classifiers. This includes popular methods such as -vs-all and -vs-, for example. Typically, a coding matrix, , is constructed (where is the length of a codeword and is the number of classes). The algorithm learns a binary classifier, , corresponding to a single column111Each column of defines a binary partition of classes over data. of in a stage-wise manner. Here is a function that maps an input to . A test instance is classified as belonging to the class associated with the codeword closest in Hamming distance to the sequence of predictions generated by . The final decision function for a test datum is where is the weight vector and the entry of is . Clearly, the performance of the algorithm is largely influenced by the quality of the coding matrices. Finding optimum coding matrices, which means identifying classes which should be grouped together, is often non-trivial. Several algorithms, e.g., max-cut and random-half, have been proposed to build optimal binary partitions for coding matrices . Max-cut finds the binary partitions that maximize the error-correcting ability of coding matrix while random-half randomly splits the classes into two groups. Li points out that random-half usually performs better than max-cut because the binary problems formed by max-cut are usually too hard for base classifiers to learn. Nonetheless, both max-cut and random-half do not achieve the best performance as they do not consider the ability of base classifiers in the optimization of coding matrices. In contrast, our proposed approach bypasses the learning of output coding by learning base classifiers and weak classifiers’ coefficients directly.
Another related approach, which trains a similar decision function, is the multi-class boosting of Duchi and Singer known as GradBoost . The main difference between GradBoost and the method we propose here is that GradBoost does not directly optimize the boosting objective function. GradBoost bounds the original non-smooth optimization problem by a quadratic function. It is not clear how well the surrogate approximates the original objective function. In contrast, our approach solves the original loss function, which is the approach of AdaBoost and LogitBoost. Shen and Hao have introduced a direct formulation of multi-class boosting in the sense that it directly maximizes the multi-class margin. By deriving a meaningful Lagrange dual problem, column generation is used to design a fully corrective boosting method . The main issue of  is its extremely heavy computation burden, which hampers its application on real data sets. Unlike their work, the proposed approach learns a classification model in a stage-wise manner. In our work, only the coefficients of the latest weak classifiers need to be updated. As a result, our approach is significantly more computationally efficient and robust to the regularization parameter value chosen. Compared to , at each boosting iteration, our approach only needs to solve for variables instead of variables, where is the number of classes and is the number of current boosting iterations. This significant reduction in the size of the problem to be solved at each iteration is responsible for the orders of magnitude reduction in training time required.
Bold lower-case letters, e.g., , denote column vectors and bold upper-case letters, e.g., , denote matrices. Given a matrix , we write the -th row of as and the -th column as . The entry of is . Let be the set of training data, where represents an instance, and the corresponding class label (where is the number of training samples and is the number of classes). We denote by a set of all possible outputs of weak classifiers where the size of can be infinite. Let denote a binary weak classifier which projects an instance to . By assuming that we learn a total of weak classifiers, the output of weak learners can be represented as , where is the label predicted by weak classifier on the training data . Each row of the matrix represents the output of all weak classifiers when applied to a single training instance . We build a classifier of the form,
where . Each column of , , contains coefficients of the linear classifier for class and each row of , , consists of the coefficients for the weak classifier for all class labels. The predicted label is the index of the column of attaining the highest sum.
Iii Our approach
In order to classify an example correctly, must be greater than , for any . In this paper, we define a set of margins associated with a training example as,
The training example is correctly classified only when . In boosting, we train a linear combination of basis functions (weak classifiers) which minimizes a given loss function over predefined training samples. This is achieved by searching for the dimension which gives the steepest descent in the loss and assigning its coefficient accordingly at each iteration. Commonly applied loss functions are exponential loss of AdaBoost  and binomial log-likelihood loss of LogitBoost . They are:
The two losses behave similarly for positive margin but differently for negative margin. has been reported to be more robust against outliers and misspecified data compared to . In the rest of this section, we present a coordinate descent based multi-class boosting as an approximate -regularized fitting. We then illustrate the similarity between both approaches. Finally, we discuss various strategies that can be adopted to prevent over-fitting.
Iii-a Stage-wise multi-class boosting
In this section, we design an efficient learning algorithm which maximizes the margin of our training examples, . The general -regularized optimization problem we want to solve is
Here can be any convex loss functions and parameter controls the trade off between model complexity and small error penalty. Although (3) is -norm regularized, it is possible to design our algorithm with other -norm regularized. We first derive the Lagrange dual problems of the optimization with both exponential loss and logistic loss, and propose our new stage-wise multi-class boosting.
The learning problem for an exponential loss can be written as,
We introduce auxiliary variables, , and rewrite the primal problem as,
where represents the joint index through all of the data and all of the classes. Here we work on the logarithmic version of the original cost function. Since is strictly monotonically increasing, this does not change the original optimization problem. Note that the regularization parameters in these two problems should have different values. Here we introduce the auxiliary variable in order to arrive at the dual problem that we need. The Lagrangian of (5) can be written as,
with . To derive the dual, we have
where . At optimum the first derivative of the Lagrangian with respect to each row of must be zeros, i.e., , and therefore
where denotes the indication operator such that if and , otherwise. Since the convex conjugate of the log-sum-exp function is the negative entropy function. Namely, the convex conjugate of is if and ; otherwise . The Lagrange dual problem can be derived as,
Note that the objective function of the dual encourages the dual variables, , to be uniform.
The learning problem of logistic loss can be expressed as,
The Lagrangian of (7) can be written as,
with . Following the above derivation and using the fact that the conjugate of logistic loss is , if ; otherwise . The Lagrange dual222Note that the sign of has been reversed. can be written as,
Since both (5) and (7) are convex, both problems are feasible and the Slater’s conditions are satisfied, the duality gap between the primal, (5) and (7), and the dual, (6) and (8), is zero. Therefore, the solution of (5) and (6), and (7) and (8) must be the same. Although (6) and (8) have identical constraints, we will show later that their solutions (selected weak classifiers and coefficients) are different.
Finding weak classifiers
From the dual, the set of constraints can be infinitely large, i.e.,
For decision stumps, the size of is the number of features times the number of samples. For decision tree, the size of would grow exponentially with the tree depth. Similar to LPBoost, we apply a technique known as column generation to identify an optimal set of constraints333Note that constraints in the dual correspond to variables in the primal. An optimal set of constraints in the dual would correspond to a set of variables in the primal that we are interested in. . The high-level idea of column generation is to only consider a small subset of the variables in the primal, i.e., only a subset of is considered. The problem solved using this subset is called the restricted master problem (RMP). At each iteration, one column, which corresponds to a variable in the primal or a constraint in the dual, is added and the restricted master problem is solved to obtain both primal and dual variables. We then identify any violated constraints which we have not added to the dual problem. These violated constraints correspond to variables in primal that are not in RMP. If no single constraint is violated, then we stop since we have found the optimal dual solution to the original problem and we have the optimal primal/dual pair. In other words, solving the restricted problem is equivalent to solving the original problem. Otherwise, we append this column to the restricted master problem and the entire procedure is iterated. Note that any columns that violate the dual feasibility can be added. However, in order to speed up the convergence, we add the most violated constraint at each iteration. In our case, the most violated constraint corresponds to:
Solving this subproblem is identical to finding a weak classifier with minimal weighted error in AdaBoost (since dual variables, , can be viewed as sample weights). At each iteration, we add the most violated constraint into the dual problem. The process continues until we can not find any violated constraints.
Through Karush-Kunh-Tucker (KKT) optimality condition, the gradient of Lagrangian over primal variable, , and dual variable, , must vanish at the optimal. Let and be any primal and dual optimal points with zero duality gap. One of the KKT conditions tells us that and . We can obtain the relationship between the optimal primal and dual variables as,
Optimizing weak learners’ coefficients
Weak learners’ coefficients can be calculated in a totally corrective manner as in . However, the drawback of  is that the training time is often slow when the number of training samples and classes are large because the primal variable, , needs to be updated at every boosting iteration. In this paper, we propose a more efficient approach based on a stage-wise algorithm similar to those derived in AdaBoost. The advantages of our approaches compared to  are 1) it is computationally efficient as we only update weak learners’ coefficient at the current iteration and 2) our method is less sensitive to the choice of the regularization parameters and, as a result, the training time can be much simplified since we no longer have to cross-validate these parameter. We will show later in our experiments that the regularization parameter only needs to be set to a sufficiently small value to ensure good classification accuracy. By inspecting the primal problem, (5) and (7), the optimal can be calculated analytically as follows. At iteration , where , we fix the value of , , , . So is the only variable to be optimized. The primal cost function for exponential loss can then be written as,
where and . Here we drop the terms that are irrelevant to and is initialized to . At each iteration, we compute and cache the value for the next iteration. Similarly, the cost function for logistic loss is,
The above primal problems, (13) or (14), can be solved using an efficient Quasi-Newton method like L-BFGS-B, and the dual variables can be obtained using the KKT condition, (11) or (12). The details of our multi-class stage-wise boosting algorithm are given in Algorithm 1.
In order to appreciate the performance gain, we briefly analyze the complexity of the new approach and MultiBoost . The time consuming step in Algorithm 1 is in step ① (weak classifier learning) and ④ (calculating coefficients). In step ①, we train a weak learner by solving the subproblem (10). For simplicity, we use decision stumps as weak learners. The fastest way to train the decision stump is to sort feature values and scan through all possible threshold values sequentially to update (10). The algorithm takes for sorting and for scanning classes. At each iteration, we need to train decision stumps (since ). Hence, this step takes at each iteration. In step ④, we solve variables at each iteration. Let us assume the computational complexity of L-BFGS is roughly . The algorithm spends at each iteration. Hence, the total time complexity for boosting iterations is . Roughly, the first term dominates when the number of samples is large and the last term dominates when the number of classes is large.
We also analyze the computational complexity during training of MultiBoost. The time complexity to learn weak classifiers in their approach would be the same as ours. However, in step ④, they would need to solve variables (since the algorithm is fully corrective). The time complexity for this step444We have and . is . Hence, the total time complexity for MultiBoost is . Clearly, the last term will dominate when the number of iterations is large. For example, training a multi-class classifier with samples, features, classes for iterations using our approach would require while MultiBoost would require . For this simple scenario, our approach already speeds up the training time by three orders of magnitudes.
Here we briefly point out the connection between our multi-class formulation and binary classification algorithms such as AdaBoost. We note that AdaBoost sets the regularization parameter, in (4), to be zero  and it minimizes the exponential loss function. The stage-wise optimization strategy of AdaBoost implicitly enforces the regularization on the coefficients of weak learners. See details in [17, 18]. We can simplify our exponential loss learning problem, (4), for a binary case () as,
where if , if and . AdaBoost minimizes the exponential loss function via coordinate descent. At iteration , Adaboost fixes the value of and solve . So (15) can simply be simplified to,
where and . By setting the first derivative of (16) to zero, a closed-form solution of is: where . corresponds to a weighted error rate with respect to the distribution of dual variables. By replacing step ④ in Algorithm 1 with (15), our approach would yield an identical solution to AdaBoost.
The -constrained classifier and our boosting
Rosset et al. pointed out that by setting the coefficient value to be small, gradient-based boosting tends to follow the solution of -constrained maximum margin classifier, (17), as a function of under some mild conditions :
We conducted a similar experiment on our multi-class boosting to illustrate the similarity between our forward stage-wise boosting and the optimal solution of (17) on USPS data set. The data set consists of pixels. We randomly select samples from classes (, and ). For ease of visualization and interpretation, we limit the number of available decision stumps to . We first solve (17) using CVX package555Note that to solve (17), the algorithm must access all weak classifiers a priori. . We then train our stage-wise boosting as discussed previously. However, instead of solving (13) or (14), we set the weak learner’s coefficient of the selected class in (10) to be and the weak learner’s coefficient of other classes to be . The learning algorithm is run for boosting iterations. The coefficient paths of each class are plotted in the second row in Fig. 1 (the first three columns correspond to exponential loss and the the last three correspond to logistic loss). We compare the coefficient paths for our boosting and -constrained exponential loss and logistic loss in Fig. 1. We observe that both algorithms give very similar coefficients. This experimental evidence leads us to the connection between the solution of our multi-class boosting and the solution of (17). Rosset et al. have also pointed this out for a binary classification problem . The authors incrementally increase the coefficient of the selected weak classifiers by a very small value and demonstrate that the final coefficient paths follow the -regularized path. In this section, we have demonstrated that our multi-class boosting also asymptotically converges to the optimal solution of -regularized solution (17).
Shrinkage and bounded step-size
In order to minimize over-fitting, strategies such as shrinkage  and bounded step-size  can also be adopted here. We briefly discuss each method and how they can be applied to our approach. As discussed in previous section, at iteration , we solve,
The alternative approach, as suggested by , is to shrink all coefficients to small values. Shrinkage is simply another form of regularization. The algorithm replaces with where . Since decreases the step-size, can be viewed as a learning rate parameter. The smaller the value of , the higher the overall accuracy as long as there are enough iterations. Having a large enough iteration means that we can keep selecting the same weak classifier repeatedly if it remains optimal. It is observed in  that shrinkage often produces a better generalization performance compared to line search algorithms. Similar to shrinkage, bounded step-size can also be applied. It caps by a small value, i.e., where is often small. The method decreases the step-size in order to provide a better generalization performance.
Iv-a Regularization parameters and shrinkage
In this experiment, we evaluate the performance of our algorithms on different shrinkage parameters, and regularization parameters, in (3). We investigate how shrinkage helps improve the generalization performance. We use benchmark multi-class data sets. We choose random samples from each class and randomly split the data into two groups: for training and the rest for evaluation. We set the maximum number of boosting iterations to . All experiments are repeated times. We vary the value of between and . Experimental results are reported in Table II. From the table, we observe a slight increase in generalization performances in all data sets when shrinkage is applied.
In the next experiment, we evaluate how effects the final classification accuracy. We experiment with in , , , , using both exponential loss and logistic loss. Note that fixing is equivalent to selecting the maximum number of weak learners. The iteration in our boosting algorithm continues until the algorithm can no longer find the most violated constraint, i.e., optimal solution has been found, or the maximum number of iterations is reached. Table I reports final classification errors. From the table, we observe a similar classification accuracy when is set to a small value (). In this experiment, we do not observe over-fitting even when we set to . This is because the number of iterations serve as the regularization in our problem. For large (), we observe that classification errors increase as the number of classes increases. Our conjecture is that as the classification problem becomes harder (i.e., more number of classes), the optimal obtained in (10) would fail to satisfy the stopping criterion for large (Step ② in Algorithm 1). As a result, the algorithm terminates prematurely and poor performance is observed. These experimental results demonstrate that choosing a specific combination of and might not have a strong influence on the final performance ( and is sufficiently small). However, one can cross-validate these parameters to achieve optimal results. In the rest of our experiment, we apply a shrinkage value of and set to be .
|data||MCBoost-||MCBoost-||MultiBoost ||Speedup factor|
|( classes/ dims)||Error||Time||Error||Time||Error||Time||Exp||Log|
Iv-B Comparison to MultiBoost
In this experiment, we compare our algorithm to MultiBoost, a totally corrective multi-class boosting proposed in . We compare both the classification accuracy and the coefficient calculation time (training time) of our approach and MultiBoost. For simplicity, we use decision stumps as the weak classifier. For MultiBoost, we use the logistic loss and choose the regularization parameter from , , , , , , by cross-validation. For our algorithm, we set to and to . All experiments are repeated times using the same regularization parameter. All algorithms are implemented in MATLAB using a single processor. The weak learner training (decision stump) is written in C and compiled as a MATLAB MEX file. We use MATLAB interface for L-BFGS-B  to solve (13) and (14). The maximum number of L-BFGS-B iterations is set to . The iteration stops when the projection gradient is less than or the difference between the objective value of current iteration and previous iteration is less than . We use the data set letter from the UCI repository and vary the number of classes and the number of training samples. Experimental results are shown in Table III and Fig. 2. We observe that our approach performs comparable to MultiBoost while having a fraction of the training time of MultiBoost. In the next experiment, we statistically compare the proposed approach with MultiBoost using the nonparametric Wilcoxon signed-rank test (WSRT)  on several UCI data sets.
In this experiment, we evaluate the proposed approach with MultiBoost on UCI data sets. For each data set, we randomly choose samples from each class and randomly split the data into training and test sets at a ratio of :. We repeat our experiments times. For data sets with a large number of dimensions, we perform dimensionality reduction using PCA. Our PCA projected data captures of the original data variance. We set the number of boosting iterations to . Table IV reports average test errors and the time it takes to compute of different algorithms. Based on our results, all methods perform very similarly. MultiBoost has a better generalization performance than other algorithms on data sets while MCBoost and MCBoost performs better than other algorithms on and data sets, respectively. We then statistically compare all three algorithms using the non-parametric Wilcoxon signed-rank test (WSRT) . WSRT tests the median performance difference between a pair of classifiers. In this test, we set the significance level to be . The null-hypothesis declares that there is no difference between the median performance of both algorithms at the significance level. In other words, a pair of algorithms perform equally well in a statistical sense. According to the table of exact critical values for the Wilcoxon’s test, for a confidence level of and data sets, the difference between the classifiers is significant if the smaller of the rank sums is equal or less than . For MCBoost and MultiBoost, the signed rank statistic result is and, for MCBoost and MultiBoost, the result is . Since both results are not less than the critical value, WSRT indicates a failure to reject the null hypothesis at the significance level. In other words, the test statistics suggest that both stage-wise boosting and totally-corrective boosting perform equally well. In terms of training time, both MCBoost and MCBoost are much faster to train compared to MultiBoost. We have observed a significant speed-up factor (at least two orders of magnitude) depending on the complexity of the optimization problem and the number of classes.
|Algorithm||Evaluation function||Coefficients||Test time (msecs)|
|Coding based (single-label)||,|
|e.g., Ada.MH , Ada.ECC |
|Coding based (multi-label)||, ,|
|e.g., AdaBoost.MO |
|A matrix of coefficient e.g., MCBoost-,|
|MCBoost-, MultiBoost , GradBoost |
Iv-C Multi-class boostings on UCI data sets
Next we compare our approaches against some well known multi-class boosting algorithms: SAMME , AdaBoost.MH , AdaBoost.ECC , AdaBoost.MO , GradBoost  and MultiBoost . For AdaBoost.ECC, we perform binary partition using the random-half method . For GradBoost, we implement /-regularized multi-class boosting and choose the regularization parameter from , , , , , All experiment are repeated times. The maximum number of boosting iterations is set to . Average test errors of different algorithms and their standard deviations (shown in ) are reported in Table V. On UCI data sets, we observe that the performance of most methods are comparable. However, MCBoost has a better generalization performance than other multi-class boosting algorithms on out of data sets evaluated. In addition, directly maximizing the multi-class margin (as in MCBoost and MultiBoost) often leads to better generalization performance in our experiments (especially on the data set in which the number of classes is larger than ). Note that similar findings have also been reported in  where the authors theoretically compare different multi-class classification algorithms. They concluded that learning a matrix of coefficients, i.e., the multi-class formulation of , should be preferred to other multi-class learning methods.
We also plot average test errors versus number of weak classifiers on a logarithmic scale in Fig. 3. From the figure, AdaBoost.MO has the fastest convergence rate followed by MultiBoost and our proposed approach. AdaBoost.MO has the fastest convergence rate since it trains weak classifiers at each iteration, while other multi-class algorithms train weak classifier at each iteration. For example, on USPS digit data sets, the AdaBoost.MO model would have a total of weak classifiers ( boosting iteration) while all other multi-class classifiers would only have weak classifiers. A comparison of evaluation functions of different multi-class classifiers is shown in Table VI. We also illustrate the evaluation time of different functions on classes pendigits data set. AdaBoost.MO is much slower than other algorithms during test time. From Fig. 3, the convergence rate of MultiBoost is slightly faster than our approach since MultiBoost adjusts classifier weights () in each iteration. Our algorithm does not have this property and converge slower than MultiBoost. However, both algorithms achieve similar classification accuracy when converged. Note that our experiments are performed on an Intel core i- CPU with GB memory.
Iv-D MNIST handwritten digits
Next we evaluate our approach on well known handwritten digit data sets. We first resize the original image to a resolution of pixels and apply a de-skew pre-processing. We then apply a spatial pyramid and extract