Forward Stagewise Additive Model for Collaborative Multiview Boosting

Forward Stagewise Additive Model for Collaborative Multiview Boosting

Abstract

Multiview assisted learning has gained significant attention in recent years in supervised learning genre. Availability of high performance computing devices enables learning algorithms to search simultaneously over multiple views or feature spaces to obtain an optimum classification performance. The paper is a pioneering attempt of formulating a mathematical foundation for realizing a multiview aided collaborative boosting architecture for multiclass classification. Most of the present algorithms apply multiview learning heuristically without exploring the fundamental mathematical changes imposed on traditional boosting. Also, most of the algorithms are restricted to two class or view setting. Our proposed mathematical framework enables collaborative boosting across any finite dimensional view spaces for multiclass learning. The boosting framework is based on forward stagewise additive model which minimizes a novel exponential loss function. We show that the exponential loss function essentially captures difficulty of a training sample space instead of the traditional ‘1/0’ loss. The new algorithm restricts a weak view from over learning and thereby preventing overfitting. The model is inspired by our earlier attempt [1] on collaborative boosting which was devoid of mathematical justification. The proposed algorithm is shown to converge much nearer to global minimum in the exponential loss space and thus supersedes our previous algorithm. The paper also presents analytical and numerical analysis of convergence and margin bounds for multiview boosting algorithms and we show that our proposed ensemble learning manifests lower error bound and higher margin compared to our previous model. Also, the proposed model is compared with traditional boosting and recent multiview boosting algorithms. In majority instances the new algorithm manifests faster rate of convergence on training set error and simultaneously also offers better generalization performance. Kappa-error diagram analysis reveals the robustness of the proposed boosting framework to labeling noise.

{IEEEkeywords}

multiview learning, AdaBoost, collaborative learning, kappa-error diagram, neural net ensemble

\IEEEpeerreviewmaketitle

1 Introduction

\IEEEPARstart

Multiviewsupervised learning has achieved significant attention among machine learning practitioners in recent times. In today’s Big Data platform it is quite common that a single learning objective is represented over multiple feature spaces. To appreciate this, let us consider the KDD Network Intrusion Challenge [2]. In this challenge, domain experts identified four major variants of network intrusion and characterized them over three feature spaces, viz. TCP components, content features and traffic features. Another motivating example is the ‘100 Leaves Dataset’ [3], where the objective is to classify hundred classes of leaves. Each leaf is characterized by shape, margin and texture features. Multiview representation of objective function is also common in other disciplines such as drug discovery [4], medical image processing [5], dialogue classification [6], etc.

One intuitive method is to combine all the features and then train a classifier on a reduced dimensional feature space. But dimensionality reduction has its own demerits. Usually features are engineered by experts and each feature has its own physical significance. Projecting the features onto a reduced dimensional space usually obscures the physical interpretation of the reduced feature space. Another problem with dimensionality reduction is that the subtle features are lost during the projection process. These features have been shown to foster better discriminative capability in presence of noisy data [7, 8]. Training by the above method is sometimes referred to as Early Fusion. Another paradigm of multiview learning is Late Fusion; the objective is to separately learn classifiers on each feature space and finally conglomerate the classifiers by majority voting [9]. The major issue is that these algorithms do not incorporate collaborative learning across views. We feel that it is an interesting strategy to communicate classification performance over views and model weight distribution over sample space according to this communication. Also, performance of fusion techniques are problem specific and thus the optimum fusion strategy is unknown a priori [10].

Multiview learning has been an established genre of research in semi supervised learning where manual annotation labor is reduced by stochastically learning over labeled and unlabeled training examples. Query-by-committee [11] and co-training [12] were the two pioneering efforts in this direction. For these algorithms, the objective function is represented over two mutually independent and sufficient view spaces. Independent classifiers are trained on each view space using the small number of labeled examples. The remaining unlabeled instance space is annotated by iterative majority voting of the classifiers trained on the two views. Recently, Co-training by committee [13] obviates the constraint of mutual orthogonality of the views. Significant success of multiview learning in semi supervised learning has been the primary motivation of our work.

The paper presents the following notable contributions:

  1. To the best of our knowledge this is the pioneering attempt in formulating a additive model based mathematical framework for multiview collaborative boosting. It is to be noted that the primary significance of our current work is to mathematically bolster our previous attempt of multiview learning, MA-AdaBoost [1], which was based on intuitive cues.

  2. Stagewise modeling of boosting requires a loss function and in this regard we propose a novel exponential multiview weighted loss function to grade strata of ‘difficultiness’ of an example. Using this loss function, we were able to derive a similar multiview weight update criterion used in [1]; this signifies the aptness of our present analytical approach and the correctness of our previous intuitive modeling.

  3. We devise a two step optimization framework for converging much nearer to global minimum of the proposed exponential loss space compared to our previous attempt of MA-AdaBoost

  4. Analytical expressions are derived for upper bounding training set error and margin distribution under multiview boosting setting. We numerically study the variations of these bounds and show that the proposed framework is superior compared to MA-AdaBoost

  5. Extensive simulations are performed on challenging datasets such as 100-Leaves [3], Eye classification [14], MNIST hand written character recognition and 11 different real world datasets from UCI database [15]. We compare our model with traditional and state-of-the-art multiclass boosting algorithms

  6. Kappa-Error visualization is studied to manifest robustness of proposed SAMA-AdaBoost to labeling noise.

The rest of the paper is organized as follows. Section II gives a brief overview of traditional and variants of AdaBoost. Section III presents some recent works on multiview boosting algorithms and how our work addresses some of the short comings of existing algorithms. Section IV formally describes our collaborative boosting framework followed by convergence and margin analysis in Section V. Experimental analysis are presented in Section VI. Finally, we conclude the paper in Section VII concludes the paper with a brief discussion and future extensions of the proposed work.

2 Brief Overview on Adaptive Boosting

In this section we present a brief overview of the traditional adaptive boosting algorithm [16] and the recent variants of AdaBoost. Also, we discuss some of the mathematical viewpoints which bolster the principle of AdaBoost. Suppose we have been provided with a training set , …., where denotes -dimensional input variable and is the class label. The fundamental concept of AdaBoost is to formulate a weak classifier in each round of boosting and ultimately conglomerate the weak classifiers into a superior meta-classifier. AdaBoost initially maintains an uniform weight distribution over training set and builds a weak classifier. For the next boosting round, weights of misclassified examples are enhanced while weights of correctly classified examples are reduced. Such a modified weight distribution aids the next weak classifier to focus more on misclassified examples and the process continues iteratively. The final classifier is formed by linear weighted combination of the weak classifiers. AdaBoost.MH [17] is usually used for multiclass classification using ‘one-versus-all’ strategy.

Traditional AdaBoost has undergone plethora of modifications due to active interest among machine learning community. WNS-Boost [18] uses a weighted novelty sampling algorithm to extract the most discriminative subset from the training sample set. The algorithm then runs AdaBoost on the reduced sample space and thereby enhances speed of training with minimal loss of accuracy. SampleBoost [19] is aimed to handle early termination of multiclass AdaBoost and to destabilize weak learners which repeatedly misclassify same set of training examples. Zhang et al. [20] introduces a correction factor for reweighting scheme of traditional AdaBoost for enhanced generalization accuracy.

Researchers have used margin analysis theory [21, 22] to explain working principle of AdaBoost. Another view point of explaining AdaBoost is functional gradient descent [23]. A modish way of explaining AdaBoost is forward stagewise additive model which minimizes an exponential loss function [24]. Inspired by the model in [24], Zhu et al. proposed SAMME [25] for multiclass boosting using Fisher-consistent exponential loss function.

We wish to acknowledge that [24, 25] have been instrumental in our thought process for the proposed algorithm but we differ on several aspects. As per our best knowledge this is the first attempt to formulate a mathematical model for multiview boosting using stagewise modeling. Also, the existing mathematical frameworks which explain boosting lack the scope of scalable collaborative learning.

3 Related Works on Multiview Boosting

The current work is motivated by our previous successful attempt on multiview assisted adaptive boosting, MA-AdaBoost [1]. MA-AdaBoost is the first attempt to grade the difficulty of a training example instead of the traditional ‘1/0’ loss usually practised in boosting genre. We have successfully used MA-AdaBoost in computer vision applications[14] and other real world datasets. But MA-AdaBoost is primarily based on heuristics. The objective of this paper is to understand and enhance the performance of MA-AdaBoost by formulating a thorough mathematical justification.

Recently, researchers have proposed different algorithms for group based learning. 2-Boost [26] and Co-AdaBoost [27] are closely related to each other. Both of these algorithms maintain a single weight distribution over the feature spaces and weight update depends on ensemble performance. Our algorithm is considerably different from these two algorithms. The proposed algorithm is scalable to any finite dimensional view and label space while 2-boost and Co-AdaBoost is restricted to two class and view setting. 2-Boost additionally requires that the two views be learnt by different baseline algorithms. Moreover, these two algorithms formulate the final hypothesis by majority voting. In contrast, our model uses a novel scheme of reward-penalty based voting. Share-Boost [28] has got some similarities with 2-Boost except that after each round of boosting Share-Boost discards all weak learners except the globally optimum learner (classifier with least weighted error).

AdaBoost.Group [29] was proposed for group based learning in which the authors assumed that sample space can be categorized into discriminative groups. Boosting was performed in group level and independent classifiers were optimally trained by maximizing F-score on individual views. The final classifier was reported using majority voting over all the groups. Separately training independent classifiers inhibits AdaBoost.Group from inter-view collaboration. Also, optimizing classifiers over each local view space does not ensure to optimize the final global classifier.

Mumbo is an elegant example of multiview assisted boosting algorithm [30, 31]. The fundamental idea of Mumbo is to remove an arduous example from view space of weak learners and simultaneously increase weight of that example in view space of strong learners. A variant of Mumbo has been used by Kwak et al. [5] for tissue segmentation. Mumbo maintains cost matrix on each view space , where represents cost of classifying training example belonging to class to class on view . The total space requirement for Mumbo is where and denote total number of views and classes respectively while is number of training samples. Such a space requirement is debatable in case of large datasets. Our proposed algorithm is void of such space requirements. Moreover, Mumbo requires that atleast one view should be ‘strong’ which is aided by other ‘weak’ views. Selection of a strong view in case of large dataset is not a trivial task. Our proposed algorithm adaptively assigns importance to a view space during run time and so end users need not manually specify a strong view.

4 Collaborative Boosting Framework

In this section we formally introduce our proposed framework for stagewise additive multiview assisted boosting algorithm, SAMA-AdaBoost. We consider the most general case where an example is represented over total views and the corresponding class label .

4.1 Formulation of Exponential Loss Function

If an example belongs to class , then we assign a corresponding label vector , such that there are zeroes and the element of , represented by . We denote a weak hypothesis vector learnt on view space after boosting rounds as and represents element of . Before we delve into formulation of the exponential loss function, we need to pre-process the hypothesis vectors. Specifically, the element of the hypothesis vector on is modified by the following equation;

(1)

where =1 only for =0 and zero elsewhere. The first part of Eq.(1) triggers in case of misclassified vectors because in case of misclassification, =0. The second function in first part of Eq. 1 is triggered only if the corresponding entry of either or is ‘1’ and in those cases, power term transforms the elements to ‘-1’. The second part of Eq.(1) triggers in case of correctly classified vector but keeps the vector intact. Table 1 delineates a representation of the above transformation process where we consider an example, , to belong to class 1.

Transformed : Transformed :
1 1 1 0 -1
0 0 0 0 0
0 0 0 1 -1
0 0 0 0 0
. . . . .
. . . . .
Table 1: An illustration to explain the transformation of hypothesis vectors , . For illustration purpose we show example using only 2-views. and are correct and incorrect hypothesis vector respectively.

We define an exponential loss function

(2)

where is the total number of feature spaces or views. From Table 1 we see that if is a correct classification vector, then else . If is misclassified by weak learners on total views, then

(3)
(4)

We argue that the term in Eq.(4) manifests the difficulty of . Weak classifiers over all views are trying to learn . So, it makes sense to judge difficulty of in terms of total misclassified views and incorporate this graded difficulty in the loss function which will eventually govern the boosting network.

4.2 Forward Stagewise Model for SAMA-AdaBoost

In this section we present a forward stagewise additive model to understand the working principle of our proposed SAMA-AdaBoost. We opt for a greedy approach where in each step we optimize one more weak classifier and add it to existing ensemble space. Specifically, the approach can be viewed as stagewise learning of additive models [24] with initial ensemble space as null space. We define as the additive model learnt over boosting rounds on a particular view space :

(5)

where denotes learning rate. Our goal is to learn the meta-model which represents the overall additive model learnt over boosting rounds on total views.

(6)

So, after any arbitrary (boosting rounds) and (total number of views) we can write,

(7)
(8)
(9)

The first part of Eq.(9) i.e., represents part of our model which has already been learnt and hence we cannot modify it. Our aim is to optimize the second part of Eq.(9) i.e., . Here, we will make use of our proposed exponential loss function as reported in Eq.(2). The solution for the next best set of weak classifiers and learning rate on boosting round can be written as:

(10)
(11)

where,

(12)

is local hypothesis vector on round on view space and is number of training examples. can be considered as the weight of on stage of boosting. Since depends only on , it is a constant for the optimization problem at the iteration. Following Eq.(12) we can write,

(13)
(14)

Eq.(14) is the weight update rule for our proposed SAMA-AdaBoost algorithm. Specifically, if an example has been misclassified on total views then following the steps of Eq.(4) it can be shown easily that the weight update rule is given by,

(15)

We now return to our optimization objective as stated in Eq.(11). For simplicity we consider,

(16)

For illustration purpose, suppose that is misclassified on total views. Considering Eq.(16) only for , we get

(17)
(18)

In general if we consider this approach for all then we can rewrite Eq.(16) as follows,

(19)

Note that is identically zero because the index runs over weak learners which have correctly classified . Thus, Eq.(19) reduces to,

(20)

To minimize in Eq.(20), we need a set of weak learners such that is minimal. Thus we have, =set of weak learners which manifest least possible exponential weighted error given by Eq.(20). With this optimal set of weak learners we now aim to evaluate the optimum value of i.e., . Rewriting Eq.(20) we get,

(21)

Differentiating w.r.t and setting to zero yields,

(22)

We solve numerically for by optimizing of Eq.(21). The exact procedure to determine is illustrated in the next section. Thus, and represent the optimum set of weak learners and learning rate that needs to be updated in the additive model at iteration.

4.3 Implementation of SAMA-AdaBoost

In the previous subsection we presented the mathematical framework of multiview assisted forward stagewise additive model of proposed SAMA-AdaBoost. Now we explain the steps for implementing SAMA-AdaBoost for any real life classification task.

Initial parameters

  • Training examples , ,….,;

  • Total view/feature spaces

  • Weak hypothesis on view space on boosting round

  • : total boosting rounds

  • Initial weight distribution =

Communication across views and grading difficulty of training example

After a boosting round , weak learners across views share their classification results. Let an example be misclassified over total views. Following the arguments in Eq.(4), difficulty of at boosting round is asserted by ,

(23)

Weight update rule

  • Learning rate is set to which optimizes Eq.(21) after boosting rounds.

  • Weight update rule

    (24)

It is noteworthy that if , then Eq.(24) reduces to ,

(25)

which is the usual weight update rule of traditional AdaBoost when has been misclassified. Similarly, when ,

(26)

which is the usual weight update rule of traditional AdaBoost when has been correctly classified. Thus our proposed algorithm is a generalization of AdaBoost and aids in asserting degree of difficulty of sample space instead of ‘1/-1’ loss. The proposed weight update rule thus helps the learning algorithm to dynamically assert more importance to relatively ”tougher” misclassified example compared to ”easier” misclassified example.

Fitness measure of local weak learners

We first determine the fitness of a local weak learner, .

  • Define a set such that,

    (27)
  • Correct classification rate of is given by,

    (28)

We argue that alone is not an appropriate fitness metric for . We found during experiments that there can be a weak learner whose is low but it tends to correctly classify ”tougher” examples. So, fitness of should be evaluated not only based on but also based on difficulty of sample space which correctly classifies.

  • Reward of is determined by as follows,

(29)
  • Finally, fitness of is given by as follows,

(30)

Eq.(30) highly rewards the classifiers which correctly classify ”tougher” examples with high confidence while highly penalizing weak learners which misclassify ”easier” examples with high conviction. Repeat steps 2-4 for times.

Conglomerating local weak learners

  • SAMA-AdaBoost.V1:: In this version the final meta-classifier is given by,

(31)

where represents nearest integer to (x).

  • SAMA-AdaBoost.V2:: In this version the final meta-classifier is given by,

(32)

where, is prediction confidence for class p.

5 Study on Convergence Properties

5.1 Error Bound on Training Set

In this section we derive an analytical expression which upper bounds training set error of multiview boosting and later we empirically compare the variations of the bounds of SAMA-AdaBoost, MA-AdaBoost at different levels of boosting. Without loss of generality, the analysis is performed on binary classification and we consider a simpler version of SAMA and MA-AdaBoost, which fuses weak multiview learners by simple majority voting instead of reward-penalty based voting. The motivation of the second simplification is to appreciate the difference of the core boosting mechanisms of the comparing three paradigms. The final boosted classifier learned on views after boosting rounds is given by,

(33)

We define as,

(34)

A normalized version of weight update rule for SAMA-AdaBoost can be written as,

(35)

where normalization factor, is given by,

(36)

The recursive nature of Eq.(35) enables us write the final weight on , as,

(37)
(38)
(39)
(40)

Now, training set error incurred by can be represented as,

(41)
(42)
(43)
(44)

Eq.(44) provides an upper bound for multiview boosting paradigms such as SAMA and MA-AdaBoost. It is to be remembered that though Eq.(44) holds true for both SAMA-AdaBoost and MA-AdaBoost, , and thus, explicitly , are different for the two algorithms. In Table 2 we report the upper bounds calculated for SAMA-AdaBoost, MA-AdaBoost at different levels of boosting on eye classification task (Refer to Section 6.2 for dataset and implementation details). A lower error bound is an indication that the ensemble has learnt the examples on the training set and is less susceptible to train set misclassification. The exact values in Table 2 are not important but the scales of the magnitudes are worth noticing. We see that ensemble space of proposed SAMA-AdaBoost is able to learn much faster compared to MA-AdaBoost. The rate of decrease of error bound is aggressively faster with each round of boosting for proposed SAMA-AdaBoost compared to MA-AdaBoost. We see that error bound for SAMA-AdaBoost suffers a lofty drop from the order of to when training is increased from 15 to 20 rounds of boosting. On contrast, error bound of MA-AdaBoost reduces insignificantly and stays at . Table 2 is a strong indication systematic optimization of SAMA-AdaBoost’s loss function fosters in faster convergence rate on training set.

Boosting Rounds: T SAMA-AdaBoost:Proposed MA-AdaBoost
5 7.1* 7.6*
10 3.2* 0.9*
15 8.3* 4.2*
20 5.0* 1.8*
25 2.5* 1.1*
Table 2: Comparison of training set error bounds (Eq.(44)) after different levels of boosting (T). A lower value of error bound signifies that an ensemble is prone is make less error on the training set.
Figure 1: Margin distribution graphs on 100-Leaves classification task after 5,10 and 15 rounds of boosting. Vertical axis denotes the fraction of training sample space having margin (horizontal axis).

5.2 Generalization Error and Margin Distribution Analysis

Visualizing Margin Distribution

For understanding generalization property of boosting, training set performance reveals only a part of the entire explanation. It has been shown in [22] that more confidence on training set explicitly improves generalization performance. Frequently, margin on training set is taken as the metric of confidence of boosted ensemble. In the context of boosting, margin is defined as follows: Suppose that the final boosted classifier is a convex combination of base/weak learners. Weightage on a particular class for a training example, is taken as summation of the convex weights of the base learners. Margin for is computed as difference of weight assigned to the correct label of and the highest weight assigned to an incorrect label. Thus, margin spans over the range . It is easy to see that for a correctly classified , margin is positive while it is negative in case of misclassification. Significantly high positive margin manifests greater confidence of prediction. It has been shown in [22, 21] that for high generalization accuracy it is mandatory to have minimal fraction of training example with small margin. Margin distribution graphs are usually studied in this regard. A margin distribution graph is a plot of the fraction of training examples with margin atmost as a function of . In Fig. 1 we analyze the margin distribution graphs of SAMA-AdaBoost and MA-AdaBoost. We have used the same simulation setup on the 100-Leaves classification task as will be discussed in Section 6.1.

Consistently, we find that the margin distribution graph of SAMA-AdaBoost lies below that of MA-AdaBoost. Such a distribution means that given a margin, , SAMA-AdaBoost always tends to have fewer examples with margin compared to MA-AdaBoost. This explicitly makes ensemble space of SAMA-AdaBoost more confident on training set and thereby manifesting superior performance on test set.

Bound on Margin Distribution

In this section we provide an analytical expression (on a similar note to [22]) for estimating the upper bound of margin distribution of an ensemble space created by SAMA-AdaBoost and MA-AdaBoost. Later, we show through numerical simulations that boosting inherently encourages to decrease fraction of training example with low margin as we keep on increasing the number of boosting rounds. Let, , denote instance and label space respectively and training examples are generated according to some unknown but fixed distribution over . denote the training set consisting of ordered pairs, i.e., , chosen according to that same distribution. Define, as the probability of event given that the example has been randomly drawn from following a normal distribution. Under unambiguous context, and are used interchangeably. Similarly, refers to the expected value. is defined as the convex combination of the boosted base learners.

(45)

Given, , we are interested to find an upper bound on,

(46)

If we assume, , it implies,

(47)
(48)

(49)
(50)
(51)

On a similar argument presented in Eq.(40), we can write,

(52)
Figure 2: Variation of upper bound of margin distribution on eye classification dataset after different rounds of boosting. Bound represents the upper bound of probability of sampling a training example with margin .

Eq.(52) gives an upper bound of sampling training examples with margin . Intuitively, we want this probability to be less because that aids in margin maximization. In Fig. 2 we illustrate the variation of this bound at different levels of boosting for SAMA-Boost and MA-AdaBoost on eye classification dataset (refer Section 6.2). In the figure, bound represents the upper bound of probability of sampling a training example with margin , i.e., . For both SAMA-Boost and MA-Boost, at a given , we observe that the bound decreases with decrease in . This indicates that the ensembles discourage to possess training examples with low margin. Also, for a given , the bound decreases with increase of ; the observation indicates that increasing rounds of boosting implicitly reduces existence of low margin examples. A significant observation is that the decay rate of upper bound with for SAMA-Boost is appreciably higher compared to that of MA-AdaBoost. Specifically, after 5 rounds of boosting, upper bound for SAMA-Boost is , while for MA-AdaBoost, the bound is . After 25 rounds of boosting, upper bound for SAMA-Boost is , while for MA-AdaBoost, the bound is . Analysis of this section thereby bolsters our claim that the learning rate of proposed SAMA-AdaBoost algorithm is much faster compared to MA-AdaBoost’s rate. Emsemble space of SAMA-AdaBoost manifests significantly lower probability of possessing low margin examples compared to that of MA-AdaBoost. Such observation guarantees better generalization capability for SAMA-AdaBoost. Empirical results in Section 6.2 will further strengthen our claim.

6 Experimental Analysis

In this section we compare our proposed SAMA-AdaBoost on challenging real world datasets with recent state-of-art collaborative and variants of non-collaborative traditional boosting algorithms. It has been shown in [1] that ”.V2” version of MA-AdaBoost performs slightly better than ”.V1” version and thus here we present results using SAMA-AdaBoost.V2 and MA-AdaBoost.V2.

Figure 3: A pictorial representation of our proposed multiview learning. On the extreme left we show extracted leaf segments of five out of one hundred classes of leaves from ”100 leaves database [3]”. Each extracted leaf segment is represented and learnt over three feature spaces with 2-layer ANN. Small bubbles in each rectangular box are drawn to mimic a 2-layer ANN architecture. Bidirectional pink arrows indicate communication across views and thereby performing collaborative learning. Finally, weak learners over different view spaces are combined by reward-penalty based voting.

6.1 100 Leaves Dataset [3]

This is a challenging dataset where the task is to classify 100 classes of leaves based on shape, margin and texture features. Each feature space is 16-dimensional with 16 examples per class. Such a heterogeneous feature set is apt to be applied on any multiview learning algorithm. For simulation purpose we have taken 2-layer ANN with 5 units in the hidden layer as baseline learner in each boosting round over each view space. The dataset is randomly shuffled and then split into 60:20:20 for training, validation and testing respectively. Regularization parameter is selected by 5-fold validation. In Fig.3 we pictorially represent the setting for our proposed multiview learning framework.

Figure 4: Plot of objective function (y-axis) versus (x-axis) as mentioned in Eq.(21) for ”100 leaves dataset [3]”. Fig. (a) , (b), (c) and (d) represents network trained over 2, 5, 7 and 10 rounds of boosting respectively. In each case we vary the number of ANN training iterations per boosting round as indicated by the colored lines. We see that our previous algorithm i.e., MA-AdaBoost fails to attain the global minimum whereas the proposed framework is able to localize very close to global minimum.

Determining optimum value of

In this section we illustrate the procedure to compute the optimum for minimizing in Eq.(21). It has been shown by Schapire [32], that the exponential loss incorporated in AdaBoost is strictly convex in nature and is void of local minima. In Fig.4 we plot versus . As it can see seen that the functional variation of w.r.t is indeed convex in nature and thus we apply gradient descent and select that value of for which the gradient of the function is close to zero. Absence of local minima guarantees that we will converge near to the global (single) minima. In [1], we naively evaluated as,

(53)

We mark the optimum locations evaluated by our algorithm with red stars in Fig. 4. We also mark the corresponding optimal points (green rhombus) evaluated using our previously proposed MA-AdaBoost[1] and it is evident that MA-AdaBoost fails to attain the global minimum. In Table LABEL:table_beta_compare we report the ratio of global optimum indicated by SAMA-AdaBoost and MA-AdaBoost to the actual global minimum of space. After two, five and ten rounds of boosting, average ratios for SAMA-AdaBoost are 1.07, 1.02 and 1.03 respectively while the average ratios for MA-AdaBoost are 2.9, 1.6 and 3.5 respectively. We thus argue that MA-AdaBoost fails to localize at global minimum by significant margin compared to SAMA-AdaBoost and as a consequence, SAMA-AdaBoost has faster training set error convergence rate compared to MA-AdaBoost as we shall see shortly. Similar nature of dependency is observed on other datasets.

T Iterations SAMA-AdaBoost (Proposed) MA-AdaBoost[1]
50 1.05 4.00
2 100 1.14 2.32
150 1.02 1.30
50 1.06 1.21
5 100 1.01 1.45
150 1.02 2.81
50 1.01 4.12
10 100 1.05 2.52
150 1.08 1.26
Table 3: Ratio of global minimum evaluated by competing algorithms to the actual global minimum of space on ”100 leaves dataset”. The closer the ratio is to unity the better. T: total boosting rounds. Iterations: number of times ANNs are trained per boosting round. We note that the proposed SAMA-AdaBoost converges much closer to actual minimum compared to our previous work of MA-AdaBoost.

Comparison of classification performances

In this section we report the training and generalization performances of several boosted classifiers. For comparing with other boosting algorithms with ANN as baseline, we used the boosting framework as proposed in [33]. For comparing with [20] we have taken the sample fraction =0.5 and the correction factor equals to 4 as indicated by the authors. We cannot compare our results with [27, 26] because these algorithms only support 2-class problems.

Figure 5: Comparison of rates of convergence of training set error of different boosted classifiers on 100 Leaves dataset. From the graph it is evident that our proposed SAMA-AdaBoost has the fastest convergence rate. We start by training baseline ANNs with 50 iterations/round and increment upto 200 iterations/round in step of 20 and we measure misclassification rates in each step. In this figure we report the average results.
T Iterations SAMA-AdaBoost:(Proposed) MA-AdaBoost [1] Mumbo [6] SAMME [25] AdaBoost [17] Zhang [20] WNS [18]
50 74.3 71.2 71.0 75.1 68.1 76.1 67.1
2 100 75.3 73.4 72.4 76.8 70.2 77.2 69.2
150 76.9 75.4 73.9 77.4 71.2 78.2 70.4
50 86.2 83.1 82.8 80.4 77.4 79.8 76.1
5 100 90.3 87.2 85.3 82.3 79.8 82.0 78.4
150 93.2 91.0 88.2 84.8 81.2 83.9 79.8
50 97.2 95.9 93.2 89.1 87.4 90.8 86.4
10 100 98.4 96.2 94.8 90.2 89.8 92.3 87.2
150 99.6 98.1 96.2 93.4 91.3 94.6 89.4
Table 4: Test Set Accuracy Percentages on ”100 leaves dataset” of Competing Boosted Classifiers.T: total boosting rounds. Iterations: Number of back-propagation passes for training a ANN network per boosting round.

In Fig. 5 we compare rate of convergence on training set error by the competing algorithms. Fig. 5 bolsters the boosting nature of our proposed algorithm because the training set error rate decreases with increase in number of boosting rounds. It is interesting to note that collaborative algorithms such as SAMA-AdaBoost, Mumbo and MA-AdaBoost perform worse compared to SAMME at low boosting rounds. Weak learner on each view space in collaborative algorithms is provided with only a subset of entire feature space. So at low boosting rounds, weak learners are poorly trained and overall group performance is worsened. Conversely, SAMME is trained on entire concatenated feature space and even with low boosting rounds, weak learners of SAMME are superior compared to weak learners of collaborative algorithms.With increase of boosting rounds, performances of collaborating algorithms are enhanced compared to non-collaborative boosting frameworks. It is to be noted that the rate of convergence of training set error of SAMA-AdaBoost is faster compared to MA-AdaBoost and this is attributed to proper localization of minimum in space by SAMA-AdaBoost. On average, SAMA-AdaBoost outperforms MA-AdaBoost, SAMME, Mumbo, AdaBoost, Zhang et al. and WNS-AdaBoost by margins of 3.8%, 7.8%, 4.3%,10.2%, 9.8% and 11.2% respectively.

Next, in Table 2 we report the generalization error rates of the competing boosted classifiers. Our proposed algorithm achieves a classification accuracy rate of 99.6% after 10 rounds of boosting with 150 iterations of ANN training per boosting round. The previously reported best result was 99.3% by [3] using probabilistic k-NN. On average, proposed SAMA-AdaBoost outperforms MA-AdaBoost, Mumbo, SAMME, AdaBoost, Zhang et al. and WNS-AdaBoost by margins of 2.3%, 4,2%, 5.3%, 8.7%, 4.8% and 9.4% respectively.

6.2 Discriminating Between Eye and Non Eye Samples

In this section we compare our algorithm on a 2-class visual recognition problem. The task is discriminate human eye samples from non eye samples [14]. For simulation purpose we manually extracted 3232 eye and non eye templates from randomly chosen human faces from the web. Few examples are shown in Fig.6. A training example is represented over two view spaces, viz. We utilized the two view representation as illustrated in [14] The feature spaces are:

  • Features from SVD-HSV space: 96D

  • Features from SVD-Haar space: 48D

Figure 6: Eye and non eye templates extracted from human face for 2-class classification problem.

Under this 2-view setting we can compare SAMA-AdaBoost with Co-AdaBoost[27], 2-Boost[26], AdaBoost.Group[29] which support only 2-class, 2-view problems. For simulation purpose we use a 2-layer ANN with 5 units in hidden layer. Keeping less hidden nodes makes our baseline hypothesis ‘weak’. In Fig. 7 we compare the classification accuracy rates of different boosted classifiers.

Figure 7: Comparison of generalization accuracy rate of different ensemble classifiers for human eye classification. T: total boosting rounds. # iterations: number of back propagation trainings per boosting rounds.

Boost-Early refers to boosting on the entire 144-D feature space by concatenating features of SVD-Haar and SVD-HSV spaces. Boost-Late refers to separately boosting on individual feature space and final decision by majority voting. We use a pruned decision tree as another baseline on SVD-Haar space for 2-Boost. Co-AdaBoost tends to outperform other 2-class multiview boosting algorithms and thereby we report Co-AdaBoost’s performance in Fig.7. We see that at a fixed value of , the rate of enhancement of accuracy rate with increase in number of ANN training iterations is significantly higher for SAMA-AdaBoost compared to the competing algorithms. On average over ten rounds of boosting at 40 training iterations per round, accuracy rate of SAMA-AdaBoost is higher than that of MA-AdaBoost, Mumbo, Co-AdaBoost, 2-Boost, AdaBoost.Group, Boost-Early and Boost-Late by 2.3%, 5.1%, 6.2%, 6.4%, 6.9%, 4% and 10.1% respectively. A ROC curve is a plot of true positive rate (TPR) at a given false positive rate (FPR). It is desirable that an ensemble classifier manifests a high TPR at a low FPR. For a good classifier the area under ROC curve (AUC), is close to unity.

\backslashboxAlgorithmsMetrics T
SAMME [25] 0.87 0.83
Boost-Late 0.82 0.77
Boost-Early 0.85 0.81
WNS [18] 0.83 0.78
Zhang et al. [20] 0.86 0.82
Co-AdaBoost [27] 5 0.86 0.84
2-Boost [26] 0.86 0.81
AdaBoost.Group [29] 0.83 0.81
Mumbo [30] 0.88 0.87
MA-AdaBoost [1] 0.90 0.88
SAMA-AdaBoost (Proposed) 0.93 0.91
SAMME 0.91 0.92
Boost-Late 0.88 0.85
Boost-Early 0.92 0.88
WNS 0.90 0.87
Zhang et al. 0.93 0.89
Co-AdaBoost 20 0.92 0.90
2-Boost 0.90 0.89
AdaBoost.Group 0.92 0.91
Mumbo 0.93 0.92
MA-AdaBoost 0.96 0.95
SAMA-AdaBoost (Proposed) 0.98 0.97
Table 5: Comparison of area under ROC curve and F-Score of different boosted classifiers for eye classification task after various rounds of boosting . In each boosting round the baseline ANNs have been trained for 40 iterations. Proposed SAMA-AdaBoost yields higher and compared to competing ensemble classifiers and thereby creating an ensemble space with better generalization capability.

F-Score, , is given by,

(54)

A high precision requirement mandates that we compromise on recall and vice versa and thus alone precision or recall is not apt for quantifying performance of a classifier. F-Score mitigates this difficulty by calculating the harmonic mean of precision and recall. It is desirable to obtain a high F-Score from a classifier. From Table 5 we see that at a given round of boosting, and of SAMA-AdaBoost is higher compared to other competing algorithms. Table 5 bolsters our claim that ensemble space created by SAMA-AdaBoost fosters faster rate of convergence of generalization error rates compared to its competing counterparts.

Finally, in Table LABEL:table_alexsvm, we compare the performance of SAMA-AdaBoost with state-of-the-art techniques of other paradigm such as AlexNet [34] 1, which is a popular CNN architecture and SVM-2K [35]2, which a state-of-the-art SVM algorithm for training on two views of dataset. Alexnet was trained for 50 epochs (error saturated after this) with batch size of 100 with stochastic gradient descent optimization. SAMA-AdaBoost was trained for 10 boosting rounds with 40 epochs per round. We see that SAMA-AdaBoost outperforms SVM-2K and manifests comparable results to AlexNet. But training time for SAMA-AdaBoost is only 13 minutes compared to 19 and 45 minutes for SVM-2K and Alexnet respectively.

Algorithm Training Time (mins) Accuracy Rate
SAMA-AdaBoost 13 99.1
(Proposed)
AlexNet [34] 45 99.6
SVM-2K [35] 19 95.2
Table 6: Compariosn of training time and classification accuracy rate on eye classification dataset.

6.3 Simulation on UCI Datasets

Comparison of Generalization Accuracy Rates

In this section we evaluate our proposed boosting algorithm on the benchmark UCI datasets which comprise of real world data pertaining to financial credit rating, medical diagnosis, game playing etc. The details of the eleven datasets chosen for simulation is shown in Table 7. We randomly partition the homogeneous datasets into two subspaces for multiview algorithms and report the best results. We use a 2-layer ANN with 3 hidden units as baseline learner on each view. In each boosting round, ANNs are trained by back propagation 30 times. We cannot test multiclass datasets such as ‘Glass’, ‘Connect-4, ‘Car Evaluate’ and ‘Balance’ by [27, 26] and [29] because these algorithms only support 2-class problems. We also report the average training time per boosting round for each dataset using SAMA-AdaBoost using Matlab-2013 on Intel i-5 processor with 4 GB RAM @3.2 GHz. In Table 8 we report the generalization accuracy rates of different boosted classifiers after =5, 10 and 20 rounds of boosting.

Dataset # of instances # of attributes # classes
Glass 214 10 7
Connect-4 67557 42 3
Car Evaluate 1728 6 4
Balance Scale 625 4 3
Breast Cancer 699 10 2
Bank Note 1372 5 2
Credit Approval 690 15 2
Heart Disease 303 75 2
Lung Cancer 32 56 2
SPECT Heart 267 22 2
Statlog Heart 270 13 2
Table 7: UCI Datasets selected for simulation purpose.
\backslashboxAlgorithmsDatasets 3 T Glass Connect-4 Car Balance Breast Bank Credit Heart Lung SPECT Statlog
SAMME [25] 72.1 68.1 75.4 81.2 80.1 78.2 80.1 69.1 66.5 78.2 72.1
WNS [18] 70.1 68.0 73.5 80.2 77.1 67.2 78.4 68.1 65.9 77.0 70.8
Boost-Late 70.0 67.4 73.0 78.6 76.1 68.2 79.1 68.0 65.0 75.3 70.1
Zhang et al. [20] 73.2 70.4 75.3 80.9 82.1 80.7 80.1 72.3 69.8 80.0 75.4
Co-AdaBoost [27] 5 - - - - 83.1 81.0 81.2 72.0 68.1 81.1 76.0
2-Boost [26] - - - - 84.0 81.3 80.9 73.1 68.0 81.2 77.2
AdaBoost.Group [29] - - - - 83.8 81.0 81.2 72.9 70.8 78.2 75.3
Mumbo [30] 74.3 75.4 74.3 78.2 84.3 75.1 83.2 74.3 75.4 81.2 78.1
MA-AdaBoost [1] 75.1 76.2 75.9 80.8 85.2 82.1 84.1 77.2 78.9 83.1 80.9
SAMA-AdaBoost (Proposed) 75.4 77.4 76.2 80.8 85.6 83.2 84.7 78.0 78.0 83.1 81.2

SAMME
79.8 76.4 81.2 84.2 85.4 84.1 85.7 74.3 74.2 86.3 78.6
WNS 74.3 73.8 77.2 81.2 83.2 78.2 84.2 70.9 71.1 82.3 74.3
Boost-Late 73.3 70.9 77.0 80.6 82.9 76.2 83.2 72.1 71.9 82.9 75.1
Zhang et al. 73.2 70.4 75.3 80.9 85.4 86.4 87.9 77.6 76.9 87.0 81.9
Co-AdaBoost 10 - - - - 83.9 83.4 86.1 73.9 73.9 85.4 78.9
2-Boost - - - - 87.0 87.3 86.7 75.4 73.6 86.1 78.7
AdaBoost.Group - - - - 86.5 84.3 86.9 75.4 74.9 87.5 79.8
Mumbo 81.6 83.2 78.3 80.2 89.3 80.1 89.2 81.7 82.3 85.2 81.9
MA-AdaBoost 86.5 85.9 82.3 86.7 89.2 86.1 89.8 84.2 86.9 88.3 85.9
SAMA-AdaBoost (Proposed) 87.3 87.1 84.3 88.0 91.2 87.2 91.2 87.2 88.1 91.1 87.9

SAMME
91.3 90.9 91.2 92.1 94.3 92.1 90.0 92.8 86.8 93.5 91.9
WNS 90.0 88.8 90.5 91.7 92.6 91.2 89.0 92.0 84.3 92.1 91.0
Boost-Late 90.0 88.0 89.3 91.8 92.1 91.0 87.8 90.0 83.2 90.7 89.9
Zhang et al. 93.2 92.1 92.9 93.5 95.4 94.2 91.0 93.2 89.3 94.3 92.9
Co-AdaBoost 20 - - - - 93.7 92.0 89.8 92.0 85.1 92.1 90.8
2-Boost - - - - 93.0 92.8 90.0 93.5 88.1 94.1 92.1
AdaBoost.Group - - - - 93.2 91.8 90.2 91.9 87.2 92.5 92.0
Mumbo 95.2 94.0 89.2 90.2 98.0 90.2 95.4 94.1 95.8 96.9 95.0
MA-AdaBoost 97.0 95.2 92.9 95.4 98.3 95.4 97.8 95.8 97.2 98.0 96.1
SAMA-AdaBoost (Proposed) 98.3 97.3 94.3 96.5 99.1 95.2 99.0 97.4 98.1 99.2 97.9
Table 8: Comparison of generalization accuracy rates on selected UCI datasets by different ensemble classifiers after various rounds () of boosting. In majority instances proposed SAMA-AdaBoost achieves higher accuracy rates compared to competing algorithms. We can compare [27, 26, 29] only on datasets involving 2-class classification.

We can see from Table 8 that our proposed SAMA-AdaBoost outperforms the competing boosted classifiers in majority instances. It is interesting to note that although Mumbo performs comparable to SAMA-AdaBoost on majority datasets, the performance of Mumbo degrades on ’Balance, ‘Car’ and ‘Bank’ datasets. These datasets are represented over a very low dimensional feature space. Disintegration of this low dimensional feature space into two sub spaces fails to provide Mumbo with a ‘Strong’ view. As mentioned before, success of Mumbo depends on the presence of a ‘Strong’ view which is aided by ‘Weak’ views. Co-AdaBoost and 2-Boost offers comparable performance on the datasets and tends to outperform SAMME in majority instances. Performance of WNS is slightly worse compared to SAMME because WNS boosts on a subset of entire sample space without any correction factor to compensate for the reduced cardinality of sample space. But, Zhang et al. incorporated the correction factor and the performance is usually superior compared to SAMME.

Kappa-Error diversity analysis

It is desirable that the individual members of an ideal ensemble classifier be highly accurate and at the same time the members should disagree with each other in majority instances [36]. So, there is a trade-off between accuracy and diversity of an ensemble classifier space. Kappa-Error diagram [37] is a visualization measure of error-diversity pattern of ensemble classifier space. For any two members and of ensemble space, represents average generalization error rates of and and denotes the degree of agreement between and . Define a coincidence matrix such that denotes the number of examples classified by and to classes and respectively. Kappa agreement coefficient is then defined as,

(55)

where is the total number of classes. signifies and agree on all instances. means and agrees by chance while signifies agreement is less than expected by chance. Kappa-Error diagram is a scatter plot of v/s for all pairwise combinations of and . Ideally, the scatter cloud should be centered near lower left portion of the graph. Fig. 8 shows the Kappa-Error plots on three UCI datasets at different levels of labeling noise. We randomly perturb a certain fraction of training labels and train the classifiers on the artificially tampered datasets. The plots are for classifiers trained over 15 rounds of boosting with 40 iterations of ANN training per round. So, we have total combinations of member learners. In Fig. 8 we plot only the centroids of scatter clouds of different classifiers because the scatter clouds are highly overlapping. Fig. 8 reveals some interesting observations.

Figure 8: Kappa-Error diversity plots on UCI datasets for different boosted classifiers. For every possible pairwise combinations of member hypotheses and within an ensemble space we calculate the Kappa agreement coefficient and mean generalization error rate . X axis: centroid of . Y axis: centroid of . Noise level indicates the fraction of original training labels that were perturbed before training the classifiers.

1: Scatter clouds of proposed SAMA-AdaBoost usually occupy the lowermost regions of the plots. This signifies that the average misclassification errors of members within SAMA-AdaBoost ensemble space is lower compared to competing ensemble spaces. Presence of such veracious members within SAMA-AdaBoost’s ensemble space aids in enhanced classification prowess. 2: Scatter clouds of Mumbo on ‘Bank Note’ dataset tends to be at a higher position compared to majority of other datasets. A relatively high position in Kappa-Error plot signifies an ensemble space consisting mainly of incorrect members. This observation also explains the degraded performance of Mumbo on ‘Bank Note’ dataset as reported in Table 8. 3: Addition of labeling noise shifts the error clouds to left and thereby enhancing diversity among the members. Simultaneously, the average error rates of the ensemble spaces also increase; this observation again highlights the error-diversity trade-off. 4: Upward shift of the error clouds of SAMA-AdaBoost due to addition of labeling noise is relatively low compared to the error clouds of other ensemble spaces. Thus, SAMA-AdaBoost is more immune to labeling noise. 5: WNS-Boost is most affected by labeling noise as indicated by its error clouds occupying top most position in the plots. 6: Zhang et al. introduced a sampling correction factor to account for training boosted classifiers on a subset of original sample space. The correction factor aids them in achieving better generalization capability compared to SAMME and obviously much better compared to WNS-Boost which lacks such correction factor.

6.4 Performance on MNIST dataset

In this section we compare our algorithm on the well known MNIST hand written character recognition dataset which consists of 60,000 training and 10,000 test images. For multiscale feature extraction, we follow the procedures of [38]. Intially, images are resized to 2828. Next we extract three level hierarchy of Histogram of Oriented Features (HOG) with 50% block overlap. The respective block sizes in each level are 44, 77 and 1414 respectively and the corresponding feature dimensions of the levels are 1564, 484 and 124. Features from each level serve as a separate view space for our algorithm. In Table 9 we compare the performance of SAMA-AdaBoost with MA-AdaBoost, SAMME, Mumbo and Early-Boost. We use a single hidden layer neural network with hidden nodes; where is feature dimensionality. In each boosting round, a network is trained for 30 epochs.

T SAMA-AdaBoost MA-AdaBoost SAMME Mumbo Boost-Early
(Proposed) [1] [25] [6]
5 2.03 2.20 2.38 2.35 2.41
10 1.10 1.21 1.49 1.39 1.52
20   0.80 \tablefootnoteWe achieve an error rate of 0.7 using AlexNet [34] after 200 epochs 0.88 1.10 1.02 1.17
Table 9: Test Set Error Rates on MNIST dataset. T: number of boosting rounds.

It can be seen that the proposed SAMA-AdaBoost fosters faster convergence on generalization error rate. The observation furthers bolsters our thesis that multiview collaborative boosting is a prudent paradigm of multi feature space learning.

7 Computational Complexity

In this section we present a brief analysis on computational complexity of proposed SAMA-AdaBoost with neural network as base learner. The analysis is based on the findings of [39]. In Table 10 we elucidate the network specific variables which are used for complexity analysis.

Symbol Representation
cardinality of training space
total number of layers
Layer number
Total activation nodes of layer
Number of nodes in output layer
Total number of weights
Dimensionality of residual Jacobian
Weight connection between node (layer )
with node (layer )
Activation of node of layer for example
Activation function of node in layer
Cost of calculating total ) activations
Cost of calculating total ) derivatives
Weight matrix connecting layer with
Total elements =
Table 10: Parameters of Neural Network

We identify the key steps in both feed forward and backward pass and analyze the complexity individually. Refer to [39] for detailed explanation.

7.1 Feed Forward

Step 1: Complexity of Feeding Inputs to a Node
Cumulative input to node of layer for is given by:

(56)

Step 2: Non Linear Activation of Node
Node of layer for is activated as:

(57)

Step 3: Output Error Evaluation
With as ground truth label for , squared error loss is defined as:

(58)

7.2 Backward Pass

Step 4: Node Sensitivity Evaluation
At ouput node, sensitivity is for is given by: