Forward Stagewise Additive Model for Collaborative Multiview Boosting
Abstract
Multiview assisted learning has gained significant attention in recent years in supervised learning genre. Availability of high performance computing devices enables learning algorithms to search simultaneously over multiple views or feature spaces to obtain an optimum classification performance. The paper is a pioneering attempt of formulating a mathematical foundation for realizing a multiview aided collaborative boosting architecture for multiclass classification. Most of the present algorithms apply multiview learning heuristically without exploring the fundamental mathematical changes imposed on traditional boosting. Also, most of the algorithms are restricted to two class or view setting. Our proposed mathematical framework enables collaborative boosting across any finite dimensional view spaces for multiclass learning. The boosting framework is based on forward stagewise additive model which minimizes a novel exponential loss function. We show that the exponential loss function essentially captures difficulty of a training sample space instead of the traditional ‘1/0’ loss. The new algorithm restricts a weak view from over learning and thereby preventing overfitting. The model is inspired by our earlier attempt [1] on collaborative boosting which was devoid of mathematical justification. The proposed algorithm is shown to converge much nearer to global minimum in the exponential loss space and thus supersedes our previous algorithm. The paper also presents analytical and numerical analysis of convergence and margin bounds for multiview boosting algorithms and we show that our proposed ensemble learning manifests lower error bound and higher margin compared to our previous model. Also, the proposed model is compared with traditional boosting and recent multiview boosting algorithms. In majority instances the new algorithm manifests faster rate of convergence on training set error and simultaneously also offers better generalization performance. Kappaerror diagram analysis reveals the robustness of the proposed boosting framework to labeling noise.
multiview learning, AdaBoost, collaborative learning, kappaerror diagram, neural net ensemble
1 Introduction
\IEEEPARstartMultiviewsupervised learning has achieved significant attention among machine learning practitioners in recent times. In today’s Big Data platform it is quite common that a single learning objective is represented over multiple feature spaces. To appreciate this, let us consider the KDD Network Intrusion Challenge [2]. In this challenge, domain experts identified four major variants of network intrusion and characterized them over three feature spaces, viz. TCP components, content features and traffic features. Another motivating example is the ‘100 Leaves Dataset’ [3], where the objective is to classify hundred classes of leaves. Each leaf is characterized by shape, margin and texture features. Multiview representation of objective function is also common in other disciplines such as drug discovery [4], medical image processing [5], dialogue classification [6], etc.
One intuitive method is to combine all the features and then train a classifier on a reduced dimensional feature space. But dimensionality reduction has its own demerits. Usually features are engineered by experts and each feature has its own physical significance. Projecting the features onto a reduced dimensional space usually obscures the physical interpretation of the reduced feature space. Another problem with dimensionality reduction is that the subtle features are lost during the projection process. These features have been shown to foster better discriminative capability in presence of noisy data [7, 8]. Training by the above method is sometimes referred to as Early Fusion. Another paradigm of multiview learning is Late Fusion; the objective is to separately learn classifiers on each feature space and finally conglomerate the classifiers by majority voting [9]. The major issue is that these algorithms do not incorporate collaborative learning across views. We feel that it is an interesting strategy to communicate classification performance over views and model weight distribution over sample space according to this communication. Also, performance of fusion techniques are problem specific and thus the optimum fusion strategy is unknown a priori [10].
Multiview learning has been an established genre of research in semi supervised learning where manual annotation labor is reduced by stochastically learning over labeled and unlabeled training examples. Querybycommittee [11] and cotraining [12] were the two pioneering efforts in this direction. For these algorithms, the objective function is represented over two mutually independent and sufficient view spaces. Independent classifiers are trained on each view space using the small number of labeled examples. The remaining unlabeled instance space is annotated by iterative majority voting of the classifiers trained on the two views. Recently, Cotraining by committee [13] obviates the constraint of mutual orthogonality of the views. Significant success of multiview learning in semi supervised learning has been the primary motivation of our work.
The paper presents the following notable contributions:

To the best of our knowledge this is the pioneering attempt in formulating a additive model based mathematical framework for multiview collaborative boosting. It is to be noted that the primary significance of our current work is to mathematically bolster our previous attempt of multiview learning, MAAdaBoost [1], which was based on intuitive cues.

Stagewise modeling of boosting requires a loss function and in this regard we propose a novel exponential multiview weighted loss function to grade strata of ‘difficultiness’ of an example. Using this loss function, we were able to derive a similar multiview weight update criterion used in [1]; this signifies the aptness of our present analytical approach and the correctness of our previous intuitive modeling.

We devise a two step optimization framework for converging much nearer to global minimum of the proposed exponential loss space compared to our previous attempt of MAAdaBoost

Analytical expressions are derived for upper bounding training set error and margin distribution under multiview boosting setting. We numerically study the variations of these bounds and show that the proposed framework is superior compared to MAAdaBoost

Extensive simulations are performed on challenging datasets such as 100Leaves [3], Eye classification [14], MNIST hand written character recognition and 11 different real world datasets from UCI database [15]. We compare our model with traditional and stateoftheart multiclass boosting algorithms

KappaError visualization is studied to manifest robustness of proposed SAMAAdaBoost to labeling noise.
The rest of the paper is organized as follows. Section II gives a brief overview of traditional and variants of AdaBoost. Section III presents some recent works on multiview boosting algorithms and how our work addresses some of the short comings of existing algorithms. Section IV formally describes our collaborative boosting framework followed by convergence and margin analysis in Section V. Experimental analysis are presented in Section VI. Finally, we conclude the paper in Section VII concludes the paper with a brief discussion and future extensions of the proposed work.
2 Brief Overview on Adaptive Boosting
In this section we present a brief overview of the traditional adaptive boosting algorithm [16] and the recent variants of AdaBoost. Also, we discuss some of the mathematical viewpoints which bolster the principle of AdaBoost. Suppose we have been provided with a training set , …., where denotes dimensional input variable and is the class label. The fundamental concept of AdaBoost is to formulate a weak classifier in each round of boosting and ultimately conglomerate the weak classifiers into a superior metaclassifier. AdaBoost initially maintains an uniform weight distribution over training set and builds a weak classifier. For the next boosting round, weights of misclassified examples are enhanced while weights of correctly classified examples are reduced. Such a modified weight distribution aids the next weak classifier to focus more on misclassified examples and the process continues iteratively. The final classifier is formed by linear weighted combination of the weak classifiers. AdaBoost.MH [17] is usually used for multiclass classification using ‘oneversusall’ strategy.
Traditional AdaBoost has undergone plethora of modifications due to active interest among machine learning community. WNSBoost [18] uses a weighted novelty sampling algorithm to extract the most discriminative subset from the training sample set. The algorithm then runs AdaBoost on the reduced sample space and thereby enhances speed of training with minimal loss of accuracy. SampleBoost [19] is aimed to handle early termination of multiclass AdaBoost and to destabilize weak learners which repeatedly misclassify same set of training examples. Zhang et al. [20] introduces a correction factor for reweighting scheme of traditional AdaBoost for enhanced generalization accuracy.
Researchers have used margin analysis theory [21, 22] to explain working principle of AdaBoost. Another view point of explaining AdaBoost is functional gradient descent [23]. A modish way of explaining AdaBoost is forward stagewise additive model which minimizes an exponential loss function [24]. Inspired by the model in [24], Zhu et al. proposed SAMME [25] for multiclass boosting using Fisherconsistent exponential loss function.
We wish to acknowledge that [24, 25] have been instrumental in our thought process for the proposed algorithm but we differ on several aspects. As per our best knowledge this is the first attempt to formulate a mathematical model for multiview boosting using stagewise modeling. Also, the existing mathematical frameworks which explain boosting lack the scope of scalable collaborative learning.
3 Related Works on Multiview Boosting
The current work is motivated by our previous successful attempt on multiview assisted adaptive boosting, MAAdaBoost [1]. MAAdaBoost is the first attempt to grade the difficulty of a training example instead of the traditional ‘1/0’ loss usually practised in boosting genre. We have successfully used MAAdaBoost in computer vision applications[14] and other real world datasets. But MAAdaBoost is primarily based on heuristics. The objective of this paper is to understand and enhance the performance of MAAdaBoost by formulating a thorough mathematical justification.
Recently, researchers have proposed different algorithms for group based learning. 2Boost [26] and CoAdaBoost [27] are closely related to each other. Both of these algorithms maintain a single weight distribution over the feature spaces and weight update depends on ensemble performance. Our algorithm is considerably different from these two algorithms. The proposed algorithm is scalable to any finite dimensional view and label space while 2boost and CoAdaBoost is restricted to two class and view setting. 2Boost additionally requires that the two views be learnt by different baseline algorithms. Moreover, these two algorithms formulate the final hypothesis by majority voting. In contrast, our model uses a novel scheme of rewardpenalty based voting. ShareBoost [28] has got some similarities with 2Boost except that after each round of boosting ShareBoost discards all weak learners except the globally optimum learner (classifier with least weighted error).
AdaBoost.Group [29] was proposed for group based learning in which the authors assumed that sample space can be categorized into discriminative groups. Boosting was performed in group level and independent classifiers were optimally trained by maximizing Fscore on individual views. The final classifier was reported using majority voting over all the groups. Separately training independent classifiers inhibits AdaBoost.Group from interview collaboration. Also, optimizing classifiers over each local view space does not ensure to optimize the final global classifier.
Mumbo is an elegant example of multiview assisted boosting algorithm [30, 31]. The fundamental idea of Mumbo is to remove an arduous example from view space of weak learners and simultaneously increase weight of that example in view space of strong learners. A variant of Mumbo has been used by Kwak et al. [5] for tissue segmentation. Mumbo maintains cost matrix on each view space , where represents cost of classifying training example belonging to class to class on view . The total space requirement for Mumbo is where and denote total number of views and classes respectively while is number of training samples. Such a space requirement is debatable in case of large datasets. Our proposed algorithm is void of such space requirements. Moreover, Mumbo requires that atleast one view should be ‘strong’ which is aided by other ‘weak’ views. Selection of a strong view in case of large dataset is not a trivial task. Our proposed algorithm adaptively assigns importance to a view space during run time and so end users need not manually specify a strong view.
4 Collaborative Boosting Framework
In this section we formally introduce our proposed framework for stagewise additive multiview assisted boosting algorithm, SAMAAdaBoost. We consider the most general case where an example is represented over total views and the corresponding class label .
4.1 Formulation of Exponential Loss Function
If an example belongs to class , then we assign a corresponding label vector , such that there are zeroes and the element of , represented by . We denote a weak hypothesis vector learnt on view space after boosting rounds as and represents element of . Before we delve into formulation of the exponential loss function, we need to preprocess the hypothesis vectors. Specifically, the element of the hypothesis vector on is modified by the following equation;
(1) 
where =1 only for =0 and zero elsewhere. The first part of Eq.(1) triggers in case of misclassified vectors because in case of misclassification, =0. The second function in first part of Eq. 1 is triggered only if the corresponding entry of either or is ‘1’ and in those cases, power term transforms the elements to ‘1’. The second part of Eq.(1) triggers in case of correctly classified vector but keeps the vector intact. Table 1 delineates a representation of the above transformation process where we consider an example, , to belong to class 1.
Transformed :  Transformed :  
1  1  1  0  1 
0  0  0  0  0 
0  0  0  1  1 
0  0  0  0  0 
.  .  .  .  . 
.  .  .  .  . 
We define an exponential loss function
(2) 
where is the total number of feature spaces or views. From Table 1 we see that if is a correct classification vector, then else . If is misclassified by weak learners on total views, then
(3) 
(4) 
We argue that the term in Eq.(4) manifests the difficulty of . Weak classifiers over all views are trying to learn . So, it makes sense to judge difficulty of in terms of total misclassified views and incorporate this graded difficulty in the loss function which will eventually govern the boosting network.
4.2 Forward Stagewise Model for SAMAAdaBoost
In this section we present a forward stagewise additive model to understand the working principle of our proposed SAMAAdaBoost. We opt for a greedy approach where in each step we optimize one more weak classifier and add it to existing ensemble space. Specifically, the approach can be viewed as stagewise learning of additive models [24] with initial ensemble space as null space. We define as the additive model learnt over boosting rounds on a particular view space :
(5) 
where denotes learning rate. Our goal is to learn the metamodel which represents the overall additive model learnt over boosting rounds on total views.
(6) 
So, after any arbitrary (boosting rounds) and (total number of views) we can write,
(7) 
(8) 
(9) 
The first part of Eq.(9) i.e., represents part of our model which has already been learnt and hence we cannot modify it. Our aim is to optimize the second part of Eq.(9) i.e., . Here, we will make use of our proposed exponential loss function as reported in Eq.(2). The solution for the next best set of weak classifiers and learning rate on boosting round can be written as:
(10) 
(11) 
where,
(12) 
is local hypothesis vector on round on view space and is number of training examples. can be considered as the weight of on stage of boosting. Since depends only on , it is a constant for the optimization problem at the iteration. Following Eq.(12) we can write,
(13) 
(14) 
Eq.(14) is the weight update rule for our proposed SAMAAdaBoost algorithm. Specifically, if an example has been misclassified on total views then following the steps of Eq.(4) it can be shown easily that the weight update rule is given by,
(15) 
We now return to our optimization objective as stated in Eq.(11). For simplicity we consider,
(16) 
For illustration purpose, suppose that is misclassified on total views. Considering Eq.(16) only for , we get
(17) 
(18) 
In general if we consider this approach for all then we can rewrite Eq.(16) as follows,
(19) 
Note that is identically zero because the index runs over weak learners which have correctly classified . Thus, Eq.(19) reduces to,
(20) 
To minimize in Eq.(20), we need a set of weak learners such that is minimal. Thus we have, =set of weak learners which manifest least possible exponential weighted error given by Eq.(20). With this optimal set of weak learners we now aim to evaluate the optimum value of i.e., . Rewriting Eq.(20) we get,
(21) 
Differentiating w.r.t and setting to zero yields,
(22) 
We solve numerically for by optimizing of Eq.(21). The exact procedure to determine is illustrated in the next section. Thus, and represent the optimum set of weak learners and learning rate that needs to be updated in the additive model at iteration.
4.3 Implementation of SAMAAdaBoost
In the previous subsection we presented the mathematical framework of multiview assisted forward stagewise additive model of proposed SAMAAdaBoost. Now we explain the steps for implementing SAMAAdaBoost for any real life classification task.
Initial parameters

Training examples , ,….,;

Total view/feature spaces

Weak hypothesis on view space on boosting round

: total boosting rounds

Initial weight distribution =
Communication across views and grading difficulty of training example
After a boosting round , weak learners across views share their classification results. Let an example be misclassified over total views. Following the arguments in Eq.(4), difficulty of at boosting round is asserted by ,
(23) 
Weight update rule

Learning rate is set to which optimizes Eq.(21) after boosting rounds.

Weight update rule
(24)
It is noteworthy that if , then Eq.(24) reduces to ,
(25) 
which is the usual weight update rule of traditional AdaBoost when has been misclassified. Similarly, when ,
(26) 
which is the usual weight update rule of traditional AdaBoost when has been correctly classified. Thus our proposed algorithm is a generalization of AdaBoost and aids in asserting degree of difficulty of sample space instead of ‘1/1’ loss. The proposed weight update rule thus helps the learning algorithm to dynamically assert more importance to relatively ”tougher” misclassified example compared to ”easier” misclassified example.
Fitness measure of local weak learners
We first determine the fitness of a local weak learner, .

Define a set such that,
(27) 
Correct classification rate of is given by,
(28)
We argue that alone is not an appropriate fitness metric for . We found during experiments that there can be a weak learner whose is low but it tends to correctly classify ”tougher” examples. So, fitness of should be evaluated not only based on but also based on difficulty of sample space which correctly classifies.

Reward of is determined by as follows,
(29) 

Finally, fitness of is given by as follows,
(30) 
Eq.(30) highly rewards the classifiers which correctly classify ”tougher” examples with high confidence while highly penalizing weak learners which misclassify ”easier” examples with high conviction. Repeat steps 24 for times.
Conglomerating local weak learners

SAMAAdaBoost.V1:: In this version the final metaclassifier is given by,
(31) 
where represents nearest integer to (x).

SAMAAdaBoost.V2:: In this version the final metaclassifier is given by,
(32) 
where, is prediction confidence for class p.
5 Study on Convergence Properties
5.1 Error Bound on Training Set
In this section we derive an analytical expression which upper bounds training set error of multiview boosting and later we empirically compare the variations of the bounds of SAMAAdaBoost, MAAdaBoost at different levels of boosting. Without loss of generality, the analysis is performed on binary classification and we consider a simpler version of SAMA and MAAdaBoost, which fuses weak multiview learners by simple majority voting instead of rewardpenalty based voting. The motivation of the second simplification is to appreciate the difference of the core boosting mechanisms of the comparing three paradigms. The final boosted classifier learned on views after boosting rounds is given by,
(33) 
We define as,
(34) 
A normalized version of weight update rule for SAMAAdaBoost can be written as,
(35) 
where normalization factor, is given by,
(36) 
The recursive nature of Eq.(35) enables us write the final weight on , as,
(37) 
(38) 
(39) 
(40) 
Now, training set error incurred by can be represented as,
(41) 
(42) 
(43) 
(44) 
Eq.(44) provides an upper bound for multiview boosting paradigms such as SAMA and MAAdaBoost. It is to be remembered that though Eq.(44) holds true for both SAMAAdaBoost and MAAdaBoost, , and thus, explicitly , are different for the two algorithms. In Table 2 we report the upper bounds calculated for SAMAAdaBoost, MAAdaBoost at different levels of boosting on eye classification task (Refer to Section 6.2 for dataset and implementation details). A lower error bound is an indication that the ensemble has learnt the examples on the training set and is less susceptible to train set misclassification. The exact values in Table 2 are not important but the scales of the magnitudes are worth noticing. We see that ensemble space of proposed SAMAAdaBoost is able to learn much faster compared to MAAdaBoost. The rate of decrease of error bound is aggressively faster with each round of boosting for proposed SAMAAdaBoost compared to MAAdaBoost. We see that error bound for SAMAAdaBoost suffers a lofty drop from the order of to when training is increased from 15 to 20 rounds of boosting. On contrast, error bound of MAAdaBoost reduces insignificantly and stays at . Table 2 is a strong indication systematic optimization of SAMAAdaBoost’s loss function fosters in faster convergence rate on training set.
Boosting Rounds: T  SAMAAdaBoost:Proposed  MAAdaBoost 

5  7.1*  7.6* 
10  3.2*  0.9* 
15  8.3*  4.2* 
20  5.0*  1.8* 
25  2.5*  1.1* 
5.2 Generalization Error and Margin Distribution Analysis
Visualizing Margin Distribution
For understanding generalization property of boosting, training set performance reveals only a part of the entire explanation. It has been shown in [22] that more confidence on training set explicitly improves generalization performance. Frequently, margin on training set is taken as the metric of confidence of boosted ensemble. In the context of boosting, margin is defined as follows: Suppose that the final boosted classifier is a convex combination of base/weak learners. Weightage on a particular class for a training example, is taken as summation of the convex weights of the base learners. Margin for is computed as difference of weight assigned to the correct label of and the highest weight assigned to an incorrect label. Thus, margin spans over the range . It is easy to see that for a correctly classified , margin is positive while it is negative in case of misclassification. Significantly high positive margin manifests greater confidence of prediction. It has been shown in [22, 21] that for high generalization accuracy it is mandatory to have minimal fraction of training example with small margin. Margin distribution graphs are usually studied in this regard. A margin distribution graph is a plot of the fraction of training examples with margin atmost as a function of . In Fig. 1 we analyze the margin distribution graphs of SAMAAdaBoost and MAAdaBoost. We have used the same simulation setup on the 100Leaves classification task as will be discussed in Section 6.1.
Consistently, we find that the margin distribution graph of SAMAAdaBoost lies below that of MAAdaBoost. Such a distribution means that given a margin, , SAMAAdaBoost always tends to have fewer examples with margin compared to MAAdaBoost. This explicitly makes ensemble space of SAMAAdaBoost more confident on training set and thereby manifesting superior performance on test set.
Bound on Margin Distribution
In this section we provide an analytical expression (on a similar note to [22]) for estimating the upper bound of margin distribution of an ensemble space created by SAMAAdaBoost and MAAdaBoost. Later, we show through numerical simulations that boosting inherently encourages to decrease fraction of training example with low margin as we keep on increasing the number of boosting rounds. Let, , denote instance and label space respectively and training examples are generated according to some unknown but fixed distribution over . denote the training set consisting of ordered pairs, i.e., , chosen according to that same distribution. Define, as the probability of event given that the example has been randomly drawn from following a normal distribution. Under unambiguous context, and are used interchangeably. Similarly, refers to the expected value. is defined as the convex combination of the boosted base learners.
(45) 
Given, , we are interested to find an upper bound on,
(46) 
If we assume, , it implies,
(47) 
(48) 
(49) 
(50) 
(51) 
On a similar argument presented in Eq.(40), we can write,
(52) 
Eq.(52) gives an upper bound of sampling training examples with margin . Intuitively, we want this probability to be less because that aids in margin maximization. In Fig. 2 we illustrate the variation of this bound at different levels of boosting for SAMABoost and MAAdaBoost on eye classification dataset (refer Section 6.2). In the figure, bound represents the upper bound of probability of sampling a training example with margin , i.e., . For both SAMABoost and MABoost, at a given , we observe that the bound decreases with decrease in . This indicates that the ensembles discourage to possess training examples with low margin. Also, for a given , the bound decreases with increase of ; the observation indicates that increasing rounds of boosting implicitly reduces existence of low margin examples. A significant observation is that the decay rate of upper bound with for SAMABoost is appreciably higher compared to that of MAAdaBoost. Specifically, after 5 rounds of boosting, upper bound for SAMABoost is , while for MAAdaBoost, the bound is . After 25 rounds of boosting, upper bound for SAMABoost is , while for MAAdaBoost, the bound is . Analysis of this section thereby bolsters our claim that the learning rate of proposed SAMAAdaBoost algorithm is much faster compared to MAAdaBoost’s rate. Emsemble space of SAMAAdaBoost manifests significantly lower probability of possessing low margin examples compared to that of MAAdaBoost. Such observation guarantees better generalization capability for SAMAAdaBoost. Empirical results in Section 6.2 will further strengthen our claim.
6 Experimental Analysis
In this section we compare our proposed SAMAAdaBoost on challenging real world datasets with recent stateofart collaborative and variants of noncollaborative traditional boosting algorithms. It has been shown in [1] that ”.V2” version of MAAdaBoost performs slightly better than ”.V1” version and thus here we present results using SAMAAdaBoost.V2 and MAAdaBoost.V2.
6.1 100 Leaves Dataset [3]
This is a challenging dataset where the task is to classify 100 classes of leaves based on shape, margin and texture features. Each feature space is 16dimensional with 16 examples per class. Such a heterogeneous feature set is apt to be applied on any multiview learning algorithm. For simulation purpose we have taken 2layer ANN with 5 units in the hidden layer as baseline learner in each boosting round over each view space. The dataset is randomly shuffled and then split into 60:20:20 for training, validation and testing respectively. Regularization parameter is selected by 5fold validation. In Fig.3 we pictorially represent the setting for our proposed multiview learning framework.
Determining optimum value of
In this section we illustrate the procedure to compute the optimum for minimizing in Eq.(21). It has been shown by Schapire [32], that the exponential loss incorporated in AdaBoost is strictly convex in nature and is void of local minima. In Fig.4 we plot versus . As it can see seen that the functional variation of w.r.t is indeed convex in nature and thus we apply gradient descent and select that value of for which the gradient of the function is close to zero. Absence of local minima guarantees that we will converge near to the global (single) minima. In [1], we naively evaluated as,
(53) 
We mark the optimum locations evaluated by our algorithm with red stars in Fig. 4. We also mark the corresponding optimal points (green rhombus) evaluated using our previously proposed MAAdaBoost[1] and it is evident that MAAdaBoost fails to attain the global minimum. In Table LABEL:table_beta_compare we report the ratio of global optimum indicated by SAMAAdaBoost and MAAdaBoost to the actual global minimum of space. After two, five and ten rounds of boosting, average ratios for SAMAAdaBoost are 1.07, 1.02 and 1.03 respectively while the average ratios for MAAdaBoost are 2.9, 1.6 and 3.5 respectively. We thus argue that MAAdaBoost fails to localize at global minimum by significant margin compared to SAMAAdaBoost and as a consequence, SAMAAdaBoost has faster training set error convergence rate compared to MAAdaBoost as we shall see shortly. Similar nature of dependency is observed on other datasets.
T  Iterations  SAMAAdaBoost (Proposed)  MAAdaBoost[1] 

50  1.05  4.00  
2  100  1.14  2.32 
150  1.02  1.30  
50  1.06  1.21  
5  100  1.01  1.45 
150  1.02  2.81  
50  1.01  4.12  
10  100  1.05  2.52 
150  1.08  1.26 
Comparison of classification performances
In this section we report the training and generalization performances of several boosted classifiers. For comparing with other boosting algorithms with ANN as baseline, we used the boosting framework as proposed in [33]. For comparing with [20] we have taken the sample fraction =0.5 and the correction factor equals to 4 as indicated by the authors. We cannot compare our results with [27, 26] because these algorithms only support 2class problems.
T  Iterations  SAMAAdaBoost:(Proposed)  MAAdaBoost [1]  Mumbo [6]  SAMME [25]  AdaBoost [17]  Zhang [20]  WNS [18] 

50  74.3  71.2  71.0  75.1  68.1  76.1  67.1  
2  100  75.3  73.4  72.4  76.8  70.2  77.2  69.2 
150  76.9  75.4  73.9  77.4  71.2  78.2  70.4  
50  86.2  83.1  82.8  80.4  77.4  79.8  76.1  
5  100  90.3  87.2  85.3  82.3  79.8  82.0  78.4 
150  93.2  91.0  88.2  84.8  81.2  83.9  79.8  
50  97.2  95.9  93.2  89.1  87.4  90.8  86.4  
10  100  98.4  96.2  94.8  90.2  89.8  92.3  87.2 
150  99.6  98.1  96.2  93.4  91.3  94.6  89.4 
In Fig. 5 we compare rate of convergence on training set error by the competing algorithms. Fig. 5 bolsters the boosting nature of our proposed algorithm because the training set error rate decreases with increase in number of boosting rounds. It is interesting to note that collaborative algorithms such as SAMAAdaBoost, Mumbo and MAAdaBoost perform worse compared to SAMME at low boosting rounds. Weak learner on each view space in collaborative algorithms is provided with only a subset of entire feature space. So at low boosting rounds, weak learners are poorly trained and overall group performance is worsened. Conversely, SAMME is trained on entire concatenated feature space and even with low boosting rounds, weak learners of SAMME are superior compared to weak learners of collaborative algorithms.With increase of boosting rounds, performances of collaborating algorithms are enhanced compared to noncollaborative boosting frameworks. It is to be noted that the rate of convergence of training set error of SAMAAdaBoost is faster compared to MAAdaBoost and this is attributed to proper localization of minimum in space by SAMAAdaBoost. On average, SAMAAdaBoost outperforms MAAdaBoost, SAMME, Mumbo, AdaBoost, Zhang et al. and WNSAdaBoost by margins of 3.8%, 7.8%, 4.3%,10.2%, 9.8% and 11.2% respectively.
Next, in Table 2 we report the generalization error rates of the competing boosted classifiers. Our proposed algorithm achieves a classification accuracy rate of 99.6% after 10 rounds of boosting with 150 iterations of ANN training per boosting round. The previously reported best result was 99.3% by [3] using probabilistic kNN. On average, proposed SAMAAdaBoost outperforms MAAdaBoost, Mumbo, SAMME, AdaBoost, Zhang et al. and WNSAdaBoost by margins of 2.3%, 4,2%, 5.3%, 8.7%, 4.8% and 9.4% respectively.
6.2 Discriminating Between Eye and Non Eye Samples
In this section we compare our algorithm on a 2class visual recognition problem. The task is discriminate human eye samples from non eye samples [14]. For simulation purpose we manually extracted 3232 eye and non eye templates from randomly chosen human faces from the web. Few examples are shown in Fig.6. A training example is represented over two view spaces, viz. We utilized the two view representation as illustrated in [14] The feature spaces are:

Features from SVDHSV space: 96D

Features from SVDHaar space: 48D
Under this 2view setting we can compare SAMAAdaBoost with CoAdaBoost[27], 2Boost[26], AdaBoost.Group[29] which support only 2class, 2view problems. For simulation purpose we use a 2layer ANN with 5 units in hidden layer. Keeping less hidden nodes makes our baseline hypothesis ‘weak’. In Fig. 7 we compare the classification accuracy rates of different boosted classifiers.
BoostEarly refers to boosting on the entire 144D feature space by concatenating features of SVDHaar and SVDHSV spaces. BoostLate refers to separately boosting on individual feature space and final decision by majority voting. We use a pruned decision tree as another baseline on SVDHaar space for 2Boost. CoAdaBoost tends to outperform other 2class multiview boosting algorithms and thereby we report CoAdaBoost’s performance in Fig.7. We see that at a fixed value of , the rate of enhancement of accuracy rate with increase in number of ANN training iterations is significantly higher for SAMAAdaBoost compared to the competing algorithms. On average over ten rounds of boosting at 40 training iterations per round, accuracy rate of SAMAAdaBoost is higher than that of MAAdaBoost, Mumbo, CoAdaBoost, 2Boost, AdaBoost.Group, BoostEarly and BoostLate by 2.3%, 5.1%, 6.2%, 6.4%, 6.9%, 4% and 10.1% respectively. A ROC curve is a plot of true positive rate (TPR) at a given false positive rate (FPR). It is desirable that an ensemble classifier manifests a high TPR at a low FPR. For a good classifier the area under ROC curve (AUC), is close to unity.
\backslashboxAlgorithmsMetrics  T  

SAMME [25]  0.87  0.83  
BoostLate  0.82  0.77  
BoostEarly  0.85  0.81  
WNS [18]  0.83  0.78  
Zhang et al. [20]  0.86  0.82  
CoAdaBoost [27]  5  0.86  0.84 
2Boost [26]  0.86  0.81  
AdaBoost.Group [29]  0.83  0.81  
Mumbo [30]  0.88  0.87  
MAAdaBoost [1]  0.90  0.88  
SAMAAdaBoost (Proposed)  0.93  0.91  
SAMME  0.91  0.92  
BoostLate  0.88  0.85  
BoostEarly  0.92  0.88  
WNS  0.90  0.87  
Zhang et al.  0.93  0.89  
CoAdaBoost  20  0.92  0.90 
2Boost  0.90  0.89  
AdaBoost.Group  0.92  0.91  
Mumbo  0.93  0.92  
MAAdaBoost  0.96  0.95  
SAMAAdaBoost (Proposed)  0.98  0.97 
FScore, , is given by,
(54) 
A high precision requirement mandates that we compromise on recall and vice versa and thus alone precision or recall is not apt for quantifying performance of a classifier. FScore mitigates this difficulty by calculating the harmonic mean of precision and recall. It is desirable to obtain a high FScore from a classifier. From Table 5 we see that at a given round of boosting, and of SAMAAdaBoost is higher compared to other competing algorithms. Table 5 bolsters our claim that ensemble space created by SAMAAdaBoost fosters faster rate of convergence of generalization error rates compared to its competing counterparts.
Finally, in Table LABEL:table_alexsvm, we compare the performance of SAMAAdaBoost with stateoftheart techniques of other paradigm such as AlexNet [34]
6.3 Simulation on UCI Datasets
Comparison of Generalization Accuracy Rates
In this section we evaluate our proposed boosting algorithm on the benchmark UCI datasets which comprise of real world data pertaining to financial credit rating, medical diagnosis, game playing etc. The details of the eleven datasets chosen for simulation is shown in Table 7. We randomly partition the homogeneous datasets into two subspaces for multiview algorithms and report the best results. We use a 2layer ANN with 3 hidden units as baseline learner on each view. In each boosting round, ANNs are trained by back propagation 30 times. We cannot test multiclass datasets such as ‘Glass’, ‘Connect4, ‘Car Evaluate’ and ‘Balance’ by [27, 26] and [29] because these algorithms only support 2class problems. We also report the average training time per boosting round for each dataset using SAMAAdaBoost using Matlab2013 on Intel i5 processor with 4 GB RAM @3.2 GHz. In Table 8 we report the generalization accuracy rates of different boosted classifiers after =5, 10 and 20 rounds of boosting.
Dataset  # of instances  # of attributes  # classes 

Glass  214  10  7 
Connect4  67557  42  3 
Car Evaluate  1728  6  4 
Balance Scale  625  4  3 
Breast Cancer  699  10  2 
Bank Note  1372  5  2 
Credit Approval  690  15  2 
Heart Disease  303  75  2 
Lung Cancer  32  56  2 
SPECT Heart  267  22  2 
Statlog Heart  270  13  2 
\backslashboxAlgorithmsDatasets 
T  Glass  Connect4  Car  Balance  Breast  Bank  Credit  Heart  Lung  SPECT  Statlog 

SAMME [25]  72.1  68.1  75.4  81.2  80.1  78.2  80.1  69.1  66.5  78.2  72.1  
WNS [18]  70.1  68.0  73.5  80.2  77.1  67.2  78.4  68.1  65.9  77.0  70.8  
BoostLate  70.0  67.4  73.0  78.6  76.1  68.2  79.1  68.0  65.0  75.3  70.1  
Zhang et al. [20]  73.2  70.4  75.3  80.9  82.1  80.7  80.1  72.3  69.8  80.0  75.4  
CoAdaBoost [27]  5          83.1  81.0  81.2  72.0  68.1  81.1  76.0 
2Boost [26]          84.0  81.3  80.9  73.1  68.0  81.2  77.2  
AdaBoost.Group [29]          83.8  81.0  81.2  72.9  70.8  78.2  75.3  
Mumbo [30]  74.3  75.4  74.3  78.2  84.3  75.1  83.2  74.3  75.4  81.2  78.1  
MAAdaBoost [1]  75.1  76.2  75.9  80.8  85.2  82.1  84.1  77.2  78.9  83.1  80.9  
SAMAAdaBoost (Proposed)  75.4  77.4  76.2  80.8  85.6  83.2  84.7  78.0  78.0  83.1  81.2  
SAMME 
79.8  76.4  81.2  84.2  85.4  84.1  85.7  74.3  74.2  86.3  78.6  
WNS  74.3  73.8  77.2  81.2  83.2  78.2  84.2  70.9  71.1  82.3  74.3  
BoostLate  73.3  70.9  77.0  80.6  82.9  76.2  83.2  72.1  71.9  82.9  75.1  
Zhang et al.  73.2  70.4  75.3  80.9  85.4  86.4  87.9  77.6  76.9  87.0  81.9  
CoAdaBoost  10          83.9  83.4  86.1  73.9  73.9  85.4  78.9 
2Boost          87.0  87.3  86.7  75.4  73.6  86.1  78.7  
AdaBoost.Group          86.5  84.3  86.9  75.4  74.9  87.5  79.8  
Mumbo  81.6  83.2  78.3  80.2  89.3  80.1  89.2  81.7  82.3  85.2  81.9  
MAAdaBoost  86.5  85.9  82.3  86.7  89.2  86.1  89.8  84.2  86.9  88.3  85.9  
SAMAAdaBoost (Proposed)  87.3  87.1  84.3  88.0  91.2  87.2  91.2  87.2  88.1  91.1  87.9  
SAMME 
91.3  90.9  91.2  92.1  94.3  92.1  90.0  92.8  86.8  93.5  91.9  
WNS  90.0  88.8  90.5  91.7  92.6  91.2  89.0  92.0  84.3  92.1  91.0  
BoostLate  90.0  88.0  89.3  91.8  92.1  91.0  87.8  90.0  83.2  90.7  89.9  
Zhang et al.  93.2  92.1  92.9  93.5  95.4  94.2  91.0  93.2  89.3  94.3  92.9  
CoAdaBoost  20          93.7  92.0  89.8  92.0  85.1  92.1  90.8 
2Boost          93.0  92.8  90.0  93.5  88.1  94.1  92.1  
AdaBoost.Group          93.2  91.8  90.2  91.9  87.2  92.5  92.0  
Mumbo  95.2  94.0  89.2  90.2  98.0  90.2  95.4  94.1  95.8  96.9  95.0  
MAAdaBoost  97.0  95.2  92.9  95.4  98.3  95.4  97.8  95.8  97.2  98.0  96.1  
SAMAAdaBoost (Proposed)  98.3  97.3  94.3  96.5  99.1  95.2  99.0  97.4  98.1  99.2  97.9 
We can see from Table 8 that our proposed SAMAAdaBoost outperforms the competing boosted classifiers in majority instances. It is interesting to note that although Mumbo performs comparable to SAMAAdaBoost on majority datasets, the performance of Mumbo degrades on ’Balance, ‘Car’ and ‘Bank’ datasets. These datasets are represented over a very low dimensional feature space. Disintegration of this low dimensional feature space into two sub spaces fails to provide Mumbo with a ‘Strong’ view. As mentioned before, success of Mumbo depends on the presence of a ‘Strong’ view which is aided by ‘Weak’ views. CoAdaBoost and 2Boost offers comparable performance on the datasets and tends to outperform SAMME in majority instances. Performance of WNS is slightly worse compared to SAMME because WNS boosts on a subset of entire sample space without any correction factor to compensate for the reduced cardinality of sample space. But, Zhang et al. incorporated the correction factor and the performance is usually superior compared to SAMME.
KappaError diversity analysis
It is desirable that the individual members of an ideal ensemble classifier be highly accurate and at the same time the members should disagree with each other in majority instances [36]. So, there is a tradeoff between accuracy and diversity of an ensemble classifier space. KappaError diagram [37] is a visualization measure of errordiversity pattern of ensemble classifier space. For any two members and of ensemble space, represents average generalization error rates of and and denotes the degree of agreement between and . Define a coincidence matrix such that denotes the number of examples classified by and to classes and respectively. Kappa agreement coefficient is then defined as,
(55) 
where is the total number of classes. signifies and agree on all instances. means and agrees by chance while signifies agreement is less than expected by chance. KappaError diagram is a scatter plot of v/s for all pairwise combinations of and . Ideally, the scatter cloud should be centered near lower left portion of the graph. Fig. 8 shows the KappaError plots on three UCI datasets at different levels of labeling noise. We randomly perturb a certain fraction of training labels and train the classifiers on the artificially tampered datasets. The plots are for classifiers trained over 15 rounds of boosting with 40 iterations of ANN training per round. So, we have total combinations of member learners. In Fig. 8 we plot only the centroids of scatter clouds of different classifiers because the scatter clouds are highly overlapping. Fig. 8 reveals some interesting observations.
1: Scatter clouds of proposed SAMAAdaBoost usually occupy the lowermost regions of the plots. This signifies that the average misclassification errors of members within SAMAAdaBoost ensemble space is lower compared to competing ensemble spaces. Presence of such veracious members within SAMAAdaBoost’s ensemble space aids in enhanced classification prowess. 2: Scatter clouds of Mumbo on ‘Bank Note’ dataset tends to be at a higher position compared to majority of other datasets. A relatively high position in KappaError plot signifies an ensemble space consisting mainly of incorrect members. This observation also explains the degraded performance of Mumbo on ‘Bank Note’ dataset as reported in Table 8. 3: Addition of labeling noise shifts the error clouds to left and thereby enhancing diversity among the members. Simultaneously, the average error rates of the ensemble spaces also increase; this observation again highlights the errordiversity tradeoff. 4: Upward shift of the error clouds of SAMAAdaBoost due to addition of labeling noise is relatively low compared to the error clouds of other ensemble spaces. Thus, SAMAAdaBoost is more immune to labeling noise. 5: WNSBoost is most affected by labeling noise as indicated by its error clouds occupying top most position in the plots. 6: Zhang et al. introduced a sampling correction factor to account for training boosted classifiers on a subset of original sample space. The correction factor aids them in achieving better generalization capability compared to SAMME and obviously much better compared to WNSBoost which lacks such correction factor.
6.4 Performance on MNIST dataset
In this section we compare our algorithm on the well known MNIST hand written character recognition dataset which consists of 60,000 training and 10,000 test images. For multiscale feature extraction, we follow the procedures of [38]. Intially, images are resized to 2828. Next we extract three level hierarchy of Histogram of Oriented Features (HOG) with 50% block overlap. The respective block sizes in each level are 44, 77 and 1414 respectively and the corresponding feature dimensions of the levels are 1564, 484 and 124. Features from each level serve as a separate view space for our algorithm. In Table 9 we compare the performance of SAMAAdaBoost with MAAdaBoost, SAMME, Mumbo and EarlyBoost. We use a single hidden layer neural network with hidden nodes; where is feature dimensionality. In each boosting round, a network is trained for 30 epochs.
T  SAMAAdaBoost  MAAdaBoost  SAMME  Mumbo  BoostEarly 

(Proposed)  [1]  [25]  [6]  
5  2.03  2.20  2.38  2.35  2.41 
10  1.10  1.21  1.49  1.39  1.52 
20  0.80 \tablefootnoteWe achieve an error rate of 0.7 using AlexNet [34] after 200 epochs  0.88  1.10  1.02  1.17 
It can be seen that the proposed SAMAAdaBoost fosters faster convergence on generalization error rate. The observation furthers bolsters our thesis that multiview collaborative boosting is a prudent paradigm of multi feature space learning.
7 Computational Complexity
In this section we present a brief analysis on computational complexity of proposed SAMAAdaBoost with neural network as base learner. The analysis is based on the findings of [39]. In Table 10 we elucidate the network specific variables which are used for complexity analysis.
Symbol  Representation 

cardinality of training space  
total number of layers  
Layer number  
Total activation nodes of layer  
Number of nodes in output layer  
Total number of weights  
Dimensionality of residual Jacobian  
Weight connection between node (layer )  
with node (layer )  
Activation of node of layer for example  
Activation function of node in layer  
Cost of calculating total ) activations  
Cost of calculating total ) derivatives  
Weight matrix connecting layer with  
Total elements = 
We identify the key steps in both feed forward and backward pass and analyze the complexity individually. Refer to [39] for detailed explanation.
7.1 Feed Forward
Step 1: Complexity of Feeding Inputs to a Node
Cumulative input to node of layer for is given by:
(56) 
Step 2: Non Linear Activation of Node
Node of layer for is activated as:
(57) 
Step 3: Output Error Evaluation
With as ground truth label for , squared error loss is defined as:
(58) 
7.2 Backward Pass
Step 4: Node Sensitivity Evaluation
At ouput node, sensitivity is for is given by: