SPOCC: Scalable POssibilistic Classifier Combination - toward robust aggregation of classifiers

SPOCC: Scalable POssibilistic Classifier Combination - toward robust aggregation of classifiers

Abstract

We investigate a problem in which each member of a group of learners is trained separately to solve the same classification task. Each learner has access to a training dataset (possibly with overlap across learners) but each trained classifier can be evaluated on a validation dataset.

We propose a new approach to aggregate the learner predictions in the possibility theory framework. For each classifier prediction, we build a possibility distribution assessing how likely the classifier prediction is correct using frequentist probabilities estimated on the validation set. The possibility distributions are aggregated using an adaptive t-norm that can accommodate dependency and poor accuracy of the classifier predictions. We prove that the proposed approach possesses a number of desirable classifier combination robustness properties. Moreover, the method is agnostic on the base learners, scales well in the number of aggregated classifiers and is incremental as a new classifier can be appended to the ensemble by building upon previously computed parameters and structures. A python implementation can be downloaded at this link.

keywords: robust classifier combination, agnostic aggregation, information fusion, classification, possibility theory

1 Introduction

Classification is a supervised machine learning task consisting of assigning objects (inputs) to discrete categories (classes). When several predictors have been trained to solve the same classification task, a second level of algorithmic procedure is necessary to reconcile the classifier predictions and deliver a single one. Such a procedure is known as classifier combination, fusion or aggregation. When each individual classifier is trained using the same training algorithm (but under different circumstances) the aggregation procedure is referred to as an ensemble method. When each classifier may be generated by different training algorithms, the aggregation procedure is referred to as a multiple classifier system. In both cases, the set of individual classifiers is called a classifier ensemble.

Classifier combination comes either from a choice of the programmer or is imposed by context. In the first case, combination is meant to increase classification performances by either increasing the learning capacity or mitigating overfitting. For instance, boosting [43] and bagging [7] can be regarded as such approaches. In the second case, it is not possible to learn a single classifier. A typical situation of this kind occurs when the dataset is dispatched on several machines in a network and sequential learning (such as mini-batch gradient descent) is not possible either to preserve network load or for some privacy or intellectual property reasons. In this decentralized learning setting, a set of classifiers are trained locally and the ensemble is later aggregated by a meta-learner.

In this article, we address classifier aggregation in a perspective that is in line with the decentralized setting assuming that the meta-learner has access to evaluations of the individual classifiers on a fraction of each local dataset which is not used for training. We do not make any assumption on the base learner models and we do not assume base learners are trained on i.i.d. samples, however it is assumed that the union of the fractions of (local) datasets used by the meta-learner is i.i.d.. We introduce a number of desirable robustness properties for the aggregation procedure in this context. We investigate fault tolerance (ability to discard classifiers whose predictions are noise), robustness to adversarial classifiers (ability to thwart classifiers with abnormal error rates) and robustness to redundant information (when classifier predictions are highly dependent).

We introduce an aggregation procedure in the framework of possibility theory. We prove that these robustness properties are verified asymptotically (when the size of the validation set is large) for this new approach. The mechanism governing the aggregation essentially relies on estimates of probabilities of class labels, classifier predictions or classifier correct predictions. There are many related works [23, 28, 30] dealing with classifier combination using similar information. We believe we are the first to do so in the framework of possibility theory but more importantly these above referenced work are not proved to possess theoretical robustness guarantees. An asymptotic optimality property is verified by an approach from [3]. This property is stronger than most of the properties that we state except for robustness to classifier dependency. Also, two technical conditions are necessary for the property to hold while our results have no such conditions to check. Similar remarks hold w.r.t. [6] which shares some ideas with [3]. Another piece of work with strong properties (oracle inequalities) is exponential weight aggregation [40] but the properties are non-exact1 and hold in expectation or with high probability while our properties rely on almost sure convergence. Also, exponential weight aggregation is a linear combination model while our method is non-linear.

In addition, the form of aggregation robustness achieved by our method does not jeopardize other important aspects of aggregation such as scalability and incrementalism, two aspects that [3] fails to possess. Our approach relies on a parametric model involving a number of parameters that is linear in the number of classifiers. Incrementalism is also preserved in the sense that a new classifier can be appended to the ensemble without implying to re-estimate the previously obtained parameter values or structures.

In the next section, we recall the classifier aggregation problem and formally define the robustness properties that we seek. In section 3, we introduce a new aggregation technique in the framework of possibility theory and we show that the desired properties hold asymptotically for this technique. Section 4 contains numerical experiments illustrating our results.

2 Problem statement

2.1 Classification

Let denote a set of class labels = {}. Let denote an input example with entries. Most of the time, is a vector and lives in but sometimes some of its entries are categorical data and lives in an abstract space which does not necessarily have a vector space structure. Without loss of generality, we suppose that is a vector in the rest of this article.

A classification task consists in determining a prediction function that maps any input to its actual class . This function is obtained from a training set which contains pairs where is the class label of example . Given classifiers (each of them trained by one base learner), the label assigned by the classifier to the input is denoted by .

From a statistical point of view, training examples are instances of a random vector whose distribution is unknown. Likewise, class labels are instances of a random variable whose distribution is also unknown. The training set is often alleged to contain i.i.d. samples of the joint distribution of .

2.2 Classifier performance estimates

The ultimate goal of machine learning is to obtain predictors that generalize well (w.r.t. unseen data at training time). Mathematically speaking, this means achieving the lowest possible expected loss between predictions and true values. When misclassification errors do not have different costs, the 0-1 loss function is the standard choice:

In this case, the expected loss is the misclassification error rate of . It is well known, that the error rate minimizer is the Bayes classifier :

Obviously, since the conditional distributions of given are unknown, we must try to find proxys of the Bayes classifier. The error rate (or risk) of classifier is denoted by .

Although our goal is to achieve the lowest possible error rate, the performances of a classifier are not, in general, constant across true class labels and predicted ones. This finer grained information will be instrumental to elicit our possibilistic ensemble of classifiers. This information is contained in the confusion matrix . Each entry of this matrix reads

(1)

where denotes the indicator function. It is important to compute the confusion matrices using a validation set disjoint from otherwise the estimates drawn from the matrix are biased. Actually, if is the size of the validation set, then is the maximum likelihood estimate of the joint probability . Also, the sum of the non-diagonal entries of over is an unbiased estimate of the error rate of . Many other performance criterion estimates can be derived from a confusion matrix.

The classifier combination that we introduce in section 3 essentially relies on the information contained in those matrices. Computing those matrices can thus be regarded as the training phase of the combination method.

2.3 Agnostic combination of classifiers and position of the problem

Let denote the random vector spanned by plugging into the ensemble of classifiers:

A realization of this random vector is denoted by or whenever the dependence on inputs must be made explicit. We place ourselves in the context where vectors can be pictured as new (learned) representation of inputs and we must be agnostic, i.e. we have no control on the base classifiers. In this context, the best aggregate classifier [3] based on observed data is thus

(2)

Again, the distributions involved in the above definition are unknown. Since lives in the discrete space , it is possible to try to learn these distributions but this leads to very hard inference problems [28, 3, 32] and such statistical learning approaches do not scale well w.r.t. either or . The smallest memory complexity among these references is achieved by [32] who introduce a mixture model relying on tensor decomposition. If denotes the number of components in the decomposition, the number of parameters to learn is linear in . Linearity in can thus be claimed if which cannot always be assumed. In addition, the Bayesian solutions introduced in these references do not allow to obtain an incremental2 aggregation algorithm, an attribute that we believe is much desirable.

Generally speaking, classifier combination consists in finding a function capturing the relation between vectors and class labels that achieves the closest possible performances as compared to . In the approach presented in the next section, we leverage the flexibility of possibility theory to find one such function. The proposed possibilistic approach visits several functions, i.e. aggregation strategies, and select the one maximizing accuracy obtained on the validation dataset. The strategies in question are generalizations of logical rules as opposed to probabilistic approaches which resort to calculus rules. In this regard, the proposed solution both relies on artificial learning and artificial reasoning principles.

Besides, we have chosen to place ourselves in a framework in which each classifier can only deliver one piece of information, i.e. a predicted class label. This allows us to be completely agnostic on the nature of the base learners. Indeed, depending on the training algorithm and model employed by a learner, this latter may be able to deliver a score vector (usually in the form of a probability distribution). In this case, the above analysis no longer applies and the optimal aggregation consists in inferring posterior predictive probabilities of class labels given the scores. Examples of approaches in this line of work are reviewed in [47]. It is possible to remain relatively agnostic on the nature of base classifiers by resorting to a statistical calibration step that allows to obtain prediction probabilities from non-probabilistic classifiers as done in [4]. Calibration will consume a significant portion of the validation set leaving a smaller amount to train the aggregation technique. Consequently, score based aggregation is out of the scope of this paper. Actually, score based aggregation is a leading follow-up of the approach introduced in the next section as mentioned in the concluding remarks in section 5.

2.4 Desirable properties for classifier combination

In terms of purely error rate related performances, the most desirable property for some aggregation function is

(3)

The aggregation technique studied in [3] achieves a result of this kind (under two technical assumptions). Indeed this technique, which elaborates on [23], amounts to compute maximum likelihood estimates of the probabilities involved in (2). But classifier aggregation can also bring other types of guarantees which we refer to as robustness. Robustness is understood here as a form of fault tolerance, i.e. the ability to maintain a good level of predictions in several circumstances involving malfunctioning individual classifiers. There may be different causes behind malfunctioning classifiers, e.g. hardware failure or malicious hacks.

Among other possibilities, we have identified the following desirable properties in this scope:

  • robustness to random guess: if the error rate of is then where is the same vector as but with its entry deleted.

    Property (a) means that if the predictions of are in average no better than random guess then has no influence on the aggregated classifier.

  • robustness to adversarial classifiers: if has an error rate larger than random guess, i.e. , then there is a classifier with an error rate lower than random guess such that where for any and .

    Property (b) means that we can somewhat rectify the incorrect predictions of classifier so that the aggregated classifier is identical to the one obtained from a non-adversarial situation.

  • robustness to redundant information: if there are two individual classifiers such that for any , then .

    Property (c) means that copies of classifiers have no influence on the aggregated classifier.

In the above, we assume that the aggregation function is produced by a given algorithmic procedure and that this procedure applies for any . So is not the restriction of but another function learned from the same algorithm by omitting classifier .

Obviously, one can think of other properties or reshape them in different ways. For instance, a soft version of property (c) would be better in the sense that an ensemble contains rarely identical copies of a predictor but it contains very often highly dependent ones. This a first attempt to formalize desirable robustness properties for classifier combination and we hope that more advanced declinations of these will be proposed in the future.

For the time being, our goal in this paper is to introduce an aggregation procedure that is compliant with properties (a) to (c). We will prove that these properties hold for the possibilistic approach that we introduce in the next section at least asymptotically for some of them. Observe that (3) asymptotically implies properties (a) to (c) so the added value of our approach as compared to [3] relies on its incremental aspect as shown in 3.8 and scalability w.r.t. as numerical experiments will illustrate in section 4.

2.5 Other related works

So far, we have mentioned only closely related works which perfectly fall in the same setting as ours, i.e. performing the same type of aggregation based on the same information. We remind that this paper is focused on a classifier aggregation paradigm in which one must be agnostic on the base learners. As explained before this immediately rules out a large number of methods such as those relying on classifier scores and ensemble methods. Score based combination most often assume that base learners exploit a given training algorithm. For instance, [34] introduced an aggregation method tailored to SVMs while [21] applied another one to combine deep nets. By definition, so do ensemble methods such as [33]. Other score based algorithms will require that training algorithms belong to the same class of models, typically probabilistic learners as in [22].

It must also be made clear that the addressed paradigm in this paper is not federated learning. In federated learning, there is no base learners. A group of remotely connected clients have access to a local dataset. Clients are meant to compute a parameter update based on their local data and send this update to a meta-learner ([51, 31, 25]) to collaboratively train a model. This means that parameter updates are aggregated and this is thus not a classifier aggregation problem.

Finally, an utmost important and original aspect of the method introduced in this paper is its robustness properties w.r.t. noise, adversaries and information redundancy. To the best of our knowledge, there is no prior art in classifier aggregation that has touched jointly these aspects for agnostic aggregation of predicted class labels. Robustness to noise is investigated by [29] but again in a very different setting which is deep fusion, i.e. deep learning from multiple inputs. Other references in the literature are focused on other aspects than robustness such as security for instance ([35]). However, robustness is a hot topic in the supervised learning paradigm with important consequences in deep learning [36]. But obtaining robust base learners does not ensure that the aggregation itself is robust.

3 Robust combination in the possibilistic framework

In this section, we introduce a new classifier combination approach in the possibility theory framework. Possibility theory [52, 15] is an uncertainty representation framework. It has strong connections with belief functions [14, 27], random sets [20, 38, 41] , imprecise probabilities [16, 9] or propositional logic [5]. For a concise but thorough overview of possibility theory, the reader is referred to [17] but B already provides deeper insights into this framework and as to why it is particularly relevant for classifier aggregation tasks.

Possibility theory is a widely used framework in symbolic artificial intelligence. It allowed the derivation of new propositional and/or modal logics in which the level of uncertainty of logical propositions can be assessed ([39, 18]). This has applications in logic programming ([1]), automated reasoning ([13]) or expert systems ([46]). A popular class of non-probabilistic graphical models relying on ordinal condition functions can also be revisited as a possibilistic model, see [2] for recent developments in this field. It has also been used in other branches of artificial intelligence such as information fusion ([11]) or machine learning ([44, 24]).

In this paper, we adopt a knowledge based system view of this theory. In this regard classifier predictions are expert knowledge to which a degree of belief is attached in the form of possibility distributions. Following a normative approach, experts are reconciled by designing a conjunctive rule that must obey the desirable properties presented in the previous section.

3.1 Possibility theory basics

A possibility distribution maps each element of to the unit interval whose extreme values correspond respectively to impossible and totally possible epistemic states. If then this class label is totally possible (meaning that we have no evidence against ) . If then is ruled out as a possible class label.

Given a subset A of , a possibility measure is given by:

(4)

which means that the possibility of a subset A is equal to the maximum possibility degree in this subset. A possibility measure is thus maxitive: as opposed to probability measures which are summative. Observe that this property accounts for the fact that the possibility distribution is enough information to compute the possibility measure of any subset.

3.2 From classifier confusion matrices to possibility distributions

If one normalizes the column of the confusion matrix , then we obtain an estimate of the probability distribution . So, if the classifier predicts for some input , we can adopt these frequentist probabilities as our beliefs on the class label of . But unless, unrealistic conditional independence assumptions3 are made, probabilistic calculus rules will not easily allow to combine beliefs arising from several classifiers.

As an alternative to this approach, we propose to build a possibility distribution from as information fusion in the possibilistic framework can mitigate dependency issues and does not lead to intractable computations. To cast the problem in the possibilistic framework, we use Dubois and Prade transform (DPT) [14]. For some arbitrary probability distribution on , let denote a permutation on such that probability masses of are sorted in descending order, i.e. if then . The (unique) possibility distribution arising from through DPT is given by

(5)

If denotes the distribution of class labels when the classifier predicts , the corresponding possibility distribution is denoted by . For each input , the classifier predictions are turned into expert opinions in the form of possibility distributions where .

3.3 Aggregation of possibility distributions

Formally speaking, any -ary operator on the set of possibility distributions is an admissible combination operator. Triangular norms, or t-norms, are instrumental to yield well defined aggregation operators for possibility distributions. A t-norm is a commutative and associative mapping therefore it is easy to build a -ary version of it using successive pairwise operations:

for any .

Moreover, a t-norm has 1 as neutral element, 0 as absorbing element and possesses the monotonicity property which reads: for any such that and , then . Finally, a t-norm is upper bounded by the minimum of its operands.

To combine possibility distributions using a t-norm, we can simply apply a t-norm elementwise. For instance, if is the aggregated possibility distribution obtained by applying a t-norm to distributions and , then

We will use the same t-norm symbol to stand for the overall combination of possibility distributions and we will write . Examples of t-norms are elementwise multiplication and elementwise minimum .

Decision making based on maximum expected utility is also justified using non-additive measures (capacities) [19] such as possibility measures. Consequently, the possibilistic aggregated classifier, denoted , is given by

(6)
(7)

Algorithm 1 explains what computations should be anticipated as part of a training phase and Algorithm 2 summarizes how an input class label is predicted at test time. The procedure corresponding to these algorithms is referred to as Scalable POssibilistic Classifier Combination (SPOCC). Note that there may be several class labels maximizing therefore the aggregated classifier prediction may be set-valued. Working with set-valued predictions is out of the scope of this paper and will be considered in future works. In the advent of a class label tie, and for any probabilistic, possibilistic or deterministic aggregation approach, one of these labels is chosen at random.

Data: validation set , classifiers , number of class labels .
for  do
       Compute confusion matrix as in (1). for  do
             Compute conditional probability estimates
Compute possibility distribution using (5)
       end for
      
end for
Return possibility distributions .
Algorithm 1 SPOCC - training phase
Data: input , classifiers , possibility distributions and t-norm .
for  do
       Compute individual classifier prediction .
end for
Compute . Return .
Algorithm 2 SPOCC - test phase

3.4 Adaptive aggregation w.r.t. dependency

The predictions of an ensemble of individual classifiers are usually significantly dependent because they are trained to capture the same bound between inputs and class labels. So if classifiers are at least weak classifiers, they will often produce identical predictions. More importantly, from an information fusion standpoint, if a majority of the classifiers are highly dependent and have a larger error rate than the remaining ones, they are likely to guide the ensemble toward their level of performances making classifier fusion counter-productive.

In the approach introduced in this paper, it is possible to mitigate dependency negative impact by choosing an idempotent t-norm such as elementwise minimum . Indeed, in the worst dependency case, classifier is a copy of classifier therefore they have an unjustified weight in the ensemble predictions. But if two individual classifiers are identical they will also yield identical possibility distributions and if these latter are combined using , then these two classifiers will be counted as one. This is exactly the spirit of property (c).

Two difficulties arise from this quest for robustness w.r.t. classifier redundancy:

  • It is not recommended to systematically use an idempotent combination mechanism because it is also possible that two poorly dependent classifiers yield identical possibility distributions in which case it appears justified that their common prediction impacts on the ensemble aggregated decision.

  • There are different levels of dependency among subsets of individual classifiers therefore, using a single t-norm to jointly aggregate them is not the best option.

To address the first issue, we propose to use the following parametric family of t-norms:

(8)

This family is known as Aczel-Alsina t-norms and is such that and . We can thus tune all the higher as the level of dependence between classifiers is high.

To assess the dependence level among two classifiers and , we use the following definition

(9)

where is the likelihood ratio of the independence model over the joint model. These likelihoods are given by

(10)
(11)

These likelihoods are computed using all training examples contained in the validation set . The probabilities involved in the computation of are the maximum likelihood estimates of the parameters of the multinomial marginal distributions and respectively. The probabilities involved in the computation of are the maximum likelihood estimates of the parameters of the multinomial joint distribution .

The definition of the dependence level can be extended to more than two classifiers but this will turn out to be unnecessary because we will use hierarchical agglomerative clustering [48] (HAC) to address issue (ii). HAC will produce a dendrogram , i.e. a t-norm computation binary tree. Each leaf in this tree is in bijective correspondence with one of the possibility distributions induced by a classifier. There are thus leafs in . Furthermore, each non-leaf node in the tree stands for a t-norm operation involving two operands only. Consequently, each non-leaf node has exactly two children and there such nodes, one of them being the root node. Figure 1 gives an illustrative example of a dependence dendrogram allowing to compute the aggregated possibility distribution.

Figure 1: Example of a dendrogram for . Leaf nodes are at the bottom. For each of the four non-leaf nodes, a specific dependence level () must be determined to compute the aggregated possibility distribution.

HAC relies on a classifier dissimilarity matrix . In our case, entries of this matrix are simply given by .

The t-norm based possibility distribution aggregation method described in the above paragraphs is meant to replace the penultimate step of Algorithm 2 but most of the computations pertaining to this dependency adaptive aggregation can be done at training time. Indeed, the computation of the pairwise dependence levels and the dendrogram do not depend on the unseen example that we will try to classify at test time. For a minimal test phase computation time, we need to assign to each non-leaf node of the appropriate dependence level (as illustrated in Figure 1) during the training phase. The corresponding array is denoted by . The function maps the set of possibility distributions to the aggregated distribution by executing the computation graph and using the dependence levels contained in . There are hyperparameters in array that need to be tuned. They will be automatically set to appropriate values by heuristic search, see A for a presentation of this grid search based heuristic. It is noteworthy that this heuristic will use HAC to define clusters of classifiers, thereby reducing the number of hyperparameters to tune as this amounts to merge some nodes of the dendrogram and apply some t-norm to more than two possibility distributions.

3.5 Adaptive aggregation w.r.t. informational content

When the predictions delivered by classifier are poorer than those of another classifier , it is instrumental to reduce the impact of on the decisions issued by the ensemble. Regardless of the formal definition behind what are called ”poor predictions”, we propose to use the following mechanism to gradually fade classifier out of the ensemble: for a given scalar , we update all conditional possibility distributions related to as follows:

(12)

This mechanism is equivalent to an operation known as discounting [45]. When , then classifier influence on the ensemble is not reduced. When , classifier is discarded from the ensemble since we obtain constant one possibility distributions which are the neutral element of t-norms and the t-norm based aggregation method introduced in the previous subsection.

Obviously, we need to find a value of the discounting coefficient tailored for each classifier and in line with what poor predictions are meant to be. Again, it is tempting to set these hyperparameters using grid search but the corresponding complexity calls for a more subtle strategy. Similarly as for dependency hyperparameters, we will resort to a heuristic search.

Among other possibilities, our solution consists in binding the discounting rates together using the following formula:

(13)

where is the estimated error rate on the validation set and is a hyperparameter to tune by grid search. Using the above equation, the best base classifier is not discounted and we have .

3.6 Fully adaptive aggregation

The fully adaptive version (w.r.t. both dependence and informative content) of SPOCC is referred to as adaSPOCC. The corresponding training and test phases are described in Algorithm 3 and 4 respectively. A python implementation can be downloaded at this link.

Data: validation set , classifiers , number of class labels .
Execute SPOCC - training phase (algorithm 1) for  do
       for  do
             Compute the dissimilarity using (9). Assign .
       end for
      
end for
Obtain dendrogram by applying HAC to dissimilarity matrix . Apply heuristic to set parameters (see A). Compute parameters as in (13). Update all conditional possibility distributions as in (12). Return possibility distributions , dendrogram , array .
Algorithm 3 adaSPOCC - training phase
Data: input , classifiers , possibility distributions , dendrogram , array .
for  do
       Compute individual classifier prediction .
end for
(computation graph execution). Return .
Algorithm 4 adaSPOCC - test phase

3.7 Properties of the possibilistic ensemble

In this paper, we adopt a normative approach for the selection of a classifier decision aggregation mechanism. In this subsection, we give sketches of proofs showing that robustness properties (a) to (c) hold for adaSPOCC asymptotically:

  • Property (a): if is a random classifier then when , each conditional distribution converges to a uniform distribution so DPT turns it into a constant one possibility distribution, which is the neutral element of .

  • Property (b): let denote an adversorial classifier, i.e. . (ada)SPOCC uses the following rectified classifier defined as

    (14)

    We have

    (15)

    Moreover, we can write

    (16)

    Given the definition of we know that if . The definition also gives

    (17)

    The maximal probability value of a discrete variable is always greater or equal than therefore

    (18)
    (19)

    Again, given the definition of we know that . Since is not the random classifier, at least one of the conditional distributions is not uniform in which case the inequality is strict. We thus obtain .

    Finally, when , if and , the column of will be identical to the column of the confusion matrix of so they will be mapped to identical possibility distributions.

  • Property (c): when , the likelihood ratio appearing in (9) writes

    (20)

    If is a copy of then and

    (21)

    If is not a constant function, then probabilities are smaller than one and . The pair of classifiers will thus be detected as maximally dependent by HAC and they will be aggregated using hence property (c) holds in this case.

    When is a constant function, then both and will yield identical possibility distributions that are a Dirac function. In this case, the output of will also be this Dirac function, meaning that idempotence is always true in these circumstances. Note that, however, the procedure described in 3.4 will fail to detect the dependency between and . There are plenty of ways to thwart this issue as constant classifiers are not difficult to detect. In practice, we will use add-one Laplace smoothing to estimate probabilities so we will never obtain a Dirac function as possibility distribution.

Properties (a) to (c) rely on asymptotic estimates of multinomial distribution parameters which, from the strong law of numbers, converge almost surely to their exact values. Consequently, the properties do not hold only in expectation or with high probability but systematically (when is large).

Although properties (a) to (c) are not as strong as (3), adaSPOCC is a scalable aggregation technique as the number of parameters it requires to learn from the validation set is in while the number of parameters to learn from in [3] is in and is therefore doomed to overfit when is large.

3.8 Incremental aggregation

When a new classifier must be appended to the ensemble, it suffices to compute its corresponding confusion matrix to be able to run SPOCC. All previously estimated parameters (confusion matrices and possibility distributions of the other classifiers) can be readily re-used.

Going incremental for adaSPOCC is not as straightforward as for SPOOC. A new coefficient needs to be computed but this latter is deduced from so this is not an issue. However, the matrix also needs to be updated by appending a new line and a new column to it, which makes new entries to compute because is symmetric and its diagonal elements are irrelevant. Then, HAC must be re-run. To increase the level of incrementalism of adaSPOCC in this regard, it is possible to use an incremental clustering algorithm such as [37]. The newly coming classifier will be either appended to an existing cluster or a new cluster that solely contains will be created. In the first case, the hyperparameters can be left unchanged. In the second case, there is an additional node in the dendogram and one additional hyperparameter must be estimated by grid search. Since we perform grid search for only one such parameter, this is obviously faster than the heuristic search described by Algorithm 5. In conclusion, adaSPOCC is also a incremental aggregation algorithm and all previously estimated parameters are also re-used without needing to be updated.

4 Experimental results

In this section, we present a number of experimental results allowing to prove the robustness of SPOCC and adaSPOCC as compared to other aggregation techniques. The section starts with results obtained when the base classifiers are trained on a synthetic dataset and are meant to highlight performances discrepancies in simple situations where robustness is required. Another set of experiments on real datasets are also presented to prove that the method is not only meaningful on toy examples.

4.1 General setup

Designing numerical experiments allowing to compare aggregation methods is not a trivial task. A crucial aspect consists in training a set of base classifiers that achieve a form of diversity [50] so that the fusion of their predictions has a significant impact on performances. Among other possibilities [8], we chose to induce diversity by feeding the base classifiers with different disjoint subsets of data points at training time. The subsets are not chosen at random but instead in a deterministic way allowing each base classifier to focus on some regions of the input space and thus learn significantly different decision frontiers.

Because the union of the validation sets is an i.i.d. sample of and aggregation methods have access to the predictions of each learner on this set union, well designed aggregation methods are able to restore high levels of performances even if base learners are trained from non-i.i.d. samples. This allows us to test if aggregation methods are relatively agnostic to the quality of the data used to train the base learners.

Each aggregation technique is fed only with the predictions of the base classifiers on the validation set in order to tune hyperparameters or learn the combination itself. Consequently, SPOCC and adaSPOCC are only compared to well established methods that use the same level of information. The benchmarked aggregation techniques are :

  • classifier selection4 based on estimated accuracies of the base classifiers,

  • weighted vote aggregation based on estimated accuracies of the base classifiers,

  • exponentially weighted vote aggregation based on estimated accuracies of the base classifiers,

  • naive Bayes aggregation,

  • Bayes aggregation,

  • stacking.

In the exponentially weighted vote aggregation, accuracies are not directly used as vote weight (as in standard weighted vote aggregation) but are mapped to weights using a softmax function. This function has a positive temperature hyperparameter that regulates the assignment of weights. When this parameter is zero, then we retrieve unweighted vote aggregation whereas when it is very large, then we retrieve classifier selection.

Bayes aggregation relies on (2). The conditional distributions involved in this equation are learned from the validation set. Naive Bayes aggregation uses conditional independence assumptions that allow to factorize probabilities as

(22)

The conditional independence assumptions are not realistic but yield a model with far less parameters to learn as compared to Bayes aggregation.

For each of the estimated probabilities involved in the mechanism behind SPOCC, adaSPOCC, Bayes or naive Bayes aggregation, we perform add-one-Laplace smoothing to avoid computational issues related to zero probabilities. The chosen t-norm for SPOCC is .

Finally, we also train a softmax regression to map classifier predictions to the true class labels. This approach belongs to a methodology known as stacking [49]. An regularization term is added to the cross-entropy loss. A positive hyperparameter regulates the relative importance of the regularization term.

All hyperparameters are tuned automatically using a cross-validated grid search on the validation set. For each hyperparameter, the grid contains 100 points. When the hyperparameter is unbounded, a logarithmic scale is used to design the grid.

The statistical significance of the reported results are given in terms of confidence intervals estimated from bootstrap sampling. When the accuracies of two aggregation methods have overlapping confidence intervals, the performance discrepancy is not significant. A companion python implementation of adaSPOCC and benchmark methods can be downloaded at this link

4.2 Synthetic Data

In this subsection, we use a very simple generating process to obtain example/label pairs. Data points are sampled from four isotropic Gaussian distributions. The centers of these Gaussian distributions are located at each corner of a centered square in a 2D input space (). The standard deviations of each of the distribution is 1. There are possible class labels: . Points such that and are both positive belong to . Points such that and are both negative also belong to . All the other points belong to . Figure 2 shows one such dataset obtained from this generating process.

Figure 2: Synthetic dataset obtained from four Gaussian distributions. Examples belonging to class are in blue while those belonging to class are in cyan. Optimal decision frontiers are in magenta.

In this series of experiments, the dataset has points and is divided in four overlapping subsets as depicted in Figure 3. Then a randomly selected portion of of each such subset is used to train one of the base classifier. The remaining are used for the validation set. Each base classifier sends the corresponding set of prediction/label pairs to the aggregation method.

Since we have access to the data generating process, the test set is dynamically created. We sample test points until the observed accuracies of all the tested methods are with probability in their respective Clopper-Pearson confidence intervals of half-width . The whole procedure is repeated 100 times. The averaged test errors in this case are thus a good approximation of the generalization errors of the tested methods.

Figure 3: Subsets of the data seen respectively by to

Given the shape of optimal decision frontiers, the base classifiers trained in this subsection are decision trees with a maximal depth of two.

Robustness w.r.t. adversaries

Among other possibilities, adversarial predictions are simulated by sampling from a Bernoulli distribution . Given , the prediction of a base classifier is replaced with another (arbitrarily selected) class label that will not coincide with the classifier prediction. When , the classifier prediction is unchanged. Consequently, an adversarial classifier built in this way from a base classifier with an error rate lower than random guess will achieve an error rate greater then random guess as .

The evolution of the classification accuracy of the benchmarked aggregation methods as the number of adversaries grows can be witnessed on Figure 4. For simplicity, all adversaries are built from the same base classifier () with . Two methods cannot maintain the same level of performances when the number of adversaries increases: weighted vote ensemble and Bayes aggregation. For the weighted vote ensemble, this is explained by the fact that the number of misleading classifiers outnumber legitimate classifiers and start to obtain a majority of votes. For Bayes aggregation, the performances are degrading simply because of overfitting. Indeed, Bayes aggregation has a number of parameter to learn that is exponential in while SPOCC and other methods have a number of parameters at most linear in .

Figure 4: Evolution of accuracy distributions (violin plots) for several aggregation methods w.r.t. the number of adversaries. SPOCC and adaSPOCC are in orange while other methods are in blue.

Robustness w.r.t. faults

Erroneous predictions are simulated by sampling from a Bernoulli distribution . Given , the prediction of a base classifier is replaced with an (arbitrarily selected) class label that will coincide with the classifier prediction with probability . When , the classifier prediction is unchanged. Consequently, a noisy classifier built in this way from a base classifier will achieve an error rate equal to (random guess) as .

The evolution of the classification accuracy of the benchmarked aggregation methods as the number of noisy classifiers grows can be witnessed on Figure 5. For simplicity, all noisy classifiers are built from the same base classifier () with . Similarly as for adversarial classifiers, weighted vote ensemble and Bayes aggregation cannot maintain the same level of performances when the number of perturbed classifiers increases. The same reasons are also behind these performance decays (majority of incorrect classifiers for the weighted vote ensemble and overfitting for Bayes aggregation).

Figure 5: Evolution of accuracy distributions (violin plots) for several aggregation methods w.r.t. the number of noisy classifiers. SPOCC and adaSPOCC are in orange while other methods are in blue.

Robustness w.r.t. informational redundancy

Redundancy in classifier predictions is simulated by adding several copies of one of the base classifiers (classifier in our experiments). As shown in Figure 6, this very simple setting allows to observe severe performance decays for the weighted vote ensemble, the exponentially weighted vote ensemble and the naive Bayes aggregation. Vote based ensembles are very sensitive to changes of majority. Naive Bayes aggregation is also sensitive to this phenomenon and suffers from its inability to capture dependency relations between the base classifiers.

Unlike the previous experiment, it can be noted that Bayes aggregation maintains the same level of performances as the number of clones of increases. Because clones will always produce identical predictions as , training the Bayes aggregation in these conditions is equivalent to learn from base classifiers regardless how many copies of are added. However, should these copies be slightly perturbed, then we would observe the same overfitting issues as in the previous experiments.

Figure 6: Evolution of accuracy distributions (violin plots) for several aggregation methods w.r.t. the number of copies of . SPOCC and adaSPOCC are in orange while other methods are in blue.

Summarizing synthetic data experiment robustness results

In the previous paragraphs, we have seen which methods are tolerant to adversaries, faults and redundancy and scale well w.r.t. . Only SPOCC, adaSPOCC and stacking seem to be robust w.r.t. each of these forms of difficulties. Beside robustness, their absolute performances also matter. Average5 performances are reported in Table 1 for each experiment as well as the global average on all experiments. We also provide as reference the optimal (Bayes) classifier accuracy as well as the performances of the best base classifier , i.e. optimal selection.

In terms of accuracies, SPOOC or adaSPOOC are always the top 1 or top 2 aggregation approach. While the naive Bayes aggregation is slightly better than SPOOC or adaSPOOC in the two first series of experiments, it performs very significantly worse in the last one and it is outperformed on global average. Stacking and all other methods obtain worse (or sometimes comparable) results as compared to (ada)SPOCC. Moreover, observe that adaSPOCC achieves the smallest variance, meaning that its performances are more stable across dataset draws. Normalized confusion matrices corresponding to the global average on all experiments (last column of Table 1) are shown in Figure 7. It shows the distribution of error rates in terms of type I and type II errors. Let alone classifier selection (which achieves anyway poor general performances), adaSPOOC is the aggregation method with the smallest type I / type II error discrepancy.

Method Adversaries Faults Redundancy Global Average
Clf. Selection
std. std. std. std.
Weighted Vote
std. std. std. std.
Exp. Weighted Vote
std. std. std. std.
Stacking
std. std. std. std.
Naive Bayes Agg.
std. std. std. std.
Bayes Agg.
std. std. std. std.
SPOCC
std. std. std. std.
adaSPOCC
std. std. std. std.
Best base Clf.
std. std. std. std.
Optimal Clf.
std. std. std. std.
Table 1: Average performances of aggregation methods on the synthetic data. The first figure is the average accuracy followed by the semi-width of the confidence interval width of this latter and the average standard deviation. Best accuracies (or those not statistically significantly different) are in bold characters.
Figure 7: Normalized averaged confusion matrices for the reported results in the last column of Table 1.

Other experimental aspects

The main goal of this experimental section is to illustrate the robustness properties (a) to (c) that adaSPOOC possesses. The results reported in the previous paragraphs match this purpose. However, other aspects are also interesting to examine. In the following paragraphs, we investigate the behavior of the tested aggregation methods under two different circumstances:

  • when the base classifiers are heterogeneous,

  • when the dataset is imbalanced, meaning that is not uniform.

Heterogeneous ensemble of classifiers An important advantage of the class label agnostic aggregation setting over others is that no assumption at all are made on the base classifiers and therefore any training algorithm can be used to derive them. To illustrate this ability, we reproduce the same experiment as in 4.2.3 with different base classifiers. Now, is trained using logistic regression, is a -nearest neighbor classifier, is an SVM with radial basis function as kernel while is a decision tree like before. Classifier will achieve smaller accuracy because a linear decision function underfits the data in this case. Since this classifier will be progressively duplicated, it will also check the ability of methods to cope with increasingly many lesser accurate classifiers.

Figure 8: Evolution of accuracy distributions (violin plots) on an heterogeneous ensemble of 4 classifiers for several aggregation methods w.r.t. the number of copies of . SPOCC and adaSPOCC are in orange while other methods are in blue.

The corresponding results are displayed in Figure 8. Since at least one member of the ensemble performs more poorly than in 4.2.3, all methods have their accuracy distributions eroded. However, the conclusions from the previous experiments are confirmed as the same methods achieve robustness to information redundancy and corruption. It is also made clear that the ability of the aggregation to overcome these difficulties does not lie with the training algorithms employed to obtain the base classifiers.

Class label imbalance In practice, it is common that the generative process underlying our data is such that the class label probability distribution is not uniform. In the previous set of experiments, such an imbalance was in place for the base learner but not for the aggregation methods. We now modify the generative process such and . The level of imbalance is progressively reduced as increases. The set of examined values is: .

Figure 9: Evolution of accuracy distributions (violin plots) with class label imbalance for several aggregation methods w.r.t. (first class label probability). SPOCC and adaSPOCC are in orange while other methods are in blue.

The corresponding results are displayed in Figure 9. For any value of , this setting is very favorable to majority based methods. Indeed, if the decision tree training went alright, we should obtain approximately the following predictors

We see that, for any , there are 3 classifiers out of 4 that yield correct predictions therefore majority voting based aggregations are expected to perform very well. In addition, we also have for any and so the weighted vote and exponentially weighted vote aggregation are nearly equivalent to majority voting. As we can see, these two methods perform very well when and can tolerate imbalance up to . Below this value, there are two few points in the class 6 for the decision tree to learn meaningful prediction rules and thus their aggregation (regardless of the method) is not meaningful either. Stacking can also easily learn a combination rule that mimics majority vote. It thus compares favorably to vote based methods.

The naive Bayes aggregation also works very well in this setting because , , and . Consequently, the naive Bayes aggregation will easily rule out the classifier yielding an incorrect prediction. We can see that this method achieves comparable performances as compared to the vote based ones.

The Bayes aggregation is a gold standard because it infers the optimal decision rule as explained in 2.3. However it still has parameters to learn from 40 data points in the validation set and will thus slightly overfit. It thus achieves worse accuracies than naive Bayes or vote based aggregation but seems to better handle extreme imbalance.

Because indicator functions (or Dirac masses) are fixed points of DPT and zero is the absorbing element of t-norms, SPOCC and adaSPOCC can enjoy the same type of information as the naive Bayes aggregation does. They indeed perform well when . However, they exhibit higher sensitivity to imbalance than other methods. As often, adaSPOCC appears to be more robust than SPOCC. We believe this is due to the fact that they are discriminative aggregation models in the sense that they do not rely on the whole data distribution but solely operate on the conditional distributions . As the experiments on real data will show, adaSPOCC already works pretty well on several imbalanced datasets, however possible fixes for this limitation will be investigated in future works and are discussed in section 5.

4.3 Real Data

To upraise the ability of the benchmarked methods to be deployed in more realistic situations (such as decentralized learning), we also need to test them on sets of real data. Since this is essentially useful in a big data context, we chose eight from moderate to large public datasets. The specifications of these datasets are reported in Table 2.

Name Size Dim. Nbr. of classes Data type Class imbalance Source
20newsgroup (after red.) 20 text yes sklearn
MNIST image no sklearn
Satellite image features yes UCI repo. (Statlog)
Wine (binarized) chemical features yes UCI repo. (Wine Quality)
Spam text yes UCI repo. (Spam)
Avila (binarized) layout features yes UCI repo. (Avila)
Drive current statistics no UCI repo. (Sensorless Drive Diagnosis)
Particle signal yes UCI repo. (MiniBooNE particle identification)
Table 2: Real dataset specifications

Example entries from the 20newsgroup data set are word counts obtained using the term frequency - inverse document frequency statistics. We reduced the dimensionality of inputs using a latent semantic analysis [10] which is a standard practice for text data. We kept 100 dimensions. Also, as recommended, we stripped out each text from headers, footers and quotes which lead to overfitting. Besides, for the Wine and Avila datasets, the number of class labels is originally 10 and 12 respectively. We binarized these classification tasks because some classes have very small cardinalities which is problematic for our experimental design in which datasets are divided into several distinct subsets. Indeed, some subsets may possess no example at all of some classes which leads to imprecise labeling which is beyond the scope of this paper. To circumvent these acute class imbalance issues, classes were merged as follows:

  • In the Wine data set, class labels are wine quality scores. Two classes are obtained by comparing scores to a threshold of 5.

  • In the Avila dataset, class labels are middle age bible copyist identities. The five first copyists are grouped in one class and the remaining ones in the other class.

Unlike synthetic data sets, we need to separate the original dataset into a train set and a test set. To avoid a dependency of the reported performances w.r.t train/test splits, we perform 2-fold cross validation (CV). Also, we shuffled at random examples and repeated the training and test phases times.

To induce diversity in the base classifiers, we separated the training data into 6 distinct pieces using the following procedure: for each data set, for each class,

  1. apply principal component analysis to the corresponding data,

  2. project this data on the dimension with highest eigenvalue,

  3. sort the projected values and split them into subsets of cardinality where is the proportion of examples belonging to class .

We argue that this way of splitting data leads to challenging fusion tasks because some base classifiers may see data that are a lot easier to separate than it should and will consequently not generalize very well. Actually, the training data to which classifier has access is a non-i.i.d. sample of the distribution of .

We used softmax regression with an regularization term to train the base classifiers. The regularization hyperparameter is set to default (i.e. 1.0).

To make sure that robustness observations from the previous subsection are confirmed on real data, we also add two noisy classifiers to the ensemble. Both noisy classifiers are built from with therefore they are perturbed copies of .

Method 20newsgroup MNIST Satellite Wine Spam Avila Drive Particle
Clf. Selection
std. std. std. std. std. std. std. std.
Weighted Vote
std. std. std. std. std. std. std. std.
Exp. Weighted Vote
std. std. std. std. std. std. std. std.
Stacking
std. std. std. std. std. std. std. std.
Naive Bayes Agg.
std. std. std. std. std. std. std. std.
Bayes Agg. Intract. Intract. Intract.
std. std. std. std. std.
SPOCC
std. std. std. std. std. std. std.
adaSPOCC
std. std. std. std. std.