autoBagging: Learning to Rank Bagging Workflows with Metalearning
Abstract
Machine Learning (ML) has been successfully applied to a wide range of domains and applications. One of the techniques behind most of these successful applications is Ensemble Learning (EL), the field of ML that gave birth to methods such as Random Forests or Boosting. The complexity of applying these techniques together with the market scarcity on ML experts, has created the need for systems that enable a fast and easy dropin replacement for ML libraries. Automated machine learning (autoML) is the field of ML that attempts to answers these needs. Typically, these systems rely on optimization techniques such as bayesian optimization to lead the search for the best model. Our approach differs from these systems by making use of the most recent advances on metalearning and a learning to rank approach to learn from metadata. We propose autoBagging, an autoML system that automatically ranks 63 bagging workflows by exploiting past performance and dataset characterization. Results on 140 classification datasets from the OpenML platform show that autoBagging can yield better performance than the Average Rank method and achieve results that are not statistically different from an ideal model that systematically selects the best workflow for each dataset. For the purpose of reproducibility and generalizability, autoBagging is publicly available as an R package on CRAN.
x201xxxxx/xxxx/xxauthor list \ShortHeadingsautoBagging: Learning to Rank Bagging Workflows with MetalearningFábio Pinto, Vítor Cerqueira, Carlos Soares and João MendesMoreira \firstpageno1
Editor
automated machine learning, metalearning, bagging, classification
1 Introduction
Ensemble learning (EL) has proven itself as one of the most powerful techniques in Machine Learning (ML), leading to stateoftheart results across several domains (fernandez2014we). Methods such as bagging, boosting or Random Forests are considered some of the favourite algorithms among data science practitioners. However, getting the most out of these techniques still requires significant expertise and it is often a complex and time consuming task. Furthermore, since the number of ML applications is growing exponentially, there is a need for tools that boost the data scientist’s productivity (mediumArticle).
The resulting research field that aims to answers these needs is Automated Machine Learning (autoML). It is a field that merges ideas and techniques from several ML and optimization topics, such as Bayesian optimization, metalearning (MtL) and algorithm selection. In the past few years it was possible to assess important innovations in the field, which enable data science practitioners, including nonexperts, to efficiently create finetuned predictive models with minimum intervention.
In this paper we address the problem of how to automatically tune an EL algorithm, covering all components within it: generation (how to generate the models and how many), pruning (which technique should be used to prune the ensemble and how many models should be discarded) and integration (which model(s) should be selected and combined for each prediction). We focus specifically in the bagging algorithm (breiman1996bagging) and four components of the algorithm: 1) the number of models that should be generated 2) the pruning method 3) how much models should be pruned and 4) which dynamic integration method should be used. For the remaining of this paper, we call to a set of these four elements a bagging workflow.
Our proposal is autoBagging, a system that combines a learning to rank approach together with metalearning to tackle the problem of automatically generate bagging workflows. Ranking is a common task in information retrieval. For instance, to answer the query of a user, a search engine ranks a plethora of documents according to their relevance. In this case, the query is replaced by new dataset and autoBagging acts as ranking engine.
Figure 1 shows an overall schema of the proposed system. We leverage the historical predictive performance of each workflow in several datasets, where each dataset is characterised by a set of metafeatures. This metadata is then used to generate a metamodel, using a learning to rank approach. Given a new dataset, we able to collect metafeatures from it and feed them to the metamodel. Finally, the metamodel outputs an ordered list of the workflows, specifically tuned to the characteristics of the new dataset.
We tested the approach in 140 classification datasets from the OpenML platform for collaborative ML (vanschoren2014openml) and 63 bagging workflows, that include two pruning techniques and two dynamic selection techniques. We give details on these workflows in Section 4. Results show that autoBagging has a better performance than two strong baselines, Bagging with 100 trees and the average rank. Furthermore, testing the top 5 workflows recommended by autoBagging guarantees an outcome that is not statistically different from the Oracle, an idealistic method that for each dataset always selects the best workflow.
This paper is organized as follows. Section 2 describes the stateoftheart regarding autoML and metalearning, with particular emphasis for the approaches more similar to ours. In Section 3 we introduce the concept of bagging workflows and describe the components from each they are designed with. Section 4 presents AutoBagging from a more formal perspective. Section 5 presents the experiments carried to evaluate our approach. Finally, Section 6 concludes the paper and sets directions for future work.
For the purpose of reproducibility and generalizability, autoBagging is publicly available as an R package. ^{1}^{1}1https://github.com/fhpinto/autoBagging
2 Related Work
In this Section we provide a brief overview of systems designed with intent of generating automatic recommendations or ranking of ML algorithms or workflows. After a careful analysis of the stateoftheart, we split it into three categories: 1) systems that use only metalearning approach, without any kind of optimization component; 2) systems that make use of optimization procedures, such as bayesian optimization and 3) systems that leverage metalearning and optimization procedures.
Finally, in the last subsection, we discuss how some autoML systems recommend ensemble learning algorithms to the user and how our approach differs from previous ones regarding this feature.
2.1 Metalearning based
The first automated framework proposed to support machine learning processes was the Data Mining Advisor (DMA) (giraud2005data). The system used an instancebased learning approach to relate the performance of the learning algorithms with simple and statistical metafeatures computed from the datasets (brazdil2003ranking).
This line of research was later on overpowered by the characterization of datasets through landmarkers (pfahringer2000tell), such as learning curves (leite2005predicting) or pairwise metarules sun2013pairwise.
2.2 Optimization based
Evaluating ML algorithms and/or ML workflows is typically very time consuming and computationally expensive. In practice, it is not feasible to evaluate all learning algorithms with 10fold cross validation for a given dataset (particularly if the dataset is of high dimensionality) and choose the one that minimizes the error measure. Therefore, researchers have been developing search procedures and optimization algorithms that can in fact do this in reasonable time.
Bayesian optimization is the field within optimization that has had the most success carrying out these type of tasks. One of the algorithms that is responsible for this success is SMAC (hutter2011sequential), an approach that constructs explicit regression models to describe the relationship between the target algorithm performance and the hyperparameters. The ability of SMAC to deal both categorical and continuous hyperparameters is one of the reasons behind its success.
The development of bayesian optimization algorithms for algorithm configuration has led to the emergence of systems such as AutoWEKA (thornton2013auto), that makes use of SMAC and the machine learning library WEKA to automatically generate workflows for classification datasets.
More recently, the Hyperband (li2016hyperband) method was proposed as an alternative for bayesian optimization algorithms. The method uses a pureexploration algorithm for multiarmed bandits to exploit the iterative algorithms of machine learning. Essentially, the authors approach the automatic model selection problem as an automatic model evaluation problem. By exploring this perspective, they report better results than the bayesian optimization methods.
2.3 Optimization plus Metalearning
Some autoML systems combine bayesian optimization with metalearning, particularly useful to act as a warm start for the optimization procedure. An example of such system is autosklearn (feurer2015efficient). Given a new dataset, the system starts by comparing the characteristics of that dataset with past performance of ML workflows on similar datasets (using a set of simple, statistical and informationtheoretic metafeatures and kNN). After this warm start, the optimization procedure is carried out by SMAC. Finally, the system also has the ability to form ensembles from models evaluated during the optimization.
2.4 Ensemble focused autoML
Some attempts have been made in creating autoML systems that are able to provide suggestions of ensembles. Again, and probably the most notorious one, is autoskelarn, as described above. Another proposal on this matter is made on (lacoste2014sequential), where the authors optimize ensembles based on bootstrapping the validation sets to simulate multiple independent hyperparameter optimization processes and combined the results with the agnostic Bayesian combination method.
One of the problems with the two approaches described is that the generation of the ensemble is rather ad hoc. That means, it does not take into account important properties that are known to affect the performance of ensembles. Specifically, complementarity between among models and the overall diversity of the ensemble. We argue that ensemble generation must take into account these concepts and we should avoid a simple averaging of predictions from several models. For instance, in (levesque2016bayesian), the authors use Bayesian optimization directly to estimate which prediction model is the best candidate to be added to the ensemble. In this paper, we use as basis a well known and studied ensemble learning algorithm (bagging) and we generate ensembles that make use of several EL techniques proposed in the literature.
3 Bagging Workflows
The Ensemble Learning (EL) literature can be split into three main topics: ensemble generation, ensemble pruning and ensemble integration (mendes2012ensemble). It can also be seen as a process of three phases: 1) generating an accurate and diverse set of models; 2) prune the ensemble in order to decrease its size and attempt to improve its generalization ability; and finally, 3) select a function to aggregate the predictions of each single model of the ensemble. This can be achieved by a static (that does not take into account the characteristics of the test instance, such as stacking) or dynamic method (that chooses different subsets of models according to the characteristics of the test instance).
Bagging, one of the most popular EL algorithms, can also be decomposed at the light of the structure that we described above (breiman1996bagging). Generically, given a training data set, a sample with replacement (a bootstrap sample) of the training instances is generated. The process is repeated k times and k samples of the training instances are obtained. Then, from each sample, a model is generated by applying a learning algorithm. In terms of aggregating the outputs of the base learners and building the ensemble, typically, bagging uses two of the most common ones: voting for classification (the most voted label is the final prediction) and averaging for regression (the predictions of all the base learners are averaged to form the ensemble prediction).
In this paper, we introduce the concept of a bagging workflow, that can also be decomposed into three components: generation, pruning and integration. The following subsections describe some of the methods that can be used within each of these components.
3.1 Generation
As mentioned before, typically, bagging algorithms generate ensembles by applying a learning algorithm to bootstrap samples of the training data. However, there are some hyperparameters that can be taken into account to exploit the versatility of bagging, such as:

the sampling strategy. Although bootstrap sampling is by far the most common sampling strategy in bagging, there are also reports of interesting results using subsampling without replacement and sampling of random subspaces (ho1998random).

the learning algorithm used to generated the models. Decision trees and neural networks are among the favourite, given their unstable learning property (breiman1996bagging).

how many models to generate. On the seminal paper in which bagging was introduced (breiman1996bagging), the author claimed that 50 or 100 single models should be enough to achieve good results. However, more recent studies showed that this problem is highly dataset dependent (hernandez2013large).
3.2 Pruning
Given the widely spread use among data science practitioners, bagging is also one the most studied algorithms (bauer1999empirical). One of the discoveries made by researchers is that an efficient pruning of a bagging ensemble could to a smaller ensemble size and also to generalization improvements (zhou2002ensembling). This has led a stream of research focused specifically on pruning techniques for bagging ensembles. Since a detailed overview of these techniques would be out of scope of this paper, we refer the reader to some important papers in the field (martinez2009analysis; qian2015pareto). Essentially, these techniques combine the concepts of accuracy and diversity in ensemble learning to search for a subset of models that guarantees the same performance of the full ensemble or even improves it. This search procedure is often led by some heuristic or an optimization algorithm.
Therefore, from the ensemble pruning phase of constructing bagging workflows, two hyperparameters must be considered:

the pruning method to be used.

the percentage of models that should be pruned. Again, studies have shown that this a highly dataset dependent hyperparameter (hernandez2013large).
3.3 Integration
As mentioned before, the method regarding ensemble integration are split into two groups: static and dynamic. In the former, the weights assigned to each model in the ensemble are a constant value; in the later, the weights vary according to the instance to be predicted. In the dynamic group, we distinguish between methods for selection (when a single model is selected) or combination of models (when more that one model can be selected).
Regarding static methods, the most well known is stacking (wolpert1992stacked). Regarding dynamic methods, again, research has shown that this hyperparameter is highly problem dependent (britto2014dynamic). A large empirical comparison of these techniques can be found in (pinto2016chade). A full description of these techniques is out of scope of this paper so we refer the reader to the original papers.
4 autoBagging: Learning to Rank Bagging Workflows
In this Section we present autoBagging. Although for this paper we focused on providing ranking of bagging workflows, we believe that the approach is generic for the algorithm/workflow ranking in ML. Therefore, we describe the method from a generic perspective and we provide more specific details on the application to bagging workflows in Section 5.
We recall Figure 1 for a brief overview of the method. We start by describing the learning approach and then how we collected the metadata, both metafeatures and metatarget, to be able to learn at the metalevel.
4.1 Learning Approach
We approach the problem of algorithm selection as a learning to rank problem (liu2009learning). Lets take as the dataset set and as the algorithm set. is the label set, where each value represents a relevance score, which represents the relative performance of a given algorithm. Therefore, , where represents an order relationship.
Furthermore, is the set of datasets for training and is the th dataset, is the set of algorithms associated with dataset and is the set of labels associated with dataset , where represents the sizes of and ; represents the th algorithm in ; and represents the th label in , representing the relevance score of with respect to . Finally, the metadataset is denoted as .
We use metalearning to generate the metafeature vectors for each datasetalgorithm pair, where ; and represents the metafeatures extraction functions. These metafeatures can describe , or even the relationship between both. Therefore, taking we can represent the metadataset as .
Our goal is to train a meta ranking model that is able to assign a relevance score to a given new datasetalgorithm pair and , given .
4.2 Metafeatures
We approach the problem of generating metafeatures to characterize and with the aid of a framework for systematic metafeatures generation (pinto2016towards). Essentially, this framework regards a metafeature as a combination of three components: metafunction, a set of input objects and a postprocessing function. The framework establishes how to systematically generate metafeatures from all possible combinations of object and postprocessing alternatives that are compatible with a given metafunction. Thus, the development of metafeatures for a MtL approach simply consists of selecting a set of metafunctions (e.g. entropy, mutual information and correlation) and the framework systematically generates the set of metafeatures that represent all the information that can be obtained with those metafunctions from the data.
For this task in particular, we selected a set of metafunctions that are able to characterize the datasets as completely as possible (measuring information regarding the target variable, the categorical and numerical features, etc) the algorithms and the relationship between the datasets and the algorithms (who can be seen as landmarkers (pfahringer2000tell)). Therefore, the set of metafunctions used is:

Skewness

Pearson’s correlation

Maximal Information Coefficient (MIC (reshef2011detecting))

Entropy

Mutual Information

Eta squared (from ANOVA test)

R value of class overlap (oh2011new)

Rank of each algorithm (brazdil2008metalearning)
Each metafunction is used to systematically measure information from all possible combination of input objects available for this task. We defined the input objects available as:

discrete descriptive data of the datasets

continuous descriptive data of the datasets

discrete output data of the datasets

five sets of predictions (discrete predicted data) for each dataset (naive bayes, decision tree with depth 1, 2 and 3, and majority class)
For instance, if we take the example of using Entropy as metafunction, it is possible to measure information in discrete descriptive data, discrete output data and discrete predicted data (if the baselevel problem is a classification task). After computing the entropy of all these objects, it might be necessary to aggregate the information in order to keep the tabular form of the data. Take for the example the aggregation required for the entropy values computed for each discrete attribute. Therefore, we choose a palette of aggregation functions to capture several dimensions of these values and minimize the loss of information by aggregation. In that sense, the postprocessing functions chosen were:

average

maximum

minimum

standard deviation

variance

histogram binning
Given these metafunctions, the available input objects and postprocessing functions, we are able to generate a set of 146 metafeatures. To this set we add eight metafeatures: the number of examples of the dataset, the number of attributes and the number of classes of the target variable; and five landmarkers (the ones already described above) estimated using accuracy as error measure. Furthermore, we add four metafeatures to describe the components of each workflow: the number of trees, the pruning method, the pruning cut point and the dynamic selection method. In total, autoBagging uses a set of 158 metafeatures.
4.3 Metatarget
In order to be able to learn a ranking metamodel , we need to compute a metatarget that is able to assign a score to each datasetalgorithm pair , so that:
(1) 
where is the ranking metamodels set and is the metatarget set.
To compute , we use a cross validation error estimation methodology (we use a 4 fold cross validation in the experiments reported in this paper, Section 5), in which we estimate the performance of each bagging workflow for each dataset using Cohen’s kappa score (cohen1960coefficient). On top of the estimated kappa score, for each dataset, we rank the bagging workflows. This ranking is the final form of the metatarget and it is then used for learning the metamodel.
5 Experiments
In this Section we describe the experiments performed to understand and evaluate autoBagging. We also provide a brief exploratory analysis of the metadata collected from the experiments that are particularly interesting to understand some of the EL methods used.
5.1 Experimental Setup
Our experimental setup comprises 140 classification datasets extracted from the OpenML platform for collaborative machine learning (vanschoren2014openml). We limited the datasets extracted to a maximum of 5000 instances, a minimum of 300 instances and a maximum of 1000 attributes, in order to speed up the experiments and exclude datasets that could be too small for some of bagging workflows that we wanted to test.
Regarding bagging workflows, taking into account all the hyperparameters described in Section 3 would result in a computational cost too large for our resources. Therefore, we limited the hyperparameters of the bagging workflows to four: number of models generated, pruning method, pruning cut point and dynamic selection method. Specifically, each hyperparameter could take the following values:

Number of models: 50, 100 or 200. Decision trees was chosen as learning algorithm.

Pruning method: Margin Distance Minimization(MDSQ) (martinez2009analysis), BoostingBased Pruning (BB) (martinez2009analysis) or none.

Pruning cut point: 25%, 50% or 75%.

Dynamic integration method: Overall Local Accuracy (OLA), a dynamic selection method (woods1997combination); Knearestoracleseliminate (KNORAE) (ko2008dynamic), a dynamic combination method; and none.
All the values of the hyperparameters described above generated 63 valid combinations. We tested these bagging workflows in the datasets extracted from OpenML with 4fold cross validation, using Cohen’s kappa as evaluation metric.
We used the XGBoost learning to rank implementation for gradient boosting of decision trees (Chen2016) to learn the metamodel as described in Section 4. The decision tree implementation from this library has a very elegant way of dealing with missing values. Essentially, the tree splitting functionality assigns an instance with missing values to a default direction and then learns from the data the optimal default direction. This is particularly important for metalearning since the number of missing values is often quite high in these dataset (e.g., attribute correlation cannot be measured in a dataset without numeric attributes, which results in a missing value).
As baselines, at the baselevel, we use 1) bagging with 100 decision trees 2) the average rank method, which basically is a model that always predicts the bagging workflow with best average rank in the meta training set and the 3) oracle, an ideal model that always selects the best bagging workflow for each dataset. As for the metalevel, we use as baseline the average rank method.
As evaluation methodology, we use an approach similar to the leaveoneout methodology. However, each test fold consists of all the algorithmdataset pairs associated with the test dataset. The remaining examples are used for training purposes. The evaluation metric at the metalevel is the Mean Average Precision at 10 (MAP@10) and at the baselevel, as mentioned before, we use Cohen’s kappa. The methodology proposed by (demvsar2006statistical) was used for statistical validation of the results.
For the purpose of reproducibility and generalizability, autoBagging is publicly available as an R package.
5.2 Exploratory Metadata Analysis
Given the rich metadata collected from the experiments that we carried out, we proceed to draw some insights about the datasets and the workflows that we experimented with.
We can see by analysing Figure 2 that the range of kappa values for each dataset varies a lot. This is expected given the No Free Lunch theorem, that states that there is no one model that works best for every problem and ”two algorithms are equivalent when their performance is averaged across all possible problems” (wolpert1996lack). Even though all the models that we experimented with belong to the same family (bagging of decision trees), the pruning and dynamic integration components enable to generate very different predictive models. This is indicative that ranking these bagging workflows for each dataset is not an easy learning task.
Figure 3 shows the boxplots of the ranking scores collected for each dataset, ordered by average ranking. We can take some insights about the bagging workflows performance from this graph:

on average, the bagging workflows that make use of BB pruning and KNORAE as dynamic integration method seem to achieve better results

in terms of pruning cut point, it seems that BB pruning works better with a large pruning cut point (e.g., 75%) than MDSQ

the bagging workflows that do not make use of any kind of dynamic integration method are worse on average than the ones that do

both the top and the worst bagging workflows are outliers for some dataset in terms of performance
5.3 Results
Figure 4 shows a loss curve, relating the average loss in terms of performance with the number of workflows tested following the ranking suggested by each method. The loss is calculated as the difference between the performance of the best algorithm ranked by the method in comparison with the ground truth ranking. The loss for all datasets is then averaged for aggregation purposes. We can see, as expected, that the average loss decreases for both methods as the number of workflows tested increases.
In terms of comparison between autoBagging and the Average Rank method, it is possible to visualize that autoBagging shows a superior performance for all the values of the axis. Interestingly, this result is particularly noticeable in the first tests. For instance, if we test only the top 1 workflow recommended by autoBagging, on average, the kappa loss is half of the one we should expect from the suggestion made by the average rank method.
We evaluated this results to assess their statistical significance using Demšar’s methodology. Figures 5 and 6 show the Critical Difference (CD) diagrams for both the meta and baselevel.
At the metalevel, using MAP@10 as evaluation metric, autoBagging presents a clearly superior performance in comparison with the Average Rank. The difference is statistically significant, as one can see in the CD diagram. This result is in accordance with performance that visualized in Figure 4 for both methods.
At the baselevel, we compared autoBagging with three baselines, as mentioned before: bagging with 100 decision trees, the Average Rank method and the oracle. We test three versions of autoBagging, taking the top 1, 3 and 5 bagging workflows ranked by the metamodel. For instance, in autoBagging@3, we test the top 3 bagging workflows ranked by the metamodel and we choose the best.
Starting by the tail of the CD diagram, both the Average Rank method and autoBagging@1 show a superior performance than Bagging with 100 decision trees. Furthermore, autoBagging@1 also shows a superior performance than the Average Rank method. This result confirms the indications that we visualized in Figure 4.
The CD diagram shows also autoBagging@3 and autoBagging@5 have a similar performance. However, and we must highlight this results, autoBagging@5 shows a performance that is not statistically different from the oracle. This is extremely promising since it shows that the performance of autoBagging excels if the user is able to test the top 5 bagging workflows ranked by the system.
5.4 Discussion
We decided to not include time in the experiments since autoBagging execution time only depends on the computation of metafeatures. Given the nature of these metafeatures, such as entropy or mutual information, the computation is extremely fast (no more than a couple of minutes for the largest datasets used in the experiments).
Figure 7 shows the relative importance of the top 30 most important metafeatures. It is clear that the most informative metafeatures are the ones generated using the rank of each workflow in the metatraining set as metafunction. Given that these metafeatures do not vary that much from dataset to dataset, we can assume that they are very important to characterize the bagging workflows. On the other hand, the remaining metafeatures are critical for the ability of the metamodel to generalize for all datasets. Metafeatures such as class.entropy, dstump.landmarker_d1.entropy and r_value.hist1 are also among the most informative metafeatures.
6 Conclusion and Future Work
This paper presents autoBagging, an autoML system that makes use of a learning to rank approach and metalearning to automatically suggest a bagging ensemble specifically designed for a given dataset. We tested the approach on 140 classification datasets and the results show that autoBagging is clearly better than the baselines to which was compared. In fact, if the top five workflows suggested by autoBagging are tested, results show that the system achieves a performance that is not statistically different from the oracle, a method that systematically selects the best workflow for each dataset. For the purpose of reproducibility and generalizability, autoBagging is publicly available as an R package.
As future work, we plan to further improve the experimental setup of autoBagging by comparing it with stateoftheart systems such as autosklearn and the hyperband method. Furthermore, we plan to study how we can use bayesian optimization to further improve the final ensemble, always taking into account concepts such as diversity and complementarity between models to design the final ensemble.