Towards better understanding of meta-features contributions
Meta learning is a difficult problem as the expected performance of a model is affected by various aspects such as selected hyperparameters, dataset properties and landmarkers. Existing approaches are focused on searching for the best model but do not explain how these different aspects contribute to the performance of a model. To build a new generation of meta-models we need a deeper understanding of relative importance of meta-features and construction of better meta-features. In this paper we (1) introduce a method that can be used to understand how meta-features influence a model performance, (2) discuss the relative importance of different groups of meta-features, (3) analyse in detail the most informative hyperparameters that may result in insights for selection of empirical priors of engineering for hyperparameters. To our knowledge this is the first paper that uses techniques developed for eXplainable Artificial Intelligence (XAI) to examine the behaviour of a meta model.
Meta-learning, or learning to learn, is a wide area of machine learning focused on extracting knowledge about building pipelines of predictive models. Transferring this experience are close to model selection problem and optimization of performance measure. Investigation involved structure of dependency between given dataset and performance of selected machine learning algorithm presumably enables choosing better configuration of model resulting in higher performance measure. This process involves advanced techniques Vanschoren (2018) and requires meta-data about analysed problems to insight into algorithm-tasks synergy. The effectiveness of meta-learning is largely dependent on type and quality of meta-features so it is crucial to consider broad range of variables.
The primary sources of meta-features that feasibly predict model performance, are dataset characteristics. They are important because of better understanding of structure of data but also potential application in assessment similarity of tasks. Comprehensive summary of commonly used meta-features is provided in Vanschoren (2018). These characteristics are available for datasets uploaded to OpenML. In this overview, depending on definitions, features are grouped in five clusters: simple, statistical, information-theoretic, complexity, model-based, and landmarkers. Detailed description of these groups may be found in Rivolli et al. (2018). In Bilalli et al. (2017) is shown that selection of the most relevant data characteristics is related to definition of meta-model, especially to meta-response which is preferred performance measure. This study was deficient as concentrating only on statistical and information-theoretic properties. Due to high computational cost of extracting, model-based and landmarkers meta-data were neglected.
The idea of landmarking is an approach to represent dataset in meta-space by a vector of performance measures of simple and effective models. Rather than trust in statistical characterizations landmarkers, models helps to uncover more sophisticated structure of task. This method was introduced by Pfahringer et al. (2000). Various models has been proposed as landmarkers Balte (2014) but one of requirements are diverse architectures of algorithms capturing different variables interrelationship.
Except for dataset properties highly informative for meta-model may be detailed model configuration - settings of hyperparameters. It is observed that some algorithms are prone to tunability of specified parameters Probst et al. (2018).For that reason automatic optimization of hyperparametrs in models are popular topic in research. One of approach to this automation is building surrogate model on historical datasets, their properties and model hyperparameters space (Wistuba et al., 2015, 2018). These paper are focused on yielding from that meta-model the best hyperparameters configuration for the new task rather than analysing the impact of individual meta-features.
1.1 Our contributions
This paper will be divided into three parts (see Figure 1). Section 2 appeal to collecting meta-features to meta-learning problem. Next part is concentrated on selection of meta-model. In section 4 we present reservoir of eXplainable AI (XAI) tools to examine built surrogate model structure. We draw conclusions about importance of three types meta-features: data properties, landmarkers and hyperparameters.
Methods proposed in third section are universal and can be applied to diverse definition of meta-learning problem with different definition of meta-fetures space. That work may be perceived as sort of case study of application of these methods. For the case study we used the OpenML 100 database.
Meta-learning is multi-stage process. In our study we distinguish two main steps:
establishment of meta-space for meta-learning problem:
for every dataset 61 statistical properties have been computed in OpenML,
gradient boosting algorithm has determined space of 5 considered hyperparameters,
4 landmarkers from baseline models.
meta-response: predictive score for every pair of selected machine learning model and dataset
training and selection of meta-learning surrogate model using one-dataset-out cross-validation.
|Hyperparameters for GBM model||Landmarks|
|fract.||min. node||knn||glmnet||ranger||random forest|
|banknote authentication (1462)||0.03||1||8429||0.52||12||6.47||4.99||5.56||6.02|
|blood transfusion service center (1464)||0.01||1||394||0.21||10||0.64||1.23||0.86||0.71|
|climate model simulation crashes (1467)||0.00||2||654||0.26||15||0.35||1.36||1.12||1.15|
2.1 Datasets and their characterizations
In this study we have focused on OpenML100 database Bischl et al. (2017) so that considered datasets are real-world tasks. On this platform every uploaded tabular data has provided automated computing of 61 characteristics. Summary of available dataset properties is included in Vanschoren (2018). As follow in (Bilalli et al., 2017), some of statistics may be generated only for datasets which contain continuous variables, some for datasets with at least one categorical column and third group are calculated for any type of data. We have decided to analyse binary classification problem and selected 20 data sets whose all predictive variables are continuous. For these datasets are 38 available statistical and information-theoretic properties. See the Table 1 for the list of all selected datasets.
2.2 Algorithms and hyperparameters space
We have decided to explore gradient boosting classifiers (gbm) sensitivity to tunability so very algorithm was fitted on each dataset with various parameters. 100 random configurations of hyperparameters have been drawn to examine their influence model predictive power. The region, which hyperparameters are sampled, is according to Probst et al. (2018). Additionally, one of considered configurations is set of default values in R software R Core Team (2019). In this study we consider following hyperparameters for gbm model:
2.3 Model and dataset predictive power
For every pair of algorithm with hyperparameters configuration and dataset we have specified 20 train and test data splits. Model has been fitted on each train subset and afterwards AUC was computed on the test frame.
On the base of AUC results, for individual dataset we have created ranking of models - the higher AUC, the higher position in ranking. Ratings are scaled to [0,1] interval. Every configurations for each algorithms appear in list 20 times beacuse of train-test splits. To aggregate this to one value for every model we have computed average rating for model.
We have used landmarkers technique to distinct approach to characterise learning problem of given dataset. As landmarkers models we have considered five machine learning algorithms with default hyperparameter configurations:
generalized linear regression with regularization,
k nearest neighbours,
two random forest implementations: randomForest and ranger.
To evaluate their predictive power we have also applied 20 train/test split method and computed AUC score. These five algorithms have been ranked according to methodology in section 2.3. Because we are mainly concentrated on analysing gradient boosting models we have computed landmarkers as ratio of models rankings (knn, glmnet, ranger and randomForest) to default gbm configuration. In a result, we have four landmarkers meta-features.
Meta-dataset with all meta-features can be found in github repository anonymous URL.
3 Selection of the surrogate meta-model
We consider three types of surrogate model: gradient boosting with interaction depth equals to 2 (named: gbm shallow), gradient boosting with interaction depth equals to 10 (named: gbm deep) and glmnet with LASSO penalty (named: glm) Tibshirani (1994). These algorithms have different structure and employ various aspects of variables relationship in this meta-learning problem. Glmnet with LASSO penalty parameter finds linear dependence and simultaneously prevent overfitting. Gradient boosting capture different levels of interactions depending on interaction depth.
To compare the quality of meta-models we apply one-dataset-out crossvalidation: every models is trained on 19 datasets and then is tested on remaining data frame by calculating mean square error (MSE) and ratio model mean square error to MSE for model with predicted mean ranking (RATIO MSE). For test dataset we also calculate Spearman correlation between predicted rankings and actual meta-response. In Table 2 we sum up performance metrics on leave one dataset out validation. Presented values are average across all validation datasets.
|meta-model||mse||ratio mse||Spearman cor|
Gradient boosting deep has similar mean value of MSE like gradient boosting shallow but relative improvement to MSE with predicted mean ranking is significant higher. Spearman correlation between predictions and predicted values is also the highest for gbm deep. According to above results we assume that the best meta-model is gradient boosting with interaction depth equals 10, i.e. model that may be rich in interactions between meta-features.
During the process of surrogate model selection and validation we build 20 different meta-learning gbm models, in each of them one of datasets was out of train data frame. To further analyse we pick one of these models in which dataset eeg-eye-state (1471) was used as test dataset. In following sections we mark this meta-model as meta-gbm.
4 Explanatory meta-analysis of gbm meta-model
In this part we present main contribution of this paper. We focus on understanding of the structure of meta-gbm model and insight into informative power of meta-features and their interactions. To comprehensive analyse we introduce techniques developed for eXplainable Artificial Intelligence. For more details see Biecek and Burzykowski (2020); Molnar (2019) .
4.1 Meta-features importance
First aspect is examination of influence of individual meta-variables and ordering them according to this. This investigation helps to identify presumptive noisy aspects and may be significant in deliberation to exclude these meta-features from new generations meta-models. We have used permutation based measures of variable importance Fisher et al. (2018). In Figure 3 are presented the most influential meta-features. On the x-axis there is dropout values of root mean square error score caused by permutation of consecutive variable.
The most important meta-data are shrinkage, interaction depth and number of trees. These hyperparameters are followed by landmarkers for k nearest neighbours and ranger. Datasets properties play minor role in this ranking, significant permutational feature importance is obtained by MajorityClassSize, NumberOfInstances and MinorityClassSize. These meta-features are related to dimension and balancing in dataset.
Considered meta-features form three groups because of different approach to creating them. Figure 3 presents the assessment of groups influence. As we see, the most important class of meta-features is hyperparameters and this conclusion is consistent with importance measure for individual variables. Landmarkers and dataset characteristics has similar dropout values.
4.2 Grouping meta-features
In previous section we analyse importance of three groups related to definitions of meta-features. We also apply automatically group meta-features based on meta-features correlation. Resulted dendrogram is presented in Figure 4.
Some of dataset statistical characteristcs are closely related and pairwise have high correlation. For example quartile of means and standard deviation create correlated cluster. Hyperparameters are uncorrelated with landmarkers and dataset properties. Moreover they have close to zero correlation within their group. This observation agrees with definition since hyperparameters were sampled independently. Landmarkers for ranger and randomForest are strongly correlated because this algorithms different implementation of the same algorithm.
4.3 Hyperparameters informativeness
So far, we investigate overall impact of explanatory variables which can be estimated as single value. Now we focus on the type of effect of selected meta-predictor and how meta-model response is sensitive to changes in this feature. In this analysis we apply Individual Conditional Expectations (ICE) Goldstein et al. (2015) and its R implementation called Ceteris Paribus (CP) (Biecek, 2018).
ICE/CP analysis for selected hyperparameters is presented in Figure 5. In view of distribution of hyperparameters we apply loarithmic scale in x-axis. On each plot we present ceteris paribus profiles for all datasets, grey lines correspond to datasets which are in train subset of meta-data in this particular fold of one-dataset-out crossvalidation and black dotted line is related to eeg-eye-state test dataset.
For every dataset in meta-data we have 100 observations because of varying gbm hyperparameters but profiles for instances of selected dataset have similar shape and are nearly parallel. We are especially interested in curve outline since the translation may be caused by other meta-features values. So we present profiles for averaged meta-features for each dataset. In fact, we averaged only hyperparameters because dataset characteristics and landmarkers has constant values for all dataset observations.
To recall, the higher ranking, the better predictive power of given for gradient boosting model. So observing increasing lines in Ceteris Paribus profiles indicate the better rating of considered gbm models.
We start with the most important variables correspondingly to Figure 3. For shrinkage, interaction depth and number of trees hyperparameters we detect three patterns of profiles in train datasets subset. We apply hierarchical clustering for profiles and for each of three groups aggregated profiles are shown in plots 5. These groups are indicated with distinctive colors and termed as A, B and C. It is worth emphasizing that groups indicate as A for two hyperparameters may consist of different datasets because clustering was executed independently.
For each of these three hyperparameters group A has increasing CP profile with very strong trend for interior values of variables. Profiles stabilise on high prediction for larger meta-features values. Similar behaviour we can observe in group B, with this difference, that maximum predicted rating is achieved for smaller values. Group C are strictly different and the highest prediction of rankings are obtained by gbm models with rather lower number of trees, shallow trees. Because of different types of profiles for above meta-features we infer that for different dataset meta-model predict various hyperparameters as optimal.
Next meta-variable from importance ranking is relative landmark for k nearest neighbours. Corresponding CP profile is presented in Figure 6. In this case all profiles has similar shape and this is difficult to distinguish any clusters.
On every plot we observe that Ceteris Paribus profile for test dataset has similar shape to training datasets curves. So according to low MSE score on test dataset we can conclude that meta-gbm is accurate and generalise meta-problem.
Meta modeling is a promising approach to AutoML. In this work we have shown how to use XAI techniques to not only build an effective meta model, but also to extract knowledge about the importance of particular meta features in the meta model.
This contributions can be seen on two levels. First is an universal and generic approach to explainable analysis of any meta-learning model presented in Figure 1. In case of employing different types of meta-features, for instance different landmarkers models, we provided comprehensive method to investigate meta-model. This approach can be reproduced on a repository of datasets from a specific domain, or datasets of a selected size or complexity.
Second level is the application of this process to a specific investigation of predictive structure of meta-gbm model trained on OpenML datasets. Proposed meta-gbm model is generally well fitted to meta-problem what is confirmed by small value of MSE (see Table 2). Meta-model learns to propose different hyperparameters configuration as optimal. Figure 5 shows that individual CP profiles have maximum for different configurations of hyperparameters. For selected OpenML datasets and considered meta-features we conclude that statistical properties of datasets are relatively less important for meta-gbm model in presence of hyperparameters (see Figure 3). As significant dataset characteristics may be treated only this related to size of dataset. This conclusion may change when researcher define meta-problem in different way. It does not mean that no other dataset characteristic is important, surely we can use this technique to assess the predictive meta-importance of new attempts to extraction of dataset characteristics.
- Meta-Learning With Landmarking: A Survey. Technical report Technical Report 8, Vol. 105. Cited by: §1.
- Explanatory Model Analysis. External Links: Cited by: §4.
- DALEX: Explainers for Complex Predictive Models in R. Journal of Machine Learning Research 19 (84), pp. 1–5. External Links: Cited by: §4.3.
- On the predictive power of meta-features in OpenML. International Journal of Applied Mathematics and Computer Science 27 (4), pp. 697–712. External Links: Cited by: §1, §2.1.
- OpenML Benchmarking Suites. External Links: Cited by: §2.1.
- All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously. External Links: Cited by: §4.1.
- Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation. Journal of Computational and Graphical Statistics 24 (1), pp. 44–65. External Links: Cited by: §4.3.
- Interpretable Machine Learning. External Links: Cited by: §4.
- Meta-Learning by Landmarking Various Learning Algorithms. Proceedings of the Seventeenth International Conference on Machine Learning ICML2000 951 (2000), pp. 743–750. External Links: Cited by: §1.
- Tunability: Importance of Hyperparameters of Machine Learning Algorithms. Technical report Cited by: §1, §2.2.
- R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. External Links: Cited by: §2.2.
- Towards Reproducible Empirical Research in Meta-learning. Technical report External Links: Cited by: §1.
- Regression Shrinkage and Selection Via the Lasso. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B 58, pp. 267–288. Cited by: §3.
- Meta-Learning: A Survey. pp. 1–29. External Links: Cited by: §1, §1, §2.1.
- Learning hyperparameter optimization initializations. In Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, External Links: Cited by: §1.
- Scalable Gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning 107 (1), pp. 43–78. External Links: Cited by: §1.