Towards better understanding of meta-features contributions

Towards better understanding of meta-features contributions

Abstract

Meta learning is a difficult problem as the expected performance of a model is affected by various aspects such as selected hyperparameters, dataset properties and landmarkers. Existing approaches are focused on searching for the best model but do not explain how these different aspects contribute to the performance of a model. To build a new generation of meta-models we need a deeper understanding of relative importance of meta-features and construction of better meta-features. In this paper we (1) introduce a method that can be used to understand how meta-features influence a model performance, (2) discuss the relative importance of different groups of meta-features, (3) analyse in detail the most informative hyperparameters that may result in insights for selection of empirical priors of engineering for hyperparameters. To our knowledge this is the first paper that uses techniques developed for eXplainable Artificial Intelligence (XAI) to examine the behaviour of a meta model.

Figure 1: Process of meta-model exploration. First we gather meta-features about selected datasets from OpenML repository. Then we calculate model performance on these datasets for selected configurations of hyperparameters. Second we assemble a surrogate meta-model, here the gradient boosting model. Third we use XAI techniques to extract information about relative importance of meta-features and their marginal responses.

1 Introduction

Meta-learning, or learning to learn, is a wide area of machine learning focused on extracting knowledge about building pipelines of predictive models. Transferring this experience are close to model selection problem and optimization of performance measure. Investigation involved structure of dependency between given dataset and performance of selected machine learning algorithm presumably enables choosing better configuration of model resulting in higher performance measure. This process involves advanced techniques Vanschoren (2018) and requires meta-data about analysed problems to insight into algorithm-tasks synergy. The effectiveness of meta-learning is largely dependent on type and quality of meta-features so it is crucial to consider broad range of variables.

The primary sources of meta-features that feasibly predict model performance, are dataset characteristics. They are important because of better understanding of structure of data but also potential application in assessment similarity of tasks. Comprehensive summary of commonly used meta-features is provided in Vanschoren (2018). These characteristics are available for datasets uploaded to OpenML. In this overview, depending on definitions, features are grouped in five clusters: simple, statistical, information-theoretic, complexity, model-based, and landmarkers. Detailed description of these groups may be found in Rivolli et al. (2018). In Bilalli et al. (2017) is shown that selection of the most relevant data characteristics is related to definition of meta-model, especially to meta-response which is preferred performance measure. This study was deficient as concentrating only on statistical and information-theoretic properties. Due to high computational cost of extracting, model-based and landmarkers meta-data were neglected.

The idea of landmarking is an approach to represent dataset in meta-space by a vector of performance measures of simple and effective models. Rather than trust in statistical characterizations landmarkers, models helps to uncover more sophisticated structure of task. This method was introduced by Pfahringer et al. (2000). Various models has been proposed as landmarkers Balte (2014) but one of requirements are diverse architectures of algorithms capturing different variables interrelationship.

Except for dataset properties highly informative for meta-model may be detailed model configuration - settings of hyperparameters. It is observed that some algorithms are prone to tunability of specified parameters Probst et al. (2018).For that reason automatic optimization of hyperparametrs in models are popular topic in research. One of approach to this automation is building surrogate model on historical datasets, their properties and model hyperparameters space (Wistuba et al., 2015, 2018). These paper are focused on yielding from that meta-model the best hyperparameters configuration for the new task rather than analysing the impact of individual meta-features.

1.1 Our contributions

This paper will be divided into three parts (see Figure 1). Section 2 appeal to collecting meta-features to meta-learning problem. Next part is concentrated on selection of meta-model. In section 4 we present reservoir of eXplainable AI (XAI) tools to examine built surrogate model structure. We draw conclusions about importance of three types meta-features: data properties, landmarkers and hyperparameters.

Methods proposed in third section are universal and can be applied to diverse definition of meta-learning problem with different definition of meta-fetures space. That work may be perceived as sort of case study of application of these methods. For the case study we used the OpenML 100 database.

2 Meta-learning

Meta-learning is multi-stage process. In our study we distinguish two main steps:

  1. establishment of meta-space for meta-learning problem:

    • meta-features:

      • for every dataset 61 statistical properties have been computed in OpenML,

      • gradient boosting algorithm has determined space of 5 considered hyperparameters,

      • 4 landmarkers from baseline models.

    • meta-response: predictive score for every pair of selected machine learning model and dataset

  2. training and selection of meta-learning surrogate model using one-dataset-out cross-validation.


Hyperparameters for GBM model Landmarks
dataset (id) shrink. inter.
depth n.trees bag
fract. min. node knn glmnet ranger random forest
diabetes (37) 0.00 4 1480 0.69 7 1.10 2.25 2.36 2.30
spambase (44) 0.04 5 1414 0.98 16 2.97 4.78 7.57 7.74
ada_agnostic (1043) 0.05 5 333 0.90 7 2.31 5.41 6.28 5.37
mozilla4 (1046) 0.09 4 1567 0.54 7 1.71 0.39 2.62 2.84
pc4 (1049) 0.01 5 2367 0.75 12 1.11 2.13 3.46 3.57
pc3 (1050) 0.04 1 949 0.90 11 0.86 1.34 1.85 1.88
kc2 (1063) 0.00 2 273 1.00 21 0.81 0.73 0.90 0.98
kc (1067) 0.00 3 5630 0.26 3 0.23 1.16 1.49 1.66
pc1 (1068) 0.00 4 6058 0.21 14 1.12 0.27 1.92 1.84
banknote authentication (1462) 0.03 1 8429 0.52 12 6.47 4.99 5.56 6.02
blood transfusion service center (1464) 0.01 1 394 0.21 10 0.64 1.23 0.86 0.71
climate model simulation crashes (1467) 0.00 2 654 0.26 15 0.35 1.36 1.12 1.15
eeg-eye-state (1471) 0.08 5 2604 0.28 14 2.48 0.93 3.34 4.16
hill-valley (1479) 0.08 5 2604 0.28 14 1.78 0.43 2.04 2.24
madelon (1485) 0.05 5 333 0.90 7 0.34 0.48 1.61 1.58
ozone-level-8hr (1487) 0.00 3 8868 0.33 11 1.41 3.40 4.43 4.36
phoneme (1489) 0.02 5 5107 0.60 18 4.89 2.90 7.12 8.13
qsar-biodeg (1494) 0.05 5 333 0.90 7 4.16 5.34 6.74 6.81
wdbc (1510) 0.04 1 949 0.90 11 1.56 0.63 1.88 1.91
wilt (1570) 0.00 4 6058 0.21 14 2.73 4.80 7.15 7.28
Table 1: Meta-features for selected datasets from OpenML repository. Four landmarks (relative performance to default gbm model) and five hyperparameteres for gbm model are presented. Dataset characteristics are omitted for the brevity. Only optimal hyperparameteres are preseneted for each individual dataset.

2.1 Datasets and their characterizations

In this study we have focused on OpenML100 database Bischl et al. (2017) so that considered datasets are real-world tasks. On this platform every uploaded tabular data has provided automated computing of 61 characteristics. Summary of available dataset properties is included in Vanschoren (2018). As follow in (Bilalli et al., 2017), some of statistics may be generated only for datasets which contain continuous variables, some for datasets with at least one categorical column and third group are calculated for any type of data. We have decided to analyse binary classification problem and selected 20 data sets whose all predictive variables are continuous. For these datasets are 38 available statistical and information-theoretic properties. See the Table 1 for the list of all selected datasets.

2.2 Algorithms and hyperparameters space

We have decided to explore gradient boosting classifiers (gbm) sensitivity to tunability so very algorithm was fitted on each dataset with various parameters. 100 random configurations of hyperparameters have been drawn to examine their influence model predictive power. The region, which hyperparameters are sampled, is according to Probst et al. (2018). Additionally, one of considered configurations is set of default values in R software R Core Team (2019). In this study we consider following hyperparameters for gbm model:

  • n.trees,

  • interaction.depth,

  • n.minobsinnode,

  • shrinkage,

  • bag.fraction.

2.3 Model and dataset predictive power

For every pair of algorithm with hyperparameters configuration and dataset we have specified 20 train and test data splits. Model has been fitted on each train subset and afterwards AUC was computed on the test frame.

On the base of AUC results, for individual dataset we have created ranking of models - the higher AUC, the higher position in ranking. Ratings are scaled to [0,1] interval. Every configurations for each algorithms appear in list 20 times beacuse of train-test splits. To aggregate this to one value for every model we have computed average rating for model.

2.4 Landmarkers

We have used landmarkers technique to distinct approach to characterise learning problem of given dataset. As landmarkers models we have considered five machine learning algorithms with default hyperparameter configurations:

  • generalized linear regression with regularization,

  • gradient boosting,

  • k nearest neighbours,

  • two random forest implementations: randomForest and ranger.

To evaluate their predictive power we have also applied 20 train/test split method and computed AUC score. These five algorithms have been ranked according to methodology in section 2.3. Because we are mainly concentrated on analysing gradient boosting models we have computed landmarkers as ratio of models rankings (knn, glmnet, ranger and randomForest) to default gbm configuration. In a result, we have four landmarkers meta-features.

Meta-dataset with all meta-features can be found in github repository anonymous URL.

3 Selection of the surrogate meta-model

We consider three types of surrogate model: gradient boosting with interaction depth equals to 2 (named: gbm shallow), gradient boosting with interaction depth equals to 10 (named: gbm deep) and glmnet with LASSO penalty (named: glm) Tibshirani (1994). These algorithms have different structure and employ various aspects of variables relationship in this meta-learning problem. Glmnet with LASSO penalty parameter finds linear dependence and simultaneously prevent overfitting. Gradient boosting capture different levels of interactions depending on interaction depth.

To compare the quality of meta-models we apply one-dataset-out crossvalidation: every models is trained on 19 datasets and then is tested on remaining data frame by calculating mean square error (MSE) and ratio model mean square error to MSE for model with predicted mean ranking (RATIO MSE). For test dataset we also calculate Spearman correlation between predicted rankings and actual meta-response. In Table 2 we sum up performance metrics on leave one dataset out validation. Presented values are average across all validation datasets.


meta-model mse ratio mse Spearman cor
gbm deep 0.017 -0.488 0.662
gbm shallow 0.024 -0.246 0.523
glm 0.031 0.130 0.435
Table 2: Summary of meta-model performance measures. MSE stands for mean square error (the lower the better), RATIO MSE stands for ratio model mean square error to MSE for model with predicted mean ranking (the lower the better)

Gradient boosting deep has similar mean value of MSE like gradient boosting shallow but relative improvement to MSE with predicted mean ranking is significant higher. Spearman correlation between predictions and predicted values is also the highest for gbm deep. According to above results we assume that the best meta-model is gradient boosting with interaction depth equals 10, i.e. model that may be rich in interactions between meta-features.

During the process of surrogate model selection and validation we build 20 different meta-learning gbm models, in each of them one of datasets was out of train data frame. To further analyse we pick one of these models in which dataset eeg-eye-state (1471) was used as test dataset. In following sections we mark this meta-model as meta-gbm.

4 Explanatory meta-analysis of gbm meta-model

In this part we present main contribution of this paper. We focus on understanding of the structure of meta-gbm model and insight into informative power of meta-features and their interactions. To comprehensive analyse we introduce techniques developed for eXplainable Artificial Intelligence. For more details see Biecek and Burzykowski (2020); Molnar (2019) .

4.1 Meta-features importance

First aspect is examination of influence of individual meta-variables and ordering them according to this. This investigation helps to identify presumptive noisy aspects and may be significant in deliberation to exclude these meta-features from new generations meta-models. We have used permutation based measures of variable importance Fisher et al. (2018). In Figure 3 are presented the most influential meta-features. On the x-axis there is dropout values of root mean square error score caused by permutation of consecutive variable.

Figure 2: Top 15 most important meta-features in GBM meta-model. Lengths of the bar correspond to the permutational feature importance.
Figure 3: Meta-features importance groups
Figure 2: Top 15 most important meta-features in GBM meta-model. Lengths of the bar correspond to the permutational feature importance.

The most important meta-data are shrinkage, interaction depth and number of trees. These hyperparameters are followed by landmarkers for k nearest neighbours and ranger. Datasets properties play minor role in this ranking, significant permutational feature importance is obtained by MajorityClassSize, NumberOfInstances and MinorityClassSize. These meta-features are related to dimension and balancing in dataset.

Considered meta-features form three groups because of different approach to creating them. Figure 3 presents the assessment of groups influence. As we see, the most important class of meta-features is hyperparameters and this conclusion is consistent with importance measure for individual variables. Landmarkers and dataset characteristics has similar dropout values.

4.2 Grouping meta-features

In previous section we analyse importance of three groups related to definitions of meta-features. We also apply automatically group meta-features based on meta-features correlation. Resulted dendrogram is presented in Figure 4.

Figure 4: Groups of meta-features based on meta-features correlation. Colors denote the group of meta-features.

Some of dataset statistical characteristcs are closely related and pairwise have high correlation. For example quartile of means and standard deviation create correlated cluster. Hyperparameters are uncorrelated with landmarkers and dataset properties. Moreover they have close to zero correlation within their group. This observation agrees with definition since hyperparameters were sampled independently. Landmarkers for ranger and randomForest are strongly correlated because this algorithms different implementation of the same algorithm.

4.3 Hyperparameters informativeness

So far, we investigate overall impact of explanatory variables which can be estimated as single value. Now we focus on the type of effect of selected meta-predictor and how meta-model response is sensitive to changes in this feature. In this analysis we apply Individual Conditional Expectations (ICE) Goldstein et al. (2015) and its R implementation called Ceteris Paribus (CP) (Biecek, 2018).

Figure 5: Ceteris paribus for hyperparameters. Grey solid lines correspond to profiles for dataset in train meta-data, black dotted line are profile for test dataset. Thick colored lines are aggreated profiles for datasets clusters. Color indicates groups.

ICE/CP analysis for selected hyperparameters is presented in Figure 5. In view of distribution of hyperparameters we apply loarithmic scale in x-axis. On each plot we present ceteris paribus profiles for all datasets, grey lines correspond to datasets which are in train subset of meta-data in this particular fold of one-dataset-out crossvalidation and black dotted line is related to eeg-eye-state test dataset.

For every dataset in meta-data we have 100 observations because of varying gbm hyperparameters but profiles for instances of selected dataset have similar shape and are nearly parallel. We are especially interested in curve outline since the translation may be caused by other meta-features values. So we present profiles for averaged meta-features for each dataset. In fact, we averaged only hyperparameters because dataset characteristics and landmarkers has constant values for all dataset observations.

To recall, the higher ranking, the better predictive power of given for gradient boosting model. So observing increasing lines in Ceteris Paribus profiles indicate the better rating of considered gbm models.

We start with the most important variables correspondingly to Figure 3. For shrinkage, interaction depth and number of trees hyperparameters we detect three patterns of profiles in train datasets subset. We apply hierarchical clustering for profiles and for each of three groups aggregated profiles are shown in plots 5. These groups are indicated with distinctive colors and termed as A, B and C. It is worth emphasizing that groups indicate as A for two hyperparameters may consist of different datasets because clustering was executed independently.

For each of these three hyperparameters group A has increasing CP profile with very strong trend for interior values of variables. Profiles stabilise on high prediction for larger meta-features values. Similar behaviour we can observe in group B, with this difference, that maximum predicted rating is achieved for smaller values. Group C are strictly different and the highest prediction of rankings are obtained by gbm models with rather lower number of trees, shallow trees. Because of different types of profiles for above meta-features we infer that for different dataset meta-model predict various hyperparameters as optimal.

Next meta-variable from importance ranking is relative landmark for k nearest neighbours. Corresponding CP profile is presented in Figure 6. In this case all profiles has similar shape and this is difficult to distinguish any clusters.

Figure 6: Ceteris paribus of kknn landmark Grey solid lines correspond to profiles for dataset in train meta-data, Thick red lines are Partial Dependency Plot (PDP) - aggreated profile for all datasets.

On every plot we observe that Ceteris Paribus profile for test dataset has similar shape to training datasets curves. So according to low MSE score on test dataset we can conclude that meta-gbm is accurate and generalise meta-problem.

5 Conclusions

Meta modeling is a promising approach to AutoML. In this work we have shown how to use XAI techniques to not only build an effective meta model, but also to extract knowledge about the importance of particular meta features in the meta model.

This contributions can be seen on two levels. First is an universal and generic approach to explainable analysis of any meta-learning model presented in Figure 1. In case of employing different types of meta-features, for instance different landmarkers models, we provided comprehensive method to investigate meta-model. This approach can be reproduced on a repository of datasets from a specific domain, or datasets of a selected size or complexity.

Second level is the application of this process to a specific investigation of predictive structure of meta-gbm model trained on OpenML datasets. Proposed meta-gbm model is generally well fitted to meta-problem what is confirmed by small value of MSE (see Table 2). Meta-model learns to propose different hyperparameters configuration as optimal. Figure 5 shows that individual CP profiles have maximum for different configurations of hyperparameters. For selected OpenML datasets and considered meta-features we conclude that statistical properties of datasets are relatively less important for meta-gbm model in presence of hyperparameters (see Figure 3). As significant dataset characteristics may be treated only this related to size of dataset. This conclusion may change when researcher define meta-problem in different way. It does not mean that no other dataset characteristic is important, surely we can use this technique to assess the predictive meta-importance of new attempts to extraction of dataset characteristics.

References

  1. Meta-Learning With Landmarking: A Survey. Technical report Technical Report 8, Vol. 105. Cited by: §1.
  2. Explanatory Model Analysis. External Links: Link Cited by: §4.
  3. DALEX: Explainers for Complex Predictive Models in R. Journal of Machine Learning Research 19 (84), pp. 1–5. External Links: Link Cited by: §4.3.
  4. On the predictive power of meta-features in OpenML. International Journal of Applied Mathematics and Computer Science 27 (4), pp. 697–712. External Links: Document, ISSN 20838492 Cited by: §1, §2.1.
  5. OpenML Benchmarking Suites. External Links: Link Cited by: §2.1.
  6. All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously. External Links: Link Cited by: §4.1.
  7. Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation. Journal of Computational and Graphical Statistics 24 (1), pp. 44–65. External Links: Link, Document, ISSN 1061-8600 Cited by: §4.3.
  8. Interpretable Machine Learning. External Links: Link Cited by: §4.
  9. Meta-Learning by Landmarking Various Learning Algorithms. Proceedings of the Seventeenth International Conference on Machine Learning ICML2000 951 (2000), pp. 743–750. External Links: Link, ISBN 1558607072, ISSN 00071536 Cited by: §1.
  10. Tunability: Importance of Hyperparameters of Machine Learning Algorithms. Technical report Cited by: §1, §2.2.
  11. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. External Links: Link Cited by: §2.2.
  12. Towards Reproducible Empirical Research in Meta-learning. Technical report External Links: Link Cited by: §1.
  13. Regression Shrinkage and Selection Via the Lasso. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B 58, pp. 267–288. Cited by: §3.
  14. Meta-Learning: A Survey. pp. 1–29. External Links: Link Cited by: §1, §1, §2.1.
  15. Learning hyperparameter optimization initializations. In Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, External Links: ISBN 9781467382731, Document Cited by: §1.
  16. Scalable Gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning 107 (1), pp. 43–78. External Links: Link, Document, ISSN 15730565 Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407654
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description