Oboe: Collaborative Filtering for AutoML Initialization
Algorithm selection and hyperparameter tuning remain two of the most challenging tasks in machine learning. The number of machine learning applications is growing much faster than the number of machine learning experts, hence we see an increasing demand for efficient automation of learning processes. Here, we introduce Oboe, an algorithm for time-constrained model selection and hyperparameter tuning. Taking advantage of similarity between datasets, Oboe finds promising algorithm and hyperparameter configurations through collaborative filtering. Our system explores these models under time constraints, so that rapid initializations can be provided to warm-start more fine-grained optimization methods. One novel aspect of our approach is a new heuristic for active learning in time-constrained matrix completion based on optimal experiment design. Our experiments demonstrate that Oboe delivers state-of-the-art performance faster than competing approaches on a test bed of supervised learning problems.
Machine learning and data science experts find it difficult to select algorithms and hyperparameter settings suitable for a given dataset; for novices, the challenge is even greater. The large number of algorithms, and the sensitivity of these methods to hyperparameter values, makes it practically infeasible to enumerate all possible configurations. To surmount these challenges, the field of Automated Machine Learning (AutoML) seeks to efficiently automate the selection of model configurations, and has attracted increasing attention in recent years.
We propose an algorithmic system, Oboe, that (like its orchestral counterpart) provides an initial tuning for AutoML. Oboe complements existing AutoML techniques by selecting among algorithm types and providing promising initializations for hyperparameters, all within a fixed time budget. Oboe predicts the performance of model configurations on a dataset based on the performance of these models on similar datasets: it is a recommender system for algorithms. To ensure a good solution is found within the required time, Oboe predicts which algorithm will perform the best within a fixed (often quite short) time. As subproblems, the system predicts the runtime of each algorithm on a new dataset, and quantifies the information to be gained about a dataset by running a particular algorithm. In contrast to other work in AutoML, Oboe dedicates the entire time budget to exploring promising models, rather than computing dataset meta-features balte2014meta (), e.g., skewness or kurtosis. With this approach, Oboe is able to select well-performing models from a (customizable) large set within a short period of time. Our system is compatible with, and speeds up, any hyperparameter tuning method that further exploits the initialization, such as scalable Gaussian processes gardner2018product (). More broadly, the challenge of time-constrained collaborative filtering has not been well studied, and may be of interest for applications outside of AutoML.
Our method begins by constructing an error matrix, typically with several hundred rows and columns, for some base set of algorithms (columns) and datasets (rows). Each entry records the performance of one machine learning model (algorithm together with hyperparameters) on one dataset. For example, one entry in this matrix would correspond to, e.g., the cross-validated error obtained by using a 2-nearest neighbor algorithm to make predictions for a particular dataset. Each row, called performance vector, records the performance of each model on a specific dataset. We also record how long it takes to run each model on each dataset, and fit a model to predict runtime based on the number of samples and features in the dataset.
We view a new dataset as a new row in this error matrix. To find the best model for this new dataset, we run models corresponding to an informative subset of the columns on the new dataset. By choosing informative models predicted to run quickly, we can ensure this step is as fast as required. We then predict the missing entries in the row, which correspond to models whose performance has not been evaluated on the current dataset. This procedure is promising in two respects. First, it streamlines the process of choosing hyperparameters of any type (categorical, boolean, numerical, etc.). Second, it infers the performance of a vast number of hyperparameter settings without running most models, or even computing meta-features of the dataset. Hence Oboe can provide a reasonable model within the specified time constraint.
2 Related work
This work addresses two important problems in AutoML: we consider 1) time-constrained initialization: how to choose a promising model class for a given problem, together with initial values for hyperparameters, within a short time budget, and 2) active learning: how to improve the model given further computational resources.
The first subproblem is important in that the spaces and landscapes of different hyperparameters have heterogenous shapes and are poorly studied. The “no free lunch” theorem wolpert1996lack () for AutoML states that is impossible to select a model that always performs well; there is no efficiently computable metric that predicts model performance.
One solution to the problem is to adopt a collaborative filtering approach bardenet2013collaborative (); stern2010collaborative (); yogatama2014efficient (); misir2017alors (): we suppose we have a collection of datasets, called training datasets, for which additional information (e.g., model performance) is available. Model performance on a new dataset, called the test dataset, can be inferred using its similarity to the training set.
It is important, in the collaborative filtering setting, to characterize dataset similarity. We hope that, for some similarity metric, training datasets similar to the test dataset will faithfully predict model performance. One line of work makes use of dataset meta-features pfahringer2000meta (); feurer2014using (); feurer2015efficient (): simple, statistical or landmarking metrics for dataset characterization balte2014meta (). Other approaches wistuba2015learning () avoid meta-features and instead use measures based on similarities in model performance, such as the Kendall rank correlation coefficient kendall1938new (). Our approach builds on both of these lines of research by treating low rank representations of model performance vectors as latent meta-features, and thereby eliminates the need to compute meta-features: our approach relies exclusively on model performance to compute dataset similarity.
The active learning subproblem is to select algorithms and hyperparameters to evaluate, with the hope of selecting the best model or gaining the most information to guide model selection in subsequent steps. The crux of this problem is to make accurate predictions of model performance on new datasets. Most approaches to this problem choose a function class which is hoped to capture the dependence of model performance on hyperparameters; they fit a surrogate model to observed performance and use it to choose new models (together with hyperparameters) to sample. Gaussian processes williams2006gaussian (); snoek2012practical () are widely used as a surrogate model, together with a variety of criteria which guide the selection of the next sample point, including Expected Improvement bergstra2011algorithms (); fusi2017probabilistic (), information-theoretic criteria (such as entropy sebastiani2000maximum (); herbrich2003fast () and information gain mackay1992information ()) and bandit regret srinivas2009gaussian (); other surrogate models of note include sparse Boolean functions hazan2017hyperparameter (), and decision trees bartz2004tuning (); hutter2011sequential (). Our Oboe system uses the set of low rank matrices as our surrogate model, which combines a flexible and rather general model class that has the advantage of simplicity and enjoys the speed of well-developed numerical linear algebra algorithms.
One key component of our system is a model which predicts the runtime of supervised learning algorithms on new training datasets. Many authors have previously studied algorithm runtime prediction as a function of dataset features and algorithm hyperparameters hutter2014algorithm (), via ridge regression huang2010predicting (), neural networks smith2011discovering (), Gaussian processes hutter2006performance (), etc. Several measures have been proposed to trade-off between accuracy and runtime leite2012selecting (); bischl2017mlrmbo (). For the purposes of this paper, we use a particularly simple (but surprisingly effective) model, trained independently for each supervised learning algorithm, which predicts algorithm runtime as a function of the number of samples and features in the dataset.
In a linear model, the classical experiment design approach wald1943efficient (); mood1946hotelling (); kiefer1960equivalence (); john1975d (); pukelsheim1993optimal (); boyd2004convex () selects features that minimize variance of the parameter estimate. Constraints on features can be added, making it suitable for such problems under certain budget constraints krause2008near (); zhang2016flash (). Here, we adopt this approach to select promising machine learning models, whose training phases on the new dataset are believed to finish within time budget.
3 Notation and terminology
Meta-learning refers to the process of learning across individual datasets or problems, which are subsystems on which standard machine learning is performed lemke2015metalearning (). Just as standard machine learning must carefully structure computations to avoid overfitting, experiments testing the performance of an AutoML system must avoid meta-overfitting! Hence we begin our experiments by dividing our set of datasets into a meta-training set, a meta-validation set, and a meta-test set, and report results on the meta-test set, as shown in Figure 2. Each of the three phases in meta-learning — meta-training, meta-validation and meta-test — is a (standard) learning process that includes training, validation and test.
All vectors in this paper are column vectors. Given a matrix , and denote the ith row and jth column of , respectively. We define for . Given an ordered set where , we write .
A model is a specific algorithm-hyperparameter combination, e.g. -NN with . For a specific dataset , denotes the expected cross-validation error of on , in terms of the error metric in use. The expectation is with respect to the way we split the dataset into different folds. The best model on is the model in our collection of models that achieves minimum error on . A model is said to be observed on if we actually calculate .
We distinguish among the following three categories of parameters:
Parameters: Model parameters of machine learning algorithms, obtained from training models with fixed hyperparameters.
Hyperparameters: Hyperparameters of machine learning algorithms that influence the way the algorithm learns parameters from training samples, e.g. the number of nearest neighbors, , in the -nearest neighbors algorithm.
Hyper-hyperparameters: Settings of the Oboe algorithm, e.g. the rank used to approximate the error matrix.
4.1 Unsupervised inference of model performance
While the distance between meta-feature vectors is an informative measure of dataset similarity, it is often difficult to determine a priori which meta-features to use, and the computation of meta-features can be expensive. To infer how new models will perform on a dataset without any meta-feature calculation that takes more time than it takes to read in the dataset (exceptions include, e.g., number of data points and number of features ), we use collaborative filtering.
We construct an error matrix , where each row is indexed by a dataset and each column is indexed by a model. Empirically, has approximately low rank: Figure 2 shows the singular values as a function of the index . Each entry of is generated using 5-fold cross-validation. This is an unbiased estimate of the true error matrix with expected cross-validation errors, and motivates the use of low rank factorization as a denoising method.
Hence, we approximate its entries where and are the minimizers of with for and for ; the solution is given by Principal Component Analysis (PCA). In the context of collaborative filtering, each can be interpreted as the latent meta-features of dataset , while each can be interpreted as the latent meta-features of model .
Given a meta-test dataset, we choose a subset of models and observe the performance of model for each , where is determined by properties of the dataset and the runtime budget. We then infer the latent meta-features of that dataset by solving the least squares problem: with . For all unobserved models, we predict their performance as for .
The remaining challenge in designing our algorithm is to choose and . The subset of models that we choose to observe, , must capture as much information about the meta-testing dataset as possible while obeying time constraints. We describe the algorithm for selecting in Section 4.3. The approximate rank of the error matrix, , is a hyper-hyperparameter; we describe how to select in Section 5.2.2.
4.2 Runtime prediction
When operating under time constraints, it is essential to estimate the runtime of each model. These estimates allow us to balance the tradeoff between choosing to observe slow, informative models and fast, less informative models. However, it is challenging to estimate the runtime of a specific model on a particular dataset: the runtime depends not only on the theoretical time complexity of the algorithm, but also other factors (such as the distribution of data points) that are not easy to calculate.
However, we observe that we are able to predict runtime of half of the machine learning models within a factor of two on more than 75% OpenML classification datasets with between 150 and 10,000 data points and with no missing entries, as shown in Table 1 and visualized in Appendix B, Figures 5 and 6. Our method uses polynomial regression: we observe that the theoretical complexities of the machine learning algorithms we use here are , where and are the number of data points and the number of features in , respectively. Hence we fit an independent polynomial regression model for each model:
where is the runtime of model on dataset , and is the set of all polynomials of order no more than 3. We denote this procedure by fit_runtime.
|Algorithm type||Runtime prediction accuracy|
|within factor of 2||within factor of 4|
|Gaussian naive Bayes||89.6%||96.7%|
4.3 Time constrained information gathering
To select the models to run on a new (meta-test) dataset with a time budget , we adopt an approach that builds on classical ideas in experiment design: we suppose fitting each model returns a linear measurement of the true latent meta-features, , corrupted by Gaussian noise. The D-optimal design chooses which models we should fit by defining an indicator vector , where entry indicates whether or not to fit model , and minimizing a scalarization of the covariance of the estimated meta-features of the new dataset subject to constraints on runtime wald1943efficient (); mood1946hotelling (); kiefer1960equivalence (); john1975d (); pukelsheim1993optimal (); boyd2004convex (). Let denote the predicted runtime of model on a meta-test dataset, and let denote its latent meta-features, for . Now we relax to allow for non-boolean values and solve the optimization problem
This relaxed problem is a convex optimization problem, and we obtain an approximate solution to the D-optimal design problem via rounding. Let be the set of indices of that we choose to observe, i.e. the set such that rounds to 1 for . We denote this process by .
The Oboe system can be divided into two stages: offline and online. The offline stage is executed only once and stores information on meta-training datasets. Time taken on this stage does not affect the prediction of Oboe on a new dataset; the runtime experienced by the user is that of the online stage.
5.1 Offline stage
Error matrix generation
The th entry of error matrix , denoted as , records the performance of the th model on the th meta-training dataset. The metric we use here to characterize model performance is balanced error rate, which is the average of false positive and false negative rates across different classes. This balances the influence of classes with different sizes. Our experiments have shown that the error matrix under this metric is approximately low rank, which serves as the foundation of our algorithm.
We also record the runtime of fitting each model on each dataset in a runtime matrix . The entry represents the runtime of the th model on the th dataset. This matrix is used to fit the runtime predictor described in Section 4.2. Pseudocode for the offline stage is shown as Algorithm 1.
5.2 Online stage
5.2.1 Ensemble selection
We first predict the runtime of each model on the test dataset using the runtime predictors computed in the offline stage. Then we use experiment design to select a subset of entries of , the error vector of the test dataset, to observe. The observed entries are used to compute , an estimation of the latent meta-features of the test datasets. We then use to predict every entry of .
We next build an ensemble out of models predicted to perform well, to facilitate comparison with other methods for AutoML feurer2015efficient (). This step outputs a classifier that gives predictions of labels on individual data points, and can be placed after further fine-grained hyperparameter tuning. We use a standard method for building an ensemble, proposed by Caruana et al. caruana2004ensemble (); caruana2006getting (), which starts from a few base learners and greedily adds more if that improves training accuracy. We denote this subroutine as ensemble_selection, which takes as input the set of base learners with their cross-validation errors and predicted labels , and outputs the ensemble learner . These are presented in Algorithm 2 as the fit function.
5.2.2 Budget doubling
To select the rank , Oboe starts with a small initial rank along with a small time budget, and then doubles the time budget for the above fit subroutine until it reaches half of the total budget. Rank increments by 1 if the validation error of the ensemble learner decreases after doubling the budget, and otherwise does not change. Since the matrices returned by PCA with rank are submatrices of those returned by PCA with rank for , we can compute the factors as submatrices of the matrices returned by PCA with full rank . The pseudocode is shown as Algorithm 3.
6 Experimental evaluation
We test different AutoML systems on OpenML OpenML2013 () and UCI Dua:2017 () classification datasets, with between 150 and 10,000 data points and with no missing entries, which we call selected OpenML and UCI datasets in the following context. Since data pre-processing is not our focus, we pre-process all datasets in the same way: we use one-hot-encoding categorical features, and then standardize all features to have zero mean and unit variance. These pre-processed datasets are used in all of the following experiments.
Numerical results demonstrate the robust performance of hyper-hyperparameter choices, including cold-start method and error metric. In terms of actual performance, we compare Oboe with a state-of-the-art AutoML system which selects among different algorithm types and tunes hyperparameters within a fixed time budget, auto-sklearn feurer2015efficient (), and with one random baseline, which replaces the selection of models in Oboe with a random selection of models that can be trained within the same time budget.
6.1 Hyper-hyperparameter choice
6.1.1 Cold-start functionality
Oboe uses D-optimal experiment design as the cold-start method to select models to evaluate on new datasets. In Figure 3 and Table 2, we compare this approach with A- and E-optimal experiment design and non-linear regression using Alors misir2017alors (), by means of leave-one-out-cross-validation on selected OpenML datasets. We consider performance measured by relative RMSE of the predicted performance vector and by the number of correctly predicted best models, both averaged across datasets. The approximate rank of the error matrix is set to be the number of eigenvalues larger than 1% of the largest, which is 38 here. The time limit in experiment design implementation is set to be 4 seconds; the meta-features used in the Alors implementation are listed in Appendix C; the nonlinear regressor used in the Alors implementation is the default RandomForestRegressor in scikit-learn 0.19.1 scikit-learn ().
It can be observed that the experiment design approach robustly outperforms nonlinear regression in terms of prediction accuracy on best models.
6.1.2 Error metric
Oboe uses the balanced error rate to construct the error matrix, and works on the premise that the error matrix can be approximated by a low rank matrix. However, there is nothing special about the balanced error rate: indeed, most metrics used to measure errors result in an approximately low rank error matrix. For example, when using the AUC metric to measure error, the 418-by-219 error matrix from selected OpenML datasets has only 38 eigenvalues greater than 1% of the largest eigenvalue, and 12 greater than 3%.
6.2 Performance comparison
For our implementation of Oboe, we use the algorithm types and hyperparameter ranges listed in Appendix A, Table 3 as columns in the error matrix. We chose to vary the hyperparameters people usually tune, and picked their ranges to contain the values people usually use. We have not optimized over them. The datasets we selected as rows are 418 pre-processed OpenML datasets with between 150 and 10,000 data points and with no missing entries, and the sum of runtime of running all models on each is less than 60,000 seconds. The (decaying) eigenvalues of this error matrix are shown in Figure 2.
We compare with auto-sklearn+meta-learning+ensemble, using the method autosklearn.classification.AutoSklearnClassifier in auto-sklearn 0.3.0 feurer2015efficient ().
As a random baseline, we replace the experiment design subroutine of Oboe with a time-constrained random selection method: we randomly select a subset of methods that we predict will complete within a particular time limit. We assign half of the budget in each budget doubling round to this random selection subroutine, during which we sequentially pick models that are predicted to finish within the remaining budget. These models are observed and used to infer the rest on test dataset.
6.3 Experimental setup
We limited the types of algorithms each system can use to Adaboost, Gaussian naive Bayes, extra trees, gradient boosting, linear and kernel SVMs, random forest, k-nearest neighbors and decision trees. Our system does not perform further hyperparameter optimization after selecting a subset of models and forming an ensemble, while we do allow the auto-sklearn package to perform hyperparameter optimization after model selection to ensure it makes full use of the time budget. This choice gives auto-sklearn a slight advantage over Oboe.
We ran all experiments on a server with 128 Intel® Xeon® E7-4850 v4 2.10GHz CPU cores and 1056GB memory. The process of running each system on a specific dataset is limited to a single CPU core. Figure 4 shows the percentile and ranking (1 is best and 3 is worst) changes of prediction errors as runtime repeatedly doubles. Until the first time when the system can produce a model, we classify every data point with the most common class label.
We make several observations on the results of these experiments:
The quality of the initial models computed by Oboe and by auto-sklearn are comparable; but Oboe computes its first nontrivial model more than 8 faster than auto-sklearn (Figures 3(a) and 3(b)). In contrast, auto-sklearn must first compute meta-features for each dataset, which requires substantial computational time, as shown in Appendix D.
In general, the test error of Oboe decreases as the time budget increases. Interestingly, the rate at which the Oboe models improves with time is faster than the rate at which the auto-sklearn models improve. This observation indicates that increased computational time may be better spent in fitting more models than by using standard hyperparameter optimization methods, such as Gaussian processes, to which auto-sklearn devotes the remaining time.
Experiment design leads to better results than random selection in almost all cases.
Oboe is an AutoML system that uses ideas from collaborative filtering and optimal experiment design to exploit dataset and algorithm similarity to predict the performance of machine learning models. By fitting a few selected models on the test dataset, this system transfers knowledge from the training datasets to select a good set of models. Oboe naturally handles different types of hyperparameters and can match the performance of state of the art AutoML systems much more quickly than competing approaches.
This work demonstrates the promise of collaborative filtering approaches to AutoML. However, many improvements are possible. Future work to adapt Oboe to different loss metrics, budget types, sparse error matrices and a wider range of machine learning algorithms, as well as to augment the initializations given by Oboe with fine tuning by any state-of-the-art hyperparameter optimization method, may yield substantial practical improvements. Furthermore, we look forward to seeing more approaches to the challenge of choosing hyper-hyperparameter settings subject to limited computation and data. With continuing efforts by the AutoML community, we look forward to a world in which domain experts seeking to use machine learning can focus on issues of data quality and problem formulation, rather than on tasks — such as algorithm selection and hyperparameter tuning — which are suitable for automation.
This work was supported in part by DARPA Award FA8750-17-2-0101. The authors thank Christophe Giraud-Carrier, Ameet Talwalkar, Raul Astudillo Marban, Matthew Zalesak, Lijun Ding and Davis Wertheimer for helpful discussions, and thank Jack Dunn for a script to parse UCI Machine Learning Repository datasets.
-  Ashvini Balte, Nitin Pise, and Parag Kulkarni. Meta-learning with landmarking: A survey. International Journal of Computer Applications, 105(8), 2014.
-  Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. Collaborative hyperparameter tuning. In International Conference on Machine Learning, pages 199–207, 2013.
-  Thomas Bartz-Beielstein and Sandor Markon. Tuning search algorithms for real-world applications: A regression tree based approach. In Evolutionary Computation, 2004. CEC2004. Congress on, volume 1, pages 1111–1118. IEEE, 2004.
-  James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, pages 2546–2554, 2011.
-  Bernd Bischl, Jakob Richter, Jakob Bossek, Daniel Horn, Janek Thomas, and Michel Lang. mlrMBO: A modular framework for model-based optimization of expensive black-box functions. arXiv preprint arXiv:1703.03373, 2017.
-  Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
-  Rich Caruana, Art Munson, and Alexandru Niculescu-Mizil. Getting the most out of ensemble selection. In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 828–833. IEEE, 2006.
-  Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. In Proceedings of the Twenty-first International Conference on Machine Learning, page 18. ACM, 2004.
-  Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
-  Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, pages 2962–2970, 2015.
-  Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Using meta-learning to initialize Bayesian optimization of hyperparameters. In Proceedings of the 2014 International Conference on Meta-learning and Algorithm Selection-Volume 1201, pages 3–10. Citeseer, 2014.
-  Nicolo Fusi and Huseyn Melih Elibol. Probabilistic matrix factorization for automated machine learning. arXiv preprint arXiv:1705.05355, 2017.
-  Jacob R Gardner, Geoff Pleiss, Ruihan Wu, Kilian Q Weinberger, and Andrew Gordon Wilson. Product Kernel Interpolation for Scalable Gaussian Processes. In AISTATS, 2018.
-  Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter Optimization: A Spectral Approach. arXiv preprint arXiv:1706.00764, 2017.
-  Ralf Herbrich, Neil D Lawrence, and Matthias Seeger. Fast sparse Gaussian process methods: The informative vector machine. In Advances in Neural Information Processing Systems, pages 625–632, 2003.
-  Ling Huang, Jinzhu Jia, Bin Yu, Byung-Gon Chun, Petros Maniatis, and Mayur Naik. Predicting execution time of computer programs using sparse polynomial regression. In Advances in Neural Information Processing Systems, pages 883–891, 2010.
-  Frank Hutter, Youssef Hamadi, Holger H Hoos, and Kevin Leyton-Brown. Performance prediction and automated tuning of randomized and parametric algorithms. In International Conference on Principles and Practice of Constraint Programming, pages 213–228. Springer, 2006.
-  Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential Model-Based Optimization for General Algorithm Configuration. LION, 5:507–523, 2011.
-  Frank Hutter, Lin Xu, Holger H Hoos, and Kevin Leyton-Brown. Algorithm runtime prediction: Methods & evaluation. Artificial Intelligence, 206:79–111, 2014.
-  RC St John and Norman R Draper. D-optimality for regression designs: a review. Technometrics, 17(1):15–23, 1975.
-  Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
-  Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12(363-366):234, 1960.
-  Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9(Feb):235–284, 2008.
-  Rui Leite, Pavel Brazdil, and Joaquin Vanschoren. Selecting classification algorithms with active testing. In International workshop on machine learning and data mining in pattern recognition, pages 117–131. Springer, 2012.
-  Christiane Lemke, Marcin Budka, and Bogdan Gabrys. Metalearning: a survey of trends and technologies. Artificial intelligence review, 44(1):117–130, 2015.
-  David JC MacKay. Information-based objective functions for active data selection. Neural computation, 4(4):590–604, 1992.
-  Mustafa Mısır and Michèle Sebag. Alors: An algorithm recommender system. Artificial Intelligence, 244:291–314, 2017.
-  Alexander M Mood et al. On Hotelling’s weighing problem. The Annals of Mathematical Statistics, 17(4):432–446, 1946.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
-  Bernhard Pfahringer, Hilan Bensusan, and Christophe G Giraud-Carrier. Meta-Learning by Landmarking Various Learning Algorithms. In ICML, pages 743–750, 2000.
-  Friedrich Pukelsheim. Optimal design of experiments, volume 50. SIAM, 1993.
-  Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. the MIT Press, 2006.
-  Paola Sebastiani and Henry P Wynn. Maximum entropy sampling and optimal Bayesian experimental design. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1):145–157, 2000.
-  Kate Smith-Miles and Jano van Hemert. Discovering the suitability of optimisation algorithms by learning from evolved instances. Annals of Mathematics and Artificial Intelligence, 61(2):87–104, 2011.
-  Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951–2959, 2012.
-  Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
-  David H Stern, Horst Samulowitz, Ralf Herbrich, Thore Graepel, Luca Pulina, and Armando Tacchella. Collaborative Expert Portfolio Management. In AAAI, pages 179–184, 2010.
-  Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked Science in Machine Learning. SIGKDD Explorations, 15(2):49–60, 2013.
-  Abraham Wald. On the efficient design of statistical investigations. The Annals of Mathematical Statistics, 14(2):134–140, 1943.
-  M. Wistuba, N. Schilling, and L. Schmidt-Thieme. Learning hyperparameter optimization initializations. In 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10, Oct 2015.
-  David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341–1390, 1996.
-  Dani Yogatama and Gideon Mann. Efficient transfer learning method for automatic hyperparameter tuning. In Artificial Intelligence and Statistics, pages 1077–1085, 2014.
-  Yuyu Zhang, Mohammad Taha Bahadori, Hang Su, and Jimeng Sun. FLASH: fast Bayesian optimization for data analytic pipelines. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2065–2074. ACM, 2016.
Appendix A Selected models in error matrix generation
Table 1 shows all the algorithms (in alphabetical order; the same below) together with their hyperparameter settings that we have considered to date. We run these algorithms on datasets using scikit-learn 0.19.1 . Hyperparameter settings not listed in this table are set to be default values in the scikit-learn library. Hyperparameter names in Table 3 are consistent with scikit-learn classifier arguments.
|Algorithm type||Hyperparameter names (values)|
|Adaboost||n_estimators (50,100), learning_rate (1.0,1.5,2.0,2.5,3)|
|Decision tree||min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e-05)|
|Extra trees||min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e-05), criterion (gini,entropy)|
|Gradient boosting||learning_rate (0.001,0.01,0.025,0.05,0.1,0.25,0.5), max_depth (3, 6), max_features (null,log2)|
|Gaussian naive Bayes||-|
|kNN||n_neighbors (1,3,5,7,9,11,13,15), p (1,2)|
|Logistic regression||C (0.25,0.5,0.75,1,1.5,2,3,4), solver (liblinear,saga), penalty (l1,l2)|
|Multilayer perceptron||learning_rate_init (0.0001,0.001,0.01), learning_rate (adaptive), solver (sgd,adam), alpha (0.0001, 0.01)|
|Random forest||min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e-05), criterion (gini,entropy)|
|Kernel SVM||C (0.125,0.25,0.5,0.75,1,2,4,8,16), kernel (rbf,poly), coef0 (0,10)|
|Linear SVM||C (0.125,0.25,0.5,0.75,1,2,4,8,16)|
Appendix B Runtime prediction performances on individual machine learning algorithms
Appendix C Dataset meta-features
Dataset meta-features (used in Figure 5 and in auto-sklearn ) are listed below.
|number of instances||number of data points in the dataset|
|log number of instances||the (natural) logarithm of number of instances|
|number of classes|
|number of features|
|log number of features||the (natural) logarithm of number of features|
|number of instances with missing values|
|percentage of instances with missing values|
|number of features with missing values|
|percentage of features with missing values|
|number of missing values|
|percentage of missing values|
|number of numeric features|
|number of categorical features|
|ratio numerical to nominal||the ratio of number of numerical features to the number of categorical features|
|ratio numerical to nominal|
|dataset ratio||the ratio of number of features to the number of data points|
|log dataset ratio||the natural logarithm of dataset ratio|
|inverse dataset ratio|
|log inverse dataset ratio|
|class probability (min, max, mean, std)||the (min, max, mean, std) of ratios of data points in each class|
|symbols (min, max, mean, std, sum)||the (min, max, mean, std, sum) of the numbers of symbols in all categorical features|
|kurtosis (min, max, mean, std)|
|skewness (min, max, mean, std)|
|class entropy||the entropy of the distribution of class labels (logarithm base 2)|
|landmarking  meta-features|
|decision tree||decision tree classifier with 10-fold cross validation|
|decision node learner||10-fold cross-validated decision tree classifier with criterion="entropy", max_depth=1, min_samples_split=2, min_samples_leaf=1, max_features=None|
|random node learner||10-fold cross-validated decision tree classifier with max_features=1 and the same above for the rest|
|PCA fraction of components for 95% variance||the fraction of components that account for 95% of variance|
|PCA kurtosis first PC||kurtosis of the dimensionality-reduced data matrix along the first principal component|
|PCA skewness first PC||skewness of the dimensionality-reduced data matrix along the first principal component|