Oboe: Collaborative Filtering for AutoML Initialization
Abstract
Algorithm selection and hyperparameter tuning remain two of the most challenging tasks in machine learning. The number of machine learning applications is growing much faster than the number of machine learning experts, hence we see an increasing demand for efficient automation of learning processes. Here, we introduce Oboe, an algorithm for timeconstrained model selection and hyperparameter tuning. Taking advantage of similarity between datasets, Oboe finds promising algorithm and hyperparameter configurations through collaborative filtering. Our system explores these models under time constraints, so that rapid initializations can be provided to warmstart more finegrained optimization methods. One novel aspect of our approach is a new heuristic for active learning in timeconstrained matrix completion based on optimal experiment design. Our experiments demonstrate that Oboe delivers stateoftheart performance faster than competing approaches on a test bed of supervised learning problems.
1 Introduction
Machine learning and data science experts find it difficult to select algorithms and hyperparameter settings suitable for a given dataset; for novices, the challenge is even greater. The large number of algorithms, and the sensitivity of these methods to hyperparameter values, makes it practically infeasible to enumerate all possible configurations. To surmount these challenges, the field of Automated Machine Learning (AutoML) seeks to efficiently automate the selection of model configurations, and has attracted increasing attention in recent years.
We propose an algorithmic system, Oboe, that (like its orchestral counterpart) provides an initial tuning for AutoML. Oboe complements existing AutoML techniques by selecting among algorithm types and providing promising initializations for hyperparameters, all within a fixed time budget. Oboe predicts the performance of model configurations on a dataset based on the performance of these models on similar datasets: it is a recommender system for algorithms. To ensure a good solution is found within the required time, Oboe predicts which algorithm will perform the best within a fixed (often quite short) time. As subproblems, the system predicts the runtime of each algorithm on a new dataset, and quantifies the information to be gained about a dataset by running a particular algorithm. In contrast to other work in AutoML, Oboe dedicates the entire time budget to exploring promising models, rather than computing dataset metafeatures balte2014meta (), e.g., skewness or kurtosis. With this approach, Oboe is able to select wellperforming models from a (customizable) large set within a short period of time. Our system is compatible with, and speeds up, any hyperparameter tuning method that further exploits the initialization, such as scalable Gaussian processes gardner2018product (). More broadly, the challenge of timeconstrained collaborative filtering has not been well studied, and may be of interest for applications outside of AutoML.
Our method begins by constructing an error matrix, typically with several hundred rows and columns, for some base set of algorithms (columns) and datasets (rows). Each entry records the performance of one machine learning model (algorithm together with hyperparameters) on one dataset. For example, one entry in this matrix would correspond to, e.g., the crossvalidated error obtained by using a 2nearest neighbor algorithm to make predictions for a particular dataset. Each row, called performance vector, records the performance of each model on a specific dataset. We also record how long it takes to run each model on each dataset, and fit a model to predict runtime based on the number of samples and features in the dataset.
We view a new dataset as a new row in this error matrix. To find the best model for this new dataset, we run models corresponding to an informative subset of the columns on the new dataset. By choosing informative models predicted to run quickly, we can ensure this step is as fast as required. We then predict the missing entries in the row, which correspond to models whose performance has not been evaluated on the current dataset. This procedure is promising in two respects. First, it streamlines the process of choosing hyperparameters of any type (categorical, boolean, numerical, etc.). Second, it infers the performance of a vast number of hyperparameter settings without running most models, or even computing metafeatures of the dataset. Hence Oboe can provide a reasonable model within the specified time constraint.
2 Related work
This work addresses two important problems in AutoML: we consider 1) timeconstrained initialization: how to choose a promising model class for a given problem, together with initial values for hyperparameters, within a short time budget, and 2) active learning: how to improve the model given further computational resources.
The first subproblem is important in that the spaces and landscapes of different hyperparameters have heterogenous shapes and are poorly studied. The “no free lunch” theorem wolpert1996lack () for AutoML states that is impossible to select a model that always performs well; there is no efficiently computable metric that predicts model performance.
One solution to the problem is to adopt a collaborative filtering approach bardenet2013collaborative (); stern2010collaborative (); yogatama2014efficient (); misir2017alors (): we suppose we have a collection of datasets, called training datasets, for which additional information (e.g., model performance) is available. Model performance on a new dataset, called the test dataset, can be inferred using its similarity to the training set.
It is important, in the collaborative filtering setting, to characterize dataset similarity. We hope that, for some similarity metric, training datasets similar to the test dataset will faithfully predict model performance. One line of work makes use of dataset metafeatures pfahringer2000meta (); feurer2014using (); feurer2015efficient (): simple, statistical or landmarking metrics for dataset characterization balte2014meta (). Other approaches wistuba2015learning () avoid metafeatures and instead use measures based on similarities in model performance, such as the Kendall rank correlation coefficient kendall1938new (). Our approach builds on both of these lines of research by treating low rank representations of model performance vectors as latent metafeatures, and thereby eliminates the need to compute metafeatures: our approach relies exclusively on model performance to compute dataset similarity.
The active learning subproblem is to select algorithms and hyperparameters to evaluate, with the hope of selecting the best model or gaining the most information to guide model selection in subsequent steps. The crux of this problem is to make accurate predictions of model performance on new datasets. Most approaches to this problem choose a function class which is hoped to capture the dependence of model performance on hyperparameters; they fit a surrogate model to observed performance and use it to choose new models (together with hyperparameters) to sample. Gaussian processes williams2006gaussian (); snoek2012practical () are widely used as a surrogate model, together with a variety of criteria which guide the selection of the next sample point, including Expected Improvement bergstra2011algorithms (); fusi2017probabilistic (), informationtheoretic criteria (such as entropy sebastiani2000maximum (); herbrich2003fast () and information gain mackay1992information ()) and bandit regret srinivas2009gaussian (); other surrogate models of note include sparse Boolean functions hazan2017hyperparameter (), and decision trees bartz2004tuning (); hutter2011sequential (). Our Oboe system uses the set of low rank matrices as our surrogate model, which combines a flexible and rather general model class that has the advantage of simplicity and enjoys the speed of welldeveloped numerical linear algebra algorithms.
One key component of our system is a model which predicts the runtime of supervised learning algorithms on new training datasets. Many authors have previously studied algorithm runtime prediction as a function of dataset features and algorithm hyperparameters hutter2014algorithm (), via ridge regression huang2010predicting (), neural networks smith2011discovering (), Gaussian processes hutter2006performance (), etc. Several measures have been proposed to tradeoff between accuracy and runtime leite2012selecting (); bischl2017mlrmbo (). For the purposes of this paper, we use a particularly simple (but surprisingly effective) model, trained independently for each supervised learning algorithm, which predicts algorithm runtime as a function of the number of samples and features in the dataset.
In a linear model, the classical experiment design approach wald1943efficient (); mood1946hotelling (); kiefer1960equivalence (); john1975d (); pukelsheim1993optimal (); boyd2004convex () selects features that minimize variance of the parameter estimate. Constraints on features can be added, making it suitable for such problems under certain budget constraints krause2008near (); zhang2016flash (). Here, we adopt this approach to select promising machine learning models, whose training phases on the new dataset are believed to finish within time budget.
3 Notation and terminology
Metalearning
Metalearning refers to the process of learning across individual datasets or problems, which are subsystems on which standard machine learning is performed lemke2015metalearning (). Just as standard machine learning must carefully structure computations to avoid overfitting, experiments testing the performance of an AutoML system must avoid metaoverfitting! Hence we begin our experiments by dividing our set of datasets into a metatraining set, a metavalidation set, and a metatest set, and report results on the metatest set, as shown in Figure 2. Each of the three phases in metalearning — metatraining, metavalidation and metatest — is a (standard) learning process that includes training, validation and test.
Indexing
All vectors in this paper are column vectors. Given a matrix , and denote the ith row and jth column of , respectively. We define for . Given an ordered set where , we write .
Algorithm performance
A model is a specific algorithmhyperparameter combination, e.g. NN with . For a specific dataset , denotes the expected crossvalidation error of on , in terms of the error metric in use. The expectation is with respect to the way we split the dataset into different folds. The best model on is the model in our collection of models that achieves minimum error on . A model is said to be observed on if we actually calculate .
Parametric hierarchy
We distinguish among the following three categories of parameters:

Parameters: Model parameters of machine learning algorithms, obtained from training models with fixed hyperparameters.

Hyperparameters: Hyperparameters of machine learning algorithms that influence the way the algorithm learns parameters from training samples, e.g. the number of nearest neighbors, , in the nearest neighbors algorithm.

Hyperhyperparameters: Settings of the Oboe algorithm, e.g. the rank used to approximate the error matrix.
4 Methodology
4.1 Unsupervised inference of model performance
While the distance between metafeature vectors is an informative measure of dataset similarity, it is often difficult to determine a priori which metafeatures to use, and the computation of metafeatures can be expensive. To infer how new models will perform on a dataset without any metafeature calculation that takes more time than it takes to read in the dataset (exceptions include, e.g., number of data points and number of features ), we use collaborative filtering.
We construct an error matrix , where each row is indexed by a dataset and each column is indexed by a model. Empirically, has approximately low rank: Figure 2 shows the singular values as a function of the index . Each entry of is generated using 5fold crossvalidation. This is an unbiased estimate of the true error matrix with expected crossvalidation errors, and motivates the use of low rank factorization as a denoising method.
Hence, we approximate its entries where and are the minimizers of with for and for ; the solution is given by Principal Component Analysis (PCA). In the context of collaborative filtering, each can be interpreted as the latent metafeatures of dataset , while each can be interpreted as the latent metafeatures of model .
Given a metatest dataset, we choose a subset of models and observe the performance of model for each , where is determined by properties of the dataset and the runtime budget. We then infer the latent metafeatures of that dataset by solving the least squares problem: with . For all unobserved models, we predict their performance as for .
The remaining challenge in designing our algorithm is to choose and . The subset of models that we choose to observe, , must capture as much information about the metatesting dataset as possible while obeying time constraints. We describe the algorithm for selecting in Section 4.3. The approximate rank of the error matrix, , is a hyperhyperparameter; we describe how to select in Section 5.2.2.
4.2 Runtime prediction
When operating under time constraints, it is essential to estimate the runtime of each model. These estimates allow us to balance the tradeoff between choosing to observe slow, informative models and fast, less informative models. However, it is challenging to estimate the runtime of a specific model on a particular dataset: the runtime depends not only on the theoretical time complexity of the algorithm, but also other factors (such as the distribution of data points) that are not easy to calculate.
However, we observe that we are able to predict runtime of half of the machine learning models within a factor of two on more than 75% OpenML classification datasets with between 150 and 10,000 data points and with no missing entries, as shown in Table 1 and visualized in Appendix B, Figures 5 and 6. Our method uses polynomial regression: we observe that the theoretical complexities of the machine learning algorithms we use here are , where and are the number of data points and the number of features in , respectively. Hence we fit an independent polynomial regression model for each model:
where is the runtime of model on dataset , and is the set of all polynomials of order no more than 3. We denote this procedure by fit_runtime.
Algorithm type  Runtime prediction accuracy  

within factor of 2  within factor of 4  
Adaboost  83.6%  94.3% 
Decision tree  76.7%  88.1% 
Extra trees  96.6%  99.5% 
Gradient boosting  53.9%  84.3% 
Gaussian naive Bayes  89.6%  96.7% 
kNN  85.2%  88.2% 
Logistic regression  41.1%  76.0% 
Multilayer perceptron  78.9%  96.0% 
Perceptron  75.4%  94.3% 
Random Forest  94.4%  98.2% 
Kernel SVM  59.9%  86.7% 
Linear SVM  30.1%  73.2% 
4.3 Time constrained information gathering
To select the models to run on a new (metatest) dataset with a time budget , we adopt an approach that builds on classical ideas in experiment design: we suppose fitting each model returns a linear measurement of the true latent metafeatures, , corrupted by Gaussian noise. The Doptimal design chooses which models we should fit by defining an indicator vector , where entry indicates whether or not to fit model , and minimizing a scalarization of the covariance of the estimated metafeatures of the new dataset subject to constraints on runtime wald1943efficient (); mood1946hotelling (); kiefer1960equivalence (); john1975d (); pukelsheim1993optimal (); boyd2004convex (). Let denote the predicted runtime of model on a metatest dataset, and let denote its latent metafeatures, for . Now we relax to allow for nonboolean values and solve the optimization problem
subject to  
This relaxed problem is a convex optimization problem, and we obtain an approximate solution to the Doptimal design problem via rounding. Let be the set of indices of that we choose to observe, i.e. the set such that rounds to 1 for . We denote this process by .
5 Oboe
The Oboe system can be divided into two stages: offline and online. The offline stage is executed only once and stores information on metatraining datasets. Time taken on this stage does not affect the prediction of Oboe on a new dataset; the runtime experienced by the user is that of the online stage.
5.1 Offline stage
Error matrix generation
The th entry of error matrix , denoted as , records the performance of the th model on the th metatraining dataset. The metric we use here to characterize model performance is balanced error rate, which is the average of false positive and false negative rates across different classes. This balances the influence of classes with different sizes. Our experiments have shown that the error matrix under this metric is approximately low rank, which serves as the foundation of our algorithm.
Runtime fitting
We also record the runtime of fitting each model on each dataset in a runtime matrix . The entry represents the runtime of the th model on the th dataset. This matrix is used to fit the runtime predictor described in Section 4.2. Pseudocode for the offline stage is shown as Algorithm 1.
5.2 Online stage
5.2.1 Ensemble selection
We first predict the runtime of each model on the test dataset using the runtime predictors computed in the offline stage. Then we use experiment design to select a subset of entries of , the error vector of the test dataset, to observe. The observed entries are used to compute , an estimation of the latent metafeatures of the test datasets. We then use to predict every entry of .
We next build an ensemble out of models predicted to perform well, to facilitate comparison with other methods for AutoML feurer2015efficient (). This step outputs a classifier that gives predictions of labels on individual data points, and can be placed after further finegrained hyperparameter tuning. We use a standard method for building an ensemble, proposed by Caruana et al. caruana2004ensemble (); caruana2006getting (), which starts from a few base learners and greedily adds more if that improves training accuracy. We denote this subroutine as ensemble_selection, which takes as input the set of base learners with their crossvalidation errors and predicted labels , and outputs the ensemble learner . These are presented in Algorithm 2 as the fit function.
5.2.2 Budget doubling
To select the rank , Oboe starts with a small initial rank along with a small time budget, and then doubles the time budget for the above fit subroutine until it reaches half of the total budget. Rank increments by 1 if the validation error of the ensemble learner decreases after doubling the budget, and otherwise does not change. Since the matrices returned by PCA with rank are submatrices of those returned by PCA with rank for , we can compute the factors as submatrices of the matrices returned by PCA with full rank . The pseudocode is shown as Algorithm 3.
6 Experimental evaluation
The code for the Oboe system is at https://github.com/udellgroup/oboe; the code for related experimental evaluation is at https://github.com/udellgroup/oboetesting.
We test different AutoML systems on OpenML OpenML2013 () and UCI Dua:2017 () classification datasets, with between 150 and 10,000 data points and with no missing entries, which we call selected OpenML and UCI datasets in the following context. Since data preprocessing is not our focus, we preprocess all datasets in the same way: we use onehotencoding categorical features, and then standardize all features to have zero mean and unit variance. These preprocessed datasets are used in all of the following experiments.
Numerical results demonstrate the robust performance of hyperhyperparameter choices, including coldstart method and error metric. In terms of actual performance, we compare Oboe with a stateoftheart AutoML system which selects among different algorithm types and tunes hyperparameters within a fixed time budget, autosklearn feurer2015efficient (), and with one random baseline, which replaces the selection of models in Oboe with a random selection of models that can be trained within the same time budget.
6.1 Hyperhyperparameter choice
6.1.1 Coldstart functionality
Oboe uses Doptimal experiment design as the coldstart method to select models to evaluate on new datasets. In Figure 3 and Table 2, we compare this approach with A and Eoptimal experiment design and nonlinear regression using Alors misir2017alors (), by means of leaveoneoutcrossvalidation on selected OpenML datasets. We consider performance measured by relative RMSE of the predicted performance vector and by the number of correctly predicted best models, both averaged across datasets. The approximate rank of the error matrix is set to be the number of eigenvalues larger than 1% of the largest, which is 38 here. The time limit in experiment design implementation is set to be 4 seconds; the metafeatures used in the Alors implementation are listed in Appendix C; the nonlinear regressor used in the Alors implementation is the default RandomForestRegressor in scikitlearn 0.19.1 scikitlearn ().
It can be observed that the experiment design approach robustly outperforms nonlinear regression in terms of prediction accuracy on best models.
6.1.2 Error metric
Oboe uses the balanced error rate to construct the error matrix, and works on the premise that the error matrix can be approximated by a low rank matrix. However, there is nothing special about the balanced error rate: indeed, most metrics used to measure errors result in an approximately low rank error matrix. For example, when using the AUC metric to measure error, the 418by219 error matrix from selected OpenML datasets has only 38 eigenvalues greater than 1% of the largest eigenvalue, and 12 greater than 3%.
6.2 Performance comparison
Oboe
For our implementation of Oboe, we use the algorithm types and hyperparameter ranges listed in Appendix A, Table 3 as columns in the error matrix. We chose to vary the hyperparameters people usually tune, and picked their ranges to contain the values people usually use. We have not optimized over them. The datasets we selected as rows are 418 preprocessed OpenML datasets with between 150 and 10,000 data points and with no missing entries, and the sum of runtime of running all models on each is less than 60,000 seconds. The (decaying) eigenvalues of this error matrix are shown in Figure 2.
autosklearn
We compare with autosklearn+metalearning+ensemble, using the method autosklearn.classification.AutoSklearnClassifier in autosklearn 0.3.0 feurer2015efficient ().
Random
As a random baseline, we replace the experiment design subroutine of Oboe with a timeconstrained random selection method: we randomly select a subset of methods that we predict will complete within a particular time limit. We assign half of the budget in each budget doubling round to this random selection subroutine, during which we sequentially pick models that are predicted to finish within the remaining budget. These models are observed and used to infer the rest on test dataset.
6.3 Experimental setup
We limited the types of algorithms each system can use to Adaboost, Gaussian naive Bayes, extra trees, gradient boosting, linear and kernel SVMs, random forest, knearest neighbors and decision trees. Our system does not perform further hyperparameter optimization after selecting a subset of models and forming an ensemble, while we do allow the autosklearn package to perform hyperparameter optimization after model selection to ensure it makes full use of the time budget. This choice gives autosklearn a slight advantage over Oboe.
We ran all experiments on a server with 128 Intel^{®} Xeon^{®} E74850 v4 2.10GHz CPU cores and 1056GB memory. The process of running each system on a specific dataset is limited to a single CPU core. Figure 4 shows the percentile and ranking (1 is best and 3 is worst) changes of prediction errors as runtime repeatedly doubles. Until the first time when the system can produce a model, we classify every data point with the most common class label.
We make several observations on the results of these experiments:

The quality of the initial models computed by Oboe and by autosklearn are comparable; but Oboe computes its first nontrivial model more than 8 faster than autosklearn (Figures 3(a) and 3(b)). In contrast, autosklearn must first compute metafeatures for each dataset, which requires substantial computational time, as shown in Appendix D.

In general, the test error of Oboe decreases as the time budget increases. Interestingly, the rate at which the Oboe models improves with time is faster than the rate at which the autosklearn models improve. This observation indicates that increased computational time may be better spent in fitting more models than by using standard hyperparameter optimization methods, such as Gaussian processes, to which autosklearn devotes the remaining time.

Experiment design leads to better results than random selection in almost all cases.
7 Summary
Oboe is an AutoML system that uses ideas from collaborative filtering and optimal experiment design to exploit dataset and algorithm similarity to predict the performance of machine learning models. By fitting a few selected models on the test dataset, this system transfers knowledge from the training datasets to select a good set of models. Oboe naturally handles different types of hyperparameters and can match the performance of state of the art AutoML systems much more quickly than competing approaches.
This work demonstrates the promise of collaborative filtering approaches to AutoML. However, many improvements are possible. Future work to adapt Oboe to different loss metrics, budget types, sparse error matrices and a wider range of machine learning algorithms, as well as to augment the initializations given by Oboe with fine tuning by any stateoftheart hyperparameter optimization method, may yield substantial practical improvements. Furthermore, we look forward to seeing more approaches to the challenge of choosing hyperhyperparameter settings subject to limited computation and data. With continuing efforts by the AutoML community, we look forward to a world in which domain experts seeking to use machine learning can focus on issues of data quality and problem formulation, rather than on tasks — such as algorithm selection and hyperparameter tuning — which are suitable for automation.
Acknowledgements
This work was supported in part by DARPA Award FA87501720101. The authors thank Christophe GiraudCarrier, Ameet Talwalkar, Raul Astudillo Marban, Matthew Zalesak, Lijun Ding and Davis Wertheimer for helpful discussions, and thank Jack Dunn for a script to parse UCI Machine Learning Repository datasets.
References
 [1] Ashvini Balte, Nitin Pise, and Parag Kulkarni. Metalearning with landmarking: A survey. International Journal of Computer Applications, 105(8), 2014.
 [2] Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. Collaborative hyperparameter tuning. In International Conference on Machine Learning, pages 199–207, 2013.
 [3] Thomas BartzBeielstein and Sandor Markon. Tuning search algorithms for realworld applications: A regression tree based approach. In Evolutionary Computation, 2004. CEC2004. Congress on, volume 1, pages 1111–1118. IEEE, 2004.
 [4] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems, pages 2546–2554, 2011.
 [5] Bernd Bischl, Jakob Richter, Jakob Bossek, Daniel Horn, Janek Thomas, and Michel Lang. mlrMBO: A modular framework for modelbased optimization of expensive blackbox functions. arXiv preprint arXiv:1703.03373, 2017.
 [6] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
 [7] Rich Caruana, Art Munson, and Alexandru NiculescuMizil. Getting the most out of ensemble selection. In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 828–833. IEEE, 2006.
 [8] Rich Caruana, Alexandru NiculescuMizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. In Proceedings of the Twentyfirst International Conference on Machine Learning, page 18. ACM, 2004.
 [9] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
 [10] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, pages 2962–2970, 2015.
 [11] Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Using metalearning to initialize Bayesian optimization of hyperparameters. In Proceedings of the 2014 International Conference on Metalearning and Algorithm SelectionVolume 1201, pages 3–10. Citeseer, 2014.
 [12] Nicolo Fusi and Huseyn Melih Elibol. Probabilistic matrix factorization for automated machine learning. arXiv preprint arXiv:1705.05355, 2017.
 [13] Jacob R Gardner, Geoff Pleiss, Ruihan Wu, Kilian Q Weinberger, and Andrew Gordon Wilson. Product Kernel Interpolation for Scalable Gaussian Processes. In AISTATS, 2018.
 [14] Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter Optimization: A Spectral Approach. arXiv preprint arXiv:1706.00764, 2017.
 [15] Ralf Herbrich, Neil D Lawrence, and Matthias Seeger. Fast sparse Gaussian process methods: The informative vector machine. In Advances in Neural Information Processing Systems, pages 625–632, 2003.
 [16] Ling Huang, Jinzhu Jia, Bin Yu, ByungGon Chun, Petros Maniatis, and Mayur Naik. Predicting execution time of computer programs using sparse polynomial regression. In Advances in Neural Information Processing Systems, pages 883–891, 2010.
 [17] Frank Hutter, Youssef Hamadi, Holger H Hoos, and Kevin LeytonBrown. Performance prediction and automated tuning of randomized and parametric algorithms. In International Conference on Principles and Practice of Constraint Programming, pages 213–228. Springer, 2006.
 [18] Frank Hutter, Holger H Hoos, and Kevin LeytonBrown. Sequential ModelBased Optimization for General Algorithm Configuration. LION, 5:507–523, 2011.
 [19] Frank Hutter, Lin Xu, Holger H Hoos, and Kevin LeytonBrown. Algorithm runtime prediction: Methods & evaluation. Artificial Intelligence, 206:79–111, 2014.
 [20] RC St John and Norman R Draper. Doptimality for regression designs: a review. Technometrics, 17(1):15–23, 1975.
 [21] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
 [22] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12(363366):234, 1960.
 [23] Andreas Krause, Ajit Singh, and Carlos Guestrin. Nearoptimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9(Feb):235–284, 2008.
 [24] Rui Leite, Pavel Brazdil, and Joaquin Vanschoren. Selecting classification algorithms with active testing. In International workshop on machine learning and data mining in pattern recognition, pages 117–131. Springer, 2012.
 [25] Christiane Lemke, Marcin Budka, and Bogdan Gabrys. Metalearning: a survey of trends and technologies. Artificial intelligence review, 44(1):117–130, 2015.
 [26] David JC MacKay. Informationbased objective functions for active data selection. Neural computation, 4(4):590–604, 1992.
 [27] Mustafa Mısır and Michèle Sebag. Alors: An algorithm recommender system. Artificial Intelligence, 244:291–314, 2017.
 [28] Alexander M Mood et al. On Hotelling’s weighing problem. The Annals of Mathematical Statistics, 17(4):432–446, 1946.
 [29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 [30] Bernhard Pfahringer, Hilan Bensusan, and Christophe G GiraudCarrier. MetaLearning by Landmarking Various Learning Algorithms. In ICML, pages 743–750, 2000.
 [31] Friedrich Pukelsheim. Optimal design of experiments, volume 50. SIAM, 1993.
 [32] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. the MIT Press, 2006.
 [33] Paola Sebastiani and Henry P Wynn. Maximum entropy sampling and optimal Bayesian experimental design. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1):145–157, 2000.
 [34] Kate SmithMiles and Jano van Hemert. Discovering the suitability of optimisation algorithms by learning from evolved instances. Annals of Mathematics and Artificial Intelligence, 61(2):87–104, 2011.
 [35] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951–2959, 2012.
 [36] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
 [37] David H Stern, Horst Samulowitz, Ralf Herbrich, Thore Graepel, Luca Pulina, and Armando Tacchella. Collaborative Expert Portfolio Management. In AAAI, pages 179–184, 2010.
 [38] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked Science in Machine Learning. SIGKDD Explorations, 15(2):49–60, 2013.
 [39] Abraham Wald. On the efficient design of statistical investigations. The Annals of Mathematical Statistics, 14(2):134–140, 1943.
 [40] M. Wistuba, N. Schilling, and L. SchmidtThieme. Learning hyperparameter optimization initializations. In 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10, Oct 2015.
 [41] David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341–1390, 1996.
 [42] Dani Yogatama and Gideon Mann. Efficient transfer learning method for automatic hyperparameter tuning. In Artificial Intelligence and Statistics, pages 1077–1085, 2014.
 [43] Yuyu Zhang, Mohammad Taha Bahadori, Hang Su, and Jimeng Sun. FLASH: fast Bayesian optimization for data analytic pipelines. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2065–2074. ACM, 2016.
Appendix A Selected models in error matrix generation
Table 1 shows all the algorithms (in alphabetical order; the same below) together with their hyperparameter settings that we have considered to date. We run these algorithms on datasets using scikitlearn 0.19.1 [29]. Hyperparameter settings not listed in this table are set to be default values in the scikitlearn library. Hyperparameter names in Table 3 are consistent with scikitlearn classifier arguments.
Algorithm type  Hyperparameter names (values) 

Adaboost  n_estimators (50,100), learning_rate (1.0,1.5,2.0,2.5,3) 
Decision tree  min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e05) 
Extra trees  min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e05), criterion (gini,entropy) 
Gradient boosting  learning_rate (0.001,0.01,0.025,0.05,0.1,0.25,0.5), max_depth (3, 6), max_features (null,log2) 
Gaussian naive Bayes   
kNN  n_neighbors (1,3,5,7,9,11,13,15), p (1,2) 
Logistic regression  C (0.25,0.5,0.75,1,1.5,2,3,4), solver (liblinear,saga), penalty (l1,l2) 
Multilayer perceptron  learning_rate_init (0.0001,0.001,0.01), learning_rate (adaptive), solver (sgd,adam), alpha (0.0001, 0.01) 
Perceptron   
Random forest  min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e05), criterion (gini,entropy) 
Kernel SVM  C (0.125,0.25,0.5,0.75,1,2,4,8,16), kernel (rbf,poly), coef0 (0,10) 
Linear SVM  C (0.125,0.25,0.5,0.75,1,2,4,8,16) 
Appendix B Runtime prediction performances on individual machine learning algorithms
The runtime prediction accuracy on selected OpenML datasets shown in Table 1 is visualized in Figures 5 and 6.
Appendix C Dataset metafeatures
Dataset metafeatures (used in Figure 5 and in autosklearn [10]) are listed below.
Metafeature name  Explanation 
number of instances  number of data points in the dataset 
log number of instances  the (natural) logarithm of number of instances 
number of classes  
number of features  
log number of features  the (natural) logarithm of number of features 
number of instances with missing values  
percentage of instances with missing values  
number of features with missing values  
percentage of features with missing values  
number of missing values  
percentage of missing values  
number of numeric features  
number of categorical features  
ratio numerical to nominal  the ratio of number of numerical features to the number of categorical features 
ratio numerical to nominal  
dataset ratio  the ratio of number of features to the number of data points 
log dataset ratio  the natural logarithm of dataset ratio 
inverse dataset ratio  
log inverse dataset ratio  
class probability (min, max, mean, std)  the (min, max, mean, std) of ratios of data points in each class 
symbols (min, max, mean, std, sum)  the (min, max, mean, std, sum) of the numbers of symbols in all categorical features 
kurtosis (min, max, mean, std)  
skewness (min, max, mean, std)  
class entropy  the entropy of the distribution of class labels (logarithm base 2) 
landmarking [30] metafeatures  
LDA  
decision tree  decision tree classifier with 10fold cross validation 
decision node learner  10fold crossvalidated decision tree classifier with criterion="entropy", max_depth=1, min_samples_split=2, min_samples_leaf=1, max_features=None 
random node learner  10fold crossvalidated decision tree classifier with max_features=1 and the same above for the rest 
1NN  
PCA fraction of components for 95% variance  the fraction of components that account for 95% of variance 
PCA kurtosis first PC  kurtosis of the dimensionalityreduced data matrix along the first principal component 
PCA skewness first PC  skewness of the dimensionalityreduced data matrix along the first principal component 

Appendix D Metafeature calculation time
The time taken to calculate metafeatures in Appendix C cannot be ignored, and are shown in Figure 7.