Oboe: Collaborative Filtering for AutoML Model Selection
Abstract.
Algorithm selection and hyperparameter tuning remain two of the most challenging tasks in machine learning. Automated machine learning (AutoML) seeks to automate these tasks to enable widespread use of machine learning by nonexperts. This paper introduces Oboe, a collaborative filtering method for timeconstrained model selection and hyperparameter tuning. Oboe forms a matrix of the crossvalidated errors of a large number of supervised learning models (algorithms together with hyperparameters) on a large number of datasets, and fits a low rank model to learn the lowdimensional feature vectors for the models and datasets that best predict the crossvalidated errors. To find promising models for a new dataset, Oboe runs a set of fast but informative algorithms on the new dataset and uses their crossvalidated errors to infer the feature vector for the new dataset. Oboe can find good models under constraints on the number of models fit or the total time budget. To this end, this paper develops a new heuristic for active learning in timeconstrained matrix completion based on optimal experiment design. Our experiments demonstrate that Oboe delivers stateoftheart performance faster than competing approaches on a test bed of supervised learning problems. Moreover, the success of the bilinear model used by Oboe suggests that AutoML may be simpler than was previously understood.
1. Introduction
It is often difficult to find the best algorithm and hyperparameter settings for a new dataset, even for experts in machine learning or data science. The large number of machine learning algorithms and their sensitivity to hyperparameter values make it practically infeasible to enumerate all configurations. Automated machine learning (AutoML) seeks to efficiently automate the selection of model (e.g., (Feurer et al., 2015; Chen et al., 2018; Fusi et al., 2018)) or pipeline (e.g., (Drori et al., 2018)) configurations, and has become more important as the number of machine learning applications increases.
We propose an algorithmic system, Oboe ^{1}^{1}1 The eponymous musical instrument plays the initial note to tune an orchestra., that provides an initial tuning for AutoML: it selects a good algorithm and hyperparameter combination from a discrete set of options. The resulting model can be used directly, or the hyperparameters can be tuned further. Briefly, Oboe operates as follows.
During an offline training phase, it forms a matrix of the crossvalidated errors of a large number of supervisedlearning models (algorithms together with hyperparameters) on a large number of datasets. It then fits a low rank model to this matrix to learn latent lowdimensional metafeatures for the models and datasets. Our optimization procedure ensures these latent metafeatures best predict the crossvalidated errors, among all bilinear models.
To find promising models for a new dataset, Oboe chooses a set of fast but informative models to run on the new dataset and uses their crossvalidated errors to infer the latent metafeatures of the new dataset. Given more time, Oboe repeats this procedure using a higher rank to find higherdimensional (and more expressive) latent features. Using a low rank model for the error matrix is a very strong structural prior.
This system addresses two important problems: 1) Timeconstrained initialization: how to choose a promising initial model under time constraints. Oboe adapts easily to short times by using a very low rank and by restricting its experiments to models that will run very fast on the new dataset. 2) Active learning: how to improve on the initial guess given further computational resources. Oboe uses extra time by allowing higher ranks and more expensive computational experiments, accumulating its knowledge of the new dataset to produce more accurate (and higherdimensional) estimates of its latent metafeatures.
Oboe uses collaborative filtering for AutoML, selecting models that have worked well on similar datasets, as have many previous methods including (Bardenet et al., 2013; Stern et al., 2010; Yogatama and Mann, 2014; Feurer et al., 2015; Mısır and Sebag, 2017; Cunha et al., 2018). In collaborative filtering, the critical question is how to characterize dataset similarity so that training datasets “similar” to the test dataset faithfully predict model performance. One line of work uses dataset metafeatures — simple, statistical or landmarking metrics — to characterize datasets (Pfahringer et al., 2000; Feurer et al., 2014; Feurer et al., 2015; Fusi et al., 2018; Cunha et al., 2018). Other approaches (e.g., (Wistuba et al., 2015)) avoid metafeatures. Our approach builds on both of these lines of work. Oboe relies on model performance to characterize datasets, and the low rank representations it learns for each dataset may be seen (and used) as latent metafeatures. Compared to AutoML systems that compute metafeatures of the dataset before running any models, the flow of information in Oboe is exactly opposite: Oboe uses only the performance of various models on the datasets to compute lower dimensional latent metafeatures for models and datasets.
The active learning subproblem is to gain the most information to guide further model selection. Some approaches choose a function class to capture the dependence of model performance on hyperparameters; examples are Gaussian processes (Rasmussen and Williams, 2006; Snoek et al., 2012; Bergstra et al., 2011; Fusi et al., 2018; Sebastiani and Wynn, 2000; Herbrich et al., 2003; MacKay, 1992; Srinivas et al., 2010), sparse Boolean functions (Hazan et al., 2018) and decision trees (BartzBeielstein and Markon, 2004; Hutter et al., 2011). Oboe chooses the set of bilinear models as its function class: predicted performance is linear in each of the latent model and dataset metafeatures.
Bilinearity seems like a rather strong assumption, but confers several advantages. Computations are fast and easy: we can find the global minimizer by PCA, and can infer the latent metafeatures for a new dataset using least squares. Moreover, recent theoretical work suggests that this model class is more general than it appears: roughly, and under a few mild technical assumptions, any matrix with independent rows and columns whose entries are generated according to a fixed function (here, the function computed by training the model on the dataset) has an approximate rank that grows as (Udell and Townsend, 2019). Hence large data matrices tend to look low rank.
Originally, the authors conceived of Oboe as a system to produce a good set of initial models, to be refined by other local search methods, such as Bayesian optimization. However, in our experiments, we find that Oboe’s performance, refined by fitting models of ever higher rank with ever more data, actually improves faster than competing methods that use local search methods more heavily.
One key component of our system is the prediction of model runtime on new datasets. Many authors have previously studied algorithm runtime prediction using a variety dataset features (Hutter et al., 2014), via ridge regression (Huang et al., 2010), neural networks (SmithMiles and van Hemert, 2011), Gaussian processes (Hutter et al., 2006), and more. Several measures have been proposed to tradeoff between accuracy and runtime (Leite et al., 2012; Bischl et al., 2017). We predict algorithm runtime using only the number of samples and features in the dataset. This model is particularly simple but surprisingly effective.
Classical experiment design (ED) (Wald, 1943; Mood et al., 1946; John and Draper, 1975; Pukelsheim, 1993; Boyd and Vandenberghe, 2004) selects features to observe to minimize the variance of the parameter estimate, assuming that features depend on the parameters according to known, linear, functions. Oboe’s bilinear model fits this paradigm, and so ED can be used to select informative models. Budget constraints can be added, as we do here, to select a small number of promising machine learning models or a set predicted to finish within a short time budget (Krause et al., 2008; Zhang et al., 2016).
2. Notation and Terminology
Metalearning. Metalearning is the process of learning across individual datasets or problems, which are subsystems on which standard learning is performed (Lemke et al., 2015). Just as standard machine learning must avoid overfitting, experiments testing AutoML systems must avoid metaoverfitting! We divide our set of datasets into metatraining, metavalidation and metatest sets, and report results on the metatest set. Each of the three phases in metalearning — metatraining, metavalidation and metatest — is a standard learning process that includes training, validation and test.
Indexing. Throughout this paper, all vectors are column vectors. Given a matrix , and denote the ith row and jth column of , respectively. is the index over datasets, and is the index over models. We define for . Given an ordered set where , we write .
Algorithm performance. A model is a specific algorithmhyperparameter combination, e.g. NN with . We denote by the expected crossvalidation error of model on dataset , where the expectation is with respect to the crossvalidation splits. We refer to the model in our collection that achieves minimal error on as the best model for . A model is said to be observed on if we have calculated by fitting (and crossvalidating) the model. The performance vector of a dataset concatenates for each model in our collection.
Metafeatures. We discuss two types of metafeatures in this paper. Metafeatures refer to metrics used to characterize datasets or models. For example, the number of data points or the performance of simple models on a dataset can serve as metafeatures of the dataset. As an example, we list the metafeatures used in the AutoML framework autosklearn in Appendix B, Table 3. In constrast to standard metafeatures, we use the term latent metafeatures to refer to characterizations learned from matrix factorization.
Parametric hierarchy. We distinguish between three kinds of parameters:

Parameters of a model (e.g., the splits in a decision tree) are obtained by training the model.

Hyperparameters of an algorithm (e.g., the maximum depth of a decision tree) govern the training procedure. We use the word model to refer to an algorithm together with a particular choice of hyperparameters.

Hyperhyperparameters of a metalearning method (e.g., the total time budget for Oboe) govern metatraining.
Time target and time budget. The time target refers to the anticipated time spent running models to infer latent features of each fixed dimension and can be exceeded. However, the runtime does not usually deviate much from the target since our model runtime prediction works well. The time budget refers to the total time limit for Oboe and is never exceeded.
3. Methodology
3.1. Model Performance Prediction
It can be difficult to determine a priori which metafeatures to use so that algorithms perform similarly well on datasets with similar metafeatures. Also, the computation of metafeatures can be expensive (see Appendix C, Figure 11). To infer model performance on a dataset without any expensive metafeature calculations, we use collaborative filtering to infer latent metafeatures for datasets.
As shown in Figure 2, we construct an empirical error matrix , where every entry records the crossvalidated error of model on dataset . Empirically, has approximately low rank: Figure 3 shows the singular values decay rapidly as a function of the index . This observation serves as foundation of our algorithm, and will be analyzed in greater detail in Section 5.2. The value provides a noisy but unbiased estimate of the true performance of a model on the dataset: .
To denoise this estimate, we approximate where and minimize with for and ; the solution is given by PCA. Thus and are the latent metafeatures of dataset and model , respectively. The rank controls model fidelity: small s give coarse approximations, while large s may overfit. We use a doubling scheme to choose within time budget; see Section 4.2 for details.
Given a new metatest dataset, we choose a subset of models and observe performance of model for each . A good choice of balances information gain against time needed to run the models; we discuss how to choose in Section 3.3. We then infer latent metafeatures for the new dataset by solving the least squares problem: with . For all unobserved models, we predict their performance as for .
3.2. Runtime Prediction
Estimating model runtime allows us to trade off between running slow, informative models and fast, less informative models. We use a simple method to estimate runtimes, using polynomial regression on and , the numbers of data points and features in , and their logarithms, since the theoretical complexities of machine learning algorithms we use are . Hence we fit an independent polynomial regression model for each model:
where is the runtime of machine learning model on dataset , and is the set of all polynomials of order no more than 3. We denote this procedure by fit_runtime.
3.3. TimeConstrained Information Gathering
To select a subset of models to observe, we adopt an approach that builds on classical experiment design: we suppose fitting each machine learning model returns a linear measurement of , corrupted by Gaussian noise. To estimate , we would like to choose a set of observations that span and form a wellconditioned submatrix, but that corresponds to models which are fast to run. In passing, we note that the pivoted QR algorithm on the matrix (heuristically) finds a well conditioned set of columns of . However, we would like to find a method that is runtimeaware.
Our experiment design (ED) procedure minimizes a scalarization of the covariance of the estimated metafeatures of the new dataset subject to runtime constraints (Wald, 1943; Mood et al., 1946; John and Draper, 1975; Pukelsheim, 1993; Boyd and Vandenberghe, 2004). Formally, define an indicator vector , where entry indicates whether to fit model . Let denote the predicted runtime of model on a metatest dataset, and let denote its latent metafeatures, for . Now relax to allow to allow for nonBoolean values and solve the optimization problem
(1) 
with variable . We call this method ED (time). Scalarizing the covariance by minimizing the determinant is called Doptimal design. Several other scalarizations can also be used, including covariance norm (Eoptimal) or trace (Aoptimal). Replacing by 1 gives an alternative heuristic that bounds the number of models fit by ; we call this method ED (number).
Problem 1 is a convex optimization problem, and we obtain an approximate solution by rounding the largest entries of up to 1 until the selected models exceed the time limit . Let be the set of indices of that we choose to observe, i.e. the set such that rounds to 1 for . We denote this process by .
4. The Oboe system
Shown in Figure 4, the Oboe system can be divided into offline and online stages. The offline stage is executed only once and explores the space of model performance on metatraining datasets. Time taken on this stage does not affect the runtime of Oboe on a new dataset; the runtime experienced by user is that of the online stage.
One advantage of Oboe is that the vast majority of the time in the online phase is spent training standard machine learning models, while very little time is required to decide which models to sample. Training these standard machine learning models requires running algorithms on datasets with thousands of data points and features, while the metalearning task — deciding which models to sample — requires only solving a small leastsquares problem.
4.1. Offline Stage
The th entry of error matrix , denoted as , records the performance of the th model on the th metatraining dataset. We generate the error matrix using the balanced error rate metric, the average of false positive and false negative rates across different classes. At the same time we record runtime of machine learning models on datasets. This is used to fit runtime predictors described in Section 3. Pseudocode for the offline stage is shown as Algorithm 1.
4.2. Online Stage
Recall that we repeatly double the time target of each round until we use up the total time budget. Thus each round is a subroutine of the entire online stage and is shown as Algorithm 2, fit_one_round.

Timeconstrained model selection (fit_one_round) Our active learning procedure selects a fast and informative collection of models to run on the metatest dataset. Oboe uses the results of these fits to estimate the performance of all other models as accurately as possible. The procedure is as follows. First predict model runtime on the metatest dataset using fitted runtime predictors. Then use experiment design to select a subset of entries of , the performance vector of the test dataset, to observe. The observed entries are used to compute , an estimate of the latent metafeatures of the test dataset, which in turn is used to predict every entry of . We build an ensemble out of models predicted to perform well within the time target by means of greedy forward selection (Caruana et al., 2004; Caruana et al., 2006). We denote this subroutine as ensemble_selection, which takes as input the set of base learners with their crossvalidation errors and predicted labels , and outputs ensemble learner . The hyperparameters used by models in the ensemble can be tuned further, but in our experiments we did not observe substantial improvements from further hyperparameter tuning.

Time target doubling To select rank , Oboe starts with a small initial rank along with a small time target, and then doubles the time target for fit_one_round until the elapsed time reaches half of the total budget. The rank increments by 1 if the validation error of the ensemble learner decreases after doubling the time target, and otherwise does not change. Since the matrices returned by PCA with rank are submatrices of those returned by PCA with rank for , we can compute the factors as submatrices of the mbyn matrices returned by PCA with full rank (Golub and Van Loan, 2012). The pseudocode is shown as Algorithm 3.
5. Experimental evaluation
We ran all experiments on a server with 128 Intel^{®} Xeon^{®} E74850 v4 2.10GHz CPU cores. The process of running each system on a specific dataset is limited to a single CPU core. Code for the Oboe system is at https://github.com/udellgroup/oboe; code for experiments is at https://github.com/udellgroup/oboetesting.
We test different AutoML systems on midsize OpenML and UCI datasets, using standard machine learning models shown in Appendix A, Table 2. Since data preprocessing is not our focus, we preprocess all datasets in the same way: onehot encode categorical features and then standardize all features to have zero mean and unit variance. These preprocessed datasets are used in all the experiments.
5.1. Performance Comparison across AutoML Systems
We compare AutoML systems that are able to select among different algorithm types under time constraints: Oboe (with error matrix generated from midsize OpenML datasets), autosklearn (Feurer et al., 2015), probabilistic matrix factorization (PMF) (Fusi et al., 2018), and a timeconstrained random baseline. The timeconstrained random baseline selects models to observe randomly from those predicted to take less time than the remaining time budget until the time limit is reached.
5.1.1. Comparison with PMF
PMF and Oboe differ in the surrogate models they use to explore the model space: PMF incrementally picks models to observe using Bayesian optimization, with model latent metafeatures from probabilistic matrix factorization as features, while Oboe models algorithm performance as bilinear in model and dataset metafeatures.
PMF does not limit runtime, hence we compare it to Oboe using either QR or ED (number) to decide the set of models (see Section 3.3). Figure 5 compares the performance of PMF and Oboe (using QR and ED (number) to decide the set of models) on our collected error matrix to see which is best able to predict the smallest entry in each row. We show the regret: the difference between the minimal entry in each row and the one found by the AutoML method. In PMF, models are chosen from the best algorithms on similar datasets (according to dataset metafeatures shown in Appendix B, Table 3) are used to warmstart Bayesian optimization, which then searches for the next model to observe. Oboe does not require this initial information before beginning its exploration. However, for a fair comparison, we show both ”warm” and ”cold” versions. The warm version observes both the models chosen by metafeatures and those chosen by QR or ED; the number of observed entries in Figure 5 is the sum of all observed models. The cold version starts from scratch and only observes models chosen by QR and ED.
Figure 5 shows the surprising effectiveness of the low rank model used by Oboe:

Metafeatures are of marginal value in choosing new models to observe. For QR, using models chosen by metafeatures helps when the number of observed entries is small. For ED, there is no benefit to using models chosen by metafeatures.

The low rank structure used by QR and ED seems to provide a better guide to which models will be informative than the Gaussian process prior used by PMF: the regret of PMF does not decrease as fast as Oboe using either QR or ED.
5.1.2. Comparison with autosklearn
The comparison with PMF assumes we can use the labels for every point in the entire dataset for model selection, so we can compare the performance of every model selected and pick the one with lowest error. In contrast, our comparison with autosklearn takes place in a more challenging, realistic setting: when doing crossvalidation on the metatest dataset, we do not know the labels of the validation fold until we evaluate performance of the ensemble we built within time constraints on the training fold.
Figure 6 shows the error rate and ranking of each AutoML method as the runtime repeatedly doubles. Again, Oboe’s simple bilinear model performs surprisingly well’^{2}^{2}2Autosklearn’s GitHub Issue #537 says “Do not start autosklearn for time limits less than 60s”. These plots should not be taken as criticisms of autosklearn, but are used to demonstrate Oboe’s ability to select a model within a short time.:

The quality of the initial models computed by Oboe and by autosklearn are comparable, but Oboe computes its first nontrivial model more than 8 faster than autosklearn (Figures 5(a) and 5(b)). In contrast, autosklearn must first compute metafeatures for each dataset, which requires substantial computational time, as shown in Appendix C, Figure 11.

Interestingly, the rate at which the Oboe models improves with time is also faster than that of autosklearn: the improvement Oboe makes before 16s matches that of autosklearn from 16s to 64s. This indicates that the large time budget may be better spent in fitting more models than optimizing over hyperparameters, to which autosklearn devotes the remaining time.

Experiment design leads to better results than random selection in almost all cases.
5.2. Why does Oboe Work?
Oboe performs well in comparison with other AutoML methods despite making a rather strong assumption about the structure of model performance across datasets: namely, bilinearity. It also requires effective predictions for model runtime. In this section, we perform additional experiments on components of the Oboe system to elucidate why the method works, whether our assumptions are warranted, and how they depend on detailed modeling choices.
Low rank under different metrics. Oboe uses balanced error rate to construct the error matrix, and works on the premise that the error matrix can be approximated by a low rank matrix. However, there is nothing special about the balanced error rate metric: most metrics result in an approximately low rank error matrix. For example, when using the AUC metric to measure error, the 418by219 error matrix from midsize OpenML datasets has only 38 eigenvalues greater than 1% of the largest, and 12 greater than 3%.
(Nonnegative) low rank structure of the error matrix. The features computed by PCA are dense and in general difficult to interpret. In contrast, nonnegative matrix factorization (NMF) produces sparse positive feature vectors and is thus widely used for clustering and interpretability (Xu et al., 2003; Kim and Park, 2008; Türkmen, 2015). We perform NMF on the error matrix to find nonnegative factors and so that . Cluster membership of each model is given by the largest entry in its corresponding column in .
Figure 8 shows the heatmap of algorithms in clusters when (the number of singular values no smaller than 3% of the largest one). Algorithm types are sparse in clusters: each cluster contains at most 3 types of algorithm. Also, models belonging to the same kinds of algorithms tend to aggregate into the same clusters: for example, Clusters 1 and 4 mainly consist of treebased models; Cluster 10 of linear models; and Cluster 12 of neighborhood models.
Runtime prediction performance. Runtimes of linear models are among the most difficult to predict, since they depend strongly on the conditioning of the problem. Our runtime prediction accuracy on midsize OpenML datasets is shown in Table 1 and in Figure 7. We can see that our empirical prediction of model runtime is roughly unbiased. Thus the sum of predicted runtimes on multiple models is a roughly good estimate.
Algorithm type  Runtime prediction accuracy  

within factor of 2  within factor of 4  
Adaboost  83.6%  94.3% 
Decision tree  76.7%  88.1% 
Extra trees  96.6%  99.5% 
Gradient boosting  53.9%  84.3% 
Gaussian naive Bayes  89.6%  96.7% 
kNN  85.2%  88.2% 
Logistic regression  41.1%  76.0% 
Multilayer perceptron  78.9%  96.0% 
Perceptron  75.4%  94.3% 
Random Forest  94.4%  98.2% 
Kernel SVM  59.9%  86.7% 
Linear SVM  30.1%  73.2% 
Coldstart. Oboe uses Doptimal experiment design to coldstart model selection. In Figure 10, we compare this choice with A and Eoptimal design and nonlinear regression in Alors (Mısır and Sebag, 2017), by means of leaveoneout crossvalidation on midsize OpenML datasets. We measure performance by the relative RMSE of the predicted performance vector and by the number of correctly predicted best models, both averaged across datasets. The approximate rank of the error matrix is set to be the number of eigenvalues larger than 1% of the largest, which is 38 here. The time limit in experiment design implementation is set to be 4 seconds; the nonlinear regressor used in Alors implementation is the default RandomForestRegressor in scikitlearn 0.19.2 (Pedregosa et al., 2011).
The horizontal axis is the number of models selected; the vertical axis is the percentage of bestranked models shared between true and predicted performance vectors.
Doptimal design robustly outperforms.
Ensemble size. As shown in Figure 10, more than 70% of the ensembles constructed on midsize OpenML datasets have no more than 5 base learners. This parsimony makes our ensembles easy to implement and interpret.
6. Summary
Oboe is an AutoML system that uses collaborative filtering and optimal experiment design to predict performance of machine learning models. By fitting a few models on the metatest dataset, this system transfers knowledge from metatraining datasets to select a promising set of models. Oboe naturally handles different algorithm and hyperparameter types and can match stateoftheart performance of AutoML systems much more quickly than competing approaches.
This work demonstrates the promise of collaborative filtering approaches to AutoML. However, there is much more left to do. Future work is needed to adapt Oboe to different loss metrics, budget types, sparsely observed error matrices, and a wider range of machine learning algorithms. Adapting a collaborative filtering approach to search for good machine learning pipelines, rather than individual algorithms, presents a more substantial challenge. We also hope to see more approaches to the challenge of choosing hyperhyperparameter settings subject to limited computation and data: metalearning is generally data(set)constrained. With continuing efforts by the AutoML community, we look forward to a world in which domain experts seeking to use machine learning can focus on data quality and problem formulation, rather than on tasks — such as algorithm selection and hyperparameter tuning — which are suitable for automation.
Acknowledgements.
This work was supported in part by DARPA Award FA87501720101. The authors thank Christophe GiraudCarrier, Ameet Talwalkar, Raul Astudillo Marban, Matthew Zalesak, Lijun Ding and Davis Wertheimer for helpful discussions, thank Jack Dunn for a script to parse UCI Machine Learning Repository datasets, and also thank several anonymous reviewers for useful comments.References
 (1)
 Bardenet et al. (2013) Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. 2013. Collaborative hyperparameter tuning. In ICML. 199–207.
 BartzBeielstein and Markon (2004) Thomas BartzBeielstein and Sandor Markon. 2004. Tuning search algorithms for realworld applications: A regression tree based approach. In Congress on Evolutionary Computation, Vol. 1. IEEE, 1111–1118.
 Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems. 2546–2554.
 Bischl et al. (2017) Bernd Bischl, Jakob Richter, Jakob Bossek, Daniel Horn, Janek Thomas, and Michel Lang. 2017. mlrMBO: A modular framework for modelbased optimization of expensive blackbox functions. arXiv preprint arXiv:1703.03373 (2017).
 Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge University Press.
 Caruana et al. (2006) Rich Caruana, Art Munson, and Alexandru NiculescuMizil. 2006. Getting the most out of ensemble selection. In ICDM. IEEE, 828–833.
 Caruana et al. (2004) Rich Caruana, Alexandru NiculescuMizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. In ICML. ACM, 18.
 Chen et al. (2018) Boyuan Chen, Harvey Wu, Warren Mo, Ishanu Chattopadhyay, and Hod Lipson. 2018. Autostacker: A compositional evolutionary learning system. In Proceedings of the Genetic and Evolutionary Computation Conference. ACM, 402–409.
 Cunha et al. (2018) Tiago Cunha, Carlos Soares, and André C. P. L. F. de Carvalho. 2018. CF4CF: Recommending Collaborative Filtering Algorithms Using Collaborative Filtering. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys ’18). ACM, New York, NY, USA, 357–361. https://doi.org/10.1145/3240323.3240378
 Dheeru and Karra Taniskidou (2017) Dua Dheeru and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
 Drori et al. (2018) Iddo Drori, Yamuna Krishnamurthy, Remi Rampin, Raoni de Paula Lourenco, Jorge Piazentin Ono, Kyunghyun Cho, Claudio Silva, and Juliana Freire. 2018. AlphaD3M: Machine learning pipeline synthesis. In AutoML Workshop at ICML.
 Feurer et al. (2015) Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems. 2962–2970.
 Feurer et al. (2014) Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. 2014. Using metalearning to initialize Bayesian optimization of hyperparameters. In International Conference on Metalearning and Algorithm Selection. Citeseer, 3–10.
 Fusi et al. (2018) Nicolo Fusi, Rishit Sheth, and Melih Elibol. 2018. Probabilistic matrix factorization for automated machine learning. In Advances in Neural Information Processing Systems. 3352–3361.
 Golub and Van Loan (2012) Gene H Golub and Charles F Van Loan. 2012. Matrix computations. JHU Press.
 Hazan et al. (2018) Elad Hazan, Adam Klivans, and Yang Yuan. 2018. Hyperparameter optimization: a spectral approach. In ICLR. https://openreview.net/forum?id=H1zriGeCZ
 Herbrich et al. (2003) Ralf Herbrich, Neil D Lawrence, and Matthias Seeger. 2003. Fast sparse Gaussian process methods: The informative vector machine. In Advances in Neural Information Processing Systems. 625–632.
 Huang et al. (2010) Ling Huang, Jinzhu Jia, Bin Yu, ByungGon Chun, Petros Maniatis, and Mayur Naik. 2010. Predicting execution time of computer programs using sparse polynomial regression. In Advances in Neural Information Processing Systems. 883–891.
 Hutter et al. (2006) Frank Hutter, Youssef Hamadi, Holger H Hoos, and Kevin LeytonBrown. 2006. Performance prediction and automated tuning of randomized and parametric algorithms. In International Conference on Principles and Practice of Constraint Programming. Springer, 213–228.
 Hutter et al. (2011) Frank Hutter, Holger H Hoos, and Kevin LeytonBrown. 2011. Sequential ModelBased Optimization for General Algorithm Configuration. LION 5 (2011), 507–523.
 Hutter et al. (2014) Frank Hutter, Lin Xu, Holger H Hoos, and Kevin LeytonBrown. 2014. Algorithm runtime prediction: Methods & evaluation. Artificial Intelligence 206 (2014), 79–111.
 John and Draper (1975) RC St John and Norman R Draper. 1975. Doptimality for regression designs: a review. Technometrics 17, 1 (1975), 15–23.
 Kim and Park (2008) Jingu Kim and Haesun Park. 2008. Sparse nonnegative matrix factorization for clustering. Technical Report. Georgia Institute of Technology.
 Krause et al. (2008) Andreas Krause, Ajit Singh, and Carlos Guestrin. 2008. Nearoptimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research 9, Feb (2008), 235–284.
 Leite et al. (2012) Rui Leite, Pavel Brazdil, and Joaquin Vanschoren. 2012. Selecting classification algorithms with active testing. In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 117–131.
 Lemke et al. (2015) Christiane Lemke, Marcin Budka, and Bogdan Gabrys. 2015. Metalearning: a survey of trends and technologies. Artificial Intelligence Review 44, 1 (2015), 117–130.
 MacKay (1992) David JC MacKay. 1992. Informationbased objective functions for active data selection. Neural Computation 4, 4 (1992), 590–604.
 Mısır and Sebag (2017) Mustafa Mısır and Michèle Sebag. 2017. Alors: An algorithm recommender system. Artificial Intelligence 244 (2017), 291–314.
 Mood et al. (1946) Alexander M Mood et al. 1946. On Hotelling’s weighing problem. The Annals of Mathematical Statistics 17, 4 (1946), 432–446.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikitlearn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
 Pfahringer et al. (2000) Bernhard Pfahringer, Hilan Bensusan, and Christophe G GiraudCarrier. 2000. MetaLearning by Landmarking Various Learning Algorithms. In ICML. 743–750.
 Pukelsheim (1993) Friedrich Pukelsheim. 1993. Optimal design of experiments. Vol. 50. SIAM.
 Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. 2006. Gaussian processes for machine learning. the MIT Press.
 Sebastiani and Wynn (2000) Paola Sebastiani and Henry P Wynn. 2000. Maximum entropy sampling and optimal Bayesian experimental design. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62, 1 (2000), 145–157.
 SmithMiles and van Hemert (2011) Kate SmithMiles and Jano van Hemert. 2011. Discovering the suitability of optimisation algorithms by learning from evolved instances. Annals of Mathematics and Artificial Intelligence 61, 2 (2011), 87–104.
 Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems. 2951–2959.
 Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. 2010. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In ICML. 1015–1022.
 Stern et al. (2010) David H Stern, Horst Samulowitz, Ralf Herbrich, Thore Graepel, Luca Pulina, and Armando Tacchella. 2010. Collaborative Expert Portfolio Management. In AAAI. 179–184.
 Türkmen (2015) Ali Caner Türkmen. 2015. A review of nonnegative matrix factorization methods for clustering. arXiv preprint arXiv:1507.03194 (2015).
 Udell and Townsend (2019) Madeleine Udell and Alex Townsend. 2019. Why Are Big Data Matrices Approximately Low Rank? SIAM Journal on Mathematics of Data Science 1, 1 (2019), 144–160.
 Vanschoren et al. (2013) Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198
 Wald (1943) Abraham Wald. 1943. On the efficient design of statistical investigations. The Annals of Mathematical Statistics 14, 2 (1943), 134–140.
 Wistuba et al. (2015) M. Wistuba, N. Schilling, and L. SchmidtThieme. 2015. Learning hyperparameter optimization initializations. In IEEE International Conference on Data Science and Advanced Analytics. 1–10. https://doi.org/10.1109/DSAA.2015.7344817
 Xu et al. (2003) Wei Xu, Xin Liu, and Yihong Gong. 2003. Document clustering based on nonnegative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM, 267–273.
 Yogatama and Mann (2014) Dani Yogatama and Gideon Mann. 2014. Efficient transfer learning method for automatic hyperparameter tuning. In Artificial Intelligence and Statistics. 1077–1085.
 Zhang et al. (2016) Yuyu Zhang, Mohammad Taha Bahadori, Hang Su, and Jimeng Sun. 2016. FLASH: fast Bayesian optimization for data analytic pipelines. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2065–2074.
Algorithm type  Hyperparameter names (values) 

Adaboost  n_estimators (50,100), learning_rate (1.0,1.5,2.0,2.5,3) 
Decision tree  min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e05) 
Extra trees  min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e05), criterion (gini,entropy) 
Gradient boosting  learning_rate (0.001,0.01,0.025,0.05,0.1,0.25,0.5), max_depth (3, 6), max_features (null,log2) 
Gaussian naive Bayes   
kNN  n_neighbors (1,3,5,7,9,11,13,15), p (1,2) 
Logistic regression  C (0.25,0.5,0.75,1,1.5,2,3,4), solver (liblinear,saga), penalty (l1,l2) 
Multilayer perceptron  learning_rate_init (0.0001,0.001,0.01), learning_rate (adaptive), solver (sgd,adam), alpha (0.0001, 0.01) 
Perceptron   
Random forest  min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e05), criterion (gini,entropy) 
Kernel SVM  C (0.125,0.25,0.5,0.75,1,2,4,8,16), kernel (rbf,poly), coef0 (0,10) 
Linear SVM  C (0.125,0.25,0.5,0.75,1,2,4,8,16) 
For reproducibility, please refer to our GitHub repositories (the Oboe system: https://github.com/udellgroup/oboe; experiments: https://github.com/udellgroup/oboetesting). Additional information is as follows.
Appendix A Machine Learning Models
Shown in Table 2, the hyperparameter names are the same as those in scikitlearn 0.19.2.
Appendix B Dataset metafeatures
Dataset metafeatures used throughout the experiments are listed in Table 3 (next page).
Metafeature name  Explanation 
number of instances  number of data points in the dataset 
log number of instances  the (natural) logarithm of number of instances 
number of classes  
number of features  
log number of features  the (natural) logarithm of number of features 
number of instances with missing values  
percentage of instances with missing values  
number of features with missing values  
percentage of features with missing values  
number of missing values  
percentage of missing values  
number of numeric features  
number of categorical features  
ratio numerical to nominal  the ratio of number of numerical features to the number of categorical features 
ratio numerical to nominal  
dataset ratio  the ratio of number of features to the number of data points 
log dataset ratio  the natural logarithm of dataset ratio 
inverse dataset ratio  
log inverse dataset ratio  
class probability (min, max, mean, std)  the (min, max, mean, std) of ratios of data points in each class 
symbols (min, max, mean, std, sum)  the (min, max, mean, std, sum) of the numbers of symbols in all categorical features 
kurtosis (min, max, mean, std)  
skewness (min, max, mean, std)  
class entropy  the entropy of the distribution of class labels (logarithm base 2) 
landmarking (Pfahringer et al., 2000) metafeatures  
LDA  
decision tree  decision tree classifier with 10fold cross validation 
decision node learner  10fold crossvalidated decision tree classifier with criterion="entropy", max_depth=1, min_samples_split=2, min_samples_leaf=1, max_features=None 
random node learner  10fold crossvalidated decision tree classifier with max_features=1 and the same above for the rest 
1NN  
PCA fraction of components for 95% variance  the fraction of components that account for 95% of variance 
PCA kurtosis first PC  kurtosis of the dimensionalityreduced data matrix along the first principal component 
PCA skewness first PC  skewness of the dimensionalityreduced data matrix along the first principal component 
Appendix C Metafeature calculation time
On a number of not very large datasets, the time taken to calculate metafeatures in the previous section are already nonnegligible, as shown in Figure 11. Each dot represents one midsize OpenML dataset.