Oboe: Collaborative Filtering for AutoML Model Selection

Oboe: Collaborative Filtering for AutoML Model Selection

Chengrun Yang, Yuji Akimoto, Dae Won Kim, Madeleine Udell Cornell University cy438,ya242,dk444,udell@cornell.edu
Abstract.

Algorithm selection and hyperparameter tuning remain two of the most challenging tasks in machine learning. Automated machine learning (AutoML) seeks to automate these tasks to enable widespread use of machine learning by non-experts. This paper introduces Oboe, a collaborative filtering method for time-constrained model selection and hyperparameter tuning. Oboe forms a matrix of the cross-validated errors of a large number of supervised learning models (algorithms together with hyperparameters) on a large number of datasets, and fits a low rank model to learn the low-dimensional feature vectors for the models and datasets that best predict the cross-validated errors. To find promising models for a new dataset, Oboe runs a set of fast but informative algorithms on the new dataset and uses their cross-validated errors to infer the feature vector for the new dataset. Oboe can find good models under constraints on the number of models fit or the total time budget. To this end, this paper develops a new heuristic for active learning in time-constrained matrix completion based on optimal experiment design. Our experiments demonstrate that Oboe delivers state-of-the-art performance faster than competing approaches on a test bed of supervised learning problems. Moreover, the success of the bilinear model used by Oboe suggests that AutoML may be simpler than was previously understood.

AutoML, meta-learning, time-constrained, model selection, collaborative filtering
copyright: none

1. Introduction

It is often difficult to find the best algorithm and hyperparameter settings for a new dataset, even for experts in machine learning or data science. The large number of machine learning algorithms and their sensitivity to hyperparameter values make it practically infeasible to enumerate all configurations. Automated machine learning (AutoML) seeks to efficiently automate the selection of model (e.g., (Feurer et al., 2015; Chen et al., 2018; Fusi et al., 2018)) or pipeline (e.g., (Drori et al., 2018)) configurations, and has become more important as the number of machine learning applications increases.

We propose an algorithmic system, Oboe 111 The eponymous musical instrument plays the initial note to tune an orchestra., that provides an initial tuning for AutoML: it selects a good algorithm and hyperparameter combination from a discrete set of options. The resulting model can be used directly, or the hyperparameters can be tuned further. Briefly, Oboe operates as follows.

During an offline training phase, it forms a matrix of the cross-validated errors of a large number of supervised-learning models (algorithms together with hyperparameters) on a large number of datasets. It then fits a low rank model to this matrix to learn latent low-dimensional meta-features for the models and datasets. Our optimization procedure ensures these latent meta-features best predict the cross-validated errors, among all bilinear models.

To find promising models for a new dataset, Oboe chooses a set of fast but informative models to run on the new dataset and uses their cross-validated errors to infer the latent meta-features of the new dataset. Given more time, Oboe repeats this procedure using a higher rank to find higher-dimensional (and more expressive) latent features. Using a low rank model for the error matrix is a very strong structural prior.

This system addresses two important problems: 1) Time-constrained initialization: how to choose a promising initial model under time constraints. Oboe adapts easily to short times by using a very low rank and by restricting its experiments to models that will run very fast on the new dataset. 2) Active learning: how to improve on the initial guess given further computational resources. Oboe uses extra time by allowing higher ranks and more expensive computational experiments, accumulating its knowledge of the new dataset to produce more accurate (and higher-dimensional) estimates of its latent meta-features.

Oboe uses collaborative filtering for AutoML, selecting models that have worked well on similar datasets, as have many previous methods including (Bardenet et al., 2013; Stern et al., 2010; Yogatama and Mann, 2014; Feurer et al., 2015; Mısır and Sebag, 2017; Cunha et al., 2018). In collaborative filtering, the critical question is how to characterize dataset similarity so that training datasets “similar” to the test dataset faithfully predict model performance. One line of work uses dataset meta-features — simple, statistical or landmarking metrics — to characterize datasets (Pfahringer et al., 2000; Feurer et al., 2014; Feurer et al., 2015; Fusi et al., 2018; Cunha et al., 2018). Other approaches (e.g., (Wistuba et al., 2015)) avoid meta-features. Our approach builds on both of these lines of work. Oboe relies on model performance to characterize datasets, and the low rank representations it learns for each dataset may be seen (and used) as latent meta-features. Compared to AutoML systems that compute meta-features of the dataset before running any models, the flow of information in Oboe is exactly opposite: Oboe uses only the performance of various models on the datasets to compute lower dimensional latent meta-features for models and datasets.

The active learning subproblem is to gain the most information to guide further model selection. Some approaches choose a function class to capture the dependence of model performance on hyperparameters; examples are Gaussian processes (Rasmussen and Williams, 2006; Snoek et al., 2012; Bergstra et al., 2011; Fusi et al., 2018; Sebastiani and Wynn, 2000; Herbrich et al., 2003; MacKay, 1992; Srinivas et al., 2010), sparse Boolean functions (Hazan et al., 2018) and decision trees (Bartz-Beielstein and Markon, 2004; Hutter et al., 2011). Oboe chooses the set of bilinear models as its function class: predicted performance is linear in each of the latent model and dataset meta-features.

Bilinearity seems like a rather strong assumption, but confers several advantages. Computations are fast and easy: we can find the global minimizer by PCA, and can infer the latent meta-features for a new dataset using least squares. Moreover, recent theoretical work suggests that this model class is more general than it appears: roughly, and under a few mild technical assumptions, any matrix with independent rows and columns whose entries are generated according to a fixed function (here, the function computed by training the model on the dataset) has an approximate rank that grows as (Udell and Townsend, 2019). Hence large data matrices tend to look low rank.

Originally, the authors conceived of Oboe as a system to produce a good set of initial models, to be refined by other local search methods, such as Bayesian optimization. However, in our experiments, we find that Oboe’s performance, refined by fitting models of ever higher rank with ever more data, actually improves faster than competing methods that use local search methods more heavily.

One key component of our system is the prediction of model runtime on new datasets. Many authors have previously studied algorithm runtime prediction using a variety dataset features (Hutter et al., 2014), via ridge regression (Huang et al., 2010), neural networks (Smith-Miles and van Hemert, 2011), Gaussian processes (Hutter et al., 2006), and more. Several measures have been proposed to trade-off between accuracy and runtime (Leite et al., 2012; Bischl et al., 2017). We predict algorithm runtime using only the number of samples and features in the dataset. This model is particularly simple but surprisingly effective.

Classical experiment design (ED) (Wald, 1943; Mood et al., 1946; John and Draper, 1975; Pukelsheim, 1993; Boyd and Vandenberghe, 2004) selects features to observe to minimize the variance of the parameter estimate, assuming that features depend on the parameters according to known, linear, functions. Oboe’s bilinear model fits this paradigm, and so ED can be used to select informative models. Budget constraints can be added, as we do here, to select a small number of promising machine learning models or a set predicted to finish within a short time budget (Krause et al., 2008; Zhang et al., 2016).

This paper is organized as follows. Section 2 introduces notation and terminology. Section 3 describes the main ideas we use in Oboe. Section 4 presents Oboe in detail. Section 5 shows experiments.

2. Notation and Terminology

Meta-learning.  Meta-learning is the process of learning across individual datasets or problems, which are subsystems on which standard learning is performed (Lemke et al., 2015). Just as standard machine learning must avoid overfitting, experiments testing AutoML systems must avoid meta-overfitting! We divide our set of datasets into meta-training, meta-validation and meta-test sets, and report results on the meta-test set. Each of the three phases in meta-learning — meta-training, meta-validation and meta-test — is a standard learning process that includes training, validation and test.

(a) Learning
(b) Meta-learning
Figure 1. Standard vs meta-learning.

Indexing.  Throughout this paper, all vectors are column vectors. Given a matrix , and denote the ith row and jth column of , respectively. is the index over datasets, and is the index over models. We define for . Given an ordered set where , we write .

Algorithm performance.  A model is a specific algorithm-hyperparameter combination, e.g. -NN with . We denote by the expected cross-validation error of model on dataset , where the expectation is with respect to the cross-validation splits. We refer to the model in our collection that achieves minimal error on as the best model for . A model is said to be observed on if we have calculated by fitting (and cross-validating) the model. The performance vector of a dataset concatenates for each model in our collection.

Meta-features.  We discuss two types of meta-features in this paper. Meta-features refer to metrics used to characterize datasets or models. For example, the number of data points or the performance of simple models on a dataset can serve as meta-features of the dataset. As an example, we list the meta-features used in the AutoML framework auto-sklearn in Appendix B, Table 3. In constrast to standard meta-features, we use the term latent meta-features to refer to characterizations learned from matrix factorization.

Parametric hierarchy.  We distinguish between three kinds of parameters:

  • Parameters of a model (e.g., the splits in a decision tree) are obtained by training the model.

  • Hyperparameters of an algorithm (e.g., the maximum depth of a decision tree) govern the training procedure. We use the word model to refer to an algorithm together with a particular choice of hyperparameters.

  • Hyper-hyperparameters of a meta-learning method (e.g., the total time budget for Oboe) govern meta-training.

Time target and time budget.  The time target refers to the anticipated time spent running models to infer latent features of each fixed dimension and can be exceeded. However, the runtime does not usually deviate much from the target since our model runtime prediction works well. The time budget refers to the total time limit for Oboe and is never exceeded.

Midsize OpenML and UCI datasets.  Our experiments use OpenML (Vanschoren et al., 2013) and UCI (Dheeru and Karra Taniskidou, 2017) classification datasets with between 150 and 10,000 data points and with no missing entries.

3. Methodology

3.1. Model Performance Prediction

It can be difficult to determine a priori which meta-features to use so that algorithms perform similarly well on datasets with similar meta-features. Also, the computation of meta-features can be expensive (see Appendix C, Figure 11). To infer model performance on a dataset without any expensive meta-feature calculations, we use collaborative filtering to infer latent meta-features for datasets.

Figure 2. Illustration of model performance prediction via the error matrix (yellow blocks only). Perform PCA on the error matrix (offline) to compute dataset () and model () latent meta-features (orange blocks). Given a new dataset (row with white and blue blocks), pick a subset of models to observe (blue blocks). Use together with the observed models to impute the performance of the unobserved models on the new dataset (white blocks).

As shown in Figure 2, we construct an empirical error matrix , where every entry records the cross-validated error of model on dataset . Empirically, has approximately low rank: Figure 3 shows the singular values decay rapidly as a function of the index . This observation serves as foundation of our algorithm, and will be analyzed in greater detail in Section 5.2. The value provides a noisy but unbiased estimate of the true performance of a model on the dataset: .

To denoise this estimate, we approximate where and minimize with for and ; the solution is given by PCA. Thus and are the latent meta-features of dataset and model , respectively. The rank controls model fidelity: small s give coarse approximations, while large s may overfit. We use a doubling scheme to choose within time budget; see Section 4.2 for details.

Figure 3. Singular value decay of an error matrix. The entries are calculated by 5-fold cross validation of machine models (listed in Appendix A, Table 2) on midsize OpenML datasets.

Given a new meta-test dataset, we choose a subset of models and observe performance of model for each . A good choice of balances information gain against time needed to run the models; we discuss how to choose in Section 3.3. We then infer latent meta-features for the new dataset by solving the least squares problem: with . For all unobserved models, we predict their performance as for .

3.2. Runtime Prediction

Estimating model runtime allows us to trade off between running slow, informative models and fast, less informative models. We use a simple method to estimate runtimes, using polynomial regression on and , the numbers of data points and features in , and their logarithms, since the theoretical complexities of machine learning algorithms we use are . Hence we fit an independent polynomial regression model for each model:

where is the runtime of machine learning model on dataset , and is the set of all polynomials of order no more than 3. We denote this procedure by fit_runtime.

We observe that this model predicts runtime within a factor of two for half of the machine learning models on more than 75% midsize OpenML datasets, and within a factor of four for nearly all models, as shown in Section 5.2 and visualized in Figure 7.

3.3. Time-Constrained Information Gathering

To select a subset of models to observe, we adopt an approach that builds on classical experiment design: we suppose fitting each machine learning model returns a linear measurement of , corrupted by Gaussian noise. To estimate , we would like to choose a set of observations that span and form a well-conditioned submatrix, but that corresponds to models which are fast to run. In passing, we note that the pivoted QR algorithm on the matrix (heuristically) finds a well conditioned set of columns of . However, we would like to find a method that is runtime-aware.

Our experiment design (ED) procedure minimizes a scalarization of the covariance of the estimated meta-features of the new dataset subject to runtime constraints (Wald, 1943; Mood et al., 1946; John and Draper, 1975; Pukelsheim, 1993; Boyd and Vandenberghe, 2004). Formally, define an indicator vector , where entry indicates whether to fit model . Let denote the predicted runtime of model on a meta-test dataset, and let denote its latent meta-features, for . Now relax to allow to allow for non-Boolean values and solve the optimization problem

(1)

with variable . We call this method ED (time). Scalarizing the covariance by minimizing the determinant is called D-optimal design. Several other scalarizations can also be used, including covariance norm (E-optimal) or trace (A-optimal). Replacing by 1 gives an alternative heuristic that bounds the number of models fit by ; we call this method ED (number).

Problem 1 is a convex optimization problem, and we obtain an approximate solution by rounding the largest entries of up to 1 until the selected models exceed the time limit . Let be the set of indices of that we choose to observe, i.e. the set such that rounds to 1 for . We denote this process by .

4. The Oboe system

Figure 4. Diagram of data processing flow in the Oboe system.

Shown in Figure 4, the Oboe system can be divided into offline and online stages. The offline stage is executed only once and explores the space of model performance on meta-training datasets. Time taken on this stage does not affect the runtime of Oboe on a new dataset; the runtime experienced by user is that of the online stage.

One advantage of Oboe is that the vast majority of the time in the online phase is spent training standard machine learning models, while very little time is required to decide which models to sample. Training these standard machine learning models requires running algorithms on datasets with thousands of data points and features, while the meta-learning task — deciding which models to sample — requires only solving a small least-squares problem.

4.1. Offline Stage

The th entry of error matrix , denoted as , records the performance of the th model on the th meta-training dataset. We generate the error matrix using the balanced error rate metric, the average of false positive and false negative rates across different classes. At the same time we record runtime of machine learning models on datasets. This is used to fit runtime predictors described in Section 3. Pseudocode for the offline stage is shown as Algorithm 1.

1:meta-training datasets , models , algorithm performance metric
2:error matrix , runtime matrix , fitted runtime predictors
3:for  do
4:      number of data points and features in
5:     for  do
6:          error of model on dataset according to metric
7:          observed runtime for model on dataset
8:     end for
9:end for
10:for  do
11:     fit fit_runtime
12:end for
Algorithm 1 Offline Stage

4.2. Online Stage

Recall that we repeatly double the time target of each round until we use up the total time budget. Thus each round is a subroutine of the entire online stage and is shown as Algorithm 2, fit_one_round.

  • Time-constrained model selection (fit_one_round) Our active learning procedure selects a fast and informative collection of models to run on the meta-test dataset. Oboe uses the results of these fits to estimate the performance of all other models as accurately as possible. The procedure is as follows. First predict model runtime on the meta-test dataset using fitted runtime predictors. Then use experiment design to select a subset of entries of , the performance vector of the test dataset, to observe. The observed entries are used to compute , an estimate of the latent meta-features of the test dataset, which in turn is used to predict every entry of . We build an ensemble out of models predicted to perform well within the time target by means of greedy forward selection (Caruana et al., 2004; Caruana et al., 2006). We denote this subroutine as ensemble_selection, which takes as input the set of base learners with their cross-validation errors and predicted labels , and outputs ensemble learner . The hyperparameters used by models in the ensemble can be tuned further, but in our experiments we did not observe substantial improvements from further hyperparameter tuning.

    1:model latent meta-features , fitted runtime predictors , training fold of the meta-test dataset , number of best models to select from the estimated performance vector, time target for this round
    2:ensemble learner
    3:for  do
    4:     
    5:end for
    6:
    7:for  do
    8:      cross-validation error of model on
    9:end for
    10:
    11:
    12: the models with lowest predicted errors in
    13:for  do
    14:      cross-validation error of model on
    15:end for
    16:ensemble_selection
    Algorithm 2 fit_one_round
  • Time target doubling To select rank , Oboe starts with a small initial rank along with a small time target, and then doubles the time target for fit_one_round until the elapsed time reaches half of the total budget. The rank increments by 1 if the validation error of the ensemble learner decreases after doubling the time target, and otherwise does not change. Since the matrices returned by PCA with rank are submatrices of those returned by PCA with rank for , we can compute the factors as submatrices of the m-by-n matrices returned by PCA with full rank (Golub and Van Loan, 2012). The pseudocode is shown as Algorithm 3.

1:error matrix , runtime matrix , meta-test dataset , total time budget , fitted runtime predictors , initial time target , initial approximate rank
2:ensemble learner
3:, for , for
4: training, validation and test folds of
5:
6:
7:while  do
8:      k-dimensional subvectors of
9:      fit_one_round
10:     
11:     if   then
12:         
13:     end if
14:     
15:     
16:end while
Algorithm 3 Online Stage

5. Experimental evaluation

We ran all experiments on a server with 128 Intel® Xeon® E7-4850 v4 2.10GHz CPU cores. The process of running each system on a specific dataset is limited to a single CPU core. Code for the Oboe system is at https://github.com/udellgroup/oboe; code for experiments is at https://github.com/udellgroup/oboe-testing.

We test different AutoML systems on midsize OpenML and UCI datasets, using standard machine learning models shown in Appendix A, Table 2. Since data pre-processing is not our focus, we pre-process all datasets in the same way: one-hot encode categorical features and then standardize all features to have zero mean and unit variance. These pre-processed datasets are used in all the experiments.

5.1. Performance Comparison across AutoML Systems

We compare AutoML systems that are able to select among different algorithm types under time constraints: Oboe (with error matrix generated from midsize OpenML datasets), auto-sklearn (Feurer et al., 2015), probabilistic matrix factorization (PMF) (Fusi et al., 2018), and a time-constrained random baseline. The time-constrained random baseline selects models to observe randomly from those predicted to take less time than the remaining time budget until the time limit is reached.

5.1.1. Comparison with PMF

PMF and Oboe differ in the surrogate models they use to explore the model space: PMF incrementally picks models to observe using Bayesian optimization, with model latent meta-features from probabilistic matrix factorization as features, while Oboe models algorithm performance as bilinear in model and dataset meta-features.

PMF does not limit runtime, hence we compare it to Oboe using either QR or ED (number) to decide the set of models (see Section 3.3). Figure 5 compares the performance of PMF and Oboe (using QR and ED (number) to decide the set of models) on our collected error matrix to see which is best able to predict the smallest entry in each row. We show the regret: the difference between the minimal entry in each row and the one found by the AutoML method. In PMF, models are chosen from the best algorithms on similar datasets (according to dataset meta-features shown in Appendix B, Table 3) are used to warm-start Bayesian optimization, which then searches for the next model to observe. Oboe does not require this initial information before beginning its exploration. However, for a fair comparison, we show both ”warm” and ”cold” versions. The warm version observes both the models chosen by meta-features and those chosen by QR or ED; the number of observed entries in Figure 5 is the sum of all observed models. The cold version starts from scratch and only observes models chosen by QR and ED.

(Standard ED also performs well; see Appendix D, Figure 12.)

Figure 5. Comparison of sampling schemes (QR or ED) in Oboe and PMF. ”QR” denotes QR decomposition with column pivoting; ”ED (number)” denotes experiment design with number of observed entries constrained. The left plot shows the regret of each AutoML method as a function of number of entries; the right shows the relative rank of each AutoML method in the regret plot (1 is best and 5 is worst).

Figure 5 shows the surprising effectiveness of the low rank model used by Oboe:

  1. Meta-features are of marginal value in choosing new models to observe. For QR, using models chosen by meta-features helps when the number of observed entries is small. For ED, there is no benefit to using models chosen by meta-features.

  2. The low rank structure used by QR and ED seems to provide a better guide to which models will be informative than the Gaussian process prior used by PMF: the regret of PMF does not decrease as fast as Oboe using either QR or ED.

5.1.2. Comparison with auto-sklearn

The comparison with PMF assumes we can use the labels for every point in the entire dataset for model selection, so we can compare the performance of every model selected and pick the one with lowest error. In contrast, our comparison with auto-sklearn takes place in a more challenging, realistic setting: when doing cross-validation on the meta-test dataset, we do not know the labels of the validation fold until we evaluate performance of the ensemble we built within time constraints on the training fold.

Figure 6 shows the error rate and ranking of each AutoML method as the runtime repeatedly doubles. Again, Oboe’s simple bilinear model performs surprisingly well’222Auto-sklearn’s GitHub Issue #537 says “Do not start auto-sklearn for time limits less than 60s”. These plots should not be taken as criticisms of auto-sklearn, but are used to demonstrate Oboe’s ability to select a model within a short time.:

(a) OpenML (meta-LOOCV)
(b) UCI (meta-test)
(c) OpenML (meta-LOOCV)
(d) UCI (meta-test)
Figure 6. Comparison of AutoML systems in a time-constrained setting, including Oboe with experiment design (red), auto-sklearn (blue), and Oboe with time-constrained random initializations (green). OpenML and UCI denote midsize OpenML and UCI datasets. ”meta-LOOCV” denotes leave-one-out cross-validation across datasets. In 5(a) and 5(b), solid lines represent medians; shaded areas with corresponding colors represent the regions between 75th and 25th percentiles. Until the first time the system can produce a model, we classify every data point with the most common class label. Figures 5(c) and 5(d) show system rankings (1 is best and 3 is worst).
  1. Oboe on average performs as well as or better than auto-sklearn (Figures 5(c) and 5(d)).

  2. The quality of the initial models computed by Oboe and by auto-sklearn are comparable, but Oboe computes its first nontrivial model more than 8 faster than auto-sklearn (Figures 5(a) and 5(b)). In contrast, auto-sklearn must first compute meta-features for each dataset, which requires substantial computational time, as shown in Appendix C, Figure 11.

  3. Interestingly, the rate at which the Oboe models improves with time is also faster than that of auto-sklearn: the improvement Oboe makes before 16s matches that of auto-sklearn from 16s to 64s. This indicates that the large time budget may be better spent in fitting more models than optimizing over hyperparameters, to which auto-sklearn devotes the remaining time.

  4. Experiment design leads to better results than random selection in almost all cases.

5.2. Why does Oboe Work?

Figure 7. Runtime prediction performance on different machine learning algorithms, on midsize OpenML datasets.

Oboe performs well in comparison with other AutoML methods despite making a rather strong assumption about the structure of model performance across datasets: namely, bilinearity. It also requires effective predictions for model runtime. In this section, we perform additional experiments on components of the Oboe system to elucidate why the method works, whether our assumptions are warranted, and how they depend on detailed modeling choices.

Low rank under different metrics.Oboe uses balanced error rate to construct the error matrix, and works on the premise that the error matrix can be approximated by a low rank matrix. However, there is nothing special about the balanced error rate metric: most metrics result in an approximately low rank error matrix. For example, when using the AUC metric to measure error, the 418-by-219 error matrix from midsize OpenML datasets has only 38 eigenvalues greater than 1% of the largest, and 12 greater than 3%.

(Nonnegative) low rank structure of the error matrix.  The features computed by PCA are dense and in general difficult to interpret. In contrast, nonnegative matrix factorization (NMF) produces sparse positive feature vectors and is thus widely used for clustering and interpretability (Xu et al., 2003; Kim and Park, 2008; Türkmen, 2015). We perform NMF on the error matrix to find nonnegative factors and so that . Cluster membership of each model is given by the largest entry in its corresponding column in .

Figure 8 shows the heatmap of algorithms in clusters when (the number of singular values no smaller than 3% of the largest one). Algorithm types are sparse in clusters: each cluster contains at most 3 types of algorithm. Also, models belonging to the same kinds of algorithms tend to aggregate into the same clusters: for example, Clusters 1 and 4 mainly consist of tree-based models; Cluster 10 of linear models; and Cluster 12 of neighborhood models.

Figure 8. Algorithm heatmap in clusters. Each block is colored by the number of models of the corresponding algorithm type in that cluster. Numbers next to the scale bar refer to the numbers of models.

Runtime prediction performance.  Runtimes of linear models are among the most difficult to predict, since they depend strongly on the conditioning of the problem. Our runtime prediction accuracy on midsize OpenML datasets is shown in Table 1 and in Figure 7. We can see that our empirical prediction of model runtime is roughly unbiased. Thus the sum of predicted runtimes on multiple models is a roughly good estimate.

Algorithm type Runtime prediction accuracy
within factor of 2 within factor of 4
Adaboost 83.6% 94.3%
Decision tree 76.7% 88.1%
Extra trees 96.6% 99.5%
Gradient boosting 53.9% 84.3%
Gaussian naive Bayes 89.6% 96.7%
kNN 85.2% 88.2%
Logistic regression 41.1% 76.0%
Multilayer perceptron 78.9% 96.0%
Perceptron 75.4% 94.3%
Random Forest 94.4% 98.2%
Kernel SVM 59.9% 86.7%
Linear SVM 30.1% 73.2%
Table 1. Runtime prediction accuracy on OpenML datasets

Cold-start.Oboe uses D-optimal experiment design to cold-start model selection. In Figure 10, we compare this choice with A- and E-optimal design and nonlinear regression in Alors (Mısır and Sebag, 2017), by means of leave-one-out cross-validation on midsize OpenML datasets. We measure performance by the relative RMSE of the predicted performance vector and by the number of correctly predicted best models, both averaged across datasets. The approximate rank of the error matrix is set to be the number of eigenvalues larger than 1% of the largest, which is 38 here. The time limit in experiment design implementation is set to be 4 seconds; the nonlinear regressor used in Alors implementation is the default RandomForestRegressor in scikit-learn 0.19.2 (Pedregosa et al., 2011).

The horizontal axis is the number of models selected; the vertical axis is the percentage of best-ranked models shared between true and predicted performance vectors.

Figure 9. Comparison of cold-start methods.
Figure 10. Histogram of Oboe ensemble size. The ensembles were built in executions on midsize OpenML datasets in Section 5.1.2.

D-optimal design robustly outperforms.

Ensemble size.  As shown in Figure 10, more than 70% of the ensembles constructed on midsize OpenML datasets have no more than 5 base learners. This parsimony makes our ensembles easy to implement and interpret.

6. Summary

Oboe is an AutoML system that uses collaborative filtering and optimal experiment design to predict performance of machine learning models. By fitting a few models on the meta-test dataset, this system transfers knowledge from meta-training datasets to select a promising set of models. Oboe naturally handles different algorithm and hyperparameter types and can match state-of-the-art performance of AutoML systems much more quickly than competing approaches.

This work demonstrates the promise of collaborative filtering approaches to AutoML. However, there is much more left to do. Future work is needed to adapt Oboe to different loss metrics, budget types, sparsely observed error matrices, and a wider range of machine learning algorithms. Adapting a collaborative filtering approach to search for good machine learning pipelines, rather than individual algorithms, presents a more substantial challenge. We also hope to see more approaches to the challenge of choosing hyper-hyperparameter settings subject to limited computation and data: meta-learning is generally data(set)-constrained. With continuing efforts by the AutoML community, we look forward to a world in which domain experts seeking to use machine learning can focus on data quality and problem formulation, rather than on tasks — such as algorithm selection and hyperparameter tuning — which are suitable for automation.

Acknowledgements.
This work was supported in part by DARPA Award FA8750-17-2-0101. The authors thank Christophe Giraud-Carrier, Ameet Talwalkar, Raul Astudillo Marban, Matthew Zalesak, Lijun Ding and Davis Wertheimer for helpful discussions, thank Jack Dunn for a script to parse UCI Machine Learning Repository datasets, and also thank several anonymous reviewers for useful comments.

References

  • (1)
  • Bardenet et al. (2013) Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. 2013. Collaborative hyperparameter tuning. In ICML. 199–207.
  • Bartz-Beielstein and Markon (2004) Thomas Bartz-Beielstein and Sandor Markon. 2004. Tuning search algorithms for real-world applications: A regression tree based approach. In Congress on Evolutionary Computation, Vol. 1. IEEE, 1111–1118.
  • Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems. 2546–2554.
  • Bischl et al. (2017) Bernd Bischl, Jakob Richter, Jakob Bossek, Daniel Horn, Janek Thomas, and Michel Lang. 2017. mlrMBO: A modular framework for model-based optimization of expensive black-box functions. arXiv preprint arXiv:1703.03373 (2017).
  • Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge University Press.
  • Caruana et al. (2006) Rich Caruana, Art Munson, and Alexandru Niculescu-Mizil. 2006. Getting the most out of ensemble selection. In ICDM. IEEE, 828–833.
  • Caruana et al. (2004) Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. In ICML. ACM, 18.
  • Chen et al. (2018) Boyuan Chen, Harvey Wu, Warren Mo, Ishanu Chattopadhyay, and Hod Lipson. 2018. Autostacker: A compositional evolutionary learning system. In Proceedings of the Genetic and Evolutionary Computation Conference. ACM, 402–409.
  • Cunha et al. (2018) Tiago Cunha, Carlos Soares, and André C. P. L. F. de Carvalho. 2018. CF4CF: Recommending Collaborative Filtering Algorithms Using Collaborative Filtering. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys ’18). ACM, New York, NY, USA, 357–361. https://doi.org/10.1145/3240323.3240378
  • Dheeru and Karra Taniskidou (2017) Dua Dheeru and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
  • Drori et al. (2018) Iddo Drori, Yamuna Krishnamurthy, Remi Rampin, Raoni de Paula Lourenco, Jorge Piazentin Ono, Kyunghyun Cho, Claudio Silva, and Juliana Freire. 2018. AlphaD3M: Machine learning pipeline synthesis. In AutoML Workshop at ICML.
  • Feurer et al. (2015) Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems. 2962–2970.
  • Feurer et al. (2014) Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. 2014. Using meta-learning to initialize Bayesian optimization of hyperparameters. In International Conference on Meta-learning and Algorithm Selection. Citeseer, 3–10.
  • Fusi et al. (2018) Nicolo Fusi, Rishit Sheth, and Melih Elibol. 2018. Probabilistic matrix factorization for automated machine learning. In Advances in Neural Information Processing Systems. 3352–3361.
  • Golub and Van Loan (2012) Gene H Golub and Charles F Van Loan. 2012. Matrix computations. JHU Press.
  • Hazan et al. (2018) Elad Hazan, Adam Klivans, and Yang Yuan. 2018. Hyperparameter optimization: a spectral approach. In ICLR. https://openreview.net/forum?id=H1zriGeCZ
  • Herbrich et al. (2003) Ralf Herbrich, Neil D Lawrence, and Matthias Seeger. 2003. Fast sparse Gaussian process methods: The informative vector machine. In Advances in Neural Information Processing Systems. 625–632.
  • Huang et al. (2010) Ling Huang, Jinzhu Jia, Bin Yu, Byung-Gon Chun, Petros Maniatis, and Mayur Naik. 2010. Predicting execution time of computer programs using sparse polynomial regression. In Advances in Neural Information Processing Systems. 883–891.
  • Hutter et al. (2006) Frank Hutter, Youssef Hamadi, Holger H Hoos, and Kevin Leyton-Brown. 2006. Performance prediction and automated tuning of randomized and parametric algorithms. In International Conference on Principles and Practice of Constraint Programming. Springer, 213–228.
  • Hutter et al. (2011) Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. LION 5 (2011), 507–523.
  • Hutter et al. (2014) Frank Hutter, Lin Xu, Holger H Hoos, and Kevin Leyton-Brown. 2014. Algorithm runtime prediction: Methods & evaluation. Artificial Intelligence 206 (2014), 79–111.
  • John and Draper (1975) RC St John and Norman R Draper. 1975. D-optimality for regression designs: a review. Technometrics 17, 1 (1975), 15–23.
  • Kim and Park (2008) Jingu Kim and Haesun Park. 2008. Sparse nonnegative matrix factorization for clustering. Technical Report. Georgia Institute of Technology.
  • Krause et al. (2008) Andreas Krause, Ajit Singh, and Carlos Guestrin. 2008. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research 9, Feb (2008), 235–284.
  • Leite et al. (2012) Rui Leite, Pavel Brazdil, and Joaquin Vanschoren. 2012. Selecting classification algorithms with active testing. In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 117–131.
  • Lemke et al. (2015) Christiane Lemke, Marcin Budka, and Bogdan Gabrys. 2015. Metalearning: a survey of trends and technologies. Artificial Intelligence Review 44, 1 (2015), 117–130.
  • MacKay (1992) David JC MacKay. 1992. Information-based objective functions for active data selection. Neural Computation 4, 4 (1992), 590–604.
  • Mısır and Sebag (2017) Mustafa Mısır and Michèle Sebag. 2017. Alors: An algorithm recommender system. Artificial Intelligence 244 (2017), 291–314.
  • Mood et al. (1946) Alexander M Mood et al. 1946. On Hotelling’s weighing problem. The Annals of Mathematical Statistics 17, 4 (1946), 432–446.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
  • Pfahringer et al. (2000) Bernhard Pfahringer, Hilan Bensusan, and Christophe G Giraud-Carrier. 2000. Meta-Learning by Landmarking Various Learning Algorithms. In ICML. 743–750.
  • Pukelsheim (1993) Friedrich Pukelsheim. 1993. Optimal design of experiments. Vol. 50. SIAM.
  • Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher KI Williams. 2006. Gaussian processes for machine learning. the MIT Press.
  • Sebastiani and Wynn (2000) Paola Sebastiani and Henry P Wynn. 2000. Maximum entropy sampling and optimal Bayesian experimental design. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62, 1 (2000), 145–157.
  • Smith-Miles and van Hemert (2011) Kate Smith-Miles and Jano van Hemert. 2011. Discovering the suitability of optimisation algorithms by learning from evolved instances. Annals of Mathematics and Artificial Intelligence 61, 2 (2011), 87–104.
  • Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems. 2951–2959.
  • Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. 2010. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In ICML. 1015–1022.
  • Stern et al. (2010) David H Stern, Horst Samulowitz, Ralf Herbrich, Thore Graepel, Luca Pulina, and Armando Tacchella. 2010. Collaborative Expert Portfolio Management. In AAAI. 179–184.
  • Türkmen (2015) Ali Caner Türkmen. 2015. A review of nonnegative matrix factorization methods for clustering. arXiv preprint arXiv:1507.03194 (2015).
  • Udell and Townsend (2019) Madeleine Udell and Alex Townsend. 2019. Why Are Big Data Matrices Approximately Low Rank? SIAM Journal on Mathematics of Data Science 1, 1 (2019), 144–160.
  • Vanschoren et al. (2013) Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations 15, 2 (2013), 49–60. https://doi.org/10.1145/2641190.2641198
  • Wald (1943) Abraham Wald. 1943. On the efficient design of statistical investigations. The Annals of Mathematical Statistics 14, 2 (1943), 134–140.
  • Wistuba et al. (2015) M. Wistuba, N. Schilling, and L. Schmidt-Thieme. 2015. Learning hyperparameter optimization initializations. In IEEE International Conference on Data Science and Advanced Analytics. 1–10. https://doi.org/10.1109/DSAA.2015.7344817
  • Xu et al. (2003) Wei Xu, Xin Liu, and Yihong Gong. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM, 267–273.
  • Yogatama and Mann (2014) Dani Yogatama and Gideon Mann. 2014. Efficient transfer learning method for automatic hyperparameter tuning. In Artificial Intelligence and Statistics. 1077–1085.
  • Zhang et al. (2016) Yuyu Zhang, Mohammad Taha Bahadori, Hang Su, and Jimeng Sun. 2016. FLASH: fast Bayesian optimization for data analytic pipelines. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2065–2074.
Algorithm type Hyperparameter names (values)
Adaboost n_estimators (50,100), learning_rate (1.0,1.5,2.0,2.5,3)
Decision tree min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e-05)
Extra trees min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e-05), criterion (gini,entropy)
Gradient boosting learning_rate (0.001,0.01,0.025,0.05,0.1,0.25,0.5), max_depth (3, 6), max_features (null,log2)
Gaussian naive Bayes -
kNN n_neighbors (1,3,5,7,9,11,13,15), p (1,2)
Logistic regression C (0.25,0.5,0.75,1,1.5,2,3,4), solver (liblinear,saga), penalty (l1,l2)
Multilayer perceptron learning_rate_init (0.0001,0.001,0.01), learning_rate (adaptive), solver (sgd,adam), alpha (0.0001, 0.01)
Perceptron -
Random forest min_samples_split (2,4,8,16,32,64,128,256,512,1024,0.01,0.001,0.0001,1e-05), criterion (gini,entropy)
Kernel SVM C (0.125,0.25,0.5,0.75,1,2,4,8,16), kernel (rbf,poly), coef0 (0,10)
Linear SVM C (0.125,0.25,0.5,0.75,1,2,4,8,16)
Table 2. Base Algorithm and Hyperparameter Settings

For reproducibility, please refer to our GitHub repositories (the Oboe system: https://github.com/udellgroup/oboe; experiments: https://github.com/udellgroup/oboe-testing). Additional information is as follows.

Appendix A Machine Learning Models

Shown in Table 2, the hyperparameter names are the same as those in scikit-learn 0.19.2.

Appendix B Dataset meta-features

Dataset meta-features used throughout the experiments are listed in Table 3 (next page).

Meta-feature name Explanation
number of instances number of data points in the dataset
log number of instances the (natural) logarithm of number of instances
number of classes
number of features
log number of features the (natural) logarithm of number of features
number of instances with missing values
percentage of instances with missing values
number of features with missing values
percentage of features with missing values
number of missing values
percentage of missing values
number of numeric features
number of categorical features
ratio numerical to nominal the ratio of number of numerical features to the number of categorical features
ratio numerical to nominal
dataset ratio the ratio of number of features to the number of data points
log dataset ratio the natural logarithm of dataset ratio
inverse dataset ratio
log inverse dataset ratio
class probability (min, max, mean, std) the (min, max, mean, std) of ratios of data points in each class
symbols (min, max, mean, std, sum) the (min, max, mean, std, sum) of the numbers of symbols in all categorical features
kurtosis (min, max, mean, std)
skewness (min, max, mean, std)
class entropy the entropy of the distribution of class labels (logarithm base 2)
landmarking (Pfahringer et al., 2000) meta-features
LDA
decision tree decision tree classifier with 10-fold cross validation
decision node learner 10-fold cross-validated decision tree classifier with criterion="entropy", max_depth=1, min_samples_split=2, min_samples_leaf=1, max_features=None
random node learner 10-fold cross-validated decision tree classifier with max_features=1 and the same above for the rest
1-NN
PCA fraction of components for 95% variance the fraction of components that account for 95% of variance
PCA kurtosis first PC kurtosis of the dimensionality-reduced data matrix along the first principal component
PCA skewness first PC skewness of the dimensionality-reduced data matrix along the first principal component
Table 3. Dataset Meta-features

Appendix C Meta-feature calculation time

On a number of not very large datasets, the time taken to calculate meta-features in the previous section are already non-negligible, as shown in Figure 11. Each dot represents one midsize OpenML dataset.

Figure 11. Meta-feature calculation time and corresponding dataset sizes of the midsize OpenML datasets. The collection of meta-features is the same as that used by auto-sklearn (Feurer et al., 2015). We can see some calculation times are not negligible.

Appendix D Comparison of experiment design with different constraints

In Section 5.1.1, our experiments compare QR and PMF to a variant of experiment design (ED) with a constraint on the number of observed entries, since QR and PMF admit a similar constraint. Figure 12 shows that the regret of ED with a runtime constraint (Equation 1) is not too much larger.

Figure 12. Comparison of different versions of ED with PMF. ”ED (time)” denotes ED with runtime constraint, with time limit set to be 10% of the total runtime of all available models; ”ED (number)” denotes ED with the number of entries constrained.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
366137
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description