Taking Human out of Learning Applications: A Survey on Automated Machine Learning
Machine learning techniques have deeply rooted in our everyday life. However, since it is knowledge- and labor-intensive to pursuit good learning performance, human experts are heavily engaged in every aspect of machine learning. In order to make machine learning techniques easier to apply and reduce the demand for experienced human experts, automatic machine learning (AutoML) has emerged as a hot topic of both in industry and academy. In this paper, we provide a survey on existing AutoML works. First, we introduce and define the AutoML problem, with inspiration from both realms of automation and machine learning. Then, we propose a general AutoML framework that not only covers almost all existing approaches but also guides the design for new methods. Afterward, we categorize and review the existing works from two aspects, i.e., the problem setup and the employed techniques. Finally, we provide a detailed analysis of AutoML approaches and explain the reasons underneath their successful applications. We hope this survey can serve as not only an insightful guideline for AutoML beginners but also an inspiration for future researches.
At the beginning of Tom’s famous machine learning textbook [tom1997machine], he wrote: “Ever since computers were invented, we have wondered whether they might be made to learn. If we could understand how to program them to learn - to improve automatically with experience - the impact would be dramatic”. This question gave birth to a new research area, i.e., machine learning, for the computer science decades ago, which tries to program computers improving with experience. Till now, machine learning techniques have been deeply rooted in our every day’s life, such as recommendation when we are reading news and handwriting recognition when we are using our cell-phones. Furthermore, it also has got significant achievements in academy, especially in recent years. For example, AlphaGO [silver2016mastering] beat human champion in the game of GO, ResNet [he2016deep] surpassed human performance in image recognition, Microsoft’s speech system [xiong2017toward] approximated human level in speech transcription.
However, these successful applications of machine learning are far from automation. Since there are no algorithms can achieve good performance on all possible learning problems with equal importance, every aspect of machine learning applications, such as feature engineering, model selection, and algorithm selection (Figure 1), needs to be carefully configured, which is usually involved heavily with human experts. As these experts are rare resources and their designs and choices in above aspects are not replicable, the above success comes at a great price. Thus, automatic machine learning is not only a academic dream described in Tom’s book, but is also of great practical usage. If we can take human out of these machine learning applications, we can enable fast deployment of machine learning solutions across organizations, quick validate and benchmark the performance of deployed solutions, and make human focus more on problems depending on applications and business. These would make machine learning much more available for real-world usages, leading to new levels of competence and customization, of which the impact can be indeed dramatic.
Motivated by the above academic dream and practical needs, in recent years, the automated machine learning (AutoML) itself has emerged as a new sub-area in machine learning. Specifically, as illustrated in Figure 1, AutoML attempts to reduce human assistance in the design, selection and implementation of various machine learning tools used in applications’ pipeline. It has got increasingly more attention not only in machine learning but also in computer vision, data mining and natural language processing. Besides, AutoML has already been successfully applied in many important problems (Table I).
|automatic model selection||Auto-sklearn||[feurer2015efficient, kotthoff2017auto]|
|neural architecture search||Google’s Cloud||[zoph2017neural, liu2018progressive]|
|automatic feature engineering||FeatureLab||[katz2016explorekit, kanter2015deep]|
The first example is Auto-sklearn [feurer2015efficient]. As different classifiers are applicable to different learning problems [tom1997machine], it is naturally to try a collection of classifiers on a new problem, and then get final predictions from an ensemble of them. However, picking up the right classifiers and setting up their hyper-parameters are tedious tasks, which usually require involvements of humans. Based on the popular scikit-learn machine learning library [scikit-learn], Auto-sklearn can automatically find good models from the out-of-the-box machine learning tools for classification. It searches for proper models and optimizes their corresponding hyper-parameters. Thus, it frees humans from above tedious tasks and allows them to focus on the real problem.
The second example is the neural architecture search (NAS) [zoph2017neural, xie2017genetic, baker2017designing]. Since the success of AlexNet [krizhevsky2012imagenet] on image classification of ImageNet dataset [deng2009imagenet], the change on the design of neural architectures has been the main power-source for improving the learning performance. Examples are VGGNet [simonyan2014very], GoogleNet [szegedy2015going], ResNet [he2016deep] and DenseNet [huang2017densely]. Hence, the problem is that can neural architectures be automatically designed so that good learning performance can be obtained on the given tasks. Many researchers have been working on this problem, and reinforcement learning [sutton1998reinforcement] has been developed as a powerful and promising tool for NAS [baker2017designing, zoph2017neural, bello2017neural, zoph2017learning]. Besides, NAS has been used in Google’s Cloud AutoML, which removes customers from the difficult and time-consuming process of designing architecture.
The last example is automatic feature engineering. In traditional machine learning methods, the modeling performance depends greatly on the quality of features [tom1997machine]. Hence, most machine learning practices take feature engineering as a vital preposition step, where useful features are generated or selected. Such operations, in the past, are usually carried out manually by data scientists with in-depth domain knowledge in a trial-and-error manner. Automatic feature engineering [kanter2015deep, katz2016explorekit] aims to construct a new features set, with which the performance of subsequent machine learning tools can be improved. By this means, intensive human knowledge and labor can be spared. Existing works on this topic include Data Science Machine (DSM) [kanter2015deep], ExploreKit [katz2016explorekit] and FeatureHub [smith2017featurehub]. Besides, we have also seen commercial products such as FeatureLabs [kanter2015deep].
Thus, with such a rapid development of AutoML in both research and industry, we feel it necessary to summarize existing works and do a survey on this topic at this time. First, we define what is the AutoML problem. Then, we propose a general framework which summarize how existing approaches work towards AutoML. Such framework further motivates us to give taxonomies of existing works based on what (by problem setup) and how (by techniques) to automate. Specifically, problem setup helps us to clarify what learning tools we want to use, while techniques give us the technical methods and details to address the AutoML problem under the corresponding setup. Based on these taxonomies, we further give a guidance how AutoML approaches can be used and design. 111In this survey we focus on the usage of existing techniques in AutoML, for individual reviews on related topics please refer to [vanschoren2010understanding, lemke2015metalearning, hyper2018meta] for meta-learning, [pan2009survey] for transfer learning, [hyper2018book] for hyper-parameter optimization and [hyper2018nas] for neural architecture search.
Below, we summarize our contributions in this survey:
We are the first to formally define the AutoML problem. The definition is general enough to include all existing AutoML problems, but also specific enough to clarify what is the goal of AutoML. Such definition is helpful for setting future research target in the AutoML area.
We propose a general framework for existing AutoML approaches. This framework is not only helpful for setting up taxonomies of existing works, but also gives insights of the problems existing approaches want to solve. Such framework can act as guidance for developing new approaches.
We systematically categorize the existing AutoML works based on “what” and “how”. Problem setup is from “what” perspective, which justifies which learning tools we want to make automated. Techniques are from “how”, they are methods to solve AutoML problems. For each category, we present detailed application scenarios as the reference.
We provide a detailed analysis of approaches techniques. Compared to existing AutoML related surveys [elsken2018neural], we not only investigate a more comprehensive set of existing works, but also present a summary of the insights behind each technique. This can serve as an good guideline not only for beginners’ usage but also for future researches.
We suggest four promising future research directions in the field of AutoML in terms of computational efficiency, problem settings, solution techniques and applications. For each direction, we provide a thorough analysis of its disadvantages in current work and propose future research directions.
The survey is organized as follows. The overview is in Section 2, which gives the definition of AutoML, the proposed framework of AutoML approaches, and taxonomies by problem setup and techniques of existing works. Section 3 describes the taxonomy by problem setup, and techniques are detailed in Section 4-6. Three application examples listed in Table I are detailed in Section 7. Finally, we end this survey with a brief history of AutoML, future works and a conclusion Section LABEL:sec:summary.
In the sequel, a machine learning tool is a method which can solve some learning problems in Figure 1, i.e., feature engineering, model selection and/or algorithm selection. We use term configuration to denote all factors but model parameters that influence the performance of a learning tool. Examples of configurations are, the hypothesis class of model, the features utilized by the model, the hyper-parameters that control the training procedure, and the architecture of network. Finally, we denote a machine learning tool as , where is the model parameters learned by training and contains configurations of the learning tool.
|classical machine learning||AutoML|
|feature engineering||humans design, construct features from data|
|humans process features making them more informative|
|model selection||humans design or pick up some machine learning tools based on professional knowledge||automation by the computer program|
|humans adjust hyper-parameters of machine learning tools based on performance evaluation|
|algorithm selection||humans pick up some optimization algorithms to find parameters|
|computational budgets||not a main concern||execute within a limited computational budget|
|summary||human experts are involved in every aspect of machine learning applications||the program can be directly reused on other learning problems|
In Section 1, we have shown why we need to do AutoML, which dues to both academic dream and industrial needs. In this section, we first define what is AutoML problem in Section 2.1. Then, in Section 2.2 we propose a framework of how AutoML problems can be solved in general. Finally, taxonomies of existing works based on “what to automate” and “how to automate” are presented in Section 2.3.
2.1 Problem Definition
Here, we define what is the AutoML problem, which is inspired by automation and machine learning. Based on the definition, we also explain core goals of AutoML.
2.1.1 AutoML from two Perspectives
From its name, we can see that AutoML is naturally the intersection of automation and machine learning. While automation has a long history, which can even date back to BC [guarnieri2010roots], machine learning was only invented decades ago [tom1997machine]. The combination of these two areas has just become a hot research topic in recent years. The key ideas from these two fields and their impacts on AutoML are as follows.
Machine learning, as in Definition 1, is specified by , and , i.e., it tries to improve its performance on task measured by , when receiving training data .
Definition 1 (Machine learning [tom1997machine]).
A computer program is said to learn from experience with respect to some classes of task and performance measure if its performance can improve with on measured by .
From this perspective, AutoML itself can also be seen as a very powerful machine learning algorithm that has good generalization performance (i.e., ) on the input data (i.e., ) and given tasks (i.e., ). However, traditional machine learning focuses more on inventing and analyzing learning tools, it does not care much about how easy can these tools be used. One such example is exactly the recent trend from simple to deep models, which can offer much better performance but also much hard to be configured [goodfellow2016deep]. In the contrast, AutoML emphasizes on how easy learning tools can be used. This idea is illustrated in Figure 2.
On the other hand, automation is the use of various control systems for operating underneath building blocks [rifkin1995end]. In pursuit of better predicting performance, configurations of machine learning tools should be adapted to the task with input data, which is often carried out manually. As shown in Figure 3, the goal of AutoML from this perspective is to construct high-level controlling approaches over underneath learning tools so that proper configurations can be found without human assistance.
These two perspectives are the main motivations for our AutoML’s definition in the sequel.
2.1.2 The Definition of AutoML
From Section 2.1.1, we can see that AutoML not only wants to have good learning performance (from machine learning’s perspective) but also requires such performance being achieved without human assistance (from automation’s perspective). Thus, an informal and intuitive description of AutoML can be expressed as
Put it more formally, we describe what is AutoML in Definition 2. Such definition is inspired by Definition 1 and the fact that AutoML itself can also be seen as another machine learning approach (Figure 2).
Definition 2 (AutoML).
AutoML attempts to construct machine learning programs (specified by , and in Definition 1), without human assistance and within limited computational budgets.
A comparison of classical machine learning and AutoML is in Table II. Basically, in classical machine learning, human are heavily involved in the search for such learning tools, by operating feature engineering, model selection, hyper-parameter tuning, and network architecture design. As a result, human take the most labor and knowledge-intensive job in machine learning practices. However, in AutoML, all these can be done by computer programs. To understand the Definition 2 better, let us look back at those three examples in Table I:
Automatic model selection: Here, denotes input training data, is a classification task and is the accuracy on the given task. When features are given, Auto-sklearn can choose proper classifiers and find corresponding hyper-parameters without human assistance.
Neural architecture search: When we try to do some image classification problems with the help of NAS, is the collection of images, is the image classification problem, and is the accuracy on testing images. NAS will automatically search for a neural architecture, i.e., a classifier based on neural networks, that has good performance on the given task.
Automatic feature engineering: As the input features may not be informative enough, we may want to construct more features to enhance the learning performance. In this case, is the raw feature, is construction of features, and is the performance of models which are learned with the constructed features. DSM [kanter2015deep] and ExploreKit [katz2016explorekit] remove human assistance by automatically construct new features based on interaction among input features.
Finally, note that Definition 2 is general enough to cover most machine learning approaches that can be considered automatic. With this definition, a machine learning pipeline with fixed configurations, that do not adapt according to different , , and , is also automatic. Approaches of this kind, though require no human assistance, are rather limited in their default performance and application scopes. Thus, they are not interesting, and will not be further pursuit in the sequel.
2.1.3 Goals of AutoML
Thus, from above discussion, we can see that while good learning performance is always desired, AutoML requires such performance can be obtained in a more special manner, i.e. without human assistance and within limited computational budgets. These set up three main goals for AutoML (Remark 2.1).
Remark 2.1 (Core goals).
The three goals underneath AutoML:
Better Performance: good generalization performance across various input data and learning tasks;
No Assistance from humans: configurations can be automatically done for machine learning tools; and
Lower Computational budgets: the program can return an output within a limited budget.
Once above three goals can be realized, we can fast deploy machine learning solutions across organizations, quickly validate and benchmark the performance of deployed solutions, and let human focus more on problems that really need humans’ engagements, i.e., problem definition, data collection and deployment in Figure 1. All these make machine learning easier to apply and more accessible for everyone.
2.2 Basic Framework
2.2.1 Human Tuning Process
However, before that, let us look at how configurations are tuned by human. Such process is shown in Figure 4. Once a learning problem is defined, we need to find some learning tools to solve it. These tools, which are placed in the right part of Figure 4, can target at different parts of the pipeline, i.e., feature, model or optimization in Figure 1. To obtain a good learning performance, we will try to set a configuration using our personal experience or intuition about the underneath data and tools. Then, based on the feedback about how the learning tools perform, we will adjust the configuration wishing the performance can be improved. Such a trial-and-error process terminates once a desired performance is achieved or the computational budgets are run out.
2.2.2 Proposed AutoML Framework
Motivated by the human-involved process above and controlling with feedbacks in the automation [phillips1995feedback], we propose a framework for AutoML, as shown in Figure 6. Compared with Figure 4, in this figure, an AutoML controller takes the place of human to find proper configurations for the learning tools. Basically, we have two key ingredients inside the controller, i.e., the optimizer and the evaluator. Their interactions with other components in Figure 6 are as follows:
Evaluator: The duty of the evaluator is to measure the performance of the learning tools with configurations provided by the optimizer. After that, it generates feedbacks to the optimizer. Usually, to measure the performance of learning tools with the given configuration, the evaluator needs to train the model based on the input data, which can be time consuming. However, the evaluator can also directly estimate the performance based on external knowledge, which mimics humans’ experience. Such estimation is very fast but may not be accurate. Thus, for the evaluator, it needs to be fast but also accurate in measuring the performance of learning tools.
Optimizer: Then, for the optimizer, its duty is to update or generate the configuration for learning tools. The search space of the optimizer is defined by learning tools, and new configurations are expected to have better performance than previous ones. However, feedbacks offered by the evaluator are not necessary used by the optimizer. This depends on which type of the optimization algorithms we are utilizing. Finally, as the optimizer operates on the search space, we wish the search space can be easy and compact so that the optimizer can identify a good configuration with a few generated configurations.
In Table X, we use examples from Table I to demonstrate how the proposed framework can cover existing works. Details of the examples are in Section 7. As we can see, such framework is general enough to cover nearly all existing works (e.g. [thornton2013auto, bergstra2013making, feurer2015efficient, maclaurin2015gradient, sparks2015automating, kanter2015deep, van2015fast, katz2016explorekit, yu2016derivative, zoph2017neural, bello2017neural, brock2018smash, liu2018darts]), but also precise enough to help us setup taxonomies for AutoML approaches in later Section 2.3. Furthermore, it offers future works to AutoML in Section LABEL:sec:fuworks.
2.3 Taxonomies of AutoML Approaches
In this section, we give taxonomies of existing AutoML approaches based on what and how to automate.
2.3.1 “What to automate”: by Problem Setup
The choice of learning tools inspires the taxonomy based on problem setup in Figure 5(a), this defines “what” we want to make automated by AutoML.
Basically, for general learning problems, we need to do feature engineering, model selection and algorithm selection. These three parts together make up the full scope of of general machine learning applications (Figure 1). We also list neural architecture search (NAS) there as a very important and special case. The reason is that NAS targets at deep models, where features, models and algorithms are configured simultaneously. The focus and challenges of AutoML problem under each setup are detailed in Section 3.
2.3.2 “How to automate”: by Techniques
Figure 5(b) presents the taxonomy by AutoML techniques. These are the techniques used for the controller, and categorize “how” we solve an AutoML problem. In general, we divide existing techniques into basic and advanced ones:
Basic techniques: As there are two ingredients, i.e., the optimizer and evaluator, in the controller, we categorize basic techniques based on which ingredient they operating on. The optimizer focus on the searching and optimizing configurations, and there are many methods can be used, from simple methods as grid search and random search [bergstra2012random] to very complex ones as reinforcement learning [zoph2017neural] and automatic differentiation [Baydin2017Auto]. However, for the evaluator, which mainly measures the performance of learning tools with current configurations by determine their parameters, there are not many methods can be taken as basic ones.
Advanced techniques: The difference between basic and advance ones is that advance techniques cannot be used for searching configurations in Figure 5(b), they usually need to be combined with basic ones. Generally, there are two main methods fall into advanced techniques, i.e., meta-learning [lemke2015metalearning, vilalta2002perspective] and transfer learning [pan2009survey], they both try to make use of external knowledge to enhance basic ones for the optimizer and evaluator.
Note that, as , and are also involved in the AutoML’s definition (Definition 2), taxonomies of machine learning, e.g., supervised learning, semi-supervised learning and unsupervised learning, can also be applied for AutoML. However, these does not necessarily connect with removing human assistance in finding configurations (Figure 4). Thus, taxonomies here are done based on the proposed framework in Figure 6 instead. Finally, we focus on supervised AutoML approaches in this survey as all existing works for AutoML are supervised ones.
2.4 Working Flow based on Taxonomies
In the sequel, basic techniques and core issues they need to solve are introduced in Section 4 and 5 for the optimizer and evaluator respectively. After that advanced techniques are described in Section 6. The working flow of designing an AutoML approach is summarized in Figure 7, which also acts a a guidance through this survey.
3 Problem Settings
|feature engineering||subsequent classifiers||feature sets||[smith2005genetic, kanter2015deep, katz2016explorekit, tran2016genetic, nargesian2017learning]|
|selection of methods and their hyper-parameters||[thornton2013auto, feurer2015efficient, kotthoff2017auto]|
|model selection||classifiers||selection of classifiers and their hyper-parameters||[escalante2009particle, calcagno2010glmulti, thornton2013auto, feurer2015efficient, sparks2015automating]|
|algorithm selection||optimization algorithms||selection of algorithms and their hyper-parameters||[merz1996dynamical, kadioglu2011algorithm, hutter2014algorithm, van2015fast, bischl2016aslib]|
|full scope||general||an union of above three aspects||[thornton2013auto, feurer2015efficient, kotthoff2017auto, zoph2017learning]|
|NAS||neural networks||design of networks (e.g., network structure, learning rate)||[domhan2015speeding, ha2016hypernetworks, mendoza2016towards, zoph2017neural, bello2017neural, fernando2017pathnet, elsken2017simple, zhong2017practical, deng2017peephole, brock2018smash, jin2018efficient, cai2018efficient]|
In this section, we give details on categorization based on problem setup (Figure 5(a)). Basically, it clarifies what to be automated. AutoML approaches need not solve the full machine learning pipeline in Figure 1, they can also focus on some parts of the learning process. Common questions need to be asked for each set up are:
What learning tools can be designed and used? What are their corresponding configurations?
By asking these questions we can then define the search space for AutoML approaches. An overview is in Table III. In the sequel, we briefly summarize existing learning tools for each setup and what are the corresponding search space.
3.1 Feature Engineering
The quality of features, perhaps, is the most important perspective for the performance of subsequent learning models. Such importance is further verified by the success of deep learning models, which can directly learn a representation of features from the original data [bengio2013representation]. The problem of AutoML for feature engineering is to automatically construct features from the data so that subsequent learning tools can have good performance. The above goal can be further divided into two sub-problems, i.e., creating features from the data and enhance features’ discriminative ability.
However, the first problem heavily depends on application scenarios and humans’ expertise, there are no common or principled methods to create features from data. AutoML only makes limited progress in this direction, we take it as one future direction and discuss it in Section LABEL:ssec:creatf. Here, we focus on feature enhancing methods.
3.1.1 Feature Enhancing Tools
In many cases, the original features from the data may not be good enough, e.g., their dimension may be too high or samples may not be discriminable in the feature space. Thus, we may want to do some post-processing on these features. Fortunately, while human knowledge and assistance are still required, there are common methods and principle ways to enhance features. They are listed as follows:
Dimension reduction: It is the process of reducing the number of random variables under consideration by obtaining a set of principal variables, which is useful when there exists much redundancy among features or the feature dimension is too high. It can be divided into feature selection and feature projection. Feature selection tries to select a subset of features from the original ones, popular methods are greed search and lasso. Feature projection transforms original features to a new space, of which the dimension is much smaller, e.g., PCA [pearson1901liii], LDA [fisher1936use] and recently developed auto-encoders [vincent2008extracting].
Feature generation: As original features are designed from humans, there are usually unexplored interactions among them which can significantly improve learning performance. Feature generation is to construct new features from the original ones based on some pre-defined operations [ref], e.g., multiplication of two features and standard normalization. It is usually modeled as a searching problem in the space spanned by operations on original features. Many search algorithms has been applied, e.g., greedy search [katz2016explorekit] and evolution algorithms [vafaie1992genetic, smith2005genetic].
Feature coding: The last category is feature coding, which re-interprets original features based on some dictionaries learned from the data. After the coding, samples are usually lifted in another feature space, which is much higher than the original one. Since the dictionary can capture the collaborative representation in the training data, samples are not discriminable in the original space become separable in the new space. Popular examples are sparse coding [elad2006image] (and its convolutional variants [zeiler2010deconvolutional]) and local-linear coding [yu2009nonlinear]. Besides, kernel methods can also be seen as feature coding, where basis functions are the dictionary. However, kernel methods have to be used with SVM and basis functions are designed by hand not driven by data.
All above tools are not automatic. While there are practical suggestions for using above feature enhancing tools, when facing with a new task, we still need to try and test.
3.1.2 Search Space
There are two types of search space for above feature enhancing tools. The first one is made up by hyper-parameters of these tools, and configuration exactly refers to these hyper-parameters. It covers dimension reduction and feature coding methods, (e.g., [thornton2013auto, feurer2015efficient, kotthoff2017auto]). For example, we need to determine the dimension of features after PCA, and the level of sparsity for sparse coding. The second type of search space comes from feature generation, (e.g., [smith2005genetic, kanter2015deep, katz2016explorekit, tran2016genetic, nargesian2017learning]). The space is spanned by the combination of predefined operations with original features. One example of new feature generated from plus, minus and times operation is shown in Figure 8. For these methods, a configuration is a choice of features in the search space.
3.2 Model Selection
Once features have been obtained, we need to find a model to predict labels. Models selection contains two components, i.e., picking up some classifiers and setting their corresponding hyper-parameters. The problem of AutoML here is to automatically select classifiers and setup their hyper-parameters so that good learning performance can be obtained.
3.2.1 Classification Tools
Many classification tools have been proposed in the literature, e.g., tree classifiers, linear classifiers, kernel machines and, more recently, deep networks. Each classifier has its own strength and weakness in modeling underneath data. For example, tree classifiers generally outperform linear ones. However, when the feature dimension becomes high, it becomes extremely expensive and difficulty to train tree classifier. In this case, linear classifiers are preferred. Some out-of-box classifiers implemented in scikit-learn are listed in Table IV. Traditionally, the choice among different classifiers are usually made by human from his experience in a trial-and-error manner.
|number of hyper-parameters|
|Bernoulli naive Bayes||2||1||1|
As can be seen from Table IV, we also have hyper-parameters associated with each classifier. They are usually determined by grid-search, which is also the standard practice in machine learning communities. However, the size of search grids grows exponentially with the number of hyper-parameters. Thus, grid-search is not a good choices for complex models. In this case, again, the importance of hyper-parameters are first evaluated by humans, and then those not important or insensitive ones are pre-defined. Such process still needs human assistance, and may lead to sub-optimal performance as the possible settings of hyper-parameters are not sufficiently explored.
3.2.2 Search Space
From above, we can see that configurations under the context of model selection are choices of classifiers and their hyper-parameters (e.g., [escalante2009particle, calcagno2010glmulti, thornton2013auto, feurer2015efficient, sparks2015automating]). These two parts make up the search space here. Choices of classifiers can simply be modeling as a discrete variable with indicates using the classifier and stands fro not using. Properties of hyper-parameters depends on the design and implementation of models, i.e., the number of nearest neighbor is discrete while the penalty parameter for logistic regression is continuous.
3.3 Algorithm Selection
The last and the most time consuming step of machine learning is to find parameters of learning tools, where optimization tools are usually involved. Transitionally, as leaning tools are usually very simple, optimization is not a concern, the performance obtained from various optimization tools are nearly the same. Efficiency is the main focus on the choice of optimization tools. However, as the learning tools get increasing more complex. Optimization is not only the main consumer of computational budgets but also has a great impact on the performance of learning as well. Thus, the goal of AutoML here is to automatically find an optimization tools so that both efficiency and performance can be balanced.
3.3.1 Optimization Tools
For each learning tool, many algorithms can be used. Some popularly tools for minimizing smooth objective functions, like logistic regression, are summarized in Table V. While GD do not involve with any extra-parameters, it suffers from slow convergence and expensive per-iteration complexity. L-BFGS is more expensive but converges faster, and each iteration is very cheap in SGD but many iterations are needed before convergence.
|number of hyper-parameters|
|gradient descent (GD)||0||0||0|
|Limited memory-BFGS (L-BFGS)||1||1||0|
|stochastic gradient descent (SGD)||4||1||3|
3.3.2 Search Space
Traditionally, both the choices of optimization tools and their hyper-parameters are made by humans. These again based on humans’ understanding of learning tools and observations of the training data. The search space is determined by configurations of optimization tools, which contains the choice of optimization tools and the values of their hyper-parameters (e.g, [merz1996dynamical, kadioglu2011algorithm, hutter2014algorithm, van2015fast, bischl2016aslib]).
3.4 Full Scope
In the last section, we discuss the full pipeline in Figure 1. We divide it into two cases here. The first one is general case, learning tools for this case are just the union of previous ones discussed in Section 3.1-3.3. The resulting searching space is also a combination of previous ones. However, the search can be done in two manners, either by reusing methods for each setup separately and then combining them together, or directly search through the space spanned by all configurations. Note that, there can be some hierarchical structure in the space (e.g., [thornton2013auto, feurer2015efficient, kotthoff2017auto, zoph2017learning]), as choices of optimization tools depends on which classifier is used.
The second one is network architecture search (NAS), which targets at searching good architectures for deep networks (e.g., [domhan2015speeding, ha2016hypernetworks, mendoza2016towards, zoph2017neural, bello2017neural, fernando2017pathnet, elsken2017simple, zhong2017practical, deng2017peephole, brock2018smash, jin2018efficient, cai2018efficient, cai2018path]). There are two main reasons why we put is here in parallel to the full scope. First, NAS itself is an extremely hot research topic now where many papers have been published. The second reason is that the deep networks are very strong learning tools, which can learn directly from data, and SGD is the main choice for optimization.
3.4.1 Network Architecture Search
Before describing the search space of NAS, let us look at what is a typical CNN architecture. As in Figure 9, basically, CNN is mainly made up by two parts, i.e., a series of convolutional layers and a fully connected layer in the last.
The performance of CNN is mostly influenced by the design of convolutional layers [yosinski2014transferable]. Within one layer, some common design choices are listed in Figure 10. More importantly, comparing with model selection in Section 3.2, the model complexities of deep network can also be taken into search, i.e., we can add one more layer or jump the connection between layers. Such key difference is the motivation to use reinforcement learning [sutton1998reinforcement] in NAS. Thus, the search space is made up by above design choices and the hyper-parameters in SGD. One configuration here for NAS is one point in such a search space.
In this survey, we focus on CNN, ideas presented here can be similarly applied for other deep architectures, such as long-short-term-memory (LSTM) [Hochreiter1997] and deep sparse networks [glorot2011deep].
4 Basic Techniques for Optimizer
Once the search space is defined, as in the proposed framework (Figure 6), we need to find an optimizer to guide the search in the space. In this section, we discuss the basic techniques for the optimizer.
Three important questions here are
what kind of search space can the optimizer operate on?
what kind of feedbacks it needs?
how many configurations it needs to generate/update before a good one can be found?
The first two questions determine which type of techniques can be used for the optimizer, and the last one clarifies the efficiency of techniques. While efficiency is also a big pursue in AutoML (see Remark 2.1), we do not categorizes existing techniques here based on it. This is because the search space is so complex where convergence rates for each technique are hard to analyze and advanced techniques (Section 6) can accelerate basic ones in various ways. Thus, in the sequel, we divide those techniques into three categories, i.e., simple search approaches, optimization from samples, and gradient descent, based on the first two questions. An overview of the comparison among these techniques are in Table VI.
|type||method||continuous||discrete||examples||feedback in examples|
|simple search||random, grid search||[bergstra2012random]||none|
|optimization from||evolutionary algorithm||[xie2017genetic]||accuracy on validation set|
|samples||Bayesian optimization||[thornton2013auto]||accuracy on validation set|
|reinforcement learning||[bello2017neural]||accuracy on validation set (reward) and a sequence of configurations (state)|
|gradient descent||reversible||[maclaurin2015gradient]||accuracy on validation set and gradients w.r.t hyper-parameters|
4.1 Simple Search Approaches
Simple search is a kind of naive search approach, they make no assumptions about search space and they do not need any feedbacks from the evaluator. Each configuration in the search space can be evaluated independently. Simple search approaches such as grid search, random search are widely used in configuration optimization. Grid search (brute-force) is the most traditional way of finding hyper-parameters. To get the optimal hyper-parameter setting, grid search have to enumerate every possible configurations in search space. Thus, discretization is necessary when search space is continuous. Random search [bergstra2012random] can better explore the search space as more positions will be evaluated (Figure 11). However, they both suffer from the curse of dimensionality while the dimensionality is increasing.
4.2 Optimization from Samples
Optimization from samples [conn2009introduction] is a kind of smarter search approach compared with simple ones in Section 4.1. It iteratively generates new configurations based on previous ones. Thus, it is also generally more efficient than simple search methods. Besides, it does not make specific assumptions on the objective as well.
In the sequel, according to different optimization strategies, we divide existing approaches into three categories, i.e., heuristic search, model-based derivative-free optimization and reinforcement learning.
4.2.1 Heuristic Search
Heuristic search methods are often inspired by biologic behaviors and phenomenons. It is widely used on optimization problems, which are nonconvex, nonsmooth or even noncontinuous. They are all population-based optimization methods, and different with each other in how they generate and select populations. Some popular heuristic search methods are listed as follow:
Particle swarm optimization (PSO) [escalante2009particle]: PSO is inspired by social behavior of bird flocking or fish schooling. It optimizes by searching the local area around the best sample. PSO has few hyper-parameters itself and is easy to be parallelized.
Evolutionary algorithms [B1998Evolutionary]: Evolutionary algorithms are inspired by biological evolution. Mutation and selection are main components. Some of state-of-the-art methods such as CMA-ES [Hansen2003reducing] are applied to solve kinds of sophisticated optimization problems.
The above methods have been applied in AutoML (e.g., [yao1999evolving, stanley2002evolving, zhang2000particle, smith2005genetic, escalante2009particle, real2017large, real2018regularized]), and they all follow framework in Figure 12. Basically, a bunch of configurations (population) are maintained. In each iteration, first, new configurations are generated by crossover or mutation, then these configurations are measured using feedbacks from the evaluator and only a few are kept for the next iteration.
4.2.2 Model-Based Derivative-Free Optimization
Model-based derivative-free optimization (Figure 13) builds a model based on visited samples, which helps to generate more promising new samples. The popular methods are Bayesian optimization, classification-based optimization and optimistic optimization:
Bayesian optimization [nickson2014automated, klein2016fast]: Bayesian optimization builds a probabilistic surrogate function on search space by Gaussian process or other model (e.g., decision tree [bergstra2011algorithms], random forest [hutter2011sequential]). Then, it chooses the next sample by optimizing the acquisition function which is based on the surrogate function. Because of the excellent performance on expensive optimization, Bayesian optimization is popularly used in AutoML.
Classification-based optimization [yu2016derivative, hu2017sequential]: Based on previous samples, classification-based optimization models search space by learning a classifier. Through the classifier, the search space is divided into positive and negative areas. Then, new samples are generated on the positive area, from which, it is more likely to gets better samples. The model method is simple and effective. Thus, classification-based optimization has the high-efficiency and good scalability.
Simultaneous Optimistic optimization (SOO) [munos2011optimistic]: SOO applies a tree structure to balance exploration and exploitation on search space. SOO can get global optimum when the objective function is local-Lipschitz continuous. But it also suffers the curse of dimensionality because the tree grows extremely hard when dimensionality of objective function is high.
4.2.3 Reinforcement Learning
Reinforcement learning (RL) [sutton1998reinforcement] is a very general and strong optimization framework, which can solve problems with delayed feedbacks. Its general framework when used in AutoML (Figure 14). Basically, the policy in RL acts as the generator, and its actual performance in the environment is measured by the evaluator. However, unlike previous methods, the feedbacks (i.e., reward and state) do not need to be immediately returned once an action is taken. They can be returned after performing a sequence of actions.
Due to above unique property, RL is popularly used in NAS (i.e., [zoph2017neural, zoph2017learning, baker2017designing, bello2017neural, li2017hyperband, pham2018faster]). The reason is that CNN can be built layer-by-layer, and the design of one layer can be seen as one action given by the generator. Thus, the iterative architecture generation naturally follows the property of RL (see details in Section LABEL:ssec:egnas). However, again due to delayed feedbacks, AutoML with reinforcement learning is high source-consuming. More efficient methods needs to be explored.
4.3 Gradient descent
Because optimization problems of AutoML is very complex, which is perhaps not differentiable or even not continuous. Thus, gradient descent is hard to be an effective optimizer. However, focusing on some differentiable loss function [bengio2000gradient], e.g., squared loss and logistic loss, continuous hyper-parameters can be optimized by gradient descent. Compared with above methods, gradients offer the most accurate information where better configurations locates.
For these type of methods, unlike traditional optimization problem whose gradients can be explicitly derived from the objective, the gradients need to be implicitly computed here. Usually, this can be done by with finite differentiation [bengio2000gradient]. Another way of computing the exact gradients is through reversible learning [maclaurin2015gradient, franceschi2017forward, Baydin2017Auto]. It has be applied into deep learning hyper-parameter search. For traditional machine learning, the approximate gradient was proposed to search continuous hyper-parameters [pedregosa2016hyper]. Through this inexact gradient, hyper-parameters can be updated before the model parameters have converged.
Finally, in this section, we briefly talk about methods for the structured search space or which can change the landscape of the search space. Such techniques are used and developed case-by-case. For example, greedy search is used in [katz2016explorekit] for searching space spanned by feature combinations (Section 3.1.2), which is mainly motivated by the prohibitive large searching space. Then, in NAS, as some design choices are discrete which are not differential, soft-max are used in [liu2018darts] to change the search space to a differential one, which enables the usage of gradient descent instead of RL.
5 Basic Techniques for Evaluator
Previously, in Section 4, we have discussed how to choose a proper basic technique for the optimizer. In this section, we talk about techniques for another component, i.e., the evaluator in Figure 6. For learning tools, once their configurations are updated, the evaluator needs to measure the corresponding performance on the validation set. Such process is usually very time consuming as it involves with parameter training.
There important questions here are:
Can the technique provide fast evaluation?
Can the technique provide accurate evaluation?
What kind of feedback needs to be offered by the evaluator?
There is a trade off between the first two questions, fast evaluation usually leads to worse evaluation, i.e., less accurate with larger variance. This is illustrated in Figure 15. Thus, for evaluator’s techniques, they wish to lie in the left top of Figure 15 with smaller variance.
The last question in Remark 5.1 is a design choice, it has to be combined with choices of the optimizer. For example, as in Table VI, grid search and random search need not give feedback to the optimizer, each configuration is run independently. However, for gradient descent methods, we not only needs to report the obtained performance on the validation set, but the gradient w.r.t. the current configuration has to be computed as well.
Unlike basic techniques for the optimizer, there are not many techniques can be used as basic ones for the evaluator. We list them as follow:
Direct evaluation: This is the simplest method, where the parameters are obtain by directly training the parameters on the training set, and then the performance is measured on the validation set. It is the most expensive method, but also offers the most accuracy evaluation;
Sub-sampling: As the training time depends heavily on the amount of training data, an intuitive method to make evaluation faster is to train parameters with a subset of the training data. This can be done by either using a subset of samples or a subset of features;
Early stop: For some bad configurations, even with enough training time, their performance will not eventually gets much better. Empirically, such configurations usually can be easily identified by their performance on the validation set at the very beginning of the training [hutter2011sequential, klein2016learning, deng2017peephole]. In this case, we can early terminate the training, and let the evaluator offers a bad feedback;
Parameter reusing: Another technique is to use parameters, that are obtained from evaluation of previous configurations, to warm-start the parameters need to be trained for the current evaluation. Intuitively, parameters for similar configurations can be close with each other. Thus, this technique can be useful when changes in the configuration between previous one and current one are not big.
Surrogate evaluator: For configurations that can be readily quantized, one straightforward method to cut down the evaluation cost is to build a model that predicts the performance of given configurations, with experience of past evaluations [eggensperger2015efficient, domhan2015speeding, van2015fast, klein2016learning]. These models, serving as surrogate evaluators, spare the computationally expensive model training, and significantly accelerate AutoML. Surrogate evaluator is also an accelerating approach that trades evaluation accuracy for efficiency, it can be used to predict running time, parameters and predicting performance of learning tools. The use of surrogate models is comparatively limited for configurations other than hyperparameters since they are hard to quantize. In Section 6.1, we will introduce meta-learning techniques that are promising to address this problem.
As a summary, due to simplicity and reliability, Direct evaluation is perhaps the most commonly used basic technique for the evaluator. Sub-sampling, early stop, and parameter reusing enhance Direct evaluation in various directions, and they can be combined for faster and more accurate evaluation. However, the impact of these three techniques depends on AutoML problems and data, it is hard to conclude to which content they can improve upon Direct evaluation. Note that, while surrogate is also used in Section 4.2.2 (i.e., “Model” in Figure 13), it is used to generate new configurations which are more likely to have good performance. The surrogate evaluator here will determine the real performance of the given configuration, and offer feedbacks for the optimizer for subsequent updates. Finally, while basic techniques are few here, various ones can be designed based on, e.g., transfer learning and meta-learning. They will be discussed in Section 6.
6 Advanced Techniques
In previous sections, we discussed the general framework to automatically construct learning tools for given learning problems. The framework features a search procedure that comprises configuration generation and evaluation. In this section, we review advanced techniques that can improve the efficiency and performance of AutoML, by putting them into our proposed framework. Two major topics of this section are: 1) meta-learning, where meta-knowledge about learning is extracted and meta-learner is trained to guide learning; 2) transfer learning, where transferable knowledge is brought from past experiences to help upcoming learning practices.
|learning problem||learning tool||other||input||output|
|configuration evaluation (evaluator)||model evaluation||meta-features of data||meta-features of models (optional)||performance, or applicability, or ranking of models||meta-features of data and models||performance, or applicability, or ranking of models|
|general configuration evaluation||meta-features of data (optional)||configurations, or meta-features of configurations (optional)||performance of configurations||meta-features of data, and configurations or meta-features of configurations||performance of configurations|
|configuration generation (optimizer)||promising configuration generation||meta-features of data||-||well-performing configurations||meta-features of data||promising configurations|
|warm-starting configuration generation||meta-features of data||-||well-performing configurations||meta-features of data||promising initial configurations|
|search space refining||-||configurations||importance of configurations, or promising search regions||configurations of learning tools||refined search space|
|for dynamic configuration adaptation||concept drift detection||statistics of data, or attributes||-||indicator (whether concept drift presented, optional)||statistics of data, or attributes||indicator, or indicating attributes|
|dynamic configuration adaptation||meta-features of data||-||well-performing configurations||meta-features of current data||promising configuration|
Though with various definitions, meta-learning in general learns how specific learning tools perform on given problem from past experiences, with the aim to recommend or construct promising learning tools for upcoming problems. Meta-learning is closely related to AutoML since they share same objectives of study, namely the learning tools and learning problem. In this section, we will first briefly introduce the general framework of meta-learning and explain why and how meta-learning can help AutoML. Then, we review existing meta-learning techniques by categorizing them into three general classes based on their applications in AutoML: 1) meta-learning for configuration evaluator; 2) meta-learning for configuration optimizer; and 3) meta-learning for dynamic configuration adaptation.
6.1.1 General Meta-learning Framework
Meta-learning satisfies the definition of machine learning (Definition 1). It is, however, significantly different from classical machine learning since it aims at totally different tasks and, consequently, learns from different experiences. Table VIII provides an analogy between meta-learning and classical machine learning, indicating both their similarities and differences.
|classical machine learning||meta-learning|
|tasks||to learn and use knowledge about instances||to learn and use knowledge about learning problems and tools|
|experiences||about instances||about learning problems and tools|
|method||to train learners with experiences and|
|apply them on future tasks|
Like classical machine learning, meta-learning is achieved by extracting knowledge from experience, training learners based on the knowledge, and applying the learners on upcoming problems. Figure 16 illustrates the general framework of meta-learning. First, learning problems and tools are characterized. Such characteristics (e.g., statistical properties of the dataset, hyperparameters of learning tools) are often named meta-features, as thoroughly reviewed in [smith2009cross, vanschoren2010understanding, lemke2015metalearning]. Then, meta-knowledge is extracted from past experiences. In addition to meta-features, empirical knowledge about the goal of meta-learning, such as performance of learning tools and the promising tools for specific problems, is also required. Afterwards, meta-learners are trained with the meta-knowledge. Most existing machine learning techniques, as well as simple statistical methods, can serve to generate the meta-learners. The trained meta-learner can be applied on upcoming, characterized learning problems to make predictions of interest.
Meta-learning helps AutoML, on the one hand, by characterizing learning problems and tools. Such characteristics can reveal important information about the problems and tools, for example, whether there are concept drift in the data, or whether a model is compatible for particular machine learning tasks. Furthermore, with these characteristic, similarities among different tasks and tools can be evaluated, which enables knowledge reuse and transfer between different problems. A simple but widely-used approach is to recommend configuration for a new task using the empirically best configuration in a neighborhood of this task in the meta-feature space. On the other hand, the meta-learner encodes past experience and acts as a guidance to solve future problems. Once trained, the meta-learners can fast evaluate configurations of learning tools, sparing the computational expensive training and evaluation of models. They can also generate promising configurations, which can directly specify a learning tool or serve as good initialization of the search, or suggest effective search strategies. Hence, meta-learning can greatly improve the efficiency of AutoML approaches.
In order to apply meta-learning in AutoML, we need to figure out the purpose of meta-learning, and the corresponding meta-knowledge and meta-learners, as noted in Remark 6.1. Table VII summarizes the meta-knowledge and meta-learners that should be extracted and trained for different purposes, according to existing works in the literature.
To apply meta-learning in AutoML, we should determine:
what is the purpose to apply meta-learning;
what meta-knowledge should be extracted to achieve the purpose;
what meta-learners should be trained to achieve the purpose.
6.1.2 Configuration Evaluation (evaluator)
The most computation-intensive step in AutoML is configuration evaluation, due to the cost of model training and validation. Meta-learners can be trained as surrogate evaluators to predict performances, applicabilities, or ranking of configurations. We summarize representative applications of meta-learning in configuration evaluation as follow:
Model evaluation: the task is to predict, given a learning problem, often specified by the data set, whether or how a learning algorithm is applicable, or a ranking of candidate algorithms, so that the most suitable and promising algorithm can be selected. The meta-knowledge includes the meta-features of learning problems and the empirical performance of different models, and optionally the meta-features of models. The meta-learner is trained to map the meta-features to the performance [gama1995characterization, merz1996dynamical, sohn1999meta], applicability [taylor1994machine, brazdil1994characterizing], or ranking [soares2000zoomed, berrer2000evaluation, alexandros2001model, brazdil2003ranking] of models. More recent research on this topic include active testing [leite2010active, leite2012selecting], runtime prediction [reif2011prediction, hutter2014algorithm], and more sophiscated measurements for models [jankowski2011universal, jankowski2013complexity]. A more complete review of on this topic can be found in [smith2009cross].
General configuration evaluation: the evaluation for other kinds of configurations can equip meta-learning in similar ways: in ExploreKit [katz2016explorekit], ranking classifiers are trained to rank candidate features; in [soares2004meta], meta-regressor is trained to score kernel widths as hyperparameters for support vector machines.
In short, with the purpose to accelerate configuration evaluation, meta-learners are trained to predict the performance or suitability of configurations. When used in the configuration generation procedure, such meta-learners can significantly cut down the number of actual model training. Furthermore, in the configuration selection setting, where all possible choices have been enumerated, best configurations can be directly selected according to the scores and rankings predicted by the meta-learner.
6.1.3 Configuration Generation (optimizer)
Meta-learning can also facilitate configuration generation by learning, e.g., configurations for specific learning tasks, strategies to generate or select configurations, or refined search spaces. These approaches, in general, can improve the efficiency of AutoML:
Promising configuration generation: the purpose is to directly generate well-performing configurations for given learning problem. For this purpose, meta-knowledge indicating the empirically good configurations are extracted, and the meta-learner take the characteristics of learning problem as input and predict promising configurations, such as kernel [ali2006meta], adaptive network architectures [finn2017model, bender2018understanding].
Warm-starting configuration generation: meta-knowledge utilized in promising configuration generation can also be exploited to better initialize configuration search. The basic approach is, given a new learning task, to find the past tasks that are closest to it in the meta-feature space, and use their best performing configurations to initialize search. Most work of this kind focus on hyperparameter tuning, with particle swarm optimization [gomes2012combining, de2012experimental], evolutionary algorithm [reif2012meta], and sequential model-based optimization [feurer2015initializing, feurer2015efficient, lindauer2017warmstarting].
Search space refining: meta-learning can accelerate configration search by refining the search space. Existing works of this line make effort to evaluate the importance of configurations [hoos2014efficient, van2018hyperparameter], or identify promising regions in the search space [wistuba2015hyperparameter].
6.1.4 Dynamic Configuration Adaptation
So far we have focused on the difference among different learning problems and tools, which raises the need of AutoML. However, in the real life, the data distribution varies even in a single data set, especially in data streams. Such change in data distribution is often termed as “concept drift”. In classical machine learning practices, concept drift is often priorly assumed or posteriorly detected, followed by specific design so that the learning tool can adapt to such drift. Meta-learning can help to automatic this procedure by detecting concept drift and dynamically adapt learning tools to it:
Concept drift detection: with statistics of data or attributes, we can detect if concept drift present in a learning problem. In [widmer1997tracking], attributes that might provide contextual clues, which indicate the changes in concept, are identified based on meta-learning. In [kifer2004detecting], a non-parametric approach is proposed to detect concept drift. A new class of distance measures are designed to indicate changes in data distribution, and concept drift is detected by monitoring the changes of distribution in a data stream.
Dynamic configuration adaptation: once the concept drift is detected, configuration adaptation can be carried out by predicting the promising configurations for current part of data [klinkenberg2005meta, rossi2012meta, rossi2014metastream]. Such approaches are similar to those in promising configuration generation.
Summary. We have so far reviewed major meta-learning techniques in the context of AutoML, however, applying meta-knowledge requires certain efforts, as will be discussed in Section LABEL:ssec:tech_future.
6.2 Transfer Learning
|surrogate model transfer||past hyperparameter optimization||current hyperparameter optimization||surrogate model, or model components|
|network block transfer||past network architecture search||current network architecture search||network building blocks|
|model parameter transfer||past architecture evaluation||current architecture evaluation||model parameter|
|function-preserving transformation||past architecture evaluation||current architecture evaluation||the function represented by the network|
Transfer learning, according to the definition in [pan2009survey], tries to improve the learning on target domain and learning task, by using the knowledge from the source domain and learning task. In the context of AutoML, the source and target of transfer are either configuration generations or configuration evaluations, where the former setting transfers knowledge among AutoML practices and the latter transfers knowledge inside an AutoML practice. On the other hand, transferable knowledge that has been hitherto exploited in AutoML includes but is not limited to: 1) learned models or their parameters; 2) configurations of learning tools; 3) strategies to search for promising learning tools. Figure 17 illustrates how transfer learning works in AutoML. Remark 6.2 points out the key issues in applying transfer learning, and Table IX summarized the different source, target, and transferable knowledge involved in transfer learning in the existing AutoML literature.
To apply transfer learning in AutoML, we need to determine:
what is the purpose of knowledge transfer;
what are the source and target of knowledge transfer;
what knowledge to be transfered.
In the remaining of this section, we will review the transfer learning techniques that have been employed to help: 1) configuration generation (targeted at optimizer), and 2) configuration evaluation (targeted at evaluator).
6.2.1 Configuration Generation (optimizer)
In AutoML, the search for good configurations is often computational expensive due to the costly evaluations and extensive search spaces. Transfer learning has been exploited to reuse trained surrogate models or promising search strategies from past AutoML search (source) and improve the efficiency in current AutoML task (target):
Surrogate model transfer: sequential model-based optimization (SMBO) for hyperparameters suffers from the cold-start problem, as it is expensive to initialize the surrogate model from scratch for every AutoML problem. Transfer learning techniques are hence proposed to reuse the knowledge gained from past experiences, by transferring the surrogate model [golovin2017google] or its components such as kernel function [yogatama2014efficient].
Network block transfer: transfer learning is especially widely-used in the realm of network architecture search due to the transferability of networks. In [zoph2017learning, zhong2017practical], the NAS problem is converted to searching for architecture building blocks. Such blocks can be learned with low costs on small data sets and transferred to larger ones.
It should be noted that multi-task learning, a topic closely related to transfer learning, is also employed to help configuration generation. In [swersky2013multi], Bayesian optimization is accompanied with multi-task gaussian process models so that knowledge gained from past tuning tasks can be transfered to warm-start search. In [wong2018transfer], a multi-task neural AutoML controller is trained to learn hyperparameters for neural networks.
6.2.2 Configuration Evaluation (evaluator)
In the search for promising learning tools, a great number of candidate configurations need to be evaluated. In common approaches, such evaluation involves expensive model training. By transferring knowledge from previous configuration evaluations, we can avoid training model from scratch for the upcoming evaluations and significantly improve the efficiency. Based on the well-recognized and proven transferability of neural networks, transfer learning techniques have been widely employed in NAS approaches to accelerate the evaluation of candidate architectures:
Model parameter tranfer: the most straightforward method is to transfer parameters from trained architectures to initialize new ones. According to [yosinski2014transferable], initializing network with transferred features layers, followed by fine-tuning, brings improvement in deep neural network performance. Following this idea, in [pham2018faster], child networks are forced to share weights so that the training costs can be significantly reduced.
Function-preserving transformation: another line of research focus on the function-preserving transformation, first proposed in Net2Net [chen2015net2net] where new networks are initialized to represent the same functionality of a given trained model. This approach has been proven capable to significantly accelerate the training of the new network. Additionally, function-preserving transformation also inspires new strategies to explore the network architecture space in recent approaches [cai2018efficient, cai2018path].
Summary. As we can observe, the applications of transfer learning in AutoML is relatively limited. Most approaches focused on neural network search problem, and the transferability of knowledge is not well addressed in an automatic manner, which motivates the discussion in Section LABEL:ssec:tech_future.
|Auto-sklearn [feurer2015efficient]||SMAC [hutter2011sequential] algorithm (warm-start by meta-learning)||direct evaluation (train model parameter with optimization algorithms)||out-of-box classifiers|
|NASNet [bello2017neural]||recurrent neural networks (trained with REINFORCE algorithm [williams1992simple])||direct evaluation (train child network with stochastic gradient descent)||convolutional neural networks|
|ExploreKit [katz2016explorekit]||greedy search algorithm||classifiers trained with meta-features||subsequent learning models|
7.1 Model Selection using Auto-sklearn
As each learning problem has its own preference over learning tools [tom1997machine], when we are dealing with a new classification problem, it is naturally to try a collection of classifiers and then get final predictions from an ensemble of them. This is a very typical application scenario of AutoML on model selection (Section 3.2), and the key issue here is how to automatically select best classifiers and setup their hyper-parameters.
Using Scikit-Learn [scikit-learn] as an example, some popularly used classifiers and their hyper-parameters are listed in Table IV. In [thornton2013auto, feurer2015efficient], the above issue is considered as a CASH problem (Example 1). The ensemble construction is transferred into (LABEL:eq:cashpro), which is an optimization problem minimizing the loss on validation set and involving with both parameters and hyper-parameters.
Example 1 (CASH Problem [thornton2013auto, feurer2015efficient]).
Let be a set of learning models, and each model has hyper-parameter with domain , be a training set which is split into cross-validation folds and with for . Then, the Combined Algorithm Selection and Hyper-parameter (CASH) optimization problem is defined as