Automatic Configuration of Deep Neural Networks with EGO
Abstract
Designing the architecture for an artificial neural network is a cumbersome task because of the numerous parameters to configure, including activation functions, layer types, and hyperparameters. With the large number of parameters for most networks nowadays, it is intractable to find a good configuration for a given task by hand. In this paper an Efficient Global Optimization (EGO) algorithm is adapted to automatically optimize and configure convolutional neural network architectures. A configurable neural network architecture based solely on convolutional layers is proposed for the optimization. Without using any knowledge on the target problem and not using any data augmentation techniques, it is shown that on several image classification tasks this approach is able to find competitive network architectures in terms of prediction accuracy, compared to the best handcrafted ones in literature. In addition, a very small training budget (200 evaluations and 10 epochs in training) is spent on each optimized architectures in contrast to the usual long training time of handcrafted networks. Moreover, instead of the standard sequential evaluation in EGO, several candidate architectures are proposed and evaluated in parallel, which saves the execution overheads significantly and leads to an efficient automation for deep neural network design.
1 Introduction
Deep Artificial Neural Networks and in particular Convolutional Neural Networks (CNN) have demonstrated great performance on a wide range of difficult computer vision, classification and regression tasks. One of the most promising aspects of using deep neural networks is that feature extraction and feature engineering, which was mostly done by hand so far, now is completely taken care of by the networks themselves. Unfortunately, the design and configuration of the artificial neural networks are still derived by hand using either an educated guess, by popularity (using an architecture from previous literature) or by trying a grid of different architectures and parameters and then choosing the best performing network. Since the number of choices for a network architecture and its parameters can become quite large, an optimal deep neural network for a given problem is very unlikely to be obtained using this handcrafted procedure.
The challenges in configuring CNNs are: 1) the search space is usually high dimensional and heterogeneous, resulting from a large number of structure choices (e.g., number of layers, layer type, etc.) and real parameters. 2) the computational time becomes the bottleneck when fitting a deep network structure to a relatively large data set. Those difficulties hinder the applicability of the traditional nonlinear blackbox optimizers, for instance Evolutionary Algorithms Stanley & Miikkulainen (2002). Instead, it is proposed here to adopt the socalled Efficient Global Optimization Močkus (1975, 2012); Jones et al. (1998) (EGO) algorithm as the network configurator. The standard EGO algorithm is a sequential strategy designed for the expensive evaluation scenario, where a single candidate configuration is provided in each iteration. It is proposed to adapt the EGO algorithm to yield several candidate configurations in each iteration where the resulting configurations can be evaluated in parallel.
This paper is organized as follows. In section 2, the related approaches on network configuration are discussed. In section 3, we introduce the AllCNN configuration framework, using only convolutional layers, and the EGObased configurator is explained in section 4. The proposed method is validated and tested in sections 5 and 6, followed by the demonstration of an application on a realworld problem.
2 Related Research
The optimization of hyperparameters is a very known challenge and has been addressed in many works. For example, Bergstra & Bengio (2012) shows that random chosen trials are more efficient than using grid search to perform hyperparameter optimization. Obviously, both random and grid search are far from optimal, and more sophisticated methods are required to search the very large and complex search space for optimizing deep artificial neural networks. More recent work of the same author Bergstra et al. (2013) shows that automatic hyperparameter tuning can yield stateoftheart result, In these papers, architectures are used that are known to work on a specific problem and are then finetuned by hyperparameter optimization. Some other sophisticated algorithms to perform parameter tuning and automated machine learning configuration are Bayesian Optimization Snoek et al. (2012); Jones et al. (1998), Evolutionary Algorithms Loshchilov & Hutter (2016) and SMAC Hutter et al. (2011a), which try to quickly converge to practical wellperforming hyperparameters for a given machine learning algorithm.
Unfortunately, even with these sophisticated algorithms, optimization of the deep neural network architecture itself, in addition with its hyperparameters, is a very challenging task. This is caused by the time complexity and computational effort that is required to train these networks, in combination with the size of the search space of hyperparameters for such networks. Automatically optimizing the structure of an artificial neural network is not an entirely new idea though, as already in 1989 Miller et al. (1989) genetic algorithms were proposed to optimize the links between a predefined number of nodes. A bit later, an evolutionary program (GNARL) was proposed to evolve the structure of recurrent neural networks Angeline et al. (1994). In another, more recent work Ritchie et al. (2003), Genetic Programming (GP) is used for the automatic construction of neural network architectures.
One of the main bottlenecks with the already proposed methods though is that a single evaluation of an artificial neural network can take several hours, on a modern GPU system. This makes it infeasible to apply these algorithms with a large evaluation budget or on a large problem instance. Unfortunately, these algorithms usually require a large evaluation budget to find well performing network configurations for a specific problem. Another challenge is to define a bounded search space that still covers most of the possibilities in order to find the optimum. When dealing with neural network structures this is far from simple. The number of layers for example could be a problematic parameter to vary, since each layer comes with its own set of hyperparameters.
To alleviate this problem, a generic configurable deep neural network architecture is proposed in this paper. This architecture is highly configurable with a large number of parameters and can represent very shallow to very deep convolutional neural networks. The configurable architecture has a fixed number of hyperparameters and is therefore very suitable for optimization. To tackle this optimization task, the wellknown Efficient Global Optimization algorithm Močkus (1975, 2012); Jones et al. (1998) is adopted with several important improvements, enabling the parallel training of different network candidates. The main advantages of the proposed approach are:

Small optimization time: it requires by far less real evaluations (training of candidate networks) than other approaches.

Parallelism: several candidate networks are suggested in each iteration, facilitating parallel execution over multiple GPUs.
3 A Configurable AllConvolutional Neural Network
In order to optimize the structure and hyperparameters of a deep neural network, a few modeling decisions are required to set the boundaries of the search space. The complexity of the search space is mostly due to a large number of different layer types, activation functions and regularization methods, each coming with their own set of hyperparameters.
In order to reduce the complexity of the search space without making too many modeling assumptions, a generic configurable convolutional neural network designed for any image classification problem is proposed here.
According to Springenberg et al. (2014), using only convolutional neural network layers can give the same or better performance as using the often used structure of convolutional layers followed by a pooling layer. Therefore, for our generic configurable network structure, we have chosen to use only convolution layers with the exception of the final layer.
The configurable network architecture is shown in Table 1 where each of the stacks has an architecture as shown in Figure 1. The network consists of multiple of these stacks, that each consist of a number of convolutional layers and a convolutional layer with strides (Conv2DOut), to allow for pooling, and a dropout layer. The last part of the network uses either global average pooling or not, and ends in a dense layer with the size of the number of classes one wants to predict. Each stack has independent configurable parameters and shared parameters that can be optimized. The convolutional layers in the stack have the parameters and , which are the number of filters , the kernel size , the regularization factor for the weights, and for the Conv2DOut layer the strides , respectively. The parameter stands for the configurable activation function and every dropout has its own dropout probability (). The last dense layer has and as configurable parameters. The size of each stack is configurable as well (), and allows for very shallow to very deep neural network architectures. All hyperparameters that are not taken into account for the configuration are set to the values recommended by literature and the padding for each convolution layer is set to ‘same’ in order to avoid negative dimensions.
Layer Type  Parameters 

Dropout  
Conv2D  
Stacks  
Conv2D  
Conv2DOut  
Dropout  
Head  
GlobalPooling  boolean 
Dense 
Next to the parameters of the configurable network itself (which are when using stacks), there are the learning rate () and decay rate for the backpropagation optimizer. Depending on available resources and the classification task at hand, the ranges of the parameters can be determined by the user. For this paper, the ranges can be found in Table 2. The optimizer used for backpropagation is the wellknown stochastic gradient descent (SGD), provided by the Keras Chollet et al. (2015) python library.
Parameter  Range 

{elu, relu, tanh, selu, sigmoid}  
4 Efficient Global Optimization based Configurator
The search space of the AllCNN framework is heterogenous and high dimensional. For the integer parameters, in case of three stacks, there are seven for the number of filters (), seven for the kernel size (), three for strides () and three for the number of layers () in the stack, and thus in total. For the discrete parameters, there are two for the activation functions () of the stack and the head and for the real parameters, there are four parameters for the dropout rate (), one for regularization () and one for the learning rate (). In addition, we have one boolean variable to control the global pooling. Therefore, this search space can be represented as:
where and . The convolutional neural network can be instantiated by drawing samples in . Given a data set, the problem arises in finding the optimal configuration, with respect to a predefined, realvalued performance metric of the neural network (for instance, can be set to for regression tasks and precision for classification problems): In the following discussion, it is assumed that the performance metric is subject to minimization, without loss of generality (the maximization problem can be easily converted). The challenge in optimizing is the evaluation time of itself, which will be extremely expensive when training a large network structure on a huge data set. Consequently, it is recommended to use efficient optimization algorithms that can save as many evaluations as possible. The efficient global optimization (EGO) algorithm Močkus (1975, 2012); Jones et al. (1998) is a suitable candidate algorithm for this task. It is a sequential optimization strategy that does not require the derivatives of the objective function and is designed to tackle expensive global optimization problems. Compared to alternative optimization algorithms (or other design of experiment methods), the distinctive feature of this method is the usage of a metamodel, which gives the predictive distribution over the (partially) unknown function.
Briefly, this optimization method iteratively proposes new candidate configurations over the metamodel, taking both the prediction and model uncertainty into account. After the evaluation of the new candidate configurations, the metamodel will be retrained.
4.1 Initial Design and Metamodeling
To construct the metamodel, some initial samples in the configuration space, are generated via the Latin hypercube sampling (LHS) McKay et al. (1979). The corresponding performance metric values are obtained by instantiating the network and validating its performance on the data set: . Note that the evaluation of the initial designs can be easily parallelized. For the choice of metamodels, although Gaussian process regression Sacks et al. (1989); Santner et al. (2003) (referred to as Kriging in geostatistics Krige (1951)) is frequently used in EGO, we adopt the random forest instead, due to the fact that it is more suitable for a mixed integer configuration domain Hutter et al. (2011b). In the following discussions, the prediction on configuration is denoted as . In addition, the empirical variance of the prediction is also calculated from the forest, which quantifies the prediction uncertainty.
4.2 InfillCriterion
To propose potentially good configurations in each iteration, the socalled infillcriterion is used to quantify the quality criterion of the configurations. Informally, infillcriteria work in a way that predicted values from the metamodel and the prediction uncertainty are balanced. A lot of research effort has been put over the last decades in exploring various infillcriteria, e.g., Expected Improvement Močkus (1975); Jones et al. (1998), Probability of Improvement (Jones, 2001; Žilinskas, 1992) and Upper Confident Bound Auer (2002); Srinivas et al. (2010). In this contribution, we adopt the socalled MomentGenerating Function (MGF) based infillcriterion, as proposed in Wang et al. (2017). This infillcriterion allows for explicitly balancing exploitation and exploration. This criterion has a closed form and can be expressed as:
(1)  
where is the current best performance over all the evaluated configurations and stands for the cumulative distribution function of the standard normal distribution. The infillcriterion introduces an additional real parameter (“temperature”) to explicitly control the balance between exploration and exploitation. As explained in Wang et al. (2017), when goes up, tends to reward the configurations with high uncertainty. On the contrary, when is decreased, puts more weight on the predicted performance value. It is then possible to set the value according to the budget on the configuration task: with a larger budget of function evaluations, can be set to a relatively high value, leading to a slow but global search process and vice versa for a smaller budget.
4.3 Parallel execution
Due to the typically large execution time of instantiated network structures, it is also proposed here to parallelize the execution. This requires generating more than one candidate configuration in each iteration. Many methods are developed for this purpose, including multipoint Expected Improvement Ginsbourger et al. (2010) and Niching techniques Wang et al. (2018). Here, we adopt the approach in Hutter et al. (2012), where () different temperatures are sampled from the lognormal distribution and different criteria are instantiated using the temperatures accordingly. Consequently, candidate configurations can be obtained by maximizing those infillcriteria. On one hand, as lognormal is a longtailed distribution, most of the values are realized relatively small and thus the model prediction is well exploited. On the other hand, only a few samples will be relatively high and therefore will lead to very explorative search behavior.
To maximize the infillcriterion on the mixedinteger search domain, we adopt the socalled MixedInteger Evolution Strategy (MIES) Li et al. (2013). The proposed Bayesian configurator is summarized in Algorithm 1.
5 Experiments
To test our algorithm, two very popular and common classification tasks have been performed using the proposed configurator and a configurable network with stacks. These are the MNIST dataset LeCun et al. (1998), containing 60.000 training samples, and a test set of 10.000 examples, all 28x28 greyscale images, and the CIFAR10 dataset Krizhevsky & Hinton (2009), containing 60.000, 32x32 colour images with 10 classes, divided into 6000 images per class. There are 50.000 training images and 10.000 test images, in this case.
In the optimization procedure of the neural network on the MNIST dataset, each evaluation is run for epochs only with a batch size of images. For the CIFAR10 dataset, the number of epochs is increased to , which is still much less than the number of epochs in most recent literature (). An early stopping criterion is used to stop the evaluation of a particular configuration after epochs of no improvements. No data augmentation is used.
The Bayesian mixed integer configurator is set to evaluate network configurations per step in parallel using NVIDIA K80 GPUs, where the first steps are used for the initial LHS design. The test set accuracy is returned after each evaluation as fitness value for the optimizer.
6 MNIST and CIFAR10 Results
In Figure 2 the results of the automatic configuration of the AllCNN networks are shown. In both cases, after approximately evaluations, a wellperforming network configuration is obtained. Both classification tasks used exactly the same initial configuration, the only difference is the number of epochs for each network evaluation.
The best performing configurations compete with the stateoftheart as shown in Table 3 and Table 4 and can be possibly improved when trained using more epochs. It should be noted that the number of epochs used to obtain these results is significantly lower than the number of epochs in stateoftheart solutions from literature. The advantage of such a small number of epochs is that it speeds up the entire optimization process. The idea behind this is that wellperforming configurations can be tuned with a larger number of epochs as second optimization step, most likely resulting in increased performance. Using the automatic configurator we obtained neural network architectures that compete with stateoftheart results using only epochs and epochs in total, without any manual tuning, reconfiguring or upfront knowledge of the specific problem instances. While handcrafted network configurations are not only trained using many more epochs for the final reported architecture, they also require a huge amount of time to be constructed by reconfiguring and finetuning the architecture. Therefore, the handcrafted networks basically use many more more epochs until the final architecture is reached.
7 Real World Problem: Tata Steel
The proposed algorithm is applied on the real world problem of classifying defects during the hot rolling process of steel. This industrial process is very complex with many conditions and parameters that influence the final product. It is also a process that changes over the years and requires dealing with concept shift and concept drift. One of the main objectives for Tata Steel is to automatically classify and predict surface defects using material properties and machine parameters as input data. To achieve this objective, first, a deep neural network architecture is designed by hand to classify these defects.
The Tata Steel data set consists of various material measurements and machine parameters. Most of the measurements are measured over the complete length of each coil but not over the width of the coil (since the width is much smaller). However, the temperature measurements are taken over several tracks in the width of the coil as well. Due to this spatial difference, it was decided to design two concatenated network architectures. One part of the architecture is based purely on the temperature data, allowing for the application of convolution layers in the width and length direction of the coil. The second component is used for modeling the remaining measurements and machine parameters where the convolution filters only work in the length of the coil. In the end of the design process, these two parts are merged into one final fullyconnected output layer.
The initial design process of these architectures was mainly based on trial and error and recommendations from literature. The design process started with a small, relative simple, twolayer multiperceptron, and adding additional dense and convolution layers in order to increase the final accuracy. Dropout is being applied to prevent overfitting, and after several manual iterations a dropout rate of seemed to work best.
Next, we applied a slightly modified version of the proposed configurable allCNN network (with a separate stack for the temperature data before concatenating it to the main model) and automatically optimized the configuration. The optimal configuration obtained by using our optimization procedure significantly improves the classification accuracy. It also allows for easy retraining and validation on future data, since almost zero knowledge of the actual dataset is required to train and optimize the network architecture.
The test set accuracy of the handdesigned classifier and the optimized classifier for this real world application is shown in Figure 3. It can be observed that the optimized classifier has a significantly improved accuracy on this specific defect type, with an almost true positive rate with only a very small () false positive rate. This shows that the optimization procedure and configurable network architecture has great potential for industrial applications.
8 Conclusions and Outlook
A novel approach based on Efficient global optimization algorithm is proposed to automatically configure the neural networks architecture. On some wellknown image classification tasks, it is observed that the proposed optimization approach is capable of generating wellperforming networks with a limited number of optimization iterations. In addition, the resulting optimized neural networks are also highly competitive with the stateoftheart manually designed ones on the MNIST and CIFAR10 classification task. Note that such performance of the optimized network are achieved under a very small number of epochs ( for MNIST, and for CIFAR10) for training, without any knowledge on the classification task or data augmentation techniques.
As for the realworld problem, we have applied the proposed approach on the challenge of steel surface detection. The outcome clearly illustrates that the proposed configuration approach also works extremely well. The accuracy of the optimized network that detects the surface defect for Tata Steel is significantly higher than the accuracy of the network designed by hand, which is obtained with manual finetuning.
For the next step, there are several possibilities to work on. First, the proposed approach will be applied and tested on various modeling tasks and realworld problems. Second, the actual training time of the candidate network will be taken into account explicitly. The tradeoff between training time and accuracy can be controlled by optimizing the epochs and batch size. Additionally, it is also interesting to formulate this as a bicriteria decision making problem, with one objective being the accuracy of the network and the other objective the training time required. Third, we will investigate how to extend the current configurable network that has a linear topology, to more general topological structures. In this case, it will be very challenging to search efficiently in the complex configuration space with multiple dependencies.
Acknowledgment
The authors acknowledge support by NWO (Netherlands Organization for Scientific Research) PROMIMOOC project (project number: 650.002.001).
References
 Angeline, Peter J, Saunders, Gregory M, and Pollack, Jordan B. An evolutionary algorithm that constructs recurrent neural networks. IEEE transactions on Neural Networks, 5(1):54–65, 1994.
 Auer, Peter. Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
 Bergstra, James and Bengio, Yoshua. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
 Bergstra, James, Yamins, Daniel, and Cox, David. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International Conference on Machine Learning, pp. 115–123, 2013.
 Chollet, François et al. Keras. https://github.com/kerasteam/keras, 2015.
 Ciregan, Dan, Meier, Ueli, and Schmidhuber, Jürgen. Multicolumn deep neural networks for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3642–3649. IEEE, 2012.
 Ginsbourger, David, Le Riche, Rodolphe, and Carraro, Laurent. Kriging Is WellSuited to Parallelize Optimization, pp. 131–162. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. ISBN 9783642107016. doi: 10.1007/9783642107016˙6. URL https://doi.org/10.1007/9783642107016_6.
 Graham, Benjamin. Fractional maxpooling. arXiv preprint arXiv:1412.6071, 2014.
 Hutter, Frank, Hoos, Holger H., and LeytonBrown, Kevin. Sequential ModelBased Optimization for General Algorithm Configuration, pp. 507–523. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011a. ISBN 9783642255663. doi: 10.1007/9783642255663˙40.
 Hutter, Frank, Hoos, Holger H, and LeytonBrown, Kevin. Sequential modelbased optimization for general algorithm configuration. LION, 5:507–523, 2011b.
 Hutter, Frank, Hoos, Holger, and LeytonBrown, Kevin. Parallel algorithm configuration. Learning and Intelligent Optimization, pp. 55–70, 2012.
 Jones, Donald R. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21(4):345–383, 2001.
 Jones, Donald R, Schonlau, Matthias, and Welch, William J. Efficient global optimization of expensive blackbox functions. Journal of Global optimization, 13(4):455–492, 1998.
 Krige, Daniel G. A Statistical Approach to Some Basic Mine Valuation Problems on the Witwatersrand. Journal of the Chemical, Metallurgical and Mining Society of South Africa, 52(6):119–139, December 1951.
 Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.
 LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Li, Rui, Emmerich, Michael TM, Eggermont, Jeroen, Bäck, Thomas, Schütz, Martin, Dijkstra, Jouke, and Reiber, Johan HC. Mixed integer evolution strategies for parameter optimization. Evolutionary computation, 21(1):29–64, 2013.
 Loshchilov, Ilya and Hutter, Frank. Cmaes for hyperparameter optimization of deep neural networks. arXiv preprint arXiv:1604.07269, 2016.
 McKay, M. D., Beckman, R. J., and Conover, W. J. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21(2):239–245, 1979. ISSN 00401706. URL http://www.jstor.org/stable/1268522.
 Miller, Geoffrey F, Todd, Peter M, and Hegde, Shailesh U. Designing neural networks using genetic algorithms. In ICGA, volume 89, pp. 379–384, 1989.
 Močkus, J. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pp. 400–404. Springer, 1975.
 Močkus, Jonas. Bayesian approach to global optimization: theory and applications, volume 37. Springer Science & Business Media, 2012.
 Ritchie, Marylyn D, White, Bill C, Parker, Joel S, Hahn, Lance W, and Moore, Jason H. Optimization of neural network architecture using genetic programming improves detection and modeling of genegene interactions in studies of human diseases. BMC bioinformatics, 4(1):28, 2003.
 Sacks, Jerome, Welch, William J., Mitchell, Toby J., and Wynn, Henry P. Design and Analysis of Computer Experiments. Statistical Science, 4(4):409–423, 1989.
 Santner, T.J., Williams, B.J., and Notz, W. The Design and Analysis of Computer Experiments. Springer Series in Statistics. Springer, 2003. ISBN 9780387954202.
 Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
 Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, and Riedmiller, Martin. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
 Srinivas, Niranjan, Krause, Andreas, Kakade, Sham M., and Seeger, Matthias. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 1015–1022, 2010. ISSN 00189448. doi: 10.1109/TIT.2011.2182033. URL http://arxiv.org/abs/0912.3995.
 Stanley, Kenneth O and Miikkulainen, Risto. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99–127, 2002.
 Wang, H., van Stein, B., Emmerich, M., and Bäck, T. A new acquisition function for bayesian optimization based on the momentgenerating function. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 507–512, Oct 2017. doi: 10.1109/SMC.2017.8122656.
 Wang, Hao, Bäck, Thomas, and Emmerich, Michael T. M. Multipoint efficient global optimization using niching evolution strategy. In Tantar, AlexandruAdrian, Tantar, Emilia, Emmerich, Michael, Legrand, Pierrick, Alboaie, Lenuta, and Luchian, Henri (eds.), EVOLVE  A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation VI, pp. 146–162, Cham, 2018. Springer International Publishing. ISBN 9783319697109.
 Yang, Zichao, Moczulski, Marcin, Denil, Misha, de Freitas, Nando, Smola, Alex, Song, Le, and Wang, Ziyu. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1476–1483, 2015.
 Zeiler, Matthew D and Fergus, Rob. Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557, 2013.
 Žilinskas, Antanas. A review of statistical models for global optimization. Journal of Global Optimization, 2(2):145–153, 1992.