A Heuristic For Efficient Reduction In Hidden Layer Combinations For Feedforward Neural Networks
In this paper, we describe the hyperparameter search problem in the field of machine learning and present a heuristic approach in an attempt to tackle it. In most learning algorithms, a set of hyperparameters must be determined before training commences. The choice of hyperparameters can affect the final model’s performance significantly, but yet determining a good choice of hyperparameters is in most cases complex and consumes large amount of computing resources. In this paper, we show the differences between an exhaustive search of hyperparameters and a heuristic search, and show that there is a significant reduction in time taken to obtain the resulting model with marginal differences in evaluation metrics when compared to the benchmark case.
Heuristic Combinatorics Neural Networks Hyperparameter Optimization
Much research has been done in the field of hyperparameter optimization [1, 2, 3], with approaches such as grid search, random search, Bayesian optimization, gradient-based optimization, etc. Grid search and manual search are the most widely used strategies for hyperparameter optimization . These approaches leave much room for reproducibility and are impractical when there are a large number of hyperparameters. Thus, the idea of automating hyperparameter search is increasingly being researched upon, and these automated approaches have already been shown to outperform manual search by numerous researchers across several problems .
A Multilayer perceptron  (MLP) is a class of feedforward artificial neural network. It consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. With the exception of the input layer, each node in the layer is a neuron that utilizes a nonlinear activation function. In training the MLP, backpropagation, a supervised learning technique is used. In our experiments, we only have test cases consisting of one to three hidden layers, each consisting of up to neurons. The reasons for this number are that our objective is to illustrate the effects of the heuristic using a small toy-example that does not take too long to run in the test cases, and we found that for the dataset used, the best results from the grid search involed less than 10 neurons.
2 Experiment Setting and Datasets
2.1 Programs Employed
We made use of Scikit-Learn , a free software machine learning library for the Python programming language. Python 3.6.4 was used in formulating and running of the algorithms, plotting of results and for data preprocessing.
2.2 Resources Utilized
All experiments were conducted in the university’s High Performance Computing111See https://nusit.nus.edu.sg/services/hpc/about-hpc/ for more details about the HPC. (HPC) machines, where we dedicated 12 CPU cores, 5GB of RAM in the job script. All jobs were submitted via SSH through the atlas8 host, which has the specifications: HP Xeon two sockets 12-Core 64-bit Linux cluster, CentOS 6. We utilized Snakemake , a workflow management tool to conduct our experiments.
We load as use the boston house-prices dataset222See https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html for the documentation. from Scikit-Learn’s sklearn.datasets package. This package contains a few small standard datasets that do not require downloads of any file(s) from an external website.
2.4 Notations and Test Cases
We perform our experiments on feedforward neural networks with one, two and three layers. Scikit-Learn’s MLPRegressor from the sklearn.neural_network333See https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor for further documentation. package. This model optimizes the squared-loss using LBFGS (an optimizer in the family of quasi-Newton methods) or stochastic gradient descent. Our maximum number of neurons in any hidden layer is set at , as preliminary experiments show that the best cross-validation score is obtained when the number of neurons at any hidden layer is under .
Define to be the minimum fraction in improvement required for each iteration of the algorithm. We also define to be the hidden layer combination at iteration . is the starting hidden layer combination used as input. Let be the number of hidden layers, to be the maximum number of neurons across all hidden layers in and to be the best hidden layer combination obtained from fitting with GridSearchCV444See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html fo the documentation. from Scikit-Learn’s sklearn.model_selection package. For example, if , it means that there are neurons in the first and third hidden layer and neurons in the second layer. Then and . We also define as the set contianing all previously fitted hidden layer combinations. is then the set containing the best combination at any iteration of the algorithm, i.e. . Scikit-Learn’s sklearn.preprocessing.StandardScaler555See https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html for further documentation. was also used to standardize features by removing the mean and scaling to unit variance.
We also denote the Root Mean Square Error from fitting the model, at the end of the current iteration and from the previous iteration as and respectively. In our experiments, . We also set to be an arbitrarily large, for the purpose of passing the first iteration of the loop. Next, we define as a function that returns the set of all possible hidden layers and as a function that removes all common elements in the set .
3 Methods Employed
3.1 Method 1 - Benchmark
In this method, all possible hidden-layer sizes (with repetition allowed) are used as hyperparameter. Let denote the set of all possible hidden layers. Then for example, if there are 2 hidden layers and each layer can have between 1 to 3 neurons, then .
3.2 Method 2 - Heuristic
In this method, a heuristic is used to iteratively explore the hidden-layer combination, subject to the condition that the abosolute change in RMSE is greater or equal to and that . In our experiments, we obtain the input by performing a grid search on the hidden-layer combinations of the form: and the ’best’ hidden-layer combination will be assigned as . The heuristic can be formulated as follows:
4 Experiment Results
We illustrate the results of Method 1 (Benchmark) and Method 2 (Heuristic) for each side-by-side, then show the overall results in a table.
For Method 1:
For Method 2:
Summary of Results
|Median Time Elapsed (s)||9.11||116.20||597.59|
|Median Time Elapsed (s)||9.09||22.59||147.90|
|Median Time Elapsed (s)||9.03||23.65||42.57|
|Median Time Elapsed (s)||9.09||22.35||42.79|
5 Conclusion and Future Work
The main takeaway from the results obtained is the significant reduction in median time taken to run a test case with a similar spread of score and RMSE when compared to the benchmark case, when the heuristic is used in Method 2. To the best of our knowledge, such a heuristic has not been properly documented and experimented with, though it is highly possible that it has been formulated and implemented by others given its simple yet seemingly naive nature.
The heuristic can be generalized and applied to other hyperparameters in a similar fashion, and other models may be used as well. We use the MLPRegressor model in Scikit-Learn as we find that it helps to illustrate the underlying idea of the algorithm the clearest. Due to time constraints we are not able to run for other models and alphas, but we strongly encourage others to explore with other models and variants of the heuristic.
-  Marc Claesen and Bart De Moor. Hyperparameter search in machine learning. CoRR, abs/1502.02127, 2015.
-  Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2962–2970. Curran Associates, Inc., 2015.
-  James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13(1):281–305, February 2012.
-  Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, 2009.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
-  Johannes Köster and Sven Rahmann. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, 28(19):2520–2522, 08 2012.