A Heuristic For Efficient Reduction In Hidden Layer Combinations For Feedforward Neural Networks

A Heuristic For Efficient Reduction In Hidden Layer Combinations For Feedforward Neural Networks

W. H. Khoong
Department of Statistics and Applied Probability
National University of Singapore
khoongweihao@u.nus.edu
Wei Hao is currently a graduate student at the Department of Statistics and Applied Probability, National University of Singapore.
Abstract

In this paper, we describe the hyperparameter search problem in the field of machine learning and present a heuristic approach in an attempt to tackle it. In most learning algorithms, a set of hyperparameters must be determined before training commences. The choice of hyperparameters can affect the final model’s performance significantly, but yet determining a good choice of hyperparameters is in most cases complex and consumes large amount of computing resources. In this paper, we show the differences between an exhaustive search of hyperparameters and a heuristic search, and show that there is a significant reduction in time taken to obtain the resulting model with marginal differences in evaluation metrics when compared to the benchmark case.

\keywords

Heuristic Combinatorics Neural Networks Hyperparameter Optimization

1 Preliminaries

Much research has been done in the field of hyperparameter optimization [1, 2, 3], with approaches such as grid search, random search, Bayesian optimization, gradient-based optimization, etc. Grid search and manual search are the most widely used strategies for hyperparameter optimization [3]. These approaches leave much room for reproducibility and are impractical when there are a large number of hyperparameters. Thus, the idea of automating hyperparameter search is increasingly being researched upon, and these automated approaches have already been shown to outperform manual search by numerous researchers across several problems [3].

A Multilayer perceptron [4] (MLP) is a class of feedforward artificial neural network. It consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. With the exception of the input layer, each node in the layer is a neuron that utilizes a nonlinear activation function. In training the MLP, backpropagation, a supervised learning technique is used. In our experiments, we only have test cases consisting of one to three hidden layers, each consisting of up to neurons. The reasons for this number are that our objective is to illustrate the effects of the heuristic using a small toy-example that does not take too long to run in the test cases, and we found that for the dataset used, the best results from the grid search involed less than 10 neurons.

2 Experiment Setting and Datasets

2.1 Programs Employed

We made use of Scikit-Learn [5], a free software machine learning library for the Python programming language. Python 3.6.4 was used in formulating and running of the algorithms, plotting of results and for data preprocessing.

2.2 Resources Utilized

All experiments were conducted in the university’s High Performance Computing111See https://nusit.nus.edu.sg/services/hpc/about-hpc/ for more details about the HPC. (HPC) machines, where we dedicated 12 CPU cores, 5GB of RAM in the job script. All jobs were submitted via SSH through the atlas8 host, which has the specifications: HP Xeon two sockets 12-Core 64-bit Linux cluster, CentOS 6. We utilized Snakemake [6], a workflow management tool to conduct our experiments.

2.3 Data

We load as use the boston house-prices dataset222See https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html for the documentation. from Scikit-Learn’s sklearn.datasets package. This package contains a few small standard datasets that do not require downloads of any file(s) from an external website.

2.4 Notations and Test Cases

We perform our experiments on feedforward neural networks with one, two and three layers. Scikit-Learn’s MLPRegressor from the sklearn.neural_network333See https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor for further documentation. package. This model optimizes the squared-loss using LBFGS (an optimizer in the family of quasi-Newton methods) or stochastic gradient descent. Our maximum number of neurons in any hidden layer is set at , as preliminary experiments show that the best cross-validation score is obtained when the number of neurons at any hidden layer is under .

Define to be the minimum fraction in improvement required for each iteration of the algorithm. We also define to be the hidden layer combination at iteration . is the starting hidden layer combination used as input. Let be the number of hidden layers, to be the maximum number of neurons across all hidden layers in and to be the best hidden layer combination obtained from fitting with GridSearchCV444See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html fo the documentation. from Scikit-Learn’s sklearn.model_selection package. For example, if , it means that there are neurons in the first and third hidden layer and neurons in the second layer. Then and . We also define as the set contianing all previously fitted hidden layer combinations. is then the set containing the best combination at any iteration of the algorithm, i.e. . Scikit-Learn’s sklearn.preprocessing.StandardScaler555See https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html for further documentation. was also used to standardize features by removing the mean and scaling to unit variance.

We also denote the Root Mean Square Error from fitting the model, at the end of the current iteration and from the previous iteration as and respectively. In our experiments, . We also set to be an arbitrarily large, for the purpose of passing the first iteration of the loop. Next, we define as a function that returns the set of all possible hidden layers and as a function that removes all common elements in the set .

3 Methods Employed

3.1 Method 1 - Benchmark

In this method, all possible hidden-layer sizes (with repetition allowed) are used as hyperparameter. Let denote the set of all possible hidden layers. Then for example, if there are 2 hidden layers and each layer can have between 1 to 3 neurons, then .

3.2 Method 2 - Heuristic

In this method, a heuristic is used to iteratively explore the hidden-layer combination, subject to the condition that the abosolute change in RMSE is greater or equal to and that . In our experiments, we obtain the input by performing a grid search on the hidden-layer combinations of the form: and the ’best’ hidden-layer combination will be assigned as . The heuristic can be formulated as follows:

Input : 
Output : 
1 while  and  do
2       if  then
3            
4       end if
5      else
6             if  then
7                  
8             end if
9             if  then
10                  
11             end if
12            
13       end if
14      if  then
15             Break
16       end if
17       if  then
18            
19       end if
20      else
21            
22       end if
23      if  then
24            
25       end if
26      
27 end while
Return
Algorithm 1 Heuristic

4 Experiment Results

We illustrate the results of Method 1 (Benchmark) and Method 2 (Heuristic) for each side-by-side, then show the overall results in a table.

For Method 1:

(a) Score
(b) Test RMSE
(c) Time Elapsed
Figure 1: Method 1 Results

For Method 2:

(a) Score
(b) Test RMSE
(c) Time Elapsed
Figure 2: Method 2 Results,
(a) Score
(b) Test RMSE
(c) Time Elapsed
Figure 3: Method 2 Results,
(a) Score
(b) Test RMSE
(c) Time Elapsed
Figure 4: Method 2 Results,

Summary of Results

1 2 3
Median Score 0.83 0.85 0.85
Median RMSE 3.51 3.81 3.63
Median Time Elapsed (s) 9.11 116.20 597.59
Table 1: Summary of Results for Method 1
1 2 3
Median Score 0.83 0.83 0.84
Median RMSE 3.87 3.50 3.69
Median Time Elapsed (s) 9.09 22.59 147.90
Table 2: Summary of Results for Method 2,
1 2 3
Median Score 0.819 0.83 0.83
Median RMSE 3.45 3.63 3.81
Median Time Elapsed (s) 9.03 23.65 42.57
Table 3: Summary of Results for Method 2,
1 2 3
Median Score 0.83 0.84 0.84
Median RMSE 3.87 3.71 3.69
Median Time Elapsed (s) 9.09 22.35 42.79
Table 4: Summary of Results for Method 2,

5 Conclusion and Future Work

The main takeaway from the results obtained is the significant reduction in median time taken to run a test case with a similar spread of score and RMSE when compared to the benchmark case, when the heuristic is used in Method 2. To the best of our knowledge, such a heuristic has not been properly documented and experimented with, though it is highly possible that it has been formulated and implemented by others given its simple yet seemingly naive nature.

The heuristic can be generalized and applied to other hyperparameters in a similar fashion, and other models may be used as well. We use the MLPRegressor model in Scikit-Learn as we find that it helps to illustrate the underlying idea of the algorithm the clearest. Due to time constraints we are not able to run for other models and alphas, but we strongly encourage others to explore with other models and variants of the heuristic.

References

  • [1] Marc Claesen and Bart De Moor. Hyperparameter search in machine learning. CoRR, abs/1502.02127, 2015.
  • [2] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2962–2970. Curran Associates, Inc., 2015.
  • [3] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13(1):281–305, February 2012.
  • [4] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, 2009.
  • [5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [6] Johannes Köster and Sven Rahmann. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics, 28(19):2520–2522, 08 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
392004
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description