A MetaheuristicDriven Approach to FineTune Deep Boltzmann Machines
Abstract
Deep learning techniques, such as Deep Boltzmann Machines (DBMs), have received considerable attention over the past years due to the outstanding results concerning a variable range of domains. One of the main shortcomings of these techniques involves the choice of their hyperparameters, since they have a significant impact on the final results. This work addresses the issue of finetuning hyperparameters of Deep Boltzmann Machines using metaheuristic optimization techniques with different backgrounds, such as swarm intelligence, memory and evolutionarybased approaches. Experiments conducted in three public datasets for binary image reconstruction showed that metaheuristic techniques can obtain reasonable results.
keywords:
Deep Boltzmann Machine, MetaHeuristic Optimization, Machine Learning1 Introduction
Restricted Boltzmann Machines (RBMs) Hinton:02 (); passosTESE:2018 () are probabilistic models that employ a layer of hidden binary units, also known as latent units, to model the distribution of the input data (visible layer). Such models have been applied to deal with problems involving images larochelle2007empirical (), text salakhutdinov2009semantic (), detection of malicious content fiore2013network (); SilvaIJCSIS:16 (), and several diseases diagnosis pereiraCAIP:2017 (); khojastehCBM:2019 (); passosJVCIR:2019 (), just to cite a few. Moreover, RBMs are also used for building deep learning architectures, such as Deep Belief Networks (DBNs) hinton2006fast () and Deep Boltzmann Machine (DBM) salakhutdinov2009deep (); passosNPL:2018 (), where the main difference is related to the interaction among layers of RBMs.
Deep Learning techniques have been extensively used to deal with tasks related to signal processing and computer vision, such as feature selection ruangkanokmas2016deep () SohnNIPS:15 (), face taigman2014deepface () DuongCVPR:15 () and image reconstruction dong2014learning (), multimodal learning srivastava2012multimodal (), and topic modeling hinton2009replicated (), among others. Despite the outstanding results obtained by these models, an intrinsic constraint associated with deep architectures is related to their complexity, which can become an insurmountable problem due to the high number of hyperparameters one must deal with. The present work focuses on this problem.
Some works have recently modeled the issue of hyperparameter finetuning as a metaheuristic optimization task. Such techniques show up as an interesting alternative for such a task since they do not require computing derivatives of hundreds of parameters as usually happen with standard optimization techniques, which is not recommended for highdimensional spaces. Papa et. al. PapaGECCO:15 (), Rosa et al. rosa2016learning (), and Passos et al. passosiRBM:2017 () are among the first to introduce metaheuristicdriven optimization in the context of RBMs, DBNs, and Infinity Restricted Boltzmann Machines (iRBMs) hyperparameter finetuning, obtaining more precise results than the ones achieved using some wellknown optimization libraries in the literature
Recently, Passos et al. passosSACI:2018 () proposed to employ metaheuristic approaches in the context of DBM hyperparameter optimization. However, the work deals only with Harmony Search Geem:09 () and Particle Swarm Optimization Kennedy:01 () techniques. Moreover, the paper presents a shallow discussion regarding the experimental results. Therefore, in this work, we considered DBM hyperparameter finetuning in the context of musicinspired, swarmbased and differential evolution algorithms, employing seven different techniques: Improved HarmonicSearch (IHS) mahdavi2007improved (), Adaptive Inertia Weight Particle Swarm Optimization (AIWPSO) yu2009adaptive (), Cuckoo Search (CS) yang2009cuckoo (), Firefly Algorithm (FA) Yang:2010ffa (), Backtracking Search Optimization Algorithm (BSA) civicioglu:13 (), Adaptive Differential Evolution (JADE) zhang:09 (), and the Differential Evolution Based on Covariance Matrix Learning and Bimodal Distribution Parameter Setting Algorithm (CoBiDE) wang:14 (). Furthermore, all techniques are compared with a random search for experimental purposes. Additionally, this work provides a more detailed experimental section, considering a statistical similarity and time consumption comparison. Finally, the application addressed in this paper concerns the task of binary image reconstruction, and for that purpose, we considered three public datasets.
In a nutshell, the main contribution of this paper is to introduce a detailed analysis considering metaheuristic optimization to the context of DBM hyperparameter finetuning, as well as to foster the research towards such area. Additionally, we provided an extensive experimental evaluation with distinct learning algorithms over a different number of layers. As far as we are concerned, we have not observed any study with such level of details. The remainder of this paper is presented as follows. Section 2 presents the theoretical background related to RBMs, DBNs, and DBMs. Section 3 introduces the main foundations related to the metaheuristic optimization techniques employed in this work. Sections 4 and 5 present the methodology and experiments, respectively, and Section 6 states conclusions and future works.
2 Theoretical Background
2.1 Restricted Boltzmann Machines
Restricted Boltzmann Machines are stochastic models composed a visible and a hidden layer of neurons, whose learning procedure is based on the minimization of an energy function. A vanilla architecture of a Restricted Boltzmann Machine is depicted in Figure 1, which comprises a visible layer v with units and a hidden layer h with units. Furthermore, stand for a realvalued matrix that models the weights between both layers, as well as stands for the weight between the visible unit and the hidden unit .
Assuming both v and h as binaryvalued units, i.e., e , the energy function of models is given by:
(1) 
where a e b stand for the biases of visible and hidden units, respectively.
Since the RBM is a bipartite graph, the activations of both visible and hidden units are mutually independent, thus leading to the following conditional probabilities:
(2) 
and
(3) 
where
(4) 
and
(5) 
Where represents the sigmoid function.
Suppose the set of RBM parameters can be learned through a training algorithm that aims at maximizing the product of probabilities given all the available training data , as follows:
(6) 
The aforementioned equation can be solved using the following derivatives over the matrix of weights W, and biases a and b at iteration as follows:
(7) 
(8) 
and
(9) 
where denotes the momentum and stands for the learning rate. To obtain the terms and , one can perform the Contrastive Divergence Hinton:02 () technique, which basically ends up performing Gibbs sampling using the training data as the visible units. In short, Equations 7, 8 and 9 employ the wellknown Gradient Descent as the optimization algorithm. The additional term in Equation 7 is used to control the values of matrix W during the convergence process, and it is formulated as follows:
(10) 
where stands for the weight decay.
2.2 Deep Belief Networks
In a nutshell, DBNs are deep archtectures composed of a set of stacked RBMs, whose are trained in a greedy fashion using the learning algorithm presented in Section 2.1, i.e., CD and PCD. In other words, an RBM does not consider the other layers’ units states while training the model at a certain layer, except that the hidden units at layer become the input units to the layer . Suppose we have a DBN composed of layers, being the weight matrix of RBM at layer .
Hinton HintonNC:06 () proposed to consider a finetuning as the final step for training a DBN, aiming to adjust the matrices , . The procedure is performed using Backpropagation or the Gradient descent algorithm. The idea is to minimize some error measure considering the output of an additional layer placed at the top of the DBN after the training procedure. The aforementioned layer is often composed of logistic units, a softmax, or even some supervised technique.
2.3 Deep Boltzmann Machines
As well as DBNs, DBMs are deep architectures able to learn more complex and intrinsic representations of the input data employing stacked RBMs. Figure 2 depicts the architecture of a standard DBM, which formulation has mild differences from the DBN
The energy of a DBM with two layers, where stands for the hidden units and stands for the visible ones in the first and second layers, can be computed as follows:
(11) 
where and stand for the number of visible units in the first and second layers, respectively, and and stand for the number of hidden units in the first and second layers, respectively. Furthermore, the weight matrices and encodes the weights of the connections between vectors v and , and vectors and , respectively. The bias terms are dropped out for simplification purposes.
Due to its complexity, calculating the derivatives of RBMbased models becomes a prohibitive task. To deal with such a constraint, one can employ the Contrastive Divergence algorithm and sample an estimated state of the visible and hidden units. Thus, the conditional probabilities over the visible and the two hidden units are given as follows:
(12) 
(13) 
and
(14) 
Finally, the generative model can be written as follows:
(15) 
where . Further, we shall proceed with the learning process of the second RBM, which then replaces by . Roughly speaking, using such procedure, the conditional probabilities given by Equations 1214, and Contrastive Divergence, one can learn DBM parameters one layer at a time Salakhutdinov:12 (). Later, one can apply meanfieldbased learning to obtain a more accurate model.
3 DBM FineTuning as an Optimization Problem
In general, Restricted Boltzmann Machines demands a proper selection of four main parameters: number of hidden units , the learning rate , the weight decay , and the momentum . Since Deep Boltzmann Machines stack RBMs on top of each other, if one has layers, then each optimization encodes variables to be optimized. However, as the training procedure of DBMs are greedywise (we are not considering meanfieldbased learning in this work), which means each layer is trained independently, only variables are optimized per layer.
In short, the idea is to initialize all optimization techniques at random, and them the algorithm takes place. The following ranges were considered in this work parameters
Figure 4 presents an overall idea of the pipeline used in this work to perform DBM hyperparameter finetuning. Roughly speaking, the optimization technique selects the set of hyperparameters that minimize the MSE over the training set considering a dataset of binary images as an input to the model. After learning the hyperparameters, one can proceed to the reconstruction step concerning the testing images, whose MSE is the one used to finally evaluate the metaheuristic techniques considered in this work.
3.1 Optimization Techniques
Below, we present a brief description of the metaheuristic techniques employed in this paper:

IHS: a variant of the HS, which models the problem of function minimization based on way musicians create their songs with optimal harmonies. This approach uses dynamic values for both the Harmony Memory Considering Rate (HMRC), which is responsible for creating new solutions based on previous experience of the music player, and the Pitch Adjusting Rate (PAR), which is in charge of applying some disruption to the solution created with HMRC in order to avoid the pitfalls of local optima. Both parameters are updated at each iteration with the new values within the range HMCRHMCR e PARPAR, respectively. Concerning PAR calculation, the bandwidth variable (bandwidth) is used, and its values must be between .

AIWPSO: a variant of the PSO, which considers any possible solution as a particle (agent) in a swarm. Each agent has a position and velocity vector in the search space, as well as two acceleration constants and . A fitness value is associated with each position, and after some iterations the global best position is selected as the best solution to the problem. The AIWPSO is proposed to balance the global exploration and local exploitation abilities for PSO. For each iteration, every particle chooses an appropriate inertia weight along the search space by dynamically adjusting the inertia weight .

CS: Cuckoo Search yang2009cuckoo (); yang2010engineering () employs a combination of the Lévy flight, which may be defined as a bird flightinspired random walk with step over a Markov chain, together with a parasitic behavior of some cuckoo species. The model follows three basic ideas: i) each cuckoo lays one egg at a time in randomly chosen nests, ii) the host bird discover the cuckoo’s egg with a probability and either discard the egg or abandon the chest and build a new one (a new solution is created), and iii) the nests with best eggs will carry over to the next generations.

FA: is derived from the fireflies’ flash attractiveness when mating partners and attracting potential preys. Basically, the attractiveness of a firefly is computed by its position related to other fireflies in the swarm, as well as its brightness is determined by the value of the objective function at that position. Furthermore, the attractiveness depends on each firefly light absorption coefficient . In order to avoid local optima, the system is exposed to a random perturbation , and the best firefly performs a random walk across the search space.

BSA: it is a simple, effective and fast evolutionary algorithm developed to deal with problems characterized by slow computation and excessive sensitivity to control parameters. In a nutshell, it employs crossover and mutation operations together with a random selection of stored memories to generate a new population of individuals based on past experiences. BSA requires a proper selection of two parameters: the mixing rate (), which controls the number of elements of individuals that will mutate in the population, as well as the parameter, which controls the amplitude of the searchdirection matrix.

JADE: a differential evolutionbased algorithm that implements the “DE/currenttobest” mutation strategy, which employs only the best agents in the mutation process. Additionally, JADE uses an optional archive for historical information, as well as an adaptive updating in the control parameter. JADE requires the selection of the parameter , which stands for the rate of parameter adaptation, and (greedness), that determines the greediness of the mutation strategy.

CoBiDE: it also a differential evolutionbased technique that employs a covariance matrix for a better representation of the system’s coordinates during the crossover process. Additionally, mutation and crossover are controlled using a bimodal distribution to achieve a good tradeoff between exploration and exploitation. The probability to execute the differential evolution according to the covariance matrix is defined by the parameter , as well as the proportion of individuals chosen from the current population to calculate the covariance matrix is denoted by .
4 Methodology
This section provides a brief introduction to the concept of data reconstruction, as well as the description of the datasets and the experimental setup employed in this work.
4.1 Data reconstruction
Although the literature is fulfilled with methods that employ image reconstruction for specific tasks, such as superresolution image reconstruction nguyen2001computationally (); dong2014learning (), tomographies liu1999optimization (), and denoising and debluring puetter2005digital (), including RBM pires2017robust () and DBM pires2017deep () approaches, data reconstruction in the context of this paper stands for RBMbased models as an intrinsic process of the learning step, whose error is monitored for optimization purpose hinton2012practical (), instead of a practical application itself. In other words, such models try to represent the input data in the hidden layers given their probability distribution. Such representation is supposedly capable of reconstructing a similar input given a stochastic Gibbs sampling. Afterward, the representation mentioned above is employed for a vast range of applications, such as classification larochelle2008classification (), dimensionality reduction hinton2006reducing (), modeling human motion taylor2007modeling (), among others.
4.2 Datasets
We validate DBM finetuning in the task of binary image reconstruction over three public datasets, as described below:

MNIST dataset
^{5} : it is composed by images of handwritten digits. The original version contains a training set with images from digits ‘0’‘9’, as well as a test set with images. Due to the high computational burden for RBM model selection, we decided to employ the original test set together with a reduced version of the training set^{6} . 
Semeion Handwritten Digit Data Set
^{7} : composed of binary images of manuscript digits, this dataset contains 1,593 images with the resolution of from around persons. The whole dataset was employed in the experimental section, being used for training purposes, as well as the remaining for testing. 
CalTech 101 Silhouettes Data Set
^{8} : it is based on the former Caltech 101 dataset, and it comprises silhouettes of images from 101 classes with resolution of . We have used only the training and test sets, since our optimization model aims at minimizing the MSE error over the training set.
Figure 5 displays some training examples from both datasets, which were partitioned in 2% for the training set and 98% to compose the test set.
4.3 Parameter Settingup
One of the main shortcoming in using RBMbased models, such as DBM and DBN, concerns their finetuning hyperparameter task, which aims at selecting a suitable set of parameters in such a way that the reconstruction error is minimized. In this work, we considered IHS, FA, CS, AIWPSO, BSA, JADE, and the CoBiDE against RS for DBM hyperparameter finetuning. We also evaluated the robustness of the proposed approach using three distinct DBN and DBM models: one layer (1L), two layers (2L) and three layers (3L). Finally, Table 1 presents the parameters used for each optimization technique
Technique  Parameters 

IHS  , 
,  
AIWPSO  , 
CS  , , 
, ,  
FA  , , 
BSA  , 
JADE  , 
CoBiDE  , 
We conducted a crossvalidation approach with runnings, iterations for the learning procedure of each RBM, and minibatches of size . In addition, we also considered two learning algorithms: Contrastive Divergence (CD) Hinton:02 () and Persistent Contrastive Divergence (PCD) TielemanICML:08 (). Finally, the Wilcoxon signedrank test Wilcoxon:45 () with significance of was used for statistical validation purposes.
Finally, the codes used to reproduce the experiments of the paper are available on GitHub
5 Experiments
In this section, we present the experimental results concerning DBM and DBN hyperparameter optimization on the task of binary image reconstruction. Both techniques were compared using two different learning algorithms, i.e. Contrastive Divergence and Persistent Contrastive Divergence. Also, seven optimization methods were employed. Additionally, three distinct models used for comparison purposes: one layer (1L), two layers (2L), and three (3L) layers.
5.1 Experimental Results
Tables 2 presents the average values of the minimum squared error over the MNIST dataset, being the values in bold the best results considering the Wilcoxon signedrank test. One can observe the metaheuristic techniques obtained the best results, with special attention to IHS, JADE, and CoBiDE for both DBN and DBM models. Also, one can not figure a considerable difference between shallow and deep models, since we limited the number of iterations for convergence to , as well as we did not employ finetuning as a final step for DBN and DBM connection weights. The main reasons for limiting the number of iterations are related to time constraints, as well as the convergence process itself. As a matter of fact, if one has unlimited resources in terms of computational load, a standard random search may obtain results as good as the ones obtained by metaheuristic techniques, since they will have enough time for convergence. However, we would like to emphasize that DBM hyperparameter finetuning is quite useful when time is limited and a serious constraint.
1L  2L  3L  
DBN  DBM  DBN  DBM  DBN  DBM  
Technique  Statistics  CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD 
IHS  Mean  0.08758  0.08762  0.08744  0.08766  0.08762  0.08762  0.08761  0.08761  0.08762  0.08762  0.08760  0.08761 
Std.  8.102e05  7.581e05  3.702e04  4.686e04  5.203e05  6.018e05  5.063e05  3.834e05  6.971e05  5.941e05  5.845e05  5.885e05  
AIWPSO  Mean  0.08764  0.08761  0.08765  0.08771  0.08763  0.08762  0.08762  0.08761  0.08762  0.08762  0.08759  0.08760 
Std.  5.694e05  4.728e05  4.793e04  3.744e04  5.965e05  4.879e05  5.207e05  5.250e05  4.299e05  4.643e05  5.280e05  5.505e05  
CS  Mean  0.08763  0.08764  0.08767  0.08770  0.08764  0.08765  0.08760  0.08760  0.08764  0.08765  0.08762  0.08761 
Std.  5.393e05  6.722e05  7.988e05  2.713e04  6.906e05  5.766e05  5.771e05  6.122e05  6.611e05  8.424e05  5.541e05  5.356e05  
FA  Mean  0.08763  0.08764  0.08766  0.08762  0.08763  0.08763  0.08761  0.08763  0.08763  0.08763  0.08761  0.08761 
Std.  6.749e05  6.271e05  1.113e04  2.673e04  5.923e05  6.488e05  4.780e05  8.191e05  6.342e05  5.658e05  3.951e05  6.131e05  
BSA  Mean  0.08762  0.08762  0.08774  0.08766  0.08762  0.08763  0.08761  0.08762  0.08763  0.08762  0.08762  0.08762 
Std.  5.231e05  6.697e05  4.135e04  3.242e04  6.697e05  6.555e05  4.072e05  5.870e05  6.176e05  6.785e05  5.416e05  5.175e05  
JADE  Mean  0.08760  0.08763  0.08754  0.08749  0.08763  0.08764  0.08761  0.08761  0.08763  0.08763  0.08761  0.08761 
Std.  6.780e05  5.644e05  4.131e04  3.256e04  6.264e05  5.967e05  6.284e05  5.491e05  6.546e05  6.696e05  5.662e05  5.356e05  
CoBiDE  Mean  0.08763  0.08762  0.08757  0.08765  0.08763  0.08764  0.08762  0.08760  0.08763  0.08762  0.08761  0.08760 
Std.  6.249e05  7.203e05  4.104e04  3.460e04  6.053e05  5.312e05  6.786e05  5.359e05  6.022e05  6.219e05  5.222e05  4.868e05  
RS  Mean  0.08762  0.08763  0.08780  0.08782  0.08762  0.08763  0.08761  0.08760  0.08763  0.08763  0.08761  0.08761 
Std.  5.699e05  5.495e05  3.965e04  5.091e04  4.355e05  4.765e05  4.657e05  5.008e05  6.911e05  5.740e05  5.125e05  5.979e05 
Table 3 presents the results concerning CalTech 101 Silhouettes dataset. In this case, the best results were achieved by DBN with one layer only. Caltech poses a greater challenge, since it has more classes than MNIST, which should us to believe more iterations for convergence would be required for DBM learning, since it a more complex model than DBN. Also, the best results were obtained by means of Improved Harmony Search, BSA, JADE, and CoBiDE.
1L  2L  3L  
DBN  DBM  DBN  DBM  DBN  DBM  
Technique  Statistics  CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD 
IHS  Mean  0.15554  0.15731  0.15983  0.15980  0.16057  0.16054  0.16055  0.16055  0.16059  0.16058  0.16057  0.16056 
Std.  2.107e03  1.584e03  1.064e03  7.218e04  1.980e04  2.922e04  1.958e04  1.852e04  2.162e04  2.07804  2.041e04  2.150e04  
AIWPSO  Mean  0.15641  0.15825  0.16006  0.16014  0.16056  0.16060  0.16056  0.16061  0.16058  0.16057  0.16057  0.16057 
Std.  2.414e03  2.310e03  8.199e04  7.570e04  2.010e04  2.224e04  1.914e04  2.291e04  2.192e04  2.129e04  1.890e04  2.124e04  
CS  Mean  0.15923  0.15992  0.16023  0.16024  0.16057  0.16062  0.16057  0.16056  0.16059  0.16061  0.16055  0.16057 
Std.  1.707e03  1.030e03  4.329e04  3.538e04  1.855e04  2.275e04  2.071e04  2.107e04  2.123e04  2.034e04  1.941e04  2.123e04  
FA  Mean  0.16002  0.15956  0.16051  0.16034  0.16060  0.16058  0.16069  0.16056  0.16060  0.16058  0.16055  0.16055 
Std.  1.555e03  1.176e03  5.541e04  6.887e04  2.120e04  2.130e04  6.536e04  2.147e04  2.327e04  2.098e04  2.174e04  2.029e04  
BSA  Mean  0.15599  0.15775  0.15992  0.15983  0.16056  0.16056  0.16052  0.16054  0.16057  0.16058  0.16057  0.16055 
Std.  1.542e03  1.511e03  8.302e03  6.978e04  2.016e04  2.174e04  1.770e04  1.985e04  2.063e04  1.981e04  1.878e04  2.004e04  
JADE  Mean  0.15608  0.15790  0.15945  0.15988  0.16058  0.16057  0.16055  0.16058  0.16059  0.16057  0.16058  0.16054 
Std.  1.835e03  1.351e03  6.426e04  6.015e04  2.037e04  2.001e04  1.876e04  1.784e04  1.933e04  2.131e04  2.126e04  2.000e04  
CoBiDE  Mean  0.15638  0.15800  0.15982  0.15982  0.16059  0.16057  0.16059  0.16056  0.16060  0.16059  0.16056  0.16054 
Std.  1.912e03  1.209e03  6.181e04  8.848e04  2.298e04  2.204e04  3.093e04  1.652e04  2.090e04  2.023e04  1.739e04  2.060e04  
RS  Mean  0.15676  0.15845  0.15967  0.15976  0.16060  0.16062  0.16059  0.16057  0.16057  0.16056  0.16056  0.16056 
Std.  1.623e03  1.220e03  7.164e04  7.133e04  1.998e04  1.915e04  1.974e04  1.993e04  1.998e04  2.173e04  1.853e04  1.868e04 
Table 4 presents the results obtained over Semeion Handwritten Digit dataset, being IHS and JADE the most accurate techniques. The best results concerning MNIST and Semeion Handwritten Digits datasets, as can be clearly seen on Tables 2 and 4, was acquired using the DBM. DBN, however, had the best results considering CalTech 101 Silhouettes dataset, as presented in Table 3. Some interesting conclusions can be extracted from a closer look at these results: (i) metaheuristicbased optimization allows more accurate results than a random search, as argued by the works of Papa et al. PapaGECCO:15 (); PapaJoCS:15 (); PapaASC:15 () already; (ii) DBMs seem to produce more accurate results than DBNs; (iii) the number of layers do not seem to influence the results when one finetune parameters; (iv) IHS achieved the best results in all datasets (concerning both DBN and DBN), but with results statistically similar to other metaheuristic techniques as well; and (v) we could not realize a significant difference between CD and PCD, since we employed iterations for learning only. Actually, PCD is expected to work better, but at the price of a longer convergence process.
1L  2L  3L  
DBN  DBM  DBN  DBM  DBN  DBM  
Technique  Statistics  CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD 
IHS  Mean  0.19359  0.20009  0.19025  0.19078  0.20961  0.20961  0.20956  0.20956  0.20961  0.20963  0.20958  0.20958 
Std.  1.367e03  1.965e03  8.901e04  1.367e03  3.669e04  3.637e04  3.571e04  3.438e04  3.731e04  3.772e04  3.609e04  3.417e04  
AIWPSO  Mean  0.20044  0.20274  0.19679  0.19426  0.20959  0.20961  0.20958  0.20956  0.20964  0.20961  0.20959  0.20959 
Std.  6.85603  3.994  7.995e03  7.044e03  3.521e04  3.853e04  3.644e04  3.619e04  3.773e04  3.584e04  3.784e04  3.664e04  
CS  Mean  0.20528  0.20647  0.20728  0.20651  0.20965  0.20960  0.20957  0.20959  0.20964  0.20963  0.20960  0.20960 
Std.  4.94803  3.556e03  2.894e03  2.352e03  4.034e04  3.554e04  3.616e04  3.722e04  3 .696e04  3.572e04  3.612e04  3.430e04  
FA  Mean  0.20638  0.20894  0.20649  0.20319  0.20966  0.20965  0.20960  0.20960  0.20964  0.20965  0.20960  0.20928 
Std.  4.92203  2.085e03  5.630e03  7.548e03  4.098e04  3.855e04  3.609e04  3.605e04  3.499e04  4.117e04  3.387e04  1.555e03  
BSA  Mean  0.19571  0.20002  0.19221  0.19325  0.20961  0.20959  0.20960  0.20958  0.20962  0.20962  0.20960  0.20956 
Std.  3.64803  2.544e03  2.879e03  2.419e03  3.591e04  3.683e04  3.783e04  3.480e04  3.722e04  3.847e04  3.716e04  3.600e04  
JADE  Mean  0.19893  0.20165  0.19152  0.19170  0.20962  0.20960  0.20957  0.20958  0.20964  0.20959  0.20956  0.20961 
Std.  7.89003  5.316e03  4.213e03  4.410e03  3.554e04  3.602e04  3.501e04  3.755e04  3.579e04  3.708e04  3.524e04  3.899e04  
CoBiDE  Mean  0.19328  0.19896  0.19190  0.19138  0.20962  0.20961  0.20959  0.20958  0.20960  0.20961  0.20958  0.20959 
Std.  1.33203  1.478e03  1.821e03  1.556e03  3.505e04  3.631e04  3.678e04  3.550e04  3.579e04  4.119e04  3.664e04  3.593e04  
RS  Mean  0.19710  0.20361  0.19458  0.19463  0.20962  0.20959  0.20960  0.20957  0.20960  0.20960  0.20960  0.20959 
Std.  3.133e03  1.837e03  3.891e03  3.909e03  3.621e04  3.494e04  3.864e04  3.680e04  3.677e04  3.563e04  3.538e04  3.410e04 
Figures 6 and 7 display the convergence process regarding the mean squared error (MSM) and logarithm of the pseudolikelihood (PL) values obtained during the learning step for DBM and DBN, respectively, trained with CD over MNIST dataset. We used the mean values of the first layer for all optimization algorithms. One can observe DBM obtained the better approximation of the model during all iterations, and both ended up with similar log PL values (iteration #10). However, it is important to shed light over the main contribution of this paper is not to show DBM may learn better models than DBNs, but to stress metaheuristic techniques are suitable to finetune DBM parameters as well.
Although one can realize an oscillating behavior of the optimization techniques, all of them obtained better models at the last iteration (i.e. a highest log PL) than RS, except for the natureinspired algorithms, that achieved similar results in most of the experiments, probably due to its demand for more iterations to convergence. The results implies that using metaheuristic techniques to finetune DBMs seems to be reasonable. DBMs optimized by metaheuristicbased techniques obtained the best results considering all datasets used in this work as well.
5.2 Statistical Analysis
In this section, we detailed the Wilcoxon signedrank test obtained through a pairwise comparison among the techniques. For such purpose, we used of significance to provide the statistical similarity among the best results obtained by each technique, i.e., considering both number of layers and learning algorithm. Tables 5, 6 and 7 presents the statistical evaluation concerning MNIST, CalTech 101 Silhouettes, and Semeion datasets.
IHS  AIWPSO  CS  FA  BSA  JADE  CoBiDE  RS  

IHS  
AIWPSO  
CS  
FA  
BSA  
JADE  
CoBiDE  
RS  

IHS  AIWPSO  CS  FA  BSA  JADE  CoBiDE  RS  

IHS  
AIWPSO  
CS  
FA  
BSA  
JADE  
CoBiDE  
RS 
It is interesting to point out that memory (IHS) and evolutionarybased (BSA, JADE, and CoBiDE) techniques obtained the best results for all datasets, outperforming swarm collective approaches (AIWPSO, FA, and CS). Regarding evolutionary techniques, mutation and crossover operators may move solutions far apart from each other (i.e., they favor the exploration), which can be interesting in the context of DBM/DBN hyperparameter finetuning. Usually, the hyperparameters we are optimizing (i.e., learning rate, number of hidden units, weight decay and momentum) do not lead to different reconstruction errors under some small intervals, i.e., the fitness landscape figures some flat zones that can trap optimization techniques.
IHS  AIWPSO  CS  FA  BSA  JADE  CoBiDE  RS  

IHS  
AIWPSO  
CS  
FA  
BSA  
JADE  
CoBiDE  
RS  

Regarding the relatively good results obtained using the Random Search, one may question the contribution of employing metaheuristic techniques for DBM hyperparameter optimization. Despite the statistical similarity among optimization techniques, the random search did not obtain the best results for any dataset.
5.3 Time Analisys
Tables 8, 9, and 10 present an analysis of the computational load required by the optimization tasks regarding MNINST, CalTech 101 Silhouettes, and Semeion datasets, respectively. The results in bold stand for the fastest aproaches for each model.
1L  2L  3L  

DBN  DBM  DBN  DBM  DBN  DBM  
CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD  
IHS  0.35  0.25  0.45  0.46  0.60  0.52  0.57  0.55  0.54  0.56  0.82  0.53 
AIWPSO  2.21  2.28  2.64  2.41  3.39  2.68  3.89  3.62  4.31  4.73  5.67  4.28 
CS  0.30  0.45  0.53  0.56  0.49  0.45  0.44  0.80  0.47  0.29  0.84  0.97 
FA  0.75  1.49  1.81  1.06  1.37  1.30  1.95  2.41  2.23  2.22  2.52  1.29 
BSA  1.28  1.31  0.98  1.21  1.12  0.71  2.67  1.61  1.48  1.43  2.65  3.74 
JADE  1.00  1.63  0.79  0.88  1.93  1.76  2.12  1.81  1.34  1.69  3.17  2.34 
CoBiDE  1.25  1.29  1.11  1.11  1.50  1.67  2.13  2.22  2.29  1.60  2.92  2.26 
One can notice that, in general, IHS has been the fastest technique, followed by CS, which is somehow expected due to their updating mechanism. IHS evaluates a single solution each iteration, while CS evaluates a reduced number of solutions, given by the probability parameter .
1L  2L  3L  

DBN  DBM  DBN  DBM  DBN  DBM  
CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD  
IHS  1.64  1.47  1.81  1.62  1.28  1.37  1.84  2.26  1.13  1.06  1.98  1.58 
AIWPSO  8.87  9.44  10.54  11.50  9.41  7.79  12.30  12.34  11.17  7.95  13.50  13.82 
CS  1.55  1.01  1.86  1.63  0.93  1.76  2.45  2.17  1.46  1.35  2.00  0.80 
FA  3.38  5.27  6.03  3.00  6.25  3.27  7.26  2.62  3.58  8.08  6.55  8.83 
BSA  6.40  5.08  6.55  8.30  6.04  5.60  9.19  8.42  4.23  4.53  7.95  9.90 
JADE  8.24  4.31  9.22  7.90  7.71  4.10  11.15  7.40  8.25  4.57  9.43  8.29 
CoBiDE  5.64  5.28  7.48  7.02  5.64  5.36  7.52  7.61  4.47  5.38  6.63  8.70 
Likewise, one can expect that BSA, JADE, and CoBiDE to behave similarly regarding the computational load, since they are evolutionarybased techniques and the number of new solutions (the ones that employ mutation and crossover operations) to be evaluated depends upon a probability.
1L  2L  3L  

DBN  DBM  DBN  DBM  DBN  DBM  
CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD  CD  PCD  
IHS  0.16  0.19  0.22  0.25  0.23  0.20  0.22  0.31  0.28  0.28  0.35  0.38 
AIWPSO  1.14  1.00  1.49  1.44  1.61  1.41  2.04  1.98  2.15  1.80  2.51  2.45 
CS  0.26  0.18  0.31  0.26  0.26  0.24  0.31  0.20  0.28  0.23  0.40  0.38 
FA  0.49  0.74  0.82  0.42  0.62  0.98  0.82  0.53  0.84  1.14  0.90  0.76 
BSA  0.68  0.65  0.57  0.88  0.54  0.44  0.57  1.16  0.83  0.92  1.30  1.51 
JADE  0.54  0.22  0.92  1.18  0.25  0.80  0.92  1.60  0.37  1.29  1.91  2.09 
CoBiDE  0.71  0.58  0.74  0.91  0.49  0.89  0.96  1.03  0.68  1.01  1.52  1.30 
One shortcoming of FA and AIWPSO concerns their computational burden since every agent in the swarm generates a new solution to be evaluated at each iteration. In fact, they are expected to present a slower convergence than IHS, which creates a single solution instead (i.e., it evaluates the fitness function only once per iteration). Such behavior makes them much faster than swarmbased techniques, but having a slower convergence as well.
6 Conclusions
In this work, we dealt with the problem of finetuning Deep Boltzmann Machines by means of metaheuristicdriven optimization techniques to reconstruct binary images. The experimental results over three public datasets showed the validity in using such techniques to optimize DBMs when compared against a random search. Also, we showed DBMs can learn more accurate models than DBNs considering two out of three datasets. Moreover, we provided a detailed analysis of the similarity among each optimization technique using the Wilcoxon signedrank test, as well the tradeoff between the computational load demanded by each metaheuristic and its effectiveness.
Even though all techniques have obtained close results, we observed that evolutionary and memorybased approaches might be more suitable for DBM/DBN finetuning hyperparameters. Since we are coping with hyperparameters that, under small intervals, do not influence the learning step (i.e., the reconstruction error), evolutionary operators and the process of creating new harmonies seem to introduce some sort of perturbation that moves possible solutions far apart from each other. In regard to future works, we aim to validate the proposed approach to reconstruct and also classify grayscale images.
Acknowledgments
The authors are grateful to FAPESP grants #2013/073750, #2014/122361, and #2016/194036, and CNPq grants #306166/20143, and #307066/20177. This material is based upon work supported in part by funds provided by Intel^{®} AI Academy program under Fundunesp Grant No.2597.2017. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior  Brasil (CAPES)  Finance Code 001
Footnotes
 journal: Journal of LaTeX Templates
 Notice the context of hyperparameter finetuning stands for a proper selection of the network’s input values, such as the number of hidden units and the learning rate, among others, rather than optimizing the bias and weights of the model.
 The main difference stands in the topdown feedback used to approximate the inference procedure. Moreover, the DBM has entirely undirected connections, while the DBN has undirected connections in the top two layers only, as well as directed connections at the lower layers.
 The ranges used for each parameter were empirically selected based on values commonly adopted in the literature papaQUATERNION:17 (); rosa2016learning (); PapaGECCO:15 (); RodriguesBook:16 (); passosiRBM:2017 ()
 http://yann.lecun.com/exdb/mnist/
 The original training set was reduced to of its former size, which corresponds to images.
 https://archive.ics.uci.edu/ml/datasets/Semeion+Handwritten+Digit
 https://people.cs.umass.edu/~marlin/data.shtml
 Parameters were empirically selected based on each technique author’s suggestions, as well as the values commonly adopted in the literature papaQUATERNION:17 (); rosa2016learning (); PapaGECCO:15 (); RodriguesBook:16 (); passosiRBM:2017 ()
 The selected number of agents and iterations for convergence were empirically chosen based on values commonly adopted in the literature papaQUATERNION:17 (); rosa2016learning (); PapaGECCO:15 ()
 LibOPF: https://github.com/jppbsi/LibOPF
 LibDEEP: https://github.com/jppbsi/LibDEEP
 LibDEV: https://github.com/jppbsi/LibDEV
 LibOPT PapaLIBOPT:17 (): https://github.com/jppbsi/LibOPT
References
 G. E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation 14 (8) (2002) 1771–1800.
 L. A. Passos, J. P. Papa, On the training algorithms for restricted boltzmann machinebased models, Ph.D. thesis, Universidade Federal de São Carlos (2018).
 H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, in: Proceedings of the 24th international conference on Machine learning, ACM, 2007, pp. 473–480.
 R. Salakhutdinov, G. E. Hinton, Semantic hashing, International Journal of Approximate Reasoning 50 (7) (2009) 969–978.
 U. Fiore, F. Palmieri, A. Castiglione, A. De Santis, Network anomaly detection with the restricted boltzmann machine, Neurocomputing 122 (2013) 13–23.
 L. A. Silva, K. A. P. Costa, P. B. Ribeiro, G. H. Rosa, J. P. Papa, Learning spam features using restricted boltzmann machines, IADIS International Journal on Computer Science and Information Systems 11 (1) (2016) 99–114.
 C. R. Pereira, L. A. Passos, R. R. Lopes, S. A. Weber, C. Hook, J. P. Papa, Parkinsonâs disease identification using restricted boltzmann machines, in: International Conference on Computer Analysis of Images and Patterns, Springer, 2017, pp. 70–80.
 P. Khojasteh, L. A. Passos, T. Carvalho, E. Rezende, B. Aliahmad, J. P. Papa, D. K. Kumar, Exudate detection in fundus images using deeplylearnable features, Computers in biology and medicine 104 (2019) 62–69.
 L. A. Passos, L. A. de Souza Jr, R. Mendel, A. Ebigbo, A. Probst, H. Messmann, C. Palm, J. P. Papa, Barrettâs esophagus analysis using infinity restricted boltzmann machines, Journal of Visual Communication and Image Representation 59 (2019) 475–485.
 G. E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets, Neural computation 18 (7) (2006) 1527–1554.
 R. Salakhutdinov, G. E. Hinton, Deep boltzmann machines., in: AISTATS, Vol. 1, 2009, p. 3.
 L. A. Passos, J. P. Papa, Temperaturebased deep boltzmann machines, Neural Processing Letters 48 (1) (2018) 95–107.
 P. Ruangkanokmas, T. Achalakul, K. Akkarajitsakul, Deep belief networks with feature selection for sentiment classification, in: 7th International Conference on Intelligent Systems, Modelling and Simulation, 2016.
 K. Sohn, H. Lee, X. Yan, Learning structured output representation using deep conditional generative models, in: C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances in Neural Information Processing Systems 28, Curran Associates, Inc., 2015, pp. 3465–3473.
 Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Closing the gap to humanlevel performance in face verification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1701–1708.
 C.N. Duong, K. Luu, K. G. Quach, T. D. Bui, 2015 ieee conference on beyond principal components: Deep boltzmann machines for face modeling, in: Computer Vision and Pattern Recognition, CVPR ’15, 2015, pp. 4786–4794.
 C. Dong, C. C. Loy, K. He, X. Tang, Learning a deep convolutional network for image superresolution, in: European Conference on Computer Vision, Springer, 2014, pp. 184–199.
 N. Srivastava, R. Salakhutdinov, Multimodal learning with deep boltzmann machines, in: Advances in neural information processing systems, 2012, pp. 2222–2230.
 G. E. Hinton, R. Salakhutdinov, Replicated softmax: an undirected topic model, in: Advances in neural information processing systems, 2009, pp. 1607–1614.
 J. P. Papa, G. H. Rosa, K. A. P. Costa, A. N. Marana, W. Scheirer, D. D. Cox, On the model selection of bernoulli restricted boltzmann machines through harmony search, in: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’15, ACM, New York, USA, 2015, pp. 1449–1450.
 G. Rosa, J. P. Papa, K. Costa, L. A. Passos, C. Pereira, X.S. Yang, Learning parameters in deep belief networks through firefly algorithm, in: IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Springer, 2016, pp. 138–149.
 L. A. Passos, J. P. Papa, Finetuning infinity restricted boltzmann machines, in: Graphics, Patterns and Images (SIBGRAPI), 2017 30th SIBGRAPI Conference on, IEEE, 2017, pp. 63–70.
 L. A. Passos, D. R. Rodrigues, J. P. Papa, Fine tuning deep boltzmann machines through metaheuristic approaches, in: 2018 IEEE 12th International Symposium on Applied Computational Intelligence and Informatics (SACI), IEEE, 2018, pp. 000419–000424.
 Z. W. Geem, MusicInspired Harmony Search Algorithm: Theory and Applications, 1st Edition, Springer Publishing Company, Incorporated, 2009.
 J. Kennedy, R. C. Eberhart, Swarm Intelligence, Morgan Kaufmann Publishers Inc., San Francisco, USA, 2001.
 M. Mahdavi, M. Fesanghary, E. Damangir, An improved harmony search algorithm for solving optimization problems, Applied mathematics and computation 188 (2) (2007) 1567–1579.
 X. Yu, J. Liu, H. Li, An adaptive inertia weight particle swarm optimization algorithm for iir digital filter, in: Artificial Intelligence and Computational Intelligence, 2009. AICI’09. International Conference on, Vol. 1, IEEE, 2009, pp. 114–118.
 X.S. Yang, S. Deb, Cuckoo search via lévy flights, in: Nature & Biologically Inspired Computing, 2009. NaBIC 2009. World Congress on, IEEE, 2009, pp. 210–214.
 X.S. Yang, Firefly algorithm, stochastic test functions and design optimisation, International Journal BioInspired Computing 2 (2) (2010) 78–84.
 P. Civicioglu, Backtracking search optimization algorithm for numerical optimization problems, Applied Mathematics and Computation 219 (15) (2013) 8121–8144.
 J. Zhang, A. C. Sanderson, Jade: adaptive differential evolution with optional external archive, IEEE Transactions on evolutionary computation 13 (5) (2009) 945–958.
 Y. Wang, H.X. Li, T. Huang, L. Li, Differential evolution based on covariance matrix learning and bimodal distribution parameter setting, Applied Soft Computing 18 (2014) 232–247.
 G. E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets, Neural Computation 18 (7) (2006) 1527–1554.
 R. Salakhutdinov, G. E. Hinton, An efficient learning procedure for deep boltzmann machines, Neural Computation 24 (8) (2012) 1967–2006. doi:10.1162/NECO_a_00311.
 J. P. Papa, G. H. Rosa, D. R. Pereira, X.S. Yang, Quaternionbased deep belief networks finetuning, Applied Soft Computing 60 (2017) 328–335.
 D. Rodrigues, X. S. Yang, J. P. Papa, Finetuning deep belief networks using cuckoo search, in: X. S. Yang, J. P. Papa (Eds.), BioInspired Computation and Applications in Image Processing, Academic Press, 2016, pp. 47–59.
 J. P. Papa, W. Scheirer, D. D. Cox, Finetuning deep belief networks using harmony search, Applied Soft Computing 46 (2016) 875–885.
 X.S. Yang, S. Deb, Engineering optimisation by cuckoo search, International Journal of Mathematical Modelling and Numerical Optimisation 1 (4) (2010) 330–343.
 N. Nguyen, P. Milanfar, G. Golub, A computationally efficient superresolution image reconstruction algorithm, IEEE transactions on image processing 10 (4) (2001) 573–583.
 S. Liu, L. Fu, W. Yang, Optimization of an iterative image reconstruction algorithm for electrical capacitance tomography, Measurement Science and Technology 10 (7) (1999) L37.
 R. Puetter, T. Gosnell, A. Yahil, Digital image reconstruction: Deblurring and denoising, Annu. Rev. Astron. Astrophys. 43 (2005) 139–194.
 R. G. Pires, D. F. S. Santos, L. A. M. Pereira, G. B. De Souza, A. L. M. Levada, J. P. Papa, A robust restricted boltzmann machine for binary image denoising, in: 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Ieee, 2017, pp. 390–396.
 R. G. Pires, D. S. Santos, G. B. Souza, A. N. Marana, A. L. Levada, J. P. Papa, A deep boltzmann machinebased approach for robust image denoising, in: Iberoamerican Congress on Pattern Recognition, Springer, 2017, pp. 525–533.
 G. E. Hinton, A practical guide to training restricted boltzmann machines, in: Neural networks: Tricks of the trade, Springer, 2012, pp. 599–619.
 H. Larochelle, Y. Bengio, Classification using discriminative restricted boltzmann machines, in: Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 536–543.
 G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, science 313 (5786) (2006) 504–507.
 G. W. Taylor, G. E. Hinton, S. T. Roweis, Modeling human motion using binary latent variables, in: Advances in neural information processing systems, 2007, pp. 1345–1352.
 T. Tieleman, Training restricted boltzmann machines using approximations to the likelihood gradient, in: Proceedings of the 25th International Conference on Machine Learning, ACM, New York, USA, 2008, pp. 1064–1071.
 F. Wilcoxon, Individual comparisons by ranking methods, Biometrics Bulletin 1 (6) (1945) 80–83.
 J. P. Papa, G. H. Rosa, D. Rodrigues, X.S. Yang, Libopt: An opensource platform for fast prototyping soft optimization techniques, ArXiv eprintsarXiv:1704.05174.
 J. P. Papa, G. H. Rosa, A. N. Marana, W. Scheirer, D. D. Cox, Model selection for discriminative restricted boltzmann machines through metaheuristic techniques, Journal of Computational Science 9 (2015) 14–18.