Scalability of using Restricted Boltzmann Machines for Combinatorial Optimization
Abstract.
Estimation of Distribution Algorithms (EDAs) require flexible probability models that can be efficiently learned and sampled. Restricted Boltzmann Machines (RBMs) are generative neural networks with these desired properties. We integrate an RBM into an EDA and evaluate the performance of this system in solving combinatorial optimization problems with a single objective. We assess how the number of fitness evaluations and the CPU time scale with problem size and with problem complexity. The results are compared to the Bayesian Optimization Algorithm, a stateoftheart EDA. Although RBMEDA requires larger population sizes and a larger number of fitness evaluations, it outperforms BOA in terms of CPU times, in particular if the problem is large or complex. RBMEDA requires less time for model building than BOA. These results highlight the potential of using generative neural networks for combinatorial optimization.
Key words and phrases:
Combinatorial Optimization, Heuristics, Evolutionary Computation, Estimation of Distribution Algorithms, Neural Networks1. Introduction
Estimation of Distribution Algorithms (EDA, Mühlenbein and Paaß, 1996; Larrañaga and Lozano, 2002) are metaheuristics for combinatorial and continuous nonlinear optimization. They maintain a population of solutions which they improve over consecutive generations. Unlike other heuristic methods, EDAs do not improve solutions with mutation, recombination, or local search. Instead, they estimate how likely it is that decisions are part of an optimal solution, and try to uncover the dependency structure between the decisions. This information is obtained from the population by the estimation of a probabilistic model. If a probabilistic model generalizes the population well, random samples drawn from the model have a structure and solution quality that is similar to the population itself. Repeated model estimation, sampling, and selection steps can solve difficult optimization problems in theory (Mühlenbein and Mahnig, 1999) and in practice (Lozano and Larrañaga, 2006). It is important to empirically assess the efficiency of using probability models in EDAs. Simple models, such as factorizations of univariate frequencies, can be quickly estimated from a population, but they cannot represent interactions between the decision variables. As a consequence, EDAs using univariate frequencies cannot efficiently solve complex problems. Using flexible probability models such as Bayesian networks allows complex problems to be solved, but fitting the model to a population and sampling new solutions can be very timeconsuming.
A central goal of EDA research is the identification of probabilistic models that are flexible and can quickly be estimated and sampled. This is also a central topic in the field of machine learning. A recent focus in machine learning is the development of feasible unsupervised learning algorithms for generative neural networks. These algorithms and models can learn complex patterns from highdimensional datasets. Moreover, generative neural networks can sample new data based on the associations that they have learned so far, and they can be fit to data (e.g., to a population of solutions) in an unsupervised manner. This makes them potentially useful for EDAs. Some of these models can also be “stacked” on several layers and be used as building blocks for “deep learning”.
In this paper, we focus on Restricted Boltzmann Machines (RBM, Smolensky, 1986; Hinton, 2002). RBMs are a basic, yet powerful, type of generative neural network where the connections between the neurons form a bipartite graph (Section 3 explains RBMs in detail). Due to a recent breakthrough (Hinton et al., 2006), training RBMs is computationally tractable. They show impressive performance in classic machine learning tasks such as image or voice recognition (Dahl et al., 2012).
Given these successes, it is not surprising that researchers have integrated RBMs and similar models into EDAs and studied how these systems perform in optimization tasks. Zhang and Shin (2000) used a Helmholtz Machine in an EDA. Helmholtz Machines are predecessors of RBMS. Due to their limited performance, they are nowadays widely discarded. Zhang and Shin (2000) evaluated their EDA by comparing it to a simple Genetic Algorithm (Goldberg, 1989). They did not study the scaling behavior for problems of different sizes and complexity. In a series of recent papers, Huajin et al. (2010); Shim et al. (2010); Shim and Tan (2012) and Shim et al. (2013) studied EDAs that use RBMs. These works are similar to ours in that an RBM is used inside an EDA. An important difference is that they considered problems with multiple objectives. Also, they hybridized the EDA with particle swarm optimization. Thus, it is unknown if using and RBM in an EDA leads to competitive performance in singleobjective combinatorial optimization.
Therefore, in this paper, we raise the following questions:

How efficient are EDAs that use RBMs for singleobjective combinatorial optimization?

How does the runtime scale with problem size and problem difficulty?

Is the performance competitive with the stateoftheart?
To answer these questions, we integrated an RBM into an EDA (the RBMEDA) and conducted a thorough experimental scalability study. We systematically varied the difficulty of two tunably complex singleobjective test problems (concatenated trap functions and NK landscapes), and we computed how the runtime and the number of fitness evaluations scaled with problem size. We then compared the results those obtained for the Bayesian Optimization Algorithm (BOA, Pelikan et al., 1999a; Pelikan, 2005). BOA is a stateoftheart EDA.
RBMEDA solved the test problems in polynomial time depending on the problem size. Indeed, concatenated trap functions can only be solved in polynomial time by decomposing the overall problem into smaller parts and then solving the parts independently. RBMEDA’s polynomial scalability suggests that RBMEDA recognized the structure of the problem correctly and that it solved the subproblems independently from one another. The hidden neurons of the RBM (its latent variables) captured the building blocks of the optimization problem. The runtime of RBMEDA scaled better than that of BOA on trap functions of high order and NK landscapes with large . RBMEDA hence appears to be useful for complex problems. It was mostly faster than BOA if instances were large.
The paper is structured as follows: In Section 2, we introduce Estimation of Distribution Algorithms and the Bayesian Optimization Algorithm. In section 3, we introduce Restricted Boltzmann Machines, show how an RBM samples new data, and describe how an RBM is fit to given data. Section 3.4 describes RBMEDA. The test functions, the experimental design and the results are presented in Section 4. Section 5 concludes the paper.
2. Estimation of Distribution Algorithms
We introduce Estimation of Distribution Algorithms (Section 2.1) and the Bayesian Optimization Algorithm (Section 2.2).
2.1. Estimation of Distribution Algorithms
EDAs are populationbased metaheuristics (Mühlenbein and Paaß, 1996; Mühlenbein and Mahnig, 1999; Pelikan et al., 1999b; Larrañaga et al., 1999). Similar to Genetic algorithms (GA, Holland, 1975; Goldberg, 1989), they evolve a population of solutions over a number of generations by means of selection and variation.
Algorithm 1 outlines the basic functionality of an EDA. After initializing a population of solutions, the EDA runs for multiple generations. In each generation, a selection operator selects a subset of highquality solutions from . is then used as input for the variation step. In contrast to a GA, which creates new individuals using recombination and mutation, an EDA builds a probabilistic model from , often by estimating their (joint) probability distribution. Then, the EDA draws samples from to obtain new candidate solutions. Together with , these candidate solutions constitute for the next generation. The algorithm stops after the population has converged or another termination criterion is met.
EDA variants mainly differ in their probabilistic models . The models describe dependency structures between the decisions variables with different types of probability distributions. Consider a binary solution space with decision variables. A naive EDA could attempt to store a probability for each solution. would contain probabilities. This could be required if all variables depended on each other. However, storing probabilities is computationally intractable for large . If some decision variables are, however, independent from other variables, then the joint distribution could be factorized into products of marginal distributions and the space required for storing shrinks. If all variables are independent, only probabilities have to be stored. In most problems, some variables are independent of other variables but the structure of the dependencies is unknown to those who want to solve the problem. Hence, model building consists of finding a network structure that matches the problem structure and estimating the model’s parameters.
Simple models like the Breeder Genetic Algorithm (Mühlenbein and Paaß, 1996) or populationbased incremental learning (Baluja, 1994) use univariate, fully factorized probability distributions with a vector of activation probabilities for the variables and choose to ignore dependencies between decision variables. Slightly more complex approaches like the bivariate marginal distribution algorithm use bivariate probability distributions which model pairwise dependencies between variables as trees or forests (Pelikan and Mühlenbein, 1999). More complex dependencies between variables can be captured by models with multivariate interactions, like the Bayesian Optimization Algorithm (Pelikan et al., 1999a, see section 2.2) or the extended compact GA (Harik, 1999). Such multivariate models are better suited for complex optimization problems. Univariate models can lead to an exponential growth of the required number of fitness evaluations (Pelikan et al., 1999a; Pelikan, 2005). However, the computational effort to build a model increases with its complexity and representational power. Many algorithms use probabilistic graphical models with directed edges, i.e., Bayesian networks, or undirected edges, i.e., Markov random fields (Larrañaga et al., 2012).
2.2. Bayesian Optimization Algorithm
The Bayesian Optimization Algorithm is the stateoftheart EDA optimization algorithm for discrete optimization problems. It was been proposed by Pelikan et al. (1999a) and has been heavily used and researched since then (Pelikan and Goldberg, 2003; Pelikan, 2008; Abdollahzadeh et al., 2012).
BOA uses a Bayesian network for modeling dependencies between variables. Decision variables correspond to nodes and dependencies between variables correspond to directed edges. As the number of possible network topologies grows exponentially with the number of nodes, BOA uses a greedy construction heuristic to find a network structure to model the training data. Starting from an unconnected (empty) network, BOA evaluates all possible additional edges, adds the one that maximally increases the fit between the model and selected individuals, and repeats this process until no more edges can be added. The fit between selected individuals and the model is measured by the Bayesian Information Criterion (BIC, Schwarz, 1978). BIC is based on the conditional entropy of nodes given their parent nodes and correction terms penalizing complex models. BIC assigns each network a scalar score
(1) 
where is the number of decision variables, is the sample size (i.e., the number of selected individuals), are the predecessors of node in the Bayesian network (’s parents), and is the number of parents of node . The term is the conditional entropy of the ’th decision variable given its parental nodes, , defined as
(2) 
where is the observed probability of instances where and , and is the conditional probability of instances where given that . The sum in (2) runs over all possible configurations of and . The BIC score depends only on the conditional entropy of a node and its parents. Therefore, it can be calculated independently for all nodes. If an edge is added to the Bayesian network, the change of the BIC can be computed quickly. The term in (1) penalizes model complexity. BOAs greedy network construction algorithm adds the edge with the largest gain in until no more edges can be added. Edge additions resulting in cycles are not considered.
After the network structure has been learned, BOA calculates the conditional activation probability tables for each node. Once the model structure and conditional activation probabilities are available, BOA can produce new candidate solutions by drawing random values for all nodes in topological order.
3. Restricted Boltzmann Machines and the RBMEDA
Restricted Boltzmann Machines (Smolensky, 1986) are stochastic neural networks that are successful in areas such as image classification, natural language processing, or collaborative filtering (Dahl et al., 2010; Hinton et al., 2006; Salakhutdinov et al., 2007). In this section, we describe the structure of Restricted Boltzmann Machines (Section 3.1), show how an RBMs can sample new data (Section 3.2), and how contrastive divergence learning is used to model the probability distribution of given data (Section 3.3). Finally, we describe RBMEDA, an EDA that uses an RBM as its probabilistic model (Section 3.4).
3.1. Structure of RBMs
Figure 1 illustrates the structure of an RBM. We denote as the input (or “visible”) layer. holds the input data represented by binary variables . The binary neurons of the hidden layer are called feature detectors as they are able to model patterns in the data. A weight matrix holds weights between all neurons and . Together, , , and form a bipartite graph. The weights are undirected. An RBM forms a Markov random field.
In the sampling and training phase, each neuron in and makes stochastic decisions about whether it is active (its value then becomes 1) or not (its value then becomes 0). Therefore, it collects inputs from all neurons to which it is directly connected. determines the strengths of the inputs.
An RBM encodes a joint probability distribution . In the sampling phase, a configuration of and is thus sampled with probability (Smolensky, 1986). In the training phase, the weights are adapted such that the marginal probability approximates the probability distribution of the training data. Training and sampling are tractable because of the bipartite structure of the RBM. Hence, it is not necessary to know the problem structure beforehand.
3.2. Sampling
The goal of the sampling phase is to generate new values for the neurons in the visible layer according to . This is straightforward if the activations of the neurons in the hidden layer are known. In this case, all are independent of each other and the conditional probability that is active is simple to compute. The conditional probability that the visible neuron is active, given the hidden layer is calculated as
(3) 
where is the logistic function.
(4) 
Although the two conditional distributions and are simple to compute, sampling from the joint probability distribution is much more difficult, as it usually requires integrating over one of the conditional distributions. An alternative is to use Gibbs sampling, which approximates the joint distribution from the conditional distributions. Gibbs sampling starts by assigning random values to the visible neurons. Then, it iteratively samples from and , respectively, while assigning the result of the previous sampling step to the nonsampled variable. Sampling in the order of forms a Markov chain. Its stationary distribution is identical to the joint probability distribution (Geman and Geman, 1984). The quality of the approximation increases with the number of sampling steps. If Gibbs sampling is started with a that has a high probability under the stationary distribution, only a few sampling steps are necessary to obtain a good approximation.
3.3. Training
In the training phase, the RBM adapts the weights such that approximates the distribution of the training data. An effective approach for adjusting the weights of an RBM is contrastive divergence (CD) learning (Hinton, 2002). CD learning maximizes the loglikelihood of the training data under the model, , by performing a stochastic gradient ascent. The main element of CD learning is Gibbs sampling.
For each point in the training data, CD learning updates in the direction of . This partial derivative is the difference of two terms usually referred to as positive and negative gradient, and (Hinton, 2002). The total gradient is
(5)  
where is the expected value of . is the expected value of when setting the visible layer to a data vector from the training set and sampling according to (4). increases the marginal probability of the data point . In contrast, is the expected value of a configuration sampled from the joint probability distribution , which is approximated by Gibbs sampling. If is equal to the distribution of the training data, in the expectation, the positive and negative gradient equal each other and the total gradient becomes zero.
Calculating exactly is infeasible since a high number of Gibbs sampling steps is required until the RBM is sampling from its stationary distribution. Therefore, CD estimates by using two approximations. First, CD initializes the Markov chain with a data vector from the training set, rather than using an unbiased, random starting point. Second, only a small number of sampling steps is used. We denote CD using sampling steps as CD. CD approximates the negative gradient by initializing the sampling chain to the same data point which is used for the calculation of . Subsequently, it performs Gibbs sampling steps. Note that the first halfstep has, in practice, already been performed during the calculation of . Despite using these two approximations, CD usually works well (Hinton et al., 2006).
Algorithm 2 describes contrastive divergence for (CD1). For each training vector, is initialized with the training vector and is sampled according to (4). This allows the calculation of as . Following this, two additional sampling steps are carried out: First, we calculate the “reconstruction” of the training vector as in (3). Subsequently, we calculate the corresponding hidden probabilities . Now, we can approximate as and obtain . Finally, we update the weights as
(6) 
where is a learning rate defined by the user. This procedure repeats for several epochs, i.e., passes through the training set. Usually, CD is implemented in a minibatch fashion. That is, we calculate in (6) for multiple training examples at the same time, and subsequently use the average gradient to update . This reduces sampling noise and makes the gradient more stable (Bishop, 2006; Hinton et al., 2006)
3.4. Restricted Boltzmann EDA
This section describes how we used an RBM in an EDA. The RBM should model the properties of promising solutions and then be used to sample new candidate solutions. In each generation of the EDA, we trained the RBM to model the probability distribution of the solutions which survived the selection process. Then, we sampled candidate solutions from the RBM and evaluate their fitness.
We chose to use the following parameters: The number of hidden neurons was set to be half the number of visible neurons (i.e. half the problem size). Standard values were used for the parameters that control the training and sampling phase (Hinton, 2010).
Sampling parameters
When sampling new candidate solutions from the RBM, we used the individuals in to initialize the visible neurons close to the stationary distribution. Subsequently, we performed 25 full Gibbs sampling steps.
Training parameters
The learning rate was set to 0.05 and 0.5 for the weights and biases, respectively. We applied standard momentum (Qian, 1999), which adds a fraction of the last parameter update to the current update, making the update less prone to fluctuations caused by noise, and dampening oscillations. We started with and increased it during an EDA run to (see below).
We used weight decay (L2 regularization) to improve generalization. Weight decay adds the term to the RBM’s optimization objective. The partial gradient in Equation (5) thus includes the term , which decays large weights and thereby reduces overfitting. The weight cost determines the strength of the decay. We chose .
We used contrastive divergence learning with one Gibbs sampling step (CD) and a minibatch size of 100. Furthermore, we initialized the visible biases with , where is the probability that the visible neuron is set to one in the training set (see Hinton, 2010). This initialization speeds up the training process.
Parameter control
We applied a simple parameter control scheme for the learning rate , the momentum , and the number of epochs.
The scheme was based on the reconstruction error . The reconstruction error is the difference between a training vector and its reconstruction after a single step of Gibbs sampling (see lines 2 and 7 in Algorithm 2).
between and . We measured the decrease of the reconstruction error in the last 25% of all epochs as . was then used to adjust the learning parameters. The learning rate was initialized in the first epoch with . As soon as , was decreased to . In the first epoch we set the momentum to . As soon as we increased to .
We used two selfadaptive termination criteria for stopping the training phase. We stopped training if . The rationale behind this was that the RBM has learned the relevant dependencies between the variables, and further training was unlikely to improve the model considerably. Furthermore, we stopped the training if the RBM was overfitting, i.e., learning noise instead of problem structure. Therefore, we split the original training set into a training set containing 90% of all samples and a validation set containing the remaining 10%. We trained the RBM only for the solutions in and, after each epoch, calculated the reconstruction error and for the training and validation set and , respectively. We stopped the training phase as soon as (i.e., the absolute difference between the reconstruction errors was larger than 2%).
4. Experiments
We describe the test problems (Section 4.1) and our experimental design (Section 4.2). The results are presented in Section 4.3.
4.1. Test Problems
We evaluated the performance of RBMEDA on onemax, concatenated deceptive traps (Ackley, 1987), and NK landscapes (Kauffman and Weinberger, 1989). All three are standard benchmark problems. Their difficulty depends on the problem size, i.e., problems with more decision variables are more difficult. Furthermore, the difficulty of concatenated deceptive trap functions and NK landscapes is tunable by a parameter. All three problems are defined on binary strings of fixed length.
The onemax problem assigns a binary solution of length a fitness value , i.e., the fitness of is the number of ones in . The onemax function is rather simple. It is unimodal and can be solved by a deterministic hill climber.
A trap function is defined on a binary solution of length . It assigns a solution a fitness
The optimal solution consists of all ones. Trap functions are difficult to solve because the secondbest solution consists of all zeros and the structure of the function is deceptive. It leads search methods away from the global optimum towards the secondbest solution.
A concatenated trap function places trap functions of order on a single bitstring of length . Its fitness is calculated as (Ackley, 1987). The problem difficulty increases with as well as with . Concatenated traps are decomposable. Dependencies exist between the variables of a trap function, but not between variables in the different trap functions (Deb and Goldberg, 1991).
NK landscapes are defined by two parameters and and fitness components (Kauffman and Weinberger, 1989). A solution vector consists of bits. The bits are assigned to overlapping subsets, each of size . The fitness of a solution is the sum of fitness components. Each component depends on the value of the corresponding variable as well as other variables. Each maps each possible configurations of its variables to a fitness value. The overall fitness function is therefore
Each decision variable usually influences several . These dependencies between subsets make NK landscapes nondecomposable. The problem difficulty increases with . is a special case where all decision variables are independent and the problem reduces to onemax.
4.2. Experimental Setup
We adopted an experimental setup similar to Pelikan (2008). We studied the performance of RBMEDA on the onemax function, on concatenated deceptive traps with traps size , and NK landscapes with . We report results for BOA so that RBMEDA can be compared to the stateoftheart.
Both EDAs used tournament selection without replacement of size two (Miller and Goldberg, 1995). We used bisection to determine the smallest population size for which a method solved a problem to optimality. For onemax and deceptive traps, we required each method to find the optimal solution in 30 out of 30 independent runs. For NK landscapes, we used 25 randomly chosen problem instances per size and determined the population size that solved five out of five independent runs of each instance.
We report the average number of fitness evaluations that were necessary to solve the problem to optimality. In addition, we report average CPU running times. CPU running time includes the time required for fitness calculation, the time required for model building (either building the Bayesian network or training the RBM), the time required for sampling new solutions, and the time required for selection. We also report CPU times even though previous EDA research mostly ignored the time required for model building and sampling.
We implemented RBMEDA and BOA in Java. The experiments were conducted on a single core of an AMD Opteron 6272 processor with 2,100 MHz. The JBLAS library was used for the linear algebra operations of the RBM.
4.3. Results
We report the performance of RBMEDA and BOA on onemax (Figure 2), concatenated traps (Figure c), and NK landscapes (Figure d). The figures have a loglog scale. Straight lines indicate polynomial scalability. Each figure shows the average number of fitness evaluations (lefthand side) and the overall CPU time (righthand side) required until the optimal solution was found. The figures also show regression lines obtained from fitting a polynomial to the raw results (details are in Table 1).
First, we study the number of fitness evaluations required until the optimal solution was found (Figures 2d, left). For the onemax problem, RBMEDA needed fewer fitness evaluations than BOA (Figure 2, lefthand side) and had a slightly lower complexity ( vs. ). For concatenated traps, BOA needed less fitness evaluations (Figure c, lefthand side). As the problem difficulty increased (larger ), the performance of the two approaches became more similar. The complexity of BOA was slightly lower than the one of RBMEDA (around versus for larger problems). The situation was similar for NK landscapes, where BOA needed fewer fitness evaluations and scaled better than RBMEDA (Figure d, lefthand side).
We now discuss average total CPU times (Figures 2d, righthand side). Besides the time required for fitness evaluation, this includes time spent for model building, sampling, and selection. For the onemax problem, RBMEDA was faster than BOA and had a lower time complexity (Figure 2, righthand side). For deceptive trap functions, BOA was faster on small problem instances, but its time complexity was larger that that of RBMEDA (Figure c, righthand side). Hence, the absolute difference in computation time became smaller for larger and more difficult problems. For traps of size , RBMEDA was faster for problems with more than 180 decision variables (36 concatenated traps). For traps of size , RBMEDA was already faster for problems with more than 60 decision variables (10 concatenated traps). BOA’s time complexity increased slightly with higher (from for to for ). In contrast, the time complexity of RBMEDA remained about the same (between and ).
The results for NK landscapes were qualitatively similar (Figure d, righthand side). BOA was faster than EDARBM, but its computational effort increased stronger with . Therefore the computation times became similar for difficult and large problems (cf. results for ).










The results indicate that although RBMEDA needed a higher number of fitness evaluations, its overall computational effort was about similar or even lower than BOA, especially for difficult problems. For this reason, we analyze in greater detail the differences in computations effort (CPU time) between BOA and RBMEDA. Figure c exemplarily shows the absolute CPU times for model building (lefthand side) and relative CPU times necessary for model building, sampling, and fitness evaluation (righthand side) for onemax, concatenated 5traps and NK landscapes with for various problem sizes
First, we study the absolute CPU times for model building (Figure c, lefthand side). For all three problem types, RBMEDA’s model building had a lower CPU time complexity (Onemax: vs. , 5traps: vs. , NK landscapes : vs. ). Model building was faster for RBMEDA than for BOA for all onemax problems, for concatenated 5traps of size and for NK problems of size . Note that this was despite the fact that the population size, i.e., the training set, was usually bigger for RBMEDA.
Second, we analyze the relative amount of CPU time that each EDA part used (Figure c, righthand side). For BOA, model building dominated the CPU usage with more than 95% of the total CPU time for all problems. Furthermore, with growing problem size, the share monotonously increased (95.299.5% for onemax, 97.499.8% for 5traps, 85.697.2% for NK). The relative times for sampling and fitness were very low. Furthermore, their shares of the total time decreased with growing problem size, despite the growing population sizes.
In contrast, the RBM spent a significantly smaller fraction of the total time for model building. Also, with growing problem size, this fraction decreased or stayed relatively constant (85.788.3% for onemax, 92.264.3% for 5traps, 78.849.2% for NK). Correspondingly, the CPU time fractions for sampling and fitness evaluations increased with growing problem size. Hence, the total CPU time for RBMEDA was much less dominated by model building
In summary, we found that the performance of RBMEDA was competitive with the stateofthe art, especially for large and difficult problem instances. This may be surprising as the number of fitness evaluations necessary to solve a problem was, in most problems, higher for RBMEDA than for BOA. Even more, the computational effort in terms of fitness evaluations grew faster with the problem size . The higher number of fitness evaluations used by RBMEDA indicates that the statistical model created during training in RBMEDA was less accurate than the statistical model in BOA. However, from a computational perspective, building this accurate model in BOA was much more expensive than learning the more inaccurate model used in RBMEDA. Furthermore, the time necessary for building the model increased slower with increasing problem size than for BOA. Thus, the lower time effort necessary for learning a less accurate statistical model in RBMEDA overcompensated for the higher number of fitness evaluations that were necessary to find optimal solutions. As a result, the CPU times for RBMEDA were not only lower than BOA, they also increased more slowly with growing problem sizes. This was true especially for difficult and large problems.
Fitness evaluations  CPU time  

BOA  RBM  BOA  RBM  
ONEMAX  )  
4TRAPS  
5TRAPS  
6TRAPS  
NK k=2  
NK k=3  
NK k=4  
NK k=5 
5. Summary and Conclusions
We carried out an indepth experimental analysis of using a Restricted Boltzmann Machine within an Estimation of Distribution Algorithm for combinatorial optimization. We tested RBMEDA on standard binary benchmark problems: onemax, concatenated deceptive traps and NK landscapes of different sizes. We carried out a scalability analysis for the number of fitness evaluations and the computation times required to solve the problems to optimality. We compared our results to those obtained from the Bayesian Optimization Algorithm, a stateoftheart method.
Our experimental results suggest that RBMEDA is competitive with the stateoftheart. We observed that it was less efficient in terms of fitness evaluations, both in absolute numbers and in complexity. However, the estimated complexity of probabilistic model building in RBMEDA was lower than in BOA. RBMEDA was able to build probabilistic models much faster than BOA if the problem is large. This caused smaller total runtimes for RBMEDA than for BOA, especially if the problem instance was large and difficult (cf. results for concatenated traps with , traps with , or NK landscape with ). In sum, RBMEDA can be an alternative if problems are large and difficult and the computational effort for fitness evaluations is low. This highlights the potential of using generative neural networks for combinatorial optimization.
Another advantage of using neural networks in EDAs is that RBMs can be parallelized without many of the problems that occur when parallelizing other EDAs. Parallelizing an RBMEDA on a Graphics Processing Unit leads to massive speedups by a factor of 200 and more, compared to optimized CPU code (Probst et al., 2014).
There are multiple directions for further research. We demonstrated that RBMs are useful in EDAs, but finetuning the parameters could improve the RBM’s model quality, possibly leading to further performance improvements. It might also be beneficial to stack RBMs on multiple layers and to use such deep systems for solving hierarchical problems.
Footnotes
 In addition, all neurons are connected to a special “bias” neuron, which is always active and works like an offset to the neuron’s input. Bias weights are treated like normal weights during learning. Due to brevity, we omitted the bias terms throughout the paper. For details, see Hinton et al. (2006).
 CD learning does not minimize the reconstruction error but maximizes , which cannot be calculated exactly tractably. As an alternative, the reconstruction error is usually a good approximation for how good the model can (re)produce the training data.
 We omitted the relative CPU times for selection in order to increase the readability of Figure c. By definition, they are proportional to the CPU time for fitness evaluations, and, in absolute numbers, negligible.
 The results for concatenated traps of size and NK landscapes of size were qualitatively similar.
 Note that in this comparison, each algorithm used the population sizes from its own bisection run, i.e., the population size for BOA was usually smaller. If we used the same population sizes, we expect the time dominance of model building in BOA to be even greater.
References
 Abdollahzadeh, A., Reynolds, A., Christie, M., Corne, D. W., Davies, B. J., Williams, G. J. J., et al., 2012. Bayesian optimization algorithm applied to uncertainty quantification. SPE Journal 17 (03), 865–873.
 Ackley, D. H., 1987. A connectionist machine for genetic hill climbing. Kluwer Academic, Boston.
 Baluja, S., 1994. Populationbased incremental learning. a method for integrating genetic search based function optimization and competitive learning. Tech. Rep. CMUCS94163, Carnegie Mellon University, Pittsburgh, PA.
 Bishop, C. M., 2006. Pattern recognition and machine learning. Information Science and Statistics. Springer.
 Dahl, G. E., Ranzato, M. A., Mohamed, A., Hinton, G. E., 2010. Phone recognition with the meancovariance restricted boltzmann machine. Advances in Neural Information Processing 23.
 Dahl, G. E., Yu, D., Deng, L., Acero, A., 2012. Contextdependent pretrained deep neural networks for largevocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 20 (1), 30–42.
 Deb, K., Goldberg, D. E., 1991. Analyzing deception in trap functions. University of Illinois, Department of General Engineering.
 Geman, S., Geman, D., 1984. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI6 (6), 721–741.
 Goldberg, D. E., 1989. Genetic algorithms in search, optimization, and machine learning. AddisonWesley Professional.
 Harik, G., 1999. Linkage learning via probabilistic modeling in the ECGA. IlliGAL Report No. 99010, University of Illinois at UrbanaChampaign, Urbana, IL.
 Hinton, G. E., 2002. Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation 14, 1771–1800.
 Hinton, G. E., 2010. A practical guide to training restricted boltzmann machines. Tech. Rep. UTML TR 2010003, Department of Computer Science, University of Toronto.
 Hinton, G. E., Osindero, S., Teh, Y.W., 2006. A fast learning algorithm for deep belief nets. Neural Computation 18, 1527–1554.
 Holland, J. H., 1975. Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor.
 Huajin, T., Shim, V. A., Tan, K. C., Chia, J. Y., 2010. Restricted boltzmann machine based algorithm for multiobjective optimization. In: IEEE Congress on Evolutionary Computation (CEC). pp. 1–8.
 Kauffman, S. A., Weinberger, E. D., 1989. The NK model of rugged fitness landscapes and its application to maturation of the immune response. Journal of theoretical biology 141 (2), 211–245.
 Larrañaga, P., Etxeberria, R., Lozano, J. A., Peña, J. M., 1999. Optimization by learning and simulation of Bayesian and Gaussian networks. Tech. Rep. EHUKZAAIK4/99, Intelligent Systems Group, Dept. of Computer Science and Artificial Intelligence, University of the Basque Country.
 Larrañaga, P., Karshenas, H., Bielza, C., Santana, R., 2012. A review on probabilistic graphical models in Evolutionary computation. Journal of Heuristics 18 (5), 795–819.
 Larrañaga, P., Lozano, J. A., 2002. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Genetic Algorithms and Evolutionary Computation, 2. Kluwer Academic Pub.
 Lozano, J. A., Larrañaga, P., 2006. Towards a New Evolutionary Computation: Advances on Estimation of Distribution Algorithms. Studies in Fuzziness and Soft Computing. Springer London, Limited.
 Miller, B. L., Goldberg, D. E., 1995. Genetic algorithms, tournament selection, and the effects of noise. Complex Systems 9, 193–212.
 Mühlenbein, H., Mahnig, T., 1999. FDA  a scalable evolutionary algorithm for the optimization of additively decomposed functions. Evolutionary Computation 7 (4), 353–376.
 Mühlenbein, H., Paaß, G., 1996. From Recombination of Genes to the Estimation of Distributions I. Binary Parameters. In: Voigt, H.M., Ebeling, W., Rechenberg, I., Schwefel, H.P. (Eds.), Parallel Problem Solving from Nature  PPSN IV. Vol. 1141 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 178–187.
 Pelikan, M., 2005. Bayesian Optimization Algorithm. In: Hierarchical Bayesian Optimization Algorithm. Vol. 170 of Studies in Fuzziness and Soft Computing. Springer Berlin / Heidelberg, pp. 31–48.
 Pelikan, M., January 2008. Analysis of estimation of distribution algorithms and genetic algorithms on NK landscapes. Tech. Rep. 2008001, Missouri Estimation of Distribution Algorithms Laboratory (MEDAL), instances can be found at http://medallab.org/files/nkinstances.tar.gz.
 Pelikan, M., Goldberg, D. E., 2003. Hierarchical boa solves ising spin glasses and maxsat. In: Genetic and Evolutionary Computation  GECCO 2003. Springer, pp. 1271–1282.
 Pelikan, M., Goldberg, D. E., CantuPaz, E., 1999a. BOA: The Bayesian Optimization Algorithm. In: Genetic and Evolutionary Computation Conference. pp. 525–532.
 Pelikan, M., Goldberg, D. E., Lobo, F., 1999b. A survey of optimization by building and using probabilistic models. IlliGAL Report No. 99018, University of Illinois at UrbanaChampaign, Urbana, IL.
 Pelikan, M., Mühlenbein, H., 1999. The Bivariate Marginal Distribution Algorithm. Advances in Soft ComputingEngineering Design and Manufacturing, 521–535.

Probst, M., Rothlauf, F., Grahl, J., 2014. An implicitly parallel EDA based
on Restricted Boltzmann Machines. In: Proceedings of the 2014
Conference on Genetic and Evolutionary Computation. GECCO ’14. ACM, New York,
NY, USA, pp. 1055–1062.
URL http://doi.acm.org/10.1145/2576768.2598273  Qian, N., 1999. On the Momentum Term in Gradient Descent Learning Algorithms. Neural Networks 12 (1), 145–151.
 Salakhutdinov, R., Mnih, A., Hinton, G. E., 2007. Restricted boltzmann machines for collaborative filtering. In: In Machine Learning, Proceedings of the Twentyfourth International Conference (ICML 2004). ACM. AAAI Press, pp. 791–798.
 Schwarz, G., 1978. Estimating the dimension of a model. The annals of statistics 6 (2), 461–464.
 Shim, V. A., Tan, K. C., 2012. Probabilistic graphical approaches for learning, modeling, and sampling in evolutionary multiobjective optimization. In: Liu, J., Alippi, C., BouchonMeunier, B., Greenwood, G., Abbass, H. (Eds.), Advances in Computational Intelligence. Vol. 7311 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 122–144.
 Shim, V. A., Tan, K. C., Chia, J. Y., 2010. Probabilistic based evolutionary optimizers in biobjective travelling salesman problem. In: Deb, K., Bhattacharya, A., Chakraborti, N., Chakroborty, P., Das, S., Dutta, J., Gupta, S., Jain, A., Aggarwal, V., Branke, J., Louis, S., Tan, K. (Eds.), Simulated Evolution and Learning. Vol. 6457 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 588–592.
 Shim, V. A., Tan, K. C., Chia, J. Y., Al Mamun, A., Mar. 2013. Multiobjective optimization with estimation of distribution algorithm in a noisy environment. Evol. Comput. 21 (1), 149–177.
 Smolensky, P., 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1. MIT Press, Cambridge, MA, USA, Ch. Information Processing in Dynamical Systems: Foundations of Harmony Theory, pp. 194–281.
 Zhang, B.T., Shin, S.Y., 2000. Bayesian evolutionary optimization using helmholtz machines. Parallel Problem Solving from Nature PPSN VI, 827–836.