Effective linkage learning using
loworder statistics and clustering^{1}^{1}1Submitted to IEEE Transactions on Evolutionary Computation
Leonardo Emmendorfer, Aurora Pozo
Numerical Methods in Egnineering, PhD program
Department of Computer Science
Federal University of Paraná, Brazil
{leonardo,aurora}@inf.ufpr.br
Abstract
The adoption of probabilistic models for the best individuals found so far is a powerful approach for evolutionary computation. Increasingly more complex models have been used by estimation of distribution algorithms (EDAs), which often result better effectiveness on finding the global optima for hard optimization problems. Supervised and unsupervised learning of Bayesian networks are very effective options, since those models are able to capture interactions of high order among the variables of a problem. Diversity preservation, through niching techniques, has also shown to be very important to allow the identification of the problem structure as much as for keeping several global optima. Recently, clustering was evaluated as an effective niching technique for EDAs, but the performance of simpler loworder EDAs was not shown to be much improved by clustering, except for some simple multimodal problems. This work proposes and evaluates a combination operator guided by a measure from information theory which allows a clustered loworder EDA to effectively solve a comprehensive range of benchmark optimization problems.
Keywords: clustering, evolutionary computation, linkage learning, niching.
1 Introduction
Evolutionary algorithms solve optimization problems by evolving successive populations of solutions until convergence occurs. Two steps are usually present at each generation: selection of promising solutions and creation of new solutions in order to obtain a new population.
Combination of genetic information is a major concern in evolutionary computation. In the simple genetic algorithm (sGA) [1] this mechanism is implemented as the crossover operator, which creates a new individual from two parents by combining portions of both strings. Recently, estimation of distribution algorithms (EDAs) started a novel approach for learning information from the best individuals, which involves inferring a probabilistic model and sampling from this model in order to generate the next population. Combination of information is achieved in EDAs since a single model is built from several good individuals. Unfortunately, combining different individuals may lead to poor results if the model adopted is not expressive enough.
Simpler order1 EDAs adopt probabilistic models which assume independence among genes. This class of EDA is known for its simplicity and computational efficiency in model learning, since no search for model structure needs to be performed [2]. Further, the simple conception and implementation should make those algorithms very attractive. Their low effectiveness on harder benchmark problems, however, is unacceptable. This is a major drawback, since genetic and evolutionary algorithms are known for their wide applicability and robustness [3].
High order EDAs, on the other hand, are based on learning the linkage among genes by inferring expressive probabilistic models based on searching for a factorization, which captures the dependencies among genes. Good results are reported for several problems in the literature whereas a high computational cost associated to the model induction stage is imposed in this class of EDAs. Finding a factorization can be a computationally expensive process and the resulting graph is often a suboptimal solution [4][5].
One of the most important efforts to make EDAs more effective is to adopt clustering as a strong niching approach, inducing the preservation of diversity in the population. Niching is crutial for evolutionary computation in general [6] and for EDAs in particular [7]. It improves the identification of the problem structure as much as enhances the chances of finding a higher number of global optima on multimodal problems. Kmeans clustering algorithm [8] has recently been applied as a niching technique based on grouping genotypically similar solutions together. The performance of simpler loworder EDAs, however, was not show to be much improved by clustering except for some simple unstructured multimodal problems. Loworder clustered EDAs have not been able to solve hard deceptive structured problems [9].
The main contribution of this paper is to show that a simple loworder EDA aided by clustering the population and guided by information measures is able to perform linkage learning and, therefore, solve a representative set of benchmark problems. This work extends a previous paper [10], where some of the ideas and results reported here were first presented. Here the foundations of the algorithm and operators proposed are discussed in a more detailed fashion, and a wider set of experiments is presented.
The main difference between the new approach and other similar clustered loworder EDAs is on how to deal with the combination of information. Clustering, or similar unsupervised learning approaches, are usually applied to the population of EDAs in order to prevent combination of different niches (clusters). Conversely, the new operator trusts in the combination of information from different clusters as an effective method for exploring the search space. This combination, however, must be carefully performed, respecting the information carried out by each cluster. The new operator captures relevant information about the problem structure from the probabilistic models of the clusters and builds a combined probabilistic model from two different parent clusters, attempting to maintain the most informative part of each parent intact, hence preserving the problem structure.
Experiments illustrate the effectiveness of this combination operator on detecting the problem structure. When the operator is turned off or replaced by a random recombination, the algorithm is not able to find the global optima of any benchmark problem tested. The problems adopted here illustrate several aspects which have been recently considered as tricky for EDAs, such as deception [11], symmetry [9], hierarchy [12], global multimodality [3] and the presence of overlapping building blocks [13]. The structured fashion of those and other classes of problems makes them hard for loworder and even for highorder EDAs.
This work is organized as follows. Section 2 discusses the main concepts related to the field of estimation of distribution algorithms. Section 3 reviews the most closely related work concerning to the application of clustering in estimation of distribution algorithms. A discussion about how a generic clustered order 1 EDA could detect higher order of interactions is presented in Section 4, where a new operator guided by information measures is proposed.
Section 5 presents the algorithm PBIL, which implements the ideas discussed in Section 4. A default parameter setting for the algorithm is discussed in Section 6. Section 7 describes the benchmark optimization problems which were adopted in this paper. Section 8 presents empirical verifications of the performance of PBIL, comparing it to stateoftheart compentent EDAs. Finally, Section 9 discusses results and implications of this work.
2 Linkage Learning and Estimation of Distribution Algorithms
Interacting variables are more informative together than alone. Interaction among variables is an irreducible whole; a dependency which cannot be broken [14]. A classic example from machine learning literature related to the importance of interaction among variables is the XOR problem [15] which is not linearly separable and cannot be solved without capturing interactions.
In genetic and evolutionary computation, the identification and preservation of important interactions among genes is called linkage learning and this topic gained increasing attention from the community. Most linkage learning techniques aim at the identification of substructures which should be conserved during combination [16]. A similar concept in genomics, called genetic linkage, is defined as the association of genes on the same chromosome. When two genes are independent, the Mendelian law of independent assortment states that the segregation of one gene is independent of the segregation of the other [17].
Simple genetic algorithm (sGA) with onepoint crossover relies on the ordering of the genes in the codification of the problem. In order to achieve success it is required that interacting variables should be coded as nearby genes in the chromosome in such a way that crossover would less probably cause the disruption of substructures. More recently other approaches adopted alternative schemes ranging from the simple reordering of the genes to more complex mechanisms like subspecification and superspecification of solutions, as in Messy GA [18]. This algorithm adopts a twostage evolutionary process where substructures are identified in a first stage and subsequently combined. This ensures that substructures were all correctly identified before going on and exploring the combinations among them.
A different situation occurs in Estimation of Distribution Algorithms (EDAs), where probabilistic modeling is applied as the learning mechanism for the evolutionary process. Algorithm 1 is the general description of an EDA.
An EDA solves a problem by building probabilistic models of the best solutions from the population. New solutions (individuals) are sampled from the model, so the traditional operators of crossover and mutation, which operate at the level of the individuals, are not suited for EDAs.
The most effective EDAs adopt fairly complex models which should capture the whole structure of dependence among the genes of the problem. The major motivation for learning complex and powerful models is to acquire and maintain information about interactions among the variables of the problem. Some earlier EDAs, however, learn probabilistic models which assume independence among all genes, as
(1) 
where  is the joint probabilistic model 

is the binary vector of an individual  
is the marginal model of gene i 
For a binary codification, the marginal model for each gene is the binomial proportion of 1s in that gene for the selected individuals. Algorithms based on this kind of model are not able to detect any interactions among the genes without additional support, therefore they present a similar behavior as a GA with uniform crossover [2]. Since no interactions are considered and only order 1 statistics are used, those algorithms should be called order 1 EDAs. One of the most important member of this class is PBIL (Population Based Incremental Learning) [19].
In PBIL the population is represented by a probability vector , as in (1), where represents the probability of an individual to possess a 1 in gene . At each generation, M individuals are generated and the best N are selected to update the model.
On the other hand, an EDA based on Bayesian networks can capture the interactions among the genes:
(2) 
where  is the joint probabilistic model 

is the binary vector of an individual  
is the marginal model of gene i  
is the set of parents of 
The set comprises the variables on which depends. This factorization is only possible because, for each , is assumed to be independent of its nondescendants, given its parents [20]. Figure 2 illustrates two possible factorization assumptions and figure 1 illustrates a Bayesian network learned at some stage of an evolutionary process for the concatenated trap4 problem [11] with 4 subproblems. EDAs based on higher order statistics (typically higher than order 2) should be called higherorder EDAs, in contrast to loworder EDAs which use statistical models where no interactions or, at most, only pairwise interactions are considered. Two very representative higherorder EDAs which adopt Bayesian Networks are the Bayesian Optimization Algorithm (BOA) [11] and the Estimation of Bayesian Network Algorithm (EBNA) [21]. The difference among them is on the metric used to evaluate factorizations when searching for the model structure.
3 Related work
It was already shown that clustering improves the performance of some EDAs for certain classes of problems. For globally multimodal problems, for instance, clustering has revealed a promising approach [3]. Such problems present several global optima or, in other words, several optimal solutions with the same fitness. Those problems may be very tricky for an evolutionary algorithm to solve, since slow convergence to one of the optima (genetic drift) often occurs. The retarded convergence is explained by the combination of solutions coming from different regions of the search space, which often results in poor solutions. The application of clustering in EDAs when solving globally multimodal problems can be described as the separation of the population in subpopulations (one for each cluster), and the subsequent learning of different probabilistic models for each subpopulation. Breeding generally involves no combination among different subpopulations. Clustering improves the performance of the evolutionary algorithm by such an interbreeding avoidance. However, this approach leads to a smaller exploration of the space, since combination from distant regions could be favorable.
Clustering is adopted, for the first time, as the niching technique of an EDA in [9]. Only binary codifications are considered, and kmeans clustering was applied to obtain subpopulations at each generation. Representative EDAs and widely known globally multimodal test problems were used in the experiments. The most noticeable result is the major improvement in effectiveness and convergence speed of UMDA – a singleorder EDA – when clustering is applied. The clustered UMDA outperforms a simple UMDA for multimodal problems without a complex structure like twomax. The number of global optima found and stably maintained is increased with clustering. Those results are explained by the speciation of each cluster in a different peak.
However, for more structured problems, clustering does not help UMDA very much. Actually, the literature does not report any loworder EDA which is able to solve hard deceptive structured problems, even when aided by clustering. Singleorder clustered EDAs are only expected to perform well on globally multimodal optimization problems because combination of solutions from different basins is avoided [3]. Actually, the beneficial effect claimed by [9] is to control combination among different subpopulations since interbreeding is considered as harmful for the evolutionary process. Exploration of the search space is done within each cluster by an evolutionary algorithm in a nearly parallel fashion.
The authors of [9] conclude by stating that more sophisticated operators, besides just clustering, should be used for an efficient scaleup behavior on loworder EDAs on difficult problems.
A similar approach to clustering is the parallelization of EDAs by adopting multiple subpopulations and migration. In [22], a simple recombination operator called PVwise uniform crossover is proposed, which is similar to GA uniform crossover. After two parent PVs were selected for combination a new temporary PV is built by randomly selecting, for each gene, from which of the two parents should the binomial proportion be taken from. A new individual is sampled from this PV.
It is possible to derive a general framework for a very simple EDA based on single order statistics and aided by some sort of clustering, as in [9]. Incrementally, or at each generation, the best individuals would be clustered by similarity of their genotype and, for each cluster, independent models for each gene would be learned. Assuming a fixed number of clusters and a fixed number of genes, a total of independent binomial proportions would be kept updated during the evolutionary process. New individuals would be generated by sampling from the probability vector of a single cluster, or combination among clusters (interbreeding), similarly as in [22].
Another perspective for the application of clustering in EDAs can be found in [3]. The unsupervised estimation of Bayesian network algorithm (UEBNA) uses a modelbased approach for clustering, based on the unsupervised learning of Bayesian networks at each generation. An unobserved variable is included in the model, which represents the unknown cluster label. The Bayesian network represents a joint probabilistic model for the best individuals, considering a factorization which includes all genes and the cluster random variable .
The structure of a Bayesian network describes a factorization as a directed acyclic graph, as illustrated in figure 2b. The nodes denote random variables and the edges represent dependencies among those variables. Additionally, a set of parameters of the network specify conditional distributions given the possible values of the corresponding parents of each variable . UEBNA adopts unsupervised learning of Bayesian networks since the cluster variable is unspecified. The structure of this kind of Bayesian network is illustrated in figure 3a.
The algorithm is expected to detect the correct setting of diverse global optima to diverse subpopulations and capture dependencies among the variables of the problem through learning Bayesian networks at successive generations. Experimental validation confirms the effectiveness of the algorithm for a set of globally multimodal structured problems. The most relevant problem tested was graph bisection, which is a hard problem even for higherorder EDAs like EBNA or BOA. Actually, UEBNA outperforms EBNA achieving a higher number of global optima especially for greater instances of graph bisection, which validates the relevance of clustering for evolutionary computation.
Unfortunately, the Bayesian networkbased model adopted by EBNA, UEBNA, BOA and other higherorder EDAs requires a computationally complex learning stage, which must be run at each generation. Actually, finding the best structure of a Bayesian network is known to be a NPhard optimization problem itself. In practice, a greedy algorithm is executed to perform the search, which can be based on penalized maximum likelihood. Penalization avoids excessively complex network structures.
It was already recognized that learning the structure of a Bayesian network at each generation may become a bottleneck for EDAs and some attempts to overcome this problem have been proposed [4]. On the other hand, incorporating even more elements into the model, as in [3] where clustering labels are used, seems to be worthy and should be considered, but this approach results models of higher complexity.
It would be very worthy, however, if a simpler EDA could reach similar performance as more complex EDAs do. As suggested by [9], clustering itself does not solve the problem, but further mechanisms could also be usefully incorporated. In the next sections a combination operator is evaluated, which allows a loworder clustered EDA to achieve success on the linkage learning task. This operator explores information gained from clustering itself and from single order statistics from the subpopulations. The proposed approach allows an effective exploration of the search space by combining the most relevant information. Experiments illustrate that the algorithm preserves the structure of complex problems which present high interactions among variables, despite only adopting statistics of low order.
Two major motivations for the application of clustering in EDAs can be derived form the works reviewed above: (i) to increase preservation of the diversity in the population and (ii) to avoid combination from different peaks in multimodal optimization. Another motivation for clustering is (iii) to allow the identification of a mixture of distributions in continuous optimization. This last motivation is often described in the literature [23][7], but it is not directly related to the work herein presented.
The relevance of motivation (i) is clear since diversity maintenance prevents premature convergence to local optima and allows for a proper exploration of the search space. This topic is related to the niching methods of simple genetic algorithms, and was widely studied. Motivation (ii) is, however, conflictive with the first. The combination of information from distant regions is a strong mechanism for exploring the space, since it could potentially result in better solutions. Even combination from different local optima is useful, since those optima may result from different substructures, which were already found and should be combined. However, this aspect was not well studied yet and is explored later in this paper.
4 Using an information measure to guide combination
Loworder EDAs have not been effective on solving several classes of problems, particularlly when high order of interactions among the variables occur. Clustering was applied as a niching technique for EDAs, as reviewed in Section 3. Individuals are grouped by similarity of their genotype, and combination among different groups is usually avoided. However, the performance of simple loworder EDAs was not much increased by using clustering.
One of the reasons which motivate the application of niching is to avoid premature convergence to local optima. Diversity preservation, however, is also important even when a single optimum exists [6] since substructures should be explored and recombined before the global optima is correctly identified.
The existence of substructures in the problem is also an important factor which motivates niching. When an evolutionary algorithm is relatively away from convergence, we can conjecture that the main source of diversity in the population is due to the diversity of substructures, since global optima are not clear yet. Thus, at some stage of the process, a clustering algorithm would group together individuals possessing the same substructures or, in other words, different substructures would characterize each subpopulation.
The concepts which characterize each cluster are, thus, relevant information which should be explored. Actually, the combination of important parts of promising solutions found so far should guide GAs and EDAs during the exploration of the search space [9]. Therefore, combination from different clusters should not be avoided, but carefully performed.
This Section describes a combination mechanism for clustered EDAs which carefully chooses how to combine information from clusters. Section 3 reviewed a random naive combination mechanism called PVwise crossover [22]. However, it is shown later that this operator is not effective for the exploration of the search space and is not able to combine valuable information from parents, similarly as the GA with uniform crossover.
A more sophisticated but still very computationally efficient operator can be built using ideas from information theory. It also combines information from different clusters by building a temporary PV from two randomly selected parent PVs. The difference is on how to choose from which parent to take each binomial proportion. A measure from information theory guides the choice of the best parent for each gene. It aims for a careful combination of relevant information from two PVs, attempting to maintain the most informative part of each parent. This operator is called conceptguided combination or, for brevity, the cgcombination.
During a cgcombination, two parent clusters – and – are, randomly and proportionally to the mean fitness, selected. A temporary PV, from which a single new individual will be created, is obtained by taking proportions for each position from either or . A measure from information theory is used to guide that choice in order to select always the most informative parent for each gene. Let
(3) 
be the entropy of the distribution of gene , where is the proportion of individuals possessing the value for gene in the whole population (). Similarly, let
(4) 
be the entropy of the distribution of the same gene without taking cluster into account in the estimated proportions. The proportion is computed as .
The measure of how informative is a cluster to a gene is therefore the difference in the entropy of the distribution of the gene before and after observing cluster :
(5) 
Thus the decision of how to build the temporary PV becomes simple: choose, for each gene , the parent with the greatest among all s. Each position of the temporary PV is defined as
(6) 
After computation of a new individual is generated by sampling from each position of independently. Figure 4 shows a scenario with clusters and 12 variables. Figure 5 illustrates the creation of after an interbreeding for a small problem with four variables.
The cgcombination relies that clustering could group together individuals possessing the same substructure, and attepts to extract the information about that substructure from and . A given cluster is expected to be informative for all genes which define a building block if most individuals in that cluster possess that building block. Overlapping building blocks do not represent a problem: two clusters would be informative to the same gene and this would still preserve substructures. This behavior is illustrated in Section 8.
In addition, highfitness similar individuals resulting form combinations of lowerlevel building blocks would also be grouped together. Thus, relevant substructures are combined as the process goes on.
The next Section presents a loworder EDA that effectively adopts the cgcombination as the interbreeding mechanism.
5 PBIL: A Clusteringbased Evolutionary Algorithm
The previous Section presented an operator called cgcombination, which supports the combination of information from different clusters and, potentially, allows a loworder clustered EDA to effectively perform linkage learning. This operator can be used in the “interbreed” step. Each cluster defines a probabilistic model and genes are assumed conditionally independent given the cluster label, therefore only dependencies between a gene and the cluster label are considered. The resulting model is, therefore, limited to order 2 statistics.
The algorithm PBIL (Conceptguided PopulationBased Incremental Learning) [10] is a loworder EDA which follows the proposed approach and performs interbreeding. Algorithm 2 shows the pseudocode of PBIL.
PBIL follows an incremental architecture where a single individual is generated at each iteration without the adoption of a succession of “generations” as other EDAs often do. A fixed number of clusters are maintained and continuously updated. Whenever a new individual is generated, it replaces the worst individual in the population (if the new individual is not even worse) and, subsequently, clustering hypothesis and probabilistic models for each cluster are just updated through a single typical kmeans step, and not fully relearned. Each cluster defines a subpopulation and, since only binary variables are allowed, then the probabilistic models for the subpopulations are just the corresponding binomial proportions which denote the proportions of individuals with the value 1 for each gene on each cluster .
Sampling from one of the probability vectors (PVs) will generate a new individual, similarly to other clustered EDAs. A PV is chosen randomly and proportionally to the mean fitness of the individuals of each corresponding cluster.
Sampling from a single PV, however, is not the only option for generating a new individual in PBIL, since the combination of PVs is also considered. Actually, the cgcombination is chosen as the default interbreeding mechanism for PBIL. Two interbreeding mechanisms were discussed in the previous Section. Both propose to combine two PVs, obtain a temporary PV and sample from to generate a new individual. Experiments in the next sections illustrate how the PV uniform crossover fails on combining relevant information from two clusters, while the cgcombination succeeds.
Two features were added in order to increase the performance of the algorithm. The first feature was motivated by the need for local search. The maximum likelihood estimator (MLE) of a binomial proportion is the simple mean of successes. In our case, it should be interpreted as the proportion of ones at each gene, or
(7) 
where is the number of individuals in cluster possessing the value at gene and is the number of individuals in cluster . Unfortunately, when all individuals in a cluster have the same value (all 0’s, or all 1’s) in a certain locus, this estimator saturates at one of the extremes ( or ). Sampling from those extreme proportions wipes out the chance of the alternative value, 1 or 0, to be generated and this behavior is harmful for local search.
Parameter  Description  Default 

value  
initial population size  –  
working population size  –  
number of clusters  –  
probability of interbreeding  50%  
probability of an old  50%  
clustering hyposthesis to be used  
probability of the Wilson  50%  
estimator to be used  

A perturbation mechanism was added which changes slightly the binomial proportions estimated and, therefore, allows for an allele to be generated even if all individual possess the complementary allele. The Wilson estimator reviewed in [24] incorporates a degree of uncertainty by estimating the binomial proportion as
(8) 
is used instead of (7), which is the maximum likelihood estimator (sample mean).
Thus, for instance, even if all 100 individuals in a cluster possess the value 1 in a locus there still remains a probability of 2% of a new individual to be generated with a 0 in that same locus. Some kind of mutation operator could also be applied, probably with similar effects on local search. Bayesian inference could also be applied since it properly express the uncertainty about the value of the parameter through the specification of a prior distribution for the parameter.
Another feature was added in order to improve recombination for overlapping and hierarchical building blocks. When smaller BBs start blending, new BBs emerge. This causes the loss of older clustering hypotheses, consequently inhibiting those initial BBs in recent individuals. This is a consequence of the well known race between selection and innovation [25]. For overlapping and/or hierarchical building blocks, this constitutes a great problem. A potential solution is the maintenance of some old BBs (therefore, old clustering hypotheses), which can be used to generate new individuals. Both the current and an old clustering hypothesis compete when breeding. This mechanism works as follows: both and matrixes are stored, for some old wellperforming clustering hypotheses. The cgcombination selects randomly from one of two clustering hypotheses: the old set and the new, recently updated set . It also computes performance records about both sets: if the new individual, generated by one of the sets, is selected, then the performance variable of the corresponding set is updated (increased by one). If the new set overperforms the old one, then the old set and its performance variable are updated, and the performance variable of the new set is set to zero.
All parameters for PBIL are described in table 1. Some default values for most parameters were set after empirical investigation, which is shown in Section 6. (probability of interbreeding), (probability of an old clustering hypothesis to be used during breeding instead of the current one) and (probability of the Wilson estimator to be used instead of the sample mean) are all set to by default.
Termination criterion of the algorithm was set to be the loss of diversity inside PVs: the algorithm finishes when all s saturate (reach some value above or below ). This is said to be the condition for the convergence.
The user must set the values for some parameters for which no default values are provided: initial population size , working population size and the number of clusters . The working population exists after the initialization and is created from the best individuals of the initial population as described in algorithm 2.
The next Section describes an empirical investigation performed to find a default parameter setting for the algorithm.
6 Parameter setting
This Section shows an exploration in the parameter space using three representative test problems: shuffled HIFF [12], concatenated trap5 [11] and graph bisection [3]. The experiment aims to find default values for some parameters of PBIL. We start from a hypothesis that , and should be all set to . The problem instances, which are descrbed in the next section) are Pshuff64, Ptrapfive50 and Pcatring42, all three relatively small (64, 50 and 42 variables, respectively), but are shown to be enough for revealing an influence of the parameters on the performance of the algorithm. Pcatring42 and Pshuff64 are globally multimodal, possessing 6 and 2 global optima respectively, while Ptrapfive50 has a single global optimum.
The other three parameters are maintained fixed for all runs of each problem in this experiment: , and for Pcatring42 and , and for Pshuff64 and Ptrapfive50. Those values were previously verified to be near the minimal values at which the algorithm becomes able to find all global solutions for each of the three instances considered here in this evaluation.
Figure 6 shows the results of this investigation. Each graph shows the variation of one of the three parameters considered, where the other two stay fixed. When nothing else is stated, the value for the remaining two considered parameters is .
Changing one of the parameters can dramatically affect the behavior of the algorithm for one or more of the problems. Setting to or causes a reduction on the performance on Pshuff64 since both old and new clustering hypothesis store information about the subsequent levels of the problem structure. The performance on other problems is, however, not so influenced by , except when is at , where performance on Ptrapfive50 also drops down. Using only old clustering hypothesis seems to be harmful for the algorithm.
Setting the probability of interbreeding – – to zero affects the performance of the algorithm for a wider range of problems, since this mechanism is responsible for recombination. Increasing slows down the convergence for all instances tested.
The parameter is also negatively related to the convergence speed, but lower values for should be avoided, at least for HIFF, since no global solution for Pshuff64 is found for . The effectiveness on other problems was not affected by this parameter.
After analyzing all results and recognizing that nearly extreme values for all of the parameters tested are, often, undesirable, therefore we set , and all to .
7 Some benchmark optimization problems
This section discusses and revises the benchmark problems used in the empirical evaluation Section. The reader is referred to [3][12][11][13] for a more detailed description of the problems.
A representative set of benchmark problems was chosen, which are known to be hard for a GA and most EDAs to solve. The existence of building blocks, or a structure in the problem, generally prevents GA and loworder EDAs from achieving success, since genes are assumed independent and structure is not preserved. Simple GA with onepoint crossover and a proper encoding performs well for some structured problems, but the best encoding requires previous knowledge about the problem structure, something that is not available in general.
Three general classes of structured problems are considered. The simpler class is that of additively decomposable functions (ADFs). In ADFs there is no dependence structure among the subproblems since the contribution of each substructure to the overall fitness of the solution is independent of the value of the remaining variables, therefore all subproblems can be solved independently. The concatenated ktrap function [11], which is better described below, is an example of an ADF with subproblems, where is the size of the problem and is the size of each subproblem. The highest order of dependence in the concatenated ktrap is, therefore, . Most ADFs proposed in the literature, including trap5, are deceptive. This means that lower order statistics would lead the algorithm away from finding the correct building block, and motivates the use of higher order statistics.
A slight variation of ADFs is obtained when overlapping building blocks exist but the problem structure is still defined by additive functions. When building blocks overlap, some genes are present in more than one subproblem simultaneously and the problem structure learning task is harder. This class of problems should be called overlapping additively decomposable functions (OADFs). [13] discusses how to design combination strategies for GA when overlapping building blocks occur, in order to minimize the number of building blocks disrupted.
Another class of GAhard problems comprises Hierarchically decomposable functions (HDF). They were proposed after a criticism to ADFs, which present no dependence structure among the subproblems. HDFs, in turn, are much harder to optimize since the substructures interact. The fitness contribution of two substructures, when combined, is different from the sum of their individual contributions. One of the most important HDFs is HIFF [12] which presents interactions up to the size of the problem. HDFs illustrate a more general class of problems than ADFs and demand a more complex mechanism for combination. Substructures of each level must generally be first identified and recombined before the next level starts emerging.
Apart from the taxonomy above, two other important aspects of the benchmark problems recently approached in the literature should be mentioned. Problems that present several global optima, called globally multimodal problems, represent a source of difficulties for GAs and EDAs and require efficient niching mechanisms. Also, a class of symmetrical multimodal problems has recently been attracting attention. Symmetry occurs when some regularities on the landscape lead to the existence of complementary solutions with identical fitness and, consequently, to complementary global solutions. Globally multimodal and/or symmetrical problems can be considered hard for most evolutionary algorithms because combining solutions from two symmetric local optima will probably not be useful to find the global optima.
Next, the problems used in this work are briefly stated.
Twomax
Twomax is a simple function of binary variables:
(9) 
There are two global maxima: and , both with fitness equal to .
Concatenated trap5
The concatenated trap5 [11] is an additively decomposable function which possess a single global optima (the string fulfilled with 1s), with a fitness equal to the size of the problem. The input string is partitioned into disjoint groups of five bits each. A 5bit trap function is applied to each of the groups and the fitness of the individual is the sum of the contributions of each 5bit group. The algorithm should be able to learn and preserve that structure in order to achieve success. The 5bit trap function is:
(10) 
where u is sum of the bits in the group.
For each contiguous nonoverlapping block of five variables, two building blocks can be identified: and , which contribute with and to the overall fitness, respectively. Single order (or any lower than 5order) statistics may lead apart from the optimal value for each gene, what makes this problem deceptive.
Notice that several combinations of optimal and suboptimal building blocks may lead to multiple suboptimal solutions. Actually, since there are 2 building blocks for a partition size of 5, then for a problem instance with size p, there are local optima and only one global optima, whose fitness is very close to that of the second best optima. An instance of size 50 (Ptrapfive50) is considered later.
Overlapping concatenated trap5
This problem is an overlapping additive decomposable function, with a fixed overlapping length. The overlapping scheme is circular as figure 7 illustrates for a problem with 6 BBs. Our test instance has 60 genes and overlapping length 2 (therefore 20 overlapping building blocks) called Poverfive60, which is also considered in [13]. The circular nature of the overlapping scheme with length 2 means that every building block shares 4 genes (2 with each neighbor) and the first building block is the neighbor of the last. In figure 7, for instance, building block 1 shares 2 genes with each neighbor, building blocks 6 and 2.
This problem is considered harder than the previous, concatenated trap5, because in the overlapping version subproblems are not separable.
Hierarchical ifandonlyif
HIFF [12] is an hierarchically decomposable function. A binary string of size represents a solution, where is the number of levels in the hierarchy. The fitness of a solution is given by
(11) 
where B is a block of bits , is the size of the block, is the ith element in block. and are the left and right halves of the block B . The evaluation starts with the chromosome as a block.
Since tight linkage occurs, HIFF is relatively simple for a GA with onepoint crossover to solve. In the shuffled version, however, the variables are randomly rearranged in order to avoid tight linkage. This version is much harder for a GA and some mechanism for detecting the problem structure is recommended. An experiment with the shuffled version of HIFF with 64 variables (Pshuff64) is described later. For EDAs in general the ordering of the genes in the codification is not relevant, since the probabilistic models do not take any advantadge from this information, as opposed to GA which actually may depend on an adequate codification.
Graph Bisection
The objective of the graph bisection problem is to obtain two equally sized sets of nodes from the original graph such that the number of vertices linking both sets is minimal. The fitness of a given solution is the number of nodes minus the number of links between both sets, making it a maximization problem. The codification adopted is based on a binary vector of size n where the ith gene represents the label of the ith node.
The set of possible solutions is restricted to the condition of equally sized sets; therefore, some individuals may appear which are unfeasible. A repair operator was adopted here and in [3] which randomly inverts the value of some genes. Iteratively a randomly selected gene in the majority is inverted until a feasible solution is obtained.
All instances of the graph bisection considered in [3] are also studied here. Figure 8 illustrates three of those instances: Pgrid36 and Pcatring42 and Pcatring84, which possess 2, 6 and 6 global optima, respectively. The other instances considered are Pgrid16, Pgrid64, Pcat28, Pcat42, Pcat56, Pcatring28, Pcatring56 and Pcatring84, with the number of global optima ranging from 2 to 6.
8 Empirical Evaluation
PBIL using  PBIL using PV  
problem (size)  conceptguided  uniform 
combination  crossover  
Success %  Success %  
Shuffled HIFF (128)  97%  0% 
Concatenated trap5 (100)  100%  0% 
Overlapping c. trap5 (60)  100%  0% 
Twomax (100)  100%  100% 
PBIL using  PBIL using PV  
problem (size)  conceptguided  uniform  
combination  crossover  
Eval.  sd  Eval.  sd  
Shuffled HIFF (128)  105134  11612  81646  13474 
Concatenated trap5 (100)  90474  7203  49391  7366 
Overlapping c. trap5 (60)  55649  4207  23729  1048 
Twomax (100)  4825  216  4867  318 
This Section illustrates the performance of the cgcombination and the PBIL algorithm for a representative set of benchmark problems. Those problems summarize most of the hurdles that have been reported for benchmark problems for EDAs: deception, multimodallity, overlapping building blocks, global multimodallity and symmetry.
Initially, the two interbreeding mechanisms which were described in section 4 are compared.
Next, a comparison is performed between PBIL and UEBNA [3], which is a stateoftheart EDA, both solving several instances of the graph bisection problem.
Comparing two interbreeding mechanisms
These first experiments illustrates the behavior of the cgcombination, which is compared to the PVwise uniform crossover [22]. This later operator randomly mixes two building blocks during an interbreeding, while the cgcombination carefully chooses from which parent to take each position of the probability vector, as described in Section 4. In this experiment, the algorithm PBIL was ran with its default parameters. Additionally, with clusters and population sizes are .
Before a quantitative comparison, the typical behavior of both operators is illustrated for a single run for the concatenated trap5 problem with size 15. The problem is decomposable, and for each subproblem there is an optimal building block. The three optimal building blocks – BB0, BB1 and BB2 – must be found and subsequently combined in order to reach the global solution.
Finding all optimal building blocks is not difficult when the PVwise uniform crossover is used, as illustrated in figure 9; the difficult thing is to mix them. It is noticeable that each cluster specializes in a specific pattern. Cluster 0, for instance, started attracting some individuals possessing BB2. As the evolutionary process continues, the proportion of individuals possessing BB2 which are attracted to cluster 0 grows, until around 500 fitness evaluations, when 100% of the individuals of cluster 0 possess BB2. Similarly, cluster 2 specializes on BB0 and cluster 3 on BB1. A different setting may occur if another run is performed for a different random seed, but the general behavior is similar. The global solution, however, was not found.
Running PBIL with the cgcombination operator for the same random seed (therefore the same initial population) leads to a much better behavior, as figure 10 shows. In the first steps of the evolutionary process (roughly until 400 fitness evaluations) speciation is clear. Subsequently, as the number of successful combinations grows, clusters attract individuals possessing two or even three optimal building blocks simultaneously. Cluster , finally, specializes on the global optima of the problem after around fitness evaluations. At that moment the algorithm converges, since all positions of all PVs saturate above or below . It is interesting to note that some individuals possessing only one correct building block are still in the population even after the combination process is at advanced stages.
A more quantitative comparison between both interbreeding operators is shown in table 2. Four representative problems were chosen: shuffled HIFF, concatenated trap5, overlapping concatenated trap5 and twomax, with problem sizes 128, 100, 60 and 100 respectively. Table 2 reports the fraction of success (Succ. %), which is the fraction of runs where at least one global optima is found, and statistics for the number of fitness evaluation (Eval.) until convergence of PBIL.
Using the cgcombination, at least one global optimum is found in 97% of the runs for HIFF and in 100% of the runs for the other problems. Actually, when this operator is used, all global optima were found in 93% of the runs for HIFF and in 100% of the runs for the other problems. The PVwise uniform crossover was not able to reach any global optima for any of the problems tested, except for twomax, which is a very simple problem without a complex structure. The results are not surprising as it is well known that blind crossover operators that don’t respect the structure of the problem are condemned to fail on these types of problems. A very similar behavior is observed when no interbreeding is performed, but the results are not reported in the table.
Multimodal optimization and symmetry
Problem  EDA  Optima  sd  Eval.  sd 
Pgrid16  UEBNA  2.0  0.0  51400  2366 
(2 peaks)  PBIL  2.0  0.0  10126  606 
Pgrid36  UEBNA  2.0  0.0  85600  8462 
(2 peaks)  PBIL  2.0  0.0  28963  10754 
Pgrid64  UEBNA  2.0  0.0  124900  3479 
(2 peaks)  PBIL  2.0  0.0  64245  10999 
Pcat28  UEBNA  2.0  0.0  57100  2846 
(2 peaks)  PBIL  2.0  0.0  14311  1299 
Pcat42  UEBNA  2.0  0.0  73900  1449 
(2 peaks)  PBIL  2.0  0.0  29714  3644 
Pcat56  UEBNA  2.0  0.0  96400  2366 
(2 peaks)  PBIL  2.0  0.0  46151  4362 
Pcatring28  UEBNA  4.0  0.0  54700  949 
(4 peaks)  PBIL  4.0  0.0  12694  853 
Pcatring56  UEBNA  3.8  0.4  96400  1897 
(4 peaks)  PBIL  3.9  0.3  48837  12115 
Pcatring42  UEBNA  5.9  0.3  75700  3302 
(6 peaks)  PBIL  6.0  0.0  32361  1513 
Pcatring84  UEBNA  4.8  0.8  121000  3162 
(6 peaks)  PBIL  5.7  0.7  84539  9300 

This experiment validates PBIL on multimodal optimization problems, some of which present interactions among genes. Those features require good niching and linkage learning capabilities from the evolutionary algorithm. A through comparison between PBIL and UEBNA is described. The latter was already shown to be competent to solve globally multimodal optimization problems. In [3] UEBNA is compared to other EDAs when solving of instances globally multimodal problems. Experiments reported in that work clearly show that UEBNA achieves much better efficiency and efficacy when compared to EBNA, which is based on the supervised induction of Bayesian networks and does not incorporate cluster labels.
The graph partitioning problem was used in this comparison. Some instances of this problem are hard for EDAs. Particularly, larger “ring” topologies (Pcatring56, Pcatring42 and Pcatring84) are challenging because they clearly present a structure of interactions among genes, besides multimodality and symmetry. In a binary codification of one gene per node there are small groups of 7 nodes which should be identified and combined until all global solutions are found. Figure 8 illustrates some of those instances.
All results for UEBNA presented here were extracted from [3]. Ten independent runs of each algorithm are performed for each problem instance. A fixed set of parameters for PBIL is used for all instances: , and (using all , which are the default parameters, resulted slightly inferior results). The initial population size is the same for PBIL and UEBNA. Additionally, the working population size of PBIL is set as . The only free parameter adopted here is , the number of clusters. The value of which empirically maximizes the performance of the algorithm is chosen, for PBIL, from the interval and, for UEBNA, from the results reported in [3]. A similar comparison to UEBNA was already reported in [10], but PBIL was run with default parameter set and the repair operator for the problem of graph partitioning was not used there.
Table 3 summarizes the results. The problems and algorithms are organized in rows and the data columns show mean standard deviation of the number of global optima found (Optimas.d.) and for mean standard deviation (Eval.s.d.) of the number of fitness evaluations performed until convergence, evaluated over 10 runs. All runs of both algorithms found at least one global optima for all instances tested.
It is clear from results that PBIL attains convergence for all global optima using less fitness evaluations. For smaller problem instances both algorithms find all global optima in all runs but, for greater problem instances, PBIL presents better efficiency. The mean number of global optima found is better than for UEBNA, noticeably for Pcatring56, Pcatring42 and Pcatring84. For the later, which presents global optima, PBIL found global optima, while UEBNA found global optima. This may suggest that PBIL presents a good scalability behavior. Further scalability tests were also shown in [10].
PBIL can attain good results with relatively small populations. It is also important to note that computational complexity of the learning step of PBIL is related to a kmeans update, what is less timeconsuming than learning Bayesian networks. Howerver, when comparing computational complexity of PBIL to other algorthms one should consider that the model is is updated after the introduction of a single individual, while most algorithms learn a model for each new population.
9 Conclusion
This work presents PBIL: an evolutionary algorithm based on the application of an interbreeding operator, which is guided by the information obtained by clustering the population of an EDA. Identification of substructures is performed by clustering and combination among substructures can be achieved by combining information from different clusters. The core algorithm is a simple clustered EDA that adopts order2 probabilistic models, which leads to a very parsimonious approach when compared to other competent EDAs. Its simple incremental approach is also noticeable, where individuals are successively generated from continuously updated binomial probabilistic models.
Probabilistic models used by PBIL are much simpler to infer when compared to Bayesian networks, which are currently the most effective models for EDAs. Actually, PBIL keeps probabilistic models updated after each new individual is generated and selected at the cost of a single kmeans update.
The most sophisticated part of the algorithm is the interbreeding mechanism, which respects the problem structure and relies on the diversity of those substructures as one of the causes for the existence of clusters in the population. Experiments show some evidence that this mechanism is very important for solving problems that present interactions among the variables. Literature generally recommends avoiding the combination from different clusters, since this is often unproductive. PBIL attempts to minimize bad results by carefully choosing how to combine the most informative portions.
An empirical search for default values for some parameters was performed, but a more formal investigation should also be done. Furthermore, it is important to understand how to set the population size and number of clusters for each kind and size of problem. Some work on this topic has already been done for other evolutionary algorithms [26] and the methodology could be adapted for our algorithm. The relation between some of the parameters should also be better explored. For instance, the number of clusters should be related to the population size . Actually, some empirical investigation (not shown) has pointed out that should be of the order of magnitude of . Actually, or was enough for most problems verified.
Future work will check for relevant features, like scalability, for a wider range of problems including HDFs like HIFF [12]. Further empirical investigation on the behavior of the interbreeding mechanism should also be performed in order to better explain how exploration is actually done for several classes of problems.
10 Acknowledgements
The authors would like to thank Fernando Lobo for his valuable contribution to improve this paper.
References
 [1] J. H. Holland, Adaptation in natural and artificial systems. Ann Arbor, MI: University of Michigan Press, 1975.
 [2] G. R. Harik, F. G. Lobo, and D. E. Goldberg, “The compact genetic algorithm,” IEEEEC, vol. 3, no. 4, p. 287, November 1999. [Online]. Available: citeseer.ist.psu.edu/harik98compact.html
 [3] J. Peña, J. Lozano, and P. Larrañaga, “Globally multimodal problem optimization via an estimation of distribution algorithm based on unsupervised learning of bayesian networks,” Evol. Comput., vol. 13, no. 1, pp. 43–66, 2005.
 [4] M. Pelikan, K. Saltry, and D. E. Goldberg, “Sporadic model building for for efficiency enhancement of hBOA,” University of Illinois at UrbanaChampaign, Urbana IL, Tech. Rep. ILLIGAL Report No. 2005026, 2005.
 [5] B. Yuan and M. Gallagher, “On the importance of diversity maintenance in estimation of distribution algorithms,” in GECCO ’05: Proceedings of the 2005 conference on Genetic and evolutionary computation. New York, NY, USA: ACM Press, 2005, pp. 719–726.
 [6] S. W. Mahfoud, “Niching methods for genetic algorithms,” Doctoral dissertation, University of Illinois at UrbanaChampaign,, Urbana, USA, 1995.
 [7] C. W. Ahn and R. S. Ramakrishna, “Clusteringbased probabilistic model fitting in estimation of distribution algorithms,” IEICE  Trans. Inf. Syst., vol. E89D, no. 1, pp. 381–383, 2006.
 [8] J. McQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkley symposium on Mathematics, Statistics and Probability, 1967, pp. 281–296.
 [9] M. Pelikan and D. E. Goldberg, “Genetic algorithms, clustering, and the breaking of symmetry,” in Parallel Problem Solving from Nature VI, 2000, pp. 385–394.
 [10] L. Emmendorfer and A. T. R. Pozo, “An incremental approach for niching and building block detection via clustering,” in Proceedings of the Seventh International Conference on Intelligent Systems Design and Applications, 2007, to be published.
 [11] M. Pelikan, D. E. Goldberg, and E.CantuPaz, “BOA: The Bayesian optimization algorithm,” in Proceedings of the Genetic and Evolutionary Computation Conference (GECCO99), 1999, pp. 525–532.
 [12] R. A. Watson, G. Hornby, and J. B. Pollack, “Modeling buildingblock interdependency,” in PPSN V: Proceedings of the 5th International Conference on Parallel Problem Solving from Nature. London, UK: SpringerVerlag, 1998, pp. 97–108.
 [13] T.L. Yu, K. Sastry, , D. E. Goldberg, and D. D. Johnson, “Linkage learning, overlapping building blocks, and systematic strategy for recombination,” in GECCO ’05: Proceedings of the 2005 conference on Genetic and evolutionary computation. New York, NY, USA: ACM Press, 2005, pp. 1217–1224.
 [14] A. Jakulin and I. Bratko, “Testing the significance of attribute interactions,” in ICML ’04: Proceedings of the twentyfirst international conference on Machine learning. New York, NY, USA: ACM Press, 2004, p. 52.
 [15] M. Minsky and S. Papert, Perceptrons: An Introduction to Computational Geometry. Cambridge, Mass.: MIT Press, 1969.
 [16] G. R. Harik, “Learning gene linkage to efficiently solve problems of bounded difficulty using genetic algorithms,” Ph.D. dissertation, University of Michigan, 1997.
 [17] B.H. Liu, Statistical Genomics: Linkage, Mapping, and QTL Analysis. CRC Press, 1998.
 [18] D. E. Goldberg, G. Korb, and G. Deb, “Messy genetic algorithms: Motivation, analysis , and first results.” Complex Systems, vol. 3, pp. 493–530, 1989.
 [19] S. Baluja and R. Caruana, “Removing the genetics from the standard genetic algorithm,” in Int. Conf. on Machine Learning, 1995, pp. 38–46.
 [20] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1988.
 [21] R. Etxeberria and P. Larranaga, “Global optimization using bayesian networks,” in Second Symposium on Artificial Intelligence (CIMAF99), 1999, pp. 332–339.
 [22] C. Ahn, D. E. Goldberg, and R. S. Ramakhrishna, “Multipledeme parallel estimation of distribution algorithms: basic framework and application,” in Proceedings of Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science, 2004, pp. 544–551.
 [23] P. A. N. Bosman and D. Thierens, “Advancing continuous IDEAs with mixture distributions and factorization selection metrics,” in Optimization by Building and Using Probabilistic Models (OBUPM) 2001, San Francisco, California, USA, 7 2001, pp. 208–212.
 [24] A. Agresti and B. Coull, “Approximate is better than exact for interval estimation of binomial proportions,” The American Statistician, vol. 58, pp. 119–126, 1998.
 [25] D. E. Goldberg, “The race, the hurdle, and the sweet spot: Lessons from genetic algorithms for the automation of design innovation and creativity,” University of Illinois at UrbanaChampaign, Tech. Rep. ILLIGAL Report No. 98007, 1998.
 [26] G.Harik and F. Lobo, “A parameterless genetic algorithm,” in Proceedings of the Genetic and Evolutionary Computation Conference (GECCO99), 1999.