On the Limitations of the Univariate Marginal Distribution Algorithm to Deception and Where Bivariate EDAs might help1footnote 11footnote 1Preliminary version of this work will appear in the Proceedings of 15th ACM/SIGEVO Workshop on Foundations of Genetic Algorithms (FOGA XV), Potsdam, Germany

On the Limitations of the Univariate Marginal Distribution Algorithm to Deception and Where Bivariate EDAs might help111Preliminary version of this work will appear in the Proceedings of 15th ACM/SIGEVO Workshop on Foundations of Genetic Algorithms (FOGA XV), Potsdam, Germany

Per Kristian Lehre & Phan Trung Hai Nguyen
School of Computer Science
University of Birmingham
Birmingham B15 2TT, United Kingdom
Abstract

We introduce a new benchmark problem called Deceptive Leading Blocks (DLB) to rigorously study the runtime of the Univariate Marginal Distribution Algorithm (UMDA) in the presence of epistasis and deception. We show that simple Evolutionary Algorithms (EAs) outperform the UMDA unless the selective pressure is extremely high, where and are the parent and offspring population sizes, respectively. More precisely, we show that the UMDA with a parent population size of has an expected runtime of on the DLB problem assuming any selective pressure , as opposed to the expected runtime of for the non-elitist with . These results illustrate inherent limitations of univariate EDAs against deception and epistasis, which are common characteristics of real-world problems. In contrast, empirical evidence reveals the efficiency of the bi-variate MIMIC algorithm on the DLB problem. Our results suggest that one should consider EDAs with more complex probabilistic models when optimising problems with some degree of epistasis and deception.

1 Introduction

Estimation of distribution algorithms (EDAs) [42, 43, 33] are a class of randomised search heuristics with many real-world applications (see [27] and references therein). Unlike traditional EAs, which define implicit models of promising solutions via genetic operations such as crossover and mutation, EDAs optimise objective functions by constructing and sampling explicit probabilistic models to generate offspring for the next iteration. The workflow of EDAs is an iterative process, where the initial model is a uniform distribution over the search space. The starting population consists of individuals sampled from the uniform distribution. A fitness function then scores each individual, and the algorithm selects the fittest individuals to update the model (where ). The procedure is repeated until some termination condition is fulfilled, which is usually a threshold on the number of iterations or on the quality of the fittest offspring [27, 19].

Many variants of EDAs have been proposed over the last decades. They differ in the way their models are represented, updated as well as sampled over iterations. In general, EDAs are categorised into two main classes: univariate and multivariate. Univariate EDAs take advantage of first-order statistics (i.e. the mean) to build a probability vector-based model and assume independence between decision variables. The probabilistic model is represented as an -vector, where each component is called a marginal (also frequency) and is the problem instance size. Typical univariate EDAs are compact Genetic Algorithm (cGA [25]), Univariate Marginal Distribution Algorithm (UMDA [42]) and Population-Based Incremental Learning (PBIL [3]). In contrast, multivariate EDAs apply higher-order statistics to model the correlations between decision variables of the addressed problems.

The cGA is the simplest EDA, which operates on a population of two individuals and updates the probabilistic model additively via a parameter , which is often referred to as the hypothetical population size of a genetic algorithm that the cGA is supposed to model. The two individuals are compared in terms of fitness to find the winner, and an increase of takes place at bit positions where individuals have different values. The algorithm also restricts the marginals to be within an interval , where the values and are called the lower and upper borders (or margins), respectively, in order to prevent the marginals from fixing at trivial borders, which may cause the algorithm to converge prematurely. Such an algorithm is referred to as a cGA with margins. In contrast, the UMDA is another univariate EDA that has a larger population of individuals. In each so-called iteration, the marginal is renewed/updated to the frequency of 1-bit among the fittest individuals at each bit position (also called empirical frequency). Unlike the cGA, whose marginals can only get increased by an amount of , those of the UMDA might jump between the upper and lower borders. A generalisation of the UMDA is the PBIL, where each marginal is updated following a convex combination of the current marginal and the empirical frequency via a so-called smoothing parameter. The marginals cannot change by a large amount, and it is then less likely that genetic drift [47] happens soon after the algorithm starts.

The theory of evolutionary computation literature provides rigorous analyses giving insights into the runtime (synonymously, optimisation time), that is the number of function evaluations of the studied algorithm until an optimal solution is sampled for the first time. In other words, theoretical work usually addresses the unlimited case when we consider the run of the algorithm as an infinite process. These analyses provide performance guarantees of the algorithm for a wide range of problem instance size.

Although EDAs were introduced several decades ago and have since shown strong potential as a global optimiser via many practical applications [27], the theoretical understanding of EDAs is very limited. There had been only a handful of runtime analyses of EDAs by 2015. Recently, they have drawn more attention from the community [11, 22, 32, 12, 47, 53, 51, 23, 36, 8, 37, 52, 40, 15, 26]. While rigorous runtime analyses provide deep insights into the performance of randomised search heuristics, it is highly challenging even for simple algorithms on toy functions. Most current runtime results merely concern univariate EDAs on functions like OneMax [32, 51, 36, 53, 40], LeadingOnes [15, 22, 37, 53, 38], BinVal [52, 37] and Jump [26, 11, 12], hoping that this provides valuable insights into the development of new techniques for analysing multivariate variants of EDAs and the behaviour of such algorithms on easy parts of more complex problem spaces [13]. There are two main reasons accounted for this. Firstly, the working principle of these algorithms, which are not originally designed to support the theoretical analyses, is complicated, involving lots of randomnesses, and the interplay of decision variables usually has a huge impact on the overall runtime [13, 46]. Secondly, we are lacking in the state-of-the-art tools in algorithmics [13]. There are seemingly two currently popular techniques used to analyse EDAs. The first tool is drift theorems [28, 16, 31], while the other tool is the level-based theorem, first proposed in [35] and constantly improved upon [6, 14], in the context of non-elitist population-based EAs.

The fact that the UMDA never attempts to learn variable interactions leads some to conjecture that the algorithm will not perform well in environments with epistasis and deception. More specifically, epistasis corresponds to the maximum number of other variables each of the decision variables depends on [9, 24]. We can take as an example the LeadingOnes function, which has a maximum epistasis level of and is often used to study the ability of EAs to cope with variable dependency. Previous studies [7, 37, 38] show an expected runtime for the UMDA and the PBIL on this function, which seems to contradict the above-mentioned claim. However, Lehre and Nguyen [38] recently showed that univariate EDAs based on probability vectors have certain limitations to epistasis: if the selective pressure is larger than roughly (as previously required in [7, 37]) and the parent population is sufficiently large, the UMDA fails to optimise the LeadingOnes function in polynomial expected runtime.

Regarding deception, an example was already mentioned in [27], where the UMDA gets stuck in the concatenated trap of order 5 (called Trap-5 [1], where the original trap function is applied to each of blocks of five consecutive bits) since marginals are deceived to hit the lower border. This excellent example demonstrates the limitations of the univariate model as statistics of low order is misleading [27]. However, this function might be as difficult for the UMDA as it is for the EAs [24]. We believe not only will the UMDA fail on this function, but it will also fail on some other problem with a milder degree of deception. To this end, a function where the UMDA takes an exponential expected runtime, while simple EAs has a polynomial expected runtime is still missing222We define the terms “polynomial” and “exponential” as and , respectively.. On the other hand, a function where the UMDA outperforms simple EAs does exist. Chen et al. [5] proposed the Substring function to point out the advantage of the probability vector-based model. More specifically, the needs a runtime of with probability , whereas the UMDA with , for any constant , and optimises the function in polynomial runtime with probability . We note that this result is very limited in many senses that the population size is large, the selective pressure is fixed to and the considered UMDA does not have borders.

Motivated by this, we introduce a new benchmark problem which has a maximum epistasis level of and is mildly deceptive. The fitness depends on the number of leading 11s, and reaching a unique global optimum requires overcoming many mild traps. Generally speaking, this function is harder than the LeadingOnes function, but still much easier compared to the Trap-5 function. The problem, which we call Deceptive Leading Blocks (DLB), can be formally defined over a finite binary search space as follows.

where

(1)

denotes the number of leading 11s, which is identical to in [29, Definition 13]. The global optimum is the all-ones bitstring. The bitstring is partitioned into independent blocks of consecutive bits. The difficulty of the problem is determined by the width of each block, that can be altered to increase the level of deception. Here, we consider the smallest width , as this suffices to prove an exponential gap between the runtimes of the UMDA and the (,) EA. DLB is similar to the LeadingOnes function except it attempts to deceive the algorithm by assigning tricky weights to different settings of the leftmost non-11 block, which we call the active block. Each leading 11 contributes a value of two, while a value of one is awarded if the active block is a 00, and no reward is given for any other blocks. The all-ones bitstring has a fitness of since there are blocks, assuming that is a multiple of .

By studying this problem, we first show that simple EAs, including elitist and non-elitist variants like , and , and some Genetic Algorithms, can optimise the DLB function within an expected runtime. We then show that the UMDA fails to optimise this function in polynomial expected runtime, assuming that the selective pressure is , and the parent population size is sufficiently large. More specifically, the expected runtime is when , while a lower bound of is achieved for for some constants . In the latter case, we say the UMDA is fooled by deceptive fitness. On the other hand, if the selective pressure is in the order of (i.e., extremely high), we obtain an upper bound of on the expected runtime of the UMDA on the DLB function. Intuitively speaking, under this extreme selective pressure the UMDA selects few fittest individuals to update the probabilistic model, and we believe (with some empirical evidence in Section 5) that the UMDA degenerates into the , where only the fittest offspring is selected to the next generation. Table 1 summarises the main results in this paper.

Algorithm Sel. Pressure Pop. Size Expected Runtime
UMDA
Table 1: Expected runtime of some simple EAs and the UMDA (with borders) on the DLB function.

Last but not least, many algorithms similar to the UMDA with fitness proportional selection have a wide range of applications in bioinformatics [2]. The algorithms relate to the notion of linkage equilibrium [45] – a popular model assumption in population genetics. Studying the UMDA might solidify our understanding of population dynamics. Based on results from this paper, we believe that the UMDA and other univariate model-based algorithms must be tuned carefully when optimising objective functions with epistasis and might not perform well in deceptive environments.

The paper is structured as follows. Section 2 introduces the algorithms, including the UMDA, EAs and the MIMIC algorithm. Section 3 provides analyses of the expected runtime of simple EAs and GAs on the DLB function, followed by a detailed runtime analysis for the UMDA on the DLB function. Next, we illustrate the efficiency of the MIMIC algorithm on the DLB function in Section 5 via a small empirical study. Finally, Section 6 gives some concluding remarks and suggests future work.

2 Preliminaries

This section briefly describes the algorithms studied in this paper. Recall that . Each individual (or bitstring) is represented as . The population of individuals in an iteration is denoted as . We consider in this paper the maximisation of an objective function . Denote .

2.1 Evolutionary algorithms

The is a mutation-only EA, which operates on a population of individuals. In each iteration, the algorithm generates new offspring by flipping bits in each of individuals, chosen uniformly at random from the population, independently with mutation rate . Afterwards, the fittest individuals out of a pool of individuals are selected to form the new population. The algorithm is elitist since the best fitness discovered thus far is guaranteed to never decrease. Popular variants of the algorithm are [17, Algorithm 1], [30, Algorithm 1] and [50, Definition 1]. For comparison, we also consider the non-elitist , defined in Algorithm 1 [39], with mutation rate for any constant . In this algorithm, the next population consists of the fittest individuals chosen from a set of offspring produced by mutation.

1 uniformly at random from repeat
2       for  do
3             select uniformly at random create by flipping each bit in independently with probability
4      sort such that , where ties are broken uniformly at random
until termination condition is fulfilled
Algorithm 1 with mutation rate

2.2 Univariate marginal distribution algorithm

The UMDA, defined in Algorithm 2, maintains a univariate model in each iteration that is represented as an -vector , where each marginal (or frequency) for each is the probability of sampling a 1 at the -th bit position of the offspring. The probability of sampling from the model is

The starting model is the uniform distribution . In each so-called iteration, the algorithm samples a population of individuals and sorts them in descending order according to fitness. Let denote the number of 1s in bit position among the fittest individuals. The marginals are updated using the component-wise formula: for all . Recall that is the selective pressure of the algorithm. The algorithm also restricts the marginals to be within to avoid premature convergence.

1 initialise repeat
2       for  do
3             sample for each
4      sort such that , where ties are broken uniformly at random for  do
5            
6      
until termination condition is fulfilled
Algorithm 2 UMDA with margins

2.3 Mutual-information-maximising input cluster algorithm

The MIMIC [10] is a well-known bivariate EDA, which takes advantage of second-order statistics in an attempt to model the correlations between decision variables. More specifically, let denote the true distribution underlying the selected individuals, where . It is often intractable to learn the true distribution , so an approximation to the distribution is often preferred. Following this approach, the MIMIC approximates by a chain-structured model , which is easy to learn and sample. Given a permutation of a set , the chain-structured model is defined as follows:

where the is called the root of the chain. The parameters of the model are then derived by minimising the Kullback-Leibler divergence between models and , which is equivalent to minimising an alternative cost function

where and are the entropy and conditional entropy, respectively [10].

The model is constructed in each iteration as follows. The variable with the smallest entropy is selected to become the root of the chain. The variable among the remaining variables that has the smallest conditional entropy on the root is then added to the root. This procedure is repeated until all variables are added to the chain (see steps 5–7). Sampling from the chain-structured model is even simpler. The root is sampled first using its marginal probability. The next variable in the chain is then sampled using the conditional probability on the preceding variable. This is repeated until the end of the chain is reached (see steps 8–11). The procedure is known as ancestral sampling [4]. Denote the permutation in an iteration as , where denotes the root.

1 uniformly at random from repeat
2       sort such that for  do
3             where
4      for  do
5             for  do
6                  
7            
8      
until termination condition is fulfilled
Algorithm 3 MIMIC with margins, where ties occurring in sorting and selection are broken uniformly at random.

We note that the description of the MIMIC algorithm in [10] is very ambiguous. In particular, when constructing the model, we need to calculate many conditional entropies, which in turn require the calculation of many conditional probabilities. We can take as an example the conditional probability for , which by the definition can be written as

Note that all probabilities on the right-hand side will be directly measured from the sampled population (via counting). And, this is where the problem comes from since very often that the event may not happen, leading to the probability estimate which renders the conditional probability above undefined. To avoid this problem, we make use of a function to ensure that the probability is always positive, and never less than . Note that this bound is in line with the way marginal probabilities are bounded in other EDAs, such as the UMDA. We think that this small change is essential to ensure that the pseudo-code of MIMIC is well-defined. Algorithm 3 gives a full description of the MIMIC (with margins) [44, Chapter 13].

2.4 Level-based analysis

First proposed in [35], the level-based theorem is a general tool that provides upper bounds on the expected runtime of many non-elitist population-based algorithms on a wide range of optimisation problems [6, 14, 7, 8, 36, 37, 38]. The theorem assumes that the studied algorithm can be described in the form of Algorithm 4, which never assumes specific fitness functions, selection mechanisms, or generic operators like mutation and crossover. The search space is partitioned into disjoint subsets , which we call levels, and the last level consists of global optima of the objective function. Denote .

1 ; create initial population repeat
2       for  do
3             sample
4      
until termination condition is fulfilled
Algorithm 4 Non-elitist population-based algorithm
Theorem 1 ([6]).

Given a partition of , define where for all , is the population of Algorithm 4 in iteration . Denote . If there exist , and such that for any population ,

  • for each level , if then

  • for each level and all , if and then

  • and the population size satisfies

    where , then

2.5 Useful Tools from Probability Theory

We will use the following well-known tail bounds [41, 18].

Lemma 1 (Chernoff Bound).

Let . Let be independent random variables (not necessarily i.i.d.), that take values in . Let , and . Then for any ,

and

Lemma 2 (Chernoff-Hoeffding Bound).

Let be independent random variables (not necessarily i.i.d.), where takes values in . Let , and let . Then .

Lemma 3 ([49]).

.

3 EAs optimise DLB efficiently

We start by showing that simple EAs optimise the DLB function in polynomial expected runtime. We consider both elitist and non-elitist EAs, namely , and , in addition to Genetic Algorithms. Although these EAs are simple, we analyse them here to emphasise their efficiency in dealing with epistasis and mild deception. Speaking of proving techniques, we will use the fitness-level method [48] and the level-based theorem (see Theorem 1). In doing so, the search space is first partitioned into non-empty disjoint subsets (called levels, where ) such that

(2)

where is defined in (1), and contains the all-ones bitstring. We now give runtime bounds on the DLB function for the EAs; the proofs are straightforward.

Theorem 2.

The expected runtime of the  EA on the DLB function is .

Proof.

Levels are defined as in (2). The probability of leaving the current level is lower bounded by , and thus not leaving it happens with probability at most . In each iteration, the  EA samples individuals by mutating the current bitstring. At least one among individuals leaves the current level with probability at least Note that if , then this probability is at least ; otherwise, it is at least . Putting everything together, the expected runtime guaranteed by the fitness-level method is

Theorem 3.

The expected runtime of the  EA on the DLB function is .

Proof.

Levels are defined as in (2). It suffices to correct block to leave the current level. Following [50], we define a fraction . Given copies of the best individual, another one is created with probability . Thus, the expected time for a fraction of the population to be in level is given by

Now given individuals in level , the event of leaving this level occurs with probability This probability is at least

The expected runtime of the algorithm on DLB is

Theorem 4.

The expected runtime of the  EA with for some sufficiently large constant and for any constant and a mutation rate constant on the DLB function is .

Proof.

Since is non-elitist, Theorem 1 guarantees an upper bound on the expected runtime as long as the three conditions (G1), (G2) and (G3) are fully verified. Choose . The levels are defined as in (2), and is assumed to be the current level.

Condition (G1) requires a lower bound on the probability of sampling an offspring in , where , given . During the selection step, if we choose an individual in , then the step is successful if the mutation operator correctly flips two of the bits in the active block while keeping others unchanged. Thus, the probability of a successful sampling is at least

Next, condition (G2) assumes that at least individuals have at least leading 11s. It suffices to pick one of the fittest individuals and flip none of the bits; the probability is at least

if for any constant .

Putting everything into condition (G3) yields for a sufficiently large constant . Having fully verified three conditions, the expected runtime of the on DLB is

So far we have considered only mutation-based EAs, the following theorem shows that Genetic Algorithms, defined in [6, Algorithm 2], with crossover rate using any crossover operator also take a polynomial expected runtime to optimise the DLB function.

Theorem 5.

Genetic Algorithms with crossover rate using any crossover operator, the bitwise mutation operator with mutation rate for any fixed constant and one of the following selection mechanisms: -tournament selection, -selection, linear or exponential ranking selection, with their parameters , and being set to no less than where being any constant, take an expected runtime on the DLB function, where for some sufficiently large constant .

Proof.

The results for -tournament, -selection and linear ranking follow by applying [35, Lemmas 5–7], while the result for exponential ranking can be seen in [6, Lemma 3]. ∎

4 Why is UMDA inefficient on DLB?

Before we get to analysing the UMDA on the DLB function, we introduce some notation. Recall that there are blocks. We then let for each denote the number of individuals having at least leading 11s in iteration , and denote the number of individuals having leading 11s, followed by a 00 at the -th block. For the special case of , consists of those with the first block being a 00. We also let denote the number of individuals having leading 11s, followed by a 10 at the -th block, and again consists of those having the first block being a 10.

Once the population has been sampled, the algorithm invokes truncation selection to select the fittest individuals to update the probability vector. We take this -cutoff into account by defining a random variable

(3)

which tells us how many consecutive marginals, counting from position one, are set to the upper border in iteration . We also define another random variable

(4)

to be the number of leading 11s of the fittest individual(s). For readability, we often leave out the indices of random variables like when we write instead of , if values of the indices are clear from the context. Furthermore, let be a filtration induced from the population , and we often write and .

4.1 On the distributions of and

We apply the principle of deferred decisions [41] and imagine that the algorithm first samples the values of the first block for individuals. Once this is finished, it moves on to the second block and so on until the whole population is obtained.

We note that selection prefers individuals with the first block being a 11 to those with a 00, which in turn is more preferred to those with a 10 or 01 (due to deceptive fitness). The number of 11s in the first block follows a binomial distribution with parameters and , that is, . Having sampled 11s, there are other blocks in block in the current population. is also binomially distributed with parameters and by the definition of conditional probability since the event of sampling a 11 is excluded. Similarly having sampled 11s and 00s, is binomially distributed with trials and success probability since again the event of sampling either a 11 or a 00 is excluded. Finally, the number of 01s is .

Having sampled the first block for individuals, and note that the bias due to selection in the second block comes into play only if the first block is a 11. Among the fittest individuals, those with a 11 in the second block will be ranked first, followed by those with a 00, and finally with a 10 or 01. Conditioned on the first block being a 11, the number of 11s in the second block is binomially distributed with parameters and , i.e., , and the number of 00s also follows a binomial distribution with trials and success probability . Similarly, is binomially distributed with parameters and , and finally the number of 01s equals . Unlike the first block, we also have remaining individuals, and since there is no bias in the second block among these individuals, the numbers of 1s sampled at the two bit positions are binomially distributed with trials and success probabilities and , respectively.

We now consider an arbitrary block . By induction, we observe that the number of individuals having at least leading 11s follows a binomial distribution with parameters and , that is,

(5)

Similarly,

(6)

and

(7)

Finally, the number of individuals with leading 11s followed by a 01 in the block is . For the remaining individuals, the numbers of 1s sampled in the bit positions and follow a and a , respectively. We note in particular that by the end of this alternative view on the sampling process, we obtain the population of individuals sorted in descending order according to fitness, where ties are broken uniformly at random. The following lemma provides the expectations of these random variables.

Lemma 4.

For all , if then

Otherwise, if , then

(8)
(9)
(10)
Proof.

For the special case of , the expectations are trivial since random variables and are all binomially distributed with trials. In the remainder of the proof, we will consider the case of . By (5), the tower rule [21] and noting that is -measurable, we get

By (6) and (8), we also get

and similarly by (7), (8) and (9), we finally obtain

4.2 In the initial population

An initial observation is that the all-ones bitstring cannot be sampled in the initial population with high probability since the probability of sampling it from the uniform distribution is , then by the union bound [41] it appears in the initial population of individuals with probability at most since we only consider the offspring population of size at most polynomial in the problem instance size . The following lemma states the expectations of the random variables and (defined in (3) and (4), respectively) in the iteration .

Lemma 5.

, and .

Proof.

Recall that and the definition of the function in (1). The probability of sampling an individual with leading 11s (where ) is then . The event implies that the individuals all have at most leading 11s, i.e.,

and . Since the random variable is integer-valued and by , we get

which proves the first claim.

For the second claim, we take an alternative view that the random variable denotes the number of leading 11s of the fittest individual(s) in a smaller population of the remaining individuals (i.e., all but the fittest individuals in the initial population), sampled from a uniform distribution. The same line of arguments above immediately yields

4.3 In an arbitrary iteration

By the definition of the random variable