Theoretical Analysis of Stochastic Search Algorithms

# Theoretical Analysis of Stochastic Search Algorithms

Per Kristian Lehre
School of Computer Science,
University of Birmingham,
Birmingham, UK
Pietro S. Oliveto
Department of Computer Science,
University of Sheffield,
Sheffield, UK
###### Abstract

Theoretical analyses of stochastic search algorithms, albeit few, have always existed since these algorithms became popular. Starting in the nineties a systematic approach to analyse the performance of stochastic search heuristics has been put in place. This quickly increasing basis of results allows, nowadays, the analysis of sophisticated algorithms such as population-based evolutionary algorithms, ant colony optimisation and artificial immune systems. Results are available concerning problems from various domains including classical combinatorial and continuous optimisation, single and multi-objective optimisation, and noisy and dynamic optimisation. This chapter introduces the mathematical techniques that are most commonly used in the runtime analysis of stochastic search heuristics. Careful attention is given to the very popular artificial fitness levels and drift analyses techniques for which several variants are presented. To aid the reader’s comprehension of the presented mathematical methods, these are applied to the analysis of simple evolutionary algorithms for artificial example functions. The chapter is concluded by providing references to more complex applications and further extensions of the techniques for the obtainment of advanced results.

## 1 Introduction

Stochastic search algorithms, also called randomised search heuristics, are general purpose optimisation algorithms that are often used when it is not possible to design a specific algorithm for the problem at hand. Common reasons are the lack of available resources (e.g., enough money and/or time) or because of an insufficient knowledge of the complex optimisation problem which has not been studied extensively before. Other times, the only way of acquiring knowledge about the problem is by evaluating the quality of candidate solutions.

Well-known stochastic search algorithms are random local search and simulated annealing. Other more complicated approaches are inspired by processes observed in nature. Popular examples are evolutionary algorithms (EAs) inspired by the concept of natural evolution, ant colony optimisation (ACO) inspired by ant foraging behaviour and artificial immune systems (AIS) inspired by the immune system of vertebrates.

The main advantage of stochastic search heuristics is that, being general purpose algorithms, they can be applied to a wide range of applications without requiring hardly any knowledge of the problem at hand. Also, the simplicity for which they can be applied implies that practitioners can use them to find high quality solutions to a wide variety of problems without needing skills and knowledge of algorithm design. Indeed, numerous applications report high performance results which make them widely used in practice. However, through experimental work and applications it is difficult to understand the reasons for these successes. In particular, given a stochastic search algorithm, it is unclear on which kind of problems it will achieve good performance and on which it will perform poorly. Even more crucial is the lack of understanding of how the parameter settings influence the performance of the algorithms. The goal of a rigorous theoretical foundation of stochastic search algorithms is to answer questions of this nature by explaining the success or the failure of these methods in practical applications. The benefits of a theoretical understanding are threefold: (a) guiding the choice of the best algorithm for the problem at hand, (b) determining the optimal parameter settings, and (c) aiding the algorithm design, ultimately leading to the achievement of better algorithms.

Theoretical studies of stochastic optimisation methods have always existed, albeit few, since these algorithms became popular. In particular, the increasing popularity gained by evolutionary and genetic algorithms in the seventies led to various attempts at building a theory for these algorithms. However, such initial studies attempted to provide insights on the behaviour of evolutionary algorithms rather than estimating their performance. The most popular of these theoretical frameworks was probably the schema theory introduced by Holland [13] and made popular by Goldberg [11]. In the early nineties a very different approach appeared to the analysis of evolutionary algorithms and consequently randomised search heuristics in general, driven by the insight that these heuristics are indeed randomised algorithms, albeit general-purpose ones, and as such they should be analysed in a similar spirit to that of classical randomised algorithms [25]. For the last 25 years this field has kept growing considerably and nowadays several advanced and powerful tools have been devised that allow the analysis of the performance of involved stochastic search algorithms for problems from various domains. These include problems from classical combinatorial and continuous optimisation, dynamic optimisation and noisy optimisation. The generality of the developed techniques, has allowed their application to the analyses of several families of stochastic search algorithms including evolutionary algorithms, local search, metropolis, simulated annealing, ant colony optimisation, artificial immune systems, particle swarm optimisation, estimation of distribution algorithms amongst others.

The aim of this chapter is to introduce the reader to the most common and powerful tools used in the performance analysis of randomised search heuristics. Since the main focus is the understanding of the methods, these will be applied to the analysis of very simple evolutionary algorithms for artificial example functions. The hope is that the structure of the functions and the behaviour of the algorithms are easy to grasp so the attention of the reader may be mostly focused on the mathematical techniques that will be presented. At the end of the chapter references to complex applications of the techniques for the obtainment of advanced results will be pointed out for further reading.

## 2 Computational Complexity of Stochastic Search Algorithms

From the perspective of computer science, stochastic search heuristics are randomised algorithms although more general than problem specific ones. Hence, it is natural to analyse their performance in the classical way as done in computer science. From this perspective an algorithm should be correct, i.e., for every instance of the problem (the input) the algorithm halts with the correct solution (i.e., the correct output) and it should be efficient in terms of its computational complexity i.e., the algorithm uses the computational resources wisely. The resources usually considered are the number of basic computations to find the solution (i.e., time) and the amount of memory required (i.e., space).

Differently from problem-specific algorithms, the goal behind general-purpose algorithms such as stochastic search heuristics is to deliver good performance independently of the problem at hand. In other words, a general-purpose algorithm is “correct” if it visits the optimal solution of any problem in finite time. If the optimum is never lost afterwards, then a stochastic search algorithm is said to converge to the optimal solution. In a formal sense the latter condition for convergence is required because most search heuristics are not capable of recognising when an optimal solution has been found (i.e., they do not halt). However, it suffices to keep track of the best found solution during the run of the algorithm, hence the condition is trivial to satisfy. What is particularly relevant and can make a huge difference on the usefulness of a stochastic search heuristic for a given problem is its time complexity. In each iteration the evaluation of the quality of a solution is generally far more expensive than its other algorithmic steps. As a result, it is very common to measure time as the number of evaluations of the fitness function (also called objective function) rather than counting the number of basic computations. Since randomised algorithms make random choices during their execution, the runtime of a stochastic search heuristic to optimise a function is a random variable . The main measure of interest is:

1. The expected runtime : the expected number of fitness function evaluations until the optimum of is found;
For skewed runtime distributions, the expected runtime may be a deceiving measure of algorithm performance. The following measure therefore provides additional information.

2. The success probability in steps : the probability that the optimum is found within steps.

Just like in the classical theory of efficient algorithms the time is analysed in relation to growing input length and usually described using asymptotic notation [3]. A search heuristic is said to be efficient for a function (class) if the runtime grows as polynomial function of the instance size. On the other hand, if the runtime grows as an exponential function, then the heuristic is said to be inefficient. See Figure 1 for an illustrative distinction.

## 3 Evolutionary Algorithms

A general framework of an evolutionary algorithm is the (+) EA defined in Algorithm 1. The algorithm evolves a population of candidate solutions, generally called the parent population. At each generation an offspring population of individuals is created by selecting individuals from the parent population uniformly at random and by applying a mutation operator to them. The generation is concluded by selecting the fittest individuals out of the parents and offspring. Algorithm 1 presents a formal definition.

In order to apply the algorithm for the optimisation of a fitness function , some parameters need to be set. The population size , the offspring population size and the mutation rate . Generally is considered a good setting for the mutation rate. Also, in practical applications a stopping criterion has to be defined since the algorithm does not halt. A fixed number of generations or a fixed number of fitness function evaluations are usually decided in advance. Since the objective of the analysis is to calculate the time required to reach the optimal (approximate) solution for the first time, no stopping condition is required, and one can assume that the algorithms are allowed to run forever. The + symbol in the algorithm’s name indicates that elitist truncation selection is applied. This means that the whole population consisting of both parents and offspring are sorted according to fitness and the best are retained for the next generation. Some criterion needs to be decided in case the best individuals are not uniquely defined. Ties between solutions of equal fitness may be broken uniformly at random. Often offspring are preferred over parents of equal fitness. In the latter case if are set, then the standard (1+1) EA is obtained, a very simple and well studied evolutionary algorithm. On the other hand if some stochastic selection mechanism was used instead of the elitist mechanism and a crossover operator was added as variation mechanism, then Algorithm 1 would become a genetic algorithm (GA) [11]. Given the importance of the (1+1) EA in this chapter, a formal definition is given in Algorithm 2.

The algorithm is initialised with a random bitstring. At each generation a new candidate solution is obtained by flipping each bit with probability . The number of bits that flip can be represented by a binomial random variable where is the number of bits (i.e., the number of trials) and is the probability of a success (i.e. a bit actually flips), while is the probability of a failure (i.e., the bit does not flip). Then, the expected number of bits that flip in one generation is given by the expectation of the binomial random variable, .

The algorithm behaves in a very different way compared to the random local search (RLS) algorithm that flips exactly one bit per iteration. Although the (1+1) EA flips exactly one bit in expectation per iteration, many more bits may flip or even none at all. In particular, the (1+1) EA is a global optimiser because there is a positive probability that any point in the search space is reached in each generation. As a consequence, the algorithm will find the global optimum in finite time. On the other hand, RLS is a local optimiser since it gets stuck once it reaches a local optimum because it only flips one bit per iteration.

The probability that a binomial random variable takes value (i.e., bits flip) is

 Pr(X=j)=(nj)pj(1−p)n−j.

Hence, the probability that the (1+1) EA flips exactly one bit is

 Pr(X=1)=(n1)⋅(1n)⋅(1−1n)n−1=(1−1n)n−1≥1/e≈0.37

So the outcome of one generation of the (1+1) EA is similar to that of RLS only approximately 1/3 of the generations. The probability that two bits flip is exactly half the probability that one flips:

 Pr(X=2) =(n2)(1n)2(1−1n)n−2 =n(n−1)2(1n)2(1−1n)n−2 =12(1−1n)n−1≈1/(2e)

On the other hand the probability no bits flip at all is:

 Pr(X=0)=(n0)(1/n)0⋅(1−1/n)n≈1/e

The latter result implies that in more than 1/3 of the iterations no bits flip. This should be taken into account when evaluating the fitness of the offspring, especially for expensive fitness functions.

In general, the probability that bits flip decreases exponentially with :

 Pr(X=i)=(ni)⋅1ni⋅(1−1n)n−i=1i!⋅(1−1n)n−i≈1i!⋅e

In the worst case all the bits may need to flip to reach the optimum in one step. This event has probability . Since, this is always a lower bound on the probability of reaching the optimum in each generation, by a simple waiting time argument an upper bound of may be derived for the expected runtime of the (1+1) EA on any pseudo-Boolean function . It is simple to design an example trap function for which the algorithm actually requires expected steps to reach the optimum [10]. This simple result further motivates why it is fundamental to gain a foundational understanding of how the runtime of stochastic search heuristics depends on the parameters of the problem and on the parameters of the algorithms.

## 4 Test Functions

Test functions are artificially designed to analyse the performance of stochastic search algorithms when they face optimisation problems with particular characteristics. These functions are used to highlight characteristics of function classes which may make the optimisation process easy or hard for a given algorithm. For this reason they are often referred to as toy problems. The analysis on test functions of simple and well understood structure has allowed the development of several general techniques for the analysis. Afterwards these techniques have allowed to analyse the same algorithms for more complicated problems with practical applications such as classical combinatorial optimisation problems. Furthermore, in recent years several standard techniques originally developed for simple algorithms have been extended to allow the analyses of more realistic algorithms. In this section the test functions that will be used as example functions throughout the chapter are introduced.

The most popular test function is Onemax (x) := which simply counts the number of one-bits in the bitstring. The global optimum is a bitstring of only one-bits. Onemax is the easiest function with unique global optimum for the (1+1) EA [5].

A particularly difficult test function for stochastic search algorithms is the needle-in-a-haystack function. consists of a huge plateau of fitness value zero apart from only one optimal point of fitness value one represented by the bitstring of only one-bits. This function is hard for search heuristics because all the search points apart from the optimum have the same fitness. As a consequence, the algorithms cannot gather any information about where the needle is by sampling search points.

Both Onemax and Needle (as defined above) have the property that the function values only depend on the number of ones in the bitstring. The class of functions with this property is called functions of unitation

 \textscUnitation(x):=f(n∑i=1xi)

Throughout this chapter, functions of unitation will be used as a general example class to demonstrate the use of the techniques that will be introduced. For simplicity of the analysis, the optimum is assumed to be the bitstring of only one-bits.

For the analysis the function of unitation will be divided into three different kinds of sub-blocks: linear blocks, gap blocks and plateau blocks. Each block will be defined by its length parameter (i.e. the number of bits in the block) and by its position parameter (i.e., each block starts at bitstrings with zeroes and ends at bitstrings with zeroes). Given a unitation function it is divided into sub-blocks proceeding from left to right from the all-zeroes bitstring towards the all-ones bitstring. If the fitness increases with the number of ones, then a linear block is created. The linear block ends when the function value stops increasing with the number of ones.

 \textscLinear(|x|)={a|x|+b if k

See Figure 2 for an illustration.

If the fitness function decreases with the number of ones, then a gap block is created. The gap block ends when the fitness value reaches for the first time a higher value than the value at the beginning of the block.

 \textscGap(|x|)={a if n−|x|=k+m0 otherwise.

See Figure 3 for an illustration.

If the fitness remains the same as the number of ones in the bitstrings increases, then a plateau block is created. The block ends at the first point where the fitness value changes.

 \textscPlateau(|x|)={a if k

See Figure 4 for an illustration.

By proceeding from left to right the whole search space is subdivided into blocks. See Figure 5 for an illustration.

Let the unitation function be subdivided into sub-functions , and let be the runtime for an elitist search heuristic to optimise each sub-function . Then by linearity of expectation, an upper bound on the expected runtime of an elitist stochastic search heuristic for the unitation function is:

 E[T]≤E[r∑i=1Ti]=r∑i=1E[Ti].

Hence, an upper bound on the total runtime for the unitation function may be achieved by calculating upper bounds on the runtime for each block separately. Once these are obtained, summing all the bounds yields an upper bound on the total runtime. Attention needs to be put when calculating upper bounds on the runtime to overcome a plateau block when this is followed by a gap block because points straight after the end of the plateau will have lower fitness values, hence will not be accepted. In these special cases, the upper bound for the Plateau block needs to be multiplied by the upper bound for the Gap block to achieve a correct upper bound on the runtime to overcome both blocks. In the remainder of the chapter upper and lower bounds for each type of block will be derived as example applications of the presented runtime analysis techniques. The reader, will then be able to calculate the runtime of the (1+1) EA and other evolutionary algorithms for any such unitation function.

By simply using waiting time arguments it is possible to derive upper and lower bounds on the runtime of the (1+1) EA for the Gap block. Assuming that the algorithm is at the beginning of the gap block then to reach the end it is sufficient to flip zero-bits into one-bits and leave the other bits unchanged. On the other hand it is a necessary condition to flip at least zero-bits because all search points achieved by flipping less than zero-bits have a fitness value of zero and would not be accepted by selection. Given that there are zero-bits available at the beginning of the block, the following upper and lower bounds on the probability of reaching the end of the block follows

 (m+knm)m1e≤(m+km)(1n)m1e≤p≤(m+km)(1n)m≤((m+k)enm)m.

Here the outer inequalities are achieved by using for . Then by simple waiting time arguments, the expected time for the (1+1) EA to optimise a Gap block of length and position is upper and lower bounded by

 (nm(m+k)e)m≤(m+km)−1nm≤E[T]≤enm(m+km)−1≤e(nmm+k)m.

## 5 Tail Inequalities

The runtime of a stochastic search algorithm for a function (class) is a random variable and the main goal of a runtime analysis is to calculate its expectation . Sometimes the expected runtime may be particularly large, but there may also be a high probability that the actual optimisation time is significantly lower. In these cases a result about the success probability within steps, helps considerably the understanding of the algorithm’s performance. In other occasions it may be interesting to simply gain knowledge about the probability that the actual optimisation time deviates from the expected runtime. In such circumstances tail inequalities turn out to be very useful tools by allowing to obtain bounds on the runtime that hold with high probability. An example of the expectation of a random variable and its probability distribution are given in Figure 6.

Given the expectation of a random variable, which often may be estimated easily, tail inequalities give bounds on the probability that the actual random variable deviates from its expectation [25, 24]. The most simple tail inequality is Markov’s inequality. Many strong tail inequalities are derived from Markov’s inequality.

###### Theorem 1 (Markov’s Inequality).

Let be a random variable assuming only non-negative values. Then for all ,

 Pr(X≥t)≤E[X]t.

The power of the inequality is that no knowledge about the random variable is required apart from it being non-negative.

Let be a random variable indicating the number of bits flipped in one iteration of the (1+1) EA. As seen in the previous section, one bit is flipped per iteration in expectation, i.e., . One may wonder what is the probability that more than one bit is flipped in one time step. A straightforward application of Markov’s Inequality reveals that in at least half of the iterations either one bit is flipped or none:

 Pr(X≥2)≤E[X]2=12

Similarly, one may want to gain some information on how many ones are contained in the bitstring at initialisation, given that in expectation there are (here is a binomial random variable with parameters and ). An application of Markov’s inequality yields that the probability of having more than ones at initialisation is bounded by

 Pr(X≥(2/3)n)≤E[X](2/3)n=n/2(2/3)n=3/4 (1)

Since is binomially distributed it is reasonable to expect that, for large enough , the actual number of obtained ones at initialisation would be more concentrated around the expected value. In particular while the bound is obviously correct, the probability that the initial bitstring has more than ones is much smaller than . However, to achieve such a result more information about the random variable should be required by the tail inequality (i.e., that it is binomially distributed). An important class of tail inequalities used in the analysis of stochastic search heuristics are Chernoff bounds.

###### Theorem 2 (Chernoff Bounds).

Let be independent random variables taking values in . Define , which has expectation .

• for .

• for .

An application of Chernoff bounds reveals that the probability that the initial bitstring has more than one-bits is exponentially small in the length of the bitstring. Let be the random variable summing up the random values of each of the bits. Since each bit is initialised with probability , it holds that and . By fixing it follows that and finally by applying inequality (b),

 Pr(X>(2/3)n)≤(e1/3(4/3)4/3)n/2<(2930)n/2

In fact an exponentially small probability of deviating from by a constant factor of the search space for any constant may easily be obtained by Chernoff bounds.

## 6 Artificial Fitness Levels (AFL)

The artificial fitness levels technique is a very simple method to achieve upper bounds on the runtime of elitist stochastic optimisation algorithms. Albeit its simplicity, it often achieves very good bounds on the runtime.

The idea behind the method is to divide the search space of size into disjoint fitness-based partitions of increasing fitness such that . The union of these partitions should cover the whole search space and the level of highest fitness should contain the global optimum (or all global optima if there is more than one).

###### Definition 3.

A tuple is an -based partition of if

1. for

For functions of unitation, a natural way of defining a fitness-based partition is to divide the search space into levels, each defined by the number of ones in the bitstring. For the Onemax function, where fitness increases with the number of ones in the bitstring, the fitness levels would be naturally defined as .

### 6.1 AFL - Upper Bounds

Given a fitness-based partition of the search space, it is obvious that an elitist algorithm using only one individual will only accept points of the search space that belong to levels of higher or equal fitness to the current level. Once a new fitness level has been reached, the algorithm will never return to previous levels. This implies that each fitness level has to be left at most once by the algorithm. Since in the worst case all fitness levels are visited, the sum of the expected times to leave all levels is an upper bound on the expected time to reach the global optimum. The artificial fitness levels method simplifies this idea by only requiring a lower bound on the probability of leaving each level rather than asking for the exact probabilities to leave each level.

###### Theorem 4 (Artificial Fitness Levels).

Let be a fitness function, a fitness-based partition of and be lower bounds on the corresponding probabilities of leaving the respective fitness levels for a level of better fitness. Then the expected runtime of an elitist algorithm using a single individual is .

The artificial fitness level method will now be applied to derive an upper bound on the expected runtime of (1+1) EA for the Onemax function. Afterwards, the bound will be generalised to general linear blocks of unitation.

###### Theorem 5.

The expected runtime of the (1+1) EA on Onemax is .

###### Proof.

The artificial fitness levels method will be applied to the partitions defined by the number of ones in the bitstring, i.e., . This means that all bitstrings with ones and zeroes belong to fitness level . For each level , the method requires a lower bound on the probability of reaching any level where . To reach a level of higher fitness it is necessary to increase the number of ones in the bitstring. However, it is sufficient to flip a zero into a one and leave the remaining bits unchanged. Since the probability of flipping a bit is and there are zeroes that may be flipped, a lower bound on the probability to reach a level of higher fitness from level is:

 si≥(n−i)⋅1n⋅(1−1n)n−1≥n−ien

where is the probability of leaving bits unchanged and the inequality follows because for all .

Then by the artificial fitness levels method (Theorem 4),

 E[T(1+1) EA,\textscOnemax]≤m−1∑i=01/si≤n−1∑i=0enn−i=enn∑i=11i=O(nlnn).

###### Theorem 6.

The expected runtime of the (1+1)-EA for a linear block of length ending at position is .

###### Proof.

Apply the artificial fitness levels method where each partition consists of the bitstrings in the block with zeroes. Then the probability of leaving a fitness level is bounded by . Given that at most fitness levels need to be left and that the block starts at position and ends at position , by Theorem 4 the expected runtime is:

 E[T]≤k+m∑i=k+1eni≤enk+m∑i=k+11i≤en(k+m∑i=11i−k∑i=11i)≤enln(m+kk)

### 6.2 AFL - Lower Bounds

Recently Sudholt introduced an artificial fitness levels method to obtain lower bounds on the runtime of stochastic search algorithms [35]. Since lower bounds are aimed for, apart from the probabilities of leaving each fitness level, the method needs to also take into account the probability that some levels may be skipped by the algorithm.

###### Theorem 7.

Consider a fitness function and a fitness-based partition of . Let be the probability of starting in level , be an upper bound on the probability of leaving and be an upper bound on the probability of jumping from level to level . If there exists some such that for all

 pi,j≥χ⋅m−1∑k=jpi,k,

then the expected runtime of an elitist algorithm using a single individual is

 E[TA,f]≥χ⋅m−1∑i=1uim−1∑j=i1sj

The method will first be illustrated for the (1+1) EA on the Onemax function. Afterwards, the result will be generalised to general linear blocks of unitation.

###### Theorem 8.

The expected runtime of the (1+1) EA on Onemax is .

###### Proof.

Apply the artificial fitness levels method on the partitions defined by the number of ones in the bitstring, i.e., . This means that all bitstrings with ones and zeroes belong to fitness level . To apply the artificial fitness levels method, bounds on and need to be derived. An upper bound on the probability of leaving fitness level is simply because it is a necessary condition that at least one zero flips to reach a better fitness level. The bound follows because each bit flips with probability and there are zeroes available to be flipped. In order to obtain an upper bound on , the method requires a lower bound on and an upper bound on . For the lower bound on notice that in order to reach level , it sufficient to flip zeroes out of the zeroes available and leave all the other bits unchanged. Hence the following bound is obtained:

 pij≥(n−ij−i)(1n)j−i(1−1n)n−(j−i)

For an upper bound on the sum, notice that to reach any level from level it is necessary to flip at least zeroes out of the available zeroes. So,

 n−1∑k=jpi,k≤(n−ij−i)(1n)j−i

and for the condition of Theorem 7 is satisfied as follows:

 pi,j≥(1−1n)n−(j−i)⋅n−1∑k=jpi,k≥χ⋅n−1∑k=jpi,k

By Eq. (1), the probability that the initial search point has less than 1-bits is at least

 (2/3)n∑i=1ui≥1−34

The statement of Theorem 7 now yields

 E[TA,f] ≥(1e)⋅n−1∑i=1uin−1∑j=i1sj >(1e)⋅⎛⎝(2/3)n∑i=1ui⎞⎠n−1∑j=(2/3)n1sj ≥(1e)⋅(1−34)n−1∑j=(2/3)nnn−j ≥(n4e)⋅n/3∑j=11j.

It now follows that . ∎

Similarly the following result may also be proved for linear blocks of unitation functions by defining the fitness partitions as for .

###### Theorem 9.

The expected runtime of the (1+1)-EA for a linear block of length ending at position is .

### 6.3 Level-based analysis of non-elitist populations

A weakness with the classical artificial fitness level technique is that it is limited to search heuristics that only keep one solution, such as the (1+1) EA, and it heavily relies on the selection mechanism to use elitism. [4] recently introduced the so-called level-based analysis, a generalisation of fitness level theorems for non-elitist evolutionary algorithms which is also applicable to search heuristics with populations, and using higher arity operators such as crossover.

Their theorem applies to any algorithm that can be expressed in the form of Algorithm 3, such as genetic algorithms [4] and estimation of distribution algorithms UMDA [7]. The main component of the algorithm is a random operator which given the current population returns a probability distribution over the search space . The next population is obtained by sampling individuals independently from this distribution.

In contrast to classical fitness-level theorems, the level-based theorem (Theorem 10) only assumes a partition of the search space , and not an -based partition (see Definition 3). Each of the sets is called a level, and the symbol denotes the set of search points above level . Given a constant , a population is considered to be at level with respect to if and meaning that at least a fraction of the population is in level or higher.

###### Theorem 10 ([4]).

Given any partition of a finite set into non-overlapping subsets , define to be the first point in time that elements of appear in of Algorithm 3. If there exist parameters , , and a constant such that for all , , and it holds

(C1)

(C2)

, and

(C3)

with , and

then

 E[T]≤2cε(mλ(1+ln(1+cλ))+m∑j=11zj).

The theorem provides an upper bound on the expected optimisation time of Algorithm 3 if it is possible to find a partition of the search space and accompanying parameters such that conditions (C1), (C2), and (C3) are satisfied. Condition (C1) requires a non-zero probability of creating an individual in level or higher if there are already at least individuals in level or higher. In typical applications, this imposes some conditions on the variation operator. The condition is analogous to the probability in the artificial fitness level technique. Condition (C2) requires that if in addition there are individuals at level or better, then the probability of producing an individual in level or better should be larger than by a multiplicative factor . In typical applications, this imposes some conditions on the strength of the selective pressure in the algorithm. Finally, condition (C3) imposes minimal requirements on the population size in terms of the parameters above.

As an example application of the level-based theorem, the EA is analysed, which is the non-elitist variant of the EA shown in Algorithm 1. The two algorithms differ in the selection step (line 8) where the new population in EA is chosen as the best individuals out of and breaking ties uniformly at random. While the EA always retains the best individuals in the population (hence the name elitist), the EA always discards the old individuals .

At first sight, it may appear as if the EA cannot be expressed in the form of Algorithm 3. The individuals that are kept in each generation are not independent due to the inherent sorting of the offspring. However, taking a different perspective, the population of the algorithm at time could also be interpreted as the offspring . In this alternative interpretation, the new population is now created by sampling uniformly at random among the best individuals in the population, and applying the mutation operator. The operator in Algorithm 3 can now be defined as in Algorithm 4.

The following lemma will be useful when estimating the probability that the mutation operator does not flip any bit positions.

###### Lemma 11.

For any and , if then

 (1−χn)n≥(1−δ)e−χ.
###### Proof.

Note first that , hence

 (nχ−1)(χ−ln(1−δ))≥n+nδχ−(χ+δ)≥n.

By making use of the fact that and simplifying the exponent as above

 (1−χn)n≥[(1−χn)(n/χ)−1]χ−ln(1−δ)≥(1−δ)e−χ.

The expected optimisation time of the  EA on Onemax can now be expressed in terms of the mutation rate and the problem size assuming some constraints on the population sizes and . The theorem is valid for a wide range of mutation rates . In the classical setting of , the expected optimisation time reduces to .

###### Theorem 12.

The expected optimisation time of the EA with bitwise mutation rate where , and population sizes and satisfying for any constant

 λμ≥(1+δ1−δ)eχ, and λ≥4δ2eχln(24576n(n+1)δ7χ)

on Onemax is for any no more than

 1536nδ5(λln(λ)+eχln(n+2)χ(1−δ))+O(nλ).
###### Proof.

Apply the level-based theorem with the same partitions as in the proof of Theorem 8. Since the parameter is assumed to be some constant , it also holds that the parameters and are positive constants. The parameters and will be chosen later.

To verify that conditions (C1) and (C2) hold for any , it is necessary to estimate the probability that operator produces a search point with one-bits when applied to a population containing at least individuals, each having at least one-bits (formally ). Such an event is called a successful sample.

Condition (C1) asks for bounds for each on the probability that the search point returned by Algorithm 4 contains one-bits. First chose the parameter setting . This parameter setting is convenient, because the selection step in Algorithm 4 always picks an individual among the best individuals in the population. By the assumption that , the algorithm will always select an individual containing at least one-bits for some non-negative integer .

Assume without loss of generality, that the first bit-positions in the selected individual are one-bits, and let , be any of the other bit positions. If there is a zero-bit in position or if , then a successful sample occurs if the mutation operator flips only bit position . If there is a one-bit in position , and if , then the step is still successful if the mutation operator flips none of the bit positions. Since the probability of not flipping a position is higher than the probability of flipping a position, i.e., , the probability of a successful sample is therefore in both cases at least

 (n−j)(χ/n)(1−χ/n)n−1. (2)

By Lemma 11, the probability above is at least The parameter is chosen to be the minimal among these probabilities, i.e. .

Condition (C2) assumes in addition that individuals have fitness or higher. In this case, it suffices that the selection mechanism picks one of the best individuals among the individuals, and that none of the bits are mutated in the selected individual. The probability of this event is at least

 γλμ(1−χ/n)n≥γλμe−χ(1−δ)

Hence, to satisfy condition (C2), it suffices to require that

 γλμexp(−χ)(1−δ)≥γ(1+δ),

which is true whenever

 λμ≥(1+δ1−δ)eχ.

To check condition (C3), notice that , and , hence

 a =δ2(λ/μ)2(1+δ)≥δ2eχ2(1−δ), and acεz∗ ≥δ2χ2ncε=δ7χ1536n

Condition (C3) is now satisfied, because the population size is required to fulfil

 2aln(16macεz∗) ≤4(1−δ)δ2eχln(24576mnδ7χ)≤λ

All conditions are satisfied, and the theorem follows. ∎

### 6.4 Conclusions

The artificial fitness levels method was first described by Wegener [37]. The original method was designed for the achievement of upper bounds on the runtime of stochastic search heuristics using only one individual such as the (1+1) EA. Since then, several extensions of the method have been devised for the analysis of more sophisticated algorithms. Sudholt introduced the method presented in Section 7 for the obtainment of lower bounds on the runtime [35]. In an early study, [38] used a potential function that generalises the fitness level argument of [37] to analyse the (+1) EA. His analysis achieved tight upper bounds on the runtime of the (+1) EA on LeadingOnes and Onemax by waiting for a sufficient amount of individuals of the population to take over a given fitness level before calculating the probability to reach a fitness level of higher fitness. Chen et al. extended the analysis to offspring populations by analysing the (+) EA, also taking into account the take over process [2]. Lehre introduced a general fitness-level method for arbitrary population-based EAs with non-elitist selection mechanisms and unary variation operators [20]. This technique was later generalised further into the level-based method presented in Section 6.3 [4]. The method allows the analysis of sophisticated non-elitist heuristics such as genetic algorithms equipped with mutation, crossover and stochastic selection mechanisms, both for classical as well as noisy and uncertain optimisation [6].

## 7 Drift Analysis

Drift analysis is a very flexible and powerful tool that is widely used in the analysis of stochastic search algorithms. The high level idea is to predict the long term behaviour of a stochastic process by measuring the expected progress towards a target in a single step. Naturally, a measure of progress needs to be introduced, which is generally called a distance function. Given a random variable representing the current state of the process at step , over a finite set of states , a distance function is defined such that if and only if is a target point (e.g., the global optimum). Drift analysis aims at deriving the expected time to reach the target by analysing the decrease in distance in each step, i.e., . The expected value of this decrease in distance, is called the drift. See Figure 8 for an illustration. If the initial distance from the target is and a bound on the drift (i.e., the expected improvement in each step) is known, then bounds on the expected runtime to reach the target may be derived.

The additive drift theorem was introduced to the field of evolutionary computation by He and Yao [12]. The theorem allows to derive both upper and lower bounds on the runtime of stochastic search algorithms. Consider a distance function indicating the current distance, at time , of the stochastic process from the optimum. The theorem simply states that if at each time step , the drift is at least some value (i.e., the process has moved closer to the target) then the expected number of steps to reach the target is at most . Conversely if the drift in each step is at most some value , then the expected number of steps to reach the target is at least .

###### Theorem 13 (Additive Drift Theorem).

Given a stochastic process over an interval and a distance function such that if and only if contains the target. Let for all , define , and assume .
Verify the following conditions:

Then,

1. If (C1+) holds for an , then .

2. If (C1) holds for an , then .

An Example application of the additive drift theorem follows concerning the (1+1) EA for plateau blocks of functions of unitation of length positioned such that .

###### Theorem 14.

The expected runtime of the (1+1)-EA for a plateau block of length ending at position is .

###### Proof.

The additive drift theorem will be applied to derive both upper and lower bounds on the expected runtime. The starting point is a bitstring with zeroes and the target point is a bitstring with zeroes. Choose to use the natural distance function that counts the number of zeroes in the bitstring. Subtract from the distance such that target points with zeroes have distance 0 and the initial point has distance . As long as points on the plateau are generated, they will be accepted because all plateau points have equal fitness. Given that each bit flips with probability , and at each step the current search point has zeroes and ones, the drift is

 Δt:=E[Yt−Yt+1∣Yt>0]=Ytn−n−Ytn=2⋅Ytn−1

A lower bound on the drift is obtained by considering that as long as the end of the plateau has not been reached there are always at least zeroes that may be flipped (i.e., ). Accordingly for an upper bound, at most zeroes may be available to be flipped (i.e., ). Hence,

 2kn−1≤Δt≤2(m+k)n−1

Then by additive drift analysis (Theorem 13),

 E[T∣Y0]≤m(2k)/n−1=mn2k−n=O(m)

and

 E[T∣Y0]≥m2(m+k)/n−1=mn2(m+k)−n=Ω(m)

where the last equalities hold as long as . ∎

Note again that if the plateau block is followed by a gap block, then an upper bound on the expected time to optimise both blocks is achieved by multiplying the upper bounds obtained for each block. This is necessary because points in the gap will not be accepted by the (1+1) EA.

### 7.2 Multiplicative Drift Theorem

In the additive drift theorem the worst case decrease in distance is considered. If the expected decrease in distance changes considerably in different areas of the search space, then the estimate on the drift may be too pessimistic for the obtainment of tight bounds on the expected runtime.

Drift analysis of the (1+1) EA for the classical Onemax function will serve as an example of this problem. Since the global optimum is the all-ones bitstring and the fitness increases with the number of ones a natural distance function is which simply counts the number of zeroes in the current search point. Then the distance will be zero once the optimum is found. Points with less one-bits than the current search point will not be accepted by the algorithm because of their lower fitness. So the drift is always positive, i.e., and the amount of progress is the expected number of ones gained in each step. In order to find an upper bound on the runtime, a lower bound on the drift is needed (i.e., the worst case improvement). Such worst case occurs when the current search point is optimal except for one 0-bit. In this case the maximum decrease in distance that may be achieved in a step is and to achieve such progress it is necessary that the algorithm flips the zero into a one and leaves the other bits unchanged. Hence, the drift is

 Δt≥1⋅1n(1−1n)n−1≥1en:=ε

Since the expected initial distance is due to random initialisation, the drift theorem yields

 E[T∣Y0]≤E[Y0]ε=n/21/(en)=e/2⋅n2=O(n2)

In Section 6 it was proven that the runtime of the (1+1) EA for Onemax is , hence a bound of is not tight. The reason is that on functions such as Onemax the amount of progress made by the algorithm depends crucially on the distance from the optimum. For Onemax in particular, larger progress per step is achieved when the current search point has many zeroes that may be flipped. As the algorithm approaches the optimal solution the amount of expected progress in each step becomes smaller because search points have increasingly more one-bits than zero-bits in the bitstring. In such cases a distance function that takes into account these properties of the objective function needs to be used. For Onemax a correct bound is achieved by using a distance function that is logarithmic in the number of zeroes , i.e., where a is added to in the argument of the logarithm such that the global optimum has distance zero (i.e., ). With such distance measure, the decrease in distance when flipping a zero and leaving the rest of the bitstring unchanged is

 ln(i+1)−ln(i)=ln(1+1i)≥12i

where the last inequality holds for all . Since it is sufficient to flip a zero and leave everything else unchanged to obtain an improvement, the drift is

 Δt≥ien⋅12i=12en:=ε

Given that the maximum possible distance is , the drift theorem yields

 E[T]≤Y01/(2en)=2en⋅ln(n+1)=O(nlnn).

The multiplicative drift theorem was introduced as a handy tool to deal with situations as the one described above where the amount of progress depends on the distance from the target.

###### Theorem 15 (Multiplicative Drift Theorem [8]).

Let be random variables describing a Markov process over a finite state space . Let be the random variable that denotes the earliest point in time such that . If there exist such that for all ,

1. and

2. ,

then

 E[T]≤2δ⋅ln(1+cmaxcmin)

The following derivation of an upper bound on the runtime of the (1+1) EA for linear blocks illustrates the multiplicative drift theorem.

###### Theorem 16.

The expected time for the (1+1)-EA to optimise a linear unitation block of length ending at position is

###### Proof.

Let be the number of zero-bits in the bitstring at time step , representing the distance from the end of the linear block. By remembering that increases in distance are not accepted due to elitism, the expected decrease in distance at time step can be bounded by

 E[Xt+1|Xt]≤Xt−1⋅Xten=Xt(1−1en)

simply by considering that if a zero-bit is flipped and nothing else then the distance decreases by 1. Then the drift is:

 E[Xt−Xt+1|Xt]≥Xt−Xt(1−1en)=1enXt:=δXt

By fixing the multiplicative drift theorem yields

 E[T]≤2δ