Runtime Analysis for Self-adaptive Mutation Rates††thanks: Extended version of a paper appearing at the Genetic and Evolutionary Computation Conference 2018 [Dwy18]. This version contains all proofs, whereas most of them for reasons of space did not fit into the conference version. In this version, the main result is valid for all , a sufficiently large constant, whereas the conference version needed for an arbitrary .
We propose and analyze a self-adaptive version of the evolutionary algorithm in which the current mutation rate is part of the individual and thus also subject to mutation. A rigorous runtime analysis on the OneMax benchmark function reveals that a simple local mutation scheme for the rate leads to an expected optimization time (number of fitness evaluations) of when is at least for some constant . For all values of , this performance is asymptotically best possible among all -parallel mutation-based unbiased black-box algorithms.
Our result shows that self-adaptation in evolutionary computation can find complex optimal parameter settings on the fly. At the same time, it proves that a relatively complicated self-adjusting scheme for the mutation rate proposed by Doerr, Gießen, Witt, and Yang (GECCO 2017) can be replaced by our simple endogenous scheme.
On the technical side, the paper contributes new tools for the analysis of two-dimensional drift processes arising in the analysis of dynamic parameter choices in EAs, including bounds on occupation probabilities in processes with non-constant drift.
Evolutionary algorithms are a class of heuristic algorithms that can be applied to solve optimization problems if no problem-specific algorithm is available. For example, this may be the case if the structure of the underlying problem is poorly understood or one is faced with a so-called black-box scenario, in which the quality of a solution can only be determined by calling an implementation of the objective function. This implementation may be implicitly given by, e. g., the outcome of a simulation without revealing structural relationships between search point and function value.
An approach to understand the working principles of evolutionary algorithms is to analyze the underlying stochastic process and its first hitting time of the set of optimal or approximate solutions. The runtime analysis community in evolutionary computation (see, e. g., [AD11, Jan13, NW10] for an introduction to the subject) follows this approach by partly using methods known from the analysis of classical randomized algorithms and, more recently and increasingly often, using and adapting tools from the theory of stochastic processes to obtain bounds on the hitting time of optimal solutions for different classes of evolutionary algorithms and problems. Such bounds will typically depend on problem size, problem type, evolutionary algorithm and choice of the parameters that these heuristic algorithms come with.
One of the core difficulties when using evolutionary algorithms is in fact finding suitable values for its parameters. It is well known and supported by ample experimental and some theoretical evidence that already small changes of the parameters can have a crucial influence on the efficiency of the algorithm.
One elegant way to overcome this difficulty, and in addition the difficulty that the optimal parameter values may change during a run of the algorithm, is to let the algorithm optimize the parameters on the fly. Formally speaking, this is an even more complicated task, because instead of a single good parameter value now a suitable functional dependence of the parameter on the search history needs to be provided. Fortunately, a number of natural heuristics like the -th rule have proven to be effective in certain cases. In a sense, these are all exogenous parameter control mechanisms which are added to the evolutionary system.
An even more elegant way is to incorporate the parameter control mechanism into the evolutionary process, that is, to attach the parameter value to the individual, to modify it via (extended) variation operators, and to use the fitness-based selection mechanisms of the algorithm to ensure that good parameter values become dominant in the population. This self-adaptation of the parameter values has two main advantages: (i) It is generic, that is, the adaptation mechanism is provided by the algorithm, only the representation of the parameter in the individual and the extension of the variation operators has to be provided by the user. (ii) It allows to re-use existing algorithms and much of the existing code.
Despite these advantages, self-adaptation is not used a lot in discrete evolutionary optimization. From the theory side, some advice exists how to set up such a self-adaptive system, but a real proof for its usefulness is still missing. This is the point we aim to make some progress on.
1.1 Our Results
The main result of this work is that we propose a version of the evolutionary algorithm (EA) with a natural self-adaptive choice of the mutation rate. For , a sufficiently large constant, we prove that it optimizes the classic OneMax benchmark problem in a runtime that is asymptotically optimal among all -parallel black-box optimization algorithms and that is better than the known runtimes of the (1,) EA and the (1+) EA for all static choices of the mutation rate. Compared to the (also asymptotically optimal) (1+) EA with fitness-dependent mutation rate of Badkobeh, Lehre, and Sudholt [BLS14] and the (1+) EA with self-adjusting (exogenous) mutation rate of Doerr, Gießen, Witt, and Yang [DGWY17] the good news of our result is that this optimal runtime could be obtained in a generic manner. Note that both the fitness-dependent mutation rate of [BLS14] and the self-adjusting rate of [DGWY17] with its mix of random and greedy rate adjustments would have been hard to find without a deeper understanding of the mathematics of these algorithms.
Not surprisingly, the proof of our main result has some similarity to the analysis of the self-adjusting (1+) EA of [DGWY17]. In particular, we also estimate the expected progress in one iteration and use variable drift analysis. Also, we need a careful probabilistic analysis of the progress obtained from different mutation rates to estimate which rate is encoded in the new parent individual (unfortunately, we cannot reuse the analysis of [DGWY17] since it is not always strong enough for our purposes). The reason, and this is also the main technical challenge in this work, is that the (1,) EA can lose fitness in one iteration. This happens almost surely when the mutation rate is too high. For this reason, we need to argue more carefully that such events do not happen regularly. To do so, among several new arguments, we also need a stronger version of the occupation probability result [KLW15, Theorem 7] since (i) we need sharper probability estimates for the case that movements away from the target are highly unlikely and (ii) for our process, the changes per time step cannot be bounded by a small constant. We expect our new results (Lemma 6 and 7) to find other applications in the theory of evolutionary algorithms in the future. Note that for the (1+) EA, an excursion into unfavorable rate regions is less a problem as long as one can show that the mutation rate returns into the good region after a reasonable time. The fact that the (1,) EA can lose fitness also makes it more difficult to cut the analysis into regimes defined by fitness levels since it is now possible that the EA returns into a previous regime.
In this work, we also gained two insights which might be useful in the design of future self-adaptive algorithms.
Need for non-elitism: Given the previous works, it would be natural to try a self-adaptive version of the (1+) EA. However, this is risky. While the self-adjusting EA of [DGWY17] copes well with the situation that the current mutation rate is far from the ideal one and then provably quickly changes the rate to an efficient setting, a self-adaptive algorithm cannot do so. Since the mutation rate is encoded in the individual, a change of the rate can only occur if an offspring is accepted. For an elitist algorithm like the (1+) EA, this is only possible when an offspring is generated that is good enough to compete with the parent(s). Consequently, if the parent individual in a self-adaptive (1+) EA has a high fitness, but a detrimental (that is, large) mutation rate, then the algorithm is stuck with this individual for a long time. Already for the simple OneMax function, such a situation can lead to an exponential runtime.
Needless to say, when using a comma strategy we have to choose sufficiently large to avoid losing the current-best solution too quickly. This phenomenon has been observed earlier, e.g., in [RS14] it is shown that is necessary for the (1,) EA with mutation rate to have a polynomial runtime on any function with unique optimum. We shall not specify a precise leading constant for our setting, but also require that for a sufficiently large constant .
Tie-breaking towards lower mutation rates: To prove our result, we need that the algorithm in case of many offspring of equal fitness prefers those with the smaller mutation rate. Given that the usual recommendation for the mutation rate is small, namely , and that it is well-known that large rates can be very detrimental, it is natural to prefer smaller rates in case of ties (where, loosely speaking, the offspring population gives no hint which rate is preferable).
This choice is similar to the classic tie-breaking rule of preferring offspring over parents in case of equal fitness. Here, again, the fitness indicates no preference, but the simple fact that one is maybe working already for quite some time with this parent suggest to rather prefer the new individual.
1.2 Previous Works
This being a theoretical paper, for reasons of space we shall mostly review the relevant theory literature, and also this with a certain brevity. For a broader account of previous works, we refer to the survey [KHE15]. For a detailed description of the state of the art in theory of dynamic parameter choices, we refer to the survey [DD18b]. We note that the use of self-adaptation in genetic algorithms was proposed in the seminal paper [Bäc92] by Bäck. Also, we completely disregard evolutionary optimization in continuous search spaces due to the very different nature of optimization there (visible, e.g., from the fact that dynamic parameter changes, including self-adaptive choices, are very common and in fact necessary to allow the algorithms to approach the optimum with arbitrary precision).
The theoretical analysis of dynamic parameter choices started slow. A first paper [JW06] on this topic in 2006 demonstrated the theoretical superiority of dynamic parameter choices by giving an artificial example problem for which any static choice of the mutation rate leads to an exponential runtime, whereas a suitable time-dependent choice leads to a polynomial runtime. Four years later [BDN10], it was shown that a fitness-dependent choice of the mutation rate can give a constant-factor speed-up when optimizing the LeadingOnes benchmark function (see [Doe18a, Section 2.3] for a simplified proof giving a more general result). The first super-constant speed-up on a classic benchmark function obtained from a fitness-dependent parameter choice was shown in [DDE13], soon to be followed by the paper [BLS14] which is highly relevant for this work. In [BLS14], the (1+) EA with fitness-dependent mutation rate was analyzed. For a slightly complicated fitness-dependent mutation rate, an optimization time of was obtained. Also, it was shown that no -parallel mutation-based unbiased black-box algorithm can have an asymptotically better optimization time.
Around that time, several successful self-adjusting (“on-the-fly”) parameter choices were found and analyzed with mathematical means. In [LS11], a success-based multiplicative update of the population size in the (1+) EA is proposed and it is shown that this can lead to a reduction of the parallel runtime. A multiplicative update inspired by the -th success rule from evolution strategies automatically finds parameter settings [DD15] leading to the same performance as the fitness-dependent choice in [DDE13]. Similar multiplicative update rules have been used to control the mutation strength for multi-valued decision variables [DDK18] and the time interval for which a selected heuristic is used in [DLOW18]. A learning-based approach was used in [DDY16a] to automatically adjust the mutation strength and obtain the performance of the fitness-dependent choice of [DDY16b]. Again a different approach was proposed in [DGWY17], where the mutation rate for the (1+) EA was determined on the fly by creating half the offspring with a smaller and half the offspring with a larger mutation rate than the value currently thought to be optimal. As new mutation rate, with probability the rate which produced the best offspring was chosen, with probability a random of the two rates was chosen. The three different exogenous approaches used in these works indicate that a generic approach towards self-adjusting parameter choices, such as self-adaptation, would ease the design of such algorithms significantly.
Surprisingly, prior to this work only a single runtime analysis paper for self-adapting parameter choices appeared. In [DL16b], Dang and Lehre show several positive and negative results on the performance of a simple class of self-adapting evolutionary algorithms having the choice between several mutation rates. Among them, they show that such an algorithm having the choice between an appropriate and a destructively high mutation rate can optimize the LeadingOnes benchmark function in the usual quadratic time, whereas the analogous algorithm using a random of the two mutation rates (and hence in half the cases the right rate) fails badly and needs an exponential time. As a second remarkable result, they give an example setting where any constant mutation rate leads to an exponential runtime, whereas the self-adapting algorithm succeeds in polynomial time. As for almost all such examples, also this one is slightly artificial and needs quite some assumptions, for example, that all individuals are initialized with the 1-point local optimum. Nevertheless, this result makes clear that self-adaptation can outperform static parameter choices. In the light of this result, the main value of our results is showing that asymptotic runtime advantages from self-adaptation can also be obtained in less constructed examples (of course, at the price that the runtime gap is not exponential).
To complete the picture on previous work relevant to ours, we finally quickly describe what is known on the performance of most common mutation-based algorithms for the OneMax benchmark function. For the simple (1+1) EA, the expected runtime of was determined in [Müh92] (upper bound) and [DJW02] (lower bound, this result was announced already 1998). For the (1+) EA with , a constant, an expected runtime (number of fitness evaluations) of
where for all .
The earliest runtime analysis of the (1,) EA with mutation rate on OneMax is due to Jägersküpper and Storch [JS07], who prove a phase transition from exponential to polynomial runtime in the regime , leaving a gap of at least 21 between the largest in the exponential regime and the smallest in the polynomial regime. This result was improved by Rowe and Sudholt [RS14], who determined the phase transition point to be the above-mentioned function , up to lower order terms. Jägersküpper and Storch [JS07] also obtain a useful coupling result: if for a sufficiently large constant , the stochastic behavior of the (1+) EA and (1,) EA with high probability are identical for a certain polynomial (with degree depending on ) number of steps, allowing the above-mentioned results about the (1+) EA to be transferred to the (1,) EA.
One of the technical difficulties in our analysis is that our self-adaptive (1,) EA can easily lose fitness when the rate parameter is set to an unsuitable value. For this reason, we cannot use the general approach of the analysis of the self-adjusting (1+) EA in [DGWY17], which separated the analysis of the rate and the fitness by, in very simple words, first waiting until the rate is in the desired range and then waiting for a fitness improvement (of course, taking care of the fact that the rate could leave the desired range). To analyze the joint process of fitness and rate with its intricate interactions, we in particular use drift analysis with a two-dimensional distance function, that is, we map (e.g., in Lemma 22) the joint space of fitness and rate suitably into the non-negative integers in a way that the expected value of this mapping decreases in each iteration. This allows to use well-known drift theorems.
The use of two-dimensional potential functions is not new in the analysis of evolutionary algorithms. However, so far only very few analyses exist that use this technique with dynamic parameter values and among these results, we feel that ours, in particular, Lemma 22, are relatively easy to use. Again in very simple words, the distance function defined in the proof of Lemma 22 is the fitness distance plus a pessimistic estimate for the fitness loss that could be caused from the current rate if this is maladjusted). We thus hope that this work eases future analyses of dynamic parameter choices by suggesting ways to measure suitably the progress in the joint space of solution quality and parameter value.
To allow the reader to compare our two-dimensional drift approach with existing works using similar arguments, we briefly review the main works that use two- or more-dimensional potential functions. Ignoring that the artificial fitness functions used in [DJW02, DJW12, DG13, Wit13] could also be interpreted as -dimensional potential functions, the possibly first explicit use of a two-dimensional potential function in the runtime analysis of randomized search heuristics can be found in [Weg05, proof of Theorem 4], a work analyzing how simulated annealing and the Metropolis algorithm compute minimum spanning trees in a line of connected triangles. In such optimization processes, a solution candidate (which is a subset of the edge set of the graph) can have two undesirable properties. (i) The solution contains a complete triangle, so one of these three edges has to be removed on the way to the optimal solution. (ii) The solution contains two edges of a triangle, but not the two with smallest weight. This case, called bad triangle, is the less desirable one as here one edge of the solution has to be replaced by the missing edge and hence the status of two edges has to be changed. It turns out that a simple potential function can take care of these two issues, namely twice the number of bad triangles plus the number of complete triangles.
When analyzing non-trivial parent populations, then often it does not suffice to measure the quality of the current state via the maximum fitness in the population, but also the number of individuals having this best fitness has to be taken into account. This was first done in the analysis of the (+1) EA in [Wit06]. Since in a run of this algorithm the population never worsens (in a strong sense), the progress could be analyzed conveniently via arguments similar to the fitness level method. Consequently, it was not necessary to define an explicit potential function. In a similar fashion, the EA [CHS09] and the (+) EA [ADFH18] were analyzed by regarding the maximum fitness and the number of individuals having this fitness.
In [LY12], a vaguely similar approach was taken for non-elitist population-based algorithms. However, the fact that these algorithm may lose the current-best solution required a number of non-trivial modifications, most notably, (i) that the potential is based on the maximum fitness such that at least a proportion of of the individuals have at least this fitness (for a suitable constant ) instead of the maximum fitness among all individuals, and (ii) that the arguments resembling the fitness level method had to be replaced by a true drift argument. This approach was extended in [DL16a] to give a general “level-based” runtime analysis result. A simplified version of this level theorem was recently given in [CDEL18].
What comes closest to our work with respect to the use of two-dimensional potential functions is [DDK18], where a self-adjusting bit-wise mutation strength for functions defined not over bit strings, but over for some is discussed. The potential function defined in (6) in [DDK18, Section 7] is too complicated to be described here in detail, but it also follows the pattern used in this work, namely that the potential (to be minimized) is the sum of the fitness distance and a penalty for mutation strengths deviating from their currently ideal value. This potential function, however, does not admit an easy interpretation of the type “fitness distance plus expected damage from improper mutation strength” as in our work. Consequently, the proof that indeed the desired progress is obtained with respect to this potential function is a lengthy (more than 4 pages) case distinction. Apparently unaware of the conference version [DDK16], a similar approach, also with a slightly complicated potential function, was developed in [AAG18] to analyze the (1+1) ES with success rule.
A very general approach was recently published in [Row18]. When a process admits several distance functions such that, for all , the -th distance satisfies for a given matrix , then under some natural conditions the first time until all distances are zero can be bounded in terms of a suitable eigenvalue of . The assumptions on the distance functions and the matrix are non-trivial, but [Row18] provides a broad selection of applications of this method. For our problem, we would expect that this method can be employed as well, however, this would also need an insight similar to the main insight of our approach, namely that the expected new fitness can be estimated in a linear fashion from the current fitness and the distance of the current rate from the ideal value.
1.4 Organization of This Work
This paper is structured as follows. In Section 2, we define the self-adaptive (1,) EA proposed in this work. In Section 3 we provide the technical tools needed on our analysis, among them two new results on occupation probabilities. Section 4 presents the main theorem. Its proof considers two main regions of different fitness, which are dealt with in separate subsections. We finish with some conclusions.
2 The (1,) EA With Self-Adapting Mutation Rate
We now define precisely the (1,) EA with self-adaptive mutation rate proposed in this work. This algorithm, formulated for the minimization of pseudo-boolean functions , is stated in pseudocode in Algorithm 1.
To encode the mutation rate into the individual, we extend the individual representation by adding the rate parameter. Hence the extended individuals are pairs consisting of a search point and the rate parameter , which shall indicate that is the mutation rate this individual was created with.
The extended mutation operator first changes the rate to either or with equal probability (). It then performs standard bit mutation with the new rate.
In the selection step, we choose from the offspring population an individual with best fitness. If there are several such individuals, we prefer individuals having the smaller rate , breaking still existing ties randomly. In this winning individual, we replace the rate by if it was smaller than to ensure that in the next iterations, the lower of the two rates is at least . We replace the rate by , that is, the largest power of not exceeding , if it was larger than this number. This ensures that in the next iteration, the larger of the two rates is not larger than and that the rate remains a power of despite the cap.
We formulate the algorithm to start with any initial mutation rate such that and is a power of . For the result we shall show in this work, the initial rate is not important, but without this prior knowledge we would strongly recommend to start with the smallest possible rate . Due to the multiplicative rate adaptation, the rate can quickly grow if this is profitable. On the other hand, a too large initial rate might lead to an erratic initial behavior of the algorithm.
For the adaptation parameter, we shall use in our runtime analysis. Having such a large adaptation parameter eases the already technical analysis, because now the two competing rates and are different enough to lead to a significantly different performance. For a practical application, we suspect that a smaller value of is preferable as it leads to a more stable optimization process. The choice of the offspring population size depends mostly on the degree of parallelism one wants to obtain. Clearly, should be at least logarithmic in to prevent a too quick loss of the current-best solution. For our theoretical analysis, we require for a sufficiently large constant .
The main result of this work is a mathematical runtime analysis of the performance of the algorithm proposed above on the classic benchmark function defined by for all . Since such runtime analyses are by now a well-established way of understanding the performance of evolutionary algorithms, we only briefly give the most important details and refer the reader to the textbook [Jan13].
The aim of runtime analysis is predicting how long an evolutionary algorithm takes to find the optimum or a solution of sufficient quality. As implementation-independent performance measure usually the number of fitness evaluations performed in a run of the algorithm is taken. More precisely, the optimization time of an algorithm on some problem is the number of fitness evaluations performed until for the first time an optimal solution is evaluated. Obviously, for a (1,) EA, the optimization time is essentially times the number of iterations performed until an optimum is generated.
As in classic algorithms analysis, our main goal is an asymptotic understanding of how the optimization time depends on the problems size . Hence all asymptotic notation in the paper will be with respect to tending to infinity.
3 Technical Tools
In this section, we listed several tools which are used in our work. Most of them are standard tools in the runtime analysis of evolutionary algorithms, however, we also prove two new results on occupation probabilities at the end of this section.
3.1 Elementary Estimates
We shall frequently use the following estimates.
For all , .
For all , . Moreover, for all , .
Weierstrass product inequality: For all ,
3.2 Probabilistic Tools
In our analysis, we use several standard probabilistic tools including Chernoff bounds. All these can be found in many textbook or the book chapter [Doe18c]. We mention the following variance-based Chernoff bound due to Bernstein [Ber24], which is less common in this field (but can be found as well in [Doe18c]).
Let be independent random variables. Let be such that for all . Let . Let . Then for all ,
We shall follow the common approach of estimating the expected progress and translating this via so-called drift theorems into an estimate for the expected optimization time. We use the variable drift theorem independently found in [Joh10, MRC09] in slightly generalized form.
Theorem 3 (Variable Drift, Upper Bound).
Given a stochastic process, let be a sequence of random variables obtained from mapping the random state at time to a finite set , where . Let be the random variable that denotes the earliest point in time such that . If there exists a monotone increasing function such that for all with we have
then for all with
Finally, we mention an elementary fact which we shall use as well. See [DD18a, Lemma 1] for a proof.
Let and . Then .
3.3 Occupation Probabilities
To analyze the combined process of fitness and rate in the parent individual, we need a tool that translates a local statement, that is, how the process changes from one time step to the next, into a global statement on the occupation probabilities of the process. Since in our application the local process has a strong drift to the target, Theorem 7 from [KLW15] is too weak. Also, we cannot assume that the process in each step moves at most some constant distance. For that reason, we need the following stronger statement.
Theorem 5 (Theorem 2.3 in [Haj82]).
Suppose that is an increasing family of sub--fields of and is adapted to . If
We apply this theorem in the following lemma that fit into the case in this paper.
Consider a stochastic process , , on such that for some the transition probabilities for all satisfy for all as well as for all . If then for all and it holds that
We aim at applying Theorem 5. There are two cases depending on : for , using the monotonicity of with respect to , we obtain
using the assumption that for all then
and for , using the monotonicity of respect to , we have
using the assumption that for all then
Using such that , we have
Theorem 2.3, inequality (2.8) in [Haj82] yields with and that
For the simpler case of a random process that runs on the positive integers and that has a strong drift to the left, we have the following estimate for the occupation probabilities.
Consider a random process defined on the positive integers . Assume that from each state different from , only the two neighboring states and can be reached (and there is no self-loop on state ). From state , only state can be reached and the process can stay on state . Let be an upper bound for the transition probability from state to state (valid in each iteration regardless of the past). Assume that
holds for all . Assume that the process starts in state . Then at all times, the probability to be in state is at most
where as usual we read the empty product as .
The claimed bound on the occupation probabilities is clearly true at the start of the process. Assume that it is true at some time. By this assumption and the assumptions on the process, the probability to be in state after one step is at most
Trivially, the probability to be in state after one step is at most . Hence, by induction over time, we see that is an upper bound for the probability to be in state at all times. ∎∎
4 Main Result and Proof
We can now state precisely our main result and prove it.
Let for a sufficiently large constant and . Let . Then the expected number of generations the self-adapting (1,) EA takes to optimize OneMax is
This corresponds to an expected number of fitness evaluations of .
The proof of this theorem is based on a careful, technically demanding drift analysis of both the current OneMax-value (which is also the fitness distance, recall that our goal is the minimization of the objective function) and the current rate of the parent. In very rough terms, a similar division of the run as in [DGWY18] into regions of large OneMax-value, the far region (Section 4.1), and of small OneMax-value, the near region (Section 4.2) is made. The middle region considered in [DGWY18] is subsumed under the far region here.
In the remainder of our analysis, we assume that is sufficiently large, that with a sufficiently large constant , and that .
4.1 The Far Region
In this section, we analyze the optimization behavior of our self-adaptive (1,) EA in the regime where the fitness distance is at least . Due to our assumption , it is very likely to have at least one copy of the parent among offspring when . Thus the (1,) EA works almost the same as the EA when is small, but can lose fitness in general. The following lemma is crucial in order to analyze the drift of the rate depending on , which follows a similar scheme as with the EA proposed in [DGWY18].
Roughly speaking, the rate leading to optimal fitness progress is for , for , and then the optimal rate quickly drops to when .
To ease the representation, we first define two fitness dependent bounds and .
Let and . We define and .
According to the definition, both and monotonically increase when increases.
Let . Consider an iteration of the self-adaptive (1,) EA with current fitness distance and current rate .
If and , the probability that all best offspring have been created with rate is at least .
If and , then the probability that all best offspring have been created with rate is at least .
Let and be the probability that standard bit mutation with mutation rate creates from a parent with fitness distance an offspring with fitness distance exactly and at most , respectively. Then
and . We aim at finding such that while . Then we use these to bound the probability that at least one offspring using rate obtains a progress of or more while at the same time all offspring using rate obtains less than progress. Let be the largest such that . Using the fact that for all , we notice that . By the assumption that , we obtain . Thus . We also notice that for . Thus for we can bound by
Let . We prove by distinguishing between two cases according to which argument maximizes .
If , then and . Referring to inequality (2) and using the fact that , , and , we obtain
and thus .
If , then since is a constant. Using , we obtain which is equivalent to . Furthermore, since . Thus
Since is decreasing in , we obtain . Using a Chernoff bound and recalling that the expected number of flipped bits is bounded by , we notice that . This upper bound will be used to estimate in the following part of the proof.
Therefore , where in the first inequality, we use the fact that . To prove , we first show . Then we use this to bound according to the definition of . Finally we obtain . It remains to bound . We show that the majority of are from the first terms in the summation of equation (1). Let denote the -th item in equation (1). Then
If , then , and thus
We notice that
using the fact that for all , we compute
Since , , and , we obtain
Consequently we have and
So finally due to the definition of , and
A simple union bound shows that with probability , no offspring of rate manages to obtain a progress of or more. However, the probability that an offspring has rate and obtains at least progress is . Thus the probability that no offspring generated with rate achieves a progress of at least is at most . This proves the first statement of the lemma. ∎
For let the random variable denote the number of flipped bits in ones and denote the number of flipped bits in zeros when applying standard bit mutation with probability . Let denote the improvement in fitness. Let denote the minimal among all offspring which apply rate . . Our aim is to find a such that while , and use this to obtain a high value for .
Let . We notice that since the median of binomial distribution is