Interpolating Local and Global Search by Controlling the Variance of Standard Bit Mutation††thanks: This work was supported by the Chinese scholarship council (CSC No. 201706310143), the Paris Ile-de-France Region, COST Action CA15140, and by a public grant as part of the Investissement d’avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH, in a joint call with Gaspard Monge Program for optimization, operations research and their interactions with data sciences.
A key property underlying the success of evolutionary algorithms (EAs) is their global search behavior, which allows the algorithms to “jump” from a current state to other parts of the search space, thereby avoiding to get stuck in local optima. This property is obtained through a random choice of the radius at which offspring are sampled from previously evaluated solutions. It is well known that, thanks to this global search behavior, the probability that an EA using standard bit mutation finds a global optimum of an arbitrary function tends to one as the number of function evaluations grows. This advantage over heuristics using a fixed search radius, however, comes at the cost of using non-optimal step sizes also in those regimes in which the optimal rate is stable for a long time. This downside results in significant performance losses for many standard benchmark problems.
We introduce in this work a simple way to interpolate between the random global search of EAs and their deterministic counterparts which sample from a fixed radius only. To this end, we introduce normalized standard bit mutation, in which the binomial choice of the search radius is replaced by a normal distribution. Normalized standard bit mutation allows a straightforward way to control its variance, and hence the degree of randomness involved. We experiment with a self-adjusting choice of this variance, and demonstrate its effectiveness for the two classic benchmark problems LeadingOnes and OneMax. Our work thereby also touches a largely ignored question in discrete evolutionary computation: multi-dimensional parameter control.
Among the most successfully applied iterative optimization heuristics are local search variants and evolutionary algorithms (EAs). While the former sample at a fixed radius around previously evaluated solutions, most evolutionary algorithms classify as global search algorithms which can escape local optima by creating offspring at larger distances. In the context of optimizing pseudo-Boolean functions , for example, the most commonly found variation operator in EAs is standard bit mutation. Standard bit mutation creates a new solution by flipping each bit of the parent individual with some probability , independently for each position. The probability to sample a specific offspring at distance from thus equals , where denotes the Hamming distance of and . This probability is strictly positive for all , thus showing that the probability that an EA using standard bit mutation will have sampled a global optimum of converges to one as the number of iterations increases. In contrast to pure random search, however, the distance at which the offspring is sampled follows a binomial distribution, , and is thus concentrated around its mean .
The ability to escape local optima comes at the price of frequent uses of non-optimal search radii even in those regimes in which the latter are stable for a long time. The incapability of standard bit mutation to adjust to such situations results in important performance losses on almost all classical benchmark functions, which often exhibit large parts of the optimization process in which flipping a certain number of bits is required. A convenient way to control the degree of randomness in the choice of the search radius would therefore be highly desirable.
In this work we introduce such an interpolation. It allows to calibrate between deterministic and pure random search, while encompassing standard bit mutation as one specification. More precisely, we investigate normalized standard bit mutation, in which the mutation strength (i.e., the search radius) is sampled from a normal distribution . By choosing one obtains a deterministic choice, and the “degree of randomness” increases with increasing . By the central limit theorem, we recover a distribution that is very similar to that of standard bit mutation by setting and .
Apart from conceptual advantages, normalized standard bit mutation offers the advantage of separating the variance from the mean, which makes it easy to control both parameters independently during the optimization process. While multi-dimensional parameter control for discrete EAs is still in its infancy, cf. comments in [KHE15, DD19], we demonstrate in this work a simple, yet efficient way to control mean and variance of normalized standard bit mutation. As test case to investigate the benefits of normalized standard bit mutation we have chosen the 2-rate EA from [DGWY17]. The choice of this reference algorithm is based on our previous work [DYvR18] in which we observed, via a detailed fixed-target analysis of several EAs, that for the two benchmark problems OneMax and LeadingOnes this algorithm performs significantly better than the plain EA for a large range of initial target values. For both functions flipping one bit is optimal for a large fraction of the optimization process, cf. Figure 2. In these regimes the 2-rate EA drastically looses performance due to sampling half the offspring with a mutation rate that is four times as large as the optimal one. Controlling the variance of this distribution seems therefore promising.
On the way towards a EA variant with self-adjusting choice of mean and variance we discover that already replacing the 2-rate sampling strategy of this algorithm by a normalized choice of the mutation strength significantly improves its performance. Controlling the variance then yields additional performance gains on the tested OneMax instances (we consider problem dimensions up to ). On LeadingOnes, the variance control improves performance for small values of . Unlike one might first expect, for this test function the average optimization time (i.e., number of search points evaluated until an optimal solution is evaluated for the first time) of the variants of the EA is better than that of their counterparts, which is an observation of independent interest.
1.1 Related Work
We are not aware of any other work replacing the binomial search radius distribution of standard bit mutation by a normal distribution. We are also not aware of any work directly controlling the variance of the mutation strength distribution. As mentioned above, controlling more than one parameter simultaneously is a largely ignored question in discrete evolutionary computation (EC).
A recently developed algorithm that also addresses the idea to sample the search radius from a different distribution than the binomial one is the fast-GA introduced in [DLMN17]. It samples the mutation strength from a power-law distribution, thus essentially shifting probability mass from small mutation strengths to larger ones. It is shown in [DLMN17] that the fast-GA is very efficient on so-called Jump functions, which require to flip bits simultaneously to jump from a local to the global optimum. It is furthermore discussed in [DLMN17] that the advantages of the fast-GA do not sacrifice too drastically the performance on uni-modal benchmark functions such as OneMax and LeadingOnes. This work has already received considerable attention in the literature [MB17, FQW18, FGQW18, COY18, Len18]. However, only static distributions are considered so far, and it is very likely that a control mechanism similar to the ones proposed in this work would be beneficial. We will comment on this in Section 6.
As reasoned above, normalized standard bit mutation offers an elegant way to interpolate between deterministic mutation strengths and regular standard bit mutation, thus showing that Randomized Local Search (RLS) variants with their deterministic search radii and the (1+1) EA with mutation rate are essentially just different instantiations of the same meta-algorithm. Similar results also extend to population-based EAs. Note that normalized standard bit mutation also allows other degrees of randomization, thereby offering a wide range for further experimentation. In this context we note that for the special case of standard RLS (i.e., the greedy (1+1) hill climber that flips in each iteration exactly one uniformly chosen bit) a similar meta-model allowing to interpolate between the (1+1) EA and RLS is the (1+1) EA introduced in [JZ11, CD18]. This model, however, is much less flexible, and does not allow, for example, deterministic search radii greater than one.
1.2 Experimental Setup
Unless stated otherwise, all numbers reported in this work are based on 100 independent runs of the respective algorithms. To ease readability, we only display average values. All raw data as well as detailed summaries with quantiles, standard deviations, etc. are available at https://github.com/FurongYe/Fixed-Target-Results. Selected statistical results can be found in Tables 1 and 2, respectively. These summaries have been created with IOHprofiler, our recently announced benchmarking and data analysis tool [DWY18].
2 Previous Observations for the Two-Rate EA and the Two Benchmark Problems
A starting point of our work are results presented in [DYvR18]. In this work we observed that the evolutionary algorithm with success-based self-adjusting mutation rate proposed in [DGWY17] outperforms the EA for a large range of sub-optimal targets. It then drastically looses performance in the later parts of the optimization process, which results in an overall poor optimization time on OneMax and LeadingOnes functions of moderate problem dimensions . The in [DGWY17] proven optimal asymptotic behavior on OneMax can thus not be observed for these dimensions.
We briefly summarize in this section the algorithm from [DGWY17] and the results presented in [DYvR18]. We also discuss a few basic properties of the two benchmark problems, which explain the choices made in subsequent sections.
2.1 The Two-Rate EA
The algorithm introduced in [DGWY17], which we named EA in [DYvR18], is a EA which applies in each iteration two different mutation rates. Half of the offspring population is generated with mutation rate , the other half with mutation rate . The parameter is the current best mutation strength, which is updated after each iteration, with a bias towards the rate by which the best of the offspring has been sampled, cf. Algorithm LABEL:alg:DoerrGWY17 for details.
Note that here and in the following we make use of the fact that standard bit mutation, which is traditionally defined by flipping each bit in a length- bit string with some probability (independently of all other decisions), can be equivalently described by first sampling a radius from the binomial distribution and then applying the operator, which flips pairwise different bits that are chosen from the index set uniformly at random.
Following the discussions and the notation introduced in [CD18, DW18, DYvR18] we enforce in this work that all offspring differ from their parents by at least one bit. We therefore require in lines LABEL:line:half and LABEL:line:double that the mutation strength is at least one. This is achieved by re-sampling if needed, or, equivalently, by sampling from the conditional binomial distribution which assigns to each value a probability of .
In [DYvR18] we compared the fixed-target performance of the EA (i.e., the EA using the conditional sampling rule introduced above) and the EA on OneMax and LeadingOnes. These two classic optimization problems ask to maximize the functions which are defined via respectively. In Figure 1 we report similar empirical results for (OneMax) and (LeadingOnes) (the other results in the two figures will be addressed below). We observed in [DYvR18] that for both functions the EA from [DGWY17] performs well for small target values, but drastically looses performance in the later stages of the optimization process.
2.2 Properties of the Benchmark Problems
Both OneMax and LeadingOnes have a long period during the optimization run in which flipping one bit is optimal.
For OneMax flipping one bit is widely assumed to be optimal as soon as . Quite interestingly, however, this conjecture has not not been rigorously proven to date. It is only known that drift-maximizing mutation strengths are almost optimal [DDY16], in the sense that the overall expected optimization time of the elitist (1+1) algorithm using these rates in each step cannot be worse than the best-possible unary unbiased algorithm for OneMax by more than an additive lower order term [DDY16]. But even for the drift maximizer the statement that flipping one bit is optimal when has only be shown for an approximation, not the actual drift maximizer. Numerical evaluations for problem dimensions up to nevertheless confirm that 1-bit flips are optimal when the OneMax-value exceeds .
For LeadingOnes, on the other hand, it is well known that flipping one bit is optimal as soon as [Doe18a].
We display in Figure 2, which is adjusted from [Doe18b], the optimal and drift-maximizing mutation strength for LeadingOnes and OneMax, respectively. We also display in the same figure the expected time needed by RLS and RLS, the elitist (1+1) algorithm using in each step these mutation rates. We see that these algorithms spend around (for OneMax) and (for LeadingOnes), respectively, of their time in the regime where flipping one bit is (almost) optimal. These numbers are based on an exact computation for LeadingOnes and on an empirical evaluation of independent runs for OneMax.
2.3 Implications for the Ea
Assume that in the regime of optimal one-bit flips the EA has correctly identified that flipping one bit is optimal. It will hence use the smallest possible value for , which is . In this case, half the offspring are sampled with (the for this algorithm optimal) mutation rate , while the other half of the offspring population is sampled with mutation rate , thus flipping on average more than four times the optimal number of bits. It is therefore non-surprising that in this regime (and already before) the gradient of the average fixed-target running time curves in Figures 1 are much worse for the EA than for the EA.
3 Creating Half the Offspring with Optimal Mutation Rate
The observations made in the last section inspire our first algorithms, the EA defined via Algorithm LABEL:alg:sigma. This algorithm samples half the offspring using as deterministic mutation strength the best mutation strength of the last iteration. The other offspring are sampled with a mutation rate that is sampled uniformly at random from the interval .
As we can see in Figure 1 this algorithm significantly improves the performance in those later parts of the optimization process. Normalized total optimization times for various problem dimensions are provided in Figures 3 and 4, respectively. We display data for only, and call this EA variant EA. We note that smaller values of , e.g., would give better results. The same effect would be observable when replacing the factor two in the EA, i.e., when using a EA rule with . A detailed discussion of this effect is omitted here for reasons of space.
It is remarkable that on LeadingOnes the EA performs better than Randomized Local Search (RLS), the elitist (1+1) algorithm flipping in each iteration exactly one uniformly chosen bit. The slightly worse gradients for target values (which are a consequence of randomly sampling the mutation rate instead of using mutation strength one deterministically) are compensated for by the gains made in the initial phase of the optimization process, where the EA variants benefit from larger mutation rates.
On OneMax the performance of the EA is better than that of the plain EA for both tested values and .
We recall that it is well known that, both for OneMax and LeadingOnes, the optimal offspring population size in the regular EA is [JDW05]. A monotonic dependence of the average optimization time on is conjectured (and empirically observed) but not formally proven. While for OneMax the impact of is significant, the dependency on is much less pronounced for LeadingOnes. Empirical results for both functions and a theoretical running time analysis for LeadingOnes can be found in [DYvR18]. For OneMax [GW17] offers a precise running time analysis of the EA for broad ranges of offspring population sizes and mutation rates . In light of the fact that the theoretical considerations in [DGWY17] required , it is worthwhile to note that for all tested problem dimensions the EA performs better on OneMax than the EA. Note, however, that the inverse holds for LeadingOnes, cf. Figure 4. For this function it seems to be important that the number of offspring allows a better estimation of the better mutation rate. We will observe the same phenomenon for all other algorithms introduced below.
4 Normalized Standard Bit Mutation
In light of the results presented in the previous section, one may wonder if splitting the population into two halves is needed after all. We investigate this question by introducing the EA which in each iteration and for each samples the mutation strength from the normal distribution around the best mutation strength of the previous iteration and rounding the sampled value to the closest integer. The reasons to replace the uniform distribution will be addressed below. As before we enforce by re-sampling if needed, thus effectively sampling the mutation strength from the conditional distribution . Algorithm LABEL:alg:normal summarizes this algorithm.
Note that the variance of the unconditional normal distribution is identical to that of the unconditional binomial distribution . We use the normal distribution here for reasons that will be explained in the next section. Note, however, that very similar results would be obtained when replacing in line LABEL:line:mutnormal of Algorithm LABEL:alg:normal the normal distribution by the binomial one . We briefly recall that, by the central limit theorem, the (unconditional) binomial distribution converges to the (unconditional) normal distribution.
The empirical performance of the EA is comparable to that of the EA for both problems and all tested problem dimensions, cf. Figures 3 and 4. Note, however, that for the EA performs worse than the EA.
4.1 Interpolating Local and Global Search
As discussed above, all EA variants mentioned so far suffer from the variance of the random selection of the mutation rate, in particular in the long final part of the optimization process in which the optimal mutation strength is one. We therefore analyze a simple way to reduce this variance on the fly. To this end, we build upon the EA and introduce a counter , which is initialized at zero. In each iteration, we check if the value of changes. If so, the counter is re-set to zero. It is increased by one otherwise, i.e., if the value of remains the same. We use this counter to self-adjust the variance of the normal distribution. To this end, we replace in line LABEL:line:mutnormal of Algorithm LABEL:alg:normal the conditional normal distribution by the conditional normal distribution , where is a constant discount factor. Algorithm LABEL:alg:normalada summarizes this EA variant with normalized standard bit mutation and a self-adjusting choice of mean and variance.
Choice of : We use in all reported experiments. Preliminary tests suggest that values are not advisable, since the algorithm may get stuck with sub-optimal mutation rates. This could be avoided by introducing a lower bound for the variance and/or by mechanisms taking into account whether or not an iteration has been successful, i.e., whether it has produced a strictly better offspring.
The empirical comparison suggests that the self-adjusting choice of the variance in the EA improves the performance on OneMax further, cf. also Figure 5 for average fixed-target results for . For the average performance is comparable to, but slightly worse than that of RLS. For LeadingOnes, the EA is comparable in performance to the EA, but we observe that for the EA performs better. It is the only one among all tested EAs for which decreasing from 50 to 2 does not result in a significantly increased running time.
5 A Meta-Algorithm with Normalized Standard Bit Mutation
In the EA we make use of the fact that a small variance in line LABEL:line:mutnormaladap of Algorithm LABEL:alg:normalada results in a more concentrated distribution. The variance adjustment is thus an efficient way to steer the degree of randomness in the selection of the mutation rate. It allows to interpolate between deterministic and random mutation rates. In our experimentation we do not go beyond the variance of the binomial distribution, but in principle there is no reason to not regard larger variance as well. The question of how to best determine the degree of randomness in the choice of the mutation rate has, to the best of our knowledge, not previously been addressed in the EC literature. We believe that this idea carries good potential, since it demonstrates that local search with its deterministic search radius and evolutionary algorithms with their global search radii are merely two different configurations of the same meta-algorithm, and not two different algorithms as the general perception might indicate. To make this point very explicit, we introduce with Algorithm LABEL:alg:meta a general meta-algorithm, of which local search with deterministic mutation strengths and EAs are special instantiations.
Note that in this meta-model we use static parameter values, variants with adaptive mutation rates can be obtained by applying the usual parameter control techniques, as demonstrated above. Of course, the same normalization can be done for similar EAs, the technique is not restricted to elitist -type algorithms. Likewise, the condition to flip at least one bit can be omitted, i.e., one can replace the conditional normal distribution in line LABEL:line:mutmeta by the unconditional .
6 Discussion and Outlook
We have introduced in this work normalized standard bit mutation, which replaces the binomial choice of the mutation strength in standard bit mutation by a normal distribution. This normalization allows a straightforward way to control the variance of the distribution, which can now be adjusted independently of the mean. We have demonstrated that such an approach can be beneficial when optimizing classic benchmark problems such as LeadingOnes and OneMax. In future work, we plan to validate our approach for the fast-GA proposed in [DLMN17]. We are confident that variance control should be beneficial for that algorithm as well.
Our work has concentrated on OneMax and LeadingOnes, as two examples where the optimal mutation rate is stable for a long time. When applied in practice—where abrupt changes of the optimal mutation strengths may occur—our variance control mechanism needs to be modified so that the variance is increased if no strict progress has been observed for a sufficiently long period. We plan to investigate this question by studying concatenated jump-functions, i.e., functions for which one mutation strength is optimal for some significant number of iterations, followed by a situation in which a much larger number of bits need to be flipped in order to make progress.
Related to the point made in the last paragraph, we also note that the parameter control technique which we applied to adjust the mean of the sampling distribution for the mutation strength has an extremely short learning period, since we simply use the best mutation strength of the last iteration as mean for the sampling distribution of the next iteration. For more rugged fitness landscapes a proper learning, which takes into account several iterations, should be preferable.
We recall that multi-dimensional parameter control has not received much attention in the EC literature for discrete optimization problems [KHE15, DD19]. Our work falls into this category, and we have demonstrated a simple way to separate the control of the mean from that of the variance of the mutation strength distribution. We hope that our work inspires more research in this direction, since practical EAs tend to have many different parameters that need to be adjusted during the optimization process.
Finally, another avenue for further work is provided by the meta-algorithm presented in Section 5, which demonstrates that Randomized Local Search and evolutionary algorithms can be seen as two configurations of the meta-algorithm. Parameter control, or, in this context possibly more suitably referred to as online algorithm configuration, offers the possibility to interpolate between these algorithms (and even more drastically randomized heuristics). Given the significant advances in the context of algorithm configuration witnessed by the EC and machine learning communities, we believe that such meta-models carry significant potential to exploit and profit from advantages of different heuristics. Note here that the configuration of meta-algorithms offers much more flexibility than the algorithm selection approach classically taken in EC, e.g., in most works on hyper-heuristics.
- [CD18] Eduardo Carvalho Pinto and Carola Doerr, A simple proof for the usefulness of crossover in black-box optimization, Proc. of Parallel Problem Solving from Nature (PPSN’18), Lecture Notes in Computer Science, vol. 11102, Springer, 2018, Full version available at http://arxiv.org/abs/1812.00493, pp. 29–41.
- [COY18] Dogan Corus, Pietro Simone Oliveto, and Donya Yazdani, Fast artificial immune systems, Proc. of Parallel Problem Solving from Nature (PPSN’18), Lecture Notes in Computer Science, vol. 11101, Springer, 2018, pp. 67–78.
- [DD19] Benjamin Doerr and Carola Doerr, Theory of parameter control mechanisms for discrete black-box optimization: Provable performance gains through dynamic parameter choices, Theory of Randomized Search Heuristics in Discrete Search Spaces (Benjamin Doerr and Frank Neumann, eds.), Springer, 2019, To appear. Available online at https://arxiv.org/abs/1804.05650.
- [DDY16] Benjamin Doerr, Carola Doerr, and Jing Yang, Optimal parameter choices via precise black-box analysis, Proc. of Genetic and Evolutionary Computation Conference (GECCO’16), ACM, 2016, pp. 1123–1130.
- [DGWY17] Benjamin Doerr, Christian Gießen, Carsten Witt, and Jing Yang, The evolutionary algorithm with self-adjusting mutation rate, Proc. of Genetic and Evolutionary Computation Conference (GECCO’17), ACM, 2017, pp. 1351–1358.
- [DLMN17] Benjamin Doerr, Huu Phuoc Le, Régis Makhmara, and Ta Duy Nguyen, Fast genetic algorithms, Proc. of Genetic and Evolutionary Computation Conference (GECCO’17), ACM, 2017, Full version available at http://arxiv.org/abs/1703.03334, pp. 1115–1122.
- [Doe18a] Benjamin Doerr, Better runtime guarantees via stochastic domination, Proc. of Evolutionary Computation in Combinatorial Optimization (EvoCOP’18), Lecture Notes in Computer Science, vol. 10782, Springer, 2018, Full paper available at https://arxiv.org/abs/1801.04487, pp. 1–17.
- [Doe18b] Carola Doerr, Dynamic parameter choices in evolutionary computation, Proc. of Genetic and Evolutionary Computation Conference (GECCO’18), Companion Material, ACM, 2018, pp. 800–830.
- [DW18] Carola Doerr and Markus Wagner, On the effectiveness of simple success-based parameter selection mechanisms for two classical discrete black-box optimization benchmark problems, Proc. of Genetic and Evolutionary Computation Conference (GECCO’18), ACM, 2018, pp. 943–950.
- [DWY18] Carola Doerr, Hao Wang, Furong Ye, Sander van Rijn, and Thomas Bäck, IOHprofiler: A Benchmarking and Profiling Tool for Iterative Optimization Heuristics, arXiv e-prints:1810.05281 (2018), IOHprofiler is available at https://github.com/IOHprofiler.
- [DYvR18] Carola Doerr, Furong Ye, Sander van Rijn, Hao Wang, and Thomas Bäck, Towards a theory-guided benchmarking suite for discrete black-box optimization heuristics: profiling (1 + ) EA variants on OneMax and LeadingOnes, Proc. of Genetic and Evolutionary Computation Conference (GECCO’18), ACM, 2018, pp. 951–958.
- [FGQW18] Tobias Friedrich, Andreas Göbel, Francesco Quinzan, and Markus Wagner, Heavy-tailed mutation operators in single-objective combinatorial optimization, Proc. of Parallel Problem Solving from Nature (PPSN’18), Lecture Notes in Computer Science, vol. 11101, Springer, 2018, pp. 134–145.
- [FQW18] Tobias Friedrich, Francesco Quinzan, and Markus Wagner, Escaping large deceptive basins of attraction with heavy-tailed mutation operators, Proc. of Genetic and Evolutionary Computation Conference (GECCO’18), ACM, 2018, pp. 293–300.
- [GW17] Christian Gießen and Carsten Witt, The interplay of population size and mutation probability in the EA on OneMax, Algorithmica 78 (2017), no. 2, 587–609.
- [JDW05] Thomas Jansen, Kenneth A. De Jong, and Ingo Wegener, On the choice of the offspring population size in evolutionary algorithms, Evolutionary Computation 13 (2005), 413–440.
- [JZ11] Thomas Jansen and Christine Zarges, Analysis of evolutionary algorithms: from computational complexity analysis to algorithm engineering, Proc. of Foundations of Genetic Algorithms (FOGA’11), ACM, 2011, pp. 1–14.
- [KHE15] G. Karafotias, M. Hoogendoorn, and A.E. Eiben, Parameter control in evolutionary algorithms: Trends and challenges, IEEE Transactions on Evolutionary Computation 19 (2015), 167–187.
- [Len18] Johannes Lengler, A general dichotomy of evolutionary algorithms on monotone functions, Proc. of Parallel Problem Solving from Nature (PPSN’18), Lecture Notes in Computer Science, vol. 11101, Springer, 2018, pp. 3–15.
- [MB17] Vladimir Mironovich and Maxim Buzdalov, Evaluation of heavy-tailed mutation operator on maximum flow test generation problem, Proc. of Genetic and Evolutionary Computation Conference (GECCO’17), Companion Material, ACM, 2017, pp. 1423–1426.