Analysis of Noisy Evolutionary Optimization When Sampling Fails

# Analysis of Noisy Evolutionary Optimization When Sampling Fails

## Abstract

In noisy evolutionary optimization, sampling is a common strategy to deal with noise. By the sampling strategy, the fitness of a solution is evaluated multiple times (called sample size) independently, and its true fitness is then approximated by the average of these evaluations. Previous studies on sampling are mainly empirical. In this paper, we first investigate the effect of sample size from a theoretical perspective. By analyzing the (1+1)-EA on the noisy LeadingOnes problem, we show that as the sample size increases, the running time can reduce from exponential to polynomial, but then return to exponential. This suggests that a proper sample size is crucial in practice. Then, we investigate what strategies can work when sampling with any fixed sample size fails. By two illustrative examples, we prove that using parent or offspring populations can be better. Finally, we construct an artificial noisy example to show that when using neither sampling nor populations is effective, adaptive sampling (i.e., sampling with an adaptive sample size) can work. This, for the first time, provides a theoretical support for the use of adaptive sampling.

Noisy optimization, evolutionary algorithms, sampling, population, running time analysis.

## I Introduction

Evolutionary algorithms (EAs) are a type of general-purpose randomized optimization algorithms, inspired by natural evolution. They have been widely applied to solve real-world optimization problems, which are often subject to noise. Sampling is a popular strategy for dealing with noise: to estimate the fitness of a solution, it evaluates the fitness multiple () times (called sample size) independently and then uses the sample average to approximate the true fitness. Sampling reduces the variance of noise by a factor of , but also increases the computation time for the fitness estimation of a solution by times. Previous studies mainly focused on the empirical design of efficient sampling methods, e.g., adaptive sampling [4, 5], which dynamically decides the sample size for each solution in each generation. The theoretical analysis on sampling was rarely touched.

Due to their sophisticated behaviors of mimicking natural phenomena, the theoretical analysis of EAs is difficult. Much effort thus has been devoted to understanding the behavior of EAs from a theoretical viewpoint [2, 17], but most of them focus on noise-free optimization. The presence of noise further increases the randomness of optimization, and thus also increases the difficulty of analysis.

For running time analysis (one essential theoretical aspect) in noisy evolutionary optimization, only a few results have been reported. The classic (1+1)-EA algorithm was first studied on the OneMax and LeadingOnes problems under various noise models [3, 7, 10, 14, 22, 27]. The results showed that the (1+1)-EA is efficient only under low noise levels, e.g., for the (1+1)-EA solving OneMax in the presence of one-bit noise, the maximal noise level of allowing a polynomial running time is , where the noise level is characterized by the noise probability and is the problem size. Later studies mainly proved the robustness of different strategies to noise, including using populations [6, 7, 14, 21, 27], sampling [22, 23] and threshold selection [24]. For example, the (+1)-EA with  [14], the (1+)-EA with  [14], the (1+1)-EA using sampling with  [23] or the (1+1)-EA using threshold selection with threshold  [24] can solve OneMax in polynomial time even if the probability of one-bit noise reaches . Note that there was also a sequence of papers analyzing the running time of the compact genetic algorithm [13] and ant colony optimization algorithms [8, 11, 12, 26] solving noisy problems, including OneMax as well as the combinatorial optimization problem single destination shortest paths.

The very few running time analyses involving sampling [22, 23] mainly showed the effectiveness of sampling with a large enough fixed sample size . For example, for the (1+1)-EA solving OneMax under one-bit noise with , using sampling with can reduce the running time exponentially. In addition, Akimoto et al. [1] proved that using sampling with a large enough can make optimization under additive unbiased noise behave as noiseless optimization. However, there are still many fundamental theoretical issues that have not been addressed, e.g., how the sample size can affect the effectiveness of sampling, and what strategies can work when sampling fails.

In this paper, we first theoretically investigate the effect of sample size. It may be believed that once the sample size reaches an effective value, the running time will always be polynomial as continues to increase. We give a counterexample, i.e., the (1+1)-EA solving LeadingOnes under one-bit noise with . Qian et al. [22] have shown that the running time will reduce from exponential to polynomial when . We prove that the running time will return to exponential when . Our analysis suggests that the selection of sample size should be careful in practice.

Then, we theoretically compare the two strategies of using populations and sampling on the robustness to noise. Previous studies have shown that both of them are effective for solving OneMax under one-bit noise [14, 22, 23], while using sampling is better for solving OneMax under additive Gaussian noise [23]. Here, we complement this comparison by constructing two specific noisy OneMax problems. For one of them, using parent populations is better than using sampling, while for the other, using offspring populations is better. In both cases, we prove that the employed parent and offspring population sizes are almost tight. We also give an artificial noisy OneMax problem where using neither populations nor sampling is effective. For this case, we further prove that using adaptive sampling can reduce the running time exponentially, which provides some theoretical justification for the good empirical performance of adaptive sampling [28, 32].

This paper extends our preliminary work [25]. When comparing sampling with populations, we only considered parent populations in [25]. To get a complete understanding, we add the analysis of using offspring populations. We construct a new noisy example to show that using offspring populations can be better than using sampling (i.e., Theorems 9 and 10 in Section V). For the noisy example in Section VI, where we previously proved that using neither sampling nor parent populations is effective while adaptive sampling can work, we now prove that using offspring populations is also ineffective (i.e., Theorem 14 in Section VI). To show that using parent populations is better than using sampling, we only gave an effective parent population size in [25]. We now add the analysis of the tightness of the effective parent population size (i.e., Theorem 8 in Section IV) as well as the effective offspring population size (i.e., Theorem 11 in Section V).

The rest of this paper is organized as follows. Section II introduces some preliminaries. Section III analyzes the effect of sample size. The effectiveness of using parent and offspring populations when sampling fails is proved in Sections IV and V, respectively. Section VI then shows that when using neither sampling nor populations is effective, adaptive sampling can work. Finally, Section VII concludes the paper.

## Ii Preliminaries

In this section, we first introduce the EAs and the sampling strategy, and then present the analysis tools that will be used in this paper.

### Ii-a Evolutionary Algorithms

The (1+1)-EA (i.e., Algorithm 1) maintains only one solution, and iteratively tries to produce one better solution by bit-wise mutation and selection. The (+1)-EA (i.e., Algorithm 2) uses a parent population size . In each iteration, it also generates one new solution , and then uses to replace the worst solution in the population if is not worse. The (1+)-EA (i.e., Algorithm 3) uses an offspring population size . In each iteration, it generates offspring solutions independently by mutating the parent solution , and then uses the best offspring solution to replace the parent solution if it is not worse. When and , both the (+1)-EA and (1+)-EA degenerate to the (1+1)-EA. Note that for the (+1)-EA, a slightly different updating rule is also used [13, 30]: is simply added into and then the worst solution in is deleted. Our results about the (+1)-EA derived in the paper also apply to this setting.

In noisy optimization, only a noisy fitness value instead of the exact one can be accessed. Note that in our analysis, the algorithms are assumed to use the reevaluation strategy as in [8, 10, 14]. That is, besides evaluating the noisy fitness of offspring solutions, the noisy fitness values of parent solutions will be reevaluated in each iteration. The running time of EAs is usually measured by the number of fitness evaluations until finding an optimal solution w.r.t. the true fitness function for the first time [1, 10, 14].

### Ii-B Sampling

Sampling as described in Definition 1 is a common strategy to deal with noise. It approximates the true fitness using the average of a number of random evaluations. The number of random evaluations is called the sample size. Note that implies that sampling is not used. Qian et al. [22, 23] have theoretically shown the robustness of sampling to noise. Particularly, they proved that by using sampling with some fixed sample size, the running time of the (1+1)-EA for solving OneMax and LeadingOnes under noise can reduce from exponential to polynomial.

###### Definition 1 (Sampling).

Sampling first evaluates the fitness of a solution times independently and obtains the noisy fitness values , and then outputs their average, i.e.,

Adaptive sampling dynamically decides the sample size for each solution in the optimization process, instead of using a fixed size. For example, one popular strategy [4, 5] is to first estimate the fitness of two solutions by a small number of samples, and then sequentially increase samples until the difference can be significantly discriminated. It has been found well useful in many applications [28, 32], while there has been no theoretical work supporting its effectiveness.

### Ii-C Analysis Tools

EAs often generate offspring solutions only based on the current population, thus, an EA can be modeled as a Markov chain (e.g., in [16, 31]) by taking the EA’s population space as the chain’s state space (i.e., ) and taking the set of all optimal populations as the chain’s target state space. Note that the population space consists of all possible populations, and an optimal population contains at least one optimal solution.

Given a Markov chain and , we define its first hitting time as . The expectation of , , is called the expected first hitting time (EFHT). If is drawn from a distribution , is called the EFHT of the chain over the initial distribution . Thus, the expected running time of the (+1)-EA starting from is , where the first is the cost of evaluating the initial population, and is the cost of one iteration, where it needs to evaluate the offspring solution and reevaluate the parent solutions. Similarly, the expected running time of the (1+)-EA starting from is , where the first is the cost of evaluating the initial solution, and is the cost of one iteration, where it needs to evaluate the offspring solutions and reevaluate the parent solution. For the (1+1)-EA, the expected running time is calculated by setting or , i.e., . For the (1+1)-EA with sampling, it becomes , because the fitness estimation of a solution needs independent evaluations. Note that in this paper, we consider the expected running time of an EA starting from a uniform initial distribution.

Then, we introduce several drift theorems which will be used to analyze the EFHT of Markov chains in this paper. The multiplicative drift theorem (i.e., Theorem 1[9] is for deriving upper bounds on the EFHT. First, a distance function satisfying that and needs to be designed to measure the distance of a state to the target state space . Then, we need to analyze the drift towards in each step, i.e., . If the drift in each step is roughly proportional to the current distance to the optimum, we can derive an upper bound on the EFHT accordingly.

###### Theorem 1 (Multiplicative Drift [9]).

Given a Markov chain and a distance function over , if for any and any with , there exists such that

 E(V(ξt)−V(ξt+1)∣ξt)≥c⋅V(ξt),

then it holds that , where denotes the minimum among all possible positive values of .

The simplified negative drift theorem (i.e., Theorem 2[18, 19] is for proving exponential lower bounds on the EFHT of Markov chains, where is often represented by a mapping of . From Theorem 2, we can see that two conditions are required: (1) a constant negative drift and (2) exponentially decaying probabilities of jumping towards or away from the target state. By building a relationship between the jumping distance and the length of the drift interval, a more general theorem simplified negative drift with scaling [20] as presented in Theorem 3 has been proposed. Theorem 4 gives the original negative drift theorem [15], which is stronger because both the two simplified versions are proved by using this original theorem.

###### Theorem 2 (Simplified Negative Drift [18, 19]).

Let , , be real-valued random variables describing a stochastic process over some state space. Suppose there exists an interval , two constants and, possibly depending on , a function satisfying such that for all :

 (1)E(Xt−Xt+1∣aa)≤r(l)(1+δ)j.

Then there exists a constant such that for it holds .

###### Theorem 3 (Simplified Negative Drift with Scaling [20]).

Let , , be real-valued random variables describing a stochastic process over some state space. Suppose there exists an interval and, possibly depending on , a drift bound as well as a scaling factor such that for all :

 (1)E(Xt−Xt+1∣aa)≤e−j, (3)1≤r≤min{ϵ2l,√ϵl/(132log(ϵl))}.

Then it holds for the first hitting time that .

###### Theorem 4 (Negative Drift [15]).

Let , be real-valued random variables describing a stochastic process over some state space. Pick two real numbers and depending on a parameter such that holds. Let be the random variable denoting the earliest time such that holds. Suppose there exists and such that for all :

 E(e−λ(l)⋅(Xt+1−Xt)∣a(l)

Then it holds that for all time bounds ,

 P(T(l)≤L(l)∣X0≥b(l)) (3) ≤e−λ(l)⋅(b(l)−a(l))⋅L(l)⋅D(l)⋅p(l),

where .

## Iii The effect of sample size

Previous studies [22, 23] have shown that for noisy evolutionary optimization, sampling with some fixed sample size can decrease the running time exponentially in some situations. For example, for the (1+1)-EA solving the OneMax problem under one-bit noise with the noise probability , the expected running time is super-polynomial [10]; while by using sampling with , the running time reduces to polynomial [22]. Then, a natural question is that whether the running time will always be polynomial by using any polynomially bounded sample size larger than the effective . It may be believed that the answer is yes, since the sample size has been effective and using a larger sample size will make the fitness estimation more accurate. For example, for the (1+1)-EA solving OneMax under one-bit noise, it is easy to see from Lemma 3 in [22] that using a larger sample size than will make the probability of accepting a true worse solution in the comparison continue to decrease and the running time will obviously stay polynomial. In this section, we give a counterexample by considering the (1+1)-EA solving the LeadingOnes problem under one-bit noise, which suggests that the selection of sample size should be careful in practice.

As presented in Definition 2, the goal of the LeadingOnes problem is to maximize the number of consecutive 1-bits counting from the left of a solution. We can easily see that the optimal solution is the string with all 1s (denoted as ). As presented in Definition 3, the one-bit noise model flips a random bit of a solution before evaluation with probability . When , it was known [22] that the expected running time of the (1+1)-EA is exponential, while the running time will reduce to polynomial by using sampling with . We prove in Theorem 5 that the running time of the (1+1)-EA will return to exponential if .

The LeadingOnes Problem is to find a binary string that maximises

 f(x)=∑ni=1∏ij=1xj.
###### Definition 3 (One-bit Noise).

Given a parameter , let and denote the noisy and true fitness of a solution , respectively, then

 fn(x)={f(x)with prob. 1−p,f(x′)with prob. p,

where is generated by flipping a randomly chosen bit of .

From Lemma 6 in [22], we can find the reason why sampling is effective only with a moderate sample size. In most cases, if , the expected gap between and is positive, which implies that a larger sample size is better since it will decrease . However, when and is close to the optimum , the expectation of can be negative, which implies that a larger sample size is worse since it will increase . Thus, neither a small sample size nor a large sample size is effective. The sample size of just makes a good tradeoff, which can lead to a not too large probability of and a sufficiently small probability of for two solutions and with and .

###### Theorem 5.

For the (1+1)-EA solving LeadingOnes under one-bit noise with , the expected running time is exponential [22]; if using sampling with , the expected running time is polynomial [22]; if using sampling with , the expected running time is exponential.

###### Proof.

We only need to prove the case . Our main idea is to show that before reaching the optimal solution , the algorithm will first find the solution or with a probability of at least ; while the probability of leaving or is exponentially small. Combining these two points, the theorem holds.

Let a Markov chain model the analyzed evolutionary process. Let denote the true number of leading 1-bits of a solution . For any , let denote the event that at time , the (1+1)-EA finds a solution with at least leading 1-bits for the first time, i.e., and ; let and denote the subsets of , which require that and , respectively. Thus, before reaching the optimal solution , the (1+1)-EA can find a solution in with probability at least .

We then show that . Assume that , where . Let denote the probability that is mutated to by bit-wise mutation. Then,

 P(At∣Ct)=(Pmut(x,1n−10)⋅P(^f(1n−10)≥^f(x)) (4) +Pmut(x,1n−201)⋅P(^f(1n−201)≥^f(x)))/P(Ct). (5)

For and , we apply Hoeffding’s inequality to get a lower bound . By the definition of one-bit noise, we get, for ,

 E(fn(1k01n−k−1))=k∑j=11n⋅(j−1)+1n⋅n+n−k−1n⋅k. (6)

Then, we have, for ,

 (7)

Thus, for , . Since and , we have

 E(fn(1n−10))−E(fn(x))≥1/n.

Let . Since the value by sampling is the average of independent evaluations, . Then, we have

 P(^f(x)≥^f(1n−10)) =P(^f(x)−^f(1n−10)−r≥−r) (8) ≤exp(−2m2r2m(2n)2)≤e−n/2,

where the first inequality is by Hoeffding’s inequality and , and the last is by and . It is easy to see from Eq. (7) that . Thus, we can similarly get

 P(^f(x)≥^f(1n−201))≤e−n/2. (9)

By applying Eqs. (8) and (9) to Eq. (4), we get

 P(At∣Ct) (10) ≥(1−e−n/2)⋅Pmut(x,1n−10)+Pmut(x,1n−201)P(Ct).

Since ,

 P(At∣Ct)P(Bt∣Ct) (11) ≥(1−e−n/2)⋅Pmut(x,1n−10)+Pmut(x,1n−201)Pmut(x,1n)+Pmut(x,1n−202).

If or ,

 P(At∣Ct)P(Bt∣Ct)≥(1−e−n/2)⋅1n(1−1n)+1n(1−1n)1n2+(1−1n)2≥1n. (12)

If , we can similarly derive that . Since , our claim that holds.

Thus, the probability that the (1+1)-EA first finds a solution in before reaching the optimum is at least

 +∞∑t=1P(At∣Ct)⋅P(Ct)≥1n+1⋅+∞∑t=1P(Ct) (13) =1n+1⋅P(LO(ξ0)

where the first equality is because the union of the events with implies that the time of finding a solution with at least leading 1-bits is at least 1, which is equivalent to that the initial solution has less than leading 1-bits; and the last equality is due to the uniform initial distribution.

We then show that after finding or , the probability of the (1+1)-EA leaving this state in each iteration is exponentially small. From Eqs. (8) and (9), we know that for any with and , . For and , it is easy to verify that . Using the same analysis as Eq. (8), we can get, for and , . Combining the above two cases, we get, for and , . Thus, our claim that the probability of leaving in each step is exponentially small holds. ∎

## Iv Parent Populations Can Work on Some Tasks Where Sampling Fails

Previous studies [14, 22, 23] have shown that both using populations and sampling can bring robustness to noise. For example, for the OneMax problem under one-bit noise with , the (1+1)-EA needs exponential time to find the optimum [10], while using a parent population size  [14], an offspring population size  [14] or a sample size  [22] can all reduce the running time to polynomial. Then, a natural question is that whether there exist cases where only one of these two strategies (i.e., populations and sampling) is effective. This question has been partially addressed. For the OneMax problem under additive Gaussian noise with large variances, it was shown that the (+1)-EA with needs super-polynomial time to find the optimum [13], while the (1+1)-EA using sampling can find the optimum in polynomial time [23]. Now, we try to solve the other part of this question. That is, we are to prove that using populations can be better than using sampling.

In this section, we show that compared with using sampling, using parent populations can be more robust to noise. Particularly, we compare the (1+1)-EA using sampling with the (+1)-EA for solving OneMax under symmetric noise. As presented in Definition 4, the goal of the OneMax problem is to maximize the number of 1-bits, and the optimal solution is . As presented in Definition 5, symmetric noise returns a false fitness with probability . It is easy to see that under this noise model, the distribution of for any is symmetric about .

###### Definition 4 (OneMax).

The OneMax Problem is to find a binary string that maximises

 f(x)=∑ni=1xi.
###### Definition 5 (Symmetric Noise).

Let and denote the noisy and true fitness of a solution , respectively, then

 fn(x)={f(x)with prob. 1/2,2n−f(x)with prob. 1/2.

We prove in Theorem 6 that the expected running time of the (1+1)-EA using sampling with any sample size is exponential. From the proof, we can find the reason why using sampling fails. Under symmetric noise, the distribution of for any is symmetric about . Thus, for any two solutions and , the distribution of is symmetric about 0. By sampling, the distribution of is still symmetric about 0, which implies that the offspring solution will always be accepted with probability at least in each iteration of the (1+1)-EA. Such a behavior is analogous to random walk, and thus the optimization is inefficient.

###### Theorem 6.

For the (1+1)-EA solving OneMax under symmetric noise, if using sampling, the expected running time is exponential.

###### Proof.

We apply the simplified negative drift theorem (i.e., Theorem 2) to prove it. Let denote the number of 0-bits of the solution maintained by the (1+1)-EA after running iterations. We consider the interval , i.e., the parameters and in Theorem 2.

We then analyze for . The drift is divided into two parts: and . That is,

 E(Xt−Xt+1∣Xt=i)=E+−E−,where (14) E+=∑x′:|x′|0iPmut(x,x′)⋅P(^f(x′)≥^f(x))⋅(|x′|0−i). (16)

To analyze , we use a trivial upper bound 1 on . Then, we have

 E+ ≤∑x′:|x′|0

where the last inequality is directly derived from Eq. (17) in the proof of Theorem 9 [22]. For , we have to consider that the number of 0-bits is increased. We analyze the cases where only one 1-bit is flipped (i.e., ), whose probability is . Let . By the definition of symmetric noise, the value of can be , , and , each with probability . It is easy to see that the distribution of is symmetric about 0, i.e., has the same distribution as . Since is the average of independent random variables, which have the same distribution as , the distribution of is also symmetric about 0, and thus . Then,

 E−≥n−ien⋅12⋅(i+1−i)=n−i2en.

By calculating , we get

 E(Xt−Xt+1∣Xt=i)≤in−n−i2en≤−0.05,

where the last inequality is by . Thus, condition (1) of Theorem 2 holds with .

To make , it is necessary to flip at least bits of . Thus, we get

 P(|Xt+1−Xt|≥j∣Xt≥1)≤(nj)1nj≤1j!≤2⋅12j.

That is, condition (2) of Theorem 2 holds with and . Note that . By Theorem 2, we can conclude that the expected running time is exponential. ∎

We prove in Theorem 7 that the (+1)-EA with can find the optimum in time. The reason for the effectiveness of using parent populations is that the true best solution will be discarded only if it appears worse than all the other solutions in the population, the probability of which can be very small by using a logarithmic parent population size. Note that this finding is consistent with that in [14].

###### Theorem 7.

For the (+1)-EA solving OneMax under symmetric noise, if , the expected running time is .

###### Proof.

We apply the multiplicative drift theorem (i.e., Theorem 1) to prove it. Note that the state of the corresponding Markov chain is currently a population, i.e., a set of solutions. We first design a distance function : for any population , , i.e., the minimum number of 0-bits of the solution in . It is easy to see that iff , i.e., contains the optimum .

Then, we investigate for any with (i.e., ). Assume that currently , where . We also divide the drift into two parts:

 E(V(ξt)−V(ξt+1)∣ξt=P)=E+−E−,where (18) E+=∑P′:V(P′)iP(ξt+1=P′∣ξt=P)⋅(V(P′)−i).

For , we need to consider that the best solution in is improved. Let , then . In one iteration of the (+1)-EA, a solution with can be generated by selecting and flipping only one 0-bit in mutation, whose probability is . If is not added into , it must hold that for any , which happens with probability since iff . Thus, the probability that is added into (which implies that ) is . We then get

 E+≥ieμn⋅(1−12μ)⋅(i−(i−1))=ieμn(1−12μ). (19)

For , if there are at least two solutions in such that , it obviously holds that . Otherwise, implies that for the unique best solution in and any , , which happens with probability since iff . Thus, . Furthermore, can increase by at most . Thus, . By calculating , we get

 E(V(ξt)−V(ξt+1)∣ξt) ≥ieμn−ieμn2μ−n−i2μ−1 (20) ≥i10nlogn=110nlogn⋅V(ξt),

where the second inequality holds with large enough . Note that . Thus, by Theorem 1,

 E(τ∣ξ0)≤10nlogn(1+logn)=O(nlog2n),

which implies that the expected running time is , since the algorithm needs to evaluate the offspring solution and reevaluate the parent solutions in each iteration. ∎

In the following, we show that the parent population size is almost tight for making the (+1)-EA efficient. Particularly, we prove that is insufficient. Note that the proof is finished by applying the original negative drift theorem (i.e., Theorem 4) instead of the simplified versions (i.e., Theorems 2 and 3). To apply the simplified negative drift theorems, we have to show that the probability of jumping towards and away from the target is exponentially decaying. However, the probability of jumping away from the target is at least a constant in this studied case. To jump away from the target, it is sufficient that one non-best solution in the current population is cloned by mutation and then the best solution is deleted in the process of updating the population. The former event happens with probability , and the latter happens with probability , which is for . The original negative drift theorem is stronger than the simplified ones, and can be applied here to prove the exponential running time.

###### Theorem 8.

For the (+1)-EA solving OneMax under symmetric noise, if , the expected running time is exponential.

###### Proof.

We apply the original negative drift theorem (i.e., Theorem 4) to prove it. Let , where denotes the minimum number of 0-bits of the solution in the population after iterations of the (+1)-EA, denotes the number of solutions in that have the minimum 0-bits , and for , with . Note that , and iff , i.e., contains at least one optimum . We set , and consider the interval , where