Significance-based Estimation-of-Distribution Algorithms

# Significance-based Estimation-of-Distribution Algorithms

## Abstract

Estimation-of-distribution algorithms (EDAs) are randomized search heuristics that maintain a stochastic model of the solution space. This model is updated from iteration to iteration based on the quality of the solutions sampled according to the model. As previous works show, this short-term perspective can lead to erratic updates of the model, in particular, to bit-frequencies approaching a random boundary value. This can lead to significant performance losses.

In order to overcome this problem, we propose a new EDA that takes into account a longer history of samples and updates its model only with respect to information which it classifies as statistically significant. We prove that this significance-based compact genetic algorithm (sig/̄cGA) optimizes the common benchmark functions OneMax and LeadingOnes both in time, a result shown for no other EDA or evolutionary algorithm so far. For the recently proposed scGA – an EDA that tries to prevent erratic model updates by imposing a bias to the uniformly distributed model – we prove that it optimizes OneMax only in a time exponential in the hypothetical population size .

\makesavenoteenv

table \makesavenoteenvtabular

## 1 Introduction

Estimation-of-distribution algorithms (EDAs; [23]) are a special class of evolutionary algorithms (EAs). They optimize a function by evolving a stochastic model of the solution space. In an iterative fashion, an EDA uses its stochastic model to generate samples and then updates it with respect to observations made from these samples. An algorithm-specific parameter determines how drastic the changes to the model in each iteration are. In order for an EDA to succeed in optimization, it is important that the stochastic model is changed over time such that better solutions are sampled more frequently. However, due to the randomness in sampling, the model should not be changed too drastically in a single iteration in order to prevent wrong updates from having a long-lasting impact.

The theory of EDAs has recently gained momentum [24, 16, 26, 17, 12, 4, 13, 27] and is mainly concerned with the aforementioned trade-off of the convergence speed of an EDA to a near-optimal model while making sure that the model does not prematurely converge to suboptimal models. This trade-off is very visible in the results of Sudholt and Witt [24] and Krejca and Witt [16], who prove lower bounds of the expected run times of three common EDAs on the benchmark function OneMax. In simple words, these bounds show that if the parameter for updating the model is too large, the model converges too quickly and very likely to a wrong model; in consequence, it then takes a long time to find the optimum (usually by first reverting to a better fitting model). On the other hand, if the parameter is too small, then the model does converge to the correct model, but it does so slowly.

The problem of how to choose the parameter has also been discussed by Friedrich et al. [12]. They consider a class of EDAs that all current theoretical results fall into: -Bernoulli--EDAs optimizing functions over bit strings of length . The stochastic model of such EDAs uses one variable per bit of a bit string, resulting in a vector of probabilities of length called the frequency vector. In each iteration, a bit string is sampled bit-wise independently and independent of any other sample such that bit is with probability (frequency) and otherwise. Thus, the stochastic model used by such EDAs is a Poisson-binomial distribution. Friedrich et al. [12] consider two different properties of such EDAs: balanced and stable. Intuitively, in expectation, a balanced EDA does not change a frequency if the fitness function has no bias toward s or s at that respective position . A stable EDA keeps a frequency, in such a scenario, close to . Friedrich et al. [12] then prove that an EDA cannot be both balanced and stable. This means that the frequencies will always move toward or , even if there is no bias from the objective function (fitness function). They also prove that all commonly theoretically analyzed EDAs are balanced.

The results of Friedrich et al. [12], Sudholt and Witt [24], and Krejca and Witt [16] draw the following picture: for a balanced EDA, there exists some inherent noise in the update. Thus, if the parameter responsible for the update of the stochastic model is large and the speed of convergence high, the algorithm only uses a few samples before it converges. During this time, the noise introduced by the balanced-property may not be overcome, resulting in the stochastic model converging to an incorrect one, as the algorithms are not stable. Hence, the parameter should be smaller in order to guarantee convergence to the correct model, resulting in a slower optimization time.

The core problem in this dilemma lies in what information the EDAs use in order to perform an update: the current samples and the current frequency vector – no time-dependent information. Thus, the algorithms are forced to make an on-the-spot decision with respect to how to update their frequency vector. This entails that they will most likely make a change to their model although this change may be harmful.1 Thus, Friedrich et al. [12] propose an EDA (called scGA) that is stable (but not balanced) in order to converge quicker to a correct frequency vector by introducing an artificial bias into the update process that should counteract the bias of a balanced EDA. However, this approach fails on the standard benchmark function OneMax, as we prove in this paper (Thm. 4).

We propose a new approach that tries to eliminate the aforementioned problems by introducing a new EDA that is aware of its history of samples: the significance-based compact genetic algorithm (sig/̄cGA). For each position, it has access to a part of the history of bits sampled so far. If it detects that either statistically significantly more s than s or vice versa were sampled, it changes the corresponding frequency, otherwise not. Thus, the sig/̄cGA only performs an update when it has proof that it makes sense. This sets it apart from the other EDAs analyzed so far. We prove that the sig/̄cGA is able to optimize OneMax and LeadingOnes in in expectation and with high probability (Thm. 3 and 2), which has not been proven before for any other EDA or classical EA (for further details, see Table LABEL:tab:runTimeComparison). Further, we prove that the scGA, which is known to have an expected run time of on LeadingOnes [12] too, is not able to optimize OneMax in that time.

Our paper is structured as follows: Section 2 establishes some notation and the setting we consider. In Section 3, we introduce and discuss our new algorithm sig/̄cGA. We also go into detail how the extra information of the sig/̄cGA can be efficiently implemented such that the additional overhead is small. Further, we prove that the sig/̄cGA optimizes LeadingOnes and OneMax in in expectation and with high probability (Thm. 2 and 3, respectively). In Section 4, we shortly discuss the scGA and prove that it optimizes OneMax in (Thm. 4), where is an algorithm-specific parameter. We conclude our paper in Section 5.

## 2 Preliminaries

In this work, we consider the maximization of pseudo-Boolean functions , where is a positive integer (fixed for the remainder of this work). We call a fitness function, an element an individual, and, for an , we denote the th bit of by . When talking about run time, we always mean the number of fitness function evaluations of an algorithm until an optimum is sampled for the first time.

In our analysis, we regard the two classic benchmark functions OneMax and LeadingOnes defined by

 \textscOneMax(x) =∑i∈[n]xiand (1) \textscLeadingOnes(x) =∑i∈[n]∏j∈[i]xj . (2)

In other words, OneMax returns the number of s of an individual, whereas LeadingOnes returns the longest sequence of consecutive s of an individual, starting from the left. Note that the all-s bit string is the unique global optimum for both functions.

We state in Table LABEL:tab:runtimes the asymptotic run times of a few algorithms on these benchmark functions. We note that (i) the black-box complexity of OneMax is , see [11, 3], and (ii) the black-box complexity of LeadingOnes is , see [2], however, all black-box algorithms witnessing these run times are highly artificial. Consequently, appears to be the best run time to aim for for these two benchmark problems.

Since random bit strings with independently sampled entries occur frequently in this work, we shall regularly use the following well-known variance-based additive Chernoff bounds (see, e.g., the respective Chernoff bound in [5]).

###### Theorem 1 (Variance-based Additive Chernoff Bounds).

Let be independent random variables such that, for all , . Further, let and . Then, for all , abbreviating ,

 Pr[X≥E[X]+λ]≤e−13m and Pr[X≤E[X]−λ]≤e−13m .

Further, we say that an event occurs with high probability if there is a such that .

Last, we use the operator to denote string concatenation. For a bit string , let denote its length, its number of s, its number of s, and, for a , let denote the last bits in . In addition to that, denotes the empty string.

## 3 The Significance-based Compact Genetic Algorithm

{algorithm2e}

[t] The sig/̄cGA with parameter and significance function  (eq. (3.1)) optimizing  for  do  and ;

repeat
offspring sampled with respect to winner of and with respect to for  do
if  then ;
else if  then ;
else ;
if  then ;

until termination criterion met;
Before presenting our algorithm sig/̄cGA in detail in Section 3.1, we provide more information about the compact genetic algorithm (cGA [14]), which the sig/̄cGA as well as the scGA are based on. The cGA is an estimation-of-distribution algorithm (EDA [23]). That is, it optimizes a fitness function by evolving a stochastic model of the search space . The cGA assumes independence of the bits in the search space, which makes it a univariate EDA. As such, it keeps a vector of probabilities , often called frequency vector. In each iteration, two individuals (offspring) are sampled in the following way with respect to the frequency vector: for an individual , we have with probability , and with probability , independently of any with . Thus, the stochastic model of the cGA is a Poisson-binomial distribution. After sampling, the frequency vector is updated with respect to a fitness-based ranking of the offspring. The process of choosing how the offspring are ranked is called selection. Let and denote both offspring of the cGA during an iteration. Given a fitness function , we rank above if (as we maximize), and we rank above if . If , we rank them randomly. The higher-ranked individual is called the winner, the other individual the loser. Assume that is the winner. The cGA changes a frequency then with respect to the difference by a value of (where is usually referred to as population size). Hence, no update is performed if the bit values are identical, and the frequency is moved to the bit value of the winner. In order to prevent a frequency getting stuck at or ,2 the cGA usually caps its frequency to the range , as is common practice. This way, a frequency can get close to or , but it is always possible to sample s and s. Consider a position and any two individuals and that are identical except for position . Assume that . If the probability that is the winner of the selection is higher than being the winner, we speak of a bias in selection (for s) at position . Analogously, we speak of a bias for s if the probability that wins is higher than the probability that wins. Usually, a fitness function introduces a bias into the selection and thus into the update.

### 3.1 Detailed Description of the sig/̄cGA

Our new algorithm – the significance-based compact genetic algorithm (sig/̄cGA; Alg. 3) – also samples two offspring each iteration. However, in contrast to the cGA, it keeps a history of bit values for each position and only performs an update when a statistical significance within a history occurs. This approach far better aligns with the intuitive reasoning that an update should only be performed if there is valid evidence for a different frequency being better suited for sampling good individuals.

In more detail, for each bit position , the sig/̄cGA keeps a history of all the bits sampled by the winner of each iteration since the last time changed – the last bit denoting the latest entry. Observe that if there is no bias in selection at position , the bits sampled by follow a binomial distribution with a success probability of and tries. We call this our hypothesis. Now, if we happen to find a sequence (starting from the latest entry) in that significantly deviates from the hypothesis, we update with respect to the bit value that occurred significantly, and we reset the history. We only use the following three frequency values:

• : starting value;

• : significance for s was detected;

• : significance for s was detected.

We formalize significance by defining the threshold for all , where is the expected value of our hypothesis and is an algorithm-specific parameter:

 s(ε,μ)=εmax{√μlnn,lnn} .

We say, for an , that a binomially distributed random variable deviates significantly from a hypothesis , where and , if there exists a such that

 Pr[|X−E[Y]|≤s(ε,E[Y])]≤n−c .

We now state our significance function , which scans a history for a significance. However, it does not scan the entire history but multiple subsequences of a history (always starting from the latest entry). This is done in order to quickly notice a change from an insignificant history to a significant one. Further, we only check in steps of powers of , as this is faster than checking each subsequence and we can be off from any length of a subsequence by a constant factor of at most . More formally, for all , we define, with being a parameter of the sig/̄cGA. Recall that denotes the last  bits of .

 sig(12,H) =⎧⎪ ⎪⎨⎪ ⎪⎩upif ∃m∈\mathdsN:∥H[2m]∥1≥2m2+s(ε,2m2),downif ∃m∈\mathdsN:∥H[2m]∥0≥2m2+s(ε,2m2),stayelse. sig(1−1n,H) ={downif ∃m∈\mathdsN:∥H[2m]∥0≥2mn+s(ε,2mn),stayelse. sig(1n,H) ={upif ∃m∈\mathdsN:∥H[2m]∥1≥2mn+s(ε,2mn),stayelse. (3)

We stop at the first (minimum) length that yields a significance. Thus, we check a history in each iteration at most times.

We now prove that the probability of detecting a significance at a position when there is no bias in selection (i.e., a false significance) is small. We use this lemma in our proofs in order to argue that no false significances are detected with high probability.

###### Lemma 1.

For the sig/̄cGA (Alg. 3), let . Consider a position of the sig/̄cGA and an iteration such that the distribution of s of follows a binomial distribution with trials and success probability , i.e., there is no bias in selection at position . Then the probability that changes in this iteration is at most .

###### Proof.

In order for to change, the number of s or s in needs to deviate significantly from the hypothesis, which follows the same distribution as by assumption. We are going to use Theorem 1 in order to show that, in such a scenario, will deviate significantly from its expected value only with a probability of at most for any number of trials at most .

Let . Note that, in order for to change, a significance of values sampled with probability needs to be sampled. That is, for , either a significant amount of s or s needs to occur; for , a significant amount of s needs to occur; and, for , a significant amount of s needs to occur. Further, let denote the number of values we are looking for a significance within trials. That is, if , is either the number of s or s; if , is the number of s; and if , is the number of s.

Given the definition of , we see that and . Since we want to apply Theorem 1, let and .

First, consider the case that , i.e., that , which is equivalent to . Note that , as . Thus, .

Now consider the case , i.e., that , which is equivalent to . We see that and . Hence, as before, we get .

Combining both cases and applying Theorem 1, we get

 Pr[X′≥k′τ′i+s(ε,k′τ′i)]=Pr[X′≥E[X′]+λ] ≤e−ε3lnn=n−ε3 .

That is, the probability of detecting a (false) significance during trials is at most . Since we look for a significance a total of at most times during an iteration, we get by a union bound that the probability of detecting a significance within a history of length is at most . ∎

Lemma 1 bounds the probability of detecting a false significance within a single iteration if there is no bias in selection. The following corollary trivially bounds the probability of detecting a false significance within any number of iterations.

###### Corollary 1.

Consider the sig/̄cGA (Alg. 3) with running for iterations such that, during each iteration, for each , a is added to with probability . Then the probability that at least one frequency will change during an interval of iterations is at most .

###### Proof.

For any during any of the iterations, by Lemma 1, the probability that changes is at most . Via a union bound over all relevant iterations and all frequencies, the statement follows. ∎

### 3.2 Efficient Implementation of the sig/̄cGA

In order to reduce the number of operations performed (computational cost) of the sig/̄cGA, we only check significance in historic data of lengths that are a power of . By saving the whole history but precomputing the number of s in the power-of-two intervals, a significance check can be done in time logarithmic in the history length; the necessary updates of this data structure can be done in logarithmic time (per bit-position) as well. With this implementation, the main loop of the sig/̄cGA has a computational cost of . Since the histories are never longer than the run time (number of fitness evaluations; twice the number of iterations), we see that the computational cost is at most , when the run time is . Since for most EAs working on bit string representations of length the computational cost is larger than the run time by at least a factor of , we see that our significance approach is not overly costly in terms of computational cost.

What appears unfavorable, though, is the memory usage caused by storing the full history. For this reason, we now sketch a way to condense the history so that it only uses space logarithmic in the length of the full history. This approach will not allow to access exactly the number of s (or s) in all power-of-two length histories. It will allow, however, for each , to access the number of s in some interval of length with . For reasons of readability, we shall in the subsequent analyses nevertheless regard the original sig/̄cGA, but it is quite clear that the mildly different accessibility of the history in the now-proposed condensed implementation will not change the asymptotic run times shown in this work.

For our condensed storage of the history, we have a list of blocks, each storing the number of s in some discrete interval of length equal to a power of two (including ). When a new item has to be stored, we append a block of size to the list. Then, traversing the list in backward direction, we check if there are three consecutive blocks of the same size, and if so, we merge the two earliest ones into a new block of twice the size. By this, we always maintain a list of blocks such that, for a certain power , there are between one and two blocks of length for all . This structural property implies both that we only have a logarithmic number of blocks (as we have ) and that we can (in amortized constant time) access all historic intervals consisting of full blocks, which in particular implies that we can access an interval with length in for all .

### 3.3 Run Time Results for LeadingOnes and OneMax

We now prove our main results, that is, upper bounds of for the expected run time of the sig/̄cGA on LeadingOnes and OneMax. Note that the sig/̄cGA samples two offspring each iteration. Thus, up to a constant factor of , the expected run time is equal to the expected number of iterations until an optimum is sampled. In our proofs, we only consider the number of iterations.

We mention briefly that the sig/̄cGA is unbiased in the sense of Lehre and Witt [18], that is, it treats bit values and bit positions in a symmetric fashion. Consequently, all of our results hold not only for OneMax and LeadingOnes as defined in eqs. (1) and (2) but also any similar function where an may be changed to a or swapped with an (with ), as the sig/̄cGA has no bias for s or s, nor does it prefer certain positions over other positions. (In fact, it treats all positions exactly the same.)

In our proofs, we use the following lemma to bound probabilities split up by the law of total probability.

###### Lemma 2.

Let such that and . Then

 αx+(1−α)y≥βx+(1−β)y .

###### Theorem 2.

Consider the sig/̄cGA (Alg. 3) with being a constant. Its run time on LeadingOnes is with high probability and in expectation.

###### Proof.

We split this proof into two parts and start by showing that the run time is with high probability. Then we prove the expected run time.

Run time with high probability. For the first part of the proof, we consider the first iterations of the sig/̄cGA and condition on the event that no frequency decreases during this time, i.e., no (false) significance of s is detected. Note that, for any position , the probability of saving a in is at least , as the selection with respect to LeadingOnes has a bias for s. Thus, by Corollary 1, the probability that at least one frequency decreases during iterations is at most , which is, as , in , for an . Thus, with high probability, no frequency decreases during iterations.

The main idea now is to show that the leftmost frequency that is different from has a significant surplus of s in its history strong enough so that, after a logarithmic number of iterations, we change such a frequency from its initial value of to . For the second part of the proof, we will use a similar argument, but the frequency will be at , and it will take steps to get to . Since the calculations for both scenarios are very similar, we combine them in the following.

In order to make this idea precise, we now consider an iteration such that there is a frequency such that, for all , . We lower-bound the probability of saving a in in order to get an upper bound on the expected time until we detect the significance necessary to update to . When considering position , we assume an empty history although it is most likely not. We can do so, since the sig/̄cGA checks for a significance in different sub-histories of (starting from the latest entry). Thus, we only consider sub-histories that go as far as the point in time when all indices less than were at .

Let denote the event that we save a this iteration, and let denote the event that at least one of the two offspring during this iteration has a at a position in . Note that event means that the bit at position of the winning individual is not relevant for selection. Hence, if occurs, we save a with probability . Otherwise, that is, the bit at position is relevant for selection, we save a with probability (i.e., if we do not sample two s). Formally,

 Pr[O]=Pr[A]⋅τi+Pr[¯¯¯¯A]⋅(1−τ2i) ,

which is a convex combination of and . Thus, according to Lemma 2, we get a lower bound if we decrease the factor of the larger term, namely, . The event occurs if and only if both offspring have only s at the positions through :

as we assumed that all frequencies at indices less than are already at . Note that this term is minimal for . Thus, we get by using the well-known inequality . Overall, we get, noting that for ,

Let denote a random variable that is stochastically dominated by the real process of saving s at position . In order to get a bound on the number of iterations that we need for detecting a significance of s, we bound the probability of a significance not occurring in a history of length , i.e., we save fewer than s:

where the minuend is positive if , which is the case for , since we assume that . Let . For , we get that . By applying Theorem 1 for any and noting that and, thus, , we get, using ,

 ≤e−13⋅λ2Var[X]≤e−13⋅k2e−4τ2i16kτi=e−13⋅ke−4τi16≤n−13⋅ce−4τi4=n−ε23 .

Thus, with probability at least , the will be set to after iterations. Further, via a union bound over all  frequencies, the probability of any such frequency not being updated to after iterations is at most , as . Hence, with high probability, all frequencies will be set to .

For the first part of this proof, that is, assuming that no frequency is at , and taking together the results of all frequencies being updated to , each in time , and no frequency at or decreasing, all with high probability, yields that all frequencies are at within iterations. Then the optimum is sampled with probability , i.e., with constant probability. Hence, we have to wait additional iterations in order to sample the optimum with high probability.

Expected run time. For the second part of this proof, that is, for the expected run time, we are left to bound the expected time if a frequency decreases during the initial iterations, which only happens with a probability of , where , as we discussed at the beginning of the first part. Due to Corollary 1, during  iterations and considering an interval of length , no frequency decreases with a probability of at least . By assuming and , with high probability, no frequency decreases during such an interval, as .

By using the result calculated in the first part, we see that a leftmost frequency  at  is increased to during iterations with high probability. Thus, overall, the sig/̄cGA finds the optimum during an interval of length with high probability, as  frequencies need to be increased to . We pessimistically assume that the optimum is only found with a probability of at least during  iterations. Hence, the expected run time in this case is .

Last, we assume that we did not find the optimum during iterations, which only happens with a probability of at most . Then, the expected run time is at most  by pessimistically assuming that all frequencies are at .

Combining all of the three different regimes we just discussed, we see that we can upper bound the expected run time by

 O(nlogn)+O(n−ε′)⋅O(t′)+2−n2n/t′⋅nn=O(nlogn) ,

which concludes the proof. ∎

For our next result, we make use of the following lemma based on a well-known estimate of binomial coefficients close to the center. A proof was given by, e.g., Doerr and Winzen [9]. We use it to show how likely it is that two individuals sampled from the sig/̄cGA have the same OneMax value.

###### Lemma 3.

For , , let and let . Then

The next theorem shows that the sig/̄cGA is also able to optimize OneMax within the same asymptotic time like many other EAs.

###### Theorem 3.

Consider the sig/̄cGA (Alg. 3) with being a constant. Its run time on OneMax is with high probability and in expectation.

###### Proof.

We first show that the run time holds with high probability. Then we prove the expected run time.

Run time with high probability. We consider the first iterations and condition on the event that no frequency decreases during that time. This can be argued in the same way as at the beginning in the proof of Theorem 2.

The main idea now is to show that, for any frequency at , iterations are enough in order to detect a significance in s. This happens in parallel for all frequencies. For our argument to hold, it is only important that all the other frequencies are at or , which we condition on.

Similar to the proof of Theorem 2, when proving the expected run time, we will use that, if all frequencies start at , they are set to with high probability within iterations in parallel. Thus, we combine both cases in the following argumentation.

Let denote the starting value of a frequency. Formally, during any of the iterations, let denote the number of frequencies at . Then frequencies are at . Further, consider a position with . We show that such a position will sample s significantly more often than the hypothesis by a factor of . Then will be updated to within iterations.

In order to show that s are significantly more often saved than assumed, we proceed as follows: we consider that all bits but bit  of both offspring during any iteration have been sampled. If the number of s of both offspring differs by more than one, bit cannot change the outcome of the selection process – bit  will be  with probability . However, if the number of s differs by at most one, then the outcome of bit  in both offspring has an influence on whether a  is saved or not, i.e., this introduces a bias toward saving a significant amount of s.

Let  denote the event to save a  at position  this iteration, and let  denote the event that the numbers of s (excluding position ) of both offspring differ by at least two during that iteration. Then the probability to save a , conditioned on , is .

In the case of , we make a case distinction with respect to the absolute difference of the number of s of both offspring, excluding position . If the difference is zero, then a  will be saved if not both offspring sample a , which happens with probability . If the absolute difference is one, then a  will be saved if the winner (with respect to all bits but bit ) samples a  (with probability ) or if it samples a , the loser samples a , and the loser is chosen during selection, which happens with probability . Overall, the probability that a  is saved is at least in the case of , as this is less than for .

Combining both cases, we see that

 Pr[O]≥Pr[A]⋅τi+Pr[¯¯¯¯A]⋅54τi ,

which we lower-bound by determining a lower bound for , according to Lemma 2.

With respect to , we first note that the probability that the frequencies at will all sample a  for both offspring is , as . Similarly, all frequencies at  (but ) will sample a  for both offspring with a probability of at least , too.

Now we only consider the difference of s sampled with respect to (for ) positions with frequencies at , i.e., all remaining positions but  we did not consider so far. Since all of these frequencies are at , the expected number of s is . Due to Theorem 1 (or, alternatively, Chebyshev’s inequality), the probability of deviating from this value by more than is at most a constant . Conditional on sampling a number of s in the range of , the probability to sample s is, due to Lemma 3, , since all frequencies are at . Thus, by the law of total probability, the probability that both offspring have the same number of s or differ only by one, i.e., , is, for a constant , at least . Hence, we get, for a sufficiently small constant , factoring in the probability of of the number of s being concentrated around and the remaining positions only sampling s,

This means that the sig/̄cGA expects s to occur with probability , but they occur with a probability of at least . Note that for the case , i.e., , conditional on the remaining positions only sampling s, and hence . Thus, we use as a lower bound for in all cases for , for an appropriately chosen .

Analogous to the proof of Theorem 2, let denote a random variable that is stochastically dominated by the real process of saving s at position . We bound the probability of not detecting a significance of s after  iterations, i.e.,

Let . Then . By noting that for sufficiently small and, thus, , we get by applying Theorem 1 and using ,

 ≤e−13⋅4k2d′2τ2i4ℓkτi=e−13⋅kd′2ℓτi≤e−43ε2lnn=n−43ε2 .

Thus, with a probability of at least , frequency  will be set to after iterations. Further, via a union bound over all  frequencies, the probability of any such frequency not being updated to after iterations is at most , as . Hence, with high probability, all frequencies will be set to .

Since our argument for position  was made for an arbitrary  and independent of the other positions, and since all  frequencies start at  (i.e., ), we have to wait at most iterations until all frequencies are set to with high probability. Then, with a probability of at least , the optimum will be sampled. Hence, after additional iterations, the optimum will be sampled with high probability.

Expected run time. The expected run time can be proven similarly as argued in the second part of the proof of Theorem 2. The main difference here is that, assuming all frequencies are at , with high probability, all frequencies will increase during iterations (in parallel, not sequentially), as we just discussed. Further, since , no frequency will decrease during an interval of such length with high probability. ∎

Note that although the expected run time of the sig/̄cGA is asymptotically the same on LeadingOnes and OneMax, the reason is quite different: for LeadingOnes, the sig/̄cGA sets its frequencies quickly consecutively to , as it only needs iterations per frequency in expectation. This is due to the bias for saving s being very large (constant, in fact) when all frequencies to the left are at , i.e., when it is very likely that bit is relevant for selection. Friedrich et al. [12] exploit this fact in the analysis (and design) of the scGA heavily, which is why it, too, has an expected run time of on LeadingOnes. However, when not all frequencies to the left of a position are at , the bias is almost negligible, as it is necessary that bits sampled with frequencies of at most have to sample the same value. Thus, in this case, the probability of this happening declines exponentially in the number of frequencies to the left not being at .

For OneMax, the situation is different. The bias in selection only gets strong (i.e., increases by a constant additive term) when a constant number of frequencies is left at and has not reached . More general, when frequencies are still at , the bias only adds a term of roughly . Thus, it takes longer in expectation in order to detect a significance for a position. However, the bias is constantly there and, even for , very large when compared to the bias for LeadingOnes for a position whose frequencies to the left are not all at . Hence, for OneMax, the frequencies can be increased in parallel. This is the major difference to LeadingOnes, where the frequencies are increased sequentially.

## 4 Run Time Analysis for the scGA

Being the closest competitor to the sig/̄cGA in that it also optimizes LeadingOnes in in expectation is the stable compact genetic algorithm (scGA; Alg. 4), which is a variant of the cGA [14] and was introduced by Friedrich et al. [12] in order to present an EDA that optimizes LeadingOnes in time . It works very similar to the cGA, however, it introduces a bias toward the update that favors frequencies moving to . For this purpose, the scGA has, next to the parameter of the cGA, another parameter , which works in the following way: when a frequency above is decreased, it decreases by , not only by as in the case of the cGA. However, a frequency above is still only increased by . For a frequency below , this is done analogously.

Further, the scGA has a third parameter , which marks the borders for a frequency that are sufficient in order to set it to one of its extreme values, i.e., or . If a frequency is greater or equal to , it is updated to and can then never be changed again, as all bits at position will be s. Symmetrically, if a frequency is less or equal to , it is updated to . Intuitively, the parameter describes a significance value that is sufficient for the algorithm to fully commit for a bit value.

The intention of the scGA is that each frequency stays around as long as there is no strong bias toward either bit value for its respective position. Once the bias is strong enough, the algorithm is willing to fix the bits for that position. While this approach works well when there is a strong bias in a position (i.e., as in LeadingOnes [12]), it fails when the bias is only weak (i.e., as in OneMax; Thm. 4).

{algorithm2e}

[t] The scGA [12] with parameters , , and optimizing for  do ;

repeat
offspring sampled with respect to winner/loser of and with respect to for  do
if  then
if  then ;
else if  then ;
else ;

else if  then
if  then ;
else if  then ;
else ;

else ;

until termination criterion met;
We prove that the scGA is not able to optimize OneMax as fast as the sig/̄cGA, as it is not able to detect the comparably small bias of for OneMax when compared to the strong bias of for LeadingOnes for a frequency whose frequencies to the left are at . Note that the assumptions in Theorem 4 for and are similar to the ones used by Friedrich et al. [12] in order to prove the expected run time of of the scGA on LeadingOnes. Our assumption for is more restrictive, as we require , whereas Friedrich et al. [12] only require . However, we allow , whereas Friedrich et al. require .
###### Theorem 4.

Let be a constant. Consider the scGA (Alg. 4) with , , and with . Its run time on OneMax is in expectation and with high probability for a constant .

Before we prove the theorem, we mention two other theorems that we are going to use in the proof. The first bounds the probability of a randomly sampled bit string having s. We use it in order to bound the probability of both offspring having the same number of s. Note that the values and in the lemma are somewhat arbitrary and can be exchanged for any constant in and , respectively.
###### Lemma 4 ([24]).

Let denote the sum of independent Poisson trials with probabilities such that, for all , . Then, for all ,

The next theorem provides an upper bound on the probability of a random process stopping after a certain time. We use it in order to show that it is unlikely for a frequency of the scGA when optimizing OneMax to get to within a certain number of iterations.
###### Theorem 5 (Negative Drift; [21, 22]).

Let be real-valued random variables describing a stochastic process over some state space, with . Suppose there exist an interval , two constants , , and, possibly depending on , a function satisfying such that, for all , the following two conditions hold:

1. and,

2. for all , .

Then there is a constant such that, for , it holds that

We can now prove our result.
###### Proof of Thm. 4.

We only show that the run time is in with high probability. The statement for the expected run time follows by lower-bounding the terms that occur with a probability of with .

We first prove the bound of . We do so by showing that each frequency will stay in the non-empty interval with high probability. Although OneMax introduces a bias into updating a frequency, it is too tiny in order to compensate the strong drift toward in the update.

We lower-bound the expected time it takes the scGA to optimize OneMax by upper-bounding the probability it takes a single frequency to leave the interval . Thus, we condition during the entire proof implicitly on the event that all frequencies are in the interval . Note that, in this scenario, the probability to sample the optimum during an iteration is at most , which is exponentially small, even for a polynomial number of iterations.

Consider an index with . We only upper-bound the probability it takes to reach . Note that the probability of reaching is at most that large, as OneMax introduces a bias for s into the selection process. Hence, we could argue optimistically for reaching as we do for reaching by swapping s for s and considering instead.

Let denote the first point in time such that . We want to apply Theorem 5 and show that it is unlikely for to reach within iterations. Hence, we define the following potential function :

 g(τi)=1ρ(1−τi) ,

which we will use for our frequencies. Note that, at the beginning, is at , i.e.,