AttributeEfficient Evolvability of Linear Functions
Abstract
In a seminal paper, Valiant (2006) introduced a computational model for evolution to address the question of complexity that can arise through Darwinian mechanisms. Valiant views evolution as a restricted form of computational learning, where the goal is to evolve a hypothesis that is close to the ideal function. Feldman (2008) showed that (correlational) statistical query learning algorithms could be framed as evolutionary mechanisms in Valiant’s model. P. Valiant (2012) considered evolvability of realvalued functions and also showed that weakoptimization algorithms that use weakevaluation oracles could be converted to evolutionary mechanisms.
In this work, we focus on the complexity of representations of evolutionary mechanisms. In general, the reductions of Feldman and P. Valiant may result in intermediate representations that are arbitrarily complex (polynomialsized circuits). We argue that biological constraints often dictate that the representations have low complexity, such as constant depth and fanin circuits. We give mechanisms for evolving sparse linear functions under a large class of smooth distributions. These evolutionary algorithms are attributeefficient in the sense that the size of the representations and the number of generations required depend only on the sparsity of the target function and the accuracy parameter, but have no dependence on the total number of attributes.
1 Introduction
Darwin’s theory of evolution through natural selection has been a cornerstone of biology for over a century and a half. Yet, a quantitative theory of complexity that could arise through Darwinian mechanisms has remained virtually unexplored. To address this question, Valiant introduced a computational model of evolution [28]. In his model, an organism is an entity that computes a function of its environment. There is a (possibly hypothetical) ideal function indicating the best behavior in every possible environment. The performance of the organism is measured by how close the function it computes is to the ideal. An organism produces a set of offspring, that may have mutations that alter the function computed. The performance (fitness) measure acting on a population of mutants forms the basis of natural selection. The resources allowed are the most generous while remaining feasible; the mutation mechanism may be any efficient randomized Turing machine, and the function represented by the organism may be arbitrary as long as it is computable by an efficient Turing machine.
Formulated this way, the question of evolvability can be asked in the language
of computational learning theory. For what classes of ideal functions, , can
one expect to find an evolutionary mechanism that gets arbitrarily close to the
ideal, within feasible computational resources? Darwinian selection is
restrictive in the sense that the only feedback received is aggregate over life
experiences. Valiant observed that any feasible evolutionary mechanism could be
simulated in the statistical query framework of Kearns [19]. In a
remarkable result, Feldman showed that in fact, evolvable concept classes are
exactly captured by a restriction of Kearns’ model, where the learning
algorithm is only allowed to make performance queries, i.e., it produces a
hypothesis and then makes a query to an oracle that returns the (approximate)
performance of that hypothesis under the
distribution [9].
Direct evolutionary mechanisms, not invoking the general reductions of Feldman and P. Valiant, have been proposed for certain classes in restricted settings. Valiant showed that the class of disjunctions is evolvable using a simple set of mutations under the uniform distribution [28]. Kanade, Valiant and Vaughan proposed a simple mechanism for evolving homogeneous linear separators under radially symmetric distributions [17]. Feldman considered a model where the ideal function is boolean but the representation can be realvalued, allowing for more detailed feedback. He presents an algorithm for evolving large margin linear separators for a large class of convex loss functions [11]. P. Valiant also showed that with very simple mutations, the class of fixeddegree polynomials can be evolved with respect to the squared loss [29].
Current understanding of biology (or lack thereof) makes it difficult to formalize a notion of naturalness for mutations in these frameworks; in particular, it is not well understood how mutations to DNA relate to functional changes in an organism. That said, the more direct algorithms are appealing due to the simplicity of their mutations. Also, the “chemical computers” of organisms may be slow, and hence, representations that have low complexity are attractive. In general, Feldman’s generic reduction from statistical query algorithms may use arbitrarily complex representations (polynomialsized circuits), depending on the specific algorithm used. In the remainder of the introduction, we first describe a particular class of biological circuits, transcription networks, that motivate our study. We then frame the evolutionary question in the language of computational learning theory, summarize our contributions and discuss related work.
1.1 Representation in Biology
Biological systems appear to function successfully with greatly restricted representation classes. The nature of circuits found in biological systems may vary, but some aspects – such as sparsity – are common. Specifically, the interacting components in many biological circuits are sparsely connected. Biological circuits are often represented as networks or graphs, where the vertices correspond to entities such as neurons or molecules and the edges to connections or interactions between pairs of entities. For example, both neural networks [31] and networks of metabolic reactions in the cell [30, 14] have been described by “smallworld” models, where a few “hub” nodes have many edges but most nodes have few edges (and consequently, the corresponding graphs have small diameter). An associated property observed in biological networks is modularity: a larger network of interacting entities is composed of smaller modules of (functionally related) entities [12]. Both the “smallworld” description and modularity of biological networks are consistent with the more general theme of sparsity.
We focus on transcription networks, which are a specific class of networks of
interacting genes and proteins that are involved in the production of new
protein. Alon provides an accessible and mathematical introduction to
transcription networks and other biological circuits [1]; below and
in Figure 1(a), we present a simplified account that motivates this
work. Genes are transcribed to produce mRNA, which is then
translated into sequences of amino acids that ultimately fold into
proteins.
The number of transcription factors varies from hundreds in a bacterium to thousands in a human cell. Some transcription factors are always present in the cell and can be thought of as representing a snapshot of the environment [1]. For example, the presence of sugar molecules in the environment may cause specific transcription factors to be activated, enabling them to regulate the production of other proteins. One of these proteins could be an endproduct, such as an enzyme that catalyzes a metabolic reaction involving the sugar. Alternatively, the transcription factor could regulate another transcription factor that itself regulates other genes – we view this as intermediate computation – and may participate in further “computation” to produce the desired endresult.
While transcription networks may include cycles (loops), here for simplicity we
focus on systems that are directed acyclic graphs, and the resulting computation
can be viewed as a circuit. We illustrate a small, real transcription network in
Figure 1(b). These circuits are by necessity shallow due to
a temporal constraint, that the time required for sufficient quantities of
protein to be produced is of the same order of magnitude as celldivision
time.
1.2 Our Contributions
First, our contribution is conceptual. We believe that the study of evolvability from a computational standpoint will benefit by understanding the representation complexity required to evolve a certain concept class. Motivated by the previous discussion, in the case of transcription networks, it appears essential that the representation used be a constant depth and fanin (boolean or arithmetic) circuit. Of course, any function that can be represented by such a circuit can depend only on a constant number of input variables. We ask the question, when we restrict attention to functions in a given class that depend only on a constant number of variables, when can evolution succeed?
Second, we show that the class of sparse linear functions, those that depend only on a constant number of variables, under a large class of smooth distributions, can be evolved using sparse linear functions as representations, when the performance is measured using squared error. The number of variables used by the representations is larger than the number of variables in the ideal function and depends on the smoothness parameter of the distribution. According to our notion of smooth nice distributions (Defn. 2), the density function of a smooth distribution is obtained by convolution of an arbitrary density with a product measure on (alternatively, drawing a point from the smooth distribution is equivalent to drawing a point from an arbitrary distribution and adding a (noise) vector from a product distribution).
A linear function is represented by a weighted arithmetic circuit with only one
addition gate (alternatively, by a depthtwo circuit with a layer of
multiplication gates and some constant inputs).
Valiant also proposed a stronger selection mechanism – when natural selection aggressively selects the (almost) best mutation, rather than merely a beneficial one – called evolution by optimization. Our second result requires a much stronger distributional assumption – the correlation – where is the sparsity of the target linear function (see Defn. 3). Under such distributions, we show that under evolution by optimization, sparse linear functions can be evolved by representations with the same sparsity. The mechanism we propose and its analysis is inspired by the greedy orthogonal matching pursuit algorithms in signal processing [7, 27]. Unlike the previous evolutionary algorithm, this one requires initialization, i.e., the evolutionary process begins with the function. As in the previous case, the number of generations required depends polynomially on the sparsity of the target linear function, the inverse of the accuracy parameter , but has no dependence on the total number of attributes . The precise statement appears as Theorem 2 in Section 3.2.
Related Work
The question of proper vs. improper learning has been studied in computational
learning theory. A separation between the two kinds is known, unless . However, most interesting PAClearnable classes can be learned using
thresholds of lowdegree polynomials, and do not seem to require the full
generality of polynomialsized circuits.
The problem of learning sparse linear functions has been studied under various names in several fields for many applications, e.g., recovering sparse solutions to (underdetermined) linear systems of equations [4], or recovering sparse representations with a redundant dictionary [22, 8]; compressive sampling or compressed sensing for sparse signal reconstruction [5]; optimization with regularization or sparsityinducing penalties in machine learning [2]; sparse coding for learning an overcomplete basis [25], or for denoising in image and video processing [8]. This area is too vast to review here; Bruckstein et al. have an excellent survey [4]. Learning the sparsest linear function is equivalent to finding the sparsest solution to a system of linear equations (assuming there is no noise in the data). In general, this problem is hard and the currently bestknown approximation factor depends on the norm of the pseudoinverse of the matrix [24]. Thus, some assumption on the distribution seems necessary. Our evolution based on optimization algorithm (Section 3.2) is essentially the greedy orthogonal matching pursuit algorithm of Tropp [27] and Donoho et al. [7], cast in the language of evolvability; these algorithms are also known in statistical modeling as forward stepwise regression [6, 13].
Finally, the question of attributeefficient regression in the PAC (or SQ) model is a natural one. Here, the goal would be to design a polynomial time algorithm for producing an accurate linear function, with sample complexity that is polynomial in the sparsity of the target function and the inverse of the target accuracy , and only polylogarithmic in , the total number of attributes. Under mild boundedness assumptions on the distribution, this can be achieved by setting up an regularized optimization problem; the output classifier may not be sparse in light of the hardness result mentioned above. We note that under the distributional assumption made in this paper, finding the sparsest linear function that fits the data is also easy in the PAC/SQ setting, since the solution to the optimization problem in this case is unique. The focus in our work is different, namely showing that simple evolutionary mechanisms can succeed, while using representations that are themselves sparse linear functions at all times.
Organization
2 Model and Preliminaries
We first provide an overview of the evolvability framework of Valiant [28]. The description here differs slightly from Valiant’s original formulation and includes some subsequent extensions (for more details the reader is referred to [28, 9, 10, 29, 16]).
2.1 Valiant’s Evolvability Framework
Let denote a set of instances, e.g., or . We assume that the representation length of each is captured by the parameter . To avoid excessive notation, we will keep this size parameter implicit in our description of the model. Let be a distribution over . Each can be thought of as the description of an environmental setting, the inputs to any circuit of an organism. denotes the distribution over the possible environmental settings an organism may experience in a lifetime. Let (typically or ) denote the ideal function, the best behavior in each possible environmental setting.
Representations
A creature is a string representation that encodes an efficiently computable function , i.e., there is an efficient Turing Machine that, given the description string and , outputs .
In this work, our focus is characterizing different evolutionary mechanisms based on the complexity of representations used. The complexity of a representation is measured by the function it computes. Let be a class of functions. For , we say that represents , if there is a map, , and if there exists an efficient Turing machine that, given input and , outputs . Henceforth, by abuse of notation we will use to denote both the representation and the function it computes, .
Evolutionary Algorithms
The performance of a representation is measured using a loss function , such that . For a function
, define the expected loss with respect to the ideal
function , under distribution , as .
Mutator: A mutator , for a set of representations , is
a polynomialtime randomized Turing machine that takes as input a representation
and accuracy parameter and outputs a multiset . The running time requirement on also ensures that
is polynomially bounded.
Selection: (Natural) Selection is based on the empirical performance
of each representation. Let be a
sample size function. First, the mutation algorithm, , is run
to produce multiset . Then, an i.i.d. sample is drawn from the distribution over , where . Denote the empirical performance of each as
Finally, let be a tolerance function. Two possible selection mechanisms are considered.

Selection based on beneficial and neutral mutations (): Let
denote the set of beneficial mutations and let
denote the neutral mutations, with respect to tolerance function . Both and are treated as multisets (the multiplicity of any representation is the same as that in ). Selection operates as follows: if , is randomly selected from as the surviving creature at the next generation. If and , then is selected randomly from as the surviving creature at the next generation. Otherwise, is produced signifying failure of evolution.

Selection based on optimization (): Let . If , then is produced signifying failure of evolution. Otherwise, consider the multiset, , and then is chosen from randomly as the surviving creature at the next generation.
Thus, while the selection rule only chooses some beneficial (or at least neutral) mutation, aggressively picks the (almost) best mutation from the available pool.
We denote by the fact
that is the surviving creature in the next generation after one
mutation and selection operation on the representation and accuracy
parameter . Here, may be one of the two selection rules
described above. For to be feasible we require that the size function
is polynomially bounded (in and ) and that the inverse of the tolerance
function is polynomially sandwiched, i.e., there exists polynomials and such that for every and .
Evolutionary Algorithm: An evolutionary algorithm is a
tuple . When is run starting from
with respect to distribution over , ideal function ,
loss function and parameter , a sequence is produced, where . If for some , we consider evolution as halted and
for . We say that succeeds at generation , if
is the smallest index for which the expected loss .
Definition 1 (Evolvability [28]).
We say that a concept class is evolvable with respect to loss function and selection rule , under a class of distributions using a representation class , if there exists a representation scheme , such that represents , and there exists an evolutionary algorithm , such that for every , every , every , and every , with probability at least , run starting from with respect to , produces for which . Furthermore, the number of generations required for evolution to succeed should be bounded by a polynomial in and .
Remark 1.
If the evolutionary algorithm succeeds only for a specific starting representation , we say is evolvable with initialization.
Remark 2.
If the functions in concept class depend only on variables, we say the evolutionary algorithm is attributeefficient, if the size function, , is polylogarithmic in , and polynomial in and , and the number of generations, , is polynomial in and , but does not depend on .
The definition presented above varies slightly from the definition of Valiant,
in the sense that we explicitly focus on the complexity of representations used
by the evolutionary algorithm. As discussed in the introduction, we focus on
concept classes where each function depends on few (constant) input
variables.
2.2 Sparse Linear Functions
Our main result in this paper concerns the class of sparse linear functions. We represent a linear function from by a vector , where . For a vector , is the number of nonzero elements of .
For any and integer , define the class of linear functions:
Thus, is the class of sparse linear functions, where the
“influence” of each variable is upper and lower bounded.
Let be a distribution over . For , define the inner product , where denotes the standard dot product in . In this paper, we use to denote (and not ). To avoid confusion, whenever necessary, we will refer to the quantity explicitly if we mean the standard Euclidean norm.
Distribution Classes
We use two classes of distributions for our results in this paper. We define them formally here.
Smooth Bounded Distributions: We consider the class of smooth bounded distributions over . The concept of smoothed analysis of algorithms was introduced by Spielman and Teng [26] and recently the idea has been used in learning theory [15, 18]. We consider distributions that are bounded and have mean. Formally, distributions we consider are defined as:
Definition 2 (Smooth Nice Distribution).
A distribution is a smooth nice
distribution if it is obtained as follows. Let be some distribution
over , and let denote the uniform distribution over . Then is obtained by the convolution of
with .


For all ,

For every in the support of ,
Incoherent Distributions: We also consider incoherent
distributions.
Definition 3 (Incoherent Nice Distribution).
A distribution is a incoherent nice distribution if the following hold:


For all ,

For all , ,

For all in the support of ,
We say a linear function represented by is bounded if . We use the notation . Suppose are bounded linear functions, and distribution is such that for every in the support of , . We consider the squared loss function, which for is . Then, for any in the support of , . Thus, standard Hoeffding bounds imply that if is an i.i.d. sample drawn from , then
(1) 
Finally, for linear functions (), let denote the nonzero variables in , so . Then, we have the following Lemma. The proof appears in Appendix A.1.
Lemma 1.
Let be a smooth nice distribution (Defn. 2), let be a vector and consider the corresponding linear function, . Then the following are true:

For any , .

There exists an such that .
3 Evolving Sparse Linear Functions
In this section, we describe two evolutionary algorithms for evolving sparse
linear functions. The first evolves the class under the class of
smooth nice distributions (Defn. 2), using the
selection rule . The second evolves the class under the
more restricted class of incoherent nice distributions
(Defn. 3), using the selection rule . We
first define the notation used in the rest of this section.
Notation: denotes the target distribution over , denotes the ideal (target) function. The inner product and norm of functions are with respect to the distribution . denotes the set . For , denotes the best linear approximation of using the variables in the set ; formally,
Finally, recall that for , and . A vector represents a linear function, . The vector has in coordinate and elsewhere and corresponds to the linear function . Thus, in this notation, . The accuracy parameter is denoted by .
3.1 Evolving Sparse Linear Functions Using
We present a simple mechanism that evolves the class of sparse linear functions with respect to smooth nice distributions (see Defn. 2). The representation class also consists of sparse linear functions, but with a greater number of nonzero entries than the ideal function. We also assume that a linear function is represented by , where each is a real number. (Handling the issues of finite precision is standard and is avoided in favor of simplicity.) Define the parameters and . Formally, the representation class is:
The important point to note is that the parameters and do not depend on , the total number of variables.
Next, we define the mutator. Recall that the mutator is a randomized algorithm that takes as input an element and accuracy parameter , and outputs a multiset . Here, is populated by independent draws from the following procedure, where will be specified later (see the proof of Theorem 1). Starting from , define the mutated representation , output by the mutator, as:

Scaling: With probability , choose uniformly at random and let .

Adjusting: With probability , do the following. Pick uniformly at random. Let denote the mutated representation, where for , and choose uniformly at random.

With the remaining probability, do the following:

Swapping: If , choose uniformly at random. Then, choose uniformly at random. Let be the mutated representation, where for . Set and choose uniformly at random. In this case, with probability , and hence .

Adding: If , choose uniformly at random. Let be the mutated representation, where for , and choose uniformly at random.

Recall that denotes the ideal (target) function and is the underlying distribution that is smooth nice (see Defn. 2). Since we are working with the squared loss metric, , the expected loss for any is given by . We will show that for any , if , with nonnegligible (inverse polynomial) probability, the above procedure produces a mutation that decreases the expected loss by at least some inverse polynomial amount. Thus, by setting the size of the neighborhood large enough, we can guarantee that with high probability there will always exist a beneficial mutation.
To simplify notation, let . Recall that denotes the best approximation to using variables in the set ; thus, . At a high level, the argument for proving the success of our evolutionary mechanism is as follows: If is large, then a mutation of the type “scaling” or “adjusting” will get closer to , reducing the expected loss. (The role of “scaling” mutations is primarily to ensure that the representations remain bounded.) If is small and , there must be a variable in , that when added to (possibly by swapping), reduces the expected loss. Thus, as long as the representation is far from the evolutionary target, a beneficial mutation is produced with high probability.
More formally, let denote a random mutation produced as a result of the procedure described above. We will establish the desired result by proving the following claims.
Claim 1.
If , then with probability at least , . In particular, a “scaling” type mutation achieves this.
Claim 2.
When , then with probability at least , . In particular, an “adjusting” type mutation achieves this.
Claim 3.
When , but , then with probability at least , . In particular, a mutation of type “swapping” or “adding” achieves this.
Note that when , then . Thus, in this case when , the evolutionary algorithm has succeeded.
The proofs of the above Claims are provided in Appendix A.2. We now prove our main result using the above claims.
Theorem 1.
Let be the class of smooth nice distributions over (Defn. 2). Then the class is evolvable with respect to , using the representation class , where and , using the mutation algorithm described in this section, and the selection rule . Furthermore, the following are true:

The number of generations required is polynomial in , , , and is independent of , the total number of attributes.

The size function , the number of points used to calculate empirical losses, depends polylogarithmically on , and polynomially on the remaining parameters.
Proof.
The mutator is as described in this section. Let
and let
Now, by Claims 1, 2 and 3, if , then the mutator outputs a mutation that decreases the squared loss by with probability at least .
Recall that and . Now, let (recall that is the bound on for in the support of the distribution). We will show that evolution succeeds in at most generations. Note that has no dependence on , the number of attributes, and polynomial dependence on the remaining parameters. Define , and at each time step we have that . Note that together with the observation above, this implies that except with probability , for , if is the representation at time step , contains a mutation that decreases the loss by at least , if .
Now, let be the tolerance function, set and let be the size function. Note that for (this also holds for , since and ). If is an i.i.d. sample drawn from , for each of the representations that may be considered in the neighborhoods for the first time steps, using (1), it holds that simultaneously except with probability (by a union bound). Thus, allowing for failure probability , we assume that we are in the case when the neighborhood always has a mutation that decreases the expected loss by (whenever the expected loss of the current representation is at least ) and that all empirical expected losses are close to the true expected losses.
Now let be the representation at some generation such that , let such that . Then, it is the case that (when ). Hence, for tolerance function , for the selection rule using , . Consequently . Hence, the representation at the next generation is chosen from . Let be the chosen representation. It must be the case that . Thus, we have . Hence, the expected loss decreases at least by .
Note that at no point can the expected loss be greater than for any representation in . Hence, in at most generations, evolution reaches a representation with expected loss at most . Note the only parameter introduced which has an inverse polynomial dependence on is . This implies that only has polylogarithmic dependence on . This concludes the proof of the theorem. ∎
Remark 3.
We note that the same evolutionary mechanism works when evolving the class , as long as the sparsity of the representation class is allowed polynomial dependence on , the inverse of the accuracy parameter. This is consistent with the notion of attributeefficiency, where the goal is that the information complexity should be polylogarithmic in the number of attributes, but may depend polynomially on .
3.2 Evolving Sparse Linear Functions Using
In this section, we present a different evolutionary mechanism for evolving
sparse linear functions. This algorithm essentially is an adaptation of a greedy
algorithm commonly known as orthogonal matching pursuit (OMP) in the signal
processing literature (see [7, 27]). Our
analysis requires stronger properties on the distribution: we show that
sparse linear functions can be evolved with respect to incoherent
nice distributions (Defn. 3). Here, the
selection rule used is selection using optimization
().
Recall that is the ideal (target)
function.
where . Now, starting from , define the action of the mutator as follows (we will define the parameters and later in the proof of Theorem 2):

Adding: With probability , do the following. Recall that denotes the nonzero entries of . If , choose uniformly at random. Let be such that for , and choose uniformly at random. If , let . Then, the multiset is populated by independent draws from the procedure just described.

With probability , do the following:

Identical: With probability , output .

Scaling: With probability , choose uniformly at random and let .

Adjusting: With probability , do the following. Pick uniformly at random. Let be such that for , and choose uniformly at random.
Then, the multiset is populated by independent draws from the procedure just described.

One thing to note in the above definition is that the mutations produced by the mutator at any given time are correlated, i.e., they are all either of the kind that add a new variable, or all of the kind that just manipulate existing variables. At a high level, we prove the success of this mechanism as follows:

Using mutations of type “scaling” or “adjusting,” a representation that is close to the best in the space, i.e., , is evolved.

When the representation is (close to) the best possible using current variables, adding one of the variables that is present in the ideal function, but not in the current representation, results in the greatest reduction of expected loss. Thus, selection based on optimization would always add a variable in . By tuning appropriately, it is ensured that with high probability, candidate mutations that add new variables are not chosen until evolution has had time to approach the best representation using existing variables.
To complete the proof we establish the following claims.
Claim 4.
If , then if , there exists and , such that for any , and for any , , . Furthermore, .
Claim 5.
Conditioned on the mutator not outputting mutations that add a new variable, with probability at least , there exists a mutation that reduces the squared loss by at least .
The proofs of Claims 4 and 5 are not difficult and are provided in Appendix A.3. Based on the above claims we can prove the following theorem:
Theorem 2.
Let be the class of incoherent nice distributions over (Defn. 3). Then, the class is evolvable with respect to by an evolutionary algorithm, using the mutation algorithm described in this section, selection rule , and the representation class , where . Furthermore, the following are true:

The number of generations is polynomial in , , , but independent of the dimension .

The size function , the number of points used to calculate the empirical losses, depends polylogarithmically on and polynomially on the remaining parameters.
Proof.
The proof is straightforward, although a bit heavy on notation; we provide a sketch here. The mutator is as described in this section. Let
and let
Also, let and let be the tolerance function.
First, we show that between the “rare” time steps when the mutator outputs mutations that add a new variable, evolution has enough time to stabilize (reach close to local optimal) using existing variables. To see this, consider a sequence of coin tosses, where the probability of heads is and the probability of tails is . Let be the number of tails between the and heads. Except with probability , by a simple union bound. Also, by Markov’s inequality, except with probability , . Thus, except with probability , we have for . Let . This ensures that, except with probability , after time steps, at least time steps where the mutator outputs mutations of type “adding” have occurred, and the first of these occurrences are all separated by at least time steps of other types of mutations.
Also, let and let be the size function. These values ensure that for generations, except with probability , the mutator always produces a mutation that had probability at least of being produced (conditioned on the type of mutations output by the mutator at that time step), and that for all the representations concerned, , where . Thus, allowing the process to fail with probability , we assume that none of the undesirable events have occurred.
We will show that the steps with mutations other than “adding” are sufficient to ensure that evolution reaches the (almost) best possible target with the variables available to it. In particular, if the set of available variables is , the representation reached by evolution will satisfy . For now, suppose that this is the case.
We claim by induction that evolution never adds a “wrong” variable, i.e., one that is not present in the target function . The base case is trivially true, since the starting representation is . Now suppose, just before a “heads” step, the representation is , such that and . The current step is assumed to be a “heads” step, thus the mutator has produced mutations by adding a new variable. Then, using Claim 4, we know that there is a mutation in such that (obtained by adding a correct variable). Since and , it must be the case that . This ensures that the set , for selection rule is not empty. Furthermore, we claim that no mutation that adds an irrelevant variable can be in . Suppose is a mutation that adds an irrelevant variable; according to Claim 5, , and hence . This ensures that every representation in corresponds to a mutation that adds some relevant variable. Thus, the evolutionary algorithm never adds any irrelevant variable.
Finally, note that during a “tails” step (when the mutator produces mutations of types other than “adding”), as long as , there exists a mutation that reduces the expected loss by at least . This implies that the set is nonempty and for the values of tolerance and , any mutation from the set reduces the expected loss by at least . (This argument is identical to the one in Theorem 1.) Since the maximum loss is at most for the class of distributions and a representation from the set ; in at most steps, a representation satisfying must be reached. Note that once such a representation is reached, it is ensured that the loss does not increase substantially, since with probability at least , the mutator outputs the same representation. Hence, it is guaranteed that there is always a neutral mutation. Thus, before the next “heads” step, it must be the case that . If is set to , the evolutionary algorithm using the selection rule succeeds.
It is readily verified that the values of and satisfy the claims in the statement of the theorem. ∎
4 Conclusion and Future Work
In this work, we provided simple evolutionary mechanisms for evolving sparse linear functions, under a large class of distributions. These evolutionary algorithms have the desirable properties that the representations used are themselves sparse linear functions, and that they are attributeefficient in the sense that the number of generations required for evolution to succeed is independent of the total number of attributes.
Strong negative results are known for distributionindependent evolvability of boolean functions, e.g., even the class of conjunctions is not evolvable [11]. However, along the lines of this work, it is interesting to study whether under restricted classes of distributions, evolution is possible for simple concept classes, using representations of low complexity. Currently, even under (biased) product distributions, no evolutionary mechanism is known for the class of disjunctions, except via Feldman’s general reduction from CSQ algorithms. Even if the queries made by the CSQ algorithm are simple, Feldman’s reduction uses intermediate representations that randomly combine queries made by the algorithm, making the representations quite complex.
A natural extension of our current results would be to study fixeddegree sparse polynomials. Another interesting direction is to study circuits with sigmoidal or other nonlinear filters on the gates, which arise naturally in molecular systems. The suitable class of boolean functions to study is lowweight threshold functions, which includes disjunctions and conjunctions. The class of smooth bounded distributions may be an appropriate starting place for studying evolvability of these classes. For example, is the class of lowweight threshold functions evolvable under smooth distributions, or at least logconcave distributions?
Acknowledgments
We would like to thank Leslie Valiant for helpful discussions and comments on an earlier version of this paper. We are grateful to Frank Solomon for discussing biological aspects related to this work.
Appendix A Omitted Proofs
a.1 Proofs from Section 2.2
Proof of Lemma 1.
Note that for any , we can write , where is drawn from some smooth bounded distribution, and is drawn from the uniform distribution over . Note that and are independent, and all components of are independent. First, we observe that . Now, consider the following:
The conclusions of the Lemma follow easily by looking at the above expression. ∎
a.2 Proofs from Section 3.1
Proof of Claim 1.
We show that in this case, a “scaling” mutation achieves the desired result. Restricted to the direction , the best approximation to is . We have that
Hence, if , for (and similarly if for ), we have that