Attribute-Efficient Evolvability of Linear Functions

Attribute-Efficient Evolvability of Linear Functions

Abstract

In a seminal paper, Valiant (2006) introduced a computational model for evolution to address the question of complexity that can arise through Darwinian mechanisms. Valiant views evolution as a restricted form of computational learning, where the goal is to evolve a hypothesis that is close to the ideal function. Feldman (2008) showed that (correlational) statistical query learning algorithms could be framed as evolutionary mechanisms in Valiant’s model. P. Valiant (2012) considered evolvability of real-valued functions and also showed that weak-optimization algorithms that use weak-evaluation oracles could be converted to evolutionary mechanisms.

In this work, we focus on the complexity of representations of evolutionary mechanisms. In general, the reductions of Feldman and P. Valiant may result in intermediate representations that are arbitrarily complex (polynomial-sized circuits). We argue that biological constraints often dictate that the representations have low complexity, such as constant depth and fan-in circuits. We give mechanisms for evolving sparse linear functions under a large class of smooth distributions. These evolutionary algorithms are attribute-efficient in the sense that the size of the representations and the number of generations required depend only on the sparsity of the target function and the accuracy parameter, but have no dependence on the total number of attributes.

1 Introduction

Darwin’s theory of evolution through natural selection has been a cornerstone of biology for over a century and a half. Yet, a quantitative theory of complexity that could arise through Darwinian mechanisms has remained virtually unexplored. To address this question, Valiant introduced a computational model of evolution [28]. In his model, an organism is an entity that computes a function of its environment. There is a (possibly hypothetical) ideal function indicating the best behavior in every possible environment. The performance of the organism is measured by how close the function it computes is to the ideal. An organism produces a set of offspring, that may have mutations that alter the function computed. The performance (fitness) measure acting on a population of mutants forms the basis of natural selection. The resources allowed are the most generous while remaining feasible; the mutation mechanism may be any efficient randomized Turing machine, and the function represented by the organism may be arbitrary as long as it is computable by an efficient Turing machine.

Formulated this way, the question of evolvability can be asked in the language of computational learning theory. For what classes of ideal functions, , can one expect to find an evolutionary mechanism that gets arbitrarily close to the ideal, within feasible computational resources? Darwinian selection is restrictive in the sense that the only feedback received is aggregate over life experiences. Valiant observed that any feasible evolutionary mechanism could be simulated in the statistical query framework of Kearns [19]. In a remarkable result, Feldman showed that in fact, evolvable concept classes are exactly captured by a restriction of Kearns’ model, where the learning algorithm is only allowed to make performance queries, i.e., it produces a hypothesis and then makes a query to an oracle that returns the (approximate) performance of that hypothesis under the distribution [9].1 P. Valiant studied the evolvability of real-valued functions and showed that whenever the corresponding weak optimization problem, i.e., approximately minimizing the expected loss, can be solved by using a weak evaluation oracle, such an algorithm can be converted into an evolutionary mechanism [29]. This implies that a large class of functions – fixed-degree real polynomials – can be evolved with respect to any convex loss function.

Direct evolutionary mechanisms, not invoking the general reductions of Feldman and P. Valiant, have been proposed for certain classes in restricted settings. Valiant showed that the class of disjunctions is evolvable using a simple set of mutations under the uniform distribution [28]. Kanade, Valiant and Vaughan proposed a simple mechanism for evolving homogeneous linear separators under radially symmetric distributions [17]. Feldman considered a model where the ideal function is boolean but the representation can be real-valued, allowing for more detailed feedback. He presents an algorithm for evolving large margin linear separators for a large class of convex loss functions [11]. P. Valiant also showed that with very simple mutations, the class of fixed-degree polynomials can be evolved with respect to the squared loss [29].

Current understanding of biology (or lack thereof) makes it difficult to formalize a notion of naturalness for mutations in these frameworks; in particular, it is not well understood how mutations to DNA relate to functional changes in an organism. That said, the more direct algorithms are appealing due to the simplicity of their mutations. Also, the “chemical computers” of organisms may be slow, and hence, representations that have low complexity are attractive. In general, Feldman’s generic reduction from statistical query algorithms may use arbitrarily complex representations (polynomial-sized circuits), depending on the specific algorithm used. In the remainder of the introduction, we first describe a particular class of biological circuits, transcription networks, that motivate our study. We then frame the evolutionary question in the language of computational learning theory, summarize our contributions and discuss related work.

1.1 Representation in Biology

Biological systems appear to function successfully with greatly restricted representation classes. The nature of circuits found in biological systems may vary, but some aspects – such as sparsity – are common. Specifically, the interacting components in many biological circuits are sparsely connected. Biological circuits are often represented as networks or graphs, where the vertices correspond to entities such as neurons or molecules and the edges to connections or interactions between pairs of entities. For example, both neural networks [31] and networks of metabolic reactions in the cell [30, 14] have been described by “small-world” models, where a few “hub” nodes have many edges but most nodes have few edges (and consequently, the corresponding graphs have small diameter). An associated property observed in biological networks is modularity: a larger network of interacting entities is composed of smaller modules of (functionally related) entities [12]. Both the “small-world” description and modularity of biological networks are consistent with the more general theme of sparsity.

(a)  
(b)  
Figure 1: (a) Schematic of transcription (top) and translation (bottom). Here, a transcription factor (TF) binds to DNA close to a gene in a way that increases gene expression by encouraging RNA polymerase (RNAp) to transcribe the gene and so produce mRNA. The mRNA is then translated by ribosomes to produce sequences of amino acids that ultimately fold into proteins. Only a small number of transcription factors directly regulate any gene. Note that a transcription factor’s action can also decrease gene expression. For a more complete picture, see e.g.,  [1]. (b) Topology of the transcription network of respiration and redox reactions in yeast. represents that transcription factor regulates the expression of . Note that this real network has cycles. Adapted from [23].

We focus on transcription networks, which are a specific class of networks of interacting genes and proteins that are involved in the production of new protein. Alon provides an accessible and mathematical introduction to transcription networks and other biological circuits [1]; below and in Figure 1(a), we present a simplified account that motivates this work. Genes are transcribed to produce mRNA, which is then translated into sequences of amino acids that ultimately fold into proteins.2 In a transcription network, a gene’s transcription may be regulated by a set of proteins called transcription factors. These transcription factors may increase or decrease a gene’s transcription by physically binding to regions of DNA that are typically close to the gene. In natural systems, only a small number of transcription factors regulate any single gene, and so transcription networks are sparsely connected. For example, Balaji et al. studied a yeast transcription network of 157 transcription factors regulating 4,410 genes. They observed this network to have 12,873 interactions (edges) where each gene was regulated on average by about 2.9 transcription factors, the distribution of in-degrees was well-described by an exponential fit, and only about 45 genes had an in-degree of 15 or greater [3].

The number of transcription factors varies from hundreds in a bacterium to thousands in a human cell. Some transcription factors are always present in the cell and can be thought of as representing a snapshot of the environment [1]. For example, the presence of sugar molecules in the environment may cause specific transcription factors to be activated, enabling them to regulate the production of other proteins. One of these proteins could be an end-product, such as an enzyme that catalyzes a metabolic reaction involving the sugar. Alternatively, the transcription factor could regulate another transcription factor that itself regulates other genes – we view this as intermediate computation – and may participate in further “computation” to produce the desired end-result.

While transcription networks may include cycles (loops), here for simplicity we focus on systems that are directed acyclic graphs, and the resulting computation can be viewed as a circuit. We illustrate a small, real transcription network in Figure 1(b). These circuits are by necessity shallow due to a temporal constraint, that the time required for sufficient quantities of protein to be produced is of the same order of magnitude as cell-division time.3 For example, Luscombe et al. measured the shortest path length (in number of intermediate nodes) between transcription factors and regulated genes corresponding to terminal nodes (leaves) in a yeast transcription network. In the static network, the mean such path length was 4.7 and the longest path involved 12 intermediate transcription factors [21].

1.2 Our Contributions

First, our contribution is conceptual. We believe that the study of evolvability from a computational standpoint will benefit by understanding the representation complexity required to evolve a certain concept class. Motivated by the previous discussion, in the case of transcription networks, it appears essential that the representation used be a constant depth and fan-in (boolean or arithmetic) circuit. Of course, any function that can be represented by such a circuit can depend only on a constant number of input variables. We ask the question, when we restrict attention to functions in a given class that depend only on a constant number of variables, when can evolution succeed?

Second, we show that the class of sparse linear functions, those that depend only on a constant number of variables, under a large class of smooth distributions, can be evolved using sparse linear functions as representations, when the performance is measured using squared error. The number of variables used by the representations is larger than the number of variables in the ideal function and depends on the smoothness parameter of the distribution. According to our notion of -smooth -nice distributions (Defn. 2), the density function of a smooth distribution is obtained by convolution of an arbitrary density with a product measure on (alternatively, drawing a point from the smooth distribution is equivalent to drawing a point from an arbitrary distribution and adding a (noise) vector from a product distribution).

A linear function is represented by a weighted arithmetic circuit with only one addition gate (alternatively, by a depth-two circuit with a layer of multiplication gates and some constant inputs).4 Also, the number of generations required for evolution to succeed depends polynomially on the sparsity of the target linear function, the smoothness parameter of the distribution and the inverse of the target accuracy , and has no dependence on the dimension of the input space. The number of mutations explored at each generation depends polynomially in and . Thus, our result shows attribute-efficient evolvability of sparse linear functions, in the sense of Littlestone [20]. For the precise statement, see Theorem 1 in Section 3.1.

Valiant also proposed a stronger selection mechanism – when natural selection aggressively selects the (almost) best mutation, rather than merely a beneficial one – called evolution by optimization. Our second result requires a much stronger distributional assumption – the correlation – where is the sparsity of the target linear function (see Defn. 3). Under such distributions, we show that under evolution by optimization, sparse linear functions can be evolved by representations with the same sparsity. The mechanism we propose and its analysis is inspired by the greedy orthogonal matching pursuit algorithms in signal processing [7, 27]. Unlike the previous evolutionary algorithm, this one requires initialization, i.e., the evolutionary process begins with the function. As in the previous case, the number of generations required depends polynomially on the sparsity of the target linear function, the inverse of the accuracy parameter , but has no dependence on the total number of attributes . The precise statement appears as Theorem 2 in Section 3.2.

Related Work

The question of proper vs. improper learning has been studied in computational learning theory. A separation between the two kinds is known, unless . However, most interesting PAC-learnable classes can be learned using thresholds of low-degree polynomials, and do not seem to require the full generality of polynomial-sized circuits.5 In this context, Valiant’s disjunction algorithm under the uniform distribution [28], Kanade et al.’s algorithm for homogeneous half-spaces under radially symmetric distributions [17], and P. Valiant’s algorithm for linear (polynomial) functions using squared loss [29], are proper evolutionary mechanisms, i.e., the representation used is from the same class as the ideal function. In the first two cases, it is straightforward to show that if the target depends only on a constant number of variables, the evolutionary mechanism also succeeds using representations that depend only on a constant number of variables. Thus, attribute-efficient evolution can be achieved.

The problem of learning sparse linear functions has been studied under various names in several fields for many applications, e.g., recovering sparse solutions to (underdetermined) linear systems of equations [4], or recovering sparse representations with a redundant dictionary [22, 8]; compressive sampling or compressed sensing for sparse signal reconstruction [5]; optimization with regularization or sparsity-inducing penalties in machine learning [2]; sparse coding for learning an overcomplete basis [25], or for denoising in image and video processing [8]. This area is too vast to review here; Bruckstein et al. have an excellent survey [4]. Learning the sparsest linear function is equivalent to finding the sparsest solution to a system of linear equations (assuming there is no noise in the data). In general, this problem is -hard and the currently best-known approximation factor depends on the norm of the pseudo-inverse of the matrix [24]. Thus, some assumption on the distribution seems necessary. Our evolution based on optimization algorithm (Section 3.2) is essentially the greedy orthogonal matching pursuit algorithm of Tropp [27] and Donoho et al. [7], cast in the language of evolvability; these algorithms are also known in statistical modeling as forward stepwise regression [6, 13].

Finally, the question of attribute-efficient regression in the PAC (or SQ) model is a natural one. Here, the goal would be to design a polynomial time algorithm for producing an -accurate linear function, with sample complexity that is polynomial in the sparsity of the target function and the inverse of the target accuracy , and only polylogarithmic in , the total number of attributes. Under mild boundedness assumptions on the distribution, this can be achieved by setting up an -regularized optimization problem; the output classifier may not be sparse in light of the -hardness result mentioned above. We note that under the distributional assumption made in this paper, finding the sparsest linear function that fits the data is also easy in the PAC/SQ setting, since the solution to the optimization problem in this case is unique. The focus in our work is different, namely showing that simple evolutionary mechanisms can succeed, while using representations that are themselves sparse linear functions at all times.

Organization

In Section 2, we give an overview of Valiant’s evolution model and describe the concept classes and class of distributions considered in this paper. Section 3 contains the mechanisms for evolving sparse linear functions. We conclude in Section 4 with some discussion and directions for future work.

2 Model and Preliminaries

We first provide an overview of the evolvability framework of Valiant [28]. The description here differs slightly from Valiant’s original formulation and includes some subsequent extensions (for more details the reader is referred to [28, 9, 10, 29, 16]).

2.1 Valiant’s Evolvability Framework

Let denote a set of instances, e.g. or . We assume that the representation length of each is captured by the parameter . To avoid excessive notation, we will keep this size parameter implicit in our description of the model. Let be a distribution over . Each can be thought of as the description of an environmental setting, the inputs to any circuit of an organism. denotes the distribution over the possible environmental settings an organism may experience in a lifetime. Let (typically or ) denote the ideal function, the best behavior in each possible environmental setting.

Representations

A creature is a string representation that encodes an efficiently computable function , i.e., there is an efficient Turing Machine that, given the description string and , outputs .

In this work, our focus is characterizing different evolutionary mechanisms based on the complexity of representations used. The complexity of a representation is measured by the function it computes. Let be a class of functions. For , we say that represents , if there is a map, , and if there exists an efficient Turing machine that, given input and , outputs . Henceforth, by abuse of notation we will use to denote both the representation and the function it computes, .

Evolutionary Algorithms

The performance of a representation is measured using a loss function , such that . For a function , define the expected loss with respect to the ideal function , under distribution , as .6 The goal of evolution is to reach some representation such that . In the following discussion, we use the notation: the ideal function, the target accuracy, the target distribution over and the expected loss function.
Mutator: A mutator , for a set of representations , is a polynomial-time randomized Turing machine that takes as input a representation and accuracy parameter and outputs a multiset . The running time requirement on also ensures that is polynomially bounded.
Selection: (Natural) Selection is based on the empirical performance of each representation. Let be a sample size function. First, the mutation algorithm, , is run to produce multiset . Then, an i.i.d. sample is drawn from the distribution over , where . Denote the empirical performance of each as

Finally, let be a tolerance function. Two possible selection mechanisms are considered.

  1. Selection based on beneficial and neutral mutations (): Let

    denote the set of beneficial mutations and let

    denote the neutral mutations, with respect to tolerance function . Both and are treated as multisets (the multiplicity of any representation is the same as that in ). Selection operates as follows: if , is randomly selected from as the surviving creature at the next generation. If and , then is selected randomly from as the surviving creature at the next generation. Otherwise, is produced signifying failure of evolution.

  2. Selection based on optimization (): Let . If , then is produced signifying failure of evolution. Otherwise, consider the multiset, , and then is chosen from randomly as the surviving creature at the next generation.

Thus, while the selection rule only chooses some beneficial (or at least neutral) mutation, aggressively picks the (almost) best mutation from the available pool.

We denote by the fact that is the surviving creature in the next generation after one mutation and selection operation on the representation and accuracy parameter . Here, may be one of the two selection rules described above. For to be feasible we require that the size function is polynomially bounded (in and ) and that the inverse of the tolerance function is polynomially sandwiched, i.e., there exists polynomials and such that for every and .
Evolutionary Algorithm: An evolutionary algorithm is a tuple . When is run starting from with respect to distribution over , ideal function , loss function and parameter , a sequence is produced, where . If for some , we consider evolution as halted and for . We say that succeeds at generation , if is the smallest index for which the expected loss .

Definition 1 (Evolvability [28]).

We say that a concept class is evolvable with respect to loss function and selection rule , under a class of distributions using a representation class , if there exists a representation scheme , such that represents , and there exists an evolutionary algorithm , such that for every , every , every , and every , with probability at least , run starting from with respect to , produces for which . Furthermore, the number of generations required for evolution to succeed should be bounded by a polynomial in and .

Remark 1.

If the evolutionary algorithm succeeds only for a specific starting representation , we say is evolvable with initialization.

Remark 2.

If the functions in concept class depend only on variables, we say the evolutionary algorithm is attribute-efficient, if the size function, , is polylogarithmic in , and polynomial in and , and the number of generations, , is polynomial in and , but does not depend on .

The definition presented above varies slightly from the definition of Valiant, in the sense that we explicitly focus on the complexity of representations used by the evolutionary algorithm. As discussed in the introduction, we focus on concept classes where each function depends on few (constant) input variables.7

2.2 Sparse Linear Functions

Our main result in this paper concerns the class of sparse linear functions. We represent a linear function from by a vector , where . For a vector , is the number of non-zero elements of .

For any and integer , define the class of linear functions:

Thus, is the class of -sparse linear functions, where the “influence” of each variable is upper and lower bounded.8

Let be a distribution over . For , define the inner product , where denotes the standard dot product in . In this paper, we use to denote (and not ). To avoid confusion, whenever necessary, we will refer to the quantity explicitly if we mean the standard Euclidean norm.

Distribution Classes

We use two classes of distributions for our results in this paper. We define them formally here.

Smooth Bounded Distributions: We consider the class of smooth bounded distributions over . The concept of smoothed analysis of algorithms was introduced by Spielman and Teng [26] and recently the idea has been used in learning theory [15, 18]. We consider distributions that are bounded and have mean. Formally, distributions we consider are defined as:

Definition 2 (-Smooth -Nice Distribution).

A distribution is a -smooth -nice distribution if it is obtained as follows. Let be some distribution over , and let denote the uniform distribution over . Then is obtained by the convolution of with .9 Furthermore, satisfies the following:

  1. For all ,

  2. For every in the support of ,

Incoherent Distributions: We also consider incoherent distributions.10 For a distribution over , the coherence is defined as , where is the correlation between and . Again, we consider bounded distributions with zero mean. We also require the variance to be upper and lower bounded in each dimension. Formally, the distributions we consider are defined as:

Definition 3 (-Incoherent -Nice Distribution).

A distribution is a -incoherent -nice distribution if the following hold:

  1. For all ,

  2. For all , ,

  3. For all in the support of ,

We say a linear function represented by is -bounded if . We use the notation . Suppose are -bounded linear functions, and distribution is such that for every in the support of , . We consider the squared loss function, which for is . Then, for any in the support of , . Thus, standard Hoeffding bounds imply that if is an i.i.d. sample drawn from , then

(1)

Finally, for linear functions (), let denote the non-zero variables in , so . Then, we have the following Lemma. The proof appears in Appendix A.1.

Lemma 1.

Let be a -smooth -nice distribution (Defn. 2), let be a vector and consider the corresponding linear function, . Then the following are true:

  1. For any , .

  2. There exists an such that .

3 Evolving Sparse Linear Functions

In this section, we describe two evolutionary algorithms for evolving sparse linear functions. The first evolves the class under the class of -smooth -nice distributions (Defn. 2), using the selection rule . The second evolves the class under the more restricted class of -incoherent -nice distributions (Defn. 3), using the selection rule . We first define the notation used in the rest of this section.

Notation: denotes the target distribution over , denotes the ideal (target) function. The inner product and -norm of functions are with respect to the distribution . denotes the set . For , denotes the best linear approximation of using the variables in the set ; formally,

Finally, recall that for , and . A vector represents a linear function, . The vector has in coordinate and elsewhere and corresponds to the linear function . Thus, in this notation, . The accuracy parameter is denoted by .

3.1 Evolving Sparse Linear Functions Using

We present a simple mechanism that evolves the class of sparse linear functions with respect to -smooth -nice distributions (see Defn. 2). The representation class also consists of sparse linear functions, but with a greater number of non-zero entries than the ideal function. We also assume that a linear function is represented by , where each is a real number. (Handling the issues of finite precision is standard and is avoided in favor of simplicity.) Define the parameters and . Formally, the representation class is:

The important point to note is that the parameters and do not depend on , the total number of variables.

Next, we define the mutator. Recall that the mutator is a randomized algorithm that takes as input an element and accuracy parameter , and outputs a multiset . Here, is populated by independent draws from the following procedure, where will be specified later (see the proof of Theorem 1). Starting from , define the mutated representation , output by the mutator, as:

  1. Scaling: With probability , choose uniformly at random and let .

  2. Adjusting: With probability , do the following. Pick uniformly at random. Let denote the mutated representation, where for , and choose uniformly at random.

  3. With the remaining probability, do the following:

    1. Swapping: If , choose uniformly at random. Then, choose uniformly at random. Let be the mutated representation, where for . Set and choose uniformly at random. In this case, with probability , and hence .

    2. Adding: If , choose uniformly at random. Let be the mutated representation, where for , and choose uniformly at random.

Recall that denotes the ideal (target) function and is the underlying distribution that is -smooth -nice (see Defn. 2). Since we are working with the squared loss metric, , the expected loss for any is given by . We will show that for any , if , with non-negligible (inverse polynomial) probability, the above procedure produces a mutation that decreases the expected loss by at least some inverse polynomial amount. Thus, by setting the size of the neighborhood large enough, we can guarantee that with high probability there will always exist a beneficial mutation.

To simplify notation, let . Recall that denotes the best approximation to using variables in the set ; thus, . At a high level, the argument for proving the success of our evolutionary mechanism is as follows: If is large, then a mutation of the type “scaling” or “adjusting” will get closer to , reducing the expected loss. (The role of “scaling” mutations is primarily to ensure that the representations remain bounded.) If is small and , there must be a variable in , that when added to (possibly by swapping), reduces the expected loss. Thus, as long as the representation is far from the evolutionary target, a beneficial mutation is produced with high probability.

More formally, let denote a random mutation produced as a result of the procedure described above. We will establish the desired result by proving the following claims.

Claim 1.

If , then with probability at least , . In particular, a “scaling” type mutation achieves this.

Claim 2.

When , then with probability at least , . In particular, an “adjusting” type mutation achieves this.

Claim 3.

When , but , then with probability at least , . In particular, a mutation of type “swapping” or “adding” achieves this.

Note that when , then . Thus, in this case when , the evolutionary algorithm has succeeded.

The proofs of the above Claims are provided in Appendix A.2. We now prove our main result using the above claims.

Theorem 1.

Let be the class of -smooth -nice distributions over (Defn. 2). Then the class is evolvable with respect to , using the representation class , where and , using the mutation algorithm described in this section, and the selection rule . Furthermore, the following are true:

  1. The number of generations required is polynomial in , , , and is independent of , the total number of attributes.

  2. The size function , the number of points used to calculate empirical losses, depends polylogarithmically on , and polynomially on the remaining parameters.

Proof.

The mutator is as described in this section. Let

and let

Now, by Claims 1, 2 and 3, if , then the mutator outputs a mutation that decreases the squared loss by with probability at least .

Recall that and . Now, let (recall that is the bound on for in the support of the distribution). We will show that evolution succeeds in at most generations. Note that has no dependence on , the number of attributes, and polynomial dependence on the remaining parameters. Define , and at each time step we have that . Note that together with the observation above, this implies that except with probability , for , if is the representation at time step , contains a mutation that decreases the loss by at least , if .

Now, let be the tolerance function, set and let be the size function. Note that for (this also holds for , since and ). If is an i.i.d. sample drawn from , for each of the representations that may be considered in the neighborhoods for the first time steps, using (1), it holds that simultaneously except with probability (by a union bound). Thus, allowing for failure probability , we assume that we are in the case when the neighborhood always has a mutation that decreases the expected loss by (whenever the expected loss of the current representation is at least ) and that all empirical expected losses are -close to the true expected losses.

Now let be the representation at some generation such that , let such that . Then, it is the case that (when ). Hence, for tolerance function , for the selection rule using , . Consequently . Hence, the representation at the next generation is chosen from . Let be the chosen representation. It must be the case that . Thus, we have . Hence, the expected loss decreases at least by .

Note that at no point can the expected loss be greater than for any representation in . Hence, in at most generations, evolution reaches a representation with expected loss at most . Note the only parameter introduced which has an inverse polynomial dependence on is . This implies that only has polylogarithmic dependence on . This concludes the proof of the theorem. ∎

Remark 3.

We note that the same evolutionary mechanism works when evolving the class , as long as the sparsity of the representation class is allowed polynomial dependence on , the inverse of the accuracy parameter. This is consistent with the notion of attribute-efficiency, where the goal is that the information complexity should be polylogarithmic in the number of attributes, but may depend polynomially on .

3.2 Evolving Sparse Linear Functions Using

In this section, we present a different evolutionary mechanism for evolving sparse linear functions. This algorithm essentially is an adaptation of a greedy algorithm commonly known as orthogonal matching pursuit (OMP) in the signal processing literature (see  [7, 27]). Our analysis requires stronger properties on the distribution: we show that -sparse linear functions can be evolved with respect to -incoherent -nice distributions (Defn. 3). Here, the selection rule used is selection using optimization ().11 Also, the algorithm is guaranteed to succeed only with initialization from the function. Nevertheless, this evolutionary algorithm is appealing due to its simplicity and because it never uses a representation that is not a -sparse linear function.

Recall that is the ideal (target) function.12 Let

where . Now, starting from , define the action of the mutator as follows (we will define the parameters and later in the proof of Theorem 2):

  1. Adding: With probability , do the following. Recall that denotes the non-zero entries of . If , choose uniformly at random. Let be such that for , and choose uniformly at random. If , let . Then, the multiset is populated by independent draws from the procedure just described.

  2. With probability , do the following:

    1. Identical: With probability , output .

    2. Scaling: With probability , choose uniformly at random and let .

    3. Adjusting: With probability , do the following. Pick uniformly at random. Let be such that for , and choose uniformly at random.

    Then, the multiset is populated by independent draws from the procedure just described.

One thing to note in the above definition is that the mutations produced by the mutator at any given time are correlated, i.e., they are all either of the kind that add a new variable, or all of the kind that just manipulate existing variables. At a high level, we prove the success of this mechanism as follows:

  1. Using mutations of type “scaling” or “adjusting,” a representation that is close to the best in the space, i.e., is evolved.

  2. When the representation is (close to) the best possible using current variables, adding one of the variables that is present in the ideal function, but not in the current representation, results in the greatest reduction of expected loss. Thus, selection based on optimization would always add a variable in . By tuning appropriately, it is ensured that with high probability, candidate mutations that add new variables are not chosen until evolution has had time to approach the best representation using existing variables.

To complete the proof we establish the following claims.

Claim 4.

If , then if , there exists and , such that for any , and for any , , . Furthermore, .

Claim 5.

Conditioned on the mutator not outputting mutations that add a new variable, with probability at least , there exists a mutation that reduces the squared loss by at least .

The proofs of Claims 4 and 5 are not difficult and are provided in Appendix A.3. Based on the above claims we can prove the following theorem:

Theorem 2.

Let be the class of -incoherent -nice distributions over (Defn. 3). Then, the class is evolvable with respect to by an evolutionary algorithm, using the mutation algorithm described in this section, selection rule , and the representation class , where . Furthermore, the following are true:

  1. The number of generations is polynomial in , , , but independent of the dimension .

  2. The size function , the number of points used to calculate the empirical losses, depends polylogarithmically on and polynomially on the remaining parameters.

Proof.

The proof is straightforward, although a bit heavy on notation; we provide a sketch here. The mutator is as described in this section. Let

and let

Also, let and let be the tolerance function.

First, we show that between the “rare” time steps when the mutator outputs mutations that add a new variable, evolution has enough time to stabilize (reach close to local optimal) using existing variables. To see this, consider a sequence of coin tosses, where the probability of heads is and the probability of tails is . Let be the number of tails between the and heads. Except with probability , by a simple union bound. Also, by Markov’s inequality, except with probability , . Thus, except with probability , we have for . Let . This ensures that, except with probability , after time steps, at least time steps where the mutator outputs mutations of type “adding” have occurred, and the first of these occurrences are all separated by at least time steps of other types of mutations.

Also, let and let be the size function. These values ensure that for generations, except with probability , the mutator always produces a mutation that had probability at least of being produced (conditioned on the type of mutations output by the mutator at that time step), and that for all the representations concerned, , where . Thus, allowing the process to fail with probability , we assume that none of the undesirable events have occurred.

We will show that the steps with mutations other than “adding” are sufficient to ensure that evolution reaches the (almost) best possible target with the variables available to it. In particular, if the set of available variables is , the representation reached by evolution will satisfy . For now, suppose that this is the case.

We claim by induction that evolution never adds a “wrong” variable, i.e., one that is not present in the target function . The base case is trivially true, since the starting representation is . Now suppose, just before a “heads” step, the representation is , such that and . The current step is assumed to be a “heads” step, thus the mutator has produced mutations by adding a new variable. Then, using Claim 4, we know that there is a mutation in such that (obtained by adding a correct variable). Since and , it must be the case that . This ensures that the set , for selection rule is not empty. Furthermore, we claim that no mutation that adds an irrelevant variable can be in . Suppose is a mutation that adds an irrelevant variable; according to Claim 5, , and hence . This ensures that every representation in corresponds to a mutation that adds some relevant variable. Thus, the evolutionary algorithm never adds any irrelevant variable.

Finally, note that during a “tails” step (when the mutator produces mutations of types other than “adding”), as long as , there exists a mutation that reduces the expected loss by at least . This implies that the set is non-empty and for the values of tolerance and , any mutation from the set reduces the expected loss by at least . (This argument is identical to the one in Theorem 1.) Since the maximum loss is at most for the class of distributions and a representation from the set ; in at most steps, a representation satisfying must be reached. Note that once such a representation is reached, it is ensured that the loss does not increase substantially, since with probability at least , the mutator outputs the same representation. Hence, it is guaranteed that there is always a neutral mutation. Thus, before the next “heads” step, it must be the case that . If is set to , the evolutionary algorithm using the selection rule succeeds.

It is readily verified that the values of and satisfy the claims in the statement of the theorem. ∎

4 Conclusion and Future Work

In this work, we provided simple evolutionary mechanisms for evolving sparse linear functions, under a large class of distributions. These evolutionary algorithms have the desirable properties that the representations used are themselves sparse linear functions, and that they are attribute-efficient in the sense that the number of generations required for evolution to succeed is independent of the total number of attributes.

Strong negative results are known for distribution-independent evolvability of boolean functions, e.g., even the class of conjunctions is not evolvable [11]. However, along the lines of this work, it is interesting to study whether under restricted classes of distributions, evolution is possible for simple concept classes, using representations of low complexity. Currently, even under (biased) product distributions, no evolutionary mechanism is known for the class of disjunctions, except via Feldman’s general reduction from CSQ algorithms. Even if the queries made by the CSQ algorithm are simple, Feldman’s reduction uses intermediate representations that randomly combine queries made by the algorithm, making the representations quite complex.

A natural extension of our current results would be to study fixed-degree sparse polynomials. Another interesting direction is to study circuits with sigmoidal or other non-linear filters on the gates, which arise naturally in molecular systems. The suitable class of boolean functions to study is low-weight threshold functions, which includes disjunctions and conjunctions. The class of smooth bounded distributions may be an appropriate starting place for studying evolvability of these classes. For example, is the class of low-weight threshold functions evolvable under smooth distributions, or at least log-concave distributions?

Acknowledgments

We would like to thank Leslie Valiant for helpful discussions and comments on an earlier version of this paper. We are grateful to Frank Solomon for discussing biological aspects related to this work.

Appendix A Omitted Proofs

a.1 Proofs from Section 2.2

Proof of Lemma 1.

Note that for any , we can write , where is drawn from some smooth bounded distribution, and is drawn from the uniform distribution over . Note that and are independent, and all components of are independent. First, we observe that . Now, consider the following:

The conclusions of the Lemma follow easily by looking at the above expression. ∎

a.2 Proofs from Section 3.1

Proof of Claim 1.

We show that in this case, a “scaling” mutation achieves the desired result. Restricted to the direction , the best approximation to is . We have that

Hence, if , for (and similarly if for ), we have that