Lexicase selection: a probabilistic and multiobjective analysis of lexicase selection in continuous domains
Abstract
Lexicase selection is a parent selection method that considers training cases individually, rather than in aggregate, when performing parent selection. Whereas previous work has demonstrated the ability of lexicase selection to solve difficult problems, the central goal of this paper is to develop the theoretical underpinnings that explain its performance. To this end, we derive an analytical formula that gives the expected probabilities of selection under lexicase selection, given a population and its behavior. In addition, we expand upon the relation of lexicase selection to manyobjective optimization methods to describe the behavior of lexicase, which is to select individuals on the boundaries of Pareto fronts in highdimensional space. We show analytically why lexicase selection performs more poorly for certain sizes of population and training cases, and show why it has been shown to perform more poorly in continuous error spaces. To address this last concern, we introduce lexicase selection, which modifies the pass condition in lexicase selection to allow nearelite individuals to pass cases, thereby improving selection performance with continuous errors. We show that lexicase outperforms several diversitymaintenance strategies on a number of realworld and synthetic regression problems.
Lexicase selection: a probabilistic and multiobjective analysis of lexicase selection in continuous domains
William La Cava lacava@upenn.edu
Institute for Bioinformatics, University of Pennsylvania, Philadelphia, PA, 19104, USA
1 Introduction
Evolutionary computation (EC) traditionally assigns scalar fitness values to candidate solutions to determine how to guide search. In the case of genetic programming (GP), this fitness value summarizes how closely, on average, the behavior of the candidate programs match the desired behavior. Take for example the task of symbolic regression, in which we attempt to find a model using a set of training examples, i.e. cases. A typical fitness measure is the mean squared error (MSE), which averages the squared differences between the model’s outputs, , and the target outputs, . The effect of this averaging is to reduce a rich set of information comparing the model’s output and the desired output to a single scalar value. As noted by Krawiec (2016), the relationship of to can only be represented crudely by this fitness value. The fitness score thereby restricts the information conveyed to the search process about candidate programs relative to the description of their behavior available in the raw comparisons of the output to the target, information which could help guide the search (Krawiec et al., 2015; Krawiec and Liskowski, 2015). This observation has led to increased interest in the development of methods that can leverage the program outputs directly to drive search more effectively (Vanneschi et al., 2014).
In addition to reducing information, averaging test performance assumes all tests are equally informative, leading to the potential loss of individuals who perform poorly on average even if they are the best on a training case that is difficult for most of the population to solve. This is particularly relevant for problems that require different modes of behavior to produce an adequate solution to the problem (Spector, 2012). The underlying assumption of traditional selection methods is that selection pressure should be applied evenly with respect to training cases. In practice, cases that comprise the problem are unlikely to be uniformly difficult. As a result, the search is likely to benefit if it can take into account the difficulty of specific cases by recognizing individuals that perform well on harder parts of the problem. Underlying this last point is the assumption that GP solves problems by identifying, propagating and recombining partial solutions (i.e. building blocks) to the task at hand (Poli and Langdon, 1998). As a result, a program that performs well on unique subsets of the problem may imply a partial solution to our task.
Several methods have been proposed to reward individuals with uniquely good test performance, such as implicit fitness sharing (IFS) (McKay, 2001), historically assessed hardness (Klein and Spector, 2008), and cosolvability (Krawiec and Lichocki, 2010), all of which assign greater weight to fitness cases that are judged to be more difficult in view of the population performance. Perhaps the most effective parent selection method designed to account for case hardness is lexicase selection (Spector, 2012). In particular, “global pool, uniform random sequence, elitist lexicase selection” (Spector, 2012), which we refer to simply as lexicase selection, has outperformed other similarlymotivated methods in recent studies (Helmuth et al., 2014; Helmuth and Spector, 2015; Liskowski et al., 2015). Despite these gains, it fails to produce such benefits when applied to continuous symbolic regression problems, due to its method of selecting individuals based on training case elitism. For this reason we recently proposed (La Cava et al., 2016b) modulating the case pass conditions in lexicase selection using an automatically defined threshold, allowing the benefits of lexicase selection to be achieved in continuous domains.
To date, lexicase selection and lexicase selection have mostly been analyzed via empirical studies, rather than algorithmic analysis. In particular, previous work has not explicitly described the probabilities of selection under lexicase selection compared to other selection methods, nor how lexicase selection relates to the multiobjective literature. Therefore, the foremost purpose of this paper is to describe analytically how lexicase selection and lexicase selection operate on a given population compared to other approaches. With this in mind, in §6 we derive an equation that describes the expected probability of selection for individuals in a given population based on their behavior on the training cases, for all variants of lexicase selection. Then in §7, we analyze lexicase and lexicase selection from a multiobjective viewpoint, in which we imagine each training case to be an objective. We prove that individuals selected by lexicase selection exist at the boundaries of the Pareto front defined by the program error vectors. We show via an illustrative example population in §8 how the probabilities of selection differ under tournament, lexicase, and lexicase selection.
The second purpose of this paper is to empirically assess the use of lexicase selection in continuous domains. In §4, we define two new variants of lexicase selection: semidynamic and dynamic, which are shown to improve the method compared to the original static implementation. A set of experiments compares variants of lexicase selection to several existing selection techniques on a set of real world benchmark problems. The results show that ability of lexicase selection to improve the predictive accuracy of models on these problems. We examine in detail the diversity of programs during these runs, as well as the number of cases used in selection events to validate our hypothesis that lexicase selection allows for more cases to be used when selecting individuals compared to lexicase selection. Lastly, the time complexity of lexicase is experimentally analyzed as a function of population size.
2 Preliminaries
In symbolic regression, we attempt to find a model that maps variables to a target output using a set of training examples , where is a dimensional vector of variables, i.e. features, and is the desired output. We refer to elements of as “cases”. GP poses the problem as
(1) 
where is the space of possible programs and denotes a minimized fitness function. GP attempts to solve the symbolic regression task by optimizing a population of programs , each of which encodes a model of the process and produces an estimate when evaluated on case . We refer to as the semantics of program , omitting for brevity. We denote the squared differences between and (i.e., the errors) as . We use to refer to the errors of all programs in the population on training case . The lowest error in is referred to as .
A typical fitness measure () is the mean squared error, , which we use to compare our results in §10. For the purposes of our discussion, it is irrelevant whether the MSE or the mean absolute error, i.e. MAE, is used, and so we use MAE to simplify a few examples throughout the paper. With lexicase selection and its variants, is used directly during selection rather than averaging over cases. Nevertheless, in keeping with the problem statement in Eqn. 1, the final program returned in our experiments is that which minimizes the MSE.
3 Lexicase Selection
Lexicase selection is a parent selection technique based on lexicographic ordering of training (i.e. fitness) cases. The lexicase selection algorithm for a single selection event is presented below:
Algorithm 3.1: Lexicase Selection  

Selection(, ns) :  
set of selected parents  
do ns times:  ns is the number of selection events 
GetParent()  add selected program to 
GetParent() :  
training cases  
initial selection pool is the population  
while and :  main loop 
random choice from  consider a random case 
elite best fitness in on  determine elite fitness 
if fitness(, ) = elite  reduce selection pool to elites 
reduce remaining cases  
return random choice from  return parent 
Algorithm 3.1 is very simple to implement because it consists of just a few steps: 1) choosing a case, 2) filtering the selection pool based on that case, and 3) repeating until the cases are exhausted or the selection pool is reduced to one individual. If the selection pool is not reduced by the time each case has been considered, an individual is chosen randomly from the remaining pool, .
Under lexicase selection, cases in can be thought of as filters that reduce the selection pool to the individuals in the pool that are best on that case. Each parent selection event constructs a new path through these filters. We refer to individuals as “passing” a case if they remain in the selection pool when the case is considered. The filtering strength of a case is affected by two main factors: its difficulty as defined by the number of individuals that the case filters from the selection pool, and its order in the selection event, which varies from selection to selection. These two factors are interwoven in lexicase selection because a case performs its filtering on a subset of the population created by a randomized sequence of cases that come before it. In other words, the difficulty of a case depends not only on the problem definition, but on the ordering of the case in the selection event, which is randomized for each selection.
Regarding case difficulty, consider two extreme examples: if a case is passed by the whole population, then it will perform no filtering, resulting in no selection pressure; on the other hand, if a case is passed by a single individual only, then that individual will be selected every time the case is considered for a selection pool containing the individual that passes it. This mechanism allows selective pressure to continually shift to individuals that are elite on cases that are not widely solved in . Because cases appear in various orderings during selection, there is selective pressure for individuals to solve unique subsets of cases. Lexicase selection thereby accounts not only for the difficulty of individual cases but the difficulty of solving arbitrarilysized subsets of cases. This selection pressure leads to the preservation of high behavioral diversity during evolution (Helmuth et al., 2015; La Cava et al., 2016b).
The worstcase complexity of selecting parents per generation with test cases is . This running time stems from the fact that to select a single individual, lexicase selection may have to consider the error value of every individual on every test case. In contrast, tournament selection only needs to consider the precomputed fitnesses of a constant tournament size number of individuals; thus selecting a single parent can be done in constant time. Since errors need to be calculated and summed for every test case on every individual, tournament selection requires time to select parents. Normally, due to differential performance across the population and due to lexicase selection’s tendency to promote diversity, a lexicase selection event will use many fewer test cases than ; the selection pool typically winnows below as well, meaning the actual running time tends to be better than the worstcase complexity (Helmuth et al., 2014; La Cava et al., 2016b).
We use an example population originally presented in (Spector, 2012) to illustrate some aspects of standard lexicase selection in the following sections. The population, shown in Table 1, consists of five individuals and four training cases with discrete errors. A graphical example of the filtering mechanism of selection is presented for this example in Figure 1. Each lexicase selection event can be visualized as a randomized depthfirst pass through the training cases. Figure 1 shows four example selection events represented by different line types. The populations are winnowed at each case to the elites until single individuals, shown with diamondshaped nodes, are selected.
Program  Case Error  Elite Cases  MAE  

2  2  4  2  2.5  0.25  0.28  
1  2  4  3  2.5  0.00  0.28  
2  2  3  4  2.75  0.33  0.12  
0  2  5  5  3.0  0.21  0.04  
0  3  5  2  2.5  0.21  0.28 
4 Lexicase Selection
Lexicase selection has been shown to be effective in discrete error spaces, both for multimodal problems (Spector, 2012) and for problems for which every case must be solved exactly to be considered a solution (Helmuth et al., 2014; Helmuth and Spector, 2015). In continuous error spaces, however, the requirement for individuals to be exactly equal to the elite error in the selection pool to pass a case turns out to be overly stringent (La Cava et al., 2016b). In continuous error spaces and especially for symbolic regression with noisy datasets, it is unlikely for two individuals to have exactly the same error on any training case unless they are (or reduce to) equivalent models. As a result, lexicase selection is prone to conducting selection based on single cases, for which the selected individual satisfies , the minimum error on among . Selecting on single cases limits the ability of lexicase to leverage case information on subsets of test cases effectively, and can lead to poorer performance than traditional selection methods (La Cava et al., 2016b).
These observations led to the development of lexicase selection (La Cava et al., 2016b), which modifies lexicase selection by modulating case filtering conditions using an threshold criteria. Handtuned and automatic variants of were proposed and tested. The best performance was achieved by a ’parameterless’ version that defines according to the dispersion of errors in the population on each training case using the median absolute deviation statistic:
(2) 
Defining according to Eqn. 2 allows the threshold to adapt to changing performance of the population on each training case. As performance across the population improves for a training case, shrinks, thereby modulating the selectivity of a case based on how difficult it is. We choose the median absolute deviation in lieu of the standard deviation statistic for calculating because it is more robust to outliers (PhamGia and Hung, 2001).
We study three implementations of lexicase selection in this paper: static, which is the version originally proposed (La Cava et al., 2016a); semidynamic, in which the elite error is defined relative to the pool; and dynamic, in which both the elite error and are defined relative to the current selection pool.
Static lexicase selection can be viewed as a preprocessing step added to lexicase selection in which the program errors are converted to pass/fail based on an threshold. This threshold is defined relative to , the lowest error on test case over the entire population. We call this static lexicase selection because the elite error and are only calculated once per generation, instead of relative to the current selection pool, as described in Algorithm 4.1.
Algorithm 4.1: Static Lexicase Selection  

Selection(, ns) :  
set of selected parents  
() for  get for each case across population 
fitness for and  convert fitness using within pass condition 
do ns times:  ns is the number of selection events 
GetParent( fitness)  add selected program to 
GetParent( fitness) :  
training cases  
initial selection pool is the population  
while and :  main loop 
random choice from  consider a random case 
elite lowest fitness in on  determine elite fitness 
if fitness() elite  reduce selection pool 
reduce remaining cases  
return random choice from  return parent 
The indicator function used above is zero when false and one when true. Semidynamic lexicase selection differs from static lexicase selection in that the pass condition is defined relative to the best error among the pool rather than among the entire population . In this way it behaves more similarly to standard lexicase selection (Algorithm 3.1), except that individuals are filtered out only if they have error more than . It is defined below:
Algorithm 4.2: Semidynamic Lexicase Selection  

Selection(, ns) :  
set of selected parents  
() for  get for each case across population 
do ns times:  ns is the number of selection events 
GetParent()  add selected program to 
GetParent() :  
training cases  
initial selection pool is the population  
while and :  main loop 
random choice from  consider a random case 
elite lowest error in on  determine elite fitness 
if elite  reduce selection pool 
reduce remaining cases  
return random choice from  return parent 
The final variant of lexicase selection is dynamic lexicase selection, in which both the error and are defined among the current selection pool. In this case, is defined as
(3) 
where is the vector of errors for case among the current selection pool . The dynamic lexicase selection algorithm is presented below:
Algorithm 4.3: Dynamic Lexicase Selection  

Selection(, ns) :  
set of selected parents  
do ns times:  ns is the number of selection events 
GetParent()  add selected program to 
GetParent() :  
training cases  
initial selection pool is the population  
while and :  main loop 
random choice from  consider a random case 
elite lowest error in on  determine elite fitness 
determine for case  
if elite  reduce selection pool 
reduce remaining cases  
return random choice from  return parent 
Since calculating according to Eqn. 2 is for a single test case, the three lexicase selection algorithms share a worstcase complexity with lexicase selection of to select parents. As discussed in §3, these worstcase time complexities are rare, and empirical results have confirmed lexicase to run within the same time frame as tournament selection (La Cava et al., 2016b). We assess the affect of population size on wallclock times in §9.
5 Related Work
Lexicase selection belongs to a class of GP systems that incorporate a program’s full semantics directly into the search process, and as such shares a general motivation with recently proposed methods such as Geometric Semantic GP (Moraglio et al., 2012) and Behavioral GP (Krawiec and O’Reilly, 2014), despite differing strongly in approach. Instead of incorporating the full semantics, a number of GP methods alter the fitness metric by weighting training cases based on population performance. In nonbinary Implicit Fitness Sharing (IFS) (Krawiec and Nawrocki, 2013), for example, the fitness proportion of a case is scaled by the performance of other individuals on that case. Similarly, historically assessed hardness scales error on each training case by the success rate of the population (Klein and Spector, 2008). Discovery of objectives by clustering (DOC) (Krawiec and Liskowski, 2015) clusters training cases by population performance, and thereby reduces training cases into a set of objectives used in multiobjective optimization. Both IFS and DOC were outperformed by lexicase selection on program synthesis and boolean problems in previous studies (Helmuth and Spector, 2015; Liskowski et al., 2015). Other methods attempt to sample a subset of to reduce computation time or improve performance, such as dynamic subset selection (Gathercole and Ross, 1994), interleaved sampling (Gonçalves and Silva, 2013), and coevolved fitness predictors (Schmidt and Lipson, 2008). Unlike these methods, lexicase selection begins each selection with the full set of training cases, and allows selection to adapt to program performance on them.
Although to an extent the ideas of multiobjective optimization apply to multiple training cases, they are qualitatively different and commonly operate at different scales. Symbolic regression often involves one or two objectives (e.g. accuracy and model conciseness) and hundreds or thousands of training cases. One example of using training cases explicitly as objectives occurs in Langdon (1995) in which small numbers of training cases (in this case 6) are used as multiple objectives in a Pareto selection scheme. Other multiobjective approaches such as NSGAII (Deb et al., 2000), SPEA2 (Zitzler et al., 2001) and ParetoGP (Smits and Kotanchek, 2005) are commonly used with a small set of objectives in symbolic regression. The “curse of dimensionality” prevents the use of objectives at the scale of typical training case sizes, since most individuals become nondominated, leading to selection based mostly on diversity measures rather than performance. Scaling issues in manyobjective optimization are reviewed by Ishibuchi et al. (2008). The connection between lexicase selection and multiobjective methods is discussed in depth in §7.
The conversion of a model’s realvalued fitness into discrete values based on an threshold has been explored in other research; for example, Novelty Search GP (MartÃnez et al., 2013) uses a reduced error vector to define behavioral representation of individuals in the population. La Cava et al. (2016b) it for the first time as a solution to applying lexicase selection effectively to regression.
Recent work has empirically studied and extended lexicase selection. Helmuth et al. (2016b) found that extreme selection events in lexicase selection were not central to its performance improvements and that lexicase selection could rediversify lessdiverse populations unlike tournament selection (Helmuth et al., 2016a). A survivalbased version of lexicase selection has also been proposed (La Cava and Moore, 2017a, b) for maintaining uncorrelated populations in an ensemble learning context.
6 Expected Probabilities of Selection
The literature on lexicase selection has yet to analytically address the question: what is the probability of an individual being selected by lexicase selection, given its performance in a population on a set of training cases? Put into words, the probability of being selected is the probability that a case passes is selected and: 1) is the only individual that passes the case; or 2) no more cases remain and is selected among the set of individuals that pass the selected case; or 3) is selected via the selection of another case that passes (repeating the process).
Formally, let be the probability of being selected by lexicase selection. Let be the training cases from for which individual is elite among . We will use for brevity. Then the probability of selection under lexicase can be represented as a piecewise recursive function:
(4) 
The first two elements of follow from the lexicase algorithm: if there is one individual in , then it is selected; otherwise if there no more cases in in , then has a probability of selection split among the individuals in , i.e., . If neither of these conditions are met, the remaining probability of selection is times the summation of over ’s elite cases. Each case in has a probability of of being selected. For each potential selection , the probability of being selected as a result of this case being chosen is dependent on the number of individuals that are also elite on this case, represented by , and the cases that are left to be traversed, represented by .
Eqn. 4 also describes the probability of selection under lexicase selection, with the condition that elitism on a case is defined as being within of the best error on that case, where the best error is defined among the whole population (static) or among the current selection pool (semidynamic and dynamic) and is defined according to Eqn. 2 or Eqn. 3.
According to Eqn. 4, when fitness values across the population are unique, selection probability is , since filtering the population according to any case for which is elite will result in being selected. Conversely, if the population semantics are completely homogeneous such that every individual is elite on every case, the selection will be uniformly random, giving the selection probability . This property of uniformity in selection for identical performance holds true each time a case is considered; a case only impacts selection if there is differential performance on it in the selection pool. The same conclusion can be gleaned from Algorithm 3.1: any case that every individual passes provides no selective pressure because the selection pool does not change when that case is considered.
Although it is tempting to pair Eqn. 4 with roulette wheel selection as an alternative to lexicase selection, an analysis of its complexity discourages such use. Eqn. 4 has a worstcase complexity of , which is exhibited when all individuals are elite on .
6.1 Effect of Population and Training Set Size
Previous studies have suggested that the performance of lexicase selection is sensitive to the number of training cases (Liskowski et al., 2015). In this section we develop the relation of population size and number of training cases to the performance of lexicase selection as a search driver. In part, this behavior is inherent to the design of the algorithm. However, this behavior is also linked to the fidelity with which lexicase selection samples the expected probabilities of selection for each individual in the population.
The effectiveness of lexicase selection is expected to suffer when there are few training cases. When is small, there are very few ways in which individuals can be selected. In an extreme case, if , an individual must be elite on one of these two cases to be selected. In fact, in this case individuals with at most 2 different error vectors will be selected. For continuous errors in which few individuals are elite, this means that very few individuals are likely to produce all of the children for the subsequent generation, leading to hyperselection (Helmuth et al., 2016b) and diversity loss. On the other hand, if many individuals solve both cases, selection becomes increasingly random.
The population size is tied to selection behavior because it determines the number of selection events (ns in Algorithms 2.13.3). In our implementation, ns = , whereas in other implementations, . This implies that the value of determines the fidelity with which is approximated via the sampling of the population by parent selection. Smaller populations will therefore produce poorer approximations of . Of course, this problem is not unique to lexicase selection; tournament selection also samples from an expected distribution and is affected by the number of tournaments (Xie et al., 2007).
Both and affect how well the expected probabilities of selection derived from Eqn. 4 are approximated by lexicase selection. Consider the probability of a case being in at least one selection event in a generation, which is one minus the probability of it not appearing, yielding
Here, the case depth is the number of cases used to select a parent for selection event . Because the case depth varies from selection to selection based on population semantics, this case probability is difficult to analyze. However, it can be simplified to consider the scenario in which a case appears first in selection. In fact, Eqn. 4 implies that a case in influences the probability of selection of most heavily when it occurs first in a selection event. There are two reasons: first, the case has the potential to filter individuals, which is the strongest selection pressure it can apply. Second, a case’s effect size is highest when selected first because it is not conditioned on the probability of selection of any other cases. Each subsequent case selection has a reduced effect on of , where is the case depth. These observations also highlight the importance of the relative sizes of and because they affect the probability that a case will be observed at the top of a selection event in a given generation, which affects how closely Eqn. 4 is approximated. Let be the probability that a case will come first in a selection event at least once in a generation. Then
(5) 
assuming selection events. This function is plotted for various values of and in Figure 2, and illustrates that the probability of a case appearing first in selection drops as grows and as shrinks. For example, when and . We therefore expect the observed probabilities of selection for to differ from when , due to insufficient sampling of the cases. In the case of , we expect most cases to appear first and therefore the probability predictions made by Eqn. 4 to be more accurate to the actual selections.
6.2 Probabilities under tournament selection
We compare the probability of selection under lexicase selection to that using tournament selection with an identical population and fitness structure. To do so we must first formulate the probability of selection for an individual undergoing tournament selection with size tournaments. The fitness ranks of can be calculated, for example using MAE as fitness, with lower rank indicating better fitness. Let be the individuals in with a fitness rank of , and let be the number of unique fitness ranks. Xie et al. (2007) showed that the probability of selecting an individual with rank in a single tournament is
(6) 
In Table 1, the selection probabilities for the example population are shown according to lexicase selection (Eqn. 4) and tournament selection (Eqn. 6). Note that the tournament probabilities are proportional to the aggregate fitness, whereas lexicase probabilities reflect more subtle but intuitive performance differences as discussed by Spector (2012). In §8 we present a more detailed population example with continuous errors and compare probabilities of selection using lexicase, lexicase and tournament selection.
7 Multiobjective Interpretation of Lexicase Selection
Objectives and training cases are fundamentally different entities: objectives define the goals of the task being learned, whereas cases are the units by which progress towards those objectives is measured. By this criteria, lexicase selection and multiobjective optimization have historically been differentiated (Helmuth, 2015), although there is clearly a “multiobjective” interpretation of the behavior of lexicase selection with respect to the training cases. Let us assume for the remainder of this section that individual fitness cases are objectives to solve. The symbolic regression task then becomes a highdimensional, manyobjective optimization problem. At this scale, the most popular multiobjective methods (e.g. NSGAII and SPEA2) tend to perform poorly, a behavior that has been explained in literature (Wagner et al., 2007; Farina and Amato, 2002). Farina and Amato (2002) point out two shortcomings of these multiobjective methods when many objectives are considered:
the Pareto definition of optimality in a multicriteria decision making problem can be unsatisfactory due to essentially two reasons: the number of improved or equal objective values is not taken into account, the (normalized) size of improvements is not taken into account.
As we describe in §6, lexicase selection takes into account the number of improved or equal objectives (i.e. cases) by increasing the probability of selection for individuals who solve more cases (consider the summation in the third part of Eqn. 4). The increase per case is proportional to the difficulty of that case, as defined by the selection pool’s performance. Regarding Farina and Amato’s second point, the size of the improvements are taken into account by lexicase selection. They are taken into account by the automated thresholding performed by which rewards individuals for being within an acceptable range of the best performance on the case. We develop the relationship between lexicase selection and Pareto optimization in the remainder of this section.
It has been noted that lexicase selection guarantees the return of individuals that are on the Pareto front with respect to the fitness cases (La Cava et al., 2016b). However, this is a necessary but not sufficient condition for selection. As we show below, lexicase selection only selects those individuals in the “corners” or boundaries of the Pareto front, meaning they are on the front and elite, globally, with respect to at least one fitness case. Below, we define these Pareto relations with respect to the training cases.
Definition 7.1.
dominates , i.e., , if and for which .
Definition 7.2.
The Pareto set of is the subset of that is nondominated with respect to ; i.e., is in the Pareto set if .
Definition 7.3.
is a Pareto set boundary if Pareto set of and for which .
With these definitions in mind, we show that individuals selected by lexicase are Pareto set boundaries.
Theorem 7.4.
If individuals from a population are selected by lexicase selection, those individuals are Pareto set boundaries of with respect to .
Proof.
Let be any individual and let be an individual selected from by lexicase selection. Suppose . Then and for which . Therefore remains in the selection pool for every case that does, yet for which is removed from every selection event due to . Hence, cannot be selected by lexicase selection and the supposition is false. Therefore must be in the Pareto set of .
Next, Algorithm 3.1 shows that must be elite on at least one test case; therefore for which . Therefore, since is in the Pareto set of , according to Definition 7.3, is a Pareto set boundary of . ∎
7.1 Extension to lexicase selection
We can extend our multiobjective analysis to lexicase selection for conditions in which is predefined for each fitness case (Eqn. 2), which is true for static and semidynamic lexicase selection. However when is recalculated for each selection pool, the theorem is not as easily extended due to the need to account for the dependency of on the current selection pool. We first define elitism in terms of a relaxed dominance relation and a relaxed Pareto set. We define the dominance relation with respect to as follows:
Definition 7.5.
dominates , i.e., , if and for which , where is defined according to Eqn. 2.
This definition of dominance differs from a previous dominance definition used by Laumanns et al. (2002) (cf. Eqn. (6)) in which if
Definition 7.5 is more strict, requiring for at least one in analagous fashion to Definition 7.1. In order to extend Theorem 7.4, this definition must be more strict since a useful dominance relation needs to capture the ability of an individual to preclude the selection of another individual under lexicase selection.
Definition 7.6.
The Pareto set of is the subset of that is nondominated with respect to ; i.e., is in the Pareto set if .
Definition 7.7.
is an Pareto set boundary if is in the Pareto set of and for which , where is defined according to Eqn 2.
Theorem 7.8.
If is defined according to Eqn. 2, and if individuals are selected from a population by lexicase selection, then those individuals are Pareto set boundaries of .
Proof.
Let be any individual and let be an individual selected from by static or semidynamic lexicase selection. Suppose . Therefore remains in the selection pool for every case that does, yet for which is removed from every selection event due to . Hence, cannot be selected by lexicase selection and the supposition is false. Therefore and must be in the Pareto set of to be selected.
Next, by definition of Algorithm 4.1 or 3.2, must be within of elite on at least one test case; i.e. for which . Therefore, since is in the Pareto set of , according to Definition 7.7, must be a Pareto set boundary of . ∎
To illustrate how lexicase selection only selects Pareto set boundries, we plot an example selection from a population evaluated on two test cases in the left plot of Figure 3. Each point in the plot represents an individual, and the squares are the Pareto set. Under a lexicase selection event with case sequence , individuals would first be filtered to the two leftmost individuals that are elite on , and then to the individual among those two that is best on , i.e. the selected square individual. Note that the selected individual is a Pareto set boundary. The individual on the other end of the Pareto set shown as a black square would be selected using the opposite order of cases.
Consider the analogous case for semidynamic lexicase selection illustrated in the right plot of Figure 3. In this case the squares are the Pareto set. Under a semidynamic lexicase selection event with case order , the population would first be filtered to the four leftmost individuals that are within of the elite error on case , and then the indicated square would be selected by being the only individual within of the elite error on among the current pool. Note that although the selected individual is an Pareto set boundary by Definition 7.7, it is not a boundary of the Pareto set. Theorem 7.8 only guarantees that the selected individual is within of the best error for at least one case, which in this scenario is .
Comparing the left and right plots of Figure 3 illustrates the effect of introducing to the filter conditions in lexicase: it reduces the selectivity of each case, ultimately resulting in the selection of individuals that are not as extremely positioned in objective space. Regarding the position of solutions in this space, it’s worth noting the significance of boundary solutions (and near boundary solutions) in the context of multiobjective optimization. Boundary solutions are known to contribute significantly to hypervolume measures (Deb et al., 2005), where the hypervolume is a measure of how wellcovered the objective space is by the current set of solutions. The boundary solutions have an infinite score according to NSGAII’s crowding measure (Deb et al., 2000), with higher being better, meaning they are the first nondominated solutions to be preserved by selection when the population size is reduced. Nonetheless, multiobjective literature appears divided on how these boundary solutions drive search when the goal of the algorithm is to approximate the optimal Pareto front (Wagner et al., 2007). The goal of GP, in contrast, is to preserve points in the search space that, when combined and varied, yield a single best solution. So while the descriptions above lend insight to the function of lexicase and lexicase selection, the different goals of search and the high dimensionality of training cases must be remembered when comparing to multiobjective optimization.
As a last note, when considered as objectives, the worstcase complexity of lexicase selection matches that of NSGAII: . Interestingly, the worst case complexity of the crowding distance assignment portion of NSGAII, , occurs when all individuals are nondominated, which is expected in high dimensions (Farina and Amato, 2002; Wagner et al., 2007). Under lexicase selection, a nondominated population that is semantically unique will have a worstcase complexity of .
8 Illustrative Example
Here we apply the concepts from §6 to consider the probabilities of selection under different methods on an example population. The goal of this section is to interweave the analyses of §6 and §7 to give an intuitive explanation of the differences between lexicase selection and the lexicase selection variants.
An example population is presented in Table 2 featuring floating point errors, in contrast to Table 1. In this case, the population semantics are completely unique, although they result in the same mean error across the training cases, as shown in the “Mean” column. As a result, tournament selection picks uniformly from among these individuals, resulting in equivalent probabilities of selection. As mentioned in §6, with unique populations, lexicase selection is proportional to the number of cases for which an individual is elite. This leads lexicase selection to pick from among the four individuals that are elite on cases, i.e. , , , and , with respective probabilities 0.2, 0.2, 0.2, and 0.4, according to Eqn. 4. Note these four individuals are Pareto set boundaries.
Cases  Probability of Selection  

Mean  tourn  lex  lex static  lex semi  lex dyn  
0.0  1.1  2.2  3.0  5.0  2.26  0.111  0.200  0.000  0.067  0.033  
0.1  1.2  2.0  2.0  6.0  2.26  0.111  0.000  0.150  0.117  0.200  
0.2  1.0  2.1  1.0  7.0  2.26  0.111  0.000  0.150  0.117  0.117  
1.0  2.1  0.2  0.0  8.0  2.26  0.111  0.200  0.300  0.200  0.167  
1.1  2.2  0.0  4.0  4.0  2.26  0.111  0.200  0.000  0.050  0.050  
1.2  2.0  0.1  5.0  3.0  2.26  0.111  0.000  0.000  0.050  0.033  
2.0  0.1  1.2  6.0  2.0  2.26  0.111  0.000  0.133  0.133  0.133  
2.1  0.2  1.0  7.0  1.0  2.26  0.111  0.000  0.133  0.133  0.217  
2.2  0.0  1.1  8.0  0.0  2.26  0.111  0.400  0.133  0.133  0.050  
0.9  0.9  0.9  2.0  2.0 
Due to its strict definition of elitism, lexicase selection does not account for the fact that other individuals are very close to being elite on these cases as well; for example and are close to the elite error on case . The lexicase variants address this as noted by the smoother distribution of selection probabilities among this population. We focus first on static lexicase selection. Applying the threshold to the errors yields the following discrete fitnesses:
0  1  1  1  1  
0  1  1  0  1  
0  1  1  0  1  
1  1  0  0  1  
1  1  0  1  1  
1  1  0  1  1  
1  0  1  1  0  
1  0  1  1  0  
1  0  1  1  0 
The selection probabilities for static lexicase selection are equivalent to the selection probabilities of lexicase selection on this converted error matrix. Note that and have selection probabilities of zero because they are dominated in the converted error space. Despite elitism on case , is not selected since and are elite on this case in addition to . The same effect makes unselectable due to . Consider , which has a higher probability of selection under static lexicase selection than lexicase selection. This is due to being elite on a unique combination of cases: and . Lastly, is selected in equal proportions to and because all three are within of the elite error on the same cases.
Semidynamic lexicase selection allows for all nine individuals to be selected with varying proportions that are similar to those derived for static lexicase selection. Selection probabilities for illustrate the differences in the static and semidynamic variants: has a chance for selection in the semidynamic case because when is selected as the first case, is within of the best case errors among the pool, i.e. , , , for any subsequent order of cases. The probability of selection for and follow the same pattern.
Dynamic lexicase selection produces the most differentiated selection pressure for this example. Consider individual which is the most likely to be selected for this example. It is selected more often than or due to the adaptations to as the selection pool is winnowed. For example, is selected by case sequence , for which the selection pool takes the following form after each case: . Conversely, under semidynamic lexicase selection, and would not be removed by these cases because is fixed for that variant.
9 Experimental Analysis
In this section we empirically test the variants of lexicase selection introduced in §4. The problems studied in this section are listed in Table 3. We benchmark nine methods using eight different datasets. Six of the problems are available from the UCI repository (Lichman, 2013). The UBall5D problem is a simulated equation^{1}^{1}1UBall5D is also known as Vladislavleva4. which has the form
The Tower problem and UBall5D were chosen from the benchmark suite suggested by White et al. (2012).
Problem  Dimension  Samples 

Airfoil  5  1503 
Concrete  8  1030 
ENC  8  768 
ENH  8  768 
Housing  14  506 
Tower  25  3135 
UBall5D  5  6024 
Yacht  6  309 
Setting  Value 

Population size  1000 
Crossover / mutation  60/40% 
Program length limits  [3, 50] 
ERC range  [1,1] 
Generation limit  1000 
Trials  50 
Terminal Set  {, ERC, , , , , , , , } 
Elitism  keep best 
Fitness (nonlexicase methods)  MSE 
We compare eight different selection methods: random selection, tournament selection, lexicase selection, agefitness pareto optimization (Schmidt and Lipson, 2011), deterministic crowding (Mahfoud, 1995), and the three lexicase selection methods presented in §4. In addition to the selection methods that are benchmarked, we include a comparison to regularized linear regression using Lasso (Tibshirani, 1996). These methods are described briefly below, along with their abbreviations used in the results.

Random Selection (rand): selection for parents is uniform random.

Tournament Selection (tourn): size two tournaments are conducted for choosing parents.

Lexicase Selection (lex): see Algorithm 3.1.

Agefitness Pareto optimization (afp): this method introduces a new individual each generation with an age of 0. Each generation, individuals are assigned an age equal to the number of generations since their oldest ancestor entered the population. Parents are selected randomly to create children. The children and parents then compete in survival tournaments of size two, in which an individual is culled from the population if it is dominated in terms of age and fitness by its competitor.

Deterministic crowding (dc): A generational form of this niching method is used in which parents are selected randomly for variation and the child competes to replace the parent with which it is most similar. Similarity is determined based on the Levenshtein distance of the parent’s equation forms, using a universal symbol for coefficients. A child replaces its parent in the population only if it has a better fitness.

Static lexicase selection (eplexs): See Algorithm 4.1.

Semidynamic lexicase selection (eplexsd): See Algorithm 4.2.

Dynamic lexicase selection (eplexd): See Algorithm 4.3.

Lasso (lasso): this method incorporates a regularization penalty into least squares regression using an measure of the model coefficients and uses a tuning parameter, , to specify the weight of this regularization. We use a least angle regression (Efron et al., 2004) implementation of Lasso that automatically chooses using cross validation.
The settings for the GP system^{2}^{2}2available from https://epistasislab.github.io/ellyn/ are shown in Table 4. We conduct 50 trials of each method by training on a random partition of 70% of the dataset and comparing the prediction error of the best model from each method on the other 30% of the dataset. In addition to test error, we compare the training convergence of the GPbased methods, the semantic diversity of the populations during the run, and the number of cases used for selection for the lexicase methods. We calculate population diversity as the fraction of unique semantics in the population. To compare the number of cases used in selection for the lexicase methods, we save the median number of cases used in selection events, i.e. the case depth, each generation.
9.1 Runtime analysis
In order to get an empirical sense of the time scaling of lexicase selection in comparison to other selection methods, we run a set of experiments in which the population size is varied between 50 and 2000 while using a fixed training set of 100 samples from the UBall5D problem. We run 10 trials of each population size setting and compare the eight GP methods listed above. We use the results to estimate the time complexity of the lexicase selection variants as a function of population size.
10 Results
The boxplots in Figure 4 show the test set MSE for each method on each problem. In the final subplot, we summarize the mean rankings of the methods on each trial of each problem to give a general comparison of performance. Ranks are calculated for each trial, and then averaged over all trials and problems to give an overall ranking comparison. In general we find that the lexicase methods produce models with the best generalization performance across the tested problems. Random selection and Lasso tend to perform the worst on these problems. It is interesting to note the performance of Lasso on the Tower problem, which is better than on other datasets; eplexsd and eplexd are the only GP variants to significantly outperform it. For every problem, a variant of lexicase selection performs the best, and the three variants of it tend to perform similarly. In accordance with previous results (La Cava et al., 2016b), lexicase selection performs worse than tournament selection for these continuous valued problems. In contrast with previous findings (Schmidt and Lipson, 2011), dc tends to outperform afp, although both methods perform better than tournament selection.
The lexicase methods show a marked advantage in converging on a low training error in fewer generations compared to all other methods, as evidenced in Figure 5. Note Figure 5 reports the normalized MSE values on the training set for the best individual in the population each generation. Again we observe very little difference between the lexicase variants.
We analyze the statistical significance of the test MSE results in Tables 5 and 6. Table 5 shows pairwise Wilcoxon ranksum tests for each method in comparison to eplexsd. There are significant differences in performance for all problems between eplexsd and all nonlexicase methods, with the exception of the comparison to dc on the housing and tower datasets. Analysis of variance of the method rankings across all problems indicates significant differences ( 2e16). A posthoc statistical analysis shown in Table 6 indicates that this difference is due to significant differences in rankings across all problems for eplexsd and eplexd in pairwise comparison to all other nonlexicase methods. The three variants of lexicase do not differ significantly from each other according to this test.
Figure 6 shows the semantic diversity of the populations for each generation using different selection methods. lexicase variants, dc, and lexicase selection all produce the highest population diversity, as expected due to their diversity maintenance design. Interestingly, they all produce more diverse semantics than random selection, suggesting that the preservation of useful diversity is an important feature of the observed performance improvements. Surprisingly, afp is found to produce low semantic diversity, despite its incorporation of age and random restarts each generation. Given that afp has no explicit semantic diversity mechanism, it’s possible that age is not an adequate surrogate for behavioral diversity on these problems.
One of the motivations for introducing an threshold into lexicase selection is to allow selection to make use of more cases in continuous domains when selecting parents. Figure 7 demonstrates that lexicase methods achieve this goal. As we noted at the beginning of §4, lexicase selection likely only uses one case per selection event in continuous domains, leading to poor performance. We observe this phenonemon in the median case depth measurements. Among the lexicase variants, eplexsd uses the most cases in general, followed by eplexs and eplexd. Intuitively this result makes sense: is likely to be largest when computed across the population, and because eplexsd uses the global (Eqn. 2) and a local error threshold, it is likely to keep the most individuals at each case filtering. These results also suggest that shrinks substantially when calculated among the pool after each case (Eqn. 3) in eplexd.
lasso  rand  tourn  lex  afp  dc  eplexs  eplexd  

airfoil  2.54e16  2.54e16  2.54e16  2.54e16  2.55e15  1.59e14  0.57  0.57 
concrete  2.54e16  2.54e16  6.24e13  4.25e16  2.74e08  1.66e04  0.1  0.057 
enc  5.15e16  2.54e16  4.12e14  2.57e15  1.67e12  1.61e03  1  0.49 
enh  2.54e16  2.54e16  2.67e16  2.54e16  1.41e15  2.00e14  1.21e04  1.28e02 
housing  1.51e05  6.20e13  8.12e04  3.40e07  1.57e02  0.22  1  1 
tower  6.38e03  2.54e16  1.57e15  6.39e15  7.63e15  3.67e14  6.38e03  0.066 
uball5d  2.54e16  2.54e16  4.80e15  1.04e13  6.96e16  1.55e11  1  1 
yacht  2.54e16  5.46e16  1.52e07  7.86e07  4.93e06  1  1  0.053 
lasso  rand  tourn  lex  afp  dc  eplexs  eplexsd  

eplexs  1.55e11  1.53e11  1.36e09  6.19e11  6.32e07  0.066  
eplexsd  1.54e11  1.53e11  4.00e11  1.63e11  1.17e08  3.59e03  0.98  
eplexd  1.54e11  1.53e11  1.05e10  1.86e11  4.32e08  1.00e02  1  1 
10.1 Runtime analysis
The results of the time complexity experiment are shown in Figure 8 as a loglog plot with wallclock times on the yaxis and the population size on the xaxis. We estimate the time scaling as a function of population size by fitting a linear model to the logtransformed results, as , which gives . The linear models are shown in Figure 8 for the lexicase selection methods, which estimate the exponent of the complexity model, , to be between 0.935 and 0.944. Therefore on average over these settings, the runtime of lexicase selection as a function of is approximately . This suggests a much lower time complexity with respect to in practice than the worstcase complexity of (see §3). In general, the lexicase methods fall between deterministic crowding and tournament selection in terms of wall clock times, with afp achieving the lowest times at higher population sizes. All runtime differences between methods are well within an order of magnitude.
11 Discussion
The experimental results show that lexicase selection performs well on the symbolic regression problems compared to other methods. lexicase leads to quicker learning on the training set (Figure 5) and better test set performance (Figure 4). The improvement in performance compared to traditional selection methods appears to be tied to the high semantic diversity that lexicase selection maintains throughout training (Figure 6), and its preservation of individuals that perform well on unique portions of the training cases. lexicase selection shows a categorical improvement over lexicase selection for these continuous valued problems. Although lexicase selection also maintains diverse semantics, its inferior performance can be explained by its underutilization of training cases for selection (Figure 7) and its property of selecting only among strictly elite individuals (see the example from §8), a property that is relaxed through the introduction of thresholds in lexicase selection.
Two new variants of lexicase selection, semidynamic and dynamic, perform the best overall in our experiments. However, the variants of lexicase do not differ significantly across all tested problems, which suggests that the foundations of the method are robust to different definitions of as long as they result in higher leverage of case information during selection compared to normal lexicase selection, which underperforms on regression problems. In view of the results, we suggest semidynamic lexicase (eplexsd, Algorithm 4.2) as the default implementation of lexicase selection since it has the lowest mean test ranking and appears to utilize the most case information according to Figure 7.
lexicase selection is a global pool, uniform random sequence, nonelitist version of lexicase selection (Spector, 2012). Compared to traditional lexicase selection, which is elitist, lexicase selection represents a relaxed version of lexicase selection; other potential relaxations could show similar benefits. “Global pool” means that each selection event begins with the entire population; however it is possible that smaller pools, perhaps defined geographically (Spector and Klein, 2006), could improve performance on certain problems that respond well to relaxed selection pressure. Future work could also consider alternative orderings of test cases that may perform better than “uniform random sequence” ordering that has been the focus of work thus far. Liskowski et al. (2015) attempted to use derived objective clusters as cases in lexicase selection, but found that this actually decreased performance, possibly due to the small number of resultant objectives. Burks and Punch (2016) found biasing case orderings in terms of performance yielded mixed results. Nevertheless, there may be a form of ordering or case reduction that improves lexicase selection’s performance over random shuffling.
The ordering of the training cases that produce a given parent also contains potentially useful information about the parent that could be used by the search operators in GP. Helmuth and Spector (2015) observed that lexicase selection creates large numbers of distinct behavioral clusters in the population (an observation supported by Figure 6). In that regard, it may be advantageous, for instance, to perform crossover on individuals selected by differing orders of cases such that their offspring are more likely to inherit subprograms with unique partial solutions to a given task. Recent work has highlighted the usefulness of semantically diverse parents when conducting geometric semantic crossover in geometic semantic GP (Chen et al., 2017).
Based on the observations in §6, when the training set is much larger than the population size, some cases are likely to go unused. In these scenarios it may be advantageous to reduce program evaluations by lazily evaluating programs on cases as they appear in selection events. Indeed, Eqn. 5 could be used as a guide for determining when a lazy evaluation strategy would lead to significant computational savings.
12 Conclusions
In this paper we present a probabilistic and multiobjective analysis of lexicase selection and lexicase selection. We develop the expected probabilities of selection under lexicase selection variants, and show the impact of population size and training set size on probabilities of selection. For the first time, the connection between lexicase selection and multiobjective optimization is analyzed, showing that individuals selected by lexicase selection occupy the boundaries or near boundaries of the Pareto front in the highdimensional space spanned by the population errors.
In addition, we experimentally validate lexicase selection, including the new semidynamic and dynamic variants, on a set of realworld and synthetic symbolic regression problems. The results suggest that lexicase selection strongly improves the ability of GP to find accurate models. Further analysis of these runs show that lexicase variants maintain exceptionally high diversity during evolution, and that lexicase variants consider more cases per selection event than standard lexicase selection. The results validate our motivation for creating this variant of lexicase for continuous domains, and suggest the adoption of lexicase selection and variants of it in similar domains.
13 Acknowledgments
This work was supported by the Warren Center for Network and Data Science at the University of Pennsylvania, as well as NIH grants P30ES013508, AI116794 and LM009012. This material is based upon work supported by the National Science Foundation under Grants No. 1617087, 1129139 and 1331283. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References
 Burks and Punch (2016) Burks, A. R. and Punch, W. F. (2016). An investigation of hybrid structural and behavioral diversity methods in genetic programming. In Genetic Programming Theory and Practice XIV, Genetic and Evolutionary Computation, Ann Arbor, USA. Springer.
 Chen et al. (2017) Chen, Q., Xue, B., Mei, Y., and Zhang, M. (2017). Geometric Semantic Crossover with an AngleAware Mating Scheme in Genetic Programming for Symbolic Regression, pages 229–245. Springer International Publishing, Cham.
 Deb et al. (2000) Deb, K., Agrawal, S., Pratap, A., and Meyarivan, T. (2000). A Fast Elitist Nondominated Sorting Genetic Algorithm for Multiobjective Optimization: NSGAII. In Schoenauer, M., Deb, K., Rudolph, G., Yao, X., Lutton, E., Merelo, J. J., and Schwefel, H.P., editors, Parallel Problem Solving from Nature PPSN VI, volume 1917, pages 849–858. Springer Berlin Heidelberg, Berlin, Heidelberg.
 Deb et al. (2005) Deb, K., Mohan, M., and Mishra, S. (2005). Evaluating the ÎµDomination Based MultiObjective Evolutionary Algorithm for a Quick Computation of ParetoOptimal Solutions. Evolutionary Computation, 13(4):501–525.
 Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., and others (2004). Least angle regression. The Annals of statistics, 32(2):407–499.
 Farina and Amato (2002) Farina, M. and Amato, P. (2002). On the optimal solution definition for manycriteria optimization problems. In Fuzzy Information Processing Society, 2002. Proceedings. NAFIPS. 2002 Annual Meeting of the North American, pages 233–238. IEEE.
 Gathercole and Ross (1994) Gathercole, C. and Ross, P. (1994). Dynamic training subset selection for supervised learning in Genetic Programming. In Davidor, Y., Schwefel, H.P., and MÃ¤nner, R., editors, Parallel Problem Solving from Nature â PPSN III, number 866 in Lecture Notes in Computer Science, pages 312–321. Springer Berlin Heidelberg. DOI: 10.1007/3540584846_275.
 Gonçalves and Silva (2013) Gonçalves, I. and Silva, S. (2013). Balancing learning and overfitting in genetic programming with interleaved sampling of training data. Springer.
 Helmuth (2015) Helmuth, T. (2015). General Program Synthesis from Examples Using Genetic Programming with Parent Selection Based on Random Lexicographic Orderings of Test Cases. Doctoral Dissertations May 2014  current.
 Helmuth et al. (2015) Helmuth, T., McPhee, N. F., and Spector, L. (2015). Lexicase selection for program synthesis: A diversity analysis. In Riolo, R., Worzel, W. P., Kotanchek, M., and Kordon, A., editors, Genetic Programming Theory and Practice XIII, Genetic and Evolutionary Computation, Ann Arbor, USA. Springer.
 Helmuth et al. (2016a) Helmuth, T., McPhee, N. F., and Spector, L. (2016a). Effects of Lexicase and Tournament Selection on Diversity Recovery and Maintenance. In Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, pages 983–990. ACM.
 Helmuth et al. (2016b) Helmuth, T., McPhee, N. F., and Spector, L. (2016b). The impact of hyperselection on lexicase selection. In Proceedings of the 2016 on Genetic and Evolutionary Computation Conference, pages 717–724. ACM.
 Helmuth and Spector (2015) Helmuth, T. and Spector, L. (2015). General Program Synthesis Benchmark Suite. pages 1039–1046. ACM Press.
 Helmuth et al. (2014) Helmuth, T., Spector, L., and Matheson, J. (2014). Solving Uncompromising Problems with Lexicase Selection. IEEE Transactions on Evolutionary Computation, PP(99):1–1.
 Ishibuchi et al. (2008) Ishibuchi, H., Tsukamoto, N., and Nojima, Y. (2008). Evolutionary manyobjective optimization: A short review. In IEEE congress on evolutionary computation, pages 2419–2426. Citeseer.
 Klein and Spector (2008) Klein, J. and Spector, L. (2008). Genetic programming with historically assessed hardness. Genetic Programming Theory and Practice VI, pages 61–75.
 Krawiec (2016) Krawiec, K. (2016). Behavioral program synthesis with genetic programming, volume 618. Springer.
 Krawiec and Lichocki (2010) Krawiec, K. and Lichocki, P. (2010). Using Cosolvability to Model and Exploit Synergetic Effects in Evolution. In Schaefer, R., Cotta, C., KoÅodziej, J., and Rudolph, G., editors, Parallel Problem Solving from Nature, PPSN XI, pages 492–501. Springer Berlin Heidelberg, Berlin, Heidelberg.
 Krawiec and Liskowski (2015) Krawiec, K. and Liskowski, P. (2015). Automatic derivation of search objectives for testbased genetic programming. In Genetic Programming, pages 53–65. Springer.
 Krawiec and Nawrocki (2013) Krawiec, K. and Nawrocki, M. (2013). Implicit fitness sharing for evolutionary synthesis of license plate detectors. Springer.
 Krawiec and O’Reilly (2014) Krawiec, K. and O’Reilly, U.M. (2014). Behavioral programming: a broader and more detailed take on semantic GP. In Proceedings of the 2014 conference on Genetic and evolutionary computation, pages 935–942. ACM Press.
 Krawiec et al. (2015) Krawiec, K., Swan, J., and O’Reilly, U.M. (2015). Behavioral program synthesis: Insights and prospects. In Riolo, R., Worzel, W. P., Kotanchek, M., and Kordon, A., editors, Genetic Programming Theory and Practice XIII, Genetic and Evolutionary Computation. Springer, Ann Arbor, USA.
 La Cava et al. (2016a) La Cava, W., Danai, K., Spector, L., Fleming, P., Wright, A., and Lackner, M. (2016a). Automatic identification of wind turbine models using evolutionary multiobjective optimization. Renewable Energy, 87, Part 2:892–902.
 La Cava and Moore (2017a) La Cava, W. and Moore, J. (2017a). A General Feature Engineering Wrapper for Machine Learning Using Lexicase Survival. In Genetic Programming, pages 80–95. Springer, Cham. DOI: 10.1007/9783319556963_6.
 La Cava and Moore (2017b) La Cava, W. and Moore, J. H. (2017b). Ensemble representation learning: an analysis of fitness and survival for wrapperbased genetic programming methods. arXiv preprint arXiv:1703.06934. To appear in GECCO ’17: Proceedings of the 2017 Genetic and Evolutionary Computation Conference.
 La Cava et al. (2016b) La Cava, W., Spector, L., and Danai, K. (2016b). EpsilonLexicase Selection for Regression. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16, pages 741–748, New York, NY, USA. ACM.
 Langdon (1995) Langdon, W. B. (1995). Evolving Data Structures with Genetic Programming. In ICGA, pages 295–302.
 Laumanns et al. (2002) Laumanns, M., Thiele, L., Zitzler, E., and Deb, K. (2002). Archiving with guaranteed convergence and diversity in multiobjective optimization. In Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation, pages 439–447. Morgan Kaufmann Publishers Inc.
 Lichman (2013) Lichman, M. (2013). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences.
 Liskowski et al. (2015) Liskowski, P., Krawiec, K., Helmuth, T., and Spector, L. (2015). Comparison of Semanticaware Selection Methods in Genetic Programming. In Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO Companion ’15, pages 1301–1307, New York, NY, USA. ACM.
 Mahfoud (1995) Mahfoud, S. W. (1995). Niching methods for genetic algorithms. PhD thesis.
 MartÃnez et al. (2013) MartÃnez, Y., Naredo, E., Trujillo, L., and GalvÃ¡nLÃ³pez, E. (2013). Searching for novel regression functions. In Evolutionary Computation (CEC), 2013 IEEE Congress on, pages 16–23. IEEE.
 McKay (2001) McKay, R. I. B. (2001). An Investigation of Fitness Sharing in Genetic Programming. The Australian Journal of Intelligent Information Processing Systems, 7(1/2):43–51.
 Moraglio et al. (2012) Moraglio, A., Krawiec, K., and Johnson, C. G. (2012). Geometric semantic genetic programming. In Parallel Problem Solving from NaturePPSN XII, pages 21–31. Springer.
 PhamGia and Hung (2001) PhamGia, T. and Hung, T. L. (2001). The mean and median absolute deviations. Mathematical and Computer Modelling, 34(7â8):921–936.
 Poli and Langdon (1998) Poli, R. and Langdon, W. B. (1998). Schema theory for genetic programming with onepoint crossover and point mutation. Evolutionary Computation, 6(3):231–252.
 Schmidt and Lipson (2008) Schmidt, M. and Lipson, H. (2008). Coevolution of Fitness Predictors. IEEE Transactions on Evolutionary Computation, 12(6):736–749.
 Schmidt and Lipson (2011) Schmidt, M. and Lipson, H. (2011). Agefitness pareto optimization. In Genetic Programming Theory and Practice VIII, pages 129–146. Springer.
 Smits and Kotanchek (2005) Smits, G. F. and Kotanchek, M. (2005). Paretofront exploitation in symbolic regression. In Genetic Programming Theory and Practice II, pages 283–299. Springer.
 Spector (2012) Spector, L. (2012). Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference companion, pages 401–408.
 Spector and Klein (2006) Spector, L. and Klein, J. (2006). Trivial geography in genetic programming. In Genetic programming theory and practice III, pages 109–123. Springer.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
 Vanneschi et al. (2014) Vanneschi, L., Castelli, M., and Silva, S. (2014). A survey of semantic methods in genetic programming. Genetic Programming and Evolvable Machines, 15(2):195–214.
 Wagner et al. (2007) Wagner, T., Beume, N., and Naujoks, B. (2007). Pareto, Aggregation, and IndicatorBased Methods in ManyObjective Optimization. In Evolutionary MultiCriterion Optimization, pages 742–756. Springer, Berlin, Heidelberg. DOI: 10.1007/9783540709282_56.
 White et al. (2012) White, D. R., McDermott, J., Castelli, M., Manzoni, L., Goldman, B. W., Kronberger, G., JaÅkowski, W., O’Reilly, U.M., and Luke, S. (2012). Better GP benchmarks: community survey results and proposals. Genetic Programming and Evolvable Machines, 14(1):3–29.
 Xie et al. (2007) Xie, H., Zhang, M., and Andreae, P. (2007). Another investigation on tournament selection: modelling and visualisation. In Proceedings of the 9th annual conference on Genetic and evolutionary computation, pages 1468–1475. ACM.
 Zitzler et al. (2001) Zitzler, E., Laumanns, M., and Thiele, L. (2001). SPEA2: Improving the strength Pareto evolutionary algorithm. EidgenÃ¶ssische Technische Hochschule ZÃ¼rich (ETH), Institut fÃ¼r Technische Informatik und Kommunikationsnetze (TIK).