Lexicase Selection for Regression
Abstract
Lexicase selection is a parent selection method that considers test cases separately, rather than in aggregate, when performing parent selection. It performs well in discrete error spaces but not on the continuousvalued problems that compose most system identification tasks. In this paper, we develop a new form of lexicase selection for symbolic regression, named lexicase selection, that redefines the pass condition for individuals on each test case in a more effective way. We run a series of experiments on realworld and synthetic problems with several treatments of and quantify how affects parent selection and model performance. lexicase selection is shown to be effective for regression, producing better fit models compared to other techniques such as tournament selection and agefitness Pareto optimization. We demonstrate that can be adapted automatically for individual test cases based on the population performance distribution. Our experiments show that lexicase selection with automatic produces the most accurate models across tested problems with negligible computational overhead. We show that behavioral diversity is exceptionally high in lexicase selection treatments, and that lexicase selection makes use of more fitness cases when selecting parents than lexicase selection, which helps explain the performance improvement.^{1}^{1}1Note: this is a corrected version of the original GECCO ’16 conference paper. Equations 2 and 5 have been corrected to indicate that the pass conditions for individuals in lexicase selection are defined relative to the best error in the population on that training case, not in the selection pool.
Lexicase Selection for Regression
William La Cava 
Department of Mechanical and Industrial Engineering 
University of Massachusetts 
Amherst, MA 01003 
wlacava@umass.edu 
Lee Spector 
School of Cognitive Science 
Hampshire College 
Amherst, MA 01002 
lspector@hampshire.edu and 
Kourosh Danai 
Department of Mechanical and Industrial Engineering 
University of Massachusetts 
Amherst, MA 01003 
danai@engin.umass.edu 
\@float
copyrightbox[b]
\end@float
genetic programming, system identification, regression, parent selection
Genetic programming (GP) traditionally tests programs on many test cases and then reduces the performance into a single value that is used to select parents for the next generation. Typically the fitness of an individual is quantified as its aggregate performance over the training set , using e.g. the mean absolute error (MAE), which is quantified for individual program as:
(1) where represents the variables or features, the target output is and is the program’s output. As a result of the aggregation of the absolute error vector in Eq. (Lexicase Selection for Regression), the relationship of to is represented crudely when choosing models to propagate. As others have pointed out [?], aggregate fitnesses strongly reduce the information conveyed to GP about relative to the description of behavior available in , thereby underutilizing information that could help guide the search. In addition, many forms of aggregation assume all tests are equally informative (although there are exceptions, including implicit fitness sharing which is discussed below). Therefore individuals that are elite (i.e. have the lowest error in the population) for portions of are not selected if they perform poorly in other regions and therefore have a higher . By providing equivalent selection pressure with respect to test cases, GP misses the opportunity to identify programs that perform especially well in certain regions of the problem, most importantly those portions of the problem that are more difficult for the process to solve. We expect GP to solve problems through the induction, propagation and recombination of building blocks (i.e. subprograms) that provide partial solutions to our desired task. Hence we wish to select those programs that imply a partial solution by performing uniquely well on subsets of the problem.
Several methods have been proposed to reward individuals with uniquely good test performance, such as implicit fitness sharing (IFS) [?], historically assessed hardness [?], and cosolvability [?], all of which assign greater weight to fitness cases that are judged to be more difficult in view of the population performance. Perhaps the most effective parent selection method recently proposed is lexicase selection [?, ?]. In particular, “global pool, uniform random sequence, elitist lexicase selection" [?], which we refer to simply as lexicase selection, has outperformed other similarlymotivated methods in recent studies [?, ?]. Despite these gains, it fails to produce such benefits when applied to continuous symbolic regression problems, due to its method of selecting individuals based on test case elitism. We demonstrate in this paper that by redefining the test case pass condition in lexicase selection using an threshold, the benefits of lexicase selection can be achieved in continuous domains.
We begin by describing the lexicase selection algorithm in §Lexicase Selection for Regression and discuss how it differs with respect to standard lexicase selection. Several definitions of are proposed. We briefly review related work in §Lexicase Selection for Regression and describe the relation between lexicase selection and multiobjective methods. The experimental analysis is presented in §Lexicase Selection for Regression, beginning with a parameter variation study of and ending with a comparison of several GP methods on a set of realworld and symbolic regression problems. Given the results, we propose future research directions in §Lexicase Selection for Regression and summarize our findings in §Lexicase Selection for Regression.
Lexicase selection is a parent selection technique based on lexicographic ordering of test (i.e. fitness) cases. Each parent selection event proceeds as follows:

The entire population is added to the selection pool.

The fitness cases are shuffled.

Individuals in the pool with a fitness worse than the best fitness on this case among the pool are removed.

If more than one individual remains in the pool, the first case is removed and 3 is repeated with the next case. If only one individual remains, it is the chosen parent. If no more fitness cases are left, a parent is chosen randomly from the remaining individuals.
As evidenced above, the algorithm is quite simple to implement. In this procedure, test cases act as filters, and a randomized path through these filters is constructed each time a parent is selected. Each parent selection event returns a parent that is elite on at least the first test case used to select it. In turn, the filtering capacity of a test case is directly proportional to its difficulty since it culls the individuals from the pool that do not do the best on it. Therefore selective pressure continually shifts to individuals that are elite on cases that are not widely solved in the population. Because each parent is selected via a randomized ordering of test cases and these cases perform filtering proportional to their difficulty, individuals are pressured to perform well on unique combinations of test cases, which promotes individuals with diverse performance, leading to increased diversity observed during evolutionary runs [?].
Lexicase selection was originally applied to multimodal [?] and “uncompromising" [?] problems. An uncompromising problem is one in which only exact solutions to every test case produce a satisfactory program. For those types of problems, using each case as a way to select only elite individuals is wellmotivated, since each test case must be solved exactly. In regression, exact solutions to test cases can only be expected for synthetic problems, whereas realworld problems are subject to noise and measurement error. With respect to the lexicase selection process, continuouslyvalued errors are problematic, due to the fact that individuals in the population are not likely to share elitism on any particular case unless they are identical equations. On regression problems, the standard lexicase procedure typically uses only one case for each parent selection, resulting in poor performance.
We hypothesize that lexicase selection performs poorly on continuous errors because the case passing criteria is too stringent in continuous error spaces. For individual to pass case , lexicase requires that , where is the best error on that test case in the pool. To remedy this shortcoming, we introduce lexicase selection, which modulates the pass condition on test cases via a parameter , such that only individuals outside of a predefined are filtered in step 3 of lexicase selection. We experiment with four different definitions of in this paper. The first two, and , are absolute thresholds that define the pass condition of program on test case as follows:
(2) (3) Here is the indicator function that returns 1 if true and 0 if false, and is the best error on case in . As shown in Eq. (Lexicase Selection for Regression), defines relative to , and therefore is always passed by at least one individual in . Conversely, (Eq. (Lexicase Selection for Regression)) defines relative to the target value , meaning that must be within of to pass case . In this way provides no selection pressure if there is not an individual in the population within adequate range of the true value for that case.
and are the simplest definitions of lexicase selection, but have two distinct disadvantages: 1) they have to be specified by the user, and 2) their optimal values are problem dependent. An absolute is unable to provide a desired amount of filtering in each selection event since it is blind to the population’s performance. Ideally should automatically adapt to take into account the values of across , denoted , so that it can modulate its selectivity based on the difficulty of . A common estimate of difficulty in performance on a fitness case is variance [?]; in this regard could be defined according to the standard deviation of , i.e. . Given the high sensitivity of to outliers, however, we opt for a more robust estimation of variability by using the median absolute deviation (MAD) [?] of , defined as
(4) We use Eq. (Lexicase Selection for Regression) in the definition of two values, and , that are defined analogously to and as:
(5) (6) An important consideration in parent selection is the time complexity of the selection procedure. Lexicase selection has a theoretical worstcase time complexity of , compared to a time complexity of for tournament selection. Although clearly undesirable, this worstcase complexity is only reached if every individual passes every test case during selection; in practice [?], lexicase selection normally uses a small number of cases for each selection and therefore incurs only a small amount of overhead. We quantify the wall clock times for our variants of lexicase compared to other methods in §Lexicase Selection for Regression.
Although to an extent the ideas of multiobjective optimization apply to multiple test cases, they are qualitatively different: objectives are the defined goals of a task, whereas test cases are tools for estimating progress towards those objectives. Objectives and test cases therefore commonly exist at different scales: symbolic regression often involves one or two objectives (e.g. accuracy and model conciseness) and hundreds or thousands of test cases. One example of using test cases explicitly as objectives occurs in Langdon’s work on data structures [?] in which small numbers of test cases (in this case 6) are used as multiple objectives in a Pareto selection scheme. Other multiobjective approaches such as NSGAII [?], SPEA2 [?] and ParetoGP [?] are used commonly with a small set of objectives in symbolic regression. The “curse of dimensionality" prevents the use of objectives at the scale of typical test case sizes, since most individuals become nondominated^{2}^{2}2Program dominates if and for at least one ( is minimized)., leading to selection based mostly on expensive diversity measures rather than performance. Scaling issues in manyobjective optimization are reviewed in [?]. In lexicase selection, parents are guaranteed to be nondominated with respect to the fitness cases. Pareto strength in SPEA2 promotes individuals based on how many individuals they dominate, and similarly lexicase selection increases the probability of selection for individuals who solve more cases and harder cases (i.e. cases that are not solved by other individuals) and decreases for individuals who solve fewer or easier cases.
A number of GP methods attempt to affect selection by weighting test cases based on population performance. In nonbinary Implicit Fitness Sharing (IFS) [?], the fitness proportion of a case is scaled by the performance of other individuals on that case. Similarly, historically assessed hardness scales error on each test case by the success rate of the population [?]. Discovery of objectives by clustering (DOC) [?] clusters test cases by population performance, and thereby reduces test cases into a set of objectives for search. Both IFS and DOC were outperformed by lexicase selection on program synthesis and boolean problems in previous studies [?, ?]. Other methods attempt to sample a subset of to reduce computation time or improve performance, such as dynamic subset selection [?], interleaved sampling [?], and coevolved fitness predictors [?]. Unlike these methods, lexicase selection begins each selection with the full set of training cases, and allows selection to adapt to program performance on them.
The conversion of a model’s realvalued fitness into discrete values based on an threshold has been explored in other research; for example, Novelty Search GP [?] uses a reduced error vector to define behavioral representation of individuals in the population. This paper proposes it for the first time as a solution to applying lexicase selection effectively to regression.
As a behavioralbased search driver, lexicase selection belongs to a class of GP systems that attempt to incorporate a program’s behavior explicitly into the search process, and as such shares a general motivation with recently proposed methods such as Semantic GP [?] and Behavioral GP [?], despite differing strongly in approach. Although lexicase is designed with behavioral diversity in mind, recent studies suggest that structural diversity can also significantly affect GP performance [?].
We define the problems used to assess lexicase selection here, as well as a set of existing GP methods used for comparison. We then analyze and tune the value of and on an example problem and discuss the results. Finally we test all of the methods on each problem and summarize the findings.
Three synthetic and three realworld problems were chosen for benchmarking different GP methods. The first problem is the housing data set [?] that seeks a model to estimate Boston housing prices. The second problem is the Tower problem^{3}^{3}3http://symbolicregression.com/?q=towerProblem that consists of 15minute averaged time series data taken from a chemical distillation tower, with the goal of predicting propelyne concentration. The third problem, referred to as the Wind problem [?], features data collected from the Controls and Advanced Research Turbine, a 600 kW wind turbine operated by the National Wind Technology Center. The data set consists of timeseries measurements of wind speed, control actions, and acceleration measurements that are used to predict the bending moment measured at the base of the wind turbine. In this case solutions are formulated as firstorder discretetime dynamic models of the form . The fourth and fifth problem tasks are to estimate the energy efficiency of heating (ENH) and cooling (ENC) requirements for various simulated buildings [?]. The last problem is the UBall5D problem^{4}^{4}4UBall5D is also known as Vladislavleva4. which has the form
The Tower problem and UBall5D were chosen from the benchmark suite suggested by White et. al. [?]. The dimensions of all data sets are shown in Table Lexicase Selection for Regression. Aside from UBall5D which has a predefined test set [?], the problems were divided 70/30 into training and testing sets. These sets were normalized to zero mean, unit variance and randomly partitioned for each trial.
Our definitions of in §Lexicase Selection for Regression yield four methods which we analyze in our experiments, abbreviated as Lex , Lex , Lex , Lex . We compare these variants to standard lexicase selection (denoted as simply Lex) and standard tournament selection of size 2 (denoted Tourn). To control for the effect of selection in GP, we also compare these methods to random parent selection, denoted Rand Sel.
In addition to these methods, many stateoftheart symbolic regression tools leverage Pareto optimization [?, ?, ?] and/or age layering [?] to improve symbolic regression performance. With this in mind, we also compare lexicase selection to agefitness Pareto survival (AFP) [?], in which each individual is assigned an age equal to the number of generations since its oldest ancestor was created. Each generation, a new individual is introduced to the population as a means of random restart. Selection for breeding is random, and during breeding a number of children are created equal to the overall population size. Survival is conducted according to the environmental selection algorithm in SPEA2 [?], as in [?].
Every method uses subtree crossover and point mutation as search operators. For each method, we include a parameter hill climbing step each generation that perturbs the constants in each equation with Gaussian noise and saves those changes that improve the model’s MAE (Eq. (Lexicase Selection for Regression)). The complete code for these tests is available online^{5}^{5}5https://www.github.com/lacava/ellen.
Setting Value Population size 1000 Crossover / mutation 80/20% Program length limits [3, 50] ERC range [1,1] Generation limit 1000 Trials 30 Terminal Set {, ERC, , , , , , , , } Elitism keep best Problem Dimension Training Cases Test Cases Housing 14 354 152 Tower 25 2195 940 Wind 6 4200 1800 ENH 8 538 230 ENC 8 538 230 UBall5D 5 1024 5000 Table \thetable: Symbolic regression problem settings. In the cases of and , the user must specify fixed parameter values. For both cases we tested the set of parameter values 0.01, 0.05, 0.10, 0.50, 1.0, 5.0, 10.0 over 15 trials. For , these values mean that an individual’s must be within 1% to 1000% of to pass test . For , must be within that range of . The parameter study was conducted on the Tower symbolic regression problem, the details of which are shown in Table Lexicase Selection for Regression. It is important to note that the optimal value of these parameters is problemdependent, although the best values from the parameter tuning experiment were used for all problems in the subsequent sections.
The test fitness results for different values of are shown in Figure Lexicase Selection for Regression. The best results are obtained for = 5.0. The number of cases used during selection are shown in Figure Lexicase Selection for Regression. This figure matches our intuition about the sensitivity of case usage to : larger tolerances for error use more cases in each selection event. For , we observe steady growth in case usage, suggesting population convergence. The diversity of the population’s behavior also grows with , as shown by the unique output vectors, i.e. unique , plot in Figure Lexicase Selection for Regression. For subsequent experiments we use which corresponds to the lowest median test fitness for the Tower problem.
The test fitness results for different values of are shown in Figure Lexicase Selection for Regression. From and upward, we note that Lex uses all fitness cases for nearly every selection event, causing long runtimes and suggesting that selection has become random. As we show in §Lexicase Selection for Regression, Rand Sel performs similarly to on this problem, further supporting the idea of selection pressure loss. We set for the subsequent experiments, again corresponding to the lowest median test fitness.
We summarize the experimental results in Table Lexicase Selection for Regression. The median best fitness of all runs on the test sets, the mean ranking of each method across problems, and the total runtime to conduct the trials for each method is shown. Figure Lexicase Selection for Regression shows the distributions of MAE (Eq. (Lexicase Selection for Regression)) on the test sets for the bestofrun models. We also visualize the best fitness on the training set each generation in Figure Lexicase Selection for Regression.
Figure Lexicase Selection for Regression shows that most of the differences in learning on the training set occurs in the first 250 generations, although in all cases Lex or maintains the lowest final training set error. Across all problems, the median best fitness on the test sets is obtained by either Lex or Lex . According to the pairwise tests annotated in Table Lexicase Selection for Regression , Lex or Lex perform significantly better than Rand Sel, Tourn, Lex and Lex on 6/6 problems, better than AFP on 5/6 problems, and better than Lex on one problem. In terms of the mean ranking across tests (Table Lexicase Selection for Regression), Lex and Lex rank the best, followed by Lex , AFP, Lex , Tourn, Lex, and Rand Sel, in that order. We conduct a Friedman’s test of the mean rankings across problems, the intervals of which are shown in Figure Lexicase Selection for Regression. This comparison indicates that the performance improvement of Lex and Lex relative to Tourn, Lex, and Rand Sel is significant across all tested problems. The intervals show partial overlap with respect to AFP and Lex that may warrant further experiments.
The median total trial times reported in Table Lexicase Selection for Regression indicate that lexicase selection takes nearly the same time to finish as tournament selection in practice, despite its higher theoretical worstcase time complexity. On average, Lex , , and report wall clock times that are 96%, 82%, 120%, and 120% the duration of tournament selection, respectively, giving a negligible average of 105%. Rand Sel finishes the fastest due to no selection, and AFP finishes the slowest, most likely due to the overhead of computing dominance relations for and, in the case of nondominated populations, densities in objective space [?]. It is possible that the tournament selectionbased version of AFP [?] would have a lower runtime, although it may have difficulty scaling to more than two objectives [?].
Lex takes only 41% of the time of tournament selection to finish, which is explained by its case usage. The number of fitness cases used by lexicase selection variants for four of the problems is shown in Figure Lexicase Selection for Regression. Note that Lex uses only one test case during parent selection due to the rarity of elitism in continuous error space; the fact that parents are chosen based on single cases also explains its poor performance. On the Tower and Wind problems, lexicase variants show small increases in case usage over the course of evolution. Lex uses the highest number of cases on Tower and Wind, but the lowest number of cases for ENH and ENC among lexicase variants. On ENH and ENC, a higher percent of total cases are used during selection compared to Tower and Wind, indicating the problemdependent nature of GP population structures and performance. Lex and Lex use nearly the same numbers of cases for selection on each problem, which suggests that the performance of based is robust to being defined relative to or (Eq. (Lexicase Selection for Regression) and (Lexicase Selection for Regression)). Lex and , on the other hand, vary strongly across problems in terms of their case usage, again indicating their parametric sensitivity.
We observe exceptionally high population diversity for the lexicase methods, which supports observations in [?]. We measure diversity by the percent of unique among programs in , plotted as an average across problems in Figure Lexicase Selection for Regression. Interestingly, the diversity is higher using lexicase than random selection, which indicates lexicase selection’s ability to exploit behavioral difference to increase diversity beyond the search operator effects. The differential performance between Rand Sel and lexicase selection shown in Table Lexicase Selection for Regression demonstrates that the gains afforded by lexicase selection are not due to simply increased randomization, but rather the promotion of individuals with exceptionally good performance on diverse orderings of test cases.
Method Housing Tower Wind ENH ENC UBall5D Mean Rank Median Total Trial Time (hr:min:s) Rand Sel 0.469 0.458 0.463 0.288 0.272 0.128 7.83 00:07:25 Tourn 0.408 0.402 0.397 0.207 0.236 0.113 6.33 00:24:37 AFP 0.354 0.319 0.381 0.138 0.171 0.094 4.00 01:09:18 Lex 0.402 0.355 0.419 0.210 0.237 0.142 6.83 00:10:11 Lex 0.325 0.260 0.386 0.113 0.150 0.079 3.17 00:23:44 Lex 0.386 0.263 0.386 0.165 0.193 0.082 4.50 00:20:24 Lex 0.321 0.239 0.378 0.101 0.137 0.080 1.67 00:29:26 Lex 0.309 0.233 0.381 0.106 0.141 0.078 1.67 00:29:37 Table \thetable: Comparison of median bestofrun MAE on the test sets and total trial time. The best fitness results are highlighted. Significant improvements with respect to each method are denoted by according to the method labels. Significance is defined as according to a pairwise Wilcoxon ranksum test with Holm correction. The median total time to run 30 trials of each algorithm is shown on the right. lexicase selection is a global pool, uniform random sequence, nonelitist version of lexicase selection [?] that performs well on symbolic regression problems according to the experimental analysis presented in the last section. “Global pool" refers to the fact that each selection event begins with the whole population (step 1 in §Lexicase Selection for Regression). Smaller pool sizes have yet to be tried, but could potentially improve performance on certain problems that historically respond well to relaxed selection pressure. Pools could also be defined geographically [?]. “Uniform random sequence" refers to the shuffling procedure for cases in step 2 (§Lexicase Selection for Regression), and, as is the case with pool size, other orderings of test cases have yet to be reported in literature. One could consider biasing the ordering of cases in some ways that could select parents with certain desired properties. In [?], Liskowski attempted to use derived objective clusters as cases in lexicase selection, but found that this actually decreased performance. Still, there may be a form of ordering or case reduction that improves lexicase selection’s performance over random shuffling.
The ordering of the test cases that produce a given parent also contains potentially useful information that could be used by the search operators in GP. Helmuth [?] observed that lexicase selection creates large numbers of distinct behavioral clusters in the population (an observation supported by Figure Lexicase Selection for Regression). In that regard, it may be advantageous, for instance, to perform crossover on individuals selected by differing orders of cases such that their offspring are more likely to inherit subprograms with unique partial solutions to a given task. On the other hand, one could argue for pairing individuals based on similar selection cases, to promote niching and minimize the destructive nature of subtree crossover.
We find that lexicase selection, especially with automatic threshold adaptation ( and ), performs the best on the regression problems studied here in comparison to the other GP methods studied. The performance in terms of test fitness is promising, as well as the measured wall clock times, which are comparable to tournament selection. In addition to introducing a nonelitist version of lexicase that defines test case pass conditions using an threshold, we demonstrated that can be set automatically based on the dispersion of error across the population on a test case. We observed that the definition of this threshold is insensitive to the elite error offset . The results should motivate the use of lexicase selection as a parent selection technique for symbolic regression, and should motivate further research using nonelitist lexicase selection methods for continuousvalued problems in GP.
The authors would like to thank Thomas Helmuth, Nic McPhee and Bill Tozier for their feedback as well as members of the Computational Intelligence Laboratory at Hampshire College. This work is partially supported by NSF Grant Nos. 1068864, 1129139 and 1331283. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by NSF grant number ACI1053575 [?].
 [1] A. R. Burks and W. F. Punch. An Efficient Structural Diversity Technique for Genetic Programming. In GECCO, pages 991–998. ACM Press, 2015.
 [2] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan. A Fast Elitist Nondominated Sorting Genetic Algorithm for Multiobjective Optimization: NSGAII. In PPSN VI, volume 1917, pages 849–858. Springer Berlin Heidelberg, Berlin, Heidelberg, 2000.
 [3] C. Gathercole and P. Ross. Dynamic training subset selection for supervised learning in Genetic Programming. In PPSN III, number 866 in Lecture Notes in Computer Science, pages 312–321. Springer Berlin Heidelberg, Oct. 1994.
 [4] I. Gonçalves and S. Silva. Balancing learning and overfitting in genetic programming with interleaved sampling of training data. In EuroGP 2013, pages 73–84, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
 [5] D. Harrison and D. L. Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81–102, 1978.
 [6] T. Helmuth. General Program Synthesis from Examples Using Genetic Programming with Parent Selection Based on Random Lexicographic Orderings of Test Cases. PhD thesis, UMass Amherst, Jan. 2015.
 [7] T. Helmuth, L. Spector, and J. Matheson. Solving Uncompromising Problems with Lexicase Selection. IEEE Transactions on Evolutionary Computation, PP(99):1–1, 2014.
 [8] G. S. Hornby. ALPS: The Agelayered Population Structure for Reducing the Problem of Premature Convergence. In GECCO, pages 815–822, New York, NY, USA, 2006. ACM.
 [9] H. Ishibuchi, N. Tsukamoto, and Y. Nojima. Evolutionary manyobjective optimization: A short review. In IEEE CEC 2008, pages 2419–2426. Citeseer, 2008.
 [10] J. Klein and L. Spector. Genetic programming with historically assessed hardness. GPTP VI, pages 61–75, 2008.
 [11] K. Krawiec and P. Lichocki. Using Cosolvability to Model and Exploit Synergetic Effects in Evolution. In PPSN XI, pages 492–501. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
 [12] K. Krawiec and P. Liskowski. Automatic derivation of search objectives for testbased genetic programming. In Genetic Programming, pages 53–65. Springer, 2015.
 [13] K. Krawiec and M. Nawrocki. Implicit fitness sharing for evolutionary synthesis of license plate detectors. Springer, 2013.
 [14] K. Krawiec and U.M. O’Reilly. Behavioral programming: a broader and more detailed take on semantic GP. In GECCO, pages 935–942. ACM Press, 2014.
 [15] W. La Cava, K. Danai, L. Spector, P. Fleming, A. Wright, and M. Lackner. Automatic identification of wind turbine models using evolutionary multiobjective optimization. Renewable Energy, 87, Part 2:892–902, Mar. 2016.
 [16] W. B. Langdon. Evolving Data Structures with Genetic Programming. In ICGA, pages 295–302, 1995.
 [17] P. Liskowski, K. Krawiec, T. Helmuth, and L. Spector. Comparison of Semanticaware Selection Methods in Genetic Programming. In GECCO Companion, pages 1301–1307, New York, NY, USA, 2015. ACM.
 [18] Y. Martínez, E. Naredo, L. Trujillo, and E. GalvánLópez. Searching for novel regression functions. In IEEE CEC 2013, pages 16–23. IEEE, 2013.
 [19] R. I. B. McKay. An Investigation of Fitness Sharing in Genetic Programming. The Australian Journal of Intelligent Information Processing Systems, 7(1/2):43–51, July 2001.
 [20] A. Moraglio, K. Krawiec, and C. G. Johnson. Geometric semantic genetic programming. In PPSN XII, pages 21–31. Springer, 2012.
 [21] T. PhamGia and T. L. Hung. The mean and median absolute deviations. Mathematical and Computer Modelling, 34(78):921–936, Oct. 2001.
 [22] M. Schmidt and H. Lipson. Coevolution of Fitness Predictors. IEEE Transactions on Evolutionary Computation, 12(6):736–749, Dec. 2008.
 [23] M. Schmidt and H. Lipson. Distilling freeform natural laws from experimental data. Science, 324(5923):81–85, 2009.
 [24] M. Schmidt and H. Lipson. Agefitness pareto optimization. In GPTP VIII, pages 129–146. Springer, 2011.
 [25] M. D. Schmidt. Machine Science: Automated Modeling of Deterministic and Stochastic Dynamical Systems. PhD thesis, Cornell University, Ithaca, NY, USA, 2011. AAI3484909.
 [26] G. F. Smits and M. Kotanchek. Paretofront exploitation in symbolic regression. In GPTP II, pages 283–299. Springer, 2005.
 [27] L. Spector. Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In GECCO, pages 401–408, 2012.
 [28] L. Spector and J. Klein. Trivial geography in genetic programming. In GPTP III, pages 109–123. Springer, 2006.
 [29] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. WilkensDiehr. XSEDE: Accelerating Scientific Discovery. Computing in Science and Engineering, 16(5):62–74, 2014.
 [30] A. Tsanas and A. Xifara. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings, 49:560–567, 2012.
 [31] E. Vladislavleva, G. Smits, and D. den Hertog. Order of Nonlinearity as a Complexity Measure for Models Generated by Symbolic Regression via Pareto Genetic Programming. IEEE Transactions on Evolutionary Computation, 13(2):333–349, 2009.
 [32] D. R. White, J. McDermott, M. Castelli, L. Manzoni, B. W. Goldman, G. Kronberger, W. Jaśkowski, U.M. O’Reilly, and S. Luke. Better GP benchmarks: community survey results and proposals. Genetic Programming and Evolvable Machines, 14(1):3–29, Dec. 2012.
 [33] E. Zitzler, M. Laumanns, and L. Thiele. SPEA2: Improving the strength Pareto evolutionary algorithm. ETH Zürich, Institut für Technische Informatik und Kommunikationsnetze (TIK), 2001.
