Semantic variation operators for multidimensional genetic programming

Semantic variation operators for multidimensional genetic programming

William La Cava 1234-5678-9012University of Pennsylvania3700 Hamilton WalkPhiladelphiaPA19104  and  Jason H. Moore University of Pennsylvania3700 Hamilton WalkPhiladelphiaPA19104

Multidimensional genetic programming represents candidate solutions as sets of programs, and thereby provides an interesting framework for exploiting building block identification. Towards this goal, we investigate the use of machine learning as a way to bias which components of programs are promoted, and propose two semantic operators to choose where useful building blocks are placed during crossover. A forward stagewise crossover operator we propose leads to significant improvements on a set of regression problems, and produces state-of-the-art results in a large benchmark study. We discuss this architecture and others in terms of their propensity for allowing heuristic search to utilize information during the evolutionary process. Finally, we look at the collinearity and complexity of the data representations that result from these architectures, with a view towards disentangling factors of variation in application.

representation learning, feature construction, variation, regression
journalyear: 2019copyright: acmlicensedconference: Genetic and Evolutionary Computation Conference; July 13–17, 2019; Prague, Czech Republicbooktitle: Genetic and Evolutionary Computation Conference (GECCO ’19), July 13–17, 2019, Prague, Czech Republicprice: 15.00doi: 10.1145/3321707.3321776isbn: 978-1-4503-6111-8/19/07

1. Introduction

A central theme in genetic programming (GP) is how to identify, propagate, and properly compose the components of programs that contribute to good solutions. In the context of classification and regression, these building blocks fill the role of “feature engineering”. That is to say, building blocks of GP solutions are meant to explain the underlying factors of variation that produce the observed response. The task of optimizing a set of explanatory features for a problem is known as representation learning, especially in the larger machine learning (ML) community (Bengio et al., 2013). Representation learning is a fundamental challenge in ML due to its computational complexity and the role the representation plays in model accuracy and interpretation. Interestingly, a variant of GP known as multidimensional GP (MGP) makes this relationship between building block discovery and representation learning explicit by optimizing a set of programs, each of which is an independent feature in the ML model. Our goal in this paper is to introduce semantic variation methods to MGP, with the goal of improving the representations it produces.

What makes a representation good? At the minimum, a good representation produces a model with better generalization than a model trained only on the raw data attributes. In addition, a good representation teases apart the factors of variation in the data into independent components. Finally, an ideal representation is succinct so as to promote intelligibility. In other words, a representation should only have as many features as there are independent factors in the process. Our discussion centers around these three motivations.

In the following section, we attempt to summarize the large body of work concerning feature construction / representation learning in GP, especially those methods that use ML to promote building blocks. This provides context for the MGP family of methods. We then describe our main contribution: the proposed methods of crossover in Section 3. We conduct an experiment at first on 8 regression problems, considering full hyperparameter tuning, and analyze the representations that are produced with and without the new crossover methods. Finally, we benchmark the new methods against many ML and GP methods on more than 100 open source regression problems. We find that the new methods of crossover lead to state-of-the-art results for regression. Our discussion points to further directions for improving representation quality within this framework.

2. Background

Feature construction / representation learning has been a consistent theme in the GP community and has been studied with various architectures. Without major changes, GP can be applied to the task of identifying single features for regression and classification (or multiple features in the multiclass case (Neshatian et al., 2012)), and this approach has been explored in several works, summarized in (Muharram and Smith, 2005). These filter approaches generally use an information-theoretic measure to determine how good a program is likely to be as a feature. Optimizing single features requires basically no changes to GP’s methodology, but lacks the power to optimize features for the multivariate context in which they are typically used.

Another approach that has been studied is to treat each individual in the population as a feature, and to optimize an ensemble model of the entire population (De Melo, 2014; Arnaldo et al., 2015; Veloso de Melo and Banzhaf, 2017; La Cava and Moore, 2017a, b). Only a single regression model is trained per generation, which demands minimal overhead. However, it is not well known how to properly select and vary features evolved by such a process. Since each individual is a feature, its fitness is heavily dependent on the current population. Furthermore, a desirable set of features should be essentially orthogonal in order to create a well-conditioned representation, and convergent evolutionary processes aim in the opposite direction. To overcome issues of collinearity and a convergent search process, the following ideas have been proposed. In evolutionary feature synthesis (EFS) (Arnaldo et al., 2015), features are selected proportionally to their coefficient in a regularized linear model; in order to prevent multicollinearity, correlation thresholds are implemented during variation to keep children different from their parents. In the feature engineering wrapper (FEW) (La Cava and Moore, 2017b, a), multicollinearity is selected against by using a survival version of -lexicase selection to choose features. In Kaizen GP (De Melo, 2014; Veloso de Melo and Banzhaf, 2017), individuals are only added to the model if they pass a significance test, in a hill climbing fashion. Another option is to not use an evolutionary updating scheme at all, but rather to create a large set of random features and fit an ML model to this, as in FFX (McConaghy, 2011). More recently, Vanneschi et. al. explored one step linear combinations of random programs (Vanneschi et al., 2017), experimentally showing that they often lead to overfitting.

Rather than building a model from the entire population, one could apply an ML method to the entire program trace as a means of identifying building blocks (Krawiec, 2016). Multiple regression GP (MRGP) (Arnaldo et al., 2014) defines a program’s behavior as the Lasso (Tibshirani, 1996) estimate generated over the entire program’s trace. One downside of this approach is the likely presence of highly correlated features in the program trace, leading to an ill-conditioned regression matrix. In a similar vein to MRGP, Behavioral GP (Krawiec and O’Reilly, 2014) extracts information from the entire program trace, this time using a decision tree algorithm to identify important building blocks, which are stored in an archive for re-use. In both algorithms, the key insight is to use ML with program traces to undo the complex masking effect that program execution has on the behavior of building blocks that are downstream from other operations in the program (for further discussion on the topic of program traces see (Krawiec, 2012)).

MGP is a framework that is, in some sense, in between the ensemble techniques and the program trace techniques described above. Programs are represented as sets of separate subprograms, usually trees. Unlike population-wide models, the fitness of each individual is directly related to its model predictions, and individuals in the population benefit from typical evolutionary optimization processes. Unlike program trace-based methods, by using multi-output programs, MGP exposes independent components of the total program behavior to the ML process that produces the model. As a result, building blocks are easier to isolate and share among the population in direct ways.

Examples of MGP include Krawiec’s method (Krawiec, 2002), M2GP (Ingalalli et al., 2014), M3GP (Muñoz et al., 2015), e-M3GP (Silva et al., 2015), M4GP (La Cava et al., 2018b), and FEAT (La Cava et al., 2019). In all of these methods, individuals in the population produce a set of corresponding outputs that are then fed into a deterministic ML method to produce the program’s regression or classification estimates. In the case of M2GP, M3GP, and M4GP, classification proceeds using a nearest centroid classifier (Tibshirani et al., 2002), whereas linear regression methods are used for regression with M3GP (Muñoz et al., 2018) and FEAT. Although a number of methods have been proposed in the MGP paradigm, they have not made much use of the semantics of independent building blocks in each program that this architecture creates. An exception is (La Cava et al., 2019), in which the authors use the coefficient magnitude to weight probabilities of mutation. The main contribution of this study is the development of semantic crossover schemes to leverage the architecture of MGP to a larger degree than these previous studies.

3. Methods

In this paper we focus on the task of regression, with the goal of building a predictive model using paired examples . The regression model associates the inputs with a real-valued output . The goal of feature engineering / representation learning is to find a new representation of via a -dimensional feature mapping , such that the model outperforms the model by some pre-defined metric (see above).

In MGP, each individual in the population is a candidate representation, , consisting of a set of programs . As an example, the individual

would encode a representation with three features: , , and . Throughout the paper, we refer to these subprograms as features, and use the word attribute to refer to the independent variables in .

MGP methods share this representation in common, and differ in terms of 1) the ML method used to generate the model prediction, i.e. , 2) the crossover and mutation operators used, and 3) the selection process used. The crossover operators proposed in this section will work with any MGP method, but are designed with a linear ML pairing in mind.

3.1. Feature Engineering Automation Tool

We study a recent MGP method named the Feature Engineering Automation Tool (FEAT) (La Cava et al., 2019), in which candidate features are parameterized by weights, , and used to fit a linear model


The coefficients are determined using ridge regression (Hoerl and Kennard, 1970). Note that each is normalized to zero mean, unit variance before ridge regression is applied. The parameters are attached to the edges of differentiable operators and updated each generation via gradient descent. The fitness of each individual in FEAT is its mean squared error (MSE) on the training set.


In order to promote building blocks, FEAT uses feedback from the ML process to bias the variation step. In a nutshell, the probability of a feature in being mutated or replaced by crossover is inversely related the magnitude of its coefficient in Eqn. 1. Let . The normalized coefficient magnitudes are used to define softmax-normalized probabilities. The probability of mutation for feature in program is denoted , and defined as follows:


Here, is a parameter that controls the amount of feedback from the weights that is used to bias the selection of feature for mutation. In our experiments, we tune , and also test whether the softmax normalization of is useful.


In the initial work introducing FEAT (La Cava et al., 2019), the authors compared several optimization schemes, including NSGA-II, simulated annealing, random search, -lexicase selection (La Cava et al., 2016), and a hybrid of -lexicase selection with NSGA-II’s survival scheme. The final combination achieved the best results in the analysis, and is used in our work here. The original work attempted to address the issue of disentanglement by measuring the multicollinearity of representations and setting this metric as an additional objective during survival. However, none of the objectives tried resulted in less correlated final representations, and actually made them slightly worse. One of the goals of this paper is to explore a second hypothesis, that disentangled representations can be encouraged through the use of semantic variation operators, rather than through additional objectives.


Semantic variation operators have not yet been proposed for MGP frameworks. The most recent MGP techniques (M3GP, M4GP and FEAT) adopt special variation operators that vary programs at the feature level. Feature crossover (called “root crossover” in (Muñoz et al., 2015)) swaps features between two parent representations. Feature mutation may delete a feature or add a random feature. M3GP, M4GP and FEAT also use standard subtree crossover and point mutation operators, and each of these operators occur with equal probability. In FEAT, however, the probabilities of each feature being chosen for mutation or crossover is determined by Eqn. 3.1.

In the following section, we describe two new semantic crossover operators for MGP that attempt to maintain orthogonality in the representations while moving towards models with lower residuals.

3.2. Semantic Crossover

The following two crossover methods are called semantic because they use information about the program’s outputs to determine the recombination that occurs to produce a child from two parents. Both operators are based on the following observations. We have two parent representations, and , with corresponding model outputs and that are linear combinations of their respective representations, as in Eqn. 1. We want to produce the best combination of and for the child representation . Basically we can treat this as a feature selection problem, where we have features and we want to pick the best. On one hand we could simply concatenate the feature sets, and generate a new model , which is the linear model fit to all features of both parents. This approach would lead to exponential growth in offspring, which would run against our goal of lowering complexity.

In lieu of that approach, we propose here what are essentially regularized versions of semantic crossover that constrain the number of features in the offspring to be of equal cardinality to , i.e. . The first operator, best residual fit crossover (ResXO), chooses a feature from to be replaced, and then chooses the feature in that best approximates the residual of the model after removing this feature. The second operator, stagewise crossover (StageXO), uses forward stagewise regression (James et al., 2013) as a feature selection method to iteratively construct the offspring.

3.3. Best residual fit crossover (ResXO)

Given parents and , ResXO swaps a feature in with the feature in that most closely approximates the residual error of with the selected feature removed. The child representation is denoted as . The steps are as follows:

  1. pick from using probabilities given by Eqn. 3.1.

  2. calculate the residual of without :

  3. choose from , which is the feature most correlated with .

  4. with replaced by .

ResXO is a semantic backpropagation operator (Ffrancon and Schoenauer, 2015; Pawlak et al., 2015; Graff et al., 2015), since it seeks to replace a component of the parent program with a subprogram most closely matching the desired semantics, given by . Within the MGP framework, this backpropagation is very simple, and does not require complex inversion operations to be introduced. We expect that ResXO will also lead to lower correlations between features in than in . To understand why, consider that

Therefore should have low correlation with the rest of the ’s representation. Assuming the replacement feature from closely matches , it should also be uncorrelated with . Note that ResXO may produce an individual with higher squared error than its parents, since may be more correlated with than .

3.4. Forward stagewise crossover (StageXO)

Rather than restricting crossover to the replacement of a single feature, the crossover operator can be used to compile the set of features that iteratively reduce the target error using a forward stagewise crossover method we call StageXO. The procedure is as follows:

  1. set the initial residual equal to the target: . Center means around zero for all .

  2. set to be all subprograms in and .

  3. while :

    1. pick from which is most correlated with .

    2. compute the least squares coefficient for fit to .

    3. update

    4. add to .

    5. remove from .

Unlike feature selection methods like forward/backward stepwise selection, forward stagewise selection only calculates the weight of a single feature at a time, and is thus more lightweight. The downside of this approach in the context of regression is that it generally takes more iterations to reach the least squares coefficients of the complete model (Friedman et al., 2001). In our case this is unimportant, since we are only interested in quickly choosing the most important features, which are then used to fit a multiple linear regression model. We expect the child representation returned by StageXO to contain uncorrelated features since the residual is updated each iteration to remove the portion of the response explained by previous features.

Forward stagewise regression, and therefore the StageXO operator, is closely related to boosting (Freund and Schapire, 1995). In both cases the residual is iteratively reduced by adding model components (weak learners in the case of boosting, and features/building blocks in our case). The relationship between forward stagewise regression, boosting, and regularized linear models is expounded upon in (Friedman et al., 2001). The stagewise additive modeling paradigm is also used by a recent GP technique called Wave (Medernach et al., 2016), in which GP runs are iteratively trained on residuals of previous runs. The insight here is that the unique representation of programs in MGP allows the same general methodology to be exploited for combining partial solutions during crossover, rather than as a post-run ensemble method.

4. Experiment

Our experiment consists of two stages. First, we conduct an extensive study of FEAT with and without the semantic crossover operators introduced in Section 3. In this study we vary the hyperparameters related to variation and analyze the results in detail for 8 regression problems. In the second study, we apply FEAT, FEATResXO, and FEATStageXO to more than 100 problems from the PMLB regression benchmark (Orzechowski et al., 2018). These variants are compared to state-of-the-art symbolic regression and ML methods.

4.1. Hyperparameter study

Despite several MGP methods having been proposed, there has not been a systematic study of the effect of variation operators on the performance of this family of methods. To fill this gap, and to properly analyze the new methods introduced in this paper, we performed a grid search of variation hyperparameters on 8 regression problems. The hyperparameters that were varied are shown in Table 1.

Hyperparameter Values
probability of crossover (complement: mutation) [0,0.25,0.5, 0.75,1.0]
feedback (, Eqn. 3.1) [0, 0.25, 0.5, 0.75, 1.0]
type of feature crossover [Standard, ResXO, StageXO]
probability of feature crossover (complement: subtree crossover) [0.5, 0.75, 1.0]
feedback softmax normalization [On, Off]
population size 500
generations 100 (200 for PMLB)
max depth 6
maximum dimensionality min(50, )
iterations of gradient descent 10
Table 1. Hyperparameter values for FEAT in the experiments. The bottom values are fixed.

Feedback softmax normalization refers to the softmax transformation in Eqn. 3.1; we tested for whether this normalization, which assumes a multinomial distribution of probabilities, was useful. The eight comparison problems are listed in Table 2.

Problem Dimension Samples
Airfoil 5 1503
Concrete 8 1030
ENC 8 768
ENH 8 768
Housing 14 506
Tower 25 3135
UBall5D 5 6024
Yacht 6 309
Table 2. Regression problems used for method comparisons.

4.2. Benchmark comparison

In the second study, we compared FEAT with each crossover variant to 15 other methods: 5 GP methods (Castelli et al., 2015; Arnaldo et al., 2014; Schmidt and Lipson, 2011; La Cava et al., 2018a) and 10 ML methods from scikit-learn (Pedregosa et al., 2011). The 5 GP methods we compared to are:

  • Geometric Semantic GP (GSGP) (Castelli et al., 2015)

  • MRGP (Arnaldo et al., 2014)

  • Age-fitness Pareto Optimization (AFP) (Schmidt and Lipson, 2011)

  • -lexicase selection (EPLEX) (La Cava et al., 2018a)

  • -lexicase selection with 1 million evaluations (EPLEX-1M) (La Cava et al., 2018a)

These methods were benchmarked on 94 open-source datasets collected in the Penn ML Benchmark (Olson et al., 2017). We used results from Orzechowski et. al.’s benchmark analysis (Orzechowski et al., 2018) as a comparison, and followed the same validation procedure. Each comparison method underwent hyperparameter tuning using 5-fold cross validation on a 75% split of the training set, and was then tested on a 25% test fold. The hyperparameters are detailed in Table 1 of the original work (Orzechowski et al., 2018). This process was repeated for 10 trials. GP methods were given 100,000 evaluations, apart from EPLEX-1M which used 1 million. For FEAT, we did not re-tune the hyperparameters, instead using the values determined from the hyperparameter tuning experiment.

We also extended the PMLB comparison to larger datasets because we were interested in 1) how FEAT handled larger datasets (the original study was restricted to smaller datasets, c.f. Fig. 1 of (Orzechowski et al., 2018)) and also how the size of the final models compared to other top-ranking algorithms such as XGBoost (Chen and Guestrin, 2016) and multilayer perceptron (MLP). The extended analysis included 111 datasets, whose properties are shown in Fig. 1. These datasets are used to evaluate the complexity of the final models.

Figure 1. Properties of the benchmarks used from PMLB (Olson et al., 2017).

4.3. Metrics

As mentioned earlier, we consider there to be three over-arching goals when learning a representation. The first is that leads to a model with a low generalization error. To measure this, we compare the mean squared error (MSE) and coefficient of determination () of each model output on the test set. We also wish to minimize the complexity of the representation. To measure the complexity of solutions in FEAT, we count the total number of nodes in the final representation. For comparison to XGBoost, we count the number of nodes in the trees, and for comparison to MLP, we count the number of nodes in the network. Finally, we want a representation that is “disentangled”, meaning that each feature of is as orthogonal to the others as possible. One such measure is the pairwise Pearson correlations of features in , written as


We use this equation to compare the final representations across selection methods.

5. Results

The hyperparameter tuning results are presented first. In addition to test score reporting, we plot various views of the data with respect to different hyperparameters, and also look at representation correlations in the resultant models and statistical comparisons. In the subsequent section, the PMLB comparison results are shown, including score comparisons, runtime comparisons, statistical tests and comparisons of the final model sizes for FEAT, XGBoost, and MLP learners.

5.1. Hyperparameter tuning

Prediction comparisons for each crossover method are shown in Fig. 2 for the 8 tuning problems. The plot shows the mean test fold value for the tuned estimator, summarized across trials. In general one can see that StageXO produces the most accurate results, followed by ResXO. Across the 8 problems, StageXO significantly outperforms standard crossover (0.035); the pairwise statistical comparisons are given in Table 3.

We also looked at the correlation of the representations produced by the different crossover methods, shown in Fig. 3. We confirmed our hypotheses that ResXO and StageXO would produce less correlated representations than the traditional crossover operator.

The best values for each tuned parameter is shown in Table 4. We found that softmax normalization did not improve the feedback probabilities. Across problems, the best crossover/mutation fraction was found to be 0.75 (Fig. 4), with a feature crossover rate of 0.75 for Feat and 0.5 for ResXO and StageXO. The best feedback value was problem dependent, as shown in Fig. 5. Since the feedback essentially controls the amount of exploration versus exploitation, it stands to reason that the ideal setting of this parameter would be problem dependent. Feedback levels of 0.25 were best for FEAT and FEATStageXO, and no feedback was best for FEATResXO. For the ResXO operator, this corresponds to choosing the feature to swap out of the parent at random.

Figure 2. Mean 5-fold CV performance for different crossover operators on the 8 tuning problems.
Figure 3. Average pairwise representation correlation (Eqn. 3) for different crossover operators on the 8 tuning problems.
Figure 4. Test rankings for different crossover probabilities on the 8 tuning problems.
Figure 5. Mean 5-fold CV performance for different levels of feedback on the 8 tuning problems.
Feat FeatResXO
FeatResXO 4.6e-01
FeatStageXO 3.5e-02 1.2e-01
Table 3. Bonferroni-adjusted -values using a Wilcoxon signed rank test of scores for the methods across all tuning problems. Bold: 0.05.
Hyperparameter FEAT FEAT-ResXO FEAT-StageXO
probability of crossover 0.75 0.75 0.75
feedback 0.25 0.0 0.25
probability of feature crossover 0.75 0.5 0.5
feedback softmax normalization Off Off Off
Table 4. Best hyperparameter values for FEAT across the 8 tuning problems.

5.2. Benchmark comparison

The comparisons of FEAT to 15 other methods is shown in Fig. 6. Across problems, FEATStageXO achieves a nearly identical ranking to EPLEX-1M, which is -lexicase selection run for 1 million evaluations. In this case, FEATStageXO achieves these similar results using 100,000 evaluations. However, the additional complexity of fitting ML models to each individual makes the evaluation of each individual in FEAT more costly. The wall clock times shown in Fig. 7 reflect this, as the FEAT wall clock times sit somewhere between the methods that ran for 100,000 evaluations (GSGP, AFP, MRGP, EPLEX) and EPLEX-1M.

A Friedman test of the MSE rankings across problems indicates significant differences. Table 5 shows post-hoc pairwise Wilcoxon signed rank tests of the results. FEAT, FEATResXO, and FEATStageXO all significantly outperform the other GP-based methods run for 100,00 evaluations. There is no significant difference found between the FEAT variants, EPLEX-1M, XGBoost, GradBoost, or MLP. The FEAT variants significantly outperform all other ML methods across the benchmark problems.

We extend the comparison of FEAT to XGBoost, MLP and ElasticNet on 111 of the datasets in PMLB in order to evaluate the complexity of the final representations. The size comparisons are shown in Fig. 8. In general FEAT produces representations about 1.5 orders of magnitude smaller than XGBoost and MLP. We found that StageXO led to slightly larger models than Feat or FeatResXO.

Figure 6. Median test MSE rankings on the PMLB datasets. The box shows the quartiles of the rankings with whiskers showing the rest of the distribution excluding outliers. Algorithms are ordered left to right by best (lowest) to worst (highest) median ranking.
Figure 7. Runtime comparisons on the PMLB datasets. Runtime is the wall clock time for a single training instance. Algorithms are ordered to match Fig. 6.
Figure 8. Number of nodes in solutions on the PMLB datasets.
AFP 1.2e-02
EPLEX 1.0e+00 7.3e-08
EPLEX-1M 8.2e-09 9.4e-12 3.4e-09
FEAT 2.6e-08 5.8e-10 1.9e-06 1.0e+00
FEATResXO 1.8e-04 2.2e-06 5.8e-04 5.0e-02 1.7e-02
FEATStageXO 9.1e-08 8.2e-10 4.1e-06 1.0e+00 1.0e+00 5.8e-03
GradBoost 1.5e-08 2.3e-08 5.6e-03 3.7e-02 1.0e+00 1.0e+00 1.0e+00
GSGP 8.4e-11 2.6e-07 4.1e-12 1.5e-14 8.1e-14 5.1e-13 2.3e-14 2.6e-14
KernelRidge 1.0e+00 1.5e-03 1.0e+00 8.1e-04 2.1e-02 1.0e+00 1.1e-02 9.9e-01 1.4e-14
Lasso 8.3e-04 1.0e+00 1.9e-06 4.4e-12 4.5e-10 3.5e-07 4.8e-10 1.8e-08 2.1e-02 8.6e-08
LinReg 1.1e-04 7.0e-03 2.3e-07 3.7e-12 4.5e-11 4.5e-09 6.1e-11 1.7e-08 1.0e+00 3.0e-09 1.0e+00
LinSVR 1.1e-03 8.6e-02 2.3e-06 4.6e-12 1.3e-09 5.7e-07 9.0e-10 2.6e-08 3.9e-01 7.9e-09 1.0e+00 1.0e+00
MLP 2.6e-02 2.6e-06 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 1.0e+00 9.6e-15 7.4e-01 6.8e-07 2.0e-08 1.3e-07
MRGP 1.0e+00 1.0e+00 1.0e+00 1.1e-04 7.0e-04 2.8e-01 3.6e-04 2.1e-01 2.1e-05 1.0e+00 2.3e-01 2.5e-02 1.5e-01 1.0e+00
RF 7.9e-05 1.2e-04 1.0e+00 7.4e-06 1.0e-04 2.0e-01 7.6e-05 2.3e-06 2.1e-13 1.0e+00 4.7e-06 4.4e-06 1.3e-06 1.0e+00 1.0e+00
SGD 2.0e-06 6.4e-05 1.9e-09 9.6e-13 2.7e-12 1.5e-09 1.3e-11 4.1e-11 1.0e+00 2.6e-09 5.5e-02 1.0e+00 1.0e+00 1.8e-09 4.5e-03 2.8e-09
XGBoost 2.1e-08 6.9e-11 5.3e-05 1.0e+00 1.0e+00 1.0e+00 1.0e+00 8.9e-03 6.0e-15 6.9e-03 1.3e-10 6.4e-10 2.2e-10 1.0e+00 2.8e-03 3.6e-08 5.8e-12
Table 5. Bonferroni-adjusted -values using a Wilcoxon signed rank test of MSE scores for the methods across all benchmarks. Bold: 0.05.

6. Discussion & Conclusion

This paper proposes semantic crossover operators for multidimensional genetic programming. We contend that MGP provides an interesting framework for developing semantic variation methods, due to the unique way that program semantics are easily teased apart in the multi-output architecture. The most successful of these is forward stagewise crossover (StageXO), which mimics forward stagewise selection to incrementally build an offspring from the representations of its parents. A lightweight version of semantic backpropagation crossover, ResXO, is also proposed. StageXO achieves better performance on 8 regression problems than naive crossover operators, taking into account hyperparameter tuning. We find that the resultant representations are also less “entangled”, i.e. less correlated when using these crossover methods.

We consider this to be a first foray into semantic methods for MGP, and consider the potential for more powerful techniques to be high. One can imagine many more semantic operators that take advantage of this architecture. Three extensions come to mind. The first is to consider the features of many programs during crossover, rather than just two parents, as is done with population wide semantic crossover (Vanneschi et al., 2017), behavioral GP, or in recent work by Fine et. al. (Fine et al., 2018). The second extension would be to explicitly design variation operators that improve the condition of the representation, for example by pruning highly correlated features. A third extension is to implement smarter termination conditions for stagewise crossover that take into account the fitness of the parents or an error tolerance.

The current methods appear to be competitive with state-of-the-art regression techniques. In particular, FEAT is able to achieve top ranking results with less computational complexity than the previous top GP method on the PMLB benchmark. It is also able to produce much smaller representations than XGBoost and MLP. These promising results should motivate further research within the MGP framework.

7. Supplementary Material

Code to reproduce the experiments can be found at

8. Acknowledgments

This work was supported by NIH grants AI116794 and LM012601, as well as the PA CURE grant from the Pennsylvania Department of Health. Special thanks to Tilak Raj Singh and members of the Computational Genetics Lab.


  • (1)
  • Arnaldo et al. (2014) Ignacio Arnaldo, Krzysztof Krawiec, and Una-May O’Reilly. 2014. Multiple regression genetic programming. In Proceedings of the 2014 conference on Genetic and evolutionary computation. ACM Press, 879–886.
  • Arnaldo et al. (2015) Ignacio Arnaldo, Una-May O’Reilly, and Kalyan Veeramachaneni. 2015. Building Predictive Models via Feature Synthesis. ACM Press, 983–990.
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.
  • Castelli et al. (2015) Mauro Castelli, Sara Silva, and Leonardo Vanneschi. 2015. A C++ framework for geometric semantic genetic programming. Genetic Programming and Evolvable Machines 16, 1 (March 2015), 73–81.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA, 785–794.
  • De Melo (2014) Vinícius Veloso De Melo. 2014. Kaizen programming. ACM Press, 895–902.
  • Ffrancon and Schoenauer (2015) Robyn Ffrancon and Marc Schoenauer. 2015. Memetic Semantic Genetic Programming. ACM Press, 1023–1030.
  • Fine et al. (2018) Steven B. Fine, Erik Hemberg, Krzysztof Krawiec, and Una-May O’Reilly. 2018. Exploiting Subprograms in Genetic Programming. In Genetic Programming Theory and Practice XV (Genetic and Evolutionary Computation), Wolfgang Banzhaf, Randal S. Olson, William Tozier, and Rick Riolo (Eds.). Springer International Publishing, 1–16.
  • Freund and Schapire (1995) Yoav Freund and Robert E. Schapire. 1995. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory. Springer, 23–37.
  • Friedman et al. (2001) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics Springer, Berlin.
  • Graff et al. (2015) Mario Graff, Eric Sadit Tellez, Elio Villaseñor, and Sabino Miranda. 2015. Semantic Genetic Programming Operators Based on Projections in the Phenotype Space. (2015), 13.
  • Hoerl and Kennard (1970) Arthur E. Hoerl and Robert W. Kennard. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 1 (1970), 55–67.
  • Ingalalli et al. (2014) Vijay Ingalalli, Sara Silva, Mauro Castelli, and Leonardo Vanneschi. 2014. A Multi-dimensional Genetic Programming Approach for Multi-class Classification Problems. In Genetic Programming. Springer, 48–60.
  • James et al. (2013) Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Springer Texts in Statistics, Vol. 103. Springer New York, New York, NY.
  • Krawiec (2002) Krzysztof Krawiec. 2002. Genetic programming-based construction of features for machine learning and knowledge discovery tasks. Genetic Programming and Evolvable Machines 3, 4 (2002), 329–343.
  • Krawiec (2012) Krzysztof Krawiec. 2012. On relationships between semantic diversity, complexity and modularity of programming tasks. In Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference. ACM, 783–790.
  • Krawiec (2016) Krzysztof Krawiec. 2016. Behavioral program synthesis with genetic programming. Vol. 618. Springer.
  • Krawiec and O’Reilly (2014) Krzysztof Krawiec and Una-May O’Reilly. 2014. Behavioral programming: a broader and more detailed take on semantic GP. In Proceedings of the 2014 conference on Genetic and evolutionary computation. ACM Press, 935–942.
  • La Cava et al. (2018a) William La Cava, Thomas Helmuth, Lee Spector, and Jason H. Moore. 2018a. A probabilistic and multi-objective analysis of lexicase selection and -lexicase selection. Evolutionary Computation (May 2018), 1–28.
  • La Cava and Moore (2017a) William La Cava and Jason Moore. 2017a. A General Feature Engineering Wrapper for Machine Learning Using \epsilon -Lexicase Survival. In Genetic Programming. Springer, Cham, 80–95.
  • La Cava and Moore (2017b) William La Cava and Jason H Moore. 2017b. Ensemble representation learning: an analysis of fitness and survival for wrapper-based genetic programming methods. In GECCO ’17: Proceedings of the 2017 Genetic and Evolutionary Computation Conference. ACM, Berlin, Germany, 961–968.
  • La Cava et al. (2018b) William La Cava, Sara Silva, Kourosh Danai, Lee Spector, Leonardo Vanneschi, and Jason H. Moore. 2018b. Multidimensional genetic programming for multiclass classification. Swarm and Evolutionary Computation (April 2018).
  • La Cava et al. (2019) William La Cava, Tilak Raj Singh, James Taggart, Srinivas Suri, and Jason H. Moore. 2019. Learning concise representations for regression by evolving networks of trees. In International Conference on Learning Representations (ICLR). In Press.
  • La Cava et al. (2016) William La Cava, Lee Spector, and Kourosh Danai. 2016. Epsilon-Lexicase Selection for Regression. In Proceedings of the Genetic and Evolutionary Computation Conference 2016 (GECCO ’16). ACM, New York, NY, USA, 741–748.
  • McConaghy (2011) Trent McConaghy. 2011. FFX: Fast, scalable, deterministic symbolic regression technology. In Genetic Programming Theory and Practice IX. Springer, 235–260.
  • Medernach et al. (2016) David Medernach, Jeannie Fitzgerald, R. Muhammad Atif Azad, and Conor Ryan. 2016. A New Wave: A Dynamic Approach to Genetic Programming. In Proceedings of the Genetic and Evolutionary Computation Conference 2016 (GECCO ’16). ACM, New York, NY, USA, 757–764.
  • Muharram and Smith (2005) Mohammed Muharram and George D. Smith. 2005. Evolutionary constructive induction. IEEE Transactions on Knowledge and Data Engineering 17, 11 (2005), 1518–1528.
  • Muñoz et al. (2015) Luis Muñoz, Sara Silva, and Leonardo Trujillo. 2015. M3GP–Multiclass Classification with GP. In Genetic Programming. Springer, 78–91.
  • Muñoz et al. (2018) Luis Muñoz, Leonardo Trujillo, Sara Silva, Mauro Castelli, and Leonardo Vanneschi. 2018. Evolving multidimensional transformations for symbolic regression with M3GP. Memetic Computing (Sept. 2018).
  • Neshatian et al. (2012) Kourosh Neshatian, Mengjie Zhang, and Peter Andreae. 2012. A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. IEEE Transactions on Evolutionary Computation 16, 5 (2012), 645–661.
  • Olson et al. (2017) Randal S. Olson, William La Cava, Patryk Orzechowski, Ryan J. Urbanowicz, and Jason H. Moore. 2017. PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison. BioData Mining (2017). arXiv preprint arXiv:1703.00512.
  • Orzechowski et al. (2018) Patryk Orzechowski, William La Cava, and Jason H. Moore. 2018. Where are we now? A large benchmark study of recent symbolic regression methods. arXiv:1804.09331 [cs] (April 2018). arXiv: 1804.09331.
  • Pawlak et al. (2015) Tomasz P. Pawlak, Bartosz Wieloch, and Krzysztof Krawiec. 2015. Semantic backpropagation for designing search operators in genetic programming. IEEE Transactions on Evolutionary Computation 19, 3 (2015), 326–340.
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825–2830.
  • Schmidt and Lipson (2011) Michael Schmidt and Hod Lipson. 2011. Age-fitness pareto optimization. In Genetic Programming Theory and Practice VIII. Springer, 129–146.
  • Silva et al. (2015) Sara Silva, Luis Munoz, Leonardo Trujillo, Vijay Ingalalli, Mauro Castelli, and Leonardo Vanneschi. 2015. Multiclass Classification Through Multidimensional Clustering. In Genetic Programming Theory and Practice XIII. Vol. 13. Springer, Ann Arbor, MI.
  • Tibshirani (1996) Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), 267–288.
  • Tibshirani et al. (2002) Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. 2002. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences 99, 10 (May 2002), 6567–6572.
  • Vanneschi et al. (2017) Leonardo Vanneschi, Mauro Castelli, Luca Manzoni, Krzysztof Krawiec, Alberto Moraglio, Sara Silva, and Ivo Gonçalves. 2017. PSXO: population-wide semantic crossover. In Proceedings of the Genetic and Evolutionary Computation Conference Companion. ACM, 257–258.
  • Veloso de Melo and Banzhaf (2017) Vinícius Veloso de Melo and Wolfgang Banzhaf. 2017. Automatic Feature Engineering for Regression Models with Machine Learning: an Evolutionary Computation and Statistics Hybrid. Information Sciences (2017).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description