Selection Heuristics on Semantic Genetic Programming for Classification Problems

Selection Heuristics on Semantic Genetic Programming for Classification Problems

Claudia N. Sánchez    Mario Graff
INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación, Circuito Tecnopolo Sur No 112, Fracc. Tecnopolo Pocitos II, Aguascalientes 20313, México
Facultad de Ingeniería. Universidad Panamericana, Aguascalientes, México
CONACyT Consejo Nacional de Ciencia y Tecnología, Dirección de Cátedras, Insurgentes Sur 1582, Crédito Constructor, Ciudad de México 03940 México 
This work has been submitted to the IEEE Transactions on Evolutionary Computation for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

In a steady-state evolution, tournament selection traditionally uses the fitness function to select the parents, and negative selection chooses an individual to be replaced with an offspring. This contribution focuses on analyzing the behavior, in terms of performance, of different heuristics when used instead of the fitness function in tournament selection. The heuristics analyzed are related to measuring the similarity of the individuals in the semantic space. In addition, the analysis includes random selection and traditional tournament selection. These selection functions were implemented on our Semantic Genetic Programming system, namely EvoDAG, which is inspired by the geometric genetic operators and tested on 30 classification problems with a variable number of samples, variables, and classes. The result indicated that the combination of accuracy and the random selection, in the negative tournament, produces the best combination, and the difference in performances between this combination and the tournament selection is statistically significant. Furthermore, we compare EvoDAG’s performance using the selection heuristics against 18 classifiers that included traditional approaches as well as auto-machine-learning techniques. The results indicate that our proposal is competitive with state-of-art classifiers. Finally, it is worth to mention that EvoDAG is available as open source software.

1 Introduction

Classification is a supervised learning problem that consists in finding a function that learns a relation between inputs and outputs, where the outputs are a set of labels. The starting point would be the training set composed of input-output pairs, i.e., . The training set is used to find a function, , that minimize a loss function, , that is, is the function that minimize where the ideal scenario would be , and, also to accurately predict the labels of unseen inputs. By fixing, a priori, an order to , one can use a notation normally adopted on Semantic Genetic Programming (GP) (e.g., [1]) which is to represented the target behavior as , and, the behavior of function as . Using this notation the search function is the one whose is as close as possible to , where the closeness is measured using the loss function referred, in GP, as the fitness function.

GP is an evolutionary algorithm (EA) traditionally used to tackle symbolic regression problems, that is, to search for (described previously). Consequently, GP has been used on a variety of classification tasks (see [2]). It has been used as pre-processing technique in [3, 4, 5], to improve the performance of a decision tree [6, 7], to evolve kernel functions for Support Vector Machines [8], and, as a full model selection algorithm as done in TPOT (Tree-based Pipeline Optimization Tool) [9]. These research works use a classification related fitness function to guide the evolutionary process. Traditionally, the fitness function is used to select the genetic operators’ parent(s), and to decide which individuals would be constituting the next population (see [10]).

Nonetheless, there have been approaches that investigate the behavior of EA when the fitness function is replaced with a heuristic that does not consider the target behavior. The extreme case would be to replace the fitness function, in all the evolutionary process, as done in Novelty Search [11] where the fitness of an individual is related to its novelty. The novelty of individual is computed as the average of the distance between and the behavior of its -nearest neighbors. GP with Novelty Search has been used in classifications problems (see [12]) where the novelty is computed with the - loss function, i.e., it counts the differences between the behavior of any pair of individuals.

Our proposal is related to Novelty Search (NS) GP [12]; we replace the fitness function with heuristics that do not consider the target behavior. However, there are significant differences; firstly the similarity is measured between the first parent and the rest of the parents, removing the need for applying -nearest neighbors. In addition, our approach evolves a classifier, whereas GP with NS evolves a transformation that is then used by a classifier. The heuristics analyzed measured the similarity between the behavior of two individuals. The first one is the absolute of the cosine similarity between two individuals, and the second one is the accuracy. Besides, we incorporate, in our analysis, random selection to provide a complete picture of the effects that the heuristics have on the performance. These heuristics are analyzed using a steady-state GP system with tournament selection and semantic crossover, in particular, we decided to use our previous GP system, namely EvoDAG [13, 14], that has been successfully applied to a variety of text classification problems [15].

The fitness function replacement takes place at different stages of the evolutionary process. In the selection of the parents, tournament selection is replaced with either random selection or our heuristics that select the first parent randomly, and the rest of the parents are chosen using the similarities analyzed utilizing as reference the behavior of the first parent. On the other hand, in the selection of the individual being replaced, the negative tournament selection is replaced with random selection. In total, we analyze the performance of eight different GP systems obtained by combining the different heuristics, random selection, and tournament selection. To provide a complete picture of the performance of the GP classifiers, we decided to compare them against state-of-the-art classifiers, in total eighteen classifiers where selected mostly implemented on the python library scikit-learn [16], and two auto-machine learning algorithms namely autosklearn [17] and TPOT [9]. The comparison is performed on thirty classification problems taken from the UCI repository [18].

The results show that the replacement of the tournament selection in EvoDAG produces competitive classifiers. Firstly, considering only EvoDAG with different heuristics, it is observed that EvoDAG with random selection, as the negative tournament, and accuracy to select the parents has the lowest rank. In addition, this combination of selection schemes outperforms EvoDAG with tournament selection, and, the difference in performance is statistically significant. Furthermore, in comparison with state-of-the-art classifiers, EvoDAG obtained the second-lowest rank being the first one TPOT, although the difference in performance, in terms of macro-F1, between these systems is not statistically significant.

The rest of the manuscript is organized as follows: Section 2 presents the related work. EvoDAG is described in Section 3. The proposed selection heuristics are presented in Section 4. Section 5 presents the experiments and results. Finally, Section 6 concludes this research and highlights some future work.

2 Related work

Let us recall that Semantic GP uses the information in the target behavior, i.e., , to guide the search. Notably, Krawiec [19] affirmed that aware semantic methods make search algorithms better informed. For example, Nguyen et al. [20] proposed Fitness Sharing, a technique that promotes dispersion and diversity of individuals. Their proposal consisted of calculating the individual fitness as , where is approximately equal to the number of individuals that behave similarly to individual .

Some crossover and mutation operators have been developed with the use of semantics. Beadle and Johnson [21] proposed a crossover operator that measures the semantic equivalence between parents and offsprings; and rejects the offspring that is semantically equivalent to its parents. Quang Uy et al. [22] proposed a semantic crossover and mutation. The crossover operator searches for a crossover point in each parent in such way that subtrees were semantically similar, and the mutation operator allows the replacement of an individual subtree only if the new subtree is semantically similar. Hara et al. [23] proposed the Semantic Control Crossover that uses the semantics to combine individuals where a global search was performed in the first generations and a local search in the last ones. Graff et al. used subtrees semantics and partial derivatives to proposed crossover [24, 25] and mutation [26] operators.

Moraglio et al. [27, 28] proposed Geometric Semantic Genetic Programming (GSGP). Their work called the attention of the GP scientific community because the crossover operator produces an offspring that stands in the segment joining the parents’ semantics. Therefore, offspring fitness cannot be worse than the worst fitness of the parents. Given two parents and , the crossover operator generates an offspring as , where is a real value between and . This property transforms the fitness landscape into a cone. Unfortunately, the offspring is always bigger than the sum of the size of its parents; this makes the operator unusable in practice. Later, some operators appear intending to improve Moraglio’s GSGP. For example, Approximately Geometric Semantic Crossover (SX) [29], Deterministic Geometric Semantic Crossover [23], Locally Geometric Crossover (LGX) [30, 31] and Approximated Geometric Crossover (AGX) [32], Semantic Crossover and Mutation based on projections [33, 34] and Subtree Semantic Geometric Crossover (SSGX) [35].

Pawlak et al. [32] proposed the Random Desired Operator (RDO). It propagates the target semantics to calculate the desired semantics in the node selected as mutation point. This desired behavior is used to search in a procedures library for the most similar subtree. Finally, it swaps the mutated node with the subtree. RDO was extended by Szubert et al. [36] introducing the Forward Propagation Mutation (FPM) which uses a combination of forward and back-propagation to find a combination of unitary and binary functions that is the most similar to the desired behavior.

According to Vanneschi et al. [37], one way to promote diversity in GP is by the use of different selection schemes. Galvan-Lopez et al. [38] applied crossover only to those individuals whose difference in behavior is greater than a defined threshold for every element of the training set. Ruberto et al. [39] defined the error Vector and error Space. The individual error vector is defined as , where is the semantic of individual . As can be seen, error space contains the individuals error vectors, and is the origin. The proposal is to search, in the error space, for two o three individuals aligned, instead of using the fitness function; the rationality comes from the fact that given the aligned individuals then there is a straight forward procedure to compute the optimal solution. Chu et al. [40, 41] also use the error vectors, and Wilcoxon signed rank test to decide whether to select the fittest, the smaller or the one with worst fitness. Their results show that their proposed techniques aim at enhancing the semantic diversity and reducing the code bloat in GP. As it was mentioned above, Hara et al. proposed Deterministic Geometric Semantic Crossover [23], and later they [42] proposed to select the parents in such way that the line connecting them is closed to the target in the semantic space.

Chen et al. [43] proposed the Angle-Driven Selection (ADS) where the first parent is selected using fitness and the second is with an angle-distance defined as . One of our selection heuristics is similar to ADS; however, there are significant differences, the first parent is randomly selected whereas the second parent is selected using an equivalent similarity with the difference that the target behavior is not considered in our approach.

Loveard and Ciesielski [44] proposed different techniques for representing classification problems in GP; one of them assign the class based on a range, there were as many intervals as classes. Muni et al. [45] proposed to evolve a tree for each class following an equivalent strategy of one-vs-all approach. Jaben and Baig [46] developed a two-stage method, the first one evolves a classifier for each class, and the second phases combine these classifiers.

Ingalalli et al. [3] introduced a GP framework called Multi-dimensional Multi-class Genetic Programming (M2GP). The main idea is to transform the original space into another one using functions evolved with GP, then, a centroid is calculated for each class, and the vectors are assigned to the class which corresponds to the nearest centroid using the Mahalanobis distance. M2GP takes as argument the dimension of the transform space; this parameter is evolved in M3GP [4] by including specialized search operators that can increase or decrease the number of feature dimensions produced by each tree. They extended M3GP and proposed M4GP [5] that uses a stack-based representation in addition to new selection methods, namely lexicase selection, and age-fitness Pareto survival.

Naredo et al. [12] use NS for evolving genetic programming classifiers based on M3GP where the difference is the procedure to compute the fitness. Each GP individual is represented as a binary vector whose length is the training set size and each vector element is set to 1 if the classifier assigns the class label correctly and 0 otherwise. Then, they use this binary vectors to measure the sparseness among individuals, and the more the sparseness the higher the fitness value. Their results show that all their NS variants achieve competitive results relative to the traditional objective-based.

Auto machine learning consists of obtaining automatically a classifier (regressor) that includes the steps of preprocessing, feature selection, classifier selection, and hyperparameters tuning. Feurer et al. [17] developed a robust automated machine learning (AutoML) technique using Bayesian optimization methods. It is based on scikit-learn [16], using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods; giving rise to a structured hypothesis space with 110 hyperparameters. Olson et al. [9] proposed the use of GP to develop a powerful algorithm that automatically constructs and optimizes machine learning pipelines through a Tree-based Pipeline Optimization Tool (TPOT). On classification, the objective consists of maximizing accuracy score performing a searching of the combinations of 14 preprocessors, five feature selectors, and 11 classifiers; all these techniques implemented on scikit-learn [16].

3 EvoDAG

EvoDAG111 [13, 14] is a python library that implements a steady-state GP system with tournament selection. EvoDAG is inspired by the implementation of GSGP performed by Castelli et al. [47]; where the main idea is to keep track of all the individuals and their behavior leading to an efficient evaluation of the offspring whose complexity depends only on the number of fitness cases.

Let us recall that the offspring, in the geometric semantic crossover, is where is a random function or a constant. In [33], we decided to extend this operation by allowing the offspring to be a linear combination of the parents, that is, , where and are obtained using ordinary least squares (OLS). Continuing with this line of research, in [13], we investigate the case when the offspring is a linear combination of more than two parents, and, also, to include the possibility that the parents could be combined using a function randomly selected from the function set.

EvoDAG, as customary, uses a function set, , , , , , , , , , , , , , , and a terminal set, , to create the individuals. The functions, in the function set, are traditional operations where the subscript indicates the number of arguments. It is also included in classifiers such as Naive Bayes with Gaussian distribution (), with Multinomial distribution () and Nearest Centroid ().

The initial population starts with , , , , where is the -th input, and is obtained using OLS. In the case is lower than the population size, the process starts including an individual created by randomly selecting a function from and the arguments are drawn from the current population . For example, let be the selected function, and the first and second arguments are , and . Then the individual inserted to is , where is obtained using OLS. This process continues until the population size is reached; EvoDAG sets population size of .

EvoDAG uses a steady-state evolution; consequently, is updated by replacing a current individual, selected using a negative selection, with an offspring which can be selected as a parent just after being inserted in . The evolution process is similar to the one used to create the initial population, and the difference is on the procedure used to select the arguments. That is, function is selected from , the arguments are selected from using tournament selection or any of the heuristics analyzed here, and finally, the parameters associated to are optimized using either OLS or the procedure used by the classifiers. The addition is defined as , where is an individual in . The rest of the arithmetic functions, trigonometric functions, and are defined as where is the function at hand, and is an individual in . The process continues until the stopping criteria are met.

At this point, it is worth to mention that EvoDAG uses one-vs-rest scheme on classification problems. That is, a problem with different classes is converted into problems each one assigns to the current class and to the other labels. Instead of evolving one tree per problem, as done, for example, in [45], we decided to use only one tree an optimize different parameters, one for each class. The result is that each node outputs values, and the class is the one with the highest value. In case of the classifiers used the output is the log-likelihood.

EvoDAG stops the evolutionary process using early stopping. That is, the training set is split into a smaller training set (50% reduction), and a validation set containing the remaining elements. The training set is used to calculate the fitness, and the parameters . The validation set is used to perform the early stopping and to keep the individual with the best performance in this set. The evolution stops when the best individual, on the validation set, has not been updated in a defined number of evaluations; EvoDAG sets this as . The final model corresponds to the best individual, in the validation set, found during the whole evolutionary process.

Figure 1: A model evolved by EvoDAG on the Iris dataset. The inputs are in the bottom of the figure and the output is on the top.

In order to provide an idea of the type of models produced by EvoDAG, Figure 1 presents a model of the Iris data set. The inputs () are at the bottom of the figure. The computation flow goes from bottom to top; being the output the node in the top of the figure, i.e., Naive Bayes using Gaussian distribution. The figure helps to understand the role of optimizing the set of parameters, one for each class, where each node outputs values; consequently, each node is a classifier.

It is well known that in evolutionary algorithms, there are runs that do not produce an acceptable result, so to improve the stability and also the accuracy we decided to use Bagging [48] in our approach. We implemented Bagging utilizing the characteristic that a bagging estimator can be expected to perform similarly by either drawing elements from the training set with-replacement or selecting elements without-replacement (see [49]). In total, we create models by using different seeds in the random function, and the final prediction is the average of the individual predictions.

4 Selection heuristics

Let us recall that in a steady-state evolution there are two stages where selection takes place, on the one hand, the selection is used to choose the parents, and on the other hand, the selection is applied to decide which individual, in the current population, is replaced with the offspring. We analyzed the behavior of EvoDAG when different selection schemes are used; the first one uses the absolute of the cosine similarity (sim), the second one is the accuracy (acc), and for comparison purposes, the third is the traditional tournament selection (fit), and the fourth is a random selection (rnd). Regarding the negative selection, it is analyzed two schemes, the traditional negative tournament selection (fit), and random selection (rnd).

The selection heuristics proposed here complement the heuristics used in the related work. Novelty Search (NS) [11] measures novelty with a similarity between the k-nearest neighbors, GP with NS [12] uses accuracy, and the Angle-Driven GP [43] uses the relative angle between the parents and the target behavior. Our heuristic uses the angle between parents without considering the target behavior as done in Angle-Driven GP; the accuracy between parents is computed without considering the accuracy between the k-nearest neighbors as done in GP with NS.

The selection mechanism used in the first two heuristics (sim and acc) is the following. The first parent is selected using random selection. The rest of the parents are chosen using tournament selection (tournament size equals 2) where the fitness function is replaced with either cosine similarity or accuracy. The objective is to minimize the similarity between the parent, being selected, and the first parent. Furthermore, we analyzed this procedure in two scenarios; the first one is when it is only applied to a subset of the functions of the function set; these are , and, for the rest of the functions, random selection is applied. The second scenario is to use this procedure to all the functions excepts those with one argument.

The cosine similarity between vectors and is defined as: the range of the function is where corresponds to , is and is . The idea of using the absolute is to avoid, as possible, the inclusion of collinear parents which are not useful on the subset of functions selected.

The second heuristic consists of selecting individuals based on the labels predicted by the individual. The similarity used is the accuracy, which counts the number of correct prediction between the target and the classifier. Nonetheless, it is measured the accuracy between the first parent, acting as the target, and the rest of the parents selected. The idea is to choose those parents that present a more significant difference with the first one.

5 Experiments and Results

This section analyzed the performance of the different selection heuristic proposed and compared it with state-of-the-art classifiers. The classification problems used as benchmarks are 30 datasets taken from the UCI repository [18]. Table  1 shows the dataset information. It can be seen that the datasets are heterogeneous in terms of the number of samples, variables, and classes. Additionally, some of the classification problems are balanced, and others are imbalanced. The table includes Shannon’s entropy to indicate the degree of the class-imbalance in the problem, where indicates a perfect balance problem.

Dataset Samples Variables Classes Classes
ad 3277 1557 2 0.58
adult 48840 14 2 0.8
agaricus-lepiota 8122 22 7 0.81
aps-failure 75998 170 2 0.12
banknote 1370 4 2 0.99
bank 45209 16 2 0.52
biodeg 1053 41 2 0.91
car 1726 6 4 0.6
census-income 299283 41 2 0.34
cmc 1471 9 3 0.98
dota2 102942 116 2 1.0
drug-consumption 1883 30 7 0.44
fertility 97 9 2 0.43
IndianLiverPatient 580 10 2 0.85
iris 150 4 3 1.0
krkopt 28054 6 18 0.84
letter-recognition 19998 16 26 1.0
magic04 19018 10 2 0.93
ml-prove 6116 56 2 0.98
musk1 474 166 2 0.99
musk2 6596 166 2 0.61
optdigits 5618 64 10 1.0
page-blocks 5471 10 5 0.27
parkinsons 192 22 2 0.79
pendigits 10990 16 10 1.0
segmentation 2308 19 7 1.0
sensorless 58507 48 11 1.0
tae 148 5 3 0.99
wine 174 13 3 0.99
yeast 1482 9 10 0.76
Table 1: Datasets used to compare the performance of the algorithms. These problems are taken from the UCI repository. The table includes Shannon’s entropy to indicate the degree of the class-imbalance.

The performance is measured in a test set, in the repository, some of the problems are already split between a training set and test set. For those problems that this partition is not present, we performed cross-validation; that is, the dataset is split using 70% for the training set and 30% for the test set. On the other hand, the performance measure used is macro-F1, which correspond to the average of the F1 score per class.

In order to improve the reading of tables and figures, we use the following notation. The selection scheme used for selecting the parents is followed by the symbol “-”, and then, comes the abbreviation of the negative selection scheme. The abbreviations used for selecting parents are sim, acc, fit, and rnd that represent selection based on the absolute value of the cosine distance, based on accuracy, tournament selection, and random selection. Furthermore, the superscript is used to indicate those systems where the heuristics propose (sim and acc) are used in all the function with more than one argument. In addition, the prefix “EvoDAG” is used when it is compared with other state-of-the-art techniques.

5.1 Comparison of the different selection schemes

Table  2 presents the performance, in terms of macro-F1, of EvoDAG with different selection schemes. The systems are arranged column-wise and sorted by the average rank to facilitate the reading. Each row presents the performance of a classification problem, and the best performance is in boldface. It can be seen that the system with the lowest average rank (the lower is better) is the system with accuracy and random in the negative selection (acc-rnd), this system also presents the highest average macro-F1. Comparing the performance of acc-rnd against all other selection schemes –using the Wilcoxon signed-rank test [50] and adjusting the -values with Holm-Bonferroni method [51] to consider the multiple comparisons– it is observed a significant statistically (95 % confidence) difference with sim-fit, sim-rnd, fit-fit, fit-rnd, acc-fit and acc-rnd; interesting, fit-fit corresponds to tournament selection with a negative tournament as normally performed on a steady-state evolution. Additionally, it can be observed that acc-rnd is not statistically better than the system using random selection in the two stages of selection, i.e., rnd-rnd. Furthermore, rnd-rnd is on the third position based on average rank and second using average macro-F1 being only outperformed by accuracy used to select the parents.

Comparing the average rank of the selection scheme used to choose the parents; it can be seen that the traditional tournament selection comes at ninth position, in addition, all of our selection heuristics have a better rank than tournament selection. On the other hand, the heuristics applied only to a subset of the function set (i.e., ) obtained a better rank than the counterpart systems using functions with arity greater than one; moreover, the worst systems correspond to the use of accuracy in this latter configuration. It is also observed that the systems using the absolute cosine similarity are less affected by choice of a subset of functions or to apply it to all the functions; whereas, this decision affects the most to the use of accuracy.

acc-rnd acc-fit rnd-rnd sim-fit rnd-fit sim-fit sim-rnd sim-rnd fit-fit fit-rnd acc-fit acc-rnd
ad 0.940 0.937 0.932 0.933 0.945 0.934 0.934 0.923 0.934 0.936 0.548 0.554
adult 0.793 0.792 0.792 0.791 0.791 0.791 0.791 0.791 0.791 0.792 0.685 0.685
agaricus-lepiota 0.684 0.684 0.682 0.684 0.675 0.684 0.682 0.682 0.677 0.677 0.037 0.037
aps-failure 0.828 0.840 0.864 0.839 0.852 0.839 0.825 0.836 0.847 0.837 0.736 0.736
banknote 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.821 0.821
bank 0.762 0.760 0.765 0.763 0.762 0.758 0.762 0.762 0.756 0.757 0.713 0.713
biodeg 0.841 0.849 0.851 0.842 0.850 0.842 0.838 0.834 0.829 0.815 0.626 0.626
car 0.866 0.853 0.860 0.829 0.840 0.885 0.830 0.839 0.843 0.810 0.294 0.294
census-income 0.767 0.510 0.770 0.767 0.513 0.767 0.765 0.767 0.753 0.755 0.419 0.419
cmc 0.536 0.538 0.526 0.537 0.525 0.547 0.534 0.541 0.531 0.523 0.467 0.467
dota2 0.595 0.594 0.594 0.595 0.595 0.594 0.594 0.594 0.595 0.595 0.475 0.474
drug-consumption 0.227 0.224 0.197 0.224 0.197 0.211 0.223 0.210 0.196 0.192 0.203 0.203
fertility 0.453 0.453 0.453 0.453 0.442 0.453 0.453 0.453 0.442 0.442 0.396 0.396
IndianLiverPatient 0.661 0.641 0.694 0.649 0.693 0.657 0.642 0.647 0.642 0.634 0.631 0.631
iris 0.980 0.980 0.980 0.980 0.980 0.960 0.980 0.960 0.980 0.980 0.960 0.960
krkopt 0.188 0.183 0.198 0.143 0.197 0.145 0.142 0.145 0.147 0.147 0.124 0.124
letter-recognition 0.653 0.652 0.658 0.646 0.659 0.646 0.646 0.646 0.646 0.646 0.646 0.646
magic04 0.848 0.849 0.846 0.847 0.844 0.839 0.846 0.840 0.834 0.834 0.655 0.655
ml-prove 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.999 0.704 0.704
musk1 0.883 0.876 0.867 0.877 0.868 0.868 0.891 0.897 0.862 0.876 0.366 0.366
musk2 0.943 0.939 0.940 0.942 0.942 0.938 0.948 0.936 0.951 0.943 0.455 0.455
optdigits 0.953 0.957 0.955 0.946 0.954 0.943 0.946 0.944 0.945 0.945 0.733 0.733
page-blocks 0.825 0.820 0.761 0.772 0.768 0.765 0.773 0.766 0.773 0.767 0.757 0.757
parkinsons 0.747 0.730 0.730 0.734 0.730 0.701 0.718 0.718 0.667 0.640 0.672 0.672
pendigits 0.945 0.948 0.940 0.933 0.941 0.915 0.933 0.918 0.937 0.928 0.802 0.802
segmentation 0.910 0.903 0.909 0.896 0.906 0.889 0.897 0.890 0.902 0.899 0.816 0.816
sensorless 0.959 0.966 0.953 0.953 0.954 0.958 0.948 0.952 0.965 0.964 0.796 0.796
tae 0.361 0.356 0.321 0.419 0.321 0.383 0.299 0.471 0.438 0.378 0.321 0.321
wine 0.982 0.982 0.982 0.982 0.982 0.982 0.982 1.000 0.982 0.982 0.982 0.982
yeast 0.468 0.464 0.455 0.457 0.448 0.460 0.444 0.445 0.456 0.452 0.457 0.457
Average macro-F1 0.753 0.743 0.749 0.748 0.739 0.745 0.742 0.747 0.744 0.738 0.577 0.577
Average rank 2.9 3.6 4.2 4.4 4.7 5.4 5.7 6.0 6.0 6.7 9.9 9.9
Table 2: Comparison of EvoDAG’s performance using different selection schemes for selecting the parents and negative selection. The columns are ordered based on the macro-F1 average rank. The symbol represents that selection heuristic was applied to all functions with arity greater than one. The best performance in each problem is indicated in boldface.

Figure  2 shows the evolution of the best individuals found during the evolution in the training and validation sets. We use agaricus-lepiota dataset as an example. The performance, in terms of macro-F1, of the best individual, is recorded during the evolution of thirty independent executions, and, these are presented as boxplots depending on the evaluated individuals. It can be seen, in all cases, the performance of the best individual on the training set is higher than the one obtained in the validation set. Furthermore, it can be observed that the parents’ selection scheme based on the accuracy (acc) has slightly bigger values in the first evaluations than tournament selection, this is reflected as outliers in the boxplot. This continues during all the evolution, and, it is reflected in the training and validation set.

Figure 2: Boxplot of the best individual (30 executions) during the training phase on the agaricus-lepiota dataset. Blue boxplots represent the best individual on the training set and red boxplots depict the best individual on validation set.

5.2 Comparison of EvoDAG with other state-of-the-art classifiers

After analyzing the performance of the different selection schemes, it is the moment to compare EvoDAG with different selection schemes against state-of-the-art classifiers. We decided to compare against sixteen classifiers all of them using their default parameters and implemented on the scikit-learn python library [16], specifically these classifiers are Perceptron, MLPClassifier, BernoulliNB, GaussianNB, KNeighborsClassifier, NearestCentroid, LogisticRegression, LinearSVC, SVC, SGDClassifier, PassiveAggressiveClassifier, DecisionTreeClassifier, ExtraTreesClassifier, RandomForestClassifier, AdaBoostClassifier and GradientBoostingClassifier. It is also included in the comparison two auto-machine learning libraries: autosklearn [17] and TPOT [9].

Figure 3 presents a boxplot of the ranks (using macro-F1 as performance measure) of state-of-the-art classifiers and EvoDAG with the different selection schemes. In order to facilitate the reading, the boxplots are ordered by the average rank. It is observed from the figure that TPOT is the system with the lowest rank, followed by EvoDAG with accuracy and the random selection, EvoDAG is followed by autosklearn. Comparing the performance of TPOT against the performance of the rest of the classifiers, one can realize that TPOT is not statistically different all EvoDAG systems and the classifiers that have a better average rank than LogisticRegression.

As can be seen from Figure 3, only two classifiers are better than EvoDAG with random selection; these are TPOT and autosklearn, it is essential to note that these are auto-machine learning classifiers. Furthermore, let us consider all the classifiers that have a better rank than EvoDAG fit-rnd, which corresponds to EvoDAG with the lowest position. These are TPOT, autosklearn, GradientBoosting, and ExtraTrees; these classifiers have in common the use of decision trees at some points, that is, these are either a variant of decision trees or include them in their search space. Conversely, EvoDAG that do not use any form of decision trees.

Figure 3: Boxplots presents the ranks (measured using macro-F1) of the different classifiers. The average rank sorts classifiers; and blue boxplots represent EvoDAG systems.

Besides measuring the performance using macro-F1, Figure 4 presents boxplots of the time required in the training phase by the different algorithms. The boxplot is on log-scale, given differences in time between the algorithms, and uses time per sample to take into consideration that the dataset varied in the training set size. It is not surprising that the systems obtaining the best performance are also the slowest systems. As can be seen from the figure, TPOT is the most time-consuming system, followed by autosklearn, and then EvoDAG systems. In average TPOT uses 38.7 seconds per sample, autosklearn requires 7.8 seconds per sample, and EvoDAG utilizes less the one second per sample. Looking at EvoDAG systems, it can be observed that the slowest selection schemes are accuracy, absolute cosine similarity, tournament selection, and random selection. This behavior is expected, given the algorithmic complexity. Accuracy and cosine similarity requires to perform operations every time a parent is selected; in addition, these systems compute the fitness to perform early stopping or the negative selection. On the other hand, tournament and the random selection, requires operations to complete the selection, although tournament selection needs to create the tournament, and random selection does not.

Figure 4: Boxplots presents the time required by the classifiers’ training phase. The boxplot is on log-scale and the time is measured in time per sample. Classifiers are sorted by the average time per sample (numbers on the left), and blue boxplots represent EvoDAG systems.

One can combine the information presented on Figures 3 and 4 by performing a Pareto analysis. The classifiers that are in the pareto frontier are: TPOT, EvoDAG with acc-rnd, EvoDAG with rnd-rnd, GradientBoosting, ExtraTrees and DecisionTree. From the figures, it can be inferred that the system closest to the elbow is GradientBoosing.

6 Conclusion

We presented the impact that different selection heuristics have on the performance of a steady-state semantic Genetic Programming system (namely EvoDAG). The selection process takes place in two moments during the evolution; during the selection of the parents and to replace an individual. The selection heuristics studied in the first place are the absolute of the cosine similarity, accuracy, tournament selection, and random selection; and on the second place, it is analyzed negative tournament selection and random selection. The results show that the use of our heuristics, cosine similarity, and accuracy outperforms EvoDAG using tournament selection, i.e., selection based on fitness. Besides, the heuristics that obtained the best performance was accuracy. It is interesting to note that random selection is competitive, achieving the third position among the different combination studied.

The performance of EvoDAG with the selection heuristics is analyzed on 30 classification problems taken from the UCI repository. Also, EvoDAG is compared with 18 state-of-the-art classifiers, 16 of them are implemented in scikit-learn python library and two auto-machine learning algorithms. The result shows that EvoDAG using accuracy and the random selection is competitive, using the average rank (measured with macro-F1) it obtained the second position where the best system is TPOT which was an auto-machine learning algorithm, and the third position was autosklearn. Interesting, EvoDAG’s performance is statistically equivalent to the two auto-machine learning algorithms considered in this comparison. However, EvoDAG uses neither feature selection algorithm nor any form of decision trees, as done by the auto-machine learning approaches. We also include in the comparison of the time required in the training phase of the classifiers. The auto-machine learning algorithms were the slowest ones, followed by EvoDAG. Nonetheless, the difference in time is considerable; TPOT uses, on average more than 30 seconds per sample, autosklearn 7, and EvoDAG less than one second per instance.


  • [1] Leonardo Vanneschi. An Introduction to Geometric Semantic Genetic Programming. In Oliver Schütze, Leonardo Trujillo, Pierrick Legrand, and Yazmin Maldonado, editors, Neo 2015, pages 3–42. Springer, Cham, 2017.
  • [2] P.G. Espejo, S. Ventura, and F. Herrera. A Survey on the Application of Genetic Programming to Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(2):121–144, 3 2010.
  • [3] Vijay Ingalalli, Sara Silva, Mauro Castelli, and Leonardo Vanneschi. A Multi-dimensional Genetic Programming Approach for Multi-class Classification Problems. pages 48–60. Springer, Berlin, Heidelberg, 2014.
  • [4] Luis Muñoz, Sara Silva, and Leonardo Trujillo. M3GP – Multiclass Classification with GP. pages 78–91. Springer, Cham, 2015.
  • [5] William La Cava, Sara Silva, Kourosh Danai, Lee Spector, Leonardo Vanneschi, and Jason H. Moore. Multidimensional genetic programming for multiclass classification. Swarm and Evolutionary Computation, 44:260–272, 2 2019.
  • [6] S. Haruyama and Qiangfu Zhao. Designing smaller decision trees using multiple objective optimization based GPs. In IEEE International Conference on Systems, Man and Cybernetics, volume vol.6, page 5. IEEE, 2002.
  • [7] Chan-Sheng Kuo, Tzung-Pei Hong, and Chuen-Lung Chen. Applying genetic programming technique in classification trees. Soft Computing, 11(12):1165–1172, 8 2007.
  • [8] Keith M. Sullivan and Sean Luke. Evolving kernels for support vector machine classification. In Proceedings of the 9th annual conference on Genetic and evolutionary computation - GECCO ’07, page 1702, New York, New York, USA, 2007. ACM Press.
  • [9] Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, and Jason H. Moore. Automating Biomedical Data Science Through Tree-Based Pipeline Optimization. pages 123–137. Springer, Cham, 2016.
  • [10] Riccardo Poli, William B W.B. Langdon, and Nicholas Freitag N.F. McPhee. A field guide to genetic programming. Number March. Published via \texttt{} and freely available at \texttt{}, 2008.
  • [11] Joel Lehman and Kenneth O. Stanley. Abandoning Objectives: Evolution Through the Search for Novelty Alone. Evolutionary Computation, 19(2):189–223, 6 2011.
  • [12] Enrique Naredo, Leonardo Trujillo, Pierrick Legrand, Sara Silva, and Luis Muñoz. Evolving genetic programming classifiers with novelty search. Information Sciences, 369:347–367, 11 2016.
  • [13] Mario Graff, Eric S. Tellez, Sabino Miranda-Jimenez, and Hugo Jair Escalante. EvoDAG: A semantic Genetic Programming Python library. In 2016 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC), pages 1–6. IEEE, 11 2016.
  • [14] Mario Graff, Eric S. Tellez, Hugo Jair Escalante, and Sabino Miranda-Jiménez. Semantic Genetic Programming for Sentiment Analysis. pages 43–65. Springer, Cham, 2017.
  • [15] Mario Graff, Sabino Miranda-Jiménez, Eric S. Tellez, and Daniela Moctezuma. EvoMSA: A Multilingual Evolutionary Approach for Sentiment Analysis. 11 2018.
  • [16] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011.
  • [17] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and Robust Automated Machine Learning, 2015.
  • [18] Dheeru Dua and Casey Graff. UCI Machine Learning Repository, 2017.
  • [19] Krzysztof Krawiec. Semantic Genetic Programming. In Behavioral Program Synthesis with Genetic Programming, pages 55–66. Springer, Cham, 2016.
  • [20] Quang Uy Nguyen, Xuan Hoai Nguyen, Michael O’Neill, and Alexandros Agapitos. An Investigation of Fitness Sharing with Semantic and Syntactic Distance Metrics. pages 109–120. Springer, Berlin, Heidelberg, 2012.
  • [21] Lawrence Beadle and Colin G. Johnson. Semantically driven crossover in genetic programming. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pages 111–116. IEEE, 6 2008.
  • [22] Nguyen Quang Uy, Nguyen Xuan Hoai, Michael O’Neill, R. I. McKay, and Edgar Galván-López. Semantically-based crossover in genetic programming: application to real-valued symbolic regression. Genetic Programming and Evolvable Machines, 12(2):91–119, 6 2011.
  • [23] Akira Hara, Yoshimasa Ueno, and Tetsuyuki Takahama. New crossover operator based on semantic distance between subtrees in Genetic Programming. In 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 721–726. IEEE, 10 2012.
  • [24] Mario Graff, Ariel Graff-Guerrero, and Jaime Cerda-Jacobo. Semantic Crossover Based on the Partial Derivative Error. pages 37–47. Springer, Berlin, Heidelberg, 2014.
  • [25] Ranyart R Suárez, Mario Graff, and Juan J Flores. Semantic Crossover Operator for GP based on the Second Partial Derivative of the Error Function. pages 87–96. Research in Computing Science 94, 2015.
  • [26] Mario Graff, Juan J. Flores, and Jose Ortiz Bejar. Genetic Programming: Semantic point mutation operator based on the partial derivative error. In 2014 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC), pages 1–6. IEEE, 11 2014.
  • [27] Alberto Moraglio and Riccardo Poli. Topological Interpretation of Crossover. pages 1377–1388. Springer, Berlin, Heidelberg, 2004.
  • [28] Alberto Moraglio, Krzysztof Krawiec, and Colin G. Johnson. Geometric Semantic Genetic Programming. pages 21–31. Springer, Berlin, Heidelberg, 2012.
  • [29] Krzysztof Krawiec and Pawel Lichocki. Approximating geometric crossover in semantic space. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation - GECCO ’09, page 987, New York, New York, USA, 2009. ACM Press.
  • [30] Krzysztof Krawiec and Tomasz Pawlak. Locally geometric semantic crossover. In Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference companion - GECCO Companion ’12, page 1487, New York, New York, USA, 2012. ACM Press.
  • [31] Krzysztof Krawiec and Tomasz Pawlak. Locally geometric semantic crossover: a study on the roles of semantics and homology in recombination operators. Genetic Programming and Evolvable Machines, 14(1):31–63, 3 2013.
  • [32] Tomasz P. Pawlak, Bartosz Wieloch, and Krzysztof Krawiec. Semantic Backpropagation for Designing Search Operators in Genetic Programming. IEEE Transactions on Evolutionary Computation, 19(3):326–340, 6 2015.
  • [33] Mario Graff, Eric Sadit Tellez, Elio Villaseñor, and Sabino Miranda-Jiménez. Semantic Genetic Programming Operators Based on Projections in the Phenotype Space. In Research in Computing Science, pages 73–85, 2015.
  • [34] Mario Graff, Eric S. Tellez, Hugo Jair Escalante, and Jose Ortiz-Bejar. Memetic Genetic Programming based on orthogonal projections in the phenotype space. In 2015 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC), pages 1–6. IEEE, 11 2015.
  • [35] Quang Uy Nguyen, Tuan Anh Pham, Xuan Hoai Nguyen, and James McDermott. Subtree semantic geometric crossover for genetic programming. Genetic Programming and Evolvable Machines, 17(1):25–53, 3 2016.
  • [36] Marcin Szubert, Anuradha Kodali, Sangram Ganguly, Kamalika Das, and Josh C. Bongard. Semantic Forward Propagation for Symbolic Regression. pages 364–374. Springer, Cham, 2016.
  • [37] Leonardo Vanneschi, Mauro Castelli, and Sara Silva. A survey of semantic methods in genetic programming. Genetic Programming and Evolvable Machines, 15(2):195–214, 6 2014.
  • [38] Edgar Galvan-Lopez, Brendan Cody-Kenny, Leonardo Trujillo, and Ahmed Kattan. Using semantics in the selection mechanism in Genetic Programming: A simple method for promoting semantic diversity. In 2013 IEEE Congress on Evolutionary Computation, pages 2972–2979. IEEE, 6 2013.
  • [39] Stefano Ruberto, Leonardo Vanneschi, Mauro Castelli, and Sara Silva. ESAGP – A Semantic GP Framework Based on Alignment in the Error Space. pages 150–161. Springer, Berlin, Heidelberg, 2014.
  • [40] Thi Huong Chu, Quang Uy Nguyen, and Michael O’Neill. Tournament Selection Based on Statistical Test in Genetic Programming. pages 303–312. Springer, Cham, 2016.
  • [41] Thi Huong Chu, Quang Uy Nguyen, and Michael O’Neill. Semantic tournament selection for genetic programming based on statistical analysis of error vectors. Information Sciences, 436-437:352–366, 4 2018.
  • [42] Akira Hara, Jun-ichi Kushida, and Tetsuyuki Takahama. Deterministic Geometric Semantic Genetic Programming with Optimal Mate Selection. In 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 003387–003392. IEEE, 10 2016.
  • [43] Qi Chen, Bing Xue, and Mengjie Zhang. Improving Generalization of Genetic Programming for Symbolic Regression With Angle-Driven Geometric Semantic Operators. IEEE Transactions on Evolutionary Computation, 23(3):488–502, 6 2019.
  • [44] T. Loveard and V. Ciesielski. Representing classification problems in genetic programming. In Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546), volume 2, pages 1070–1077. IEEE, 2001.
  • [45] D.P. Muni, N.R. Pal, and J. Das. A Novel Approach to Design Classifiers Using Genetic Programming. IEEE Transactions on Evolutionary Computation, 8(2):183–196, 4 2004.
  • [46] Hajira Jabeen and Abdul Rauf Baig. Two-stage learning for multi-class classification using genetic programming. Neurocomputing, 116:311–316, 9 2013.
  • [47] Mauro Castelli, Sara Silva, and Leonardo Vanneschi. A C++ framework for geometric semantic genetic programming. Genetic Programming and Evolvable Machines, 16(1):73–81, 3 2015.
  • [48] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 8 1996.
  • [49] Jerome H. Friedman and Peter Hall. On bagging and nonlinear estimation. Journal of Statistical Planning and Inference, 137(3):669–683, 3 2007.
  • [50] Frank Wilcoxon. Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6):80–83, 12 1945.
  • [51] Sture Holm. A Simple Sequentially Rejective Multiple Test Procedure, 1979.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description