GPRVM: Genetic Programingbased Symbolic Regression Using Relevance Vector Machine
Abstract
This paper proposes a hybrid basis function construction method (GPRVM) for Symbolic Regression problem, which combines an extended version of Genetic Programming called Kaizen Programming and Relevance Vector Machine to evolve an optimal set of basis functions. Different from traditional evolutionary algorithms where a single individual is a complete solution, our method proposes a solution based on linear combination of basis functions built from individuals during the evolving process. RVM which is a sparse Bayesian kernel method selects suitable functions to constitute the basis. RVM determines the posterior weight of a function by evaluating its quality and sparsity. The solution produced by GPRVM is a sparse Bayesian linear model of the coefficients of many nonlinear functions. Our hybrid approach is focused on nonlinear whitebox models selecting the right combination of functions to build robust predictions without prior knowledge about data. Experimental results show that GPRVM outperforms conventional methods, which suggest that it is an efficient and accurate technique for solving SR. The computational complexity of GPRVM scales in , where is the number of functions in the basis set and is typically much smaller than the number of training patterns.
I Introduction
In this paper, we address a learning task called symbolic regression (SR) [13]. Suppose that there is an unknown realvalued function whose input can be viewed as a dimensional vector: where and we are given a set of observed inputoutput pairs: where . Regression is defined as a set of statistical processes for approximating the relationships between inputoutput pairs. Symbolic Regression is performed by methods which minimize various error metrics while searching the space of mathematical expressions to estimate the accurate and simple model that best fits the observed dataset [12]. Conventional learning methods such as Artificial Neural Networks (ANNs) [5,6] and Support Vector Machines (SVMs) [7] has been widely used for solving regression problems. The solutions given by these techniques that are known as black boxes, which means the mechanisms behind are difficult to understand and analyze. SR, on the contrary, tries to result in white box models that are clear to interpret. The goal of SR exercise is to determine which of the inputs are the most effective in predicting outputs and identify the inputoutput relationship [21]. When solving SR, not only the unknown coefficients are optimized, but also the exact formulae that explain the data are determined [810]. Linear and nonlinear regression methods try to fit parameters to a given form of an equation. However, SR methods search both the parameters space and the form of equations at the same time, aiming at solving SR form mathematical equations in an easytointerpret format. The main reason for choosing SR over other linear and nonlinear regression methods is its interpretability. Such a formula may not be discovered by other machine learning methods since their primary target is coefficient optimization.
SR is one the realworld applications of Genetic Programming (GP) [12, 13, 21, 22]. Inspired by Darwin’s theory of evolution, GP tries to solve optimization problem by simulating the evolution procedure to evolve computer programs that perform well on a given task. GP is an evolutionary algorithm where many programs are simulated as a cluster of genes that are then evolved based on an evolutionary mechanism. It starts with randomly initializing a population of programs. The population is then repeatedly updated under fitnessbased selection, by applying genetic operators until the desired solution is found. GP has been utilized to tackle various problems in different disciplines such as decision making[1618], pattern recognition, robotic networks[1618], bioinformatics, data mining, finance[1618], etc. Despite its successful application in several domains, there are some known issues with GP and related algorithms. For instance, in practice, conventional GP is suffering severely from problems like bloat, empirical hyperparameter tuning, slow speed (as the large size of population), and low success rate[1618].

Why Kaizen Programming?
A controversial issue of GP is its nondeterminacy under random search, where the only guide is the pressure on individuals towards better fitness. The algorithm cannot ensure the next iteration to improve, or even avoid to deteriorate because the solutions are randomly modified. Motivated by developing an SR method which requires a relatively lower number of individuals (to accelerate the speed) while still maintaining the quality and interpretability of the solutions, de Melo proposed the Kaizen Programming (KP) [14] in 2014. KP is a cooperative coevolutionary method which performs automatic feature engineering by using the statistical approach to build models from features generated by GP. KP makes linear models over GP individuals to obtain deterministic results. By keeping on updating a set of best regressors, KP ensures the improvement of solution quality. To improve the solution quality, KP employs feature selection via the null hypothesis significance testing (NHST) based on pvalue. KP has an advantage over other similar methods as it relies not only on random evolution but also on a more deterministic and efficient approach to build the model. Rather than focusing on finding a single complete solution, KP concentrates more on how important is each part of the solution and guides the search by dividing the entire task into small ones to achieve high efficiency. The first significant limitation of KP is the introduction of NHST, which raises several controversial issues. For example, the proper selection of a threshold on the significance level, eventually demands prior knowledge. The employment of the linear regression model raises the singularity problem. The singularity happens when duplicated features are used to build the model, and cause the data variance matrix noninvertible.
Motivated by the analytical results, in this paper we extended the feature selection module of KP based on RVM. The proposed method works without performing hypothesis test, thus requiring no prior knowledge to set the threshold. It also deals with singularity automatically. We implemented the sequential sparse Bayesian learning algorithm to accelerate the training speed of RVM. The proposed method is compared to KP and GP on several benchmark functions. Results shows that GPRVM outperforms the other methods, providing more accurate models with fewer evaluations of functions. The remainder of the paper is organized as follows. In Part 2, we present an overview of the related recent work on solving SR tasks the basic Kaizen Programming workflow. Part 3 describes the method proposed in this work. Experimental results and related topics are discussed in Part 4. Finally, Part 5 summarizes this paper and covers future works. For detailed information about the application of Relevance Vector Machines (RVM) in our work, see the [32] Chapter 7.
Ii Previous Works
Iia Kaizen Programming
Kaizen Programming is a hybrid method for solving SR based on the Kaizen [14] event with the PlanDoCheckAct (PDCA) methodology. In the real world, a Kaizen event is an event where experts propose their ideas and test them to tackle a business issue. Many ideas are then combined to form a complete solution to the issue, which is known as the standard. During the Kaizen process, the PDCA methodology is employed to improve the quality of the current standard, where adjustments are planned, executed, checked and acted. The cycle of PDCA is iterated until the business issue is solved. Since the contribution of each action at each period on resolving the issue is analyzed and evaluated based on specific criteria, more knowledge on the topic is acquired. As a result, the experts learn from the improved information and will avoid useless or harmful actions and control the search process, thus resulting in a better solution.

Feature generation module, which is to propose ideas in the shape of regression features.

Feature selection module, which is to evaluate and select contributive features.

Model generation module, which is to create a complete solution using generated features.
For solving regression problem, a solution can be a mathematical expression, for instance, , which is equivalent to a feature used in a linear regression model. In the feature selection module, new ideas are evaluated to generate an actual partial solution, which can be treated as an Ndimensional vector, where N is the number of training data. These partial solutions are combined with old ones existing in the current standard, resulting in a final complete solution composed of partial solutions. Each partial solution is then evaluated to indicate its contribution to solving the problem. Due to the relationship among those partial solutions a measurement considers not only the independent quality of each partial solution but also their dependency is used in this step. The major difference of conventional evolutionary algorithms with KP is that they work with individuals where each one is a complete solution. Therefore, in KP, individuals collaborate with each other, so the size of the population is smaller compared to conventional evolutionary algorithms, where individuals must compete with each other to survive.
In the model generation module, the complete solution is generated based on the results of the selection module. For example, after several iterations most significant partial solutions are selected from the structure, the coefficients are fitted to the data by performing a Maximum Likelihood Estimation (MLE), which is equivalent to minimize the Root Mean Square Error (RMSE). The evaluation of the quality of the complete solution is also needed. One of the reasons is to estimate if the algorithm gets stuck in a local optimum.
IiA1 KP: The Advantages and Disadvantages
The PDCA cycle plays the role of encouraging experts to propose ideas which are meaningful and useful for solving the problem. It is critical because the criteria for evaluating the contribution replaces the place of fitness on the entire solution. Therefore, the employed methods to perform the second and third modules should be logically connected to each other. For instance, applying the same criteria to choose the features to construct the model, more precisely, the method used to solve the problem and the method employed to evaluate the contribution of the partial solutions are equivalent. Provided that, the feature selection module and the model generation model were unrelated, then the ideas meaningful to the procedure employed in the feature selection module would not be useful to the method that finally solves the problem. KP guides the search process to the goal based on the continuously increasing knowledge of the problem. As a result, machine learning techniques or statistical approach must be employed efficiently for the hybrid structure of KP. Without considering these techniques, KP performance reduces to a conventional EC algorithm, where the search depends only on the selection process. We believe that KP has several limitations requiring further improvement. Firstly, when MLE solves the linear regression problem, the singularity issue of the data variance matrix raises, which must be treated appropriately. The singularity is caused by similar or identical features used to construct the model, and it will result in the difficulty to estimate the parameters. Secondly, the hypothesis test requires a properly adjusted value for the significance level. If the threshold is too high, the complexity of the solution will increase excessively, and the redundancy in the solution will affect us to acquire inner knowledge of it. On the other hand, an overly low threshold will make it hard to find a good approximation to the target. Though, in the first case, we could control the excessive growth of the complexity by limiting the number of features used in the standard. It is still problematic to select a proper limit provided that we have no prior knowledge of the problem.
IiB Other Genetic Programmingbased Methods for Symbolic Regression
IiB1 Multiple regression genetic programming
Arnaldo et al. [28] proposed a method which overcomes the limitations of conventional GP by eliminating direct comparison between the output of the complete individual with the target value of the training data. Motivated by the fact that in GP the fitness function is not directly applied to the genotype of individuals, but is instead to the complete phenotype; and thus, resulting in the missing of clear and strong focus on the advancing evolution of basic structures. In their experiment, MRGP consistently generated solutions fitter than the result of GP and linear regression.
IiB2 Geometric Semantic Genetic Programming (GSGP)
GSGP [31] concentrates on the introduction of semanticawareness on genetic operators, which means the awareness of the actual outcome after the operators are applied to individuals. To do so, they formally defined the semantic space as an Ndimensional Euclidean space, where N is the number of training IO pairs. Each is encoded in the multidimensional metric space as a single point. They proposed several semantic geometric operators which directly search the semantic space, producing offspring that are guaranteed to be at least as fit as the parents. GSGP employs operators which produce geometric offspring to explore the semantic space. Geometric semantic crossover perfectly mixes two parents in the population by constructing an offspring expressed as a (weighted) average of the parents, which is guaranteed to be at least as fit as the less fit of the parents. Geometric semantic mutation produces an offspring which is guaranteed to lie in an Ndimensional ball of prespecified radius centered in the parent. Therefore, it avoids causing huge migration in the semantic space. With geometric operators employed, the landscape if the fitness function is unimodal, thus easy to reach the global optimum. The primary advantage of GSGP is its relatively higher speed to converge compared to conventional GP. However, the tradeoff is its exponentially increasing complexity of the individual, which results in solutions that are theoretically difficult to analyze (blackbox) and computationally expensive to evaluate.
IiC Relevance Vector Machine for Symbolic Regression
Support Vector Machine (SVM) method has been widely used in classification and regression applications. However, machine learning techniques based on support vector machines have some limitations, several of which are explained in [32] Chapter 7. The output of SVM shows decisions rather than posterior probabilities. These predictive results, are extracted as linear combinations of kernel functions which are focused on training data. The Relevance Vector Machine [15] is a Bayesian sparse kernel which has applications in classification and regression. RVM has many qualities similar to SVM. RVMbased solutions avoid basic limitations of SVM while resulting in much sparser models. Consequently, RVM corresponds in higher performance on test data. Using RVM for regression we shall find a linear model which results in sparse solutions [32] (Chapter 3). Given a train set of inputtarget pairs , considering scalarvalued target functions only, it is followed the standard probabilistic formulation and assume that the targets are samples from the model with additive noise:
(1) 
where are independent samples from some noise process which is assumed to be a zeromean Gaussian with variance [32] (Chapter 7.2). The functions are some fixed nonlinear basis functions. Thus the model defines a conditional distribution for a realvalued target variable , given an input vector , which takes the form:
(2) 
As we assumed the targets , the likelihood of the complete dataset can be written as:
(3) 
Here, RVM adopts a Bayesian perspective and restricts the parameters by introducing a separate hyperparameter for each of the weight parameters instead of a single shared hyperparameter[32] (Chapter 7.2). Thus the weight prior takes the form:
(4) 
where , and stand for the precision of . The introduction of an individual hyperparameter for every weight is the key feature of the model. The corresponding weight parameters take posterior distributions that are concentrated at 0. Consequently, the basis functions made out of these parameters do not contribute in the model predictions which results in a sparse model. The posterior over weights is then obtained from Bayesian rule:
(5) 
with the posterior mean:
(6) 
and posterior covariance:
(7) 
where the the design matrix is defined as [32] (Chapter 7.2, Eq. 7.81). The training of relevance vector machine requires to search for posterior hyperparameter to maximize the marginal likelihood function[32] (Chapter 7.2, Eq. 7.84):
(8) 
This integration involves a convolution of two Gaussian functions, it can be easily evaluated by completing the spare to give the log marginal likelihood with:
(9) 
where
(10) 
We try to make the (9) as large as possible w.r.t. and variables. Therefore, we set the derivatives of marginal likelihood to zero to obtain:
(11) 
(12) 
where the element of the posterior mean is defined by (6). [32] (Section 3.5.3) defines as a measure on how well the parameter is determined:
(13) 
where is defined in (7).
The learning algorithm starts with initializing and , and then calculating mean and covariance of the posterior using (6) and (7), respectively. The learning algorithm proceeds by repeating the equation of (11) and (12), accompanied with updating the posterior statistics, until reaching the maximum of repeats or minimizing all of the precision parameters, i.e. s, until the get smaller than a lowerbound(6). [32] (Section 7.2.2). The optimization found in practice, drives a proportion of the hyperparameters into large (theoretically infinite) values, as a result the weight parameters related to hyperparameters have posterior distributions with mean and variance both zero. Accordingly, those parameters as well as corresponding basis functions should be eliminated from the model and play no role in making prediction for new inputs [15].
Iii Proposed Method
Iiia An Efficient Hybrid GPbased Approach Using Relevance Vector Machine
This section introduces a new method to SR, based on the Sparse Bayesian kernel method, which integrates a GPbased search of tree structures, and a Bayesian parameter estimation employing automatic relevance determination (ARD). To overcome GP’s major difficulty, we extend the work of KP in [14] by improving the feature selection with ARD, using RVM. Because RVM requires no prior knowledge to set a threshold and deals with singularity automatically, the proposed method also overcomes the limitations of KP. Figure 1 shows the basic flow chart of the proposed method. We explain three key steps which are shown with a star mark after the subsequent parts this chapter. To overcome the limitations of KP, we first analytically introduce regularization terms over the data variance matrix to avoid the singularity. The regularized matrix can be considered as a posterior covariance matrix. It means that the singularity problem can be avoided if we generalize the linear regression model into a Bayesian linear regression model. In that case, there would be distributed precision parameters for each linear coefficient. In other words, the generalized model is equivalent to the wellknown RVM [15]. Also, the generalized model no longer depends on hypothesis test. Therefore, it does not require any prior knowledge to set a proper threshold for feature selection.
Considering the analytical results, in this paper we introduce a hybrid method for solving SR problem based on RVM. RVM is widely used as a kernel technique although there is no restriction for applying it to any model expressed as a linear combination of feature functions. In our proposed method, GP individuals are used for generating features. A preprocessing is applied for extracting all subtree functions in each treeformed GP individual, to ensure the maximum usage of provided resources. These functions are further used to build linear models together with features existing in the current model. Different from the statistical methods used in KP, RVM is used to select proper features. A sequential training algorithm [19] is employed to perform the optimization task of RVM which improves training speed significantly. Optimized results are finally checked under the fitness criterion to ensure the improvement of model quality and avoid being stuck in a local optimum. We use the adjusted Rsquared () [20] rather than RMSE as the fitness criterion for model comparison and for reducing overfitting by controlling the complexity of the selected model.
IiiB Feature Generation from GP Individuals
A sufficient number of candidate features must be produced in the first place to ensure the quality of the regression model. In the experiment, we found that the variety of features it utilizes strongly influences the performance of the proposed method. On the other hand, the lack of candidate features will decrease the increasing speed of fitness and increase the risk of being stuck in a local optimum. As we no longer consider the individuals as solutions, the primary duty of KP is to generate feature functions. Empirically, we found that a proper feature function usually exists in a subtree of the individual. However, as the mapping from a tree to a function is complicated, the final behavior of the tree can be very different. If we only use the entire tree as one feature function, then all the subtrees are wasted, and it is difficult for them to reappear in the future generation again. One approach is to increase the number of individuals, but it still brings more subtrees to be wasted. So, the proper solution is to do a traversal for every single tree to extract all its subtree functions. The training of RVM will be difficult to converge on a repeated basis or even fail due to the singularity. As a result, in practice, the python computational algebra library (SymPy) is used to transform a primitive tree into a symbolic expression, which will simplify the function (thus reduce the next evaluation complexity) and check the uniqueness of them. As an expression can frequently appear along the process, an LRU (Least Recently Used) cache is implemented to save the evaluation result of an evaluated expression.
In short, it takes two steps for a GP primitive tree to construct candidate functions. First, the entire tree will be traversed to yield all its subtree functions. Second, all the subtrees will be transformed into symbolic expressions and filtered, resulting only different functions.
IiiC Sequential Sparse Bayesian Learning Algorithm
IiiD Model Selection
The principal disadvantage of RVM is that the training involves optimizing a nonconvex function. During the training phase, many distinct models with a different number of active functions will be generated. In general, the model with more active functions behaves better on the training set. However, a simpler one is preferred if the difference of training error between two models is small. The complexity of RVM scales to the number of active basis, in the experiments we found that the speed of the proposed method is significantly influenced by the number of functions kept in the record. As a result, it is essential to compare the models produced in the training phase and select the most straightforward model among those which exceed a measurement of quality. In [14], the adjusted is used for model comparison, as it considers not only the fitness value but also the number of active functions compared to the amount of training data. The adjusted is calculated using the following formula:
(15) 
where N is the number of training patterns, p is the number of active functions in a model and is the constant of determination which is given by:
(16) 
where , and is the entry of the output vector of a model. In the proposed method, the is used as the criterion to select the best model from the converged models. Compared to RMSE, the advantage of Adj. is that it is scalefree, w.r.t. the absolute target values of the problem and the result of this statistic is in the range (0, 1) where 1.0 means a perfect fit, 0.99 means a highquality model and 0.95 means a mediumquality model.
Iv Experimental Results
This chapter presents a comparison of our proposed method against the conventional GP and the KP shown in the literature [14] to solve symbolic regression benchmark functions [33, 34]. The experiments demonstrate that our method can outperform them, providing highquality solutions for both training and testing sets.
Iva Benchmarks
Several benchmarks in Table I [33] for the first experiment and the benchmarks in Table II [34] for the second experiments which were chosen to compare our method with the results presented in [14]. Similar to [14], randomly sampled points from the uniform distribution confined in the range are noted as . Points confined the range with successively equal intervals are noted as .
IvB Configuration
The Distributed Evolutionary Algorithms in Python 3.6.4 (DEAP) [35] is used to implement the GPrelated parts of our method, which are the population generation based on genetic operators. The Python library for symbolic mathematics (SymPy) is used as the computational algebra library. The Python library for numeric computations (NumPy) is used for the numerical calculation in RVM and evaluation of functions in the method. Table III. shows the configuration and parameters setting for the experiment. Table IV. is the configuration and setting for the benchmarks. Missing terms of Table IV should could be find in Table III.
Function  Training data  Testing data 

E[1,1,0.1]  E[1,1,0.001]  
E[2,2,0.1]  E[2,2,0.001]  
E[3,3,0.1]  E[3,3,0.001]  
E[1,50,1]  E[1,120,1]  
E[1,100,1]  E[1,100,0.1]  
E[1,100,1]  E[1,100,0.1]  
E[1,100,1]  E[1,100,0.1] 
Function  Train / Test 

U[1,1,20]  
U[1,1,20]  
U[1,1,20]  
U[1,1,20]  
U[1,1,20]  
U[1,1,20]  
U[0,2,20]  
U[0,4,40]  
U[1,1,100]  
U[1,1,100] 
For comparison, the proposed method and the other methods in the KP [14] were configured as shown in Table II. Configurations regarding KP, GP50, and GP500 are referenced from [14]. The maximum generations are set to result in a balance of the total number of individuals used among all methods. In the first experiment, the execution of each process will be terminated if the maximum generation is reached or the current global solution has higher fitness than 0.99999. In the second experiment, the termination condition is set to the maximum evaluation of nodes achieved. The termination will be considered as a failure if, in any one of the training IO case, the absolute error is more significant than 0.01. Our method is executed 50 times in the first experiment, and 100 times in the second experiment on each benchmark.
Parameter  Value 

GPRVM / KP/ GP50/ GP500  
Population size  8/ 8/ 50/ 500 
Max. generations  2000/ 2000/ 500/ 50 
Crossover probability  1.0/ 1.0/ 0.9/ 0.9 
Crossover operator  Onepoint 
Mutation probability  1.0/ 1.0/ 0.05/ 0.05 
Mutation operator  90% Uniform & 10% ERC/ 
Uniform/ Uniform  
Max. depth  15/ 2/ 15/ 15 
Nonterminals  
Terminals  
Fitness  
Stopping criteria  Max. gen. or fitness>0.99999 
Runs  50/ 50/ 50/ 50 
Parameter  Value 

GPRVM/ KP/ GP50/ GP500  
Population size  4/ 8/ 50/ 500 
Max. node eval.  2000 
Nonterminals  
Terminals  ( to , 
and  
Runs  100/ 100/ 100/ 100 
IvC Results
The first experiment is on benchmarks and is for comparing the model quality and computational complexity. The training results are shown in Table V, and testing results are shown in Table VI. The number of function evaluations (NFEs) is posted for comparison of computational complexity. The statistical results are collected based on 50 times of trails. The second experiment is on benchmark functions. It uses the same interval for training and testing, but the sets of points are distinct. The objective is to result in the absolute error of each training IO pair is smaller than 0.01 within a finite number of node evaluations. The result of our method is shown in Table VII, and the comparison with other methods is referenced and shown in Figure 2. The statistical results are collected based on 100 times of trails.
F  Stat.  GPRVM  KP  GP500  

RMSE  NFEs  RMSE  NFEs  RMSE  NFEs  
Min  1.00E+00  8.18E05  532  1.00E+00  1.93E06  328  4.12E01  8.00E02  26700  
Med.  1.00E+00  4.14E05  5311  1.00E+00  2.21E04  9492  4.14E01  8.97E02  30450  
Max  1.00E+00  2.06E05  11214  1.00E+00  2.28E02  32260  5.34E01  8.99E02  32170  
Min  9.95E01  8.49E04  1052  1.25E01  5.42E04  2880  5.50E02  2.17E01  28460  
Med.  9.98E01  9.04E03  6654  8.99E01  6.75E02  32190  1.39E01  2.23E01  30550  
Max  1.00E+00  1.09E02  16588  1.00E+00  2.19E01  32380  1.84E01  2.34E01  32310  
Min  9.56E01  3.89E02  11130  1.26E01  9.98E02  32100  3.26E02  3.35E01  28300  
Med.  9.81E01  9.75E02  14811  3.83E01  2.65E01  32170  7.05E02  3.51E01  30300  
Max  9.92E01  2.42E01  17870  9.13E01  3.36E01  32260  1.53E01  3.58E01  31950  
Min  1.00E+00  4.26E04  7  1.00E+00  6.99E16  56  1.00E+00  9.90E01  501  
Med.  1.00E+00  4.26E04  19  1.00E+00  9.22E16  56  1.00E+00  9.90E01  501  
Max  1.00E+00  1.27E04  104  1.00E+00  1.18E14  14510  1.00E+00  9.90E01  501  
Min  1.00E+00  1.32E03  7  1.00E+00  2.96E06  56  9.96E01  1.05E01  28590  
Med.  1.00E+00  2.26E03  8  1.00E+00  4.61E04  80  9.98E01  1.81E01  30660  
Max  1.00E+00  6.83E03  43  1.00E+00  1.78E02  224  9.99E01  2.51E01  32480  
Min  1.00E+00  1.03E06  7  1.00E+00  7.91E16  56  1.00E+00  0.00E+00  501  
Med.  1.00E+00  5.61E05  7  1.00E+00  3.50E15  56  1.00E+00  0.00E+00  501  
Max  1.00E+00  6.08E03  38  1.00E+00  1.34E02  160  1.00E+00  0.00E+00  501  
Min  1.00E+00  8.08E04  7  1.00E+00  9.77E06  56  9.69E01  1.12E01  25500  
Med.  1.00E+00  5.41E03  7  1.00E+00  1.59E03  88  9.99E01  1.35E01  30300  
Max  1.00E+00  9.19E03  22  1.00E+00  2.91E02  248  9.99E01  7.70E01  32050 
Func.  Stat.  GPRVM  KP  GP 500 
min  7.10E03  1.07E05  7.88E02  
median  2.56E02  4.36E04  8.20E02  
max  1.00E02  5.74E+06  1.30E+00  
min  9.77E03  6.86E04  2.09E01  
median  6.76E02  6.85E02  2.22E01  
max  1.57E+00  8.23E+00  1.00E+00  
min  1.10E02  9.67E02  3.41E01  
median  5.02E01  2.07E+00  3.53E01  
max  1.30E+01  9.98E+01  2.71E+00  
min  6.87E03  5.49E16  9.96E01  
median  9.71E03  8.81E16  9.96E01  
max  2.22E02  8.03E14  9.96E01  
min  1.59E03  3.89E04  1.06E01  
median  1.83E03  1.27E02  1.71E01  
max  4.42E02  4.49E+02  5.50E+08  
min  1.03E06  1.23E15  0  
median  1.78E04  6.08E15  0  
max  2.68E03  1.63E+00  0  
min  8.18E04  4.74E03  1.39E01  
median  4.55E03  1.03E01  1.51E01  
max  9.64E03  3.82E+09 
Problem  Max. Error  Raw Fitness  RMSE Training  Func. Eval.  Node Eval.  RMSE Testing  Succ. Runs 

0.00283  0.01823  0.00112  21.14  20.63  0.05112  100  
0.00479  0.02357  0.00135  32.63  27.56  0.10625  100  
0.00399  0.02711  0.00172  45.73  36.68  0.83062  100  
0.00406  0.02746  0.00174  33.41  54.32  0.49868  100  
0.00057  0.00396  0.00025  14.39  26.62  0.36413  100  
0.00264  0.0188  0.00118  18.17  29.14  0.099  100  
0.0022  0.01473  0.00093  8.07  14.04  0.28685  100  
0.00145  0.00948  0.0006  17.26  29.66  0.1638  100  
0.00352  0.09064  0.00114  75.29  177.51  0.70064  100  
0.00442  0.10199  0.00133  129.54  311.68  0.20869  100 
IvD Discussion
The first experiment is to verify that if our method could outperform regular GP, and its predecessor, KP. For seven functions, the training results show that our approach achieved the highest values of fitness for two functions ( 2 and 3), with five ties to KP ( 1, 6, 7, 8, 9) and 2 ties to GP ( 6 and 8). However, the RMSE value of our method is more significant when ties. Our way found highquality models ( of fitness 0.99 for six functions, whereas KP found five and GP (50 and 500) found models with such quality for only four functions. To explain the results, we learn that our method performs as fast as KP and GP for simple problems and it performs better when dealing with complex functions ( 2 and 3), where KP can find models with low or poor quality and GP can only find poor results. For these functions, our method requires a much smaller NFEs () than the other methods. It should be noted that, when dealing with easy targets ( 6, 7, 8, 9), the models found by our method are sparse and rapidly meet the stopping criteria (fitness0.99999), which explains the RMSE value of our method is larger as the rewards sparse models. Concerning the testing results, we see that the min and of RMSE of our method are somewhat larger than KP ( 1, 6, 7, 8, 9). However, the maximum errors of GPRVM are far smaller than KP in 1, 2, 3, 7, 8, 9, which explains that the results of our method are more stable than the others, in other words, more robust on the testing set. In practice, the robustness of testing set is much more important than a small difference in error.
The second experiment is to test curvefitting tasks on benchmark functions where the maximum absolute error of any of the fitness cases should be lower than 0.01. Table VII shows the averaged results of our method. Like KP, GPRVM succeeds to fit target curves as the RMSE for training and testing remain low values for every test. Regarding the efficiency, we notice that the number of objective function and node evaluations also remain low values for every test, which means that our method is faster in finding the correct result. Part of the reason is the implementation of cache, but more importantly, is the sparse kernel method employed by our method results in fast convergence. Figure 2. compares results of our method with KP and other techniques working on the same curvefitting problem. The performance of our method is as accurate as KP, both resulting 100% successful runs. Both GPRVM and KP outperform other methods.
V Conclusion
Va Summary
In this paper we focus on the topic of symbolic regression (SR), using symbols of functions and variables to construct solutions. Expanded search space and the reduced amount of prior knowledge is the main reason for choosing SR instead of other machine learning models. To perform SR, Genetic Programming (GP) is a representative method, which is a populationbased evolutionary algorithm. Limitations including a blind search of GP attracted many works to remove them. Kaizen Programming (KP) is a successful one among those works, which combines linear regression modeling and hillclimbing approach to guides the search. Though its success, problems including threshold setting and singularity raise. We analytically found that the singularity problem can be solved by generalizing the linear model of KP. Moreover, if we take a Bayesian treatment to estimate the generalized model, we can avoid using hypothesis testing to perform feature selection, where the threshold setting problem solved simultaneously. The generalized model, found in literature, is known as the relevance vector machine (RVM). Motivated by the analytical results, in this work we extended the feature selection module of KP based on RVM. The proposed method works without performing hypothesis test, thus requiring no prior knowledge to set the threshold. It also deals with singularity automatically. We implemented the sequential sparse Bayesian learning algorithm to accelerate the training speed of RVM. The proposed method is compared to KP and GP on several benchmark functions. Results showed that it outperforms the other methods, providing more accurate models with fewer functions evaluations.
VB Future Work
Although GPRVM is more efficient than previous methods, for large datasets it has a relatively low speed. Since it is a nonconvex optimization problem, for a model with M feature functions, RVM requires inversion of a matrix of size , which in general has an of computational complexity. During the execution process of our method, in each generation, it requires at least one run of the sequential training algorithm to return a model. To increase the building model speed, authors are following a dataparallel approach by running several learner modules on a smaller set of data for large datasets. The dataset will be randomly reduced to subsets, and then each subset will be simultaneously trained. This idea will reduce the cost of executing each individual learner and make the execution time of model building shorter. The predicted models then will be combined with fusion techniques.
GPRVM can be extended to other machine learning tasks, such as classification. There have been several successful proposed applications for KP, such as [23], where the author used KP highperformance concrete compressive strength. In [1, 35], the authors generalized KP into classification tasks. Since RVM is a probabilistic model and the framework is extendable into different kinds of machine learning tasks, it is theoretically practicable to generalize our method to overcome the limitations of statistically based KP in classification tasks.
References
 [1] de Melo, V.V. and Banzhaf, W. 2016. Kaizen programming for feature construction for classification. In Genetic Programming Theory and Practice XIII, pp. 3957, Springer.
 [2] Peng, Y., Yuan, C., Qin, X., Huang, J. and Shi, Y. 2014. An improved gene expression programming approach for symbolic regression problems. Neurocomputing, 137, pp.293301.
 [3] Tang, F., et. al. 2015. Discovery scientific laws by hybrid evolutionary model. Neurocomputing, 148, pp.143149.
 [4] Icke, I. and Bongard, J.C. 2013, June. Improving genetic programming based symbolic regression using deterministic machine learning. In Evolutionary Computation IEEE Congress on pp. 17631770.
 [5] Cristianini, N. and ShaweTaylor, J., 2000. An introduction to support vector machines and other kernelbased learning methods. Cambridge university press.
 [6] Shafaei, M. and Kisi, O.,2017. Predicting river daily flow using waveletartificial neural networks based on regression analyses in comparison with artificial neural networks and support vector machine models. Neural Computing and Applications, 28(1), pp.1528.
 [7] Deklel, A.K., Saleh, M.A., Hamdy, A.M. and Saad, E.M. 2017, March. Transfer learning with long term artificial neural network memory (LTANNMEM) and neural symbolization algorithm (NSA) for solving high dimensional multiobjective symbolic regression problems. In Radio Science Conference, 2017 34th National (pp. 343352). IEEE.
 [8] Austel, V., et. al. 2017. Globally Optimal Symbolic Regression. arXiv preprint arXiv:1710.10720.
 [9] Michael S., and Hod L. 2009. Distilling FreeForm Natural Laws from Experimental Data. Science Col. 324, Issue 5923, pp.8185.
 [10] Michael S., and Hod L. 2013. Eureqa software (version 0.98 beta). Nutonian, Somerville, Mass, USA.
 [11] Michael S., and Hod L. 2013. Symbolic Regression of Implicit Equations. Genetic Programming Theory and Practice VII. Genetic and Evolutionary Computation. Springer, Boston, MA.
 [12] Koza, J.R., 1994. Genetic programming as a means for programming computers by natural selection. Statistics and computing, 4(2), pp.87112.
 [13] Banzhaf, W., Nordin, P., Keller, R.E. and Francone, F.D. 1998. Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers. Inc., Heidelberg and San Francisco CA.
 [14] de Melo, V.V. 2014, July. Kaizen programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation (pp. 895902). ACM.
 [15] Tipping, M.E. 2001. Sparse Bayesian learning and the relevance vector machine. Journal of machine learning research, 1(Jun), pp.211244.
 [16] Fulcher, J., 2008. Computational intelligence: an introduction. In Computational intelligence: a compendium (pp. 378). Springer, Berlin, Heidelberg.
 [17] Harvey, D.Y. and Todd, M.D. 2015. Automated feature design for numeric sequence classification by genetic programming. IEEE Transactions on Evolutionary Computation, 19(4), pp.474489.
 [18] Chatterjee, A. and Rong, M. 2018. Efficiency Analysis of Genetic Algorithm and Genetic Programming in Data Mining and Image Processing. Computer Vision: Concepts, Methodologies, Tools, and Applications. p.246.
 [19] Tipping, M.E. and Faul, A.C. 2003. Fast marginal likelihood maximisation for sparse Bayesian models., AISTATS.
 [20] Model, R.R., Square Adjusted R Square Std. Error of the Estimate, 1.
 [21] Stijven, S., Vladislavleva, E., Kordon, A., Willem, L. and Kotanchek, M.E. 2016. PrimeTime: Symbolic Regression Takes Its Place in the Real World. In Genetic Programming Theory and Practice XIII (pp. 241260). Springer, Cham.
 [22] Bautu, E., Bautu, A. and Luchian, H. 2005. Symbolic regression on noisy data with genetic and gene expression programming. In Symbolic and Numeric Algorithms for Scientific Computing, 2005. SYNASC 2005. Seventh International Symposium.
 [23] de Melo, V.V. and Banzhaf, W. 2015. Predicting highperformance concrete compressive strength using features constructed by kaizen programming. In Intelligent Systems pp. 8085.
 [24] de Melo, V.V. and Banzhaf, W. 2017. Improving the prediction of material properties of concrete using Kaizen Programming with Simulated Annealing. Neurocomputing, 246, pp.2544.
 [25] de Melo, V.V., 2016. Breast cancer detection with logistic regression improved by features constructed by Kaizen programming in a hybrid approach. In IEEE Evolutionary Computation Congress, pp. 1623.
 [26] Brameier, M.F. and Banzhaf, W., 2007. Linear genetic programming. Springer Science & Business Media.
 [27] Moraglio, A., Krawiec, K. and Johnson, C.G. 2012. Geometric semantic genetic programming. In International Conference on Parallel Problem Solving from Nature, Springer, Berlin, Heidelberg, pp. 2131.
 [28] Arnaldo, I., Krawiec, K. and O’Reilly, U.M. 2014. Multiple regression genetic programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, pp. 879886.
 [29] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. 2004. Least angle regression. The Annals of statistics, 32(2), pp.407499.
 [30] Moraglio, A., Krawiec, K. and Johnson, C.G. 2012. Geometric semantic genetic programming. In International Conference on Parallel Problem Solving from Nature, pp. 2131, Springer, Berlin, Heidelberg.
 [31] Castelli, M., Trujillo, L., Vanneschi, L. and Silva, S. 2015. Geometric semantic genetic programming with local search. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 9991006.
 [32] Christopher, M.B. 8th printing 2009. PATTERN RECOGNITION AND MACHINE LEARNING. SpringerVerlag New York.
 [33] McDermott, J., et. al. ,2012. Genetic programming needs better benchmarks. In Proceedings of the 14th annual conference on Genetic and evolutionary computation pp. 791798.
 [34] Karaboga, D., Ozturk, C., Karaboga, N. and Gorkemli, B. 2012. Artificial bee colony programming for symbolic regression. Information Sciences, 209, pp.115.
 [35] de Melo, V.V. and Banzhaf, W. 2016. Improving logistic regression classification of credit approval with features constructed by Kaizen programming. Genetic and Evolutionary Computation Conference Companion pp. 6162.
 [36] dal Piccol Sotto, L.F. and de Melo, V.V., 2016. Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression. Neurocomputing, 180, pp.7993.
 [37] Sotto, L.F., Coelho, R.C. and de Melo, V.V., 2016, July. Classification of Cardiac Arrhythmia by Random Forests with Features Constructed by Kaizen Programming with Linear Genetic Programming. In Proceedings of the Genetic and Evolutionary Computation Conference 2016 (pp. 813820). ACM.