GP-RVM: Genetic Programing-based Symbolic Regression Using Relevance Vector Machine

GP-RVM: Genetic Programing-based Symbolic Regression Using Relevance Vector Machine

Hossein Izadi Rad ICE, IST
The University of Tokyo
Tokyo, Japan
izad@iba.t.u-tokyo.ac.jp
   Ji Feng ICE, IST
The University of Tokyo
Tokyo, Japan
feng@iba.t.u-tokyo.ac.jp
   Hitoshi Iba ICE, IST
The University of Tokyo
Tokyo, Japan
iba@iba.t.u-tokyo.ac.jp
Abstract

This paper proposes a hybrid basis function construction method (GP-RVM) for Symbolic Regression problem, which combines an extended version of Genetic Programming called Kaizen Programming and Relevance Vector Machine to evolve an optimal set of basis functions. Different from traditional evolutionary algorithms where a single individual is a complete solution, our method proposes a solution based on linear combination of basis functions built from individuals during the evolving process. RVM which is a sparse Bayesian kernel method selects suitable functions to constitute the basis. RVM determines the posterior weight of a function by evaluating its quality and sparsity. The solution produced by GP-RVM is a sparse Bayesian linear model of the coefficients of many non-linear functions. Our hybrid approach is focused on nonlinear white-box models selecting the right combination of functions to build robust predictions without prior knowledge about data. Experimental results show that GP-RVM outperforms conventional methods, which suggest that it is an efficient and accurate technique for solving SR. The computational complexity of GP-RVM scales in , where is the number of functions in the basis set and is typically much smaller than the number of training patterns.

Evolutionary Computing, Machine learning, Genetic Programming, Symbolic Regression, Relevance Vector Machine, White-box Optimization, Kaizen Programming

I Introduction

In this paper, we address a learning task called symbolic regression (SR) [1-3]. Suppose that there is an unknown real-valued function whose input can be viewed as a -dimensional vector: where and we are given a set of observed input-output pairs: where . Regression is defined as a set of statistical processes for approximating the relationships between input-output pairs. Symbolic Regression is performed by methods which minimize various error metrics while searching the space of mathematical expressions to estimate the accurate and simple model that best fits the observed dataset [12]. Conventional learning methods such as Artificial Neural Networks (ANNs) [5,6] and Support Vector Machines (SVMs) [7] has been widely used for solving regression problems. The solutions given by these techniques that are known as black boxes, which means the mechanisms behind are difficult to understand and analyze. SR, on the contrary, tries to result in white box models that are clear to interpret. The goal of SR exercise is to determine which of the inputs are the most effective in predicting outputs and identify the input-output relationship [21]. When solving SR, not only the unknown coefficients are optimized, but also the exact formulae that explain the data are determined [8-10]. Linear and nonlinear regression methods try to fit parameters to a given form of an equation. However, SR methods search both the parameters space and the form of equations at the same time, aiming at solving SR form mathematical equations in an easy-to-interpret format. The main reason for choosing SR over other linear and nonlinear regression methods is its interpretability. Such a formula may not be discovered by other machine learning methods since their primary target is coefficient optimization.

SR is one the real-world applications of Genetic Programming (GP) [12, 13, 21, 22]. Inspired by Darwin’s theory of evolution, GP tries to solve optimization problem by simulating the evolution procedure to evolve computer programs that perform well on a given task. GP is an evolutionary algorithm where many programs are simulated as a cluster of genes that are then evolved based on an evolutionary mechanism. It starts with randomly initializing a population of programs. The population is then repeatedly updated under fitness-based selection, by applying genetic operators until the desired solution is found. GP has been utilized to tackle various problems in different disciplines such as decision making[16-18], pattern recognition, robotic networks[16-18], bioinformatics, data mining, finance[16-18], etc. Despite its successful application in several domains, there are some known issues with GP and related algorithms. For instance, in practice, conventional GP is suffering severely from problems like bloat, empirical hyper-parameter tuning, slow speed (as the large size of population), and low success rate[16-18].

  • Why Kaizen Programming?

A controversial issue of GP is its non-determinacy under random search, where the only guide is the pressure on individuals towards better fitness. The algorithm cannot ensure the next iteration to improve, or even avoid to deteriorate because the solutions are randomly modified. Motivated by developing an SR method which requires a relatively lower number of individuals (to accelerate the speed) while still maintaining the quality and interpretability of the solutions, de Melo proposed the Kaizen Programming (KP) [14] in 2014. KP is a cooperative co-evolutionary method which performs automatic feature engineering by using the statistical approach to build models from features generated by GP. KP makes linear models over GP individuals to obtain deterministic results. By keeping on updating a set of best regressors, KP ensures the improvement of solution quality. To improve the solution quality, KP employs feature selection via the null hypothesis significance testing (NHST) based on p-value. KP has an advantage over other similar methods as it relies not only on random evolution but also on a more deterministic and efficient approach to build the model. Rather than focusing on finding a single complete solution, KP concentrates more on how important is each part of the solution and guides the search by dividing the entire task into small ones to achieve high efficiency. The first significant limitation of KP is the introduction of NHST, which raises several controversial issues. For example, the proper selection of a threshold on the significance level, eventually demands prior knowledge. The employment of the linear regression model raises the singularity problem. The singularity happens when duplicated features are used to build the model, and cause the data variance matrix non-invertible.

Motivated by the analytical results, in this paper we extended the feature selection module of KP based on RVM. The proposed method works without performing hypothesis test, thus requiring no prior knowledge to set the threshold. It also deals with singularity automatically. We implemented the sequential sparse Bayesian learning algorithm to accelerate the training speed of RVM. The proposed method is compared to KP and GP on several benchmark functions. Results shows that GP-RVM outperforms the other methods, providing more accurate models with fewer evaluations of functions. The remainder of the paper is organized as follows. In Part 2, we present an overview of the related recent work on solving SR tasks the basic Kaizen Programming workflow. Part 3 describes the method proposed in this work. Experimental results and related topics are discussed in Part 4. Finally, Part 5 summarizes this paper and covers future works. For detailed information about the application of Relevance Vector Machines (RVM) in our work, see the [32] Chapter 7.

Ii Previous Works

Ii-a Kaizen Programming

Kaizen Programming is a hybrid method for solving SR based on the Kaizen [14] event with the Plan-Do-Check-Act (PDCA) methodology. In the real world, a Kaizen event is an event where experts propose their ideas and test them to tackle a business issue. Many ideas are then combined to form a complete solution to the issue, which is known as the standard. During the Kaizen process, the PDCA methodology is employed to improve the quality of the current standard, where adjustments are planned, executed, checked and acted. The cycle of PDCA is iterated until the business issue is solved. Since the contribution of each action at each period on resolving the issue is analyzed and evaluated based on specific criteria, more knowledge on the topic is acquired. As a result, the experts learn from the improved information and will avoid useless or harmful actions and control the search process, thus resulting in a better solution.

  • Feature generation module, which is to propose ideas in the shape of regression features.

  • Feature selection module, which is to evaluate and select contributive features.

  • Model generation module, which is to create a complete solution using generated features.

For solving regression problem, a solution can be a mathematical expression, for instance, , which is equivalent to a feature used in a linear regression model. In the feature selection module, new ideas are evaluated to generate an actual partial solution, which can be treated as an N-dimensional vector, where N is the number of training data. These partial solutions are combined with old ones existing in the current standard, resulting in a final complete solution composed of partial solutions. Each partial solution is then evaluated to indicate its contribution to solving the problem. Due to the relationship among those partial solutions a measurement considers not only the independent quality of each partial solution but also their dependency is used in this step. The major difference of conventional evolutionary algorithms with KP is that they work with individuals where each one is a complete solution. Therefore, in KP, individuals collaborate with each other, so the size of the population is smaller compared to conventional evolutionary algorithms, where individuals must compete with each other to survive.

In the model generation module, the complete solution is generated based on the results of the selection module. For example, after several iterations most significant partial solutions are selected from the structure, the coefficients are fitted to the data by performing a Maximum Likelihood Estimation (MLE), which is equivalent to minimize the Root Mean Square Error (RMSE). The evaluation of the quality of the complete solution is also needed. One of the reasons is to estimate if the algorithm gets stuck in a local optimum.

Ii-A1 KP: The Advantages and Disadvantages

The PDCA cycle plays the role of encouraging experts to propose ideas which are meaningful and useful for solving the problem. It is critical because the criteria for evaluating the contribution replaces the place of fitness on the entire solution. Therefore, the employed methods to perform the second and third modules should be logically connected to each other. For instance, applying the same criteria to choose the features to construct the model, more precisely, the method used to solve the problem and the method employed to evaluate the contribution of the partial solutions are equivalent. Provided that, the feature selection module and the model generation model were unrelated, then the ideas meaningful to the procedure employed in the feature selection module would not be useful to the method that finally solves the problem. KP guides the search process to the goal based on the continuously increasing knowledge of the problem. As a result, machine learning techniques or statistical approach must be employed efficiently for the hybrid structure of KP. Without considering these techniques, KP performance reduces to a conventional EC algorithm, where the search depends only on the selection process. We believe that KP has several limitations requiring further improvement. Firstly, when MLE solves the linear regression problem, the singularity issue of the data variance matrix raises, which must be treated appropriately. The singularity is caused by similar or identical features used to construct the model, and it will result in the difficulty to estimate the parameters. Secondly, the hypothesis test requires a properly adjusted value for the significance level. If the threshold is too high, the complexity of the solution will increase excessively, and the redundancy in the solution will affect us to acquire inner knowledge of it. On the other hand, an overly low threshold will make it hard to find a good approximation to the target. Though, in the first case, we could control the excessive growth of the complexity by limiting the number of features used in the standard. It is still problematic to select a proper limit provided that we have no prior knowledge of the problem.

Ii-B Other Genetic Programming-based Methods for Symbolic Regression

Ii-B1 Multiple regression genetic programming

Arnaldo et al. [28] proposed a method which overcomes the limitations of conventional GP by eliminating direct comparison between the output of the complete individual with the target value of the training data. Motivated by the fact that in GP the fitness function is not directly applied to the genotype of individuals, but is instead to the complete phenotype; and thus, resulting in the missing of clear and strong focus on the advancing evolution of basic structures. In their experiment, MRGP consistently generated solutions fitter than the result of GP and linear regression.

Ii-B2 Geometric Semantic Genetic Programming (GSGP)

GSGP [31] concentrates on the introduction of semantic-awareness on genetic operators, which means the awareness of the actual outcome after the operators are applied to individuals. To do so, they formally defined the semantic space as an N-dimensional Euclidean space, where N is the number of training I-O pairs. Each is encoded in the multidimensional metric space as a single point. They proposed several semantic geometric operators which directly search the semantic space, producing offspring that are guaranteed to be at least as fit as the parents. GSGP employs operators which produce geometric offspring to explore the semantic space. Geometric semantic crossover perfectly mixes two parents in the population by constructing an offspring expressed as a (weighted) average of the parents, which is guaranteed to be at least as fit as the less fit of the parents. Geometric semantic mutation produces an offspring which is guaranteed to lie in an N-dimensional ball of pre-specified radius centered in the parent. Therefore, it avoids causing huge migration in the semantic space. With geometric operators employed, the landscape if the fitness function is unimodal, thus easy to reach the global optimum. The primary advantage of GSGP is its relatively higher speed to converge compared to conventional GP. However, the trade-off is its exponentially increasing complexity of the individual, which results in solutions that are theoretically difficult to analyze (black-box) and computationally expensive to evaluate.

Ii-C Relevance Vector Machine for Symbolic Regression

Support Vector Machine (SVM) method has been widely used in classification and regression applications. However, machine learning techniques based on support vector machines have some limitations, several of which are explained in [32] Chapter 7. The output of SVM shows decisions rather than posterior probabilities. These predictive results, are extracted as linear combinations of kernel functions which are focused on training data. The Relevance Vector Machine [15] is a Bayesian sparse kernel which has applications in classification and regression. RVM has many qualities similar to SVM. RVM-based solutions avoid basic limitations of SVM while resulting in much sparser models. Consequently, RVM corresponds in higher performance on test data. Using RVM for regression we shall find a linear model which results in sparse solutions [32] (Chapter 3). Given a train set of input-target pairs , considering scalar-valued target functions only, it is followed the standard probabilistic formulation and assume that the targets are samples from the model with additive noise:

(1)

where are independent samples from some noise process which is assumed to be a zero-mean Gaussian with variance [32] (Chapter 7.2). The functions are some fixed nonlinear basis functions. Thus the model defines a conditional distribution for a real-valued target variable , given an input vector , which takes the form:

(2)

As we assumed the targets , the likelihood of the complete dataset can be written as:

(3)

Here, RVM adopts a Bayesian perspective and restricts the parameters by introducing a separate hyper-parameter for each of the weight parameters instead of a single shared hyper-parameter[32] (Chapter 7.2). Thus the weight prior takes the form:

(4)

where , and stand for the precision of . The introduction of an individual hyper-parameter for every weight is the key feature of the model. The corresponding weight parameters take posterior distributions that are concentrated at 0. Consequently, the basis functions made out of these parameters do not contribute in the model predictions which results in a sparse model. The posterior over weights is then obtained from Bayesian rule:

(5)

with the posterior mean:

(6)

and posterior covariance:

(7)

where the the design matrix is defined as [32] (Chapter 7.2, Eq. 7.81). The training of relevance vector machine requires to search for posterior hyper-parameter to maximize the marginal likelihood function[32] (Chapter 7.2, Eq. 7.84):

(8)

This integration involves a convolution of two Gaussian functions, it can be easily evaluated by completing the spare to give the log marginal likelihood with:

(9)

where

(10)

We try to make the (9) as large as possible w.r.t. and variables. Therefore, we set the derivatives of marginal likelihood to zero to obtain:

(11)
(12)

where the element of the posterior mean is defined by (6). [32] (Section 3.5.3) defines as a measure on how well the parameter is determined:

(13)

where is defined in (7).

The learning algorithm starts with initializing and , and then calculating mean and covariance of the posterior using (6) and (7), respectively. The learning algorithm proceeds by repeating the equation of (11) and (12), accompanied with updating the posterior statistics, until reaching the maximum of repeats or minimizing all of the precision parameters, i.e. s, until the get smaller than a lower-bound(6). [32] (Section 7.2.2). The optimization found in practice, drives a proportion of the hyper-parameters into large (theoretically infinite) values, as a result the weight parameters related to hyper-parameters have posterior distributions with mean and variance both zero. Accordingly, those parameters as well as corresponding basis functions should be eliminated from the model and play no role in making prediction for new inputs [15].

Iii Proposed Method

Iii-a An Efficient Hybrid GP-based Approach Using Relevance Vector Machine

This section introduces a new method to SR, based on the Sparse Bayesian kernel method, which integrates a GP-based search of tree structures, and a Bayesian parameter estimation employing automatic relevance determination (ARD). To overcome GP’s major difficulty, we extend the work of KP in [14] by improving the feature selection with ARD, using RVM. Because RVM requires no prior knowledge to set a threshold and deals with singularity automatically, the proposed method also overcomes the limitations of KP. Figure 1 shows the basic flow chart of the proposed method. We explain three key steps which are shown with a star mark after the subsequent parts this chapter. To overcome the limitations of KP, we first analytically introduce regularization terms over the data variance matrix to avoid the singularity. The regularized matrix can be considered as a posterior covariance matrix. It means that the singularity problem can be avoided if we generalize the linear regression model into a Bayesian linear regression model. In that case, there would be distributed precision parameters for each linear coefficient. In other words, the generalized model is equivalent to the well-known RVM [15]. Also, the generalized model no longer depends on hypothesis test. Therefore, it does not require any prior knowledge to set a proper threshold for feature selection.

Considering the analytical results, in this paper we introduce a hybrid method for solving SR problem based on RVM. RVM is widely used as a kernel technique although there is no restriction for applying it to any model expressed as a linear combination of feature functions. In our proposed method, GP individuals are used for generating features. A preprocessing is applied for extracting all subtree functions in each tree-formed GP individual, to ensure the maximum usage of provided resources. These functions are further used to build linear models together with features existing in the current model. Different from the statistical methods used in KP, RVM is used to select proper features. A sequential training algorithm [19] is employed to perform the optimization task of RVM which improves training speed significantly. Optimized results are finally checked under the fitness criterion to ensure the improvement of model quality and avoid being stuck in a local optimum. We use the adjusted R-squared () [20] rather than RMSE as the fitness criterion for model comparison and for reducing overfitting by controlling the complexity of the selected model.

Fig. 1: Flow chart of proposed method.

Iii-B Feature Generation from GP Individuals

A sufficient number of candidate features must be produced in the first place to ensure the quality of the regression model. In the experiment, we found that the variety of features it utilizes strongly influences the performance of the proposed method. On the other hand, the lack of candidate features will decrease the increasing speed of fitness and increase the risk of being stuck in a local optimum. As we no longer consider the individuals as solutions, the primary duty of KP is to generate feature functions. Empirically, we found that a proper feature function usually exists in a subtree of the individual. However, as the mapping from a tree to a function is complicated, the final behavior of the tree can be very different. If we only use the entire tree as one feature function, then all the subtrees are wasted, and it is difficult for them to reappear in the future generation again. One approach is to increase the number of individuals, but it still brings more subtrees to be wasted. So, the proper solution is to do a traversal for every single tree to extract all its subtree functions. The training of RVM will be difficult to converge on a repeated basis or even fail due to the singularity. As a result, in practice, the python computational algebra library (SymPy) is used to transform a primitive tree into a symbolic expression, which will simplify the function (thus reduce the next evaluation complexity) and check the uniqueness of them. As an expression can frequently appear along the process, an LRU (Least Recently Used) cache is implemented to save the evaluation result of an evaluated expression.

In short, it takes two steps for a GP primitive tree to construct candidate functions. First, the entire tree will be traversed to yield all its subtree functions. Second, all the subtrees will be transformed into symbolic expressions and filtered, resulting only different functions.

Iii-C Sequential Sparse Bayesian Learning Algorithm

After the construction of candidate functions, the next step is to put them into RVM for training [32] [19]. In practice, we use another approach to solving the optimization problem which improves training speed significantly. The problem is to determine the hyper-parameters and :

(14)
Sequential Sparse Bayesian Learning Algorithm;
1. Initialize , select a candidate function , set its . Set the other to .
2. Evaluate the posterior of , along with , for all basis functions.
3. Select a candidate function .
4. If and , update .
5. If and , include to the model, initialize its .
6. If , remove from the model, set .
7. Update . Repeat 2 until convergence.
Algorithm 1 Sequential Sparse Bayesian Learning Algorithm

Iii-D Model Selection

The principal disadvantage of RVM is that the training involves optimizing a non-convex function. During the training phase, many distinct models with a different number of active functions will be generated. In general, the model with more active functions behaves better on the training set. However, a simpler one is preferred if the difference of training error between two models is small. The complexity of RVM scales to the number of active basis, in the experiments we found that the speed of the proposed method is significantly influenced by the number of functions kept in the record. As a result, it is essential to compare the models produced in the training phase and select the most straightforward model among those which exceed a measurement of quality. In [14], the adjusted is used for model comparison, as it considers not only the fitness value but also the number of active functions compared to the amount of training data. The adjusted is calculated using the following formula:

(15)

where N is the number of training patterns, p is the number of active functions in a model and is the constant of determination which is given by:

(16)

where , and is the entry of the output vector of a model. In the proposed method, the is used as the criterion to select the best model from the converged models. Compared to RMSE, the advantage of Adj. is that it is scale-free, w.r.t. the absolute target values of the problem and the result of this statistic is in the range (0, 1) where 1.0 means a perfect fit, 0.99 means a high-quality model and 0.95 means a medium-quality model.

Iv Experimental Results

This chapter presents a comparison of our proposed method against the conventional GP and the KP shown in the literature [14] to solve symbolic regression benchmark functions [33, 34]. The experiments demonstrate that our method can outperform them, providing high-quality solutions for both training and testing sets.

Iv-a Benchmarks

Several benchmarks in Table I [33] for the first experiment and the benchmarks in Table II [34] for the second experiments which were chosen to compare our method with the results presented in [14]. Similar to [14], randomly sampled points from the uniform distribution confined in the range are noted as . Points confined the range with successively equal intervals are noted as .

Iv-B Configuration

The Distributed Evolutionary Algorithms in Python 3.6.4 (DEAP) [35] is used to implement the GP-related parts of our method, which are the population generation based on genetic operators. The Python library for symbolic mathematics (SymPy) is used as the computational algebra library. The Python library for numeric computations (NumPy) is used for the numerical calculation in RVM and evaluation of functions in the method. Table III. shows the configuration and parameters setting for the experiment. Table IV. is the configuration and setting for the benchmarks. Missing terms of Table IV should could be find in Table III.

Function Training data Testing data
E[-1,1,0.1] E[-1,1,0.001]
E[-2,2,0.1] E[-2,2,0.001]
E[-3,3,0.1] E[-3,3,0.001]
E[1,50,1] E[1,120,1]
E[1,100,1] E[1,100,0.1]
E[1,100,1] E[1,100,0.1]
E[1,100,1] E[1,100,0.1]
TABLE I: Benchmark Functions
Function Train / Test
U[-1,1,20]
U[-1,1,20]
U[-1,1,20]
U[-1,1,20]
U[-1,1,20]
U[-1,1,20]
U[0,2,20]
U[0,4,40]
U[-1,1,100]
U[-1,1,100]
TABLE II: Benchmark Functions

For comparison, the proposed method and the other methods in the KP [14] were configured as shown in Table II. Configurations regarding KP, GP50, and GP500 are referenced from [14]. The maximum generations are set to result in a balance of the total number of individuals used among all methods. In the first experiment, the execution of each process will be terminated if the maximum generation is reached or the current global solution has higher fitness than 0.99999. In the second experiment, the termination condition is set to the maximum evaluation of nodes achieved. The termination will be considered as a failure if, in any one of the training I-O case, the absolute error is more significant than 0.01. Our method is executed 50 times in the first experiment, and 100 times in the second experiment on each benchmark.

Parameter Value
GP-RVM / KP/ GP50/ GP500
Population size 8/ 8/ 50/ 500
Max. generations 2000/ 2000/ 500/ 50
Crossover probability 1.0/ 1.0/ 0.9/ 0.9
Crossover operator One-point
Mutation probability 1.0/ 1.0/ 0.05/ 0.05
Mutation operator 90% Uniform & 10% ERC/
Uniform/ Uniform
Max. depth 15/ 2/ 15/ 15
Non-terminals
Terminals
Fitness
Stopping criteria Max. gen. or fitness>0.99999
Runs 50/ 50/ 50/ 50
TABLE III: Configuration and Parameter Setting for Benchmark
Parameter Value
GP-RVM/ KP/ GP50/ GP500
Population size 4/ 8/ 50/ 500
Max. node eval. 2000
Non-terminals
Terminals ( to ,
and
Runs 100/ 100/ 100/ 100
TABLE IV: Configuration and Parameter Setting for Benchmark

Iv-C Results

The first experiment is on benchmarks and is for comparing the model quality and computational complexity. The training results are shown in Table V, and testing results are shown in Table VI. The number of function evaluations (NFEs) is posted for comparison of computational complexity. The statistical results are collected based on 50 times of trails. The second experiment is on benchmark functions. It uses the same interval for training and testing, but the sets of points are distinct. The objective is to result in the absolute error of each training I-O pair is smaller than 0.01 within a finite number of node evaluations. The result of our method is shown in Table VII, and the comparison with other methods is referenced and shown in Figure 2. The statistical results are collected based on 100 times of trails.

F Stat. GP-RVM KP GP500
RMSE NFEs RMSE NFEs RMSE NFEs
Min 1.00E+00 8.18E-05 532 1.00E+00 1.93E-06 328 4.12E-01 8.00E-02 26700
Med. 1.00E+00 4.14E-05 5311 1.00E+00 2.21E-04 9492 4.14E-01 8.97E-02 30450
Max 1.00E+00 2.06E-05 11214 1.00E+00 2.28E-02 32260 5.34E-01 8.99E-02 32170
Min 9.95E-01 8.49E-04 1052 1.25E-01 5.42E-04 2880 5.50E-02 2.17E-01 28460
Med. 9.98E-01 9.04E-03 6654 8.99E-01 6.75E-02 32190 1.39E-01 2.23E-01 30550
Max 1.00E+00 1.09E-02 16588 1.00E+00 2.19E-01 32380 1.84E-01 2.34E-01 32310
Min 9.56E-01 3.89E-02 11130 1.26E-01 9.98E-02 32100 3.26E-02 3.35E-01 28300
Med. 9.81E-01 9.75E-02 14811 3.83E-01 2.65E-01 32170 7.05E-02 3.51E-01 30300
Max 9.92E-01 2.42E-01 17870 9.13E-01 3.36E-01 32260 1.53E-01 3.58E-01 31950
Min 1.00E+00 4.26E-04 7 1.00E+00 6.99E-16 56 1.00E+00 9.90E-01 501
Med. 1.00E+00 4.26E-04 19 1.00E+00 9.22E-16 56 1.00E+00 9.90E-01 501
Max 1.00E+00 1.27E-04 104 1.00E+00 1.18E-14 14510 1.00E+00 9.90E-01 501
Min 1.00E+00 1.32E-03 7 1.00E+00 2.96E-06 56 9.96E-01 1.05E-01 28590
Med. 1.00E+00 2.26E-03 8 1.00E+00 4.61E-04 80 9.98E-01 1.81E-01 30660
Max 1.00E+00 6.83E-03 43 1.00E+00 1.78E-02 224 9.99E-01 2.51E-01 32480
Min 1.00E+00 1.03E-06 7 1.00E+00 7.91E-16 56 1.00E+00 0.00E+00 501
Med. 1.00E+00 5.61E-05 7 1.00E+00 3.50E-15 56 1.00E+00 0.00E+00 501
Max 1.00E+00 6.08E-03 38 1.00E+00 1.34E-02 160 1.00E+00 0.00E+00 501
Min 1.00E+00 8.08E-04 7 1.00E+00 9.77E-06 56 9.69E-01 1.12E-01 25500
Med. 1.00E+00 5.41E-03 7 1.00E+00 1.59E-03 88 9.99E-01 1.35E-01 30300
Max 1.00E+00 9.19E-03 22 1.00E+00 2.91E-02 248 9.99E-01 7.70E-01 32050
TABLE V: Training Result on Benchmark
Func. Stat. GP-RVM KP GP 500
min 7.10E-03 1.07E-05 7.88E-02
median 2.56E-02 4.36E-04 8.20E-02
max 1.00E-02 5.74E+06 1.30E+00
min 9.77E-03 6.86E-04 2.09E-01
median 6.76E-02 6.85E-02 2.22E-01
max 1.57E+00 8.23E+00 1.00E+00
min 1.10E-02 9.67E-02 3.41E-01
median 5.02E-01 2.07E+00 3.53E-01
max 1.30E+01 9.98E+01 2.71E+00
min 6.87E-03 5.49E-16 9.96E-01
median 9.71E-03 8.81E-16 9.96E-01
max 2.22E-02 8.03E-14 9.96E-01
min 1.59E-03 3.89E-04 1.06E-01
median 1.83E-03 1.27E-02 1.71E-01
max 4.42E-02 4.49E+02 5.50E+08
min 1.03E-06 1.23E-15 0
median 1.78E-04 6.08E-15 0
max 2.68E-03 1.63E+00 0
min 8.18E-04 4.74E-03 1.39E-01
median 4.55E-03 1.03E-01 1.51E-01
max 9.64E-03 3.82E+09
TABLE VI: RMSE Testing Result on Benchmark
Problem Max. Error Raw Fitness RMSE Training Func. Eval. Node Eval. RMSE Testing Succ. Runs
0.00283 0.01823 0.00112 21.14 20.63 0.05112 100
0.00479 0.02357 0.00135 32.63 27.56 0.10625 100
0.00399 0.02711 0.00172 45.73 36.68 0.83062 100
0.00406 0.02746 0.00174 33.41 54.32 0.49868 100
0.00057 0.00396 0.00025 14.39 26.62 0.36413 100
0.00264 0.0188 0.00118 18.17 29.14 0.099 100
0.0022 0.01473 0.00093 8.07 14.04 0.28685 100
0.00145 0.00948 0.0006 17.26 29.66 0.1638 100
0.00352 0.09064 0.00114 75.29 177.51 0.70064 100
0.00442 0.10199 0.00133 129.54 311.68 0.20869 100
TABLE VII: Results (average of 100 trials) of benchmark functions. Raw fitness is sum of absolute error on all fitness cases.
Fig. 2: Number of successful runs using Nguyen benchmark functions (F1-F10)

Iv-D Discussion

The first experiment is to verify that if our method could outperform regular GP, and its predecessor, KP. For seven functions, the training results show that our approach achieved the highest values of fitness for two functions ( 2 and 3), with five ties to KP ( 1, 6, 7, 8, 9) and 2 ties to GP ( 6 and 8). However, the RMSE value of our method is more significant when ties. Our way found high-quality models ( of fitness 0.99 for six functions, whereas KP found five and GP (50 and 500) found models with such quality for only four functions. To explain the results, we learn that our method performs as fast as KP and GP for simple problems and it performs better when dealing with complex functions ( 2 and 3), where KP can find models with low or poor quality and GP can only find poor results. For these functions, our method requires a much smaller NFEs () than the other methods. It should be noted that, when dealing with easy targets ( 6, 7, 8, 9), the models found by our method are sparse and rapidly meet the stopping criteria (fitness0.99999), which explains the RMSE value of our method is larger as the rewards sparse models. Concerning the testing results, we see that the min and of RMSE of our method are somewhat larger than KP ( 1, 6, 7, 8, 9). However, the maximum errors of GP-RVM are far smaller than KP in 1, 2, 3, 7, 8, 9, which explains that the results of our method are more stable than the others, in other words, more robust on the testing set. In practice, the robustness of testing set is much more important than a small difference in error.

The second experiment is to test curve-fitting tasks on benchmark functions where the maximum absolute error of any of the fitness cases should be lower than 0.01. Table VII shows the averaged results of our method. Like KP, GP-RVM succeeds to fit target curves as the RMSE for training and testing remain low values for every test. Regarding the efficiency, we notice that the number of objective function and node evaluations also remain low values for every test, which means that our method is faster in finding the correct result. Part of the reason is the implementation of cache, but more importantly, is the sparse kernel method employed by our method results in fast convergence. Figure 2. compares results of our method with KP and other techniques working on the same curve-fitting problem. The performance of our method is as accurate as KP, both resulting 100% successful runs. Both GP-RVM and KP outperform other methods.

V Conclusion

V-a Summary

In this paper we focus on the topic of symbolic regression (SR), using symbols of functions and variables to construct solutions. Expanded search space and the reduced amount of prior knowledge is the main reason for choosing SR instead of other machine learning models. To perform SR, Genetic Programming (GP) is a representative method, which is a population-based evolutionary algorithm. Limitations including a blind search of GP attracted many works to remove them. Kaizen Programming (KP) is a successful one among those works, which combines linear regression modeling and hill-climbing approach to guides the search. Though its success, problems including threshold setting and singularity raise. We analytically found that the singularity problem can be solved by generalizing the linear model of KP. Moreover, if we take a Bayesian treatment to estimate the generalized model, we can avoid using hypothesis testing to perform feature selection, where the threshold setting problem solved simultaneously. The generalized model, found in literature, is known as the relevance vector machine (RVM). Motivated by the analytical results, in this work we extended the feature selection module of KP based on RVM. The proposed method works without performing hypothesis test, thus requiring no prior knowledge to set the threshold. It also deals with singularity automatically. We implemented the sequential sparse Bayesian learning algorithm to accelerate the training speed of RVM. The proposed method is compared to KP and GP on several benchmark functions. Results showed that it outperforms the other methods, providing more accurate models with fewer functions evaluations.

V-B Future Work

Although GP-RVM is more efficient than previous methods, for large datasets it has a relatively low speed. Since it is a non-convex optimization problem, for a model with M feature functions, RVM requires inversion of a matrix of size , which in general has an of computational complexity. During the execution process of our method, in each generation, it requires at least one run of the sequential training algorithm to return a model. To increase the building model speed, authors are following a data-parallel approach by running several learner modules on a smaller set of data for large datasets. The dataset will be randomly reduced to subsets, and then each subset will be simultaneously trained. This idea will reduce the cost of executing each individual learner and make the execution time of model building shorter. The predicted models then will be combined with fusion techniques.

GP-RVM can be extended to other machine learning tasks, such as classification. There have been several successful proposed applications for KP, such as [23], where the author used KP high-performance concrete compressive strength. In [1, 35], the authors generalized KP into classification tasks. Since RVM is a probabilistic model and the framework is extendable into different kinds of machine learning tasks, it is theoretically practicable to generalize our method to overcome the limitations of statistically based KP in classification tasks.

References

  • [1] de Melo, V.V. and Banzhaf, W. 2016. Kaizen programming for feature construction for classification. In Genetic Programming Theory and Practice XIII, pp. 39-57, Springer.
  • [2] Peng, Y., Yuan, C., Qin, X., Huang, J. and Shi, Y. 2014. An improved gene expression programming approach for symbolic regression problems. Neurocomputing, 137, pp.293-301.
  • [3] Tang, F., et. al. 2015. Discovery scientific laws by hybrid evolutionary model. Neurocomputing, 148, pp.143-149.
  • [4] Icke, I. and Bongard, J.C. 2013, June. Improving genetic programming based symbolic regression using deterministic machine learning. In Evolutionary Computation IEEE Congress on pp. 1763-1770.
  • [5] Cristianini, N. and Shawe-Taylor, J., 2000. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press.
  • [6] Shafaei, M. and Kisi, O.,2017. Predicting river daily flow using wavelet-artificial neural networks based on regression analyses in comparison with artificial neural networks and support vector machine models. Neural Computing and Applications, 28(1), pp.15-28.
  • [7] Deklel, A.K., Saleh, M.A., Hamdy, A.M. and Saad, E.M. 2017, March. Transfer learning with long term artificial neural network memory (LTANN-MEM) and neural symbolization algorithm (NSA) for solving high dimensional multi-objective symbolic regression problems. In Radio Science Conference, 2017 34th National (pp. 343-352). IEEE.
  • [8] Austel, V., et. al. 2017. Globally Optimal Symbolic Regression. arXiv preprint arXiv:1710.10720.
  • [9] Michael S., and Hod L. 2009. Distilling Free-Form Natural Laws from Experimental Data. Science Col. 324, Issue 5923, pp.81-85.
  • [10] Michael S., and Hod L. 2013. Eureqa software (version 0.98 beta). Nutonian, Somerville, Mass, USA.
  • [11] Michael S., and Hod L. 2013. Symbolic Regression of Implicit Equations. Genetic Programming Theory and Practice VII. Genetic and Evolutionary Computation. Springer, Boston, MA.
  • [12] Koza, J.R., 1994. Genetic programming as a means for programming computers by natural selection. Statistics and computing, 4(2), pp.87-112.
  • [13] Banzhaf, W., Nordin, P., Keller, R.E. and Francone, F.D. 1998. Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers. Inc., Heidelberg and San Francisco CA.
  • [14] de Melo, V.V. 2014, July. Kaizen programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation (pp. 895-902). ACM.
  • [15] Tipping, M.E. 2001. Sparse Bayesian learning and the relevance vector machine. Journal of machine learning research, 1(Jun), pp.211-244.
  • [16] Fulcher, J., 2008. Computational intelligence: an introduction. In Computational intelligence: a compendium (pp. 3-78). Springer, Berlin, Heidelberg.
  • [17] Harvey, D.Y. and Todd, M.D. 2015. Automated feature design for numeric sequence classification by genetic programming. IEEE Transactions on Evolutionary Computation, 19(4), pp.474-489.
  • [18] Chatterjee, A. and Rong, M. 2018. Efficiency Analysis of Genetic Algorithm and Genetic Programming in Data Mining and Image Processing. Computer Vision: Concepts, Methodologies, Tools, and Applications. p.246.
  • [19] Tipping, M.E. and Faul, A.C. 2003. Fast marginal likelihood maximisation for sparse Bayesian models., AISTATS.
  • [20] Model, R.R., Square Adjusted R Square Std. Error of the Estimate, 1.
  • [21] Stijven, S., Vladislavleva, E., Kordon, A., Willem, L. and Kotanchek, M.E. 2016. Prime-Time: Symbolic Regression Takes Its Place in the Real World. In Genetic Programming Theory and Practice XIII (pp. 241-260). Springer, Cham.
  • [22] Bautu, E., Bautu, A. and Luchian, H. 2005. Symbolic regression on noisy data with genetic and gene expression programming. In Symbolic and Numeric Algorithms for Scientific Computing, 2005. SYNASC 2005. Seventh International Symposium.
  • [23] de Melo, V.V. and Banzhaf, W. 2015. Predicting high-performance concrete compressive strength using features constructed by kaizen programming. In Intelligent Systems pp. 80-85.
  • [24] de Melo, V.V. and Banzhaf, W. 2017. Improving the prediction of material properties of concrete using Kaizen Programming with Simulated Annealing. Neurocomputing, 246, pp.25-44.
  • [25] de Melo, V.V., 2016. Breast cancer detection with logistic regression improved by features constructed by Kaizen programming in a hybrid approach. In IEEE Evolutionary Computation Congress, pp. 16-23.
  • [26] Brameier, M.F. and Banzhaf, W., 2007. Linear genetic programming. Springer Science & Business Media.
  • [27] Moraglio, A., Krawiec, K. and Johnson, C.G. 2012. Geometric semantic genetic programming. In International Conference on Parallel Problem Solving from Nature, Springer, Berlin, Heidelberg, pp. 21-31.
  • [28] Arnaldo, I., Krawiec, K. and O’Reilly, U.M. 2014. Multiple regression genetic programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, pp. 879-886.
  • [29] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. 2004. Least angle regression. The Annals of statistics, 32(2), pp.407-499.
  • [30] Moraglio, A., Krawiec, K. and Johnson, C.G. 2012. Geometric semantic genetic programming. In International Conference on Parallel Problem Solving from Nature, pp. 21-31, Springer, Berlin, Heidelberg.
  • [31] Castelli, M., Trujillo, L., Vanneschi, L. and Silva, S. 2015. Geometric semantic genetic programming with local search. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 999-1006.
  • [32] Christopher, M.B. 8th printing 2009. PATTERN RECOGNITION AND MACHINE LEARNING. Springer-Verlag New York.
  • [33] McDermott, J., et. al. ,2012. Genetic programming needs better benchmarks. In Proceedings of the 14th annual conference on Genetic and evolutionary computation pp. 791-798.
  • [34] Karaboga, D., Ozturk, C., Karaboga, N. and Gorkemli, B. 2012. Artificial bee colony programming for symbolic regression. Information Sciences, 209, pp.1-15.
  • [35] de Melo, V.V. and Banzhaf, W. 2016. Improving logistic regression classification of credit approval with features constructed by Kaizen programming. Genetic and Evolutionary Computation Conference Companion pp. 61-62.
  • [36] dal Piccol Sotto, L.F. and de Melo, V.V., 2016. Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression. Neurocomputing, 180, pp.79-93.
  • [37] Sotto, L.F., Coelho, R.C. and de Melo, V.V., 2016, July. Classification of Cardiac Arrhythmia by Random Forests with Features Constructed by Kaizen Programming with Linear Genetic Programming. In Proceedings of the Genetic and Evolutionary Computation Conference 2016 (pp. 813-820). ACM.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
202213
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description