Feature selection with optimal coordinate ascent (OCA)
In machine learning, Feature Selection (FS) is a major part of efficient algorithm. It fuels the algorithm and is the starting block for our prediction. In this paper, we present a new method, called Optimal Coordinate Ascent (OCA) that allows us selecting features among block and individual features. OCA relies on coordinate ascent to find an optimal solution for gradient boosting methods score (number of correctly classified samples). OCA takes into account the notion of dependencies between variables forming blocks in our optimization. The coordinate ascent optimization solves the issue of the NP hard original problem where the number of combinations rapidly explode making a grid search unfeasible. It reduces considerably the number of iterations changing this NP hard problem into a polynomial search one. OCA brings substantial differences and improvements compared to previous coordinate ascent feature selection method: we group variables into block and individual variables instead of a binary selection. Our initial guess is based on the k-best group variables making our initial point more robust. We also introduced new stopping criteria making our optimization faster. We compare these two methods on our data set. We found that our method outperforms the initial one. We also compare our method to the Recursive Feature Elimination (RFE) method and find that OCA leads to the minimum feature set with the highest score. This is a nice byproduct of our method as it provides empirically the most compact data set with optimal performance.
keywords:feature selection, coordinate ascent, gradient boosting method
Msc: 68T01, 68T05
Feature selection is also known as variable or attribute selection. It is the selection of a subset of relevant attributes in our data that are most relevant to our predictive modeling problem. It has been an active and fruitful field of research and development for decades in statistical learning. It has proven to be effective and useful in both theory and practice for many reasons: enhanced learning efficiency and increasing predictive accuracy (see Mitra et al. (2002)), model simplification to ease its interpretation and improve performance (see Almuallim and Dietterich (1994), Koller and Sahami (1996) and Blum and Langley (1997)), shorter training time (see Mitra et al. (2002)), curse of dimensionality avoidance, enhanced generalization with reduced overfitting, implied variance reduction. Both Hastie et al. (2009) and Guyon and Elisseeff (2003) are nice references to get an overview of various methods to tackle features selections. The approaches followed varies. Briefly speaking, the methods can be sorted into three main categories: Filter method, Wrapper methods and Embedded methods. We developed these three categories in the following section.
1.1 Features selection methods
1.1.1 Filter methods
Filter type methods select variables regardless of the model. These methods suppress the least interesting variables by using ranking techniques as a criteria to select the variables. Once the ranking is done, a threshold is determined in order to select features above it. These methods are very effective in terms of computation time and robust to overfitting. By construction, filter methods may select redundant variables as they do not consider the relationships between variables. To stress this last point, we can present one of the most known criteria, the Pearson correlation coefficient, which is simply the ratio between the covariance and the square root of the two variances: with the feature in the model and y the label associated. It is well known that this correlation ranking can only detect linear dependencies between features ant the target label.
1.1.2 Wrappers methods
Wrapper methods evaluate subsets of variables. They thus allow detecting possible interactions between variables. In wrapper methods, a model must be trained to test any subsequent feature subset. Consequently, these methods are iterative and computationally expensive. However, these methods can identify the best performing features set for that specific modeling algorithm. Some known examples of wrapper methods are forward and backward feature selection methods.
The backward elimination starts with all features and progressively remove them. At the opposite, the forward selection starts with an empty set and progressively add them.
If we have features, we need to train classifiers for the first step, then classifiers for the second step and so on. We then have training steps for both methods. However, forward selection starts with small features subsets so it can be computationally cheaper if the stopping condition is satisfied early. One of the State of the art wrappers method is Recursive Feature Elimination (RFE) (see for instance Mangal and Holm (2018) for more details). It first fits a model and removes features until a pre-determined number of features. Features are ranked through an external model that assigns weights to each features and RFE recursively eliminates features with the least weight at each iteration. One of the main limitation to RFE is that it requires the number of features to keep. This is hard to guess a priori and one may need to iterate much more than the desired number of feature to find an optimal feature set.
1.1.3 Embedded methods
Embedded method perform feature selection as a part of the modeling algorithm’s execution. Many hybrid methods are developed to combine the advantages of wrappers and filters methods
2 Result of convergence
In order to motivate our method that relies on coordinate ascent, we recall some theoretical results about the convergence of coordinate ascent optimization. The theory is well understood for the convex case, see Wright (2015). The non convex case without gradient which is our example is however much harder as we have local minima issue and mathematical assumptions too weak to be able to prove convergence. However, convergence results under strong convex conditions provide some hint about the efficiency of this method and its convergence rate that is linear. Our proof provided in appendix section is inspired by Nesterov (2012) with a slight modification as we start by the critical point condition. We also provide the various building block lemma to achieve this proof rapidly. In order to have some meaningful result, we need to make some necessary assumptions for our function to be minimized. Obviously, even if our final problem is a maximization, it is trivial to turn the minimization program into a maximization one by taking the opposite of the objective function. In this section, we stick to the traditional presentation and examine minimization to make proof reading easier. We examine the following optimization program:
We denote by the traditional vector with for any coordinate except for coordinate . It is the vector of the canonical basis.
We assume our function is twice differentiable and strongly convex with respect to the Euclidean norm:
We also assume that each gradient’s coordinate is uniformly Lipschitz, that is, there exists a constant such that for any
We denote by the maximum of these Lipschitz coefficients :
We assume that the minimum of denoted by is attainable and that the left value of the epigraph with respect to our initial starting point is bounded, that is
The proof is quite simple and given in A.1. ∎
Our function to be maximize is obviously not convex. However, a linear rate in the convex case is rather a good performance for the ascent optimization method. Provided the method generalizes which is still under research, this convergence rate is a good hint of the efficiency of this method.
3 Method developed
In many applications, we can regroup features among families. We call these features block variables. Typical example is to regroup variables that are observations of some physical quantity but at a different time (like the speed of the wind measure at different hours for some energy prediction problem, like the price of a stock in an algorithmic trading strategy for financial markets, like the temperature or heart beat of a patient at different time, etc …). Formally, we can regroup our variables into two sets:
the first set encompasses . These are called block variables of different length . Mathematically, the Block variables are denoted by with taking value in
the second set is denoted and is a block of single variables.
Graphically, our variables looks like that:
In addition, we have variables split between block variables and single variables, hence with .
Our algorithm works as follows. We first fit our classification model to find a ranking of features importance. The performance is computed with the Gini index for each variable. We then keep the first best ranked features for each blocks in order to find the best initial guess for our coordinate ascent algorithm. Notice that the set of unique variables is not modified during the first step of the procedure. The objective function is the number of correctly classified samples at each iteration. We then enter the main loop of the algorithm. Starting with the vector of as the initial guess for our algorithm, we perform our coordinate ascent optimization in order to find the set with optimal score and the minimum number of features. The coordinate ascent loop stops whenever we either reach the maximum number of iterations or the current optimal solution has not moved between two steps.
We summarize the algorithm in the pseudo code 1. We denote by the tolerance for the convergence stopping condition. To control early stop, we use a precision variable denoted by and two iteration maximum and that are initialized before starting the algorithm. We also denote to be the accuracy score of our classifier with each block of variables retaining best variables and with single variable all retained.
The originality of this coordinate ascent optimization is to regroup variable by block, hence it reduces the number of iterations compared to Binary Coordinate Ascent (BCA) as presented in Zarshenas and Suzuki (2016) The stopping condition can be changed to accommodate for other stopping conditions.
There are many variants to this algorithm. It can be modified by using a randomized coordinate ascent. In this case, we choose the index randomly at each step instead of using the provided order. The pseudo code is listed below:
The specificity of our method is to keep the best representative features for each feature class, as opposed to other methods that only select one representative feature from each group, ignoring the strong similarities between each feature of a given variable block. This takes in particular the opposite view of feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination as developed in Tuv et al. (2009).
4 Numerical Results
4.1 Data set
We use this algorithm to do a supervised classification of a data set obtained from financial markets trades that we want to classify according to some a priori features. We are given 1500 trades with 135 features that can be classified into 5 block of 20 variables,1 block of 30 variables and 5 single variables. We know for each trade whether it is a good or bad trade. The idea is to use the minimum number of features to classify a priori this data set. We use cross validation with 70% for the training set and 30% for the test sets. For full reproducibility, full data set and corresponding python code for this algorithm is available publicly on github.. The authors may further update the code to reflect improvements or typos if required. This code is provided as it is. The authors do not grant any warranty nor assume any liability for the content thereof.
We compare our method to two other methods that are supposed to be State of the art for feature selections, namely RFE and BCA. Our new method achieves a score of 62.80 % with 16% of features used, to be compared to RFE that achieves 62.80 % with 19% of features used. BCA performs poorly with a highest score given by 62.19 % with 27% of features used. If we take in terms of efficiency criterium, the highest score with the less feature, our method is the most efficient among these three methods. In comparison, with the same number of features, namely 16%, RFE gets a score of 62.40 %. All these figures are summarized in the table 1.
|Method||OCA using 24 features||RFE using 24 features||BCA using 39 features||RFE using 28 features|
|% of features||16.6||16.6||27.08||19.4|
|Score (in %)||62.8||62.39||62.19||62.8|
Compared to BCA our method reduces the number of iterations as it uses the fact that variables can be regrouped into categories or classes. Below is provided the number of iterations for OCA and BCA in figure 1. Our method requires only 350 iterations steps ton converge as opposed to BCA that needs up to 700 iterations steps as it computes blindly variables ignoring similarities between the different variables.
Graphically, we can compute the best candidates for the four methods listed in table 1 in figure 5 and 5. We have taken the following color code. The hottest (or best performing) method is plotted in red, while the worst in blue. Average performing methods are plotted in orange. In order to compare finely OCA and RFE, we have plotted in figure 5 the result of RFE for used features set percentage from 10 to 30 percent. We can notice that for the same feature set as OCA, RFE has a lower score and equally that to get the same score as OCA, RFE needs a large features set.
In this paper, we have presented a new method, called Optimal Coordinate Ascent (OCA) that allows us selecting features among block and individual features. OCA relies on coordinate ascent to find an optimal solution for gradient boosting methods score (number of correctly classified samples). OCA takes into account the notion of dependencies between variables forming blocks in our optimization. The coordinate ascent optimization solves the issue of the NP hard original problem where the number of combinations rapidly explode making a grid search unfeasible. It transforms the NP hard problem of finding the best features into a polynomial search one. Comparing result with two other methods Binary Coordinate Ascent (BCA) and Recursive Feature Elimination (RFE), we find that OCA leads to the minimum feature set with the highest score. OCA provides empirically the most compact data set with optimal performance. Obtaining a reduced features set compared to other method is highly desirable for at least two reasons: First, a lower feature set should have a stronger generalization power as it has less noise created by too many variables (in a similar way in a sense as the Lasso method that eliminates variables in regression). Second, fewer features leads to smaller memory size model and faster computation. Possible extension is to parallelize and potentially use GPU acceleration for this algorithm to leverage its strong decoupling when examining candidate solutions.
Appendix A Proofs
a.1 Proof of proposition 2.1
In order to prove result, we first start by a simple lemma.
If is non increasing such that with
We remark that or equivalently , which says that is bounded by a lower parabola. Let be this parabola. ’s variation are easy to study and given below (with ):
We can now trivially prove our result by induction. The initialization step is obvious for as since the global maximum of our parabola is which is less than for . If the result holds for , we know that the maximum of the parabola () is attained in since . This implies that
We can trivially conclude as ∎
We can now prove our main result. By assumptions, we do a gradient descent according to one coordinate: for some . A Taylor-Lagrange expansion for ’s gradient gives us:
We denote . We want such that . Combined with (9), we have the following equality:
Taking the norm and using the Cauchy Schwarz inequality, we have that :
Assuming that the objective function minimum is not attained at step k, , we have: . We precisely take this critical value for at each step k in order to avoid a step too large to prevent oscillation phenomena. Our recursive relationship is now:
Using the fact that , we can conclude that
By convexity of , we have
We denote . Taking the expectation over all the random index in 14, we obtain :
for any k by definition of which is the minima of f. So, taking the square value on both side of the inequality, we have
Using the relation in 10, we apply Taylor-Lagrange to our objective function f at step k+1 :
with . Therefore:
Taking the expectation over all the random indexes , we have :
Subtracting on both side, we obtain the main recursive relation between and :
Using the inequality 17, we ensure that :
- Almuallim and Dietterich (1994) Almuallim, H., Dietterich, T.G., 1994. Learning boolean concepts in the presence of many irrelevant features. Artificial Intelligence 69, 279–305.
- Blum and Langley (1997) Blum, A.L., Langley, P., 1997. Selection of relevant features and examples in machine learning. Artif. Intell. 97, 245–271.
- Guyon and Elisseeff (2003) Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182.
- Hastie et al. (2009) Hastie, T., Tibshirani, R., Friedman, J.H., 2009. The elements of statistical learning: data mining, inference, and prediction, 2nd Edition. Springer series in statistics, Springer.
- Koller and Sahami (1996) Koller, D., Sahami, M., 1996. Toward optimal feature selection, in: Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. pp. 284–292.
- Mangal and Holm (2018) Mangal, A., Holm, E.A., 2018. A comparative study of feature selection methods for stress hotspot classification in materials. ArXiv e-prints .
- Mitra et al. (2002) Mitra, P., Murthy, C.A., Pal, S.K., 2002. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 24, 301–312.
- Nesterov (2012) Nesterov, Y., 2012. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization 22, 341–362.
- Tuv et al. (2009) Tuv, E., Borisov, A., Runger, G., Torkkola, K., 2009. Feature selection with ensembles, artificial variables, and redundancy elimination. J. Mach. Learn. Res. 10, 1341–1366.
- Wright (2015) Wright, S.J., 2015. Coordinate descent algorithms. Math. Program. 151, 3–34.
- Zarshenas and Suzuki (2016) Zarshenas, A., Suzuki, K., 2016. Binary coordinate ascent: An efficient optimization technique for feature subset selection for machine learning. Knowledge-Based Systems 110, 191 – 201.