Gradient Boosted Feature Selection
Abstract
A feature selection algorithm should ideally satisfy four conditions: reliably extract relevant features; be able to identify nonlinear feature interactions; scale linearly with the number of features and dimensions; allow the incorporation of known sparsity structure. In this work we propose a novel feature selection algorithm, Gradient Boosted Feature Selection (GBFS), which satisfies all four of these requirements. The algorithm is flexible, scalable, and surprisingly straightforward to implement as it is based on a modification of Gradient Boosted Trees. We evaluate GBFS on several real world data sets and show that it matches or outperforms other state of the art feature selection algorithms. Yet it scales to larger data set sizes and naturally allows for domainspecific side information.
Gradient Boosted Feature Selection
Zhixiang (Eddie) Xu ^{1}^{1}footnotemark: 1 
Washington University in St. Louis 
One Brookings Dr. 
St. Louis, USA 
xuzx@cse.wustl.edu 
Gao Huang 
Tsinghua University 
30 Shuangqing Rd. 
Beijing, China 
huangg09@mails.tsinghua.edu.cn 
Kilian Q. Weinberger 
Washington University in St. Louis 
One Brookings Dr. 
St. Louis, USA 
kilian@wustl.edu and 
Alice X. Zheng 
GraphLab 
936 N. 34th St. Ste 208 
Seattle, USA 
alicez@graphlab.com 
\@float
copyrightbox[b]
\end@floatCategories and Subject Descriptors H.3 [Information Storage and Retrieval]: Miscellaneous; I.5.2 [Pattern Recognition]: Design Methodology—Feature evaluation and selection

Learning

Feature selection; Largescale; Gradient boosting
Feature selection (FS) [?] is an important problems in machine learning. In many applications, e.g., bioinformatics [?] [?] [?] or neuroscience [?], researchers hope to gain insight by analyzing how a classifier can predict a label and what features it uses. Moreover, effective feature selection leads to parsimonious classifiers that require less memory [?] and are faster to train and test [?]. It can also reduce feature extraction costs [?, ?] and lead to better generalization [?].
Linear feature selection algorithms such as LARS [?] are highly effective at discovering linear dependencies between features and labels. However, they fail when features interact in nonlinear ways. Nonlinear feature selection algorithms, such as Random Forest [?] or recently introduced kernel methods [?, ?], can cope with nonlinear interactions. But their computational and memory complexity typically grow superlinearly with the training set size. As data sets grow in size, this is increasingly problematic. Balancing the twin goals of scalability and nonlinear feature selection is still an open problem.
In this paper, we focus on the scenario where data sets contain a large number of samples. Specifically, we aim to perform efficient feature selection when the number of data points is much larger than the number of features (). We start with the (NPHard) feature selection problem that also motivated LARS [?] and LASSO [?]. But instead of using a linear classifier and approximating the feature selection cost with an norm, we follow [?] and use gradient boosted regression trees [?] for which greedy approximations exist [?].
The resulting algorithm is surprisingly simple yet very effective. We refer to it as Gradient Boosted Feature Selection (GBFS). Following the gradient boosting framework, trees are built with the greedy CART algorithm [?]. Features are selected sparsely following an important change in the impurity function: splitting on new features is penalized by a cost , whereas reuse of previously selected features incurs no additional penalty.
GBFS has several compelling properties. 1. As it learns an ensemble of regression trees, it can naturally discover nonlinear interactions between features. 2. In contrast to, e.g., FS with Random Forests, it unifies feature selection and classification into a single optimization. 3. In contrast to existing nonlinear FS algorithms, its time and memory complexity scales as , where denotes the number of features dimensionality and the number of data points^{1}^{1}1In fact, if the storage of the input data is not counted, the memory complexity of GBFS scales as ., and is very fast in practice. 4. GBFS can naturally incorporate prespecified feature cost structures or sideinformation, e.g., select bags of features or focus on regions of interest, similar to generalized lasso in linear FS [?].
We evaluate this algorithm on several realworld data sets of varying difficulty and size, and we demonstrate that GBFS tends to match or outperform the accuracy and feature selection tradeoff of Random Forest Feature Selection, the current stateoftheart in nonlinear feature selection.
We showcase the ability of GBFS to naturally incorporate sideinformation about interfeature dependencies on a real world biological classification task [?]. Here, features are grouped into nine prespecified bags with biological meaning. GBFS can easily adapt to this setting and select entire feature bags. The resulting classifier matches the best accuracy of competing methods (trained on many features) with only a single bag of features.
One of the most widely used feature selection algorithms is Lasso [?]. It minimizes the squared loss with regularization on the coefficient vector, which encourages sparse solutions. Although scalable to very large data sets, Lasso models only linear correlations between features and labels and cannot discover nonlinear feature dependencies.
[?] propose the Minimum Redundancy Maximum Relevance (mRMR) algorithm, which selects a subset of the most responsive features that have high mutual information with labels. Their objective function also penalizes selecting redundant features. Though elegant, computing mutual information when the number of instance is large is intractable, and thus the algorithm does not scale. HSIC Lasso [?], on the other hand, introduces nonlinearity by combining multiple kernel functions that each uses a single feature. The resulting convex optimization problem aligns this kernel with a “perfect” label kernel. The algorithm requires constructing kernel matrices for all features, thus its time and memory complexity scale quadratically with input data set size. Moreover, both algorithms separate feature selection and classification, and require additional time and computation for training classifiers using the selected features.
Several other works avoid expensive kernel computation while maintaining nonlinearity. Grafting [?] combines and regularization with a nonlinear classifier based on a nonconvex variant of the multilayer perceptron. Feature Selection for Ranking using Boosted Trees [?] selects the top features with the highest relative importance scores. [?] and [?] use Random Forest. Finally, while not a feature selection method, [?] employ Gradient Boosted Trees to learn cascades of classifiers to reduce testtime cost by incorporating feature extraction budgets into the classifier optimization.
Throughout this paper we type vectors in bold (), scalars in regular math type ( or ), sets in cursive () and matrices in capital bold () font. Specific entries in vectors or matrices are scalars and follow the corresponding convention.
The data set consists of input vectors with corresponding labels drawn from an unknown distribution. The labels can be binary, categorical (multiclass) or realvalued (regression). For the sake of clarity, we focus on binary classification , although the algorithm can be extended to multiclass and regression as well.
Lasso [?] combines linear classification and regularization
(1) In its original formulation, is defined to be the squared loss, . However, for the sake of feature selection, other loss functions are possible. In the binary classification setting, where , we use the better suited logloss, [?].
regularization serves two purposes: It regularizes the classifier against overfitting, and it induces sparsity for feature selection. Unfortunately, these two effects of the norm are inherently tied and there is no way to regulate the impact of either one.
[?] introduce the capped norm, defined by the elementwise operation
(2) Its advantage over the standard norm is that once a feature is extracted, its use is not penalized further — i.e., it penalizes using many features does not reward small weights. This is a much better approximation of the norm, which only penalizes feature use without interfering with the magnitude of the weights. When is small enough, i.e., , we can compute the exact number of features extracted with . In other words, penalizing is a close proxy for penalizing the number of extracted features. However, the capped norm is not convex and therefore not easy to optimize.
The capped norm can be combined with a regular (or ) norm, where one can control the tradeoff between feature extraction and regularization by adjusting the corresponding regularization parameters, :
(3) Here denotes .
The classifier in Eq. (Gradient Boosted Feature Selection) is better suited for feature selection than plain regularization. However, it is still linear, which limits the flexibility of the classifer. Standard approaches for incorporating nonlinearity include the kernel learning [?] and boosting [?]. HSIC Lasso [?] uses kernel learning to discover nonlinear feature interactions at a price of quadratic memory and time complexity. Our method uses boosting, which is much more scalable.
Boosting assumes that one can preprocess the data with limiteddepth regression trees. Let be the set of all possible regression trees. Taking into account limited precision and counting trees that obtain identical values on the entire training set as one and the same tree, one can assume to be finite (albeit possibly large). Assuming that inputs are mapped into through , we propose to learn a linear classifier in this transformed space. Eq. (Gradient Boosted Feature Selection) becomes
(4) Here, is a sparse linear vector that selects trees. Although it is extremely high dimensional, the optimization in Eq. (Gradient Boosted Feature Selection) is tractable because is extremely sparse. Assuming, without loss of generalization, that the trees in are sorted so that the first entries of are nonzero, we obtain a final classifier
(5) Eq. (Gradient Boosted Feature Selection) has two penalty terms: plain norm and capped norm. The first penalty term reduces overfitting while the second selects features. However, in its current form, the capped norm selects trees rather than features. We therefore have to modify our setup to explicitly penalize the extraction of features.
To model the total number of features extracted by an ensemble of trees, we define a binary matrix , where an entry if and only if the tree uses feature . With this notation, we can express the total weight assigned to trees that extract feature as
(6) We modify to instead penalize the actual weight assigned to features. The final optimization becomes
(7) As before, if is sufficiently small (), we can set and the feature selection penalty coincides exactly with the number of features used.
The optimization problem in Eq. (Gradient Boosted Feature Selection) is nonconvex and nondifferentiable. Nevertheless, we can minimize it effectively (up to a local fixed point) with gradient boosting [?]. Let denote the loss function to be minimized and the gradient w.r.t . Gradient boosting can be viewed as coordinate descent where we update the dimension with the steepest gradient at every step. We can assume that the set of all regression trees is negation closed, i.e., for each , we also have . This allows us to only follow negative gradients and always increase . Thus there is always a nonnegative optimal . The search for the dimensions with the steepest negative gradient can be formalized as
(8) In the remainder of this section we discuss approximate minimization strategies that does not require iterating over all possible trees.
Since each step of the optimization increases a single dimension of with a fixed stepsize , the norm of can be written in closed form as after iterations. This means that penalizing the norm of is equivalent to early stopping after iterations [?]. We therefore drop the term and instead introduce as an equivalent hyperparameter.
To find the steepest descent direction at iteration , we decompose the (sub)gradient into two parts, one for the loss function , and one for the capped norm penalty
(9) (Hereafter we drop the absolute value around , since both and are nonnegative.) The gradient of is not welldefined at the cusp when . But we can take the righthand limit, since never decreases,
(10) If we set , where is the step size, then if and only if feature has already been used in a tree from a previous iteration. Let indicate that feature is still unused, and otherwise. With this notation we can combine the gradients from the two cases and replace with . We obtain
(11) Note that if and only if feature is extracted for the first time in tree . In other words, the second term effectively penalizes trees that use many new (previously not selected) features.
With Eq. (Gradient Boosted Feature Selection) we can compute the gradient with respect to any tree. But finding the optimal would still require searching all trees. In the remainder of this section, we transform the search for from a search over all possible dimensions to a search for the best tree to minimize a prespecified loss function. The new search can be approximated with the CART algorithm [?].
To this end, we apply the chain rule and decompose into the derivative of the loss w.r.t. the current prediction evaluated at each input and the partial derivative ,
(12) Note that is just a linear sum of all , the predictions over training data. Thus . If we let denote the negative gradient , we can reformulate Eq. (Gradient Boosted Feature Selection) as
(13) Similar to [?], we restrict to only normalized trees (i.e. ). We can then add two constant terms and to eq. (Gradient Boosted Feature Selection), and complete the binomial equation.
(14) This is now a penalized squared loss—an impurity function—and a good solutions can be found efficiently via the greedy CART algorithm for learning regression trees [?]. The first term in Eq. (Gradient Boosted Feature Selection) encourages feature splits to best match the negative gradient of the loss function, and the second term rewards splitting on features which have already been used in previous iterations. Algorithm 1 summarizes the overall algorithm in pseudocode.
In many feature selection applications, one may have additional qualitative knowledge about acceptable sparsity patterns. Sometimes features can be grouped into bags and the goal is to select as few bags as possible. Prior work on handling structured sparsity include group lasso [?, ?] and generalized lasso [?]. Our framework can easily handle structured sparsity via the feature cost identity function . For example, we can define if and only if no feature from the same bag as has been used in the past, and otherwise. The moment a feature from a particular bag is used in a tree, all other features in the same bag become “free” and the classifier is encouraged to use features from this bag exclusively until it starts to see diminishing returns.
In the most general setting, we can define as a function that maps from the set of previously extracted features to a cost. For example, one could imagine settings where feature extraction appears in stages. Extracting feature makes feature cheaper, but not free. One such application might be that of classifying medical images (e.g., MRI scans) where the features are raw pixels and feature groups are local regions of interest. In this case, may reduce the “price” of pixels surrounding those in to encourage feature selection with local focus.
In this section, we evaluate GBFS against other stateoftheart feature selection methods on synthetic as well as realworld benchmark data sets. We also examine at its capacity for dealing with known sparsity patterns in a bioinformatics application. All experiments were conducted on a desktop with dual 6core Intel i7 cpus with 2.66GHz, 96 GB RAM, and Linux version 2.6.32.x86_64.
Figure Gradient Boosted Feature Selection illustrates a synthetic binary classification data set with three features. The data is not linearly separable in either two dimensions or three dimensions. However, a good nonlinear classifier can easily separate the data using and . The feature is simply a linear combination of and and thus redundant. We randomly select 90% of the instances for training and the rest for testing.
Figure Gradient Boosted Feature Selection (left panel) illustrates results from regularized logistic regression (L1LR) [?, ?]. The regularization parameter is tuned on a holdout set. Although L1LR successfully detects and ignores the redundant feature , it also assigns zero weight to and only selects a single feature . Consequently, it has poor classification error rate on the test set (). In contrast, GBFS (Figure Gradient Boosted Feature Selection, right panel) not only identifies the redundant feature , but also detects that the labels are related to a nonlinear combination of . It selects both and and successfully separates the data, achieving classification error.
In many applications there may be prior constraints on the sparsity patterns. Since GBFS can naturally incorporate prespecified feature structures, we use it to perform structured feature selection on the Colon data set^{2}^{2}2Available through the Princeton University gene expression project (http://microarray.princeton.edu/oncology/). In this dataset, tumor and normal colon tissues for 6500 human genes are measured using affymetrix gene chips. [?] select 2000 genes that have the highest minimal intensity across the samples. [?] further analyze these genes and cluster them into clusters/bags by their biological meaning. The task is to classify whether a tissue is normal or tumor. We random split the tissues into 80/20 training and testing datasets, repeated over random splits. We use the featurebag cost function mentioned in section Gradient Boosted Feature Selection to incorporate this sideinformation (setting the cost of all features in a bag to zero once the first feature is extracted). Feature selection without considering these bag information not only performs and generalizes poorly, but are also difficult to interpret and justify.
Figure Gradient Boosted Feature Selection shows the selected features from one random split and classification results averaged over splits. Selected features are colored in green, and unselected ones are in blue. A bag is highlighted with a red/white box if at least one of its features is selected by the classifier. We compare against regularized logistic regression (L1LR) [?, ?], Random Forest feature selection (RFFS) [?], HSIC Lasso [?] and Group Lasso [?].
As shown in Figure Gradient Boosted Feature Selection, because GBFS can incorporate the bag structures, it focusses on selecting features in one specific bag. Throughout training, it only selects features from bag (highlighted with a red/white box). This conveniently reveals the association between diseases and gene clusters/bags. Similar to GBFS, Group Lasso with logistic regression can also deal with structured features. However, its regularization has side effects on feature weights, and thus results in much higher classification error rate . In contrast, regularized logistic regression, Random Forest and HSIC Lasso do not take bag information into consideration. They select scattered features from different bags, making results difficult to interpret. In terms of classification accuracy, GBFS and Random Forest has the lowest test set error rate (), whereas regularized logistic regression (L1LR) and HSIC Lasso achieve error rates of and , respectively.
There are two reason why GBFS can be accurate with features from only a single bag. First, it is indeed the case that the genes in bag are very predictive for the task of whether the tissue is malignant or benign (a result that may be of high biological value). Second, GBFS does not penalize further feature extraction inside bag while other methods do; since bag 8 features are the most predictive, penalizing against them leads to a worse classifier.
data set pcmac uspst spam isolet mnist3vs8 adult kddcup99 #training 1555 1606 3681 6238 11982 32562 4898431 #testing 388 401 920 1559 1984 16282 311029 #features 3289 256 57 617 784 123 122 Table \thetable: Data sets statistics. Data sets are ordered by the number of training instances. We evaluate GBFS on realworld benchmark data sets of varying sizes, domains and levels of difficulty. Table Gradient Boosted Feature Selection lists data set statistics ordered by increasing numbers of training instances. We focus on data sets with a large number of training examples (). All tasks are binary classification, though GBFS naturally extends to the regression setting, so long as the loss function is differentiable and continuous. Multiclass classification problems can be reduced to binary ones, either by selecting the two classes that are most easily confused or (if those are not known) by grouping labels into two sets.
The first baseline is regularized logistic regression (L1LR) [?, ?]. We vary the regularization parameter to select different numbers of features and examine the test error rates under each setting.
We also compare against Random Forest feature selection (RFFS) [?], a nonlinear classification and feature selection algorithm. The learner builds many full decision trees by bagging training instances over random subsets of features. Features selection is done by ranking features based on their impurity improvement score, aggregated over all trees and all splits. Features with larger impurity improvements are considered more important. For each data set, we train a Random Forest with trees and a maximum number of elements per leaf node. After training all trees, we rank all features. Starting from top of the list, we retrain a random forest with only the topk features and evaluate its accuracy on the test set. We increase the number of selected features until all features are included.
The next stateoftheart baseline is Minimum Redundancy Maximum Relevance (mRMR) [?], a nonlinear feature selection algorithm that ranks features based on their mutual information with the labels. Again, we increase the selected feature set starting from the top of the list. At each stopping point, we train an RBF kernel SVM using only the features selected so far. The hyperparameters are tuned on on 5 different random 80/20 splits of the training data. The final reported test error rates are based on the SVM trained on the full training set with the best hyperparameter setting.
Finally, we compare against HSIC Lasso [?], a convex extension to Greedy HSIC [?]. HSIC Lasso builds a kernel matrix for each feature and combines them to best match an ideal kernel generated from the labels. It encourages feature sparsity via an penalty on the linear combination coefficients. Similar to regularized logistic regression, we evaluate a wide range of regularization parameters to sweep out the entire feature selection curve. Since HSIC Lasso is a two steps algorithm, we train a kernel SVM with the selected features to perform classification. Similar to the mRMR experiment, we use crossvalidation to select hyperparameters and average over runs.
To evaluate GBFS, we perform 5 random 80/20 training/validation splits. We use the validation set to choose the depth of the regression trees and the number of iterations (maximum iterations is set to ). The learning rate is set to for all data sets. In order to show the whole error rates curve, we evaluate the algorithm at values for the feature selection tradeoff parameter in Eq. (Gradient Boosted Feature Selection) (i.e., ).
Figure Gradient Boosted Feature Selection shows the feature selection and classification performance of different algorithms on small and medium sized data sets. We select up to 100 features except for spam () and pcmac (). In general, regularized logistic regression (L1LR), Random Forest (RFFS) and GBFS easily scale to all data sets. RFFS and GBFS both clearly out perform L1LR in accuracy on all data sets due to their ability to capture nonlinear featurelabel relationships. HSIC Lasso is very sensitive to the data size (both the number of training instance and the number of features), and only scales to two small data sets (uspst,spam). mRMR is even more restrictive (more sensitive to the number of training instance) and thus only works for uspst. Both of them run out of memory on pcmac, which has the largest number of features. In terms of accuracy, GBFS clearly outperforms HSIC Lasso on spam but performs slightly worse on uspst. On all small and medium datasets, GBFS either outperforms RFFS or matches its classification performance. However, very different from RFFS, GBFS is a one step approach that selects features and learns a classifier at the same time, whereas RFFS requires retraining a classifier after feature selection. This means that GBFS is much faster to train than RFFS.
The last dataset in Table Gradient Boosted Feature Selection (kddcup99) contains close to 5 million training instances. Training on such large data sets can be very timeconsuming. We limit GBFS to trees with the default hyperparameters of tree depth = 4 and learning rate = 0.1. Training Random Forest with default hyperparameters would take more than a week. Therefore, we limit the number of trees to and the maximum number of instances per leaf node to . Feature selection and classification results are shown in Figure Gradient Boosted Feature Selection. For GBFS, instead of plotting the best performing results for each , we plot the whole feature selection iteration curve for multiple values of . GBFS obtains lower error rates than Random Forest (RFFS) and regularized logistic regression (L1LR) when few features are selected. (Note that due to the extremely large data set size, even improvements of are considered significant.)
To evaluate the quality of the selected features, we separate the contribution of the feature selection from the effect of using different classifiers. We apply all algorithms on the smallest data set (uspst) to select a subset of the features and then train a SVM with RBF kernel on the respective feature subset. Figure Gradient Boosted Feature Selection shows the error rates as a function of the number of selected features. GBFS obtains the lowest error rates in the (most interesting) regions of only few selected features. As more features are selected eventually all FS algorithms converge to similar values. It is worth noting that the linear classifier (L1LR) slightly outperforms most nonlinear methods when given enough features, which suggests that the uspst digits data requires a nonlinear classifier for prediction but not for feature discovery.
While GBFS focusses on the scenario where the number of training data points is much larger than the number of features (), we also evaluate GBFS on a traditional feature selection benchmark data set SMKCAN187 [?], which is publicly available from [?]. This binary classification data set contains data points and features. We average results over 5 random traintest splits. Figure Gradient Boosted Feature Selection compares the results. GBFS outperforms regularized logistic regression (L1LR), HSICLasso and Random Forest feature selection (RFFS), though by a small margin in some regions.
Not surprisingly, the linear method (L1LR) is the fastest by far. Both mRMR and HSIC Lasso take significantly more time than Random Forest and GBFS because they involve either mutual information or kernel matrix computation, which scales as or . Random Forest builds full trees, requiring a time complexity of per tree. The dependency on is slightly misleading, as the number of trees required for Random Forests is also dependent on the number of features and scales as itself. In contrast, GBFS only builds limited depth (depth = ) trees, and the computation time complexity is . The number of iterations is independent of the number of input features ; it is only a function of how the number of desired features. Empirically, we observe that the two algorithms are comparable in speed but GBFS is significantly faster on data sets with many instances (large ). The training time ranged from several seconds to minutes on the small data sets to about one hour on the large data set kddcup99 (when Random Forest is trained with only 500 trees and large leaf sizes). Admittedly, the empirical comparison of training time is slightly problematic because our Random Forest implementation is based on highly optimized C++ code, whereas GBFS is implemented in Matlab™. We expect that GBFS could be made significantly faster if implemented in faster programming languages (e.g. C++) with the incorporation of known parallelization techniques for limited depth trees [?].
This paper introduces GBFS, a novel algorithm for nonlinear feature selection. The algorithm quickly train very accurate classifiers while selecting high quality features. In contrast to most prior work, GBFS is based on a variation of gradient boosting of limited depth trees [?]. This approach has several advantages. It scales naturally to large data sets and it combines learning a powerful classifier and performing feature selection into a single step. It can easily incorporate known feature dependencies, a common setting in biomedical applications [?] [?], medical imaging [?] [?] and computer vision [?]. This has the potential to unlock interesting new discoveries in a variety of application domains. From a practitioners perspective, it is now worth investigating if a data set has interfeature dependencies that could be provided as additional sideinformation to the algorithm.
One bottleneck of GBFS is that it explores features using the CART algorithm, which has a complexity of . This may become a problem in cases with millions of features. Although this paper primarily focusses on the scenario, as future work we plan to consider improving the scalability with respect to . One promising approach is to restrict the search to a random subsets of new features, akin to Random Forest. However, in contrast to Random Forest, the iterative nature of GBFS allows us to bias the sampling probability of a feature by its splitting value from previous iterations—thus avoiding unnecessary selection of unimportant features.
We are excited by the promising results of GBFS and believe that the use of gradient boosted trees for feature selection will lead to many interesting followup results. This will hopefully spark new algorithmic developments and improved feature discovery across application domains.
KQW was supported by NSF grants 1149882 and 1137211. KQW and ZEX were supported by NSF IIS1149882 and IIS 1137211. Part of this work was done while ZEX was an intern at Microsoft Research, Redmond.
 [1] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12):6745–6750, 1999.
 [2] L. Breiman. Classification and regression trees. Chapman & Hall/CRC, 1984.
 [3] O. Chapelle, P. Shivaswamy, S. Vadrevu, K. Weinberger, Y. Zhang, and B. Tseng. Boosted multitask learning. Machine learning, 85(1):149–173, 2011.
 [4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
 [5] J. Duchi, S. ShalevShwartz, Y. Singer, and T. Chandra. Efficient projections onto the ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272–279. ACM, 2008.
 [6] J. A. Etzel, V. Gazzola, and C. Keysers. An introduction to anatomical ROIbased fMRI classification analysis. Brain Research, 1282:114–125, 2009.
 [7] J. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, pages 1189–1232, 2001.
 [8] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182, 2003.
 [9] T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of statistical learning. Springer, 2009.
 [10] J. L. Hellrung Jr, L. Wang, E. Sifakis, and J. M. Teran. A second order virtual node method for elliptic problems with interfaces and irregular domains in three dimensions. Journal of Computational Physics, 231(4):2015–2048, 2012.
 [11] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. The Journal of Machine Learning Research, 12:3371–3412, 2011.
 [12] S. Lee, H. Lee, P. Abbeel, and A. Y. Ng. Efficient l1 regularized logistic regression. In Proceedings of the National Conference on Artificial Intelligence, volume 21, page 401. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.
 [13] Y. Liu, M. Sharma, C. Gaona, J. Breshears, J. Roland, Z. Freudenburg, E. Leuthardt, and K. Q. Weinberger. Decoding ipsilateral finger movements from ecog signals in humans. In Advances in Neural Information Processing Systems, pages 1468–1476, 2010.
 [14] S. Ma, X. Song, and J. Huang. Supervised group lasso with applications to microarray data analysis. BMC bioinformatics, 8(1):60, 2007.
 [15] L. Meier, S. Van De Geer, and P. Bühlmann. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):53–71, 2008.
 [16] F. Pan, T. Converse, D. Ahn, F. Salvetti, and G. Donato. Feature selection for ranking using boosted trees. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 2025–2028. ACM, 2009.
 [17] M. Y. Park and T. Hastie. L1regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):659–677, 2007.
 [18] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8):1226–1238, 2005.
 [19] S. Perkins, K. Lacker, and J. Theiler. Grafting: Fast, incremental feature selection by gradient descent in function space. The Journal of Machine Learning Research, 3:1333–1356, 2003.
 [20] V. Roth. The generalized lasso. Neural Networks, IEEE Transactions on, 15(1):16–28, 2004.
 [21] V. Roth and B. Fischer. The grouplasso for generalized linear models: Uniqueness of solutions and efficient algorithms. In Proceedings of the 25th international conference on Machine learning, pages 848–855. ACM, 2008.
 [22] Y. Saeys, I. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507–2517, 2007.
 [23] B. Schölkopf and A. Smola. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2001.
 [24] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence maximization. The Journal of Machine Learning Research, 98888:1393–1434, 2012.
 [25] A. Spira, J. E. Beane, V. Shah, K. Steiling, G. Liu, F. Schembri, S. Gilman, Y.M. Dumas, P. Calner, P. Sebastiani, et al. Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nature medicine, 13(3):361–366, 2007.
 [26] S. Sra. Fast projections onto , norm balls for grouped feature selection. In Machine Learning and Knowledge Discovery in Databases, pages 305–317. Springer, 2011.
 [27] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 [28] E. Tuv, A. Borisov, G. Runger, and K. Torkkola. Feature selection with ensembles, artificial variables, and redundancy elimination. The Journal of Machine Learning Research, 10:1341–1366, 2009.
 [29] S. Tyree, K. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search ranking. In WWW, pages 387–396. ACM, 2011.
 [30] L. Wang and P.O. Persson. A discontinuous galerkin method for the navierstokes equations on deforming domains using unstructured moving spacetime meshes. In 21st AIAA Computational Fluid Dynamics Conference, page 2833, 2013.
 [31] L. Wang and P.O. Persson. A highorder discontinuous galerkin method with unstructured space–time meshes for twodimensional compressible flows on domains with large deformations. Computers & Fluids, 118:53–68, 2015.
 [32] Z. Xu, M. K., M. Chen, and K. Q. Weinberger. Costsensitive tree of classifiers. In S. Dasgupta and D. Mcallester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML13), volume 28, pages 133–141. JMLR Workshop and Conference Proceedings, 2013.
 [33] Z. Xu, M. Kusner, G. Huang, and K. Q. Weinberger. Anytime representation learning. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 1076–1084, 2013.
 [34] Z. Xu, K. Weinberger, and O. Chapelle. The greedy miser: Learning under testtime budgets. In ICML, pages 1175–1182, 2012.
 [35] M. Yamada, W. Jitkrittum, L. Sigal, E. P. Xing, and M. Sugiyama. Highdimensional feature selection by featurewise nonlinear lasso. arXiv preprint arXiv:1202.0515, 2012.
 [36] T. Zhang. Multistage convex relaxation for learning with sparse regularization. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1929–1936. 2008.
 [37] Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, and H. Liu. Advancing feature selection research. ASU Feature Selection Repository, 2010.
APPENDIX
We further extend our experimental results by incorporating more onevsone pairs from MNIST data set. We randomly pick onevsone pairs from MNIST and evaluate GBFS and other feature selection algorithms. The first baseline is regularized logistic regression (L1LR). We vary the regularization parameter to select different number of features and examine the error rates under these different settings. We also compare against Random Forest feature selection [?]. Same to the procedure described in section Gradient Boosted Feature Selection, we run Random Forest with trees and a maximum number of elements per leaf node. After training all trees, we rank all features. Starting from the most important feature, we retrain a random forest with only selected features and evaluate it on testing set. We gradually include less important features until we include all features. The other two baselines (include mRMR, HSICLasso) do not scale on the MNIST data set.
Figure Gradient Boosted Feature Selection indicates that GBFS consistently matches Random Forest FS, and clearly outperforms regularized logistic regression.
