MaxMargin Feature Selection
Abstract
Many machine learning applications such as in vision, biology and social networking deal with data in high dimensions. Feature selection is typically employed to select a subset of features which improves generalization accuracy as well as reduces the computational cost of learning the model. One of the criteria used for feature selection is to jointly minimize the redundancy and maximize the relevance of the selected features. In this paper, we formulate the task of feature selection as a one class SVM problem in a space where features correspond to the data points and instances correspond to the dimensions. The goal is to look for a representative subset of the features (support vectors) which describes the boundary for the region where the set of the features (data points) exists. This leads to a joint optimization of relevance and redundancy in a principled maxmargin framework. Additionally, our formulation enables us to leverage existing techniques for optimizing the SVM objective resulting in highly computationally efficient solutions for the task of feature selection. Specifically, we employ the dual coordinate descent algorithm (Hsieh et al., 2008), originally proposed for SVMs, for our formulation. We use a sparse representation to deal with data in very high dimensions. Experiments on seven publicly available benchmark datasets from a variety of domains show that our approach results in orders of magnitude faster solutions even while retaining the same level of accuracy compared to the state of the art feature selection techniques.
keywords:
Feature Selection, One class SVM, MaxMarginsemicolon,round,sort,authoryear
10 May 2013 \availableonline15 May 2013
1 Introduction
Many machine learning problems in vision, biology, social networking and several other domains need to deal with very high dimensional data. Many of these attributes may not be relevant for the final prediction task and act as noise during the learning process. A number of feature selection methods have already been proposed in the literature to deal with this problem. These can be broadly categorized into filter based, wrapper based and embedded methods.
In filter based methods, features (or subset of the features) are ranked based on their statistical importance and are oblivious to the classifier being used (Guyon and Elisseeff, 2003; Peng et al., 2005). Wrapper based methods select subset of features heuristically and classification accuracy is used to estimate the goodness of the selected subset (Kumar et al., 2012). These methods typically result in good accuracy while incur high computational cost because of the need to train the classifier multiple number of times. In the embedded methods, feature selection criteria is directly incorporated in the objective function of the classifier (Tan et al., 2010; Yiteng et al., 2012). Many filter and wrapper based methods fail on very high dimensional datasets due to their high time and memory requirements, and also because of inapplicability on sparse datasets (Guyon and Elisseeff, 2003; Yiteng et al., 2012).
In the literature, various maxmargin formulation had been developed for many applications (Burges, 1998; Guo et al., 2007). Recently, we have proposed a hard margin primal formulation for feature selection using quadratic program (QP) slover (Prasad et al., 2013). This approach jointly minimizes redundancy and maximizes relevance in a maxmargin framework. We have formulated the task of feature selection as a one class SVM problem (Schölkopf et al., 2000) in the dual space where features correspond to the data points and instances correspond to the dimensions. The goal is to search for a representative subset of the features (support vectors) which describes the boundary for the region in which the set of the features (data points) lies. This is equivalent to searching for a hyperplane which maximally separates the data points from the origin (Schölkopf et al., 2000).
In this paper, we have extended the hardmargin formulation to develop a general softmargin framework for feature selection. We have also modified the primal and dual formulations. We present the dual objective as unconstrained optimization problem. We employ the Dual Coordinate Descent (DCD) algorithm (Hsieh et al., 2008) for solving our formulation. The DCD algorithm simultaneously uses the information in the primal as well as in the dual to come up with a very fast solver for the SVM objective. In order to apply DCD approach, our formulation has been appropriately modified by including an additional term in the dual objective, which can be seen as a regularizer on the feature weights. The strength of this regularizer can be tuned to control the sparsity of the selected features weights. We adapt the liblinear implementation (Fan et al., 2008) for our proposed framework so that our approach is scalable to data in very high dimensions. We also show that the Quadratic Programming Feature Selection (QPFS) (RodriguezLujan et al., 2010) falls out as a special case of our formulation in the dual space when using a hard margin.
Experiments on seven publicly available datasets from a vision, biology and Natural Language Processing (NLP) domains show that our approach results in orders of magnitude faster solutions compared to the state of the art techniques while retaining the same level of accuracy.
2 Proposed MaxMargin Framework
The key objective in feature selection is to select a subset of features which are highly relevant (that is high predictive accuracy) and nonredundant (that is uncorrelated). Relevance is captured either using an explicit metric (such as the correlation between a feature and the target variable) or implicitly using the classifier accuracy on the subset of features being selected. Redundancy is captured using metrics such as correlation coefficient or mutual information. Most of the existing feature selection methods rely on a pairwise notion of similarity to capture redundancy (RodriguezLujan et al., 2010; Peng et al., 2005; Yu and Liu, 2003).
We try to answer the question ”Is there a principled approach to jointly capturing the relevance as well redundancy amongst the features?”. To do this, we flip around the problem and examine the space where features themselves become the first class objects. In particular, we analyze the space where ”features” represent the data points and ”instances” represent the dimensions. Which boundary could describe well the set of features lying in this space? Locating the desired boundary is similar to one class SVM formulation (Schölkopf et al., 2000). This equivalently can be formulated as the problem of searching for a hyperplane which maximally separates the features (data points) from the origin in the appropriate kernel space over the features. In order to incorporate feature relevance, we construct a set of parallel marginal hyperplanes, one hyperplane for each feature. The margin of each separating hyperplane captures the relevance of the corresponding feature. Greater the relevance, higher the margin required (a greater margin increases the chances of a feature being a support vector). Redundancy among the features is captured implicitly in our framework. The support vectors which lie on respective margin boundaries constitute the desired subset of features to be selected. This leads to a principled maxmargin framework for feature selection. The proposed formulation for MMFS is presented hereafter.
2.1 Formulation
Let represent the data matrix where each row vector ( denotes an instance and each column vector ( denotes a feature vector. We will use to denote a feature map such that the dot product between the data points can be computed via a kernel , which can be interpreted as the similarly of and . We will use to denote the vector of class labels ’s . Based on the above notations, we present the following formulation for feature selection in the primal:
(1) 
where, represents a vector normal to the separating hyperplane(s) ^{1}^{1}1All the separating hyperplanes are parallel to each other in our framework., represents the bias term and ’s represent slack variables. captures the relevance for the feature. The equation of the separating hyperplane is given by with the distance of the hyperplane from the origin being . Note that in this formulation the objective function is similar to the one class SVM (Schölkopf et al., 2000). However, the constraints are very much different as our formulation includes the relevance of the features (). The choice of determines the kind of similarity (correlation) to be captured among the features. The set of support vectors obtained after optimizing this problem i.e. {} and the margin violators {} constitute the set of features to be selected. In the dual space, this translates to those features being selected for which where is the Lagrange multiplier for . We will refer to our approach as MaxMargin Feature Selection (MMFS). Note that when dealing with hard margin (no noise) case and the term involving disappears (since this enforces ).
Figure 1 illustrates the intuition behind our proposed framework in the linear dot product space (with hard margin). In the figure, represents the separating hyperplane. The distance of this hyperplane from the origin is given by . The first term in the objective of Equation 1 tries to minimize i.e. maximize . The second term in the objective tries to minimize i.e. maximize . Hence, the overall objective tries to push the plane away from the origin. The i dashed plane represents the margin boundary for the i feature. The distance of this marginal hyperplane from the separating hyperplane is given by where is the precomputed relevance of the i feature. Therefore, minimizing in the objective also amounts to maximizing this marginal distance (). Hence, the objective has the dual goal of pushing the hyperplane away from the origin while maximizing the margin for each feature (weighted by its relevance)as well. The features which lie on the respective marginal planes are the support features (encircled points). The redundancy is explicitly captured in the dual formulation of this problem.
2.2 Dual Formulation
In order to solve the MMFS optimization efficiently by Dual Coordinate Descent strategy, we require both the primal and dual formulations. The dual formulation for Equation 1 can be derived using the Lagrangian method. The Lagrangian function can be written as:
Where, ’s and ’s are the Lagrange multipliers. Now, the Lagrangian dual can be written as:
(2) 
At the optimality, , and (for all ) will be 0 i.e.
(3)  
By substituting the values from Equation (3) into Equation (2) we get:
(4)  
Subject to 
This is similar to the standard SVM dual derivation (Schölkopf et al., 2000). The only difference is that while there is a single margin in standard SVM, the number of features here dictate the number of margins . We can equivalently rewrite the dual formulation of (4) as follows:
(5)  
Subject to 
Here, is the similarity matrix whose entries are given by where is the kernel function corresponding to the dot product in the transformed feature space. represents the vector of feature relevance. ’s are the Lagrange multipliers. Note that the first term in the objective captures the redundancy between the features and the second term captures the relevance as in the case of QPFS formulation of (RodriguezLujan et al., 2010). Hence, the connection between the redundancy and the relevance becomes explicit in the dual formulation. It should be noted that the dual objective bears a close similarity to the QPFS objective. We give the detailed comparison in Section 3. We can give relative importance to redundancy and relevance by incorporating a scaling parameter in Equation (5) as follows:
(6)  
Subject to 
In the primal formulation (Equation (1)), this can be achieved by scaling the relevance scores by , that is, replacing the constraints by .
2.3 Choice of Metrics
The relevance of a feature in our framework is captured using the correlation between the feature vector and the class label vector. In our experiments, we have normalized the data as well as the target vector (class labels) so that it has zero mean and unit variance. Hence, the dot product between the feature vector and the target vector (normalized) estimates the correlation between them i.e. relevance of the feature can be computed as . Some other appropriate metric which captures the predictive accuracy of a feature (such as mutual information(MI)) could also be used (Peng et al., 2005).
The redundancy is usually captured using correlation or mutual information in feature selection tasks (Peng et al., 2005). In our framework, the dot product space (kernel) captures the similarity (redundancy) among the features and the required similarity metric can be captured by selecting the appropriate kernel. The linear kernel () represents the correlation among the features when the features are normalized to zero mean and unit variance ^{2}^{2}2It is typical to normalize the data to zero mean and unit variance for feature selection.. Since the value of the correlation ranges between and , a degree two homogeneous polynomial kernel defined over normalized data represents the squared correlation (i.e. ). The choice of this kernel is quite intuitive for feature selection as it gives equal importance to the positive and negative correlations. A Gaussian kernel can also be used to approximate the mutual information (MI) (Gretton et al., 2005) which is the key metric for nonlinear redundancy measure in feature selection problems (Peng et al., 2005; RodriguezLujan et al., 2010). Since the MMFS formulation very closely matches the one class SVM formulation, any of the existing algorithms for SVM optimization either in primal or dual can be used. Next, we describe the use of Dual Coordinate Descent (DCD) algorithm (Hsieh et al., 2008) to obtain a highly computationally efficient solution for our feature selection formulation.
2.4 Dual Coordinate Descent for MMFS
Following equation (1), the number of variables and the number of constraints in the primal formulation are + and , respectively, while from equation (6), it is seen that the corresponding numbers are and +, respectively. Solving the primal (typically by using QP solvers) may be efficient () in the cases when (ShalevShwartz et al., 2007). Solving the dual using QP solvers requires space and time. Even solving the dual using sequential minimal optimization (SMO) based methods in practice has the complexity of (Fan et al., 2005). These high time and memory complexities limit the scalability of directly solving the primal or dual for data with a very large number of instances and features.
In many cases when the data already lies in a rich feature space, the performance of linear SVMs is observed to be similar to that of nonlinear SVMs. In such scenarios, it may be much more efficient to train the linear SVMs directly. The dual coordinate descent methods have been well studied for solving linear SVMs using unconstrained form of the primal as well as dual formulations (Hsieh et al., 2008) who have shown that dual coordinate descent algorithm is significantly faster than many other existing algorithms for solving the SVM problem. Since our formulation very closely resembles the one class SVM formulation (with the exception of having a separate margin for each feature), we can easily adapt the Dual Coordinate Descent (DCD) algorithm for our case.
Following the unconstrained formulation for the SVM objective (Hsieh et al., 2008), the MMFS objective in the primal (using a linear kernel) can be written as:
(7) 
where denotes the loss function and is a control parameter. Assuming standard loss, . Note the slightly changed form of the objective compared to Equation (1) where the bias term has been replaced by a squared term . The bias term can now be handled by introducing an additional dimension:
(8) 
Equation (7) can then be equivalently written as:
(9) 
The dual of this slightly modified problem becomes:
(10)  
subject to 
where is (+)(+) matrix such that . Comparing Equation (10) with Equation (6), we note that the constraint requiring is no longer needed because of the slightly changed form of the objective. In the unconstrained form of the dual, we are minimizing an additional term in the objective which is nothing but the square of the regularizer over the feature weights. Note that this term in the objective effectively takes care of the original constraint . The parameter controls the strength of this regularizer and can be tuned to control the sparsity of the solution. The gradient of the objective w.r.t to can be computed as follows:
Using the fact (set of Equations (3)), the gradient can be further reduced as:
We adapt the Dual Coordinate Descent algorithm (Hsieh et al., 2008) for our MMFS problem. This algorithm works by optimizing the dual objective by computing the gradient based on the weight vector in the primal. This process is repeated with respect to each in turn and the weight vector is updated accordingly. This translates into optimizing a one variable quadratic function at every step and can be done very efficiently. We name this approach MMFSDCD in the paper, henceforth.
2.5 Complexity
Following (Hsieh et al., 2008), the MMFSDCD approach obtains an accurate solution in number of iterations. Time complexity of a single iteration is . Memory complexity of the DCD algorithm is . For sparse datasets, the complexities depend on instead of , where is the average number of nonzero feature values in an instance. The details about the proof of convergence are available in (Hsieh et al., 2008).
3 Relationship to Existing Filter Based Methods
Quadratic Programming Feature Selection (QPFS) (RodriguezLujan et al., 2010) is a filter based feature selection method which models the feature selection problem as a quadratic program jointly minimizing redundancy and maximizing relevance. Redundancy is captured using some kind of similarity score (such as MI or correlation) amongst the features. Relevance is captured using the correlation between a feature and the target variable. One norm of the feature weight vector is constrained to be . Formally, the quadratic program can written as:
(11)  
Subject to 
is an matrix representing redundancy among the features, is an sized vector representing the feature relevance and is an sized vector capturing feature weights. is a scalar which controls the relative importance of redundancy (the term) and the relevance (the term). QPFS objective closely resembles the minimalredundancymaximalrelevance (mRMR) (Peng et al., 2005) criterion. When , only the relevance is considered (maximum Relevance) and when only redundancy among the features is captured. QPFS has also been shown to outperform many existing feature selection methods including mRMR and maxRel (RodriguezLujan et al., 2010).
The form of the QPFS formulation above is exactly similar to our dual formulation (Equation 6) for an appropriate choice of kernel (similarity) function and (hard margin). Hence, the QPFS objective falls out as a special case of our maxmargin framework in the dual problem space when dealing with hard margin. It should be noted that Lujan et al. (RodriguezLujan et al., 2010) do not give any strong justification for the particular form of the objective used, other than the fact that it makes intuitive sense and seems to work well in practice. This is unlike our case where we present a maxmargin based framework for jointly optimizing relevance and redundancy. Therefore, our formulation can be seen as providing a framework for the use of the QPFS objective and generalizing it further to handle noise (soft margin). Further, since no direct connection of the QPFS objective has been established with the SVM like formulation by Lujan et al. (RodriguezLujan et al., 2010), the proposed approach for solving the objective is to simply use any of the standard quadratic programming implementations. Hence, the time complexity of QPFS approach is and space complexity is . To deal with cubic complexity, they propose combining it with the Nyström method which works on subsamples of the data. This can partially alleviate the problem with the computational inefficiency of QPFS but comes at the cost of significant loss in accuracy, as shown by our experiments. In our case, because of the close connection with the SVM based maxmargin formulation and the ability to use the information from the primal as well as the dual, we can utilize any of the highly optimized SVM solvers (such as DCD which has time complexity linear in ).
Further it may be noted that while our MMFSDCD approach can handle sparse representation of very high dimensional datasets, other feature selection methods like QPFS, FCBF, mRMR etc. cannot do so directly.
4 Experiments
4.1 Datasets
We demonstrate our experiments on seven publicly available benchmark datasets with medium to large number of dimensions. Out of these seven datasets Leukemia, RAOA and RAC are microarray datasets (Kumar et al., 2012), MNIST is a vision dataset (Tan et al., 2010) and REALSIM, Webspam and Kddb are the text classification datasets from NLP domain (Chang et al., 2010; Yiteng et al., 2012). Table 1 describes the details of the datasets. The last column represents the sparsity that is average number of nonzero features per instance in the dataset.
Dataset  # Training  # Testing  # Features  Sparsity 

Leukemia  72    7,129  7,129 
RAOA  31    18,432  18,422 
RAC  33    48,701  48,701 
MNIST  11,982  1,984  752  752 
REALSIM  57,848  14,461  20,958  51.5 
Webspam  80,000  70,000  8,355,099  3,730 
Kddb  100,000  748,401  29,889,813  30 
4.2 Algorithms
We compared the performance of our proposed MMFS algorithm with FCBF^{3}^{3}3http://www.public.asu.edu/h̃uanliu/FCBF/FCBFsoftware.html (Yu and Liu, 2003), QPFS (RodriguezLujan et al., 2010) and two other embedded feature selection methods, namely, Feature Generating Machine (FGM) (Tan et al., 2010) and Group Discovery Machine (GDM) (Yiteng et al., 2012). FGM uses cutting plane strategy for feature selection. GDM further tries to minimize the redundancy in FGM by incorporating the correlation among the features. QPFS, FGM and GDM have been shown to outperform a variety of existing feature selection methods including mRMR and MaxRel (Peng et al., 2005), FCBF (Yu and Liu, 2003), SVMRFE (Guyon and Elisseeff, 2003), etc. For QPFS, we used mutual information (MI) as the similarity metric as it has been shown to give the best set of results (RodriguezLujan et al., 2010). In MMFSDCD, we use correlation of a feature vector with the target class vector to compute the feature relevance.
4.3 Methodology
We compare all the approaches for feature selection in terms of their accuracy and execution time on each of the datasets. For all the datasets except Webspam and Kddb, we report the accuracies obtained at varying number of top features ( = 2, 3, 4,…, 100) selected for each of the methods. For webspam and kddb datasets,we report the accuracies obtained at varying number of top features ( =5, 10, 20, 30,…, 200) selected by FGM, GDM and MMFSDCD methods.
We also report the best accuracies obtained at any given value of in the above range for all the datasets. We have normalized all the datasets except webspam and kddb to zero mean and unit variance. The zero mean and unit variance normalization for webspam and kddb datasets is very memory inefficient (very large memory ()) as these two are very large sparse datasets. We have normalized these two datasets with unit variance (Yiteng et al., 2012). In the microarray datasets, the number of samples are small so we report the leaveoneout crossvalidation (LOOCV) accuracy. For MNIST and REALSIM datasets, training and testing splits are provided in (Chang et al., 2010). We have followed the training and testing splits of (Yiteng et al., 2012) for webspam and kddb datasets. The results reported are averaged over random splits.
For MMFSDCD, parameter was tuned separately for each of the microarray datasets. The values of the parameters and were set to and respectively in all the experiments. We used the default settings of the parameters for both FGM and GDM as reported in (Tan et al., 2010; Yiteng et al., 2012). After the top features are selected, we used L2regularized L2loss SVM (Fan et al., 2008) with default settings (that is cost parameter =1) for classification for each of the algorithms and for each of the datasets. MMFS was implemented on top of the liblinear tool^{4}^{4}4http://www.csie.ntu.edu.tw/ cjlin/liblinear. This implementation uses shrinking strategy (Hsieh et al., 2008). We used the publicly available implementation of QPFS (RodriguezLujan et al., 2010). For FGM, we used the publicly available tool^{5}^{5}5http://www.c2i.ntu.edu.sg/mingkui/FGM.htm. GDM was implemented as an extension of the FGM based on the details given in Yiteng et. al (Yiteng et al., 2012). Any additional required wrapper code was written in C/C++. All the experiments were run on a Intel Core i7 3.10GHz machine with 16GB RAM under linux operating system.
4.4 Results
4.4.1 Accuracy
Table 2 presents the best set of average accuracies (varying the number of topK features selected) for all the methods. QPFS method did not produce any results on RAOA and RAC dataset within 24 hours^{6}^{6}6We put a dash with corresponding entries in the Table 2.. So, we used Nyström approximation (RodriguezLujan et al., 2010) with sampling rate(=0.01) for these datasets. In the Figure 2(a), QPFSN represents the QPFS with Nyström approximation. The QPFS and FCBF methods can not handle the sparse data, so we compare FGM, GDM and MMFSDCD for webspam and kddb datasets. MMFSDCD reaches the best accuracy on a small number of top features for all the microarray datasets. Further, MMFSDCD produces significantly better accuracies compared to FCBF, QPFS, FGM and GDM on all the microarray datasets (FGM does equally well on RAC). On MNIST and webspam datasets, MMFSDCD is marginally worse than the best performing algorithm. The plots for the average accuracies obtained as we vary the number of top features are available in the supplementary file. Clearly, for most of the datasets, MMFSDCD is able to achieve the best set of accuracies at early stages of feature selection compared to all algorithms. Further, the gene ontology and biological significance of top selected genes for leukemia dataset is provided in the supplementary file.
4.4.2 Time
Figure 2 plots the average execution time for each of the methods. yaxis is plotted on a log scale. The time requirement for MMFSDCD, FCBF and QPFS is independent of the number of features selected. For FGM and GDM, time requirement monotonically increases with . For GDM, there is a sharp increase in the time required when becomes greater than five^{7}^{7}7For RAC, we run GDM upto iterations.. It is obvious from Figure 2 that MMFSDCD is upto several orders of magnitude faster than all the other algorithms on all the datasets^{8}^{8}8 Plots for remaining datasets are available in supplementary file..
Dataset  FCBF  QPFS  FGM  GDM  MMFSDCD  


Accuracy  M  Accuracy  M  Accuracy  M  Accuracy  M  Accuracy  M 
Leukemia  90.280.1  37  87.50 0.1  45  87.50.1  2  84.720.1  2  91.670.1  6 
RAOA  74.190.2  2  67.750.2  6  67.750.2  2  54.840.2  2  83.870.1  2 
RAC  48.480.2  12  96.970.1  75  100.00.0  3  87.880.1  3  100.00.0  2 
MNIST  91.070.0  19  96.060.0  94  96.210.0  99  96.670.0  77  96.060.0  83 
REALSIM          90.030.01  90  89.480.01  100  90.190.01  100 
Webspam          95.910.0  200  96.80 0.0  200  96.79 0.0  200 
Kddb          87.60 0.0  150  87.77 0.0  190  88.39 0.0  200 
4.4.3 Parameter Sensitivity Analysis
Figure 3 presents the variation in accuracy for MMFSDCD on the Leukemia dataset, as we vary the regularizer parameter () with varying number of top features. The accuracy is not very sensitive to as demonstrated by a large flat region in the graph.
5 Conclusion and Future Work
We have presented a novel MaxMargin framework for Feature Selection (MMFS) similar to one class SVM formulation. Our framework provides a principled approach to jointly maximize relevance and minimize redundancy. It also enables us to use existing SVM based optimization techniques leading to highly efficient solutions for the task of feature selection. Our experiments show that MMFS with dual coordinate decent approach is many orders of magnitude faster than existing state of the art techniques while retaining the same level of accuracy.
One of the key future directions includes exploring if there is some notion of a generalization bound for the task of feature selection in our framework as in the case of SVMs for the task of classification. In other words, what can we say about the quality of the features selected as we see more and more data. We would also like to explore the performance of our model with nonlinear kernels. Lastly, exploring the tradeoff as we vary the noise penalty would also be a direction to pursue in the future.
Acknowledgment
The authors would like to thank Dr. Parag Singla, Dept. of CSE, I.I.T Delhi for his valuable suggestions and support in improving the paper.
References
 Burges (1998) Burges, C.J.C., 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121–167.
 Chang et al. (2010) Chang, Y.W., Hsieh, C.J., Chang, K.W., Ringgaard, M., Lin, C.J., 2010. Training and testing lowdegree polynomial data mappings via linear svm. J. Mach. Learn. Res. 11, 1471–1490.
 Fan et al. (2008) Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J., 2008. Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874.
 Fan et al. (2005) Fan, R.E., Chen, P.H., Lin, C.J., 2005. Working set selection using second order information for training support vector machines. J. Mach. Learn. Res. 6, 1889–1918.
 Gretton et al. (2005) Gretton, A., Herbrich, R., Smola, A., Bousquet, O., Schölkopf, B., 2005. Kernel methods for measuring independence. J. Mach. Learn. Res. 6, 2075–2129.
 Guo et al. (2007) Guo, Z., Zhang, Z., Xing, E.P., Faloutsos, C., 2007. A max margin framework on image annotation and multimodal image retrieval., in: ICME, IEEE. pp. 504–507.
 Guyon and Elisseeff (2003) Guyon, I., Elisseeff, A., 2003. An intoduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182.
 Hsieh et al. (2008) Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S., 2008. A dual coordinate descent method for largescale linear SVM, in: Proceedings of the 25 International Conference on Machine Learning, ACM. pp. 408–415.
 Kumar et al. (2012) Kumar, P.G., Victoire, A.T.A., Renukadevi, P., Devaraj, D., 2012. Design of fuzzy expert system for microarray data classification using a novel genetic swarm algorithm. Expert Syst. Appl. 39, 1811–1821.
 Peng et al. (2005) Peng, H., Long, F., Ding, C., 2005. Feature selection based on mutual information: criteria of maxdependency, maxrelevance, and minredundancy. IEEE Trans. on Pattern Analysis and Machine Intelligence 27, 1226–1238.
 Prasad et al. (2013) Prasad, Y., Biswas, K., Singla, P., 2013. Feature selection using one class svm: A new perspective, in: MLCB, NIPS Workshop.
 RodriguezLujan et al. (2010) RodriguezLujan, I., Huerta, R., Elkan, C., Cruz, C.S., 2010. Quadratic programming feature selection. J. Mach. Learn. Res. 11, 1491–1516.
 Schölkopf et al. (2000) Schölkopf, B., Williamson, R.C., Smola, A.J., ShaweTaylor, J., Platt, J., 2000. Support vector method for novelty detection. Advances in neural information processing systems 12, 582–588.
 ShalevShwartz et al. (2007) ShalevShwartz, S., Singer, Y., Srebro, N., 2007. Pegasos: Primal estimated subgradient solver for svm, in: Proceedings of the 24th International Conference on Machine Learning, ACM, USA. pp. 807–814.
 Tan et al. (2010) Tan, M., Wang, L., Tsang, I.W., 2010. Learning sparse SVM for feature selection on very high dimensional datasets, in: Proceedings of the 27 International Conference on Machine Learning, pp. 1047–1054.
 Yiteng et al. (2012) Yiteng, Z., Mingkui, T., Yew S., O., Ivor W., T., 2012. Discovering support and affiliated features from very high dimensions, in: Proceedings of the 29 International Conference on Machine Learning, pp. 1455–1462.
 Yu and Liu (2003) Yu, L., Liu, H., 2003. Feature selection for highdimensional data: A fast correlationbased filter solution, in: Proceedings of the 20 International Conference on Machine Learning, pp. 856–863.