IMMIGRATE: A Marginbased Feature Selection Method with Interaction Terms
Abstract
Traditional hypothesismargin researches focus on obtaining large margins and feature selection. In this work, we show that the robustness of margins is also critical and can be measured using entropy. In addition, our approach provides clear mathematical formulations and explanations to uncover feature interactions, which is often lack in large hypothesismargin based approaches. We design an algorithm, termed IMMIGRATE (Iterative maxmin entropy marginmaximization with interaction terms), for training the weights associated with the interaction terms. IMMIGRATE simultaneously utilizes both local and global information and can be used as a base learner in Boosting. We evaluate IMMIGRATE in a wide range of tasks, in which it demonstrates exceptional robustness and achieves the stateoftheart results with high interpretability.
Hypothesismargin Feature selection Entropy IMMIGRATE
1 Introduction
Feature selection is one of the most fundamental problems in machine learning and pattern recognition (Fukunaga, 2013). The Relief algorithm by Kira and Rendell (1992) is one of the most successful feature selection algorithms. It can be interpreted as an online learning algorithm that solves a convex optimization problem with a hypothesismarginbased cost function. Instead of deploying exhaustive or heuristic combinatorial searches, Relief decomposes a complex, global and nonlinear classification task into a simple and local one. Following the large hypothesismargin principle for classification, Relief calculates the weights of features, which can be used for feature selection. Considering the binary classification in a set of samples \mathcal{P} with two kinds of labels, the hypothesismargin of an instance \vec{x} is later formally defined in GiladBachrach et al. (2004) as \frac{1}{2}(\\vec{x}\operatorname{NM}(\vec{x})\\\vec{x}\operatorname{NH}% (\vec{x})\), where \operatorname{NH}(\vec{x}) denotes the “nearest hit," i.e., the nearest sample to \vec{x} with the same label, while \operatorname{NM}(\vec{x}) denotes the “nearest miss", the nearest sample to \vec{x} with the different label. The large hypothesismargin principle has motivated several successful extensions of the Relief algorithm. For example, ReliefF (Kononenko, 1994) uses multiple nearest neighbors. Simba (GiladBachrach et al., 2004) recalculates the nearest neighbors every time the feature weights are updated. Yang et al. (2008) consider global information to improve Simba. IRELIEF (Sun and Li, 2006) identifies the nearest hits and misses in a probabilistic manner, which forms a variation of hypothesismargin. LFE (Sun and Wu, 2008) extends Relief from feature selection to feature extraction using local information. IM4E is proposed by Bei and Hong (2015) to balance marginquantity maximization and marginquality maximization. Both approaches in Sun and Wu (2008), Bei and Hong (2015) use a variation of hypothesismargin proposed in Sun and Li (2006).
The Reliefbased algorithms indirectly consider feature interactions by normalizing the feature weights (Urbanowicz et al., 2018), which, however, cannot directly reflect natural effects of associations and hence results in poor understanding on how feature interacts. For example, Relief and many of its extensions cannot tell whether a high weight of a certain feature is caused by its linear effect or its interaction with other features (Urbanowicz et al., 2018). Furthermore, these methods cannot directly reveal and measure the impact of the interaction terms on classification results.
To this end, we propose the Iterative MaxMIn entropy marGinmaximization with inteRAction TErms algorithm (IMMIGRATE, henceforth). IMMIGRATE directly measures the influence of feature interactions and has the following characteristics. First, when defining hypothesismargin, we introduce a new trainable quadraticManhattan measurement to capture interaction terms, which measures the interaction importance directly. Second, we take advantage of the margin stability by measuring the underlying entropy based on the distribution of instances. Third, we derive an iterative optimization algorithm to efficiently minimize the cost function. Fourth, we design a novel classification method that utilizes the learned quadraticManhattan measurement to predict the class of a new instance. Fifth, we design a more powerful approach (i.e., Boosted IMMIGRATE) by using IMMIGRATE as the base learner of Boosting (Schapire, 1990). Sixth, to make IMMIGRATE efficient for analyzing highdimensional datasets, we take advantage of IM4E (Bei and Hong, 2015) to obtain an effective initialization.
The rest of the paper is organized as follows. Section 2 explains the foundation of the Relief algorithm, and Section 3 introduces the IMMIGRATE algorithm. Section 4 summarizes and discusses our experiments with different datasets, showing that IMMIGRATE achieves the stateoftheart results, and Boosted IMMIGRATE outperforms other boosting classifiers significantly. The computation time of IMMIGRATE is comparable to other popular feature selection methods that consider interaction terms. Section 5 concludes the article with comparisons with related works and a short discussion.
2 Review: the Relief Algorithm
We first introduce a few notations used throughout the paper: \vec{x}_{i}\in\mathbb{R}^{A} as the ith instance in the training set \mathcal{P}; y_{i} as the class label of \vec{x}_{i}; N as the size of \mathcal{P}; A as the number of features (i.e., attributes); \vec{w} as the feature weight vector; and \vec{x}_{i} as a vector where absolute value operation is elementwise. Relief (Kira and Rendell, 1992) iteratively calculates the feature weights in \vec{w} (Algorithm 1). The higher a feature weight is, the more relevant the corresponding feature is. After the calculation of feature weights, a threshold is chosen to select relevant features. Relief can be viewed as a convex optimization problem that minimizes the cost function in Equation 2.1:
\displaystyle C  \displaystyle=\sum_{n=1}^{M}\big{(}\vec{w}^{\,T}\big{}\vec{x}_{n}% \operatorname{NH}(\vec{x}_{n})\big{}\vec{w}^{\,T}\big{}\vec{x}_{n}% \operatorname{NM}(\vec{x}_{n})\big{}\big{)},  (2.1)  
\displaystyle\text{subject to}:\vec{w}\geq 0,\,\\vec{w}\_{2}^{2}=1, 
where M(\ll N) is a user defined number of randomly chosen training samples, \operatorname{NH}(\vec{x}) is the nearest "hit" (from the same class) of \vec{x}; \operatorname{NM}(\vec{x}) is the nearest "miss" (from a different class) of \vec{x}; and \vec{w}^{\,T}\big{}\vec{x}_{n}\operatorname{NH}(\vec{x}_{n})\big{} is the weighted Manhattan distance. Denote \vec{u}=\sum_{n=1}^{M}\big{(}\big{}\vec{x}_{n}\operatorname{NH}(\vec{x}_{n})% \big{}\big{}\vec{x}_{n}\operatorname{NM}(\vec{x}_{n})\big{}\big{)}. Minimizing the cost function of Relief (2.1) can be solved using the Lagrange multiplier method and the Karush–Kuhn–Tucker conditions (Kuhn and Tucker, 2014) to get a closedform solution: \vec{w}=(\vec{u})^{+}/\(\vec{u})^{+}\_{2}, where (\vec{a})^{+} truncates the negative elements to 0. This solution to the original Relief algorithm is important for understanding the Reliefbased algorithms.
3 IMMIGRATE Algorithm
Without loss of generality, we establish the IMMIGRATE algorithm in a general binary classification setting. This formulation can be easily extended to handle multiclass classification problems. Let the whole data set be \mathcal{P}=\{z_{n}\mid z_{n}=(\vec{x}_{n},y_{n}),\vec{x}_{n}\in\mathbb{R}^{A}% ,y_{n}=\pm 1\}_{n=1}^{N}; the hit index set of \vec{x}_{n} be \mathcal{H}_{n}=\{j\mid z_{j}\in\mathcal{P},y_{j}=y_{n}\,\&\,j\neq n\}, and the miss index set of \vec{x}_{n} be \mathcal{M}_{n}=\{j\mid z_{j}\in\mathcal{P},y_{j}\neq y_{n}\}.
3.1 HypothesisMargin
Given a distance d(\vec{x}_{i},\vec{x}_{j}) between two instances, \vec{x}_{i} and \vec{x}_{j}, a hypothesismargin (GiladBachrach et al., 2004) is defined as \rho_{n,h,m}=d(\vec{x}_{n},\vec{x}_{m})d(\vec{x}_{n},\vec{x}_{h}), where \vec{x}_{h} and \vec{x}_{m} represent the nearest hit and nearest miss for instance \vec{x}_{n}, respectively. We adopt the probabilistic hypothesismargin defined by Sun and Li (2006) as
\rho_{n}=\sum_{m\in\mathcal{M}_{n}}\beta_{n,m}d(\vec{x}_{n},\vec{x}_{m})\sum_% {h\in\mathcal{H}_{n}}\alpha_{n,h}d(\vec{x}_{n},\vec{x}_{h}),  (3.2) 
where \alpha_{n,h}\geq 0, \beta_{n,m}\geq 0, \sum_{h\in\mathcal{H}_{n}}\alpha_{n,h}=1, \sum_{m\in\mathcal{M}_{n}}\beta_{n,m}=1, for \forall\,n\,\in\,\{1,\cdots,N\}. As in the above design, the hidden random variable \alpha_{n,h} represents the probability that \vec{x}_{h} is the nearest hit of instance \vec{x}_{n}, while \beta_{n,m} indicates the probability that \vec{x}_{m} is the nearest miss of instance \vec{x}_{n}. In the rest of the paper, for conciseness, we will use margin to indicate hypothesismargin.
3.2 Entropy to Measure Margin Stability
The distributions of hits and misses can be used to evaluate the stability of margins (i.e., margin quality). A more stable margin can be obtained by considering the distributions of instances with the same or different labels with respect to the target instance. A margin is deemed stable if it will not be greatly reduced by changes to only a few neighbors of the target instance. Considering an instance \vec{x}_{n}, its probabilities \{\alpha_{n,h}\} and \{\beta_{n,m}\} represent the distributions of its hits and misses, respectively. We can use the hit entropy E_{hit}(\vec{x}_{n})=\sum_{h\in\mathcal{H}_{n}}\alpha_{n,h}\log\alpha_{n,h} and miss entropy E_{miss}(\vec{x}_{n})=\sum_{m\in\mathcal{M}_{n}}\beta_{n,m}\log\beta_{n,m} to evaluate the stability of \vec{x}_{n}’s margin. The following two scenarios help explain the intuition of using these entropy. Scenario A: all neighbors are distributed evenly around the target instance; scenario B: the neighbor distribution is highly uneven. An extreme example for scenario B is that one instance is quite close to the target and the rest are quite far away from the target. An easy experiment to test the stability is to discard one instance from the system and to check how it influences the margin. In scenario A, if the closest neighbor (no matter if it is hit or miss) is discarded, the margin changes only slightly because there are many other hits/misses evenly distributed around the target. In scenario B, if the closest neighbor is a miss, its removal can increase the margin significantly. On the contrary, if the closest neighbor is a hit, removing it can decrease the margin significantly. Intuitively speaking, hits prefer scenario A and misses favor scenario B.
Since scenarios A and B correspond to high and low entropy, respectively, the margin can benefit from a large hit entropy E_{hit} (e.g., scenario A) and a low miss entropy E_{miss} (e.g., scenario B). We can set up a framework to maximize the hit entropy and minimize the miss entropy, which is equivalent to make the margin in Equation 3.2 the most stable. Bei and Hong (2015) use the term maxmin entropy principle to describe the process that maximizes the hit entropy and minimize the loss entropy to maximize the margin quality. The process of stabilizing margin is an extension of the large margin principle.
3.3 QuadraticManhattan Measurement
We extend the margin in Equation 3.2 by using a new quadraticManhattan measurement defined as:
q(\vec{x}_{i},\vec{x}_{j})=\big{}\vec{x}_{i}\vec{x}_{j}\big{}^{\,T}\textbf{% W}\big{}\vec{x}_{i}\vec{x}_{j}\big{},  (3.3) 
where W is a nonnegative symmetric matrix (elementwise nonnegative) with its Frobenius norm \\textbf{W}\_{F}=1. The quadraticManhattan measurement is a natural extension of the weight vector, and the distance defined in Equation 3.3 is a natural extension of the weighted Manhattan distance in Equation 2.1. Offdiagonal elements in W capture feature interactions and diagonal elements in W capture main effects. To understand why quadraticManhattan measurement can capture the influence of interactions, we observe that the effect of element w_{a,b} (a\neq b) in W enters into (3.3) as the coefficient for the combination of the ath and bth elements in vector \big{}\vec{x}_{i}\vec{x}_{j}\big{}. In Reliefbased algorithms, the weighted Manhattan distance Equation 2.1 can be equivalently captured by the feature weight update equation in Algorithm 1. Similarly, w_{a,b} can be updated using the combination of the ath and bth features based on a randomly given instance. We thus define our new margin using the quadraticManhattan measurement as
\sum_{m\in\mathcal{M}_{n}}\beta_{n,m}q(\vec{x}_{n},\vec{x}_{m})\sum_{h\in% \mathcal{H}_{n}}\alpha_{n,h}q(\vec{x}_{n},\vec{x}_{h}).  (3.4) 
3.4 IMMIGRATE
We design the following cost function to maximize our new margin, and simultaneously, the hit entropy and miss entropy are optimized.
\displaystyle C  \displaystyle=\sum_{n=1}^{N}\bigg{(}\sum_{h\in\mathcal{H}_{n}}\alpha_{n,h}\big% {}\vec{x}_{n}\vec{x}_{h}\big{}^{\,T}\textbf{W}\big{}\vec{x}_{n}\vec{x}_{h% }\big{}\sum_{m\in\mathcal{M}_{n}}\beta_{n,m}\big{}\vec{x}_{n}\vec{x}_{m}% \big{}^{\,T}\textbf{W}\big{}\vec{x}_{n}\vec{x}_{m}\big{}\bigg{)}  (3.5)  
\displaystyle+\sigma\sum_{n=1}^{N}[E_{miss}(z_{n})E_{hit}(z_{n})],  
\displaystyle\text{subject to}:\textbf{W}\geq 0,\,\textbf{W}^{T}=\textbf{W},\,% \\textbf{W}\_{F}^{2}=1,  
\displaystyle\forall n,\sum_{h\in\mathcal{H}_{n}}\alpha_{n,h}=1,\sum_{m\in% \mathcal{M}_{n}}\beta_{n,m}=1,\text{and}\,\,\alpha_{n,h}\geq 0,\beta_{n,m}\geq% \,0, 
where E_{miss}(z_{n})=\sum_{m\in\mathcal{M}_{n}}\beta_{n,m}\log\beta_{n,m}, E_{hit}(z_{n})=\sum_{h\in\mathcal{H}_{n}}\alpha_{n,h}\log\alpha_{n,h}, and \sigma is a hyperparameter that can be tuned via internal crossvalidation.
We also design the following optimization procedure containing two iterative steps to find W that minimizes the cost function. The framework starts from a randomly initialized W and stops when the change of cost function is less than a preset limit or the iteration number reaches a preset threshold. In practice, we find that it typically takes less than 10 iterations to stop and obtain good results. Based on our experiments, different initialization of W does not influence the results of the iterative optimization. The computation time of IMMIGRATE is comparable to other interaction related methods such as SODA (Li and Liu, 2018), hierNet (Bien et al., 2013).
As depicted by the flowchart in Figure 1, the IMMIGRATE algorithm iteratively optimizes the cost function Equation 3.5. It starts with a random initiation satisfying certain boundary conditions, and proceeds to iterate the two steps as detailed below in Algorithm 2.
3.4.1 Step 1: Fix W, Update \{\alpha_{n,h}\} and \{\beta_{n,m}\}
Fixing W and setting \frac{\partial C}{\partial\alpha_{n,h}}=0 and \frac{\partial C}{\partial\beta_{n,m}}=0, we can obtain the closedform updates of \alpha_{n,h} and \beta_{n,m} as
\displaystyle\alpha_{n,h}=\frac{exp(q(\vec{x}_{n},\vec{x}_{h})/\sigma)}{\sum_% {h\in\mathcal{H}_{n}}exp(q(\vec{x}_{n},\vec{x}_{h})/\sigma)},  (3.6)  
\displaystyle\beta_{n,m}=\frac{exp(q(\vec{x}_{n},\vec{x}_{m})/\sigma)}{\sum_{% k\in\mathcal{M}_{n}}exp(q(\vec{x}_{n},\vec{x}_{k})/\sigma)}. 
The Hessian matrix of C w.r.t. probability pair (\alpha_{n,h}, \beta_{n,m}) is:
\displaystyle\frac{\partial^{2}C}{\partial(\alpha_{n,h},\beta_{n,m})}=\left(% \begin{array}[]{ccc}\sigma/\alpha_{n,h}&\partial^{2}C/\partial\beta_{n,m}% \alpha_{n,h}\\ \partial^{2}C/\partial\beta_{n,m}\alpha_{n,h}&\sigma/\beta_{n,m}\end{array}% \right).  (3.7) 
Since \alpha_{n,h},\beta_{n,m}>0, the determinant of the Hessian matrix is negative, where a saddle point is found in the (\alpha_{n,h},\beta_{n,m}) space. Therefore, the cost function C achieves its local minimum and local maximum w.r.t. \alpha_{n,h} and \beta_{n,m}, respectively.
3.4.2 Step 2: Fix \{\alpha_{n,h}\} and \{\beta_{n,m}\}, Update W
Fixing \alpha_{n,h} and \beta_{n,m}, the minimization w.r.t. W is convex. In Equation 3.5, W satisfies \textbf{W}\geq 0,\,\textbf{W}^{T}=\textbf{W},\,\\textbf{W}\_{F}^{2}=1. In our iterative optimization strategy, we impose W to be a distance metric for computation. Then, a closedform solution to W can be derived (see Equation 3.8).
Theorem 3.1.
With \{\alpha_{n,h}\} and \{\beta_{n,m}\} fixed, Equation 3.5 gives rise to a closedform solution for updating W. Let
\mathbf{\Sigma}=\sum_{n=1}^{N}\left(\Sigma_{n,H}\Sigma_{n,M}\right), 
where \Sigma_{n,H}=\sum_{h\in\mathcal{H}_{n}}\alpha_{n,h}\big{}\vec{x}_{n}\vec{x}_% {h}\big{}\big{}\vec{x}_{n}\vec{x}_{h}\big{}^{\,T}, \Sigma_{n,M}=\sum_{m\in\mathcal{M}_{n}}\beta_{n,m}\big{}\vec{x}_{n}\vec{x}_{% m}\big{}\big{}\vec{x}_{n}\vec{x}_{m}\big{}^{\,T}. Let the \psi_{i}’s and \mu_{i}’s be the eigenvectors and eigenvalues of \mathbf{\Sigma}, respectively, so that \mathbf{\Sigma}\psi_{i}=\mu_{i}\psi_{i} with \\psi_{i}\_{2}^{2}=1. Then,
\mathbf{W}=\Phi\,\Phi^{T},  (3.8) 
where \Phi=(\sqrt{\eta_{1}}\psi_{1},\sqrt{\eta_{2}}\psi_{2},\cdots,\sqrt{\eta_{A}}% \psi_{A}), \sqrt{\eta_{i}}=\sqrt{(\mu_{i})^{+}/\sqrt{\sum_{i=1}^{A}((\mu_{i})^{+})^{2}}}.
Proof.
Since W is a distance metric matrix, it is symmetric and positivesemidefinite. Let \lambda_{1}\geq\lambda_{2}\geq\cdots\geq\lambda_{A}\geq 0 be eigenvalues of W, then the eigendecomposition of W is
W  \displaystyle=P\Lambda P^{\,T}=P\Lambda^{1/2}\Lambda^{1/2}P^{\,T},  (3.9)  
\displaystyle=[\sqrt{\lambda_{1}}\,\,p_{1},\cdots,\sqrt{\lambda_{A}}\,\,p_{A}]% [\sqrt{\lambda_{1}}\,\,p_{1},\cdots,\sqrt{\lambda_{A}}\,\,p_{A}]^{\,T}\equiv% \Phi\Phi^{T}, 
where P is an orthogonal matrix, and \Phi=[\phi_{1},\cdots,\phi_{A}]\equiv[\sqrt{\lambda_{1}}p_{1},\cdots,\sqrt{% \lambda_{A}}p_{A}]. Thus, \left<\phi_{i},\phi_{j}\right>=0. The constraint \\textbf{W}\_{F}^{2}=1 can be simplified as:
\\textbf{W}\^{2}_{F}=\sum_{i,j}w_{i,j}^{2}=\sum_{i}(\phi_{i}^{\,T}\phi_{i})^% {2}=1.  (3.10) 
Let us rearrange Equation 3.5 as:
\displaystyle\sum_{h\in\mathcal{H}_{n}}\alpha_{n,h}\big{}\vec{x}_{n}\vec{x}_% {h}\big{}^{\,T}\textbf{W}\big{}\vec{x}_{n}\vec{x}_{h}\big{}  \displaystyle\text{tr}(\textbf{W}\sum_{h\in\mathcal{H}_{n}}\alpha_{n,h}\big{}% \vec{x}_{n}\vec{x}_{h}\big{}\big{}\vec{x}_{n}\vec{x}_{h}\big{}^{\,T}),  (3.11)  
\displaystyle\text{tr}(\textbf{W}\Sigma_{n,H})=\text{tr}(\Sigma_{n,H}\sum_{i=1% }^{A}\phi_{i}\phi_{i}^{\,T})  \displaystyle=\sum_{i=1}^{A}\phi_{i}^{\,T}\Sigma_{n,H}\,\,\phi_{i}. 
Then, Equation 3.5 can be further simplified as:
\displaystyle C  \displaystyle=\sum_{i=1}^{A}\phi_{i}^{\,T}\Sigma\,\,\phi_{i},  (3.12)  
\displaystyle\text{subject to}:\\textbf{W}\^{2}_{F}=\sum_{i}(\phi_{i}^{\,T}% \phi_{i})^{2}=1,\left<\phi_{i},\phi_{j}\right>=0, 
where \Sigma=\sum_{n=1}^{N}\Sigma_{n,H}\Sigma_{n,M} and \Sigma_{n,H}=\sum_{h\in\mathcal{H}_{n}}\alpha_{n,h}\big{}\vec{x}_{n}\vec{x}_% {h}\big{}\big{}\vec{x}_{n}\vec{x}_{h}\big{}^{\,T}, \Sigma_{n,M}=\sum_{m\in\mathcal{M}_{n}}\beta_{n,m}\big{}\vec{x}_{n}\vec{x}_{% m}\big{}\big{}\vec{x}_{n}\vec{x}_{m}\big{}^{\,T}. The orthogonality condition can be ignored because this condition is required in the constraint. The Lagrangian for the optimization problem in Equation 3.12 is easy to obtain:
L=\sum_{i=1}^{A}\phi_{i}^{\,T}\Sigma\,\,\phi_{i}+\lambda(\sum_{i=1}^{A}(\phi_{% i}^{\,T}\phi_{i})^{2}1).  (3.13) 
Differentiating L with respect to \phi_{i} yields:
\partial{L}/\partial{\phi_{i}}=2\Sigma\phi_{i}+4\lambda\phi_{i}^{\,T}\phi_{i}% \phi_{i}=0.  (3.14) 
Denote \phi_{i}/\\phi_{i}\_{2}:=\psi_{i}. From Equation 3.14, we have
\Sigma\,\,\psi_{i}=\mu_{i}\,\,\psi_{i},  (3.15) 
where \mu_{i}=2\lambda\\phi_{i}\_{2}^{2}. Thus, \psi_{i} and \mu_{i} are an eigenvector and eigenvalue of \Sigma, respectively.
Let \phi_{i}=\sqrt{\eta_{i}}\psi_{i}, \eta_{i}\geq 0. Thus, C=\sum_{i=1}^{A}\sqrt{\eta_{i}}\psi_{i}^{\,T}\Sigma\sqrt{\eta_{i}}\psi_{i}=% \sum_{i=1}^{A}\eta_{i}\mu_{i}\psi_{i}^{\,T}\psi_{i}=\sum_{i=1}^{A}\eta_{i}\mu_% {i}, and \\textbf{W}\^{2}_{F}=\sum_{i}(\sqrt{\eta_{i}}\psi_{i}^{\,T}\sqrt{\eta_{i}}% \psi_{i})^{2}=\sum_{i}(\eta_{i})^{2}=1. Then, Equation 3.12 can be simplified to be
C=\sum_{i=1}^{A}\eta_{i}\mu_{i},\,\,\text{subject to}:\,\,\sum_{i=1}^{A}(\eta_% {i})^{2}=1,\eta_{i}\geq 0.  (3.16) 
Note that Equation 3.16 is exactly the same as the original Relief Algorithm (Algorithm 1):
\vec{\eta}=(\vec{\mu})^{+}/\(\vec{\mu})^{+}\_{2},  (3.17) 
where (\vec{a})^{+}=[max(a_{1},0),max(a_{2},0),\cdots,max(a_{I},0)], and \phi_{i}=\sqrt{\eta_{i}}\psi_{i}. It is also easy to see that the updated W is a distance metric. ∎
3.4.3 Weight Pruning
Some previous Reliefbased algorithms offer options to remove weights lower than a preset threshold. IMMIGRATE offers a similar option to prune small weights: set small elements in W to 0. By default, we use a threshold to prune small weights to 0, where W should be normalized w.r.t. Frobenius norm after the pruning.
3.4.4 Predict New Samples
A prediction rule based on the learned weight matrix W can be formulated as:
\begin{split}&\displaystyle\hat{y}^{\prime}=\arg\min_{c}\sum_{y_{n}=c}\alpha_{% n}^{c}(\vec{x}^{\,\prime})q(\vec{x}^{\,\prime},\vec{x}_{n}),\\ &\displaystyle\alpha_{n}^{c}(\vec{x}^{\,\prime})=\frac{exp\big{(}q(\vec{x}^{% \,\prime},\vec{x}_{n})/\sigma\big{)}}{\sum_{y_{k}=c}exp\big{(}q(\vec{x}^{\,% \prime},\vec{x}_{k})/\sigma\big{)}},\end{split}  (3.18) 
where z^{\prime}=(\vec{x}^{\,\prime},y^{\,\prime}) is a new instance, c denotes the class and \hat{y}^{\prime} is the predicted label. This prediction method assigns a new instance to a class that maximizes its hypothesismargin using the learned weight matrix W, which makes it more stable than the kNN method used in the traditional Reliefbased algorithms.
3.5 IMMIGRATE in Ensemble Learning
Boosting (Schapire, 1990; Freund and Schapire, 1996; Freund and Mason, 1999) has been widely used to create ensemble learners that produce the stateoftheart results in many tasks. Boosting combines a set of relatively weak base learners to create a much stronger learner. To use IMMIGRATE as the base classifier in the AdaBoost algorithm (Freund and Schapire, 1996), we modify the cost function Equation 3.5 to include sample weights and use the modified version in the boosting iterations. We name the algorithm BIM, standing for Boosted IMMIGRATE (Refer to Equation 3.19 and Algorithm 3 for more details about BIM. BIM schedules the adjustment of the hyperparameter \sigma in its boosting iterations. It starts with \sigma being a predefined \sigma_{max} and gradually reduces \sigma by multiplying it with (\sigma_{min}/\sigma_{max})^{1/T} at each interaction until reaching \sigma_{min}, where T is a predefined maximum number of boosting iterations.
\displaystyle C  \displaystyle=\sum_{n=1}^{N}D(\vec{x}_{n})\bigg{(}\sum_{h\in\mathcal{H}_{n}}% \alpha_{n,h}\big{}\vec{x}_{n}\vec{x}_{h}\big{}^{\,T}\textbf{W}\big{}\vec{x% }_{n}\vec{x}_{h}\big{}\sum_{m\in\mathcal{M}_{n}}\beta_{n,m}\big{}\vec{x}_{% n}\vec{x}_{m}\big{}^{\,T}\textbf{W}\big{}\vec{x}_{n}\vec{x}_{m}\big{}% \bigg{)}  (3.19)  
\displaystyle+\sigma\sum_{n=1}^{N}D(\vec{x}_{n})[E_{miss}(z_{n})E_{hit}(z_{n}% )],  
\displaystyle\text{subject to}:\textbf{W}\geq 0,\,\textbf{W}^{T}=\textbf{W},\,% \\textbf{W}\_{F}^{2}=1,  
\displaystyle\forall n,\sum_{h\in\mathcal{H}_{n}}\alpha_{n,h}=1,\sum_{m\in% \mathcal{M}_{n}}\beta_{n,m}=1,\text{and}\,\,\alpha_{n,h}\geq 0,\beta_{n,m}\geq% \,0, 
where E_{miss}(z_{n})=\sum_{m\in\mathcal{M}_{n}}\beta_{n,m}\log\beta_{n,m}, E_{hit}(z_{n})=\sum_{h\in\mathcal{H}_{n}}\alpha_{n,h}\log\alpha_{n,h}, \sum_{n=1}^{N}D(\vec{x}_{n})=1, and D(\vec{x}_{n})\geq 0,\,\,\,\forall\,\,n.
3.6 IMMIGRATE for HighDimensional Data Space
When applied to highdimensional data, IMMIGRATE can incur a high computational cost because it considers the interactions between every feature pair. To reduce the computational cost, we first use IM4E (Bei and Hong, 2015) to learn a feature weight vector, which is used to initialize the diagonal elements of W in the proposed quadraticManhattan measurement. We also use the learned feature weight vector to help prescreen the features, and keep only those with weights above a preset limit. In the remaining computation, we only model interactions between those chosen features. The discarded features after prescreening can be added back empirically based on the need of a specific application. We term this procedure IM4EIMMIGRATE, which is effective and computationally efficient. It can also be boosted (Boosted IM4EIMMIGRATE) to be stronger.
4 Experiments
In our experiments, all continuous features are normalized with mean zero and unit variance. And crossvalidation is used here to compare the performances of various approaches. We have implemented IMMIGRATE in R and MATLAB. The R package is available at https://CRAN.Rproject.org/package=Immigrate, and the MATLAB version is available at https://github.com/RuzhangZhao/ImmigrateMATLAB. Both IMMIGRATE and BIM can be accelerated by parallel computing as their computations are matrixbased.
4.1 Synthetic Dataset
We first test the robustness of the IMMIGRATE algorithm using a synthesized dataset where we have two interacting features following Gaussian distributions in a binary classification setting. The simulated dataset contains 100 samples from one class governed by a Gaussian distribution with mean (4,2)^{T} and the covariance matrix \left(\begin{array}[]{cc}1&0.5\\ 0.5&1\\ \end{array}\right) and another 100 samples from the other class governed by a Gaussian distribution with mean (6,0)^{T} and the same covariance matrix. In addition, we add noises following a Gaussian distribution with mean (8,2)^{T} and the covariance matrix \left(\begin{array}[]{cc}8&4\\ 4&8\\ \end{array}\right) to the fist class, and add noises following a Gaussian distribution with mean (2,4)^{T} and the same covariance matrix to the second class. Figure 2 shows a scatter plot of the synthesized dataset containing 10% samples from the noise distributions. The slope of the orange dotted line in Figure 2 is 1, which separates data with different labels.
The noises are included to disturb the detection of the interaction term. The noise level starts from 5%, and gradually increases by 5% to 50%. As the baseline, we apply logistic regression and observe that the ttest pvalue of the interaction coefficient increases from 3\times 10^{11} to 7\times 10^{5} and 0.7 when the noise level increases from 0% to 10% and 50%. Local Feature Extraction (LFE, Sun and Wu (2008)) is a Reliefbased algorithm which considers interaction terms indirectly, though the interaction information is only used for feature extraction. We run IMMIGRATE and LFE on the synthesized datasets and compare the weights of the interaction term between features 1 and 2 in Figure 3, which shows IMMIGRATE is more robust than LFE.
4.2 Real Datasets
We compare IMMIGRATE with several existing popular methods using real datasets from the UCI database. The following algorithms are considered in the comparison: Support Vector Machine (Soentpiet, 1999) with Sigmoid Kernel (SV1), Support Vector Machine with Radial basis function Kernel (SV2), LASSO (LAS) (Tibshirani, 1996), Decision Tree (DT) (Freund and Mason, 1999), Naive Bayes Classifier (NBC) (John and Langley, 1995), Radial basis function Network (RBF) (Haykin, 1994), 1Nearest Neighbor (1NN) (Aha et al., 1991), 3Nearest Neighbor (3NN), Large Margin Nearest Neighbor (LMN) (Weinberger and Saul, 2009), Relief (REL) (Kira and Rendell, 1992), ReliefF (RFF) (Kononenko, 1994; RobnikŠikonja and Kononenko, 2003), Simba (SIM) (GiladBachrach et al., 2004), and Linear Discriminant Analysis (LDA) (Fisher, 1936). In addition, several methods designed for detecting interaction terms are included: LFE (Sun and Wu, 2008), Stepwise conditional likelihood variable selection for Discriminant Analysis (SOD) (Li and Liu, 2018), and hierNet (HIN) (Bien et al., 2013). We also include three most widely used and competitive ensemble learners: Adaptive Boosting (ADB) (Freund and Schapire, 1996; Freund and Mason, 1999), Random Forest (RF) (Breiman, 2001), and XgBoost (XGB) (Chen and Guestrin, 2016). We use the following abbreviations when presenting the results: IM4 for IM4E, IGT for IMMIGRATE, and B4G for the boosted IM4EIMMIGRATE.
Whenever possible, we use the settings of the aforementioned methods reported in their original papers: LMNN uses 3NN classifier; Relief and Simba use Euclidean distance and 1NN classifier; ReliefF uses Manhattan distance and kNN classifier (k=1,3,5 is decided by internal crossvalidation); in SODA, gam (=0, 0.5, 1) is determined by internal crossvalidation and logistic regression is used for prediction. The IM4E algorithm has two hyperparameters \lambda and \sigma. We fix \lambda=1 as it has no actual contribution and tune \sigma as suggested by Bei and Hong (2015). Hence, the IMMIGRATE algorithm only has one hyperparameter \sigma. When tuning \sigma, we gradually decrease \sigma from \sigma_{0}=4 by half each time until it is not larger than 0.2. The preset limit for weight pruning is 1/A, where A is the number of features. Furthermore, the preset iteration number is chosen to be 10. For each dataset, \sigma and whether weight pruning is applied are determined by the best internal crossvalidation results. For BIM, we use \sigma_{max}=4, \sigma_{min}=0.2, and the maximal number of boosting iterations T is 100. The preset threshold in IM4EIMMIGRATE is 2/A.
We repeat tenfold crossvalidation ten times for each algorithm on each dataset, i.e., 100 trials are carried out. When comparing two algorithms (i.e., A vs. B), we calculate the paired Student’s ttest using the results of 100 trials. First, the null hypothesis is there is no difference between the performances of A and those of B. When the pvalue is larger than the significant level cutoff 0.05, we say A "Tie" B, which means there is no significant difference between their performances. When the pvalue is smaller than the significant level cutoff 0.05, the second null hypothesis is the performances of B are no worse than those of A. When the new pvalue is smaller than the significant level cutoff 0.05, we say A "wins", which means A on average performs significantly better than B on this dataset, and vice versa.
4.2.1 Gene Expression Datasets
Gene expression datasets typically have thousands of features. We use the following five gene expression datasets for feature selections: GLI (Freije et al., 2004), Colon (COL) (Alon et al., 1999), Myeloma (ELO) (Tian et al., 2003), Breast (BRE) (Van’t Veer et al., 2002), Prostate (PRO) (Singh et al., 2002). All datasets have more than 10,000 features. Refer to Table 4 in Appendix A for details of all datasets.
We perform tenfold crossvalidation ten times, i.e., 100 trials in total. The results are summarized in Table 1. The last row "(W,T,L)" indicates the number of times that the Boosted IM4EIMMIGRATE (B4G) W,T,L (win,tie,loss) compared with each algorithm by the paired Student’s ttest with the significance level of \alpha=0.05. The comparison results are also summarized in Figure 4 (top plot) for easy comparison. Although our B4G is not always the best, it outperforms other methods in most cases. In particular, when IM4EIMMIGRATE (EGT) is compared with other methods, it also outperforms in most cases.
Data  SV1  SV2  LAS  DT  NBC  1NN  3NN  SOD  RF  XGB  IM4  EGT  B4G 

GLI  85.1  86.0  85.2  83.8  83.0  88.7  87.7  88.7  87.6  86.3  87.5  89.1  89.9 
COL 
73.7  82.0  80.6  69.2  71.1  72.1  77.9  78.1  82.6  79.5  84.3  78.6  82.5 
ELO 
72.9  90.2  74.6  77.3  76.3  85.6  91.3  86.9  79.2  77.9  88.9  88.6  88.4 
BRE 
76.0  88.7  91.4  76.4  69.4  83.0  73.6  82.6  86.3  87.3  88.1  90.2  91.5 
PRO 
71.3  69.9  87.9  86.4  68.0  83.2  82.7  83.2  91.8  90.5  88.0  89.5  89.7 
W,T,L{}^{1} 
5,0,0  4,0,1  4,1,0  5,0,0  5,0,0  5,0,0  4,0,1  5,0,0  3,1,1  4,0,1  3,1,1  ,,  ,, 

1
The last row shows the number of times Boosted IM4EIMMIGRATE(B4G) W,T,L (win,tie,loss) compared with each algorithm by paired ttest

**
Tenfold crossvalidation is performed for ten times, namely 100 trials are carried out for each dataset. The average accuracies are reported on the corresponding datasets in Table 1,2,3. Here, with 100 trials and two algorithms A and B, paired Student’s ttest is carried out between the results of these two algorithms. Under the significance level of \alpha=0.05, algorithm A is significantly better than another algorithm B (i.e. A wins) on a dataset if the pvalue of the paired Student’s ttest with corresponding null hypothesis is less than \alpha=0.05. (The rule also applies to experiments on UCI datasets) .
4.2.2 UCI Datasets
We also carry out an extensive comparison using many UCI datasets (Frank and Asuncion, 2010): BCW, CRY, CUS, ECO, GLA, HMS, IMM, ION, LYM, MON, PAR, PID, SMR, STA, URB, USE and WIN. Refer to Appendix A Table 4 for the full names and links for those datasets. If a dataset has more than two classes, we use two classes with the largest sample size. In addition, we use three largescale datasets: CRO{}^{*}, ELE{}^{*}, WAV{}^{*}.
We perform tenfold crossvalidation ten times. Tables 2 for IMMIGRATE and Table 3 for BIM show the average accuracies on the corresponding datasets. In Table 2, the last row "(W,T,L)" indicates the number of times IMMIGRATE (IGT) and BIM W,T,L (win,tie,loss) when compared with each algorithm separately by using the paired Student’s ttest with the significance level of \alpha=0.05. The comparison results are also summarized in Figure 4 (bottom subplot), where the first 17 items (black) indicate the results for IMMIGRATE while the last three items (blue) indicate the results for BIM.
Although IMMIGRATE or BIM is not always the best, they outperform other methods significantly in onetoone comparisons in terms of crossvalidation results. Figure 4 (bottom subplot, black part) and Table 2 show that IMMIGRATE achieves the stateoftheart performance as the base classifier while Figure 4 (bottom subplot, blue part) and Table 3 show BIM achieves the stateoftheart performance as the boosted version. To visualize the feature selection results of our approaches, we plot the feature weight heat maps of four datasets (GLA, LYM, SMR and STA) in Appendix B Figure B1.
Data  SV1  SV2  LAS  DT  NBC  RBF  1NN  3NN  LMN  REL  RFF  SIM  LFE  LDA  SOD  hIN  IM4  IGT 

BCW  61.4  66.6  71.4  70.5  62.4  56.9  68.2  72.2  69.5  66.4  67.1  67.7  67.1  73.9  65.2  71.8  66.4  74.5 
CRY 
72.9  90.6  87.4  85.3  84.4  89.7  89.1  85.4  87.8  73.8  77.2  79.7  86.0  88.6  86.0  87.9  86.2  89.8 
CUS 
86.5  88.9  89.6  89.6  89.5  86.8  86.5  88.7  88.8  82.1  84.7  84.3  86.4  90.3  90.8  90.3  87.5  90.1 
ECO 
92.9  96.9  98.6  98.6  97.8  94.6  96.0  97.8  97.8  89.0  90.7  91.2  93.1  99.0  97.9  98.7  97.5  98.2 
GLA 
64.2  76.7  72.3  79.4  69.5  73.0  81.1  78.1  79.4  64.1  63.5  67.1  81.2  72.0  75.3  75.0  78.0  87.5 
HMS 
63.8  64.5  67.7  72.5  67.2  66.8  66.0  69.3  71.2  65.3  66.0  65.7  64.9  69.0  67.4  69.4  66.6  69.2 
IMM 
74.3  70.6  74.4  84.1  77.9  67.3  69.4  77.9  76.7  69.9  71.8  69.0  75.0  75.2  72.3  70.2  80.7  83.8 
ION 
80.5  93.5  83.6  87.4  89.4  79.9  86.7  84.1  84.5  85.8  86.2  84.2  91.0  83.3  90.3  92.6  88.3  92.9 
LYM 
83.6  81.5  85.2  75.2  83.6  71.1  77.2  82.8  86.6  64.9  71.0  70.4  79.6  85.2  79.3  84.8  83.3  87.2 
MON 
74.4  91.7  75.0  86.4  74.0  68.2  75.1  84.4  84.9  61.4  61.8  65.0  64.8  74.4  91.9  97.2  75.6  99.5 
PAR 
72.7  72.5  77.1  84.8  74.1  71.5  94.6  91.4  91.8  87.3  90.3  84.6  94.0  85.6  88.2  89.5  83.2  93.8 
PID 
65.6  73.1  74.7  74.3  71.2  70.3  70.3  73.5  74.0  64.8  68.0  67.0  67.8  74.5  75.7  74.1  72.1  74.7 
SMR 
73.5  83.9  73.6  72.3  70.3  67.1  86.9  84.7  86.1  69.5  78.3  81.0  84.3  73.1  70.5  83.0  76.4  86.5 
STA 
69.8  71.6  70.8  68.9  71.0  69.5  67.8  70.8  71.3  59.7  64.0  63.0  66.7  71.3  71.8  69.2  70.8  75.9 
URB 
85.2  87.9  88.1  82.6  85.8  75.3  87.2  87.5  87.9  81.9  83.2  73.0  87.9  73.0  87.9  88.3  87.4  89.9 
USE 
95.7  95.2  97.2  93.2  90.6  84.9  90.5  91.5  92.0  54.5  63.7  69.5  85.8  96.9  96.2  96.5  94.1  96.4 
WIN 
98.3  99.3  98.6  93.1  97.3  97.2  96.4  96.6  96.5  87.2  95.0  95.0  93.8  99.7  92.9  98.9  98.2  99.0 
CRO{}^{*} 
75.4  97.5  89.9  91.0  88.8  75.4  98.4  98.5  98.6  98.5  98.7  95.1  98.6  89.1  95.2  95.5  81.9  98.2 
ELE{}^{*} 
72.3  95.7  79.9  80.0  82.5  70.8  81.1  83.9  89.7  64.6  75.4  76.2  79.8  79.9  93.7  93.6  83.2  93.7 
WAV{}^{*} 
90.0  91.9  92.2  86.2  91.4  84.0  86.5  88.3  88.8  77.6  80.0  83.6  84.7  91.8  92.0  92.1  91.1  92.4 
W,T,L{}^{1} 
20,0,0  16,2,2  15,4,1  16,3,1  19,1,0  20,0,0  17,2,1  18,2,0  16,3,1  19,1,0  19,1,0  19,1,0  18,2,0  15,4,1  13,4,3  12,7,1  19,0,1  ,, 

1
The last row (W,T,L) shows the number of times that IMMIGRATE (IGT) wins/ties/losses an existing algorithm according to the paired ttest on the crossvalidation results.
Data  ADB  RF  XGB  BIM 

BCW 
78.2  78.6  78.6  78.3 
CRY 
90.4  92.9  89.9  91.5 
CUS 
90.8  91.1  91.4  91.0 
ECO 
98.0  98.9  98.2  98.6 
GLA 
85.0  87.0  87.9  86.8 
HMS 
65.8  72.1  70.0  72.0 
IMM 
77.2  84.2  81.7  86.1 
ION 
92.1  93.5  92.5  93.1 
LYM 
84.8  87.0  87.4  88.1 
MON 
98.4  95.8  99.1  99.7 
PAR 
90.5  91.0  91.9  93.2 
PID 
73.5  76.0  75.1  76.2 
SMR 
81.4  82.8  83.3  86.6 
STA 
69.0  71.3  69.5  74.1 
URB 
87.9  88.6  88.8  91.4 
USE 
96.0  95.3  94.9  96.1 
WIN 
97.5  99.1  98.2  99.1 
CRO{}^{*} 
97.3  97.4  98.5  98.6 
ELE{}^{*} 
91.1  92.3  95.2  94.1 
WAV{}^{*} 
89.5  91.2  90.8  93.3 
W,T,L{}^{1} 
17,3,0  11,8,1  14,4,2  ,, 

1
The last row (W,T,L) shows the number of times that the Boosted IMMIGRATE (BIM) wins/ties/losses an existing algorithm according to the paired ttest on the crossvalidation results.
5 Related Works
In many recent publications, Reliefbased algorithms and feature selection with interaction terms have been well explored. Some methods are reviewed here to show the connection and differences with our approach. The hypothesismargin definition in Equation 3.2 adopted in this work is also used in some previous studies, such as Bei and Hong (2015). However, Bei and Hong (2015) do not consider the interactions between features. Our work provides a measurable way to show the influence of each feature interaction.
Sun and Wu (2008) propose local feature extraction (LFE) method, which learns linear combination of features for feature extraction. LFE explores the information of feature interaction terms indirectly, which is partly our aim. However, LFE does not consider global information or margin stability, which results in significant differences in the cost function and the optimization procedures.
Our quadraticManhattan measurement defined in Equation 3.3 is related to the Mahalanobis metric used in previous works on metric learning, such as Large Margin Nearest Neighbor (LMNN) (Weinberger and Saul, 2009). Weinberger and Saul (2009) use semidefinite programming for learning distance metric in LMNN. LMNN and our approach are both based on KNearest Neighbors. A major difference is that our quadraticManhattan measurement has matrix W to be nonnegative and symmetric (elementwise nonnegative) with its Frobenius norm \\textbf{W}\_{F}=1, whereas metric learning only requires its matrix to be symmetric semipositive definite. Actually, the nonnegative element requirement of W provides IMMIGRATE a high intepretability, where items in matrix indicate interaction importance. QuadraticManhattan measurement serves well in the classification task and offers a direct explanation about how features, in particular, feature interaction terms, contribute to the classification results.
6 Conclusions and Discussion
In this paper, we propose a new quadraticManhattan measurement to extend the hypothesismargin framework, based on which a feature selection algorithm IMMIGRATE is developed for detecting and weighting interaction terms. We also develop its extended versions, Boosted IMMIGRATE (BIM) and IM4EIMMIGRATE. IMMIGRATE and its variants follow the principle of maximizing stable hypothesismargin and are implemented via a computationally efficient iterative optimization procedure. Extensive experiments show that IMMIGRATE outperforms stateoftheart methods significantly, and its boosted version BIM outperforms other boostingbased approaches. In conclusion, compared with other Reliefbased algorithms, IMMIGRATE mainly has the following advantages: (1) both local and global information are considered; (2) interaction terms are used; (3) robust and less prone to noise; (4) easily boosted. The computation time of IMMIGRATE variants is comparable to other methods able to detect interaction terms.
There are some limitations for IMMIGRATE and we discuss some directions of improving the algorithm accordingly. First, in Section 3.4.3, small weights are removed to obtain sparse solutions using some cutoffs directly, which is hard to do inference for the obtained weights. Penalty terms such as the l_{1} or l_{2}penalty are usually applied to shrink and select important weights. We suggest that our cost function Equation 3.5 can be modified to include such a penalty term to replace the process of weight pruning in Section 3.4.3. Second, although IMMIGRATE is efficient, it still costs much time to compute data with large size. To further improve the computational efficiency of IMMIGRATE for largescale datasets, we can improve training by using well selected prototypes (Garcia et al., 2012), which, as a subset of the original data, are representative but with noisy and redundant samples removed. Third, IMMIGRATE only considers pairwise interactions between features. Interactions among multiple features can play important roles in real applications (Yu et al., 2019; Vinh et al., 2016). Our work provides a basis for developing new algorithms to detect multifeature interactions. For example, people can use tensor form to consider weights for multifeature interactions. Fourth, although our iterative optimization procedure is efficient, it achieves ad hoc solutions with no guarantee of reaching the global optimum. It remains an open challenge to develop better optimization algorithms. Finally, the selection of an appropriate \sigma currently relies on internal crossvalidation, which cannot uncover the underlying properties of \sigma. A better strategy may be developed by rigorously investigating the theoretical contributions of \sigma.
Appendix
Appendix A Information of the Real Datasets
Data  No.F{}^{1}  No.I{}^{2}  Full Name 
BCW  9  116  Breast Cancer Wisconsin (Prognostic) 
CRY 
6  90  Cryotherapy 
CUS 
7  440  Wholesale customers 
ECO 
5  220  Ecoli 
GLA 
9  146  Glass Identification 
HMS 
3  306  Haberman’s Survival 
IMM 
7  90  Immunotherapy 
ION 
32  351  Ionosphere 
LYM 
16  142  Lymphograph 
MON 
6  432  MONK’s Problems 
PAR 
22  194  Parkinsons 
PID 
8  768  PimaIndiansDiabetes 
SMR 
60  208  Connectionist Bench (Sonar, Mines vs. Rocks) 
STA 
12  256  Statlog (Heart) 
URB 
147  238  Urban Land Cover 
USE 
5  251  User Knowledge Modeling 
WIN 
13  130  Wine 
CRO{}^{*} 
28  9003  Crowdsourced Mapping 
ELE{}^{*} 
12  10000  Electrical Grid Stability Simulated 
WAV{}^{*} 
21  3304  Waveform Database Generator 
GLI 
22283  85  Gliomas Strongly Predicts Survival(Freije et al., 2004) 
COL 
2000  62  Tumor and Normal Colon Tissues(Alon et al., 1999) 
ELO 
12625  173  Myeloma(Tian et al., 2003) 
BRE 
24481  78  Breast Cancer(Van’t Veer et al., 2002) 
PRO 
12600  136  Clinical Prostate Cancer Behavior(Singh et al., 2002) 


1
No.F: Number of Features.

2
No.I: Number of Instances.
Appendix B Heat Maps
References
 Instancebased learning algorithms. Machine learning 6 (1), pp. 37–66. Cited by: §4.2.
 Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences 96 (12), pp. 6745–6750. Cited by: Table 4, §4.2.1.
 Maximizing margin quality and quantity. In Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on, pp. 1–6. Cited by: §1, §1, §3.2, §3.6, §4.2, §5.
 A lasso for hierarchical interactions. Annals of statistics 41 (3), pp. 1111. Cited by: §3.4, §4.2.
 Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §4.2.
 Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §4.2.
 The use of multiple measurements in taxonomic problems. Annals of eugenics 7 (2), pp. 179–188. Cited by: §4.2.
 UCI machine learning repository [http://archive. ics. uci. edu/ml]. irvine, ca: university of california. School of information and computer science 213, pp. 2–2. Cited by: §4.2.2.
 Gene expression profiling of gliomas strongly predicts survival. Cancer research 64 (18), pp. 6503–6510. Cited by: Table 4, §4.2.1.
 The alternating decision tree learning algorithm. In icml, Vol. 99, pp. 124–133. Cited by: §3.5, §4.2.
 Experiments with a new boosting algorithm. In Icml, Vol. 96, pp. 148–156. Cited by: §3.5, §4.2.
 Introduction to statistical pattern recognition. Elsevier. Cited by: §1.
 Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE transactions on pattern analysis and machine intelligence 34 (3), pp. 417–435. Cited by: §6.
 Margin based feature selectiontheory and algorithms. In Proceedings of the twentyfirst international conference on Machine learning, pp. 43. Cited by: §1, §3.1, §4.2.
 Neural networks: a comprehensive foundation. Prentice Hall PTR. Cited by: §4.2.
 Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pp. 338–345. Cited by: §4.2.
 A practical approach to feature selection. In Machine Learning Proceedings 1992, pp. 249–256. Cited by: §1, §2, §4.2.
 Estimating attributes: analysis and extensions of relief. In European conference on machine learning, pp. 171–182. Cited by: §1, §4.2.
 Nonlinear programming. In Traces and emergence of nonlinear programming, pp. 247–258. Cited by: §2.
 Robust variable and interaction selection for logistic regression and general index models. Journal of the American Statistical Association, pp. 1–16. Cited by: §3.4, §4.2.
 Theoretical and empirical analysis of relieff and rrelieff. Machine learning 53 (12), pp. 23–69. Cited by: §4.2.
 The strength of weak learnability. Machine learning 5 (2), pp. 197–227. Cited by: §1, §3.5.
 Gene expression correlates of clinical prostate cancer behavior. Cancer cell 1 (2), pp. 203–209. Cited by: Table 4, §4.2.1.
 Advances in kernel methods: support vector learning. MIT press. Cited by: §4.2.
 Iterative relief for feature weighting. In Proceedings of the 23rd international conference on Machine learning, pp. 913–920. Cited by: §1, §3.1.
 A relief based feature extraction algorithm. In Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 188–195. Cited by: §1, §4.1, §4.2, §5.
 The role of the wntsignaling antagonist dkk1 in the development of osteolytic lesions in multiple myeloma. New England Journal of Medicine 349 (26), pp. 2483–2494. Cited by: Table 4, §4.2.1.
 Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288. Cited by: §4.2.
 Reliefbased feature selection: introduction and review. Journal of biomedical informatics. Cited by: §1.
 Gene expression profiling predicts clinical outcome of breast cancer. nature 415 (6871), pp. 530. Cited by: Table 4, §4.2.1.
 Can highorder dependencies improve mutual information based feature selection?. Pattern Recognition 53, pp. 46–58. Cited by: §6.
 Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §4.2, §5.
 A novel feature selection algorithm based on hypothesismargin.. JCP 3 (12), pp. 27–34. Cited by: §1.
 Multivariate extension of matrixbased renyi’s \alphaorder entropy functional. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §6.