Efficient and Effective Feature Selection
Abstract
Because of continuous advances in mathematical programing, Mix Integer Optimization has become a competitive visavis popular regularization method for selecting features in regression problems. The approach exhibits unquestionable foundational appeal and versatility, but also poses important challenges. We tackle these challenges, reducing computational burden when tuning the sparsity bound (a parameter which is critical for effectiveness) and improving performance in the presence of feature collinearity and of signals that vary in nature and strength. Importantly, we render the approach efficient and effective in applications of realistic size and complexity â without resorting to relaxations or heuristics in the optimization, or abandoning rigorous crossvalidation tuning. Computational viability and improved performance in subtler scenarios is achieved with a multipronged blueprint, leveraging characteristics of the Mixed Integer Programming framework and by means of whitening, a data preprocessing step.
Keywords:
Regression, feature selection, Mixed Integer Optimization, LASSO, crossvalidation, whitening.
Funding information:
This work was partially funded by the NIH B2D2K training grant and the Huck Institutes of the Life Sciences of Penn State, and by NSF grant DMS1407639. Computation was performed on the Institute for CyberScience Advanced CyberInfrastructure (ICSACI; Penn State University).
1 Introduction
LASSOtype methods (e.g., [tibshirani1996regression], [zou2005regularization], [meier2008group]) perform feature selection as a convex constrained optimization, shrinking the norm of the coefficient estimates vector. While inducing sparsity, LASSO does not constrain directly the number of nonzero estimates (the size of the socalled active set of features). However, it is enormously popular due to its effectiveness and computational efficiency in comparison to traditional best subset selection, which explores all subsets of any size and is computationally very demanding, or sequential procedures, which curb computation but may lead to suboptimal solutions â especially in the presence of collinear features. Recently, a new approach based on Mixed Integer Optimization has been brought to bear onto this statistical problem. By very efficiently tackling integrality constraints, advances in optimization methods allow direct control of the norm of the coefficient estimates vector (the size of the active set), and thus sparsity, in larger problems (see [bertsimas2016best] and [bertsimas2017logistic] for standard and logistic regression). This, which we refer to as the Mixed Integer Programming (MIP) approach, has engendered a lively debate (e.g., [hastie2017extended], [mazumder2017subset],[bertsimas2017sparse]) comparing its general phylosophy and performance to the LASSO. A notable outcome of this debate has been the proposal of hybrid approaches, whereby the nonzero coefficient estimates of a LASSO solution are allowed to “unshrink” (relaxed LASSO, [hastie2017extended]), or some shrinkage is imposed upon a MIP solution by adding and/or norm constraints to the problem (penalized MIP, [mazumder2017subset]).
While LASSOtype methods remain unquestionaly best in terms of computational burden for any problem size, the MIP approach, at least in its existing implementations, is not viable for realistic problem sizes (hundreds, thousands or more features) if one seeks to appropriately tune the constraint â which is critical for effectiveness. Moreover, which approach works best in terms of prediction, coefficients estimation and active set identification depends on problem complexity. When the sample size is relatively modest with respect to the number of features, a tradeoff can emerge between model parsimony and the balancing of bias and variance in coefficients estimation. In addition, signal strength and patterns, degree of (true) underlying sparsity, and linear associations among the features can have important effects. To date, performance of the MIP approach in comparison to LASSOtype methods is still only partially understood in the presence feature collinearity and signals that vary in nature and strength. Pondering these issues from the perspective of applications and impact on statistical practice, our aim is to render the MIP approach efficient and effective in applications of realistic size and complexity.
Back to tuning, best practices involve assessing outofsample performance, e.g. by crossvalidation – notably, this is an integral part of all LASSO implementations. But for a problem comprising features, tuning the MIP approach by standard fold crossvalidation over the whole range of its main parameter (the sparsity bound, i.e. the size of the active set) requires solving optimizations â each of which can be itself computationally demanding. Though metrics such as AIC or BIC (Akaike and Bayes Information Criteria) can in principle be used to avoid crossvalidation, they rely on distributional assumptions and asymptotic properties. Interestingly, [miyashiro2015mixed] treat information criteria as the objective in a reformulated MIP which must be solved only once, with the bound value internally determined – but this was found to be in fact slower than solving the original MIP over all possible values of the bound. Proposals also exist to improve the computational efficiency of each individual MIP run, usually adopting suboptimal solutions based on linear relaxations and heuristics. For instance, [willis2017l0] use approximations of the norm and solve the reformulated problem with sequential linear programming, and [hazimeh2018fast] use a coordinate descent with local combinatorial search to fit a sparse ridge regression – obtaining solutions similar to those of the original MIP. These proposals though do not provide a provable guarantee on solution quality â weakening the appeal of the MIP approach.
We believe that computational burden should not discourage the widespread use of MIP for feature selection, nor the quest for efficient techniques to solve the problem at optimality. In fact, in addition to direct control of sparsity, the MIP approach affords several other important advantages. In particular, it allows one to incorporate further constraints and adjust the objective function to enhance desirable problemspecific properties. For instance, [bertsimas2017logistic] exploited this in logistic regression to address issues such as group sparsity, multicollinearity among selected features, robustness of solutions, statistical significance, modeler expertise, etc. As another instance, in a classification problem for microarray data, [lan2013multi] used Binary Integer Programming to select a subset of genes that was not only predictive of class label, but also minimally redundant. Previously, prediction performance and redundancy had been targeted separately by repeatedly refitting and adjusting the model to improve results – but the two objectives can be pursued simultaneously within the MIP framework, as for example in [Felici2017].
Our proposals do not resort to relaxations or heuristics in the optimization problem, and do not abandon rigorous crossvalidation based tuning. We recover computational viability and improve performance in subtler application scenarios with a multipronged blueprint that leverages characteristics of the MIP framework. We also explore the use of whitening, a data preprocessing technique to decorrelate features, as a means to increase feature selection accuracy in applications where high collinearity and/or weak signals result in a failure to identify all (or only) relevant features. We find that the MIP approach greatly benefits from whitening and, interestingly, that it exploits this preprocessing step more effectively than LASSOtype methods. In short, in this article we introduce:

two modified bisection procedures, which allow us to assess the whole range of the MIP tuning parameter directly evaluating only a small portion of its values;

strategies for bounds and warm starts, based on the auxiliary use of LASSO and forward selection, which can be called upon to speed up convergence in particularly slow optimization runs;

an integrated crossvalidation scheme, which exploits the nature of mixed integer programs and, in combination with (1) and (2), drastically reduces computation time;

whitening as a preprocessing step to better handle collinear features and a wide range of signal strengths.
Before describing our proposals in Section 3, we provide some technical background and formulate best subset selection as a MIP problem in Section 2. In Section 4 we present results of an in depth simulation study, and an application to a Diabetes data set first introduced in [efron2004least]. Concluding remarks are provided in Section 5.
2 Background
As background for our work, we briefly describe Mixed Integer Programs in Section 2.1, best subset selection as a Mixed Integer Quadratic Program in Section 2.2, and the typical LASSO approach to feature selection in Section 2.3.
2.1 Mixed Integer Programming
In Mixed Integer Programs (MIP), an objective function is optimized under a collection of constraints – including constraits that require some of the variables to assume only integer values (integrality constraints). A well studied and rather general class of MIPs is that of Mixed Integer Quadratic Programs (MIQP), where a quadratic objective function is optimized under both linear and integrality constraints. In symbols
(1a)  
(1b)  
(1c)  
(1d) 
where is the vector of variables, and , (positive semidefinite), , , and are known quantities. (1b) expresses the linear constraints, (1c) lower and upper bounds on the feasible values of the solution vector, and (1d) the integrality constraints; variables in the index set can take only integer values in the set .
MIQPs are of particular interest to us because regression and its feature selection commonly utilize least squares; that is, a quadratic objective function. They extend an even better studied class of MIPs; Mixed Integer Linear Programs (MILP), where also the objective function is linear. In turn, MILPs extend the simplest class of programs; Linear Programs (LP), where integrality constraints are absent.
From a computational point of view, LPs of any size can be solved very quickly. But introducing integrality constraints, even in the form of binary constraints restricting variables to , poses serious challenges – both in the case of linear objective functions (MILPs) and in the case of quadratic ones (MIQPs). Nevertheless, optimization solvers have seen dramatic improvements in the past two decades; see [bertsimas2016best] for an overview of progress from 1991 through 2015. At present, the most effective way to solve a MIP problem is to relax integrality constraints and then combine implicit enumeration methods, such as Branch & Bound, with the addition of constraints that approximate integrality in the relaxed problem, such as Branch & Cut (for a general introduction to these techniques see [Schrijver1986]). The overarching goal of recent contributions (see in particular [bertsimas2016best]) and of our own work, is indeed to harness this extremely powerful machinery for the purpose of feature selection.
2.2 Best Subset Selection as a Mixed Integer Quadratic Program
Consider a linear regression of the form , where is the vector of response values, is the design matrix, is the vector of regression coefficients, and is the vector of errors. The design matrix contains the observed values of predictors, or features, to be included in the regression. For any given integer , the best subset selection problem amounts to minimizing the regression error using at most features among the available ones. In the traditional least squares formulation the problem can be written as
(2)  
where the subset size is restricted to at most features through the constraint . As pointed out in [bertsimas2016best], this can be reformulated as a MIQP of the form described in (1) by introducing a vector of binary variables with the following interpretation: if in the solution, then feature is selected, else it is not. The problem then becomes
(3)  
The two parameters and play very important roles. Constraining each to be within is referred to as the ”bigM” trick, and the proper choice of is an art. A sufficiently large guarantees that solutions to (3) are also solutions to (2), but an excessively large one can affect solvability of the problem – mainly for numerical reasons. Using separate bounds for each coefficient can increase flexibility in the estimation of .
[bertsimas2016best] provide a few options for selecting , as well as a clever formulation that does not require its a priori choice. In our implementation we use separate and fairly wide bounds, setting where are the Ordinary Least Squares (OLS) estimates and is some scaling constant (e.g., for our numerical results in Section 4). This choice is based on the assumption that OLS estimates provide a reasonable idea of the magnitude of the true underlying coefficients. Of course, OLS may overestimate the coefficients of nonrelevant features and underestimate those of relevant ones – but it provides information on the role of each feature in the regression and, even when inflating by , does not seem to lead to solvability issues. The parameter has a less technical and rather crucial role for guaranteeing accurate feature selection. It bounds the size of the subset in (3), directly controling sparsity. If is too small, we miss important signals; if it is too large, we overfit the regression. For large , chosing by standard tuning methods such as 10fold crossvalidation across its entire range is computationally very demanding, if not prohibitive – even with stateoftheart solvers. This issue can be mitigated restricting the range of at the outset through assumptions. For instance, one may assume that the regression is very sparse and explore only even though , or – or more. In many applications though this may not be appropriate. In this regard, we propose specialized search methods that assess the entire range of , but maintain computational viability by evaluating only a small portion of its values (see Section 3).
2.3 Feature Selection with the LASSO
LASSO regression is a regularization method commonly used for feature selection due to its impressive computational speed and ability to induce shrinkage and fit parameters simultaneously. Similar to (2), LASSO can be expressed as
(4)  
The key difference between the MIP and LASSO formulations is the choice of penalization term, i.e. the constraint placed on the coefficients. While MIP uses the norm, placing a bound on the size of the subset (the number of selected features), LASSO uses the norm, placing a bound on the size of coefficient vector. Thus, LASSO does not explicitly solve best subset selection and does not directly control sparsity. Its formulation expresses a simpler optimization problem, having only continuous variables and convex constraints. Many fast algorithms have been proposed for solving LASSO; the most commonly used is a variation of coordinate descent from [friedman2010regularization] (a very efficient implementation of LASSO is provided in the glmnet Rpackage). LASSO solution quality, likewise MIP, depends on the choice of its tuning parameter . But computational speed makes this straightforward: one generates a fine grid of values ranging from maximal to minimal shrinkage (all coefficients are forced to 0 on one extreme, and allowed to take their OLS values on the other). On each grid value one then computes, e.g., a 10fold crossvalidation error, and selects the one minimizing such error. Another choice, which induces more shrinkage and is known as parsimonious LASSO, is the largest with crossvalidation error within one standard deviation from the minimum.
Overall, notwithstanding the efficient tuning made possible by computational speed, LASSO is known for producing solutions with a high number of false positives. In fact, [tibshirani2011regression] recommended it as a screen (instead of a selection procedure), because under certain conditions it achieves the sure screening property. In contrast, the MIP approach has been shown to produce sparser solutions, with fewer false positives – though they may not necessarily have higher prediction accuracy when compared to LASSO.
3 Our Proposals
Here we present our novel proposals: two modified bisection procedures for searching across the whole range of – but evaluating only a small portion of its values (Section 3.1), strategies for bounds and warm starts (Section 3.2), and an integrated crossvalidation scheme (Section 3.3). Our numerical studies (Section 4) show that these proposals lead to a substantial reduction in computational burden, rendering the MIP approach viable in realistic applications. In Section 3.4 we introduce whitening, a preprocessing step which can further increase performance of the MIP approach.
3.1 Modified Bisection Procedures
Let be the true and unknown number of features that affect the response and should be retained. For their selection to be effective, an appropriate value for the bound, i.e. an estimate of , must be obtained from the data. This tuning can be implemented evaluating outofsample regression performance via fold crossvalidation, and comparing it across all values from 1 to . For example, the black curve in Figure 1 shows 10fold crossvalidation Mean Squared Error (CVMSE) obtained solving MIQPs at on simulated data with and (see Section 4 for details).
Ideally, the CVMSE curve would decrease (as one adds more relevant features) and then increase (as one overfits adding irrelevant features), identifying an appropriate as a sweet spot (minimum). But the elbow shape of Figure 1 is typical of many regressions in practice. It still allows us to easily settle on by parsimony, since the benefit of adding more features is evidently negligible.
The problem with employing this strategy when solving MIQPs is the computational burden of crossvalidating across the whole range of . If is large, this may be unviable because we need to solve as many as MIQPs. For very sparse regressions, say , it may still be feasible to solve MIQPs for until it becomes clear that we are past a sweet spot or an elbow in the CVMSE curve. However, if the regression is not very sparse, or if it is only proportionally sparse but is very large, this will not ensure computational viability. For instance, if we have and the number of true positives is , more than values of will be crossvalidated and more than MIQPs solved to identify a good estimate of the bound.
If we could count on the CVMSE curve having a distinct sweet spot (a single global minimum and a clear quasiconvex shape) one way to tackle this issue would be to use a bisection procedure ([cormen2009introduction]). A classical bisection would search the whole range of , but evaluate only a small portion of its values – limiting the number of MIQPs to be solved to around . However, this could be rather ineffective in identifying the elbow of a CVMSE curve shaped as in Figure 1. Thus, we have designed two bisection procedures specifically modified for this problem. They are described below, and their remarkable performance and robust behavior is demonstrated in Section 4.1.
Bisection with Feelers: The first modified bisection utilizes our expectation on the shape of the CVMSE curve to save on the number of functional evaluations. It also operates iteratively, exploring the neighbours of the current to better understand where we are in the CVMSE curve. We name these explorations feelers, thus the method is referred to as bisection with feelers (BF).
We begin with the two end points of the search interval set to and , and split it at its midpoint to get and . To decide which of these two intervals to search next, we use relative differences in crossvalidation error. In symbols, let be the set of features selected by the MIQP with a generic , and the corresponding crossvalidation error. We solve for , , and , and consider
(5) 
Given a positive threshold , is an indication that is already past the elbow of the curve; in this case we search . Else, if , we search . However, if ( is sufficiently higher than ) we have an indication that the curve actually reincreases past a minimum occurring before ; candidate values in are subject to overfitting. In this case we search regardless of the value of . This continues until the bisection algorithm converges and outputs an estimate . Next, since clearly affects the bisection output, we send out two feelers around the current , computing and . If these are both smaller than another, stricter threshold , we have an indication that is already past the elbow of the curve; in this case we restart the algorithm setting and .
Due to the feelers portion of the BF, the number of evaluations may be higher then – depending on how many times we allow the algorithm to restart and how far past the elbow we are each time. In our numerical studies we typically restart the algorithm at most once; nevertheless, a bound on the number of iterations is used to avoid excessive computing time. Detailed pseudocode is provided in Algorithm 1.
Bisection with Randomly Added Variables: While BF was designed to exploit the shapes generally presented by crossvalidation error curves, our second modified bisection saves on the number of function evaluations by comparing the optimal subsets produced by MIQP with perturbed sets containing features selected at random. We call this bisection with randomly added variables (BRV). Let be again the set of features selected by the MIQP with a generic , and the corresponding validation error on fold . Note that this is not the same as the crossvalidation error used in BF. was obtained solving the MIQP on the training data and computing the validation error on the withheld data for each fold of the crossvalidation, and then averaging these errors over folds. In BRV, the aim is to compare the set obtained solving on all the data to another, perturbed set of features. Thus, we fix and simply refit with OLS (on the training data) and measure validation error (on the withheld data) using the same features across the folds. This provides validation errors specific to that can be used to gauge its quality. Now, let be the complement (the set of features not contained in ). At any given , we draw from it to generate perturbed sets.
As in BF, we begin with the two end points of the search interval set to and , and split it at its midpoint to get and . In BRV, we focus our attention on the midpoint; we solve for , and , obtain and for , and benchmark these errors as follows. For each fold , let (resp. ) be the error that would be obtained if a new feature were added to the set (resp. ) by randomly selecting such a feature from (resp. ). The construction of an empirical distribution for (resp. ) is very simple and economical: for a large enough number of repetitions, one selects a feature at random from the complement set, adds it to the current set, and recomputes the coefficients fixing the variables already selected in , resulting in a negligible computational burden. Again for each fold , we can then ask if the error obtained by solving the MIQP with (resp. ) is likely to come from the empirical distribution of (resp. ). More precisely, let and be low quantiles (e.g., ) of the two empirical distributions constructed for . If , then increasing the size of an optimized set of features from to affords a gain comparable to simply adding a feature at random to the optimally selected ones. This may be the case for some folds and not others, especially if contains many false positives. All together, if for more than, say, of the folds, then we turn our search to . On the other hand, if for most folds, then the gain is sizeable, and to decide where to turn our search we consider . If for most folds, then also increasing the size of an optimized set of features from to is better than adding a feature at random to optimally selected ones, and this encourages us to search . However, if for most folds, then while increasing to features results in a sizeable gain, moving any further does not. Thus, we turn our search to – which may allow us to identify an even better bound value to the left of (an even sparser solution). This continues until the BRV algorithm converges and outputs an estimate . Detailed pseudocode for this method is provided in Algorithm 2.
3.2 LASSO Bounds and Forward Selectionbased Warm Starts
Next we address the fact that a single MIQP solution can take very different, and potentially large run times for different values of . For example, the red curve in Figure 1 shows average computing times (over crossvalidation folds) across values of (we solved the MIP at each instance using Julia 0.6.1 to interface with Gurobi 7.5.1; see Section 4.1). Clearly, some are much higher than others; as to be expected, very small or very large values result in faster convergence rates for the optimization solver. Convergence becomes slower as we move inward, though notably it speeds up again around the true , which is of course unknown in real applications. This behavior illustrates that at least some of the MIQP solution runs required to tune are likely to be very costly. What if, in addition to reducing the number of runs (see Section 3.1), we could make the most expensive ones cheaper? Following a well established practice in optimization, we do this generating good starting points for the Branch & Cut procedure utilized when solving MIQPs. Specifically we employ LASSO and forward selection, which are extremely inexpensive to run, to generate appropriate bounds and warm starts for our optimization problem.
When solving the MIQP for a generic , termination is established either when a sufficiently small integrality gap is reached (see, e.g., [Schrijver1986] or [Chen2011]), or when a maximum computing time is surpassed. Both rules can be customized in stateoftheart solvers. In our implementation, we set a gap threshold , as customary, but also a reasonable maximum computing time . Suppose that, while searching an interval with one of our modified bisections and solving for an , we hit such . Then, we run a LASSO on all features and obtain the number of features it would select. Since LASSO tends to retain false positives, we solve for if , and again for if . Next, we run forward selection for precisely (or ) features and consider the resulting feature selection indicators and coefficient estimates ; this is our warm start. We also run forward selection for (or ) features and consider the resulting insample mean squared error ; this provides us with a bound for the error. Finally, we run our algorithm using and as our starting point and terminating when when the error is within some percentage, say , of . If , we also update the current search interval, setting . If instead , LASSO does not help in adjusting the sparsity bound, but we can still run our algorithm using the forward selectionbased warm start and error bound. This greatly curbs computation time in numerical experiments (see Section 4.1) and, notably, generates a solution with at least one fewer feature than the LASSO solution and a smaller or equal error. Detailed pseudocode is provided in Algorithm 3.
3.3 Integrated CrossValidation
We now describe a procedure that leverages linear constraints in a MIQP to reduce the computational burden of fold crossvalidation. For any given fold, solving the MIQP effectively amounts to solving a relaxation of the complete problem, where the constraints associated to the withheld observations are relaxed. Thus, when the focus is switched from one fold to another, it may be very easy to recover the feasibility of the relaxed solution using a dual approach – a step typically integrated in MIP solvers. We call this integrated crossvalidation (ICV). In symbols, for each fold , we consider a modified version of the optimization problem in (3); namely
(6)  
This modification is one of several ways to tell a MIP solver not to consider the constraints associated with some of the observations (the ones withheld in ; this is done through the parameter whose use is similar to that of in Section 2). With this formulation, the first fold solution may be rather expensive, but subsequent ones become very cheap – in effect, the first solution serves as a starting point for the next ones. Relatedly, in our practical experience, the warm start strategy described in Section 3.2 is almost exclusively triggered when pursuing first fold solutions.
Notably, the triggering of a warm start in the first fold calculation also provides useful information on the gap. Since is used as a termination threshold on the error, the first fold solution may well have a gap larger than the given . However, as pointed out in [bertsimas2016best] and [hastie2017extended], the solver often reaches an optimal solution, or something very near optimal, earlier than the total solution time may indicate. This is typically due to a weak lower bound on the error that is slow to improve. By using the threshold, we allow the algorithm to stop earlier instead of waiting on improvements to the lower bound. is then used for subsequent folds: in addition to adopting the first fold solution as a starting point, we change the gap threshold to and avoid long computing times from weak lower bounds. The callback feature in modern solvers allows for a simple implementation where the solver periodically checks if the current gap matches or has improved upon that of the first fold. We illustrate the potential benefits of ICV in Figure 2, where we plot computing times across folds for the same simulated data used in Figure 1 at . Computing times when using standard crossvalidation are provided for comparison. As a final summary, Figure 3 guides the reader through the complete process we employ to tune the sparsity bound – utilizing modified bisections, bounds and warm starts, and integrated crossvalidation.
3.4 Whitening
Even with procedures that allow us to efficiently and accurately estimate (the true size of the active set), there is no guarantee that the features selected solving for coincide with the relevant ones, especially in the presence of strong multicollinearity and/or weak signals. In fact, the MIP approach may not fare well in these scenarios. In Figure 4 (and in Figure 1 in the Supplement) we present results obtained solving with the true on data simulated for a wide range of SignaltoNoise ratios (SNR; from to ) and highly collinear features. Specifically, we simulated features following an autoregressive correlation structure with , and features following a block structure in which within the sets of relevant and irrelevant features, and across the two sets (see Section 4 for more details). Figure 4, where , and , clearly shows that in the presence of strong collinearity, even using the true , MIQP solutions can lack sensitivity – especially at lower SNR values. By comparing results obtained with and without a data preprocessing step called whitening, Figure 4 also shows that MIP recovery can be greatly improved under some circumstances.
In full generality, whitening involves switching from the original data matrix to a new data matrix , where (the whitening matrix) is chosen so that the transformed features are uncorrelated. For our purposes we use , the sample covariance matrix. This is known as ZCA whitening, or ZCAMahalanobis whitening, and has the property of minimizing the distance between the original and whitened features; see [kessy2018optimal]. The MIP approach can then be applied to select features among the latter and, provided the minimized distance between and is small enough, to translate this selection back to the original features. In Figure 4, this allows us to reach almost true positive rate at SNR values .
An additional benefit of ZCA whitening is the effect it has on our crossvalidation error curves. Figure 5 shows CVMSE over the range ( to ) before and after whitening. The data here is simulated with , SNR , and features autoregressively correlated with . Using , noise makes it very difficult to identify an appropriate sparsity bound; simply selecting the minimum would lead to an excessively sparse solution. Using produces a more distinct ”elbow” behavior, so we can get reasonably close to the true . Interestingly, approaches in the LASSO family do not appear to benefit from whitening as much as the MIP approach. In fact they can be negatively affected by whitening, perhaps because they seem to be prone to specificity rather than sensitivity issues  as mentioned in Section 2.3. We will discuss this in more detail when presenting our numerical results in Section 4.
4 Numerical Studies
Here we assess performance of the MIP implemented with our modified bisections, bounds and warm starts strategy, and integrated cross validation. We do this in comparison to LASSOtype methods on both simulated and real data.
4.1 Simulation Study
We conduct an extensive simulation study comprising a variety of scenarios defined in terms of problem size , sparsity level (the number of active features), signal patterns (the type of coefficients), signal to noise ratio (SNR), and the nature and strength of linear association among features. Our enhanced implementation of the MIP approach is compared to LASSOtype methods based on several performance metrics, including computational burden.
We generate the rows of the data matrix independently from a variate Gaussian distribution. Without loss of generality, we take – where, in addition to having mean , all components have variance . Thus, instead of a covariance we specify a correlation matrix. We consider two different regimes. The first is an autoregressive structure, where for . Here each feature is strongly associated with a few neighboring ones – and only weakly or almost not at all to features further away. If we list the active features as the first , followed by the inactive ones, potential confounding across the former and the latter occurs for those “within a certain radius” of the boundary at – depending on the size of . The second regime, which we call a block structure, is one where all active features are pairwise correlated at level , , , all inactive features are too, , , , and across the two sets one also has pairwise correlations at level , . Here the sizes of and control, respectively, the degree of potential confounding among active features, and between active and inactive ones. Notably, each active feature has the same strength of association with all inactive features ().
The two correlation structures, with appropriately high parameter values, can both represent strong collinearity. However, they are very different and proxy different types of real data settings. This can be appreciated in terms of eigenvalue decay. For example, considering and in simulations with and , Figure 6 shows the percentage of variance explained by the first eigenvalues. The autoregressive structure has a much steadier decay compared to the block structure – which has a very dominant first component followed by a sharp drop. The autoregressive structure though also has, in a way, a lower intrinsic dimensionality; its first eigenvalues capture around 98% of the overall variability, while the first eigenvalues of the block structure only reach 37%. In our experiments we investigated both structures with varying parameter values. Due to space limitations, results below refer to the block structure – but all others are reported in the Supplement.
The response is generated through a linear relationship, where . Values of are used to create scenarios with varying SNR, defined as SNR . A summary detailing all parameter settings is given in Table 1. To assist with interpretation of the SNR values, we include corresponding regression values (averaged across simulation runs in each setting). A ”realistic” signal to noise ratio may vary depending on scientific fields and class of problems of interest. In recent literature, the MIP approach was shown to perform extremely well (and with a relatively handleable computational cost) in scenarios with very high SNRs ([bertsimas2017sparse]). Conversely, the MIP approach was shown to perform worse than procedures in the LASSO family in scenarios with very low SNRs ([hastie2017extended]). Both extremes may characterize a limited number of real applications. In our experiments we explore a range of SNR values from to , corresponding to values from to . Results below refer to SNR (results for more SNR values are reported in the Supplement).
Parameter  Values 

500  
100, 1000  
10  
(BLOCK)  *(0.5,0.1), *(0.5,0.3), (0.5,0.4) 
(AUTOREG)  *0.8, 0.9 
, **  
SNR  0.1, ***0.25, ***0.50, 0.75, 1, ***2, ***3, 10 
0.08, 0.18, 0.32, 0.45, 0.51, 0.67, 0.75, 0.91  
Replications  5 

Asterisks: results in the Supplement. **: 7 stronger and 3 weaker active signals ( and ).
We compare an appropriately tuned MIP, implemented with bisection with feelers (BF) or bisection with randomly selected variables (BRV), the bound and warm start strategy, and integrated fold crossvalidation, to: (i) standard LASSO regression tuned by fold cross validation, with the minimizing or the parsimonious choice of (LMN and LSD, respectively; see Section 2.3); (ii) the relaxed LASSO of [hastie2017extended] tuned by fold cross validation, with the minimizing choice of (RL); (iii) standard forward selection tuned by fold cross validation, with the minimizing choice of subset size (FS). For LMN, LSD, RL and FS we used the R package bestsubset from [hastie2017extended]. For our BF and BRV we used Julia 0.6.1, a highperformance dynamic programming language, to interface with the MIP optimization solver Gurobi 7.5.1. For our computing time limit, we allowed 10 minutes for scenarios with , and minutes for scenarios with (see below). All simulations were performed on the Pennsylvania State Universityâs Institute for CyberScience Advanced CyberInfrastructure (ICSACI) using the basic memory option on the ACIB cluster with an Intel Xeon 24 core processor at 2.2 GHz and 128 GB of Ram. The multithread option in Gurobi was limited to a maximum of 5 threads for consistency across settings. Code for our procedures (written in Julia) and for data generation (written in R) is available on request – a package for general use is forthcoming. Each simulation scenario is run independent times. As performance metrics, we consider the average number of true positives, which measures sensitivity, and the average number of false positives, which measures (lack of) specificity. We also consider the average mean squared error computed on separately drawn validation data for each scenario and simulation, which measures prediction accuracy. Finally, we consider accuracy in estimation of – partitioned into variance and squared bias. Below we report performance metrics with/without whitening of the data depending on whether it helps/hurts each procedure; in the scenarios considered BF and BVR perform better with whitening, while LMN, LSD, RL and FS perform better on unwhitened data.
4.1.1 Scenarios with
In these scenarios the dimension is one fifth of the sample size , and the number of active features is . Figure 7 summarizes results for the block correlation structure with and SNRs around 0.1, 0.75, 1, and 10. Similar plots are provided in the Supplement for all SNR values and both correlation structures.
In Figure 7(a) we see that all procedures suffer when the SNR is very low – capturing a small number of true positives. Note in particular that an SNR of is not low enough to induce superior performance for the LASSO family methods (see [hastie2017extended]) – at least in this high collinearity setting. LMN and RL do select about features (the right number), but not all the selected features are in fact active. The lack of specificity of these procedures does not improve even at higher SNRs. MIP with BF captures all active features at SNR , as do methods in the LASSO family – but unlike the latter it is very specific. FS and MIP with BRV catch up more slowly, but they too capture all active features and are nearly perfectly specific at SNR=10.
Remarkably, but perhaps not surprisingly given the strong collinearity, these differences in sensitivity and specificity do not translate in substantial differences in prediction accuracy: Figure 7(b) shows that all procedures perform comparably, and better as the SNR increases. However, looking at bias and variance of in Figure 7(c) and (d), respectively, we do again find differences. At an SNR of , the empirical squared bias of all procedures is significantly worse than that of the OLS estimator fit using only the active features. As to be expected, the overly sparse solutions produced by MIP with BF and BRV, but also by FS and LSD, have larger bias than those of the less specific solutions produced by LMN and RL. Notably though, for mid SNR values, MIP with BF nearly matches the optimal OLS fit, and FS is the most biased at an SNR of all methods have similar and negligible bias. In terms of empirical variance, at low SNRs methods in the LASSO family naturally outperform OLS solutions, as well as FS and the two MIP implementations. However, MIP with BF becomes competitive already at SNR , and all procedures show similar and very small variance at SNR . The poorer estimation performance of MIP with BRV is not surprising, given that it tends to produce sparser solutions (while still maintaining similar prediction error) relative to other procedures. This induces larger differences with the true coefficient values ( vs ). In contrast, though methods in the LASSO family lack specificity, the coefficient estimates of irrelevant features are likely small and thus induce less of a deterioration in estimation performance.
In Figure 8, we show results when all procedures are applied to whitened data. The SNR reported are again , but similar plots for the complete range of signal strengths are provided in Figure 2 of the Supplement. Figure 8(a) shows how the lack of specificity of methods in the LASSO family becomes rather extreme when they are applied to whitened data. FS, whose specificity is good on the original data, also deteriorates massively under whitening. Figures 8(a) and (b) give some insight on why whitening affects LASSO and MIP procedures so differently. It is clear that, under whitening, the LASSO CVMSE curves give very little indication of where the optimal may occur. As the SNR increases, the minimizing the error (marked by a circle) gets further and further away from the inducing the right level of sparsity (i.e. corresponding to a of approximately ; marked by a triangle). For very weak signals (SNR) the two ’s are close, but Figure 8(a) indicates that this still corresponds to a solution containing many false positives. In contrast, the CVMSE curves for MIP indicate the right level of sparsity much more clearly as the SNR increases. The CVMSE curves shown here were generated by solving the MIP on folds across an entire grid of values ranging from to – without the employment of one of our bisections. We still utilized our bounds and warm start strategy, as well as integrated crossvalidation, for better efficiency. The effect of whitening on methods in the LASSO family, as well as on other regularization techniques, certainly warrants further simulation and theoretical explorations – which are beyond the scope of the current article. For subsequent results (below and in the Supplement), LASSO procedures and FS are applied to the original data, and MIP procedures are applied to the whitened ones.
We conclude with a technical note: the performance of MIP with BF shown in Figure 7 for SNR was obtained fixing a more conservative threshold than those used at lower SNR values. Of course, how to set the threshold in the algorithm may not be obvious in practice, but a quick run of the LASSO selecting a very large number of features can inexpensively diagnose the need to implement a restrictive BF. However, we stress that MIP with BF does not suffer from the same lack of specificity that hinders LASSO at high SNRs. As discussed above, when the SNR increases the true becomes in fact easier to identify for MIP; it is simply that the BF search may be pushed towards the right tail of the range if seemingly large differences in CVMSE are not “ignored” by setting a stricter threshold. In Figure 8(c), the “elbow” at is clearly denoted when SNR, but the strong signal leads to distinct drops in the CVMSE also at large values. Here, a more restrictive threshold easily improves solution quality. In contrast, using a more conservative tuning for LASSO (e.g., the parsimonious LSD), still fails to increase specificity.
4.1.2 Scenarios with
In these scenarios the dimension is twice the sample size ; that is, the problem is undersampled. Figure 9 summarizes again results for the block correlation structure with and SNRs and (complete results are provided in the Supplement). In comparison to the scenarios, the MIP procedures have a slightly worse performance. MIP with BF, which was very specific at all SNR values and very sensitive for SNR , retains its specificity but, with , identifies all active features only at SNR . MIP with BRV, which produced false positives at low SNR and caught up more slowly than MIP with BF in terms of sensitivity, appears to produce false positives also at higher SNRs – it still does at SNR , where it also identifies all active features though. Of course the deterioration of MIP procedures may be due to undersampling. Interestingly, in scenarios with , FS behaves very similarly to the MIP with BF, and the nonparsimonious LASSO procedures LMN and RL appear to have higher false positive rates also at low and intermediate SNRs – this, too, could be due to undersampling.
In terms of prediction accuracy, all procedures perform again comparably and better as the SNR increases. In terms of estimation performance, both squared bias and variance of are higher overall in comparison to the scenarios. They are also, again, higher for MIP procedures at low SNRs. Notably, MIP with BF does not improve as quickly as when and requires higher signal strengths. These behaviors are likely the combined outcome of the overly sparse solutions produced by MIP procedures, and a generally lower accuracy when . Recall that, when , even overly sparse solutions produced by MIP with BF and MIP with BRV did comprise only active features. When however, MIP procedures still struggle to select a few of the true features At low SNRs. An important factor here may be that the inverse square root of the sample covariance matrix represents a rather poor estimate of the whitening matrix for (we used a simple MoorePenrose generalized inverse to implement whitening). More accurate estimation of the whitening matrix may enhance the quality of MIP solutions, and consequently the quality of our sparsity bound tuning. We plan to investigate this in more depth in future work.
4.1.3 Sparser Scenarios with
The scenarios with dimensions and discussed above all have active features. Thus, in addition to differing because tha latter are undersampled (and the former are not), they also differ in terms of relative sparisity: when , 10% of the features are active, while this percentage decreases to 1% when . To disentagle the effects of increased sparsity, which may enhance performance, from those of undersampling, which by and large deteriorates it, we consider some additional simulation scenarios where (one fifth of the sample size , no undersampling) and (2% of the features are active). Supplemental Figure 23, like Figures 7 and 9, summarizes results for the block correlation structure with and SNRs around 0.1, 0.75, 1, and 10. In short, the MIP procedures are still more specific than LMS and RL. Surprisingly, only LSD benefits from increased sparsity – even though the penalty follows the ”bet on sparsity” principle ([friedman2001elements]). Overall, the methods that were overly sparse when 10% of the features were active (LSD, FS, and MIP with BRV) now produce the most exact solutions. MIP with BF had to be run again with a restrictive threshold, but this was easy to diagnose as LASSO procedures have very poor specificity even at low SNRs when the data is whitened. Prediction accuracy is comparable for all procedures, much like when 10% of the features were active. In terms of bias and variance of , the parsimonious LSD, which is more specific with higher sparsity, also induces a distinctly larger squared bias. With higher sparsity, the MIP procedures actually outperform all others in terms of bias and have comparable variance at mid to high SNRs.
4.1.4 Computational Burden
We have shown how our implementation of MIP can be quite effective in terms of active features identification (true and false positives), prediction accuracy, and bias and variance in estimating the coefficients of a linear model. However, a key benefit of our proposals (bisection variants, bounds and warm starts strategy, and integrated crossvalidation) is the dramatic reduction of computing time compared to the standard MIP implementation. This renders MIP a viable approach for best subset selection in a broader variety of realistic data scenarios, and with a rigorous tuning of the sparsity bound. Considering the scenario with and SNR, we illustrate this in Figure 10 by comparing computing times between the standard MIP implementation and our implementation – with both BF and BRV (LASSOtype methods are not considered here; they are orders of magnitude faster than MIP implementations).
We allowed all procedures to search across the whole range of possible sparsity bounds ( from to ). The left panel shows the total time needed for full tuning (i.e. to identify the optimal sparsity bound). Clearly, the standard MIP implementation required a very large amount of time (almost 4 days) – note this is obtained allowing at most 10 minutes to solve each instance. In comparison, our MIP with BF converged in 3 minutes and our MIP with BRV in 70. The right panel shows the time needed to evaluate each . The standard MIP is comparable to our MIP with BF and BRV for only very low or very high values – which are, of course, easier to solve. However, it takes up the full allowed time ( minutes for each of folds, totaling around minutes) for values ranging between and 0. Notably, even if we restricted the procedures to search a smaller range, say from to , the total time for full tuning would still be around hours ( minutes) for the standard MIP, as compared to approximately seconds or minutes for our MIP with BF and BRV, respectively.
We note that MIP with BF is much faster than with BRV in this simulation scenario. At the current , BF performs 10fold integrated crossvalidation; it solves the program a first time on observations (the first fold calculation) and then performs another nine much cheaper runs, each again on observations (subsequent folds, see Figure 3). In contrast, BRV is akin to solving the program over observations three times; at , and . If solutions are computationally expensive at and around the current , BRV may in fact be more expensive than BF. Of course this is not always the case; the overall cost of BF strongly depends on how cheap are the runs for subsequent folds in our integrated crossvalidation. For larger problem sizes, they can still take several minutes, making BRV more competitive.
4.1.5 Summary
To conclude, we summarize the salient results of our simulation study. In scenarios with highly collinear features:

The MIP approach strongly benefits from the use of data whitening as a preprocessing step. This not only helps in the selection of relevant features, but also in the tuning of .

MIP with BF and with BRV tend to favor sparser solutions than LMN and RL, with fewer false positives, but also capture more true positives than the parsimonious LSD and FS.

Unlike the observations in [hastie2017extended], our implementations of MIP are consistently competitive in terms of prediction error – even at SNRs as low as .

In problems of relatively small size (), coefficient estimates produced by MIP with BF have very low variance and bias at intermediate and high SNRs. Those by MIP with BRV have higher variation and bias, but are still consistently better then FS.

In problems of larger size (), coefficient estimates by LMN, LSD, and RL have the lowest variance and bias.

LMN, LSD, RL, and FS are strongly unspecific when applied to whitened data – this was documented only in problems of size and the behavior of LASSOtype methods under whitening warrants further exploration.

At high SNRs, MIP with BF is rather sensitive to the choice of threshold in the algorithm. However, the need for a more conservative threshold can be readily diagnosed with a (computationally cheap) LASSO run.

MIP with BRV is less sensitive to the choice of threshold but can be overly sparse in small problems and include more false positives in larger problems.

In very sparse problems, the parsimonious LSD, FS, and MIP with BRV are more accurate than the other procedures. MIO with BF and with BRV have low bias and comparable variance for intermediate and high SNRs. Surprisingly, increased sparsity does not improve the specificity of LMN and RL.

Our implementation of the MIP provides dramatic gains in computational efficiency, making this approach for feature selection feasible in a broader range of applications.
4.2 An Application to Diabetes Data
We analyze the diabetes dataset first used in [efron2004least]. This is a rather low dimensional and well sampled dataset, containing diabetes patients and baseline predictors including age, body mass index, average blood pressure, six blood serum measurements, and sex encoded as (although somewhat unorthodox, this matches the treatment in [efron2004least]). The response is a measure of disease progression one year after the baseline. The number of features considered to model disease progression is actually ; in addition to the baseline predictors, these include all their pairwise interactions (products) and squared terms (the squared term for sex is not included).
We separated the data set into a training set and a test set, with 354 and 88 observations respectively (7525% breakdown). On the training set, we ran all procedures on both whitened data and data that was simply standardized – i.e. transformed to have mean and variance , without eliminating correlations (both LASSO and MIP procedures do need to run on features with equal scaling; see [friedman2001elements]). On the test set, we computed validation Mean Squared Errors, again after whitening or simple standardization. A summary of results is provided in Table 2, which shows also OLS fits of the null and full models (intercept only, intercept and all features). The last column contains the relative decrease in validation MSE between the full model and the models produced by each of the procedures considered. For a generic procedure this is defined as
(7) 
Procedure  ValMSE  Relative Decrease  

(S and W)  (S and W)  (S and W)  
LMN  19  41  0.488  0.631  0.614  0.501 
LSD  9  21  0.501  0.550  0.604  0.565 
RL  19  41  0.488  0.631  0.614  0.501 
FS  16  7  0.523  *0.477  0.586  0.623 
BF  2  3  0.513  0.496  0.594  0.608 
BRV  2  3  0.513  0.496  0.594  0.608 
OLS Null  0  0.989  –  
OLS Full  64  1.265  – 

*: minimum Validation MSE. S: standardization. W: whitening.
We observe that MIP procedures generate much sparser solutions, whose prediction accuracy does improve with whitening. MIP with BF and with BRV are also consistent in identifying the same set of features – both with whitening and with simple standardization. For both BF and BRV whitening adds one feature, Lamotrigine (LTG), to the two already selected under standardization, Body Mass Index (BMI) and Mean Arterial Pressure (MAP). Looking at pvalues for individual effects, all features are highly significant (BMI , LTG and MAP ). FS under whitening produces the minimum validation MSE, but with as many as features – the ones identified by BF and BVR, plus more (HDL, TCH, GLU, and the interaction between BMI and MAP). This induces a modest improvement with respect to MIP procedures – in terms of relative decrease in the last column, only 1.5%. pvalues for BMI, LTG, MAP and the interaction between BMI and MAP are highly significant. In contrast, pvalues for TCH and GLU, albeit selected, are nonsignificant (0.742 and 0.458, respectively) and HDL is only marginally significant (0.0302). Methods in the LASSO family select a much larger number of features, but again with modest improvements in variance explained in the training set, and with a substantial deterioration in validation MSE. Notably, LMN and the RL of [hastie2017extended] have the same performance – relaxing the LASSO does not appear to improve it on this data.
The features BMI, MAP and LTG are selected by all procedures considered and have very low pvalues in all model fits. They were also included in the best models of [efron2004least]. The interaction term between BMI and MAP is not selected by MIP with BF and with BRV, but it is selected by all other procedures and significant in the model fits. In fact, its pvalue in a feature model comprising BMI, MAP, LTG and BMIMAP is also quite low ( 0.003) – which leads to the question of whether it ought to have been selected also by the MIP procedures. The right panel of Figure 11 indeed shows a discernible marginal effect on disease progression, especially at extremely high values of BMIMAP. But the improvement in terms of validation MSE is negligible (relative decrease of 0.495 compared to 0.496), as is the one in terms of variance explained in the training set ( of 0.484 compared to 0.472). For comparison, the left panel of Figure 11 shows the much stronger and clearercut marginal effect of BMI on disease progression. Of course on a real data problem we have no conclusive way to confirm that we have captured all of the active features, but we have shown that our improved MIP procedures (with modified bisections, bounds and warm starts strategy, and integrated cross validation) can produce sparse, interpretable, and competitive solutions. Importantly, these were also found within a reasonable amount of time. With the relatively small size of , the total computing time of MIP with BF was around 13 and 25 minutes for whitened and standardized data, respectively. MIP with BRV was faster on this data set, with computing times around 5 and 8 minutes.
5 Concluding Remarks and Future Work
Mixed Integer Optimization for norm feature selection is receiving much attention due to its foundational appeal, versatility and recent improvements in algorithmic efficiency. However, effectiveness of the MIP approach, like that of regularization methods, depends critically on the selection of a tuning parameter. Pursuing this through crossvalidation requires the solution of a very large number of programs – which can still hinder computational viability on problems of realistic size and complexity. We proposed a multipronged blueprint for reducing computational burden with modified bisections, bounds and warm starts, and a specialized scheme for crossvalidation that takes advantage of the structure of the optimization problem. Importantly, though our proposals were described for linear models, the procedures are very general and could easily be extended to generalized linear models and other classes of statistical models.
Also importantly, the ability to tune effectively and within a feasible timeframe does not fully guarantee the quality of solutions – for MIP as well as other feature selection approaches. Data with strong feature collinearity and/or weak regression signals can deteriorate specificity, sensitivity, prediction or estimation even when the optimization is well tuned (for MIP, when one has a perfect or nearperfect estimate of the number of active features). In our numerical studies the MIP approach substantially benefits from ZCA whitening, a data preprocessing step that decorrelates the features. Whitening is not a new concept, but it too is attracting increasing attention among statisticians. For instance, [kessy2018optimal] show how different forms of whitening can benefit different statistical problems. Surprisingly, while our numerical studies suggest that whitening improves recovery of MIP procedures, they do not show a positive effect on LASSO performance. This may be due to the fact that LASSO suffers from poor specificity, not poor sensitivity. The effects of whitening on various features selection approaches (MIP, regularization methods, forward selection, etc.) deserve further investigation. In particular, with regards to MIP procedures, it will be critical to investigate whether and how fast an increase in the distance between original and ZCA whitened features erodes the gains in recovery.
We can envision several other avenues for future work. In our simulations, scenarios with strong signals induced false positives for MIP with BF which were easily prevented using stricter thresholds. This suggests improving our modified bisections with datadriven adaptive thresholding. Moreover, we could incorporate new techniques to generate cuts within the Branch and Bound algorithm and increase efficiency – e.g. a cutting plane method proposed in [bertsimas2017sparse] for sparse ridge regression showed very good performance, in terms of feature selection and computational efficiency, under low to mild feature collinearity and fairly strong signals. Cuts constructed to exploit problem structure, but not necessarily data structure, are common in Operations Research: knapsack covers, cliques and Gomory cuts are routinely built into commercial solvers (for a survey of potentially useful cutting planes see [marchand2002cutting]). While it generally leverages types of MIP formulation (problem structure), the cutting framework is extremely versatile; one can introduce constraints to incorporate data structures instead. This was explored in [bertsimas2017logistic], where constraints were added to balance specific, competing goals of the modeler – but not to exploit data structures for improving solution quality and computing time.
Acknowledgements and Funding
We thank Matthew Reimherr for useful discussions and comments. This work was partially funded by the NIH B2D2K training grant and the Huck Institutes of the Life Sciences of Penn State, and by NSF grant DMS1407639. Computation was performed on the Institute for CyberScience Advanced CyberInfrastructure (ICSACI; Penn State University).