\icmlourtitle
Abstract
Screening rules allow to early discard irrelevant variables from the optimization in Lasso problems, or its derivatives, making solvers faster. In this paper, we propose new versions of the socalled safe rules for the Lasso. Based on duality gap considerations, our new rules create safe test regions whose diameters converge to zero, provided that one relies on a converging solver. This property helps screening out more variables, for a wider range of regularization parameter values. In addition to faster convergence, we prove that we correctly identify the active sets (supports) of the solutions in finite time. While our proposed strategy can cope with any solver, its performance is demonstrated using a coordinate descent algorithm particularly adapted to machine learning use cases. Significant computing time reductions are obtained with respect to previous safe rules.
1 Introduction
Since the mid 1990’s, high dimensional statistics has attracted considerable attention, especially in the context of linear regression with more explanatory variables than observations: the socalled case. In such a context, the least squares with regularization, referred to as the Lasso Tibshirani (1996) in statistics, or Basis Pursuit Chen et al. (1998) in signal processing, has been one of the most popular tools. It enjoys theoretical guarantees Bickel et al. (2009), as well as practical benefits: it provides sparse solutions and fast convex solvers are available. This has made the Lasso a popular method in modern datascience toolkits. Among successful fields where it has been applied, one can mention dictionary learning Mairal (2010), biostatistics Haury et al. (2012) and medical imaging Lustig et al. (2007); Gramfort et al. (2012) to name a few.
Many algorithms exist to approximate Lasso solutions, but it is still a burning issue to accelerate solvers in high dimensions. Indeed, although some other variable selection and prediction methods exist Fan & Lv (2008), the best performing methods usually rely on the Lasso. For stability selection methods Meinshausen & Bühlmann (2010); Bach (2008); Varoquaux et al. (2012), hundreds of Lasso problems need to be solved. For nonconvex approaches such as SCAD Fan & Li (2001) or MCP Zhang (2010), solving the Lasso is often a required preliminary step Zou (2006); Zhang & Zhang (2012); Candès et al. (2008).
Among possible algorithmic candidates for solving the Lasso, one can mention homotopy methods Osborne et al. (2000), LARS Efron et al. (2004), and approximate homotopy Mairal & Yu (2012), that provide solutions for the full Lasso path, \iefor all possible choices of tuning parameter . More recently, particularly for , coordinate descent approaches Friedman et al. (2007) have proved to be among the best methods to tackle large scale problems.
Following the seminal work by El Ghaoui et al. (2012), screening techniques have emerged as a way to exploit the known sparsity of the solution by discarding features prior to starting a Lasso solver. Such techniques are coined safe rules when they screen out coefficients guaranteed to be zero in the targeted optimal solution. Zeroing those coefficients allows to focus more precisely on the nonzero ones (likely to represent signal) and helps reducing the computational burden. We refer to Xiang et al. (2014) for a concise introduction on safe rules. Other alternatives have tried to screen the Lasso relaxing the “safety”. Potentially, some variables are wrongly disregarded and postprocessing is needed to recover them. This is for instance the strategy adopted for the strong rules Tibshirani et al. (2012).
The original basic safe rules operate as follows: one chooses a fixed tuning parameter , and before launching any solver, tests whether a coordinate can be zeroed or not (equivalently if the corresponding variable can be disregarded or not). We will refer to such safe rules as static safe rules. Note that the test is performed according to a safe region, \iea region containing a dual optimal solution of the Lasso problem. In the static case, the screening is performed only once, prior any optimization iteration. Two directions have emerged to improve on static strategies.

The first direction is oriented towards the resolution of the Lasso for a large number of tuning parameters. Indeed, practitioners commonly compute the Lasso over a grid of parameters and select the best one in a datadriven manner, \egby crossvalidation. As two consecutive in the grid lead to similar solutions, knowing the first solution may help improve screening for the second one. We call sequential safe rules such strategies, also referred to as recursive safe rules in El Ghaoui et al. (2012). This road has been pursued in Wang et al. (2013); Xu & Ramadge (2013); Xiang et al. (2014), and can be thought of as a “warm start” of the screening (in addition to the warm start of the solution itself). When performing sequential safe rules, one should keep in mind that generally, only an approximation of the previous dual solution is computed. Though, the safety of the rule is guaranteed only if one uses the exact solution. Neglecting this issue, leads to “unsafe” rules: relevant variables might be wrongly disregarded.

The second direction aims at improving the screening by interlacing it throughout the optimization algorithm itself: although screening might be useless at the beginning of the algorithm, it might become (more) efficient as the algorithm proceeds towards the optimal solution. We call these strategies dynamic safe rules following Bonnefoy et al. (2014a, b).
Based on convex optimization arguments, we leverage duality gap computations to propose a simple strategy unifying both sequential and dynamic safe rules. We coined GAP SAFE rules such safe rules.
The main contributions of this paper are 1) the introduction of new safe rules which demonstrate a clear practical improvement compared to prior strategies 2) the definition of a theoretical framework for comparing safe rules by looking at the convergence of their associated safe regions.
In Section 2, we present the framework and the basic concepts which guarantee the soundness of static and dynamic screening rules. Then, in Section 3, we introduce the new concept of converging safe rules. Such rules identify in finite time the active variables of the optimal solution (or equivalently the inactive variables), and the tests become more and more precise as the optimization algorithm proceeds. We also show that our new GAP SAFE rules, built on dual gap computations, are converging safe rules since their associated safe regions have a diameter converging to zero. We also explain how our GAP SAFE tests are sequential by nature. Application of our GAP SAFE rules with a coordinate descent solver for the Lasso problem is proposed in Section 4. Using standard datasets, we report the time improvement compared to prior safe rules.
1.1 Model and notation
We denote by the set for any integer . Our observation vector is and the design matrix has explanatory variables (or features) columnwise. We aim at approximating as a linear combination of few variables ’s, hence expressing as where is a sparse vector. The standard Euclidean norm is written , the norm , the norm , and the matrix transposition of a matrix is denoted by . We denote .
For such a task, the Lasso is often considered (see Bühlmann & van de Geer (2011) for an introduction). For a tuning parameter , controlling the tradeoff between data fidelity and sparsity of the solutions, a Lasso estimator is any solution of the primal optimization problem
(1) 
Denoting the dual feasible set, a dual formulation of the Lasso reads (see for instance Kim et al. (2007) or Xiang et al. (2014)):
(2) 
We can reinterpret Eq. (2) as , where refers to the projection onto a closed convex set . In particular, this ensures that the dual solution is always unique, contrarily to the primal .
1.2 A KKT detour
For the Lasso problem, a primal solution and the dual solution are linked through the relation:
(3) 
The KarushKhunTucker (KKT) conditions state:
(4) 
See for instance Xiang et al. (2014) for more details. The KKT conditions lead to the fact that for , is a primal solution. It can be considered as the mother of all safe screening rules. So from now on, we assume that for all the considered ’s.
2 Safe rules
Safe rules exploit the KKT condition (4). This equation implies that as soon as . The main challenge is that the dual optimal solution is unknown. Hence, a safe rule aims at constructing a set containing . We call such a set a safe region. Safe regions are all the more helpful that for many ’s, , hence for many ’s, .
Practical benefits are obtained if one can construct a region for which it is easy to compute its support function, denoted by and defined for any by:
(5) 
Cast differently, for any safe region , any , and any primal optimal solution , the following holds true:
(6) 
We call safe test or safe rule, a test associated to and screening out explanatory variables thanks to Eq. (6).
Remark 1.
Reminding that the support function of a set is the same as the support function of its closed convex hull HiriartUrruty & Lemaréchal (1993)[Proposition V.2.2.1], we restrict our search to closed convex safe regions.
Based on a safe region one can partition the explanatory variables into a safe active set and a safe zero set where:
(7)  
(8) 
Note that for nested safe regions then . Consequently, a natural goal is to find safe regions as small as possible: narrowing safe regions can only increase the number of screened out variables.
Remark 2.
If , the safe active set is the equicorrelation set (in most cases Tibshirani (2013) it is exactly the support of ). Even when the Lasso is not unique, the equicorrelation set contains all the solutions’ supports. The other extreme case is when , and . Here, no variable is screened out: and the screening is useless.
We now consider common safe regions whose support functions are easy to obtain in closed form. For simplicity we focus only on balls and domes, though more complicated regions could be investigated Xiang et al. (2014).
2.1 Sphere tests
Following previous work on safe rules, we call sphere tests, tests relying on balls as safe regions. For a sphere test, one chooses a ball containing with center and radius , \ie. Due to their simplicity, safe spheres have been the most commonly investigated safe regions (see for instance Table 1 for a brief review). The corresponding test is defined as follows:
(9) 
Note that for a fixed center, the smaller the radius, the better the safe screening strategy.
Example 1.
The first introduced sphere test El Ghaoui et al. (2012) consists in using the center and radius . Given that , this is a safe region since and . However, one can check that this static safe rule is useless as soon as
(10) 
2.2 Dome tests
Other popular safe regions are domes, the intersection between a ball and a halfspace. This kind of safe region has been considered for instance in El Ghaoui et al. (2012); Xiang & Ramadge (2012); Xiang et al. (2014); Bonnefoy et al. (2014b). We denote the dome with ball center , ball radius , oriented hyperplane with unit normal vector and parameter such that is the projection of on the hyperplane (see Figure 1 for an illustration in the interesting case ).
Remark 3.
The dome is nontrivial whenever . When , one gets simply a hemisphere.
For the dome test one needs to compute the support function for . Interestingly, as for balls, it can be obtained in a closed form. Due to its length though, the formula is deferred to the Appendix (see also Xiang et al. (2014)[Lemma 3] for more details).
2.3 Dynamic safe rules
For approximating a solution of the Lasso primal problem , iterative algorithms are commonly used. We denote the current estimate after iterations of any iterative algorithm (see Section 4 for a specific study on coordinate descent). Dynamic safe rules aim at discovering safe regions that become narrower as increases. To do so, one first needs dual feasible points: . Following El Ghaoui et al. (2012) (see also Bonnefoy et al. (2014a)), this can be achieved by a simple transformation of the current residuals , defining as
(11) 
Such dual feasible is proportional to , and is the closest point (for the norm ) to in with such a property, \ie. A reason for choosing this dual point is that the dual optimal solution is the projection of on the dual feasible set , and the optimal is proportional to , \lcfEquation (3).
Remark 4.
Note that if (convergence of the primal) then with the previous display and (3), we can show that . Moreover, the convergence of the primal is unaltered by safe rules: screening out unnecessary coefficients of , can only decrease the distance between and .
Example 2.
Note that any dual feasible point immediately provides a ball that contains since
(12) 
The ball corresponds to the simplest safe region introduced in Bonnefoy et al. (2014a, b) (\lcfFigure 2 for more insights). When the algorithm proceeds, one expects that gets closer to , so should get closer to . Similarly to Example 1, this dynamic rule becomes useless once is too small. More precisely, this occurs as soon as
(13) 
Noticing that (since is a contraction and ) and proceeding as for (10), one can show that this dynamic safe rule is inefficient when:
(14) 
This is a critical threshold, yet the screening might stop even at a larger thanks to Eq. (13). In practice the bound in Eq. (13) cannot be evaluated a priori due to the term ). Note also that the bound in Eq. (14) is close to the one in Eq. (10), explaining the similar behavior observed in our experiments (see Figure 3 for instance).
3 New contributions on safe rules
3.1 Support discovery in finite time
Let us first introduce the notions of converging safe regions and converging safe tests.
Definition 1.
Let be a sequence of closed convex sets in containing . It is a converging sequence of safe regions for the Lasso with parameter if the diameters of the sets converge to zero. The associated safe screening rules are referred to as converging safe tests.
Not only converging safe regions are crucial to speed up computation, but they are also helpful to reach exact active set identification in a finite number of steps. More precisely, we prove that one recovers the equicorrelation set of the Lasso (\lcfRemark 2) in finite time with any converging strategy: after a finite number of steps, the equicorrelation set is exactly identified. Such a property is sometimes referred to as finite identification of the support Liang et al. (2014). This is summarized in the following.
Theorem 1.
Let be a sequence of converging safe regions. The estimated support provided by , , satisfies , and there exists such that one gets .
Proof.
The main idea of the proof is to use that , and that the set is discrete. Details are delayed to the Appendix. ∎
Remark 5.
A more general result is proved for a specific algorithm (ForwardBackward) in Liang et al. (2014). Interestingly, our scheme is independent of the algorithm considered (\egForwardBackward Beck & Teboulle (2009), Primal Dual Chambolle & Pock (2011), coordinatedescent Tseng (2001); Friedman et al. (2007)) and relies only on the convergence of a sequence of safe regions.
3.2 GAP SAFE regions: leveraging the duality gap
In this section, we provide new dynamic safe rules built on converging safe regions.
Theorem 2.
Let us take any . Denote the dual optimal Lasso solution and , then
(15) 
Proof.
The construction of the ball is based on the weak duality theorem (\lcfRockafellar & Wets (1998) for a reminder on weak and strong duality). Fix and , then it holds that
Hence,
(16) 
In particular, this provides . Combining (12) and (16), asserts that belongs to the annulus (the light blue zone in Figure 2).
Remind that the dual feasible set is convex, hence is also convex. Thanks to (16), , and then is convex too. Hence, is inside the annulus and so is by convexity (see Figure 2,(a) and Figure 2,(b)). Moreover, is the point of which is closest to . The farthest where can be according to this information would be if were tangent to the inner ball and . Let us denote such a point. The tangency property reads and . Hence, with the later and the definition of , and .
Since by construction cannot be further away from than (again, insights can be gleaned from Figure 2), we conclude that . ∎
Remark 6.
Choosing and , then one recovers the static safe rule given in Example 1.
With the definition of the primal (resp. dual) objective for , (resp. ), the duality gap reads as . Remind that if , then one has , which is a standard stopping criterion for Lasso solvers. The next proposition establishes a connection between the radius and the duality gap .
Proposition 1.
For any , the following holds
(17) 
Proof.
Use the fact that and . ∎
If we could choose the “oracle” and in (15) then we would obtain a zero radius. Since those quantities are unknown, we rather pick dynamically the current available estimates given by an optimization algorithm: and as in Eq. (11). Introducing GAP SAFE spheres and domes as below, Proposition 2 ensures that they are converging safe regions.
GAP SAFE sphere:
(18) 
GAP SAFE dome:
(19) 
Proposition 2.
For any converging primal sequence , and dual sequence defined as in Eq. (11), then the GAP SAFE sphere and the GAP SAFE dome are converging safe regions.
Proof.
Remark 7.
The radius can be compared with the radius considered for the Dynamic Safe rule and Dynamic ST3 Bonnefoy et al. (2014a) respectively: and , where . We have proved that , but a weaker property is satisfied by the two other radius: and .
3.3 GAP SAFE rules : sequential for free
As a byproduct, our dynamic screening tests provide a warm start strategy for the safe regions, making our GAP SAFE rules inherently sequential. The next proposition shows their efficiency when attacking a new tuning parameter, after having solved the Lasso for a previous , even only approximately. Handling approximate solutions is a critical issue to produce safe sequential strategies: without taking into account the approximation error, the screening might disregard relevant variables, especially the one near the safe regions boundaries. Except for , it is unrealistic to assume that one can dispose of exact solutions.
Consider and a nonincreasing sequence of tuning parameters in . In practice, we choose the common grid Bühlmann & van de Geer (2011)[2.12.1]): (for instance in Figure 3, we considered ). The next result controls how the duality gap, or equivalently, the diameter of our GAP SAFE regions, evolves from to .
Proposition 3.
Suppose that and . Reminding , the following holds
(20)  
Proof.
Details are given in the Appendix. ∎
This proposition motivates to screen sequentially as follows: having such that , then, we can screen using the GAP SAFE sphere with center and radius . The adaptation to the GAP SAFE dome is straightforward and consists in replacing by in the GAP SAFE dome definition.
Remark 8.
The basic sphere test of Wang et al. (2013) requires the exact dual solution for center, and has radius , which is strictly larger than ours. Indeed, if one has access to dual and primal optimal solutions at , \ie, then , and
since for .
Note that contrarily to former sequential rules Wang et al. (2013), our introduced GAP SAFE rules still work when one has only access to approximations of .
4 Experiments
4.1 Coordinate Descent
Screening procedures can be used with any optimization algorithm. We chose coordinate descent because it is well suited for machine learning tasks, especially with sparse and/or unstructured design matrix . Coordinate descent requires to extract efficiently columns of which is typically not easy in signal processing applications where is commonly an implicit operator (e.g. Fourier or wavelets).
We implemented the screening rules of Table 1 based on the coordinate descent in Scikitlearn Pedregosa et al. (2011). This code is written in Python and Cython to generate low level C code, offering high performance. A low level language is necessary for this algorithm to scale. Two implementations were written to work efficiently with both dense data stored as Fortran ordered arrays and sparse data stored in the compressed sparse column (CSC) format. Our pseudocode is presented in Algorithm 1. In practice, we perform the dynamic screening tests every passes through the entire (active) variables. Iterations are stopped when the duality gap is smaller than the target accuracy.
The naive computation of in (11) involves the computation of ( being the current residual), which costs operations. This can be avoided as one knows when using a safe rule that the index achieving the maximum for this norm is in . Indeed, by construction . In practice the evaluation of the dual gap is therefore not a but where is the size of . In other words, using screening also speeds up the evaluation of the stopping criterion.
We did not compare our method against the strong rules of Tibshirani et al. (2012) because they are not safe and therefore need complex postprocessing with parameters to tune. Also we did not compare against the sequential rule of Wang et al. (2013) (\egEDDP) because it requires the exact dual optimal solution of the previous Lasso problem, which is not available in practice and can prevent the solver from actually converging: this is a phenomenon we always observed on our experiments.
4.2 Number of screened variables
Figure 3 presents the proportion of variables screened by several safe rules on the standard Leukemia dataset. The screening proportion is presented as a function of the number of iterations . As the SAFE screening rule of El Ghaoui et al. (2012) is sequential but not dynamic, for a given the proportion of screened variables does not depend on . The rules of Bonnefoy et al. (2014a) are more efficient on this dataset but they do not benefit much from the dynamic framework. Our proposed GAP SAFE tests screen much more variables, especially when the tuning parameter gets small, which is particularly relevant in practice. Moreover, even for very small ’s (notice the logarithmic scale) where no variable is screened at the beginning of the optimization procedure, the GAP SAFE rules manage to screen more variables, especially when increases. Finally, the figure demonstrates that the GAP SAFE dome test only brings marginal improvement over the sphere.
4.3 Gains in the computation of Lasso paths
The main interest of variable screening is to reduce computation costs. Indeed, the time to compute the screening itself should not be larger than the gains it provides. Hence, we compared the time needed to compute Lasso paths to prescribed accuracy for different safe rules. Figures 4, 5 and 6 illustrate results on three datasets. Figure 4 presents results on the dense, small scale, Leukemia dataset. Figure 5 presents results on a medium scale sparse dataset obtained with bag of words features extracted from the 20newsgroup dataset (comp.graphics vs. talk.religion.misc with TFIDF removing English stop words and words occurring only once or more than 95% of the time). Text feature extraction was done using ScikitLearn. Figure 6 focuses on the large scale sparse RCV1 (Reuters Corpus Volume 1) dataset, \lcfSchmidt et al. (2013).
In all cases, Lasso paths are computed as required to estimate optimal regularization parameters in practice (when using crossvalidation one path is computed for each fold). For each Lasso path, solutions are obtained for values of ’s, as detailed in Section 3.3. Remark that the grid used is the default one in both ScikitLearn and the glmnet R package. With our proposed GAP SAFE screening we obtain on all datasets substantial gains in computational time. We can already get an up to 3x speedup when we require a duality gap smaller than . The interest of the screening is even clearer for higher accuracies: GAP SAFE sphere is 11x faster than its competitors on the Leukemia dataset, at accuracy . One can observe that with the parameter grid used here, the larger is compared to , the higher is the gain in computation time.
In our experiments, the other safe screening rules did not show much speedup. As one can see on Figure 3, those screening rules keep all the active variables for a wide range of ’s. The algorithm is thus faster for large ’s but slower afterwards, since we still compute the screening tests. Even if one can avoid some of these useless computations thanks to formulas like (14) or (10), the corresponding speedup would not be significant.
5 Conclusion
We have presented new results on safe rules for accelerating algorithms solving the Lasso problem (see Appendix for extension to the Elastic Net). First, we have introduced the framework of converging safe rules, a key concept independent of the implementation chosen. Our second contribution was to leverage duality gap computations to create two safer rules satisfying the aforementioned convergence properties. Finally, we demonstrated the important practical benefits of those new rules by applying them to standard dense and sparse datasets using a coordinate descent solver. Future works will extend our framework to generalized linear model and groupLasso.
Acknowledgment
The authors would like to thanks Jalal Fadili and Jingwei Liang for helping clarifying some misleading statements on the equicorrelation set. We acknowledge the support from Chair Machine Learning for Big Data at Télécom ParisTech and from the Orange/Télécom ParisTech think tank phiTAB. This work benefited from the support of the ”FMJH Program Gaspard Monge in optimization and operation research”, and from the support to this program from EDF.
Appendix A Supplementary materials
We provided in this Appendix some more details on the theoretical results given in the main part.
a.1 Dome test
Let us consider the case where the safe region is the dome , with parameters: center , radius , relative distance ratio and unit normal vector .
The computation of the dome test formula proceeds as follows:
(21) 
and so
(22) 
With the previous display we can now compute . Thanks to the Eq. (6), we express our dome test as:
(23) 
Using the former notation:
(24) 
(25) 
Let us introduce the following dome parameters, for any :

Center: .

Radius: .

Ratio: .

Normal vector: .
Reminding that the support function of a set is the same as the support function of its closed convex hull HiriartUrruty & Lemaréchal (1993)[Proposition V.2.2.1] means that we only need to optimize over the dome introduced. Therefore, one cannot improve our previous result by optimizing the problem on the intersection of the ball of radius and the complement of the ball of radius (\iethe blue region in Figure 2).
a.2 Proof of Theorem 1
Proof.
Define . Fix such that . As is a converging sequence containing , its diameter is converging to zero, and there exists such that . Hence, for any and any , . Using the triangle inequality, one gets
provided that . Thus, for any and .
For the reverse inclusion take , \ie. Since for all , then and the result holds.
∎
a.3 Proof of Proposition 3
We detail here the proof of Proposition 3.
Proof.
We first use the fact that
to obtain
Then,
We deal with the dot product as