Screening Data Points in Empirical Risk Minimization via Ellipsoidal Regions and Safe Loss Functions
Abstract
We design simple screening tests to automatically discard data samples in empirical risk minimization without losing optimization guarantees. We derive loss functions that produce dual objectives with a sparse solution. We also show how to regularize convex losses to ensure such a dual sparsityinducing property, and propose a general method to design screening tests for classification or regression based on ellipsoidal approximations of the optimal set. In addition to producing computational gains, our approach also allows us to compress a dataset into a subset of representative points.
1 Introduction
Let us consider a collection of pairs , where each vector in describes a data point and is its label. For regression, is realvalued, and we address the convex optimization problem
() 
where in carries the feature vectors, and carries the labels. The function is a convex loss and measures the fit between data points and the model, and is a convex regularization function. For classification, the scalars are binary labels in , and we consider instead of (()) marginbased loss functions, where our problem becomes
() 
The above problems cover a wide variety of formulations such as Lasso [18] and its variants [22], logistic regression, support vector machines [10], and many more.
When is the norm, the solution is encouraged to be sparse [1], which can be exploited to speedup optimization procedures.
A recent line of work has focused on screening tests that seek to automatically discard variables before running an optimization algorithm. For example, [7] derive a screening rule from KarushKuhnTucker conditions, noting that if a dual optimal variable satisfies a given inequality constraint, the corresponding primal optimal variable must be zero. Checking this condition on a set that is known to contain the optimal dual variable ensures that the corresponding primal variable can be safely removed. This prunes out irrelevant features before solving the problem. This is called a safe rule if it discards variables that are guaranteed to be useless; but it is possible to relax the “safety” of the rules [17] without losing too much accuracy in practice.
The seminal approach by [7] has led to a series of works proposing refined tests [6, 20] or dynamic rules [9] for the Lasso, where screening is performed as the optimization algorithm proceeds, significantly speeding up convergence. Other papers have proposed screening rules for sparse logistic regression [19] or other linear models.
Whereas the goal of these previous methods is to remove variables, our goal is to design screening tests for data points in order to remove observations that do not contribute to the final model. The problem is important when there is a large amount of “trivial” observations that are useless for learning. This typically occurs in tracking or anomaly detection applications, where a classical heuristic seeks to mine the data to find difficult examples [8].
A few of such screening tests for data points have been proposed in the literature. Some are problemspecific (e.g. [14] for SVM), others are making strong assumptions on the objective. For instance, the most general rule of [16] for classification requires strong convexity and the ability to compute a duality gap in closed form.
The goal of our paper is to provide a more generic approach for screening data samples, both for regression and classification. Such screening tests may be designed for loss functions that induce a sparse dual solution. We describe this class of loss functions and investigate a regularization mechanism that ensures that the loss enjoys such a property.
Our contributions can be summarized as follows:

We revisit the Ellipsoid method [3] to design screening test for samples, when the objective is convex and its dual admits a sparse solution.

We propose a new regularization mechanism to design regression or classification losses that induce sparsity in the dual. This allows us to recover existing loss functions and to discover new ones with sparsityinducing properties in the dual.

Originally designed for linear models, we extend our screening rules to kernel methods. Unlike the existing literature, our method also works for non strongly convex objectives.

We demonstrate the benefits of our screening rules in various numerical experiments on largescale classification problems and regression.
2 Preliminaries
We now present the key concepts used in our paper.
2.1 Fenchel Conjugacy
Definition 2.1 (Fenchel conjugate).
Let be an extended realvalued function. The Fenchel conjugate of is defined by
The biconjugate of is naturally the conjugate of and is denoted by . The FenchelMoreau theorem [?] states that if is proper, lower semicontinuous and convex, then it is equal to its biconjugate . Finally, FenchelYoung’s inequality gives for all pair
with an equality case iff .
Suppose now that for such a function , we add a convex term to in the definition of the biconjugate. We get a modified biconjugate , written
The inner objective function is continuous, concave in and convex in , such that we can switch min and max according to Von Neumann’s minimax theorem to get
Definition 2.2 (Infimum convolution).
is called the infimum convolution of and , which may be written as .
2.2 Empirical Risk Minimization and Duality
2.3 Safe Loss Functions and Sparsity in the Dual of ERM Formulations
A key feature of our losses is to encourage sparsity of dual solutions, which typically emerge from loss functions with a flat region. We call such functions “safe losses” since they will allow us to design safe screening tests.
Definition 2.3 (Safe loss function).
Let be a continuous convex loss function such that . We say that is a safe loss if there exists a nonsingleton and nonempty interval such that
Lemma 2.4 (Safe loss and dual sparsity).
Consider the problem (1) where is a convex penalty. Denoting by and the optimal primal and dual variables respectively, we have for all ,
A consequence of this lemma is that for both classification and regression, the sparsity of the dual solution is related to loss functions that have “flat” regions—that is, such that . This is the case for safe loss functions defined above. Note that the relation between flat losses and sparse dual solutions is classical, see [?, 4].
3 Safe rules for screening data points
In this section, we derive screening rules in the spirit of SAFE [7] to select data points in regression or classification problems with safe losses.
3.1 Principle of SAFE Rules for Data Points
We recall that our goal is to safely delete data points prior to optimization, that is, we want to train the model on a subset of the original dataset while still getting the same optimal solution as a model trained on the whole dataset. This amounts to identifying beforehand which dual variables are zero at the optimum. Indeed, as discussed in Section 2.2, the optimal primal variable only relies on nonzero entries of . To that effect, we make the following assumption:
Assumption 3.1 (Safe loss assumption).
We consider problem (1), where each is a safe loss function. Specifically, we assume that for regression, or for classification, where satisfies Definition 2.3 on some interval . For simplicity, we assume that there exists such that for regression losses and for classification, which covers most useful cases.
We may now state the basic safe rule for screening.
Lemma 3.2 (SAFE rule).
Under Assumption 3.1, consider a subset containing the optimal solution . If, for a given data point , for all in , (resp. ), where is the interior of , then this data point can be discarded from the dataset.
Proof.
From the definition of safe loss functions, is differentiable at with .
We see now how the safe screening rule can be interpreted in terms of discrepancy between the model prediction and the true label . If, for a set containing the optimal solution and a given data point , the prediction always lies in , then the data point can be discarded from the dataset. The data point screening procedure therefore consists in maximizing linear forms, and in regression (resp. minimizing in classification), over a set containing and check whether they are lower (resp. greater) than the threshold . The smaller , the lower the maximum (resp. the higher the minimum) hence the more data points we can hope to safely delete. Finding a good test region is critical however. We show how to do this in the next section.
3.2 Building the Test Region
Screening rules aim at sparing computing resources, testing a data point should therefore be easy. As in [7] for screening variables, if is an ellipsoid, the optimization problem detailed above admits a closedform solution. Furthermore, it is possible to get a smaller set by adding a first order optimality condition with a subgradient of the objective evaluated in the center of this ellipsoid. This linear constraint cuts the final ellipsoid roughly in half thus reducing its volume.
Lemma 3.3 (Closedform screening test).
Consider the optimization problem
(3) 
in the variable in with defining an ellipsoid with center and is in . Then the maximum is
with and .
The proof can be found in Appendix A.2 and it is easy to modify it for minimizing . We can obtain both and by using a few steps of the ellipsoid method [?, 3]. The method starts from an initial ellipsoid containing the solution to a given convex problem. It iteratively computes a subgradient in the center of the current ellipsoid, selects the halfellipsoid containing , and computes the ellipsoid with minimal volume containing the previous halfellipsoid before starting all over again. Such a method, presented in Algorithm 1, performs closedform updates of the ellipsoid.
Note that the ellipsoid update formula was also used to screen primal variables for the Lasso problem [6], although not iterating over ellipsoids in order to get smaller volumes.
Initialization.
The algorithm requires an initial ellipsoid that contains the solution. This is typically achieved by defining the center as an approximate solution of the problem, which can be obtained in various ways. For instance, one may run a few steps of a solver on the whole dataset, or one may consider the solution obtained previously for a different regularization parameter when computing a regularization path, or the solution obtained for slightly different data, e.g., for tracking applications where an optimization problem has to be solved at every time step , with slight modifications from time .
Once the center is defined, there are many cases where the initial ellipsoid can be safely assumed to be a sphere. For instance, if the objective—let us call it —is strongly convex, we have the basic inequality , which can often be upperbounded by several quantities, e.g., a duality gap [16] or simply if is nonnegative as in typical ERM problems. Otherwise, other strategies can be used depending on the problem at hand, as done for the Lasso by [7, 9] for example.
Efficient implementation.
Since each update of the ellipsoid matrix is rank one, it is possible to parametrize at step as
with the identity matrix, is in and in is a diagonal matrix. Hence, we only have to update and while the algorithm proceeds.
Complexity of our screening rules.
For each step of Algorithm 1, we compute a subgradient in operations. The ellipsoids are modified using rank one updates that can be stored. As a consequence, the computations at this stage are dominated by the computation of , which is . As a result, steps cost .
Once we have the test set , we have to compute the closed forms from Lemma 3.3 for each data point. This computation are dominated by the matrixvector multiplications with , which cost using the structure of . Hence, testing the whole dataset costs . Since we typically have , the cost of the overall screening procedure is therefore .
In constrast, solving the ERM problem without screening would cost where is the number of passes over the data, with . With screening, the complexity becomes , where is the number of data points accepted by the screening procedure.
3.3 Extension to Kernel Methods
It is relatively easy to adapt our safe rules to kernel methods. Consider for example (), where has been replaced by in , with a RKHS and its mapping function . The prediction function lives in the RKHS, thus it can be written , . In the setting of ERM, the representer theorem ensures with and the kernel associated to . The problem becomes:
(4) 
with the Gram matrix. The constraint is linear in (thus satisfying to Lemma 4.1) while yielding nonlinear prediction functions. The screening test becomes maximizing the linear forms and over an ellipsoid containing . When the problem is convex (it depends on ), can still be found using the ellipsoid method.
We now have an algorithm for selecting data points in regression or classification problems with linear or kernel models. As detailed above, the rules require a sparse dual, which is not the case in general except in particular instances such as support vector machines. We now explain how to induce sparsity in the dual.
4 Constructing safe losses
In this section, we introduce a way to induce sparsity in the dual of empirical risk minimization problems.
4.1 Inducing Sparsity in the Dual of ERM
When the ERM problem does not admit a sparse dual solution, safe screening is not possible. To fix this issue, consider the ERM problem () and replace by defined in Section 2:
() 
We have the following result connecting the dual of () with that of ().
Lemma 4.1 (Regularized dual for regression).
Before we prove this lemma, we remark that is possible, in many cases, to induce sparsity in the dual if is the norm, or another sparsityinducing penalty. This is notably true if the unregularized dual is smooth with bounded gradients. In such a case, it is possible to show that the optimal dual solution would be as soon as is large enough [1].
Proof.
(6) 
in the variable with and and . Since the constraints are linear, we can directly express the dual of this problem in terms of the Fenchel conjugate of the objective (see e.g. [5], 5.1.6). Let us note . For all , we have
It is known from [2] that with . Clearly, . If is proper, convex and lower semicontinuous, then . As a consequence, . If is proper, convex and lower semicontinuous, then , hence
Now we can form the dual of by writing
(7) 
in the variable . Since with the dual variable associated to the equality constraints,
Injecting in the problem and setting instead of (we optimize in ) concludes the proof.
We consider now the classification problem () and show that the previous remarks about sparsityinducing regularization for the dual of regression problems also hold in this new context.
Lemma 4.2 (Regularized dual for classification).
Proof.
We proceed as above with a linear constraint and .
Note that the formula directly provides the dual of regression and classification ERM problems with a linear model such as the Lasso and SVM.
4.2 Link Between the Original and Regularized Problems
Lemma 4.3 (Smoothness of ).
If is strongly convex, then is smooth.
Proof.
The lemma follows directly from the fact that (see the proof of Lemma 4.1). The conjugate of a closed, proper, strongly convex function is indeed smooth.
Lemma 4.4 (Bounding ).
If and is a norm then
with and .
Proof.
If is a norm, then and is the indicator function of the dual norm of hence nonnegative. Moreover, if then, and ,
In particular, we can take hence the righthand inequality. On the other hand,
Since is convex,
As a consequence,
Proof.
The proof is trivial given the inequalities in Lemma 4.4.
4.3 Effect of Regularization and Examples
We start by recalling that the infimum convolution is traditionally used for smoothing an objective when is strongly convex, and then we discuss the use of sparsityinducing regularization in the dual.
Euclidean distance to a closed convex set.
It is known that convolving the indicator function of a closed convex set with a quadratic term (the Fenchel conjugate of a quadratic term is itself) yields the euclidean distance to
Huber loss.
The loss is more robust to outliers than the loss, but is not differentiable in zero which may induce difficulties during the optimization. A natural solution consists in smoothing it: [2] for example show that applying the MoreauYosida smoothing, i.e convolving with a quadratic term yields the wellknown Huber loss, which is both smooth and robust:
Now, we present examples where has a sparsityinducing effect.
Hinge loss.
Instead of the quadratic loss in the previous example, choose a robust loss . By using the same function , we obtain the classical hinge loss of support vector machines
We see that the effect of convolving with the constraint is to turn a regression loss (e.g., square loss) into a classification loss. The effect of the norm is to encourage the loss to be flat (when grows, is equal to zero for a larger range of values ), which corresponds to the sparsityinducing effect in the dual that we will exploit for screening data points.
Squared hinge loss.
Let us consider a problem with a quadratic loss designed for a classification problem, and consider . We have , and
which is a squared Hinge Loss with a threshold parameter and .
Screeningfriendly regression.
Screeningfriendly logistic regression.
Let us now consider the logistic loss , which we define only with one dimension for simplicity here. It is easy to show that the infimum convolution with the norm does not induce any sparsity in the dual, because the dual of the logistic loss has unbounded gradients, making classical sparsityinducing penalties ineffective. However, we may consider instead another penalty to fix this issue: for . We have . Convolving with yields
(10) 
Note that this loss is asymptotically robust. Moreover, the entropic part of makes this penalty strongly convex hence is smooth [?]. Finally, the penalty ensures that the dual is sparse thus making the screening usable. Our regularization mechanism thus builds a smooth, robust classification loss akin to the logistic loss on which we can use screening rules. The effect of regularization parameter in a few previous cases are illustrated in Figure 2.
In summary, regularizing the dual with the norm induces a flat region in the loss, which induces sparsity in the dual. The geometry is preserved elsewhere.
5 Experiments
We now present experimental results demonstrating the effectiveness of the data screening procedure.
Datasets.
We consider three real datasets, SVHN, MNIST, RCV1, and a synthetic one. MNIST () and SVHN () both represent digits, which we encode by using the output of a twolayer convolutional kernel network [12] leading to feature dimensions . RCV1 () represents sparse TFIDF vectors of categorized newswire stories (). For classification, we consider a binary problem consisting of discriminating digit 9 for MNIST vs. all other digits (resp. digit 1 vs rest for SVHN, 1st category vs rest for RCV1). For regression, we also consider a synthetic dataset, where data is generated by , where is a random, sparse ground truth, a data matrix whose coefficients are in and with . Implementation details are provided in the appendix. We fit usual models using scikitlearn [15].
5.1 Safe Screening
Here, we consider problems that are naturally admit a sparse dual solution, which allows safe screening.
Interval regression.
We first illustrate the practical use of the screeningfriendly regression loss (9) derived above. It corresponds indeed to a particular case of a supervised learning task called interval regression [11], which is widely used in fields such as economics. In interval regression, one does not have scalar labels but intervals containing the true labels , which are unknown. The loss is written
(11) 
where contains the true label . For a given data point, the model only needs to predict a value inside the interval in order not to be penalized. When the intervals have the same width and we are given their centers , (11) is exactly (9). Since we proved (9) to yield a sparse dual, we can apply our rules to safely discard intervals that are assured to be matched by the optimal solution. We use an penalty along with the loss. As an illustration, the experiment was done using a toy synthetic dataset , the signal to recover being generated by one feature only. The intervals can be visualized in Figure 3. The “difficult” intervals (red) were kept in the training set. The predictions hardly fit these intervals. The “easy” intervals (blue) were discarded from the training set: the safe rules certify that the optimal solution will fit these intervals. Our screening algorithm was run for 20 iterations of the Ellipsoid method. Most of the intervals can be ruled out afterwards while the remaining intervals yield the same optimal solution as a model trained on all the intervals.
Classification.
Common sample screening methods such as [16] require a strongly convex objective. When it is not the case, there is, to the best of our knowledge, no baseline for this case. Thus, when considering classification using the non strongly convex safe logistic loss derived in Section 4 along with an penalty, our algorithm is still able to screen samples, as shown in Table 1. The algorithm is initialized using an approximate solution to the problem, and the radius of the initial ball is chosen depending on the number of epochs ( for epochs, for and for epochs), which is valid in practice.
Epochs  20  30  

MNIST  SVHN  RCV1  MNIST  SVHN  RCV1  
0  0  1  0  2  12  
0.3  0.01  8  27  17  42  
35  12  45  65  54  75 
As established in Lemma 2.4, the hinge loss and squared hinge loss allow for safe screening, Combined with an penalty, the resulting ERM problem is strongly convex. We can therefore compare our Ellipsoid algorithm to the baseline introduced by [16], where the safe region is a ball centered in the current iterate of the solution and whose radius is with a duality gap of the ERM problem. Both methods are initialized by running the default solver of scikitlearn with a certain number of epochs. The resulting approximate solution and duality gap are subsequently fed into our algorithm for initialization. Then, we perform one more epoch of the duality gap screening algorithm on the one hand, and the corresponding number of ellipsoid steps computed on a subset of the dataset on the other hand, so as to get a fair comparison in terms of data access. The results can be seen in Table 2. While being more general (our approach is neither restricted to classification, nor requires strong convexity), our method performs similarly to the baseline. Figure 4 highlights the tradeoff between optimizing and evaluating the gap (Duality Gap Screening) versus performing one step of Ellipsoid Screening. Key observations here is that both methods start screening after a correct iterate (i.e. with good test accuracy) is obtained by the solver (blue curve) thus underlining the fact that screening methods would rather be of practical use when computing a regularization path, or when the computing budget is less constrained (e.g. tracking or anomaly detection) which is the object of next paragraph.
Epochs  20  30  

MNIST  SVHN  MNIST  SVHN  
89 / 89  87 / 87  89 / 89  87 / 87  
95 / 95  11 / 47  95 / 95  91 / 91  
16 / 84  0 / 0  98 / 98  90 / 92  
0 / 0  0 / 0  34 / 50  0 / 0 
Epochs  10  20 

7 / 84  85 / 85  
80 / 80  80 / 80  
68 / 68  68 / 68 
Computational gains
As demonstrated in Figure 5, computational gains can indeed be obtained in a regularization path setting (MNIST features, Squared Hinge Loss and L2 penalty). Each point of both curves represents an estimator fitted for a given lambda against the corresponding cost (in epochs). Each estimator is initialized with the solution to the previous parameter lambda. On the orange curve, the previous solution is also used to initialize a screening. In this case, the estimator is fit on the remaining samples which further accelerates the path computation.
5.2 Dataset Compression
We now consider the problem of dataset compression, where the goal is to maintain a good accuracy while using less examples from a dataset. This section should be seen as a proof of concept. A natural scheme consists in choosing the samples that have a higher margin since those will carry more information than samples that are easy to fit. In this setting, our screening algorithm can be used for compression by using the scores of the screening test as a way of ranking the samples. In our experiments, and for a given model, we progressively delete data points according to their score in the screening test for this model, before fitting the model on the remaining subsets. We compare those methods to random deletions in the dataset and to a margin computed on early approximations of the solution when the loss admits a flat area.
Lasso regression.
The Lasso objective combines an loss with an penalty. Since its dual is not sparse, we will instead apply the safe rules offered by the screeningfriendly regression loss (9) derived in Section 4.3 and illustrated in Figure 2, combined with an penalty. We can draw an interesting parallel with the SVM, which is naturally sparse in data points. At the optimum, the solution of the SVM can be expressed in terms of data points (the socalled support vectors) that are close to the classification boundary, that is the points that are the most difficult to classify. Our screening rule yields the analog for regression: the points that are easy to predict, i.e. that are close to the regression curve, are less informative than the points that are harder to predict. In our experiments on synthetic data (), this does consistently better than random subsampling as can be seen in Figure 6.
Classification.
Our compression scheme is also valid for classification as can be seen in Figure 7.
Acknowledgments
Julien Mairal and Grégoire Mialon were supported by the ERC grant number 714381 (SOLARIS project) and by ANR 3IA MIAI@Grenoble Alpes. AA is at CNRS & département d’informatique, École normale supérieure, UMR CNRS 8548, 45 rue d’Ulm 75005 Paris, France, INRIA and PSL Research University. The authors would like to acknowledge support from the Optimization & Machine Learning joint research initiative with the fonds AXA pour la recherche and Kamet Ventures as well as a Google focused award. Grégoire Mialon thanks Vivien Cabannes, Yana Hasson and Robin Strudel for useful discussions.
Appendix A Proofs.
a.1 Proof of Lemma 2.4
Proof.
At the optimum,
Adding the null term gives
since FenchelYoung’s inequality states that each term is greater or equal to zero. We have a null sum of nonnegative terms; hence, each one of them is equal to zero. We therefore have for each :
which corresponds to the equality case in FenchelYoung’s relation, which is equivalent to .
a.2 Proof of Lemma 3.3
Proof.
The Lagrangian of the problem writes:
with . When maximizing in , we get:
We have since the opposite leads to a contradiction. This yields and at the optimum which gives .
Now, we have to minimize
To do that, we consider the optimality condition
which yields . If then in order to avoid a contradiction.
In summary, either hence the maximum is attained in and is equal to , or and the maximum is attained in and is equal to with and .
a.3 Proof of Example 4.3
Proof.
The Fenchel conjugate of a norm is the indicator function of the unit ball of its dual norm, the ball here. Hence the infimum convolution to solve
(12) 
Since ,
If we consider the change of variable , we get:
The solution to this problem is exactly the proximal operator for the indicator function of the infinity ball applied to . It has a closed form
Hence,
But, for , where .
Appendix B Additional experimental results.
Experimental protocol and reproducibility.
The data sets did not require any preprocessing except MNIST and SVHN on which exhaustive details can be found in [12]. For both regression and classification, the examples were allocated to train and test sets using scikitlearn’s traintestsplit. The experiments were run three to ten times (depending on the cost of the computations) and our error bars reflect the standard deviation. For each fraction of points deleted, we fit three to five estimators on the screened dataset and the random subset before averaging the corresponding scores. The optimal parameters for the linear models were found using a simple gridsearch.
Accuracy of our safe logistic loss.
The accuracies of the Safe Logistic loss we build is similar to the accuracies obtained with the Squared Hinge and the Logistic losses on the datasets we use in this paper thus making it a realistic loss function.
Dataset  MNIST  SVHN  RCV1 

Logistic +  0.997 (0.01)  0.99 (0.0003)  0.975 (1.0) 
Logistic +  0.997 (0.001)  0.99 (0.0003)  0.975 (1.0) 
Safelog +  0.996 (0.0)  0.989 (0.0)  0.974 (1e05) 
Safelog +  0.996 (0.0)  0.989 (0.0)  0.975 (1e05) 
Squared Hinge +  0.997 (0.03)  0.99 (0.03)  0.975 (1.0) 
Squared Hinge +  0.997 (0.003)  0.99 (0.003)  0.974 (1.0) 
Exemplar selection.
Here we generate respectively and redundant examples of synthetic data ( and diabetes (, , in scikitlearn) by forming convex combinations of existing data points and adding gaussian noise with zero mean. As in ranking data points for the Lasso, we apply our screening rules to iteratively discard examples that are redundant and fit a Lasso on the remaining dataset. This method greatly outperforms random subsets as can be seen in Figure 8.
Footnotes
 Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France.
 D.I., UMR 8548, École Normale Supérieure, Paris, France.
 footnotemark:
 footnotemark:
References
 (2012) Optimization with sparsityinducing penalties. Foundations and Trends in Machine Learning 4 (1), pp. 1–106. Cited by: §1, §4.1, §4.3.
 (2012) Smoothing and first order methods: a unified framework. SIAM J. Optim Vol. 22, No. 2. Cited by: §4.1, §4.3.
 (1981) The ellipsoid method: a survey. Operation Research 29. Cited by: 1st item, §3.2.
 (2019) Learning classifiers with fenchelyoung losses: generalized entropies, margins, and algorithms. In International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §2.3.
 (2004) Convex optimization. Cambridge University Press. Cited by: §4.1.
 (201201) An ellipsoid based, twostage screening test for bpdn. European Signal Processing Conference, pp. 654–658. External Links: ISBN 9781467310680 Cited by: §1, §3.2.
 (201009) Safe Feature Elimination for the LASSO and Sparse Supervised Learning Problems. arXiv eprints, pp. arXiv:1009.4219. External Links: 1009.4219 Cited by: §1, §1, §3.2, §3.2, §3.
 (2009) Object detection with discriminatively trained partbased models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §1.
 (2015) Mind the duality gap: safer rules for the Lasso. In International Conference on Machine Learning (ICML), Cited by: §1, §3.2.
 (2001) The elements of statistical learning. Springer series in statistics New York. Cited by: §1.
 (2013) Learning sparse penalties for changepoint detection using max margin interval regression. In International Conference on Machine Learning (ICML), Cited by: §5.1.
 (2016) Endtoend kernel learning with supervised convolutional kernel networks. In Advance in Neural Information Processing Systems (NIPS), Cited by: Appendix B, §5.
 (1962) Fonctions convexes duales et points proximaux dans un espace hilbertien. CR Acad. Sci. Paris Sér. A MAth. Cited by: §2.1.
 (2013) Safe screening of nonsupport vectors in pathwise svm computation. In International Conference on Machine Learning (ICML), Cited by: §1.
 (2011) Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §5.
 (2016) Simultaneous Safe Screening of Features and Samples in Doubly Sparse Modeling. In International Conference on Machine Learning (ICML), Cited by: §1, §3.2, §5.1, §5.1.
 (2012) Strong rules for discarding predictors in lassotype problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (2), pp. 245–266. Cited by: §1.
 (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §1.
 (2014) A safe screening rule for sparse logistic regression. In Advance in Neural Information Processing Systems (NIPS), Cited by: §1.
 (2013) Lasso screening rules via dual polytope projection. In Advance in Neural Information Processing Systems (NIPS), Cited by: §1.
 (1980) Functional analysis. BerlinHeidelberg. Cited by: §2.1.
 (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), pp. 301–320. Cited by: §1.