Abstract
We design simple screening tests to automatically discard data samples in empirical risk minimization without losing optimization guarantees. We derive loss functions that produce dual objectives with a sparse solution. We also show how to regularize convex losses to ensure such a dual sparsityinducing property, and propose a general method to design screening tests for classification or regression based on ellipsoidal approximations of the optimal set. In addition to producing computational gains, our approach also allows us to compress a dataset into a subset of representative points.
Screening Data Points in Empirical Risk Minimization
1 Introduction
Let us consider a collection of pairs , where each vector in describes a data point and is its label. For regression, is realvalued, and we address the convex optimization problem
() 
where in carries the feature vectors, and carries the labels. The function is a convex loss and measures the fit between data points and the model, and is a convex regularization function. For classification, the scalars are binary labels in , and we consider instead of () marginbased loss functions, where our problem becomes
() 
The above problems cover a wide variety of formulations such as Lasso [20] and its variants [24], logistic regression, support vector machines [10], and many more.
When is the norm, the solution is encouraged to be sparse [1], which can be exploited to speedup optimization procedures.
A recent line of work has focused on screening tests that seek to automatically discard variables before running an optimization algorithm. For example, [7] derive a screening rule from KarushKuhnTucker conditions, noting that if a dual optimal variable satisfies a given inequality constraint, the corresponding primal optimal variable must be zero. Checking this condition on a set that is known to contain the optimal dual variable ensures that the corresponding primal variable can be safely removed. This prunes out irrelevant features before solving the problem. This is called a safe rule if it discards variables that are guaranteed to be useless; but it is possible to relax the “safety” of the rules [19] without losing too much accuracy in practice.
The seminal approach by [7] has led to a series of works proposing refined tests [6, 22] or dynamic rules [9] for the Lasso, where screening is performed as the optimization algorithm proceeds, significantly speeding up convergence. Other papers have proposed screening rules for sparse logistic regression [21] or other linear models.
Whereas the goal of these previous methods is to remove variables, our goal is to design screening tests for data points in order to remove observations that do not contribute to the final model. The problem is important when there is a large amount of “trivial” observations that are useless for learning. This typically occurs in tracking or anomaly detection applications, where a classical heuristic seeks to mine the data to find difficult examples [8].
A few of such screening tests for data points have been proposed in the literature. Some are problemspecific (e.g. [16] for SVM), others are making strong assumptions on the objective. For instance, the most general rule of [18] for classification requires strong convexity and the ability to compute a duality gap in closed form.
The goal of our paper is to provide a more generic approach for screening data samples, both for regression and classification. Such screening tests may be designed for loss functions that induce a sparse dual solution. We describe this class of loss functions and investigate a regularization mechanism that ensures that the loss enjoys such a property. Our contributions can be summarized as follows:

We revisit the Ellipsoid method [3] to design screening test for samples, when the objective is convex and its dual admits a sparse solution.

We propose a new regularization mechanism to design regression or classification losses that induce sparsity in the dual. This allows us to recover existing loss functions and to discover new ones with sparsityinducing properties in the dual.

Originally designed for linear models, we extend our screening rules to kernel methods. Unlike the existing literature, our method also works for non strongly convex objectives.

We demonstrate the benefits of our screening rules in various numerical experiments on largescale classification problems and regression
^{2} .
2 Preliminaries
We now present the key concepts used in our paper.
2.1 Fenchel Conjugacy
Definition 2.1 (Fenchel conjugate).
Let be an extended realvalued function. The Fenchel conjugate of is defined by
The biconjugate of is naturally the conjugate of and is denoted by . The FenchelMoreau theorem [?] states that if is proper, lower semicontinuous and convex, then it is equal to its biconjugate . Finally, FenchelYoung’s inequality gives for all pair
with an equality case iff .
Suppose now that for such a function , we add a convex term to in the definition of the biconjugate. We get a modified biconjugate , written
The inner objective function is continuous, concave in and convex in , such that we can switch min and max according to Von Neumann’s minimax theorem to get
Definition 2.2 (Infimum convolution).
is called the infimum convolution of and , which may be written as .
2.2 Empirical Risk Minimization and Duality
2.3 Safe Loss Functions and Sparsity in the Dual of ERM Formulations
A key feature of our losses is to encourage sparsity of dual solutions, which typically emerge from loss functions with a flat region. We call such functions “safe losses” since they will allow us to design safe screening tests.
Definition 2.3 (Safe loss function).
Let be a continuous convex loss function such that . We say that is a safe loss if there exists a nonsingleton and nonempty interval such that
Lemma 2.4 (Dual sparsity).
Consider the problem (1) where is a convex penalty. Denoting by and the optimal primal and dual variables respectively, we have for all ,
The proof can be found in Appendix A.
Remark 2.5 (Safe loss and dual sparsity).
A consequence of this lemma is that for both classification and regression, the sparsity of the dual solution is related to loss functions that have “flat” regions—that is, such that . This is the case for safe loss functions defined above.
Note that the relation between flat losses and sparse dual solutions is classical, see [?, 4].
3 Safe Rules for Screening Data Points
In this section, we derive screening rules in the spirit of SAFE [7] to select data points in regression or classification problems with safe losses.
3.1 Principle of SAFE Rules for Data Points
We recall that our goal is to safely delete data points prior to optimization, that is, we want to train the model on a subset of the original dataset while still getting the same optimal solution as a model trained on the whole dataset. This amounts to identifying beforehand which dual variables are zero at the optimum. Indeed, as discussed in Section 2.2, the optimal primal variable only relies on nonzero entries of . To that effect, we make the following assumption:
Assumption 3.1 (Safe loss assumption).
We consider problem (1), where each is a safe loss function. Specifically, we assume that for regression, or for classification, where satisfies Definition 2.3 on some interval . For simplicity, we assume that there exists such that for regression losses and for classification, which covers most useful cases.
We may now state the basic safe rule for screening.
Lemma 3.2 (SAFE rule).
Under Assumption 3.1, consider a subset containing the optimal solution . If, for a given data point , for all in , (resp. ), where is the interior of , then this data point can be discarded from the dataset.
Proof.
From the definition of safe loss functions, is differentiable at with .
We see now how the safe screening rule can be interpreted in terms of discrepancy between the model prediction and the true label . If, for a set containing the optimal solution and a given data point , the prediction always lies in , then the data point can be discarded from the dataset. The data point screening procedure therefore consists in maximizing linear forms, and in regression (resp. minimizing in classification), over a set containing and check whether they are lower (resp. greater) than the threshold . The smaller , the lower the maximum (resp. the higher the minimum) hence the more data points we can hope to safely delete. Finding a good test region is critical however. We show how to do this in the next section.
3.2 Building the Test Region
Screening rules aim at sparing computing resources, testing a data point should therefore be easy. As in [7] for screening variables, if is an ellipsoid, the optimization problem detailed above admits a closedform solution. Furthermore, it is possible to get a smaller set by adding a first order optimality condition with a subgradient of the objective evaluated in the center of this ellipsoid. This linear constraint cuts the final ellipsoid roughly in half thus reducing its volume.
Lemma 3.3 (Closedform screening test).
Consider the optimization problem
(3) 
in the variable in with defining an ellipsoid with center and is in . Then the maximum is
with and .
The proof can be found in Appendix A and it is easy to modify it for minimizing . We can obtain both and by using a few steps of the ellipsoid method [?, 3]. This firstorder optimization method starts from an initial ellipsoid containing the solution to a given convex problem (here 1) . It iteratively computes a subgradient in the center of the current ellipsoid, selects the halfellipsoid containing , and computes the ellipsoid with minimal volume containing the previous halfellipsoid before starting all over again. Such a method, presented in Algorithm 1, performs closedform updates of the ellipsoid. It requires iterations for a precision starting from a ball of radius with the Lipschitz bound on the loss, thus making it impractical for accurately solving highdimensional problems.
Note that the ellipsoid update formula was also used to screen primal variables for the Lasso problem [6], although not iterating over ellipsoids in order to get smaller volumes.
Initialization.
The algorithm requires an initial ellipsoid that contains the solution. This is typically achieved by defining the center as an approximate solution of the problem, which can be obtained in various ways. For instance, one may run a few steps of a solver on the whole dataset, or one may consider the solution obtained previously for a different regularization parameter when computing a regularization path, or the solution obtained for slightly different data, e.g., for tracking applications where an optimization problem has to be solved at every time step , with slight modifications from time .
Once the center is defined, there are many cases where the initial ellipsoid can be safely assumed to be a sphere. For instance, if the objective—let us call it —is strongly convex, we have the basic inequality , which can often be upperbounded by several quantities, e.g., a duality gap [18] or simply if is nonnegative as in typical ERM problems. Otherwise, other strategies can be used depending on the problem at hand. If the problem is not strongly convex but constrained (e.g. often a norm constraint in ERM problems), the initialization is also natural (e.g., a spere containing the constraint set). We will see below that one of the most successful applications of screening methods is for computing regularization paths. Given that regularization path for penalized and constrained problems coincide (up to minor details), computing the path for a penalized objective amounts to computing it for a constrained objective, whose ellipsoid initialization is safe as explained above. Even though we believe that those cases cover many (or most) problems of interest, it is also reasonable to believe that guessing the order of magnitude of the solution is feasible with simple heuristics, which is what we do for safe logistic regression. Then, it is possible to check a posteriori that screening was safe and that indeed, the initial ellipsoid contained the solution.
Efficient implementation.
Since each update of the ellipsoid matrix is rank one, it is possible to parametrize at step as
with the identity matrix, is in and in is a diagonal matrix. Hence, we only have to update and while the algorithm proceeds.
Complexity of our screening rules.
For each step of Algorithm 1, we compute a subgradient in operations. The ellipsoids are modified using rank one updates that can be stored. As a consequence, the computations at this stage are dominated by the computation of , which is . As a result, steps cost .
Once we have the test set , we have to compute the closed forms from Lemma 3.3 for each data point. This computation is dominated by the matrixvector multiplications with , which cost using the structure of . Hence, testing the whole dataset costs . Since we typically have , the cost of the overall screening procedure is therefore .
In constrast, solving the ERM problem without screening would cost where is the number of passes over the data, with . With screening, the complexity becomes , where is the number of data points accepted by the screening procedure.
3.3 Extension to Kernel Methods
It is relatively easy to adapt our safe rules to kernel methods. Consider for example (), where has been replaced by in , with a RKHS and its mapping function . The prediction function lives in the RKHS, thus it can be written , . In the setting of an ERM strictly increasing with respect to the RKHS norm and each sample loss, the representer theorem ensures with and the kernel associated to . If we consider the squared RKHS norm as the regularizer, which is typically the case, the problem becomes:
(4) 
with the Gram matrix. The constraint is linear in (thus satisfying to Lemma 4.1) while yielding nonlinear prediction functions. The screening test becomes maximizing the linear forms and over an ellipsoid containing . When the problem is convex (it depends on ), can still be found using the ellipsoid method.
We now have an algorithm for selecting data points in regression or classification problems with linear or kernel models. As detailed above, the rules require a sparse dual, which is not the case in general except in particular instances such as support vector machines. We now explain how to induce sparsity in the dual.
4 Constructing Safe Losses
In this section, we introduce a way to induce sparsity in the dual of empirical risk minimization problems.
4.1 Inducing Sparsity in the Dual of ERM
When the ERM problem does not admit a sparse dual solution, safe screening is not possible. To fix this issue, consider the ERM problem () and replace by defined in Section 2:
() 
We have the following result connecting the dual of () with that of ().
Lemma 4.1 (Regularized dual for regression).
The proof can be found in Appendix A. We remark that is possible, in many cases, to induce sparsity in the dual if is the norm, or another sparsityinducing penalty. This is notably true if the unregularized dual is smooth with bounded gradients. In such a case, it is possible to show that the optimal dual solution would be as soon as is large enough [1].
We consider now the classification problem () and show that the previous remarks about sparsityinducing regularization for the dual of regression problems also hold in this new context.
Lemma 4.2 (Regularized dual for classification).
Proof.
We proceed as above with a linear constraint and .
Note that the formula directly provides the dual of regression and classification ERM problems with a linear model such as the Lasso and SVM.
4.2 Link Between the Original and Regularized Problems
The following results should be understood as an indication that and are similar objectives.
Lemma 4.3 (Smoothness of ).
If is strongly convex, then is smooth.
Proof.
The proof can be found in Appendix A. When , hence the objectives can be arbitrarily close.
4.3 Effect of Regularization and Examples
We start by recalling that the infimum convolution is traditionally used for smoothing an objective when is strongly convex, and then we discuss the use of sparsityinducing regularization in the dual.
Euclidean distance to a closed convex set.
It is known that convolving the indicator function of a closed convex set with a quadratic term (the Fenchel conjugate of a quadratic term is itself) yields the euclidean distance to
Huber loss.
The loss is more robust to outliers than the loss, but is not differentiable in zero which may induce difficulties during the optimization. A natural solution consists in smoothing it: [2] for example show that applying the MoreauYosida smoothing, i.e convolving with a quadratic term yields the wellknown Huber loss, which is both smooth and robust:
Now, we present examples where has a sparsityinducing effect.
Hinge loss.
Instead of the quadratic loss in the previous example, choose a robust loss . By using the same function , we obtain the classical hinge loss of support vector machines
We see that the effect of convolving with the constraint is to turn a regression loss (e.g., square loss) into a classification loss. The effect of the norm is to encourage the loss to be flat (when grows, is equal to zero for a larger range of values ), which corresponds to the sparsityinducing effect in the dual that we will exploit for screening data points. The Squared Hinge loss is presented in Appendix B.
Screeningfriendly regression.
Screeningfriendly logistic regression.
Let us now consider the logistic loss , which we define only with one dimension for simplicity here. It is easy to show that the infimum convolution with the norm does not induce any sparsity in the dual, because the dual of the logistic loss has unbounded gradients, making classical sparsityinducing penalties ineffective. However, we may consider instead another penalty to fix this issue: for . We have . Convolving with yields
(8) 
Note that this loss is asymptotically robust. Moreover, the entropic part of makes this penalty strongly convex hence is smooth [?]. Finally, the penalty ensures that the dual is sparse thus making the screening usable. Our regularization mechanism thus builds a smooth, robust classification loss akin to the logistic loss on which we can use screening rules. If is well chosen, the safe logistic loss maximizes the loglikelihood of the data for a probabilistic model which slightly differs from the sigmoid in vanilla logistic regression. The effect of regularization parameter in a few previous cases are illustrated in Figure 2.
5 Experiments
We now present experimental results demonstrating the effectiveness of the data screening procedure.
Datasets.
We consider three real datasets, SVHN, MNIST, RCV1, and a synthetic one. MNIST () and SVHN () both represent digits, which we encode by using the output of a twolayer convolutional kernel network [13] leading to feature dimensions . RCV1 () represents sparse TFIDF vectors of categorized newswire stories (). For classification, we consider a binary problem consisting of discriminating digit 9 for MNIST vs. all other digits (resp. digit 1 vs rest for SVHN, 1st category vs rest for RCV1). For regression, we also consider a synthetic dataset, where data is generated by , where is a random, sparse ground truth, a data matrix whose coefficients are in and with . Implementation details are provided in the appendix. We fit usual models using Scikitlearn [17] and Cyanure [14] for largescale datasets.
5.1 Safe Screening
Here, we consider problems that naturally admit a sparse dual solution, which allows safe screening.
Interval regression.
We first illustrate the practical use of the screeningfriendly regression loss (7) derived above. It corresponds indeed to a particular case of a supervised learning task called interval regression [12], which is widely used in fields such as economics. In interval regression, one does not have scalar labels but intervals containing the true labels , which are unknown. The loss is written
(9) 
where contains the true label . For a given data point, the model only needs to predict a value inside the interval in order not to be penalized. When the intervals have the same width and we are given their centers , (9) is exactly (7). Since (7) yields a sparse dual, we can apply our rules to safely discard intervals that are assured to be matched by the optimal solution. We use an penalty along with the loss. As an illustration, the experiment was done using a toy synthetic dataset , the signal to recover being generated by one feature only. The intervals can be visualized in Figure 3. The “difficult” intervals (red) were kept in the training set. The predictions hardly fit these intervals. The “easy” intervals (blue) were discarded from the training set: the safe rules certify that the optimal solution will fit these intervals. Our screening algorithm was run for 20 iterations of the Ellipsoid method. Most intervals can be ruled out afterwards while the remaining ones yield the same optimal solution as a model trained on all the intervals.
Classification.
Common sample screening methods such as [18] require a strongly convex objective. When it is not the case, there is, to the best of our knowledge, no baseline for this case. Thus, when considering classification using the non strongly convex safe logistic loss derived in Section 4 along with an penalty, our algorithm is still able to screen samples, as shown in Table 1. The algorithm is initialized using an approximate solution to the problem, and the radius of the initial ball is chosen depending on the number of epochs ( for epochs, for and for epochs), which is valid in practice.
Epochs  20  30  

MNIST  SVHN  RCV1  MNIST  SVHN  RCV1  
0  0  1  0  2  12  
0.3  0.01  8  27  17  42  
35  12  45  65  54  75 
Epochs  20  30  

MNIST  SVHN  MNIST  SVHN  
89 / 89  87 / 87  89 / 89  87 / 87  
95 / 95  11 / 47  95 / 95  91 / 91  
16 / 84  0 / 0  98 / 98  90 / 92  
0 / 0  0 / 0  34 / 50  0 / 0 
The Squared Hinge loss allows for safe screening (see 2.4). Combined with an penalty, the resulting ERM is strongly convex. We can therefore compare our Ellipsoid algorithm to the baseline introduced by [18], where the safe region is a ball centered in the current iterate of the solution and whose radius is with a duality gap of the ERM problem. Both methods are initialized by running the default solver of scikitlearn with a certain number of epochs. The resulting approximate solution and duality gap are subsequently fed into our algorithm for initialization. Then, we perform one more epoch of the duality gap screening algorithm on the one hand, and the corresponding number of ellipsoid steps computed on a subset of the dataset on the other hand, so as to get a fair comparison in terms of data access. The results can be seen in Table 2. While being more general (our approach is neither restricted to classification, nor requires strong convexity), our method performs similarly to the baseline. Figure 4 highlights the tradeoff between optimizing and evaluating the gap (Duality Gap Screening) versus performing one step of Ellipsoid Screening. Both methods start screening after a correct iterate (i.e. with good test accuracy) is obtained by the solver (blue curve) thus underlining the fact that screening methods would rather be of practical use when computing a regularization path, or when the computing budget is less constrained (e.g. tracking or anomaly detection) which is the object of next paragraph.
Computational gains
As demonstrated in Figure 5, computational gains can indeed be obtained in a regularization path setting (MNIST features, Squared Hinge Loss and L2 penalty). Each point of both curves represents an estimator fitted for a given lambda against the corresponding cost (in epochs). Each estimator is initialized with the solution to the previous parameter lambda. On the orange curve, the previous solution is also used to initialize a screening. In this case, the estimator is fit on the remaining samples which further accelerates the path computation.
5.2 Dataset Compression
We now consider the problem of dataset compression, where the goal is to maintain a good accuracy while using less examples from a dataset. This section should be seen as a proof of concept. A natural scheme consists in choosing the samples that have a higher margin since those will carry more information than samples that are easy to fit. In this setting, our screening algorithm can be used for compression by using the scores of the screening test as a way of ranking the samples. In our experiments, and for a given model, we progressively delete data points according to their score in the screening test for this model, before fitting the model on the remaining subsets. We compare those methods to random deletions in the dataset and to deletions based on the sample margin computed on early approximations of the solution when the loss admits a flat area (“margin screening”). Our compression scheme is valid for classification as can be seen in Figure 6 and regression (see Appendix C).
Discussion.
For all methods, the degradation in performance is lesser than with random deletions. Nevertheless, in the regime where most samples are deleted (beyond ), random deletions tend to do better. This is not surprising since the screening deletes the samples that are “easy” to classify. Once all these samples are deleted, only the difficult ones and outliers remain, thus making the prediction task harder compared to a random subsampling.
Acknowledgments
JM and GM were supported by the ERC grant number 714381 (SOLARIS project) and by ANR 3IA MIAI@Grenoble Alpes, (ANR19P3IA0003). AA would like to acknowledge support from the ML and Optimisation joint research initiative with the fonds AXA pour la recherche and Kamet Ventures, a Google focused award, as well as funding by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR19P3IA0001 (PRAIRIE 3IA Institute). GM thanks Vivien Cabannes, Yana Hasson and Robin Strudel for useful discussions. All the authors thank the reviewers for their useful comments.
Appendix A Proofs.
a.1 Proof of Lemma 2.4
Proof.
At the optimum,
Adding the null term gives
since FenchelYoung’s inequality states that each term is greater or equal to zero. We have a null sum of nonnegative terms; hence, each one of them is equal to zero. We therefore have for each :
which corresponds to the equality case in FenchelYoung’s relation, which is equivalent to .
a.2 Proof of Lemma 3.3
Proof.
The Lagrangian of the problem writes:
with . When maximizing in , we get:
We have since the opposite leads to a contradiction. This yields and at the optimum which gives .
Now, we have to minimize
To do that, we consider the optimality condition
which yields . If then in order to avoid a contradiction.
In summary, either hence the maximum is attained in and is equal to , or and the maximum is attained in and is equal to with and .
a.3 Proof of Lemma 4.1
Proof.
(10) 
in the variable with and and . Since the constraints are linear, we can directly express the dual of this problem in terms of the Fenchel conjugate of the objective (see e.g. [5], 5.1.6). Let us note . For all , we have
It is known from [2] that with . Clearly, . If is proper, convex and lower semicontinuous, then . As a consequence, . If is proper, convex and lower semicontinuous, then , hence
Now we can form the dual of by writing
(11) 
in the variable . Since with the dual variable associated to the equality constraints,
Injecting in the problem and setting instead of (we optimize in ) concludes the proof.
a.4 Lemma a.1
Lemma A.1 (Bounding ).
If and is a norm then
with and .
Proof.
If is a norm, then and is the indicator function of the dual norm of hence nonnegative. Moreover, if then, and ,
In particular, we can take hence the righthand inequality. On the other hand,
Since is convex,
As a consequence,
a.5 Proof of Lemma 4.4
Proof.
The proof is trivial given the inequalities in Lemma A.1.
a.6 Proof of Screeningfriendly regression
Proof.
The Fenchel conjugate of a norm is the indicator function of the unit ball of its dual norm, the ball here. Hence the infimum convolution to solve
(12) 
Since ,
If we consider the change of variable , we get:
The solution to this problem is exactly the proximal operator for the indicator function of the infinity ball applied to . It has a closed form
using Moreau decomposition. We therefore have
Hence,
But, for , where .
Appendix B Additional examples.
Squared hinge loss.
Let us consider a problem with a quadratic loss designed for a classification problem, and consider . We have , and
which is a squared Hinge Loss with a threshold parameter and .
Appendix C Additional experimental results.
Reproducibility.
The data sets did not require any preprocessing except MNIST and SVHN on which exhaustive details can be found in [13]. For both regression and classification, the examples were allocated to train and test sets using scikitlearn’s traintestsplit ( of the data allocated to the train set). The experiments were run three to ten times (depending on the cost of the computations) and our error bars reflect the standard deviation. For each fraction of points deleted, we fit three to five estimators on the screened dataset and the random subset before averaging the corresponding scores. The optimal parameters for the linear models were found using a simple gridsearch.
Accuracy of our safe logistic loss.
The accuracies of the Safe Logistic loss we build is similar to the accuracies obtained with the Squared Hinge and the Logistic losses on the datasets we use in this paper thus making it a realistic loss function.
Dataset  MNIST  SVHN  RCV1 

Logistic +  0.997 (0.01)  0.99 (0.0003)  0.975 (1.0) 
Logistic +  0.997 (0.001)  0.99 (0.0003)  0.975 (1.0) 
Safelog +  0.996 (0.0)  0.989 (0.0)  0.974 (1e05) 
Safelog +  0.996 (0.0)  0.989 (0.0)  0.975 (1e05) 
Squared Hinge +  0.997 (0.03)  0.99 (0.03)  0.975 (1.0) 
Squared Hinge +  0.997 (0.003)  0.99 (0.003)  0.974 (1.0) 
Rcv1.
Table 4 shows additional screening results on RCV1 with a penalized Squared Hinge loss SVM.
Epochs  10  20 

7 / 84  85 / 85  
80 / 80  80 / 80  
68 / 68  68 / 68 
Lasso regression.
The Lasso objective combines an loss with an penalty. Since its dual is not sparse, we will instead apply the safe rules offered by the screeningfriendly regression loss (7) derived in Section 4.3 and illustrated in Figure 2, combined with an penalty. We can draw an interesting parallel with the SVM, which is naturally sparse in data points. At the optimum, the solution of the SVM can be expressed in terms of data points (the socalled support vectors) that are close to the classification boundary, that is the points that are the most difficult to classify. Our screening rule yields the analog for regression: the points that are easy to predict, i.e. that are close to the regression curve, are less informative than the points that are harder to predict. In our experiments on synthetic data (), this does consistently better than random subsampling as can be seen in Figure 7.
Footnotes
 footnotetext: Département d’informatique de l’ENS, CNRS, Inria, PSL, 75005 Paris, France. <aspremon@ens.fr>.
 Univ. Grenoble Alpes, Inria, CNRS,Grenoble INP, LJK, 38000 Grenoble, France. <firstname.lastname@inria.fr>.
 footnotetext: Our code is available at https://github.com/GregoireMialon/screening_samples.
References
 (2012) Optimization with sparsityinducing penalties. Foundations and Trends in Machine Learning 4 (1), pp. 1–106. Cited by: §1, §4.1, §4.3.
 (2012) Smoothing and first order methods: a unified framework. SIAM J. Optim Vol. 22, No. 2. Cited by: §A.3, §4.3.
 (1981) The ellipsoid method: a survey. Operation Research 29. Cited by: 1st item, §3.2.
 (2019) Learning classifiers with fenchelyoung losses: generalized entropies, margins, and algorithms. In International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §2.3.
 (2004) Convex optimization. Cambridge University Press. Cited by: §A.3.
 (201201) An ellipsoid based, twostage screening test for bpdn. European Signal Processing Conference, pp. 654–658. External Links: ISBN 9781467310680 Cited by: §1, §3.2.
 (201009) Safe Feature Elimination for the LASSO and Sparse Supervised Learning Problems. arXiv eprints, pp. arXiv:1009.4219. External Links: 1009.4219 Cited by: §1, §1, §3.2, §3.
 (2009) Object detection with discriminatively trained partbased models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §1.
 (2015) Mind the duality gap: safer rules for the Lasso. In International Conference on Machine Learning (ICML), Cited by: §1.
 (2001) The elements of statistical learning. Springer series in statistics New York. Cited by: §1.
 (1993) Convex analysis and minimization algorithms ii. Springer. Cited by: §4.2.
 (2013) Learning sparse penalties for changepoint detection using max margin interval regression. In International Conference on Machine Learning (ICML), Cited by: §5.1.
 (2016) Endtoend kernel learning with supervised convolutional kernel networks. In Advance in Neural Information Processing Systems (NIPS), Cited by: Appendix C, §5.
 (2019) Cyanure: an opensource toolbox for empirical risk minimization for python, C++, and soon more. arXiv preprint arXiv:1912.08165. Cited by: §5.
 (1962) Fonctions convexes duales et points proximaux dans un espace hilbertien. CR Acad. Sci. Paris Sér. A MAth. Cited by: §2.1.
 (2014) Safe sample screening for support vector machines. External Links: 1401.6740 Cited by: §1.
 (2011) Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §5.
 (2016) Simultaneous Safe Screening of Features and Samples in Doubly Sparse Modeling. In International Conference on Machine Learning (ICML), Cited by: §1, §3.2, §5.1, §5.1.
 (2012) Strong rules for discarding predictors in lassotype problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (2), pp. 245–266. Cited by: §1.
 (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §1.
 (2014) A safe screening rule for sparse logistic regression. In Advance in Neural Information Processing Systems (NIPS), Cited by: §1.
 (2013) Lasso screening rules via dual polytope projection. In Advance in Neural Information Processing Systems (NIPS), Cited by: §1.
 (1980) Functional analysis. BerlinHeidelberg. Cited by: §2.1.
 (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), pp. 301–320. Cited by: §1.