Screening Data Points in Empirical Risk Minimization via Ellipsoidal Regions and Safe Loss Functions

# Screening Data Points in Empirical Risk Minimization via Ellipsoidal Regions and Safe Loss Functions

## Abstract

We design simple screening tests to automatically discard data samples in empirical risk minimization without losing optimization guarantees. We derive loss functions that produce dual objectives with a sparse solution. We also show how to regularize convex losses to ensure such a dual sparsity-inducing property, and propose a general method to design screening tests for classification or regression based on ellipsoidal approximations of the optimal set. In addition to producing computational gains, our approach also allows us to compress a dataset into a subset of representative points.

## 1 Introduction

Let us consider a collection of pairs , where each vector  in  describes a data point and  is its label. For regression, is real-valued, and we address the convex optimization problem

 minx∈Rp,t∈Rnf(t)+λR(x)   s.t.   t=Ax−b, (P1)

where in carries the feature vectors, and carries the labels. The function is a convex loss and measures the fit between data points and the model, and is a convex regularization function. For classification, the scalars  are binary labels in , and we consider instead of (()) margin-based loss functions, where our problem becomes

 minx∈Rp,t∈Rnf(t)+λR(x)   s.t.   t=diag(b)Ax, (P2)

The above problems cover a wide variety of formulations such as Lasso [18] and its variants [22], logistic regression, support vector machines [10], and many more.

When is the -norm, the solution is encouraged to be sparse [1], which can be exploited to speed-up optimization procedures.

A recent line of work has focused on screening tests that seek to automatically discard variables before running an optimization algorithm. For example, [7] derive a screening rule from Karush-Kuhn-Tucker conditions, noting that if a dual optimal variable satisfies a given inequality constraint, the corresponding primal optimal variable must be zero. Checking this condition on a set that is known to contain the optimal dual variable ensures that the corresponding primal variable can be safely removed. This prunes out irrelevant features before solving the problem. This is called a safe rule if it discards variables that are guaranteed to be useless; but it is possible to relax the “safety” of the rules [17] without losing too much accuracy in practice.

The seminal approach by [7] has led to a series of works proposing refined tests [6, 20] or dynamic rules [9] for the Lasso, where screening is performed as the optimization algorithm proceeds, significantly speeding up convergence. Other papers have proposed screening rules for sparse logistic regression [19] or other linear models.

Whereas the goal of these previous methods is to remove variables, our goal is to design screening tests for data points in order to remove observations that do not contribute to the final model. The problem is important when there is a large amount of “trivial” observations that are useless for learning. This typically occurs in tracking or anomaly detection applications, where a classical heuristic seeks to mine the data to find difficult examples [8].

A few of such screening tests for data points have been proposed in the literature. Some are problem-specific (e.g. [14] for SVM), others are making strong assumptions on the objective. For instance, the most general rule of [16] for classification requires strong convexity and the ability to compute a duality gap in closed form.

The goal of our paper is to provide a more generic approach for screening data samples, both for regression and classification. Such screening tests may be designed for loss functions that induce a sparse dual solution. We describe this class of loss functions and investigate a regularization mechanism that ensures that the loss enjoys such a property.

Our contributions can be summarized as follows:

• We revisit the Ellipsoid method [3] to design screening test for samples, when the objective is convex and its dual admits a sparse solution.

• We propose a new regularization mechanism to design regression or classification losses that induce sparsity in the dual. This allows us to recover existing loss functions and to discover new ones with sparsity-inducing properties in the dual.

• Originally designed for linear models, we extend our screening rules to kernel methods. Unlike the existing literature, our method also works for non strongly convex objectives.

• We demonstrate the benefits of our screening rules in various numerical experiments on large-scale classification problems and regression.

## 2 Preliminaries

We now present the key concepts used in our paper.

### 2.1 Fenchel Conjugacy

###### Definition 2.1 (Fenchel conjugate).

Let be an extended real-valued function. The Fenchel conjugate of is defined by

 f∗(y)=max t∈Rp⟨t,y⟩−f(t).

The biconjugate of is naturally the conjugate of and is denoted by . The Fenchel-Moreau theorem [?] states that if is proper, lower semi-continuous and convex, then it is equal to its biconjugate . Finally, Fenchel-Young’s inequality gives for all pair

 f(t)+f∗(y)≥⟨t,y⟩,

with an equality case iff .

Suppose now that for such a function , we add a convex term to in the definition of the biconjugate. We get a modified biconjugate , written

 fμ(t) =max y∈Rp⟨y,t⟩−f∗(y)−μΩ(y)

The inner objective function is continuous, concave in  and convex in , such that we can switch min and max according to Von Neumann’s minimax theorem to get

 fμ(t) =min z∈Rpf(z)+max y∈Rp{⟨t−z,y⟩−μΩ(y)} =min z∈Rpf(z)+μΩ∗(t−zμ).
###### Definition 2.2 (Infimum convolution).

is called the infimum convolution of and , which may be written as .

Note that is convex as the minimum of a convex function in . We recover the Moreau-Yosida smoothing [13, 21] and its generalization when is respectively a quadratic term or a strongly-convex term [?].

### 2.2 Empirical Risk Minimization and Duality

Let us consider the convex ERM problem

 minx∈RpP(x)=1nn∑i=1fi(a⊤ix)+λR(x), (1)

which covers both () and () by using the appropriate definition of function . We consider the dual problem (obtained from Lagrange duality)

 maxν∈RnD(ν)=1nn∑i=1−f∗i(νi)−λR∗(−ATνλn). (2)

We always have . Since there exists a pair such that (Slater’s conditions), we have and at the optimum.

### 2.3 Safe Loss Functions and Sparsity in the Dual of ERM Formulations

A key feature of our losses is to encourage sparsity of dual solutions, which typically emerge from loss functions with a flat region. We call such functions “safe losses” since they will allow us to design safe screening tests.

###### Definition 2.3 (Safe loss function).

Let be a continuous convex loss function such that . We say that is a safe loss if there exists a non-singleton and non-empty interval such that

 t∈I⟹ϕ(t)=0.
###### Lemma 2.4 (Safe loss and dual sparsity).

Consider the problem (1) where is a convex penalty. Denoting by and the optimal primal and dual variables respectively, we have for all ,

 ν∗i∈∂fi(a⊤ix⋆).

A consequence of this lemma is that for both classification and regression, the sparsity of the dual solution is related to loss functions that have “flat” regions—that is, such that . This is the case for safe loss functions defined above. Note that the relation between flat losses and sparse dual solutions is classical, see [?, 4].

## 3 Safe rules for screening data points

In this section, we derive screening rules in the spirit of SAFE [7] to select data points in regression or classification problems with safe losses.

### 3.1 Principle of SAFE Rules for Data Points

We recall that our goal is to safely delete data points prior to optimization, that is, we want to train the model on a subset of the original dataset while still getting the same optimal solution as a model trained on the whole dataset. This amounts to identifying beforehand which dual variables are zero at the optimum. Indeed, as discussed in Section 2.2, the optimal primal variable only relies on non-zero entries of . To that effect, we make the following assumption:

###### Assumption 3.1 (Safe loss assumption).

We consider problem (1), where each is a safe loss function. Specifically, we assume that for regression, or for classification, where satisfies Definition 2.3 on some interval . For simplicity, we assume that there exists such that for regression losses and for classification, which covers most useful cases.

We may now state the basic safe rule for screening.

###### Lemma 3.2 (SAFE rule).

Under Assumption 3.1, consider a subset containing the optimal solution . If, for a given data point , for all in , (resp. ), where is the interior of , then this data point can be discarded from the dataset.

###### Proof.

From the definition of safe loss functions, is differentiable at with .

We see now how the safe screening rule can be interpreted in terms of discrepancy between the model prediction and the true label . If, for a set containing the optimal solution and a given data point , the prediction always lies in , then the data point can be discarded from the dataset. The data point screening procedure therefore consists in maximizing linear forms, and in regression (resp. minimizing in classification), over a set containing and check whether they are lower (resp. greater) than the threshold . The smaller , the lower the maximum (resp. the higher the minimum) hence the more data points we can hope to safely delete. Finding a good test region  is critical however. We show how to do this in the next section.

### 3.2 Building the Test Region X

Screening rules aim at sparing computing resources, testing a data point should therefore be easy. As in [7] for screening variables, if is an ellipsoid, the optimization problem detailed above admits a closed-form solution. Furthermore, it is possible to get a smaller set by adding a first order optimality condition with a subgradient of the objective evaluated in the center of this ellipsoid. This linear constraint cuts the final ellipsoid roughly in half thus reducing its volume.

###### Lemma 3.3 (Closed-form screening test).

Consider the optimization problem

 maximizea⊤ix−bisubject to(x−z)TE−1(x−z)≤1gT(x−z)≤0 (3)

in the variable in with defining an ellipsoid with center and is in . Then the maximum is

 ⎧⎨⎩a⊤iz+(a⊤iEai)12−bi if gTEai<0a⊤i(z+12γE(ai−νg))−bi % otherwise,

with and .

The proof can be found in Appendix A.2 and it is easy to modify it for minimizing . We can obtain both and by using a few steps of the ellipsoid method [?, 3]. The method starts from an initial ellipsoid containing the solution  to a given convex problem. It iteratively computes a subgradient in the center of the current ellipsoid, selects the half-ellipsoid containing , and computes the ellipsoid with minimal volume containing the previous half-ellipsoid before starting all over again. Such a method, presented in Algorithm 1, performs closed-form updates of the ellipsoid.

Note that the ellipsoid update formula was also used to screen primal variables for the Lasso problem [6], although not iterating over ellipsoids in order to get smaller volumes.

#### Initialization.

The algorithm requires an initial ellipsoid that contains the solution. This is typically achieved by defining the center as an approximate solution of the problem, which can be obtained in various ways. For instance, one may run a few steps of a solver on the whole dataset, or one may consider the solution obtained previously for a different regularization parameter when computing a regularization path, or the solution obtained for slightly different data, e.g., for tracking applications where an optimization problem has to be solved at every time step , with slight modifications from time .

Once the center  is defined, there are many cases where the initial ellipsoid can be safely assumed to be a sphere. For instance, if the objective—let us call it —is -strongly convex, we have the basic inequality , which can often be upper-bounded by several quantities, e.g., a duality gap [16] or simply if is non-negative as in typical ERM problems. Otherwise, other strategies can be used depending on the problem at hand, as done for the Lasso by [7, 9] for example.

#### Efficient implementation.

Since each update of the ellipsoid matrix is rank one, it is possible to parametrize at step as

 Ek=skI−LkDkLTk,

with the identity matrix, is in and in is a diagonal matrix. Hence, we only have to update and while the algorithm proceeds.

#### Complexity of our screening rules.

For each step of Algorithm 1, we compute a subgradient in operations. The ellipsoids are modified using rank one updates that can be stored. As a consequence, the computations at this stage are dominated by the computation of , which is . As a result, steps cost .

Once we have the test set , we have to compute the closed forms from Lemma 3.3 for each data point. This computation are dominated by the matrix-vector multiplications with , which cost using the structure of . Hence, testing the whole dataset costs . Since we typically have , the cost of the overall screening procedure is therefore .

In constrast, solving the ERM problem without screening would cost where is the number of passes over the data, with . With screening, the complexity becomes , where is the number of data points accepted by the screening procedure.

### 3.3 Extension to Kernel Methods

It is relatively easy to adapt our safe rules to kernel methods. Consider for example (), where has been replaced by in , with a RKHS and its mapping function . The prediction function lives in the RKHS, thus it can be written , . In the setting of ERM, the representer theorem ensures with and the kernel associated to . The problem becomes:

 minα∈Rn,t∈Rnf(t)+λn∑i,j=1αiαjK(ai,aj)   s.t.   t=Kα−b, (4)

with the Gram matrix. The constraint is linear in (thus satisfying to Lemma 4.1) while yielding non-linear prediction functions. The screening test becomes maximizing the linear forms and over an ellipsoid containing . When the problem is convex (it depends on ), can still be found using the ellipsoid method.

We now have an algorithm for selecting data points in regression or classification problems with linear or kernel models. As detailed above, the rules require a sparse dual, which is not the case in general except in particular instances such as support vector machines. We now explain how to induce sparsity in the dual.

## 4 Constructing safe losses

In this section, we introduce a way to induce sparsity in the dual of empirical risk minimization problems.

### 4.1 Inducing Sparsity in the Dual of ERM

When the ERM problem does not admit a sparse dual solution, safe screening is not possible. To fix this issue, consider the ERM problem () and replace by defined in Section 2:

 minx∈Rp,t∈Rnfμ(t)+λR(x)   s.t.   t=Ax−b, (P′1)

We have the following result connecting the dual of () with that of ().

###### Lemma 4.1 (Regularized dual for regression).

The dual of () is

 maxν∈Rn−⟨b,ν⟩−f∗(ν)−λR∗(−ATνλ)−μΩ(ν), (5)

and the dual of () is obtained by setting .

Before we prove this lemma, we remark that is possible, in many cases, to induce sparsity in the dual if is the -norm, or another sparsity-inducing penalty. This is notably true if the unregularized dual is smooth with bounded gradients. In such a case, it is possible to show that the optimal dual solution would be as soon as is large enough [1].

###### Proof.

We can write as

 minimize~f(~x)+λ~R(~x)subject to~A~x=−b (6)

in the variable with and and . Since the constraints are linear, we can directly express the dual of this problem in terms of the Fenchel conjugate of the objective (see e.g. [5], 5.1.6). Let us note . For all , we have

 f∗0(y) =supx∈Rn+p⟨x,y⟩−~f(x)−λ~R(x) =% supx1∈Rn,x2∈Rp⟨x1,y1⟩+⟨x2,y2⟩−f(x1)−λR(x2) =f∗μ(y1)+λR∗(y2λ).

It is known from [2] that with . Clearly, . If is proper, convex and lower semicontinuous, then . As a consequence, . If is proper, convex and lower semicontinuous, then , hence

 f∗0(y)=f∗(y1)+λR∗(y2λ)+μΩ(y1).

Now we can form the dual of by writing

 maximize−⟨−b,ν⟩−f∗0(−~ATν) (7)

in the variable . Since with the dual variable associated to the equality constraints,

 f∗0(−~ATν)=f∗(−ν)+λR∗(ATνλ)+μΩ(−ν).

Injecting in the problem and setting instead of (we optimize in ) concludes the proof.

We consider now the classification problem () and show that the previous remarks about sparsity-inducing regularization for the dual of regression problems also hold in this new context.

###### Lemma 4.2 (Regularized dual for classification).

Consider now the modified classification problem

 minx∈Rp,t∈Rnfμ(t)+λR(x)   s.t.   t=diag(b)Ax. (P′2)

The dual of is

 maxν∈Rn−f∗(−ν)−λR∗(ATdiag(b)νλ)−μΩ(−ν). (8)
###### Proof.

We proceed as above with a linear constraint and .

Note that the formula directly provides the dual of regression and classification ERM problems with a linear model such as the Lasso and SVM.

### 4.2 Link Between the Original and Regularized Problems

###### Lemma 4.3 (Smoothness of fμ).

If is strongly convex, then is smooth.

###### Proof.

The lemma follows directly from the fact that (see the proof of Lemma 4.1). The conjugate of a closed, proper, strongly convex function is indeed smooth.

###### Lemma 4.4 (Bounding fμ).

If and is a norm then

 f(t)−δ(t)≤fμ(t)≤f(t),for all t∈domf

with and .

###### Proof.

If is a norm, then and is the indicator function of the dual norm of hence non-negative. Moreover, if then, and ,

 fμ(t)≤f(z)+μΩ∗(t−zμ).

In particular, we can take hence the right-hand inequality. On the other hand,

 fμ(t)−f(t) =minzf(z)+μI∥z−tμ∥∗≤1−f(t) =min∥uμ∥∗≤1f(t+u)−f(t).

Since is convex,

 f(t+u)−f(t)≥gTu with g∈∂f(t).

As a consequence,

 fμ(t)−f(t)≥min∥uμ∥∗≤1gTu.

###### Corollary 4.5 (Bounding the value of P1).

Let us denote the optimum objectives of , by , . If is a norm, we have the following inequalities:

 Pλ−δ∗≤Pλ,μ≤Pλ,

with the value of at the optimum of .

###### Proof.

The proof is trivial given the inequalities in Lemma 4.4.

### 4.3 Effect of Regularization and Examples

We start by recalling that the infimum convolution is traditionally used for smoothing an objective when  is strongly convex, and then we discuss the use of sparsity-inducing regularization in the dual.

#### Euclidean distance to a closed convex set.

It is known that convolving the indicator function of a closed convex set with a quadratic term (the Fenchel conjugate of a quadratic term is itself) yields the euclidean distance to

 fμ(t)= minz∈RnIC(z)+12μ∥t−z∥22=minz∈C12μ∥t−z∥22.

#### Huber loss.

The -loss is more robust to outliers than the -loss, but is not differentiable in zero which may induce difficulties during the optimization. A natural solution consists in smoothing it: [2] for example show that applying the Moreau-Yosida smoothing, i.e convolving with a quadratic term yields the well-known Huber loss, which is both smooth and robust:

 fμ(t)=⎧⎨⎩t22μif |t|≤μ,|t|−μ2otherwise.

Now, we present examples where has a sparsity-inducing effect.

#### Hinge loss.

Instead of the quadratic loss in the previous example, choose a robust loss . By using the same function , we obtain the classical hinge loss of support vector machines

 fμ(t)=n∑i=112[1−ti−μ,0]+.

We see that the effect of convolving with the constraint is to turn a regression loss (e.g., square loss) into a classification loss. The effect of the -norm is to encourage the loss to be flat (when grows, is equal to zero for a larger range of values ), which corresponds to the sparsity-inducing effect in the dual that we will exploit for screening data points.

#### Squared hinge loss.

Let us consider a problem with a quadratic loss designed for a classification problem, and consider . We have , and

 fμ(t)= n∑i=1[1−ti−μ,0]2+,

which is a squared Hinge Loss with a threshold parameter  and .

#### Screening-friendly regression.

Consider now the quadratic loss and . Then (see e.g. [1]), and

 fμ(t)=n∑i=112[|ti|−μ]2+. (9)

A proof can be found in Appendix A. As before, the parameter encourages the loss to be flat (it is exactly when ).

#### Screening-friendly logistic regression.

Let us now consider the logistic loss , which we define only with one dimension for simplicity here. It is easy to show that the infimum convolution with the -norm does not induce any sparsity in the dual, because the dual of the logistic loss has unbounded gradients, making classical sparsity-inducing penalties ineffective. However, we may consider instead another penalty to fix this issue: for . We have . Convolving with  yields

 fμ(x)={ex+μ−1−(x+μ)ifx+μ−1≤0,0otherwise. (10)

Note that this loss is asymptotically robust. Moreover, the entropic part of makes this penalty strongly convex hence is smooth [?]. Finally, the penalty ensures that the dual is sparse thus making the screening usable. Our regularization mechanism thus builds a smooth, robust classification loss akin to the logistic loss on which we can use screening rules. The effect of regularization parameter in a few previous cases are illustrated in Figure 2.

In summary, regularizing the dual with the norm induces a flat region in the loss, which induces sparsity in the dual. The geometry is preserved elsewhere.

## 5 Experiments

We now present experimental results demonstrating the effectiveness of the data screening procedure.

#### Datasets.

We consider three real datasets, SVHN, MNIST, RCV-1, and a synthetic one. MNIST () and SVHN () both represent digits, which we encode by using the output of a two-layer convolutional kernel network [12] leading to feature dimensions . RCV-1 () represents sparse TF-IDF vectors of categorized newswire stories (). For classification, we consider a binary problem consisting of discriminating digit 9 for MNIST vs. all other digits (resp. digit 1 vs rest for SVHN, 1st category vs rest for RCV-1). For regression, we also consider a synthetic dataset, where data is generated by , where is a random, sparse ground truth, a data matrix whose coefficients are in and with . Implementation details are provided in the appendix. We fit usual models using scikit-learn [15].

### 5.1 Safe Screening

Here, we consider problems that are naturally admit a sparse dual solution, which allows safe screening.

#### Interval regression.

We first illustrate the practical use of the screening-friendly regression loss (9) derived above. It corresponds indeed to a particular case of a supervised learning task called interval regression [11], which is widely used in fields such as economics. In interval regression, one does not have scalar labels but intervals containing the true labels , which are unknown. The loss is written

 ℓ(x)=n∑i=1infbi∈Si(a⊤ix−bi)2, (11)

where contains the true label . For a given data point, the model only needs to predict a value inside the interval in order not to be penalized. When the intervals have the same width and we are given their centers ,  (11) is exactly (9). Since we proved (9) to yield a sparse dual, we can apply our rules to safely discard intervals that are assured to be matched by the optimal solution. We use an penalty along with the loss. As an illustration, the experiment was done using a toy synthetic dataset , the signal to recover being generated by one feature only. The intervals can be visualized in Figure 3. The “difficult” intervals (red) were kept in the training set. The predictions hardly fit these intervals. The “easy” intervals (blue) were discarded from the training set: the safe rules certify that the optimal solution will fit these intervals. Our screening algorithm was run for 20 iterations of the Ellipsoid method. Most of the intervals can be ruled out afterwards while the remaining intervals yield the same optimal solution as a model trained on all the intervals.

#### Classification.

Common sample screening methods such as [16] require a strongly convex objective. When it is not the case, there is, to the best of our knowledge, no baseline for this case. Thus, when considering classification using the non strongly convex safe logistic loss derived in Section 4 along with an penalty, our algorithm is still able to screen samples, as shown in Table 1. The algorithm is initialized using an approximate solution to the problem, and the radius of the initial ball is chosen depending on the number of epochs ( for epochs, for and for epochs), which is valid in practice.

As established in Lemma 2.4, the hinge loss and squared hinge loss allow for safe screening, Combined with an penalty, the resulting ERM problem is strongly convex. We can therefore compare our Ellipsoid algorithm to the baseline introduced by [16], where the safe region is a ball centered in the current iterate of the solution and whose radius is with a duality gap of the ERM problem. Both methods are initialized by running the default solver of scikit-learn with a certain number of epochs. The resulting approximate solution and duality gap are subsequently fed into our algorithm for initialization. Then, we perform one more epoch of the duality gap screening algorithm on the one hand, and the corresponding number of ellipsoid steps computed on a subset of the dataset on the other hand, so as to get a fair comparison in terms of data access. The results can be seen in Table 2. While being more general (our approach is neither restricted to classification, nor requires strong convexity), our method performs similarly to the baseline. Figure 4 highlights the trade-off between optimizing and evaluating the gap (Duality Gap Screening) versus performing one step of Ellipsoid Screening. Key observations here is that both methods start screening after a correct iterate (i.e. with good test accuracy) is obtained by the solver (blue curve) thus underlining the fact that screening methods would rather be of practical use when computing a regularization path, or when the computing budget is less constrained (e.g. tracking or anomaly detection) which is the object of next paragraph.

#### Computational gains

As demonstrated in Figure 5, computational gains can indeed be obtained in a regularization path setting (MNIST features, Squared Hinge Loss and L2 penalty). Each point of both curves represents an estimator fitted for a given lambda against the corresponding cost (in epochs). Each estimator is initialized with the solution to the previous parameter lambda. On the orange curve, the previous solution is also used to initialize a screening. In this case, the estimator is fit on the remaining samples which further accelerates the path computation.

### 5.2 Dataset Compression

We now consider the problem of dataset compression, where the goal is to maintain a good accuracy while using less examples from a dataset. This section should be seen as a proof of concept. A natural scheme consists in choosing the samples that have a higher margin since those will carry more information than samples that are easy to fit. In this setting, our screening algorithm can be used for compression by using the scores of the screening test as a way of ranking the samples. In our experiments, and for a given model, we progressively delete data points according to their score in the screening test for this model, before fitting the model on the remaining subsets. We compare those methods to random deletions in the dataset and to a margin computed on early approximations of the solution when the loss admits a flat area.

#### Lasso regression.

The Lasso objective combines an loss with an penalty. Since its dual is not sparse, we will instead apply the safe rules offered by the screening-friendly regression loss (9) derived in Section 4.3 and illustrated in Figure 2, combined with an penalty. We can draw an interesting parallel with the SVM, which is naturally sparse in data points. At the optimum, the solution of the SVM can be expressed in terms of data points (the so-called support vectors) that are close to the classification boundary, that is the points that are the most difficult to classify. Our screening rule yields the analog for regression: the points that are easy to predict, i.e. that are close to the regression curve, are less informative than the points that are harder to predict. In our experiments on synthetic data (), this does consistently better than random subsampling as can be seen in Figure 6.

#### Classification.

Our compression scheme is also valid for classification as can be seen in Figure 7.

## Acknowledgments

Julien Mairal and Grégoire Mialon were supported by the ERC grant number 714381 (SOLARIS project) and by ANR 3IA MIAI@Grenoble Alpes. AA is at CNRS & département d’informatique, École normale supérieure, UMR CNRS 8548, 45 rue d’Ulm 75005 Paris, France, INRIA and PSL Research University. The authors would like to acknowledge support from the Optimization & Machine Learning joint research initiative with the fonds AXA pour la recherche and Kamet Ventures as well as a Google focused award. Grégoire Mialon thanks Vivien Cabannes, Yana Hasson and Robin Strudel for useful discussions.

## Appendix A Proofs.

### a.1 Proof of Lemma 2.4

###### Proof.

At the optimum,

 P(x∗)−D(ν∗)= 1nn∑i=1fi(a⊤ix)+f∗i(νi)+λR(x)+λR∗(−ATνλn)=0.

 1nn∑i=1fi(a⊤ix)+f∗i(νi)−a⊤ixνi≥0+λ(R(x)+R∗(−A⊤νλn)−⟨x,−A⊤νλn⟩)≥0=0,

since Fenchel-Young’s inequality states that each term is greater or equal to zero. We have a null sum of non-negative terms; hence, each one of them is equal to zero. We therefore have for each :

 f(a⊤ix)+f∗(νi)=a⊤ixνi,

which corresponds to the equality case in Fenchel-Young’s relation, which is equivalent to .

### a.2 Proof of Lemma 3.3

###### Proof.

The Lagrangian of the problem writes:

 L(x,ν,γ)=a⊤ix−bi+ν(1−(x−z)TE−1(x−z))−γgT(x−z),

with . When maximizing in , we get:

 ∂L∂x =ai+2ν(E−1z−E−1x)−γ=0.

We have since the opposite leads to a contradiction. This yields and at the optimum which gives .

Now, we have to minimize

 g(ν,γ)=ai(z+12ν(Eai−γEg))−γ⊤(12ν(Eai−γEg)).

To do that, we consider the optimality condition

 ∂g∂γ =−12νaiEg−12νgTEai+γνgTEg=0,

which yields . If then in order to avoid a contradiction.

In summary, either hence the maximum is attained in and is equal to , or and the maximum is attained in and is equal to with and .

### a.3 Proof of Example 4.3

###### Proof.

The Fenchel conjugate of a norm is the indicator function of the unit ball of its dual norm, the ball here. Hence the infimum convolution to solve

 fμ(x)=min z∈Rn{f(z)+1∥x−z∥∞≤μ} (12)

Since ,

 fμ(x)=min z∈Rn12nzTz+1∥x−z∥∞≤μ.

If we consider the change of variable , we get:

 fμ(x)=min t∈Rn12n∥x−t∥22+1∥t∥∞≤μ.

The solution to this problem is exactly the proximal operator for the indicator function of the infinity ball applied to . It has a closed form

 t∗ = prox1∥.∥∞≤μ(x) = x−prox(1∥.∥∞≤μ)∗(x),using Moreau decomposition = x−proxμ∥.∥1(x).

Hence,

 fμ(x)=12n∥x−t∗∥22=12n∥proxμ∥.∥1(x)∥22.

But, for , where .

## Appendix B Additional experimental results.

#### Experimental protocol and reproducibility.

The data sets did not require any pre-processing except MNIST and SVHN on which exhaustive details can be found in [12]. For both regression and classification, the examples were allocated to train and test sets using scikit-learn’s train-test-split. The experiments were run three to ten times (depending on the cost of the computations) and our error bars reflect the standard deviation. For each fraction of points deleted, we fit three to five estimators on the screened dataset and the random subset before averaging the corresponding scores. The optimal parameters for the linear models were found using a simple grid-search.

#### Accuracy of our safe logistic loss.

The accuracies of the Safe Logistic loss we build is similar to the accuracies obtained with the Squared Hinge and the Logistic losses on the datasets we use in this paper thus making it a realistic loss function.

#### Exemplar selection.

Here we generate respectively and redundant examples of synthetic data ( and diabetes (, , in scikit-learn) by forming convex combinations of existing data points and adding gaussian noise with zero mean. As in ranking data points for the Lasso, we apply our screening rules to iteratively discard examples that are redundant and fit a Lasso on the remaining dataset. This method greatly outperforms random subsets as can be seen in Figure 8.

### Footnotes

1. Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France.
2. D.I., UMR 8548, École Normale Supérieure, Paris, France.
3. footnotemark:
4. footnotemark:

### References

1. F. Bach, R. Jenatton, J. Mairal and G. Obozinski (2012) Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning 4 (1), pp. 1–106. Cited by: §1, §4.1, §4.3.
2. A. Beck and M. Teboulle (2012) Smoothing and first order methods: a unified framework. SIAM J. Optim Vol. 22, No. 2. Cited by: §4.1, §4.3.
3. R. G. Bland, D. Goldfarb and M. J. Todd (1981) The ellipsoid method: a survey. Operation Research 29. Cited by: 1st item, §3.2.
4. M. Blondel, A. F. T. Martins and V. Niculae (2019) Learning classifiers with fenchel-young losses: generalized entropies, margins, and algorithms. In International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §2.3.
5. S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge University Press. Cited by: §4.1.
6. L. Dai and K. Pelckmans (2012-01) An ellipsoid based, two-stage screening test for bpdn. European Signal Processing Conference, pp. 654–658. External Links: ISBN 978-1-4673-1068-0 Cited by: §1, §3.2.
7. L. El Ghaoui, V. Viallon and T. Rabbani (2010-09) Safe Feature Elimination for the LASSO and Sparse Supervised Learning Problems. arXiv e-prints, pp. arXiv:1009.4219. External Links: 1009.4219 Cited by: §1, §1, §3.2, §3.2, §3.
8. P. F. Felzenszwalb, R. B. Girshick, D. McAllester and D. Ramanan (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §1.
9. O. Fercoq, A. Gramfort and J. Salmon (2015) Mind the duality gap: safer rules for the Lasso. In International Conference on Machine Learning (ICML), Cited by: §1, §3.2.
10. J. Friedman, T. Hastie and R. Tibshirani (2001) The elements of statistical learning. Springer series in statistics New York. Cited by: §1.
11. T. Hocking, G. Rigaill, J. Vert and F. Bach (2013) Learning sparse penalties for change-point detection using max margin interval regression. In International Conference on Machine Learning (ICML), Cited by: §5.1.
12. J. Mairal (2016) End-to-end kernel learning with supervised convolutional kernel networks. In Advance in Neural Information Processing Systems (NIPS), Cited by: Appendix B, §5.
13. J. Moreau (1962) Fonctions convexes duales et points proximaux dans un espace hilbertien. CR Acad. Sci. Paris Sér. A MAth. Cited by: §2.1.
14. K. Ogawa, Y. Suzuki and I. Takeuchi (2013) Safe screening of non-support vectors in pathwise svm computation. In International Conference on Machine Learning (ICML), Cited by: §1.
15. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §5.
16. A. Shibagaki, M. Karasuyama, K. Hatano and I. Takeuchi (2016) Simultaneous Safe Screening of Features and Samples in Doubly Sparse Modeling. In International Conference on Machine Learning (ICML), Cited by: §1, §3.2, §5.1, §5.1.
17. R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor and R. J. Tibshirani (2012) Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (2), pp. 245–266. Cited by: §1.
18. R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §1.
19. J. Wang, J. Zhou, J. Liu, P. Wonka and J. Ye (2014) A safe screening rule for sparse logistic regression. In Advance in Neural Information Processing Systems (NIPS), Cited by: §1.
20. J. Wang, J. Zhou, P. Wonka and J. Ye (2013) Lasso screening rules via dual polytope projection. In Advance in Neural Information Processing Systems (NIPS), Cited by: §1.
21. K. Yosida (1980) Functional analysis. Berlin-Heidelberg. Cited by: §2.1.
22. H. Zou and T. Hastie (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), pp. 301–320. Cited by: §1.