Fast Differentiable Sorting and Ranking

# Fast Differentiable Sorting and Ranking

## Abstract

The sorting operation is one of the most basic and commonly used building blocks in computer programming. In machine learning, it is commonly used for robust statistics. However, seen as a function, it is piecewise linear and as a result includes many kinks at which it is non-differentiable. More problematic is the related ranking operator, commonly used for order statistics and ranking metrics. It is a piecewise constant function, meaning that its derivatives are null or undefined. While numerous works have proposed differentiable proxies to sorting and ranking, they do not achieve the time complexity one would expect from sorting and ranking operations. In this paper, we propose the first differentiable sorting and ranking operators with time and space complexity. Our proposal in addition enjoys exact computation and differentiation. We achieve this feat by constructing differentiable sorting and ranking operators as projections onto the permutahedron, the convex hull of permutations, and using a reduction to isotonic optimization. Empirically, we confirm that our approach is an order of magnitude faster than existing approaches and showcase two novel applications: differentiable Spearman’s rank correlation coefficient and soft least trimmed squares.

\mdfsetup

backgroundcolor=theoremcolor, linewidth=0.5pt, \newmdtheoremenvpropositionProposition \newmdtheoremenvlemmaLemma \newmdtheoremenvcorollaryCorollary

\printAffiliationsAndNotice

## 1 Introduction

Modern deep learning architectures are built by composing parameterized functional blocks (including loops and conditionals) and are trained end-to-end using gradient backpropagation. This has motivated the term differentiable programming, recently popularized, among others, by LeCun (2018). Despite great empirical successes, many operations commonly used in computer programming remain poorly differentiable or downright pathological, limiting the set of architectures for which a gradient can be computed.

We focus in this paper on two such operations: sorting and ranking. Sorting returns the given input vector with its values re-arranged in monotonic order. It plays a key role to handle outliers in robust statistics, as in least-quantile (Rousseeuw, 1984) or trimmed (Rousseeuw & Leroy, 2005) regression. As a piecewise linear function, however, the sorted vector contains many kinks where it is non-differentiable. In addition, when used in composition with other functions, sorting often induces non-convexity, thus rendering model parameter optimization difficult.

The ranking operation, on the other hand, outputs the positions, or ranks, of the input values in the sorted vector. A workhorse of order statistics (David & Nagaraja, 2004), ranks are used in several metrics, including Spearman’s rank correlation coefficient (Spearman, 1904), top- accuracy and normalized discounted cumulative gain (NDCG). As piecewise constant functions, ranks are unfortunately much more problematic than sorting: their derivatives are null or undefined, preventing gradient backpropagation. For this reason, a large body of work has studied differentiable proxies to ranking. While several works opt to approximate ranking metrics directly (Chapelle & Wu, 2010; Adams & Zemel, 2011; Lapin et al., 2016), others introduce “soft” ranks, which can then be plugged into any differentiable loss function. Taylor et al. (2008) use a random perturbation technique to compute expected ranks in time, where is the dimensionality of the vector to rank. Shortly after, Qin et al. (2010) propose a simple method based on comparing pairwise distances between values, thereby taking time. This method is refined by Grover et al. (2019) using unimodal row-stochastic matrices. Lastly, Cuturi et al. (2019) adopt an optimal transport viewpoint of sorting and ranking. Their method is based on differentiating through the iterates of the Sinkhorn algorithm (Sinkhorn & Knopp, 1967), thereby costing time, where is the number of Sinkhorn iterations. Alas, none of these approaches achieve the time complexity one would expect from sorting and ranking operations.

In this paper, we propose the first differentiable sorting and ranking operators with time and memory complexity. Our proposals enjoy exact computation and differentiation (i.e., they do not involve differentiating through the iterates of an approximate algorithm). We achieve this feat by casting differentiable sorting and ranking as projections onto the permutahedron, the convex hull of all permutations, and using a reduction to isotonic optimization. While the permutahedron had been used for learning before (Yasutake et al., 2011; Ailon et al., 2016; Blondel, 2019), it had not been used to define fast differentiable operators. The rest of the paper is organized as follows.

• We review the necessary background (§2) and show how to cast sorting and ranking as linear programs over the permutahedron, the convex hull of all permutations (§3).

• We introduce regularization in these linear programs, which turns them into projections onto the permutahedron and allows us to define differentiable sorting and ranking operators. We analyze the properties of these operators, such as their asymptotic behavior (§4).

• Using a reduction to isotonic optimization, we achieve computation and differentiation of our operators, a key technical contribution of this paper (§5).

• We show that our approach is an order of magnitude faster than existing approaches and showcase two novel applications: differentiable Spearman’s rank coefficient and soft least trimmed squares (§6).

## 2 Preliminaries

In this section, we define the notation that will be used throughout this paper. Let . We will think of as a vector of scores or “logits” produced by a model, i.e., for some and some . For instance, in a label ranking setting, may contain the score of each of labels for the features .

We denote a permutation of by and its inverse by . For convenience, we will sometimes use . If a permutation is seen as a vector, we denote it with bold, . We denote the set of permutations of by . Given a permutation , we denote the version of permuted according to by . We define the reversing permutation by or in vector form. Given a set and a vector , we denote the restriction of to by .

We define the argsort of as the indices sorting , i.e.,

 σ(θ)\coloneqq(σ1(θ),…,σn(θ)), (1)

where . If some of the coordinates of are equal, we break ties arbitrarily. We define the sort of as the values of in descending order, i.e.,

 s(θ)\coloneqqθσ(θ). (2)

We define the rank of as the function evaluating at coordinate to the position of in the descending sort (smaller rank means that has higher value). It is formally equal to the argsort’s inverse permutation, i.e.,

 r(θ)\coloneqqσ−1(θ). (3)

For instance, if , then , and . All three operations can be computed in time. Note that throughout this paper, we use descending order for convenience. The ascending order counterparts are easily obtained by , and , respectively.

## 3 Sorting and ranking as linear programs

We show in this section how to cast sorting and ranking operations as linear programs over the permutahedron. To that end, we first formulate the argsort and ranking operations as optimization problems over the set of permutations . {lemma}Discrete optimization formulations

For all and , we have

 σ(θ) =argmaxσ∈Σ⟨θσ,ρ⟩, and (4) r(θ) =argmaxπ∈Σ⟨θ,ρπ⟩. (5)

A proof is provided in §B.1. To obtain continuous optimization problems, we introduce the permutahedron induced by a vector , the convex hull of permutations of :

 P(w)\coloneqqconv({wσ:σ∈Σ})⊂Rn. (6)

A well-known object in combinatorics (Bowman, 1972; Ziegler, 2012), the permutahedron of is a convex polytope, whose vertices correspond to permutations of . It is illustrated in Figure 1. In particular, when , . With this defined, we can now derive linear programming formulations of sort and ranks. {proposition}Linear programming formulations

For all and , we have

 s(θ) =argmaxy∈P(θ)⟨y,ρ⟩, and (7) r(θ) =argmaxy∈P(ρ)⟨y,−θ⟩. (8)

A proof is provided in §B.2. The key idea is to perform a change of variable to “absorb” the permutation in (4) and (5) into a permutahedron. From the fundamental theorem of linear programming (Dantzig et al., 1955, Theorem 6), an optimal solution of a linear program is always achieved at a vertex of the convex polytope. Thus, an optimal solution over a permutahedron is always a permutation. Interestingly, appears in the constraints and appears in the objective for sorting, while this is the opposite for ranking. Figure 1: Illustration of the permutahedron P(ρ), whose vertices are permutations of ρ=(3,2,1). In this example, the ranks of θ=(2.9,0.1,1.2) are r(θ)=(1,3,2). In this case, our proposed soft rank rεQ(θ) with ε=1 is exactly equal to r(θ). When ε→∞, rεQ(θ) converges towards the centroid of the permutahedron. The gray line indicates the regularization path of rεQ(θ) between these two regimes, when varying ε.

#### Differentiability a.e. of sorting.

For , the fact that appears in the linear program constraints makes piecewise linear and thus differentiable almost everywhere. When is unique at , is differentiable at and its Jacobian is the permutation matrix associated with . When is not unique, we can choose any matrix in Clarkeâs generalized Jacobian, i.e., any convex combination of the permutation matrices associated with .

#### Lack of differentiability of ranking.

On the other hand, for , since appears in the objective, a small perturbation to may cause the solution of the linear program to jump to another permutation of . This makes a discontinuous, piecewise constant function. This means that has null or undefined partial derivatives, preventing its use within a neural network trained with backpropagation.

## 4 Differentiable sorting and ranking Figure 2: Illustration of the soft sorting and ranking operators, sεΨ(θ) and rεΨ(θ) for Ψ=Q; the results with Ψ=E are similar. When ε→0, they converge to their “hard” counterpart. When ε→∞, they collapse into a constant, as proven in Prop.4.

As we have already motivated, our primary goal is the design of efficiently computable approximations to the sorting and ranking operators, that would smoothen the numerous kinks of the former, and provide useful derivatives for the latter. We achieve this by introducing strongly convex regularization in our linear programming formulations. This turns them into efficiently computable projection operators, which are differentiable and amenable to formal analysis.

#### Projection onto the permutahedron.

Let and consider the linear program . Clearly, we can express by setting and by setting . Introducing quadratic regularization is considered by Martins & Astudillo (2016) over the unit simplex and by Niculae et al. (2018) over marginal polytopes. Similarly, adding to our linear program over the permutahedron gives

 PQ(z,w)\coloneqqargmaxμ∈P(w)⟨z,μ⟩−Q(μ)=argminμ∈P(w)12∥μ−z∥2, (9)

i.e., the Euclidean projection of onto . We also consider entropic regularization , popularized in the optimal transport literature (Cuturi, 2013; Peyré & Cuturi, 2017). Subtly, we define

 PE(z,w) \coloneqqlogargmaxμ∈P(ew)⟨z,μ⟩−E(μ) (10) =logargminμ∈P(ew)KL(μ,ez), (11)

where is the Kullback-Leibler (KL) divergence between two positive measures and . is therefore the log KL projection of onto . The purpose of is to ensure that always belongs to (since is a convex combination of the permutations of ) and that of the logarithm is to map back to .

More generally, we can use any strongly convex regularization under mild conditions. For concreteness, we focus our exposition in the main text on . We state all our propositions for these two cases and postpone a more general treatment to the appendix.

#### Soft operators.

We now build upon these projections to define soft sorting and ranking operators. To control the regularization strength, we introduce a parameter which we multiply by (equivalently, divide by).

For sorting, we choose and therefore define the -regularized soft sort as

 sεΨ(θ)\coloneqqPεΨ(ρ,θ)=PΨ(\nicefracρε,θ). (12) Figure 3: Effect of the regularization parameter ε. We take the vector θ\coloneqq(0,3,1,2), vary one of its coordinates θi and look at how [sεΨ(θ)]i and [rεΨ(θ)]i change in response. For soft sorting with Ψ=Q, the function is still piecewise linear, like sorting. However, by increasing ε we reduce the number of kinks, and the function eventually converges to a mean (Proposition 4). With Ψ=E, the function tends to be even smoother. For soft ranking with Ψ=Q, the function is piecewise linear instead of piecewise constant for the “hard” ranks. With Ψ=E, the function again tends to be smoother though it may contain kinks.

For ranking, we choose and therefore define the -regularized soft rank as

 rεΨ(θ)\coloneqqPεΨ(−θ,ρ)=PΨ(\nicefrac−θε,ρ). (13)

We illustrate the behavior of both of these soft operations as we vary in Figures 3 and 2. As for the hard versions, the ascending-order soft sorting and ranking are obtained by negating the input as and , respectively.

#### Properties.

We can further characterize these approximations. Namely, as we now formalize, they are differentiable a.e., and not only converge to the their “hard” counterparts, but also satisfy some of their properties for all . {proposition}Properties of and

1. Differentiability. For all , and are differentiable (a.e.) w.r.t. .

2. Order preservation. Let , and . For all and , we have and .

3. Asymptotics. For all without ties:

 sεΨ(θ) −−→ε→0s(θ) rεΨ(θ) −−→ε→0r(θ) (14) −−−→ε→∞fΨ(θ)1 −−−→ε→∞fΨ(ρ)1,

where , .

The last property describes the behavior as and . Together with the proof of Section 4, we include in §B.3 a slightly stronger result. Namely, we derive an explicit value of below which our operators are exactly equal to their hard counterpart, and a value of above which our operators can be computed in closed form.

#### Convexification effect.

Proposition 4 shows that and for all converge to convex functions of as . This suggests that larger make the objective function increasingly easy to optimize (at the cost of departing from “hard” sorting or ranking). This behavior is also visible in Figure 3, where converges towards the mean , depicted by a straight line.

#### On tuning ε (or not).

The parameter controls the trade-off between approximation of the original operator and “smoothness”. When the model producing the scores or “logits” to be sorted/ranked is a homogeneous function, from (12) and (13), can be absorbed into the model. In our label ranking experiment, we find that indeed tuning is not necessary to achieve excellent accuracy. On the other hand, for top- classification, we find that applying a logistic map to squash to and tuning is important, confirming the empirical finding of Cuturi et al. (2019).

#### Relation to linear assignment formulation.

When using uniform weights on the inputs, the differentiable operators of Cuturi et al. (2019) are based on viewing sorting and ranking as linear assignment over the Birkhoff polytope , the convex hull of permutation matrices. To relate to their method, note that using the change of variable and , we can rewrite (8) as , where

 P(θ)\coloneqqargmaxP∈B ⟨Pρ,−θ⟩. (15)

Let be a distance matrix. Simple calculations show that if , then

 P(θ)=argminP∈B ⟨P,D(−θ,ρ)⟩. (16)

Similarly, we can rewrite (7) as . To obtain a differentiable operator, Cuturi et al. (2019) (see also (Adams & Zemel, 2011)) propose to replace the permutation matrix by a doubly stochastic matrix , which is computed approximately in using Sinkhorn (1967). In comparison, our approach is based on regularizing with directly, the key to achieve time and space complexity, as we now show.

## 5 Fast computation and differentiation

As shown in the previous section, computing our soft sorting and ranking operators boils down to projecting onto a permutahedron. Our key contribution in this section is the derivation of an forward pass and an backward pass (multiplication with the Jacobian) for these projections. Beyond soft sorting and ranking, this is an important sensitivity analysis question in its own right.

#### Reduction to isotonic optimization.

We now show how to reduce the projections to isotonic optimization, i.e., with simple chain constraints, which is the key to fast computation and differentiation. We will w.l.o.g. assume that is sorted in descending order (if not the case, we sort it first). {proposition}Reduction to isotonic optimization

For all and sorted we have

 PΨ(z,w)=z−vΨ(zσ(z),w)σ−1(z) (17)

where

 vQ(s,w) \coloneqqargminv1≥⋯≥vn12∥v−(s−w)∥2, and (18) vE(s,w) \coloneqqargminv1≥⋯≥vn⟨es−v,1⟩+⟨ew,v⟩. (19)

The function is classically known as isotonic regression. The fact that it can be used to solve the Euclidean projection onto has been noted several times (Negrinho & Martins, 2014; Zeng & Figueiredo, 2015). The reduction of Bregman projections, which we use here, to isotonic optimization was shown by Lim & Wright (2016). Unlike that study, we use the KL projection of onto , and not of onto , which simplifies many expressions. We include in §B.4 a simple unified proof of Section 5 based on Fenchel duality and tools from submodular optimization. We also discuss an interpretation of adding regularization to the primal linear program as relaxing the equality constraints of the dual linear program in §B.5.

#### Computation.

As shown by Best et al. (2000), the classical pool adjacent violators (PAV) algorithm for isotonic regression can be extended to minimize any per-coordinate decomposable convex function subject to monotonicity constraints, which is exactly the form of the problems in Section 5. The algorithm repeatedly splits the coordinates into a set of contiguous blocks that partition (their union is and ). It only requires access to an oracle that solves for each block the sub-problem , and runs in linear time. Further, the solution has a clean block-wise constant structure, namely it is equal to within block . Fortunately, in our case, as shown in §B.6, the function can be analytically computed, as

 γQ(Bj;s,w) \coloneqq1|Bj|∑i∈Bjsi−wi, and (20) γE(Bj;s,w) \coloneqqlog∑i∈Bjesi−log∑i∈Bjewi. (21)

Hence, PAV returns an exact solution of both and in time (Best et al., 2000). This means that we do not need to choose a number of iterations or a level of precision, unlike with Sinkhorn. Since computing and requires obtaining beforehand, the total computational complexity is .

#### Differentiating isotonic optimization.

The block-wise structure of the solution also makes its derivatives easy to analyze, despite the fact that we are differentiating the solution of an optimization problem. Since the coordinates of the solution in block are all equal to , which in turn depends only on a subset of the parameters, the Jacobian has a simple block-wise form, which we now formalize. {lemma}Jacobian of isotonic optimization

Let be the ordered partition of induced by from Proposition 5. Then,

 ∂vΨ(s,w)∂s=⎡⎢ ⎢ ⎢⎣BΨ1000⋱000BΨm⎤⎥ ⎥ ⎥⎦∈Rn×n, (22)

where . A proof is given in §B.7. The Jacobians w.r.t. are entirely similar, thanks to the symmetry of (20) and (21). Figure 4: Left, center: Accuracy comparison on CIFAR-10, CIFAR-100 (n=10, n=100). Right: Runtime comparison for one batch computation with backpropagation disabled. OT and All-pairs go out-of-memory starting from n=2000 and n=3000, respectively. With backpropagation enabled, the runtimes are similar but OT and All-pairs go out-of-memory at n=1000 and n=2500, respectively.

In the quadratic regularization case, it was already derived by Djolonga & Krause (2017) that . The multiplication with the Jacobian, for some vector , can be computed as , where . In the entropic regularization case, novel to our knowledge, we have . Note that is column-wise constant, so that the multiplication with the Jacobian , can be computed as . In both cases, the multiplication with the Jacobian therefore takes time.

There are interesting differences between the two forms of regularization. For quadratic regularization, the Jacobian only depends on the partition (not on ) and the blocks have constant value. For entropic regularization, the Jacobian does depend on and the blocks are constant column by column. Both formulations are averaging the incoming gradients, one uniformly and the other weighted.

#### Differentiating the projections.

We now combine Section 5 with Section 5 to characterize the Jacobians of the projections onto the permutahedron and show how to multiply arbitrary vectors with them in linear time. {proposition}Jacobian of the projections

Let be defined in Proposition 5. Then,

 ∂PΨ(z,w)∂z=JΨ(zσ(z),w)σ−1(z), (23)

where is the matrix obtained by permuting the rows and columns of according to , and where

 JΨ(s,w)\coloneqqI−∂vΨ(s,w)∂s. (24)

Again, the Jacobian w.r.t.  is entirely symmetric. Unlike the Jacobian of isotonic optimization, the Jacobian of the projection is not block diagonal, as we need to permute its rows and columns. We can nonetheless multiply with it in linear time by using the simple identity , which allows us to reuse the multiplication with the Jacobian of isotonic optimization.

#### Differentiating sεΨ and rεΨ.

With the Jacobian of w.r.t. and at hand, differentiating and boils down to a mere application of the chain rule to (12) and (13). To summarize, we can multiply with the Jacobians of our soft operators in time and space.

## 6 Experiments

We present in this section our empirical findings. We will release in the near future JAX, PyTorch and Tensorflow implementations of our soft operators building upon a highly-optimized C++ implementation of the PAV algorithm.

### 6.1 Top-k classification loss function

#### Experimental setup.

To demonstrate the effectiveness of our proposed soft rank operators as a drop-in replacement for exisiting ones, we reproduce the top- classification experiment of Cuturi et al. (2019). The authors propose a loss for top- classification between a ground truth class and a vector of soft ranks , which is higher if the predicted soft ranks correctly place in the top- elements. We compare the following soft operators

• OT (Cuturi et al., 2019): The optimal transport formulation discussed in §4.

• All-pairs (Qin et al., 2010): observing that can be written as , one can obtain soft ranks in by replacing the indicator function with a sigmoid.

• Proposed: our soft ranks and . Although not used in this experiment, for top- ranking, the complexity can be reduced to by computing using the algorithm of Lim & Wright (2016).

We use the CIFAR-10 and CIFAR-100 datasets, with and classes, respectively. Following Cuturi et al. (2019), we use a vanilla CNN (4 Conv2D with 2 max-pooling layers, ReLU activation, 2 fully connected layers with batch norm on each), the ADAM optimizer (Kingma & Ba, 2014) with a constant step size of , and set . Similarly to Cuturi et al. (2019), we found that squashing the scores to with a logistic map was beneficial.

#### Results.

Our empirical results, averaged over runs, are shown in Figure 4 (left, center). On both CIFAR-10 and CIFAR-100, our soft rank formulations achieve comparable accuracy to the OT formulation, though significantly faster, as we elaborate below. Confirming the results of Cuturi et al. (2019), we found that the soft top- loss slightly outperforms the classical cross-entropy (logistic) loss for these two datasets. However, we did not find that the All-pairs formulation could outperform the cross-entropy loss.

The training times for 600 epochs on CIFAR-100 were 29 hours (OT), 21 hours (), 23 hours () and 16 hours (All-pairs). Training times on CIFAR-10 were similar. While our soft operators are several hours faster than OT, they are slower than All-pairs, despite its complexity. This is due the fact that, with , All-pairs is very efficient on GPUs, while our PAV implementation runs on CPU.

### 6.2 Runtime comparison: effect of input dimension

To measure the impact of the dimensionality on the runtime of each method, we designed the following experiment.

#### Experimental setup.

We generate score vectors randomly according to , for ranging from up to . For fair comparison with GPU implementations (OT, All-pairs, Cross-entropy), we create a batch of such vectors and we compare the time to compute soft ranking operators on this batch. We run this experiment on top of TensorFlow (Abadi et al., 2016) on a six core Intel Xeon W-2135 with 64 GBs of RAM and a GeForce GTX 1080 Ti.

#### Results.

Run times for one batch computation with backpropagation disabled are shown in Figure 4 (Right). While their runtime is reasonable in small dimension, OT and All-pairs scale quadratically with respect to the dimensionality (note the log scale on the -axis). Although slower than a softmax, our formulations scale well, with the dimensionality having negligible impact on the runtime. OT and All-pairs go out-of-memory starting from and , respectively. With backpropagation enabled, they go out-of-memory at and , due to the need for recording the computational graph. This shows that the lack of memory available on GPUs is problematic for these methods. In contrast, our approaches only require memory and comes with the theoretical Jacobian (they do not rely on differentiating through iterates). They therefore suffer from no such issues.

### 6.3 Label ranking via soft Spearman’s rank correlation coefficient

We now consider the label ranking setting where supervision is given as full rankings (e.g., ) rather than as label relevance scores. The goal is therefore to learn to predict permutations, i.e., a function . A classical metric between ranks is Spearman’s rank correlation coefficient, defined as the Pearson correlation coefficient between the ranks. Maximizing this coefficient is equivalent to minimizing the squared loss between ranks. A naive idea would be therefore to use as loss , where . This is unfortunately a discontinuous function of . We therefore propose to rather use , hence the name differentiable Spearman’s rank correlation coefficient. At test time, we replace with , which is justified by the order-preservation property (Proposition 4).

#### Experimental setup.

We consider the datasets from (Hüllermeier et al., 2008; Cheng et al., 2009), which has both semi-synthetic data obtained from classification problems, and real biological measurements. Following (Korba et al., 2018), we average over two 10-fold validation runs, in each of which we train on 90% and evaluate on 10% of the data. Within each repetition, we run an internal 5-fold cross-validation to grid-search for the best parameters. We consider linear models of the form , and for ablation study we drop the soft ranking layer .

#### Results.

Due to the large number of datasets, we choose to present a summary of the results in Figure 5. We postpone detailed results to the appendix (Table 1). Out of 21 datasets, introducing a soft rank layer with works better on 15 datasets, similarly on 4 and worse on 2 datasets. We can thus conclude that even for such simple model, introducing our layer is beneficial, and even achieving state of the art results on some of the datasets (full details in the appendix). Figure 5: Label ranking accuracy with and without soft rank layer. Each point above the line represents a dataset where our soft rank layer improves Spearman’s rank correlation coefficient.

### 6.4 Robust regression via soft least trimmed squares

We explore in this section the application of our soft sorting operator to robust regression. Let and be a training set of input-output pairs. Our goal is to learn a model that predicts outputs from inputs, where are model parameters. We focus on for simplicity. We further assume that a certain proportion of examples are outliers including some label noise, which makes the task of robustly estimating particularly challenging.

The classical ridge regression can be cast as

 minw1nn∑i=1ℓi(w)+12ε∥w∥2, (25)

where . In order to be robust to label noise, we propose instead to sort the losses (from larger to smaller) and to ignore the first ones. Introducing our soft sorting operator, this can be formulated as

 minw1n−kn∑i=k+1ℓεi(w) (26)

where is the loss in the soft sort, and is the loss vector that gathers for each . Solving (26) with existing soft sorting operators could be particularly computationally prohibitive, since here is the number of training samples.

When , we have and (26) is known as least trimmed squares (LTS) (Rousseeuw, 1984; Rousseeuw & Leroy, 2005). When , we have, from Proposition 4, and therefore both (25) and (26) converge to the least squares (LS) objective, . To summarize, our proposed objective (26), dubbed soft least trimmed squares, interpolates between least trimmed squares () and least squares (), as also confirmed empirically in Section 6.3.

#### Experimental setup.

To empirically validate our proposal, we compare cross-validated results for increasing percentage of outliers of the following methods:

• Least trimmed squares, with truncation parameter ,

• Soft least trimmed squares (26), with truncation parameter and regularization parameter ,

• Ridge regression (25), with regularization parameter ,

• Huber loss (Huber, 1964) with regularization parameter and threshold parameter , as implemented in scikit-learn (Pedregosa et al., 2011).

We consider datasets from the LIBSVM archive (Fan & Lin, 2011). We hold out 20% of the data as test set and use the rest as training set. We artifically create outliers, by adding noise to a certain percentage of the training labels, using , where . We do not add noise to the test set. For all methods, we use L-BFGS (Liu & Nocedal, 1989), with a maximum of iterations. For hyper-parameter optimization, we use -fold cross-validation. We choose from , from log-spaced values between and , and from linearly spaced values between and . We repeat this procedure times with a different train-test split, and report the averaged scores (a.k.a. coefficient of determination).

#### Results.

The averaged scores (higher is better) are shown in Figure 6.3. On all datasets, the accuracy of ridge regression deteriorated significantly with increasing number of outliers. Least trimmed squares (hard or soft) performed slightly worse than the Huber loss on housing, comparably on bodyfat and much better on cadata. We found that hard least trimmed squares (i.e., ) worked well on all datasets, showing that regularization is less important for sorting operators (which are piecewise linear) than for ranking operators (which are piecewise constant). Nevertheless, regularization appeared useful in some cases. For instance, on cadata, the cross-validation procedure picked when the percentage of outliers is less than , and when the percentage of outliers is larger than . This is confirmed visually on Figure 6.3, where the soft sort with works slightly better than the hard sort with few outliers, then performs comparably with more outliers. The interpolation effect enabled by therefore allows some adaptivity to the (unknown) percentage of outliers.

## 7 Conclusion

Building upon projections onto permutahedra, we constructed differentiable sorting and ranking operators. We derived exact computation and differentiation of these operators, a key technical contribution of this paper. We demonstrated that our operators can be used as a drop-in replacement for existing ones, with an order-of-magnitude speed-up. We also showcased two applications enabled by our soft operators: label ranking with differentiable Spearman’s rank correlation coefficient and robust regression via soft least trimmed squares.

## Acknowledgements

We are grateful to Marco Cuturi and Jean-Philippe Vert for useful discussions, and to Carlos Riquelme for comments on a draft of this paper.

Appendix

## Appendix A Additional empirical results

We include in this section the detailed label ranking results on the same datasets as considered by Hüllermeier et al. (2008) as well as Cheng et al. (2009).

For entropic regularization , in addition to , we also consider an alternative formulation. Since is already strictly positive, instead of using the log-projection onto , we can directly use the projection onto . In our notation, this can be written as , where

 (27)

Spearman’s rank correlation coefficient for each method, averaged over runs, is shown in the table below.

## Appendix B Proofs

### b.1 Proof of Lemma 3 (Discrete optimization formulation)

For the first claim, we have for all such that

 σ(θ)=argmaxσ∈Σ⟨θσ,w⟩ (28)

and in particular for . The second claim follows from

 σ(θ)=argmaxσ∈Σ⟨θ,wσ−1⟩=argmaxπ−1∈Σ⟨θ,wπ⟩=(argmaxπ∈Σ⟨θ,wπ⟩)−1. (29)

### b.2 Proof of Proposition 3 (Linear programming formulations)

Let us prove the first claim. The key idea is to absorb in the permutahedron. Using (28), we obtain for all and for all such that

 θσ(θ)=argmaxθσ:σ∈Σ⟨θσ,w⟩=argmaxy∈Σ(θ)⟨y,w⟩=argmaxy∈P(θ)⟨y,w⟩, (30)

where in the second equality we used and the fundamental theorem of linear programming (Dantzig et al., 1955, Theorem 6). For the second claim, we have similarly

 wr(θ)=argmaxwπ:π∈Σ⟨θ,wπ⟩=argmaxy∈P(w)⟨θ,y⟩. (31)

Setting and using proves the claim.

### b.3 Proof of Proposition 4 (Properties of soft sorting and ranking operators)

#### Differentiability.

Let be a closed convex set and let . If is strongly convex over , then is Lipschitz continuous. By Rademacherâs theorem, is differentiable almost everywhere. Furthermore, since with , is differentiable a.e. as long as is twice differentiable, which is the case when .

#### Order preservation.

Proposition 1 of Blondel et al. (2019) shows that and are sorted the same way. Furthermore, since with and since is monotone, is sorted the same way as , as well. Let and . From the respective definitions, this means that is sorted the same way as (i.e., it is sorted in descending order) and is sorted the same way as , which concludes the proof.

#### Asymptotic behavior.

We will now characterize the behavior for sufficiently small and large regularization strength . Note that rather than multiplying the regularizer by , we instead divide by , which is equivalent. {lemma}Analytical solutions of isotonic optimization in the limit regimes

If , then

 vQ(s/ε,w)=vE(s/ε,w)=s/ε−w. (32)

If , then

 vQ(s/ε,w)=1nn∑i=1(si/ε−wi)1andvE(s/ε,w)=(LSE(s/ε)−LSE(w))1, (33)

where .

###### Proof.

We start with the case. Recall that is sorted in descending order. Therefore, since we chose sufficiently small, the vector is sorted in descending order as well. This means that is feasible, i.e., it belongs to the constraint sets in Section 5. Further, note that so that is the optimal solution if we drop the constraints, which completes the argument.

Next, we tackle the case. Note that the claimed solutions are exactly and , so the claim will immediately follow if we show that is an optimal partition. The PAV algorithm (cf. §B.6) merges at each iteration any two neighboring blocks that violate , starting from the partitions consisting of singleton sets. Let be the iteration number. We claim that the two blocks, and , will always be violating the constraint, so that they can be merged. Note that in the quadratic case, they can be merged only if

 k∑i=1(si/ε−wi)/k

which is equivalent to

 k∑i=1si−sk+1kε

which is indeed satisfied when . In the KL case, they can be merged only if

 logk∑i=1esi/ε−logk∑i=1ewi

This will be true if the term on the left-hand side is smaller than the term on the right-hand side, i.e., when , which again is implied by the assumption. ∎

We can now directly characterize the behavior of the projection operator in the two regimes and . This in turn implies the results for both the soft ranking and sorting operations using (12) and (13). {proposition}Analytical solutions of the projections in the limit regimes

If , then

 PΨ(z/ε,w)=wσ−1(z). (40)

If , then

 PQ(z/ε,w) =z/ε−mean(z/ε−w)1, and (41) PE(z/ε,w) =z/ε−LSE(z/ε)1+LSE(w)1. (42)

Therefore, in these two regimes, we do not even need PAV to compute the optimal projection.

### b.4 Proof of Proposition 5 (Reduction to isotonic optimization)

Before proving Proposition 5, we need the following three lemmas.

{lemma}

Technical lemma

Let be convex, and . Then,