Large Scale computation of Means and Clusters for Persistence Diagrams using Optimal Transport

Large Scale computation of Means and Clusters for Persistence Diagrams using Optimal Transport

Abstract

Persistence diagrams (PDs) are now routinely used to summarize the underlying topology of sophisticated data encountered in challenging learning problems. Despite several appealing properties, integrating PDs in learning pipelines can be challenging because their natural geometry is not Hilbertian. In particular, algorithms to average a family of PDs have only been considered recently and are known to be computationally prohibitive. We propose in this article a tractable framework to carry out fundamental tasks on PDs, namely evaluating distances, computing barycenters and carrying out clustering. This framework builds upon a formulation of PD metrics as optimal transport (OT) problems, for which recent computational advances, in particular entropic regularization and its convolutional formulation on regular grids, can all be leveraged to provide efficient and (GPU) scalable computations. We demonstrate the efficiency of our approach by carrying out clustering on PDs at scales never seen before in the literature.

1 Introduction

Topological data analysis (TDA) has been used successfully in a wide array of applications, for instance in medical (Nicolau et al., 2011) or material (Hiraoka et al., 2016) sciences, computer vision (Li et al., 2014) or to classify NBA players (Lum et al., 2013). The goal of TDA is to exploit and account for the complex topology (connectivity, loops, holes, etc.) seen in modern data. The tools developed in TDA are built upon persistent homology theory (Edelsbrunner & Harer, 2010) whose main output is a descriptor called a persistence diagram (PD) which encodes in a compact form—roughly speaking, a point cloud in the upper triangle of the square —the topology of a given space or object at all scales.

Statistics on PD. Persistence diagrams have appealing properties: in particular they have been shown to be stable with respect to perturbations of the input data (Cohen-Steiner et al., 2007; Chazal et al., 2009, 2014). This stability is measured either in the so called bottleneck metric or in the -th diagram distance, which are both distances that compute optimal partial matchings. While theoretically motivated and intuitive, these metrics are by definition very costly to compute. Furthermore, these metrics are not Hilbertian, preventing a faithful application of a large class of standard machine learning tools (-means, PCA, SVM) on PDs.

Related work. To circumvent the non-Hilbertian nature of the space of PDs, one can of course map diagrams onto simple feature vectors. Such features can be either finite dimensional (Carrière et al., 2015a; Adams et al., 2017), or infinite through kernel functions (Bubenik, 2015; Carrière et al., 2017). A known drawback of kernel approaches on a rich geometric space such as that formed by PDs is that once PDs are mapped as feature vectors, any ensuing analysis remains in the space of such features (the “inverse image” problem inherent to kernelization). They are therefore not helpful to carry out simple tasks in the space of PDs, such as that of averaging PDs, namely computing the Fréchet mean of a family of PDs. Such problems call for algorithms that are able to optimize directly in the space of PDs, and were first addressed by Mileyko et al. (2011); Turner (2013). Turner et al. (2014) provided an algorithm that converges to a local minimum of the Fréchet function by successive iterations of the Hungarian algorithm. However, the Hungarian algorithm does not scale well with the size of diagrams, and non-convexity yields potentially convergence to bad local minima.

Contributions. We reformulate the computation of diagram metrics as an optimal transport problem, opening several perspectives, among them the ability to benefit from fast entropic regularization (Cuturi, 2013). We provide a new numerical scheme to bound OT transport metrics, and therefore diagram metrics, with additive guarantees. Unlike previous approximations of diagram metrics, ours can be parallelized and implemented efficiently on GPUs. These approximations are also differentiable, leading to an extremely efficient method to solve the barycenter problem for persistence diagrams. In exchange for a discretized approximation of PDs, we recover a convex problem, unlike previous PDs barycenter formulations. We demonstrate the scalability of these two advances (accurate approximation of the diagram metric at scale and barycenter computation) by providing the first tractable implementation of the -means algorithm in the space of persistence diagrams.

2 Background on OT and TDA

OT. Optimal transport is now widely seen as a central tool to compare probability measures (Villani, 2003, 2008; Santambrogio, 2015). Given a space endowed with a cost function , we consider two discrete measures and on , namely measures that can be written as positive combinations of diracs, with weight vectors verifying and all in . The cost matrix and the transportation polytope define an optimal transport problem whose optimum can be computed using either of two linear programs, dual to each other,

(1)

where is the Frobenius dot product and is the set of pairs of vectors in such that their tensor sum is smaller than , namely . Note that when and all weights and are uniform and equal, the problem above reduces to the computation of an optimal matching, that is a permutation (with a resulting optimal plan taking the form ). That problem has clear connections with diagram distances, as shown in §3.

Entropic Regularization. Solving the optimal transport problem is intractable for large data. Cuturi proposes to consider a regularized formulation of that problem using entropy , namely:

(2)
(3)

where . Because the negentropy is 1-strongly convex, that problem has a unique solution which takes the form, using first order conditions,

(4)

where (term-wise exponentiation), and is a fixed point of the Sinkhorn map (term-wise divisions):

(5)

Cuturi considers the transport cost of the optimal regularized plan, , to define a Sinkhorn divergence between (here is the elementwise product). One has that as , and more precisely converges to the optimal transport plan solution of (1) with maximal entropy. That approximation can be readily applied to any problem that involves terms in , notably barycenters (Cuturi & Doucet, 2014; Solomon et al., 2015; Benamou et al., 2015).

Eulerian setting. When the set is finite with cardinality , and are entirely characterized by their probability weights and are often called histograms in a Eulerian setting. When is not discrete, as when considering the plane , we therefore have a choice of representing measures as sums of diracs, encoding their information through locations, or as discretized histograms on a planar grid of arbitrary granularity. Because the latter setting is more effective for entropic regularization (Solomon et al., 2015), this is the approach we will favor in our computations.

Figure 1: Sketch of persistent homology. and so that sublevel sets of are unions of balls centered at the points of . First (resp second) coordinate of points in the persistence diagram encodes appearance scale (resp disappearance) of cavities in the sublevel sets of . The isolated red point accounts for the presence of a persistent hole in the sublevel sets, inferring the underlying spherical geometry of the input point cloud.

Persistent homology and Persistence Diagrams. Given a topological space and a real-valued function , persistent homology provides—under mild assumptions on , taken for granted in the remaining of this article—a topological signature of built on its sublevel sets , and called a persistence diagram (PD), denoted as . It is of the form , namely a point measure with finite support included in . Each point in can be understood as a topological feature (connected component, loop, hole…) which appears at scale and disappears at scale in the sublevel sets of . Comparing the persistence diagrams of two functions measures their difference from a topological perspective: presence of some topological features, difference in appearance scales, etc. The space of PDs is naturally endowed with a partial matching metric defined as ():

(6)

where is the set of all partial matchings between points in and points in and denotes the orthogonal projection of an (unmatched) point to the diagonal . The mathematics of OT and diagram distances share a key idea, that of matching, but differ on an important aspect: diagram metrics can cope, using the diagonal as a sink, with measures that have a varying total number of points. We solve this gap by leveraging an unbalanced formulation for OT.

3 Fast estimation of diagram metrics using Optimal Transport

In the following, we start by explicitely formulating (6) as an optimal transport problem. Entropic smoothing provides us a way to approximate (6) with controllable error. In order to benefit mostly from that regularization (matrix parallel execution, convolution, GPU—as showcased in (Solomon et al., 2015)), implementation requires specific attention, as described in propositions 2, 3, 4.

PD metrics as Optimal Transport. The main differences between (6) and (1) are that PDs do not generally have the same number of points and that the diagonal plays a special role by allowing to match any point in a given diagram with its orthogonal projection onto the diagonal. Guittet’s formulation for partial transport (2002) can be used to account for this by creating a “sink” bin corresponding to that diagonal and allowing for different total masses. The idea of representing the diagonal as a single point is also present in the bipartite graph problem of Edelsbrunner & Harer (2010) (Ch.VIII). The important aspect of the following proposition is the clarification of the partial matching problem (6) as a standard OT problem (1).

Let be extended with a unique virtual point encoding the diagonal. We introduce the linear operator which, to a measure in , associates a dirac on with mass equal to the total mass of , namely .

Proposition 1.

Let and be two persistence diagrams with respectively points and points . Let . Then:

(7)

where is the cost matrix with block structure

(8)

where , for .

The proof seamlessly relies on the fact that, when transporting point measures with the same number of points, the optimal transport problem is equivalent to an optimal matching problem (see §2). For the sake of completeness, we provide details in the supplementary material .

Entropic approximation of diagram distances. Following the correspondance established in Prop 1, entropic regularization can be used to approximate the diagram distance . Given two persistence diagrams with respectively and points, let and where is the output after iterations of the Sinkhorn map (5) starting from an arbitrary initialization. Adapting the bounds provided by Altschuler et al. (2017), we can bound the error of approximating by :

(9)

where (that is, error on marginals).

In recent work of Dvurechensky et al. (2018), authors prove that iterating the Sinkhorn map (5) gives a plan verifying in iterations. Given (9), a natural choice is thus to take for a desired precision , which lead to a total of iterations in the Sinkhorn loop. These results can be used to pre-tune parameters and to control the approximation error due to smoothing. However, these are worst-case bounds, controlled by max-norms, and are often too pessimistic in practice. To overcome this phenomenon, we propose on-the-fly error control, using approximate solutions to the smoothed primal (2) and dual (3) optimal transport problems, which provide upper and lower bounds on the optimal transport cost.

Upper and Lower Bounds. The Sinkhorn algorithm, after even but one iteration, produces feasible dual variables (see below (1) for a definition). Their objective value, as measured by , performs poorly as a lower bound of the true optimal transport cost (see Fig. 2 and §5 below) in most of our experiments. To improve on this, we compute the so called -transform of , defined as:

Applying a -transform on , we recover two vectors . One can show that for any feasible , we have that (see Prop 3.1 in (Peyré & Cuturi, 2017))

When ’s top-left block is the squared Euclidean metric, this problem can be cast as that of computing the Moreau envelope of . In a Eulerian setting and when is a finite regular grid which we will consider, we can use either the linear-time Legendre transform or the Parabolic Envelope algorithm (Lucet, 2010, §2.2.1,§2.2.2) to compute the -transform in linear time with respect to the grid resolution .

Unlike dual iterations, the primal iterate does not belong to the transport polytope after a finite number of iterations. We use the rounding_to_feasible algorithm provided by Altschuler et al. (2017) to compute efficiently a feasible approximation of that does belong to . Putting these two elements together, we obtain

(10)

Therefore, after iterating the Sinkhorn map (5) times, we have that if is below a certain criterion , then we can guarantee that is a fortiori an -approximation of . Observe that one can also have a relative error control: if one has , then . Note that might be negative but can always be replaced by since we know has non-negative entries (and therefore ), while is always non-negative.

Figure 2: (Left) (red) and (green) as a function of , the number of iterations of the Sinkhorn map ( ranges from to , with fixed ). (Middle) Final (red) and (green) provided by Alg.1, computed for decreasing s, ranging from to . For each value of , Sinkhorn loop is run until . Note that the -axis is flipped. (Right) Influence of -transform for the Sinkhorn dual cost. (orange) The dual cost , where are Sinkhorn dual variables (before the -transform). (green) Dual cost after -transform, i.e. with . Experiment run with and iterations.

Discretization. For simplicity, we assume in the remaining that our diagrams have their support in . From a numerical perspective, encoding persistence diagrams as histograms on the square offers numerous advantages. Given a uniform grid of size on , we associate to a given diagram a matrix-shaped histogram such that is the number of points in belonging to the cell located at position in the grid (we transition to bold-faced small letters to insist on the fact that these histograms must be stored as square matrices). To account for the total mass, we add an extra dimension encoding mass on . We extend the operator to histograms, associating to a histogram its total mass on the -th coordinate. One can show that the approximation error resulting from that discretization is bounded above by (see the supplementary material).

  Input: Pairs of PDs , smoothing parameter , grid step , stopping criterion, initial .
  Output: Approximation of all , upper and lower bounds if wanted.
  init Cast as histograms , on a grid
  while stopping criterion not reached do
     Iterate in parallel (5) using (11).
  end while
  Compute all using (12)
  if Want a upper bound then
     Compute in parallel using (14)
  end if
  if Want a lower bound then
     Compute using (Lucet, 2010)
  end if
Algorithm 1 Sinkhorn divergence for persistence diagrams

Convolutions. In the Eulerian setting, where diagrams are matrix-shaped histograms of size , the cost matrix has size . Since we will use large values of to have low discretization error (typically ), instantiating is usually intractable. However, Solomon et al. (2015) showed that for regular grids endowed with a separable cost (notably the squared Euclidean norm ), each Sinkhorn map (as well as other key operations such as evaluating Sinkhorn’s divergence) can be performed using Gaussian convolutions, which amounts to performing matrix multiplications of size , without having to manipulate matrices. Our framework is slightly different due to the extra dimension , but we show that equivalent computational properties hold. This observation is crucial from a numerical perspective. Our ultimate goal being to efficiently evaluate (11), (12) and (14), we provide implementation details.

Let be a pair where is a matrix-shaped histogram and is a real number accounting for the mass located on the virtual point . We denote by the column vector obtained when reshaping . The cost matrix and corresponding kernel are given by

where , . and as defined above will never be instantiated, because we can rely instead on defined as and .

Proposition 2 (Iteration of Sinkhorn map).

The application of to can be performed as:

(11)

where denotes the Froebenius dot-product in .

We now introduce and .

Proposition 3 (Computation of ).

Let . The transport cost of can be computed as:

(12)

where the first term can be computed as:

(13)

Finally, consider two histograms , let be the rounded matrix of (see the supplementary material or (Altschuler et al., 2017)). Let denote the first and second marginal of respectively. We introduce (using term-wise min and divisions):

along with and the marginal errors:

Proposition 4 (Computation of upper bound ).

The transport cost induced by can be computed as:

(14)

Note that the first term can be computed using (12)

Parallelization and GPU. Using a Eulerian representation is particularly beneficial when applying Sinkhorn’s algorithm, as shown by Cuturi (2013). Indeed, the Sinkhorn map (5) only involves matrix-vector operations. When dealing with a large number of histograms, concatenating these histograms and running Sinkhorn’s iterations in parallel as matrix-matrix product results in significant speedup that can exploit GPGPU to compare a large number of pairs simultaneously. This makes our approach especially well-suited for large sets of persistence diagrams.

We can now estimate distances between persistence diagrams with Alg. 1 in parallel by performing only -sized matrix multiplications, leading to a computational scaling in where is the grid resolution parameter. Note that a standard stopping criterion is to check the error to marginals .

4 Smoothed barycenters for persistence diagrams

OT formulation for barycenters. We show in this section that the benefits of entropic regularization also apply to the computation of barycenters of PDs. As the space of PD is not Hilbertian but only a metric space, the natural definition of barycenters is to formulate them as Fréchet means for the metric, as first introduced (for PDs) in (Mileyko et al., 2011). Using Prop 1, we thus consider:

Definition.

Given a set of persistence diagrams , a barycenter of is any minimizer of the following problem:

(15)

where is defined as in (8) with (but our approach adapts easily to finite ).

Let denotes the restriction of to the space of persistence diagrams (finite point measures). Turner et al. proved the existence of barycenters and proposed an algorithm that converges to a local minimum of , using the Hungarian algorithm as a subroutine. Their algorithm will be referred to as the B-Munkres Algorithm. The non-convexity of can be a real limitation in practice since can have arbitrarily bad local minima (see Lemma 1 in the supplementary material). Note that minimizing instead of will not give strictly better minimizers (see Thm. 1 in the supplementary material). We then apply entropic smoothing to this problem. This relaxation offers differentiability and circumvents both non-convexity and numerical scalability.

Figure 3: Barycenter estimation for different s with a simple set of PDs (red, blue and green). The smaller the , the better the estimation ( decreases, note the -axis is flipped on the right plot), at the cost of more iterations in Alg. 2. The mass appearing along the diagonal is a consequence of entropic smoothing: it does not cost much to delete while it increases the entropy of transport plans.

Entropic smoothing for PD barycenters. In addition to numerical efficiency, an advantage of smoothed optimal transport is that is differentiable. In the Eulerian setting, its gradient is given by centering the vector where is a fixed point of the Sinkhorn map (5), see (Cuturi & Doucet, 2014). This result can be adapted to our framework, namely:

Proposition 5.

Let be PDs, and the corresponding histograms on a grid. The gradient of the functional is given by

(16)

where denotes the adjoint operator and is a fixed point of the Sinkhorn map obtained while transporting onto .

As in (Cuturi & Doucet, 2014), this result follows from the enveloppe theorem, with the added subtlety that appears in both terms depending on and . This formula can be exploited to compute barycenters via gradient descent, yielding Algorithm 2.

  Input: PDs , learning rate , smoothing parameter , grid step .
  Output: Estimated barycenter
  Init: uniform measure above the diagonal.
  Cast each as an histogram on a grid
  while  changes do
     Iterate defined in (5) in parallel between all the pairs and .
     
     
  end while
  Return
Algorithm 2 Smoothed approximation of PD barycenter

As it can be seen in Fig. 3 and 6, the barycentric persistence diagrams are smeared. If one wishes to recover more spiked diagrams, quantization and/or entropic sharpening (Solomon et al., 2015, §6.1) can be applied, as well as smaller values for that impact computational speed or numerical stability. We will consider these extensions in future work.

5 Experiments

Figure 4: Comparison of scalings of Hera and Sinkhorn (Alg. 1) as the number of points in diagram increases. - scale.

A large scale approximation. Iterations of Sinkhorn map (5) yield a transport cost whose value converges to the true transport cost as goes to and the number of iterations goes to (Cuturi, 2013). We quantify in Fig. 2 this convergence experimentally using the upper and lower bounds provided in (10) through and for decreasing . We consider a set of pairs of diagrams randomly generated with to points in each diagrams, and discretized on a grid. We run Alg. 1 for different ranging from to along with corresponding upper and lower bounds described in (10). For each pair of diagrams, we center our estimates by removing the true distance, so that the target cost is across all pairs. We plot median, top % and bottom % percentiles for both bounds. Using the -transform provides a much better lower bound in our experiments. This is however costly in practice: despite theoretical complexity linear in the grid size, the sequential structure of the algorithms described in (Lucet, 2010) makes them unsuited for GPGPU to our knowledge.

We then compare the scalability of Alg. 1 with respect to the number of points in diagrams with that of Kerber et al. (2017) which provides a state-of-the-art algorithm with publicly available code—referred to as Hera—to estimate distances between diagrams. For both algorithms, we compute the average time to estimate a distance between two random diagrams having from to points where ranges from to . In order to compare their scalability, we plot in Fig. 4 the ratio of both algorithms. It appears that our Alg. 1 scales better (heuristically linearly) as the number of points in the diagrams increases.

Figure 5: Average running times for B-Munkres (blue) and Sinkhorn (red) barycenter algorithms (log-log scale) to average 10 PDs. Sinkhorn parameters: gradient descent is performed until , , , init and progressively decreased.

Fast barycenters and -means on large PD sets. We compare our Alg. 2 to the combinatorial algorithm of Turner et al. (2014). We use the script munkres.py provided on the website of K.Turner for their implementation. Algorithms are referred to as Sinkhorn and B-Munkres respectively. We record in Fig. 5 running times of both algorithms on a set of diagrams having from to points, ranging from to , on Intel Xeon 2.3 GHz (CPU) and P100 (GPU, Sinkhorn only). Our experiment shows that Alg. 2 drastically outperforms B-Munkres as the number of points increases. We interrupt B-Munkres at , after which computational time becomes an issue.

We now merge Alg. 1 and Alg. 2 in order to perform unsupervised clustering via -means on PDs. We work with the 3D-shape database provided by Sumner & Popović and generate diagrams in the same way as in (Carrière et al., 2015b), working in practice with diagrams with to points each. The database contains classes: camel, cat, elephant, horse, head and face. In practice, this unsupervised clustering algorithm detects two main clusters: faces and heads on one hand, camels and horses on the other hand are systematically grouped together. Fig. 6 illustrates the convergence of our algorithm and the computed centroids for the aforementioned clusters.

Figure 6: Illustration of our -means algorithm. From left to right: diagrams extracted from horses and camels plot together (one color for each diagram); the centroid they are matched with provided by our algorithm; 20 diagrams of head and faces; along with their centroid; decrease of the objective function. Running time depends on many parameters along with the random initialization of -means. As an order of magnitude, it takes from to minutes with this PD dataset on a P100 GPU.

6 Conclusion

In this work, we took advantage of a link between PD metrics and optimal transport to leverage and adapt entropic regularization for persistence diagrams. Our approach relies on matrix manipulations rather than combinatorial computations, providing parallelization and efficient use of GPUs. We provide bounds to control approximation errors. We use these differentiable approximations to estimate barycenters of persistence diagrams significantly faster than existing algorithm, and showcase their application by clustering thousand diagrams built from real data. We believe this first step will open the way for new statistical tools for TDA and ambitious data analysis applications of persistence diagrams.

Appendix A Supplementary material

a.1 Omitted proofs from Section 3

Diagram metrics as optimal transport:

We recall that we consider and two persistence diagrams with respectively points and points , , and is the cost matrix with block structure

Proof of Prop. 1.

Let and . Since are point measures, that is discrete measures of same mass with integer weights at each point of their support, finding is an assignment problem of size as introduced in §2. It is equivalent to finding an optimal matching representing some permutation for the cost matrix built from by repeating the last line in total times, the last column in total times, and replacing the lower right corner by a matrix of zeros. The optimal defines a partial matching between and , defined by iff , . Such pairs of points induce a cost , while other points (referred to as unmatched) induce a cost . Then:

Error control due to discretization:

Let be two diagrams and their respective representations as histograms. For two histograms, where are diagrams deduced from respectively by moving any mass located at to , inducing at most an error of for each point. We thus identify and in the following. We recall that is a distance over persistence diagrams and thus verify triangle inequality, leading to:

Thus, the error made is upper bounded by .

Propositions 2, 3, 4:

We keep the same notations as in the core of the article and give details regarding the iteration schemes provided in the paper.

Proof of prop 2.

Given an histogram and a mass , one can observe that (see below):

(17)

In particular, the operation can be perform by only manipulating matrices in . Indeed, observe that:

So we have:

And thus we have in our case:

(18)

where designs the Froebenius dot product between two histograms . Note that these computations only involves matrix product of size . ∎

Proof of prop 3.

Thus, we finally have:

And finally, taking the bin into considerations,

Remark: First term correspond to the cost of effective mapping (point to point) and the two others to the mass mapped to the diagonal. ∎

To address the proof of the last proposition, we recall below the rounding_to_feasible algorithm introduced by Altschuler et al.; and denotes respectively the first and second marginal of a matrix .

1:  Input: , desired marginals .
2:  Output: close to .
3:  
4:  
5:  
6:  
7:  
8:  return
Algorithm 3 Rounding algorithm of Altschuler et al. (2017)
Proof of prop 4.

Start by observing that and can be computed in expected complexity. Indeed, we have by definition (and straightforward computations): The first marginal of is:

And the second marginal is:

We know that and can be computed using previous propositions.

Now we look at the transport cost computation:

The first term is the transport cost induced by a rescaling of and can be computed with Prop 3.

Look at the second term. Without considering the additional bin , we have:

So when we consider our framework (with ):

Putting things together finally proves the claim. ∎

a.2 Omitted proofs from Section 4

We start to observe that does not have local minimum (while does). For , we extend the euclidean norm by the distance from to its orthogonal projection onto the diagonal . In particular, . We denote by the corresponding cost function (continuous analogue of the matrix defined in (8)).1

Proposition (Convexity of ).

For any two measures and , we have:

(19)
Proof.

We denote by the dual variables involved when computing the optimal transport plan between and . Note that maximum are taken over the set (with ):

Tightness of the relaxation.

The following result states that the minimization problem (15) is a tight relaxation of the problem considered by Turner et al. in sense that global minimizers of (which are, by definition, persistence diagrams) are (global) minimizers of .

Theorem 1.

Let be a set of persistence diagrams. Diagram has mass , while denotes the total mass of the dataset. Consider the normalized dataset defined by . Then the functional

(20)

where has the same minimizers as (15).

This allows to apply known results from OT theory, linear programming, and integrality of solutions of LPs with totally unimodular constraint matrices and integral constraint vectors (Schrijver, 1998), which provides results on the tightness of our relaxation.

Corollary (Properties of barycenters for PDs).

Let be a minimizer of (15). Then verifies:

  1. (Carlier et al., 2015) Localization: minimizes for some . This function admit a unique minimizer in , thus the support of is discrete.

  2. (Schrijver, 1998) admits persistence diagrams (that is point measures) as minimizers (so does ). More precisely, minimizers of are exactly given by measures in the convex hull of global minimizers of .

We now focus on proof of Theorem 1. It involves intermediates results.

Lemma (Adding mass on diagonal trick).

For , we have:

(21)

i.e. we can add the same mass onto the virtual point to both measures without changing the transport cost.

Proof.

First of all, since , adding same mass onto cannot increase the optimal transport cost: keep previous optimal transport plan and extend it by mapping the additional mass to itself. It is still a transport plan and it has the same cost as the previous optimal transport cost. Therefore: