Multiscale Strategies for Computing Optimal Transport
Abstract
This paper presents a multiscale approach to efficiently compute approximate optimal transport plans between point sets. It is particularly wellsuited for point sets that are in highdimensions, but are close to being intrinsically lowdimensional. The approach is based on an adaptive multiscale decomposition of the point sets. The multiscale decomposition yields a sequence of optimal transport problems, that are solved in a toptobottom fashion from the coarsest to the finest scale. We provide numerical evidence that this multiscale approach scales approximately linearly, in time and memory, in the number of nodes, instead of quadratically or worse for a direct solution. Empirically, the multiscale approach results in less than one percent relative error in the objective function. Furthermore, the multiscale plans constructed are of interest by themselves as they may be used to introduce novel features and notions of distances between point sets. An analysis of sets of brain MRI based on optimal transport distances illustrates the effectiveness of the proposed method on a real world data set. The application demonstrates that multiscale optimal transport distances have the potential to improve on stateoftheart metrics currently used in computational anatomy.
Mauro Maggioni mauro.maggioni@jhu.edu
Kitware, NC, U.S.A
Department of Mathematics, Applied Mathematics, Institute for Data Intensive Engineering and Science, Johns Hopkins University, Baltimore, MD, U.S.A.
Editor: Nikos Vlassis
1 Introduction
The study of maps between shapes, manifolds and point clouds is of great interest in a wide variety of applications. There are many data types, e.g. shapes (modeled as surfaces), images, sounds, and many more, where a similarity between a pair of data points involves computing a map between the points, and the similarity is a functional of that map. The map between a pair of data points however contains much more information than the similarity measure alone, and the study of networks of such maps have been successfully used to organize, extract functional information and abstractions, and help regularize estimators in large collections of shapes (M. Ovsjanikov and Guibas, 2010; Qixing Huang, 2014; Huang and Guibas, 2013). The family of maps to be considered depends on the type of shape, manifold or point cloud, as well as on the choice of geometric features to be preserved in a particular application. These considerations are not restricted to data sets where each point is naturally a geometric object: highdimensional data sets of nongeometric nature, from musical pieces to text documents to trajectories of highdimensional stochastic dynamical systems, are often mapped, via feature sets, to geometric objects. The considerations above therefore apply to a very wide class of data types.
In this paper we are interested in the problem where each object is a point cloud – a set of points in – and will develop techniques for computing maps from one point cloud to another, in particular in the situation where is very large, but the point clouds are close to being lowdimensional, for example they may be samples from a dimensional smooth manifold (). The two point clouds may have a different number of points, and they may arise from a sample of a lowdimensional manifold perturbed by highdimensional noise (for more general models see the works by Little et al. (2012), Maggioni et al. (2016) and Liao and Maggioni (2016)). In this setting we have to be particularly careful in both the choice of maps and in their estimation since sampling and noise have the potential to cause significant perturbations.
We find optimal transport maps rather wellsuited for these purposes. They automatically handle the situation where the two point clouds have different cardinality, they handle in a robust fashion noise, and even changes in dimensionality, which is typically illdefined, for point clouds arising from realworld data (Little et al., 2012). Optimal transport has a very long history in a variety of disciplines and arises naturally in a wide variety of contexts, from optimization problems in economics and resource allocation, to mathematics and physics, to computer science (e.g. network flow algorithms). Thus, applications of optimal transport range from logistics and economics (Beckmann, 1952; Carlier et al., 2008), geophysical models (Cullen, 2006), image analysis (Rubner et al., 1998; Haker et al., 2004) to machine learning (Cuturi and Doucet, 2014; Cuturi and Avis, 2014). Despite these widespread applications, the efficient computation of optimal transport plans remains challenging, especially in complex geometries and in high dimensions.
For point sets the optimal transport problem can be solved by a specialized linear program, the minimum network flow problem (Ahuja et al., 1993; Tarjan, 1997). The minimum network flow problem has been extensively studied in the operations research community and several fast algorithms exist. However, these algorithms, at least on desktop hardware, do not scale beyond a few thousand source and target points. Our framework extends the applications of these algorithms to problem instances several orders of magnitude larger, under suitable assumptions on the geometry of the data. We exploit a multiscale representation of the source and target sets to reduce the number of variables in the linear program and quickly find good initial solutions, as illustrated in Figure 1. The optimal transport problem is solved (with existing algorithms) at the coarsest scale and the solution is propagated to the next scale and refined. This process is repeated until the finest scale is reached. This strategy, discussed in detail in Section 3, is adaptable to memory limitations and speed versus accuracy tradeoffs. For some of the refinement strategies, it is guaranteed to converge to the optimal solution.
Our approach draws from a varied set of work that is briefly summarized in Section 2. The proposed approach generalizes and builds on previous and concurrently developed hierarchical methods (Glimm and Henscheid, 2013; Schmitzer and Schnörr, 2013; Schmitzer, 2015; Oberman and Ruan, 2015). The work in this paper adds the following contributions:

The description of a general multiscale framework for discrete optimal transport that can be applied in conjunction with a wide range of optimal transport algorithms.

A set of propagation and refinement heuristics, including approaches that are similar and/or refine existing ones (Glimm and Henscheid, 2013; Oberman and Ruan, 2015; Schmitzer, 2015) as well as novel ones. In particular we propose a novel propagation strategy based on capacity restrictions of the network flow problem at each scale. This new approach proves to be very efficient and accurate in practice. Overall, the heuristics result empirically in a linear increase in computation time with respect to data set size.

An implementation in the R package mop that allows the combination of multiple heuristics to tailor speed and accuracy to the requirements of particular applications.
Compared to other linear programming based approaches, the multiscale approach results in a speedup of up to multiple orders of magnitude in large problems and permits to solve approximately transportation problems of several orders of magnitudes larger than previously possible. Comparing to PDE based approaches is difficult and PDE based methods are limited to lowdimensional domains and specific cost metrics. The proposed framework is demonstrated on several numerical examples and compared to the stateoftheart approximation algorithm by Cuturi (2013).
2 Background
Optimal transport is the problem of minimizing the cost of moving a source probability distribution to a target probability distribution given a function that assigns costs to moving mass from source to target locations. The classical formulation by Monge (1781) considers the minimization over mass preserving mappings. Later Kantorovitch (1958) considered the same problem but as a minimization over couplings between the two probability measures, which permits to split mass from a single source across multiple target locations. More precisely, for two probability measures and on probability spaces and respectively, a coupling of and is a measure on such that the marginals of are and , i.e. for all (measurable) sets and for all (measurable) sets . We denote by the set of couplings between and . Informally, we may think of as the amount of infinitesimal mass to be transported from source to destination , with the condition that is a coupling guaranteeing that the source mass is distributed according to and the destination mass is distributed according to . Such couplings always exist: we always have the trivial coupling . The trivial coupling is uninformative, every source mass is transported to the same target distribution . In Monge’s formulation the coupling is restricted to the special form where is the Dirac measure with mass at , and is a function : in this case the coupling is “maximally informative” in the sense that there is a function mapping each source to a single destination ; in particular the mass at is not split into multiple portions that are shipped to different target ’s.
To define optimal transport and optimal couplings, we need a cost function on representing the work or cost needed to move a unit of mass from to . Then for every coupling we may define the cost of to be
(1) 
with being a pair of random variables distributed according to . An optimal coupling minimizes this cost over all choices of couplings. When seeking an optimal transportation plan, the above becomes
and one minimizes over all (measurable) functions . One often designs the cost function in an applicationdependent fashion, and the above framework is extremely general. When is a metric space with respect to a distance (with suitable technical conditions relating the metrics and the measures , that we will not delve into, referring the reader to Villani (2009)), then distinguished choices for the cost are those that are related to the metric structure. The natural choice of , for some leads to the definition of the WassersteinKantorovichRubinstein metric on the space of probability measures on :
(2) 
Computational solutions to optimal transport split roughly in two settings: Approaches based on the solution of partial differential equations derived from the continuous optimal transport formulation, briefly discussed in Section 2.1 and, more relevant to this paper, combinatorial optimization methods to directly solve for a discrete optimal transport plan discussed in Section 2.2.
2.1 Continuous Optimal Transport
In case that at least the source distribution admits a density, and when the cost function is the squared Euclidean distance, the optimal coupling is deterministic, i.e. there exists a transport map, and the optimal solution is the gradient of a convex function (Brenier, 1991). This has been exploited to solve the optimal transport problem by numerical partial differential equation approaches (Benamou and Brenier, 2000; Angenent et al., 2003; Haker et al., 2004; Iollo and Lombardi, 2011; Papadakis et al., 2013; Benamou et al., 2014).
An alternative formulation proposed by Aurenhammer et al. (1998) shows that the optimal transport from a source density to a target distribution of a set of weighted Dirac delta’s can be solved through a finite dimensional unconstrained convex optimization. Mérigot (2011) proposes a multiscale approach for the formulation of Aurenhammer et al. (1998).
Both the numerical PDE based approaches as well as the unconstrained convex optimization require a discretization of the full domain which is generally not feasible for higher dimensional domains. For arbitrary cost functions and distributions the optimal transport problem does typically not result in a deterministic coupling and can not be solved through PDE based approaches.
2.2 Discrete Optimal Transport and Linear Programming
For two discrete distributions and with the optimal transport problem is equivalent to the linear program
(3) 
The solution is called an optimal coupling , and the minimum value attained at , is called the cost of , or the optimal cost of the transport problem, and is denoted by . The constraints enforce that is a coupling. The variables correspond to the amount of mass transported from source to target , at cost . The linear constraints are of rank : when of the constraints are satisfied, either the constraints of the source density or the constraints of the target density are satisfied. Since the sum of outgoing mass is equal to the sum of incoming mass, i.e. , it follows that all constraints must be satisfied. The optimal solution lies, barring degeneracies, on a corner of the polytope defined by the constraints, i.e. is a basic feasible solution. This implies that exactly entries of the optimal coupling are nonzero, i.e. is a sparse matrix. The optimal coupling is a Monge transport map if and only if all the mass of a source is transported to exactly one target location. A Monge solution does not exist for every optimal transport problem, in fact a small perturbation of will always suffice to make a Monge solution impossible.
For source and target data sets with the same cardinality and equally weighted functions, Kantorovich’s optimal transport problem reduces to an assignment problem, whose solution is a Monge optimal transport map. In this special case, the optimal transport problem can be efficiently solved by the Hungarian algorithm (Kuhn, 1955). The assignment problem results in a degenerate linear program since only entries are nonzero (instead of ). We can therefore think of optimal transport as a robust version of assignments. This can also be seen from the point of view of convexity: in the assignment problem, is a permutation matrix deciding to which each is transported. The convex hull of permutation matrices is exactly the set of doublystochastic matrices, to which a coupling belongs as a consequence of the constraints in (3).
For point sets with different cardinalities and/or points with different masses the optimal transport problem can be solved by a linear program and is a special case of the minimum cost network flow problem. The minimum cost flow problem is well studied and a number of algorithms (Ford and Fulkerson, 1956; Klein, 1967; Cunningham, 1976; Goldberg and Tarjan, 1987; Bertsekas and Tseng, 1988; Orlin, 1997) exist for its solution. This discrete solution approach is not constrained to specific cost functions and can work with arbitrary cost functions.
However, the linear programming approach neglects possibly useful geometric properties of the measures and the cost function. Our work makes assumptions about the underlying geometry of the measure spaces and the associated cost function, and in this way is a mixing of the lowdimensional “geometric PDE” approaches with the discrete nongeometric optimization approaches. It exploits the geometric assumptions to relieve the shortcomings of either approach, namely it scales to highdimensional data, provided that the intrinsic dimension is low in a suitable sense, and does not require a mesh data structure. At the same time we use the geometry of the data to speed up the linear program, which per–se does not leverage geometric structures.
The refinement strategies of the proposed multiscale approach add subsets of paths among all pairwise paths at each subsequent scale to improve the optimal transport plan. This strategy of adding paths, is akin to column generation approaches (Desrosiers and Lübbecke, 2005). Column generation, first developed by Dantzig and Wolfe (1960) and Ford and Fulkerson (1956), reduces the number of variables in a linear program by solving a smaller linear program on a subset of the original variables and introducing new variables on demand. However, the proposed approach exploits the geometry of the problem instead of relying on an auxiliary linear program (Dantzig and Wolfe, 1960) or shortest path computations (Ford and Fulkerson, 1956) to detect the entering variables.
2.3 Approximation Strategies
In the computer vision literature, the Earth Movers distance or equivalently the Wasserstein1 distance, which is simply the cost of the optimal coupling, is a successful similarity measure for image retrieval (Rubner et al., 1998). In this application the transport plan is not of interest but only the final transport cost. For this purpose Indyk and Thaper (2003), Shirdhonkar and Jacobs (2008) and Andoni et al. (2008) developed algorithms that compute an approximate cost but do not yield a transport plan. Some of these approaches are based on the dual formulation of optimal transport, which involves testing against Lipschitz functions, and observing that Lipschitz functions may be characterized by decay properties of their wavelet coefficients. In this sense these approaches are multiscale as well.
To speed up computations in machine learning applications Cuturi (2013) proposes to smooth transport plans by adding a maximum entropy penalty to the optimal transport formulation. The resulting optimization problem is efficiently solved through matrix scaling with Sinkhorn fixedpoint iterations. Because of the added regularization term, the solution will in general be different from the optimal transportation. It may however be the case that these particular (or perhaps other) regularized solutions are better suited for certain applications.
2.4 Related Work
Very recently a number of approaches have been proposed to solve the optimal transport in a multiscale fashion (Glimm and Henscheid, 2013; Schmitzer and Schnörr, 2013; Schmitzer, 2015; Oberman and Ruan, 2015). Glimm and Henscheid (2013) design an iterative scheme to solve a discrete optimal transport problem in reflector design and propose a heuristic for the iterative refinements based on linear programming duality. This iterative scheme can be interpreted as a multiscale decomposition of the transport problem based on geometry of the source and target sets. The proposed potential refinement strategy extends the heuristic proposed by Glimm and Henscheid (2013) to guarantee optimal solutions and adds a more efficient computation strategy: Their approach requires to check all possible variables at the next scale. In Section 3.4.1 we introduce a variation of the approach by Glimm and Henscheid (2013) by adding a branch and bound strategy to avoid checking all variables, and an iterative procedure that guarantees optimal solutions.
Schmitzer and Schnörr (2013) propose a multiscale approach on grids that uses a refinement strategy based on spatial neighborhoods, akin to the neighborhood refinement described in Section 3.4.2. Schmitzer (2015) uses a multiscale approach to develop a modified auction algorithm with guaranteed worst case complexity and optimal solutions. We use data structures that enable us to quickly construct neighborhoods, even for point clouds that live in highdimensions, but have lowintrinsic dimension; we also exploit these structures to not compute all the pairwise possible costs, as well as the candidate neighborhoods in our propagation steps. This can result in substantial savings in the scaling of the algorithm, from to just . We are therefore able to scale to larger problem, solving on a laptop problems an order of magnitude larger than those in Oberman and Ruan (2015) for example, and showing linear scaling on a large range of scales. Schmitzer (2015) use the c–cyclical monotonicity property of optimal transport plans (Villani, 2009, Chapter 5) to construct “shielding neighborhoods” that permit to exclude paths from further consideration. The idea of shielding neighborhoods is combined with a multiscale strategy that permits to quickly refine initial neighborhood estimates.
Finally, in our work we emphasize that the multiscale construction is not only motivated by its computational advantage, but also as a way of revealing possibly important features of the optimal transportation map. As we show in Section 5, features collected from the multiscale optimal transportation maps leads to improved predictors for brain conditions. More generally, we expect multiscale properties of optimal transportation maps to be useful in a variety of learning tasks; the connections between learning and optimal transportation are still a very open field, to be explored and exploited.
3 Multiscale Optimal Transport
Solving the optimal transport problem for two point sets and directly requires variables, or paths to consider. In other words, the number of paths along which mass can be transported grows quadratically in the number of points and quickly yields exceedingly large problems. The basic premise of the multiscale strategy is to solve a sequence of transport problems based on increasingly accurate approximations of the source and target point set. The multiscale strategy helps to reduce the problem size at each scale by using the solution from the previous scale to inform which paths to include in the optimization at the next finer scale. Additionally, the solution at the previous scale helps to find a good initialization for the current scale which results in fewer iterations to solve the reduced size linear program.
The multiscale algorithm (see Algorithm 1 and Figure 2 for a visual illustration) comprises of three key elements:

A way of coarsening the sets of source points and measure in a multiscale fashion, yielding a chain
(4) connecting the scales from fine to coarse, with of decreasing cardinality as the scale decreases, and the discrete measure “representing” a coarsification of at scale , with (the support of is the set of points with positive measure). Similarly for and we obtain the chain
(5) This coarsening step is described in Section 3.1 and the resulting multiscale family of transport problems is discussed in Section 3.2.

A way of propagating a coupling solving the transport problem at scale to a coupling at scale . This is described in Section 3.3.

A way of refining the propagated solution to the optimal coupling at scale . This is described in Section 3.4.
3.1 The Coarsening Step: Multiscale Approximations to , and
To derive approximation bounds for the error of the multiscale transport problem at each scale we rely on the notion of a regular family of multiscale partitions formally described in Section 3.1.1. The multiscale partition is used to define approximations to and at all scales. An integral part of the definitions is that the constructions can be interpreted as a tree, with all nodes at a fixed height corresponding to one scale of the multiscale partitioning.
We start with some notation needed for the definition of the multiscale partitions. Let be a measure metric space with metric and finite measure . Without loss of generality assume that . The metric ball of center and radius is . We say that has doubling dimension if every ball can be covered by at most balls of radius (Assouad, 1983). Furthermore, a space has a doubling measure if , i.e. if there exist a constant such that for every and we have . Here and in what follows, we say that if there are two constants such that for every in the domain of both functions we have (and therefore a similar set of inequalities holds with the roles of and swapped), and we say that and have the same order of magnitude. Having a doubling measure implies having a doubling metric, and up to changing the metric to an equivalent one, one may choose the same in the doubling condition for the metric and in that for the measure: we assume this has been done from now on. This family of spaces is rather general, it includes regular domains in , as well as smooth compact manifolds endowed with volume measures.
3.1.1 Multiscale Approximations to and
A regular family of multiscale partitions, with scaling parameter , is a family of sets , where denotes the scale and indexes the sets at scale , such that:

the sets form a partition of , i.e. they are disjoint and ;

either does not intersect a , or it is completely contained in it;

there exists a constant such that for all we have the diameter ;

each contains a “center” point such that .
To ease the notation we will fix in what follows, but the constructions and results hold, mutatis mutandis, for general . The properties (i) and (ii) above imply that there exists a tree , with nodes at scale (i.e. at distance from the root) in bijection with , such that is a child of if and only if is contained in . Moreover properties (iii) and (iv), together with the properties of spaces with a doubling measure, imply that and . These partitions are classical in harmonic analysis, mimicking dyadic cubes in Euclidean space, and they have recently been used to construct multiscale decompositions of data sets in highdimensions (Little et al., 2012; Allard et al., 2012; Iwen and Maggioni, 2013; Chen et al., 2012). We say that , or even , is a child of (respectively, of ) if , and that such (respectively ) is a parent of (resp. ).
Given two discrete sets and in a doubling metric space of homogeneous type with dimension , we construct the corresponding families of multiscale partitions and (we assume the same range of scales to keep the notation simple). The construction of the multiscale approximations is illustrated in Figure 2.
The space will be the partition at scale , namely the set , of cardinality . The measures and may be coarsened in the natural way, by letting be defined recursively on by
(6) 
and similarly for . These are in fact projections of these measures, and may also be interpreted as conditional expectations with respect to the algebra generated by the multiscale partitions. We can associate a point to each in various ways, by “averaging” the points in , for a child of . At the finest scale we may let (this being a “center” for as in item (iv) in the definition of multiscale partitions), and then recursively we defined the coarser centers step from scale to scale in one of the following ways:

If the metric space is also a vector space a natural definition of is a weighted average of the corresponding to children:

In general we can define as the point
for some , typically (median) or (Fréchet mean).
Of course similar constructions apply to the space , yielding points . We discuss algorithms for these constructions in Section 4.1.1.
3.1.2 Coarsening the cost function
The multiscale partition provides several ways to coarsen the cost function: for every and we consider

the pointwise value
(7) where and are defined in any of the ways above;

the local average
(8) 
the local weighted average
where is the optimal or approximate transportation plan at scale , defined in (9); is the unique index for which and is the unique index for which
3.2 Multiscale Family of Optimal Transport Problems
With the definitions of the multiscale family of coarser spaces and , corresponding measures and , and corresponding cost , we may consider, for each scale , the following optimal transport problem:
(9) 
The problems in this family are related to each other, and to the optimal transportation problem in the original spaces. We define the cost of a coupling as
(10) 
The cost of the optimal coupling at scale is provably an approximation to the cost of the optimal coupling (which is equal to ):
Proposition 1
Proof Consider the coupling induced at scale by the optimal coupling , defined by
First of all, since and are partitions, it is immediately verified that is a coupling. Secondly, observe that
Since (since is optimal), we obtain (12). When and is Lipschitz with constant , we have
with as in the claim.
In the discrete, finite case that we are considering, and trivially since and by construction. If and were continuous, and at least when for some , then if and have finite moment (i.e. and similarly for ), we would obtain convergence of a subsequence of to by general results (e.g. as a simple consequence of Lemma 4.4. in (Villani, 2009)).
Note that the approximations do not guarantee that the transport plans are close in any other sense but their cost. Consider the arrangement in Figure 3, the transport plans are close in cost but the distances between the target locations of the sources are far no matter how small gets.
(a)  (b) 
3.3 Propagation Strategies
The approximation bounds in Section 3.2 show that the optimal solution at scale is close to optimal solution at scale . This suggests that the solution at scale can provide a reasonably close to optimal initialization for scale .
As proposed by Glimm and Henscheid (2013) the solution at a given scale can be interpolated at the next scale (or finer discretization). The most direct approach to initialize the transport problem at scale given the solution at scale is to distribute the mass equally to all combinations of paths between and .
This propagation strategy results in a reduction in the number of iterations required to find an optimal solution at the subsequent scale. This warmstarting alone is often not sufficient, however. At the finest scale the problems still requires the solution of a problem of size . This quickly reaches memory constraints with points, and a single iteration of Newton’s method or a pivot step of a linear program becomes prohibitively slow. Thus, we consider reducing the number of variables, which substantially speeds up the algorithm, albeit we may lose guarantees on its computational complexity and/or its ability to achieve arbitrary accuracy, so that only numerical experiments will support our constructions. These reductions are achieved by considering only a subset of all possible paths at scale .
To distinguish the optimal solution on the reduced set of paths at scale from the optimal solution over all paths we introduce some notation. Let be the set of all possible paths between sources and targets at scale . Let be the set of paths propagated from the previous solution (e.g. children of massbearing paths found at scale ). Let be the optimal solution to the transport problem restricted to a set of paths . With this notation, . The optimal coupling on the reduced set of paths problem does not need to match the optimal coupling on all paths. However does provide a starting point for further refinements discussed in Section 3.4.
3.3.1 Simple Propagation
The most direct approach to reduce the number of paths considered at subsequent scales is to include only paths at scale whose endpoints are children of endpoints of massbearing paths at scale . The optimal solution at scale has exactly paths with nonzero weight. Thus, the number of paths at scale reduces to , where is the maximal number of children of any node at scale . In particular, for a doubling space of dimension . This reduces the number of variables from “quadratic”, , to linear, . This propagation strategy by itself, however, often leads to a dramatic loss of accuracy in both the cost the transportation plan, and the transportation plan itself.
3.3.2 Capacity Constraint Propagation
This propagation strategy solves a modified minimum flow problem at scale in order to include additional paths at scale that are likely to be included in the optimal solution . This is achieved by adding a capacity constraint to the mass bearing paths at scale in the optimal coupling : The amount of mass of a mass bearing path is constrained to with random uniform on . The randomness is introduced to avoid degenerate constraints. The solution of this modified minimum–flow problem forces the inclusion of additional paths, where is the number of constraints added. There are various options for adding capacity constraints, we propose to constrain all mass bearing paths of the optimal solution at scale . The capacity constrained problem thus results in a solution with twice the number of paths as in the coupling . The solution of the capacity constrained minimum flow problem is propagated as before to the next scale.
To increase the likelihood of including paths required to find an optimal solution at the next scale, the capacity constrained procedure can be iterated multiple times. Each time the mass bearing paths in the modified solution are constrained and a new solution is computed. Each iteration doubles the number of mass bearing paths and the number of iterations controls how many paths are propagated to the next scale. Thus, the capacity constraint propagation strategy bounds the number of paths considered in the linear program. The optimal transport plan on a source set and results in a linear program with constraints and variables and the optimal transport plan has mass bearing paths. It follows that the capacity constraint propagation strategy considers linear programs with at most constraints, where is the number of iterations of the capacity propagation scheme. This results in a significant reduction in problem size, since at each scale we only consider a number of paths scaling as instead of .
3.4 Refinement Strategies
Solving the reduced transport problem at scale , propagated from scale , can not guarantee an optimal solution at scale . Propagating a suboptimal solution further may lead to an accumulation of errors. This section describes strategies to refine the reduced transport problem to find closer approximations or even optimal transport plans at each scale. These refinement strategies are essentially batch column generation methods (Desaulniers et al., 2002), that take advantage of the multiscale structure.
3.4.1 Potential Refinement
This refinement strategy exploits the potential functions, or dual solution, to determine additional paths to include given the currently optimal solution on the reduced set of paths from the propagation. The dual formulation of the optimal transport can be written as:
(14) 
The functions and are called dual variables or potential functions. From the dual formulation it follows that at an optimal solution the reduced cost is larger or equal to zero. This also follows from the Kantorovich duality of optimal transport (Villani, 2009, Chapter 5).
The potential refinement strategy uses the potential functions and from the solution of the reduced problem to determine which additional paths to include. If the solution on the reduced problem is not optimal on all paths, then there exist paths with negative reduced cost. Thus, we check the reduced cost between all paths and include the ones negative reduced cost. A direct implementation of this strategy would require to check all possible paths between the source and target points. To avoid checking all pairwise paths between source and target point sets at the current scale, we introduce a branch and bound procedure on the multiscale structure that efficiently determines all paths with negative reduced cost.
Let and the dual variables, i.e., the potential functions, at the optimal solution on the set of paths for the source and target points, respectively. Define as the set of paths with nonpositive reduced cost with respect to and
The potential refinement strategy now refines the propagated solution by including all negative reduced cost paths . Let be the associated optimal transport. This new solution changes the potential functions which in turn may require to include additional paths. Thus the potential refinement strategy can be iterated with leading to monotonically decreasing optimal transport plans . Since a solution is optimal if and only if all reduced cost are this iterative strategy converges to the optimal solution.
The set of paths with negative reduced cost given and are determined by descending the tree and excluding nodes that can not contain any negative reduced cost paths. This requires bounds on the potential functions for any node at scale smaller than . The bound is achieved by storing at each target node of the multiscale decomposition the maximal of any of its descendants at scale . Now for each source node the target multiscale decomposition is descended, but only towards nodes at which a potential negative reduced cost can exist.
Depending on the properties of and , this refinement strategy may reduce the number of required cost function evaluations drastically. In empirical experiments the total number of paths considered in the iterative potential refinement was typically reduced to , with being the doubling dimension of the data set.
The potential refinement strategy is also employed by Glimm and Henscheid (2013) without the branch and bound procedure. The shielding neighborhoods proposed by Schmitzer (2015) similarly uses the potentials to define which variables are needed to define a sparse optimal transport problem and also suggests to iterate the neighborhood shields in order to arrive at an optimal solution.
3.4.2 Neighborhood Refinement
This section presents a refinement strategy that takes advantage of the geometry of the data. The approach is based on the heuristic that most paths at the next scale are suboptimal due to boundary effects when moving from one scale to the next, induced by the sharp partitioning of the space at scale . Such artifacts from the multiscale structure are mitigated by including paths between spatial neighbors in the source and target locations of the optimal solution on the propagated paths. This refinement strategy is also employed by Oberman and Ruan (2015).
Let be the set of paths such that the source and target of any path are within radius of the source and target of a path with nonzero mass transfer in . The neighborhood refinement strategy is to expand the reduced set of paths using the union of paths in the current reduced set and its neighbors:
When moving from one scale to the next the cost of any path can change at most two times the radius of the parent node which suggests to set the radius of the neighborhood in consideration as two times the node radius.
This heuristics does not guarantee an optimal solution, but does reduce the number of paths to consider at scale to with being the number of neighbors within radius , for a doubling space with dimension , .
The neighborhood strategy requires to efficiently compute the set of points within a ball of a given radius and location. Depending on the multiscale structure there are different ways to compute the set of neighbors. A generic approach that does not depend on any specific multiscale structure is to use a branch and bound strategy. For this approach, each node requires an upper bound on the maximal distance from the representative of node to any of its descendants. Using these upper bounds the tree is searched, starting from the root node, while excluding any child nodes for which from further consideration. For multiscale structures such as regular grids more efficient direct computations are possible. For this paper we implemented the generic branch and bound strategy that works with any multiscale structure.
3.5 Remark on Errors
The error induced by this multiscale framework stems from two sources. First, if , i.e., the optimal transport problem is only solved up to scale , the solution at scale has a bounded approximation error, as detailed in Section 3.2. By solving the transport problem only up to a certain scale permits to tradeoff computation time versus precision with guaranteed approximation bounds. However, to speed up computation we rely on heuristics that, depending on the refinement strategy, yield solutions at each scale that might not be optimal. This second type error is difficult to quantify; however, for the potential refinement strategy that we introduce Section 3.4.1, an optimal solution can be guaranteed. The propagation and refinement strategies introduced in Sections 3.3 and 3.4 permit tradeoffs between accuracy and computational cost.
4 Numerical Results and Comparisons
We tested the performance of the proposed multiscale transport framework with respect to computational speed as well as accuracy for different propagation and refinement strategies on various source and target point sets with different properties.
The multiscale approach is implemented in C++ with an R front end in the package mop^{1}^{1}1available on https://bitbucket.org/suppechasper/optimaltransport. For comparisons, we also implemented the Sinkhorn distance approach (Cuturi, 2013) in C++ with an R front end in the package sinkhorn^{†}^{†}footnotemark: . Our C++ implementation uses the Eigen linear algebra library (Guennebaud et al., 2010), which resulted in faster runtimes than the MATLAB implementation by Cuturi (2013).
4.1 Implementation Details
The R package mop provides a flexible implementation of the multiscale framework and is adaptable to any multiscale decomposition that can be represented by a tree. The package permits to use different optimization algorithms to solve the individual transport problem. Currently the multiscale framework implementation supports the open source library GLPK (Makhorin, 2013) and the commercial packages MOSEK (Andersen and Andersen, 2013) and CPLEX (IBM, 2013) (both free for academic use). MOSEK and CPLEX support a specialized network simplex implementation that runs 10100 times faster than a typical primal simplex implementation. Both the MOSEK and CPLEX network simplex run at comparable times, with CPLEX slightly faster in our experiments. Furthermore CPLEX supports starting from an advanced initial basis for the network simplex algorithm which improves the multiscale runtimes significantly. Thus, the numerical test are all run using the CPLEX network simplex algorithm.
4.1.1 Algorithms for Constructing Multiscale Point Set Representations
Various approaches exist to build the multiscale structures described in Section 3.1, such as hierarchical clustering type algorithms (Ward Jr, 1963), or in low dimensions constructions such as quad and octtrees (Finkel and Bentley, 1974; Jackins and Tanimoto, 1980) or kdtrees (Bentley, 1975) are feasible. Data structures developed for fast nearest neighbor queries, such as navigating nets (Krauthgamer and Lee, 2004) and cover trees (Beygelzimer et al., 2006) induce a hierarchical structure on the data sets with guarantees on partition size and geometric regularity of the elements of the partition at each scale, under rather general assumptions on the distribution of the points. The complexity of cover trees (Beygelzimer et al., 2006) is , for some constant , where , is the doubling dimension of , and is the cost of computing a distance between a pair of points. Therefore the algorithm is practical only when the intrinsic dimension is small, in which case they are provably adaptive to such intrinsic dimension. The optimal transport approach does not rest on a specific multiscale structure and can be adapted to applicationdependent considerations. However, the properties of the multiscale structure, i.e., depth and partition sizes, do affect runtime and approximation bounds.
In our experiments we use an iterative means strategy to recursively split the data into subsets. The tree is initialized using the mean of the complete data set as the root node. Then Kmeans is run resulting in children. This procedure is recursively applied for each leaf node, in a breadth first fashion, until a desired number of of leaf nodes, a maximal leaf radius or the leaf node contains only a single point. For the examples shown we select with the dimensionality of the data set. For highdimensional data could be set to an estimate of the intrinsic dimensionality. Since in all experiments we are equipped with a metric structure we use the pointwise coarsening of the cost function as in Equation (7). The reported results include the computation time for building the hierarchical decomposition, which is, however, negligible compared to solving the transport problem at all scales.
4.1.2 Multiscale Transport Implementation for Point Sets
Algorithm 2 details the steps for computing multiscale optimal transport plans using the newtork simplex for solving the optimization problems at each scale and iterated –means to construct the multiscale structures.