An explicit analysis of the entropic penalty in linear programming
Abstract
Solving linear programs by using entropic penalization has recently attracted new interest in the optimization community, since this strategy forms the basis for the fastestknown algorithms for the optimal transport problem, with many applications in modern largescale machine learning. Crucial to these applications has been an analysis of how quickly solutions to the penalized program approach true optima to the original linear program. More than 20 years ago, Cominetti and San Martín showed that this convergence is exponentially fast; however, their proof is asymptotic and does not give any indication of how accurately the entropic program approximates the original program for any particular choice of the penalization parameter. We close this longstanding gap in the literature regarding entropic penalization by giving a new proof of the exponential convergence, valid for any linear program. Our proof is nonasymptotic, yields explicit constants, and has the virtue of being extremely simple. We provide matching lower bounds and show that the entropic approach does not lead to a nearlinear time approximation scheme for the linear assignment problem.
1 Introduction
In 1992, Fang initiated the study of the entropic penalty for linear programs. Given a basic linear program of the form
(LP) 
he proposed to solve instead the penalized program
(Pen) 
where is the Shannon entropy of viewed as a probability vector and is a penalization parameter. The term plays the role of a strongly convex regularizer, which also enforces the constraint . As , we recover (LP); however, one hopes that solving (Pen) is significantly easier.
Solving linear programs via entropic penalization does not initially seem like an especially attractive choice. Unlike the well known logarithmic penalty, the entropic penalty is not selfconcordant, which makes it a poor barrier function for interior point methods (Boyd and Vandenberghe, 2004). Nevertheless, the entropic penalty has been applied to various linear and nonlinear problems with empirical success (Fang et al., 1997). It is also notable for its connection to other fields. In statistics, it has been used as a tool for model selection and aggregation (Juditsky et al., 2008; Rigollet and Tsybakov, 2011) and is intimately related to the maximum entropy principle for statistical inference (Jaynes, 1982) and to maximum likelihood estimation (Chrétien and Hero, 2000). The entropic penalty is also closely connected to firstorder optimization methods such as mirror descent (Bubeck, 2015) and to online learning algorithms for combinatorially structured problems (Freund and Schapire, 1997; CesaBianchi and Lugosi, 2006; Helmbold and Warmuth, 2009; Koolen et al., 2010; Audibert et al., 2013).
The recent resurgence of interest in the entropic penalty in the machinelearning community has been driven by the fact that it can be used to obtain stateoftheart methods for the optimal transport problem (Cuturi, 2013; Cuturi and Doucet, 2014; Solomon et al., 2015; Genevay et al., 2016; Benamou et al., 2016; Altschuler et al., 2017). The use of the entropic penalty for such problems dates back to Schrödinger (1931) (see Léonard, 2014) and to Brègman (1967), who noted the connection between the entropic penalty and the computation of a projection onto the feasible set with respect to the generalized KullbeckLeibler divergence.
What makes this penalty especially useful for transport problems is that the solution to (Pen) can be computed quickly by a simple iterative algorithm known as the Sinkhorn or RAS algorithm (Sinkhorn, 1967). This fact was popularized by Cuturi (2013), and his work led to widespread adoption of the entropic penalty for computing optimal transport. The introduction of the entropic penalty makes an enormous difference in practice: since optimal transport can be formulated as a linear program, it can, of course, be solved in polynomial time, but the numerical experiments conducted by Cuturi (2013) indicate that the same linear program with an entropic penalty can be solved up to 10,000 times faster, as long as is not too large. On the other hand, those experiments also showed that solving the penalized program becomes costly as increases.
This same phenomenon is present in theory as well as in practice.
A recent theoretical analysis (see Altschuler et al., 2017) of the method of Cuturi (2013) suggests that the time required to solve the optimal transport problem via entropic regularization scales linearly with .
Even the guarantees of the most recent algorithms for solving (Pen) decay as grows.
For example, when defines the Birkhoff polytope (Brualdi, 2006) of doubly stochastic matrices, an approximate solution to (Pen) can be found in time
Nevertheless, if we wish to obtain a good approximation to the solution of (LP), cannot be taken too small. In the limit, the solution to (Pen) converges to the maximumentropy point in the feasible set, which may be far from the optimum. If the goal is to approximately solve (LP), then must be large enough that the solution of (Pen) is still close to an optimum of (LP).
To summarize: the choice of is essential. Too large, and the computational benefits of using the regularizer disappear; too small, and the entropic term induces significant bias and (Pen) is a poor approximation to the original problem. The chief aim of this work is to quantify this tradeoff.
1.1 Prior work
The question of how well (Pen) approximates (LP) as a function of was studied by Cominetti and San Martín (1994).
(1) 
for some . However, their proof does not make it easy to determine the order of magnitude of —and, in particular, its dependence on problemspecific quantities such as the dimension and size of the feasible set . This result is nevertheless tantalizing, insofar as it suggests that will be close to even for relatively small values of . Of course, knowing the size of is crucial to making this idea precise. Prior to this work, theirs was the most general analysis of (Pen) available.
1.2 Our contribution
In this work, motivated by the recent popularity of entropic penalization for optimal transport, we prove a version of (1) with easytounderstand constants. Our analysis applies to any linear program of the form (LP). We show (Section 2) that the quality of the penalized solution satisfies
where is the gap in objective value between an optimal vertex and any suboptimal vertex, and and are the radius of the feasible set with respect to the norm and the entropy, respectively. As a corollary, we obtain that the result (1) obtained by Cominetti and San Martín (1994) holds for any . In addition to making explicit their result, our proof has the virtue of being very simple, requiring only elementary facts about entropy. Moreover, we show (Section 3) that no general improvement in the dependence on , , or is possible, even for the simplest possible example, where the feasible set is the probability simplex.
Finally, specializing to the Birkhoff polytope (Section 4), we obtain nearly matching upper and lower bounds on the quality of the solution as a function of . In particular, these imply that cannot be taken to be , so that the entropic penalty is not a magic bullet for the assignment problem.
1.3 Assumptions
We assume throughout that is bounded. To ensure that (LP) is nontrivial, we assume that is nonempty and that is not constant over .
1.4 Quantities of interest
For convenience, we collect here definitions of the three quantities , , and appearing in our bounds.
Definition 1.
Let be the set of vertices of . The suboptimality gap is
where and }.
Definition 2.
The radius of is .
Definition 3.
The entropic radius of is .
2 Upper bound
In this section, we prove our main bound on the quality of the solution of (Pen). Our proof recovers the result of Cominetti and San Martín (1994) that the penalized solution approaches an optimal solution exponentially fast. Before doing so, however, we first prove a much simpler and weaker bound, which we call the slow rate (see Fang and Tsao, 1993; Altschuler et al., 2017, where this analysis also appeared):
Proposition 1 (Slow rate).
For all ,
Proof.
Note that the slow rate is much worse than the fast rate we hope to prove; however, Rigollet (2017) noted that the slow rate is actually tight for an infinitedimensional analogue of (LP). This indicates that the reason that a fast rate obtains for (LP) is that the finitedimensional problem exhibits a suboptimality gap (known as an energy gap in the statistical physics literature; see Mézard and Montanari, 2009). Intuitively, the slow rate dominates the convergence until is large enough that is concentrated near enough to the optimal solution; after this point, convergence occurs exponentially fast. We will return to this point in Section 3.
We now turn to the main result.
Theorem 1 (Fast rate).
If , then the optimal solution of (Pen) satisfies
Theorem 1 implies a bound on the size of required to obtain a solution of desired accuracy: to obtain a solution satisfying , it suffices to take , where .
Note that Theorem 1 only holds for sufficiently large. The requirement that corresponds exactly to the requirement that the exponent appearing on the right side of the above equation is nonpositive. In Section 3, we show that this restriction is necessary, in the sense that there are penalized linear programs for which does not make appreciable progress towards the minimizer until .
The proof of Theorem 1 is elementary and relies on three simple lemmas about the entropy function, which we now state. These lemmas are easy to verify; proofs appear in Section 6. Recall the definition of the binary entropy function:
Lemma 1.
If and are nonnegative vectors and , then
Lemma 2.
The function is increasing on the interval .
Lemma 3.
If , then
Proof of Theorem 1.
Let be the vertices of . Write for the set of optimal vertex solutions for (LP), and let be the set of suboptimal vertices. Since , we can write
for some nonnegative vector satisfying . If we let , then , where and . Since is a convex combination of elements of , it lies on the optimal face of and is an optimal solution to (LP). On the other hand, since is a convex combination of suboptimal vertices, .
Let . We first prove two simple bounds on this quantity. First, we have a trivial lower bound:
(2) 
On the other hand, Proposition 1 implies
(3) 
We also obtain a corollary which establishes the distance of to the optimal face, which reproduces the result of Cominetti and San Martín (1994). Let be the optimal face of with respect to the objective , and denote by the distance of the point to .
Corollary 1.
If , then
In particular,
for any .
Proof.
Using the notation of Theorem 1, we have that there exist points such that is optimal and
for . We obtain
and the claim follows. ∎
The quantity is quite brittle, since it can be affected by the presence of even a single almostoptimal vertex whose objective value is very close to that of the optimal vertex. However, the definition of can be relaxed slightly to account for this case, as the following corollary shows.
Corollary 2.
For any , let and . If , then the optimal solution of (Pen) satisfies
Proof.
If , the claim is vacuous, so assume that . Let , and note that . Given an optimal solution to (LP), let . We have . Moreover, the suboptimality gap of is .
While the quantities and are easy to calculate, evaluating the suboptimality gap is not easy in general. Nevertheless, as we noted above, intuition from statistical physics implies that some dependence on is necessary to obtain exponential convergence, a point which we substantiate in Section 3. We note the obvious fact that this dependence can be removed for integral polytopes, which are a core object of study in combinatorial optimization (Schrijver, 2003).
Corollary 3.
If is integral and the entries of are integers, then
for all .
Proof.
By definition, the vertices of have integer coordinates, so for any vertex , if is an integer vector then . Therefore if is an optimal vertex and , then , so . ∎
3 Lower bound
In this section, we present an explicit example of a simple family of linear programs for which our analysis is tight, up to constant factors. This example evinces the two phenomena present in Theorem 1: the convergence of to the optimum is slow until is of order , and once this threshold is reached convergence happens at precisely the speed indicated in the upper bound. This example also validates the intuition presented above about the necessary dependence on the suboptimality gap: exponentially fast convergence is obtained only when .
Fix positive constants and and a dimension . Let be given by
and consider the linear program
(5) 
Note that the polytope defined by the constraints of (5) is a rescaled version of the dimensional probability simplex. We focus on the following penalized program:
(6) 
We make the following simple observations about (5) and (6):

The unique optimal solution to (5) is , the first elementary basis vector, and .

The maximum value of over is , achieved at any vertex other than .

For this polytope, , , and .
The penalized program (6) has an explicit solution, which is given by a rescaled version of the Gibbs distribution (Mézard and Montanari, 2009).
Proposition 2.
The optimal solution to (6) is given by
The guarantee of Theorem 1 requires that . We now show that when is significantly smaller than this quantity, the solution to the penalized program is far from the true optimum. Indeed, the following proposition establishes that we cannot even achieve a constantfactor improvement over the maximum value of over until is of order .
Proposition 3.
For any , if , then .
Proof.
We prove the contrapositive. Note that , so if then . By Proposition 2, we can write explicitly
If , then , so , as claimed. ∎
The next proposition shows that the Theorem 1 is tight up to a small constant factor.
Proposition 4.
If , then
Proof.
4 Entropic penalization for the assignment problem
In this section, we given an application of Theorem 1 to the assignment problem, a fundamental combinatorial optimization problem (Schrijver, 2003). Our motivation for analyzing this example explicitly is twofold. First, this is a case where entropic penalization has already been proposed as a good candidate algorithm (Kosowsky and Yuille, 1994; Sharify et al., 2011). Second, as noted in the introduction, new fast algorithms for the matrix scaling problem (Cohen et al., 2017; AllenZhu et al., 2017) show that a penalized version of the assignment problem with cost matrix can be solved in time . These fast algorithms raise the prospect that entropic penalization could provide a nearlinear time algorithm for the assignment problem, a major breakthrough (see, e.g., Mądry, 2013).
Whether this breakthrough is possible depends crucially on the size of required to solve the problem accurately. The best bounds available from previous works on the problem (Kosowsky and Yuille, 1994; Sharify et al., 2011) require to achieve constant accuracy, which is just the guarantee given by the slow rate (Proposition 1). An open question implicit in these works is whether this is optimal, or whether suffices. (In particular, this would imply a nearlinear time algorithm for the assignment problem.) By applying Theorem 1 and exhibiting an almostmatching lower bound, we show exactly what rates are attainable for the Birkhoff polytope. In short, our hopes are dashed: cannot be taken to be dimension free in general.
We first recall the problem. Given a bipartite graph with edge weights, the goal of the assignment problem is to find a minimumcost perfect matching in the graph. This problem also has a well known linear programing formulation: given a matrix of edge weights, the assignment problem is
(7) 
The polytope given by the constraints is known as the Birkhoff polytope, and its vertices are the permutation matrices (Brualdi, 2006), a result known as the Birkhoffvon Neumann Theorem.
We first give an upper bound on the quality of as a function of the regularization parameter . We require a preliminary lemma, whose proof appears in Section 6.
Lemma 4.
The Birkhoff polytope has and .
Lemma 4 combined with Theorem 1 yields the following guarantee for the entropic penalty applied to the assignment problem. For normalization purposes, we assume that the entries of are nonnegative integers, as is common in the combinatorial optimization literature.
Proposition 5.
An additive approximation to the assignment problem with cost matrix can be found by solving an entropypenalized version of (7) with parameter .
Proof.
Proposition 5 is disappointing: it guarantees exponential convergence of to only when , a far cry from the hopedfor result that could be taken . We now show that, up to logarithmic factors, this bound is tight. The following theorem implies that, even when , can be bounded away from the optimal value if .
Theorem 2.
Let be the matrix given by
(8) 
If , then
Proof.
The matrix defied in (8) admits the unique optimum solution , the identity matrix, and optimal value . For any permutation , on the other hand, .
We prove the contrapositive. By the Birkhoffvon Neumann theorem, we can write as a convex combination of permutation matrices:
By assumption, , so . This implies that for , and therefore that for .
Sinkhorn’s theorem (Sinkhorn, 1967) combined with firstorder optimality conditions for the penalized program guarantee that for positive diagonal matrices and and . Write for the vector of diagonal entries of . For , we have
(9) 
Finally, we note that
and since , we obtain
(10) 
5 Conclusions
Our focus in this work has been on making explicit the asymptotic analysis of Cominetti and San Martín (1994). Their paper has been cited consistently in the computational optimal transport community as giving the best account of the speed of convergence of the penalized program to the original linear program (see Genevay et al., 2016; Benamou et al., 2015, 2016; Blondel et al., 2017; Carlier et al., 2017; Denitiu et al., 2014; Dessein et al., 2016; Di Marino et al., 2017; Díaz et al., 2015; Genevay et al., 2016; Schmitzer, 2016; Peyré and Cuturi, 2017; Luise et al., 2018). We hope that the simple and explicit proof here will clarify the nature of the exponential rate proved in Cominetti and San Martín (1994), and provide a framework for a more refined analysis of the entropic penalty for linear programs of interest.
One puzzle that remains is to give theoretical justification to the observation of Cuturi (2013) that small values of achieve good accuracy on realworld optimal transport data. It is clear that the analysis of Theorem 1 could be improved via a more refined understanding of the “energy spectrum” of optimal transport (i.e., the size and structure of the set of nearlyoptimal transports), but obtaining this understanding even in the case where the costs are i.i.d. random variables is a very deep question (Aldous, 2001). We leave obtaining a more sophisticated grasp on the behavior of this trajectory for future work.
6 Proofs of Lemmas
Proof of Lemma 1.
Write , and let .
∎
Proof of Lemma 2.
The derivative of satisfies
When ,
so . The claim follows. ∎
Proof of Lemma 3.
By definition , so it suffices to show that
This inequality is easily verified by noting that the derivative of the left side is nonpositive on and . ∎
Proof of Lemma 4.
It is trivial to see that all satisfy , so . For any ,
where denotes the th row of . Since each row of is a nonnegative vector of dimension whose entries sum to , for each the bound holds. Therefore for all , which proves that . ∎
7 Acknowledgments
This work was supported in part by NSF Graduate Research Fellowship DGE1122374. The author would like to thank J. Altschuler and P. Rigollet for useful discussions, as well as the anonymous referees for their suggestions.
Footnotes
 The notation hides polylogarithmic factors.
 In fact, the main object of study of Cominetti and San Martín (1994) is the program , which is the dual of (Pen) with the entropy replaced by the similar function . They refer to the penalty appearing in this dual program as the exponential penalty. Our analysis applies equally well to their setting, but we focus on the vanilla entropic penalty for clarity.
References
 David J. Aldous. The limit in the random assignment problem. Random Structures Algorithms, 18(4):381–418, 2001. ISSN 10429832. URL https://doi.org/10.1002/rsa.1015.
 Zeyuan AllenZhu, Yuanzhi Li, Rafael Oliveira, and Avi Wigderson. Much faster algorithms for matrix scaling. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 1517, 2017, pages 890–901. IEEE Computer Society, 2017. ISBN 9781538634646. doi: 10.1109/FOCS.2017.87. URL https://doi.org/10.1109/FOCS.2017.87.
 Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Nearlinear time approximation algorithms for optimal transport via sinkhorn iteration. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 1961–1971, 2017. URL http://papers.nips.cc/paper/6792nearlineartimeapproximationalgorithmsforoptimaltransportviasinkhorniteration.
 JeanYves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2013.
 JeanDavid Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
 JeanDavid Benamou, Guillaume Carlier, and Luca Nenna. A numerical method to solve multimarginal optimal transport problems with Coulomb cost. In Splitting Methods in Communication, Imaging, Science, and Engineering, Sci. Comput., pages 577–601. Springer, Cham, 2016.
 Mathieu Blondel, Vivien Seguy, and Antoine Rolet. Smooth and sparse optimal transport. arXiv preprint arXiv:1710.06276, 2017.
 Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004. ISBN 0521833787.
 L. M. Brègman. A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. Ž. Vyčisl. Mat. i Mat. Fiz., 7:620–631, 1967. ISSN 00444669.
 Richard A. Brualdi. Combinatorial Matrix Classes, volume 108 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, Cambridge, 2006. ISBN 9780521865654; 0521865654. doi: 10.1017/CBO9780511721182. URL http://dx.doi.org/10.1017/CBO9780511721182.
 Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(34):231–357, 2015. doi: 10.1561/2200000050. URL https://doi.org/10.1561/2200000050.
 Guillaume Carlier, Vincent Duval, Gabriel Peyré, and Bernhard Schmitzer. Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis, 49(2):1385–1418, 2017.
 Nicolò CesaBianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, Cambridge, 2006. ISBN 9780521841085; 0521841089. doi: 10.1017/CBO9780511546921. URL http://dx.doi.org/10.1017/CBO9780511546921.
 Stéphane Chrétien and Alfred O. Hero, III. Kullback proximal algorithms for maximumlikelihood estimation. IEEE Trans. Inform. Theory, 46(5):1800–1810, 2000. ISSN 00189448. doi: 10.1109/18.857792. URL http://dx.doi.org/10.1109/18.857792.
 Michael B. Cohen, Aleksander Mądry, Dimitris Tsipras, and Adrian Vladu. Matrix scaling and balancing via box constrained newton’s method and interior point methods. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 1517, 2017, pages 902–913. IEEE Computer Society, 2017. ISBN 9781538634646. doi: 10.1109/FOCS.2017.88. URL https://doi.org/10.1109/FOCS.2017.88.
 R. Cominetti and J. San Martín. Asymptotic analysis of the exponential penalty trajectory in linear programming. Math. Programming, 67(2, Ser. A):169–187, 1994. ISSN 00255610. doi: 10.1007/BF01582220. URL http://dx.doi.org/10.1007/BF01582220.
 Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 58, 2013, Lake Tahoe, Nevada, United States., pages 2292–2300, 2013. URL http://papers.nips.cc/paper/4927sinkhorndistanceslightspeedcomputationofoptimaltransport.
 Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pages 685–693. JMLR.org, 2014. URL http://jmlr.org/proceedings/papers/v32/cuturi14.html.
 Andreea Denitiu, Stefania Petra, Claudius Schnörr, and Christoph Schnörr. An entropic perturbation approach to TVminimization for limiteddata tomography. In Elena Barcucci, Andrea Frosini, and Simone Rinaldi, editors, Discrete Geometry for Computer Imagery  18th IAPR International Conference, DGCI 2014, Siena, Italy, September 1012, 2014. Proceedings, volume 8668 of Lecture Notes in Computer Science, pages 262–274. Springer, 2014. ISBN 9783319099545. doi: 10.1007/9783319099552_22. URL https://doi.org/10.1007/9783319099552_22.
 Arnaud Dessein, Nicolas Papadakis, and JeanLuc Rouas. Regularized optimal transport and the ROT mover’s distance. arXiv preprint arXiv:1610.06447, 2016.
 Simone Di Marino, Augusto Gerolin, and Luca Nenna. Optimal transportation theory with repulsive costs. In Topological optimization and optimal transport, volume 17 of Radon Ser. Comput. Appl. Math., pages 204–256. De Gruyter, Berlin, 2017.
 Juan Díaz, Tomás Rau, and Jorge Rivera. A matching estimator based on a bilevel optimization problem. Review of Economics and Statistics, 97(4):803–812, 2015.
 S. C. Fang. An unconstrained convex programming view of linear programming. Z. Oper. Res., 36(2):149–161, 1992. ISSN 03409422. URL https://doi.org/10.1007/BF01417214.
 S.C. Fang, J. R. Rajasekera, and H.S. J. Tsao. Entropy optimization and mathematical programming, volume 8 of International Series in Operations Research & Management Science. Kluwer Academic Publishers, Boston, MA, 1997. ISBN 0792399390. URL https://doi.org/10.1007/9781461561316.
 ShuCherng Fang and HS Jacob Tsao. Linear programming with entropic perturbation. Zeitschrift für Operations Research, 37(2):171–186, 1993.
 Yoav Freund and Robert E Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
 Aude Genevay, Marco Cuturi, Gabriel Peyré, and Francis R. Bach. Stochastic optimization for largescale optimal transport. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pages 3432–3440, 2016. URL http://papers.nips.cc/paper/6566stochasticoptimizationforlargescaleoptimaltransport.
 David P Helmbold and Manfred K Warmuth. Learning permutations with exponential weights. Journal of Machine Learning Research, 10(Jul):1705–1736, 2009.
 Edwin T Jaynes. On the rationale of maximumentropy methods. Proceedings of the IEEE, 70(9):939–952, 1982.
 Anatoli Juditsky, Philippe Rigollet, and Alexandre Tsybakov. Learning by mirror averaging. Ann. Statist., 36(5):2183–2206, 2008. ISSN 00905364.
 Wouter M. Koolen, Manfred K. Warmuth, and Jyrki Kivinen. Hedging structured concepts. In Adam Tauman Kalai and Mehryar Mohri, editors, COLT 2010  The 23rd Conference on Learning Theory, Haifa, Israel, June 2729, 2010, pages 93–105. Omnipress, 2010. ISBN 9780982252925. URL http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf#page=101.
 JJ Kosowsky and Alan L Yuille. The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks, 7(3):477–490, 1994.
 Christian Léonard. A survey of the Schrödinger problem and some of its connections with optimal transport. Discrete Contin. Dyn. Syst., 34(4):1533–1574, 2014. ISSN 10780947. doi: 10.3934/dcds.2014.34.1533. URL http://dx.doi.org/10.3934/dcds.2014.34.1533.
 Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. Differential properties of Sinkhorn approximation for learning with Wasserstein distance. arXiv preprint arXiv:1805.11897, 2018.
 Marc Mézard and Andrea Montanari. Information, physics, and computation. Oxford Graduate Texts. Oxford University Press, Oxford, 2009. ISBN 9780198570837. doi: 10.1093/acprof:oso/9780198570837.001.0001. URL https://doi.org/10.1093/acprof:oso/9780198570837.001.0001.
 Aleksander Mądry. Navigating central path with electrical flows: From flows to matchings, and back. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 2629 October, 2013, Berkeley, CA, USA, pages 253–262. IEEE Computer Society, 2013. ISBN 9780769551357. doi: 10.1109/FOCS.2013.35. URL https://doi.org/10.1109/FOCS.2013.35.
 Gabriel Peyré and Marco Cuturi. Computational optimal transport. Book draft, 2017.
 Philippe Rigollet. Personal communication, 2017.
 Philippe Rigollet and Alexandre Tsybakov. Exponential screening and optimal rates of sparse estimation. Ann. Statist., 39(2):731–771, 2011. ISSN 00905364. doi: 10.1214/10AOS854. URL http://dx.doi.org/10.1214/10AOS854.
 Bernhard Schmitzer. Stabilized sparse scaling algorithms for entropy regularized transport problems. arXiv preprint arXiv:1610.06519, 2016.
 Alexander Schrijver. Combinatorial optimization. Polyhedra and efficiency. Vol. A, volume 24 of Algorithms and Combinatorics. SpringerVerlag, Berlin, 2003. ISBN 3540443894.
 Erwin Schrödinger. Über die Umkehrung der Naturgesetze. Angewandte Chemie, 44(30):636–636, 1931.
 Meisam Sharify, Stéphane Gaubert, and Laura Grigori. Solution of the optimal assignment problem by diagonal scaling algorithms. arXiv preprint arXiv:1104.3830, 2011.
 Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.
 Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):66, 2015.