Tsallis Regularized Optimal Transport and Ecological Inference

# Tsallis Regularized Optimal Transport and Ecological Inference

Boris Muzellec
Ecole Polytechnique
boris.muzellec@polytechnique.edu
Richard Nock
Data61, the Australian National University & the University of Sydney
richard.nock@data61.csiro.au
Giorgio Patrini
The Australian National University & Data61
giorgio.patrini@anu.edu.au
Frank Nielsen
Ecole Polytechnique & Sony CS Labs, Inc.
Frank.Nielsen@acm.org
###### Abstract

Optimal transport is a powerful framework for computing distances between probability distributions. We unify the two main approaches to optimal transport, namely Monge-Kantorovitch and Sinkhorn-Cuturi, into what we define as Tsallis regularized optimal transport (trot). trot interpolates a rich family of distortions from Wasserstein to Kullback-Leibler, encompassing as well Pearson, Neyman and Hellinger divergences, to name a few. We show that metric properties known for Sinkhorn-Cuturi generalize to trot, and provide efficient algorithms for finding the optimal transportation plan with formal convergence proofs. We also present the first application of optimal transport to the problem of ecological inference, that is, the reconstruction of joint distributions from their marginals, a problem of large interest in the social sciences. trot provides a convenient framework for ecological inference by allowing to compute the joint distribution — that is, the optimal transportation plan itself — when side information is available, which is e.g. typically what census represents in political science. Experiments on data from the 2012 US presidential elections display the potential of trot in delivering a faithful reconstruction of the joint distribution of ethnic groups and voter preferences.

## 1 Introduction

Optimal transport (ot) allows to compare probability distributions by exploiting the underlying metric space on their supports [22, 26]. A number of prominent applications allow for a natural definition of this underlying metric space, from image processing [32] to natural language processing [25], music processing [13] and computer graphics [36].

One key problem of ot is its processing complexity — cubic in the support size, ignoring low order terms (on state of the art LP solvers [8]). Moreover, the optimal transportation plan has often many zeroes, which is not desirable in some applications. An important workaround was found and consists in penalizing the transport cost with a Shannon entropic regularizer [8]. At the price of changing the transport distance, for a distortion with metric related properties, comes an algorithm with geometric convergence rates [8, 16]. As a result, we can picture two separate approches to ot: one essentially relies on the initial Monge-Kantorovitch formulation optimizing the transportation cost itself [39], but is computationally expensive; the other is based on tweaking the transportation cost by Shannon regularizer [8]. The corresponding optimization algorithm, grounded in a variety of different works [7, 34, 37], is fast and can be very efficiently parallelized [8].

Our paper brings three contributions. (i) We interpolate these two worlds using a family of entropies celebrated in nonextensive statistical mechanics, Tsallis entropies [38], and hence we define the Tsallis regularized optimal transport (trot). We show that the metric properties for Shannon entropy still hold in this more general case, and prove new properties that are key to our application. (ii) We provide efficient optimization algorithms to compute trot and the optimal transportation plan. (iii) Last but not least, we provide a new application of trot to a field in which this optimal transportation plan is the key unknown: the problem of ecological inference.

Ecological inference deals with recovering information from aggregate data. It arises in a diversity of applied fields such as econometrics [6, 4], sociology and political science [23, 24] and epidemiology [40], with a long history [31]; interestingly, the empirical software engineering community has also explored the idea [28]. Its iconic application is inferring electorate behaviour: given turnout results for several parties and proportions of some population strata, e.g. percentages of ethnic groups, for many geographical regions such as counties, the aim is to recover contingency tables for parties groups for all those counties. In the language of probability the problem is isomorphic to the following: given two random variables and their respective marginal distributions — conditioned to another variable, the geography —, compute their conditional joint distribution (See Figure 1).

The problem is fundamentally under-determined and any solution can only either provide loose deterministic bounds [12, 6, 4] or needs to enforce additional assumptions and prior knowledge on the data domain [23]. More recently, the problem has witnessed a period of renaissance along with the publication of a diversity of methods from the second family, mostly inspired by distributional assumptions as summarised in [24]. Closer to our approach, [21] follows the road of a minimal subset of assumptions and frame the inference as an optimization problem. The method favors one solution according to some information-theoretic solution, e.g. the Cressie-Read power divergence, intended as an entropic measure of the joint distribution.

There is an intriguing link between optimal transport and ecological inference: if we can figure out the computation of the ground metric, then the optimal transportation plan provides a solution to the ecological inference problem. This is appealing because it ties the computation of the joint distribution to a ground individual distance between people. Figure 1 gives an example. As recently advocated in ecological inference [14], it turns out that we have access to more and more side information that helps to solve ecological inference — in our case, the computation of this ground metric. Polls, census, social networks are as many sources of public or private data that can be of help. It is not our objective to show how to best compute the ground metric, but we show an example on real world data for which a simple approach gives very convincing results.

To our knowledge, there is no former application of optimal transport (regularized or not) to ecological inference. The closest works either assume that the joint distribution follows a random distribution constrained to structural or marginal constraints [15] (and references therein) or modify the constraints to the marginals and / or add constraints to the problem [11]. In all cases, there is no ground metric (or anything that looks like a cost) among supports that ties the computation of the joint distribution. More importantly, as noted in [14], traditional ecological inference would not use side information of the kind that would be useful to estimate our ground metric.

This paper is organized as follows. In Section 2, we present the main definitions for ot. 3 presents trot and its geometric properties. 4 presents the algorithms to compute trot and the optimal transportation plan, and their properties. 5 details experiments. A last Section concludes with open problems. All proofs, related comments, and some experiments are deferred to a Supplementary Material (sm).

## 2 Basic definitions and concepts

In the following, we let denote the probability simplex (bold faces like denote vectors). denotes Frobenius product ( is the vectorization of a matrix). For any two , we define their transportation polytope . For any cost matrix , the transportation distance between and as the solution of the following minimization problem:

 dM(\ver,\vec) \defeq minP∈U(\ver,\vec)\innerPM. (1)

Its argument, is the (optimal) transportation plan between and . Assuming , is unique. Furthermore, if is a metric matrix, then is also a metric [39, §6.1].

In current applications of optimal transport, the key unknown is usually the distance [8, 9, 19, 29, 36] (etc). In the context of ecological inference [21], it is rather : describes a joint distribution between two discrete random variables and with respective marginals and , , for example the support of being the votes for year US presidential election, and being the ethnic breakdown in the US population in year , see Figure 1. In this case, denotes an ”ideal” joint distribution of votes within ethnicities, ideal in the sense that it minimizes a distance based on the belief that votes correlate positively with a similarity between an ethnic profile and a party’s profile. While we will carry out most of our theory on formal transportation grounds, requiring in particular that be a distance matrix, it should be understood that requiring just ”correlation” alleviates the need for to formally be a distance for ecological inference.

## 3 Tsallis Regularized Optimal Transport

For any , the Tsallis entropy of , is:

 Hq(\vep) \defeq 11−q⋅∑i(pqi−pi), (2)

and for any , we let . Notably, we have , which is just Shannon’s entropy. For any , we define the Tsallis Regularized Optimal Transport (trot) distance. {definition} The trot() distance (or trot distance for short) between and is:

 dλ,qM(\ver,\vec) \defeq minP∈U(\ver,\vec)\innerPM−1λ⋅Hq(P). (3)

A simple yet important property is that trot distance unifies both usual modalities of optimal transport. It generalizes optimal transport (ot) when , since converges to a constant and so the ot-distance is obtained up to a constant additive term [22, 26]. It also generalizes the regularized optimal transport approach of [8] since , the Sinkhorn distance between and [8]. There are several important structural properties of that motivate the unification of both approaches. To state them, we respectively define the -logarithm,

 logq(x) \defeq (1−q)−1⋅(x1−q−1), (4)

the -exponential, and Tsallis relative -entropy between as:

 Kq(P,R) \defeq 11−q⋅∑i,j(qpij+(1−q)rij−pqijr1−qij). (5)

Taking joint distribution matrices and allows to recover the natural logarithm, the exponential and Kullback-Leibler (kl) divergence, respectively [1]. Other notable examples include (i) Pearson’s statistic (), (ii) Neyman’s statistic (), (iii) square Hellinger distance () and the reverse kl divergence if scaled appropriately by [21], which also allows to span Amari’s divergences for [1]. For any function , denoting for matrix as the matrix whose general term is . {lemma} Let . Then:

 dλ,qM(\ver,\vec) = 1λ⋅minP∈U(\ver,\vec)K1/q(Pq,~Uq)+g(M), (6)

where does not play any role in the minimization of . Lemma 3 shows that the trot distance is a divergence involving escort distributions [1, 4], a particularity that disappears in Sinkhorn distances since it becomes an ordinary kl divergence between distributions. Predictably, the generalization is useful to create new solutions to the regularized optimal transport problem that are not captured by Sinkhorn distances (solution refers to (optimal) transportation plans, i.e. the argument of the in eq. (3)). {theorem} Let denote the set of solutions of eq. (3) when ranges over all distance matrices. Then such that , , . Figure 2 provides examples of solutions. Adding the free parameter is not just interesting for the reason that we bring new solutions to the table: turns out to be Cressie-Read Power Divergence (for , [21]), and so trot has an applicability in ecological inference that Sinkhorn distances alone do not have. In addition, we also generalize two key facts already known for Sinkhorn distances [8]. First, the solution to trot is unique (for ) and satisfies a simple analytical expression amenable to convenient optimization. {theorem} There exists exactly one matrix solution to trot(). It satisfies:

 pij = expq(−1)exp−1q(αi+λmij+βj),∀i,j. (7)

( are unique up to an additive constant). Second, we can tweak trot to meet distance axioms. Let

 dM,α,q(\ver,\vec) \defeq minP∈U(\ver,\vec)Hq(P)−Hq(\ver)−Hq(\vec)≥α\innerPM, (8)

where . For any , such that . Also, the following holds. {theorem} For and if is a metric matrix, function is a distance. Theorem 3 is a generalization of [8, Theorem 1] (for ). As we explain more precisely in sm (Section D), there is a downside to using as proof of the good properties of : the triangle inequality, key to Euclidean geometry, transfers to with varying and uncontrolled parameters — in the inequality, the three values of may all be different! This does not break down the good properties of , it just calls for workarounds. We now give one, which replaces by the quantity ( is a constant):

 dλ,q,βM(\ver,\vec) \defeq (9)

This has another trivial advantage that does not have: the solutions (optimal transportation plans) are always the same on both sides. Also, the right-hand side is lowerbounded for any and the trick that ensures the identity of the indiscernibles still works on . The good news is that if , , as is, can satisfy the triangle inequality. {theorem} satisfies the triangle inequality, . Hence, the solutions to are optimal transport plans for distortions that meet the triangle inequality. This is new compared to [8]. For a general , the proof, in Supplementary Material (Section D), shows more, namely that satisfies a weak form of the identity of the indiscernibles. Finally, there always exist a value such that is non negative ( is lowerbounded ).

## 4 Efficient trot optimizers

The key idea behind Sinkhorn-Cuturi’s solution is that the KKT conditions ensure that the optimal transportation plan satisfies . Sinkhorn’s balancing normalization can then directly be used for a fast approximation of [34, 33]. This trick does not fit at first sight for Tsallis regularization because the -exponential is not multiplicative for general and KKT conditions do not seem to be as favorable. We give however workarounds for the optimization, that work for any .

First, we assume wlog that since in those cases, any efficient LP solver () or Sinkhorn balancing normalization () can be used. The task is non trivial because for , the function minimized in is not Lipschitz, which impedes the convergence of gradient methods. In this case, our workaround is Algorithm 1 (sotrot), which relies on a Second Order approximation of a fundamental quantity used in its convergence proof, auxiliary functions [10].

{theorem}

[Convergence of sotrot] For any fixed , matrix output by sotrot converges to with:

 P⋆ = argminP∈Rn×n+:P\ve1=\verK1/q(Pq,~Uq).

The proof (in Supplementary Material, Section E) is involved but interesting in itself because it represents one of the first use of the theory of auxiliary functions outside the realm of Bregman divergences in machine learning [5, 10]. Some important remarks should be made. First, since sotrot uses only one of the two marginal constraints, it would need to be iterated (”wrapped”), swapping the row and column constraints like in Sinkhorn balancing. In practice, this is not efficient. Furthermore, iterating sotrot over constraint swapping does not necessarily converge. For these reasons, we swap constraints in the algorithm, making one iteration of Steps 4-14 over rows, and then one iteration of Steps 4-14 over columns (this boils down to transposing matrices in sotrot), and so on. This converges, but still is not the most efficient. To improve efficiency we perform two modifications, that do not impede convergence experimentally. First, we remove Step 12. In doing so, we not only save computations for each outer loop, we essentially make sotrot as parallelizable as Sinkhorn balancing [8]. Second, we remarked experimentally that convergence is faster when multiplying by 2 in Step 10, and dividing by 2 in Step 5.

For simplicity, we still refer to this algorithm (balancing constraints in the algorithm, with the modifications for Steps 5, 10, 12) as sotrot in the experiments.

Last, when , the function minimized in becomes Lipschitz. In this case, we take the particular geometry of into account by using mirror gradient methods, which are equivalent to gradient methods projected according to some suitable divergence [2]. In our case, we consider Kullback-Leibler divergence, which can save a factor iterations [2]. Furthermore, the Kullback-Leibler projection can be written in terms of Sinkhorn-Knopp’s (SK) algorithm with marginals constraints [35], as is shown in Algorithm 2, named kltrot ( is Kronecker product).

{theorem}

If and the gradient steps are s.t. and , matrix output by kltrot converges to with:

 P⋆ = argminP∈U(\ver,\vec)K1/q(Pq,~Uq).

(proof omitted, follows [2, 35])

## 5 Experiments

We evaluate empirically the trot framework with its application to ecological inference. The dataset we use describes about millions individual voters from Florida for the 2012 US presidential elections, as obtained from [20]. The data is much richer than is required for ecological inference: surely we could estimate the joint distribution of every voters’ available attributes by counting. This is itself a particularly rare case of data quality in political science, where any analysis is often carried out on aggregate measurements. In fact, since ground truth distributions are effectively available, the Florida dataset has been used to test methodological advances in the field [14, 20]. As a demonstrative example, we focus on inferring the distributions of ethnicity and party for all Florida counties.

Dataset description and preprocessing. The data contains the following attributes for each voter: location (district, county), gender, age, party (Democrat, Republican, Other), ethnicity (White, African-american, Hispanic, Asian, Native, Other), 2008 vote (yes, no). About 800K voters with missing attributes are excluded from the study. Thanks to the richness of the data, marginal probabilities of ethnic groups and parties can be obtained by counting: for each county we obtain marginals for the optimal transport problems.

Evaluation assumptions. Two assumptions are made in terms of information available for inference. First, the ground truth joint distributions for one district are known; we chose district number which groups out of counties of about voters in total. This information will be used to tune hyper-parameters. Second, a cost matrix is computed based on mean voter’s attributes at state level. For the sake of simplicity, we retain only age (normalized in ), gender and the 2008 vote; notice that in practice geographical attributes may encode relevant information for computing distances between voter behaviours [14]. We do not use this. For distance matrix , we aggregate those features over all Florida for each party to obtain the vectors of the party’s expected profile and for each ethnic group to obtain the vectors of the ethnicity’s expected profile. The dissimilarity measure relies on a Gaussian kernel between average county profiles:

 m\tiny{{rbf}}ij \defeq √2−2exp(−γ⋅∥μ\labpi−μ\labej∥2), (11)

with . The given function is actually the Hilbert metric in the RBF space. Table 1 shows the resulting cost matrix. Notice how it does encode some common-sense knowledge: White and Republican is the best match, while Hispanic and Asians are the worst match with Republican profiles. It is rather surprising that only 3 features such as age, gender and whether people voted at the last election can reflect so well those relative political traits; these results are indeed much in line with survey-based statistics [18]. We also try another cost matrix , , derived from the ID proportions of parties composition given in [18]; is computed as , where is the proportion of people registered to party belonging to ethnic group . Finally, we consider a ”no prior” matrix , in which .

Cross-validation of . We study the solution of trot for a grid of , inferring the joint distributions of all counties of district number 3. We measure average KL-divergence between inferred and ground truth joint distributions. Notice that each county defines a different optimal transport problem; inferring the joint distributions for multiple counties at a time is therefore trivial to parallelize. This is somewhat counter-intuitive since we may believe that geographically wider spread data should improve inference at a local level, that is, more data better inference. Indeed, the implicit coupling of the problem is represented by cost matrix, which expresses some prior knowledge of the problem by means of all data from Florida.

Baselines and comparisons with other methods. To evaluate quantitatively the solution of trot is useful to define a set of baseline methods: i) Florida-average, which the same state-level joint distribution (assumed prior knowledge) for each of the 67 county; ii) Simplex, that is the solution of optimal transport with no regularization as given by the Simplex algorithm; iii) Sinkhorn(-Cuturi)’s algorithm, which is trot with ; iv) trot. ii-iv are tested with , and we provide in addition the results for trot with . Hyper-parameters are cross-validated independently for each algorithm.

Table 2 reports a quantitative comparison. From the most general to the most specific, there are three remarks to make. First, optimal transport can be (but is not always) better than the default distribution (Florida average). Second, regularizing optimal transport consistently improves upon these baselines. Third, trot successfully matches Sinkhorn’s approach when is be the best solution in trot’s range of (), and manages to tune to significantly beat Sinkhorn’s when better alternatives exist: with , trot divides the expected KL divergence by more than seven (7) compared to Sinkhorn. This is a strong advocacy to allow for the tuning of . Notice that in this case, is larger compared to , which makes sense since is more accurate for the optimal transport problem (see the Simplex results) and so the weight of the regularizer predictably decreases in the regularized optimal transport distance. We conjecture that beats in part because it is somehow finer grained: is computed from sufficient statistics for the marginals alone, while exploits information computed from the cartesian product of the supports. Figure 3 compares all 1 836 inferred probabilities ( per county) with respect to the ground truth for Sinkhorn vs trot using . Remark that the figures in Table 2 translate to per-county ecological inference results that are significantly more in favor of trot, which basically has no ”hard-to-guess” counties compared to Sinkhorn for which the absolute difference between inference and ground truth can exceed 10.

To finish up, additional experiments, displayed in sm (Sections F and G) also show that trot with manages to have a distribution of per county errors extremely peaked around zero error, compared to the simplest baselines (Florida average and trot with ). These are good news, but there are some local discrepancies. For example, there exists one county on which trot with is beaten by trot with .

## 6 Discussion and conclusion

In this paper, we have bridged Shannon regularized optimal transport and unregularized optimal transport, via Tsallis entropic regularization. There are three main motivations to the generalization, the two first have already been discussed: trot allows to keep the properties of Sinkhorn distances, and fields like ecological inference bring natural applications for the general trot family. The application to ecological inference is also interesting because the main unknown is the optimal transportation plan and not necessarily the transportation distance obtained. The third and last motivation is important for applications at large and ecological inference in particular. trot spans a subset of -divergences, and -divergences satisfy the information monotonicity property that coarse graining does not increase the divergence [1, 3.2]. Furthermore, -divergences are invariant under diffeomorphic transformations [30, Theorem 1]. This is a powerful statement: if the ground metric is affected by such a transformation (for example, we change the underlying manifold coordinate system, e.g. for privacy reasons), then, from the optimal trot transportation plan , the transportation plan corresponding to the initial coordinate system can be recovered from the sole knowledge of .

The algorithms we provide allow for the efficient optimization of the regularized optimal transport for all values of , and include notable cases for which conventional gradient-based approaches would probably not be the best approaches due to the fact that the function to optimize is not Lipschitz for the chosen. In fact, the main notable downside of the generalization is that we could not prove the same (geometric) convergence rates as the ones that are known for Sinkhorn’s approach [16].

Our results display that there can be significant discrepancies in the regularized optimal transport results depending on how cost matrix is crafted, yet the information we used for our best experiments is readily available from public statistics (matrices ). Even the instantiation without prior knowledge () does not strictly fail in returning useful solutions (compared e.g. to Florida average and unregularized optimal transport). This may be a strong advocacy to use trot even on domains for which little prior knowledge is available.

## Acknowledgments

The authors wish to thank Seth Flaxman and Wendy K. Tam Cho for numerous stimulating discussions. Work done while Boris Muzellec was visiting Nicta / Data61. Nicta was funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Center of Excellence Program.

## References

• [1] S.-I. Amari. information geometry and its applications. Springer-Verlag, Berlin, 2016.
• [2] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31:167 – 175, 2003.
• [3] L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys., 7:200–217, 1967.
• [4] W.-K.-T. Cho and C.-F. Manski. Cross level/ecological inference. Oxford Handbook of Political Methodology, pages 547–569, 2008.
• [5] M. Collins, R. Schapire, and Y. Singer. Logistic regression, adaboost and Bregman distances. MLJ, pages 253–285, 2002.
• [6] P.-J. Cross and C.-F. Manski. Regressions, short and long. Econometrica, 70(1):357–368, 2002.
• [7] I. Csiszár. A geometric interpretation of Darroch and Ratcliff’s generalized iterative scaling. Ann. of Stat., 17:1409–1413, 1989.
• [8] M. Cuturi. Sinkhorn distances: lightspeed computation of optimal transport. In NIPS*26, pages 2292–2300, 2013.
• [9] M. Cuturi and A. Doucet. Fast computation of wasserstein barycenters. In 31 ICML, pages 685–693, 2014.
• [10] S. Della Pietra, V.-J. Della Pietra, and J.-D. Lafferty. Inducing features of random fields. IEEE Trans. PAMI, 19(4):380–393, 1997.
• [11] S. Donoso, N. Marín, and M.-A. Vila. Systems of possibilistic regressions: A case study in ecological inference. Mathware and Soft Computing, 12:169–184, 2005.
• [12] O.-D. Duncan and B. Davis. An alternative to ecological correlation. American sociological review, pages 665–666, 1953.
• [13] R. Flamary, C. Févotte, N. Courty, and V. Emyia. Optimal spectral transportation with application to music transcription. In NIPS*29, 2016.
• [14] S.-R. Flaxman, Y.-X. Wang, and A.-J. Smola. Who supported obama in 2012?: Ecological inference through distribution regression. In 21 KDD, pages 289–298, 2015.
• [15] A. Forcina and G.-M. Marchetti. The Brown and Payne model of voter transition revisited. In S. Ingrassia, R. Rocci, and M. Vichi, editors, New Perspectives in Statistical Modeling and Data Analysis, pages 481–488. Springer, 2011.
• [16] J. Franklin and J. Lorenz. On the scaling of multidimensional matrices. Linear Algebra and Applications, 114:717–735, 1989.
• [17] S. Furuichi. Information theoretical properties of tsallis entropies. Journal of Mathematical Physics, 47(2), 2006.
• [18] Gallup.
• [19] A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In NIPS*29, 2016.
• [20] K. Imai and K. Khanna. Improving ecological inference by predicting individual ethnicity from voter registration records. Political Analysis, 24:263–272, 2016.
• [21] G.-G. Judge, D.-J. Miller, and W.-K.-T. Cho. An information theoretic approach to ecological estimation and inference. In G. King, O. Rosen, and M. Tanner, editors, Ecological inference: New methodological strategies, pages 162–187. Cambridge University Press, 2004.
• [22] L. Kantorovitch. On the translocation of masses. Management Science, pages 1–4, 1958.
• [23] G. King. A solution to the ecological inference problem: reconstructing individual behavior from aggregate data. Princeton University Press, 1997.
• [24] G. King, M.-A. Tanner, and O. Rosen. Ecological inference: New methodological strategies. Cambridge University Press, 2004.
• [25] M.-J. Kusner, Y. Sun, N.-I. Kolkin, and K.-Q. Weinberger. From word embeddings to document distances. In 32 ICML, pages 957–966, 2015.
• [26] G. Monge. Mémoire sur la théorie des déblais et des remblais. Académie Royale des Sciences de Paris, pages 666–704, 1781.
• [27] R. Nock and F. Nielsen. On the efficient minimization of classification-calibrated surrogates. In NIPS*21, pages 1201–1208, 2008.
• [28] D. Posnett, V. Filkov, and P. Devanbu. Ecological inference in empirical software engineering. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pages 362–371, 2011.
• [29] W. Qian, B. Hong, D. Cai, X. He, and X. Li. Non-negative matrix factorization with Sinkhorn distance. In 25 IJCAI, pages 1960–1966, 2016.
• [30] Y. Qiao and N. Minematsu. A study on invariance of -divergence and its application to speech recognition. IEEE Trans. SP, 58:3884–3890, 2010.
• [31] W.-S. Robinson. Ecological correlations and the behavior of individuals. American Sociological Review, 15(3):351–357, 1950.
• [32] Y. Rubner, C. Tomasi, and L.-J. Guibas. The earth moverâs distance as a metric for image retrieval. Int. J. Comp. Vis., 40:99–121, 2000.
• [33] R. Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. Annals of Mathematical Statistics, 35:876–879, 1964.
• [34] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. American Mathematical Monthly, 74:402–405, 1967.
• [35] R. Sinkhorn and P. Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific J. Math., 21:343–348, 1967.
• [36] J. Solomon, F. de Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34:66:1–66:11, 2015.
• [37] G.-W. Soules. The rate of convergence of Sinkhorn balancing. Linear Algebra and Applications, pages 3–40, 1991.
• [38] C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics. J. of Statistical Physics, 52:479–487, 1988.
• [39] C. Villani. Optimal transport: old and new. Springer, 2009.
• [40] J. Wakefield and G. Shaddick. Health-exposure modeling and the ecological fallacy. Biostatistics, 7(3):438–455, 2006.

Supplementary material on proofs Pg Supplementary Material: proofs
Proof of Theorem 3Pg B
Proof of Theorem 3Pg C
Proof of Theorems 3 and 3Pg D
Proof of Theorem 4Pg E

Supplementary material on experiments Pg Supplementary Material: experiments
Per county error distribution, trot survey vs Florida averagePg F
Per county errors, trot survey vs trot Pg G

## Appendix B Proof of Theorem 3

Let be a distance matrix, and (the case when xor can be treated in a similar fashion). We suppose wlog that the support does not reduce to a singleton (otherwise the solution to optimal transport is trivial). Rescaling and a constant row vector and a constant column vector, the solution of trot can be written wlog as

 pij = expq(−1)exp−1q(mij). (12)

Assume there exists a such that the solution of trot is equal to that of trot. This is equivalent to saying that there exists such that

 expq(mij) = expq′(αi+λ′mij+βj),∀i,j. (13)

Composing with and rearranging, this implies that

 fλ′q′,q(mij) = αi+βj,∀i,j, (14)

where

 fλ′q′,q(x) \defeq logq′∘expq−λ′Id. (15)

Now, remark that, since is a distance, because of the identity of the indiscernibles, and so , implying . is differentiable. Let:

 gλ′q′,q(x) \defeq ddxfλ′q′,q(x) (16) = expq−q′q(x)−λ′; hλ′q′,q(x) \defeq ddxgλ′q′,q(x) (17) = (q−q′)⋅exp2q−q′−1q(x).

If we assume wlog that , then is increasing and zeroes at most once over , eventually on some that we define as if (and otherwise). Notice that and is bijective over . Suppose wlog that . Otherwise, all distances are scaled by the same real so that : this does not alter the property of being a distance. A distance being symmetric, we also have and since is strictly increasing in the range of distances, then we get from eq. (14) that and so (since ). Hence, there exists a real such that . We get, in matrix form

 fλ′q′,q(M) = \veα\ve1⊤+\ve1\veβ⊤ (18) = α⋅\ve1\ve1⊤−α⋅\ve1\ve1⊤=0. (19)

Hence, and the support reduces to a singleton (because of the identity of the indiscernibles), which is impossible.

Remark that the proof also works when is not a distance anymore, but for example contains all arbitrary non negative matrices. To see this, we remark that the right hand side of eq. (18) is a matrix of rank no larger than 2. Since is continuous, we have

 Im(fλ′q′,q) \defeq I⊆R

where is not reduced to a singleton and so the left hand side of eq. (18) spans matrices of arbitrary rank. Hence, eq. (18) cannot always hold.

## Appendix C Proof of Theorem 3

Denote

 fij:pij→pijmij−1λ(1−q)(pqij−pij).

is twice differentiable on , and

 d2dx2fij(x) = qλxq−2>0

for any fixed , and so is strictly convex on . We also remark that is a non-empty compact subset of . Indeed, , (which proves boundedness) and is a closed subset of (being the intersection of the pre-images of singletons by continuous functions). Hence, since , there exists a unique minimum of this function in .

To prove the analytic shape of the solution, we remark that trot() consists in minimizing a convex function given a set of affine constraints, and so the KKT conditions are necessary and sufficient. The KKT conditions give

 pij = expq(−1)exp−1q(αi+λmij+βj),

where are Lagrange multipliers.

Finally, let us show that Lagrange multipliers are unique up to an additive constant. Assume that are such that

 ∀i,j,pij =expq(−1)exp−1q(λmij+αi+βj) =expq(−1)exp−1q(λmij+α′i+β′j),

where is the unique solution of trot(). This implies

 αi+βj = α′i+β′j,∀i,j,

i.e.

 αi−α′i = β′j−βj,∀i,j.

In particular, if there exists and such that , then and in turn , which proves our claim.

## Appendix D Proof of Theorems 3 and 3

For reasons that we explain now, we will in fact prove Theorem 3 before we prove Theorem 3.

Had we chosen to follow [8], we would have replaced trot() by:

 dM,α,q(\ver,\vec) \defeq minP∈U(\ver,\vec)Hq(P)−Hq(\ver)−Hq(\vec)≥α\innerPM, (20)

for some . Both problems are equivalent since in trot() plays the role of the Lagrange multiplier for the entropy constraint in eq. (20) [8, Section 3], and so there exists an equivalent value of for which both problems coincide:

 dM,α∗,q(\ver,\vec) = dλ,qM(\ver,\vec), (21)

so eq. (20) indeed matches trot(). It is clear from eq. (21) that does not depend solely on , but also (eventually) on all other parameters, including .

This would not be a problem to state the triangle inequality for , as in [8] ():

 dM,α,q(\vex,\vez) ≤ dM,α,q(\vex,\vey)+dM,α,q(\vey,\vez). (22)

However, is fixed and in particular different from the that guarantee eq. (21) — and there might be three different sets of parameters for as it would equivalently appear from eq. (22). Under the simplifying assumption that only changes, we might just get from eq. (22):

 dλ∗,qM(\vex,\vez) ≤ dλ′∗,qM(\vex,\vey)+dλ′′∗,qM(\vey,\vez), (23)

with . Worse, the transportation plans may change with : for example, we may have

 argminP∈U(\vex,\vez)dλ1,qM(\vex,\vez) ≠ argminP∈U(\vex,\vez)dλ2,qM(\vex,\vez),

with and . So, the triangle inequality for that follows from ineq. (22) does not allow to control the parameters of trot() nor the optimal transportation plans that follows. It does not show a problem in regularizing the optimal transport distance, but rather that the distance chosen from eq. (21) does not completely fulfill its objective in showing that regularization in still keeps some of the attractive properties that unregularized optimal transport meets.

To bypass this problem and establish a statement involving a distance in which all parameters are in the clear and optimal transportation plans still coincide with , we chose to rely on measure:

 dλ,q,βM(\ver,\vec) \defeq minP∈U(\ver,\vec)\innerPM −1λ⋅(Hq(P)−β⋅(Hq(\ver)+Hq(\vec))),

where is some constant. There is one trivial but crucial fact about : regardless of the choice of , its optimal transportation plan is the same as for trot(). {lemma} For any and constant , let

 P1 \defeq argminP∈U(\ver,\vec)\innerPM (24) −1λ⋅(Hq(P)−β⋅(Hq(\ver)+Hq(\vec))). P2 \defeq argminP∈U(\ver,\vec)\innerPM (25) −1λ⋅(Hq(P)).

Then . {theorem} The following holds for any fixed (unless otherwise stated):

• for any , satisfies the triangle inequality;

• for the choice , satisfies the following weak version of the identity of the indiscernibles: if , then .

• for the choice , , choosing the (no) transportation plan brings

 \innerPM−1λ⋅(Hq(P)−12⋅(Hq(\ver)+Hq(\ver))) = 0

Remark: the last property is trivial but worth stating since the (no) transportation plan also satisfies , which zeroes the (no) transportation distance .

###### Proof.

To prove the Theorem, we need another version of the Gluing Lemma with entropic constraints [8, Lemma 1], generalized to handle Tsallis entropy. {lemma}(Refined gluing Lemma) Let . Let and . Let defined by general term

 sik \defeq ∑jpijqjkyj. (26)

1. ;

2. if , then:

 Hq(S)−Hq(\vex)−Hq(\vez) (27) ≥ Hq(P)−Hq(\vex)−Hq(\vey).
###### Proof.

The proof essentially builds upon [8, Lemma 1]. We remark that can be built by

 sik = ∑jtijk, (28)

where , we have

 tijk \defeq pijqjkyj (29)

if (and otherwise)

is a transportation matrix between and . Indeed,

 ∑i∑jsijk = ∑j∑ipijqjkyj = ∑jqjkyj∑ipij = ∑jqjkyjyj=∑jqjk=zk; ∑k∑jsijk = ∑j∑kpijqjkyj = ∑jpijyj∑kqjk