MREC: a fast and versatile framework for aligning and matching point clouds with applications to single cell molecular data
Abstract.
Comparing and aligning large datasets is a pervasive problem occurring across many different knowledge domains. We introduce and study MREC, a recursive decomposition algorithm for computing matchings between datasets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using black box matching procedures that are too expensive to run on the entire dataset. Using an absolute measure of the quality of a matching, the framework supports optimization over parameters including partitioning procedures and matching algorithms. By design, MREC can be applied to extremely large datasets. We analyze the procedure to describe when we can expect it to work well and demonstrate its flexibility and power by applying it to a number of alignment problems arising in the analysis of single cell molecular data.
1. Introduction
Computing alignments, or matchings, between two point clouds is a basic problem in data analysis. Notable applications include aligning points between images (sometimes referred to as the “point registration” problem) [EK03, MS05], measuring distances between images or histograms [RTG00], finding independent edge sets in bipartite graphs [BGB17], matching the shape of primate teeth to determine species relationships [BLC+11], removing batch effect in singlecell data [FHN+19], and studying stem cell differentiation [FHN+19, SST+19]. In the setting where the point clouds correspond to 3D shapes or lowdimensional manifolds, there exist techniques that successfully exploit the geometric structure of the point clouds [LD11, GYD+18, BGH+18]. In this work, we focus on arbitrary metric spaces, where the matching problem is typically formalized as an optimization problem: for each matching of datasets and , by which we mean a pair of maps and , we assign a cost . The goal is then to find a matching that minimizes the cost. The cost of the optimal matching provides a measure of some kind of distance between and ; in this formulation, there is a tight connection between matchings and distances between datasets.
Matchings and distances between metric spaces: A standard way of assigning costs to the matching is to assume that the two point clouds and are equipped with distance functions and respectively, i.e., they are metric spaces. Then the cost of a matching can be defined in terms of relationships between the interpoint distances. This approach to matching is closely related to notions of distances between metric spaces known as the Hausdorff and GromovHausdorff distances [GRO81, GRO01]. A distinct advantage of cost measures that depend only on the distance matrices for each dataset is that the resulting alignment procedures are invariant with respect to rotations and more generally arbitrary distancepreserving transformations of the point clouds. This is clearly an essential feature when working with images, for example. Unfortunately, computing these distances in practice is infeasible.
A closely related viewpoint is to study matching questions from the perspective of “earthmover distances”, which arise from an area of mathematics known as optimal transport [VIL08]. Here the idea is that matchings can be probabilistic or fuzzy, which is encoded by representing the datasets as having probability distributions. The costs of the matchings are then computed by adding up the amounts of probability mass required to transform one distribution into another, weighted by the distances. The resulting distances between metric spaces are known as the Wasserstein and GromovWasserstein distance [MÉM11, STU06]. This approach is related to the approach above insofar as the GromovWasserstein distance is a relaxation of the GromovHausdorff distance. Although there is a substantial and beautiful theoretical literature on optimal transport, the GromovWasserstein distance has until recently been too expensive to use in practice, limited to at most a few hundred points. A related relaxation in terms of semidefinite programming yields a distance and matching procedure that has attractive theoretical properties but is also expensive to use in practice [RPD+12, SJ08, KKB+15, VBB+16]; again, it is limited to hundreds of points.
Quite recently, there has been exciting progress in the area based on a relaxation of the GromovWasserstein distance that involves regularizing the optimal transport problem by adding an entropy term [CUT13]. Roughly speaking, this adds a penalty based on the complexity of the matching. This refinement to the minimization objective (see Section 2.2) turns the problem into one that can be solved comparatively efficiently with Sinkhorn’s fixedpoint algorithm [SIN74]. However, applying this approximation to datasets with more than a few thousand points is only feasible in low dimensions.
The limitations on the applicable size of datasets is a significant issue for many potential applications, including singlecell characterization of biological systems. Recent technological breakthroughs have allowed characterization of gene expression, chromatin accessibility, and structure at the singlecell level, allowing biologists to generate datasets with tens of thousands of cells (data points) in a very highdimensional space (each dimension represents a gene of interest). These technical limitations are unfortunate since optimal transport has been shown in recent work to have significant potential applications, e.g., for removing batch effects [FHN+19], or for a better understanding of the temporal evolution and differentiation in development [SST+19].
Contributions. Our goal in this article is to introduce and study a general framework for applying matching algorithms to very large datasets by recursive decomposition. The basic idea is to partition the dataset into (possibly overlapping) subsets, match representative points from the subsets using any matching algorithm as a subroutine, and then recursively match the partitions. We refer to the recursive approximation scheme as MREC (see Algorithm 1). By selecting parameters appropriately, we can work with datasets of tens of thousands of points. A particularly attractive features of this framework is that, since there is an external measure of the quality of a matching, we can efficiently search over the parameter space, including selection of matching algorithms as well as methods for partitioning.
After introducing the algorithm, we do a simple theoretical analysis that explains situations when we can guarantee that this approximation will work well (see Section 3.2). The conclusions are reassuring but unsurprising: the algorithm can be expected to work well either when the matching itself admits a recursive decomposition (e.g., the data is produced by wellseparated Gaussians) or the partitions produce an approximation that is close to the original metric space.
Finally, we explore a number of applications of MREC to datasets coming from singlecell characterization of different biological systems, showcasing the usefulness of MREC for various problems in this field (see Section 4). The results we obtain outperform or are comparable to alternative algorithms on small datasets, and produce interesting results on datasets too large to be amenable to any other techniques.
2. Matchings and Gromovtype distances
In this section, we recall the definitions of matchings and various matching distances between metric spaces. We first review the GromovHausdorff distance in Section 2.1. We present its probabilistic (and entropyregularized) relaxation, the GromovWasserstein distance, in Section 2.2, and we discuss the semidefinite relaxation in Section 2.3.
Problem. We can formulate the basic problem of interest as follows. We are given two datasets and , each equipped with a distance function. That is, we have functions and that satisfy the metric axioms. The problem is to find a matching that best preserves the distances. There are of course many ways to make this precise, depending on the notion of “best”. Our point of departure is the solution that comes from the idea of the GromovHausdorff distance [GRO81, GRO01].
2.1. The GromovHausdorff distance
The GromovHausdorff distance is a metric on the set of compact metric spaces up to isometry, which generalizes the Hausdorff distance.
Definition 2.1.
Let and be subsets of a metric space . The Hausdorff distance between and is defined as , where and .
The Hausdorff distance depends on the embedding of and , and
for instance is not translation or rotation invariant for and
separately. As a consequence, it is useful to consider a refinement
which is invariant to distancepreserving transformations.
Correspondences. Let be a correspondence between
and , i.e., a subset such that the
projections are surjective onto and . A correspondence is a
kind of matching where multiple points can be associated with each
other. Then, the distortion of is defined to be:
,
where the supremum is over all and .
The distortion can be thought of as the cost of the correspondence:
the quality of the matching is assessed by how well distances between
matched points are preserved.
GromovHausdorff distance. The GromovHausdorff distance is defined as:
where the infimum varies over all correspondences. That is, in principle, if we compute the GromovHausdorff distance, we also obtain a correspondence which realizes it. Unfortunately, it is infeasible to directly compute the GromovHausdorff distance. The optimization problem above is an example of an integer programming problem, and is wellknown to be NPhard, even for linear spaces such as metric trees [AFN+18]. Instead, a standard approach is to turn to various kinds of relaxations.
2.2. The GromovWasserstein distance
A first kind of relaxation is to allow probabilistic matchings; each
point is matched to a point with a probability
, subject to the constraint that . It turns out that a way to study this is via the
consideration of metric spaces equipped with a probability
measure.
Metric probability space. We will temporarily broaden
consideration from metric spaces to metric measure spaces; these
are metric spaces equipped with a Borel measure
. In fact, we restrict attention to metric probability
spaces, i.e., metric measure spaces where the total measure is .
Any metric space can be regarded as a metric measure space equipped
with the uniform measure, where each point is assigned equal
probability. To construct a metric in this setting which is analogous to the GromovHausdorff distance, we need to introduce a
probabilistic analogue of a correspondence. The correct notion here
is that of a coupling of and ; this is a
probability measure on such that and . We can now define the
GromovWasserstein distance [MÉM11, STU06, LV09].
GromovWasserstein distance. Let denote .
Definition 2.2.
Let , . The th GromovWasserstein distance computed between two metric probability spaces and is defined as:
where ranges over couplings of and . If we replace by when , the distance is called the Wasserstein distance; this is analogous to the Hausdorff distance.
Intuitively, the GromovWasserstein distance is computed by testing every probabilistic matching between and . Each such matching assigns, for a given , a probability distribution of matching over , and viceversa. The matching that best minimizes the socalled metric distortion score between the spaces is picked and its score is used as the distance value. By rounding, one can extract an absolute (nonprobabilistic) matching.
Unfortunately, the GromovWasserstein is again very hard to compute in
general; this is a quadratic assignment problem (QAP), which are in
general known to be hard to solve or even provably
approximate [PW94].
Entropic regularization. Since the GromovWasserstein distance is still expensive to compute, attention has been focused recently on a further relaxation, which adjusts the cost function to add a penalty term in terms of the entropy of the couplings [CUT13], where the entropy of a coupling is .
Definition 2.3.
For a fixed , the entropyregularized GromovWasserstein distance is:
Adding the entropy term turns out to result in a problem that can be efficiently approximated with, e.g., Sinkhorn’s fixedpoint algorithm [SIN74]. However, solving this problem on thousands of points is feasible only in very low dimensions.
2.3. Relaxation of GromovHausdorff distance
A different relaxation of the GromovHausdorff distance is given by semidefinite programming [VBB+16]. This has the advantage that the resulting optimization problem is convex, comes with a certificate of optimality, and yields a pseudometric that satisfies the triangle inequality. On the other hand, the result is a semidefinite program in variables. A further relaxation reduces its complexity by introducing sparsity assumptions [FKS18], but it is still infeasible for large datasets.
Let and be the number of points of and respectively. For the semidefinite relaxation, we optimize over a matrix . In fact, we work with an augmented matrix , with entries of indexed by pairs ,, and with and ,
(1) 
Definition 2.4.
The SDP relaxation of the GromovHausdorff distance is the solution to the following semidefinite programming problem:
subject to , where denotes the set of matrices satisfying the constraints: , , , , , , and is symmetric and positive semidefinite.
As noted above, one of the most interesting properties of the SDP relaxation is that it yields a distance that also satisfies the triangle inequality; this is somewhat surprising, as there is not an obvious geometric interpretation as with the optimal transport framework.
3. A recursive approximation scheme
In this section, we define and study a recursive approximation scheme for computing matchings between datasets. This algorithm, that we call MREC and define in Algorithm 1, requires a black box matching function (such as the entropyregularized GromovWasserstein distance or the SDP relaxation presented in Section 2.2) and a black box clustering function (such as means++, [AV07]) as parameters, and uses them recursively to scale and approximate the matching computation.
3.1. Definition
Setup. We are given a clustering algorithm whose input is a finite metric space with and whose output is a finite metric space together with a surjective (pointset) function , satisfying, for all :
1. . 2. is roughly constant.
Recursive scheme. We now present our recursive decomposition algorithm, that we call MREC and detail in Algorithm 1, for producing a matching between and .
At the conclusion of this algorithm, we obtain a matching between and ; this is the output of the algorithm, along with the induced approximation of the Gromovtype distances. In practice, MREC is stochastic since it depends both on the specifics of the clustering algorithm and on random properties of the optimization solvers for the matching algorithm . As a consequence, the correct way to proceed is perform a search over the parameter space, possibly even searching over different clustering algorithms—each matching comes with a score, which is the associated distortion measure, allowing us to automatically hillclimb to find the best matching.
Associated code. Our code is publicly available
3.2. Theoretical properties
We begin by giving some generic properties of MREC. If we let denote the time required for the black box matching algorithm to compute a matching on metric spaces of size smaller than , then when using clusters the time to execute MREC on datasets with points is . That is, the running time is basically determined by .
Next, we turn to the question of how good an approximation we can expect to get from MREC. Recall that is an net if for every point there is a point such that . A basic observation is that if is an net of , the evident induced matching that assigns each point of to a point of within distance has discrepancy bounded above by . The triangle inequality then implies that if is an net and is an net, then the optimal matching of and induces a matching of and that has distortion within of the optimal.
It is helpful to reformulate this in terms of clustering algorithms. Let be a clustering algorithm, and . Say that a finite metric space is clustered if the output of on satisfies the condition that for all clusters, all elements in the cluster are within distance of each other. Notice that this in particular means that choosing a point in each partition produces an net.
Proposition 3.1.
Suppose , . Given a relation and retractions and , consider the derived relation defined by . Suppose that for all , and likewise for . Then .
Proof.
For , we have . Thus, we have, using the triangle inequality: and .
This leads to: . ∎
This immediately implies the following corollary.
Corollary 3.2.
Let and be clustered for with the number of clusters , and let and denote the metric spaces of clusters generated by . Then MREC (computed with the exact GromovHausdorff distance as the matching function ) produces a matching between and with distortion at most , where is the correspondence realizing .
Now, a very natural question to ask is for a metric space with some kind of dimension restriction, how many clusters are needed to obtain an clustering. A standard way of encoding the dimension of a metric space is via doubling dimension. Recall that a metric space has doubling dimension if every ball of radius can be covered by balls of radius . For example, Euclidean space has doubling dimension . The following standard result now follows from an easy counting argument.
Proposition 3.3.
Let be a metric space with doubling dimension . Then there exists an clustering with approximately clusters, where .
Here the approximation in the statement comes from issues of rounding logarithms. These results give quite pessimistic estimates, but they are useful for providing a sense of lower bounds on the performance to expect from MREC and its recursive decomposition.
The analyses above focus on the situation in which the coarse approximation of the data that we get from partitioning is close in the GromovHausdorff metric to the original data. (Analogous analyses can be performed using other metrics.) We now explain a situation in which we can expect the algorithm to work well even when the partition approximation is in this sense far from the original data.
Consider the toy model where and are both finite samples from a metric space with the property that can be partitioned into clusters such that the minimum Hausdorff distance between any pair of clusters and is substantially larger than the maximum diameter of the clusters. Specifically, assume that
First off, note that in this case, the optimal clustering can be found by the means algorithm; each point in a cluster is closer to the centroid than to any point outside the cluster. In addition, we know that the expected number of samples required to to obtain a point in each cluster is bounded above by ; this follows from a generalized couponcollector analysis. Let us assume that we have chosen enough points so that with high probability we indeed have at least one sample from each cluster. In this situation, the optimal matching between and is realized by a matching of the clusters.
Proposition 3.4.
Under the hypotheses above, the optimal matching of and permutes the clusters.
Proof.
Any matching that separates points within a cluster (i.e., assigns points to distinct clusters of ) or merges clusters will have discrepancy bounded below by , whereas a matching that permutes the clusters will have discrepancy bounded above by . ∎
An immediate corollary is that MREC will find the optimal matching of the toplevel clusters with high probability. Since we are only using metric information, our matchings are nonunique up to symmetry in the data; if we assume that the intercluster distances are unique, the matching returned is the unique matching that identifies clusters to themselves.
3.3. Discussion
One of the most useful features of the MREC framework is that because matchings can be evaluated in terms of their distortion, we can search over parameter space to find the minimal distortion embedding. For example, it is reasonable to try a variety of clustering algorithms and numbers of centers in order to ascertain which parameters lead to the best approximation of the optimal matching. Moreover, we can even try different matching procedures. This in particular means that the algorithm can tolerate a substantial amount of uncertainty about the decomposition of the best matching, as long as there exists a partition and allowable resolution for which there is a good matching.
Another useful aspect of this is that we can easily make MREC robust to outliers by subsampling crossvalidation; specifically, we can repeatedly subsample the two metrics spaces and , perform the matchings on the subsamples, and identify points that result in high distortion matchings.
Finally, although we have discussed the matching algorithm in terms of a partition of the dataset, instead it is entirely reasonable to imagine an overlapping cover of the datasets used for matchings. We only mention this refinement here; we intend to return to this in future work.
4. Applications
In this section, we demonstrate the use of MREC by applying it on synthetic and real biological singlecell datasets with groundtruth labels. Overall, we show that the recursion procedure used in MREC is highly efficient, in the sense that the computed matchings are on par, or slightly better (as measured by the accuracy of label transfer under the computed matchings), than the ones obtained without doing any recursion, with significantly smaller running times. We first provide proofofconcepts of MREC for synthetic and real data in Sections 4.1 and 4.2. Then, we present applications of MREC for correcting batch effects on a singlecell datasets in Section 4.3 and aligning across different measurement modalities in Sections 4.4 and 4.5. Whereas the datasets in Sections 4.3 and 4.4 can be handled directly the dataset presented in Section 4.5 is too large to be handled with usual matching algorithms, making MREC as far as we know the only algorithm suitable for working at scale with the recent breakthroughs in singlecell data generation.
Singlecell datasets. Singlecell RNA sequencing has been a major event in the fine characterization of many biological systems at the molecular level, since it allows biologists to study the gene expression levels of thousands of individual cells, leading to a clearer understanding of many biological processes. For instance, clustering the expression of singlecell data identifies sets of cells with common properties and function, and the relation between these clusters can reflect transitional states. In these datasets, each cell is represented as a point in Euclidean space, in which every dimension is associated to the expression of a gene of interest, and each single cell coordinate contains the expression of the corresponding gene in that cell (that is, how often is this gene transcribed in the cell, as measured with the number of transcript fragments that are detected in the cell). Hence, raw singlecell datasets are usually given as integer matrices. They often require preprocessing; we will use the Python packages Scanpy [WAT18] and Randomly [ABB+18].
Batch effects. Singlecell RNA sequencing has resulted in a huge increase in data availability. However, many methodological issues with using these datasets remain. Notably, datasets computed in different labs or with different protocols usually exhibit unwanted biases, often referred to as batch effects [HLM+18]. Matching, or aligning, these datasets in order to integrate them together in a robust way is thus a problem of major importance, and optimal transport with the GromovWasserstein distances has recently been shown to be a promising solution in this application [FHN+19]. In Section 4.3, we present an example of batch effect correction with MREC.
Different measure modalities. Another class of alignment problems comes from situations where we have measurements of the same dataset coming from different kinds of technologies; for example, measuring expression and chromatin accessibility. Understanding the amount of alignment that can be expected between these two kinds of measurements is an extremely interesting problem. In Sections 4.4 and 4.5, we describe applications to two such datasets.
Parameters and results. Each of the datasets in
Sections 4.1, 4.2, 4.3
and 4.5 presented below is comprised of two groups with the same list of
associated groundtruth labels (except for BDTNP; see the details in Section 4.4 below). On each of these datasets, we use a
Voronoi partition computed with randomly sampled centroids
4.1. Matching of simulated datasets Synth(+)
We first apply MREC on a synthetic dataset sampled from a mixture of three Gaussian probability distributions located at different positions in the Euclidean plane . More specifically, we drew samples that we eventually split into two groups. Accuracy is then measured as the ability of MREC to match points sampled from the same Gaussian probability distributions together. We also run an experiment with samples (called Synth+),
4.2. Matching of subsampled datasets Pbmc
In our second example, we test the performance of MREC on a singlecell transcriptional dataset. We collected the data presented in [KST+18], which is comprised of single peripheral blood mononuclear cells (PBMC), with eight different cell types (CD4, CD8, Bcell, NKcell, dendritic cells (Ddr), CD14, monocytes (Mnc) and megakaryocytes (Mgk)). We preprocessed the cells with Randomly [ABB+18] to create a dataset of cells in dimensions (each of which representing a highly variable gene of interest). We then split the data into two groups and measured the ability of MREC to match together cells with the same types coming from different groups.
4.3. Matching of different protocols Hema
In our third application, we study an example of batch effects in singlecell transcriptional data for which optimal transport has already proved useful [FHN+19]. The data is comprised of two groups of and single hematopoietic cells respectively, with associated genes and three different cell types (common myeloid progenitors (CMP), granulocytemonocyte progenitors (GMP) and megakaryocyteerythrocyte progenitors (MEP)). These groups were generated with the SMART and MARS protocols respectively, as explained in [HLM+18]. As mentioned above, even though the dimensions are in correspondence, the fact that different protocols have been used introduces a bias, or batch effect, in the data, We preprocessed the data using the method presented in [ZTB+17] and available in the Python package Scanpy and then reduced the number of dimensions to 50 with PCA.
4.4. Matching of different modalities Bdtnp
In this section, we study 3,039 single cells coming from Drosophila embryos, as presented in [NKF+18]. Spatial coordinates in the 2D tissues were obtained with fluorescence in situ hybridization (FISH), as well as the expression of 84 marker genes, leading to spaces with different dimensions. This means that usual Wasserstein distance cannot be used here. We followed the exact same procedure as in [NKF+18], i.e., pairwise distances in both spaces were computed with nearestneighbor graph, so as to approximate the geodesics from the manifolds on which the datasets have been sampled (following the manifold assumption made in [NKF+18]). Finally, the ground truth is here given by the true correspondence between the cells. Hence, we chose to measure accuracy with the (normalized) area under the curve that shows, for each radius , the fraction of cells (in expression space) whose corresponding cell (in tissue space) under the computed matching is at distance at most from the true corresponding cell (which is a common practice in computer graphics [COO15]).
4.5. Matching of different modalities Brain
In our last example, we focus on a dataset of single cells sampled from the human brain with eight different cell types (astrocytes (Ast), endothelial cells (End), excitatory neurons (Ex), inhibitory neurons (In), microglia (Mic), oligodendrocytes (Oli) and their precursor cells (OPC)), presented in [LCS+17]. The challenge of this dataset is twofold. First, the dimensions of the two groups are not in correspondence (as in Section 4.4): the gene expressions were measured in the first group, while the second group contains the DNA region accessibilities (that is, whether given regions in the DNA of the cell nucleus are open, i.e., accessible for transcription, or not). Second, the datasets are too large to be handled with usual matching techniques: the first group contains cells, while the second one contains cells. We preprocessed the first group with the method presented in [ZTB+17] and available in the Python package Scanpy, and the second one with TFIDF. Then, we applied PCA on each group separately to reduce the number of dimensions to 50. Note that since the dimensions are not in correspondence anymore, we have to use the GromovWasserstein distance. We also added a standard Wasserstein cost by using a binary cost matrix containing the correspondence between the genes and the DNA regions (that is, the entry of the matrix is one if the genomic coordinates of gene intersects DNA region ).
4.6. Results and comparison with no recursion.
It can be seen from Figure 1 that, as expected, accuracy generally tends to increase while distortion decreases with the number of centroids. Note also that, for the BRAIN dataset, the variance of the distortion is pretty large, and its decrease is not obvious. This large variance is also the reason why the mean accuracy (observed on the plot) is actually much smaller than the one corresponding to the matching with lowest distortion (shown in Table 1)—note that this is actually true (to a certain extent) for all datasets. Note also that there is clear gap in Figure 1 between the accuracy of matchings computed with MREC on BDTNP and the baseline. This suggests that convergence has not occurred yet and more centroids are needed. Indeed, the accuracy and running times provided in Table 1 were obtained by allowing the number of centroids to go up to 3,000 (see Appendix, Figure 3).
Finally, it can be seen from Table 1 that recursion is mandatory
on Synth+ and BRAIN for the matching computation to end in a
reasonable amount of time. Moreover, the results are generally on par (or slightly better) than the baseline with substantially lower running times. In the case of HEMA, the
accuracy is more significantly better
Accuracy (%)  Time (s)  
Dataset  MREC  No rec.  MREC  No rec. 
Synth  100  100  14.1  505.8 
Synth+  100  100  126.7  14,325.6 
PBMC  89.9  89.5  29.1  77.3 
HEMA  84.9  79.5  20.5  119.0 
BDTNP  99.2  99.1  73.2  508.5 
BRAIN  59.6  58.6  125.1  15,580.5 
5. Conclusion
In this article, we introduced MREC: a fast and versatile tool for computing matchings on largescale datasets. It can use any black box matching or clustering function defined by the user, and scale it with a recursive scheme so as to be able to process datasets with very large numbers of points in an efficient way. Theoretical analysis shows that the algorithm can in principle perform well. We validated its use with several applications in singlecell molecular analysis, achieving comparable or better performance than the stateoftheart on smaller datasets and providing novel results on datasets too large to be aligned efficiently by any other means.
Appendix
Additional figures
In this appendix, we provide various additional plots.

In Figure 2, we display the running times of our experiments w.r.t. the number of centroids. Unsurprisingly, it increases with the number of centroids, with a rate that depends on the experiment.

In Figure 3, we confirm that MREC converges on the BDTNP experiment by showing the accuracy and distortion for a larger number of centroids than presented in the main article.

In Figure 4, we present the dependency of accuracy, distortion and running times w.r.t. both entropy and number of centroids. Note that for small entropy values on HEMA, numerical errors occurred, leading to missing values for the distortion.

In Figure 5, we provide illustrations of the datasets and computed matchings.
Code
We provide the code and data that we used to run the experiments (except for BRAIN and Synth+ which are too large, although we provide a way to generate the data for Synth+ in the synth+_matrix.py file—just run python synth+_matrix.py in a terminal) in the experiments folder available on this link:
The experiments can be reproduced by running ./launchS.sh and ./launchnonS.sh in a terminal. Since it is (very) long to run all experiments, we also provide a notebook Example.ipynb at the following url: https://github.com/MathieuCarriere/mrec. This notebook illustrates the use of MREC on the dataset synth. Dependencies for running both codes are numpy, scipy, POT, scikitlearn, cvxopt, matplotlib and (optionally) the Matlab engine API for Python.
Footnotes
 https://github.com/MathieuCarriere/mrec
 https://pot.readthedocs.io/en/stable/
 We also tried using means. However, we did not observe noticeable changes in accuracy, but substantially longer running times, so we stick to randomly sampled centroids.
 The accuracies for HEMA provided in this work are different from the ones presented in [FHN+19] on the same dataset due to the following reason: in [FHN+19] the authors provide the average of the accuracies associated to several random samples of cells in each group, whereas we measure the accuracy on the whole dataset.
References
 (201804) Computing the GromovHausdorff distance for metric trees. ACM Trans. Algorithms 14 (2), pp. 24:1–24:20. External Links: ISSN 15496325, Link, Document Cited by: §2.1.
 (201810) Quasiuniversality in singlecell sequencing data. arXiv. External Links: 1810.03602, Link Cited by: §4.2, §4.
 (2007) Kmeans++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pp. 1027–1035. Cited by: §3.
 (2018) SMAC: simultaneous mapping and clustering using spectral decompositions. In International Conference on Machine Learning, pp. 334–343. Cited by: §1.
 (2017) A hungarian algorithm for errorcorrecting graph matching. In GraphBased Representations in Pattern Recognition, Cham, pp. 118–127. Cited by: §1.
 (2011) Algorithms to automatically quantify the geometric similarity of anatomical surfaces. Proceedings of the National Academy of Sciences 108 (45), pp. 18221–18226. Cited by: §1.
 (2015) Stable Topological Signatures for Points on 3D Shapes. Computer Graphics Forum 34 (5), pp. 001–012. External Links: Document Cited by: §4.4.
 (2013) Sinkhorn distances: lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, Cited by: §1, §2.2, §3.1.
 (2003) On bending invariant signatures for surfaces. IEEE Trans. Pattern Anal. Mach. Intell. 25 (10), pp. 1285–1295. Cited by: §1.
 (2018) Semidefinite programming approach for the quadratic assignment problem with a sparse graph. Computational Optimization and Applications 69 (3), pp. 677–712. Cited by: §2.3.
 (201904) Statistical optimal transport via factored couplings. In International Conference on Artificial Intelligence and Statistics, pp. 2454–2465. External Links: Link Cited by: §1, §1, §4.3, §4, footnote 4.
 (2018) Development and assessment of fully automated and globally transitive geometric morphometric methods, with application to a biological comparative dataset with high interspecific variation. The Anatomical Record 301 (4), pp. 636–658. Cited by: §1.
 (1981) Groups of polynomial growth and expanding maps. Inst. Hautes Études Sci. Publ. Math. (53), pp. 53–73. Cited by: §1, §2.
 (2001) Metric structures for Riemannian and nonRiemannian spaces. Progress in Mathematics, Birkhäuser. Cited by: §1, §2.
 (201804) Batch effects in singlecell RNAsequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology 36 (5), pp. 421–427. External Links: Document, Link Cited by: §4.3, §4.
 (201801) Multiplexed droplet singlecell RNAsequencing using natural genetic variation. Nature Biotechnology 36 (1), pp. 89–94. External Links: Document Cited by: §4.2.
 (2015) Tight relaxation of quadratic matching. In Computer Graphics Forum, Vol. 34, pp. 115–128. Cited by: §1.
 (201712) Integrative singlecell analysis of transcriptional and epigenetic states in the human adult brain. Nature Biotechnology 36 (1), pp. 70–80. External Links: Document, Link Cited by: §4.5.
 (2011) Conformal Wasserstein distances: comparing surfaces in polynomial time. Advances in Mathematics 227 (3), pp. 1047–1077. Cited by: §1.
 (2009) Ricci curvature for metricmeasure spaces via optimal transport. Ann. of Math. (2) 169 (3), pp. 903–991. External Links: ISSN 0003486X, Document, Link, MathReview (Alessio Figalli) Cited by: §2.2.
 (2005) A theoretical and computational framework for isometry invariant recognition of point cloud data. Found. Comput. Math. 5 (3), pp. 313–347. External Links: ISSN 16153375, Document Cited by: §1.
 (2011) GromovWasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, pp. 1–71. Note: 10.1007/s1020801190935 External Links: ISSN 16153375, Link Cited by: §1, §2.2.
 (201810) Charting a tissue from singlecell transcriptomes. bioRxiv. External Links: Document, Link Cited by: §4.4.
 (1994) Quadratic assignment and related problems: dimacs workshop, may 2021, 1993. Vol. 16, American Mathematical Soc.. Cited by: §2.2.
 (2012) Wasserstein barycenter and its application to texture mixing. In Proceedings of the Third International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435–446. Cited by: §1.
 (200011) The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40 (2), pp. 99–121. Cited by: §1.
 (201902) Optimaltransport analysis of singlecell gene expression identifies developmental trajectories in reprogramming. Cell 176 (4), pp. 928–943.e22. External Links: Document, Link Cited by: §1, §1.
 (2008jun. 2008) Approximate earth mover’s distance in linear time. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1 – 8. Cited by: §1.
 (1974) Diagonal equivalence to matrices with prescribed row and column sums. ii. Proceedings of the American Mathematical Society 45 (2), pp. 195–198. Cited by: §1, §2.2.
 (2006) On the geometry of metric measure spaces. I. Acta Math. 196 (1), pp. 65–131. External Links: ISSN 00015962, Document, Link, MathReview (Juha Heinonen) Cited by: §1, §2.2.
 (2008) Optimal transport: old and new. Springer Berlin Heidelberg. Cited by: §1.
 (201610) A polynomialtime relaxation of the GromovHausdorff distance. arXiv. External Links: 1610.05214, Link Cited by: §1, §2.3, §3.1.
 (2018) SCANPY: largescale singlecell gene expression data analysis. Genome Biology 19. Cited by: §4.
 (201701) Massively parallel digital transcriptional profiling of single cells. Nature Communications 8. External Links: Document, ISSN 20411723 Cited by: §4.3, §4.5.