MREC: framework for matching point clouds with applications to single-cell

MREC: a fast and versatile framework for aligning and matching point clouds with applications to single cell molecular data

Abstract.

Comparing and aligning large datasets is a pervasive problem occurring across many different knowledge domains. We introduce and study MREC, a recursive decomposition algorithm for computing matchings between datasets. The basic idea is to partition the data, match the partitions, and then recursively match the points within each pair of identified partitions. The matching itself is done using black box matching procedures that are too expensive to run on the entire dataset. Using an absolute measure of the quality of a matching, the framework supports optimization over parameters including partitioning procedures and matching algorithms. By design, MREC can be applied to extremely large datasets. We analyze the procedure to describe when we can expect it to work well and demonstrate its flexibility and power by applying it to a number of alignment problems arising in the analysis of single cell molecular data.

1. Introduction

Computing alignments, or matchings, between two point clouds is a basic problem in data analysis. Notable applications include aligning points between images (sometimes referred to as the “point registration” problem) [EK03, MS05], measuring distances between images or histograms [RTG00], finding independent edge sets in bipartite graphs [BGB17], matching the shape of primate teeth to determine species relationships [BLC+11], removing batch effect in single-cell data [FHN+19], and studying stem cell differentiation [FHN+19, SST+19]. In the setting where the point clouds correspond to 3D shapes or low-dimensional manifolds, there exist techniques that successfully exploit the geometric structure of the point clouds [LD11, GYD+18, BGH+18]. In this work, we focus on arbitrary metric spaces, where the matching problem is typically formalized as an optimization problem: for each matching of datasets and , by which we mean a pair of maps and , we assign a cost . The goal is then to find a matching that minimizes the cost. The cost of the optimal matching provides a measure of some kind of distance between and ; in this formulation, there is a tight connection between matchings and distances between datasets.

Matchings and distances between metric spaces: A standard way of assigning costs to the matching is to assume that the two point clouds and are equipped with distance functions and respectively, i.e., they are metric spaces. Then the cost of a matching can be defined in terms of relationships between the interpoint distances. This approach to matching is closely related to notions of distances between metric spaces known as the Hausdorff and Gromov-Hausdorff distances [GRO81, GRO01]. A distinct advantage of cost measures that depend only on the distance matrices for each dataset is that the resulting alignment procedures are invariant with respect to rotations and more generally arbitrary distance-preserving transformations of the point clouds. This is clearly an essential feature when working with images, for example. Unfortunately, computing these distances in practice is infeasible.

A closely related viewpoint is to study matching questions from the perspective of “earthmover distances”, which arise from an area of mathematics known as optimal transport [VIL08]. Here the idea is that matchings can be probabilistic or fuzzy, which is encoded by representing the datasets as having probability distributions. The costs of the matchings are then computed by adding up the amounts of probability mass required to transform one distribution into another, weighted by the distances. The resulting distances between metric spaces are known as the Wasserstein and Gromov-Wasserstein distance [MÉM11, STU06]. This approach is related to the approach above insofar as the Gromov-Wasserstein distance is a relaxation of the Gromov-Hausdorff distance. Although there is a substantial and beautiful theoretical literature on optimal transport, the Gromov-Wasserstein distance has until recently been too expensive to use in practice, limited to at most a few hundred points. A related relaxation in terms of semidefinite programming yields a distance and matching procedure that has attractive theoretical properties but is also expensive to use in practice [RPD+12, SJ08, KKB+15, VBB+16]; again, it is limited to hundreds of points.

Quite recently, there has been exciting progress in the area based on a relaxation of the Gromov-Wasserstein distance that involves regularizing the optimal transport problem by adding an entropy term [CUT13]. Roughly speaking, this adds a penalty based on the complexity of the matching. This refinement to the minimization objective (see Section 2.2) turns the problem into one that can be solved comparatively efficiently with Sinkhorn’s fixed-point algorithm [SIN74]. However, applying this approximation to datasets with more than a few thousand points is only feasible in low dimensions.

The limitations on the applicable size of datasets is a significant issue for many potential applications, including single-cell characterization of biological systems. Recent technological breakthroughs have allowed characterization of gene expression, chromatin accessibility, and structure at the single-cell level, allowing biologists to generate datasets with tens of thousands of cells (data points) in a very high-dimensional space (each dimension represents a gene of interest). These technical limitations are unfortunate since optimal transport has been shown in recent work to have significant potential applications, e.g., for removing batch effects [FHN+19], or for a better understanding of the temporal evolution and differentiation in development [SST+19].

Contributions. Our goal in this article is to introduce and study a general framework for applying matching algorithms to very large datasets by recursive decomposition. The basic idea is to partition the dataset into (possibly overlapping) subsets, match representative points from the subsets using any matching algorithm as a subroutine, and then recursively match the partitions. We refer to the recursive approximation scheme as MREC (see Algorithm 1). By selecting parameters appropriately, we can work with datasets of tens of thousands of points. A particularly attractive features of this framework is that, since there is an external measure of the quality of a matching, we can efficiently search over the parameter space, including selection of matching algorithms as well as methods for partitioning.

After introducing the algorithm, we do a simple theoretical analysis that explains situations when we can guarantee that this approximation will work well (see Section 3.2). The conclusions are reassuring but unsurprising: the algorithm can be expected to work well either when the matching itself admits a recursive decomposition (e.g., the data is produced by well-separated Gaussians) or the partitions produce an approximation that is close to the original metric space.

Finally, we explore a number of applications of MREC to datasets coming from single-cell characterization of different biological systems, showcasing the usefulness of MREC for various problems in this field (see Section 4). The results we obtain outperform or are comparable to alternative algorithms on small datasets, and produce interesting results on datasets too large to be amenable to any other techniques.

2. Matchings and Gromov-type distances

In this section, we recall the definitions of matchings and various matching distances between metric spaces. We first review the Gromov-Hausdorff distance in Section 2.1. We present its probabilistic (and entropy-regularized) relaxation, the Gromov-Wasserstein distance, in Section 2.2, and we discuss the semidefinite relaxation in Section 2.3.

Problem. We can formulate the basic problem of interest as follows. We are given two datasets and , each equipped with a distance function. That is, we have functions and that satisfy the metric axioms. The problem is to find a matching that best preserves the distances. There are of course many ways to make this precise, depending on the notion of “best”. Our point of departure is the solution that comes from the idea of the Gromov-Hausdorff distance [GRO81, GRO01].

2.1. The Gromov-Hausdorff distance

The Gromov-Hausdorff distance is a metric on the set of compact metric spaces up to isometry, which generalizes the Hausdorff distance.

Definition 2.1.

Let and be subsets of a metric space . The Hausdorff distance between and is defined as , where and .

The Hausdorff distance depends on the embedding of and , and for instance is not translation or rotation invariant for and separately. As a consequence, it is useful to consider a refinement which is invariant to distance-preserving transformations.

Correspondences. Let be a correspondence between and , i.e., a subset such that the projections are surjective onto and . A correspondence is a kind of matching where multiple points can be associated with each other. Then, the distortion of is defined to be: , where the supremum is over all and . The distortion can be thought of as the cost of the correspondence: the quality of the matching is assessed by how well distances between matched points are preserved.

Gromov-Hausdorff distance. The Gromov-Hausdorff distance is defined as:

where the infimum varies over all correspondences. That is, in principle, if we compute the Gromov-Hausdorff distance, we also obtain a correspondence which realizes it. Unfortunately, it is infeasible to directly compute the Gromov-Hausdorff distance. The optimization problem above is an example of an integer programming problem, and is well-known to be NP-hard, even for linear spaces such as metric trees [AFN+18]. Instead, a standard approach is to turn to various kinds of relaxations.

2.2. The Gromov-Wasserstein distance

A first kind of relaxation is to allow probabilistic matchings; each point is matched to a point with a probability , subject to the constraint that . It turns out that a way to study this is via the consideration of metric spaces equipped with a probability measure.

Metric probability space. We will temporarily broaden consideration from metric spaces to metric measure spaces; these are metric spaces equipped with a Borel measure . In fact, we restrict attention to metric probability spaces, i.e., metric measure spaces where the total measure is . Any metric space can be regarded as a metric measure space equipped with the uniform measure, where each point is assigned equal probability. To construct a metric in this setting which is analogous to the Gromov-Hausdorff distance, we need to introduce a probabilistic analogue of a correspondence. The correct notion here is that of a coupling of and ; this is a probability measure on such that and . We can now define the Gromov-Wasserstein distance [MÉM11, STU06, LV09].

Gromov-Wasserstein distance. Let denote .

Definition 2.2.

Let , . The -th Gromov-Wasserstein distance computed between two metric probability spaces and is defined as:

where ranges over couplings of and . If we replace by when , the distance is called the Wasserstein distance; this is analogous to the Hausdorff distance.

Intuitively, the Gromov-Wasserstein distance is computed by testing every probabilistic matching between and . Each such matching assigns, for a given , a probability distribution of matching over , and vice-versa. The matching that best minimizes the so-called metric distortion score between the spaces is picked and its score is used as the distance value. By rounding, one can extract an absolute (non-probabilistic) matching.

Unfortunately, the Gromov-Wasserstein is again very hard to compute in general; this is a quadratic assignment problem (QAP), which are in general known to be hard to solve or even provably approximate [PW94].

Entropic regularization. Since the Gromov-Wasserstein distance is still expensive to compute, attention has been focused recently on a further relaxation, which adjusts the cost function to add a penalty term in terms of the entropy of the couplings [CUT13], where the entropy of a coupling is .

Definition 2.3.

For a fixed , the entropy-regularized Gromov-Wasserstein distance is:

Adding the entropy term turns out to result in a problem that can be efficiently approximated with, e.g., Sinkhorn’s fixed-point algorithm [SIN74]. However, solving this problem on thousands of points is feasible only in very low dimensions.

2.3. Relaxation of Gromov-Hausdorff distance

A different relaxation of the Gromov-Hausdorff distance is given by semidefinite programming [VBB+16]. This has the advantage that the resulting optimization problem is convex, comes with a certificate of optimality, and yields a pseudometric that satisfies the triangle inequality. On the other hand, the result is a semidefinite program in variables. A further relaxation reduces its complexity by introducing sparsity assumptions [FKS18], but it is still infeasible for large datasets.

Let and be the number of points of and respectively. For the semidefinite relaxation, we optimize over a matrix . In fact, we work with an augmented matrix , with entries of indexed by pairs ,, and with and ,

(1)
Definition 2.4.

The SDP relaxation of the Gromov-Hausdorff distance is the solution to the following semidefinite programming problem:

subject to , where denotes the set of matrices satisfying the constraints: , , , , , , and is symmetric and positive semidefinite.

As noted above, one of the most interesting properties of the SDP relaxation is that it yields a distance that also satisfies the triangle inequality; this is somewhat surprising, as there is not an obvious geometric interpretation as with the optimal transport framework.

3. A recursive approximation scheme

In this section, we define and study a recursive approximation scheme for computing matchings between datasets. This algorithm, that we call MREC and define in Algorithm 1, requires a black box matching function (such as the entropy-regularized Gromov-Wasserstein distance or the SDP relaxation presented in Section 2.2) and a black box clustering function (such as -means++, [AV07]) as parameters, and uses them recursively to scale and approximate the matching computation.

3.1. Definition

Setup. We are given a clustering algorithm whose input is a finite metric space with and whose output is a finite metric space together with a surjective (point-set) function , satisfying, for all :

1. .     2. is roughly constant.

Recursive scheme. We now present our recursive decomposition algorithm, that we call MREC and detail in Algorithm 1, for producing a matching between and .

  Input: dataset , dataset
  Output: matching
  Parameters: (black box) clustering algorithm , number of clusters , threshold , (black box) matching algorithm
  if  or less than  then
     ;
  else
     , ; , ; { Use to compute centroids , with , as well as maps , }
     ; {Use to match and }
     ; ; {Enumerate the elements of and use consistent enumeration (under the match ) for . Let and denote these enumerations. }
     for  in  do
        return MREC;
     end for
  end if
Algorithm 1 MREC

At the conclusion of this algorithm, we obtain a matching between and ; this is the output of the algorithm, along with the induced approximation of the Gromov-type distances. In practice, MREC is stochastic since it depends both on the specifics of the clustering algorithm and on random properties of the optimization solvers for the matching algorithm . As a consequence, the correct way to proceed is perform a search over the parameter space, possibly even searching over different clustering algorithms—each matching comes with a score, which is the associated distortion measure, allowing us to automatically hill-climb to find the best matching.

Associated code. Our code is publicly available1. It contains -means and Voronoi partitions (with randomly sampled germs) for the black box clustering function , as well as entropy-regularized [CUT13] (based on the POT Python package2) and the semidefinite relaxation [VBB+16] of the Gromov-Hausdorff distance for the potential black box matching function . Moreover, it has been designed so that incorporating new clustering and matching functions is as easy as possible.

3.2. Theoretical properties

We begin by giving some generic properties of MREC. If we let denote the time required for the black box matching algorithm to compute a matching on metric spaces of size smaller than , then when using clusters the time to execute MREC on datasets with points is . That is, the running time is basically determined by .

Next, we turn to the question of how good an approximation we can expect to get from MREC. Recall that is an -net if for every point there is a point such that . A basic observation is that if is an -net of , the evident induced matching that assigns each point of to a point of within distance has discrepancy bounded above by . The triangle inequality then implies that if is an -net and is an -net, then the optimal matching of and induces a matching of and that has distortion within of the optimal.

It is helpful to reformulate this in terms of clustering algorithms. Let be a clustering algorithm, and . Say that a finite metric space is -clustered if the output of on satisfies the condition that for all clusters, all elements in the cluster are within distance of each other. Notice that this in particular means that choosing a point in each partition produces an -net.

Proposition 3.1.

Suppose , . Given a relation and retractions and , consider the derived relation defined by . Suppose that for all , and likewise for . Then .

Proof.

For , we have . Thus, we have, using the triangle inequality: and .

This leads to: . ∎

This immediately implies the following corollary.

Corollary 3.2.

Let and be -clustered for with the number of -clusters , and let and denote the metric spaces of clusters generated by . Then MREC (computed with the exact Gromov-Hausdorff distance as the matching function ) produces a matching between and with distortion at most , where is the correspondence realizing .

Now, a very natural question to ask is for a metric space with some kind of dimension restriction, how many clusters are needed to obtain an -clustering. A standard way of encoding the dimension of a metric space is via doubling dimension. Recall that a metric space has doubling dimension if every ball of radius can be covered by balls of radius . For example, Euclidean space has doubling dimension . The following standard result now follows from an easy counting argument.

Proposition 3.3.

Let be a metric space with doubling dimension . Then there exists an -clustering with approximately clusters, where .

Here the approximation in the statement comes from issues of rounding logarithms. These results give quite pessimistic estimates, but they are useful for providing a sense of lower bounds on the performance to expect from MREC and its recursive decomposition.

The analyses above focus on the situation in which the coarse approximation of the data that we get from partitioning is close in the Gromov-Hausdorff metric to the original data. (Analogous analyses can be performed using other metrics.) We now explain a situation in which we can expect the algorithm to work well even when the partition approximation is in this sense far from the original data.

Consider the toy model where and are both finite samples from a metric space with the property that can be partitioned into clusters such that the minimum Hausdorff distance between any pair of clusters and is substantially larger than the maximum diameter of the clusters. Specifically, assume that

First off, note that in this case, the optimal clustering can be found by the -means algorithm; each point in a cluster is closer to the centroid than to any point outside the cluster. In addition, we know that the expected number of samples required to to obtain a point in each cluster is bounded above by ; this follows from a generalized coupon-collector analysis. Let us assume that we have chosen enough points so that with high probability we indeed have at least one sample from each cluster. In this situation, the optimal matching between and is realized by a matching of the clusters.

Proposition 3.4.

Under the hypotheses above, the optimal matching of and permutes the clusters.

Proof.

Any matching that separates points within a cluster (i.e., assigns points to distinct clusters of ) or merges clusters will have discrepancy bounded below by , whereas a matching that permutes the clusters will have discrepancy bounded above by . ∎

An immediate corollary is that MREC will find the optimal matching of the top-level clusters with high probability. Since we are only using metric information, our matchings are non-unique up to symmetry in the data; if we assume that the inter-cluster distances are unique, the matching returned is the unique matching that identifies clusters to themselves.

3.3. Discussion

One of the most useful features of the MREC framework is that because matchings can be evaluated in terms of their distortion, we can search over parameter space to find the minimal distortion embedding. For example, it is reasonable to try a variety of clustering algorithms and numbers of centers in order to ascertain which parameters lead to the best approximation of the optimal matching. Moreover, we can even try different matching procedures. This in particular means that the algorithm can tolerate a substantial amount of uncertainty about the decomposition of the best matching, as long as there exists a partition and allowable resolution for which there is a good matching.

Another useful aspect of this is that we can easily make MREC robust to outliers by subsampling cross-validation; specifically, we can repeatedly subsample the two metrics spaces and , perform the matchings on the subsamples, and identify points that result in high distortion matchings.

Finally, although we have discussed the matching algorithm in terms of a partition of the dataset, instead it is entirely reasonable to imagine an overlapping cover of the datasets used for matchings. We only mention this refinement here; we intend to return to this in future work.

4. Applications

Figure 1. Mean and variance of the accuracy and distortion of the matchings computed by MREC (for the parameters inducing the best matching) w.r.t. the number of centroids used for Synth (upper left), Synth+ (upper middle), PBMC (upper right), HEMA (lower left), BDTNP (lower middle) and BRAIN (lower right). The dashed line corresponds to the accuracy obtained without doing recursion.

In this section, we demonstrate the use of MREC by applying it on synthetic and real biological single-cell datasets with ground-truth labels. Overall, we show that the recursion procedure used in MREC is highly efficient, in the sense that the computed matchings are on par, or slightly better (as measured by the accuracy of label transfer under the computed matchings), than the ones obtained without doing any recursion, with significantly smaller running times. We first provide proof-of-concepts of MREC for synthetic and real data in Sections 4.1 and 4.2. Then, we present applications of MREC for correcting batch effects on a single-cell datasets in Section 4.3 and aligning across different measurement modalities in Sections 4.4 and 4.5. Whereas the datasets in Sections 4.3 and 4.4 can be handled directly the dataset presented in Section 4.5 is too large to be handled with usual matching algorithms, making MREC as far as we know the only algorithm suitable for working at scale with the recent breakthroughs in single-cell data generation.

Single-cell datasets. Single-cell RNA sequencing has been a major event in the fine characterization of many biological systems at the molecular level, since it allows biologists to study the gene expression levels of thousands of individual cells, leading to a clearer understanding of many biological processes. For instance, clustering the expression of single-cell data identifies sets of cells with common properties and function, and the relation between these clusters can reflect transitional states. In these datasets, each cell is represented as a point in Euclidean space, in which every dimension is associated to the expression of a gene of interest, and each single cell coordinate contains the expression of the corresponding gene in that cell (that is, how often is this gene transcribed in the cell, as measured with the number of transcript fragments that are detected in the cell). Hence, raw single-cell datasets are usually given as integer matrices. They often require preprocessing; we will use the Python packages Scanpy [WAT18] and Randomly [ABB+18].

Batch effects. Single-cell RNA sequencing has resulted in a huge increase in data availability. However, many methodological issues with using these datasets remain. Notably, datasets computed in different labs or with different protocols usually exhibit unwanted biases, often referred to as batch effects [HLM+18]. Matching, or aligning, these datasets in order to integrate them together in a robust way is thus a problem of major importance, and optimal transport with the Gromov-Wasserstein distances has recently been shown to be a promising solution in this application [FHN+19]. In Section 4.3, we present an example of batch effect correction with MREC.

Different measure modalities. Another class of alignment problems comes from situations where we have measurements of the same dataset coming from different kinds of technologies; for example, measuring expression and chromatin accessibility. Understanding the amount of alignment that can be expected between these two kinds of measurements is an extremely interesting problem. In Sections 4.4 and 4.5, we describe applications to two such datasets.

Parameters and results. Each of the datasets in Sections 4.14.24.3 and 4.5 presented below is comprised of two groups with the same list of associated ground-truth labels (except for BDTNP; see the details in Section 4.4 below). On each of these datasets, we use a Voronoi partition computed with randomly sampled centroids3 as the clustering algorithm , a threshold , and we searched over 20 runs with matchings coming from both the entropy-regularized Gromov-Wasserstein distance (with ranging from to in a log-scale) and the SDP relaxation as the matching algorithm for MREC. Note that, for all datasets except BDTNP in Section 4.4 and BRAIN in Section 4.5, the two groups belong to the same common space, so we can use the Wasserstein distance as well. We also compute matchings for an increasing number of clusters (see Algorithm 1) ranging from to in a log-scale. For each computed matching, we measure three quantities: the running time of MREC, the distortion of MREC (see Section 2), and its accuracy, given by computing the fraction of points in the first group whose associated points under the matching given by MREC share the same label (in the second group). We show in Figure 1 the evolution of accuracy and distortion w.r.t. the number of centroids for all datasets (see Appendix, Figure 2, for running times). We finally provide in Table 1 an overall comparison with directly computing the matchings (i.e., without doing any recursion). In this table, we provide the accuracies of the matchings with lowest distortion. All computations were performed on an AWS machine with a Xeon Platinum 8175 processor. See also Appendix, Figures 4 and 5, for more illustrations and examples of parameter tuning.

4.1. Matching of simulated datasets Synth(+)

We first apply MREC on a synthetic dataset sampled from a mixture of three Gaussian probability distributions located at different positions in the Euclidean plane . More specifically, we drew samples that we eventually split into two groups. Accuracy is then measured as the ability of MREC to match points sampled from the same Gaussian probability distributions together. We also run an experiment with samples (called Synth+),

4.2. Matching of subsampled datasets Pbmc

In our second example, we test the performance of MREC on a single-cell transcriptional dataset. We collected the data presented in [KST+18], which is comprised of single peripheral blood mononuclear cells (PBMC), with eight different cell types (CD4, CD8, B-cell, NK-cell, dendritic cells (Ddr), CD14, monocytes (Mnc) and megakaryocytes (Mgk)). We preprocessed the cells with Randomly [ABB+18] to create a dataset of cells in dimensions (each of which representing a highly variable gene of interest). We then split the data into two groups and measured the ability of MREC to match together cells with the same types coming from different groups.

4.3. Matching of different protocols Hema

In our third application, we study an example of batch effects in single-cell transcriptional data for which optimal transport has already proved useful [FHN+19]. The data is comprised of two groups of and single hematopoietic cells respectively, with associated genes and three different cell types (common myeloid progenitors (CMP), granulocyte-monocyte progenitors (GMP) and megakaryocyte-erythrocyte progenitors (MEP)). These groups were generated with the SMART and MARS protocols respectively, as explained in [HLM+18]. As mentioned above, even though the dimensions are in correspondence, the fact that different protocols have been used introduces a bias, or batch effect, in the data, We preprocessed the data using the method presented in [ZTB+17] and available in the Python package Scanpy and then reduced the number of dimensions to 50 with PCA.

4.4. Matching of different modalities Bdtnp

In this section, we study 3,039 single cells coming from Drosophila embryos, as presented in [NKF+18]. Spatial coordinates in the 2D tissues were obtained with fluorescence in situ hybridization (FISH), as well as the expression of 84 marker genes, leading to spaces with different dimensions. This means that usual Wasserstein distance cannot be used here. We followed the exact same procedure as in [NKF+18], i.e., pairwise distances in both spaces were computed with nearest-neighbor graph, so as to approximate the geodesics from the manifolds on which the datasets have been sampled (following the manifold assumption made in [NKF+18]). Finally, the ground truth is here given by the true correspondence between the cells. Hence, we chose to measure accuracy with the (normalized) area under the curve that shows, for each radius , the fraction of cells (in expression space) whose corresponding cell (in tissue space) under the computed matching is at distance at most from the true corresponding cell (which is a common practice in computer graphics [COO15]).

4.5. Matching of different modalities Brain

In our last example, we focus on a dataset of single cells sampled from the human brain with eight different cell types (astrocytes (Ast), endothelial cells (End), excitatory neurons (Ex), inhibitory neurons (In), microglia (Mic), oligodendrocytes (Oli) and their precursor cells (OPC)), presented in [LCS+17]. The challenge of this dataset is two-fold. First, the dimensions of the two groups are not in correspondence (as in Section 4.4): the gene expressions were measured in the first group, while the second group contains the DNA region accessibilities (that is, whether given regions in the DNA of the cell nucleus are open, i.e., accessible for transcription, or not). Second, the datasets are too large to be handled with usual matching techniques: the first group contains cells, while the second one contains cells. We preprocessed the first group with the method presented in [ZTB+17] and available in the Python package Scanpy, and the second one with TF-IDF. Then, we applied PCA on each group separately to reduce the number of dimensions to 50. Note that since the dimensions are not in correspondence anymore, we have to use the Gromov-Wasserstein distance. We also added a standard Wasserstein cost by using a binary cost matrix containing the correspondence between the genes and the DNA regions (that is, the entry of the matrix is one if the genomic coordinates of gene intersects DNA region ).

4.6. Results and comparison with no recursion.

It can be seen from Figure 1 that, as expected, accuracy generally tends to increase while distortion decreases with the number of centroids. Note also that, for the BRAIN dataset, the variance of the distortion is pretty large, and its decrease is not obvious. This large variance is also the reason why the mean accuracy (observed on the plot) is actually much smaller than the one corresponding to the matching with lowest distortion (shown in Table 1)—note that this is actually true (to a certain extent) for all datasets. Note also that there is clear gap in Figure 1 between the accuracy of matchings computed with MREC on BDTNP and the baseline. This suggests that convergence has not occurred yet and more centroids are needed. Indeed, the accuracy and running times provided in Table 1 were obtained by allowing the number of centroids to go up to 3,000 (see Appendix, Figure 3).

Finally, it can be seen from Table 1 that recursion is mandatory on Synth+ and BRAIN for the matching computation to end in a reasonable amount of time. Moreover, the results are generally on par (or slightly better) than the baseline with substantially lower running times. In the case of HEMA, the accuracy is more significantly better4. This suggests that for some datasets it may be the case that the recursive procedure may lead to better approximations than the “absolute” algorithms.

Accuracy (%) Time (s)
Dataset MREC No rec. MREC No rec.
Synth 100 100 14.1 505.8
Synth+ 100 100 126.7 14,325.6
PBMC 89.9 89.5 29.1 77.3
HEMA 84.9 79.5 20.5 119.0
BDTNP 99.2 99.1 73.2 508.5
BRAIN 59.6 58.6 125.1 15,580.5
Table 1. Results of MREC and its counterpart with no recursion.

5. Conclusion

In this article, we introduced MREC: a fast and versatile tool for computing matchings on large-scale datasets. It can use any black box matching or clustering function defined by the user, and scale it with a recursive scheme so as to be able to process datasets with very large numbers of points in an efficient way. Theoretical analysis shows that the algorithm can in principle perform well. We validated its use with several applications in single-cell molecular analysis, achieving comparable or better performance than the state-of-the-art on smaller datasets and providing novel results on datasets too large to be aligned efficiently by any other means.

Appendix

Additional figures

In this appendix, we provide various additional plots.

  • In Figure 2, we display the running times of our experiments w.r.t. the number of centroids. Unsurprisingly, it increases with the number of centroids, with a rate that depends on the experiment.

  • In Figure 3, we confirm that MREC converges on the BDTNP experiment by showing the accuracy and distortion for a larger number of centroids than presented in the main article.

  • In Figure 4, we present the dependency of accuracy, distortion and running times w.r.t. both entropy and number of centroids. Note that for small entropy values on HEMA, numerical errors occurred, leading to missing values for the distortion.

  • In Figure 5, we provide illustrations of the datasets and computed matchings.

Code

We provide the code and data that we used to run the experiments (except for BRAIN and Synth+ which are too large, although we provide a way to generate the data for Synth+ in the synth+_matrix.py file—just run python synth+_matrix.py in a terminal) in the experiments folder available on this link:

The experiments can be reproduced by running ./launchS.sh and ./launchnonS.sh in a terminal. Since it is (very) long to run all experiments, we also provide a notebook Example.ipynb at the following url: https://github.com/MathieuCarriere/mrec. This notebook illustrates the use of MREC on the dataset synth. Dependencies for running both codes are numpy, scipy, POT, scikit-learn, cvxopt, matplotlib and (optionally) the Matlab engine API for Python.

Figure 2. Mean and variance of the running time of the matchings computed by MREC (for the parameters inducing the best matching) w.r.t. the number of centroids used for Synth (upper left), Synth+ (upper middle), PBMC (upper right), HEMA (lower left), BDTNP (lower middle) and BRAIN (lower right).
Figure 3. Mean and variance of the accuracy, distortion (left) and running time (right) of the matchings computed by MREC (for the parameters inducing the best matching) w.r.t. the number of centroids used for BDTNP.
Figure 4. Example of parameter tuning for the entropy parameter for Synth (upper left), Synth+ (upper middle), PBMC (upper right), HEMA (lower left), BDTNP (lower middle) and BRAIN (lower right.
Figure 5. Upper left: Application of MREC on a synthetic data set of a mixture of Gaussians. The two input datasets are represented with dots and crosses, and MREC prediction (with 1,000 centroids) for each dot, that is, the weighted average of the second dataset computed with the probabilities given by MREC, is represented with a star. Upper right: Application of MREC on a data set of PBM cells. The two input datasets are represented with their first two principal components with dots and crosses, and MREC prediction (with 1,000 centroids) for each dot, that is, the weighted average of the second dataset computed with the probabilities given by MREC, is represented with a star. Lower left: Application of MREC on a data set of hematopoietic cells generated with the SMART and MARS protocols. The two input datasets are represented with their first two principal components with dots and crosses, and MREC prediction (with 1,000 centroids) for each dot, that is, the weighted average of the second dataset computed with the probabilities given by MREC, is represented with a star. Lower right: Application of MREC on a data set of human brain cells. Gene expression groups are visualized with the first two principal components.

Footnotes

  1. https://github.com/MathieuCarriere/mrec
  2. https://pot.readthedocs.io/en/stable/
  3. We also tried using -means. However, we did not observe noticeable changes in accuracy, but substantially longer running times, so we stick to randomly sampled centroids.
  4. The accuracies for HEMA provided in this work are different from the ones presented in [FHN+19] on the same dataset due to the following reason: in [FHN+19] the authors provide the average of the accuracies associated to several random samples of cells in each group, whereas we measure the accuracy on the whole dataset.

References

  1. P. Agarwal, K. Fox, A. Nath, A. Sidiropoulos and Y. Wang (2018-04) Computing the Gromov-Hausdorff distance for metric trees. ACM Trans. Algorithms 14 (2), pp. 24:1–24:20. External Links: ISSN 1549-6325, Link, Document Cited by: §2.1.
  2. L. Aparicio, M. Bordyuh, A. Blumberg and R. Rabadan (2018-10) Quasi-universality in single-cell sequencing data. arXiv. External Links: 1810.03602, Link Cited by: §4.2, §4.
  3. D. Arthur and S. Vassilvitskii (2007) K-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035. Cited by: §3.
  4. C. Bajaj, T. Gao, Z. He, Q. Huang and Z. Liang (2018) SMAC: simultaneous mapping and clustering using spectral decompositions. In International Conference on Machine Learning, pp. 334–343. Cited by: §1.
  5. S. Bougleux, B. Gaüzère and L. Brun (2017) A hungarian algorithm for error-correcting graph matching. In Graph-Based Representations in Pattern Recognition, Cham, pp. 118–127. Cited by: §1.
  6. D. M. Boyer, Y. Lipman, E. S. Clair, J. Puente, B. A. Patel, T. Funkhouser, J. Jernvall and I. Daubechies (2011) Algorithms to automatically quantify the geometric similarity of anatomical surfaces. Proceedings of the National Academy of Sciences 108 (45), pp. 18221–18226. Cited by: §1.
  7. M. Carrière, S. Y. Oudot and M. Ovsjanikov (2015) Stable Topological Signatures for Points on 3D Shapes. Computer Graphics Forum 34 (5), pp. 001–012. External Links: Document Cited by: §4.4.
  8. M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, Cited by: §1, §2.2, §3.1.
  9. A. Elad and R. Kimmel (2003) On bending invariant signatures for surfaces. IEEE Trans. Pattern Anal. Mach. Intell. 25 (10), pp. 1285–1295. Cited by: §1.
  10. J. F. B. Ferreira, Y. Khoo and A. Singer (2018) Semidefinite programming approach for the quadratic assignment problem with a sparse graph. Computational Optimization and Applications 69 (3), pp. 677–712. Cited by: §2.3.
  11. A. Forrow, J. Hütter, M. Nitzan, P. Rigollet, G. Schiebinger and J. Weed (2019-04) Statistical optimal transport via factored couplings. In International Conference on Artificial Intelligence and Statistics, pp. 2454–2465. External Links: Link Cited by: §1, §1, §4.3, §4, footnote 4.
  12. T. Gao, G. S. Yapuncich, I. Daubechies, S. Mukherjee and D. M. Boyer (2018) Development and assessment of fully automated and globally transitive geometric morphometric methods, with application to a biological comparative dataset with high interspecific variation. The Anatomical Record 301 (4), pp. 636–658. Cited by: §1.
  13. M. Gromov (1981) Groups of polynomial growth and expanding maps. Inst. Hautes Études Sci. Publ. Math. (53), pp. 53–73. Cited by: §1, §2.
  14. M. Gromov (2001) Metric structures for Riemannian and non-Riemannian spaces. Progress in Mathematics, Birkhäuser. Cited by: §1, §2.
  15. L. Haghverdi, A. Lun, M. Morgan and J. Marioni (2018-04) Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology 36 (5), pp. 421–427. External Links: Document, Link Cited by: §4.3, §4.
  16. H. M. Kang, M. Subramaniam, S. Targ, M. Nguyen, L. Maliskova, E. McCarthy, E. Wan, S. Wong, L. Byrnes, C. Lanata, R. Gate, S. Mostafavi, A. Marson, N. Zaitlen, L. Criswell and C. J. Ye (2018-01) Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nature Biotechnology 36 (1), pp. 89–94. External Links: Document Cited by: §4.2.
  17. I. Kezurer, S. Z. Kovalsky, R. Basri and Y. Lipman (2015) Tight relaxation of quadratic matching. In Computer Graphics Forum, Vol. 34, pp. 115–128. Cited by: §1.
  18. B. Lake, S. Chen, B. Sos, J. Fan, G. Kaeser, Y. Yung, T. Duong, D. Gao, J. Chun, P. Kharchenko and K. Zhang (2017-12) Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nature Biotechnology 36 (1), pp. 70–80. External Links: Document, Link Cited by: §4.5.
  19. Y. Lipman and I. Daubechies (2011) Conformal Wasserstein distances: comparing surfaces in polynomial time. Advances in Mathematics 227 (3), pp. 1047–1077. Cited by: §1.
  20. J. Lott and C. Villani (2009) Ricci curvature for metric-measure spaces via optimal transport. Ann. of Math. (2) 169 (3), pp. 903–991. External Links: ISSN 0003-486X, Document, Link, MathReview (Alessio Figalli) Cited by: §2.2.
  21. F. Mémoli and G. Sapiro (2005) A theoretical and computational framework for isometry invariant recognition of point cloud data. Found. Comput. Math. 5 (3), pp. 313–347. External Links: ISSN 1615-3375, Document Cited by: §1.
  22. F. Mémoli (2011) Gromov-Wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, pp. 1–71. Note: 10.1007/s10208-011-9093-5 External Links: ISSN 1615-3375, Link Cited by: §1, §2.2.
  23. M. Nitzan, N. Karaiskos, N. Friedman and N. Rajewsky (2018-10) Charting a tissue from single-cell transcriptomes. bioRxiv. External Links: Document, Link Cited by: §4.4.
  24. P. M. Pardalos and H. Wolkowicz (1994) Quadratic assignment and related problems: dimacs workshop, may 20-21, 1993. Vol. 16, American Mathematical Soc.. Cited by: §2.2.
  25. J. Rabin, G. Peyré, J. Delon and M. Bernot (2012) Wasserstein barycenter and its application to texture mixing. In Proceedings of the Third International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435–446. Cited by: §1.
  26. Y. Rubner, C. Tomasi and L. Guibas (2000-11) The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40 (2), pp. 99–121. Cited by: §1.
  27. G. Schiebinger, J. Shu, M. Tabaka, B. Cleary, V. Subramanian, A. Solomon, J. Gould, S. Liu, S. Lin, P. Berube, L. Lee, J. Chen, J. Brumbaugh, P. Rigollet, K. Hochedlinger, R. Jaenisch, A. Regev and E. Lander (2019-02) Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 176 (4), pp. 928–943.e22. External Links: Document, Link Cited by: §1, §1.
  28. S. Shirdhonkar and D. Jacobs (2008-jun. 2008) Approximate earth mover’s distance in linear time. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1 – 8. Cited by: §1.
  29. R. Sinkhorn (1974) Diagonal equivalence to matrices with prescribed row and column sums. ii. Proceedings of the American Mathematical Society 45 (2), pp. 195–198. Cited by: §1, §2.2.
  30. K. Sturm (2006) On the geometry of metric measure spaces. I. Acta Math. 196 (1), pp. 65–131. External Links: ISSN 0001-5962, Document, Link, MathReview (Juha Heinonen) Cited by: §1, §2.2.
  31. C. Villani (2008) Optimal transport: old and new. Springer Berlin Heidelberg. Cited by: §1.
  32. S. Villar, A. Bandeira, A. Blumberg and R. Ward (2016-10) A polynomial-time relaxation of the Gromov-Hausdorff distance. arXiv. External Links: 1610.05214, Link Cited by: §1, §2.3, §3.1.
  33. A. Wolf, P. Angerer and F. Theis (2018) SCANPY: large-scale single-cell gene expression data analysis. Genome Biology 19. Cited by: §4.
  34. G. Zheng, J. Terry, P. Belgrader, P. Ryvkin, Z. Bent, R. Wilson, S. Ziraldo, T. Wheeler, G. McDermott, J. Zhu, M. Gregory, J. Shuga, L. Montesclaros, J. Underwood, D. Masquelier, S. Nishimura, M. Schnall-Levin, P. Wyatt, C. Hindson, R. Bharadwaj, A. Wong, K. Ness, L. Beppu, J. Deeg, C. McFarland, K. Loeb, W. Valente, N. Ericson, E. Stevens, J. Radich, T. Mikkelsen, B. Hindson and J. Bielas (2017-01) Massively parallel digital transcriptional profiling of single cells. Nature Communications 8. External Links: Document, ISSN 20411723 Cited by: §4.3, §4.5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
409218
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description