Fast Tree Variants of Gromov-Wasserstein

Fast Tree Variants of Gromov-Wasserstein

Abstract

Gromov-Wasserstein (GW) is a powerful tool to compare probability measures whose supports are in different metric spaces. GW suffers however from a computational drawback since it requires to solve a complex non-convex quadratic program. We consider in this work a specific family of ground metrics, namely tree metrics for a space of supports of each probability measure in GW. By leveraging a tree structure, we propose to use flows from a root to each support to represent a probability measure whose supports are in a tree metric space. We consequently propose a novel tree variant of GW, namely flow-based tree GW (FlowTGW), by matching the flows of the probability measures. We then show that FlowTGW shares a similar structure as a univariate optimal transport distance. Therefore, FlowTGW is fast for computation and can scale up for large-scale applications. In order to further explore tree structures, we propose another tree variant of GW, namely depth-based tree GW (DepthTGW), by aligning the flows of the probability measures hierarchically along each depth level of the tree structures. Theoretically, we prove that both FlowTGW and DepthTGW are pseudo-distances. Moreover, we also derive tree-sliced variants, computed by averaging the corresponding tree variants of GW using random tree metrics, built adaptively in spaces of supports. Finally, we test our proposed discrepancies against other baselines on some benchmark tasks.

\printAffiliationsAndNotice\icmlEqualContribution

1 Introduction

Optimal transport (OT) theory provides a powerful set of tools to compare probability measures. OT has recently gained traction in machine learning community Cuturi (2013); Perrot et al. (2016); Genevay et al. (2016); Muzellec & Cuturi (2018); Mena & Niles-Weed (2019); Luise et al. (2019); Alaya et al. (2019); Paty & Cuturi (2019); Togninalli et al. (2019), and played an increasingly important role in several research areas such as computer graphics Solomon et al. (2015); Bonneel et al. (2016); Lavenant et al. (2018); Solomon & Vaxman (2019), domain adaptation Courty et al. (2016, 2017); Bhushan Damodaran et al. (2018); Redko et al. (2019), and deep generative models Arjovsky et al. (2017); Gulrajani et al. (2017); Genevay et al. (2018); Kolouri et al. (2019); Wu et al. (2019); Nadjahi et al. (2019) to name a few.

When probability measures are discrete and their supports belong to the same space, OT distance can be recasted as a linear program, which can be solved by standard interior-point method algorithms. However, these algorithms are not efficient when the number of supports is large. In order to account for the scalability of OT distance, Cuturi (2013) proposed to regularize OT by the entropy of transport plans, which result in entropic regularized OT. Several efficient algorithms have been proposed to solve the entropic problem recently Altschuler et al. (2017); Dvurechensky et al. (2018); Lin et al. (2019); Altschuler et al. (2019).

When probability measures are discrete and their supports lie in different spaces, classical OT distance is no longer valid to measure their discrepancy. In the seminal work, Mémoli (2011) introduced Gromov-Wasserstein (GW) distance to compare probability measures whose supports are in different metric spaces. GW is defined based on the discrepancy between distance matrices of supports (i.e., pair-wise distances of supports) corresponding to the probability measures. GW has been used in several applications, including quantum chemistry Peyré et al. (2016), computer graphics Solomon et al. (2016), cross-lingual embeddings Alvarez-Melis & Jaakkola (2018); Grave et al. (2019), graph partitioning and matching Xu et al. (2019a, b), and deep generative models Bunne et al. (2019). However, since GW is a complex non-convex quadratic program and NP-hard for arbitrary inputs Peyré & Cuturi (2019) (§10.6.3), its computation is very costly, which hinders applications, especially in large-scale settings where the number of supports is large.

Reposing on the entropic regularization idea from OT, Peyré et al. (2016) proposed an entropic GW discrepancy. The entropic GW can be efficiently solved by Sinkhorn algorithm under certain cases of regularization parameter and a specific family of loss functions. Nevertheless, entropic GW requires the regularization to be sufficiently large for a fast computation, which leads to a poor approximation of GW. Following the direction of leveraging entropic regularization, Xu et al. (2019a, b) proposed algorithmic approaches to further speed up GW for graph data. Another approach for scaling up the computation of GW is sliced GW Vayer et al. (2019), which relies on a one-dimensional projection of supports of the probability measures. Consequently, similar to sliced-Wasserstein, sliced GW albeit fast limits its capacity to capture high-dimensional structure in a distribution of supports Liutkus et al. (2019); Le et al. (2019). Additionally, sliced GW can be only either applied for discrete measures with same number of supports and uniform weights, or required an artifact zero padding for measures having different number of supports Vayer et al. (2019).

In this work, we consider a particular family of ground metrics, namely tree metrics for a space of supports of each probability measure in GW. Although it is well-known that one can leverage tree metrics to speed up a computation of arbitrary metrics Bartal (1996, 1998); Charikar et al. (1998); Indyk (2001); Fakcharoenphol et al. (2004), our goal is rather to sample tree metrics for spaces of supports, and use them as ground metrics, similar to tree-(sliced)-Wasserstein (TSW) Le et al. (2019). However, different to TSW, one may not apply this idea straightforwardly by only using tree metrics as ground metrics to scale up GW. Therefore, by exploiting a tree structure, we propose to leverage flows from a root to each support to represent a probability measure whose supports are in tree metric space, instead of pair-wise distances of supports as in traditional GW. Consequently, we propose a novel tree variant of GW, namely flow-based tree GW (FlowTGW), by matching the flows of the probability measures whose supports are in different tree metric spaces. FlowTGW is fast for computation and can scale up for large-scale applications due to sharing a similar structure as a univariate OT. In order to further explore tree structures, we propose to align the flows hierarchically along each depth level of the tree structures, namely depth-based tree GW (DepthTGW). Theoretically, we prove that both FlowTGW and DepthTGW are pseudo-metrics. Furthermore, we derive tree-sliced variants, computed by averaging the corresponding tree variants of GW using random tree metrics, sampled by a fast adaptive method, e.g., clustering-based tree metric sampling Le et al. (2019) (§4).

The paper is organized as follows: we briefly review tree metrics and define tree GW for probability measures whose supports are in different tree metric spaces in §2. We propose two novel tree variants of GW: FlowTGW and DepthTGW in §3 and §4 respectively. In §5, we derive tree-sliced variants of GW for practical applications, and then evaluate our proposed discrepancies against other baselines on some benchmark tasks in §6 before concluding in §7.

Notations. We denote , . Given , let be the -norm of , and be the Dirac function at . For a discrete probability measure , denote for the number of supports of .

2 Tree Gromov-Wasserstein

In this section, we give a brief review about tree metric space, and define tree GW between probability measures whose supports are in different tree metric spaces.

2.1 Tree metric space

Figure 1: An illustration for a tree metric space. is at depth level , while are at depth level . contains (the orange dot path), (the green dot subtree), and . For edge , and .

Given a tree , let be a tree metric on . The tree metric between two nodes in tree is equal to a length of the (unique) path between them Semple & Steel (2003) (§7, p.145–182). Given node , let be the set of nodes in the subtree of rooted at , i.e., where is the (unique) path between root and node in , be the set of child nodes of , and is the cardinality of set . Given an edge , we write and for the nodes that are respectively at a shallower (closer to ) and deeper (further away from ) level of edge , and be the non-negative length of that edge. We illustrate those notions in Figure 1.

Throughout the paper, we are given two probability measures and whose supports and are in different tree metric spaces and respectively; , and . Our goal is to define discrepancies between these probability measures.

Figure 2: An illustration for an efficient computation for FlowTGW. Given a measure , when the new root ( is in the subtree rooted at , and not containing any supports of ), the order of is the same as that of , and (a.k.a. Case , illustrated in the left bottom tree). Additionally, when the new root ( is in the subtree rooted at , and containing supports in ), the order of is the same as that of , and (a.k.a Case , illustrated in the right bottom tree).

2.2 Tree Gromov-Wasserstein

Let tree GW be GW distance between probability measures whose supports are in different tree metric spaces. Therefore, tree GW between and takes the following form:

(1)

where is a set of the transport plans between and ; and are pair-wise distances of supports, i.e., distance matrices of supports, represented for and respectively.

However, one may not scale up GW by straightforwardly using tree metrics as ground metrics for supports of probability measures as in Equation (1) like TSW (Le et al., 2019). Therefore, we propose to leverage tree structure to form a novel representation for a probability measure based on flows from a root to each support to scale up GW. Consequently, we propose two novel variants of tree GW: FlowTGW and DepthTGW, detailed in §3 and §4 respectively. Especially, one can further scale up their computation by directly sampling aligned-root tree metrics in applications without priori knowledge about tree metrics for probability measures.

3 Flow-based tree GW discrepancy

In this section, we study a variant of tree GW, named flow-based tree GW (FlowTGW) discrepancy.

3.1 Definition of flow-based tree GW

Different from the tree GW, FlowTGW takes into account tree structures to represent probability measures whose supports are in tree metric spaces.

Definition 1.

The flow-based tree Gromov-Wasserstein discrepancy between two probability measures and is defined as follows:

(2)

As indicated in Definition 1, each probability measure in FlowTGW is represented by tree metric, i.e., the unique path (or flow) length, from a root to each support while a weight of each support can be regarded as a mass of a corresponding flow. Additionally, the minimization with roots and in Equation (2) is to ensure an optimal alignment for a pair of roots and in trees and respectively. Similar to tree GW, we also determine an optimal transport plan between and when computing FlowTGW .

Theorem 1.

FlowTGW is a pseudo-distance.

See the supplementary for the proof of Theorem 1.

3.2 Efficient computation for flow-based tree GW

A naive implementation for has a complexity where is the number of nodes in tree, if one exhaustedly searches the optimal pair of roots for and 1. In this section, we present an efficient computation approach which reduces this complexity into nearly .

Consider FlowTGW  between two probability measures in two different tree metric space rooted at respectively. When one changes into the new root for tree , as illustrated in Figure 2, there are two cases that can happen:
Case 1: The new root is one of the nodes in the subtree rooted at a node in , which does not contain any supports in , illustrated in the left-bottom of Figure 2. Then, for all supports , we have

(3)

Consequently, the order of the length of the path from root to each support does not change.
Case 2: The new root is one of the nodes in the subtree rooted at a node in , containing some of the supports in , named as , illustrated in the right-bottom of Figure 2. Then, for all supports in , except , we have the same formulation as Equation (3), i.e., . Consequently, the order of the length of the path from the root to each support (except ) is preserved. For supports in (illustrated in the supplementary), there are three following sub-cases:
Case 2a: For supports which , we have . Therefore, the path-length order of those supports are preserved.
Case 2b: For supports which , we have . Therefore, the path-length order of those supports are reversed.
Case 2c: For supports which and , one needs to find the corresponding closest common ancestor of and , i.e., is on both paths and , so we have . Note that the path-length order of supports having the same is preserved.

Therefore, one only needs to merge these ordered arrays with the complexity nearly (except the degenerated case where each array has only node).

From the above observation, one may not need to sort for the tree metrics between the new root and each support in by leveraging the sorted order of the tree metrics between the root and each support. Moreover, those computational steps can be done separately for each tree. Therefore, the complexity of reduces from into nearly . More details can be seen in the supplementary.

Figure 3: An illustration for the aligned-root DepthTGW  between on and on . In , we consider the optimal alignment at each depth level. At depth level , root is trivially aligned for root . Since both and have their child nodes, the optimal alignment between and is recursive into depth level . For in , root has subtrees rooted at and , considered as “leaves”. Therefore, for the -depth-level tree , contains nodes: , , and , and . Similarly, for in , . The recursive procedure is repeated until the deepest level of the lower tree where there exists only simple cases (i.e., at least one of an aligned pair of nodes does not have any child nodes, or sum of weights of child nodes in corresponding 2-depth-level tree is ).

3.3 Aligned-root flow-based tree GW

In this section, we consider a special case of FlowTGW  where roots have been already aligned. Therefore, we can leave out minimization step with roots in Definition 1, and name this discrepancy as aligned-root FlowTGW.

Definition 2.

Assume that root in is aligned with root in . Then, the aligned-root flow-based tree GW discrepancy between and is defined as follow:

(4)

The in Equation (4) is equivalent to the univariate Wasserstein distance between and , i.e., where we denote as the -Wasserstein distance Villani (2003). Moreover, the univariate Wasserstein is equal to the integral of the absolute difference between the generalized quantile functions of these two univariate probability distributions Santambrogio (2015) (§2). Therefore, one only needs to sort , and for the computation of , i.e., linearithmic complexity. Due to sharing the same structure as a univariate Wasserstein distance, inherits the same properties as those of the univariate Wasserstein distance. More precisely, is equivalent to ; is symmetric and satisfies triangle inequality. See the supplementary for an illustration of aligned-root FlowTGW.

Note that in practical applications where we usually do not have priori knowledge about tree metrics for spaces of supports of probability measures, we need to sample tree metrics for each support data space. Moreover, we can directly sample aligned-root tree metrics, e.g., by choosing means of support data distributions as roots when using the clustering-based tree metric sampling Le et al. (2019) (§4). Consequently, we can use aligned-root FlowTGW formulation to reduce the complexity of FlowTGW.

Aligned-root FlowTGW barycenter. The aligned-root FlowTGW can be handily used for a barycenter problem, especially in large-scale applications. Given probability measures whose supports are in different tree metric spaces with aligned-roots respectively, and corresponding weights , the aligned-root FlowTGW barycenter problem aims to find a flow-based tree structure representation of an optimal probability measure whose number of supports is less than or equal to k in tree metric space that takes the form:

(5)

where the roots in are aligned with root in . The barycenter problem in Equation (5) is equivalent to the free-support univariate Wasserstein barycenter which is efficiently solved, e.g., by using Algorithm in Cuturi & Doucet (2014).

4 Depth-based tree GW discrepancy

FlowTGW only focuses on flows from a root to each support, but ignores information about the depth level of supports in tree structures. In this section, we take into account the depth level of supports, and propose depth-based tree GW (DepthTGW) discrepancy . In particular, DepthTGW considers the alignment problem for flows hierarchically for each depth level along the tree structures.

We first introduce some necessary definitions to define . Recall that given a node in tree , is a set of child nodes of .

Definition 3.

Given a node in tree , a 2-depth-level tree , or shortened as , is defined in rooted at , i.e., root is at depth level , and subtrees rooted at , considered as “leaves” at depth level in .

Following Definition 3 for the 2-depth-level tree , let be the set of vertices of , then contains and all . Moreover, given in , we have a corresponding in , defined as follow: where , and if , otherwise .

In order to define DepthTGW, we start with its special case when roots are aligned.

Definition 4.

Assume that root in is aligned with root in . Then, the aligned-root depth-based tree GW discrepancy between two probability measures and is defined as follows:

(6)

where is the considered depth level, starting from to the deepest level of the lower tree between and ; is a set of optimal aligned pairs at the depth level where ; is the optimal matching mass for the pair at the depth level where .

Intuitively, at each depth level, we consider the alignment for the corresponding 2-depth-level trees. Note that the 2-depth-level tree structures are at the same depth level for both and , and one can consider for the alignment. Moreover, at depth level , trivially matches to with optimal matching mass . Thus, the matching procedure is recursive along all depth levels in trees. The simple case of the recursive procedure is that at least one node of the considered pair does not have child nodes, or sum of weights of child nodes in the corresponding 2-depth-level tree is equal to .

Figure 4: Results of MAE and time consumption for -NN regression in dataset, and results of averaged accuracy and time consumption for -NN in TWITTER and RECIPE datasets. For EGW/EGW* (eps=X) where eps is the value of entropic regularization, we use for and TWITTER, and for RECIPE (too slow with for RECIPE). For DTGW, we use for TWITTER, for RECIPE, and only slice in due to its slowness. For clustering-based tree metric sampling, we used its suggested parameters (, ).

When we would like an optimal alignment between the roots of trees and , we have:

(7)

The above discrepancy is referred to as DepthTGW. Similar to FlowTGW, we have a following theorem:

Theorem 2.

DepthTGW is a pseudo-distance.

See the supplementary material for the proof of Theorem 2.

5 Tree-sliced variants of GW by sampling tree metrics

Similar to tree-sliced-Wasserstein distance Le et al. (2019) (or sliced-Wasserstein Rabin et al. (2011), sliced-GW Vayer et al. (2019)), computing (aligned-root) FlowTGW/DepthTGW requires to choose or sample tree metrics for each space of supports. We use fast adaptive methods, e.g., clustering-based tree metric sampling Le et al. (2019), to sample tree metrics for a space of supports, and further average the corresponding (aligned-root) FlowTGW/DepthTGW using those random tree metrics.

Definition 5.

Given two probability measures supported on a set in which tree metric spaces and can be defined respectively, the (aligned-root) flow/depth-based tree-sliced GW is defined as an average of corresponding (aligned-root) flow/depth-based tree GW for on tree metric spaces , and respectively.

As discussed in Le et al. (2019), the average over several random tree metrics can help to reduce quantization effects, or clustering sensitivity problems in which data points may be partitioned or clustered to adjacent but different hypercubes or clusters respectively in tree metric sampling. Moreover, note that the complexity of tree metric sampling is negligible compared to that of GW computation. Indeed, the complexity of the clustering-based tree metric sampling is when one fixes the same number of clusters for the farthest-point clustering (Gonzalez, 1985) and sets for the predefined deepest level of tree , and is the number of input data points.

Remark 1.

For specific applications with priori knowledge about tree metrics for probability measures, one can apply FlowTGW, or consider DepthTGW in case the known tree structure for each probability measure is important for the applications. Moreover, if roots of those known tree metrics are already aligned, one can use the corresponding aligned-root formulations to reduce the complexity. For general applications without priori knowledge about tree metrics for probability measures, one can directly sample aligned-root tree metrics, e.g., by choosing a mean of support data as its root for the clustering-based tree metric sampling Le et al. (2019), and use the aligned-root formulations for an efficient computation of the proposed tree variants of GW.

6 Experiments

We evaluate our proposed FlowTGW and DepthTGW discrepancies for quantum chemistry and document classification with randomly linear transform word embeddings. In addition, we also carry out the large-scale FlowTGW barycenter problem within -means clustering for point clouds of handwritten digits in MNIST dataset rotated arbitrarily in the plane as in Peyré et al. (2016).

Setup. We consider three following baselines: (i) entropic GW (EGW) Peyré et al. (2016), (ii) a variant of entropic GW (EGW*) which we only use the entropic regularization to optimize transport plan, but exclude it when computing GW discrepancy, and (iii) sliced GW (SGW) Vayer et al. (2019). In all of our experiments, we do not have priori knowledge about tree metrics for probability measures. Therefore, we sample aligned-root tree metrics from support data points by applying the clustering-based tree metric approach Le et al. (2019) where means of support data points are chosen as tree roots. Consequently, we can use the aligned-root formulations for both FlowTGW (FTGW) and DepthTGW (DTGW) to reduce their complexity. For sliced GW, since it can be only applied for discrete measures with same number of supports and uniform weights Vayer et al. (2019) (§3), we follow Vayer et al. (2019) to add zero padding when discrete measures have different numbers of supports. We further use the binomial expansion trick to reduce its complexity Vayer et al. (2019). For entropic GW and its variant, we use the log-stabilized Sinkhorn Schmitzer (2019) when optimizing transport plan2. In general, when entropic regularization becomes smaller, the quality of entropic GW and its variant is better, but their computation is considerably slower. In our experiments, the computation for entropic GW is either usually blown up, or too slow for evaluation when entropic regularization is less than or equal . We run experiments with Intel Xeon CPU E7-8891v3 (2.80GHz), and 256GB RAM. Reported time consumption for all methods has already included their corresponding preprocessing, e.g., tree metric sampling for FlowTGW and DepthTGW, or one-dimensional projection for sliced GW.

6.1 Applications

Quantum chemistry. We carry out a regression problem on molecules for dataset as in Peyré et al. (2016). The task is to predict atomization energies for molecules based on similar labeled molecules instead of estimating them through expensive numerical simulations Rupp et al. (2012); Peyré et al. (2016). For simplicity, we only used the relative locations in of atoms in molecules, without information about atomic nuclear charges as the experiments in Rupp et al. (2012); Peyré et al. (2016). There are molecules in dataset. Each molecule has no more than atoms. We randomly split for training and test sets, and repeat times. Following Peyré et al. (2016), we use a -nearest neighbor (-NN) regression approach.

Figure 5: Results of averaged accuracy and time consumption for variants of GW with different parameters (e.g., entropic regularization in EGW/EGW*, and the number of slices in SGW/FTGW/DTGW) in -NN in TWITTER dataset. For clustering-based tree metric sampling, we used its suggested parameters (, ).

Document classification with non-registered word embeddings. We next evaluate our proposed FlowTGW and DepthTGW for document classification with non-registered word embeddings in TWITTER and RECIPE datasets. For each document in these datasets, we use a randomly linear transform for word embedding Mikolov et al. (2013), pre-trained on Google News3, containing about million words/phrases. maps each word/phrase into a -dimensional vector. Following Kusner et al. (2015); Le et al. (2019), we remove SMART stop words Salton & Buckley (1988), and also drop words in documents if they are not in the pre-trained . After preprocessing, there are documents in classes where each document length is not more than in TWITTER dataset, and documents in classes where each document length is not more than in RECIPE dataset. We randomly split for training and test sets, and repeat times.

Figure 6: Results of averaged accuracy and time consumption for FlowTGW (10 slices) with different parameters (e.g., the predefined deepest level , the number of clusters ) in the clustering-based tree metric sampling in TWITTER dataset.

Performance results, time consumption and discussions. The results of averaged mean absolute value (MAE) for different in -NN regression, and time consumption of quantum chemistry in dataset are illustrated in the first column of Figure 4, while the results of averaged accuracy for different in -NN, and time consumption of document classification with non-registered word embeddings in TWITTER and RECIPE datasets are shown in the second and third columns of Figure 4 respectively. The computational time of FlowTGW is at least comparative to that of sliced GW, and much faster than that of entropic GW. Especially, in RECIPE dataset, it took less than minutes for FlowTGW ( slices), while more than hours for sliced GW ( slices), and more than days for entropic GW (even with entropic regularization eps=50). Moreover, the performances of FlowTGW compare favorably with other baselines, except the variant of entropic GW in RECIPE dataset. For DepthTGW, its performances are comparative with other baselines. However, DepthTGW is slow in practice due to solving a large number of sub-problems, i.e., aligned-root FlowTGW between corresponding 2-depth-level trees. We observe that the variant of entropic GW improves the performances of entropic GW. Therefore, the entropic term in entropic GW computation may harm its performances in applications, e.g., in and RECIPE datasets. For entropic GW and its variants, their performances are improved when their entropic regularization is smaller, but their computational time is considerably increased. For example, in TWITTER dataset, the performances of entropic GW and its variants are comparative with other variants of GW, but their entropic regularization should be small enough (i.e., eps=5) which makes their computation about order(s) slower than those of DepthTGW, sliced GW, FlowTGW respectively (for slice). For sliced GW, when the lengths of documents are large, e.g., in RECIPE dataset, its computational time is slow down since it requires to use extra artificial zeros padding and uniform weights for probability measures with different number of supports (i.e., documents with different lengths), while note that other variants of GW work with an original number of supports (i.e., unique words in documents), and general weights (i.e., frequencies of unique words) for supports in probability measures.

Additionally, we show the trade-off between performances and time consumption for those variants of GW when their parameters, e.g., the entropic regularization in entropic GW and its variant, the number of slices in sliced GW, FlowTGW and DepthTGW, are changed for TWITTER dataset in Figure 5. Moreover, we also illustrate performances and time consumption of FlowTGW (10 slices) with different parameters of the clustering-based tree metric sampling (e.g., the predefined deepest level , the number of clusters for the farthest-point clustering) in TWITTER dataset in Figure 6. Similar to tree metric sampling for tree-(sliced)-Wasserstein Le et al. (2019), we also observe that the clustering-based tree metric sampling for FlowTGW and DepthTGW is fast and its time consumption is negligible compared to that of either FlowTGW or DepthTGW. For examples, for each tree metric sampling with the suggested parameters , it only took about seconds for dataset, seconds for TWITTER dataset, and seconds for RECIPE dataset. Many further experimental results can be seen in the supplementary.

Figure 7: Results of time consumption and measure for -means clustering with FlowTGW for randomly rotated point clouds of handwritten digits in MNIST dataset.

6.2 Large-scale FlowTGW barycenter within -means clustering

We applied FlowTGW barycenter (§3.3), using Algorithm in Cuturi & Doucet (2014) where we set for the maximum number of supports in barycenters, into a larger machine learning pipeline such as -means clustering on MNIST dataset where point clouds of handwritten digits are rotated arbitrarily in the plane as in Peyré et al. (2016). For each handwritten digit, we randomly extracted point clouds. We evaluated -means with FlowTGW for , and handwritten-digit point clouds where each handwritten digit is randomly rotated , and times respectively. Furthermore, we grouped the handwritten digit and digit together due to applying random rotation. We used -means++ initialization technique Arthur & Vassilvitskii (2007), set for the maximum iterations of -means, and repeated times with different random seeds for -means++ initialization. In Figure 7, we show the averaged time consumption and measure Manning et al. (2008) where is chosen as in Le & Cuturi (2015) for the results of -means clustering with FlowTGW. Note that, in these settings, the barycenter problem from entropic GW and its entropic variant has extremely slow running time.

7 Conclusion

We proposed in this paper two novel tree variants of GW, i.e., FlowTGW and DepthTGW, between probability measures whose supports are in different metric spaces by considering a particular family of ground metrics, namely tree metrics. By leverage a tree structure, we proposed an effective representation for probability measures whose supports are in tree metric space, e.g., flows from a root to each support instead of traditional pair-wise distances of supports for probability measures to scale up GW. Especially, the proposed FlowTGW is not only very fast, but its performances also compare favorably with other variants of GW. Moreover, the FlowTGW can be applied for large-scale applications (e.g., a million probability measures) which are usually prohibited for entropic GW. The questions about sampling efficiently tree metrics from support data points for the tree variants of GW, or using them for more involved parametric inference are left for future work.

Supplement to “Fast Tree Variants of Gromov-Wasserstein”

We organize this supplementary material as follow:

  • In Section A, we provide proofs for technical results: Theorem 1 and Theorem 2 in the main text.

  • In Section B, we show more illustrations for flow-based tree GW (FlowTGW) mentioned in the main text.

  • In Section C, we describe further details for FlowTGW and depth-based tree GW (DepthTGW).

  • In Section D, we illustrate

    • further experimental results in , TWITTER, and RECIPE datasets considered in the main text;

    • experiments on larger document datasets (e.g., AMAZON and CLASSIC datasets) for document classification with non-registered word embeddings;

    • time consumption for the clustering-based tree metric sampling;

    • and results with different parameters for tree metric sampling.

  • In Section E, we give some brief reviews for

    • the farthest-point clustering;

    • clustering-based tree metric sampling;

    • tree metric;

    • measure for clustering evaluation;

    • and more information for datasets.

  • In Section F, we provide some further discussions.

  • In Section G, we investigate empirical relations among variants of GW (e.g., FlowTGW, DepthTGW, sliced GW, entropic GW and a variant of entropic GW).

Notations. We use same notations as in the main text.

Appendix A Proofs

In this section, we provide the proofs for the pseudo-distances of FlowTGW and DepthTGW discrepancies, i.e., Theorem 1 and Theorem 2 in the main text.

a.1 Proof of Theorem 1 in the main text

From the definition of FlowTGW , it is symmetric, namely, . In addition, when , there exist two optimal roots and such that , , and . Finally, we show that also satisfies triangle inequality as in Proposition 1.

Proposition 1.

Given three probability measures in three different metric spaces , , and . Then, we have:

Proof.

It is sufficient to demonstrate that

(8)

for any roots of respectively. Our proof for the above inequality is a direct application of the gluing lemma in (Villani, 2003). In particular, for any roots of of , we denote and as optimal transport plans for and respectively. Based on the gluing lemma, there exists with marginal of the first and the third factors as and marginal of the second and the third factors as . We denote the marginal of its first and second factors as , which is a transport plan between and . Therefore, from the definition of aligned-root FlowTGW discrepancy, we have

(9)

where we used Hölder’s inequality for the third term for the second inequality. As a consequence, we obtain the conclusion of the inequality in Equation (8). ∎

a.2 Proof of Theorem 2 in the main text

In fact, from the definition of DepthTGW , it is clear that and . Furthermore, also satisfies triangle inequality as in Proposition 2.

Proposition 2.

Given three probability measures in three different metric spaces , , and . Then, we have:

Proof.

Similar to the proof of Proposition 1, it is sufficient to demonstrate that

(10)

for any roots of respectively. According to the definition of aligned-root DepthTGW, the above inequality is equivalent to

(11)

where are respectively sets of optimal aligned pairs at the depth level from trees and , from trees and , and from trees and ; are respectively optimal matching masses for the pairs . In order to demonstrate the above inequality, we only need to verify that

(12)

for any depth level . We respectively denote