Fast Tree Variants of GromovWasserstein
Abstract
GromovWasserstein (GW) is a powerful tool to compare probability measures whose supports are in different metric spaces. GW suffers however from a computational drawback since it requires to solve a complex nonconvex quadratic program. We consider in this work a specific family of ground metrics, namely tree metrics for a space of supports of each probability measure in GW. By leveraging a tree structure, we propose to use flows from a root to each support to represent a probability measure whose supports are in a tree metric space. We consequently propose a novel tree variant of GW, namely flowbased tree GW (FlowTGW), by matching the flows of the probability measures. We then show that FlowTGW shares a similar structure as a univariate optimal transport distance. Therefore, FlowTGW is fast for computation and can scale up for largescale applications. In order to further explore tree structures, we propose another tree variant of GW, namely depthbased tree GW (DepthTGW), by aligning the flows of the probability measures hierarchically along each depth level of the tree structures. Theoretically, we prove that both FlowTGW and DepthTGW are pseudodistances. Moreover, we also derive treesliced variants, computed by averaging the corresponding tree variants of GW using random tree metrics, built adaptively in spaces of supports. Finally, we test our proposed discrepancies against other baselines on some benchmark tasks.
1 Introduction
Optimal transport (OT) theory provides a powerful set of tools to compare probability measures. OT has recently gained traction in machine learning community Cuturi (2013); Perrot et al. (2016); Genevay et al. (2016); Muzellec & Cuturi (2018); Mena & NilesWeed (2019); Luise et al. (2019); Alaya et al. (2019); Paty & Cuturi (2019); Togninalli et al. (2019), and played an increasingly important role in several research areas such as computer graphics Solomon et al. (2015); Bonneel et al. (2016); Lavenant et al. (2018); Solomon & Vaxman (2019), domain adaptation Courty et al. (2016, 2017); Bhushan Damodaran et al. (2018); Redko et al. (2019), and deep generative models Arjovsky et al. (2017); Gulrajani et al. (2017); Genevay et al. (2018); Kolouri et al. (2019); Wu et al. (2019); Nadjahi et al. (2019) to name a few.
When probability measures are discrete and their supports belong to the same space, OT distance can be recasted as a linear program, which can be solved by standard interiorpoint method algorithms. However, these algorithms are not efficient when the number of supports is large. In order to account for the scalability of OT distance, Cuturi (2013) proposed to regularize OT by the entropy of transport plans, which result in entropic regularized OT. Several efficient algorithms have been proposed to solve the entropic problem recently Altschuler et al. (2017); Dvurechensky et al. (2018); Lin et al. (2019); Altschuler et al. (2019).
When probability measures are discrete and their supports lie in different spaces, classical OT distance is no longer valid to measure their discrepancy. In the seminal work, Mémoli (2011) introduced GromovWasserstein (GW) distance to compare probability measures whose supports are in different metric spaces. GW is defined based on the discrepancy between distance matrices of supports (i.e., pairwise distances of supports) corresponding to the probability measures. GW has been used in several applications, including quantum chemistry Peyré et al. (2016), computer graphics Solomon et al. (2016), crosslingual embeddings AlvarezMelis & Jaakkola (2018); Grave et al. (2019), graph partitioning and matching Xu et al. (2019a, b), and deep generative models Bunne et al. (2019). However, since GW is a complex nonconvex quadratic program and NPhard for arbitrary inputs Peyré & Cuturi (2019) (§10.6.3), its computation is very costly, which hinders applications, especially in largescale settings where the number of supports is large.
Reposing on the entropic regularization idea from OT, Peyré et al. (2016) proposed an entropic GW discrepancy. The entropic GW can be efficiently solved by Sinkhorn algorithm under certain cases of regularization parameter and a specific family of loss functions. Nevertheless, entropic GW requires the regularization to be sufficiently large for a fast computation, which leads to a poor approximation of GW. Following the direction of leveraging entropic regularization, Xu et al. (2019a, b) proposed algorithmic approaches to further speed up GW for graph data. Another approach for scaling up the computation of GW is sliced GW Vayer et al. (2019), which relies on a onedimensional projection of supports of the probability measures. Consequently, similar to slicedWasserstein, sliced GW albeit fast limits its capacity to capture highdimensional structure in a distribution of supports Liutkus et al. (2019); Le et al. (2019). Additionally, sliced GW can be only either applied for discrete measures with same number of supports and uniform weights, or required an artifact zero padding for measures having different number of supports Vayer et al. (2019).
In this work, we consider a particular family of ground metrics, namely tree metrics for a space of supports of each probability measure in GW. Although it is wellknown that one can leverage tree metrics to speed up a computation of arbitrary metrics Bartal (1996, 1998); Charikar et al. (1998); Indyk (2001); Fakcharoenphol et al. (2004), our goal is rather to sample tree metrics for spaces of supports, and use them as ground metrics, similar to tree(sliced)Wasserstein (TSW) Le et al. (2019). However, different to TSW, one may not apply this idea straightforwardly by only using tree metrics as ground metrics to scale up GW. Therefore, by exploiting a tree structure, we propose to leverage flows from a root to each support to represent a probability measure whose supports are in tree metric space, instead of pairwise distances of supports as in traditional GW. Consequently, we propose a novel tree variant of GW, namely flowbased tree GW (FlowTGW), by matching the flows of the probability measures whose supports are in different tree metric spaces. FlowTGW is fast for computation and can scale up for largescale applications due to sharing a similar structure as a univariate OT. In order to further explore tree structures, we propose to align the flows hierarchically along each depth level of the tree structures, namely depthbased tree GW (DepthTGW). Theoretically, we prove that both FlowTGW and DepthTGW are pseudometrics. Furthermore, we derive treesliced variants, computed by averaging the corresponding tree variants of GW using random tree metrics, sampled by a fast adaptive method, e.g., clusteringbased tree metric sampling Le et al. (2019) (§4).
The paper is organized as follows: we briefly review tree metrics and define tree GW for probability measures whose supports are in different tree metric spaces in §2. We propose two novel tree variants of GW: FlowTGW and DepthTGW in §3 and §4 respectively. In §5, we derive treesliced variants of GW for practical applications, and then evaluate our proposed discrepancies against other baselines on some benchmark tasks in §6 before concluding in §7.
Notations. We denote , . Given , let be the norm of , and be the Dirac function at . For a discrete probability measure , denote for the number of supports of .
2 Tree GromovWasserstein
In this section, we give a brief review about tree metric space, and define tree GW between probability measures whose supports are in different tree metric spaces.
2.1 Tree metric space
Given a tree , let be a tree metric on . The tree metric between two nodes in tree is equal to a length of the (unique) path between them Semple & Steel (2003) (§7, p.145–182). Given node , let be the set of nodes in the subtree of rooted at , i.e., where is the (unique) path between root and node in , be the set of child nodes of , and is the cardinality of set . Given an edge , we write and for the nodes that are respectively at a shallower (closer to ) and deeper (further away from ) level of edge , and be the nonnegative length of that edge. We illustrate those notions in Figure 1.
Throughout the paper, we are given two probability measures and whose supports and are in different tree metric spaces and respectively; , and . Our goal is to define discrepancies between these probability measures.
2.2 Tree GromovWasserstein
Let tree GW be GW distance between probability measures whose supports are in different tree metric spaces. Therefore, tree GW between and takes the following form:
(1)  
where is a set of the transport plans between and ; and are pairwise distances of supports, i.e., distance matrices of supports, represented for and respectively.
However, one may not scale up GW by straightforwardly using tree metrics as ground metrics for supports of probability measures as in Equation (1) like TSW (Le et al., 2019). Therefore, we propose to leverage tree structure to form a novel representation for a probability measure based on flows from a root to each support to scale up GW. Consequently, we propose two novel variants of tree GW: FlowTGW and DepthTGW, detailed in §3 and §4 respectively. Especially, one can further scale up their computation by directly sampling alignedroot tree metrics in applications without priori knowledge about tree metrics for probability measures.
3 Flowbased tree GW discrepancy
In this section, we study a variant of tree GW, named flowbased tree GW (FlowTGW) discrepancy.
3.1 Definition of flowbased tree GW
Different from the tree GW, FlowTGW takes into account tree structures to represent probability measures whose supports are in tree metric spaces.
Definition 1.
The flowbased tree GromovWasserstein discrepancy between two probability measures and is defined as follows:
(2)  
As indicated in Definition 1, each probability measure in FlowTGW is represented by tree metric, i.e., the unique path (or flow) length, from a root to each support while a weight of each support can be regarded as a mass of a corresponding flow. Additionally, the minimization with roots and in Equation (2) is to ensure an optimal alignment for a pair of roots and in trees and respectively. Similar to tree GW, we also determine an optimal transport plan between and when computing FlowTGW .
Theorem 1.
FlowTGW is a pseudodistance.
See the supplementary for the proof of Theorem 1.
3.2 Efficient computation for flowbased tree GW
A naive implementation for has a complexity where is the number of nodes in tree, if one exhaustedly searches the optimal pair of roots for and
Consider FlowTGW between two probability measures in two different tree metric space rooted at respectively. When one changes into the new root for tree , as illustrated in Figure 2, there are two cases that can happen:
Case 1: The new root is one of the nodes in the subtree rooted at a node in , which does not contain any supports in , illustrated in the leftbottom of Figure 2. Then, for all supports , we have
(3) 
Consequently, the order of the length of the path from root to each support does not change.
Case 2: The new root is one of the nodes in the subtree rooted at a node in , containing some of the supports in , named as , illustrated in the rightbottom of Figure 2. Then, for all supports in , except , we have the same formulation as Equation (3), i.e., . Consequently, the order of the length of the path from the root to each support (except ) is preserved. For supports in (illustrated in the supplementary), there are three following subcases:
Case 2a: For supports which , we have
. Therefore, the pathlength order of those supports are preserved.
Case 2b: For supports which , we have
.
Therefore, the pathlength order of those supports are reversed.
Case 2c: For supports which and , one needs to find the corresponding closest common ancestor of and , i.e., is on both paths and , so we have
. Note that the pathlength order of supports having the same is preserved.
Therefore, one only needs to merge these ordered arrays with the complexity nearly (except the degenerated case where each array has only node).
From the above observation, one may not need to sort for the tree metrics between the new root and each support in by leveraging the sorted order of the tree metrics between the root and each support. Moreover, those computational steps can be done separately for each tree. Therefore, the complexity of reduces from into nearly . More details can be seen in the supplementary.
3.3 Alignedroot flowbased tree GW
In this section, we consider a special case of FlowTGW where roots have been already aligned. Therefore, we can leave out minimization step with roots in Definition 1, and name this discrepancy as alignedroot FlowTGW.
Definition 2.
Assume that root in is aligned with root in . Then, the alignedroot flowbased tree GW discrepancy between and is defined as follow:
(4)  
The in Equation (4) is equivalent to the univariate Wasserstein distance between and , i.e., where we denote as the Wasserstein distance Villani (2003). Moreover, the univariate Wasserstein is equal to the integral of the absolute difference between the generalized quantile functions of these two univariate probability distributions Santambrogio (2015) (§2). Therefore, one only needs to sort , and for the computation of , i.e., linearithmic complexity. Due to sharing the same structure as a univariate Wasserstein distance, inherits the same properties as those of the univariate Wasserstein distance. More precisely, is equivalent to ; is symmetric and satisfies triangle inequality. See the supplementary for an illustration of alignedroot FlowTGW.
Note that in practical applications where we usually do not have priori knowledge about tree metrics for spaces of supports of probability measures, we need to sample tree metrics for each support data space. Moreover, we can directly sample alignedroot tree metrics, e.g., by choosing means of support data distributions as roots when using the clusteringbased tree metric sampling Le et al. (2019) (§4). Consequently, we can use alignedroot FlowTGW formulation to reduce the complexity of FlowTGW.
Alignedroot FlowTGW barycenter. The alignedroot FlowTGW can be handily used for a barycenter problem, especially in largescale applications. Given probability measures whose supports are in different tree metric spaces with alignedroots respectively, and corresponding weights , the alignedroot FlowTGW barycenter problem aims to find a flowbased tree structure representation of an optimal probability measure whose number of supports is less than or equal to k in tree metric space that takes the form:
(5) 
where the roots in are aligned with root in . The barycenter problem in Equation (5) is equivalent to the freesupport univariate Wasserstein barycenter which is efficiently solved, e.g., by using Algorithm in Cuturi & Doucet (2014).
4 Depthbased tree GW discrepancy
FlowTGW only focuses on flows from a root to each support, but ignores information about the depth level of supports in tree structures. In this section, we take into account the depth level of supports, and propose depthbased tree GW (DepthTGW) discrepancy . In particular, DepthTGW considers the alignment problem for flows hierarchically for each depth level along the tree structures.
We first introduce some necessary definitions to define . Recall that given a node in tree , is a set of child nodes of .
Definition 3.
Given a node in tree , a 2depthlevel tree , or shortened as , is defined in rooted at , i.e., root is at depth level , and subtrees rooted at , considered as “leaves” at depth level in .
Following Definition 3 for the 2depthlevel tree , let be the set of vertices of , then contains and all . Moreover, given in , we have a corresponding in , defined as follow: where , and if , otherwise .
In order to define DepthTGW, we start with its special case when roots are aligned.
Definition 4.
Assume that root in is aligned with root in . Then, the alignedroot depthbased tree GW discrepancy between two probability measures and is defined as follows:
(6)  
where is the considered depth level, starting from to the deepest level of the lower tree between and ; is a set of optimal aligned pairs at the depth level where ; is the optimal matching mass for the pair at the depth level where .
Intuitively, at each depth level, we consider the alignment for the corresponding 2depthlevel trees. Note that the 2depthlevel tree structures are at the same depth level for both and , and one can consider for the alignment. Moreover, at depth level , trivially matches to with optimal matching mass . Thus, the matching procedure is recursive along all depth levels in trees. The simple case of the recursive procedure is that at least one node of the considered pair does not have child nodes, or sum of weights of child nodes in the corresponding 2depthlevel tree is equal to .
When we would like an optimal alignment between the roots of trees and , we have:
(7) 
The above discrepancy is referred to as DepthTGW. Similar to FlowTGW, we have a following theorem:
Theorem 2.
DepthTGW is a pseudodistance.
See the supplementary material for the proof of Theorem 2.
5 Treesliced variants of GW by sampling tree metrics
Similar to treeslicedWasserstein distance Le et al. (2019) (or slicedWasserstein Rabin et al. (2011), slicedGW Vayer et al. (2019)), computing (alignedroot) FlowTGW/DepthTGW requires to choose or sample tree metrics for each space of supports. We use fast adaptive methods, e.g., clusteringbased tree metric sampling Le et al. (2019), to sample tree metrics for a space of supports, and further average the corresponding (alignedroot) FlowTGW/DepthTGW using those random tree metrics.
Definition 5.
Given two probability measures supported on a set in which tree metric spaces and can be defined respectively, the (alignedroot) flow/depthbased treesliced GW is defined as an average of corresponding (alignedroot) flow/depthbased tree GW for on tree metric spaces , and respectively.
As discussed in Le et al. (2019), the average over several random tree metrics can help to reduce quantization effects, or clustering sensitivity problems in which data points may be partitioned or clustered to adjacent but different hypercubes or clusters respectively in tree metric sampling. Moreover, note that the complexity of tree metric sampling is negligible compared to that of GW computation. Indeed, the complexity of the clusteringbased tree metric sampling is when one fixes the same number of clusters for the farthestpoint clustering (Gonzalez, 1985) and sets for the predefined deepest level of tree , and is the number of input data points.
Remark 1.
For specific applications with priori knowledge about tree metrics for probability measures, one can apply FlowTGW, or consider DepthTGW in case the known tree structure for each probability measure is important for the applications. Moreover, if roots of those known tree metrics are already aligned, one can use the corresponding alignedroot formulations to reduce the complexity. For general applications without priori knowledge about tree metrics for probability measures, one can directly sample alignedroot tree metrics, e.g., by choosing a mean of support data as its root for the clusteringbased tree metric sampling Le et al. (2019), and use the alignedroot formulations for an efficient computation of the proposed tree variants of GW.
6 Experiments
We evaluate our proposed FlowTGW and DepthTGW discrepancies for quantum chemistry and document classification with randomly linear transform word embeddings. In addition, we also carry out the largescale FlowTGW barycenter problem within means clustering for point clouds of handwritten digits in MNIST dataset rotated arbitrarily in the plane as in Peyré et al. (2016).
Setup. We consider three following baselines: (i) entropic GW (EGW) Peyré et al. (2016), (ii) a variant of entropic GW (EGW*) which we only use the entropic regularization to optimize transport plan, but exclude it when computing GW discrepancy, and (iii) sliced GW (SGW) Vayer et al. (2019). In all of our experiments, we do not have priori knowledge about tree metrics for probability measures. Therefore, we sample alignedroot tree metrics from support data points by applying the clusteringbased tree metric approach Le et al. (2019) where means of support data points are chosen as tree roots. Consequently, we can use the alignedroot formulations for both FlowTGW (FTGW) and DepthTGW (DTGW) to reduce their complexity. For sliced GW, since it can be only applied for discrete measures with same number of supports and uniform weights Vayer et al. (2019) (§3), we follow Vayer et al. (2019) to add zero padding when discrete measures have different numbers of supports. We further use the binomial expansion trick to reduce its complexity Vayer et al. (2019). For entropic GW and its variant, we use the logstabilized Sinkhorn Schmitzer (2019) when optimizing transport plan
6.1 Applications
Quantum chemistry. We carry out a regression problem on molecules for dataset as in Peyré et al. (2016). The task is to predict atomization energies for molecules based on similar labeled molecules instead of estimating them through expensive numerical simulations Rupp et al. (2012); Peyré et al. (2016). For simplicity, we only used the relative locations in of atoms in molecules, without information about atomic nuclear charges as the experiments in Rupp et al. (2012); Peyré et al. (2016). There are molecules in dataset. Each molecule has no more than atoms. We randomly split for training and test sets, and repeat times. Following Peyré et al. (2016), we use a nearest neighbor (NN) regression approach.
Document classification with nonregistered word embeddings. We next evaluate our proposed FlowTGW and DepthTGW for document classification with nonregistered word embeddings in TWITTER and RECIPE datasets. For each document in these datasets, we use a randomly linear transform for word embedding Mikolov et al. (2013), pretrained on Google News
Performance results, time consumption and discussions. The results of averaged mean absolute value (MAE) for different in NN regression, and time consumption of quantum chemistry in dataset are illustrated in the first column of Figure 4, while the results of averaged accuracy for different in NN, and time consumption of document classification with nonregistered word embeddings in TWITTER and RECIPE datasets are shown in the second and third columns of Figure 4 respectively. The computational time of FlowTGW is at least comparative to that of sliced GW, and much faster than that of entropic GW. Especially, in RECIPE dataset, it took less than minutes for FlowTGW ( slices), while more than hours for sliced GW ( slices), and more than days for entropic GW (even with entropic regularization eps=50). Moreover, the performances of FlowTGW compare favorably with other baselines, except the variant of entropic GW in RECIPE dataset. For DepthTGW, its performances are comparative with other baselines. However, DepthTGW is slow in practice due to solving a large number of subproblems, i.e., alignedroot FlowTGW between corresponding 2depthlevel trees. We observe that the variant of entropic GW improves the performances of entropic GW. Therefore, the entropic term in entropic GW computation may harm its performances in applications, e.g., in and RECIPE datasets. For entropic GW and its variants, their performances are improved when their entropic regularization is smaller, but their computational time is considerably increased. For example, in TWITTER dataset, the performances of entropic GW and its variants are comparative with other variants of GW, but their entropic regularization should be small enough (i.e., eps=5) which makes their computation about order(s) slower than those of DepthTGW, sliced GW, FlowTGW respectively (for slice). For sliced GW, when the lengths of documents are large, e.g., in RECIPE dataset, its computational time is slow down since it requires to use extra artificial zeros padding and uniform weights for probability measures with different number of supports (i.e., documents with different lengths), while note that other variants of GW work with an original number of supports (i.e., unique words in documents), and general weights (i.e., frequencies of unique words) for supports in probability measures.
Additionally, we show the tradeoff between performances and time consumption for those variants of GW when their parameters, e.g., the entropic regularization in entropic GW and its variant, the number of slices in sliced GW, FlowTGW and DepthTGW, are changed for TWITTER dataset in Figure 5. Moreover, we also illustrate performances and time consumption of FlowTGW (10 slices) with different parameters of the clusteringbased tree metric sampling (e.g., the predefined deepest level , the number of clusters for the farthestpoint clustering) in TWITTER dataset in Figure 6. Similar to tree metric sampling for tree(sliced)Wasserstein Le et al. (2019), we also observe that the clusteringbased tree metric sampling for FlowTGW and DepthTGW is fast and its time consumption is negligible compared to that of either FlowTGW or DepthTGW. For examples, for each tree metric sampling with the suggested parameters , it only took about seconds for dataset, seconds for TWITTER dataset, and seconds for RECIPE dataset. Many further experimental results can be seen in the supplementary.
6.2 Largescale FlowTGW barycenter within means clustering
We applied FlowTGW barycenter (§3.3), using Algorithm in Cuturi & Doucet (2014) where we set for the maximum number of supports in barycenters, into a larger machine learning pipeline such as means clustering on MNIST dataset where point clouds of handwritten digits are rotated arbitrarily in the plane as in Peyré et al. (2016). For each handwritten digit, we randomly extracted point clouds. We evaluated means with FlowTGW for , and handwrittendigit point clouds where each handwritten digit is randomly rotated , and times respectively. Furthermore, we grouped the handwritten digit and digit together due to applying random rotation. We used means++ initialization technique Arthur & Vassilvitskii (2007), set for the maximum iterations of means, and repeated times with different random seeds for means++ initialization. In Figure 7, we show the averaged time consumption and measure Manning et al. (2008) where is chosen as in Le & Cuturi (2015) for the results of means clustering with FlowTGW. Note that, in these settings, the barycenter problem from entropic GW and its entropic variant has extremely slow running time.
7 Conclusion
We proposed in this paper two novel tree variants of GW, i.e., FlowTGW and DepthTGW, between probability measures whose supports are in different metric spaces by considering a particular family of ground metrics, namely tree metrics. By leverage a tree structure, we proposed an effective representation for probability measures whose supports are in tree metric space, e.g., flows from a root to each support instead of traditional pairwise distances of supports for probability measures to scale up GW. Especially, the proposed FlowTGW is not only very fast, but its performances also compare favorably with other variants of GW. Moreover, the FlowTGW can be applied for largescale applications (e.g., a million probability measures) which are usually prohibited for entropic GW. The questions about sampling efficiently tree metrics from support data points for the tree variants of GW, or using them for more involved parametric inference are left for future work.
Supplement to “Fast Tree Variants of GromovWasserstein”
We organize this supplementary material as follow:

In Section A, we provide proofs for technical results: Theorem 1 and Theorem 2 in the main text.

In Section B, we show more illustrations for flowbased tree GW (FlowTGW) mentioned in the main text.

In Section C, we describe further details for FlowTGW and depthbased tree GW (DepthTGW).

In Section D, we illustrate

further experimental results in , TWITTER, and RECIPE datasets considered in the main text;

experiments on larger document datasets (e.g., AMAZON and CLASSIC datasets) for document classification with nonregistered word embeddings;

time consumption for the clusteringbased tree metric sampling;

and results with different parameters for tree metric sampling.


In Section E, we give some brief reviews for

the farthestpoint clustering;

clusteringbased tree metric sampling;

tree metric;

measure for clustering evaluation;

and more information for datasets.


In Section F, we provide some further discussions.

In Section G, we investigate empirical relations among variants of GW (e.g., FlowTGW, DepthTGW, sliced GW, entropic GW and a variant of entropic GW).
Notations. We use same notations as in the main text.
Appendix A Proofs
In this section, we provide the proofs for the pseudodistances of FlowTGW and DepthTGW discrepancies, i.e., Theorem 1 and Theorem 2 in the main text.
a.1 Proof of Theorem 1 in the main text
From the definition of FlowTGW , it is symmetric, namely, . In addition, when , there exist two optimal roots and such that , , and . Finally, we show that also satisfies triangle inequality as in Proposition 1.
Proposition 1.
Given three probability measures in three different metric spaces , , and . Then, we have:
Proof.
It is sufficient to demonstrate that
(8) 
for any roots of respectively. Our proof for the above inequality is a direct application of the gluing lemma in (Villani, 2003). In particular, for any roots of of , we denote and as optimal transport plans for and respectively. Based on the gluing lemma, there exists with marginal of the first and the third factors as and marginal of the second and the third factors as . We denote the marginal of its first and second factors as , which is a transport plan between and . Therefore, from the definition of alignedroot FlowTGW discrepancy, we have
(9) 
where we used Hölder’s inequality for the third term for the second inequality. As a consequence, we obtain the conclusion of the inequality in Equation (8). ∎
a.2 Proof of Theorem 2 in the main text
In fact, from the definition of DepthTGW , it is clear that and . Furthermore, also satisfies triangle inequality as in Proposition 2.
Proposition 2.
Given three probability measures in three different metric spaces , , and . Then, we have:
Proof.
Similar to the proof of Proposition 1, it is sufficient to demonstrate that
(10) 
for any roots of respectively. According to the definition of alignedroot DepthTGW, the above inequality is equivalent to
(11) 
where are respectively sets of optimal aligned pairs at the depth level from trees and , from trees and , and from trees and ; are respectively optimal matching masses for the pairs . In order to demonstrate the above inequality, we only need to verify that
(12) 
for any depth level . We respectively denote