Unsupervised Hierarchy Matching with
Optimal Transport over Hyperbolic Spaces
Abstract
This paper focuses on the problem of unsupervised alignment of hierarchical data such as ontologies or lexical databases. This is a problem that appears across areas, from natural language processing to bioinformatics, and is typically solved by appeal to outside knowledge bases and labeltextual similarity. In contrast, we approach the problem from a purely geometric perspective: given only a vectorspace representation of the items in the two hierarchies, we seek to infer correspondences across them. Our work derives from and interweaves hyperbolicspace representations for hierarchical data, on one hand, and unsupervised wordalignment methods, on the other. We first provide a set of negative results showing how and why Euclidean methods fail in this hyperbolic setting. We then propose a novel approach based on optimal transport over hyperbolic spaces, and show that it outperforms standard embedding alignment techniques in various experiments on crosslingual WordNet alignment and ontology matching tasks.
references.bib
Youssef Mroueh mroueh@us.ibm.com
Tommi S. Jaakkola tommi@csail.mit.edu
1 Introduction
Hierarchical structures are ubiquitous in various domains, such as natural language processing and bioinformatics. For example, structured lexical databases like WordNet \citepmiller1995wordnet are widely used in computational linguistics as an additional resource in various downstream tasks \citepmoldovan2001logic, shi2005putting, bordes2012joint. On the other hand, ontologies are often used to store and organize relational data. Building such datasets is expensive and requires expert knowledge, so there is great interest in methods to merge, extend and extrapolate across these structures. A fundamental ingredient in all of these tasks is matching^{1}^{1}1Throughout this work, we interchangeably use matching and alignment to refer to this task. different datasets, i.e., finding correspondences between their entities. For example, the problem of ontology alignment is an active area of research, with important implications for integrating heterogeneous resources, across domains or languages [spohr2011machine]. We refer the reader to \citeteuzenat2013ontology for a thorough survey on the state of this problem. On the other hand, there is a long line of work focusing on automatic WordNet construction that seeks to leverage existing large WordNets (usually, English) to automatically build WordNets in other lowresource languages \citeplee2000automatic, saveski2010automatic, pradet2014wonef, khodak2017automated.
euzenat2013ontology recognize three dimensions for similarity in ontology matching: semantic, syntactic and external. A similar argument can be made for other types of hierarchical structures. Most current methods for aligning such types of data rely on a combination of these three, i.e., in addition to the relations between entities they exploit lexical similarity and external knowledge. For example, automatic WordNet construction methods often rely on access to machine translation systems \citeppradet2014wonef, and stateoftheart ontology matching systems commonly assume access to a large external knowledge base. Unsurprisingly, these methods perform poorly when no such additional resources are available \citepshvaiko2013ontology. Thus, effective fullyunsupervised alignment of hierarchical datasets remains largely an open problem.
Our work builds upon two recent trends in machine learning to derive a new approach to this problem. On one hand, there is mounting evidence —both theoretical and empirical— of the advantage of embedding hierarchical structures in hyperbolic (rather than Euclidean) spaces \citepnickel2018learning, ganea2018hyperbolic, desa2018representation. On the other hand, various fully unsupervised geometric approaches have recently shown remarkable success in unsupervised word translation \citepconneau2018word, Artetxe2018Robust, alvarezmelis2018gromov, grave2019unsupervised, alvarezmelis2019invariances. We seek to combine these two recent developments by extending the latter to nonEuclidean settings, and using them to find correspondences between datasets by relying solely on their geometric structure, as captured by their hyperbolicembedded representations. The end goal is a fully unsupervised approach to the problem of hierarchy matching.
In this work, we focus on the second step of this pipeline —the matching— and assume the embeddings of the hierarchies are already learned and fixed. Our approach proceeds by simultaneous registration of the two manifolds and pointwise entity alignment using optimal transport distances. After introducing the building blocks of our approach, we begin our analysis with a set of negative results. We show that stateoftheart methods for unsupervised (Euclidean) embedding alignment perform very poorly when used on hyperbolic embeddings, even after modifying them to account for this geometry. The cause of this failure lies in a type of invariance —not exhibited by Euclidean embeddings— which we refer to as branch permutability. At a high level, this phenomenon is characterized by a lack of consistent ordering of branches in the representations of a dataset across different runs of the embedding algorithm (Fig 1a), and is akin to the node order invariance in trees.
In response to this challenge, we further generalize our approach by learning a flexible nonlinear registration function between the spaces with a hyperbolic neural network \citepganea2018hypernn. This nonlinear map is complex enough to register one of the hyperbolic spaces (Fig 1b), and is learned by minimizing an optimal transport problem over hyperbolic space, which provides both a gradient signal for training and a pointwise (soft) matching between the embedded entities. The resulting method (conceptually illustrated in Figure 3) is capable of aligning embeddings in spite of severe branch permutability, which we demonstrate with applications in WordNet translation and biological ontology matching.
In summary, we make the following contributions:

Formulating the problem of unsupervised matching of hierarchical datasets from a geometric perspective, casting it as a correspondence problem between hyperbolic spaces

Showing that stateoftheart methods for unsupervised embedding alignment fail in this task, and find the cause of this to be an unique type of invariance found in popular hyperbolic embeddings

Proposing a novel framework for Riemannian nonlinear registration based on hyperbolic neural networks, which might be of independent interest

Empirically validating this approach with experiments on WordNet hierarchies and ontologies
Notation and Conventions
We denote by the set of probability distributions over a metric space . For a continuous map we note by its associated pushforward operator, i.e., for any , is the pushforward measure satisfying . The image of is denoted as . Finally, and are the orthogonal and special orthogonal groups of order .
2 Related Work
Ontology Matching
Ontology matching is an important problem in various biomedical applications, for example, to find correspondences between disease and phenotype ontologies \citep[and references therein]algergawy2018oaei. Techniques in ontology matching are usually rulebased, and often rely on entity label similarity and external knowledge bases, making them unfit for unsupervised settings. Here instead we do not assume any additional information nor textual similarity.
Hyperbolic Embeddings
Since their introduction \citepchamberlain2017neural, nickel2017poincare, research on automatic embedding of hierarchical structures in hyperbolic spaces has gained significant traction \citepganea2018hyperbolic, desa2018representation, tay2018hyperbolic. The main appeal of this approach is that hyperbolic geometry captures several important aspects of hierarchies and other structured data \citepnickel2017poincare. We rely on these embeddings to represent the hierarchies of interest.
Unsupervised Word Embedding Alignment
Word translation based on word embeddings has recently gained significant attention after the successful fullyunsupervised approach of \citetconneau2018word, which finds a mapping between embedding spaces with adversarial training, after which a refinement procedure based on the Procrustes problem produces the final alignment. Various nonadversarial approaches have been proposed since, such as robustself learning \citepArtetxe2018Robust. Optimal transport (in particular, Wasserstein) distances have been recently shown to provide a robust and effective approach to the problem of unsupervised embedding alignment \citepzhang2017earth, alvarezmelis2018gromov, grave2019unsupervised, alvarezmelis2019invariances. For example, \citetalvarezmelis2019invariances and \citetgrave2019unsupervised use a hybrid optimization objective over orthogonal transformations between the spaces (i.e., an Orthogonal Procrustes problem) and Wasserstein couplings between the samples. These works consider only Euclidean settings. Alternatively, this problem can be successfully approached \citepalvarezmelis2018gromov with a generalized version of optimal transport, the GromovWasserstein (GW) distance \citepmemoli2011gromov, which relies on comparing distances between points rather than the points themselves. The recently proposed Fused GromovWasserstein distance \citepvayer2018fused extends this to structured domains such as graphs, but as opposed to our approach, assumes node features and knowledge of the full graph structure. While the GW distance provides a stepping stone towards alignment of more general embedding spaces, it cannot account for the type of invariances encountered in practice when operating on hyperbolic embeddings, as we will in Section 5.
Correspondence Analysis
Finding correspondences between shapes is at the heart of many problems in computer graphics. One of the classic approaches to this problem is the Iterative Closest Point method \citepchen1992object, besl1992method (and its various generalizations, e.g. \citeprusinkiewicz2001efficient), which alternates between finding (hard) correspondences through nearestneighbor pairing and finding the best rigid transformation based on those correspondences (i.e., solving a Orthogonal Procrustes problem). The framework we propose can be understood as generalizing ICP in various ways: allowing for Riemannian Manifolds (beyond Euclidean spaces), going beyond rigid (orthogonal) registration and relaxing the problem by allowing for soft correspondences, which the framework of optimal transport naturally provides.
3 Embedding Hierarchies in Hyperbolic Space
A fundamental question when dealing with any type of symbolic data is how to represent it. As the advent of representation learning has proven, finding the right feature representation is as—and often more—important than the algorithm used on it. Naturally, the goal of such representations is to capture relevant properties of the data. For our problem, this is particularly important. Since our goal is to find correspondences between datasets based purely on their relational structure, it is crucial that the representation capture the semantics of these relations as precisely as possible.
Traditional representation learning methods embed symbolic objects into lowdimensional Euclidean spaces. These approaches have proven very successful for embedding largescale cooccurrence statistics, like linguistic corpora for word embeddings \citepmikolov2013distributed, pennington2014glove. However, recent work has shown that data for which semantics are given in the form of hierarchical structures is best represented in hyperbolic spaces, i.e., Riemannian manifolds with negative curvature \citepchamberlain2017neural, nickel2017poincare, ganea2018hyperbolic. Among the arguments in favor of these spaces is the fact that any tree can be embedded into finite hyperbolic spaces with arbitrary precision \citepgromov1987hyperbolic. This stands in stark contrast with Euclidean spaces, for which the dependence on dimension grows exponentially. In practice, this means that very lowdimensional hyperbolic embeddings often perform onpar or above their highdimensional Euclidean counter parts in various downstream tasks \citepnickel2017poincare, ganea2018hyperbolic, tay2018hyperbolic. This too is an appealing argument in our application, as we are interested in matching very large datasets, making computational efficiency crucial.
Working with hyperbolic geometry requires a model to represent it and operate on it. Recent computational approaches to hyperbolic embeddings have mostly focused on the Poincaré Disk (or, in higher dimensions, Ball) model. This model is defined by the manifold , equipped with the metric tensor , where is the conformal factor and is the Euclidean metric tensor. With this, has a Riemannian manifold structure, with the induced Riemannian distance given by:
(1) 
From this, the norm on the Poincaré Ball can be derived as
(2) 
It can be seen from this expression that the magnitude of points in the Poincaré Ball tends to infinity towards its boundary. This phenomenon intuitively illustrates the treelike structure of hyperbolic space: starting from the origin, the space becomes increasingly—in fact, exponentially more—densely packed towards the boundaries, akin to how the width of a tree grows exponentially with its depth.
Hyperbolic embedding methods find representations in the Poincaré Ball by constrained optimization (i.e., by imposing ) of a loss function that is often problemdependent. For datasets in the form of entailment relations , where means that is a subconcept of , \citetnickel2017poincare propose to minimize the following softranking loss:
(3) 
where are the embeddings and a set of negative examples for .
Transformations in the Poincaré Ball will play a prominent role in the development of our approach in Section 6, so we discuss them briefly here. Since the Poincaré Ball is bounded, any meaningful operation on it must map onto itself. Furthermore, for registration we are primarily interested in isometric transformations on the disk, i.e., we seek analogues of Euclidean vector translation, rotation and refection. In this model, translations are given by Möbius addition, defined as
(4) 
This definition conforms to our intuition of translation, e.g., if the origin of the disk is translated to , then is translated to . Note that this addition is neither commutative nor associative. More generally, it can be shown that all isometries in the Poincaré Ball have the form , where and , i.e., it is an orientationpreserving isometry in . Two other important operations are the logarithmic and exponential maps on a Riemannian manifold, which map between the manifold and its tangent space at a given point . For the Poincaré Ball, these maps can be expressed as
4 The Wasserstein Approach to Correspondence
4.1 Optimal Transport distances
Optimal transport (OT) distances provide a powerful and principled approach to find correspondences across distributions, shapes and point clouds \citepvillani2008optimal, Peyre2018Computational. In its usual formulation, OT considers a complete and separable metric space , along with probability measures and . These can be continuous or discrete measures, the latter often used in practice as empirical approximations of the former whenever working in the finitesample regime. The Kantorovich formulation [kantorovitch1942translocation] of the transportation problem reads:
(5) 
where is a cost function (the “ground” cost), and the set of couplings consists of joint probability distributions over the product space with marginals and , i.e.,
(6) 
Whenever is equipped with a metric , it is natural to use it as ground cost, e.g., . In such case, Equation (5) is called the Wasserstein distance. The case is also known as the Earth Mover’s Distance in computer vision \citeprubner2000earth.
In applications, the measures and are often unknown, and are accessible only through finite samples . In that case, these can be taken to be discrete measures and , where , are vectors in the probability simplex, and the pairwise costs can be compactly represented as an matrix , i.e., . In this case, Equation (5) becomes a linear program. Solving this problem scales cubically on the sample sizes, which is often prohibitive in practice. Adding an entropy regularization, namely
(7) 
leads to a problem that can be solved much more efficiently \citepaltschuler2017nearlinear and which has better sample complexity \citepgenevay2019sample than the unregularized problem. In the discrete case, Problem (7) can be solved with the Sinkhorn algorithm \citepcuturi2013sinkhorn,Peyre2018Computational, a matrixscaling procedure which iteratively updates and , where and the division and exponential are entrywise.
4.2 Unsupervised matching with optimal transport
Besides providing a principled geometric approach to compare distributions, optimal transport has the advantage of producing, as an intrinsic part of its computation, a realization of the optimal way to match the two distributions. Any feasible coupling in problem (5) (or (7)) can be interpreted as a “soft” or “multivalued” matching between and . Therefore, the optimal corresponds to the minimumcost way to match them. In the case where the distributions are discrete (e.g., point clouds) is a matrix of soft correspondences. Whenever OT is used with the goal of transportation (as opposed to just comparison), having guarantees on the solution of the problem takes particular importance. Obtaining such guarantees is an active area of research, and a full exposition falls beyond the scope of this work. We provide a brief summary of these in Appendix E, but refer the interested reader to the survey by \citetambrosio2013users. For our purposes, is suffices to mention that for the quadratic cost (i.e., the 2Wasserstein distance), the optimal coupling is guaranteed to exist, be unique, and correspond to a deterministic map (i.e., a “hard” matching).^{2}^{2}2Note that “includes” all maps , which can be expressed as .
It is tempting to directly apply OT to unsupervised embedding alignment. But note that Problem (5) makes the crucial assumption that the two distributions are defined in the same space . More generally, one can consider different spaces and as long as a meaningful cost function between them be specified. When the embedding spaces are estimated in a datadriven way, as is usually the case in machine learning, even if these spaces are compatible (e.g., have the same dimensionality) there is no guarantee that the usual metric is meaningful. This could be, for example, because the spaces are defined up to rotations and reflections, creating a class of invariants that the ground metric does not take into account. A natural approach to deal with this lack of registration between the two spaces is to simultaneously find a global transformation that corrects for this and an optimal coupling that minimizes the transportation cost between the distributions. Formally, in addition to the optimal coupling, we now also seek a mapping which realizes
(8) 
As before, we can additionally define an entropyregularized version of this problem too. Variations of this problem for particular cases of and have been proposed in various contexts, particularly for image registration (e.g., [Rangarajan1997Softassign, cohen1999earth]), and more recently, for word embedding alignment \citepzhang2017earth, alvarezmelis2019invariances, grave2019unsupervised. Virtually all these approaches instantiate as the class of orthogonal transformations (or slightly more general classes of linear mappings \citepalvarezmelis2019invariances). In such cases, minimization with respect to is easy to compute, as it corresponds to an Orthogonal Procrustes problem, which has a closed form solution \citepgower2004procrustes. Thus, Problem (8) is commonly solved by alternating minimization.
5 Wasserstein Correspondences across Hyperbolic Spaces
In the previous section, we discussed how Wasserstein distances can be used to find correspondences between two embedding spaces in a fully unsupervised manner. However, all the methods we mentioned there have been applied exclusively to Euclidean settings. One might be hopeful that naive application of those approaches on hyperbolic embeddings might just work, but—unsurprisingly—it does not (cf. Table LABEL:tab:wordnet). Indeed, ignoring the special geometry of these spaces leads to poor alignment. Thus, we now investigate how to adapt such a framework to nonEuclidean settings.
The first fundamental question towards this goal is whether optimal transport extends to more general Riemannian manifolds. The answer is mostly positive. Again, limited space prohibits a nuanced discussion of this matter, but for our purposes it suffices to say that for hyperbolic spaces, under with mild regularity assumptions, it can be shown that: (i) OT is welldefined \citepvillani2008optimal, (ii) its solution is guaranteed to exist, be unique and be induced by a transport map \citepMcCann2001Polar; and (iii) this map is not guaranteed to be smooth for the usual cost , but it is for variations of it (e.g., ) \citeplee2012examples. Further details on why this is the case are provided in Appendix F. This set of theoretical results support the use of Wasserstein distances for finding correspondences in the hyperbolic setting of interest. Furthermore, Theorem F.2 provides various Riemannian cost functions with strong theoretical foundations and potential for better empirical performance.
The second step towards generalizing Problem (8) to hyperbolic spaces involves the transformation . First, we note that using orthogonal matrices as in the Euclidean case is still valid because, as discussed in Section 3, these map the unit disk into itself. Therefore, we can now solve a generalized (hyperbolic) version of the Orthogonal Procrustes problem as before. However, this approach performs surprisingly bad in practice too (see results for HyperOT+Orthogonal in Table LABEL:tab:wordnet).
To understand the cause of this failure, recall that orthogonality was a natural choice of invariance for embedding spaces that we assumed might differ by a rigid transformation, but were otherwise compatible. However, Poincaré embeddings exhibit another, more complex, type of invariance, which to the best of our knowledge has not been reported before. It is a branch permutability invariance, whereby the relative positions of branches in the hierarchy might change abruptly across different runs of the embedding algorithm, even for the exact same data and hyperparameters. This phenomenon is shown for a simple hierarchy embedded in the Poincaré Disk in Figure 1. Naturally, actual discrete trees are invariant to node ordering, but a priori it is not obvious why this property would be inherited by the embedded space generated with optimization objective (3), where nonancestrallyrelated nodes do indeed interact (as negative pairs) in the objective. The results in 2 show that this phenomenon occurs in various popular hyperbolic embedding methods (Poincaré \citepnickel2017poincare, Hyperbolic Cones \citepganea2018hyperbolic and Principal Geodesic Analysis \citepsala2018representation) and although more prominent in low dimensions, is still present when the dimensionality is increased (Figure 2).
We conjecture that the cause of this invariance is the use of negative sampling for normalization in that loss function, which has the effect of putting emphasis on preserving distance between entities that are ancestrally related in the hierarchy, at the cost of downweighting distances between unrelated entities. A formal explanation of this phenomenon is left for future work. Here, instead, we develop a framework to account and correct these invariances while simultaneously aligning the two embeddings.
6 A Framework for Correspondence across Hyperbolic Spaces
The failure of the baseline Euclidean alignment methods (and their hyperbolic versions) discussed in the previous section, combined with the underlying branch permutability invariance responsible for it, make it clear that the space of registration transformations in Problem (8) has to be generalized not only beyond orthogonality but beyond linearity too.
Ideally we would search for among all continuous mappings between and , i.e, letting . To make this search computationally tractable, we can instead approximate this function class with deep neural networks parametrized by . While an alternating minimization approach is still possible, solving for to completion in each iteration is undesirable. Instead, we reverse the order of optimization and rewrite our objective as
(9) 
Since is differentiable with respect to , we can use gradientdescent based methods to optimize it. Wasserstein distances have been used before as loss functions, particularly in the context of deep generative modeling \citepArjovsky2017Wasserstein, genevay2018learning,salimans2018improving. When used as a loss function, the entropyregularized version (Eq. (7)) has the undesirable property that , in addition to having biased sample gradients \citepbellemare2017cramer. Following \citetgenevay2018learning, we instead consider the Sinkhorn Divergence:
(10) 
Besides being a proper divergence and providing unbiased gradients, this function is convex, smooth and positivedefinite \citepfeydy2019interpolating, and its sample complexity is well characterized \citepgenevay2019sample, all of which make it an appealing loss function. Using this divergence in place of the Wasserstein distance above yields our final objective:
(11) 
The last remaining challenge is that we need to construct a class of neural networks that parametrizes , i.e., functions that map onto itself. In recent work, \citetganea2018hyperbolic propose a class of hyperbolic neural networks that do exactly this. As they point out, the basic operations in hyperbolic space that we introduced in Section 3 suffice to define analogues of various differentiable building blocks of traditional neural networks. For example, a hyperbolic linear layer can be defined as
Analogously, a layer applying a nonlinearity in the hyperbolic sense can be defined as . Here, we also consider Möbius Transformation layers, , with and . With these building blocks, we can parametrize highly nonlinear functions as a sequence of such hyperbolic layers, e.g., . Note that for the hyperbolic linear layer—but crucially, not for the Möbius layer—the intermediate hidden states need not live in the same dimensional space as the input and output, i.e., using rectangular weight matrices we can map intermediate states to Poincaré balls of different dimensionality.
The overall approach is summarized in Figure 3.
Optimization
Evaluation of the loss function in (11) is itself an optimization problem, i.e., solving instances of regularized optimal transport. We backpropagate through this objective as proposed by \citetgenevay2018learning, using the geomloss toolbox for efficiency. For the outerlevel optimization, we rely on Riemannian gradient descent \citepzhang2016riemannian, wilson2018gradient. We found that the adaptive methods of \citetbecigneul2019riemannian worked best, particularly Radam. Note that for the HyperLinear layers only the bias term is constrained (on the PoincarB́all), while for our Möbius layers the weight matrix is also constrained (in the Stiefel manifold), hence for these we optimize over the product of these two manifolds. Additional optimization details are provided in the Appendix.
Avoiding poor local minima
The loss function (11) is highly nonconvex with respect to , a consequence of both the objective itself and the nature of hyperbolic neural networks \citepganea2018hyperbolic. As a result, we found that initialization plays a crucial role in this problem, since it is very hard to overcome a poor initial local minimum. Even suitable layerwise random initialization of weights and biases proved futile. As a solution, we experimented with three pretraining initialization schemes, that roughly ensure (in different ways) that does not initially “collapse” the space (details provided in Appendix A). In addition, we use an annealing scheme on the entropyregularization parameter \citepkosowsky1994invisible,alvarezmelis2019invariances. Starting from an aggressive regularization (large ), we gradually decrease it with a fixed decay rate . In all our experiments we use .
7 Experiments
Datasets
For our first set of experiments, we extract subsets of WordNet [miller1995wordnet] in five languages. For this, we consider only nouns and compute their transitive closure according to hypernym relations. Then, for each collection we generate embeddings in the Poincaré Ball of dimension 10 using the method of \citetnickel2018learning with default parameters. We will release the multilingual WordNet dataset along with our codebase. In Section 7.2, we perform synthetic experiments on the csphd network dataset \citeprossi2015network again embedded with same algorithm. In addition, we consider two subtasks of the OAEI 2018 ontology matching challenge \citepalgergawy2018oaei: Anatomy, which consists of two ontologies; and biodiv, consisting of four. Additional details on all the datasets are provided in Table A.LABEL:tab:dataset_details in the Appendix.
Methods
We first compare ablated versions of our HyperbolicOT model (detailed in Table A.LABEL:tab:ablation_details), and then we compare against three offtheshelf stateoftheart unsupervised word embedding alignment models: Muse \citepconneau2018word, SelfLearn \citepArtetxe2018Robust and InvarOt \citepalvarezmelis2019invariances. We run all these methods with the settings and configurations recommended in their documentation.
Metrics
All the baseline methods return transformed embeddings. Using these, we retrieve nearest neighbors and report precisionatk, i.e., if the true match is within the top retrieved candidate matches for percent of the test examples.
7.1 Multilingual WordNet alignment
We first perform an ablation study on the various components of our model in a controlled setting, where the correspondences between the two datasets are perfect and unambiguous. For this, we embed the same hierarchy (the En part of our WordNet dataset) twice, using the same algorithm with the same hyperparameters, but different random seeds. We then evaluate the extent to which our method can recover the correspondences. Starting from our Full Model, we remove and/or replace various components and reevaluate performance. The exact configuration of the ablated models is provided in Appendix D. The results in Table LABEL:tab:ablation suggest that the most crucial components are the use of the appropriate Poincaré metric and the pretraining step. Next, we move on to the real task of interest: matching WordNet embeddings across different languages. Naturally, in this case there might not be perfect correspondences across the entities in different languages. As before, we report Precision@10 and compare against baseline models in Table LABEL:tab:wordnet.
7.2 Noise sensitivity
We next analyze the effect of domain discrepancy on the matching quality. Using our method to match a noiseless and noisy versions of the csphd dataset (details in the Appendix), we observe that accuracy degrades rapidly with noise, although —as expected— less so for higherdimensional embeddings (Fig. 4).
7.3 Ontology Matching
Finally, we test our method on the OAEI tasks. The results (Table 2) show that our method again decidedly outperforms the baseline Euclidean methods, but now the overall performance of all methods is remarkably lower, which suggests the domains’ geometry is less coherent (partly because of their vastly different sizes) and/or the correspondences between them more noisy.
Anatomy  Biodiv  

HM  MH  FP  PF  ES  SE  
Muse  0.12  0.00  3.23  0.00  0.00  0.00 
SelfLearn  0.00  0.00  4.00  0.00  0.01  0.02 
HypOT  7.89  4.49  16.67  8.73  6.25  9.66 
8 Discussion and Extensions
The framework for hierarchical structure matching proposed here admits various extensions, some of them immediate. We focused on the particular case of the Poincaré Ball, but since most of the components of our approach —optimization, registration, optimal transport— generalize to other Riemannian manifolds, our framework would too. As long as optimizing over a given manifold is tractable, our framework would enable computing correspondences across instances of it. On the other hand, we purposely adopted the challenging setting where no additional information is assumed. This setting is relevant both for extreme practical cases and to stresstest the limits of unsupervised learning in this context. However, our method would likely benefit from incorporating any additional available information as stateoftheart methods for ontology matching do. In our framework, this information could for example be injected intro the transport cost objective.
Appendix A Pretraining Strategies
The loss function (11) is highly nonconvex with respect to , a consequence of both the objective itself and the nature of hyperbolic neural networks \citepganea2018hyperbolic. As a result, we found that initialization plays a crucial role in this problem, since it is very hard to overcome a poor initial local minimum. Even layerwise random initialization of weights and biases proved futile. As a solution, we experimented with the following three pretraining initialization schemes, all of which intuitively try to approximately ensure (in different ways) that does not “collapse” the space :

Identity. Initialize to approximate the identity:
which trivially ensures that (approximately) preserves the overall geometry of the space.

CrossMap. Initialize to approximately match the target points to the source points in a random permuted order:
for some permutation , which again ensures that approximately preserves the global geometry, albeit for an arbitrary labeling of the points.

Procrustes. Following \citepbunne2019learning, we initialize to be approximately endtoend orthogonal:
where , i.e., is the solution of (a hyperbolic version of) the Orthogonal Procrustes problem for mapping to , which can be obtained via singular value decomposition (SVD). This strategy thus requires computing an SVD for every gradient update on ; hence, it is significantly more computationally expensive than the other two.
Appendix B Optimization Details
Each forward pass of the loss function (11) requires solving three regularized OT problems. While this can be done to completion in time \citepaltschuler2017nearlinear, practical implementations often run the Sinkhorn algorithm for a fixed number of iterations with a tolerance threshold on the objective improvement. We rely on the geomloss^{3}^{3}3https://www.kerneloperations.io/geomloss/ package for efficient differentiable Sinkhorn divergence implementation and on the geoopt^{4}^{4}4https://geoopt.readthedocs.io/en/latest/ package for Riemannian optimization. We run our method for a fixed number of outer iterations (200 in all our experiments), which given the decay strategy on the entropy regularization parameter , ensures that ranges from to . All experiments where run a single machine with 32core processor, Intel Xeon CPU @3.20 GHz, and exploiting computations on the GPU (a single GeForce Titan X) whenever possible. With this configuration the total runtime of our method on the experiments ranged from to minutes.
Appendix C Dataset Details
To generate the parallel WordNet datasets, we use the nltk interface to WordNet, and proceed as follows. In the English WordNet, we first filter out all words except nouns, and generate their transitive closure. For each of the remaining synsets, we query for lemmas in each of the four other languages (Es, Fr, It, Ca), for which nltk provides multilingual support in WordNet. These tuples of lemmas form our groundtruth translations, which are eventually split into a validation set of size 5000, leaving all the other pairs for test data (approximately 1500 for each language pairs). Note that the validation is for visualization purposes only, and all model selection is done in a purely unsupervised way based on the training objective. After the multilingual synset vocabularies have been extracted, we ensure their transitive closures are complete and write all the relations in these closures to a file, which will be used as an input to the PoincareEmbeddings toolkit.^{5}^{5}5https://github.com/facebookresearch/poincareembeddings
To generate the datasets for the synthetic noisesensitivity experiments (§7.2), we start from the original CSPhD dataset.^{6}^{6}6http://networkrepository.com/CSphd.php Given a predefined value , we iterate through the hierarchy removing node with probability , connecting ’s children with ’s parent to keep the tree connected. We repeat this with noise values and embed all of these using the PoincareEmbeddings in hyperbolic spaces of dimensions . For a given dimensionality and noise level, we use our method to find correspondences between the noiseless and noisy version of the hierarchy (i.e., matching tasks in total).
Statistics about all the datasets used in this work are provided in Table LABEL:tab:dataset_details. Further details about the OAEI datasets can found on the project’s website.^{7}^{7}7http://oaei.ontologymatching.org/2018/
Appendix D Model Configurations and Hyperparameters
In Table LABEL:tab:ablation_details, we provide full configuration details for all the ablated models used in the WordNet En En selfrecovery experiment (results shown in Table LABEL:tab:ablation). Dashed lines indicate a parameter being the same as in the Full Model.
Appendix E A Brief Summary of Theoretical Guarantees for Optimal Transport (Euclidean Case)
As mentioned in Section 4.2, whenever optimal transport is used with the goal of obtaining correspondences, there are various theoretical considerations that become particularly appealing.
The first of such considerations pertains to the nature of the solution, i.e., the optimal coupling which minimizes the cost (5). When the final end goal is to transport points from one space to the other, the best case scenario would be if the optimal happens to be a “hard” deterministic mapping. A celebrated result by \citetbrenier1987decomposition, brenier1991polar shows that this indeed the case for the quadratic cost,^{8}^{8}8This result holds in more general settings. We refer the reader to \citepSantambrogio2010Introduction, ambrosio2013users for further details. i.e., for the 2Wasserstein distance. Even when solving the problem approximately with entropic regularization (cf. Eq. (7)), this result guarantees that the solution found in this way converges to a deterministic mapping as .
Now, assuming now that such a map exists, the next aspect we might be interested in is its smoothness. Intuitively, smoothness of this mapping is desirable since it is more likely to lead to robust matchings in the context of correspondences, even if, again, the argument holds asymptotically for the regularized problem. This, clearly, is a very strong property to require. While not even continuity can be guaranteed in general \citepambrosio2013users, again for the quadraticcost things are simpler: if the source and target densities are smooth and the support of the target distribution satisfies suitable convexity assumptions, the optimal map is guaranteed to be smooth too [caffarelli1992regularity, caffarelli1992boundary].
Appendix F A Brief Summary of Theoretical Guarantees for Optimal Transport (Riemannian Manifold Case)
Extending the problem beyond Euclidean to more general spaces has been one of the central questions theoretical optimal transport research over the past decades [villani2008optimal]. For obvious reasons, here we focus the discussion on results related to hyperbolic spaces, and more generally, to Riemannian manifolds.
Let us first note that Problem (5) is welldefined for any complete and separable metric space . Since the arclength metric of a Riemannian manifold allows for the direct construction of an accompanying metric space , then OT can be defined over those too. However, some of the theoretical results of their Euclidean counterparts do not transfer that easily to the Riemannian case \citepambrosio2013users. Nevertheless, the existence and uniqueness of the optimal transportation plan , which in addition is induced by a transport map , can be guaranteed with mild regularity conditions on the source distribution . This was first shown in seminal work by \citetMcCann2001Polar. The result, which acts as an Riemannian analogue of that of Brenier for the Euclidean setting \citepbrenier1987decomposition, is shown below as presented by \citetambrosio2013users:
Theorem F.1 (McCann, version of \citepambrosio2013users).
Let be a smooth, compact Riemannian manifold without boundary and . Then the following are equivalent:

[wide=0pt, leftmargin=]

, there exists a unique optimal , and this plan is induced by a map .

is regular.
If either (i) or (ii) holds, the optimal can be written as for some cconcave function .
The question of regularity of the optimal map, on the other hand, is much more delicate now than in the Euclidean case \citepambrosio2013users, ma2005regularity, loeper2009regularity. In addition to the suitable convexity assumptions on the support of the target density, a restrictive structural condition, known as the MaTrudingerWang (MTW) condition [ma2005regularity], needs to be imposed on the cost in order to guarantee continuity of the optimal map. Unfortunately for our setting, in the case of Riemannian manifolds the MTW condition for the usual quadratic cost is so restrictive that it implies that has nonnegative sectional curvature \citeploeper2009regularity, which rules out hyperbolic spaces. However, a recent sequence of remarkable results \citetlee2012examples, Li2009SmoothOT prove that for simple variations of the Riemannian metric on hyperbolic spaces, smoothness is again guaranteed:
Theorem F.2 (Lee and Li, \citeplee2012examples).
Let be the Riemannian distance function on a manifold of constant sectional curvature ; then the cost functions and satisfy the strong MTW condition, and the cost functions satisfy the weak MTW condition.
Thus, these cost objectives can be used it out hyperbolic optimal transport matching setting with the hopes of obtaining a smoother solution, and therefore a more stable set of correspondences.