An Optimal Ancestry Labeling Scheme with Applications to XML Trees and Universal PosetsPreliminary results of this paper appeared in the proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), 2010, the 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), 2010, and the 21st ACM Symposium on Parallel Algorithms and Architectures (SPAA), 2009, as part of [21, 22, 23]. This research is supported in part by the ANR project DISPLEXITY, and by the INRIA project GANG.

An Optimal Ancestry Labeling Scheme
with Applications to XML Trees and Universal Posetsthanks: Preliminary results of this paper appeared in the proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), 2010, the 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), 2010, and the 21st ACM Symposium on Parallel Algorithms and Architectures (SPAA), 2009, as part of [21, 22, 23]. This research is supported in part by the ANR project DISPLEXITY, and by the INRIA project GANG.

Pierre Fraigniaud
CNRS and Univ. Paris Diderot
pierre.fraigniaud@liafa.univ-paris-diderot.fr
   Amos Korman
CNRS and Univ. Paris Diderot
amos.korman@liafa.univ-paris-diderot.fr
Abstract

In this paper we solve the ancestry-labeling scheme problem which aims at assigning the shortest possible labels (bit strings) to nodes of rooted trees, so that ancestry queries between any two nodes can be answered by inspecting their assigned labels only. This problem was introduced more than twenty years ago by Kannan et al. [STOC ’88], and is among the most well-studied problems in the field of informative labeling schemes. We construct an ancestry-labeling scheme for -node trees with label size bits, thus matching the bits lower bound given by Alstrup et al. [SODA ’03]. Our scheme is based on a simplified ancestry scheme that operates extremely well on a restricted set of trees. In particular, for the set of -node trees with depth at most , the simplified ancestry scheme enjoys label size of bits. Since the depth of most XML trees is at most some small constant, such an ancestry scheme may be of practical use. In addition, we also obtain an adjacency-labeling scheme that labels -node trees of depth with labels of size bits. All our schemes assign the labels in linear time, and guarantee that any query can be answered in constant time.

Finally, our ancestry scheme finds applications to the construction of small universal partially ordered sets (posets). Specifically, for any fixed integer , it enables the construction of a universal poset of size for the family of -element posets with tree-dimension at most . Up to lower order terms, this bound is tight thanks to a lower bound of due to Alon and Scheinerman [Order ’88].

1 Introduction

1.1 Background and motivation

How to represent a graph in a compact manner is a fundamental data structure question. In most traditional graph representations, the names (or identifiers) given to the nodes serve merely as pointers to entries in a data structure, and thus reveal no information about the graph structure per se. Hence, in a sense, some memory space is wasted for the storage of content-less data. In contrast, Kannan et al. [36] introduced the notion of informative labeling schemes, which involves a mechanism for assigning short, yet informative, labels to nodes. Specifically, the goal of such schemes is to assign labels to nodes in such a way that allows one to infer information regarding any two nodes directly from their labels. As explicated below, one important question in the framework of informative labeling schemes is how to efficiently encode the ancestry relation in trees. This is formalized as follows.

The ancestry-labeling scheme problem:

Given any -node rooted tree , label the nodes of  using the shortest possible labels (bit strings) such that, given any pair of nodes and in , one can determine whether is an ancestor of in by merely inspecting the labels of and .


The following simple ancestry-labeling scheme was suggested in [36]. Given a rooted -node tree , perform a DFS traversal in starting at the root, and provide each node with a DFS number, , in the range . (Recall, in a DFS traversal, a node is visited before any of its children, thus, the DFS number of a node is smaller than the DFS number of any of its descendants). The label of a node is simply the interval , where is the descendant of with the largest DFS number. An ancestry query then amounts to an interval containment query between the corresponding labels: a node is an ancestor of a node if and only if . Clearly, the label size, namely, the maximal number of bits in a label assigned by this ancestry-labeling scheme to any node in any -node tree, is bounded by bits111All logarithms in this paper are taken in base 2..

The bits scheme of [36] initiated an extensive research [1, 3, 11, 34, 35, 43] whose goal was to reduce the label size of ancestry-labeling schemes as much as possible. The main motivation behind these works lies in the fact that a small improvement in the label size of ancestry-labeling schemes may contribute to a significant improvement in the performances of XML search engines. Indeed, to implement sophisticated queries, XML documents are viewed as labeled trees, and typical queries over the documents amount to testing relationships between document items, which correspond to ancestry queries among the corresponding tree nodes [2, 18, 48, 49]. XML search engines process such queries using an index structure that summarizes this ancestry information. To allow good performances, a large portion of the XML index structure resides in the main memory. Hence, the length of the labels is a main factor which determines the index size. Thus, due to the enormous size of the Web data, even a small reduction in the label size may contribute significantly to both memory cost reduction and performance improvement. A detailed explanation regarding this application can be found in various papers on ancestry-labeling schemes (see, e.g., [1, 35]).

In [5], Alstrup et al. proved a lower bound of bits for the label size of an ancestry-labeling scheme. On the other hand, thanks to a scheme by Abiteboul et al. [1], the current state of the art upper bound is bits. Thus, a large gap is still left between the best known upper and lower bounds on the label size. The main result of this paper closes the gap. This is obtained by constructing an ancestry-labeling scheme whose label size matches the aforementioned lower bound.

Our scheme is based on a simplified ancestry scheme that operates extremely well on a restricted set of trees. In particular, for the set of -node trees with depth at most , the simplified ancestry scheme enjoys label size of bits. This result can be of independent interest for XML search engines, as a typical XML tree has extremely small depth (cf. [14, 17, 39, 38]). For example, by examining about 200,000 XML documents on the Web, Mignet et al. [38] found that the average depth of an XML tree is 4, and that 99% of the trees have depth at most 8. Similarly, Denoyer and Gallinari [17] collected about 650,000 XML trees taken from the Wikipedia collection222XML trees taken from the Wikipedia collection have actually relatively larger depth compared to usual XML trees [16]., and found that the average depth of a node is 6.72.

In addition, our ancestry-labeling scheme on arbitrary trees finds applications in the context of universal partially ordered sets (posets). Specifically, the bound on the label size translates to an upper bound on the size of the smallest universal poset for the family of all -element posets with tree-dimension at most (see Section 2 for the definitions). It is not difficult to show that the smallest size of such a universal poset is at most . On the other hand, it follows from a result by Alon and Scheinerman [12] that this size is also at least . As we show, it turns out that the real bound is much closer to this lower bound than to the upper bound.

1.2 Related work

1.2.1 Labeling schemes

As mentioned before, following the -bit ancestry-labeling scheme in [36], a considerable amount of research has been devoted to improve the upper bound on the label size as much as possible. Specifically, [3] gave a first non-trivial upper bound of bits. In [34], a scheme with label size bits was constructed to detect ancestry only between nodes at distance at most from each other. An ancestry-labeling scheme with label size of bits was given in [43]. The current state of the art upper bound of bits was given in [11] (that scheme is described in detail in the journal publication [1] joint with [3]). Following the aforementioned results on ancestry-labeling schemes for general rooted trees, [35] gave an experimental comparison of different ancestry-labeling schemes over XML tree instances that appear in “real life”.

The ancestry relation is the transitive closure of the parenthood relation. Hence, the following parenthood-labeling scheme problem is inherently related to the ancestry-labeling scheme problem: given a rooted tree , label the nodes of in the most compact way such that one can determine whether is a parent of in by merely inspecting the corresponding labels. The parenthood-labeling scheme problem was also introduced in [36], and a very simple parenthood scheme was constructed there, using labels of size at most bits. (Actually, [36] considered adjacency-labeling schemes in trees rather than parenthood-labeling schemes, however, such schemes are equivalent up to a constant number of bits in the label size333To see this equivalence, observe that one can construct a parenthood-labeling scheme from an adjacency-labeling scheme in trees, as follows. Given a rooted tree , first label the nodes of using the adjacency-labeling scheme (which ignores the fact that is rooted). Then, for each node , in addition to the label given to it by the adjacency-labeling scheme, add two more bits, for encoding , the distance from to the root, calculated modulo 3. Now the parenthood-labeling scheme follows by observing that for any two nodes and in a tree, is a parent of if and only if and are adjacent and modulo 3.). By now, the parenthood-labeling scheme problem is almost completely closed thanks to Alstrup and Rauhe [10], who constructed a parenthood scheme for -node trees with label size bits. In particular, this bound indicates that encoding ancestry in trees is strictly more costly than encoding parenthood.

Adjacency labeling schemes where studied for other types of graphs, including, general graphs [9], bounded degree graphs [4], and planar graphs [24]. Informative labeling schemes were also proposed for other graph problems, including distance [5, 26, 42], routing [20, 43], flow [31, 33], vertex connectivity [27, 31, 32], and nearest common ancestor in trees [6, 8, 40].

Very recently, Dahlgaard et al. [15] and Alstrup et al. [7] claim to provide asymptotically optimal schemes for the ancestry problem and the adjacency problem on trees, respectively.

1.2.2 Universal posets

When considering infinite posets, it is known that a countable universal poset for the family of all countable posets exists. This classical result was proved several times [19, 29, 30] and, in fact, as mentioned in [28], has motivated the whole research area of category theory.

We later give a simple relation between the label size of consistent ancestry-labeling schemes and the size of universal posets for the family of all -element posets with tree-dimension at most  (see Section 2 for the corresponding definitions). The -bit ancestry-labeling scheme of [36] is consistent, and thus it provides yet another evidence for the existence of a universal poset with  elements for the family of all -element posets with tree-dimension at most . It is not clear whether the ancestry-labeling schemes in [3, 11, 34, 43] can be somewhat modified to be consistent and still maintain the same label size. However, even if this is indeed the case, the universal poset for the family of all -element posets with tree-dimension at most that would be obtained from those schemes, would be of size .

The lower bound of [5] implies a lower bound of for the number of elements in a universal poset for the family of -element posets with tree-dimension . As mentioned earlier, for fixed , the result of Alon and Scheinerman [12] implies a lower bound of for the number of elements in a universal poset for the family of -element posets with tree-dimension at most .

1.3 Our contributions

The main result of this paper provides an ancestry-labeling scheme for -node rooted trees, whose label size is bits. This scheme assigns the labels to the nodes of any tree in linear time and guarantees that any ancestry query is answered in constant time. By doing this, we solve the ancestry-labeling scheme problem which is among the main open problems in the field of informative labeling schemes.

Our main scheme is based on a simplified ancestry scheme that is particularly efficient on a restricted set of trees, which includes the set of -node trees with depth at most . For such trees, the simplified ancestry scheme enjoys label size of bits. A simple trick allows us to use this latter ancestry-labeling scheme for designing a parenthood-labeling scheme for -node trees of depth at most using labels of size bits. Each of these two schemes assigns the labels to the nodes of any tree in linear time. The schemes also guarantee that the corresponding queries are answered in constant time.

Our schemes rely on two novel tree-decompositions. The first decomposition, called spine decomposition, bears similarities with the classical heavy-path decomposition of Sleator and Tarjan [41]. It is used for the construction of our simplified ancestry-labeling scheme. Our main ancestry-labeling scheme uses another tree-decomposition, called folding decomposition. The spine decomposition of the folding decomposition of any tree has a crucial property, that is heavily exploited in the construction of our main labeling scheme.

Finally, we establish a simple relation between compact ancestry-labeling schemes and small universal posets. Specifically, we show that there exists a consistent ancestry-labeling scheme for -node forests with label size if and only if, for any integer , there exists a universal poset with elements for the family of -element posets with tree-dimension at most . Using this equivalence, and slightly modifying our ancestry-labeling scheme, we prove that for any integer , there exists a universal poset of size for the family of all -element posets with tree-dimension at most . Up to lower order terms444The notation hides polylogarithmic terms., this bound is tight.

1.4 Outline

Our paper is organized as follows. Section 2 provides the essential definitions, including the definition of the spine decomposition. In Section 3 we describe our labeling schemes designed for a restricted family of trees, which includes trees of bounded depth. The main result regarding the construction of the optimal ancestry-labeling scheme is presented in Section 4. Our result concerning small universal posets appears in Section 5. Finally, in Section 6, we conclude our work and introduce some directions for further research on randomized labeling schemes.

2 Preliminaries

Let be a rooted tree, i.e., a tree with a designated node referred as the root of . A rooted forest is a forest consisting of several rooted trees. The depth of a node in some (rooted) tree is defined as the smallest number of nodes on the path leading from to the root. In particular, the depth of the root is 1. The depth of a rooted tree is defined as the maximum depth over all its nodes, and the depth of a rooted forest is defined as the maximum depth over all the trees in the forest.

For two nodes and in a rooted tree , we say that is an ancestor of if is one of the nodes on the shortest path in connecting and the root . (An ancestor of can be itself; Whenever we consider an ancestor of a node , where , we refer to as a strict ancestor of ). For two nodes and in some (rooted) forest , we say that is an ancestor of in if and only if and belong to the same rooted tree in , and is an ancestor of in that tree. A node is a descendant of if and only if is an ancestor of . For every non-root node , let denote the parent of , i.e., the ancestor of at distance 1 from it.

The size of , denoted by , is the number of nodes in . The weight of a node , denoted by , is defined as the number of descendants of , i.e., is the size of the subtree hanging down from . In particular, the weight of the root is .

For every integer , let denote the family of all rooted trees of size at most , and let denote the family of all forests of rooted trees, were each forest in has at most nodes.

For two integers , let denote the set of integers . (For , we sometimes use the notation which simply denotes the set of integers ). We refer to this set as an interval. For two intervals and , we say that if . The size of an interval is , namely, the number of integers in .

2.1 The spine decomposition

Our ancestry scheme uses a novel decomposition of trees, termed the spine decomposition (see Figure 1). This decomposition bears similarities to the classical heavy-path decomposition of Sleator and Tarjan [41]. Informally, the spine decomposition is based on a path called spine which starts from the root, and goes down the tree along the heavy-path until reaching a node whose heavy child has less than half the number of nodes in the tree. This is in contrast to the heavy-path which goes down until reaching a node whose heavy child has less than half the number of nodes in the subtree rooted at . Note that the length of a spine is always at most the length of the heavy-path but can be considerably smaller. (For example, by augmenting a complete binary tree making it slightly unbalanced, one can create a tree with heavy-path of length while its spine is of length .)

Formally, given a tree in some forest , we define the spine of as the following path . Assume that each node holds its weight (these weights can easily be computed in linear time). We define the construction of iteratively. In the th step, assume that the path contains the vertices , where is the root of  and  is a child of , for . If the weight of a child of is more than half the weight of the root then this child is added to as . (Note, there can be at most one such child of .) Otherwise, the construction of stops. (Note that the spine may consist of only one node, namely, the root of .) Let be the nodes of the spine (Node is the root , and is the last node added to the spine). It follows from the definition that if then is a strict ancestor of . The size of the spine is . We split the nodes of the spine to two types. Specifically, the root of , namely , is called the apex node, while all other spine nodes, namely, , are called heavy nodes. (Recall that the weight of each heavy node is larger than half the weight of the apex node).

By removing the nodes in the spine (and the edges connected to them), the tree  breaks into forests , such that the following properties holds for each :

  • P1. In , the roots of the trees in are connected to ;

  • P2. Each tree in contains at most nodes;

  • P3. The forests are unrelated in terms of the ancestry relation in .

Figure 1: Spine decomposition

The spine decomposition is constructed iteratively, where each level of the process follows the aforementioned description. That is, given a forest , after specifying the spine of each tree in , we continue to the next level of the process, operating in parallel on the forests . The recursion implies that each node is eventually classified as either apex or heavy. The depth of the spine decomposition of a forest , denoted is the maximal size of a spine, taken over all spines obtained in the spine decomposition of . Note that is bounded from above by the depth of .

For any two integers and , let denote the set of all rooted forests with at most nodes, whose spine decomposition depth is at most .

2.2 Ancestry labeling schemes

An ancestry-labeling scheme for a family of forests of rooted trees is composed of the following components:

  1. A marker algorithm that assigns labels (i.e., bit strings) to the nodes of all forests in .

  2. A decoder algorithm that given any two labels and in the output domain of , returns a boolean value .

These components must satisfy that if and denote the labels assigned by the marker algorithm to two nodes and in some rooted forest , then

It is important to note that the decoder algorithm is independent of the forest . That is, given the labels of two nodes, the decoder algorithm decides the ancestry relationship between the corresponding nodes without knowing to which forest in they belong.

The most common complexity measure used for evaluating an ancestry-labeling scheme is the label size, that is, the maximum number of bits in a label assigned by , taken over all nodes in all forests in . When considering the query time of the decoder algorithm, we use the RAM model of computation, and assume that the length of a computer word is bits. Similarly to previous works on ancestry-labeling schemes, our decoder algorithm uses only basic RAM operations (which are assumed to take constant time). Specifically, the basic operations used by our decoder algorithm are the following: addition, subtraction, multiplication, division, left/right shifts, less-than comparisons, and extraction of the index of the least significant 1-bit.

Let be a family of forests of rooted trees. We say that an ancestry-labeling scheme for is consistent if the decoder algorithm satisfies the following conditions, for any three pairwise different labels and in the output domain of :

  • Anti-symmetry: if then , and

  • Transitivity: if and then .

Note that by the definition of an ancestry-labeling scheme , the decoder algorithm trivially satisfies the two conditions above if for , and , and are different nodes belonging to the same forest in .

2.3 Small universal posets

The size of a partially ordered set (poset) is the number of elements in it. A poset contains a poset as an induced suborder if there exists an injective mapping such that for any two elements : we have

A poset is called universal for a family of posets if contains every poset in as an induced suborder. If and are orders on the set , we say that is an extension of if, for any two elements ,

A common way to characterize a poset is by its dimension, that is, the smallest number of linear (i.e., total order) extensions of the intersection of which gives rise to [44]. The following fact is folklore, and its proof straightforward (this proof is however stated for the sake of completeness).

Fact 1

The smallest size of a universal poset for the family of -element posets with dimension at most is at most .

Proof.  Let be the natural total order defined on the set of integers. We present a universal poset for the family of -element posets with dimension at most . The set of elements is

and the relation is defined for two elements by:

Clearly has  elements. Now consider any -element poset with dimension at most . For , let be the total orders the intersection of which gives rise to . By the definition of intersection, there exists a collection of injective mappings such that for any two elements , we have

For every , since is a total order, it is isomorphic to , that is, there exists an injective and onto mapping such that for , we have

We define the mapping so that for any , we have the th coordinate of be defined as . The fact that  preserves the order , i.e., the fact that, for every ,

is now immediate.

Another way of characterizing a poset is by its tree-dimension. A poset is a tree555Note that the term “tree” for ordered sets is used in various senses in the literature, see e.g., [45]. [13, 46] if, for every pair and of incomparable elements in , there does not exist an element such that and . (Observe that the Hasse diagram [44] of a tree poset is a forest of rooted trees). The tree-dimension [13] of a poset is the smallest number of tree extensions of the intersection of which gives rise to .

For any two positive integers and , let denote the family of all -element (non-isomorphic) posets with tree-dimension at most . The following fact follows rather directly from previous work.

Fact 2

Fix an integer and let denote the smallest size of a universal poset for . We have .

Proof.  The fact that the smallest size of a universal poset for is at most follows from Fact 1, and from the well known fact that the dimension of a poset is at most twice its tree-dimension666This follows from the fact that a tree-poset has dimension at most 2. Indeed, consider the two linear orders for obtained as follows. We perform two DFS traversals over the Hasse diagram of , which is a directed forest , starting from the root in each tree in , so that to provide every element with two DFS numbers, and . DFS is arbitrary, and DFS reverses the order in which the trees are considered in DFS, and in which the children are visited in DFS, so that if and only if and .. For the other direction, Alon and Scheinerman showed that the number of non-isomorphic -element posets with dimension at most is at least  (this result is explicit in the proof of Theorem 1 in [12]). Since the dimension of a poset is at least its tree-dimension, this result of [12] yields also a lower bound on the number of non-isomorphic -element posets with tree-dimension at most , specifically, we have

On the other hand,

by definition of . Therefore, by combining the above two inequalities, it directly follows that .

3 Labeling schemes for forests with bounded spine decomposition depth

In this section we construct an efficient ancestry-labeling scheme for forests with bounded spine decomposition depth. Specifically, for forests with spine decomposition depth at most , our scheme enjoys label size of bits. (Note that the same bound holds also for forests with depth at most .) Moreover, our scheme has query time and construction time.

3.1 Informal description

Let us first explain the intuition behind our construction. Similarly to the simple ancestry scheme in [36], we map the nodes of forests to a set of intervals , in a way that relates the ancestry relation in each forest with the partial order defined on intervals through containment. I.e., a label of a node is simply an interval, and the decoder decides the ancestry relation between two given nodes using the interval containment test on the corresponding intervals. While the number of intervals used for the scheme in [36] is , we managed to show that, if we restrict our attention to forests with spine decomposition depth bounded by , then one can map the set of such forests to a set of intervals , whose size is only . Since a label is a pointer to an interval in , the bound of bits for the label size follows. In fact, we actually manage to provide an explicit description of each interval, still using bits, so that to achieve constant query time.

3.1.1 Intuition

Let be the family of all forests with at most nodes and spine decomposition depth at most . The challenge of mapping the nodes of forests in to a small set of intervals  is tackled recursively, where the recursion is performed over the number of nodes. That is, for , level of the recursion deals with forests of size at most . When handling the next level of the recursion, namely level , the difficult case is when we are given a forest containing a tree of size larger than , i.e., . Indeed, trees in of size at most are essentially handled at level of the recursion. To map the nodes of Tree , we use the spine decomposition (see Subsection 2.1).

Recall the spine of and the forests , obtained by removing from . Broadly speaking, Properties P2 and P3 of the spine decomposition give hope that the forests , , could be mapped relying on the previous level of the recursion. Once we guarantee this, we map the nodes of the spine in a manner that respects the ancestry relations. That is, the interval associated with a spine node must contain all intervals associated with descendants of in , which are, specifically, all the spine nodes , for , as well as all nodes in , for . Fortunately, the number of nodes on the spine is , hence we need to deal with only few such nodes.

The intervals in are classified into levels. These interval levels correspond to the levels of the recursion in a manner to be described. Level of the recursion maps forests (of size at most ) into , the set of intervals of level at most . In fact, even in levels of recursion higher than , the nodes in forests containing only trees of size at most are mapped into . (In particular, a forest consisting of singleton nodes is mapped into .) Regarding level , a forest of size at most contains at most one tree , where . In such a case, the nodes on the spine of are mapped to level- intervals, and the forests are mapped to .

As mentioned before, to have the ancestry relations in correspond to the inclusion relations in , the level- interval to which some spine node is mapped, must contain the intervals associated with nodes which are descendants of in . In particular, must contain all the intervals associated with the nodes in the forests . Since the number of such intervals is at least (note, this value can be close to ), the length of must be relatively large. Moreover, since level-1 intervals are many (at least because they need to be sufficiently many to handle a forest containing singleton nodes), and since contains all level- intervals, for values of , we want the number of level- intervals to decrease with , so that the total number of intervals in will remain small (recall, we would like to have ). Summing up the discussion above, in comparison with the set of level- intervals, we would like the set of level- intervals to contain fewer but wider intervals.

Example 1

Let us consider the example depicted in Figure 2. We have a tree of size roughly , with two spine nodes and and two corresponding forests and . We would like to map to some interval and map all nodes in to intervals contained in . In addition, we would like to map to some interval containing , and map all nodes in to intervals contained in .

Figure 2: Illustration of Example 1

The mapping in the above example can be done using the method in [36]. Specifically, is mapped to , and is mapped to . The nodes of are mapped to intervals, all of which are contained in , and the nodes of are mapped to intervals which are contained in . Note that this scheme guarantees that all intervals are contained in . One of the crucial properties making this mapping possible is that fact that the interval exists in the collection of intervals used in [36], for all possible sizes of . Unfortunately, this property requires many intervals of level , which is undesirable (the scheme in [36] uses intervals in total). In a sense, restricting the number of level- intervals costs us, for example, the inability to use an interval that precisely covers the set of intervals associated with . In other words, in some cases, must strictly contain . In particular, we cannot avoid having , for some (perhaps large) positive . In addition, the nodes in must be mapped to intervals contained in some range that is outside of (say, to the left of Interval ), and Node  must be mapped to an interval that contains all these intervals, as well as . Hence, we cannot avoid having , for positive integers and . Therefore, the total slack (in this case, coming from and ), does not only propagate over the spine nodes, but also propagates up the levels. One of the artifacts of this propagation is the fact that we can no longer guarantee that all intervals are contained in (as guaranteed by the scheme of [36]). Somewhat surprisingly, we still manage to choose the parameters to guarantee that all intervals in are contained in the range , where .

Being slightly more formal, we introduce a hierarchy of intervals called bins. A bin of level is an interval of length , i.e., , for some value to be described. Intuitively, the length corresponds to the smallest length of a bin for which our scheme enables the proper mapping of any forest of size at most to . It is important to note that this property is shift-invariant, that is, no matter where in this bin is, the fact that its length is at least  should guarantee that it can potentially contain all intervals associated with a forest of size at most . Because of the aforementioned unavoidable (non-negligible) slack that propagates up the levels, we must allow to increase with .

Figure 3: A level- interval

3.1.2 The intuition behind the tuning of the parameters

We now describe the set of intervals , and explain the intuition behind the specific choice of parameters involved. Consider a level , and fix a resolution parameter for interval-level , to be described later. Let and . The level- intervals are essentially all intervals in which are of the form:

(1)

See Figure 3. The resolution parameter is chosen to be monotonically increasing with in a manner that will guarantee fewer intervals of level , as  is increasing. Moreover, the largest possible length of an interval of level is , which is the length of a bin sufficient to accommodate the intervals of a tree of size at most . This length is monotonically increasing with the level , as desired.

Figure 4: Overview of the bins and intervals assignment in level

Consider now a bin of length located somewhere in . This bin should suffice for the mapping of a tree of size . By executing the spine decomposition, we obtain the spine nodes and the forests (see Figure 1). We allocate a level- interval to each spine node , and a bin to each forest , , in the same spirit as we did in the specific Example 1 (see Figure 2).

The general allocation is illustrated in Figure 4. Since is of the form , and should be included in Bin , and since this interval must contain all intervals assigned to nodes in , Bin is chosen to start at the leftmost multiple of in Bin . Note that contains trees of size at most each. Hence, by induction on , each of these trees, , can be properly mapped to any bin of size . Therefore, setting of size suffices to properly map all trees in . The bin , of size , is then chosen to start at the leftmost multiple of to the right of the end point of , in the bin . And so on: we set of size , and place it in so that to start at the leftmost multiple of to the right of the end point of , . The level- intervals associated to the spine nodes are then set as follows. For , the interval starts at the left extremity of (which is a multiple of the resolution ). All these intervals end at the same point in , which is chosen as the leftmost multiple of to the right of , in . Putting the right end-point of at the point where all the intervals of spine nodes end, suffices to guarantee that includes all the intervals , and all the bins , for .

Observe that the length of must satisfy , where the slack of comes from the fact that the interval must start at a multiple of the resolution . More generally, for , the length of must satisfy

Therefore, the length of must satisfy . Now, since may start at any point between two multiples of the resolution , we eventually get that setting the bin to be of length suffices. Since can be at most the spine decomposition depth , we must have be approximately . To agree with the latter approximation, we choose the values of so that:

(2)

Ultimately, we would like to map that whole -node forest to a bin of size . This bin must fit into , hence, the smallest value that we can choose is . Since we also want the value of to be linear in , we choose the ’s so that . Specifically, for , we set

for some small . Note that for each , and the ’s are increasing with . Moreover, we take large enough so that the sum converges. Hence, all the ’s are bounded from above by some constant . In particular, , and thus . The fact that all ’s are bounded, together with Equation 2, explains why we choose

This choice for the resolution parameter implies that the number of level- intervals is

yielding a total of intervals in , as desired. In fact, in order to reduce the label size even further, by playing with the constant hidden in the big- notation, we actually choose less than a constant. Indeed, we will later pick

3.2 The ancestry scheme

We now turn to formally describe the desired ancestry-labeling scheme for . For simplicity, assume without loss of generality that is a power of 2.

3.2.1 The marker algorithm

We begin by defining the set of intervals. For integers and , let

where

For integer , let be defined as follows. Let , and, for any , , let

Note that the sum converges, and hence all the ’s are bounded from above by some constant

Let us set:

Then let , , and, for , let us set

Next, define the set of level- intervals:

Finally, define the set of intervals of level at most as

and let

Definition 1

Let . We say that a one-to-one mapping is a legal-containment mapping if, for every two nodes , we have

Note that since a legal-containment mapping is one-to-one, we get that if is a strict ancestor of in , then , and vice-versa.

We first wish to show that there exists a legal-containment mapping from every forest in into . For this purpose, we introduce the concept of a bin, which is simply an interval of integers. For a bin , and for any integer , , we use the following notation:

I.e., is the set of all intervals of level at most which are contained in the bin .

Claim 1

Let be a forest, and let be pairwise-disjoint forests such that . Let be a bin and let be a partition of into  pairwise-disjoint bins, i.e., with for any . For any level , , if there exists a legal-containment mapping from to for every , , then there exists a legal-containment mapping from to .

Proof.  The proof follows directly from the definitions above. More specifically, for every integer , , we embed the forest into using a legal-containment mapping. For two nodes and in the same forest , the condition specified in Definition 1, namely, that is an ancestor of in if and only if , holds by the fact that each is embedded using a legal-containment mapping. On the other hand, if and are in two different forests and , then the condition holds simply because .

We are now ready to state the main technical lemma of this section.

Lemma 1

For every , , every forest , and every bin , such that , there exists a legal-containment mapping from into . Moreover this mapping can be computed in time.

Proof.  We prove the lemma by induction on . The case is simple and can be verified easily. Assume now that the claim holds for with , and let us show that it also holds for . Let be a forest of size , and let be a bin, such that . Our goal is to show that there exists a legal-containment mapping of into . We consider two cases.

The simpler case: when all the trees in are of size at most . For this case, we show that there exists a legal-containment mapping of into for every bin such that . (Note that this claim is slightly stronger than what is stated in Lemma 1)777Indeed, we show that the size of Bin can be only , which is smaller than (that is, the size required to prove the lemma) by an additive term of ..

Let be an arbitrary enumeration of the trees in . We divide the given bin of size into disjoint sub-bins , where for every , . This can be done because . By the induction hypothesis, we have a legal-containment mapping of into for every , . The stronger claim thus follows by Claim 1.

Observe that, in the above, the enumeration of the trees in was arbitrary. In the context of our general scheme described in the next section, it is important to enumerate these trees in a specific order. Once this order is fixed, we can implement the mapping of by choosing the disjoint sub-bins of , so that is “to the left” of , i.e., , for . This will guarantee that all the intervals associated with the nodes in are “to the left” of all the intervals associated with a nodes of , for every . We state this observation as a fact, for further reference in the next section.

Fact 3

Let be a positive integer. Let be an arbitrary enumeration of the trees in a forest , all of size at most , and let be a bin with . Then, our legal-containment mapping from into guarantees that for every and where , we have .

The more involved case: when one of the subtrees in , denoted by , contains more than nodes. Our goal now is to show that for every bin , where , there exists a legal-containment mapping of into . Indeed, once this is achieved we can complete the proof as follows. Let , and . Similarly to the simple case above, let and be two consecutive intervals in (starting at the leftmost point in ) such that and . Since we have a legal-containment mapping that maps into , and one that maps into , we get the desired legal-containment mapping of into by Claim 1. (The legal-containment mapping of into can be done by the induction hypothesis, because .)

For the rest of the proof, our goal is thus to prove the following claim:

Claim 2

For every tree of size , and every bin , where , there exists a legal-containment mapping of into .

In order to prove the claim, we use the spine decomposition described in Subsection 2.1. Recall the spine , and the corresponding forests . The given bin can be expressed as for some integer . We now describe how we allocate the sub-bins of so that, later, we will map each to .

The sub-bins of :

For every , we now define a bin associated with . Let us first define . Let be the smallest integer such that . We let

Assume now that we have defined the interval for . We define the interval as follows. Let be the smallest integer such that , that is