VOUTE-Virtual Overlays Using Tree Embeddings

VOUTE-Virtual Overlays Using Tree Embeddings

Stefanie Roos, Martin Beck, Thorsten Strufe
TU Dresden
{stefanie.roos,martin.beck1,thorsten.strufe}@tu.dresden.de
Abstract

Friend-to-friend (F2F) overlays, which restrict direct communication to mutually trusted parties, are a promising substrate for privacy-preserving communication due to their inherent membership-concealment and Sybil-resistance. Yet, existing F2F overlays suffer from a low performance, are vulnerable to denial-of-service attacks, or fail to provide anonymity. In particular, greedy embeddings allow highly efficient communication in arbitrary connectivity-restricted overlays but require communicating parties to reveal their identity. In this paper, we present a privacy-preserving routing scheme for greedy embeddings based on anonymous return addresses rather than identifying node coordinates. We prove that the presented algorithm are highly scalalbe, with regard to the complexity of both the routing and the stabilization protocols. Furthermore, we show that the return addresses provide plausible deniability for both sender and receiver. We further enhance the routing’s resilience by using multiple embeddings and propose a method for efficient content addressing. Our simulation study on real-world data indicates that our approach is highly efficient and effectively mitigates failures as well as powerful denial-of-service attacks.

1 Introduction

Anonymous and censorship-resistant communication is essential for providing freedom of speech. In the last years, threats to this essential human right have emerged in western countries as well, mainly in the form of self-censorship caused by the fear of seemingly private communication being recorded 111http://www.theguardian.com/commentisfree/2013/jun/17/chilling-effect-nsa-surveillance-internet. Due to the natural vulnerability of publicly known servers to sabotage and corruption, completely distributed solutions for anonymous communication and content distribution are required. However, the openness of distributed systems presents a vulnerability, enabling attackers to infiltrate the system with a large number of forged participants, as can seen e.g., from attacks on the Tor [1] network in 2014 222https://blog.torproject.org/blog/tor-security-advisory-relay-early-traffic-confirmation-attack.

F2F overlays circumvent the problem of connecting to permanently changing strangers by restricting connectivity to participants sharing a mutual real-world trust relationship. Hence, adversaries need to resort to social engineering attacks for infiltration of the network. However, large-scale privacy-preserving communication in F2F overlays requires additional measures to achieve anonymity, failure and attack resilience, and efficiency. Multiple studies have shown that deployed F2F overlays such as Freenet [2] are highly inefficient and vulnerable to attacks [3, 4]. Virtual overlays have been proposed as an efficient alternative [3, 5], but recent work has shown that they inherently require unacceptable high stabilization costs [6].

A potential solution is presented by greedy network embeddings such as [7, 8]. Greedy embeddings allow for highly efficient greedy routing in arbitrary connectivity-restricted overlays. For this purpose, they first construct a spanning tree of the network and then assign coordinates based on a node’s position in the spanning tree. However, a participant can only contact a non-trusted contact when knowing its coordinate in the network. Though only direct neighbors can directly map the embedding coordinate to a real-world identity, arbitrary participants can reconstruct the social graph based on the revealed coordinates. Participants can then easily be identified from the social graph structure [9] and correlated with their activities in the network due to the coordinate acting as a persistent pseudonym. In this manner, governmental and commercial institutions as well as curious strangers can track individual or all users and establish detailed profiles of their behavior. Possibly even their opinions and interests, published unencrypted in a supposedly anonymous manner, are revealed to the adversary. Thus, network embeddings in their unaltered form fail to provide receiver anonymity.

For utilizing the high efficiency of network embedding, our first requirement is a modified addressing and routing protocol that provides both efficiency and (receiver) anonymity. Second, due to the fragility of spanning trees in the presence of network dynamics and attacks, the resilience of the embedding to both failures and denial-of-service attacks needs to be drastically increased. Rather than only dropping messages, we assume that the adversary first strategically sabotages the embedding algorithm to maximize the impact of its censorship. Third, efficient content storage and retrieval requires the existence of a suitable content addressing scheme for network embeddings.

Our solution addresses the above problems by i) introducing anonymous return addresses to provide receiver anonymity, ii) constructing multiple embeddings and using backtracking during routing to increase the resilience, and iii) utilizing the network embedding to provide a routing protocol for a virtual overlay, thus avoiding the enormous stabilization costs of previous virtual overlays.

Our embedding algorithm assigns coordinates in the form of vectors of -bit strings, so that nodes in the same subtree of the spanning tree share the same prefix, similar to the PIE embedding [8]. Rather than revealing the coordinate, the receiver then generates an anonymous return address by applying a hash cascade to the elements of the coordinate vector salted with a random seed. After publishing the return address and the seed, the receiver can be contacted efficiently without revealing any information that is not required for routing. Furthermore, a node can publish several anonymous return addresses by varying the seed, In this manner, it can construct distinct pseudonyms for distinct contexts, e.g., one pseudonym for each forum discussion it participates in. The revealed information can be further reduced by applying an additional layer of encryption at the price of a reduced efficiency.

By routing in multiple embeddings, we aim to increase the probability of finding a route despite the disruption of routes in some but not all embeddings. We propose a purely local algorithm for the construction of multiple spanning trees of highly different structures to provide largely node-independent routes. In addition, backtracking and a modified distance are integrated into the routing to further improve resilience and avoid congestion.

We evaluate our solution both by a formal security analysis and an extensive simulation study. In the security analysis, we prove a receiver can never be uniquely identified from a return address. Our simulation study indicates that our scheme is highly efficient compared existing approaches in terms of the number of messages required for routing. Furthermore, the resilience is greatly improved. In fact, the routing terminates successfully in the vast majority of cases despite the presence of node failures or powerful attackers, which manipulate the embedding and routing as well as forge connections to honest nodes.

2 Related Work

Here, we describe the state-of-the-art with regard to routing and content discovery in F2F overlays.

The common characteristics of all F2F overlays are i) the restriction of connections to trusted parties, ii) hop-by-hop anonymization, i.e., the transfer of messages via a path of trusted nodes that rewrite the source tag of the message to point to themselves and apply probabilistic delays before forwarding a message, iii) encryption of all communication. In the following, we present existing approaches, categorizing them according to their routing methodology in unstructured overlays, virtual overlays, and network embeddings. Routing is applied to either discover nodes based on network coordinates or, more commonly, content based on a content key or description.

Unstructured approaches utilize flooding, e.g. in Turtle [10], or probabilistic forwarding e.g. in OneSwarm [11]. GnuNet attempts to combine random walks with deterministic routing [12]. These overlays focus locating content rather than individual nodes. Due to the replication of content, the content can indeed be located, but efficient communication between two uniquely defined entities is not possible.

Virtual overlays address the problem of establishing an overlay despite the restricted connectivity by replacing overlay links with tunnels of trusted nodes. So, efficient tunnel discovery and maintenance is a main concern given the inherent network dynamics: Vasserman et al. [3] suggest flooding the network for discovering adequate overlay neighbors, thus creating a large overhead. In contrast, X-Vine leverages the overlay routing by concatenating previously existing tunnels to a new one, thus entailing a increase of the average tunnel length and hence routing costs over time [5]. Indeed, without an additional routing protocol in the underlying F2F overlay, efficient maintenance and efficient routing are inherently mutually exclusive in a virtual overlay [6].

In contrast, network embeddings assign coordinates that allow efficient routing to nodes. For example, the F2F mode of Freenet relies on a network embedding. However, results indicate that the embedding is lacking both with regard to routing efficiency [3] and attack resilience [4]. Hoefer et al. [13] propose highly efficient greedy embeddings. However, their approach reveals the identity of the communicating parties and fails to consider resilience. Furthermore, their proposed scheme for content addressing maps the majority of content keys to the same central node.

In summary, network embeddings are the only existing approach providing a high efficiency at acceptable maintenance costs. However, achieving receiver anonymity,resilience, and suitably content addressing is an unsolved highly challenging problem.

3 Adversary Model

We aim to realize efficient F2F overlays making use of network embeddings but at the same time providing receiver anonymity, resilience, and content addressing. Note that we do not consider sender anonymity because the problem of sender anonymity can easily be solved by starting the routing with a short random walk, as extensively analyzed for various anonymous look-up strategies for distributed hash tables (DHTs) (e.g., [12, 14]). In contrast, receiver anonymity is a challenging problem for network embeddings, because the coordinates acting as a node’s pseudonym are essential for the routing process and hence for the efficient communication between arbitrary node pairs. The term resilience is loosely defined. In general, a system is denoted resilience if an action is only slightly impaired by node failures or attacks. Commonly, a system is judged to be resilient by comparison with others.

We consider two attack goals in our adversary model. The first goal is to discover the identity of communicating parties, in particular the identity of the designated receiver of a message. A second goal of the attacker is to block undesired communication using a so-called black hole attack [15], which could be applied in case an attacker fails to identify specific parties. During such an attack, an adversary indiscriminately censors communication by first gaining a predominant position in the system and then dropping all received messages. In addition, attacks on the availability, such as pollution, i.e., denial-of-service attacks by flooding the network with content and traffic, and eclipse attacks, i.e., censoring of specific content, present a thread for any P2P systems. However, these attacks have been addressed in various publications (e.g., [15]), which can be applied to our contribution with few modifications. Hence, we do not consider them in our evaluation.

As for the attacker’s capacities, we assume a local, active, internal, possibly colluding attacker, able to drop and manipulate messages it receives. The adversary can control one or several colluding nodes in the network but is unable to observe the complete topology. In particular, we assume that an adversary cannot be certain that it knows all neighbors of a node, in other words, the complete circle of a user.We assume that this is hard, because it requires the adversary to i) be sure that he knows all contacts from different social circles of a user, such as family, close friends, and colleagues, and ii) establish connections to all of them. A global passive attacker is disregarded on the basis that steganographic techniques can be applied to hide the F2F traffic as suggested in e.g. [16]. However, attackers are modeled as polynomial time adversaries, which are given a transcript of all own and public input, as well as all locally observed traffic. Their computation power is therefore bounded by polynomial time algorithms, which prevents breaking computationally-secure cryptographic primitives. We assume that an adversary can easily forge an arbitrary number of nodes, so called Sybils. However, gaining connections and hence influence in a F2F overlay requires establishing real-world trust relationships. Such social engineering attacks are considered to be costly and difficult because they require long-term interaction between a human adversary and an honest participant. Thus, the number of connections between honest nodes and forged participants can be assumed to be small, More precisely, we assume that the number is logarithm with the network size, in agreement to previous work [5].

4 Network Embeddings

Our solution builds upon previous work in the area of network embeddings, which assign coordinates to nodes with the goal of structuring networks, e.g., for efficient routing in wireless sensor networks or as an alternative to the current IP layer in content-centric networking. We first introduce some notation, then explain the principal concepts of network embeddings, and conclude by describing specific algorithms. In particular, we detail the PIE embedding [8], which we modify in Section 5 to allow for anonymity.

4.1 Basic Terminology

In the remainder of the paper, we represent an overlay network by a graph with nodes and links or edges . Because we require mutual trust for connection establishment in F2F overlays, the network is bidirectional. We denote the neighbors of by . Therefore, a Friend-to-friend (F2F) overlay is an overlay such that the set of links is given by pairs of nodes sharing a real-world trust relationship. Embedding algorithms heavily rely on spanning trees, connected subgraphs of such that . In such a (spanning) tree, one node is designated as the root and the position of nodes are described based on their relation to the root. In particular, the level or depth of a node is given by the length of the path from to . If is not the root, the parent of is defined to be a neighbor with a shorter path to than , whereas the remaining neighbors are ’s children. A node without children is called a leaf, whereas nodes with children are called internal nodes.

4.2 Concept

Now, we define the concept of network embeddings and in particular greedy network embeddings. In the following, let be a network and be a metric space with a distance . A network embedding is defined as a function assigning each node a coordinate. The problem of enabling routing in a connectivity-restricted network has been addressed by the design of greedy embeddings. Greedy embeddings [17] are coordinate assignments, such that for any source-destination pair with , a neighbor of exists such that . We say that is closer to than with regard to . As a consequence, straight-forward greedy routing is guaranteed to find a route from to .

Though there exists a multitude of greedy embedding algorithms, they all follow the same four abstract steps: i) Construct a spanning tree , ii) Each internal node in enumerates its children, iii) The root receives a predefined coordinate, iv) Children derive their coordinate from the parent’s coordinate and the enumeration index assigned by the parent (e.g. [7, 18, 19, 8]). The coordinates are then distributed such that the embedding of the spanning tree is greedy, as specified for the PIE embedding below. Subsequent to the coordinate assignment, nodes consider all neighbors, including those that are neither parent nor child, for the routing. So, routing is not restricted to tree edges. We call non-tree edges shortcuts because they allow for a faster reduction of the distance and shorter routes than predicted by the distance in the tree.

In the following, we consider the construction and stabilization costs for such greedy embeddings. A spanning tree is constructed by i) selecting a root node using a distributed leader election protocol such as [20, 21], and ii) building the tree from the root. In this manner, it is possible to construct a spanning tree with messages for a graph of diameter [20], though integrating protections against nodes aiming to cheat the root selection protocol such as [21] require a higher cost. Various embeddings [18, 19, 8] are able to react to dynamics without computing the complete embedding whenever the topology changes. New nodes join the trees as leaves, requiring only a constant overhead for contacting one of their neighbors to be their parent and receiving a coordinate from said parent. If any node but the root leaves, only its descendants have to reconnect. We show that the stabilization overhead then scales linearly with the tree depth rather than linear with the number of participants.

4.3 Existing Approaches

Though embedding algorithms generally rely on a spanning tree and assign coordinates according to the tree structure, the nature of the assigned coordinates is highly diverse: Embeddings into hyperbolic space such as [7, 18, 19] allow embedding in low-dimensional spaces. However, proposed hyperbolic embeddings are extremely complex and do not scale with regard to the number of bits required for coordinate representation [19]. Custom-metric approaches have been designed to overcome these shortcomings. The custom-metric embedding PIE [8] assigns an empty vector as the root coordinate. Child coordinates are then derived from the parent coordinate by concatenating the parent coordinate with the index assigned to the child by the parent, potentially weighted with the cost of the parent-child edge if such weights are given. In this manner, a node ’s coordinate represents the route from the root to . Consequently, the distance is given by the hop distance of two nodes in the tree. An example for the PIE embedding in unweighted graphs is displayed on the left side of Figure 1. Whereas routing in greedy embeddings is highly efficient in comparison to non-greedy embeddings [13], neither anonymity nor resilience has been considered in suitably manner.

5 Design

Our main contribution lies in proposing multiple greedy embeddings with anonymous return addresses and a virtual overlay on top of the embeddings. In the following, we present our system, in particular

  • a spanning tree construction and stabilization algorithm for multiple parallel embeddings,

  • an embedding algorithm providing efficiency as well as allowing for improved censorship-resistance through a modified distance,

  • an address generation algorithm enabling receiver anonymity, and

  • a virtual overlay design based on embeddings which allowing balanced content distribution and efficient content retrieval.

5.1 Tree Construction and Stabilization

In this section, we show how we construct and stabilize parallel spanning trees. In the next section, we then describe how to assign coordinates on the basis of these trees. We want to increase the robustness and censorship-resistance by using multiple trees. In order to ensure that the trees indeed offer different routes, our algorithm encourages nodes to select different parents in each tree if possible. Our algorithm design follows similar principles as the provable optimally robust and resilient tree construction algorithm for P2P-based video streaming presented in [22]. However, the algorithm assumes that nodes can change their neighbors. Thus, we cannot directly apply the algorithm nor the results. In the following, we first discuss the tree construction and then the stabilization.

Tree Construction:

We divide the construction of a tree into two phases i) selecting the root, and ii) building the tree starting from the root. We can apply [20] for the root election, which achieves a communication complexity of . Our own contribution lies in ii) the tree construction after the root node has been chosen.

We now shortly describe the idea of our algorithm and then the actual algorithm. A node that is not the root receives messages from its neighbors when they join a tree and are hence available as parent nodes. There are two questions to consider when designing an algorithm governing ’s reaction to such messages, called invitations in the following. First, has to decide if and when it accepts an invitation. Second, has to select an invitation in the presence of multiple invitations.

For the second question, always prefers invitation from nodes that have been their parent in less trees with the goal of constructing different trees and increasing the overall number of possible routes. Increasing the number of routes allows the use of alternative routes if the request can not be routed along the preferred route due to a failed or malicious node. If two neighbors are parents in the same number of trees, can either select one randomly or prefer the parent closer to the root. Choosing a random parent reduces the impact of nodes close to the root but is likely to lead to longer routes and thus a lower efficiency.

Coming back to the first question of if and when accepts invitations, should always accept an invitation of a neighbor that is not yet a parent of in any tree in order choose different parents as often as possible. In contrast, if is already a parent, might wait for the invitation of a different neighbor. However, it is unclear if it is possible for all neighbors of to ever become a parent. For example, a neighbor of degree is only a parent if it is the root. In order to overcome this dilemma, periodically probabilistically decides if it should accept ’s invitation or wait for another invitation. So, eventually accepts an invitation but does provide alternative parents the chance to send an invitation.

Now, we describe the exact steps of the algorithm. The algorithm is a round-based invitation protocol for the tree construction. After a node is included in the -th tree, sends invitations to all its neighbors inviting them to be its children in tree . When receives an information for the -th tree from a neighbor , it saves the invitation if it is not yet contained in tree and otherwise stores it. The invitation can still be used if has to modify its parent selection later. In each round, a node considers all invitations for trees it is not yet part of, as described in Algorithm 1. Let be number of trees for which a neighboring node is a parent of . If has received invitations from neighbors with minimal among all neighbors, accepts one of those invitations (Lines 2-4). In the presence of multiple invitations, we experiment with two selection strategies: i) Choosing a random invitation, and ii) Choosing a random invitation from a node on the lowest level. The latter selection scheme requires that the invitations also detail the level of the potential parent node in the tree. If does not have an invitation from any node with minimal , nevertheless accepts an invitation with probability in order to guarantee the termination of the tree construction. If accepts a parent, it selects a node that has offered an invitation and has the lowest among neighbors with outstanding invitations (Lines 8-9). In this manner, we guarantee the convergence of tree construction.

The acceptance probability is essential for the diversity and the structure of the trees: For a high , nodes quickly accept invitations leading to trees of a low depth and thus short routes. However, in the presence of an attacker acting as the root of all or most trees, the trees are probably close to identical, resulting in a low censorship-resistance. A lower acceptance probability increases the diversity but entails longer routes. Thus, a low results in a higher communication complexity and at some point decreases the robustness due to the increased likelihood of encountering failed nodes on a longer route. In Section 6.1, we show that the constructed trees are of a logarithmic depth such that we indeed maintain a routing complexity of . Note that Algorithm 1 does not assume that all trees are constructed at the same time. Rather, individual trees can be (re-)constructed while the remaining trees impact the parent choice in the new tree but remain unchanged.

1:{Internal state: Set of invitations, acceptance probability , number of times neighbor is parent}
2:
3:if  is not empty then
4:    Select invitation in to answer
5:else
6:     uniform random number
7:    if  then
8:       
9:       Select invitation in to answer
10:    end if
11:end if
Algorithm 1 constructTreeRound()

Stabilization:

Now, we consider the stabilization of the trees when nodes join and leave. Stabilizing the trees efficiently, i.e., repairing them locally rather than reconstructing the complete tree whenever the topology changes, is essential for efficiency. Joining nodes can be integrated in a straight-forward manner by connecting to their neighbors as children, again trying to maximize the diversity of the parents. For this purpose, nodes record the time, i.e., the round in our abstract time model, they joined the tree. Now, when a new node joins, it requests its neighbors’ coordinates and these timestamps for all trees. Based on this information, can simulate Algorithm 1 locally, ensuring that its expected depth in the tree is unaffected by its delayed join. When a node departs, all its children have to choose a different parent and inform their descendants of the change. In order to prevent a complete subtree from being relocated at an increased depth, the descendants may also select a different parent. The selection of the new parent again follows Algorithm 1 but only locally re-establishes the trees affected by the node departure. We show that the stabilization complexity considering any node but the root is linear in the terms of average depth of the node in the trees in Section 6.1.

We formally prove that the above stabilization algorithm indeed only introduces only logarithmic complexity in Section 6.1. We now present the embedding algorithm, which in agreement with the presented tree construction algorithm, assigns coordinates within subtrees independently of the remaining subtrees to allow for local stabilization.

5.2 Embedding and Routing

Figure 1: Original PIE and modified PIE coordinates using -bit number

In this section, we show how we assign coordinates in a spanning tree and how to route based on these coordinates. As we want to prevent an attacker from guessing the coordinate of a receiver, we require a certain degree of in-determinism in the coordinate assignment. We thus choose a slightly modified version of the unweighted PIE embedding [8], which we have introduced in Section 4. Our main modification lies the use of in-deterministic coordinates in order to prevent an adversary from guessing the coordinate and thus undermining the anonymization schemes presented in the next section. The routing algorithm corresponds to the greedy routing with backtracking. In addition to the tree distance in [8], we also present a second distance preferring nodes with a long common prefix and thus avoiding routes via nodes close to the root whenever possible. In this manner, we increase robustness and censorship-resistance because the routing algorithm considers alternative routes and the impact of strategically choosing a position close to the root is reduced. In the following, we subsequently present the embedding algorithm, the distance functions, and the routing with backtracking.

Embedding Algorithm:

Embeddings are performed on each of the trees independently, so that we only consider one embedding . Throughout this section, let be a sufficiently large integer, a pseudo-random number generator with values in , and a cryptographically secure hash function. We describe the embedding algorithm, then the distance used for routing, and last, the backtracking procedure, which allows highly resilient routing despite failures.

We now describe the embedding algorithm for one tree. The coordinate assignment starts at the root and then spreads successively throughout the tree. After a spanning tree has been established, the root is assigned an empty vector as a coordinate . In the next step, each child of the root generates a random -bit number such that its coordinate is . Here, our algorithm differs from the PIE embedding because it uses random rather than consecutive numbers, thus preventing an adversary from guessing the coordinate in an efficient manner. Subsequently, nodes in the tree are assigned coordinates by concatenating their parent’s coordinate with a random number. So, upon receiving its parent coordinate , a node on level of the tree obtains its coordinate by adding a random -bit number . The coordinate space is hence given by all vectors consisting of -bit numbers, i.e., . Figure 1 displays the differences between the original PIE embedding and our variation.

Note that the independent random choice of the -bit number might lead to two nodes having the same coordinate. Thus, should be chosen such that the chance of equal coordinates should be negligible. If two children nevertheless select the same coordinate, the parent node should inform one of them to adapt its choice. Note that allowing the parent to influence the coordinate selection in this manner does not really increase the vulnerability to attacks, as the parent can achieve at least the same damage by constantly changing its coordinate. Such constant changes can be detected easily, so that nodes should stop selecting such nodes as parents. In general, by moving the choice of the last coordinate element from the parent to the child, we automatically reduce the impact of a malicious parent as it can not determine the complete coordinate of the child.

Distances:

We still need to define distances between coordinates in order to apply greedy routing. For this purpose, we consider two distances on . Both rely on the common prefix length of two vectors and and the coordinate length .

First, we consider the tree distance from [8], which gives the length of path between the two nodes in the tree, i.e.,

(1)

Secondly, the common prefix length can be used as the determining factor in the distance function, i.e., for a constant exceeding the length of all node coordinates in the overlay, we define

(2)

The reason for using the common prefix length rather than the actual tree distance is the latter’s preference of routes passing nodes close to the root in the tree. In this manner, these nodes on these routes are very influential, so that adversaries can gain a large impact from gaining such a position. In contrast, prefers possibly longer routes by always forwarding to a node within the same subtree as the destination and avoids central nodes in the tree. An example of the difference between the two distances and the impact on the discovered routes is displayed in Figure 2.

Figure 2: Tree distance (TD) and common prefix length based distance when routing from node to : prefers nodes in the same subtree as the destination, leading to better censorship-resistance at the price of longer routes. The table gives the distances in the first hop for and its neighbors and .

Greedy Routing in Multiple Embeddings:

We route in trees in parallel. More precisely, given a vector of coordinates , the sender selects coordinates and sends a request for each of them. can either select embeddings uniformly at random or choose the embeddings so that the distance of the neighbor with the closest coordinate to is minimal. The latter choice might result in shorter routes due to the low distance in the embedding.

The routing processes in each embedding independently. Nodes forward the request to the neighbor with the closest coordinate in the respective embedding. Thus, in order for the nodes on the route to forward the request correctly, the request has to contain both the coordinate and the index of the embedding. In practice, we can achieve a performance gain by including multiple coordinates and embedding indices in one message if the next hop in two or more embeddings are identical. For now, we assume that one message is sent for each embedding for simplicity.

We optionally increase the robustness and censorship-resistance of the routing algorithm by allowing backtracking if the routing gets stuck in a local minimum of the distance function due to failures or intentional refusal to forward a request. For this purpose, all nodes remember their predecessor on the routing path as well as the neighbors they have forwarded the request to. If all neighbors closer to the target have been considered and have been unable to deliver the request, the node reroutes the request to its predecessor for finding an alternative path. The routing is thus only considered to be failed if the request returns to its source and cannot be forwarded to any other neighbor. In this manner, all greedy paths, i.e., all paths with a monotonously decreasing distance to the target, are found.

1:{Input: current node , message from node , tree index , target coordinate }
2:{Internal state: set of nodes forwarded to, predecessor , distance }
3:if  then
4:    Routing succeeds
5:else
6:{Store predecessor unless backtracking}
7:    if not contains  then
8:       
9:    end if
10:{Determine closest neighbors}
11:    
12:     random element in C
13:    if  then
14:       Forward to {Forward if improvement}
15:    else
16:       if  is set then
17:          Forward to {Backtrack}
18:       else
19:          Routing failed
20:       end if
21:    end if
22:end if
Algorithm 2 route()

Algorithm 2 gives the pseudo code describing one step of the routing algorithm, including the backtracking procedure. When receiving a message , the node first checks if it is the receiver of , thus successfully terminating the routing (Line 4). If is not the receiver, it determines if the routing is currently in the backtracking phase by checking if has previously forwarded to the sender . Otherwise, it stores the sender of as a predecessor for potential later backtracking (Line 8). In the manner of greedy routing, selects the closest neighbor to the target coordinate. In the presence of several closest neighbors, picks one of them uniformly at random (Lines 11-12). Note that in the presence of failures, the embedding can lose its greediness. Hence, to avoid loops, only forwards the request to that neighbor if it is indeed closer (Line 14). Otherwise, contacts its predecessor (Line 17) or forfeits the routing if no such predecessor exists (Line 19), i.e., if is the source of the request.

This completes the description of the routing and stabilization functionalities. However, up to now, we used identifying coordinates rather than anonymous addresses.

5.3 Anonymous Return Addresses

In this section, we introduce our address generation algorithm for generating anonymous return addresses but do not reveal the receiver of the request. For this reason, we call the generated addresses route preserving (RP) return addresses. Based on these return addresses, we specify two routing algorithms and for routing a request containing a return address. The return addresses allow a node to determine the common prefix length of their neighbor’s coordinates and the receiver coordinate, which allows the node to determine the closest neighbor. Hence, and correspond to Algorithm 2 for the two distance function and when using return addresses rather than a receiver coordinates. After describing the algorithm, we show that the return addresses indeed preserve routes.

Return Address Generation:

Return addresses are generated in three steps:

  1. Padding the coordinate

  2. Applying a hash cascade to obtain the return address

  3. Adding a MAC

Algorithm 3 displays the pseudo code of the above steps.

1:{Input: coordinate , seed }
2:{Internal State: key , , }
3:
4:
5:for  do
6:    if  then
7:       
8:    else
9:        {Padding}
10:    end if
11:     {Hash cascade}
12:end for
13: {MAC}
14:Publish
Algorithm 3 generateRP()

The first step of the return address generation prevents an adversary from identifying coordinates based on their length. A node pads its coordinate by adding random elements . More precisely, selects a seed for the pseudo-random number generator and obtains the padded coordinate with

In order to ensure that the closest node coordinate to is indeed , recomputes the padding with a different seed if is equal to the -th element of a child’s coordinate 333We exclude this step in Algorithm 3 for increased readability. Afterwards, chooses a different seed for the construction of the actual return address and generates . then executes the local function in order to obtain a vector with elements in . The -th element of is given by

(3)

We call the pair a return address, which can be used to find a route to the node with coordinate . Before publishing the return address, adds a MAC for a private key to prevent malicious nodes from faking return addresses and gaining information from potential replies. Last, publishes the return address and the MAC.

Routing Algorithms:

Now, we determine diversity measures and in order to compare coordinates and with regard to and . The diversity measure then assumes the role of the distance in Algorithm 2. 444Note that a diversity measure is not a distance because it i) is defined for two potentially distinct sets and , and ii) is not symmetric.

In order to define a sensible diversity measure, note that for any coordinate and return address for coordinate , we have . We thus can define the diversity measure in terms of the common prefix length in the same manner as the distance. More precisely, for , the diversity for of a coordinate to the return address is

(4)

In practice, can increase the efficiency of the computation by only determining up to the first element in which it disagrees with . Thus, we now have two possible realizations of the routing algorithm , namely and . Given the RP return address of the destination , and forward the message to the neighbor with the lowest diversity measure and , respectively.

Proving Route Preservation:

We now prove formally that the above return addresses preserve routes. For this purpose, we first define the notion of preserving a property of a coordinates. Note that we

Definition 5.1.

Let be a local function of node in a graph Given a set of node coordinates and a target coordinate , returns a subset . A return address for a coordinate is said to preserve if for all , there exists a function such that for all

The notion of route preserving (RP) return addresses now follows if we choose the function to return the neighbors with the closest coordinates to .

Definition 5.2.

Let

(5)

determine the closest coordinates in a set to a coordinate . A return address is called route preserving (RP) (with regard to ) if it preserves .

Based Definition 5.2, we can now show that Algorithm 3 generates RP return addresses.

Theorem 5.3.

Algorithm 3 generates RP return addresses with regard to the distances and .

Proof.

In order to show that preserves routes, we derive the relation between the diversity measures and , defined in Eq. 4, and the corresponding distances and , defined in Eq. 1 and Eq. 2, respectively.

Let denote the padded coordinate used to generate , and let be the coordinate without padding. In the following, we relate the distance of and a coordinate to the diversity measure of and . We can assume that , i.e., the common prefix length of the padded coordinate and is at most equal to the length of the original coordinate . A node with coordinate with cannot exist in a valid embedding. More precisely, our embeddings algorithm ensures that coordinates are unique and a node ensures that the first element of the padding does not corresponds to the -th element of a descendant’s coordinate. Thus, the coordinate is the unique closest coordinate of a node to the padded coordinate. Thus, we can indeed limit our evaluation to coordinates with .

We start by considering the tree distance . By Eq. 4, we have

Hence, diversity measure and distance only differ by a constant independent of . Thus, any forwarding node can determine the closest coordinates to the destination in its neighborhood and thus Algorithm 3 generates RP return addresses with regard to .

For the distance , we consider two coordinates and with for . We show that i) iff and ii) iff . Thus, the return address is RP as the comparison of two coordinates yields the same order when using the return address as for the original coordinate. For i) note that by Eq. 2 implies that and . Because , we have . The converse holds analogously by Eq. 4. If ii) , then Eq. 2 implies that either or and . In the first case, the claim follows as and and both prefer coordinates with a longer common prefix length. For the second case, the claim follows under the assumptions and , because

Hence for both cases i) and ii), Algorithm 3 generates RP return addresses with regard to . ∎

Up to now, we have only considered route preserving return addresses generated by padding coordinates and applying a hash cascade. Optionally, an additionally layer of symmetric encryption can be added, preventing a node from deriving the actual length of the common prefix. Rather, can only determine if a neighbor is closer to the destination than itself. However, we show the same degree of anonymity for for both algorithms, so that the additional layer does not result in a provably higher level of anonymity. Furthermore, the additional layer reduces the efficiency as nodes select one closer neighbor at random rather than the closest neighbor. For this reason, the advantage of the additional layer is limited, so that we focus on RP return addresses here and defer the further obfuscation of coordinates to the appendix.

We prove that Algorithm 3 indeed enables receiver anonymity in Section 8.

5.4 Content Storage

In order to store content, we use a distributed hash table (DHT). As nodes can not communicate directly, they store tree addresses in their routing tables and leverage the tree routing. In this manner, we do not require maintenance-intensive tunnels like [5] and [3]. Note that we only sketch the solution for content storage and retrieval because our focus lies in improving the quality of the greedy embeddings for messaging between nodes. In the following, we first present the idea of our design and then a realization based upon a recursive Kademlia.

General Design:

Nodes establish a DHT by maintaining a routing table of (virtual) overlay connections. The routing table contains entries correspond to a DHT coordinate and corresponding return addresses. Nodes communicate with their virtual neighbors by sending requests in any of the embedding.

New routing table entries are added by routing for a suitable virtual overlay key, as done in [5] for the tunnel discovery. However, after the routing terminates, the discovered nodes send back their return addresses rather than taking the routing path as a new tunnel. In this manner, the length of routes between virtual overlay neighbors only depends on the trees and does not increase over time. The exact nature of the neighbor discovery, the routing algorithm , and the stabilization of the virtual overlay depend on the specifications of the DHT.

Kademlia:

In our evaluation in Sections 6 and 7, we utilize a highly resilient recursive Kademlia [23]. In Kademlia, a node selects a Kademlia identifier uniformly at random in the form of a 160-bit number. The distance between identifiers is equal to their XOR. Nodes maintain many redundant (virtual) overlay connections to increase the resilience. More precisely, each node keeps a routing table of -buckets. The -th bucket contains up to addresses of nodes so that the common prefix length of and is . Maintaining more than neighbor per common prefix length increases the robustness to failures and possibly even to attacks due to the existence of alternative connections.

Based on such routing tables, efficient and robust content discovery is possible. Files are indexed by keys corresponding to the hash of their content, i.e., the algorithm for the generation of file addresses is a hash function. A node requesting a file with key looks up the closest nodes to in its routing table in terms of virtual overlay coordinates. Then, routes for each in the trees. Upon receiving one of the messages, returns via the same route if in possession of . Otherwise, forwards the message to the overlay neighbor closest to , again using tree routing, and returns an acknowledgement message to . If a node on the route has already received the message via a parallel query, it returns a backtrack message such that the predecessor can contact a different node. Similarly, if a node does not receive an acknowledgement from its overlay neighbor in time, it selects an alternative node from its routing table if virtual neighbors closer to than exist.

Similarly, stabilization is realized in the same reactive manner as in the original Kademlia. Whenever a node successfully sends a message to an overlay neighbor, this neighbor returns an acknowledgement containing updated return addresses if any coordinates were changed. If a node in the routing table cannot be contacted, the node removes the neighbor from the routing table. Depending on the implementation, it initializes a new neighbor discovery request for the prefix. In addition, suitable neighbors encountered during routing are added to the routing table.

We have now presented the essential components of our design. In the following, we evaluate our design with regard to our requirements. The different layers of our system are displayed in Figure 3

Figure 3: Layers of VOUTE: 1) F2F overlay as restricted topology, 2) Tree embeddings and offer addressing for messaging, 3) Virtual overlays with tree addresses offer content sharing DHT routing based on tree addresses

6 Efficiency and Scalability

In this section, we analyze the efficiency of our scheme with regard to routing complexity, stabilization complexity, and their evolution over time.

6.1 Theoretical Analysis

In the first part of this section, we obtain upper bounds on the expected routing length of the routing algorithms and . The desired upper bound on the routing complexity follows by multiplying this bound for routing in one tree with , the number of trees used for parallel routing. Afterwards, we consider the stabilization complexity of the stabilization algorithm consisting of i) the local reconstruction of the trees according to Algorithm 1 and ii) the assignment of new coordinates for the nodes affected by a change topology using the modified PIE embedding.

Routing:

We consider both messaging between nodes as well as content discovery in the DHT.

Theorem 6.1.

Let be a modified PIE embedding on a spanning tree of generated by Algorithm 1 with parameters and . Furthermore, assume that the diameter of is . The expected routing length of Algorithm 2 is at most

(6)

for the routing algorithm , and

(7)

for .

For the proof, we first show Lemma 6.2, which bounds the expected level of a node in trees constructed by Algorithm 1. More precisely, we prove that the expected level of a node in any tree constructed by Algorithm 1 is increased by at most a constant factor in comparison to a breath-first-search.

Lemma 6.2.

Let be any of the trees constructed by Algorithm 1 and the root of . Furthermore, denote by the length of the shortest path from to , and let be the level of in . Then the expected value of is bound by

(8)
Proof.

We first give an upper bound on the expected number of rounds until a node accepts an invitation for after receiving the first invitation. Afterwards, we show Eq. 8 by induction.

In the first step, we denote the number of rounds until acceptance by . In order to derive an upper bound on , we assume that does not receive any invitation that it can immediately accept, i.e., an invitation from neighbors with minimal parent count . Thus, accepts one invitation with probability in each round. In the worst case, the -th accepted invitation is for tree . The number of rounds thus corresponds to the sum of identically distributed geometrically distributed random variables . Here, is the number of trials until the first success of a sequence of Bernoulli experiments with success probability , i.e., the number of rounds until an invitation is accepted. The random variable describes the number of trials until the -th success and presents an upper bound on the expected number of rounds until acceptance of an invitation for tree . We hence derive an upper bound on by

(9)

In the second step, we apply induction on . Note that the level of a node in the tree is at most the number of rounds until an invitation is accepted from the start of the protocol. For , the node receives an invitation from at round of the protocol because is a neighbor of the root node. In expectation, joins at round at most , which shows the claim for . Now, we assume Eq. 8 holds for and show that then it also holds for . The number of rounds until the node at level accepts an invitation in tree is the sum of , the number of rounds until the first invitation is received, and the number of rounds accepts after receiving the first invitation. is the neighbor of a node with and receives an invitation from one round after joined . So, is bound by our induction hypothesis, and is equal to and hence bound by Eq. 9. As a result,

and hence indeed Eq. 8 holds. ∎

Based on Lemma 6.2, we now prove Theorem 6.1. The idea of the proof is to bound the routing length by a multiple of expected level of a node.

Proof.

We consider the diversity measure first and then .

For , the claim follows directly from Lemma 6.2 and Theorem 4.3 in [8]. More precisely, the expected level of a node is at most assuming a diameter and hence maximal distance to the root of . Recall that the distance of two nodes and corresponds to the length of the shortest path between them in the tree and is an upper bound on the routing. Now, by Eq. 1, the sum of the length of the two coordinates is an upper bound on . As the length of a coordinate is equal to the level of the corresponding node in the tree, we indeed obtain

(10)

The last step follows from Lemma 6.2. Eq. 6 follows because Eq. 10 holds for all source-destination pairs .

In contrast, the proof for the common prefix length based similarities cannot build on previous results. Note that the change of the distance function does not affect the existence of a path with expected length at most between source and destination in the tree. However, the routing might divert from that path when discovering a node with a longer common prefix length but at a higher depth. For this reason, the sum of the expected levels is not an upper bound on the routing length. Rather, whenever a node with a longer common prefix length is contacted, the upper bound of the remaining number of hops is reset to the expected level of that node in addition to the level of . In the following, we show that such a reset increases the distance in tree by less than on average. The claim then follows because the number of resets is bound by the expected level of the destination. Eq. 7 follows by multiplication of the increased distance per reset and the number of resets.

More precisely, let give the tree distance between the -th contacted node and the target . Again, we cannot use the traditional approach for deriving the routing length because is not monotonously decreasing. Rather, we need to bound the number of times that increases and the expected amount of increase . Thus, the routing length from a source node to is bound by

(11)

The number of times the common prefix length can increase is bound by the length of the target’s coordinate and hence its level in . So by Lemma 6.2,

(12)

The tree distance is potentially increased whenever a node with a longer common prefix length is contacted. Yet, an upper bound on the expected increase is given by the difference in the levels and minus due to the increased common prefix length. Note that and are neighbors and hence the length of their shortest path to the root differs by at most . Lemma 6.2 thus provides the desired bound on

(13)

The desired bound can now be derived from Lemma 6.2, Eqs. 11, 12, and 13 under the assumption that the diameter of the graph and hence all shortest paths to the root scale logarithmically, i.e.,

(14)

As for the first part, Eq. 7 follows because Eq. 14 holds for all pairs . ∎

The bounds for a virtual overlay lookup based on routing algorithm follow directly from the fact that a DHT lookup requires overlay hops with each hop corresponding to one route in the network embedding.

Corollary 6.3.

If the DHT used for the virtual overlay offers logarithmic routing, the communication complexity of routing algorithm is

for the diversity measure and

for diversity measure .

Stabilization:

The stabilization complexity is required to stay polylog in the network size to allow for scalable communication and content addressing. In the following, we hence give bounds for the self-stabilization of the network embeddings, the costs for the virtual overlay follow by considering the maintenance costs for DHT as suggested for general overlay networks and multiplying with the length of the routes between overlay neighbors.

Theorem 6.4.

We assume the social graph to be of a logarithmic diameter and a constant average degree. Furthermore, we assume the use of a the root election protocol with complexity . Then stabilization complexity of the spanning trees constructed by Algorithm 6.2 with parameters and for one topology change is

(15)
Proof.

We first consider the complexity for one tree. The general result then follows by multiplying with the number of trees . When a node joins an overlay with a constant average degree, the communication complexity of receiving and replying to all invitations is constant. For a node departure, we consider non-root nodes and root nodes separately. If a any node but the root departs, the expected stabilization complexity corresponds to the number of nodes that have to rejoin . This number of nodes is equal to the number of descendants in a tree. Hence, the expected complexity of a departure corresponds to the expected number of descendants. Consider that a node on level is a descendant of nodes, so that the expected number of descendants is given by

If the root node leaves, the spanning tree and the embedding have to be re-established at a complexity of . As the probability for the root to depart is , we indeed have

We have shown that the complexity of routing, content discovery, and stabilization is bound (poly-)log as required.

6.2 Simulations

In this section, we validate the above bounds and relate them to the concrete communication overhead for selected scenarios. We start by detailing our simulation model and set-up, followed by our expectation, the results and their interpretation.

Model and Evaluation Metrics:

In order to evaluate the efficiency, we consider the routing length and the stabilization complexity. We express the stabilization costs in terms of the average number of coordinates that have to reassigned when a randomly chosen node leaves, i.e., the average number of descendants of a node. The number of messages required for the assigning the new coordinates is at most two per assignment, namely the disconnected node registering at a new parent and receiving a new coordinate. We conducted the study to determine how the number of trees, the tree construction algorithm, and the distance function affect routing and stabilization costs.

We compared our results to those for Freenet, a virtual overlay , and the original PIE embedding. The virtual overlay combines the advantages of X-Vine and MCON by using shortest paths as tunnels in a Kademlia overlay like MCON but integrating backtracking in the presence of local optima and shortcuts from one tunnel to another like X-Vine.

Set-up:

Due to space constraints, we restrict the presented results to one example network, namely the giant component of a community network from Facebook with 63392 users 555http://konect.uni-koblenz.de/networks/facebook-wosn-links.

The spanning tree construction in Algorithm 1 is parametrized by the number of trees , the acceptance probability , and the selection criterion chosen to be either random selection (denoted DIV-RAND) or preference of nodes at a low depth (denoted DIV-DEP). In addition, we consider a breadth first search for spanning tree construction (denoted BFS). Moreover, we consider the impact of the two distances (denoted TD) and (denoted CPL). The length of the return addresses was set to and the number of bits per element was , all embeddings were considered for routing.

For the virtual overlay used for content addressing, we chose a highly resilient recursive Kademlia [23] with bucket size and parallel look-ups. Because routing table entries are not uniquely determined by Kademlia identifiers, the entries were chosen randomly from all suitable candidates.

We parametrized the related approaches as follows. For simulating Freenet, we executed the embedding for iteration as suggested in [24] and then routed using a distance-directed depth-first search based only on the information about direct neighbors. The routing and stabilization complexity of the original PIE embedding is equal to the respective quantities of our algorithm for , the distance function and routing without the use of backtracking. In order to better understand the results of the comparison, we simulate the virtual overlay using the same Kademlia overlay as for our own approach but replacing the tree routing by tunnels corresponding to the shortest paths between overlay neighbors. So, we parametrized the related approaches by either using the proposed standard parameters or selecting parameters that are suitable for comparison because they corresponds to the same degree of redundancy as the parametrization of our own approach.

All results were averaged over 20 runs. They are displayed with % confidence intervals. Each run consisted of routing attempts for a randomly selected source-destination pair.

Expectations:

We expect that the routing length decreases with the number of embeddings, because the number of available routes and thus the probability to discover the shortest route in one embedding increases. In general, the routing length is directly related to the tree depth and should thus be lower for BFS and DIV-DEP.

Similarly, we expect a higher stabilization overhead for trees of a higher depth as the expected number of descendants per node increases. Thus, the number of nodes that need to select a new parent should be higher for DIV-RAND than for DIV-DEP and BFS.

In comparison to the existing approaches, our approach should enable shorter routes between pairs of nodes than both Freenent and VO. As shown above, we achieve a routing complexity of whereas the related work achieves at best routes of polylog length. However, our routes for content discovery should be slightly longer than in VO. VO utilizes the same DHT routing but uses shortest paths rather than the longer tree routes.

Results:

(a)
(b)
(c) Stabilization
Figure 4: Impact of number of embeddings , tree construction, and distance function on routing length for a) tree routing and b) Kademlia lookup with degree of parallelism ; related approaches result in routing lengths of (virtual overlay ) and close to (Freenet), and c) stabilization overhead

The impact of the three parameters, number of trees, tree construction, and distance on the routing length confirms our expectations. First, the results indicate that the tree construction, in particular the number of trees, is the dominating factor for the routing length. So, the routing length decreased considerably if multiple embeddings were used because the shortest route in any of the trees was considered. Second, preferring parents closer to the root, i.e., using BFS or DIV-DEP, produced shorter routes in the tree and hence reduced the routing length. Third, in comparison to the tree construction, the choice of a distance function had less impact. For BFS or DIV-DEP, the advantage of TD over CPL was barely noticeable, whereas the difference for DIV-RAND was still small but noticeable. In order to understand this difference, note that CPL is expected to lead to longer routes. The reason for the longer routes lies in forwarding the request to neighbors at a higher depth, which might have a long common prefix but are nevertheless at a higher distance from the destination due to their depth. For BFS or DIV-DEP, the difference of the depth of neighbors was generally small because neighbors at a lower depth were preferably selected as parents. In contrast, DIV-DEP allows for larger differences in depth. Hence there is a higher probability to increase the tree distance by selecting a neighbor with a longer common prefix length but at a high depth. All in all, the routing length varied between (BFS, , TD) and (DIV-RAND, , CPL) hops, as displayed in Figure (a)a. In summary, the use of multiple embeddings indeed reduced the routing length considerably.

The performance of the DHT lookup in the virtual overlay directly related to the previous results (cmp. Fig. (b)b for the distance under TD). The overhead for the discovery of a randomly chosen Kademlia ID, stored at the node with the closest ID in the overlay, varied between and hops in the F2F overlay, at around 4 hops in the virtual overlay.

By Theorem 6.4, the stabilization complexity was expected to increase at most quadratic with the number of trees. Indeed, Figure (c)c supports this fact for DIV-RAND. The increase for BFS and DIV-DEP was even only linear and slightly super-linear, respectively. Note that the quadratic increase is due to the raising average depth of additional trees generated by Algorithm 1. With the goal of achieving diverse spanning trees, nodes select parents at a higher depth. However, the average number of descendants increases with the depth, because a node at depth is a descendant of nodes. Due to the stabilization complexity corresponding to the number of the departing node’s descendants, the stabilization overhead was higher for DIV-RAND and DIV-DEP than for BFS. More precisely, BFS constructs all trees independently, so that the average depth of each tree is independent of the number of trees. The stabilization complexity per tree thus remains constant. DIV-DEP, aiming to balance diversity and short routes, causes stabilization overhead between the two former approaches, but performed closer to BFS (this similarity also held for the routing length). More concretely, the average stabilization overhead for a departing node was slightly below for a single tree. For it increased to (BFS), (DIV-DEP), and more than (DIV-RAND). In contrast to a complete re-computation of the embedding requiring at least messages, the stabilization overhead is negligible.

For the related approaches, we found a routing length of for Freenet, for VO with , and for VO with . Furthermore, the shortest paths are on average of length , meaning that our routing length of is close to optimal. So, routing between nodes in the tree required less than half the overhead of state-of-the-art approaches. Routing in the virtual overlay, requiring at best less than hops in our scheme, was slightly more costly in our approach than in VO due to the inability of the tree routing to guarantee shortest paths between virtual neighbors.

A straight-forward comparison of the stabilization overhead was not possible. Since Freenet stabilizes periodically, there is no overhead directly associated with a leaving node. In case of virtual overlays, VO uses flooding for stabilization, which is clearly more costly. Other overlays such as X-Vine use less costly stabilization but stabilization and routing overhead are unstable and increase over time as shown in [6], so that it is unclear which state of the system should be considered for a comparison. In order to nevertheless give a lower bound on the stabilization overhead, we computed the number of tunnels that needed to be rebuild in VO. On average, tunnels corresponding to shortest paths were affected by a departing node. If a tunnel is repaired by routing in the Kademlia overlay like in X-Vine, the stabilization overhead per tunnel corresponds to routing a request and the corresponding reply, i.e., for tunnels corresponding to shortest paths at least messages, resulting in a lower bound on more than messages per node departure. The above stabilization algorithm is unable to maintain short routes, such that the actual overhead of stabilization in virtual overlay is even higher than the above lower bound.

Discussion:

Our simulation study validates the asymptotic bounds. Indeed, the routing length and thus the routing complexity for messaging is very low, improving on the state-of-the-art by more than a factor of . The stabilization complexity is similarly low if the number of trees is not too high. Even for trees, the number of involved nodes is generally well below 100, which still improves upon virtual overlays such as VO, the most promising state-of-the-art candidate. Only content discovery in form of a DHT lookup was slightly more costly in our approach than in VO, which we consider acceptable given the considerable advantage with regard to all other metrics.

We have considerably improved the efficiency of F2F overlays. In the following, we show that we also mitigated their vulnerability to failures and attacks.

7 Robustness and Censorship-Resistance

In this section, we consider the robustness and resilience to censorship of VOUTE. Note that the evaluation of the censorship-resilience requires a specification of the modified stabilization algorithm , which refer to as attack strategy in the following. After deriving two attack strategies, we subsequently present our theoretical and simulation-based evaluation.

We express our results in terms of node coordinates and distances and rather than the corresponding diversity measures. The use of distances simplifies the notation as we do not need to apply a hash cascade for the comparison of coordinates and return addresses. As the routing paths are chosen identical for both coordinates and return addresses, the results are equally valid for return addresses.

7.1 Attack Strategy

We first describe our attack strategies and then comment on additional strategies and our reasons on selecting In order to model secure and insecure root selection protocols, we consider two realizations of ATT-RAND and ATT-ROOT. In the following, assume that one attacker node has established links to honest nodes and now aims to censor communication.

For secure spanning trees, the adversary is unable to manipulate the root election. Nevertheless, can manipulate the subsequent embedding. The attack strategy ATT-RAND assigns each of its children a different random prefix rather than the correct prefix. In this manner, routing fails because nodes in the higher levels of the tree do not recognize the prefix. So, the impact of the attack is increased in comparison to a random failure.

In contrast, if the adversary can manipulate the root election protocol, ATT-ROOT manipulates the root election in all spanning trees such that becomes the root in all trees. Under the assumption that the root observes the maximal number of requests, the attack should result in a high ratio of failed requests.

Now, we shortly comment on some further attack strategies we choose not to implement and give reasons for our decision not to do so.

First, note that in the original PIE embedding, assigning the same coordinate to two children is another attack strategy. In contrast to the above strategy, the routing can then fail even if the attacker is not involved in forwarding the actual request because the node coordinates are not unique and thus the request might end up at a different node than the receiver. In the modified embedding, the child decides on the last element of the coordinate. Hence, the attacker can only assign a node the coordinate of another node as a prefix, so that the two nodes appear to be close but are indeed not. However, upon realizing that does not offer a route to , the routing algorithm backtracks, so that this attack strategy merely increases the routing complexity but not the success ratio. Thus, we do not consider it here.

Second, recall from Section 3 that the attacker can also generate an arbitrary number of identities whereas the above attack strategies only rely on one identity. In the following, we argue that without additional knowledge, the use of additional identities in the tree does not improve the strength of the attack.

ATT-RAND actually simulates different virtual identities by providing fake distinct prefixes to all children. Indeed, in practice, it might be wise to indeed use distinct physical nodes because it minimizes the risk of detection if two neighbors realize that they are connected to the same physical node but received different prefixes.

For ATT-ROOT, the attacker might have to create (virtual) identities in order to manipulate the root election. As soon as is the root in each tree, multiple identities could be used to provide prefixes of different lengths. However, if a neighbor of receives a long prefix from , there is a high chance that and potential descendants of choose different parents seemingly closer to the root. Thus, in expectation a large number of nodes joins those subtrees rooted at a neighbor of with a short prefix. As routing within such a subtree does not require to forward a request from , ’s impact is likely to be reduced. Hence, without concrete topology knowledge, the insertion of additional virtual identities (corresponding to prefixes of different lengths) does usually not present an obvious advantage for .

7.2 Theoretical Analysis

We present two theoretical results in this section. First, we characterize the backtracking algorithm more closely. Second, we show that the censorship-resistance is improved by using the distance rather than .

Throughout this section, let and denote Algorithm 2 with distance and , respectively. Furthermore, let and denote the corresponding standard greedy routing algorithms, which terminate in local optima with regard to the distance to the destination’s coordinate. Let denote the success ratio of a routing algorithm . We are considering the success ratio for one embedding. The overall success ratio is improved as it is the combined success ratio of all embeddings.

Lemma 7.1.

We have that

(16)

Furthermore, Algorithm 2 is successful if and only if there exists a greedy path of responsive nodes according to its distance metric .

Proof.

Eq. 16 follows because Algorithm 2 is identical to the standard greedy algorithm until the latter terminates. Then, Algorithm 2 continues to search for an alternative, possible increasing the success ratio.

For the second part, recall that a greedy path is a path such that the distance to the destination decreases in each step, i.e., for all and a distance . Assume Algorithm 2 does not discover a route from the source and despite the existence of a greedy path of responsive nodes. Let be the set of nodes that forwarded the request according to Algorithm 2, and let . Then the neighbor of did not receive the request despite being closer to than . Though might have a neighbor closer to than , the request is backtracked to if forwarding to does not result in a route to the destination. Routing only terminates if either a route is found or has forwarded the request to all closer neighbors, including . Thus, Algorithm 2 cannot fail if a greedy path exists. In contrast, if there are not any greedy paths from to , any path with and contains a pair with . Thus, Algorithm 2 does not forward the request to and hence does not discover a path from to . It follows that indeed Algorithm 2 is successful if and only if a greedy path of responsive nodes exists. ∎

Now, we use Lemma 7.1 to show that using a common prefix length based distance generally enhances the censorship-resistance.

Theorem 7.2.

Let be an attacker applying either ATT-RAND or ATT-ROOT. Then for all distinct nodes

(17)

i.e., if does not discover a route between and , then does not discover a route. However, the converse does not hold. In particular,

(18)
Figure 5: Illustrating the proof of Theorem 7.2: left: is closer to destination than for distance but not for ; right: pair for which is successful as forwards to , but is not successful because forwards to the attacker.
Proof.

We prove the claim by contradiction. Assume that there is pair such that the algorithm terminates successfully while does not. Let with and denote the discovered route. By Lemma 7.1, is a greedy path for distance but not for . In other words, there exists such that i)