Reliable Communication in a Dynamic Network
in the Presence of Byzantine Faults
We consider the following problem: two nodes want to reliably communicate in a dynamic multihop network where some nodes have been compromised, and may have a totally arbitrary and unpredictable behavior. These nodes are called Byzantine. We consider the two cases where cryptography is available and not available.
We prove the necessary and sufficient condition (that is, the weakest possible condition) to ensure reliable communication in this context. Our proof is constructive, as we provide Byzantine-resilient algorithms for reliable communication that are optimal with respect to our impossibility results.
In a second part, we investigate the impact of our conditions in three case studies: participants interacting in a conference, robots moving on a grid and agents in the subway. Our simulations indicate a clear benefit of using our algorithms for reliable communication in those contexts.
As modern networks grow larger, their components become more likely to fail, sometimes in unforeseen ways. As opportunistic networks become more widespread, the lack of global control over individual participants makes those networks particularly vulnerable to attacks. Many failure and attack models have been proposed, but one of the most general is the Byzantine model proposed by Lamport et al. [LSP82j]. The model assumes that faulty nodes can behave arbitrarily. In this paper, we study the problem of reliable communication in a multihop network despite the presence of Byzantine faults. The problem proves difficult since even a single Byzantine node, if not neutralized, can lie to the entire network.
A common way to solve this problem is to use cryptography [CL99c, DFS05c]: the nodes use digital signatures to authenticate the sender across multiple hops. However, cryptography per se is not unconditionally reliable, as shown by the recent Heartbleed bug [hbleed] discovered in the widely deployed OpenSSL software. The defense in depth paradigm [indepth] advocates the use of multiple layers of security controls, including non-cryptographic ones. For instance, if the cryptography-based security layer is compromised by a bug, a virus, or intentional tampering, a cryptography-free communication layer can be used to safely broadcast a patch or to update cryptographic keys. Thus, it is interesting to develop both cryptographic and non-cryptographic strategies.
Following the setting of the seminal paper of Lamport et al. [LSP82j], many subsequent papers focusing of Byzantine tolerance [AW98b, MMR03j, MRRS01c, MS03j] study agreement and reliable communication primitives using cryptography-free protocols in networks that are both static and fully connected. A recent exception to fully connected topologies in Byzantine agreement protocols is the recent work of Tseng, Vaidya and Liang [TV13, VTL12], which considers specific classes of static directed graphs (i.e., graphs with a particularly high clustering coefficient) and considers approximate and iterative versions of the agreement problem.
In general multihop networks, two notable classes of algorithms use some locality property to tolerate Byzantine faults: space-local and time-local algorithms. Space-local algorithms [MT07j, NA02c, SOM05c] try to contain the fault (or its effect) as close to its source as possible. This is useful for problems where information from remote nodes is unimportant (such as vertex coloring, link coloring, or dining philosophers). Time-local algorithms [DMT10ca, DMT10cd, DMT11j, DMT11cb, MT06cb] try to limit over time the effect of Byzantine faults. Time-local algorithms presented so far can tolerate the presence of at most a single Byzantine node, and are unable to mask the effect of Byzantine actions. Thus, neither approach is suitable to reliable communication.
In dense multihop networks, a first line of work assumes that there is a bound on the fraction of Byzantine nodes among the neighbors of each node. Protocols have been proposed for nodes organized on a grid [BV05c, K04c] (but with much more than neighbors), and later generalized to other topologies [PP05j], with the assumption that each node knows the global topology. Since this approach requires all nodes to have a large degree, it may not be suitable for every multihop networks. The case of sparse networks was studied under the assumption that Byzantine failures occur uniformly at random [CtrZ, Scalbyz, Trig], an assumption that holds, e.g., in structured overlay networks where the identifier (a.k.a. position) of a new node joining the network is assigned randomly, but not necessarily in various actual communication networks.
Most related to our work is the line of research that assume the existence of node-disjoint paths from source to destination, in order to provide reliable communication in the presence of up to Byzantine failure [D82j, NT09j, secure_mt]. The initial solution [D82j] assumes that each node is aware of the global network topology, but this hypothesis was dropped in subsequent work [NT09j, explorratum].
None of the aforementioned papers considers genuinely dynamic networks, i.e., where the topology evolves while the protocol executes.
In this paper, our objective is to determine the condition for reliable communication in the presence of up to Byzantine failures in a dynamic network, where the topology can vary with time. The proof technique used in [D82j, NT09j, secure_mt] implicitly relies on Menger’s theorem [menger_thm], which can be expressed as follows: there exists disjoint paths between two nodes and if and only if nodes must removed to disconnect and .
However, Menger’s theorem does not generalize to dynamic networks [menger_fail]. To illustrate this, let us consider the simple dynamic network of Figure 1. This network is in two steps ( and ). There exists three dynamic paths connecting to : , and . To cut these three paths, at least two nodes must be removed: either , or . Yet, it is impossible to find two disjoint paths among the three dynamic paths. Therefore, Menger’s theorem cannot be used to prove the condition in dynamic networks.
In this paper, we prove the necessary and sufficient condition for reliable communication in dynamic networks, in the presence of up to Byzantine failures. We consider the two cases where cryptography is available and not available. Our characterization is based on a dynamic version of a minimal cut between and , denoted by DynMinCut(), that takes into account both the presence of particular paths and their duration with respect to the delay that is necessary to actually transmit a message over a path. Then condition is that DynMinCut() is lower or equal to (without cryptography) or (with cryptography). The proof is constructive, as we provide algorithms to prove the sufficiency of the condition.
In a second part, we apply these conditions to three case studies: participants interacting in a conference, robots moving on a grid and agents moving in the subway. We thus show the benefit of this multihop approach for reliable communication, instead of waiting that the source meets the sink directly (if this event is to occur).
Organization of the paper
We consider a continuous temporal domain where dates are positive real numbers. We model the system as a time varying graph, as defined by Casteigts, Flocchini, Quattrociocchi and Santoro [dynam_model], where vertices represent the processes and edges represent the communication links (or channels). A time varying graph is a dynamic graph represented by a tuple where:
is the set of nodes.
is the set of edges.
is the presence function: indicates that edge is present at date .
is the latency function: indicates that a message sent at date takes time units to cross edge .
The discrete time model is a special case, where time and latency are restricted to integer values.
We make the same hypotheses as previous work on the subject [BV05c, D82j, K04c, CtrZ, Trig, Scalbyz, NT09j, PP05j]. First, each node has a unique identifier. Then, we assume authenticated channels (or oral model), that is, when a node receives a message through channel , it knows the identity of . Now, an omniscient adversary can select up to nodes as Byzantine. These nodes can have a totally arbitrary and unpredictable behavior defined by the adversary (including tampering or dropping messages, or simply crashing). Finally, other nodes are correct and behave as specified by the algorithm. Of course, correct nodes are unable to know a priori which nodes are Byzantine. We also assume that a correct node is aware of its local topology at any given date (that is, knows the set of nodes such that ).
Informally, a dynamic path is a sequence of nodes a message can traverse, with respect to network dynamicity and latency.
Definition 1 (Dynamic path).
A sequence of distinct nodes is a dynamic path from to if and only if there exists a sequence of dates such that, we have:
, i.e. there exists an edge connecting to .
, , i.e. can send a message to at date .
, i.e. the aforementioned message is received by date .
We now define the dynamic minimal cut between two nodes and as the minimal number of nodes (besides and ) one has to remove from the network to prevent the existence of a dynamic path between and . Formally:
Let be the set of node sets such that is a dynamic path.
For a set of node sets , let be the set of node sets such that, , ( contains at least one node from each set ).
Let (the size of the smallest element of ). If is empty, we assume that .
We say that a node multicasts a message when it sends to all nodes in its current local topology. Now, a node accepts a message from another node when it considers that is the author of this message. We now define our problem specification, that is, reliable communication.
Definition 2 (Reliable communication).
Let and be two correct nodes. An algorithm ensures reliable communication from to when the following two conditions are satisfied:
When accepts a message from , is necessarily the author of this message.
When sends a message, eventually receives and accepts this message from .
3 Non-cryptographic reliable communication
In this section, we consider that cryptography is not available. We first provide a Byzantine-resilient multihop broadcast protocol. This algorithm is used as a constructive proof for the sufficient condition for reliable communication. We then prove the necessary and sufficient condition for reliable communication.
Informal description of the algorithm
Consider that each correct node wants to broadcast a message to the rest of the network. Let us first discuss why the naive flood-based solution fails. A naive first idea would be to send a tuple through all possible dynamic paths: thus, each node receiving knows that broadcast . Yet, Byzantine nodes may forward false messages, e.g., a Byzantine node could forward the tuple , with , to make the rest of the network believe that broadcast .
To prevent correct nodes from accepting false message, we attach to each message the set of nodes that have been visited by this message since it was sent (that is, we use , where is a set of nodes already visited by since sent it). As the Byzantine nodes can send any message, in particular, they can forward false tuples . Therefore, a correct node only accepts a message when it has been received through a collection of dynamic paths that cannot be cut by nodes (where is a parameter of the algorithm, and supposed to be an upper bound on the total number of Byzantine nodes in the network).
Each correct node maintains the following variables:
, the message that wants to broadcast.
, a dynamic set registering all tuples received by .
, a dynamic set of confirmed tuples . We assume that whenever , accepts from .
Initially, and .
Each correct node obeys the three following rules:
Initially, and whenever or the local topology of change: multicast .
Upon reception of through channel : , if then append to .
Whenever there exist , and such that , and : append to .
Condition for reliable communication
Let us consider a given dynamic graph, and two given correct nodes and . Our main result is as follows:
For a given dynamic graph, a -Byzantine tolerant reliable communication from to is feasible if and only if .
Lemma 1 (Necessary condition).
For a given dynamic graph, let us suppose that there exists an algorithm ensuring reliable communication from to . Then, we necessarily have .
Let us suppose the opposite: there exists an algorithm ensuring reliable communication from to , and yet, . Let us show that it leads to a contradiction.
As we have and , there exists an element of such that . Let be a subset of containing elements, with . Let . Thus, we have and .
According to the definition of , contains a node of each possible dynamic path from to . Therefore, the information that receives about are completely determined by the behavior of the nodes in .
Let us consider two possible placements of Byzantine nodes, and show that they lead to a contradiction:
First, suppose that all nodes in are Byzantine, and that all other nodes are correct. This is possible since .
Suppose now that broadcasts a message . Then, according to our hypothesis, since the algorithm ensures reliable communication, eventually accepts from , regardless of what the behavior of the nodes in may be.
Now, suppose that all nodes in are Byzantine, and that all other nodes are correct. This is also possible since .
Then, suppose that broadcasts a message , and that the Byzantine nodes have exactly the same behavior as the nodes of had in the previous case.
Thus, as the information that receives about is completely determined by the behavior of the nodes of , from the point of view of , this situation is indistinguishable from the previous one: the nodes of have the same behavior, and the behavior of the nodes of is unimportant. Thus, similarly to the previous case, eventually accepts from .
Therefore, in the second situation, broadcasts , and eventually accepts . Thus, according to Definition 2, the algorithm does not ensure reliable communication, which contradicts our initial hypothesis. Hence, the result. ∎
Lemma 2 (Safety).
Let us suppose that all correct nodes follow our algorithm. If , then .
As , according to rule of our algorithm, there exists such that, , , and .
Suppose that each node set contains at least one Byzantine node. If is the set of Byzantine nodes, then and . This is impossible because . Therefore, there exists such that does not contain any Byzantine node.
Now, let us use the correct dynamic path corresponding to to show that . Let . Let us show the following property by induction, : there exists a correct node and a set of correct nodes such that and .
As , . Thus, is true if we take and .
Let us now suppose that is true, for . As , according to rule of our algorithm, it implies that received from a node , with , and . Thus, .
As and is a set of correct nodes, is correct and behaves according to our algorithm. Then, as sent , according to rule of our algorithm, we necessarily have . Thus, as , we have . Hence, is true if we take and .
By induction principle, is true. As , and . As is a correct node and follows our algorithm, the only possibility to have is that and . Thus, the result. ∎
Lemma 3 (Communication).
Let us suppose that , and that all correct nodes follow our algorithm. Then, we eventually have .
Let be the set of node sets that only contain correct nodes. Similarly, let be the set of node sets that contain at least one Byzantine node.
Let us suppose that . Then, there exists such that . Let be the set containing the nodes of and the Byzantine nodes. Thus, and , and . Thus, , which contradicts our hypothesis. Therefore, .
In the following, we show that , we eventually have , ensuring that eventually accepts from .
Let . As , let be the dynamic path such that , and . Let be the corresponding dates, according to Definition 1. Let us show the following property by induction, : at date , , with if and otherwise.
is true, as we initially have .
Let us suppose that is true, for . According to Definition 1, , , being the edge connecting to .
Let be the earliest date such that, , .
Let be the date where is added to .
Then, at date , either or the local topology topology of changes. Thus, according to rule of our algorithm, multicasts at date , with .
As , receives from at date . Then, according to rule of our algorithm, is added to .
Thus, is true if we take .
By induction principle, is true. As , , and we eventually have .
Thus, , we eventually have . Then, as , according to rule of our algorithm, is added to . ∎
Lemma 4 (Sufficient condition).
Let there be any dynamic graph. Let and be two correct nodes, and denote the maximum number of Byzantine nodes. If , our algorithm ensures reliable communication from to .
Let us suppose that the correct nodes follow our algorithm, as described in Section 3. First, according to Lemma 2, if , then . Thus, when accepts a message from , is necessarily the author of this message. Then, according to Lemma 3, we eventually have . Thus, eventually receives and accepts the message broadcast by . Therefore, according to Definition 2, our algorithm ensures reliable communication from to . ∎
4 Cryptographic reliable communication
If cryptography is available, then it becomes possible to authenticate the sender of a message across multiple hops.
The setting is now the following. Each node has a private key (only known by ) and a public key (known by all nodes). The node can encrypt a message with the function . Any node can decrypt a message from with the function . This function returns NULL if the message was not correctly encrypted. We assume that the Byzantine nodes do not know the private keys of correct nodes.
Then, we modify the previous algorithm as follows. Initially, and . Then, each correct node obeys to the three following rules:
Initially, and whenever or the local topology of change: multicast .
Upon reception of from a neighbor node: .
Whenever there exists such that : append to .
If cryptography is available, for a given dynamic graph, a -Byzantine tolerant reliable communication from to is feasible if and only if .
If , then it is possible to cut all dynamic paths between and with Byzantine nodes. Thus, never receives any message from . Thus, the condition is necessary. Now, let us show that the condition is sufficient.
First, cannot accept a message with . Indeed, let us suppose the opposite. According to step of the algorithm, it implies that we have , with . Implying that . Let be the first node to have . According to steps and of the algorithm, cannot be a correct node. Thus, is Byzantine, implying that a Byzantine node knows : contradiction.
Besides, if , then there exists at least one dynamic path from to . Thus, for the same argument as in Lemma 3, we eventually have . Thus, according to step of the algorithm, is added to , and the condition is sufficient.
5 Case Studies
In this section, we apply our conditions for reliable communication to several case studies: participants interacting in a conference, robots moving on a grid and agents moving in the subway. We show the interest of multihop reliable communication.
5.1 A real-life dynamic network: the Infocom 2005 dataset
In this section, we consider the Infocom 2005 dataset [infocom_dataset], which is obtained in a conference scenario by iMotes capturing contacts between participants. This dataset can represent a dynamic network where each participant is a node and where each contact is a (temporal) edge.
We consider an 8-hour period during the second day of the conference. In this period, we consider the dynamic network formed by the 10 most “sociable” nodes (our criteria of sociability is the total number of contacts reported). We assume that at most one on these nodes may be Byzantine (that is, ).
Let and be two correct nodes. Let us suppose that wants to transmit a message to within a period of minutes. Within minutes, three types of communication can be achieved:
If we want to ensure reliable communication despite one Byzantine node, the simplest strategy is to wait until meets directly. Let us show now that relaying the message is usually beneficial and that our approach realizes a significant gain of performance.
Figure 2 represents the percentage of pairs of nodes that communicate within minutes, according to the date of beginning of the communication. We can correlate the peaks with the program of the conference: the first period corresponds to morning arrivals during the keynotes; the peak between 10:30 and 11:00 corresponds to the morning break; the peak starting at 12:30 corresponds to the end of parallel sessions and the departure for lunch.
As it turns out, many pairs of nodes are able to communicate reliably, even though they are unable to meet directly. For instance, at 9:15, 60% of pairs of nodes meet directly, 80% can communicate reliably without cryptography, and 100% can communicate reliably with cryptography. This means that relaying the information is actually effective and desirable.
5.2 Probabilistic mobile robots on a grid
We consider a network of mobile robots that are initially randomly scattered on a grid.
Definition 3 (Grid).
An grid is a topology such that:
Each vertex has a unique identifier , with and .
Two vertices and are neighbors if and only if:
At each time unit, a robot randomly moves to a neighbor vertex, or does not move (the new position is chosen uniformly at random among all possible choices). Let be the current vertex of the robot at date . We consider that two robots can communicate if and only if they are on the same vertex. Our setting induces the following dynamic graph : , , when and .
Let and be two correct robots, and suppose that up to other robots are Byzantine. We aim at evaluating the communication time, that is: the mean time to satisfy the condition for reliable communication with cryptography (Theorem 2) and without cryptography (Theorem 1). For this purpose, we ran more than 10000 simulations, and represented the results on Figure 3 and 4. Let us comment on these results.
In Figure 3, we represented the mean communication time varying the maximal number of Byzantine failures when cryptography is available. This time increases regularly. The case corresponds to the case where all the nodes (except and ) are Byzantine. In this limit case, the only possibility for and to communicate is to meet directly.
In Figure 3, we represented the case where cryptography is not available. Here, the aforementioned limit case is reached for , as the condition for non-cryptographic reliable communication is harder to satisfy.
As we can see, the reliable multihop communication approach can be an interesting compromise. For instance, let us suppose that we want to tolerate one Byzantine failure (). Let us consider the mean time for and to meet directly. If we use our algorithms, this time decreases by 38% without cryptography, and by 51% with cryptography.
5.3 Mobile agents in the Paris subway
We consider a dynamic network consisting of 10 mobile agents randomly moving in the Paris subway. The agents can use the classical subway lines (we exclude tramways and regional trains). Each agent is initially located at a randomly chosen junction station – that is, a station that connects at least two lines. Then, the agent randomly chooses a neighbor junction station, waits for the next train, moves to this station, and repeats the process. We use the train schedule provided by the local subway company (http://data.ratp.fr). The time is given in minutes from the departure of the first train (i.e., around 5:30). We consider that two agents can communicate in the two following cases:
They are staying together at the same station.
They cross each other in trains. For instance, if at a given time, one agent is in a train moving from station to station while the other agent moves from to , then we consider that they can communicate.
Again, let us suppose that we want to tolerate one Byzantine failure (). Let us consider the mean time for and to meet directly. If we use our algorithms, this time decreases by 36% without cryptography, and by 49% with cryptography.
In this paper, we gave the necessary and sufficient condition for reliable communication in a dynamic multihop network that is subject to Byzantine failures. We considered both cryptographic and non-cryptographic cases, and provided algorithms to show the sufficient condition. We then demonstrated the benefits of these algorithms in several case studies.
Our experiments explicitly quantify the benefits of a cryptographic infrastructure (fewer dynamic paths are required, less computations are necessary at each node for accepting genuine messages), but additional tradeofs are worth examining. In practice, ensuring hop by hop integrity through cryptography requires every node on the (dynamic) path to collect the public key of the sender (as it is unlikely that all public keys are initially bundled into the node, for memory size reasons and inclusion/exclusion node dynamics). Actually reaching a trusted authority from a guenuinely dynamic network to obtain this public key raises both bootstrapping and performance issues.
Our result implicitly considers a worst-case placement of the Byzantine nodes, which is the classical approach when studying Byzantine failures in a distributed setting. Studying variants of the Byzantine node placement (e.g. a random placement according to a particular distribution), and the associated necessary and sufficient condition for enabling multihop reliable communication, constitutes an interesting path for future research.