Consensus using Asynchronous Failure Detectors

Consensus using Asynchronous Failure Detectors

Nancy Lynch
CSAIL, MIT
   Srikanth Sastry111The author is currently affiliated with Google Inc.
CSAIL, MIT
Abstract

The FLP result shows that crash-tolerant consensus is impossible to solve in asynchronous systems, and several solutions have been proposed for crash-tolerant consensus under alternative (stronger) models. One popular approach is to augment the asynchronous system with appropriate failure detectors, which provide (potentially unreliable) information about process crashes in the system, to circumvent the FLP impossibility.

In this paper, we demonstrate the exact mechanism by which (sufficiently powerful) asynchronous failure detectors enable solving crash-tolerant consensus. Our approach, which borrows arguments from the FLP impossibility proof and the famous result from [2], which shows that is a weakest failure detector to solve consensus, also yields a natural proof to as a weakest asynchronous failure detector to solve consensus. The use of I/O automata theory in our approach enables us to model execution in a more detailed fashion than [2] and also addresses the latent assumptions and assertions in the original result in [2].

1 Introduction

In [5, 6] we introduced a new formulation of failure detectors. Unlike the traditional failure detectors of [3, 2], ours are modeled as asynchronous automata, and defined in terms of the general I/O automata framework for asynchronous concurrent systems. To distinguish our failure detectors from the traditional ones, we called ours “Asynchronous Failure Detectors (AFDs)”.

In terms of our model, we presented many of the standard results of the field and some new results. Our model narrowed the scope of failure detectors sufficiently so that AFDs satisfy several desirable properties, which are not true of the general class of traditional failure detector. For example, (1) AFDs are self-implementable; (2) if an AFD is strictly stronger than another AFD , then is sufficient to solve a strict superset of the problems solvable by . See [6] for details. Working entirely within an asynchronous framework allowed us to take advantage of the general results about I/O automata and to prove our results rigorously without too much difficulty.

In this paper, we investigate the role of asynchronous failure detectors in circumventing the impossibility of crash-tolerant consensus in asynchronous systems (FLP) [7]. Specifically, we demonstrate exactly how sufficiently strong AFDs circumvent the FLP impossibility. We borrow ideas from the important related result by Chandra, Hadzilacos, and Toueg [2] that says that the failure detector is a “Weakest Failure Detector” that solves the consensus problem. Incidentally, the proof in [2] make certain implicit assumptions and assertions which are entirely reasonable and true, respectively. However, for the purpose of rigor, it is desirable that these assumptions be explicit and these assertions be proved. Our demonstration of how sufficiently strong AFDs circumvent FLP dovetails effortlessly with an analogous proof of “weakest AFD” for consensus.

While our proof generally follows the proof in [2], we state the (implicit) assumptions and assertions from [2] explicitly. Since our framework is entirely asynchronous and all our definitions are based on an established concurrency theory foundation, we are able to provide rigorous proofs for the (unproven) assertions from [2]. In order to prove the main result of this paper, we modified certain definitions from [6]. However, these modifications do not invalidate any of the results from [5, 6].

The rest of this paper is organized as follows. Section 2 outlines the approach that we use in this paper and its major contributions. In section 3, we compare our proof with the original CHT proof in [2]. Sections 4 through 7 introduce I/O automata and the definitions of a problem, of an asynchronous system, and of AFDs; much of the material is summarized from [5, 6]. Section 8 introduces the notion of observations of AFD behavior, which are a key part of showing that is a weakest AFD to solve consensus; this section proves several useful properties of observations which are central to the understanding of the proof and are a contribution of our work. In Section 9, we introduce execution trees for any asynchronous system that uses an AFD; we construct such trees from observations introduced in Section 8. We also prove several properties of such execution trees, which may be of independent interest and useful in analysis of executions in any AFD-based system. In Section 10, we formally define the consensus problem and use the notions of observations and execution trees to demonstrate how sufficiently strong AFDs enable asynchronous systems to circumvent the impossibility of fault tolerant consensus in asynchronous systems [7]; Section 10 defines and uses decision gadgets in an execution tree to demonstrate this; it also shows that the set of such decision gadgets is countable, and therefore, any such execution tree contains a “first” decision gadget. Furthermore, Section 10 also shows that each decision gadget is associated with a location that is live and never crashes; we call it the critical location of the decision gadget. In Section 11, we show that is a weakest AFD to solve consensus by presenting a distributed algorithm that simulates the output of . The algorithm constructs observations and execution trees, and it eventually identifies the “first” decision gadget and its corresponding critical location; the algorithm outputs this critical location as the output of the simulated AFD, thus showing that is a weakest AFD for consensus.

2 Approach and contributions

To demonstrate our results, we start with a complete definition of asynchronous systems and AFDs. Here, we modified the definitions of AFD from [5, 6], but we did so without invalidating earlier results. We argue that the resulting definition of AFDs is more natural and models a richer class of behaviors in crash-prone asynchronous systems. Next, we introduce the notion of observations of AFD behavior (Section 8), which are DAGs that model a partial ordering AFD outputs are different processes; importantly, the knowledge of this partial order can be gained by any process through asynchronous message passing alone. Observations as a tool for modeling AFD behavior is of independent interest, and we prove several important properties of observations that are used in our later results.

From such observations, we construct trees of executions of arbitrary AFD-based systems; again, such trees are of independent interest, and we prove several important properties of such trees that are used later.

Next, we define the consensus problem and the notion valence. Roughly speaking, a finite execution of a system is univalent if all its fair extensions result in the same decision value and the execution is bivalent if some fair extension results in a decision value and another fair extension results in a decision value . We present our first important result using observations and execution trees; we show that a sufficiently powerful AFD guarantees that in the execution tree constructed from any viable222Informally, an observation is viable if it can be constructed from an AFD trace. observation of AFD outputs, the events responsible for the transition from a bivalent execution to a univalent execution must occur at location that does not crash. Such transitions to univalent executions correspond to so-called “decision gadgets”, and the live location corresponding to such transitions is called the “critical location” of the decision gadgets.

Next, we use the aforementioned result to show that is a weakest AFD to solve consensus. In order to do so, we first define a metric function that orders all the decision gadgets. This metric function satisfies an important stability property which guarantees the following. Given the decision gadget with the smallest metric value in a given infinite execution tree, for any sufficiently large, but finite, subtree, the same decision gadget will have the smallest metric value within that subtree. Note that the original proof in [2] did not provide such a metric function, and we contend that this is an essential compoenent for completing this proof. We then construct an emulation algorithm (similar to the one in [2]) that uses an AFD sufficiently powerful to solve consensus and simulates the output of . In this algorithm processes exchange AFD outputs and construct finite observations and corresponding finite execution trees. The aforementioned stability property ensures that eventually forever, each process that does not crash identifies the same decision gadget as the one with the smallest metric value. Recall that the critical location of any decision gadget is guaranteed to not crash. Therefore, eventually forever, each process that does not crash identifies the same correct process and outputs that correct process as the output of the simulated AFD.

3 Comparisons with the original CHT proof

Our proof has elements that are very similar to the the original CHT proof from [2]. However, despite the similarity in our arguments, our proof deviates from the CHT proof in some subtle, but significant ways.

3.1 Observations

In [2], the authors introduce DAGs with special properties that model the outputs of a failure detector at different processes and establishes partial ordering of these outputs. In our proof, the analogous structure is an observation (See Section 8). However, our notion of an observation is much more general than the DAG introduced in [2].

First, the DAG in [2] is an infinite graph and cannot model failure detector outputs in finite executions. In contrast, observations may be finite or infinite. Second, we also introduce the notion of a sequence of finite observations that can be constructed from progressively longer finite executions that enable us to model the evolution of observations and execution trees as failure detector outputs become available. Such detailed modeling and analysis does not appear in [2].

3.2 Execution trees

In [2], each possible input to consensus gives rise to a unique execution tree from the DAG. Thus, for processes, there are possible trees that constitute a forest a trees. In contrast, our proof constructs exactly one tree that models the executions of all possible inputs to consensus. This change is not merely cosmetic. It simplifies analysis and makes the proof technique more general in the following sense.

The original proof in [2] cannot be extended to understanding long-lived problems such as iterative consensus or mutual exclusion. The simple reason for this is that the number of possible inputs for such problems can be uncountably infinite, and so the number of trees generated by the proof technique in [2] is also uncountably infinite. This introduces significant challenges in extracting any structures within these trees by a distributed algorithm. In contrast, in our approach, the execution tree will remain virtually the same; only the rules for determining the action tag values at various edges change.

3.3 Determining the “first” decision gadget

In [2] and in our proof, a significant result is that there are infinite, but countable number of decision gadgets, and therefore there exists a unique enumeration of the decision gadgets such that one of them is the “first” one. This result is then used in [2] to claim that all the emulation algorithms converge to the same decision gadget. However, [2] does not provide any proof of this claim. Furthermore, we show that this proving this claim in non-trivial.

The significant gap in the original proof in [2] is the following. During the emulation, each process constructs only finite DAGs, that are subgraphs of some infinite DAG with the required special properties. However, since the DAGs are finite, the trees of executions constructed from this DAG could incorrectly detect certain parts of the trees as being decision gadgets, when in the execution tree of the infinite DAG, these are not decision gadgets. Each such pseudo decision gadget, is eventually deemed to not be a decision gadget, as the emulation progresses. However, there can be infinitely many such pseudo gadgets. Thus, given any arbitrary enumeration of decision gadgets, it is possible that such pseudo decision gadgets appears infinitely often, and are enumerated ahead of the “first” decision gadget. Consequently, the emulation never stabilizes to the first decision gadget.

In our proof, we address is gap by carefully defining metric functions for nodes and decision gadgets so that eventually, all the pseudo decision gadgets are ordered after the eventual “first” decision gadget.

4 I/O Automata

We use the I/O Automata framework [8, 9, 10] for specifying the system model and failure detectors. Briefly, an I/O automaton models a component of a distributed system as a (possibly infinite) state machine that interacts with other state machines through discrete actions. This section summarizes the I/O-Automata-related definitions that we use in this paper. See [10, Chapter 8] for a thorough description of I/O Automata.

4.1 Automata Definitions

An I/O automaton, which we will usually refer to as simply an “automaton”, consists of five components: a signature, a set of states, a set of initial states, a state-transition relation, and a set of tasks. We describe these components next.

Actions, Signature, and Tasks.

The state transitions of an automaton are associated with named actions; we denote the set of actions of an automaton by . Actions are classified as input, output, or internal, and this classification constitutes the signature of the automaton. We denote the sets of input, output, and internal actions of an automaton by , , and , respectively. Input and output actions are collectively called the external actions, denoted , and output and internal actions are collectively called the locally controlled actions. The locally controlled actions of an automaton are partitioned into tasks. Tasks are used in defining fairness conditions on executions of the automaton, as we describe in Section 4.4.

Internal actions of an automaton are local to the automaton itself whereas external (input and output) actions are available for interaction with other automata. Locally controlled actions are initiated by the automaton itself, whereas input actions simply arrive at the automaton from the outside, without any control by the automaton.

States.

The states of an automaton are denoted by ; some (non-empty) subset is designated as the set of initial states.

Transition Relation.

The state transitions of an automaton are defined by a state-transition relation , which is a set of tuples of the form where and . Each such tuple is a transition, or a step, of . Informally speaking, each step denotes the following behavior: automaton , in state , performs action and changes its state to .

For a given state and action , if contains some step of the form , then is said to be enabled in . We assume that every input action in is enabled in every state of ; that is, for every input action and every state , contains a step of the form . A task , which is a set of locally controlled actions, is said to be enabled in a state iff some action in is enabled in .

Deterministic Automata.

The general definition of an I/O automaton permits multiple locally controlled actions to be enabled in any given state. It also allows the resulting state after performing a given action to be chosen nondeterministically. For our purposes, it is convenient to consider a class of I/O automata whose behavior is more restricted.

We define an action (of an automaton ) to be deterministic provided that, for every state , contains at most one transition of the form . We define an automaton to be task deterministic iff (1) for every task and every state of , at most one action in is enabled in , and (2) all the actions in are deterministic. An automaton is said to be deterministic iff it is task deterministic, has exactly one task, and has a unique start state.

4.2 Executions, Traces, and Schedules

Now we define how an automaton executes. An execution fragment of an automaton is a finite sequence , or an infinite sequence , of alternating states and actions of such that for every , is in . A sequence consisting of just a state is a special case of an execution fragment and is called a null execution fragment. Each occurrence of an action in an execution fragment is called an event.

An execution fragment that starts with an initial state (that is, ) is called an execution. A null execution fragment consisting of an initial state is called a null execution. A state is said to be reachable if there exists a finite execution that ends with . By definition, any initial state is reachable.

We define concatenation of execution fragments. Let and be two execution fragments of an I/O automaton such that is finite and the final state of is also the starting state of , and let denote the sequence obtained by deleting the first state in . Then the expression denotes the execution fragment formed by appending after .

It is sometimes useful to consider just the sequence of events that occur in an execution, ignoring the states. Thus, given an execution , the schedule of is the subsequence of that consists of all the events in , both internal and external. The trace of an execution includes only the externally observable behavior; formally, the trace of an execution is the subsequence of consisting of all the external actions.

More generally, we define the projection of any sequence on a set of actions as follows. Given a sequence (which may be an execution fragment, schedule, or trace) and a set of actions, the projection of on , denoted by , is the subsequence of consisting of all the events from .

We define concatenation of schedules and traces. Let and be two sequences of actions of some I/O automaton where is finite; then denotes the sequence formed by appending after .

To designate specific events in a schedule or trace, we use the following notation: if a sequence (which may be a schedule or a trace) contains at least events, then denotes the event in the sequence , and otherwise, . Here, is a special symbol that we assume is different from the names of all actions.

4.3 Operations on I/O Automata

Composition.

A collection of I/O automata may be composed by matching output actions of some automata with the same-named input actions of others.333Not all collections of I/O automata may be composed. For instance, in order to compose a collection of I/O automata, we require that no two automata have a common output action. See [10, chapter 8] for details. Each output of an automaton may be matched with inputs of any number of other automata. Upon composition, all the actions with the same name are performed together.

Let be an execution of the composition of automata . The projection of on automaton , where , is denoted by and is defined to be the subsequence of obtained by deleting each pair for which is not an action of and replacing each remaining state by automaton ’s part of . Theorem 8.1 in [10] states that if is an execution of the composition , then for each , is an execution of . Similarly, if is a trace of of , then for each , is an trace of .

Hiding.

In an automaton , an output action may be “hidden” by reclassifying it as an internal action. A hidden action no longer appears in the traces of the automaton.

4.4 Fairness

When considering executions of an I/O automaton, we will often be interested in those executions in which every task of the automaton gets infinitely many turns to take steps; we call such executions “fair”. When the automaton represents a distributed systems, the notion of fairness can be used to express the idea that all system components continue to get turns to perform their activities.

Formally, an execution fragment of an automaton is said to be fair iff the following two conditions hold for every task in . (1) If is finite, then no action in is enabled in the final state of . (2) If is infinite, then either (a) contains infinitely many events from , or (b) contains infinitely many occurrences of states in which is not enabled.

A schedule of is said to be fair if it is the schedule of a fair execution of . Similarly, a trace of is said to be fair if it is the trace of a fair execution of .

5 Crash Problems

In this section, we define problems, distributed problems, crash problems, and failure-detector problems. We also define a particular failure-detector problem corresponding to the leader election oracle of [2].

5.1 Problems

We define a problem to be a tuple , where and are disjoint sets of actions and is a set of (finite or infinite) sequences over these actions such that there exists an automaton where , , and the set of fair traces of is a subset of . In this case we state that solves . We include the aforementioned assumption of solvability to satisfy a non-triviality property, which we explain in Section 7.

Distributed Problems.

Here and for the rest of the paper, we introduce a fixed finite set of location IDs; we assume that does not contain the special symbol . We assume a fixed total ordering on . We also assume a fixed mapping from actions to ; for an action , if , then we say that occurs at . A problem is said to be distributed over if, for every action , . We extend the definition of by defining .

Given a problem that is distributed over , and a location , and denote the set of actions in and , respectively, that occur at location ; that is, and .

Crash Problems.

We assume a set of crash events, where . That is, represents a crash that occurs at location . A problem that is distributed over is said to be a crash problem iff . That is, for every .

Given a (finite or infinite) sequence , denotes the set of locations at which a event occurs in . Similarly, denotes the set of locations at which a event does not occur in . A location in is said to be faulty in , and a location in is said to be live in .

5.2 Failure-Detector Problems

Recall that a failure detector is an oracle that provides information about crash failures. In our modeling framework, we view a failure detector as a special type of crash problem. A necessary condition for a crash problem to be an asynchronous failure detector (AFD) is crash exclusivity, which states that ; that is, the actions are exactly the actions. Crash exclusivity guarantees that the only inputs to a failure detector are the events, and hence, failure detectors provide information only about crashes. An AFD must also satisfy additional properties, which we describe next.

Let be a crash problem satisfying crash exclusivity. We begin by defining a few terms that will be used in the definition of an AFD. Let be an arbitrary sequence over .

Valid sequence.

The sequence is said to be valid iff (1) for every , no event in (the set of actions in at location ) occurs after a event in , and (2) if no event occurs in , then contains infinitely many events in .

Thus, a valid sequence contains no output events at a location after a event, and contains infinitely many output events at each live location.

Sampling.

A sequence is a sampling of iff (1) is a subsequence of , (2) for every location , (a) if is live in , then , and (b) if is faulty in , then contains the first event in , and is a prefix of .

A sampling of sequence retains all events at live locations. For each faulty location , it may remove a suffix of the outputs at location . It may also remove some crash events, but must retain the first crash event.

Constrained Reordering.

Let be a valid permutation of events in ; is a constrained reordering of iff the following is true. For every pair of events and , if (1) precedes in , and (2) either (a) and , or (b) and , then precedes in as well.444Note that the definition of constrained reordering is less restrictive than the definition in [5, 6]; specifically, unlike in [5, 6], this definition allow crashes to be reordered with respect to each other. However, this definition is “compatible” with the earlier definition in the sense that the results presented in [5, 6] continue to be true under this new definition.

A constrained reordering of sequence maintains the relative ordering of events that occur at the same location and maintains the relative order between any event and any subsequent event.

Crash Extension.

Assume that is a finite sequence. A crash extension of is a (possibly infinite) sequence such that is a prefix of and the suffix of following is a sequence over .

In other words, a crash extension of is obtained by extending with events.

Extra Crashes.

An extra crash event in is a event in , for some , such that contains a preceding .

An extra crash is a crash event at a location that has already crashed.

Minimal-Crash Sequence.

Let denote the subsequence of that contains all the events in , except for the extra crashes; is called the minimal-crash sequence of .

Asynchronous Failure Detector.

Now we are ready to define asynchronous failure detectors. A crash problem of the form (which satisfies crash exclusivity) is an asynchronous failure detector (AFD, for short) iff satisfies the following properties.

  1. Validity. Every sequence is valid.

  2. Closure Under Sampling. For every sequence , every sampling of is also in .

  3. Closure Under Constrained Reordering. For every sequence , every constrained reordering is also in .

  4. Closure Under Crash Extension. For every sequence , for every prefix of , for every crash extension of , the following are true. (a) If is finite, then is a prefix of some sequence in . (b) If , then is in .

  5. Closure Under Extra Crashes. For every sequence , every sequence such that is also in .

Of the properties given here, the first three—validity and closure under sampling and constrained reordering—were also used in our earlier papers [5, 6]. The other two closure properties—closure under crash extension and extra crashes—are new here.

A brief motivation for the above properties is in order. The validity property ensures that (1) after a location crashes, no outputs occur at that location, and (2) if a location does not crash, outputs occur infinitely often at that location. Closure under sampling permits a failure detector to “skip” or “miss” any suffix of outputs at a faulty location. Closure under constrained reordering permits “delaying” output events at any location. Closure under crash extension permits a crash event to occur at any time. Finally, closure under extra crashes captures the notion that once a location is crashed, the occurrence of additional crash events (or lack thereof) at that location has no effect.

We define one additional constraint, below. This contraint is a formalization of an implicit assumption made in [2]; namely, for any AFD , any “sampling” (as defined in [4]) of a failure detector sequence in is also in .

Strong-Sampling AFDs.

Let be an AFD, . A subsequence of is said to be a strong sampling of if is a valid sequence. AFD is said to satisfy closure under strong sampling if, for every trace , every strong sampling of is also in . Any AFD that satisfies closure under strong sampling is said to be a strong-sampling AFD.

Although the set of strong-sampling AFDs are a strict subset of all AFDs, we conjecture that restricting our discussion to strong sampling AFDs does not weaken our result. Specifically, we assert without proof that for any AFD , we can construct an “equivalent” strong-sampling AFD . This notion of equivalence is formally discussed in Section 7.3.

5.3 The Leader Election Oracle.

An example of a strong-sampling AFD is the leader election oracle  [2]. Informally speaking, continually outputs a location ID at each live location; eventually and permanently, outputs the ID of a unique live location at all the live locations. The failure detector was shown in [2] to be a “weakest” failure detector to solve crash-tolerant consensus, in a certain sense. We will present a version of this proof in this paper.

We specify our version of as follows. The action set , where, for each , . is the set of all valid sequences over that satisfy the following property: if , then there exists a location and a suffix of such that is a sequence over the set .

The automaton

Signature:

input ,

output ,

State variables:

, a subset of , initially

Transitions:

input

effect

output

precondition

effect

none

Tasks:

One task per location defined as follows:

Algorithm 1 Automaton that implements the AFD

Algorithm 1 shows an automaton whose set of fair traces is a subset of ; it follows that satisfies our formal definition of a “problem”. It is easy to see that satisfies all the properties of an AFD, and furthermore, note that also satisfies closure under strong sampling. The proofs of these observations are left as an exercise.

Afd .

Here, we introduce , where is a natural number, as a generalization of . In this paper, we will show that is a weakest strong-sampling AFD that solves fault-tolerant consensus if at most locations are faulty. Informally speaking, denotes the AFD that behaves exactly like in traces that have at most faulty locations. Thus, is the AFD .

Precisely, , where is the set of all valid sequences over such that, if , then . This definition implies that contains all the valid sequences over such that .

It is easy to see that is a strong-sampling AFD.

6 System Model and Definitions

We model an asynchronous system as the composition of a collection of I/O automata of the following kinds: process automata, channel automata, a crash automaton, and an environment automaton. The external signature of each automaton and the interaction among them are described in Section 6.1. The behavior of these automata is described in Sections 6.26.5.

For the definitions that follow, we assume an alphabet of messages.

6.1 System Structure

A system contains a collection of process automata, one for each location in . We define the association with a mapping , which maps each location to a process automaton . Automaton has the following external signature. It has an input action , which is an output from the crash automaton, a set of output actions , and a set of input actions . A process automaton may also have other external actions with which it interacts with the external environment or a failure detector; the set of such actions may vary from one system to another.

For every ordered pair of distinct locations, the system contains a channel automaton , which models the channel that transports messages from process to process . Channel has the following external actions. The set of input actions is , which is a subset of outputs of the process automaton . The set of output actions is , which is a subset of inputs to .

The crash automaton models the occurrence of crash failures in the system. Automaton has as its set of output actions, and no input actions.

The environment automaton models the external world with which the distributed system interacts. The automaton is a composition of automata . For each location , the set of input actions to automaton includes the action . In addition, may have input and output actions corresponding (respectively) to any outputs and inputs of the process automaton that do not match up with other automata in the system.

We assume that, for every location , every external action of and , respectively, occurs at , that is, for every external action of and .

We provide some constraints on the structure of the various automata below.

6.2 Process Automata

The process automaton at location , , is an I/O automaton whose external signature satisfies the constraints given above, and that satisfies the following additional properties.

  1. Every internal action of occurs at , that is, for every internal action of . We have already assumed that every external action of occurs at ; now we are simply extending this requirement to the internal actions.

  2. Automaton is deterministic, as defined in Section 4.1.

  3. When occurs, it permanently disables all locally controlled actions of .

We define a distributed algorithm to be a collection of process automata, one at each location; formally, it is simply a particular mapping. For convenience, we will usually write for the process automaton .

6.3 Channel Automata

The channel automaton for and , , is an I/O automaton whose external signature is as described above. That is, ’s input actions are and its output actions are .

Now we require to be a specific I/O automaton—a reliable FIFO channel, as defined in [10]. This automaton has no internal actions, and all its output actions are grouped into a single task. The state consists of a FIFO queue of messages, which is initially empty. A input event can occur at any time. The effect of an event is to add to the end of the queue. When a message is at the head of the queue, the output action is enabled, and the effect is to remove from the head of the queue. Note that this automaton is deterministic.

6.4 Crash Automaton

The crash automaton is an I/O automaton with as its set of output actions, and no input actions.

Now we require the following constraint on the behavior of : Every sequence over is a fair trace of the crash automaton. That is, any pattern of crashes is possible. For some of our results, we will consider restrictions on the number of locations that crash.

6.5 Environment Automaton

The environment automaton is an I/O automaton whose external signature satisfies the constraints described in Section 6.1. Recall that is a composition of automata . For each location , the following is true.

  1. has a unique initial state.

  2. has tasks , where ranges over some fixed task index set .

  3. is task-deterministic.

  4. When occurs, it permanently disables all locally controlled actions of .

In addition, in some specific cases we will require the traces of to satisfy certain “well-formedness” restrictions, which will vary from one system to another. We will define these specifically when they are needed, later in the paper.

Figure 1: Interaction diagram for a message-passing asynchronous distributed system augmented with a failure detector automaton.

7 Solving Problems

In this section we define what it means for a distributed algorithm to solve a crash problem in a particular environment. We also define what it means for a distributed algorithm to solve one problem using another problem . Based on these definitions, we define what it means for an AFD to be sufficient to solve a problem.

7.1 Solving a Crash Problem

An automaton is said to be an environment for if the input actions of are , and the output actions of are . Thus, the environment’s inputs and outputs “match” those of the problem, except that the environment doesn’t provide the problem’s inputs.

If is an environment for a crash problem , then an I/O automaton is said to solve in environment provided that the following conditions hold:

  1. .

  2. .

  3. The set of fair traces of the composition of , , and the crash automaton is a subset of .

A distributed algorithm solves a crash problem in an environment iff the automaton , which is obtained by composing with the channel automata, solves in . A crash problem is said to be solvable in an environment iff there exists a distributed algorithm such that solves in . If crash problem is not solvable in environment , then it is said to be unsolvable in .

7.2 Solving One Crash Problem Using Another

Often, an unsolvable problem may be solvable if the system contains an automaton that solves some other (unsolvable) crash problem . We describe the relationship between and as follows.

Let and be two crash problems with disjoint sets of actions (except for actions). Let be an environment for . Then a distributed algorithm solves crash problem using crash problem in environment iff the following are true:

  1. For each location , .

  2. For each location , .

  3. Let be the composition of with the channel automata, the crash automaton, and the environment automaton . Then for every fair trace of , if , then .

    In effect, in any fair execution of the system, if the sequence of events associated with the problem is consistent with the specified behavior of , then the sequence of events associated with problem is consistent with the specified behavior of .

Note that requirement 3 is vacuous if for every fair trace of , . However, in the definition of a problem , the requirement that there exist some automaton whose set of fair traces is a subset of ensures that there are “sufficiently many” fair traces of , such that .

We say that a crash problem is sufficient to solve a crash problem in environment , denoted iff there exists a distributed algorithm that solves using in . If , then also we say that is solvable using in . If no such distributed algorithm exists, then we state that is unsolvable using in , and we denote it as .

7.3 Using and Solving Failure-Detector Problems

Since an AFD is simply a kind of crash problem, the definitions above automatically yield definitions for the following notions.

  1. A distributed algorithm solves an AFD in environment .

  2. A distributed algorithm solves a crash problem using an AFD in environment .

  3. An AFD is sufficient to solve a crash problem in environment .

  4. A distributed algorithm solves an AFD using a crash problem in environment .

  5. A crash problem is sufficient to solve an AFD in environment .

  6. A distributed algorithm solves an AFD using another AFD .

  7. An AFD is sufficient to solve an AFD .

Note that, when we talk about solving an AFD, the environment has no output actions because the AFD has no input actions except for , which are inputs from the crash automaton. Therefore, we have the following lemma.

Lemma 7.1.

Let be a crash problem and an AFD. If in some environment (for ), then for any other environment for , .

Consequently, when we refer to an AFD being solvable using a crash problem (or an AFD) , we omit the reference to the environment automaton and simply say that is sufficient to solve ; we denote this relationship by . Similarly, when we say that an AFD is unsolvable using , we omit mention of the environment, and write simply .

Finally, if an AFD is sufficient to solve another AFD (notion 7 in the list above), then we say that is stronger than , and we denote this by . If , but , then we say that is strictly stronger than , and we denote this by . Also, if and , then we say that is equivalent to .

We conjecture that for any AFD , there exists a strong sampling AFD such that is equivalent to ; thus, if a non-strong-sampling AFD is a weakest to solve consensus, then there must exist an equivalent AFD that is also a weakest to solve consensus. Therefore, it is sufficient to restrict our attention to strong-sampling AFDs.

8 Observations

In this section, fix to be an AFD. We define the notion of an observation of and present properties of observations. Observations are a key part of the emulation algorithm used to prove the “weakest failure detector” result, in Section 11.

8.1 Definitions and Basic Properties

An observation is a DAG , where the set of vertices consists of triples of the form where is a location, is a positive integer, and is an action from ; we refer to , , and as the location, index, and action of , respectively. Informally, a vertex denotes that is the -th AFD output at location , and the observation represents a partial ordering of AFD outputs at various locations. We say that an observation is finite iff the set (and therefore the set ) is finite; otherwise, is said to be infinite.

We require the set to satisfy the following properties.

  1. For each location and each positive integer , contains at most one vertex whose location is and index is .

  2. If contains a vertex of the form and , then also contains a vertex of the form .

Property 1 states that at each location , for each positive integer , there is at most one -th AFD output. Property 2 states that for any and , if the -th AFD output occurs at , then the first AFD outputs also occur at .

The set of edges imposes a partial ordering on the occurrence of AFD outputs. We assume that it satisfies the following properties.

  1. For every location and natural number , if contains vertices of the form and , then contains an edge from to .

  2. For every pair of distinct locations and such that contains an infinite number of vertices whose location is , the following is true. For each vertex in whose location is , there is a vertex in whose location is such that there is an edge from to in .

  3. For every triple , , of vertices such that contains both an edge from to and an edge from to , also contains an edge from to . That is, the set of edges of is closed under transitivity.

Property 3 states that at each location , the -th output at occurs before the -st output at . Property 4 states that for every pair of locations and such that infinitely many AFD outputs occur at , for every AFD output event at there exists some AFD output event at such that occurs before . Property 5 is a transitive closure property that simply captures the notion that if event happens before event and happens before event , then happens before .

Given an observation , if contains an infinite number of vertices of the form for some particular , then is said to be live in . We write for the set of all the locations that are live in .

Lemma 8.1.

Let be an observation, a location in . Then for every positive integer , contains exactly one vertex of the form .

Proof.

Follows from Properties 1 and 2 of observations. ∎

Lemma 8.2.

Let and be distinct locations with . Let be a vertex in whose location is . Then there exists a positive integer such that for every positive integer , contains an edge from to some vertex of the form .

Proof.

Follows from Lemma 8.1, and Properties 3, 4, and 5 of observations. ∎

Lemma 8.3.

Let and be distinct locations with and ; that is, contains infinitely many vertices whose location is and only finitely many vertices whose location is . Then there exists a positive integer such that for every , there is no edge from any vertex of the form to any vertex whose location is .

Proof.

Fix and as in the hypotheses. Let be the vertex in whose location is and whose index is the highest among all the vertices whose location is . From Lemma 8.2 we know that there exists a positive integer such that for every positive integer , contains an edge from to some vertex of the form . Since is a DAG, there is no edge from any vertex of the form , to . Applying Properties 3 and 5 of observations, we conclude that there is no edge from any vertex of the form to any vertex whose location is . ∎

Lemma 8.4.

Let be an observation. Every vertex in has only finitely many incoming edges in .

Proof.

For contradiction, assume that there exists a vertex with infinitely many incoming edges, and let be the location of . Then there must be a location such that there are infinitely many vertices whose location is that have an outgoing edge to . Fix such a location . Note that must be live in .

Since there are infinitely many vertices whose location is , by Property 4 of observations, we know that has an outgoing edge to some vertex . Since infinitely many vertices of the form have an outgoing edge to , fix some such . By Properties 3 and 5 of observations, we know that there exists a edge from to . Thus, we see that there exist edges from to , from to , and from to , which yield a cycle. This contradicts the assumption that is a DAG. ∎

8.2 Viable Observations

Now consider an observation . If is any sequence of vertices in , then we define the event-sequence of to be the sequence obtained by projecting onto its second component.

We say that a trace is compatible with an observation provided that is the event sequence of some topological ordering of the vertices of . is a viable observation if there exists a trace that is compatible with .

Lemma 8.5.

Let be a viable observation, and suppose that is compatible with . For each location , is live in iff .

We now consider paths in an observation DAG, and their connection with strong sampling, as defined in Section 5.2. A path in a observation is a sequence of vertices, where for each pair of consecutive vertices in a path, is an edge of the observation.

A branch of an observation is a maximal path in . A fair branch of is a branch of that satisfies the additional property that, for every in , if is live in , then contains an infinite number of vertices whose location is .

Lemma 8.6.

Let be a viable observation, and suppose that is compatible with . Suppose is a fair branch of , and let be the event sequence of . Then

  1. There exists a strong sampling of such that .

  2. If is a strong-sampling AFD, then there exists such that is a strong sampling of and .

Proof.

Fix , , , and from the hypotheses of the Lemma statement.

Proof of Part 1. Since is a fair branch of , for each location that is live in , contains an infinite number of outputs at . Furthermore, for each location , the projection of on the events at is a subsequence of the projection of on the AFD outputs at . Therefore, by deleting all the AFD output events from that do not appear in , we obtain a strong-sampling of such that .

Proof of Part 2. In Part 2, assume is a strong-sampling AFD. From Part 1, we have already established that there exists a strong-sampling of such that . Fix such a . By closure under strong-sampling, since , we conclude that as well. ∎

Lemma 8.6 is crucial to our results. In Section 11, we describe an emulation algorithm that uses outputs from an AFD to produce viable observations, and the emulations consider paths of the observation and simulate executions of a consensus algorithm with AFD outputs from each path in the observation. Lemma 8.6 guarantees that each fair path in the observation corresponds to an actual sequence of AFD outputs from some trace of the AFD. In fact, the motivation for closure-under-strong-sampling property is to establish Lemma 8.6.

8.3 Relations and Operations on Observations

The emulation construction in Section 11 will require processes to manipulate observations. To help with this, we define some relations and operations on DAGs and observations.

Prefix.

Given two DAGs and , is said to be a prefix of iff is a subgraph of and for every vertex of , the set of incoming edges of in is equal to the set of incoming edges of in .

Union.

Let and be two observations. Then the union of and , denoted , is the graph . Note that, in general, this union need not be another observation. However, under certain conditions, wherein the observations are finite and “consistent” in terms of the vertices and incoming edges at each vertex, the union of two observations is also an observation. We state this formally in the following Lemma.

Lemma 8.7.

Let and be two finite observations. Suppose that the following hold:

  1. There do not exist and with .

  2. If then has the same set of incoming edges (from the same set of other vertices) in and .

Then is also an observation.

Proof.

Straightforward. ∎

Insertion.

Let be a finite observation, a location, and the largest integer such that contains a vertex of the form . Let be a triple . Then , the result of inserting into , is a new graph , where and . That is, is obtained from by adding vertex and adding edges from every vertex in to .

Lemma 8.8.

Let be a finite observation, a location. Let be the largest integer such that contains a vertex of the form . Let be a triple . Then is a finite observation.

8.4 Limits of Sequences of Observations

Consider an infinite sequence of finite observations, where each is a prefix of the next. Then the limit of this sequence is the graph defined as follows:

  • .

  • .

Lemma 8.9.

For each positive integer , is a prefix of .

Under certain conditions, the limit of the infinite sequence of observations is also an observation; we note this in Lemma 8.10.

Lemma 8.10.

Let be the limit of the infinite sequence of finite observations, where each is a prefix of the next. Suppose that the sequence satisfies the following property:

  1. For every vertex and any location , there exists a vertex with location such that contains the edge .

Then is an observation.

Proof.

All properties are straightforward from the definitions, except for Property 4 of observations, which follows from the assumption of the lemma. ∎

We define an infinite sequence of finite observations, where each is a prefix of the next, to be to be convergent if the limit of this sequence is an observation.

9 Execution Trees

In this section, we define a tree representing executions of a system that are consistent with a particular observation of a particular failure detector . Specifically, we define a tree that describes executions of in which the sequence of AFD outputs is exactly the event-sequence of some path in observation .

Section 9.1 defines the system for which the tree is defined. The tree is constructed in two parts: Section 9.2 defines a “task tree”, and Section 9.3 adds tags to the nodes and edges of the task tree to yield the final execution tree. Additionally, Sections 9.2 and 9.3 prove certain basic properties of execution trees, and they establish a correspondence between the nodes in the tree and finite executions of . Section 9.4 defines that two nodes in the execution tree are “similar” to each other if they have the same tags, and therefore correspond to the same execution of ; the section goes on to prove certain useful properties of nodes in the subtrees rooted at any two similar nodes. Section 9.5 defines that two nodes in the execution tree are “similar-modulo-” to each other if the executions corresponding to the two nodes are indistinguishable for process automata at any location except possibly the the process automaton at ; the section goes on to prove certain useful properties of nodes in the subtrees rooted at any two similar-modulo- nodes. Section 9.6 establishes useful properties of nodes that are in different execution trees that are constructed using two observations, one of which is a prefix of another. Finally, Section 9.7 proves that a “fair branch” of infinite execution trees corresponds to a fair execution of system . The major results in this section are used in Sections 10 and 11, which show that is a weakest strong-sampling AFD to solve consensus if at most locations crash.

9.1 The System

Fix to be a system consisting of a distributed algorithm , channel automata, and an environment automaton such that solves a crash problem using in .

The system contains the following tasks. The process automaton at contains a single task . Each channel automaton , where contains a single task, which we also denote as ; the actions in task are of the form , which results in a message received at location . Each automaton has tasks , where ranges over some fixed task index set . Let denote the set of all the tasks of .

Each task has an associated location, which is the location of all the actions in the task. The tasks at location are , , and .

Recall from Section 6 that each process automaton, each channel automaton, and the environment automaton have unique initial states. Therefore, the system has a unique initial state. From the definitions of the constituent automata of , we obtain the following lemma.

Lemma 9.1.

Let be an execution of system , and let be the trace of such that for some location , does not contain any locally-controlled actions at and . Then, there exists an execution of system such that is the trace of .

Proof.

Fix , and as in the hypothesis of the claim. Let be the prefix of whose trace is . Let be the final state of . Let be the execution , where is the state of when is applied to state .

Note that disables all locally-controlled actions at and , and it does not change the state of any other automaton in . Therefore, the state of all automata in except for and are the same in state and . Also, note that does not contain any locally-controlled action at or , and can be applied to state . Therefore, can also be applied to , thus extending to an execution of . By construction, the trace of is . ∎

9.2 The Task Tree

For any observation , we define a tree that describes all executions of in which the sequence of AFD output events is the event-sequence of some path in .

We describe our construction in two stages. The first stage, in this subsection, defines the basic structure of the tree, with annotations indicating where particular system tasks and observation vertices occur. The second stage, described in the next subsection, adds information about particular actions and system states.

The task tree is rooted at a special node called “” which corresponds to the initial state of the system . The tree is of height ; if is infinite, the tree has infinite height.555The intuitive reason for limiting the depth of the tree to is the following. If is a finite observation, then none of the locations in are live in . In this case, we want all the branches in the task tree to be finite. On the other hand, if is an infinite observation, then some location in is live in , and in this case we want all the branches in the task tree to be infinite. On way to ensure these properties is to restrict the depth of the tree to . Every node in the tree that is at a depth is a leaf node. All other nodes are internal nodes. Each edge in the tree is labeled by an element from . Intuitively, the label of an edge corresponds to a task being given a “turn” or an AFD event occurring. An edge with label is said to be an -edge, for short. The child of a node that is connected to by an edge labeled is said to be an -child of .

In addition to labels at each edge, the tree is also augmented with a vertex tag, which is a vertex in , at each node and edge. We write for the vertex tag at node and for the vertex tag at edge . Intuitively, each vertex tag denotes the latest AFD output that occurs in the execution of corresponding to the path in the tree from the root to node or the head node of edge (as appropriate). The set of outgoing edges from each node in the tree is determined by the vertex tag .

We describe the labels and vertex tags in the task tree recursively, starting with the node. We define the vertex tag of to be a special placeholder element , representing a “null vertex” of . For each internal node with vertex tag , the outgoing edges from and their vertex tags are as follows.

  • Outgoing , , and edges. For every task in , the task tree contains exactly one outgoing edge from with label from , i.e., an -edge. The vertex tag of is .

  • Outgoing -edges. If , then for every vertex of