Simple CHT: A New Derivation of the Weakest Failure Detector for Consensus

Simple CHT:
A New Derivation of the Weakest Failure Detector for Consensus

Eli Gafni

UCLA
   Petr Kuznetsov

Télécom ParisTech
Abstract

The paper proposes an alternative proof that , an oracle that outputs a process identifier and guarantees that eventually the same correct process identifier is output at all correct processes, provides minimal information about failures for solving consensus in read-write shared-memory systems: every oracle that gives enough failure information to solve consensus can be used to implement .

Unlike the original proof by Chandra, Hadzilacos and Toueg (CHT), the proof presented in this paper builds upon the very fact that -process wait-free consensus is impossible. Also, since the oracle that is used to implement can solve consensus, the implementation is allowed to directly access consensus objects. As a result, the proposed proof is shorter and conceptually simpler than the original one.

1 Introduction

The presence of faults and the lack of synchrony make distributed computing interesting. On the one hand, in an asynchronous system which assumes no bounds on communication delays and relative processing speeds, even a basic form of non-trivial synchronization (consensus) is impossible if just one process may fail by crashing [7, 13]. On the other hand, in a synchronous system, where the bounds exist and known a priori and every failure can be reliably detected, every meaningful fault-tolerant synchronization problem becomes solvable. The gap suggests that the amount of information about failures is a crucial factor in reasoning about fault-tolerant solvability of distributed computing problems.

Chandra and Toueg proposed failure detectors as a convenient abstraction to describe the failure information. Informally, a failure detector is a distributed oracle that provides processes with hints about failures [5]. The notion of a weakest failure detector [4] captures the exact amount of synchrony needed to solve a given problem: is the weakest failure detector for solving if (1) is sufficient to solve , i.e., there exists an algorithm that solves using , and (2) any failure detector that is sufficient to solve provides at least as much information about failures as does, i.e., there exists a reduction algorithm that extract the output of using the failure information provided by .

This paper considers a distributed system in which crash-prone processes communicate using atomic reads and writes in shared memory. In the (binary) consensus problem [7], every process starts with a binary input and every correct (never-failing) process is supposed to output one of the inputs such that no two processes output different values. Asynchronous (wait-free) consensus is known to be impossible [7, 13], as long as at least one process may fail by crashing. Chandra et al. [4] showed that the “eventual leader” failure detector is necessary and sufficient to solve consensus. The failure detector outputs, when queried, a process identifier, such that, eventually, the same correct process identifier is output at all correct processes.

The reduction technique presented in [4] is very interesting in its own right, since it not only allows us to determine the weakest failure detector for consensus, but also establishes a framework for determining the weakest failure detector for any problem. Informally, the reduction algorithm of [4] works as follows. Let be any failure detector that can be used to solve consensus. Processes periodically query their modules of , exchange the values returned by , and arrange the accumulated output of the failure detector in the form of ever-growing directed acyclic graphs (DAGs). Every process periodically uses its DAG as a stimulus for simulating multiple runs of the given consensus algorithm. It is shown in [4] that, eventually, the collection of simulated runs will include a critical run in which a single process “hides” the decided value, and, thus, no extension of the run can reach a decision without cooperation of . As long as a process performing the simulation observes a run that the process suspects to remain critical, it outputs the “hiding” process identifier of the “first” such run as the extracted output of . The existence of a critical run and the fact that the correct processes agree on ever-growing prefixes of simulated runs imply that, eventually, the correct processes will always output the identifier of the same correct process.

Crucially, the existence of a critical run is established in [4] using the notion of valence [7]: a simulated finite run is called -valent () if all simulated extensions of it decide . If both decisions and are “reachable” from the finite run, then the run is called bivalent. Recall that in [7], the notion of valence is used to derive a critical run, and then it is shown that such a run cannot exist in an asynchronous system, implying the impossibility of consensus. In [4], a similar argument is used to extract the output of in a partially synchronous system that allows for solving consensus. Thus, in a sense, the technique of [4] rehashes arguments of [7]. It is challenging to find a proof that is necessary to solve consensus building upon the very fact that -process wait-free consensus is impossible.

This paper addresses this challenge. It is shown that is necessary to solve consensus using the very impossibility of -process wait-free consensus, without “opening the box” and considering the problem semantics. The resulting proof is shorter and simpler that the original proof of [4].

On the technical side, the paper uses two fundamental results and one observation. First, the technique of Zieliński [15] that allows us to construct, given an algorithm that uses a failure detector , an asynchronous algorithm that simulates runs of using a static sample of ’s output (captured in a DAG), instead of the “real” output of the failure detector. The live processor set that run may be different than the live set of processor implied by the sample of . Therefore, the asynchronous algorithm guarantees that every infinite simulated run is safe (every prefix of it is a finite run of ), but not necessarily live (some correct process may not be able to make progress).

Second, the paper also makes use of the BG-simulation technique [1, 3] that allows processes simulate, in a wait-free manner, a -resilient (with at most faulty processes) run of . Using a series of consensus instances (provided by the algorithm using ), processes locally simulate the very same sequence of -resilient runs and eventually identify a “never-deciding” -resilient run of . Since is an asynchronous simulation of , a -resilient run of that includes infinitely many steps of each correct process should be deciding. Thus, exactly one correct process appears in the -resilient never-deciding run only finitely often. To emulate , it is thus sufficient to output the process that appears the least in that -resilient run. Eventually, all correct process agree on the same never-deciding -resilient run, and will always output the same correct process. The observation here is that a reduction algorithm may directly access consensus objects, since it is given a failure detector which can be used to solve consensus.

The rest of the paper is organized as follows. Section 2 describes the system model. Sections 3 presents the reduction algorithm. Section 4 overviews the related work and Section 5 concludes the paper by discussing implications of the presented results.

2 Model

The model of processes communicating through read-write shared objects and using failure detectors is based on [4, 9, 10, 11]. The details necessary for showing the results of this paper are described below.

2.1 Processes and objects

A distributed system is composed of a set of processes (). Processes are subject to crash failures. A process that never fails is said to be correct. Processes that are not correct are called faulty. Process communicate through applying atomic operations on a collection of shared objects. In this paper, the shared objects are registers, i.e., they export only conventional atomic read-write operations.

2.2 Failure patterns and failure detectors

A failure pattern is a function from the time range to , where denotes the set of processes that have crashed by time . Once a process crashes, it does not recover, i.e., . The set of faulty processes in , , is denoted by . Respectively, . A process is said to be crashed at time . An environment is a set of failure patterns. This paper considers environments that consists of failure patterns in which at least one process is correct.

A failure detector history with range is a function from to . is interpreted as the value output by the failure detector module of process at time . A failure detector with range is a function that maps each failure pattern to a (non-empty) set of failure detector histories with range . denotes the set of possible failure detector histories permitted by  for failure pattern . Possible ranges of failure detectors are not a priori restricted.

2.3 Algorithms

An algorithm using a failure detector is a collection of deterministic automata, one for each process in the system. denotes the automaton on which process runs the algorithm . Computation proceeds in atomic steps of . In each step of , process

  1. invokes an atomic operation (read or write) on a shared object and receives a response or queries its failure detector module and receives a value from , and

  2. applies its current state, the response received from the shared object or the value output by to the automaton to obtain a new state.

A step of is thus identified by a tuple , where is the failure detector value output at during that step if was queried, and otherwise.

If the state transitions of the automata do not depend on the failure detector values, the algorithm is called asynchronous. Thus, for an asynchronous algorithm, a step is uniquely identified by the process id.

2.4 Runs

A state of defines the state of each process and each object in the system. An initial state of specifies an initial state for every automaton and every shared object.

A run of algorithm using a failure detector in an environment is a tuple where is a failure pattern, is a failure detector history, is an initial state of , is an infinite sequence of steps of respecting the automata and the sequential specification of shared objects, and is an infinite list of increasing time values indicating when each step of has occurred, such that for all , if with , then and .

A run is fair if every process in takes infinitely many steps in , and -resilient if at least processes appear in infinitely often. A partial run of an algorithm is a finite prefix of a run of .

For two steps and of processes and , respectively, in a (partial) run of an algorithm , we say that causally precedes if in , and we write , if (1) , and occurs before in , or (2) is a write step, is a read step, and occurs before in , or (3) there exists in , such that and .

2.5 Consensus

In the binary consensus problem, every process starts the computation with an input value in (we say the process proposes the value), and eventually reaches a distinct state associated with an output value in (we say the process decides the value). An algorithm solves consensus in an environment if in every fair run of in , (i) every correct process eventually decides, (ii) every decided value was previously proposed, and (iii) no two processes decide different values.

Given a an algorithm that solves consensus, it is straightforward to implement an abstraction cons that can be accessed with an operation propose () returning a value in , and guarantees that every propose operation invoked by a correct process eventually returns, every returned value was previously proposed, and no two different values are ever returned.

2.6 Weakest failure detector

We say that an algorithm using extracts the output of in , if implements a distributed variable such that for every run of in which , there exists such that for all and , (i.e., the value of at at time is ). We say that is a reduction algorithm. (A more precise definition of a reduction algorithm is given in [11].)

If, for failure detectors and and an environment , there is a reduction algorithm using that extracts the output in , then we say that is weaker than in .

is the weakest failure detector to solve a problem (e.g., consensus) in if there is an algorithm that solves using in and is weaker than any failure detector that can be used to solve in .

3 Extracting

Let be an algorithm that solves consensus using a failure detector . The goal is to construct an algorithm that emulates using and . Recall that to emulate means to output, at each time and at each process, a process identifiers such that, eventually, the same correct process is always output.

3.1 Overview

As in [4], the reduction algorithm of this paper uses failure detector to construct an ever-growing directed acyclic graph (DAG) that contains a sample of the values output by in the current run and captures some temporal relations between them. Following [15], this DAG can be used by an asynchronous algorithm to simulate a (possibly finite and unfair) run of . Recall that, using BG-simulation [1, 3], processes can simulate a -resilient run of . The fact that -resilient -process consensus is impossible implies that the simulation must produce at least one ”non-deciding” -resilient run of .

Now every correct process locally simulates all executions of BG-simulation on two processes and that simulate a -resilient run of of the whole system . Eventually, every correct process locates a never-deciding run of and uses the run to extract the output of : it is sufficient to output the process that takes the least number of steps in the “smallest” non-deciding simulated run of . Indeed, exactly one correct process takes finitely many steps in the non-deciding -resilient run of : otherwise, the run would simulate a fair and thus deciding run of .

The reduction algorithm extracting from and consists of two components that are running in parallel: the communication component and the computation component. In the communication component, every process maintains the ever-growing directed acyclic graph (DAG) by periodically querying its failure detector module and exchanging the results with the others through the shared memory. In the computation component, every process simulates a set of runs of using the DAGs and maintains the extracted output of .

3.2 DAGs

  Shared variables: for all : , initially empty graph 1 2 while true do 3 for all do 4 query failure detector 5 6 add and edges from all other vertices of to , to  

Figure 1: Building a DAG: the code for each process

The communication component is presented in Figure 1. This task maintains an ever-growing DAG that contains a finite sample of the current failure detector history. The DAG is stored in a register which can be updated by and read by all processes.

DAG has some special properties which follow from its construction [4]. Let be the current failure pattern, and be the current failure detector history. Then a fair run of the algorithm in Figure 1 guarantees that there exists a map , such that, for every correct process and every time ( denotes here the value of variable at time ):

  1. The vertices of are of the form where , and .

    1. For each vertex , and . That is, is the value output by ’s failure detector module at time .

    2. For each edge , . That is, any edge in reflects the temporal order in which the failure detector values are output.

  2. If and are vertices of and then is an edge of .

  3. is transitively closed: if and are edges of , then is also an edge of .

  4. For all correct processes , there is a time , a and a such that, for every vertex of , is an edge of .

  5. For all correct processes , there is a time such that is a subgraph of .

The properties imply that ever-growing DAGs at correct processes tend to the same infinite DAG : . In a fair run of the algorithm in Figure 1, the set of processes that obtain infinitely many vertices in is the set of correct processes [4].

3.3 Asynchronous simulation

It is shown below that any infinite DAG constructed as shown in Figure 1 can be used to simulate partial runs of in the asynchronous manner: instead of querying , the simulation algorithm uses the samples of the failure detector output captured in the DAG. The pseudo-code of this simulation is presented in Figure 2. The algorithm is hypothetical in the sense that it uses an infinite input, but this requirement is relaxed later.

In the algorithm, each process is initially associated with an initial state of and performs a sequence of simulated steps of . Every process maintains a shared register that stores the vertex of used for the most recent step of simulated by . Each time is about to perform a step of it first reads registers to obtain the vertexes of used by processes for simulating the most recent causally preceding steps of (line 2 in Figure 2). Then selects the next vertex of that succeeds all vertices (lines 2-2). If no such vertex is found, blocks forever (line 2).

Note that a correct process may block forever if contains only finitely many vertices of . As a result an infinite run of may simulate an unfair run of : a run in which some correct process takes only finitely many steps. But every finite run simulated by is a partial run of .

  Shared variables: , {for each , is the vertex of corresponding to the latest simulated step of } Shared variables of 7 initialize the simulated state of in , based on 8 9 while do {Simulating the next ’s step of } 10 11 repeat 12 13 wait until includes for some 14 until , : 15 16 take the next ’s step of using as the output of  

Figure 2: DAG-based asynchronous algorithm : code for each
Theorem 1

Let be the DAG produced in a fair run of the communication component in Figure 1. Let be any fair run of using . Then the sequence of steps simulated by in belongs to a (possibly unfair) run of , , with input vector of and failure pattern . Moreover, the set of processes that take infinitely many steps in is , and if , then is fair.

Proof. Recall that a step of a process can be either a memory step in which accesses shared memory or a query step in which queries the failure detector. Since memory steps simulated in are performed as in , to show that algorithm indeed simulates a run of with failure pattern , it is enough to make sure that the sequence of simulated query steps in the simulated run (using vertices of ) could have been observed in a run of with failure pattern and the input vector based on .

Let be a map associated with that carries each vertex of to an element in such that (a) for any vertex of , and , and (b) for every edge of , (the existence of is established by property (5) of DAGs in Section 3.2). For each step simulated by in , let denote time when step occurred in , i.e., when the corresponding line 2 in Figure 2 was executed, and be the vertex of used for simulating , i.e., the value of when simulates in line 2 of Figure 2.

Consider query steps and simulated by processes and , respectively. Let and . WLOG, suppose that , i.e., outputs at before outputting at .

If , i.e., is simulated by before is simulated by , then the order in which and see value and is the run produced by is consistent with the output of , i.e., the values and indeed could have been observed in that order.

Suppose now that . If and are not causally related in the simulated run, then is indistinguishable from a run in which is simulated by before is simulated by . Thus, and can still be observed in a run of .

Now suppose, by contradiction that and causally precedes in the simulated run, i.e., simulated at least one write step after , and simulated at least one read step before , such that took place before in . Since before performing the memory access of , updated with a vertex that succeeds in (line 2), and occurs in after , must have found or a later vertex of in before simulating step (line 2) and, thus, the vertex of used for simulating must be a descendant of , and, by properties (1) and (3) of DAGs (Section 3.2), — a contradiction. Hence, the sequence of steps of simulated in could have been observed in a run of with failure pattern .

Since in , a process simulates only its own steps of , every process that appears infinitely often in is in . Also, since each faulty in process contains only finitely many vertices in , eventually, each process in is blocked in line 2 in Figure 2, and, thus, every process that appears infinitely often in is also in . Now consider a process . Property (4) of DAGs implies that for every set of vertices of , there exists a vertex of in such that for all , is an edge in . Thus, the wait statement in line 2 cannot block forever, and takes infinitely many steps in .

Hence, the set of processes that appear infinitely often in is exactly . Specifically, if , then the set of processes that appear infinitely often in is , and the run is fair.
Note that in a fair run, the properties of the algorithm in Figure 2 remain the same if the infinite DAG is replaced with a finite ever-growing DAG constructed in parallel (Figure 1) such that . This is because such a replacement only affects the wait statement in line 2 which blocks until the first vertex of that causally succeeds every simulated step recently ”witnessed” by is found in , but this cannot take forever if is correct (properties (4) and (5) of DAGs in Section 3.2). The wait blocks forever if the vertex is absent in , which may happen only if is faulty.

3.4 BG-simulation

Borowsky and Gafni proposed in [1, 3], a simulation technique by which simulators can wait-free simulate a -resilient execution of any asynchronous -process protocol. Informally, the simulation works as follows. Every process tries to simulate steps of all processes in a round-robin fashion. Simulators run an agreement protocol to make sure that every step is simulated at most once. Simulating a step of a given process may block forever if and only if some simulator has crashed in the middle of the corresponding agreement protocol. Thus, even if out of simulators crash, at least simulated processes can still make progress. The simulation thus guarantees at least processes in accept infinitely many simulated steps.

In the computational component of the reduction algorithm, the BG-simulation technique is used as follows. Let denote the simulation protocol for processes and which allows them to simulate, in a wait-free manner, a -resilient execution of algorithm for processes . The complete reduction algorithm thus employs a triple simulation (Figure 3): every process simulates multiple runs of two processes and that use BG-simulation to produce a -resilient run of on processes in which steps of the original algorithm are periodically simulated using (ever-growing) DAGs . (To avoid confusion, we use to denote the process that models in a run of simulated by a “real” process .)

Figure 3: Three levels of simulation: real processes simulate a system of two BG-simulators and that run to simulate an -resilient run of on .

We are going to use the following property which is trivially satified by BG-simulation:

  1. A run of BG-simulation in which every simulator take infinitely many steps simulates a run in which every simulated process takes infinitely many steps.

3.5 Using consensus

The triple simulation we are going to employ faces one complication though. The simulated runs of the asynchronous algorithm may vary depending on which process the simulation is running. This is because are maintained by a parallel computation component (Figure 1), and a process simulating a step of may perform a different number of cycles reading the current version of its DAG before a vertex with desired properties is located (line 2 in Figure 2). Thus, the same sequence of steps of and simulated at different processes may result in different -resilient runs of : waiting until a vertex appears in at process may take different number of local steps checking , depending on the time when executes the wait statement in line 2 of Figure 2.

  repeat if contains for some then else until  

Figure 4: Expanded line 2 of Figure 2: waiting until includes a vertex for some . Here is any DAG generated by the algorithm in Figure 1.

To resolve this issue, the wait statement is implemented using a series of consensus instances , , (Figure 4). If is correct, then eventually each correct process will have a vertex in its DAG and, thus, the code in Figure 4 is non-blocking, and Theorem 1 still holds. Furthermore, the use of consensus ensures that if a process, while simulating a step of at process , went through steps before reaching line 2 in Figure 2, then every process simulating this very step does the same. Thus, a given sequence of steps of and will result in the same simulated -resilient run of , regardless of when and where the simulation is taking place.

3.6 Extracting

The computational component of the reduction algorithm is presented in Figure 5. In the component, every process locally simulates multiple runs of a system of processes and that run algorithm , to produce a -resilient run of (Figures 2 and 4). Recall that , in its turn, simulates a run of the original algorithm , using, instead of , the values provided by an ever-growing DAG . In simulating the part of of process presented in Figure 4, and count each access of a consensus instance as one local step of that need to be simulated. Also, in , when is about to simulate the first step of , uses its own input value as an input value of .

For each simulated state of , periodically checks whether the state of in is deciding, i.e., whether some process has decided in the state of in . As we show, eventually, the same infinite non-deciding -resilient run of will be simulated by all processes, which allows for extracting the output of .

The algorithm in Figure 5 explores solo extensions of and starting from growing prefixes. Since, by property (BG0) of BG-simulation (Section 3.4), a run of in which both and participate infinitely often simulates a run of in which every participates infinitely often, and, by Theorem 1, such a run will produce a fair and thus deciding run of . Thus, if there is an infinite non-deciding run simulated by the algorithm in Figure 2, it must be a run produced by a solo extension of or starting from some finite prefix.

  17 for all binary -vectors do { For all possible consensus inputs for and } 18 the empty string 19 explore 20 function explore 21 for all do 22 empty string 23 repeat 24 25 let be the process that appears the least in 26 27 until is decided 28 explore 29 explore  

Figure 5: Computational component of the reduction algorithm: code for each process . Here denotes the state of reached by the partial run of simulated in the partial run of with schedule and input state , and denotes the corresponding schedule of .
Lemma 2

The algorithm in Figure 5 eventually forever executes lines 55.

Proof. Consider any run of the algorithm in Figures 1, 4 and 5. Let be the failure pattern of that run. Let be the infinite limit DAG approximated by the algorithm in Figure 1. By contradiction, suppose that lines 55 in Figure 5 never block .

Suppose that for some initial , the call of explore performed by in line 5 never returns. Since the cycle in lines  55 in Figure 5 always terminates, there is an infinite sequence of recursive calls explore, explore, explore, , where each is a one-step extension of . Thus, there exists an infinite never deciding schedule such that the run of based on and produces a never-deciding run of . Suppose that both and appear in infinitely often. By property (BG0) of BG-simulation (Section 3.4), a run of in which both and participate infinitely often simulates a run of in which every participates infinitely often, and, by Theorem 1, such a run will produce a fair and thus deciding run of — a contradiction.

Thus, if there is an infinite non-deciding run simulated by the algorithm in Figure 2, it must be a run produced by a solo extension of or starting from some finite prefix. Let be the first such prefix in the order defined by the algorithm in Figure 2 and be the first process whose solo extension of is never deciding. Since the cycle in lines  55 always terminates, the recursive exploration of finite prefixes in lines 5 and 5 eventually reaches , the algorithm reaches line 5 with and . Then the succeeding cycle in lines  55 never terminates — a contradiction.

Thus, for all inputs , the call of explore performed by in line 5 returns. Hence, for every finite prefix , any solo extension of produces a finite deciding run of . We establish a contradiction, by deriving a wait-free algorithm that solves consensus among and .

Let be the infinite limit DAG constructed in Figure 1. Let be a map from vertices of to defined as follows: for each vertex in , is the value of variable at the moment when any run of (produced by the algorithm in Figure 2) exits the cycle in Figure 4, while waiting until appears in . If there is no such run, is set to . Note that the use of consensus implies that if in any simulated run of , has been found after iterations, then , i.e., is well-defined.

Now we consider an asynchronous read-write algorithm that is defined exactly like , but instead of going through the consensus invocations in Figure 4, performs local steps. Now consider the algorithm that is defined exactly as except that in , and BG-simulate runs of . For every sequence of steps of and , the runs of and agree on the sequence of steps of in the corresponding runs of and , respectively. Moreover, they agree on the runs of resulting from these runs of and . This is because the difference between and consist only in the local steps and does not affect the simulated state of .

We say that a sequence of steps of and is deciding with , if, when started with , the run of produces a deciding run of . By our hypothesis, every eventually solo schedule is deciding for each input . As we showed above, every schedule in which both and appear sufficiently often is deciding by property (BG0) of BG-simulation. Thus, every schedule of is deciding for all inputs.

Consider the trees of all deciding schedules of for all possible inputs . All these trees have finite branching (each vertex has at most descendants) and finite paths. By König’s lemma, the trees are finite. Thus, the set of vertices of used by the runs of simulated by deciding schedules of is also finite. Let be a finite subgraph of that includes all vertices of used by these runs.

Finally, we obtain a wait-free consensus algorithm for and that works as follows. Each runs (using a finite graph ) until a decision is obtained in the simulated run of . At this point, returns the decided value. But produces only deciding runs of , and each deciding run of solves consensus for inputs provided by and — a contradiction.

Theorem 3

In all environments , if a failure detector can be used to solve consensus in , then is weaker than in .

Proof. Consider any run of the algorithm in Figures 1, 4 and 5 with failure pattern .

By Lemma 2, at some point, every correct process gets stuck in lines 55 simulating longer and longer -solo extension of some finite schedule with input . Since, processes use a series of consensus instances to simulate runs of in exactly the same way, the correct processes eventually agree on and .

Let be the sequence of process identifiers in the -resilient execution of simulated by and in schedule with input . Since a -process BG-simulation produces a -resilient run of , at least simulated processes in appear in infinitely often. Let () be the set of such processes.

Now we show that exactly one correct (in ) process appears in only finitely often. Suppose not, i.e., . By Theorem 1, the run of simulated a far run of , and, thus, the run must be deciding — a contradiction. Since , exactly one process appears in the run of only finitely often. Moreover, the process is correct.

Thus, eventually, the correct processes in stabilize at simulating longer and longer prefixes of the same infinite non-deciding -resilient run of . Eventually, the same correct process will be observed to take the least number of steps in the run and output in line 5 — the output of is extracted.

4 Related Work

Chandra et al. derived the first “weakest failure detector” result by showing that is necessary to solve consensus in the message-passing model in their fundamental paper [4]. The result was later generalized to the read-write shared memory model [12, 10]. 111The result for the shared memory as stated in [12], but the only published proof of it appears in [10].

The technique presented in this paper builds atop two fundamental results. The first is the celebrated BG-simulation [1, 3] that allows processes simulate, in a wait-free manner, a -resilient run of any -process asynchronous algorithm. The second is a brilliant observation made by Zieliński [15] that any run of an algorithm using a failure detector induces an asynchronous algorithm that simulates (possibly unfair) runs of . The recursive structure of the algorithm in Figure 5 is also borrowed from [15]. Unlike [15], however, the reduction algorithm of this paper assumes the conventional read-write memory model without using immediate snapshots [2]. Also, instead of growing ”precedence” and ”detector” maps of [15], this paper uses directed acyclic graphs á la [4]. A (slightly outdated) survey on the literature on failure detector is presented in [FGK11].

5 Concluding Remarks

This paper presents another proof that is the weakest failure detector to solve consensus in read-write shared memory models. The proof applies a novel reduction technique, and is based on the very fact that wait-free -process consensus is impossible, unlike the original technique of [4] that partially rehashes elements of the consensus impossibility proof.

A related problem is determining the weakest failure detector for a generalization of consensus, -set agreement, in which processes have to decide on at most distinct proposed values. The weakest failure detector for -set agreement (consensus) is . For -set agreement (sometimes called simply set agreement in the literature), it is anti-, a failure detector that outputs, when queried, a process identifier, so that some correct process identifier is output only finitely many times [Zie10].

Finally, the general case of -set agreement was resolved using an elaborated and extended version of the technique proposed in this paper [GK11]. Intuitively, BG simulation allows processes to simulate a -resilient run of any asynchronous algorithm, and, generalizing the technique described in this paper, we can derive an infinite non-deciding -resilient run of . At least one correct process appears only finitely often in (otherwise, the run would be deciding). Thus, a failure detector that periodically outputs the latest processes in growing prefixes of guarantees that eventually some correct process is never output. It can be easily shown that this information about failures is sufficient to solve -set agreement. For , we cannot use consensus to make sure that correct processes simulate runs of in exactly the same way, regardless of how their local DAGs evolve. Therefore, our generalized reduction algorithm employs a slightly more sophisticated “eventual agreement” mechanism to make sure that the simulation converges.

References

  • [1] Elizabeth Borowsky and Eli Gafni. Generalized FLP impossibility result for -resilient asynchronous computations. In STOC, pages 91–100. ACM Press, May 1993.
  • [2] Elizabeth Borowsky and Eli Gafni. Immediate atomic snapshots and fast renaming. In PODC, pages 41–51, New York, NY, USA, 1993. ACM Press.
  • [3] Elizabeth Borowsky, Eli Gafni, Nancy A. Lynch, and Sergio Rajsbaum. The BG distributed simulation algorithm. Distributed Computing, 14(3):127–146, 2001.
  • [4] Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 43(4):685–722, July 1996.
  • [5] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, March 1996.
  • [6] Wei Chen, Jialin Zhang, Yu Chen, and Xuezheng Liu. Weakening failure detectors for -set agreement via the partition approach. In DISC, pages 123–138, 2007.
  • [7] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374–382, April 1985.
  • [8] Eli Gafni and Petr Kuznetsov. The weakest failure detector for solving -set agreement. In submission, February 2009.
  • [9] Rachid Guerraoui, Maurice Herlihy, Petr Kouznetsov, Nancy A. Lynch, and Calvin C. Newport. On the weakest failure detector ever. In PODC, pages 235–243, August 2007.
  • [10] Rachid Guerraoui and Petr Kouznetsov. Failure detectors as type boosters. Distributed Computing, 20(5):343–358, 2008.
  • [11] Prasad Jayanti and Sam Toueg. Every problem has a weakest failure detector. In PODC, pages 75–84, 2008.
  • [12] Wai-Kau Lo and Vassos Hadzilacos. Using failure detectors to solve consensus in asynchronous shared memory systems. In WDAG, LNCS 857, pages 280–295, September 1994.
  • [13] M.C. Loui and H.H. Abu-Amara. Memory requirements for agreement among unreliable asynchronous processes. Advances in Computing Research, 4:163–183, 1987.
  • [14] Michel Raynal and Corentin Travers. In search of the holy grail: Looking for the weakest failure detector for wait-free set agreement. In OPODIS, pages 3–19, 2006.
  • [15] Piotr Zieliński. Anti-omega: the weakest failure detector for set agreement. In PODC, August 2008.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
361871
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description