Self-Stabilizing Snapshot Objectsfor Asynchronous Fail-Prone Network Systems 3footnote 33footnote 3This document was the Appendix of our accepted submissions to PODC 2019 and NETYS 2019. (preliminary report)

Self-Stabilizing Snapshot Objects
for Asynchronous Fail-Prone Network Systems 333This document was the Appendix of our accepted submissions to PODC 2019 and NETYS 2019.
 (preliminary report)

Chryssis Georgiou 444Department of Computer Science, University of Cyprus, Nicosia, Cyprus. E-mail: chryssis@cs.ucy.ac.cy    Oskar Lundström 555Department of Engineering and Computer Science, Chalmers University of Technology, Gothenburg, SE-412 96, Sweden, E-mail: osklunds@student.chalmers.se.    Elad M. Schiller 666Department of Engineering and Computer Science, Chalmers University of Technology, Gothenburg, SE-412 96, Sweden, E-mail: elad@chalmers.se.
Abstract

A snapshot object simulates the behavior of an array of single-writer/multi-reader shared registers that can be read atomically. Delporte-Gallet et al. proposed two fault-tolerant algorithms for snapshot objects in asynchronous crash-prone message-passing systems. Their first algorithm is non-blocking; it allows snapshot operations to terminate once all write operations had ceased. It uses messages of bits, where is the number of nodes and is the number of bits it takes to represent the object. Their second algorithm allows snapshot operations to always terminate independently of write operations. It incurs messages.

The fault model of Delporte-Gallet et al. considers both node failures (crashes). We aim at the design of even more robust snapshot objects. We do so through the lenses of self-stabilization—a very strong notion of fault-tolerance. In addition to Delporte-Gallet et al.’s fault model, a self-stabilizing algorithm can recover after the occurrence of transient faults; these faults represent arbitrary violations of the assumptions according to which the system was designed to operate (as long as the code stays intact).

In particular, in this work, we propose self-stabilizing variations of Delporte-Gallet et al.’s non-blocking algorithm and always-terminating algorithm. Our algorithms have similar communication costs to the ones by Delporte-Gallet et al. and recovery time (in terms of asynchronous cycles) from transient faults. The main differences are that our proposal considers repeated gossiping of bits messages and deals with bounded space (which is a prerequisite for self-stabilization). Lastly, we explain how to extend the proposed solutions to reconfigurable ones.

\setcapindent

0pt \addtokomafontcaption

1 Introduction

We propose self-stabilizing implementations of shared memory snapshot objects for asynchronous networked systems whose nodes may fail-stop.

Context and Motivation.   Shared registers are fundamental objects that facilitate synchronization in distributed systems. In the context of networked systems, they provide a higher abstraction level than simple end-to-end communication, which provides persistent and consistent distributed storage that can simplify the design and analysis of dependable distributed systems. Snapshot objects extend shared registers. They provide a way to further make the design and analysis of algorithms that base their implementation on shared registers easier. Snapshot objects allow an algorithm to construct consistent global states of the shared storage in a way that does not disrupt the system computation. Their efficient and fault-tolerant implementation is a fundamental problem, as there are many examples of algorithms that are built on top of snapshot objects; see textbooks such as [33, 32], as well as recent reviews, such as [31].

Task description.   Consider a fault-tolerant distributed system of asynchronous nodes that are prone to failures. Their interaction is based on the emulation of Single-Writer/Multi-Reader (SWMR) shared registers over a message-passing communication system. Snapshot objects can read the entire array of system registers [2, 4]. The system lets each node update its own register via operations and retrieve the value of all shared registers via operations. Note that these snapshot operations may occur concurrently with the write operations that individual nodes perform. We are particularly interested in the study of atomic snapshot objects that are linearizable [23]: the operations and appear as if they have been executed instantaneously, one after the other (in other words, they appear to preserve real-time ordering).

Fault Model.   We consider an asynchronous message-passing system that has no guarantees on the communication delay. Moreover, there is no notion of global (or universal) clocks and we do not assume that the algorithm can explicitly access the local clock (or timeout mechanisms). Our fault model includes fail-stop failures of nodes, and communication failures, such as packet omission, duplication, and reordering. In addition, to the failures captured in our model, we also aim to recover from transient faults, i.e., any temporary violation of assumptions according to which the system and network were designed to behave, e.g., the corruption of control variables, such as the program counter and operation indices, which are responsible for the correct operation of the studied system, or operational assumptions, such as that at least half of the system nodes never fail. Since the occurrence of these failures can be combined, we assume that these transient faults can alter the system state in unpredictable ways. In particular, when modeling the system, we assume that these violations bring the system to an arbitrary state from which a self-stabilizing algorithm should recover the system. Therefore, starting from an arbitrary state, the correctness proof of self-stabilizing systems [13] has to demonstrate the return to a “correct behavior” within a bounded period, which brings the system to a legitimate state. The complexity measure of self-stabilizing systems is the length of the recovery period.

As transient faults can occur at any point in a system’s lifetime, self-stabilizing systems need to keep communicating its state structures for cleaning any potential corrupted (stale) information; to this respect, a self-stabilizing system cannot really terminate [14, Chapter 2.3]. Specifically, the proposed solution repeatedly broadcasts -size gossip messages that facilitate the system clean-up from stale information, where is the number of bits it takes to represent the object. We note the trade-off between the cost related to these gossip messages and the recovery time. That is, one can balance this trade-off by, for example, reducing the rate of gossip messages, which prolongs the stabilization time. We clarify that the rate in which these repeated clean-up operations take place does not impact the execution time of the and operations.

Related work.   We follow the design criteria of self-stabilization, which was proposed by Dijkstra [13] and detailed in [14]. We now overview existing work related to ours. Our review does not focus on algorithms for shared memory system; although there are examples for both non-self-stabilizing [26, 25] and self-stabilizing [1] solutions.

Shared registers emulation in message-passing systems:   Attiya et al. [7] implemented SWMR atomic shared memory in an asynchronous networked system. They assume that the majority of the nodes do not crash or get disconnected. Their work builds on this assumption in the following manner: Any majority subset of the system nodes includes at least one non-failing node; thus, any two majority subsets of the system nodes have a non-empty intersection. They show that if a majority of the nodes acknowledge an update to the shared register, then that update can safely be considered visible to all non-failing nodes that retrieve the latest update from a majority of nodes. Attiya et al. also show that this assumption is essential for solvability. Their seminal work has many generalizations and applications [6]. The literature includes a large number of simulation of shared registers for networked systems, which differ in their fault tolerance properties, time complexity, storage costs, and system properties, e.g., [5, 20, 22, 28, 29, 30].

In the context of self-stabilization, the literature includes a practically-self-stabilizing variation for the work of Attiya et al. [7] by Alon et al. [3]. Their proposal guarantees wait-free recovery from transient faults. However, there is no bound on the recovery time. Dolev et al. [19] consider MWMR atomic storage that is wait-free in the absence of transient faults. They guarantee a bounded time recovery from transient faults in the presence of a fair scheduler. They demonstrate the algorithm’s ability to recover from transient faults using unbounded counters and in the presence of fair scheduling. Then they deal with the event of integer overflow via a consensus-based procedure. Since integer variables can have -bits, their algorithm seldom uses this non-wait-free procedure for dealing with integer overflows. In fact, they model integer overflow events as transient faults, which implies bounded recovery time from transient faults in the seldom presence of a fair scheduler (using bounded memory). They call these systems self-stabilizing systems in the presence of seldom fairness. Our work adopts these design criteria. We also make use of their self-stabilizing quorum and gossip services [19, Section 13].

Implementing a snapshot object on top of a message-passing system:   A straightforward way for implementing snapshot objects is to consider a layer of SWMR atomic registers emulated in a networked system. This way we can run on top of this layer any algorithm for implementing a snapshot object for a system with shared variables. Delporte-Gallet et al. [12] avoid this composition, obtaining, in this way, a more efficient implementation with respect to the communication costs. Specifically, they claim that when stacking the shared-memory atomic snapshot algorithm of [2] on the shared-memory emulation of [7] (with some improvements), the number of messages per snapshot operation is and it takes four round trips. Their proposal, instead, takes message per snapshot operation and just one round trip to complete. The algorithms we propose in the present work follow the non-stacking approach of Delporte-Gallet and they have the same communication costs for write and snapshot operations. Moreover, they tolerate any failure (in any communication or operation invocation pattern) that [12] can. Furthermore, our algorithms deal with transient faults by periodically removing stale information. To that end, the algorithms broadcast gossip message of bits, where is the number of bits it takes to represent the object.

In the context of self-stabilization, there exist algorithms for the propagation of information with feedback, e.g., [11] that can facilitate the implementation of snapshot objects that can recover from transient faults, but not from node failures. For the sake of clarity, we note that “stacking” of self-stabilizing algorithms for asynchronous systems is not a straightforward process (since the existing “stacking” require schedule fairness, see [14, Section 2.7]). Moreover, we are unaware of an attempt in the literature to stack a self-stabilizing shared-memory atomic snapshot algorithm (such as the weak snapshots algorithm of Abraham [1] that uses register size) over a self-stabilizing shared-memory emulation, such as the one of Dolev et al. [18].

Our Contributions.   We present an important module for dependable distributed systems: self-stabilizing algorithms for snapshot objects in networked systems. To the best of our knowledge, we are the first to provide a broad fault model that includes both node failures and transient faults. Specifically, we advance the state of the art as follows:

  1. As a first contribution, we offer a self-stabilizing variation of the non-blocking algorithm presented by Delporte-Gallet et al. [12]. Their solution tolerates node failures as well as packet omission, duplication, and reordering. Each snapshot or write operation uses messages of bits, where is the number of nodes and is the number of bits for encoding the object. The termination of a snapshot operation depends on the assumption that the invocation of all write operations cease eventually.

    Our solution broadens the set of failure types it can tolerate, since it can also recover after the occurrence of transient faults, which model any violation of the assumptions according to which the system was designed to operate (as long as the code stays intact). We increase the communication costs slightly by using gossip messages of bits, where is the number of bits it takes to represent the object.

  2. Our second contribution offers a self-stabilizing all-operation always-terminating variation of the snapshot-only always-terminating algorithm presented by Delporte-Gallet et al. [12]. Our algorithm can: (i) recover from of transient faults, and (ii) both write and snapshot operations always terminate (regardless of the invocation patterns of any operation).

    We achieve by choosing to use safe registers for storing the result of recent snapshot operations, rather than a reliable broadcast mechanism, which often has higher communication costs. Moreover, instead of dealing with one snapshot task at a time, we take care of several at a time. We also consider an input parameter, . For the case of , our self-stabilizing algorithm guarantees an always-termination behavior in a way the resembles the non-self-stabilizing algorithm by Delporte-Gallet et al. [12] that blocks all write operation upon the invocation of any snapshot operation at the cost of messages. For the case of , our solution aims at using messages per snapshot operation while monitoring the number of concurrent write operations. Once our algorithm notices that a snapshot operation runs concurrently with at least write operations, it blocks all write operations and uses messages for completing the snapshot operations.

    Thus, the proposed algorithm can trade communication costs with an bound on snapshot operation latency. Moreover, between any two consecutive periods in which snapshot operations block the system for write operations, the algorithm guarantees that at least write operations occur.

  3. The two proposed algorithms presented in sections 4 and 5 consider unbounded counters. In Section 6, we explain how to bound these counters as well as how to extend our solutions to reconfigurable ones.

Organization. We state our system settings in Section 2. We review the non-self-stabilizing solutions by Delporte-Gallet et al. [12] in Section 3. Our self-stabilizing non-blocking and always-terminating algorithms are proposed in Sections 4 and 5, respectively; they consider unbounded counters. We explain how to bound the counters of the proposed self-stabilizing algorithms in Section 6. We conclude in Section 7.

2 System settings

We consider an asynchronous message-passing system that has no guarantees on the communication delay. Moreover, there is no notion of global (or universal) clocks and we do not assume that the algorithm can explicitly access the local clock (or timeout mechanisms). The system consists of failure-prone nodes (or processors) with identifiers are unique and totally ordered in .

2.1 Communication model

The network topology is of a fully-connected graph, , and any pair of nodes have access to a bidirectional communication channel that, at any time, has at most packets. Every two nodes exchange (low-level messages called) packets to permit delivery of (high-level) messages. When node sends a packet, , to node , the operation inserts a copy of to , while respecting the upper bound on the number of packets in the channel. In case is full, i.e., , the sending-side simply overwrites any message in . When receives from , the system removes from . As long as , we say that ’s message is in transit from to .

2.2 Execution model

Our analysis considers the interleaving model [14], in which the node’s program is a sequence of (atomic) steps. Each step starts with an internal computation and finishes with a single communication operation, i.e., message or .

The state, , of node includes all of ’s variables as well as the set of all incoming communication channels. Note that ’s step can change as well as remove a message from (upon message arrival) or add a message in (when a message is sent). The term system state refers to a tuple of the form (system configuration), where each is ’s state (including messages in transit to ). We define an execution (or run) as an alternating sequence of system states and steps , such that each system state , except for the starting one, , is obtained from the preceding system state by the execution of step .

Let and be a prefix, and respectively, a suffix of , such that is a finite sequence, which starts with a system state and ends with a step , and is an unbounded sequence, which starts in the system state that immediately follows step in . In this case, we can use as the operator to denote that concatenates with .

2.3 Fault model

We model a failure as a step that the environment takes rather than the algorithm. We consider failures that can and cannot cause the system to deviate from fulfilling its task (Figure 1). The set of legal executions () refers to all the executions in which the requirements of the task hold. In this work, denotes our studied task of snapshot object emulation and denotes the set of executions in which the system fulfills ’s requirements. We say that a system state is legitimate when every execution that starts from is in . When a failure cannot cause the system execution (that starts in a legitimate state) to leave the set , we refer to that failure as a benign one. We refer to any temporary violation of the assumptions according to which the system was designed to operate (as long as program code remains intact) as transient faults. Self-stabilizing algorithms deals with benign failures (while fulfilling the task requirements) and they can also recover, within a bounded period, after the occurrence of transient faults.

                                            Frequency
Duration Rare Not rare
Any violation of the assumptions according Packet failures: omissions,
Transient to which the system is assumed to duplications, reordering
operate (as long as the code stays intact). (assuming communication
This can result in any state corruption. fairness holds).
Permanent Fail-stop failures.

Figure 1: The table above details our fault model and the chart illustrates when each fault set is relevant. The chart’s gray shapes represent the system execution, and the white boxes specify the failures considered to be possible at different execution parts and recovery guarantees of the proposed self-stabilizing algorithm. The set of benign faults includes both packet failures and fail-stop failures.

2.3.1 Benign failures

The algorithmic solutions that we consider are oriented towards asynchronous message-passing systems and thus they are oblivious to the time in which the packets arrive and departure (and require no explicit access to clock-based mechanisms, which may or may not be used by the system underlying mechanisms, say, for congestion control at the end-to-end protocol).

Communication fairness.  

Recall that we assume that the communication channel handles packet failures, such as omission, duplication, reordering (Section 2.1). We consider standard terms for characterizing node failures [21]. A crash failure considers the case in which a node stops taking steps forever and there is no way to detect this failure. A fail-stop failure considers the case in which a node stops taking steps and there is a way to detect this failure, say, using unreliable failure detectors [10]. We say that a failing node resumes when it returns to take steps without restarting its program — the literature sometimes refer to this as an undetectable restart. The case of a detectable restart allows the node to restart all of its variables. We assume that if sends a message infinitely often to , node receives that message infinitely often. We refer to the latter as the fair communication assumption. For example, the proposed algorithm sends infinitely often messages from any processor to any other. Despite the possible loss of messages, the communication fairness assumption implies that every processor receives infinitely often messages from any non-failing processor. Note that fair communication provides no bound on the channel communication delays. It merely says that a message is received within some finite time if its sender does not stop sending it (until that sender receives the acknowledgment for that message). We refer to the latter as the fair communication assumption. We note that without the communication fairness assumption, the communication channel between any two correct nodes eventually becomes non-functional.

Node failure.  

We assume that the failure of node implies that it stops sending and receiving messages (and it also stops executing any other step) without any warning. We assume that the number of failing nodes is bounded by and that for the sake of guaranteeing correctness [27]. In the absence of transient faults, failing nodes can simply crash (or fail-stop and then resume at some arbitrary time), as in Delporte-Gallet et al. [12]. In the presence of transient faults, we assume that failing nodes resume within some unknown finite time. The latter assumption is needed only for recovering from transient faults; we bring more details in Section 2.6. In Section 6 we discuss how to relax this assumption.

2.3.2 Transient faults

As already mentioned, we consider arbitrary violations of the assumptions according to which the system and the communication network were designed to operate. We refer to these violations and deviations as transient faults and assume that they can corrupt the system state arbitrarily (while keeping the program code intact). The occurrence of a transient fault is rare. Thus, we assume that transient faults occur before the system execution starts [14]. Moreover, it leaves the system to start in an arbitrary state.

2.4 The snapshot object task

The task of snapshot object emulation requires the fulfilment of two properties: termination and linearizability. The definition of these two terms is based on the term event, which we defined next before defining the term event histories that is needed for the definition of linearizability.

Events:   Let be a or operation. The execution of an operation by a processor is modeled by two steps: the invocation step, denoted by , which calls the operation, and a response event, denoted (termination), which occurs when terminates (completes) the operation. For the sake of simple presentation, by event we refer to either an operation’s start step or an operation’s end step.

Effective operations:   We say that a operation is effective when the invoking processor does not fail during the operation’s execution. We say that a operation is effective when the invoking processor does not fail during its execution, or in case it does fail, the operation’s effect is returned by an effective snapshot operation.

Histories:   a history is a sequence of operation start and end steps that are totally ordered. We consider histories to compare in an abstract way between two executions of the studied algorithms. Given any two events and , if occurs before in the corresponding history. A history is denoted by , where is the set of events. Given an infinite history , we require that: (i) its first event is an invocation and (ii) each invocation is followed by its matching response event. If is finite, then might not contain the matching response event of the last invocation event.

Linearizable snapshot history:   A snapshot-based history models a computation at the abstraction level at which the write and snapshot operations are invoked. It is linearizable if there is an equivalent sequential history in which the sequence of effective and operations issued by the processes is such that:

  1. Each effective operation appears as executed at a single point of the timeline between its invocation event and its response event, and

  2. Each effective operation returns an array such that: (i) if the operation by appears previously in the sequence. (ii) Otherwise .

2.5 Dijkstra’s self-stabilization criterion

An algorithm is self-stabilizing with respect to the task of , when every (unbounded) execution of the algorithm reaches within a bounded period a suffix that is legal. That is, Dijkstra [13] requires that , where the length of is the complexity measure, which we refer to as the recovery time (other calls it the stabilization time). We say that a system execution is fair when every step that is applicable infinitely often is executed infinitely often and fair communication is kept. Self-stabilizing algorithms often assume that is a fair execution. Wait-free algorithms guarantee that non-failing operations always become (within a finite number of steps) complete even in the presence of benign failures. Note that fair executions do not consider fail-stop failures (that were not detected by the system whom then excluded these failing nodes from reentering the system, as in [16]). Therefore, we cannot demonstrate that an algorithm is wait-free by assuming that the system execution is always fair.

2.6 Self-stabilization in the presence of seldom fairness

As a variation of Dijkstra’s self-stabilization criterion, Dolev et al. [19] proposed design criteria in which (i) any execution , which starts in an arbitrary system state and has a prefix () that is fair, reaches a legitimate system state within a bounded prefix . (Note that the legal suffix is not required to be fair.) Moreover, (ii) any execution in which the prefix of is legal, and not necessarily fair but includes at most write or snapshot operations, has a suffix, , such that is required to be fair and bounded in length but might permit the violation of liveness requirements, i.e., a bounded number of operations might be aborted (as long as the safety requirement holds). Furthermore, is legal and not necessarily fair but includes at least write or snapshot operations before the system reaches another . Since we can choose to be a very large value, say , and the occurrence of transient faults is rare, we refer to the proposed criteria as one for self-stabilizing systems that their executions fairness is unrequited except for seldom periods.

2.7 Complexity Measures

The main complexity measure of self-stabilizing systems is the time it takes the system to recover after the occurrence of a last transient fault. In detail, in the presence of seldom fairness this complexity measure considers the maximum of two values: (i) the maximum length of , which is the period during which the system recovers after the occurrence of transient failures, and (ii) the maximum length of . We consider systems that use bounded among of memory and thus as a secondary complexity measure we bound the memory that each node needs to have. However, the number of messages sent during an execution does not have immediate relevance in the context of self-stabilization, because self-stabilizing systems never stop sending messages [14, Chapter 3.3]. Next, we present the definitions, notations and assumptions related to the main complexity measure.

2.7.1 Message round-trips and iterations of self-stabilizing algorithms

The correctness proof depends on the nodes’ ability to exchange messages during the periods of recovery from transient faults. The proposed solution considers quorum-based communications that follow the pattern of request-replay as well as gossip messages for which the algorithm does not wait for any reply. The proof uses the notion of a message round-trip for the cases of request-reply messages as well as the term of algorithm iteration.

We give a detailed definition of round-trips as follows. Let be a node and be a network node. Suppose that immediately after state node sends a message to , for which awaits a reply. At state , that follows state , node receives message and sends a reply message to . Then, at state , that follows state , node receives ’s response, . In this case, we say that has completed with a round-trip of message .

Self-stabilizing algorithms cannot terminate their execution and stop sending messages [14, Chapter 2.3]. Moreover, their code includes a do forever loop. Thus, we define a complete iteration of a self-stabilizing algorithm. Let be the set of nodes with whom completes a message round trip infinitely often in execution . Moreover, assume that node sends a gossip message infinitely often to (regardless of the message payload). Suppose that immediately after the state , node takes a step that includes the execution of the first line of the do forever loop, and immediately after system state , it holds that: (i) has completed the iteration it has started immediately after (regardless of whether it enters branches), (ii) every request-reply message that has sent to any node during the iteration (that has started immediately after ) has completed its round trip, and (iii) it includes the arrival of at least one gossip message from to any non-failing . In this case, we say that ’s iteration (with round-trips) starts at and ends at .

2.7.2 Asynchronous cycles

We measure the time between two system states in a fair execution by the number of (asynchronous) cycles between them. The definition of (asynchronous) cycles considers the term of complete iterations. The first (asynchronous) cycle (with round-trips) of a fair execution is the shortest prefix of , such that each non-failing node in the network executes at least one complete iteration in , where is the concatenation operator (Section 2.2). The second cycle in execution is the first cycle in execution , and so on.

Remark 2.1

For the sake of simple presentation of the correctness proof, we assume that any message that arrives in without being transmitted in does so within asynchronous cycles in .

2.8 External Building Blocks: Gossip and Quorum Services

We utilize the service from [19], which guarantees the following: (a) every gossip message that the receiver delivers to its upper layer was indeed sent by the sender, and (b) such deliveries occur according to the communication fairness guarantees (Section 2.3.1). I.e., this gossip service does not guarantee reliability.

We consider a system in which the nodes behave according to the following terms of service. We assume that at any time, any node runs only at most one operation (that is, either a or a ; one at a time). These operations access the quorum sequentially, i.e., send one request at a time, by sending messages to all other nodes via a interface. The receivers of this message reply. For this request-reply behavior, the quorum-based communication functionality guarantees the following: (a) at least a quorum of nodes receive, deliver and acknowledge every message, (b) a (non-failing) sending node receives at least a majority of these replies or fulfil another return condition, e.g., arrival of a special message, and (c) immediately before returning from the quorum access, the sending-side of this service clears its state from information related to this quorum request. We use the above requirements in Corollary 2.1; its correctness proof can be found in [19].

Corollary 2.1 (Self-stabilizing gossip and quorum-based communications)

Let be an (unbounded) execution of the algorithm that appears in [19, Algorithm 3] and satisfies the terms of service of the quorum-based communication functionality. Suppose that is fair and its starting system state is arbitrary. Within asynchronous cycles, reaches a suffix in which (1) the gossip, and (2) the quorum-based communication functionalities are correct. (3) During , the gossip and quorum-based communication complete their operations correctly within asynchronous cycles.

\@dblfloat

algocf[t!]     \end@dblfloat

3 Background

For the sake of completeness, we review the solutions of Delporte-Gallet et al. [12].

3.1 The non-blocking algorithm by Delporte-Gallet et al.

The non-blocking solution to snapshot object emulation by Delporte-Gallet et al. [12, Algorithm 1] allows all write operations to terminate regardless of the invocation patterns of the other write or snapshot operations (as long as the invoking processors do not fail during the operation). However, for the case of snapshot operations, termination is guaranteed only if eventually the system execution reaches a period in which there are no concurrent write operations. Algorithm LABEL:alg:0disCongif presents Delporte-Gallet et al. [12, Algorithm 1]. That is, we have changed some of the notation of Delporte-Gallet to fit the presentation style of this paper. Moreover, we use the primitive according to its definition in Section 2.8.

Local variables.   The node state appears in lines LABEL:ln:varStart to LABEL:ln:var and automatic variables (which are allocated and deallocated automatically when program flow enters and leaves the variable’s scope) are defined using the let keyword, e.g., the variable (line LABEL:ln:0prevSsnGetsRegSsnPlusOne). Also, when a message arrives, we use the parameter name to refer to the arriving value for the message field .

Processor stores an array of elements (line LABEL:ln:var), such that the -th entry stores the most recent information about processor ’s object and stores ’s actual object value. Every entry is a pair of the form , where the field is a -bits object value and is an unbounded integer that stores the object timestamp. The values of serve as the index of ’s write operations. Similarly, maintains an index for the snapshot operations, (sequence number). Algorithm LABEL:alg:0disCongif defines also the relation that compares and according to the write operation indices (line LABEL:ln:preceq).

The operation.   Algorithm LABEL:alg:0disCongif’s operation appears in lines LABEL:ln:0operationWriteV to LABEL:ln:0writeReturn (client-side) and lines LABEL:ln:0operationSnapshot to LABEL:ln:0snapshotReturn (server-side). The client-side operation stores the pair in (line LABEL:ln:0tsPlusOne), where is the calling processor and is a unique operation index. The primitive sends to all the processors in the message about ’s local perception of ’s value.

Upon the arrival of a message to from (line LABEL:ln:0arrivalWRITE), the server-side code is run. Processor updates according to the timestamps of the arriving values (line LABEL:ln:0kDotRegGetsMaxPreceqRegJWrite). Then, replies to with the message (line LABEL:ln:0sendSNAPSHOTackRegSsn), which includes ’s local perception of the system shared registers.

Getting back to the client-side, repeatedly broadcasts the message to all processors in until it receives replies from a majority of them (line LABEL:ln:0broadcastWRITEreg). Once that happens, it uses the arriving values for keeping up-to-date (line LABEL:ln:0mergeRecWrite).

The operation.   Algorithm LABEL:alg:0disCongif’s operation appears in lines LABEL:ln:0operationSnapshot to LABEL:ln:0snapshotReturn (client-side) and lines LABEL:ln:0arrivalSNAPSHOT to LABEL:ln:0sendSNAPSHOTackRegSsn (server-side). Recall that Delporte-Gallet et al. [12, Algorithm 1] is non-blocking with respect to the snapshot operations as long as are no concurrent write operations. Thus, the client-side is written in the form of a repeat-until loop. Processor tries to query the system for the most recent value of the shared registers. The success of such attempts depends on the above assumption. Therefore, before each such broadcast, copies ’s value to (line LABEL:ln:0prevSsnGetsRegSsnPlusOne) and exits the repeat-until loop only when the updated value of indicates that there are no concurrent write operations.

Figure 2: Examples of Algorithm LABEL:alg:0disCongif’s executions. The upper drawing illustrates a case of a terminating snapshot operation (dashed line arrows) that occurs between two write operations (solid line arrows). The acknowledgments of these messages are arrows that start with circles and squares, respectively. The lower drawing illustrates a case in which every execution of line LABEL:ln:ssnPlusOne occurs concurrently with write operations (regardless of whether the algorithm is self-stabilizing or not). Thus, snapshot operations cannot terminate.

Figure 2 depicts two examples of Algorithm LABEL:alg:0disCongif’s execution. The upper drawing illustrates a write operation that is followed by a snapshot operation and then a second write operation. We use this example when comparing algorithms LABEL:alg:9disCongifLABEL:alg:disCongif and LABEL:alg:terminating. The lower drawing illustrates a case of an unbounded sequence of write operations that disrupts a snapshot operation, which does not terminate for an unbounded period.

\@dblfloat

algocf[t!]     \end@dblfloat

3.2 The always-terminating algorithm by Delporte-Gallet et al.

Delporte-Gallet et al. [12, Algorithm 2] guarantee termination for any invocation pattern of write and snapshot operations, as long as the invoking processors do not fail during these operations. Its advantage over Delporte-Gallet et al. [12, Algorithm 1] is that it can deal with an infinite number of concurrent write operations. This is because it guarantees the non-blocking progress criterion for the snapshot operations. We present [12, Algorithm 2] in Algorithm LABEL:alg:9disCongif using the presentation style of this paper. We review Algorithm LABEL:alg:9disCongif while pointing out some key challenges that exist when considering the context of self-stabilization.

High-level overview.   Delporte-Gallet et al. [12, Algorithm 2] use a job-stealing scheme for allowing rapid termination of snapshot operations. Processor starts its operation by queueing this new task at all processors . Once receives ’s new task and when that task reaches the queue front, starts the procedure, which is similar to Algorithm LABEL:alg:0disCongif’s operation. This joint participation in all snapshot operations makes sure that all processors are aware of all on-going snapshots operations.

This joint awareness allows the system processors to make sure that no write operation can stand in the way of on-going snapshot operations. To that end, the processors wait until the oldest snapshot operation terminates before proceeding with later operations. Specifically, they defer write operations that run concurrently with snapshot operations. This guarantees termination of snapshot operations via the interleaving and synchronization of snapshot and write operations.

Detailed description.   Algorithm LABEL:alg:9disCongif extends Algorithm LABEL:alg:0disCongif in the sense that it uses all of Algorithm LABEL:alg:0disCongif’s variables and two additional ones, which is a second operation index, , and an array , which operations use. The entry holds the outcome of ’s -th snapshot operation, where no explicit bound on the number of invocations of snapshot operations is given.

In the context of self-stabilization, the use of such unbounded variables is not possible. The reasons are that real-world systems have bounded size memory as well as the fact that a single transient fault can bring any counter to its near overflow value and fill up any finite capacity buffer. We discuss the way around this challenge in Section 5.

The operation and the function.   Since operations are preemptible, cannot always start immediately to write. Instead, stores in together with a unique operation index (line LABEL:ln:preWrite). The algorithm then runs the write operation as a background task (line LABEL:ln:backGroundWrite) using the function (lines LABEL:ln:9baseWrite to LABEL:ln:9mergeRecWrite2).

The operation.   A call to (line LABEL:ln:9snspp) causes to reliably broadcast, via the primitive , a new index in a to all processors in . Processor then places it as a background task (line LABEL:ln:9repSnapIsns). We note that for our proposed solutions we do not assume access to a reliable broadcast mechanism such as ; see Section 5 for details and an alternative approach that uses safe registers instead of the primitive, which often has higher communication costs.

The function.   This function essentially follows the operation of Algorithm LABEL:alg:0disCongif. That is, Algorithm LABEL:alg:0disCongif’s snapshot repeat-until loops iterates until the retrieved vector equals to the one that was known prior to the last repeat-until iteration. Algorithm LABEL:alg:0disCongif’s procedure returns after at least one snapshot process has terminated. In detail, processor stores in , via a reliable broadcast of the message, the result of the snapshot process (line LABEL:ln:prevReg and LABEL:ln:uponEND).

Synchronization between the and functions.   Algorithm LABEL:alg:9disCongif interleaves the background tasks in a do forever loop (lines LABEL:ln:backGroundWrite to LABEL:ln:waitUntilReadSnap). As long as there is an awaiting write task, processor runs the function (line LABEL:ln:backGroundWrite). Also, if there is an awaiting snapshot task, processor selects the oldest task, , and uses the function. Here, Algorithm LABEL:alg:9disCongif blocks until contains the result of that snapshot task.

Note that line LABEL:ln:unboundedBuffer implies that Algorithm LABEL:alg:9disCongif does not explicitly assume that processor has bounded space for storing messages. In the context of self-stabilization, there must be an explicit bound on the size of all memory in use. We discuss how to overcome this challenge in Section 5.

Figure 3: Algorithm LABEL:alg:9disCongif’s execution for the case depicted by the upper drawing of Figure 2. The drawing illustrates a case of a terminating snapshot operation (dashed line arrows) that occurs between two write operations (solid line arrows). The acknowledgments of these messages are arrows that start with circles and squares, respectively.

Figure 3 depicts an example of Algorithm LABEL:alg:9disCongif’s execution where a write operation is followed by a snapshot operation. Note that each snapshot operation will be handled separately and the communication costs of each such operation requires messages.

4 An Unbounded Self-stabilizing Non-blocking Algorithm

We propose Algorithm LABEL:alg:disCongif as an elegant extension of Delporte-Gallet et al. [12, Algorithm 1]; we have simply added the boxed code lines to Algorithm LABEL:alg:disCongif. Algorithms LABEL:alg:0disCongif and LABEL:alg:disCongif differ in their ability to deal with stale information that can appear in the system when starting in an arbitrary state. Note that we model the appearance of stale information as the result of transient faults and assume that they occur only before the system starts running.

4.1 Algorithm description

Our description refers to the values of variable at node as , i.e., the variable name with a subscript that indicates the node identifier. Algorithm LABEL:alg:disCongif considers the case in which any of ’s operation indices, and , is smaller than some other or value, say, , or , where appears in the field of some in-transit message.

For the case of corrupted values, ’s client-side simply ignores arriving message with values that do not match (line LABEL:ln:waitUntilSNAPSHOTackReg). For the sake of clarity of our proposal, we also remove periodically any stored snapshot replies that their fields are not equal to .

For the case of corrupted values, ’s do forever loop makes sure that is not smaller than (line LABEL:ln:tsGetsMaxTsRegTs) before gossiping to every processor its local copy of ’s shared register (line LABEL:ln:sendGossip). Also, upon the arrival of such gossip messages, Algorithm LABEL:alg:disCongif merges the arriving information with the local one (line LABEL:ln:GOSSIPupdate). Moreover, when replies from write or snapshot messages arrive to , it merges the arriving value with the one in (line LABEL:ln:tsRegMaxTsRegTs).

On the presentation side, we clarify that the code lines LABEL:ln:0broadcastWRITEreg to LABEL:ln:0mergeRecWrite and lines LABEL:ln:0ssnPlusOne to LABEL:ln:0mergeRecSnapshot are equivalent to lines  to  and lines  and  of [12, Algorithm 1], respectively, because it is merely a more detailed description of the code described in [12, Algorithm 1].

Figure 4 depicts an example of Algorithm LABEL:alg:disCongif’s execution in which a write operation is followed by a snapshot operation. Note that gossip messages do not interfere with write and snapshot operations.

Figure 4: Algorithm LABEL:alg:disCongif’s execution for the case depicted in the upper drawing of Figure 2. The drawing illustrates a case of a terminating snapshot operation (dashed line arrows) that occurs between two write operations (solid line arrows). The acknowledgments of these messages are arrows that start with circles and squares, respectively.
\@dblfloat

algocf[t!]     \end@dblfloat

4.2 Correctness

Although the extension performed to Algorithm LABEL:alg:0disCongif for obtaining Algorithm LABEL:alg:disCongif includes only few changes, proving convergence and closure for Algorithm LABEL:alg:disCongif is not straightforward. We proceed with the details.

Notation and definitions Definition 4.1 refers to ’s timestamps and snapshot sequence numbers, where . The set of ’s timestamps includes , , and the value of in the payload of any message that is in transient in the system. The set of ’s snapshot sequence numbers includes and the value of in the payload of any message that is in transient in the system.

Definition 4.1 (Algorithm LABEL:alg:disCongif’s consistent operation indices)

(i) Let be a system state in which is greater than or equal to any ’s timestamp values in the variables and fields related to . We say that the ’ timestamps are consistent in . (ii) Let be a system state in which is greater than or equal to any ’s snapshot sequence numbers in the variables and fields related to . We say that the ’s snapshot sequence numbers are consistent in .

Theorems 4.1 and 4.11 show the properties required by the self-stabilization design criteria.

Theorem 4.1 (Algorithm LABEL:alg:disCongif’s convergence)

Let be a fair and unbounded execution of Algorithm LABEL:alg:disCongif. Within asynchronous cycles in , the system reaches a state in which ’ timestamps and ’s snapshot sequence numbers are consistent in .

Proof. The proof of the theorem follows by Lemmas 4.2 and 4.7.

Lemma 4.2 (Timestamp convergence)

Let be an unbounded fair execution of Algorithm LABEL:alg:disCongif. Within asynchronous cycles in , the system reaches a state in which the value of is greater than or equal to any ’s timestamp value. Moreover, suppose that node takes a step immediately after that includes the execution of line LABEL:ln:tsPlusOne. Then in , it holds that as well as for every messages that is in transit from to or to it holds that .

Proof. Claims 4.34.44.5 and 4.6 prove the lemma. Claim 4.3 denotes by the -th value stored in during , where .

Claim 4.3

The sequences , , , and are non-decreasing.

Proof of claim. We note that Algorithm LABEL:alg:disCongif does only the following actions on and fields: increment (line LABEL:ln:tsPlusOne) and merge using the function (lines LABEL:ln:tsRegMaxTsRegTsLABEL:ln:regGetsMaxRegCupMidLABEL:ln:tsGetsMaxTsRegTsLABEL:ln:mergeRecSnapshotLABEL:ln:GOSSIPupdateLABEL:ln:kDotRegGetsMaxPreceqRegJWrite and LABEL:ln:kDotRegGetsMaxPreceqRegJSnapshoot). That is, there are no assignments. Thus, the claim is true, because the value of these fields is never decremented during .

Claim 4.4

Within asynchronous cycles, .

Proof of claim. Since is unbounded, it holds that node calls line LABEL:ln:sendGossip for an unbounded number of times during . Recall the line numbers that may change the value of and , cf. the proof of Claim 4.3. Note that only line LABEL:ln:tsPlusOne change the value of , via an increment (thus we do not have a simple equality) whereas lines LABEL:ln:tsRegMaxTsRegTsLABEL:ln:tsGetsMaxTsRegTs and LABEL:ln:GOSSIPupdate update and by taking the maximum value of and . The rest of the proof is implied by Claim 4.3, and the fact that executes line LABEL:ln:tsGetsMaxTsRegTs at least once in every asynchronous cycles.

Algorithm LABEL:alg:disCongif sends messages in line LABEL:ln:sendGossip, requests messages in lines LABEL:ln:broadcastWRITEreg and LABEL:ln:ssnPlusOne as well as replies in lines LABEL:ln:sendWRITEackReg and LABEL:ln:sendSNAPSHOTackRegSsn. Claim 4.5’s proof considers lines LABEL:ln:broadcastWRITEreg and LABEL:ln:ssnPlusOne in which sends a request message to , whereas Claim 4.6’s proof considers lines LABEL:ln:sendGossipLABEL:ln:sendWRITEackReg and LABEL:ln:sendSNAPSHOTackRegSsn in which replies or gossips to .

Claim 4.5

Let be a message on transit from to (during the first asynchronous cycles of ) and the value of the filed in , where are non-failing nodes. Within asynchronous cycles, and whenever raises the events , or .

Proof of claim. Suppose during the first asynchronous cycles of , node indeed sends message , i.e., does not appear in ’s starting system state. Let be the first step in in which calls line LABEL:ln:sendGossipLABEL:ln:broadcastWRITEreg or LABEL:ln:ssnPlusOne and for which there is a step , which appears in after and in which message is sent (in a packet by the end-to-end or quorum protocol). Note that the value of in the message payload is defined by the value of in the system state that immediately precedes . The rest of the proof relies on the fact that until arrives to , the invariant holds (due to Claim 4.3).

Let be the first step that appears after in , if there is any such step, in which the node at delivers the packet (token) that transmits the message (if there are several such packets, consider the last to arrive). By the correctness of the end-to-end [15, 17] or quorum service (Corollary 2.1), step appears in within asynchronous cycles. During , node raises the message delivery event (when considers line LABEL:ln:GOSSIPupdate), (when considers line LABEL:ln:arrivalWRITE) or (when considers line LABEL:ln:arrivalSNAPSHOT), such that .

Suppose that step does not appear in , i.e., appears in ’s starting system state. By the definition of asynchronous rounds with round-trips (Remark 2.1), within asynchronous cycles, all messages in transit to arrive (or leave the communication channel). Immediately after that, the system starts an execution in which this claim holds trivially.

Claim 4.6

Let be a message on transit from to (during the first asynchronous cycles of ) and the value of the filed in , where are non-failing nodes and may or may not hold. Within asynchronous cycles, and whenever raises the events , or .

Proof of claim. Suppose during the first asynchronous cycles of , node indeed sends message , i.e., does not appear in ’s starting system state. Let be the first step in in which calls line LABEL:ln:sendGossipLABEL:ln:sendWRITEackReg or LABEL:ln:sendSNAPSHOTackRegSsn and for which there is a step , which appears in after . Note that the value of in the message payload is defined by the value of in the system state that immediately precedes . The rest of the proof relies on the fact that until arrives to , the invariant holds (due to Claim 4.3).

Let be the first step that appears after in , if there is any such step, in which the node at delivers the packet (token) that transmits the message (if there are several such packets, consider that the last to arrive). By the correctness of the gossip and quorum services (Corollary 2.1), step appears in within asynchronous cycles. During , node raises the message delivery event (when considers line LABEL:ln:sendGossip) (when considers line LABEL:ln:sendWRITEackReg) or (when considers line LABEL:ln:sendSNAPSHOTackRegSsn), such that .

For the case in which step does not appear in , the proof follows the same arguments that appear in the proof of Claim 4.5.

This completes the proof of the lemma.

Lemma 4.7 (Sequence number convergence)

Let be a fair and unbounded execution of Algorithm LABEL:alg:disCongif. Within asynchronous cycles in , the system reaches a state in which the value of is greater than or equal to any ’s snapshot sequence number.

Proof. Claims 4.84.9 and 4.10 prove the lemma.

Claim 4.8

The sequence is non-decreasing.

Proof of claim. Algorithm LABEL:alg:disCongif only increments (line LABEL:ln:prevSsnGetsRegSsnPlusOne), and assigns (lines LABEL:ln:ssnPlusOne and LABEL:ln:sendSNAPSHOTackRegSsn) values. Thus, the claim is true, because the value of this field is never decremented during .

The proofs of Claims 4.9 and 4.10 are followed by similar arguments to the ones that appear in the proofs of Claims 4.5 and 4.6.

Claim 4.9

Let be a message on transit from to (during the first asynchronous cycles of ) that includes the filed with the value of . Within asynchronous cycles, including when raises the event .

Claim 4.10

Let be a message on transit from to (during the first asynchronous cycles of ) and the value of the filed in . Within asynchronous cycles, including when raises the event .

This completes the proof of the lemma, which completes the proof of the theorem.

Theorem 4.11 (Algorithm LABEL:alg:disCongif’s termination and linearization)

Let be an execution of Algorithm LABEL:alg:disCongif that starts in system state , in which the timestamps and snapshot sequence numbers are consistent (Definition 4.1). Execution is legal with respect to the task of emulating snapshot objects.

Proof. We start the proof by observing the differences between Algorithms LABEL:alg:0disCongif and LABEL:alg:disCongif. Note that Algorithms LABEL:alg:0disCongif and LABEL:alg:disCongif use the same variables. Any message that Algorithm LABEL:alg:0disCongif sends, also Algorithm LABEL:alg:disCongif sends. The only exception are gossip messages: Algorithm LABEL:alg:disCongif sends gossip messages, while Algorithm LABEL:alg:0disCongif does not. The two algorithms differ in line LABEL:ln:tsRegMaxTsRegTs, lines LABEL:ln:deleteSNAPSHOTack to LABEL:ln:sendGossip and line LABEL:ln:GOSSIPupdate.

The next step in the proof is to show that during , any step that includes the execution of line LABEL:ln:GOSSIPupdate does not change the state of the calling processor. This is due to the fact the every timestamp uniquely couples an object value (line LABEL:ln:tsGetsMaxTsRegTs) and that timestamps are consistent in every system state throughout (Lemma 4.2).

The rest of the proof considers that is obtained from the code of Algorithm LABEL:alg:disCongif by the removal of lines LABEL:ln:sendGossip and LABEL:ln:GOSSIPupdate, in which the gossip messages are sent and received, respectively. We use this definition to show that simulates Algorithm LABEL:alg:0disCongif. This means that from the perspective of its external behavior (i.e., its requests, replies and failure events), any trace of has a trace of Algorithm LABEL:alg:0disCongif (as long as indeed the starting system state, , encodes consistent timestamps and snapshot sequence numbers). Since Algorithm LABEL:alg:0disCongif satisfies the task of emulating snapshot objects, it holds that also satisfies the task. This implies that Algorithm LABEL:alg:disCongif satisfies the task as well.

Recall the fact the every timestamp uniquely couples an object value (line LABEL:ln:tsGetsMaxTsRegTs) as well as that timestamps and snapshot sequence numbers are consistent in every system state throughout (Lemma 4.2). These facts imply that also line LABEL:ln:tsRegMaxTsRegTs, and lines LABEL:ln:deleteSNAPSHOTack to LABEL:ln:tsGetsMaxTsRegTs do not change the state of the calling node.

5 An Unbounded Self-stabilizing Always Terminating Algorithm

We propose Algorithm LABEL:alg:terminating as a variation of Delporte-Gallet et al. [12, Algorithm 2]. Algorithms LABEL:alg:9disCongif and LABEL:alg:terminating differ mainly in their ability to recover from transient faults. This implies some constraints. For example, Algorithm LABEL:alg:terminating must have a clear bound on the number of pending snapshot tasks as well as on the number of stored results from snapshot tasks that have already terminated (see Section 3.2 for details). For sack of simple presentation, Algorithm LABEL:alg:terminating assumes that the system needs, for each processor, to cater for at most one pending snapshot task. It turns out that this assumption allows us to avoid the use of a self-stabilizing mechanism for reliable broadcast, as an extension of the non-self-stabilizing reliable broadcast that Delporte-Gallet et al. [12, Algorithm 2] use. Instead, Algorithm LABEL:alg:terminating uses a simpler mechanism for safe registers.

The above opens up another opportunity: Algorithm LABEL:alg:terminating can defer pending snapshot tasks until either (i) at least one processor was able to observe at least concurrent write operations, where is an input parameter, or (ii) no concurrent write operations were observed, i.e., (line LABEL:ln:exceedDelta). Our intention here is to have as a tunable parameter that balances the latency (with respect to snapshot operations) vs. communication costs. That is, for the case of being a very high (finite) value, Algorithm LABEL:alg:terminating guarantees termination in a way that resembles [12, Algorithm 1], which uses messages per snapshot operation, and for the case of , Algorithm LABEL:alg:terminating behaves in a way that resembles [12, Algorithm 2], which uses messages per snapshot operation.

5.1 High-level description

Algorithms LABEL:alg:9disCongif uses reliable broadcasts for informing all non-failing processors about new snapshot tasks (line LABEL:ln:reliableBroadcastout) as well as the results of snapshot tasks that have terminated (line LABEL:ln:prevReg). Since we assume that each processor can have at most one pending snapshot task, we can avoid the need of using a self-stabilizing mechanism for reliable broadcast. Indeed, Algorithm LABEL:alg:terminating simply lets every processor disseminate its (at most one) pending snapshot task and use a safe register for facilitating the delivery of the task result to its initiator. That is, once a processor finishes a snapshot task, it broadcasts the result to all processors and waits for replies from a majority of processors, which may possibly include the initiator of the snapshot task (using the macro , line LABEL:ln:safeStoreSend). This way, if processor notices that it has the result of an ongoing snapshot task, it sends that result to the requesting processor.

5.2 Algorithm details

We review Algorithm LABEL:alg:terminating’s do forever loop (lines LABEL:ln:deleteSNAPSHOTack2 to LABEL:ln:letSexceedDeltaCall), the function together with the dealing of message (lines LABEL:ln:SNAPSHOTarrival to LABEL:ln:SNAPSHOTarrivalEND), as well as the macro (line LABEL:ln:safeStoreSend) together with the dealing of message (lines LABEL:ln:safeStoreArrival to LABEL:ln:safeStoreSendAck).

The do forever loop.   Algorithm LABEL:alg:terminating’s do forever loop (lines LABEL:ln:deleteSNAPSHOTack2 to LABEL:ln:letSexceedDeltaCall), includes a number of lines for cleaning stale information, such as out-of-synch messages (line LABEL:ln:deleteSNAPSHOTack2), out-dated operation indices (line LABEL:ln:tsGetsMaxTsRegTs2), illogical vector-clocks (line LABEL:ln:vcFix) or corrupted entries (line LABEL:ln:snsNegRepSnap). The gossiping of operation indices (lines LABEL:ln:sendGossip2 and LABEL:ln:GOSSIParrival) also helps to remove stale information (as in Algorithm LABEL:alg:disCongif but only with the addition of values).

The synchronization between write and snapshot operations (lines LABEL:ln:writePendingExceedDelta and LABEL:ln:letSexceedDeltaCall) starts with a write, if there is any such pending task (line LABEL:ln:writePendingExceedDelta), before running its own snapshot task, if there is any such pending, as well as any snapshot task (initiated by others) for which observed that at least write operations occur concurrently with it (line LABEL:ln:letSexceedDeltaCall).

The operation and the function.   As in Algorithm LABEL:alg:9disCongif, does not start immediately a write operation. Node permits concurrent write operations by storing and a unique index in (line LABEL:ln:tsEqtsPlusOne). The algorithm then runs the write operation as a background task (line LABEL:ln:writePendingExceedDelta) using the function (line LABEL:ln:waitUntilWRITEackReg2).

The function and the message.   Algorithm LABEL:alg:terminating maintains the state of every snapshot task in the array . The entry includes: (i) the index of the most recent snapshot operation that has initiated and is aware of, (ii) the vector clock representation of (i.e., just the timestamps of , cf. line LABEL:ln:vc) and (iii) the final result of the snapshot operation (or , in case it is still running).

The function includes an outer loop part (lines LABEL:ln:prevSsnGetsRegSsnPlusOne2 and LABEL:ln:outerStop), an inner loop part (lines LABEL:ln:letSprime to LABEL:ln:waitUntilSNAPSHOTackReg2a), and a result update part (lines LABEL:ln:prevEqReg to LABEL:ln:vcUpdate). The outer loop increments the snapshot index, (line LABEL:ln:prevSsnGetsRegSsnPlusOne2), so that it can consider a new query attempt by the inner loop. The outer loop ends when (i) there are no more pending snapshot tasks that this call to needs to handle, or (ii) the only pending snapshot task for the current invocation of is the one of and has not observed at least concurrent writes. The inner loop broadcasts messages, which includes all the pending that are relevant to this call to together with the local current value of and the snapshot query index . The inner loop ends when acknowledgments are received from a majority of processors and the received values are merged (line LABEL:ln:waitUntilSNAPSHOTackReg2a). The results are updated by writing to an emulated safe shared register (line LABEL:ln:prevEqReg) whenever . In case the results do not allow to terminate its snapshot task (line LABEL:ln:vcUpdate), Algorithm LABEL:alg:terminating uses the query results for storing the timestamps in the field . This allows to balance a trade-off between snapshot operation latency and communication costs, as we explain next.

The use of the input parameter for balancing the trade-off between snapshot operation latency and communication costs.   For the case of , the set (line LABEL:ln:exceedDelta) includes all the nodes for which there is no stored result, i.e., . Thus, no snapshot tasks are ever deferred, as in Delporte-Gallet et al. [12, Algorithm 2]. The case of uses the fact that Algorithm LABEL:alg:terminating samples the vector clock value of and stores it in (line LABEL:ln:vcUpdate) once it had completed at least one iteration of the repeat-until loop (line LABEL:ln:waitUntilSNAPSHOTackReg2 and LABEL:ln:waitUntilSNAPSHOTackReg2a). This way, we can be sure that the sampling of the vector clock is an event that occurred not before the start of ’s snapshot operation that has the index of .

Many-jobs-stealing scheme for reduced blocking periods.   We note that ’s task is considered active as long as . For helping all currently actives snapshot tasks, samples the set of currently pending task (line LABEL:ln:letSprime) before starting the inner repeat-until loop (lines LABEL:ln:letSprime to LABEL:ln:waitUntilSNAPSHOTackReg2a). Processor broadcasts from the client-side the message, which includes the most recent snapshot task information, to all processors. The reception of this message on the server-side (lines LABEL:ln:SNAPSHOTarrival to LABEL:ln:SNAPSHOTarrivalEND), updates the local information (line LABEL:ln:vcRemoteUpdate) and prepares the response information (line LABEL:ln:prepareREsponse) before sending the reply to the client-side (line LABEL:ln:SNAPSHOTackSend). Note that if the receiver notices that it has the result of an ongoing snapshot task, it sends that result to the requesting processor (line LABEL:ln:sendSAVEwithA).

The function and the message.   The function considers a snapshot task that was initiated by processor . This function is responsible for storing the result of this snapshot task in a safe register. It does so by broadcasting the client-side message to all processors in the system (line LABEL:ln:safeStoreSend). Upon the arrival of the message to the server-side, the receiver stores the arriving information, as long as the arriving information is more recent than the local one. Then, the server-side replies with a message to the client-side, who is waiting for a majority of such replies (line LABEL:ln:safeStoreSend).

Figure 5: The upper drawing depicts an example of Algorithm LABEL:alg:terminating’s execution for a case that is equivalent to the one depicted in the upper drawing of Figure 3, i.e., only one snapshot operation. The lower drawing illustrates the case of concurrent invocations of snapshot operations by all nodes.

Figure 5 depicts two examples of Algorithm LABEL:alg:terminating’s execution. In the upper drawing, a write operation is followed by a snapshot operation. Note that fewer messages are considered when comparing to Figure 3’s example. The lower drawing illustrates the case of concurrent invocations of snapshot operations by all nodes. Observe the potential improvement with respect to number of messages (in the upper drawing) and throughput (in the lower drawing) since Algorithm LABEL:alg:9disCongif uses messages for each snapshot task and handles only one snapshot task at a time.

\@dblfloat

algocf[t!]     \end@dblfloat

5.3 Correctness

We now prove the convergence (recovery), termination and linearization of Algorithm LABEL:alg:terminating.

Definition 5.1 (Algorithm LABEL:alg:terminating’s consistent system states and executions)

(i) Let be a system state in which is greater than or equal to any ’s timestamp values in the variables and fields related to . We say that the ’ timestamps are consistent in . (ii) Let be a system state in which is greater than or equal to any ’s snapshot sequence numbers in the variables and fields related to . We say that the ’s snapshot sequence numbers are consistent in . (iii) Let be a system state in which is greater than or equal to any ’s snapshot operation index in the variables and fields related to . Moreover, and . We say that the ’s snapshot sequence numbers are consistent in . (iv) Let be a system state in which holds, where is the returned value from a macro defined in line LABEL:ln:vc when executed by processor . We say that the vector clock values are consistent in . We say that system state is consistent if it is consistent with respect to invariants (i) to (iv). Let be an execution of Algorithm LABEL:alg:terminating that all of its system states are consistent and be a suffix of . We say that execution is consistent (with respect to ) if any message arriving in was indeed sent in and any reply arriving in has a matching request in .

Theorem 5.1 (Algorithm LABEL:alg:terminating’s convergence)

Let be a fair and unbounded execution of Algorithm LABEL:alg:disCongif. Within asynchronous cycles in , the system reaches a consistent state (Definition 5.1). Within asynchronous cycles after , the system starts a consistent execution .

Proof. Note that Lemmas 4.2 and 4.7 imply invariants (i), and respectively, (ii) of Definition 5.1 also for the case of Algorithm LABEL:alg:terminating, because they use the similar code lines for asserting these invariants.

We now consider the proof of invariant (iii) of Definition 5.1. Note that the variables and fields of and the data structure in Algorithm LABEL:alg:terminating follow the same patterns of information as the variables and fields of and the data structure in Algorithm LABEL:alg:disCongif. Moreover, within one asynchronous cycle, every processor executes line LABEL:ln:snsNegRepSnap at least once. Therefore, the proof of invariant (iii) can follow similar arguments to the ones appearing in the proof of Lemma 4.2. Specifically, holds due to arguments that appear in the proof of Claim 4.5 with respect to the variables and the fields of and the structure .

The proof of invariant (iv) is implied by the fact that within one asynchronous cycle, every processor executes line LABEL:ln:vcFix at least once and the fact that is assigned to in line LABEL:ln:vcUpdate. Note that these are the only lines of code that assign values to and that value of every entry in is not decreasing (Claim 4.3).

By the definition of asynchronous cycles (Section 2.7.2), within one asynchronous cycle, reaches a suffix , such that every received message during was sent during . By repeating the previous argument, it holds that within asynchronous cycles, reaches a suffix in which for every received reply message, we have that its associated request message was sent during . Thus, is consistent.

The proof of Theorem 5.2 considers both complete and not complete operations. We say that a operation is complete if it starts due to a step in which calls the operation (line LABEL:ln:operationSnapshot2) and its operation index, , is greater than any of ’s snapshot indices in the system state that appears immediately before . Otherwise, was say that it is not complete.

Theorem 5.2 (Algorithm LABEL:alg:terminating’s termination and linearization)

Let be a consistent execution (as defined by Definition 5.1) with respect to some execution of Algorithm LABEL:alg:terminating. Suppose that there exists , such that in ’s second system state (which immediately follows ’s first step that may include a call to the operation in line LABEL:ln:operationSnapshot2) it holds that and . Within asynchronous cycles, the system reaches a state in which .

Proof. Lemmas 5.35.7 and 5.10 prove the theorem. These lemmas use the function that we define next. Whenever ’s program counter is outside of the function , the function returns the value of . Otherwise, the function returns the value of .

Lemma 5.3 (Algorithm LABEL:alg:terminating’s termination — part I)

Let be a consistent execution (Definition 5.1) with respect to some execution of Algorithm LABEL:alg:terminating. Suppose that there exists , such that in ’s second system state (which immediately follows ’s first step that may include a call to the operation in line LABEL:ln:operationSnapshot2) it holds that and . Within asynchronous cycles, the system reaches a state in which either: (i) for any non-failing processor it holds that (line LABEL:ln:exceedDelta) and , (ii) any majority include at least one , such that or (iii) .

Proof. Towards a proof in the way of contradiction, suppose that the lemma is false. That is, has a prefix that includes asynchronous cycles, such that none of the lemma invariants hold during . The proof uses claims 5.4 and 5.5 for demonstrating a contradiction with the above assumption in Claim 5.6.

Claim 5.4

does not include a step in which processor evaluates the if-statement condition in line LABEL:ln:prevEqReg to be true (or at least one of the lemma invariants holds).

Proof of claim. Arguments (1), (2) and (3) show that during processor calls the function . Argument (4) shows that this implies that invariant (ii) holds. Thus, we reached a contradiction with the assumption in the lemma proof.

Argument (1): a call to ends within asynchronous cycles.  

A call to