Ares: Adaptive, Reconfigurable, Erasure coded, atomic Storage

Ares: Adaptive, Reconfigurable, Erasure coded, atomic Storage

Viveck Cadambe ***EE Department, Pennsylvania State University, University Park, PA, USA, vxc12@engr.psu.edu   Nicolas Nicolaou Algolysis Ltd & KIOS Research and Innovation Center of Excellence, U. of Cyprus, Nicosia, Cyprus, nicolasn@cs.ucy.ac.cy   Kishori M. Konwar Department of EECS, Massachusetts Institute of Technology, Cambridge MA, USA, kishori@csail.mit.edu, prakashn@mit.edu, lynch@csail.mit.edu, medard@mit.edu   N. Prakash
Nancy Lynch   Muriel Médard
Abstract

Atomicity or strong consistency is one of the fundamental, most intuitive, and hardest to provide primitives in distributed shared memory emulations. To ensure survivability, scalability, and availability of a storage service in the presence of failures, traditional approaches for atomic memory emulation, in message passing environments, replicate the objects across multiple servers. Compared to replication based algorithms, erasure code-based atomic memory algorithms has much lower storage and communication costs, but usually, they are harder to design. The difficulty of designing atomic memory algorithms further grows, when the set of servers may be changed to ensure survivability of the service over software and hardware upgrades (scale-up or scale-down), while avoiding service interruptions. Atomic memory algorithms for performing server reconfiguration, in the replicated systems, are very few, complex, and are still part of an active area of research; reconfigurations of erasure-code based algorithms are non-existent.

In this work, we present Ares, an algorithmic framework that allows reconfiguration of the underlying servers, and is particularly suitable for erasure-code based algorithms emulating atomic objects. Ares introduces new configurations while keeping the service available. To use with Ares we also propose a new, and to our knowledge, the first two-round erasure-code based algorithm treas, for emulating multi-writer, multi-reader (MWMR) atomic objects in asynchronous, message-passing environments, with near-optimal communication and storage costs. Our algorithms can tolerate crash failures of any client and some fraction of servers, and yet, guarantee safety and liveness property. Moreover, by bringing together the advantages of Ares and treas, we propose an optimized algorithm where new configurations can be installed without the objects values passing through the reconfiguration clients.

\stackMath

1 Introduction

With the rapid increase of computing power on portable devices, such as, smartphone, laptops, tablets, etc., and the near ubiquitous availability of Internet connectivity, our day to day lives are becoming increasingly dependent on Internet-based applications. Most of these applications, rely on large volumes of data from a wide range of sources, and their performance improves with the easy accessibility of the data. Today, data is gathered at an even faster pace from numerous sources of interconnected devices around the world. In order to keep abreast with this veritable tsunami of data, researchers, in both industry and academia, are hurtling to invent new ways to increase the capacity of durable, large-scale distributed storage systems, and the efficient ingestion and retrieval of data. Currently, most of the data is stored in cloud-based storages, offered by major providers like Amazon, Dropbox, Google, etc. To store large amounts of data in an affordable manner, cloud vendors deploy hundreds to thousands of commodity machines, networked together to act as a single giant storage system. Component failures of commodity devices, and network delays are the norm, therefore, ensuring consistent data-access and availability at the same time is challenging. Vendors often solve availability by generating replicating data across multiple servers, but this creates the headache of keeping those copies consistent, especially when they can be updated concurrently by different operations.


Atomic Distributed Storage. To solve this problem, researches developed a series of consistency semantics, that formally specify the behavior of concurrent operations. Atomicity or strong consistency is the strongest and most intuitive consistency semantic which provides the illusion that a data object is accessed sequentially, even when operations may access different copies of that data object concurrently. In addition, atomic objects are composable [HW90, Lynch1996], enabling the creation of large shared memory systems from individual atomic data objects. Such large-scale shared memory services make application development much simpler. Finally, the strongest and most desirable form of availability or liveness of atomic data access is wait-freedom [Lynch1996] where operations complete irrespective of the speed of the clients.


Replication-based Atomic Storage. A long stream of works used replication of data across multiple servers to implement atomic read/write storage [FNP15, ABD96, CDGL04, FL03, FHN16, GNS08, GNS09, LS97]. Popular replication-based algorithms appear in the work by Attiya, Bar-Noy and Dolev [ABD96] (we refer to this as the ABD algorithm) and also in the work by Fan and Lynch [FL03] (which is referred to as the LDR algorithm). Replication based strategies, however, incur high storage and communication costs; for example, to store 1,000,000 objects each of size 1MB, a total size of TB across a server system, the ABD algorithm replicates the objects in all the servers, which blows up the worst-case storage cost to TB. Additionally, every write or read operation may need to transmit at most MB of data, incurring high communication cost.


Erasure Code-based Atomic Storage. To avoid the high storage and communication costs stemmed from the use of replication, erasure codes provide an alternative way to emulate fault-tolerant shared atomic storage. In comparison to replication, algorithms based on erasure codes significantly reduce both the storage and communication costs of the implementation. An erasure code splits the value of size unit into elements, each of size units, creates coded elements, and stores one coded element per server. The size of each coded element is also units, and thus the total storage cost across the servers is units. For example, if we use an MDS code, the storage cost is simply TB, which is 2 times lower than the storage in the case of ABD. A similar reduction in bandwidth per operation is possible in many erasure code-based algorithms for implementing atomicity. A class of erasure codes known as Maximum Distance Separable (MDS) codes have the property that value can be reconstructed from any out of these coded elements. Motivated by these approaches, recently, several ensure code based algorithms for implementing strong consistency on message-passing models have been proposed in theory  [CT06, CadambeLMM17, DGL08, SODA2016, radon], and in practice [GIZA2017, Zhang2016]. However, the leap from replication-based to erasure code-based algorithms for strong consistency comes with the additional burden of synchronizing the access of multiple pieces of coded-elements from the same version of the the data object. This naturally leads to relatively complex algorithms.


Reconfigurable Atomic Storage. Although replication and erasure-codes may help the system survive the failure of a subset of servers, they do not suffice to ensure the liveness of the service in a longer period where a larger number of servers may fail. Reconfiguration is the process that allows addition or removal of servers from a live system, without affecting its normal operation. However, performing reconfiguration of a system, without service interruption, is very challenging and not well-understood even in replicated systems, and is still and active area of research. RAMBO [LS02] and DynaStore [AKMS09] are two of the handful of algorithms [GM15, G03, LVM15, SMMK2010] that allows reconfiguration on live systems. Recently, the authors in  [spiegelman:DISC:2017] presented a general algorithmic framework for consensus-free reconfiguration algorithms.

So far, all reconfiguration approaches, implicitly or explicitly, assume a replication-based system in each configuration. Therefore, none of the above methods can reconfigure from or to a configuration where erasure codes are used. In particular, erasure code based atomic memory algorithms requires a fixed set of servers in a configuration, therefore, moving from an existing coding scheme would entail deploying a new set of servers, or change the set of servers in an existing configuration. In RAMBO [LS02], messages originating from a read or write operation are sent as part of an ongoing gossiping protocol, and this makes counting the communication cost of each operation obscure. In both RAMBO [LS02] and DynaStore [AKMS09], the clients and servers are combined, and therefore, not immediately suitable for settings where clients and servers are separate processes, a commonly studied architecture, both in theory [Lynch1996] and in practice [SNOW2016]. In DynaStore [AKMS09] and  [spiegelman:DISC:2017], an active new configuration, generated by the underlying SpSn algorithm, may often consist of a set of servers proposed during configuration proposed by multiple clients. They assume that as long as a majority of the servers in the active configurations are non- faulty the service can guarantee liveness. In erasure code based algorithms additional assumptions are required. For example, if some client proposes a configuration with certain MDS code then If we have more than servers in the configuration increases the storage cost, while having fewer than servers will not permit using the proposed code parameters. Moreover, in the algorithms in [AKMS09] and  [spiegelman:DISC:2017], the configurations proposed by the clients are incremental changes, e.g., , where is a suggestion to remove a server and a suggestion to add it. Therefore, some additional mechanism is necessary for the reconfiguration client to know the total number of existing active servers before proposing its change as an attempt to generate a configuration of a desired size.

Reconfigurations are usually desirable to system administrators [aguileratutorial], usually during system maintenance. Therefore, during the migration of the objects, from one configuration to the next, it is highly likely that all stored objects are moved to the newer configuration almost at the same time. In the above algorithms, data transfer is done, by using the client as a conduit, creating a possible bottleneck. Although, setting proxy servers for reconfiguration is an ad hoc solution it suffers from a the same bottleneck. In such situations, it is more reasonable for the data objects to migrate directly from one set of servers to another.


Our Contributions. In this work, we present Ares, an algorithm that allows reconfiguration of the servers that emulates an atomic memory, specifically suitable for atomic memory service that uses erasure codes, while keeping the service available at all times. In order to keep Ares general, so as to allow adaptation of already known atomic memory algorithms to the configurations, we introduced a new set of data primitives, that (i) provides a modular framework to describe atomic read/write storage implementations, and (ii) enhances the ease for reasoning about algorithm’s correct implementation of atomic memory.Using these primitives, we are able to prove the safety property (atomic memory) of an execution of Ares that involves ongoing reconfiguration operations. We also present treas, the first two-round erasure code-based MWMR algorithm, with cost-effective storage and communication costs, for emulating shared atomic memory under message-passing environment in the presence of crash-failure of processes. We prove safety and liveness conditions for treas. Then we describe a new algorithm Ares-treas, where we use a modified version of treas as the underlying atomic memory algorithm in every configuration and modify Ares, so that data from one configuration is transferred to another directly, thereby, avoiding the reconfiguration client as the possible bottleneck.


Document Structure. The remainder of the manuscript consists of the following sections. In Section 2, we present the model assumptions for our setting. In Section 3, we present our two-round erasure code-based algorithm for emulating shared atomic storage under the message-passing model for MWMR setting. In Section 4, we present our Ares framework for emulating shared atomic memory with erasure-codes where the system can undergo reconfiguration, while it is live. Sub-section 4.4 provides latency analysis read, write and reconfiguration operations. In Section 5, we describe Ares-treasalgorithm. Finally, in Section 6, we conclude our work.

2 Model and Definitions

A shared atomic storage can be emulated by composing individual atomic objects. Therefore, we aim to implement only one atomic read/write memory object on a set of servers. Each data object takes a value from a set . We assume a system consisting of four distinct sets of processes: a set of writers, a set of readers, a set of reconfiguration clients, and a set of servers. Let be the set of clients. Servers host data elements (replicas or encoded data fragments). Each writer is allowed to modify the value of a shared object, and each reader is allowed to obtain the value of that object. Reconfiguration clients attempt to modify the set of servers in the system in order to mask transient errors and to ensure the longevity of the service. Processes communicate via messages through asynchronous, reliable channels. Ares allows the set of server host to be modified during the course of an execution for masking transient or permanent failures of servers and preserve the longevity of the service.


Configuration A configuration, identified by a unique identifier , is a data type that describes explicitly: the set identifiers of the set of servers that are a part of it, denote as ; the set of quorums that are defined on ; an underlying algorithm that implements atomic memory in , including related parameters; and a consensus instance, , with the values from , the set of all unique configuration identifiers, implemented on top of the servers in . We refer to a server as a member of configuration .


Liveness and Safety Conditions. The algorithms we present in this paper satisfy wait-free termination (Liveness) and atomicity (Safety). Wait-free termination [HS11] requires that any process terminates in a finite number of steps, independent of the progress of any other process in the system.

An implementation of a read/write object satisfies the atomicity property if the following holds [Lynch1996]. Let the set contain all complete operations in any well-formed execution of . Then for operations in there exists an irreflexive partial ordering satisfying the following:

  • For any operations and in , if , then it cannot be the case that .

  • If is a write operation and is any operation, then either or .

  • The value returned by a read operation is the value written by the last preceding write operation according to (or the initial value if there is no such write).

Storage and Communication Costs. We are interested in the complexity of each read and write operation. The complexity of each operation invoked by a process , is measured with respect to three metrics, during the interval between the invocation and the response of : (i) communication round-trips, accounting the number of messages sent during , (ii) storage efficiency (storage cost), accounting the maximum storage requirements for any single object at the servers during , and (iii) message bit complexity (communication cost) which measures the length of the messages used during .

We define the total storage cost as the size of the data stored across all servers, at any point during the execution of the algorithm. The communication cost associated with a read or write operation is the size of the total data that gets transmitted in the messages sent as part of the operation. We assume that metadata, such as version number, process ID, etc. used by various operations is of negligible size, and is hence ignored in the calculation of storage and communication cost. Further, we normalize both the costs with respect to the size of the value v; in other words, we compute the costs under the assumption that v has size 1 unit.


Background on Erasure coding. In treas, we use an linear MDS code  [verapless_book] over a finite field to encode and store the value among the servers. An MDS code has the property that any out of the coded elements can be used to recover (decode) the value . For encoding, is divided§§§In practice is a file, which is divided into many stripes based on the choice of the code, various stripes are individually encoded and stacked against each other. We omit details of represent-ability of by a sequence of symbols of , and the mechanism of data striping, since these are fairly standard in the coding theory literature. into elements with each element having size (assuming size of is ). The encoder takes the elements as input and produces coded elements as output, i.e., , where denotes the encoder. For ease of notation, we simply write to mean . The vector is referred to as the codeword corresponding to the value . Each coded element also has size . In our scheme we store one coded element per server. We use to denote the projection of on to the output component, i.e., . Without loss of generality, we associate the coded element with server , .


Tags. A tag is defined as a pair , where and , an ID of a writer. Let be the set of all tags. Notice that tags could be defined in any totally ordered domain and given that this domain is countably infinite, then there can be a direct mapping to the domain we assume. For any we define if or and .

2.1 The Data-Access Primitives

In the next section, we present the treas algorithm using three data access primitives (DAP) which can be associated with a configuration . These data access primitives in the context of are: , , and . Assuming a set of totally ordered timestamps , a value domain of the distributed atomic object , and a set of configuration identifiers , the three primitives defined over a configuration , tag , and a value as follows:

Definition 1 (Data Access Primitives).

Given a configuration identifier , any non-faulty client process may invoke the following data access primitives during an execution , where is added to specify the configuration specific implementation of the primitives:

  1. : returns a tag

  2. : returns a tag-value pair

  3. : the tag-value pair as argument.

operation ()
2:       
       
4:       return
end operation
6:
operation ()
8:       
       
10:       
end operation
Algorithm 1 Read and write operations of generic algorithm

A number of known tag-based algorithms that implement atomic read/write objects (e.g., ABD, Fast – see Appendix A.1), can be expressed in terms of DAP. In particular, many algorithms can be transformed in an algorithmic template, say , presented in Alg. 10. In brief, a read operation in performs to retrieve a tag-value pair, form configuration , and then it performs a to propagate that pair to configuration . A write operation is similar to the read but before performing the action it generates a new tag which associates with the value to be written. It can be shown (see Appendix A) that an algorithm described in the form satisfies atomic guarantees and liveness, if the DAP satisfy the following consistency properties:

Definition 2 (DAP Consistency Properties).

In an execution we say that a DAP operation in an execution is complete if both the invocation and the matching response step appear in . If is the set of complete DAP operations in execution then for any :

  1. If is , for , and , and is (or ) that returns (or ) and completes before in , then .

  2. If is a that returns , then there exists such that is and did not complete before the invocation of , and if no such exists in , then is equal to .

Expressing an atomic algorithm in terms of the DAP primitives allows one to achieve a modular design for atomic object tag-based algorithms, serving multiple purposes. First, describing an algorithm according to templates (like treas in Section 3) allows one to proof that the algorithm is safe (atomic) by just showing that the appropriate DAP properties hold, and the algorithm is live if the implementation of each primitive is live. Second, the safety and liveness proofs for more complex algorithms (like Ares in Section 4) become easier as one may reason on the DAP properties that are satisfied by the primitives used, without involving the underlying implementation of those primitives. Last but not least, describing a reconfigurable algorithm using DAPs, provides the flexibility to use different implementation mechanics for the DAPs in each reconfiguration, as long as the DAPs used in a single configuration satisfy the appropriate DAP properties. Hence, makes our algorithm adaptive.

3 treas: A new two-round erasure-code based algorithm

In this section, we present the first, two-round, erasure-code based algorithm for implementing atomic memory service, we call treas. The algorithm uses MDS codes for storage. We implement and instance of the algorithm in a configuration of server processes.

The read and write operations of algorithm treas are implemented using (Alg.  10), the DAP primitives are implemented in Alg. 2, and the servers’ responses in Automaton 3. In high level, both the read and write operations take two phases to complete (similar to the ABD algorithm). As in algorithm , a write operation , discovers the maximum tag from a quorum in by executing ; creates a new tag by incorporating the writer’s own ID; and it performs a to propagate that pair to configuration . A read operation performs to retrieve a tag-value pair, form configuration , and then it performs a to propagate that pair to the servers .

at each process
2:
procedure ()
4:       send to each
       until receives from servers in
6:       
       return
8:end procedure
10:procedure ()
       send to each
12:       until receives from each server s.t. and
        set of tags that appears in lists
14:        set of tags that appears in lists with values
       
16:       
       if   then
18:              decode value for
       end if
20:       return
end procedure
22:
procedure ()
24:       ,
       send to each
26:       until receives ack from servers in
end procedure
Algorithm 2 The protocols for the DAP primitives for template to implement treas.
at each server in configuration
2:
State Variables:
4:, initially
6:Upon receive (query-tag)from
       
8:       Send to
end receive
10:
Upon receive (query-list)from
12:       Send to
end receive
14:
Upon receive (put-data, )from
16:       
       if  then
18:             
20:       // remove the coded value and retain the tag
             
22:
       end if
24:       Send ack to
end receive
Algorithm 3 The response protocols at any server in treas for client requests.

To facilitate the use of erasure-codes, each server stores one state variable, , which is a set of up to (tag, coded-element) pairs. Initially the set at contains a single element, . Given this set we can now describe the implementation of the DAP.

A client, during the execution of a primitive, queries all the servers in for the highest tags in their , and awaits responses from servers, with . A server upon receiving the get-tag request, responds to the client with the highest tag, as . Once the client receives the tags from servers, it selects the highest tag and returns it to .

During the execution of the primitive , a client sends the pair to each server . Every time a message arrives at a server from a writer, the pair gets added to the . As the size of the at each is bounded by , then following an insertion in the , trims the coded-elements associated with the smallest tags. In particular, replaces the coded-elements of the older tags with , and maintains only the coded-elements associated with the highest tags in the (see Line Alg. 3:22). The client completes the primitive operation after getting acks from servers.

A client during the execution of a primitive, it queries all the servers in for their local variable , and awaits responses from servers. Once the client receives from servers, it selects the highest tag , such that, its corresponding value is decodable from the coded elements in the lists; and is the highest tag seen from the responses of at least (see Lines Alg. 2:13-16) and returns the pair . Note that in the case where anyone of the above conditions is not satisfied the corresponding read operation does not complete.


Storage and Communication Costs for treas. We now briefly present the storage and communication costs associated with treas. Due to space limitations the proofs appear in Appendix B. Recall that by our assumption, the storage cost counts the size (in bits) of the coded elements stored in the variable at each server. We ignore the storage cost due to meta-data and temporary variables. Also, for the communication cost we measure the bits sent on the wire between the nodes.

Theorem 3.

The treas algorithm has: (i) storage cost , (ii) communication cost for each write at most to , and (iii) communication cost for each read at most to .

3.1 Safety, Liveness and Performance cost of treas

In this section we are concerned with only one configuration , consisting of a set of servers , and a set of reader and writer clients and , respectively. In other words, in such static system the sets , and are fixed, and at most servers may crash fail. Below we prove Lemma 5, which proves the consistency properties of the DAP implementation of treas, which implies the atomicity city properties.

3.1.1 Correctness and Liveness

Now we can show that if DAP properties are satisfied from the DAP, then algorithm implements an atomic read/write algorithm. We can show that satisfy atomic guarantees and liveness if the DAP in the above algorithms satisfy the DAP consistency properties.

Theorem 4 (Atomicity of ).

Suppose the DAP implementation satisfies the consistency properties and of Definition 31. Then any execution the atomicity protocols on a configuration , is atomic and live if DAPs are live in .

Lemma 5.

The data-access primitives, i.e., , and primitives, implemented in the treas algorithm satisfy the consistency properties.

Theorem 6 (Atomicity).

Any execution of treas, is atomic.

The parameter captures all the write operations that overlap with the read, until the time the reader has all data needed to attempt decoding a value. However, we ignore those write operations that might have started in the past, and never completed yet, if their tags are less than that of any write that completed before the read started. This allows us to ignore write operations due to failed writers, while counting concurrency, as long as the failed writes are followed by a successful write that completed before the read started.

Definition 7 (Valid read operations).

A read operation will be called as a valid read if the associated reader remains alive until the reception of the responses during the get-data phase.

Definition 8 (Writes concurrent with a valid read).

Consider a valid read operation . Let denote the point of initiation of . For , let denote the earliest point of time during the execution when the associated reader receives all the responses. Consider the set is any write operation that completes before , and let . Next, consider the set is any write operation that starts before . We define the number of writes concurrent with the valid read operation to be the cardinality of the set .

The above definition captures all the write operations that overlap with the read, until the time the reader has all data needed to attempt decoding a value. However, we ignore those write operations that might have started in the past, and never completed yet, if their tags are less than that of any write that completed before the read started. This allows us to ignore write operations due to failed writers, while counting concurrency, as long as the failed writes are followed by a successful write that completed before the read started.

Theorem 9 (Liveness).

Let denote a well-formed execution of treas, with , where is the number of servers and , and be the maximum number of write operations concurrent with any valid read operation then the read and write operations are live.

4 Algorithm Framework for Ares

In this section, we provide the description of an atomic reconfigurable read/write storage, we call Ares. In the presentation of Ares algorithm we decouple the reconfiguration service from the shared memory emulation, by utilizing the data access primitives presented in Section 2.1. This allows Ares, to handle both the reorganization of the servers that host the data, as well as utilize a different atomic memory implementation per configuration. It is also important to note that Ares adopts a client-server architecture and separates the reader, writer and reconfiguration processes from the server processes that host the object data.

In the rest of the section we first provide the specification of the reconfiguration mechanism used in Ares, along with the properties that this service offers. Then, we discuss the implementation of read and write operations and how they utilize the reconfiguration service to ensure atomicity even in cases where read/write operations are concurrent with reconfiguration operations. The read and write operations are described in terms of the data access primitives presented in Section 2.1 and we show that if the DAP properties are satisfied then Ares preserves atomicity. This allows Ares to deploy the transformation of any atomic read/write algorithm in terms of the presented DAPs without compromising consistency.

4.1 Implementation of the Reconfiguration Service

In this section, we describe the reconfiguration service that is used in Ares, where reconfig clients introduce new configurations. In our setting, we assume throughout an execution of Ares, every configuration is attempted to be reconfigured at most once. Multiple clients may attempt concurrently to introduce a different configuration for the same index in the configuration sequence. Ares uses consensus to resolve such conflicts. In particular, each configuration is associated with an external consensus service, denoted by , that runs on a subset of servers in the configuration. We use the data-type , corresponding to a configuration, say , to denote whether is finalized () or is still pending (). Each reconfigurer may change the system configuration by introducing a new configuration identifier. The data type configuration sequence is an array of pairs , where and . We denote each such pair by the caret over a variable name, e.g., or , or , etc.

The service relies on an underlying sequence of configurations in the from of a “distributed list”, global configuration sequence . In any configuration , every server in has a configuration sequence variable , initially , where new configurations can be added to the end of the list. We use the notation to denote the length of the array. Every server in has a variable , . Initially, at any server , and once it it set to a value in it is never altered. For any , at any point in time, all the values of , such that , in the processes in are the same. At any point in an execution of Ares, for any , we say points is a link in if at that point in the execution where a server in has .

The reconfiguration service consists of two major components: sequence traversal, responsible of discovering the latest state of the configuration sequence , and reconfiguration operation that installs a new configuration .

procedure ()
2:       
       
4:       while  do
             
6:             if  then
                    
8:                    
                    
10:             end if
       end while
12:       return
end procedure
14:
procedure ()
16:       send to each
       until s.t. receives from
18:       if  then
             return
20:       else if  then
             return
22:       else
             return
24:       end if
end procedure
26:
procedure ()
28:       send to each
       until s.t. receives ack from
30:end procedure
Algorithm 4 Sequence traversing at each process of algorithm Ares.

Sequence Traversal. Any read/write/reconfig operation utilizes the sequence traversal mechanism to discover the latest state of the global configuration sequence, as well as to ensure that such a state is discoverable by any subsequent operation . The sequence parsing consists of three actions: (), (), and (). We do present their specification and implementations as follows (Fig. 4):

: The action returns the configuration that follows in . During action sends read-config messages to all the servers in . Once a server receives such a message responds with the value of its variable variable. Once it receives replies from a quorum in , then if there exists a reply that contains a the action returns ; otherwise it returns .

: The action propagates to the servers in . During the action, the client sends messages, to the servers in and waits for each server in some quorum to respond.

: A sequentially traverses the configurations in in order to discover the latest state of the sequence . At invocation, the client starts with last finalized configuration in (Line A4:2), say and enters a loop to traverse by invoking , which returns the next configuration, say . If , then: (a) is appended at the end of the sequence ; (b) a is invoked to inform a quorum of servers in to update the value of their variable to ; and (c) variable is set to . If the loop terminates the action returns .


Server Protocol. Each server responds to requests from clients (Alg. 6). A server waits for two types of messages: read-config and write-config. When a read-config message is received for a particular configuration , then the server returns variables of the servers in . A write-config message attempts to update the variable of the server with a particular tuple . A server changes the value of its local in two cases: (i) , or (ii) .

at each reconfigurer
2:State Variables:
with members:
4:Initialization:
6:
operation (c)
8:       if  then
             
10:             
             
12:             
       end if
14:end operation
16:procedure (, )
       
18:       
       
20:       
       
22:       return
end procedure
24:
procedure ()
26:       
       
28:       
       for  do
30:             
             
32:       end for
       
34:       
end procedure
36:
procedure ()
38:       
       
40:       
       return
42:end procedure
Algorithm 5 Reconfiguration protocol of algorithm Ares.
at each server in configuration
2:State Variables:
, initially,
4:, intially,
, initially
6:
Upon receive (read-config)from
8:       send to
end receive
10:
Upon receive (write-config, )from
12:       if  then
             
14:       end if
       send ack to
16:end receive
Algorithm 6 Server protocol of algorithm Ares.

Reconfiguration operation. A reconfiguration operation , , invoked by a non-faulty reconfiguration client , attempts to append to . The operation consists of the following phases, executed consecutively by : ; ; and .

: The phase at , reads the recent global configuration sequence starting with some initial guess of . As described above, the action completes the traversal by returning a possibly extended configuration sequence to .

: The attempts to append a new configuration to the end of (the approximation of ). Suppose the last configuration in is , then in order to decide the most recent configuration, invokes , on the consensus object associated with configuration , where is the decided configuration identifier returned the configuration service. If , this implies that another (possibly concurrent) reconfiguration operation, invoked by a reconfigurer , proposed and succeeded as the configuration to follow . In this case, adopts as it own propose configuration, by adding to the end of its local (entirely ignoring ), using the operation , and returns the extended configuration .

: Let us denote by the index of the last configuration in , at , such that its corresponding status is ; and denote the last index of . Next invokes , which gathers the tag-value pair corresponding to the maximum tag in each of the configurations in for , and transfers that pair to the configuration that was added by the action. The and actions are implemented with respect to the atomic algorithm that is used in each of the configurations that are accessed. Suppose is the tag value pair corresponding to the highest tag among the responses from all the configurations. Then, is written to the configuration via the invocation of .

Figure 1: An illustration of an execution of the reconfiguration process.

: Once the tag-value pair is transferred, in the last phase of the reconfiguration operation, executes , to update the status of the last configuration in , i.e. =, to . informs a quorum of servers in the previous configuration, i.e. in some , about the change of status, by executing the action.

Fig. 1 illustrates an example execution of a reconfiguration operation . In this example, the reconfigurer goes through a number of configuration queries () before it reaches configuration in which a quorum of servers replies with . There it proposes to the consensus object of ( on arrow 10), and once is decided, completes after executing .

Write Operation
at each writer
2:State Variables:
with members:
4:Initialization:
6:
operation (),
8:       ()
       
10:       
       for  do
12:             
       end for
14:       
       
16:       while not  do
             
18:             ()
             if  then
20:                    
             else
22:                    
             end if
24:       end while
end operation\EndPart\PartRead Operation
26:at each reader
State Variables:
28: with members:
Initialization:
30:
32:operation ( )
       ()
34:       
       
36:       for  do
             
38:       end for
       
40:       while not  do
             
42:             ()
             if  then
44:                    
             else
46:                    
     &nbs