Unleashing and Speeding Up Readers
in Atomic Object Implementations
Providing efficient emulations of atomic read/write objects in asynchronous, crash-prone, message-passing systems is an important problem in distributed computing. Communication latency is a factor that typically dominates the performance of message-passing systems, consequently the efficiency of algorithms implementing atomic objects is measured in terms of the number of communication exchanges involved in each read and write operation. The seminal result of Attiya, Bar-Noy, and Dolev established that two pairs of communication exchanges, or equivalently two round-trip communications, are sufficient. Subsequent research examined the possibility of implementations that involve less than four exchanges. The work of Dutta et al. showed that for single-writer/multiple-reader (SWMR) settings two exchanges are sufficient, provided that the number of readers is severely constrained with respect to the number of object replicas in the system and the number of replica failures, and also showed that no two-exchange implementations of multiple-writer/multiple-reader (MWMR) objects are possible. Later research focused on providing implementations that remove the constraint on the number of readers, while having read and write operations that use variable number of communication exchanges, specifically two, three, or four exchanges.
This work presents two advances in the state-of-the-art in this area. Specifically, for SWMR and MWMR systems algorithms are given in which read operations take two or three exchanges. This improves on prior works where read operations took either three exchanges, or two or four exchanges. The number of readers in the new algorithms is unconstrained, and write operations take the same number of exchanges as in prior work (two for SWMR and four for MWMR settings). The correctness of algorithms is rigorously argued. The paper presents an empirical study using the NS3 simulator that compares the performance of relevant algorithms, demonstrates the practicality of the new algorithms, and identifies settings in which their performance is clearly superior.
Emulating atomic [Lamport79] (or linearizable [HW90]) read/write objects in asynchronous, message-passing environments with crash-prone processors is a fundamental problem in distributed computing. To cope with processor failures, distributed object implementations use redundancy by replicating the object at multiple network locations. Replication masks failures, however it introduces the problem of consistency because operations may access different object replicas possibly containing obsolete values. Atomicity is the most intuitive consistency semantic as it provides the illusion of a single-copy object that serializes all accesses such that each read operation returns the value of the latest preceding write operation.
Background and Prior Work. The seminal work of Attiya, Bar-Noy, and Dolev [ABD96] provided an algorithm, colloquially referred to as ABD, that implements SWMR (Single Writer, Multiple Reader) atomic objects in message-passing crash-prone asynchronous environments. Operations are ordered using logical timestamps associated with each value. Operations terminate provided some majority of replica servers does not crash. Writes involve a single communication round-trip involving two communication exchanges. Each read operation takes two rounds involving in four communication exchanges. Subsequently, Lynch et al. [LS97] showed how to implement MWMR (Multi-Writer, Multi-Reader) atomic memory where both read and write operations take two communication round trips, for a total of four exchanges.
Dutta et al. [CDGL04] introduced a fast SWMR implementation where both reads and writes involve two exchanges (such operations are called ‘fast’). It was shown that this is possible only when the number of readers is constrained with respect to the number of servers and the number of server failures , viz. . Other works focused on relaxing the bound on the number of readers in the service by proposing hybrid approaches where some operations complete in one and others in two rounds, e.g., [EGMNS09].
Georgiou et al. [GNS08] introduced Quorum Views, client-side tools that examine the distribution of the latest value among the replicas in order to enable fast read operations (two exchanges). A SWMR algorithm, called Sliq, was given that requires at least one single slow read per write operation, and where all writes are fast. A later work [GNRS11] generalized the client-side decision tools and presented a MWMR algorithm, called CwFr, that allows fast read operations.
Previous works considered only client-server communication round-trips. Recently, Hadjistasi et al. [HNS17] showed that atomic operations do not necessarily require complete communication round trips, by introducing server-to-server communication. They presented a SWMR algorithm, called OhSam, where reads take three exchanges: two of these are between clients and servers, and one is among servers; their MWMR algorithm, called OhMam, uses a similar approach. These algorithms do not impose constrains on reader participation and perform a modest amount of local computation, resulting in negligible computation overhead.
|Model||Algorithm||Read Exch.||Write Exch.||Read Comm.||Write Comm.|
|swmr||Sliq [GNS08]||2 or 4||2|
|swmr||Erato||2 or 3||2|
|mwmr||ABD-mw [ABD96, LS97]||4||4|
|mwmr||CwFr [GNRS11]||2 or 4||4|
|mwmr||Erato-mw||2 or 3||4|
Contributions. We focus on the gap between one-round and two-round algorithms by presenting atomic memory algorithms where read operations can take at most “one and a half rounds,” i.e., complete in either two or three exchanges. Complexity results are shown in Table 1, additional details are as follows.
1. We present Erato,111 Ερατώ is a Greek Muse, and the authors thank the lovely muse for her inspiration. Efficient Reads for ATomic Objects, a SWMR algorithm for atomic objects in the asynchronous message-passing model with processor crashes. We improve the three-exchange read protocol of OhSam [HNS17] to allow reads to terminate in either two or three exchanges using client-side tools, Quorum Views, from algorithm Sliq [GNS08]. During the second exchange, based on the distribution of the timestamps, the reader may be able to complete the read. If not, it awaits for “enough” messages from the third exchange to complete. A key idea is that when the reader is “slow” it returns the value associated with the minimum timestamp, i.e., the value of the previous write that is guaranteed to be complete (cf. [HNS17] and [CDGL04]). Read operations are optimal in terms of exchanges in light of [HNS2017arx]. Similarly to ABD, writes take two exchanges. (Section 3.)
2. Using the SWMR algorithm as the basis, we develop a MWMR algorithm, Erato-mw. The algorithm supports three-exchange read protocol based on [HNS17] in combination with the iterative technique using quorum views as in [GNRS11]. Reads take either two or three exchanges. Writes are similar to ABD-mw and take four communication exchanges (cf. [LS97]). (Section 4.)
3. We simulate the algorithms using the NS3 simulator and assess their performance under practical considerations by varying the number of participants, frequency of operations, and network topologies. (Section 5.)
Improvements in latency are obtained in a trade-off for communication complexity. Simulation results suggest that in practical settings, such as data centers with well-connected servers, the communication overhead is not prohibitive.
2 Models and Definitions
We now present the model, definitions, and notations used in the paper. The system is a collection of crash-prone, asynchronous processors with unique identifiers (ids). The ids are from a totally-ordered set that is composed of three disjoint sets, set of writer ids, set of reader ids, and set of replica server ids. Each server maintains a copy of the object.
Processors communicate by exchanging messages via asynchronous point-to-point reliable channels; messages may be reordered. We use the term broadcast as a shorthand denoting sending point-to-point messages to multiple destinations.
A quorum system over a set is a collection of subsets, called quorums, such that every pair of quorums intersects. We define a quorum system over the set of server ids as ; it follows that for any we have . We assume that every process in the system is aware of .
Executions. An algorithm is a collection of processes, where process is assigned to processor . The state of processor is determined over a set of state variables, and the state of is a vector that contains the state of each process. Algorithm performs a step, when some process (i) receives a message, (ii) performs local computation, (iii) sends a message. Each such action causes the state at to change. An execution is an alternating sequence of states and actions of starting with the initial state and ending in a state.
Failure Model. A process may crash at any point in an execution. If it crashes, then it stops taking steps; otherwise we call the process correct. Any subset of readers and writers may crash. A quorum is non-faulty if , is correct. Otherwise, we say is faulty. We allow for any server to crash as long one quorum is non-faulty.
Efficiency and Message Exchanges. Efficiency of implementations is assessed in terms of operation latency and message complexity. Latency of an operation is determined by computation time and the communication delays. Computation time accounts for all local computation within an operation. Communication delays are measured in terms of communication . The protocol implementing each operation involves a collection of sends (or broadcasts) of typed messages and the corresponding receives. As defined in [HNS17], a communication exchange within an execution of an operation is the set of sends and receives for the specific message type. Traditional implementations in the style of ABD [ABD96] are structured in terms of rounds, each consisting of two exchanges, the first, a broadcast, is initiated by the process executing an operation, and the second, a convergecast, consists of responses to the initiator. The number of messages that a process expects during a convergecast depends on the implementation. Message complexity measures the worst-case total number of messages exchanged.
Atomicity. An implementation of a read or a write operation contains an invocation action and a response action. An operation is complete in an execution, if it contains both the invocation and the matching response actions for ; otherwise is incomplete. An execution is well formed if any process invokes one operation at a time. We say that an operation precedes an operation in an execution , denoted by , if the response step of appears before the invocation step in in . Two operations are concurrent if neither precedes the other. The correctness of an atomic read/write object implementation is defined in terms of atomicity (safety) and termination (liveness) properties. Termination requires that any operation invoked by a correct process eventually completes. Atomicity is defined following [Lynch1996]. For any execution , if is the set of all completed read and write operations in , then there exists a partial order on the operations in , s.t. the following properties are satisfied:
A1 For any such that , it cannot be that
A2 For any write and any operation , then either or .
A3 Every read operation returns the value of the last write preceding it according to (or the initial value if there is no such write)
Timestamps and Quorum Views. Atomic object implementations typically use logical timestamps (or tags) associated with the written values to impose a partial order on operations that satisfies the properties A1, A2, and A3.
A quorum view refers to the distribution of the highest timestamps that a read operation witnesses during an exchange. Fig. 1 illustrates four different scenarios. Here small circles represent timestamps received from servers, and dark circles represent the highest timestamp, and the light ones represent older timestamps. The quorum system consists of three quorums, and .
Suppose a read strictly receives values and timestamps from quorum during an exchange. As presented in [GNRS11], can distinguish three different cases: qv(1), qv(2), or qv(3). Each case can help derive conclusions about the state of the latest write (complete, incomplete, unknown). If qv(1) is detected, Fig. 1(a), it means that only one timestamp is received. This means that the write associated with this timestamp is complete. If qv(2) is detected, Fig. 1(b), this indicates that the write associated with the highest timestamp is still in progress (because older timestamps are detected in the intersections of quorums). Lastly, if qv(3) is detected, the distribution of timestamps does not provides sufficient information regarding the state of the write. This is because there are two possibilities as shown in Fig. 1(c) and 1(d). In the former the write is incomplete (no quorum has the highest detected timestamp) and in the latter the write is complete in quorum , but the read has no way of knowing this. We will use quorum views as a design element in our algorithms.
3 SWMR Algorithm Erato
We now present and analyze the SWMR algorithm Erato.
3.1 Algorithm Description
In Erato reads take either two or three exchanges. This is achieved by combining the three exchange read protocol of [HNS17] with the use of Quorum Views of [GNS08]. The read protocol design aims to return the value associated with the timestamp of the last complete write operation. We refer to the three exchanges of the read protocol as e1, e2, and e3. Exchange e1 is initiated by the reader, and exchanges e2 and e3 are conducted by the servers. When the reader receive messages during e2, it analyses the timestamps to determine whether to terminate or wait for the conclusion of e3. Due to asynchrony it is possible for the message from e3 to arrive at the reader before messages from e2. In this case the reader still terminates in three exchanges. Similarly to ABD, writes take two exchanges. The code is given in Algorithm 1. We now give the details of the protocols; in referring to the numbered lines of code we use the prefix “L” to stand for “line”.
Reader Protocol. Each reader maintain several temporary variables. Key variable include and hold the minimum and the maximum timestamps discovered during the read operation. Sets and hold the received readRelay and readAck messages respectively. The ids of servers that sent these messages are stored in sets and respectively. The set keeps the ids of the servers that sent a readRelay message with the timestamp equal to the maximum timestamp .
Reader starts its operation by broadcasting a readRequest message to the servers (exchange e1). It then collects readRelay messages (from exchange e2) and readAck messages (from exchange e3). The reader uses counter to distinguish fresh message from stale message from prior operations. The messages are collected until messages of the same type are received from some quorum of servers (L7-10). If readRelay messages are received from quorum then the reader examines the timestamps to determine what quorum view is observed (recall Section 2). If qv(1) is observed, then all timestamps are the same, meaning that the write operation associated with the timestamp is complete, and it is safe to return the value associated with it without exchange e3. (L20-22). If qv(2) is observed, then the write associated with the maximum timestamp maxTS is not complete. But because there is a sole writer, it is safe to to return the value associated with timestamp maxTS-, i.e., the value of the preceding complete write, again without exchange e3 (L30-33). If qv(3) is observed, then the write associated with the maximum timestamp maxTS is in progress or complete. Since the reader is unable to decide which case applies, it waits for the exchange e3 readAck messages from some quorum . The reader here returns the value associated with the minimum timestamp observed (L23-29). It is possible, due to asynchrony, that messages from e3 arrive from a quorum before enough messages from e2 are gathered. Here the reader decides as above for e3 (L11-13).
Server Protocol. Server stores the value of the replica and its associated timestamp . The array is used to store sets of processes that relayed to regarding a read operation. Destinations set is initialized to set containing all servers from every quorum that contains . It is used for sending relay messages during exchange e2.
In exchange e2, upon receiving message compares its local timestamp with . If , then sets its local value and timestamp to those enclosed in the message (L71-72). Next, checks if the received readRelay marks a new read by , i.e., . If so, then : (a) sets its local counter for to the enclosed one, ; and (b) re-initializes the relay set for to an empty set, (L73-75). It then adds the sender of the readRelay message to the set of servers that informed it regarding the read invoked by (L76-77). Once readRelay messages are received from a quorum , creates a readAck message and sends it to in exchange e3 of the read (L78-79).
To prove correctness of algorithm Erato we reason about its liveness (termination) and atomicity (safety).
Termination is satisfied with respect to our failure model: at least one quorum is non-faulty and each operation waits for messages from a quorum of servers. Let us consider this in more detail.
Write Operation. Showing liveness is straightforward. Per algorithm Erato, writer creates a line:erato:writerequest message and then it broadcasts it to all servers. Writer then waits for writeAck messages from a full quorum of servers (L48-52). Since in our failure model at least one quorum is non-faulty, then writer collects writeAck messages form a full quorum of live servers and write operation terminates.
Read Operation. The reader begins by broadcasting a readRequest message all servers and waiting for responses. A read operation of the algorithm Erato terminates when the reader either (i) collects readAck messages from full quorum of servers or (ii) collects readRelay messages from a full quorum and notices qv(1) or qv(2) (L7-10). Let’s analyze case (i). Since a full quorum is non-faulty then at least a full quorum of servers receives the readRequest message. Any server that receives this message broadcasts readRelay message to every server that belongs to the same quorum with, and the invoker . That is its destinations set (L62-63). In addition, no server ever discards any incoming readRelay messages. Any server, whether it is aware or not of the readRequest, always keeps a record of the incoming readRelay messages and takes action as if it is aware of the readRequest. The only difference between server that received a readRequest message and server that does not, is that is able to broadcast readRelay messages, and broadcasts readRelay messages when receives the readRequest message. Each non-failed server receives readRelay messages from a full quorum of servers and sends a readAck message to reader (L78-79). Therefore, reader can always collect readAck messages from a full quorum of servers, decide on a value to return, and terminate (L11-13). In case where case (ii) never holds then the algorithm will always terminate from case (i). Thus, since any read or write operation will collect a sufficient number of messages and terminate then liveness is satisfied.
Based on the above, it is always the case that acknowledgment messages readAck and writeAck are collected from a full quorum of servers in any read and write operation, thus ensuring liveness.
To prove atomicity we order the operations with respect to the timestamps associated with the written values. For each execution of the algorithm there must exist a partial order on the operations that satisfy conditions A1, A2, and A3 given in Section 2. Let be the the timestamp at the completion of when is a write, and the timestamp associated with the returned value when is a read. We now define the partial order as follows. For two operations and , when is any operation and is a write, we let if . For two operations and , when is a write and is a read we let if . The rest of the order is established by transitivity, without ordering the reads with the same timestamps. We now state the following lemmas.
It is easy to see that the variable in each server is monotonically increasing. This leads to the following lemma.
In any execution of Erato, the variable maintained by any server in the system is non-negative and monotonically increasing.
Next, we show that any read operation that follows a write operation, it receives readAck messages the servers where each included timestamp is at least as the one returned by the complete write operation.
In any execution of Erato, if a read operation succeeds a write operation that writes and , i.e., , and receives readAck messages from a quorum of servers, set , then each sends a readAck message to with a timestamp s.t. .
Proof. Let be the set of servers from a quorum that send a writeAck message to , let be the set of servers from a quorum that sent readRelay messages to server , and let be the set of servers from a quorum that send a readAck message to . Notice that it is not necessary that holds.
Write operation is completed. By Lemma 1, if a server receives a timestamp from a process at some time , then attaches a timestamp s.t. in any message sent at any time . Thus, every server in , sent a writeAck message to with a timestamp greater or equal to . Hence, every server has a timestamp . Let us now examine a timestamp that server sends to read operation .
Before server sends a readAck message to , it must receive readRelay messages from a full quorum of servers, (L78-79). Since both and contain messages from a full quorum of servers, and by definiton, any two quorums have a non-empty intersection, then . By Lemma 1, any server has a timestamp s.t. . Since server and from the algorithm, server’s timestamp is always updated to the highest timestamp it noticed (L71-71), then when server receives the message from , it will update its timestamp s.t. . Server creates a readAck message where it encloses its local timestamp and its local value, (L79). Each sends a readAck to with a timestamp s.t. . Thus, , and the lemma follows.
Now, we show that if a read operation succeeds a write operation, then it returns a value at least as recent as the one written.
In any execution of Erato, if a read succeeds a write operation that writes timestamp , i.e. , and returns a timestamp , then .
We first examine case (b). Let’s suppose that receives readAck messages from a full quorum of servers, . By lines 11 - 13, it follows that decides on the minimum timestamp, , among all the timestamps in the readAck messages of the set . From Lemma 2, holds, where is the timestamp written by the last complete write operation . Then also holds. Thus, .
Now we examine case (a). In particular, case (a) terminates when the reader process notices either (i) qv(1) or (ii) qv(2) or (iii) qv(3). Let be the set of servers from a quorum that send a writeAck message to . Since the write operation , that wrote value associated with timestamp is complete, and by monotonicity of timestamps in servers (Lemma 1), then at least a quorum of servers has a timestamp s.t. . In other words, every server in , sent a writeAck message to with a timestamp greater or equal to .
Let’s suppose that receives readRelay messages from a full quorum of servers, . Since both and contain messages from a full quorum of servers, quorums and , and by definition any two quorums have a non-empty intersection, then . Since every server in has a timestamp then any server has a timestamp s.t. .
If noticed qv(1) in , then the distribution of the timestamps indicates the existence of one and only timestamp in , . Hence, . Based on the algorithm (L20-22), the read operation returns value associated with and holds.
Based on the definition of qv(2), if it is noticed in , then there must exist at least two servers in with different timestamps and one of them holds the maximum timestamp. Let be the one that holds the maximum timestamp (or ) and the server that holds the timestamp s.t. . Since (a) any server has a timestamp s.t. , and (b) holds the maximum timestamp (or ), and (c) and (d) then it follows that . Thus, (or ) must be strictly greater from , . Based on the algorithm, when notices qv(2) in then it returns the value associated with the previous maximum timestamp, that is the value associated with maxTS-1 (L30-33). Since , then for the previous maximum timestamp, denoted by , which is only one unit less than , then the following holds, . Therefore, in this case returns a value associated with and holds.
In any execution of Erato, if and are two semi-fast read operations, take 3 exchanges to complete, such that precedes , i.e., , and returns the value for timestamp , then returns the value for timestamp .
Proof. Let the two operations and be invoked by processes with identifiers and respectively (not necessarily different). Also, let and be the sets of servers from quorums and (not necessarily different) that sent a readAck message to and during and .
Assume by contradiction that read operations and exist such that succeeds , i.e., , and the operation returns a timestamp that is smaller than the returned by , i.e., . Based on the algorithm, returns a timestamp that is smaller than the minimum timestamp received by , i.e., , if obtains and in the readAck message of some server , and is the minimum timestamp received by .
Let us examine if sends a readAck message to with timestamp , i.e., . By Lemma 1, and since , then it must be the case that . According to our assumption , and since is the smallest timestamp sent to by any server in , then it follows that does not receive the readAck message from , and hence .
Now let us examine the actions of the server . From the algorithm, server collects readRelay messages from a full quorum of servers before sending a readAck message to (L78-78). Let denote the set of servers from the full quorum that sent readRelay to . Since, both and contain messages from full quorums, and , and since any two quorums have a non-empty intersection, then it follows that .
Thus there exists a server , that sent (i) a readAck to for , and (ii) a readRelay to during . Note that sends a readRelay for only after it receives a read request from . Since , then it follows that sent the readAck to before sending the readRelay to . By Lemma 1, if attaches a timestamp in the readAck to , then attaches a timestamp in the readRelay message to , such that Since is the minimum timestamp received by , then , and hence as well. By Lemma 1, and since receives the readRelay message from before sending a readAck to , it follows that sends a timestamp s.t. . Thus, and this contradicts our initial assumption.
In any execution of Erato, if and are two fast read operations, take 2 exchanges to complete, such that precedes , i.e., , and returns the value for timestamp , then returns the value for timestamp .
Proof. Let the two operations and be invoked by processes with identifiers and respectively (not necessarily different). Also, let and be the sets of servers from quorums and (not necessarily different) that sent a readRelay message to and during and .
The algorithm terminates in two communication exchanges when a read operation receives readRelay messages from a full quorum and based on the distribution of the timestamp it either notices (a) qv(1) or (b) qv(2). We now examine the four cases.
Case (i), and both and notice qv(1). It is known that all the servers in replied to with timestamp . Since by definition, any two quorums have a non-empty intersection it follows that . From that and by Lemma 1, then every server has a timestamp such that . Since notices qv(1) in , then the distribution of the timestamps indicates the existence of one and only timestamp in , . Thus, .
Case (ii), and notices qv(1) and notices qv(2). It is known that all the servers in replied to with timestamp . Since by definition, any two quorums have a non-empty intersection it follows that . From that and by Lemma 1, then every server has a timestamp such that . Since notices qv(2) in , then there must exist at least two servers in with different timestamps and one of them holds the maximum timestamp. Let be the one that holds the maximum timestamp (or ) and the server that holds the timestamp s.t. . Since (a) any server has a timestamp s.t. , and (b) holds the maximum timestamp (or ), and (c) and (d) then it follows that . Thus, (or ) must be strictly greater from , . Based on the algorithm, when notices qv(2) in then it returns the value associated with the previous maximum timestamp, that is the value associated with maxTS-1 (L30-33). Since , then for the previous maximum timestamp, denoted by , which is only one unit less than , then the following holds, , thus .
Case (ii), and notices qv(2) and notices qv(1). Since notices qv(2) in then there exist a subset of servers , , that hold the maximum timestamp and a subset of servers , , that hold timestamp maxTS-1. Based on the algorithm, returns s.t. from the set of servers in . Since , and qv(1) indicates the existence of one and only timestamp, then can notice qv(1) in two cases; (a) all the servers in or (b) all the servers in . By Lemma 1, and if (a) holds then returns s.t. ; else, if (b) holds then returns s.t. .
Case (i), and both and notice qv(2). The distribution of the timestamps that notices, indicates that the write operation associated with the maximum timestamp, , is on-going, i.e., not completed. By the property of well formdness and the existence of a sole writer in the system then we know that corresponds to the latest complete write operation, . By Lemma 3, will not be able to return a timestamp s.t. . Thus holds and the lemma follows.
In any execution of Erato, if and are two read operations such that precedes , i.e., , and returns timestamp , then returns a timestamp , s.t. .
Proof. We are interested to examine the cases where one of the read operation is fast and the other is semifast. In particular, cases (i) and is semifast and is fast and (ii) and is fast and is semifast.
Let the two operations and be invoked by processes with identifiers and respectively (not necessarily different). Also, let , and , be the sets of servers from full quorums (not necessarily different) that sent a readRelay and readAck message to and respectively.
We start with case (i). Since read operation is semifast, then based on the algorithm, the timestamp that is returned it is also the minimum timestamp noticed in . Before a server sents readAck messages to (that form ), it must receive readRelay messages from a full quorum of servers. Thus, by Lemma 1 monotonicity of the timestamps at the servers we know that the minimum timestamp that a full quorum has by the end of is . Read operation receives readRelay messages from a full quorum of servers, . By definition of quorums, since both and are from a full quorum of servers then it follows that . Thus every server holds a timestamp s.t. .
If notices qv(1) in then the distribution of the timestamps in indicates the existance of one and only timestamp, . From the above, it follows that for the timestamp that returns holds.
On the other hand, if notices qv(2) in , then based on the distributions of the timestamps in qv(2) there must exist at least two servers in with different timestamps and the one must be the maximum. Since every server holds a timestamp s.t. then the maximum timestamp cannot be equal to . If that was the case, would have noticed qv(1). In particular, now holds. Based on the algorithm, when notices qv(2) in then it returns the value associated with the previous maximum timestamp, that is the value associated with maxTS-1 (L30-33). Since , then for the previous maximum timestamp, denoted by , which is only one unit less than , then the following holds, , thus .
We now examine case (ii). Since is fast, it follows that it has either noticed qv(1) or qv(2) in . If qv(1) was noticed, and returned a value associated with maximum timestamp , then by the completion of a full quorum has a timestamp s.t. . Now, since read operation is semifast, then based on the algorithm, the timestamp that is returned it is the minimum timestamp noticed in . Before a server sents readAck messages to (that form ), it must receive readRelay messages from a full quorum of servers, . By Lemma 1 monotonicity of the timestamps at the servers and , then every server in has a timestamp s.t. .
If qv(2) was noticed in , based on the algorithm, returned a value associated with previous maximum timestamp, that is . By the completion of a full quorum has a timestamp s.t. . Read operation is semifast, and the returned timestamp is the minimum timestamp noticed in . A server sents readAck messages to (that form ), when receives readRelay messages from a full quorum of servers, . By Lemma 1 and since , then every server in has a timestamp s.t. .
Algorithm Erato implements an atomic SWMR object.
Proof. We now use the lemmas above and the partial order definition to reason about each of the three conditions A1, A2 and A3.
A1 For any such that , it cannot be that .
When the two operations and are reads and holds, then from Lemma 6 it follows that the timestamp of is no less than the one rof , . If then by the ordering definition is satisfied. When then the ordering is not defined, thus it cannot be the case that . If is a write, the sole writer generates a new timestamp by incrementing the largest timestamp in the system. By well-formedness (see Section 2), any timestamp generated in any write operation that precedes must be smaller than . Since , then it holds that . Hence, by the ordering definition it cannot be the case that . Lastly, when is a read and a write and holds, then from Lemma 3 it follows that . By the ordering definition, it cannot hold that in this case either.
A2 For any write and any operation , then either or .
If the timestamp returned from is greater than the one returned from , i.e. , then follows directly. Similartly, if holds, then follows. If , then it must be that is a read and discovered in a quorum view qv(1) or qv(3). Thus, follows.
A3 Every read operation returns the value of the last write preceding it according to (or the initial value if there is no such write).
Let be the last write preceding read . From our definition it follows that . If , then returns the value conveyed by to some servers in a quorum , satisfying either qv(1) or qv(3). If , then obtains a larger timestamp, but such a timestamp can only be created by a write that succeeds , thus does not precede the read and this cannot be the case. Lastly, if , no preceding writes exist, and returns the initial value.
Having shown liveness and atomicity of algorithm Erato the result follows.
We now assess the performance of Erato in terms of (i) latency of read and write operations as measured by the number of communication exchanges, and (ii) the message complexity of read and write operations.
Communication and Message Complexity. By inspection of the code, write operations take 2 exchanges and read operations take either 2 or 3 exchanges. The (worst case) message complexity of write operations is and of read operations is , as follows from the structure of the algorithm. We now give additional details.
Operation Latency. Write operation latency: According to algorithm Erato, writer sends line:erato:writerequest messages to all servers during exchange e1 and waits for writeAck messages from a full quorum of servers during e2. Once the writeAck messages are received, no further communication is required and the write operation terminates. Therefore, any write operation consists of 2 communication exchanges.
Read operation latency: A reader sends a readRequest message to all the servers in the first communication exchange e1. Once the servers receive the readRequest message they broadcast a readRelay message to all servers and the reader in exchange e2. The reader can terminate at the end of the e2 if it receives readRelay messages and based on the distribution of the timestamp it notices qv(1) or qv(2) If this is not the case, the operation goes into the third exchange e3. Thus read operations terminate after either 2 or 3 communication exchanges.
Message Complexity. We measure operation message complexity as the worst case number of exchanged messages in each read and write operation. The worst case number of messages corresponds to failure-free executions where all participants send messages to all destinations according to the protocols.
Write operation: A single write operation in algorithm Erato takes 2 communication exchanges. In the first exchange e1, the writer sends a writeRequest message to all the servers in . The second exchange e2, occurs when all servers in send a writeAck message to the writer. Thus, at most messages are exchanged in a write operation.
Read operation: Read operations in the worst case take 3 communication exchanges. Exchange e1 occurs when a reader sends a readRequest message to all servers in . The second exchange e2 occurs when servers in send readRelay messages to all servers in and to the requesting reader. The final exchange e3 occurs when servers in send a readAck message to the reader. Summing together , shows that in the worst case, messages are exchanged during a read operation.
4 MWMR Algorithm Erato-mw
We now aim for a MWMR algorithm that involves two or three communications exchanges per read operation and four exchanges per write operation. The read protocol of algorithm Erato relies on the fact of the sole writer in the system: based on the distribution of the timestamp in a quorum , if the reader knows that the write operation is not complete, then any previous write is complete (by well-formedness). In the MWMR setting this does not hold due to the possibility of concurrent writes. Consequently, algorithm Erato-mw, in order to allow operations to terminate in either two or three communication exchanges, adapts the read protocol from algorithm OhMam in combination with the iterative technique using quorum views of CwFr. The latter approach not only predicts the completion status of a write operation, but also detects the last potentially complete write operation. The code is given in in Algorithm 2.