Regal \RRtheme\THCom \URRocq
8083
\RRauthor
Annette Bieniusa, INRIA & UPMC, Paris, France
Marek Zawirski, INRIA & UPMC, Paris, France
Nuno Preguiça, CITI, Universidade Nova de Lisboa,
Portugal
Marc Shapiro, INRIA & LIP6, Paris, France
Carlos Baquero, HASLab, INESC TEC & Universidade do Minho, Portugal
Valter Balegas, CITI, Universidade Nova de Lisboa, Portugal
Sérgio Duarte CITI, Universidade Nova de Lisboa, Portugal
\authorheadBieniusa, Zawirski, PreguiÃ§a, Shapiro, Baquero, Balegas, Duarte
Octobre 2012
\RRtitleOptimisation d’un type de donnÃ©es ensemble rÃ©pliquÃ© sans conflit
\titleheadAn optimized conflictfree replicated set
\RRetitle
An Optimized Conflictfree Replicated Set
Eventual consistency of replicated data supports concurrent updates, reduces latency and improves fault tolerance, but forgoes strong consistency. Accordingly, several cloud computing platforms implement eventuallyconsistent data types.
The set is a widespread and useful abstraction, and many replicated set designs have been proposed. We present a reasoning abstraction, permutation equivalence, that systematizes the characterization of the expected concurrency semantics of concurrent types. Under this framework we present one of the existing conflictfree replicated data types, ObservedRemove Set.
Furthermore, in order to decrease the size of metadata, we propose a new optimization to avoid tombstones. This approach that can be transposed to other data types, such as maps, graphs or sequences.
La rÃ©plication des donnÃ©es avec cohÃ©rence Ã terme permet les mises Ã jour concurrentes, rÃ©duit la latence, et amÃ©liore la tolÃ©rance aux fautes, mais abandonne la cohÃ©rence forte. Aussi, cette approche est utilisÃ©e dans plusieurs plateformes de nuage.
L’ensemble (Set) est une abstraction largement utilisÃ©e, et plusieurs modÃ¨les d’ensemble rÃ©pliquÃ©s ont Ã©tÃ© proposÃ©s. Nous prÃ©sentons l’Ã©quivalence de permutation, un principe de raisonnement qui caractÃ©rise de faÃ§on systÃ©matique la sÃ©mantique attendue d’un type de donnÃ©es concurrent. Ce principe nous permet d’expliquer la conception un type dÃ©jÃ connu, ObservedRemove Set.
Par ailleurs, afin de diminuer la taille des mÃ©tadonnÃ©es, nous proposons une nouvelle optimisation qui Ã©vite les Â«Â pierres tombalesÂ Â». Cette approche peut se transposer Ã d’autres types de donnÃ©es, comme les mappes, les graphes ou les sÃ©quences.
RÃ©plication des donnÃ©es, rÃ©plication optimiste, opÃ©rations commutatives \RRkeywordData replication, optimistic replication, commutative operations
1 Introduction
Eventual consistency of replicated data supports concurrent updates, reduces latency and improves fault tolerance, but forgoes strong consistency (e.g., linearisability). Accordingly, several cloud computing platforms implement eventuallyconsistent replicated sets [3, 11]. Eventual Consistency, allows concurrent updates at different replicas, under the expectation that replicas will eventually converge [13]. However, solutions for addressing concurrent updates tend to be either limited or very complex and errorprone [7].
We follow a different approach: Strong Eventual Consistency (SEC) [9] requires a deterministic outcome for any pair of concurrent updates. Thus, different replicas can be updated in parallel, and concurrent updates are resolved locally, without requiring consensus. Some simple conditions (e.g., that concurrent updates commute with one another) are sufficient to ensure SEC. Data types that satisfy these conditions are called ConflictFree Replicated Data Types (CRDTs). Replicas of a CRDT object can be updated without synchronization and are guaranteed to converge. This approach has been adopted in several works [15, 12, 6, 14, 11].
The set is a pervasive data type, used either directly or as a component of more complex data types, such as maps or graphs. This paper highlights the semantics of sets under eventual consistency, and introduces an optimized set implementation, Optimized Observed Remove Set.
2 Principle of Permutation Equivalence
The sequential semantics of a set are well known, and are defined by individual updates, e.g., (in “{precondition} computation {postcondition}” notation), where denotes its abstract state. However, the semantics of concurrent modifications is left underspecified or implementationdriven.
We propose the following Principle of Permutation Equivalence [2] to express that concurrent behaviour conforms to the sequential specification: “If all sequential permutations of updates lead to equivalent states, then it should also hold that concurrent executions of the updates lead to equivalent states.” It implies the following behavior, for some updates and :
Specifically for replicated sets, the Principle of Permutation Equivalence requires that , and similarly for operations on different elements or idempotent operations. Only the pair is unspecified by the principle, since differs from . Any of the following postconditions ensures a deterministic result:
where compares unique clocks associated with the operations. Note that not all concurrency semantics can be explained as a sequential permutation; for instance no sequential execution ever results in an error mark.
3 A Review of Existing Replicated Set Designs
In the past, several designs have been proposed for maintaining a replicated set. Most of them violate the Principle of Permutation Equivalence (Fig. 1). For instance, the Amazon Dynamo shopping cart [3] is implemented using a register supporting \Readand \Write(assignment) operations, offering the standard sequential semantics. When two \Writes occur concurrently, the next \Readreturns their union. As noted by the authors themselves, in case of concurrent updates even on unrelated elements, a \removemay be undone (Fig. 1).
Sovran et al. and Asian et al. [11, 1] propose a set variant, CSet, where for each element the associated \addand \removeupdates are counted. The element is in the abstraction if their difference is positive. CSet violates the Principle of Permutation Equivalence (Fig. 1). When delivering the updates to both replicas as sketched, the add and remove counts are equal, i.e., is not in the abstraction, even though the last update at each replica is .
4 Addwins Replicated Sets
In Section 2 we have shown that when considering concurrent \addand \removeoperations over the same element, one among several postconditions can be chosen. Considering the case of an \addwins semantics we now recall [9] the CRDT design of an Observed Remove Set, or ORSet, and then introduce an optimized design that preserves the ORSet behaviour and greatly improves its space complexity.
These CRDT specifications follow a new notation with mixed state and operationbased update propagation. Although the formalization of this mixed model, and the associated proof obligations that check compliance to CRDT requisites, is out of the scope of this report the notation is easy to infer from the standard CRDT model [9, 8, 10].
System model synopsis: We consider a single object, replicated at a given set of processes/replicas. A client of the object may invoke an operation at some replica of its choice, which is called the source of the operation. A query executes entirely at the source. An update applies its side effects first to the source replica, then (eventually) at all replicas, in the downstream for that update. To this effect, an update is modeled as an update pair that includes two operations such that is a sideeffect free prepare(update) operation and is an effect(update) operation; the source executes the prepare and effect atomically; downstream replicas execute only the effect . In the mixed state and operationbased modelling, replica state can both be changed by applying an effect operation or by merging state from another replica of the same object. The monotonic evolution of replica states is described by a compare operation, supplied with each CRDT specification.
4.1 Observed Remove Set
\speconeORSet: Addwins replicated set {algorithmic} \Payloadset , set \Comment: elements; : tombstones \State \Commentsets of pairs { (element , uniquetag ), …} \Initial \Query\containselement boolean b \Let \EndQuery\Query\elementsset \Let \EndQuery\Update\addelement \AtSource \Let\Comment returns a unique tag \EndAtSource\Downstream \State \Comment + unique tag \EndDownstream\EndUpdate\Update\removeelement \AtSource \CommentCollect pairs containing \Let \EndAtSource\Downstream \CommentRemove pairs observed at source \State \State \EndDownstream\EndUpdate\LEL \Let \MergeML \State \State \EndMergeML
Figure 4.1 shows our specification for an addwins replicated set CRDT. Its concurrent specification is for each element defined as follows:


.
To implement addwins, the idea is to distinguish different invocations of by adding a hidden unique token , and effectively store pair. A pair is removed by adding it to a tombstone set. An element can be always added again, because the new pair uses always a fresh token, different from the old one, . If the same element is both added and removed concurrently, the updateprepare of \removeconcerns only observed pairs and not the concurrentlyadded unique pair . Therefore the \addwins by adding a new pair. We call this object an Observed Remove Set, or ORSet. As illustrated in Figure 1, ORSet is immune from the anomaly that plagues CSet.
Space complexity: The payload size of ORSet is at any moment bounded by the number of all applied add (effectupdate) operations.
4.2 Optimized Observed Remove Set
The ORSet design uses extensively unique identifiers and tombstones, as other CRDTs [6, 14, 8]. We now show how to make CRDT practical by minimizing the required metadata.
Immediately discarding tombstones: When comparing two payloads and , respectively containing some element and the other not, it is important to know if has been recently added to , or if it was recently removed from . The presented addwins set uses tombstones to unambiguously answer this question, even when updates are delivered out of order or multiple times.
Tombstones accumulate (as a consequence of the monotonic semilattice requirement); if they cannot be discarded, memory requirements grow with the number of operations. To address this issue, Wuu’s 2PSet [15] garbagecollects tombstones that have been delivered everywhere, basically by waiting for acknowledgements from each process to every other process. This adds communication and processing overhead, and requires all processes to be correct. We devise a novel technique to eliminate tombstones without these limitations and offer conflictfree semantics at an affordable cost. We present our solution using addwins as the example.
To recapitulate, in ORSet, adding an element creates a new unique
pair to the part of the payload. Removing an element moves all
pairs containing observed at the source from to .
We leverage these observations to propose a novel \removealgorithm that discards a removed pair immediately and works safely with \merge. It compactly records happensbefore information to summarizes removed elements. Figure 2 presents Optimized ORSet (OptORSet) based on this approach.
Each replica maintains a vector [5] to
summarize the unique identifiers it has already observed. Entry
at replica indicates that this replica has observed successive
identifiers generated at : . Replica
maintains its local counter as the th entry in the vector ,
initially . A replica generates new unique identifiers by
incrementing its local counter. Note that to summarize successive
identifiers in a vector, OptORSet requires causal delivery of
updates.
When \addis invoked, the source associates it with a unique identifier made of the next local counter value and source replica identifier. When the \addis delivered to a downstream replica, it should have an effect only if it has not been previously delivered; for this, it checks if the unique identifier is incorporated in the downstream replica’s vector. When \mergeing payloads, an element should be in the merged state only if: either it is in both payloads (set in Figure 2), or it is in the local payload and not recently removed from the remote one (set ) or viceversa ()  an element has been removed if it is not in the payload but its identifier is reflected in the replica’s vector.
This approach can be generalized to any CRDT where elements are added and removed, e.g., a sequence [6, 14] or a graph [10].
Coalescing repeated adds: Another source of memory growth in the original ORSet is due to the elements added several times. Similarly to tombstones, they pollute the state with unique identifiers for every \add. We observe that for every combination of element and source replica, it is enough to keep the identifier of the latest \add, which subsumes previously added elements. The OptORSet specification leverages this observation in \add and \merge definitions, by discarding unnecessary identifiers (set ).
Space complexity: The payload size of OptORSet set is bounded by at any moment, where is the number of processes in the systems and is the number of elements present in the set. The first component corresponds to the maximum number of timestamps in set and the second captures the size of the vector . In the common case, where the number of processes repeatedly invoking \adds can be considered a constant, the payload size is .
5 Conclusions
ConflictFree Replicated Data Types (CRDTs) allow a system to maintain multiple replicas of data that are updated without requiring synchronization while guaranteeing Strong Eventual Consistency. This allows, for example, a cloud infrastructure to maintain replicas of data in data centers spread over large geographical distance and still provide low access latency by choosing the closest, to client, data center.
In this paper we reviewed existing replicated set designs and contrasted then with the CRDT ORSet design, under the principle of permutation equivalence. Having in mind that the base ORSet favored simplicity at the expense of scalability, we introduced a new optimized design, Optimized ORSet, that greatly improves its scalability and should favor efficient implementations of sets and other CRDTs that share the ORSet design techniques.
Footnotes
 thanks: This research is supported in part by ANR project ConcoRDanT (ANR10BLAN 0208), by ERDF, COMPETE Programme, by Google European Doctoral Fellowship in Distributed Computing received by Marek Zawirski, and FCT projects #PTDC/EIAEIA/104022/2008 and #PTDC/EIAEIA/108963/2008.
 A practical implementation will just set a mark bit on the representation of the removed pair and will deallocate any other associated storage. Consider for instance the extension of ORSet to a map: a key will have some associated value, e.g., would contain triples . When the key is removed, can be discarded, but the corresponding pair(s) must remain in .
 It is easy to extend this solution for updates delivered out of happensbefore order by using instead a version vector with exceptions [4].
References
 Khaled Aslan, Pascal Molli, Hala SkafMolli, and Stéphane Weiss. CSet: a commutative replicated data type for semantic stores. In RED: Fourth International Workshop on REsource Discovery, Heraklion, Greece, 2011.
 Annette Bieniusa, Marek Zawirsky, Nuno Preguiça, Marc Shapiro, Carlos Baquero, Valter Balegas, and Sérgio Duarte. Brief annoucement: Semantics of eventually consistent replicated sets. In Proceedings of the 26th international conference on Distributed Computing, DISC’12, Berlin, Heidelberg, 2012. SpringerVerlag.
 Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available keyvalue store. In Symp. on Op. Sys. Principles (SOSP), volume 41 of Operating Systems Review, pages 205–220, Stevenson, Washington, USA, October 2007. Assoc. for Computing Machinery.
 Dahlia Malkhi and Douglas B. Terry. Concise version vectors in WinFS. Distributed Computing, 20(3):209–219, 2007.
 D. S. Parker, G. J. Popek, G. Rudisin, A. Stoughton, B. J. Walker, E. Walton, J. M. Chow, D. Edwards, S. Kiser, and C. Kline. Detection of mutual inconsistency in distributed systems. IEEE Trans. Softw. Eng., 9:240–247, May 1983.
 Nuno Preguiça, Joan Manuel Marquès, Marc Shapiro, and Mihai Leţia. A commutative replicated data type for cooperative editing. In Int. Conf. on Distributed Comp. Sys. (ICDCS), pages 395–403, Montréal, Canada, June 2009.
 Yasushi Saito and Marc Shapiro. Optimistic replication. Computing Surveys, 37(1):42–81, March 2005.
 Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. A comprehensive study of Convergent and Commutative Replicated Data Types. Rapport de recherche 7506, Institut Nat. de la Recherche en Informatique et Automatique (INRIA), Rocquencourt, France, January 2011.
 Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. Conflictfree replicated data types. In Xavier Défago, Franck Petit, and V. Villain, editors, Int. Symp. on Stabilization, Safety, and Security of Distributed Systems (SSS), volume 6976 of Lecture Notes in Comp. Sc., pages 386–400, Grenoble, France, October 2011. SpringerVerlag GmbH.
 Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. Convergent and commutative replicated data types. Bulletin of the European Association for Theoretical Computer Science (EATCS), (104):67–88, June 2011.
 Yair Sovran, Russell Power, Marcos K. Aguilera, and Jinyang Li. Transactional storage for georeplicated systems. In Symp. on Op. Sys. Principles (SOSP), pages 385–400, Cascais, Portugal, October 2011. Assoc. for Computing Machinery.
 Robert H. Thomas. A majority consensus approach to concurrency control for multiple copy databases. Trans. on Computer Systems, 4(2):180–209, June 1979.
 Werner Vogels. Eventually consistent. ACM Queue, 6(6):14–19, October 2008.
 Stephane Weiss, Pascal Urso, and Pascal Molli. Logootundo: Distributed collaborative editing system on P2P networks. IEEE Trans. on Parallel and Dist. Sys. (TPDS), 21:1162–1174, 2010.
 Gene T. J. Wuu and Arthur J. Bernstein. Efficient solutions to the replicated log and dictionary problems. In Symp. on Principles of Dist. Comp. (PODC), pages 233–242, Vancouver, BC, Canada, August 1984.