1 Introduction

\RRprojet

Regal \RRtheme\THCom \URRocq

\RRNo

8083 \RRauthor Annette Bieniusa, INRIA & UPMC, Paris, France
Marek Zawirski, INRIA & UPMC, Paris, France
Nuno Preguiça, CITI, Universidade Nova de Lisboa, Portugal
Marc Shapiro, INRIA & LIP6, Paris, France
Carlos Baquero, HASLab, INESC TEC & Universidade do Minho, Portugal
Valter Balegas, CITI, Universidade Nova de Lisboa, Portugal
Sérgio Duarte CITI, Universidade Nova de Lisboa, Portugal \authorheadBieniusa, Zawirski, Preguiça, Shapiro, Baquero, Balegas, Duarte

\RRdate

Octobre 2012 \RRtitleOptimisation d’un type de données ensemble répliqué sans conflit \titleheadAn optimized conflict-free replicated set \RRetitle An Optimized Conflict-free Replicated Set 1

\RRabstract

Eventual consistency of replicated data supports concurrent updates, reduces latency and improves fault tolerance, but forgoes strong consistency. Accordingly, several cloud computing platforms implement eventually-consistent data types.

The set is a widespread and useful abstraction, and many replicated set designs have been proposed. We present a reasoning abstraction, permutation equivalence, that systematizes the characterization of the expected concurrency semantics of concurrent types. Under this framework we present one of the existing conflict-free replicated data types, Observed-Remove Set.

Furthermore, in order to decrease the size of meta-data, we propose a new optimization to avoid tombstones. This approach that can be transposed to other data types, such as maps, graphs or sequences.

\RRresume

La réplication des données avec cohérence à terme permet les mises à jour concurrentes, réduit la latence, et améliore la tolérance aux fautes, mais abandonne la cohérence forte. Aussi, cette approche est utilisée dans plusieurs plateformes de nuage.

L’ensemble (Set) est une abstraction largement utilisée, et plusieurs modèles d’ensemble répliqués ont été proposés. Nous présentons l’équivalence de permutation, un principe de raisonnement qui caractérise de façon systématique la sémantique attendue d’un type de données concurrent. Ce principe nous permet d’expliquer la conception un type déjà connu, Observed-Remove Set.

Par ailleurs, afin de diminuer la taille des méta-données, nous proposons une nouvelle optimisation qui évite les « pierres tombales ». Cette approche peut se transposer à d’autres types de données, comme les mappes, les graphes ou les séquences.

\RRmotcle

Réplication des données, réplication optimiste, opérations commutatives \RRkeywordData replication, optimistic replication, commutative operations

\makeRR

1 Introduction

Eventual consistency of replicated data supports concurrent updates, reduces latency and improves fault tolerance, but forgoes strong consistency (e.g., linearisability). Accordingly, several cloud computing platforms implement eventually-consistent replicated sets [3, 11]. Eventual Consistency, allows concurrent updates at different replicas, under the expectation that replicas will eventually converge [13]. However, solutions for addressing concurrent updates tend to be either limited or very complex and error-prone [7].

We follow a different approach: Strong Eventual Consistency (SEC) [9] requires a deterministic outcome for any pair of concurrent updates. Thus, different replicas can be updated in parallel, and concurrent updates are resolved locally, without requiring consensus. Some simple conditions (e.g., that concurrent updates commute with one another) are sufficient to ensure SEC. Data types that satisfy these conditions are called Conflict-Free Replicated Data Types (CRDTs). Replicas of a CRDT object can be updated without synchronization and are guaranteed to converge. This approach has been adopted in several works [15, 12, 6, 14, 11].

The set is a pervasive data type, used either directly or as a component of more complex data types, such as maps or graphs. This paper highlights the semantics of sets under eventual consistency, and introduces an optimized set implementation, Optimized Observed Remove Set.

2 Principle of Permutation Equivalence

The sequential semantics of a set are well known, and are defined by individual updates, e.g., (in “{pre-condition} computation {post-condition}” notation), where denotes its abstract state. However, the semantics of concurrent modifications is left underspecified or implementation-driven.

We propose the following Principle of Permutation Equivalence [2] to express that concurrent behaviour conforms to the sequential specification: “If all sequential permutations of updates lead to equivalent states, then it should also hold that concurrent executions of the updates lead to equivalent states.” It implies the following behavior, for some updates and :

Specifically for replicated sets, the Principle of Permutation Equivalence requires that , and similarly for operations on different elements or idempotent operations. Only the pair is unspecified by the principle, since differs from . Any of the following post-conditions ensures a deterministic result:


where compares unique clocks associated with the operations. Note that not all concurrency semantics can be explained as a sequential permutation; for instance no sequential execution ever results in an error mark.

\subfigure

[Dynamo shopping cart] \subfigure[C-Set] \subfigure[OR-Set]

Figure 1: Examples of anomalies and a correct design.

3 A Review of Existing Replicated Set Designs

In the past, several designs have been proposed for maintaining a replicated set. Most of them violate the Principle of Permutation Equivalence (Fig. 1). For instance, the Amazon Dynamo shopping cart [3] is implemented using a register supporting \Readand \Write(assignment) operations, offering the standard sequential semantics. When two \Writes occur concurrently, the next \Readreturns their union. As noted by the authors themselves, in case of concurrent updates even on unrelated elements, a \removemay be undone (Fig. 1).

Sovran et al. and Asian et al. [11, 1] propose a set variant, C-Set, where for each element the associated \addand \removeupdates are counted. The element is in the abstraction if their difference is positive. C-Set violates the Principle of Permutation Equivalence (Fig. 1). When delivering the updates to both replicas as sketched, the add and remove counts are equal, i.e., is not in the abstraction, even though the last update at each replica is .

4 Add-wins Replicated Sets

In Section 2 we have shown that when considering concurrent \addand \removeoperations over the same element, one among several post-conditions can be chosen. Considering the case of an \addwins semantics we now recall [9] the CRDT design of an Observed Remove Set, or OR-Set, and then introduce an optimized design that preserves the OR-Set behaviour and greatly improves its space complexity.

These CRDT specifications follow a new notation with mixed state- and operation-based update propagation. Although the formalization of this mixed model, and the associated proof obligations that check compliance to CRDT requisites, is out of the scope of this report the notation is easy to infer from the standard CRDT model [9, 8, 10].

System model synopsis: We consider a single object, replicated at a given set of processes/replicas. A client of the object may invoke an operation at some replica of its choice, which is called the source of the operation. A query executes entirely at the source. An update applies its side effects first to the source replica, then (eventually) at all replicas, in the downstream for that update. To this effect, an update is modeled as an update pair that includes two operations such that is a side-effect free prepare(-update) operation and is an effect(-update) operation; the source executes the prepare and effect atomically; downstream replicas execute only the effect . In the mixed state- and operation-based modelling, replica state can both be changed by applying an effect operation or by merging state from another replica of the same object. The monotonic evolution of replica states is described by a compare operation, supplied with each CRDT specification.

4.1 Observed Remove Set

\specone

OR-Set: Add-wins replicated set {algorithmic} \Payloadset , set \Comment: elements; : tombstones \State \Commentsets of pairs { (element , unique-tag ), …} \Initial \Query\containselement boolean b \Let \EndQuery\Query\elementsset \Let \EndQuery\Update\addelement \AtSource \Let\Comment returns a unique tag \EndAtSource\Downstream \State \Comment + unique tag \EndDownstream\EndUpdate\Update\removeelement \AtSource \CommentCollect pairs containing \Let \EndAtSource\Downstream \CommentRemove pairs observed at source \State \State \EndDownstream\EndUpdate\LEL \Let \MergeML \State \State \EndMergeML

Figure 4.1 shows our specification for an add-wins replicated set CRDT. Its concurrent specification is for each element defined as follows:

  • .

To implement add-wins, the idea is to distinguish different invocations of by adding a hidden unique token , and effectively store pair. A pair is removed by adding it to a tombstone set. An element can be always added again, because the new pair uses always a fresh token, different from the old one, . If the same element is both added and removed concurrently, the update-prepare of \removeconcerns only observed pairs and not the concurrently-added unique pair . Therefore the \addwins by adding a new pair. We call this object an Observed Remove Set, or OR-Set. As illustrated in Figure 1, OR-Set is immune from the anomaly that plagues C-Set.

Space complexity: The payload size of OR-Set is at any moment bounded by the number of all applied add (effect-update) operations.

4.2 Optimized Observed Remove Set

{algorithmic}\Payload

set , vect \Comment: elements, set of triples (element , timestamp , replica ) \State\Comment: summary (vector) of received triples \Initial \Query\containselement boolean b \Let \EndQuery\Query\elementsset \Let \EndQuery\Update\addelement \AtSource \Let\Comment = source replica \Let \EndAtSource\Downstream \Precausal delivery \If \Let \State \State \EndIf\EndDownstream\EndUpdate\Update\removeelement \AtSource\CommentCollect all unique triples containing \Let \EndAtSource\Downstream \CommentRemove triples observed at source \Precausal delivery \State \EndDownstream\EndUpdate\LEML \Let \Let \Let \EndLEML\MergeML \Let \Let \Let \Let \Let \State \State \EndMergeML

Figure 2: Optimized OR-Set (Opt-OR-Set).

The OR-Set design uses extensively unique identifiers and tombstones, as other CRDTs [6, 14, 8]. We now show how to make CRDT practical by minimizing the required meta-data.

Immediately discarding tombstones: When comparing two payloads and , respectively containing some element and the other not, it is important to know if has been recently added to , or if it was recently removed from . The presented add-wins set uses tombstones to unambiguously answer this question, even when updates are delivered out of order or multiple times.

Tombstones accumulate (as a consequence of the monotonic semilattice requirement); if they cannot be discarded, memory requirements grow with the number of operations. To address this issue, Wuu’s 2P-Set [15] garbage-collects tombstones that have been delivered everywhere, basically by waiting for acknowledgements from each process to every other process. This adds communication and processing overhead, and requires all processes to be correct. We devise a novel technique to eliminate tombstones without these limitations and offer conflict-free semantics at an affordable cost. We present our solution using add-wins as the example.

To recapitulate, in OR-Set, adding an element creates a new unique pair to the part of the payload. Removing an element moves all pairs containing observed at the source from to .2 Note that adding some pair always happens-before removing the same pair . If updates are delivered only in causal order,once, the \addalways executes before any related \removes, and the tombstone set is not necessary when executing operations. However, we also need to support state-based \merge, which joins two replicas possibly unrelated by happens-before. When merging two replicas in which only one replica has some pair , we need to know if the pair has been added to the replica that contains it or if it was removed in the other replica.

We leverage these observations to propose a novel \removealgorithm that discards a removed pair immediately and works safely with \merge. It compactly records happens-before information to summarizes removed elements. Figure 2 presents Optimized OR-Set (Opt-OR-Set) based on this approach.

Each replica maintains a vector [5] to summarize the unique identifiers it has already observed. Entry at replica indicates that this replica has observed successive identifiers generated at : . Replica maintains its local counter as the -th entry in the vector , initially . A replica generates new unique identifiers by incrementing its local counter. Note that to summarize successive identifiers in a vector, OptORSet requires causal delivery of updates.3

When \addis invoked, the source associates it with a unique identifier made of the next local counter value and source replica identifier. When the \addis delivered to a downstream replica, it should have an effect only if it has not been previously delivered; for this, it checks if the unique identifier is incorporated in the downstream replica’s vector. When \mergeing payloads, an element should be in the merged state only if: either it is in both payloads (set in Figure 2), or it is in the local payload and not recently removed from the remote one (set ) or vice-versa () - an element has been removed if it is not in the payload but its identifier is reflected in the replica’s vector.

This approach can be generalized to any CRDT where elements are added and removed, e.g., a sequence [6, 14] or a graph [10].

Coalescing repeated adds: Another source of memory growth in the original OR-Set is due to the elements added several times. Similarly to tombstones, they pollute the state with unique identifiers for every \add. We observe that for every combination of element and source replica, it is enough to keep the identifier of the latest \add, which subsumes previously added elements. The OptORSet specification leverages this observation in \add and \merge definitions, by discarding unnecessary identifiers (set ).

Space complexity: The payload size of OptORSet set is bounded by at any moment, where is the number of processes in the systems and is the number of elements present in the set. The first component corresponds to the maximum number of timestamps in set and the second captures the size of the vector . In the common case, where the number of processes repeatedly invoking \adds can be considered a constant, the payload size is .

5 Conclusions

Conflict-Free Replicated Data Types (CRDTs) allow a system to maintain multiple replicas of data that are updated without requiring synchronization while guaranteeing Strong Eventual Consistency. This allows, for example, a cloud infrastructure to maintain replicas of data in data centers spread over large geographical distance and still provide low access latency by choosing the closest, to client, data center.

In this paper we reviewed existing replicated set designs and contrasted then with the CRDT OR-Set design, under the principle of permutation equivalence. Having in mind that the base OR-Set favored simplicity at the expense of scalability, we introduced a new optimized design, Optimized OR-Set, that greatly improves its scalability and should favor efficient implementations of sets and other CRDTs that share the OR-Set design techniques.

Footnotes

  1. thanks: This research is supported in part by ANR project ConcoRDanT (ANR-10-BLAN 0208), by ERDF, COMPETE Programme, by Google European Doctoral Fellowship in Distributed Computing received by Marek Zawirski, and FCT projects #PTDC/EIA-EIA/104022/2008 and #PTDC/EIA-EIA/108963/2008.
  2. A practical implementation will just set a mark bit on the representation of the removed pair and will deallocate any other associated storage. Consider for instance the extension of OR-Set to a map: a key will have some associated value, e.g., would contain triples . When the key is removed, can be discarded, but the corresponding pair(s) must remain in .
  3. It is easy to extend this solution for updates delivered out of happens-before order by using instead a version vector with exceptions [4].

References

  1. Khaled Aslan, Pascal Molli, Hala Skaf-Molli, and Stéphane Weiss. C-Set: a commutative replicated data type for semantic stores. In RED: Fourth International Workshop on REsource Discovery, Heraklion, Greece, 2011.
  2. Annette Bieniusa, Marek Zawirsky, Nuno Preguiça, Marc Shapiro, Carlos Baquero, Valter Balegas, and Sérgio Duarte. Brief annoucement: Semantics of eventually consistent replicated sets. In Proceedings of the 26th international conference on Distributed Computing, DISC’12, Berlin, Heidelberg, 2012. Springer-Verlag.
  3. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value store. In Symp. on Op. Sys. Principles (SOSP), volume 41 of Operating Systems Review, pages 205–220, Stevenson, Washington, USA, October 2007. Assoc. for Computing Machinery.
  4. Dahlia Malkhi and Douglas B. Terry. Concise version vectors in WinFS. Distributed Computing, 20(3):209–219, 2007.
  5. D. S. Parker, G. J. Popek, G. Rudisin, A. Stoughton, B. J. Walker, E. Walton, J. M. Chow, D. Edwards, S. Kiser, and C. Kline. Detection of mutual inconsistency in distributed systems. IEEE Trans. Softw. Eng., 9:240–247, May 1983.
  6. Nuno Preguiça, Joan Manuel Marquès, Marc Shapiro, and Mihai Leţia. A commutative replicated data type for cooperative editing. In Int. Conf. on Distributed Comp. Sys. (ICDCS), pages 395–403, Montréal, Canada, June 2009.
  7. Yasushi Saito and Marc Shapiro. Optimistic replication. Computing Surveys, 37(1):42–81, March 2005.
  8. Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. A comprehensive study of Convergent and Commutative Replicated Data Types. Rapport de recherche 7506, Institut Nat. de la Recherche en Informatique et Automatique (INRIA), Rocquencourt, France, January 2011.
  9. Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. Conflict-free replicated data types. In Xavier Défago, Franck Petit, and V. Villain, editors, Int. Symp. on Stabilization, Safety, and Security of Distributed Systems (SSS), volume 6976 of Lecture Notes in Comp. Sc., pages 386–400, Grenoble, France, October 2011. Springer-Verlag GmbH.
  10. Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. Convergent and commutative replicated data types. Bulletin of the European Association for Theoretical Computer Science (EATCS), (104):67–88, June 2011.
  11. Yair Sovran, Russell Power, Marcos K. Aguilera, and Jinyang Li. Transactional storage for geo-replicated systems. In Symp. on Op. Sys. Principles (SOSP), pages 385–400, Cascais, Portugal, October 2011. Assoc. for Computing Machinery.
  12. Robert H. Thomas. A majority consensus approach to concurrency control for multiple copy databases. Trans. on Computer Systems, 4(2):180–209, June 1979.
  13. Werner Vogels. Eventually consistent. ACM Queue, 6(6):14–19, October 2008.
  14. Stephane Weiss, Pascal Urso, and Pascal Molli. Logoot-undo: Distributed collaborative editing system on P2P networks. IEEE Trans. on Parallel and Dist. Sys. (TPDS), 21:1162–1174, 2010.
  15. Gene T. J. Wuu and Arthur J. Bernstein. Efficient solutions to the replicated log and dictionary problems. In Symp. on Principles of Dist. Comp. (PODC), pages 233–242, Vancouver, BC, Canada, August 1984.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
105455
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description