Continuous Partial Quorums for Consistency-Latency Tuning in Distributed NoSQL Storage Systems

Continuous Partial Quorums for Consistency-Latency Tuning in Distributed NoSQL Storage Systems

Marlon McKenzie \titlenoteAuthors supported by research funding from Hewlett-Packard Labs, Google, and the Natural Sciences and Engineering Research Council (NSERC) of Canada.
University of Waterloo, Canada Hua Fan *
University of Waterloo, Canada Wojciech Golab *
University of Waterloo, Canada
m2mckenzie@uwaterloo.ca h27fan@uwaterloo.ca wgolab@uwaterloo.ca
Abstract

NoSQL storage systems are used extensively by web applications and provide an attractive alternative to conventional databases when the need for scalability outweighs the need for transactions. Several of these systems provide quorum-based replication and present the application developer with a choice of multiple client-side “consistency levels” that determine the number of replicas accessed by reads and writes, which in turn affects both latency and the consistency observed by the client application. Since using a fixed combination of read and write consistency levels for a given application provides only a limited number of discrete options, we investigate techniques that allow more fine-grained tuning of the consistency-latency trade-off, as may be required to support consistency-based service level agreements (SLAs). We propose a novel technique called continuous partial quorums (CPQ) that assigns the consistency level on a per-operation basis by choosing randomly between two options, such as eventual and strong consistency, with a tunable probability. We evaluate our technique experimentally using Apache Cassandra and demonstrate that it outperforms an alternative tuning technique that delays operations artificially at clients.

\subtitle

[Please refer to the proceedings of SCDM’15 for the extended version of this manuscript.]

\numberofauthors

3

1 Introduction

NoSQL storage systems are used extensively by web applications and provide an attractive alternative to conventional databases when the need for scalability outweighs the need for transactions. Several of these systems, most notably Cassandra [16], Voldemort and Riak, are derivatives of Amazon’s Dynamo [11] and share a common quorum-based replication model that enables different behaviors with respect to Brewer’s CAP principle, which states that during a network partition a system must compromise either consistency or availability [6]. Application developers who use such systems face a choice of multiple client-side “consistency levels” that determine the size of a partial quorum for reads and writes, which is the number of replicas that must respond to a read or write request. This parameter directly affects the latency of read and write operations, and indirectly affects the consistency observed by client applications. Overlapping (e.g., majority) quorums are used to achieve so-called “strong consistency,” meaning that reads always return the latest value of a data object, whereas non-overlapping partial quorums provide weaker forms of consistency, particularly eventual consistency, whereby reads may return stale values for some period of time after an update while replicas of the data object converge to a common state. In this context a value is stale if it has been overwritten by a newer value, and is fresh otherwise; staleness is a very different concept from the age of a value with respect to the time it was written into the storage system.

In this paper we investigate the possibility of tuning the consistency-latency trade-off in a more fine-grained manner than is possible using client-side consistency levels, which offer a limited number of discrete choices (e.g., read one replica, read majority, etc). Specifically, we focus on techniques that enable fine-grained consistency-latency tuning in quorum-replicated storage systems by varying a real-valued parameter, as opposed to the use of a fixed consistency level that offers only a limited number of discrete choices. Attaining fine-grained control over consistency and latency is an important step on the path to supporting service level agreements (SLAs), for example where a client application requests that read operations have 95th %-ile latency at most milliseconds and return stale values at most fraction of the time for some thresholds and . In this framework a latency-favoring application (e.g., a shopping cart) may specify a lower and higher , whereas a consistency-favoring application (e.g., personal cloud file system) may opt for a higher and lower . Naturally such SLAs can also specify guarantees on throughput.

Our main technical contribution in connection with fine-grained consistency-latency tuning is a novel technique called continuous partial quorums (CPQ), which entails making a random choice between multiple discrete consistency levels on a per-operation basis. For example, the application may choose consistency level one with probability and majority quorums with probability . In this case itself becomes a continuous tunable parameter in the range . In contrast, using fixed consistency levels for reads and writes and a replication factor of three, there are only three possible partial quorums—one, two/quorum, and three/all—and hence nine discrete combinations. Furthermore, only four of these combinations, namely ones using the one and two/quorum consistency levels, provide availability in the presence of a single server failure.

We compare continuous partial quorums experimentally against an alternative technique called artificial delays (AD), in which clients use a weak consistency level such as read/write one (i.e., operations terminate when one replica responds) and boost consistency by injecting a tunable delay immediately before or after executing an operation against the storage system. For example, during a read operation the delay is injected immediately before the client issues a read request to the storage system. In this scenario the value returned by the read is fresh as long as it was the last updated value at any point in time during the interval starting immediately before the artificial delay and ending when the storage system returns a response to the client. The longer the delay the larger the latency and the higher the odds that the read returns a fresh value.

Our experimental comparison of continuous partial quorums against artificial delays using Apache Cassandra shows that the CPQ technique enables a superior consistency-latency trade-off. In some cases our technique attains the same degree of consistency (defined more precisely in Section 2) as artificial delays with severalfold lower latency.

2 Methodology

We study the consistency-latency trade-off experimentally by applying the two techniques described in Section 1 (CPQ and AD) to an Apache Cassandra [16] cluster deployed in Amazon’s EC2 environment. All EC2 instances are provisioned in the same availability zone and we do not consider geo-replication in this paper. The workload is generated using the Yahoo Cloud Serving Benchmark (YCSB) [9], with a modified Cassandra connector to support our CPQ technique. YCSB collects precise measurements of throughput and latency. To measure consistency precisely we follow the approach of Golab et al. by calculating the (Gamma) consistency metric from traces of operations recorded by instrumenting YCSB [13].

The metric quantifies consistency by measuring how far the behavior of a storage system, as observed by client applications (in this case YCSB) deviates from the gold standard of linearizability—the property that every operation appears to take effect instantaneously at some point between its start time and its finish time. A value of zero for a particular trace of operations indicates linearizable behavior, and positive values indicate deviations from linearizability, which we refer to as consistency anomalies. Intuitively, if the value is time units then this indicates that each operation appears to take effect instantaneously at some point between its start time minus and its finish time plus . Similarly to [13], we calculate a fine-grained form of the metric called the per-value score, which quantifies consistency anomalies associated with a collection of operations that access the same key and read or write the same value. Positive scores represent an upper bound on the staleness of values returned by read operations. We use the proportion of positive scores as an estimate of the fraction of stale reads, which was denoted by in our discussion of SLAs in Section 1.

Our chosen method of measuring consistency is client-centric in the following sense: positive scores represent consistency anomalies that are actually observed by a collection of clients via the responses of read operations. It is possible for the storage system to contain stale copies of a data item internally even when the score is zero, indicating linearizability, as long as the stale copies are never read by clients. We believe that this approach, which separates the consistency metric cleanly from the implementation details of a storage system, is well matched to the task of specifying and verifying SLAs for consistency.

3 Experiments

3.1 Overview

3.1.1 Hardware and software environment

The experiments are staged using six on-demand instances in Amazon EC2, us-west-2b availability zone. Each host is an m3.2xlarge on-demand instance with 8 virtual Intel Xeon E5-2670 2.50GHz cores, 32 GB RAM, 2x80 GB SSD local storage. The RTT between nodes is 300-450s. Clocks are synchronized to within 2ms using NTP. The software environment includes an Ubuntu 14.04 x86_64 image with Linux kernel version 3.13.0 in HVM (Hardware Virtual Machine) mode, Oracle Java 1.7.0_72, Apache Cassandra 2.0.10 and YCSB 0.1.4 modified as explained in Section 2. Cassandra is configured with default settings and the data directory is placed on SSD-based local instance storage. Each host runs a single YCSB process with 128 client threads that connect to the local Cassandra server.

3.1.2 Workload and system parameters

Each experiment comprises a YCSB load phase starting with an empty keyspace, followed by a 60-second YCSB transaction phase. We use a mixture of 80/20% read/write operations that access 128-byte values. Keys are generated using one of two YCSB probability distributions, similarly to [13]: “latest” with a key space of 1k, and “hotspot” with a key space of size 10k and 80% of the operations acting on a 20% hot set of keys. The replication factor is three. The target throughput in YCSB is set to 5kops/s/host, and is achieved to approximately 1% in all experimental runs.

3.1.3 Visualizations

We present several types of graphs in this section. Part (a) of Figures 12 and 3 presents the proportion of positive scores. The proportions shown exclude scores that are positive but less than the clock synchronization threshold of approximately 2ms. Such small scores do not reliably indicate consistency anomalies and we remove them to de-noise our figures. Part (b) of Figures 12 and 3 present the 95%-ile latencies (ms) corresponding to the runs shown in part (a), calculated as an average over a sample of values reported by YCSB, with one value from each host.

In the interest of readability we do not include error bars in our graphs, but we do observe moderate variations in the results. In particular, the proportion of positive scores varies noticeably between runs. This is partly due to imperfect clock synchronization, which adds noise to measurements of , and partly a side-effect of poor performance isolation in the EC2 environment. The latency measurements are generally more stable than the consistency measurements, and the standard deviation of the 95%-ile latency reported by YCSB processes at different hosts is approximately 1ms.

(a) consistency
(b) latency
Figure 1: Consistency and latency vs. client-side consistency level (e.g., ONE-QUO means read one, write majority quorum).
(a) consistency
(b) latency
Figure 2: Consistency and latency versus probability of client-side consistency level quorum vs. one.
(a) consistency
(b) latency
Figure 3: Consistency and latency versus client-side artificial delay (ms).

3.2 Results

As a starting point we evaluate the consistency-latency envelope of fixed client-side consistency levels, which is our baseline technique. We focus specifically on different combinations of one and majority quorum consistency levels, which provide availability in the presence of one failed server given the replication factor of three. Figure 1 presents the results. The x-axis labels are of the form A-B where A and B indicate the client-side consistency level for reads and writes, respectively. Similarly to Figure 6 of [13], our results show that the quorum consistency level improves consistency (i.e., lowers the proportion of positive scores) at the cost of increased latency. The 95%-ile latencies are generally less than 8ms, and slightly higher for reads overall than for writes—an expected outcome given that Cassandra is write-optimized. The QUO-QUO case (strong consistency) indicates zero positive scores, meaning that the storage system produced a linearizable trace. In comparison, the ONE-ONE case (eventual consistency) exhibits latencies less than half of QUO-QUO, with fewer than 1% of reads returning stale values.

The second set of results, presented in Figure 2, demonstrates continuous partial quorums in action. In this experiment the client chooses majority quorum consistency with probability , shown on the x-axis, and one consistency with probability . The same policy is used for both read and writes. As increases from 0 to 1 we observe that both the consistency and latency gradually morph from values corresponding to the ONE-ONE case in Figure 1 to values corresponding to the QUO-QUO case. Thus, CPQ successfully attains points in the two-dimensional consistency-latency spectrum that lie in-between the discrete points attained using fixed client-side consistency levels. Furthermore, when is chosen between 0 and 1, CPQ attains trade-offs that are not possible at all using fixed client-side consistency levels. In particular, these points do not correspond to the ONE-QUO and QUO-ONE cases in Figure 1. Aside from differences in the proportion, these configurations provide a different balance of read and write latencies compared to CPQ.

The last set of results, presented in Figure 3, demonstrate the behavior of artificial delays. In this experiment the client uses consistency level one for both reads and writes, and boosts consistency by injecting a delay at the beginning of each read. The length of the delay in milliseconds is shown on the x-axis, and contributes directly to the latency of read operations. For example, with a 20ms delay the 95%-ile latency for reads is 20-25ms, compared to 1-3ms in Figure 1 (ONE-ONE case) and Figure 2 (0.0 case). (Note that the consistency and latency scales in Figures 1 and 2 range from 0 to 0.01 and 0 to 10ms, respectively, whereas in Figure 3 they range from 0 to 0.02 and 0 to 35ms, respectively.) At this point the proportion approaches zero, which is the value attained using majority quorums in Figures 1 (QUO-QUO case) and 2 (1.0 case) at a latency of only 4-6ms for reads and 2-4ms for writes. Thus, a 20ms artificial delay achieves slightly worse consistency than quorum operations with severalfold higher latency. Even with a 5ms delay the read latency in Figure 3 exceeds that of quorum reads, but the consistency observed is only slightly better than using consistency level one and no delay. Thus, artificial delays provide a suboptimal consistency-latency trade-off compared to both our CPQ technique and the baseline technique.

4 Related Work

Recent research in the area of consistency has addressed the classification of consistency models, consistency measurement, and the design of storage systems that provide precise consistency guarantees. This body of work is influenced profoundly by the CAP principle, which states that a distributed storage system must make a trade-off between consistency (C) and availability (A) in the presence of a network partition (P) [6]. The PACELC formulation builds on CAP by considering two separate cases: during a network partition it reduces directly to CAP, but during failure-free operation it dictates a trade-off between latency and consistency [1].

Distributed storage systems use a variety of designs that achieve different trade-offs with respect to CAP. Amazon’s Dynamo and its derivatives (Cassandra, Voldemort and Riak) use a quorum-based replication scheme that can operate either in CP (i.e., strongly consistent but sacrificing availability) or AP (i.e., highly available but eventually consistent) mode depending on the size of the partial quorum used to execute read and writes [16, 11]. The techniques discussed in this paper–CPQ and AD—are targeted specifically at this family of systems. Since they are implemented at clients these techniques can be used with any quorum-replicated system that supports tunable partial quorums.

Many alternative designs have been proposed for supporting stronger notions of consistency in storage systems. Bigtable provides atomic access to individual rows, and is eventually consistent when deployed across multiple data centers [7]. PNUTS provides per-record timeline consistency, which ensures that replicas of a record apply updates in the same order [8]. COPS provides causal consistency with convergent conflict handling and read-only transactions, and is designed for wide-area deployments [17]. Causal consistency is in some sense the strongest property that can be guaranteed in the presence of network partitions, which makes COPS an AP system in the context of CAP [18]. Bolt-on causal consistency is a shim layer that provides causal consistency on top of eventual consistency [3]. Spanner is a geo-replicated transactional database that provides external consistency, which is similar in spirit to Lamport’s atomicity property (see Section 2) [10]. The replication and transaction commitment protocols in these systems are geared toward specific notions of stronger-than-eventual consistency and do not expose a client-side consistency level setting that could be used with our CPQ technique.

Several systems consider the problem of providing continuously tunable consistency guarantees. TACT is a middleware layer that uses three metrics to express consistency requirements with respect to read and write operations: numerical error, order error, and staleness [23]. TACT relies on a consistency manager that pushes updates synchronously to other replicas. Pileus allows client applications to declare consistency and latency requirements in the form of SLAs [21]. These SLAs include latency and staleness bounds but do not support the types of probabilistic guarantees discussed in Section 1. Internally, Pileus enforces the SLAs by choosing which replica to access in an SLA-aware manner, whereas Dynamo-style systems tend to always access the closest replicas. Tuba supports consistency SLAs by automatically reconfiguring the locations of its replicas in response to the client’s location and request rates [2]. AQuA is middleware layer that allows the client application to specify latency and consistency requirements similarly to Pileus, but with a focus on time-sensitive applications [15]. It provides probabilistic timeliness guarantees by selecting replicas dynamically using probabilistic models.

We are aware of only two systems that use artificial delays for consistency-latency tuning. Golab and Wylie propose consistency amplification—a framework for supporting consistency-based SLAs by injecting client-side or server-side delays whose duration is determined adaptively using measurements of the consistency actually observed by clients [14]. Rahman et al. present a similar system called PCAP, where delays are injected only at clients and their duration is determined using a feedback control mechanism [20]. PCAP also varies the read repair rate, which is shown to be a far less effective tuning knob. The evaluation of the system considers the proportion of operations that satisfy particular consistency and latency requirements, and does not investigate the optimality of this trade-off with respect to fixed client-side consistency levels such as majority quorums. The argument given against strict quorums is that they may cause storage operations to block in the event of a network partition. However, the consistency calculations used to tune artificial delays in PCAP are themselves blocking because they are based upon operation logs collected from multiple servers. Furthermore, in practice even quorum operations can be made non-blocking by using read and write timeouts, which are configurable in recent versions of Cassandra. Timeouts ensure that every operation eventually either completes successfully, or fails and allows the client to retry the operation using a smaller partial quorum.

The use of server-side artificial delays is explored in [12] as a technique for reducing the severity of consistency anomalies in Cassandra when client-side consistency level one is used. The delays are injected judiciously following the garbage collection stop-the-world pause, which improves consistency drastically with negligible impact on latency. In contrast, the artificial delays used in PCAP and explored in our own experiments incur a latency penalty for every single read operation, which increases average latency directly.

In the pursuit of an empirical understanding of CAP-related trade-offs several papers have explored techniques for measuring consistency [13, 22, 5, 24]. Measuring consistency in a precise way is subtly difficult because consistency anomalies such as stale reads are the result of interplay between multiple storage operations. As a result, some of the contributions in this space consider simplified techniques that measure the convergence time of the replication protocol rather than the consistency actually observed by client applications (e.g., [22, 5]) or quantify the consistency observed in terms of quantities that do not translate directly into staleness measures expressed naturally in units of time (e.g., counting cycles in a dependency graph [24]). Probabilistically bounded staleness (PBS) is a mathematical model of partial quorums that overcomes these limitations but is based upon the simplifying assumption that writes do not execute concurrently with other operations [4]. The theory underlying probabilistic quorum systems was originally developed by Malkhi, Reiter, and Wright [19].

5 Discussion and Conclusion

Our experiments using Cassandra in Amazon’s EC2 environment demonstrate that the consistency-latency trade-off can be tuned in a continuous manner using only a handful of discrete client-side consistency levels. We achieve this goal using a novel technique called continuous partial quorums (CPQ), which chooses randomly between two discrete consistency levels according to a tunable probability parameter. Compared to client-side artificial delays with consistency level one, CPQ is able to achieve a more attractive consistency-latency trade-off, in some cases offering the same degree of consistency with severalfold lower latency. This result confirms informal claims regarding the potentially detrimental effect of injecting artificial delays (e.g., see [4]), albeit only in the special case where the delay is in the critical path of every read operation. As discussed in Section 4, delays injected at servers can improve consistency effectively with a very small latency penalty [12].

Although we demonstrate CPQ specifically in the context of Apache Cassandra, the technique is applicable to any system that supports a set of discrete client-side consistency options. In future work we plan to implement and evaluate this technique on top of other storage systems and expand the scope of experiments to cover geo-replication. Furthermore, we plan to construct a comprehensive middleware framework that uses CPQ and other tuning techniques to supporting probabilistic consistency and latency guarantees.

References

  • [1] D. Abadi. Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. IEEE Computer, 45(2):37–42, 2012.
  • [2] M. S. Ardekani and D. B. Terry. A self-configurable geo-replicated cloud storage system. In Symp. on Op. Sys. Design and Implementation (OSDI), pages 367–381, 2014.
  • [3] P. Bailis, A. Ghodsi, J. M. Hellerstein, and I. Stoica. Bolt-on causal consistency. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 761–772, 2013.
  • [4] P. Bailis, S. Venkataraman, M. J. Franklin, J. M. Hellerstein, and I. Stoica. Probabilistically bounded staleness for practical partial quorums. PVLDB, 5(8):776–787, 2012.
  • [5] D. Bermbach and S. Tai. Eventual consistency: How soon is eventual? An evaluation of Amazon S3’s consistency behavior. In Proc. Workshop on Middleware for Service Oriented Computing (MW4SOC), 2011.
  • [6] E. A. Brewer. Towards robust distributed systems (Invited Talk). In Proc. of the 19th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, 2000.
  • [7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), 2008.
  • [8] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!’s hosted data serving platform. PVLDB, 1(2):1277–1288, 2008.
  • [9] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In Proc. of the ACM Symposium on Cloud Computing, pages 143–154, 2010.
  • [10] J. C. Corbett et al. Spanner: Google’s globally-distributed database. In Proc. USENIX Conference on Operating Systems Design and Implementation (OSDI), pages 251–264, 2012.
  • [11] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon’s highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages 205–220. ACM, 2007.
  • [12] H. Fan, A. Ramaraju, M. McKenzie, W. Golab, and B. Wong. Understanding the causes of consistency anomalies in Apache Cassandra. PVLDB, 8(7):810–813, 2015.
  • [13] W. Golab, M. R. Rahman, A. AuYoung, K. Keeton, and I. Gupta. Client-centric benchmarking of eventual consistency for cloud storage systems. In Proc. of the 34th International Conference on Distributed Computing Systems, pages 493–502, 2014.
  • [14] W. Golab and J. J. Wylie. Providing a measure representing an instantaneous data consistency level, 2014. US Patent Application 20,140,032,504.
  • [15] S. Krishnamurthy, W. H. Sanders, and M. Cukier. An adaptive quality of service aware middleware for replicated services. IEEE Transactions on Parallel and Distributed Systems, 14:1112–1125, 2003.
  • [16] A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35–40, 2010.
  • [17] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen. Don’t settle for eventual: Scalable causal consistency for wide-area storage with COPS. In Proc. of the 23rd ACM Symposium on Operating Systems Principles, pages 401–416, 2011.
  • [18] P. Mahajan, L. Alvisi, and M. Dahlin. Consistency, availability, and convergence. University of Texas at Austin Tech Report, 11, 2011.
  • [19] D. Malkhi, M. Reiter, and R. Wright. Probabilistic quorum systems. In Proceedings of the Sixteenth Annual ACM Symposium on Principles of Distributed Computing (PODC), pages 267–273, 1997.
  • [20] M. R. Rahman, L. Tseng, S. Nguyen, I. Gupta, and N. Vaidya. Characterizing and adapting the consistency-latency tradeoff in distributed key-value stores. 2015. http://arxiv.org/abs/1509.02464.
  • [21] D. B. Terry, V. Prabhakaran, R. Kotla, M. Balakrishnan, M. K. Aguilera, and H. Abu-Libdeh. Consistency-based service level agreements for cloud storage. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 309–324. ACM, 2013.
  • [22] H. Wada, A. Fekete, L. Zhao, K. Lee, and A. Liu. Data consistency properties and the trade-offs in commercial cloud storage: the consumers’ perspective. In Proc. Conference on Innovative Data Systems Research (CIDR), pages 134–143, 2011.
  • [23] H. Yu and A. Vahdat. Design and evaluation of a conit-based continuous consistency model for replicated services. ACM Trans. Comput. Syst., 20(3):239–282, Aug. 2002.
  • [24] K. Zellag and B. Kemme. How consistent is your cloud application? In Proceedings of the Third ACM Symposium on Cloud Computing, page 6. ACM, 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
368616
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description