In this paper, we propose a new approach to building synchronization primitives, dubbed “lwlocks” (short for light-weight locks). The primitives are optimized for small memory footprint while maintaining efficient performance in low contention scenarios. A read-write lwlock occupies 4 bytes, a mutex occupies 4 bytes (2 if deadlock detection is not required), and a condition variable occupies 4 bytes. The corresponding primitives of the popular pthread library occupy 56 bytes, 40 bytes and 48 bytes respectively on the x86-64 platform. The API for lwlocks is similar to that of the pthread library but covering only the most common use cases. Lwlocks allow explicit control of queuing and scheduling decisions in contention situations and support “asynchronous” or “deferred blocking” acquisition of locks. Asynchronous locking helps in working around the constraints of lock-ordering which otherwise limits concurrency. The small footprint of lwlocks enables the construction of data structures with very fine-grained locking, which in turn is crucial for lowering contention and supporting highly concurrent access to a data structure. Currently, the Data Domain File System uses lwlocks for its in-memory inode cache as well as in a generic doubly-linked concurrent list which forms the building block for more sophisticated structures.
The advent of the multi-core systems has forced a rethinking of basic data structures in order to support greater scalability and concurrency . While there have been good strides in building lock-free versions of certain data structures [2, 5], and software transactional memory (STM) based techniques are becoming popular [9, 10], the use of traditional locking techniques remains the de-facto standard for synchronization in shared-memory systems. The usual technique for increasing concurrency using traditional locking schemes, aside from using algorithms that reduce the concurrent sections [4, 6], is to use different locks for different parts of the data structures. The use of such fine-grained locking often runs afoul of the overhead involved, thereby limiting the maximum number of locks used. To minimize the space overhead, the algorithms usually try to minimize the number of locks, and in turn need to build a mapping to and from different parts of the structure to the corresponding lock. This adds to the complexity of the code that needs to be maintained.
In this paper we present a novel technique to create locking primitives that have a very small memory footprint. We call our locks “light-weight locks” or “lwlocks”. Specifically, a read-write lock in our scheme takes 4 bytes, a mutex takes 4 bytes (only 2 if deadlock detection is not required), and a condition variable takes 4 bytes. The corresponding primitives of the popular pthread library occupy 56 bytes, 40 bytes and 48 bytes respectively on the x86-64 platform. The API for lwlocks is modeled after that of the pthread library. We however eschew some of the features provided by pthread locks for the sake of simplicity of our implementation.
We consider our contributions as being four-fold: (i) locking primitives with small memory footprint which makes them ideal for very fine-grained locking; (ii) the mechanism underlying the implementation of lwlocks that allows creation of custom lock-like primitives; (iii) access to waiting queue of threads so custom scheduling schemes can be implemented; and (iv) support for “asynchronous” or “deferred block” locking.
In this paper, we focus largely on lwlocks. The rest of the paper is organized as follows: Section 2 describes the idea that forms the basis of lwlocks. Section 3 describes the internal structure of the supported primitives and the algorithms for implementing their APIs. Section 4 briefly describes possible extensions to lwlocks and how asynchronous locking works. Section 5 compares the performance of lwlocks with the corresponding primitives in the pthread library. Finally, in Section 6 we present our conclusions.
2 The Fundamental Idea
The core idea behind lwlocks is the observation that while a thread could block on different locks or wait on many different condition variables in its lifetime, it can block on only one lock or condition variable at any given point. With lwlocks, whenever a thread has to block, it uses a “waiter” structure to do so. In this paper, we use the term “waiter structure” or simply “waiter” interchangeably. Each thread has its own waiter structure and can access it by invoking the tls_get_waiter function (which returns the pointer to the waiter kept in the thread local storage).
Figure 1 presents the definition of a waiter structure. For compact representation, we limit the maximum number of waiter structures to be less than so that each structure can also be uniquely referred to by a 16-bit number. We reserve the value to represent the NULL waiter structure and denote it by NULLID. We expect the limit on number of waiters, and hence on the number of threads, to be large enough for most applications111The limit can be increased for a small increase in the size of the locks which presumably will be acceptable for an application that can support so many threads..
A waiter structure is assigned to a thread the first time the thread accesses it (via tls_get_waiter) and the structure is returned to the pool of free waiter structures when the thread exits to be re-used by a later thread. The waiter structure is the key piece that enables the compact nature of lwlocks. It can also be used to create other custom compact lock-like data structures. The current non-optimized implementation of a waiter structure occupies bytes. Since this cost is per thread and we expect the normal use case to have far fewer threads than the number of locks, the amortized cost is very low. For example, an application with threads and around mutexes will have the same memory footprint whether using lwlocks or pthread mutexes. Our normal expected use case is for applications that need several hundred thousands or millions of locks which would make lwlocks the clear choice for locking. We now describe the most important pieces that form a waiter structure.
Waiter’s Event. A generic event interface underlies the actual mechanics that are used by a thread when it blocks or unblocks on a lock. The two main operations defined for an event are: (i) wait, which is called to wait for the event to trigger; and (ii) signal, which informs a waiter of an event getting triggered. A waiter structure uses one pthread mutex and one condition variable to implement both operations. The operation wait blocks the thread on the condition variable until a signal arrives. The operation signal wakes up the blocked thread. Like semaphores, the implementation ensures that a signal on an event cannot be lost, i. e., a signal can be invoked before the matching wait is and the wait will find the pending signal. Unlike a semaphore, however, the operations wait and signal are always called in pairs. There is also a third operation called poll. It can be used to check if a signal is already pending.
Waiter’s Domain. Instead of a fixed implementation for mapping from an id to the waiter structure, we have abstracted out the notion of a waiter’s domain. A waiter’s domain defines four operations: (i) alloc_waiter to allocate a waiter from the domain; (ii) free_waiter to return a waiter back to the domain; (iii) get_waiter which allows a thread to get to its own waiter; and (iv) id2waiter to map from an id to the waiter structure.
Abstracting the notion of a domain has three benefits. First it provides one more way of extending the system so that instead of an entire application being limited to a maximum of threads, the limit only applies to individual domains. Second it provides the flexibility to create domains that have lower limit on maximum concurrency, thereby allowing for creation of locks with even smaller footprint. For example, a system limiting itself to 15 threads (127 without deadlock detection) would need only 1 byte for a mutex. Third, combining it with a custom event, allows for creation of libraries such as a user space job scheduler. The wait call on a job blocks it and causes the scheduler to switch to another ready job while the signal call marks the job as ready again. We mention the waiter’s domains only for completeness as they are not necessary to understand the workings of lwlocks. Lwlocks use a default global domain whose waiter structures implement the behavior we describe here.
Forming Lists or Stacks of Waiters. Each waiter structure records its own id. It also has space for previous and next id values which can be used to form stacks or lists of waiters. Such a list (or stack) of waiters can be identified purely by the id of the first element of the list, i. e., it can be represented by a 16-bit value. To go to the next (previous) waiter, we convert the current id to the corresponding waiter structure and look at the next (previous) id field in it.
Locking Data. The final important piece of a waiter structure is the space it provides that can be used by the abstractions built on top for their own purpose. The waiter itself does not interpret it in any way. For instance, read-write lwlocks use this space to record the type of locking operation that the thread was performing when it blocked: whether it was taking a read or write lock. Currently, this space amounts to 8 bytes and is referred to as app_data.
3 Light-Weight Lock Primitives
We now look at the internals of each of the lwlock primitives, the supported operations and how they work. The lwlocks by default are “fair”: a lock is acquired in FIFO order by the threads blocked on it and wake-ups from a condition variable are done in the order in which the threads called wait on the condition variable. Pthread locks are not fair in this sense, and although it is possible to build lwlocks to mimic the same behavior, we have found fairness to be better suited to our needs in the Data Domain File System .
Each primitive uses 2 bytes to keep a queue of waiter structures of the threads that are blocked on that primitive. This queue is aptly called a waitq. The waitq is maintained as a “reverse list” as that allows insertion of a new waiter in a single hardware supported compare-and-swap (CAS) instruction. The next field of a waiter structure holds the id of the waiter structure in front of it. The oldest waiter’s next field holds NULLID. The oldest waiter is the waiter in front of the waitq.
To acquire a lock, a thread uses the CAS instruction to either take ownership of the lock or add its own waiter structure to the lock’s waitq. If the lock is acquired, nothing more needs to be done. If it cannot be acquired, then the thread waits on its waiter structure (by calling the event’s wait routine). When the thread’s turn comes to own the lock (in FIFO order), the unlocking thread will transfer the lock to it and invoke the event’s signal routine on the waiter to wake up the thread. Since the unlocking thread does the work of transferring the lock state and ownership, the waking thread can assume that it has the lock upon being signaled222For unfair locks, this part has to change and the waking thread would need to try again to take the lock.. The unlocking thread has to walk the waitq to find the waiter to signal. At any point there can be only one thread performing the transfer on a lock and hence the walk is safe to perform.
We now present each one of the lwlock primitives. Note that we only highlight the essence of the various operations in the included algorithms. The actual implementation, which we hope to release to the open source community in the near future, has additional logic for performance optimization.
Light-weight mutex. The 4 bytes of a light-weight mutex (henceforth a lwmutex) are composed of 2 equal parts. The first part holds the id of the waiter structure of the owner thread and the second part is the waitq. The owner id is necessary to do self-deadlock detection. Figure 2 outlines the lock and unlock algorithms for the 4-byte version of a lwmutex. If deadlock detection is not required, the lock only needs to be 16 bits in size to hold the waitq. To comply with POSIX semantics, we also need to be able to ascertain the owner of such a mutex. Fortunately, we can use the same waitq space. The locking thread swaps the NULLID of the waitq with the id of its own waiter to indicate that the lock is taken. As other threads block, their waiter structures get added to the waitq as in the case of the regular lwmutex. The difference is that the next field of the waiter structure in front of the waitq does not hold NULLID. Instead, it holds the id of the waiter of the lock owner thread. Hence, the unlock operation traverses the waitq until a waiter whose next field matches the id of the unlocking thread’s waiter structure is reached. The next field is reset to NULLID and the waiter is signaled.
Light-weight condition variable. The 4 bytes of a light-weight condition variable (henceforth a lwcondvar) are composed of 2 equal parts. The first part is a 2-byte version of lwmutex and the second part is the queue of waiter structures. There are three basic operations for a lwcondvar: (i) wait; (ii) signal; and (iii) broadcast. The internal 2-byte lwmutex is used to synchronize manipulation of the waiter’s queue which makes the algorithms for those three operations very easy to derive. The algorithms for the three operations are presented in Appendix A.
Light-weight read-write lock. The light-weight read-write lock (henceforth a lw_rwlock) also uses 2 of its 4 bytes for the waitq. Of the remaining 16 bits, 14 bits are used for the count of read locks granted, 1 bit is used to indicate a write lock, and the final 1 bit is used to indicate whether the lock is read-biased or not. A read-biased lock is unfair towards writers in the sense that a thread that needs a read lock will acquire it without any regard to waiting writers if the lock is already held by other readers. This behavior is similar to that of pthread read-write lock and is essential for applications where a thread can recursively acquire the same lock as a reader. Without the read-biased behavior, a deadlock can result if a writer arrives in between two read lock acquisitions: the second read lock attempt will wait for the writer which is waiting for the first read lock to be released. Applications that do not have recursive read locking do not need the read-biased behavior but may choose to use it for throughput reasons.
The 14-bit reader count limits the maximum number of readers per lock to , a limit that we have found to be sufficient in practice. The limit can be raised by having the API explicitly flag read-bias behavior, so the bias bit does not have to be in the lw_rwlock or restricting the maximum concurrency, thereby freeing bits from the waitq or by slightly increasing the size of the lock.
Figure 3 outlines the algorithms for the two main operations: (i) lock, and (ii) unlock. The lock operation on lw_rwlock is similar to that of lwmutex with the added flag indicating if a read or write lock is requested. The unlock operation for non-read-biased lw_rwlock has to pick the oldest set of waiters that it can signal: either a single writer or a set of contiguous readers. A read-biased lw_rwlock can follow the same logic as a non-biased lw_rwlock when the transfer is to a waiting writer. For the transfer from a writer to reader(s), however, the writer has to signal all readers, not just the oldest contiguous set. The solution is to have the writer atomically remove the entire waitq and downgrade to a read lock. It then separates the waitq into two queues: one consisting of readers and one consisting of writers. The writers are added back to the front of the waitq while also updating the reader count to fully account for the readers found in the removed waitq. Finally, the readers can be signaled. Note that the re-insert of the waiting writers during unlock is safe. The re-insert is done at the front of the waitq and any new writers will add themselves to the back of the waitq. No other thread can be traversing the waitq for ownership transfer as the re-inserting thread holds a read lock. This case makes the implementation of lw_rwlocks the most complex of all the primitives and the algorithm outline is only at a high level for the contention case, where the waitq has at least one waiter in it. The non-contented case is simple to derive.
4 Asynchronous Locking and Other Extensions
We take a moment here to highlight some aspects of the algorithms presented in Section 3 and how small changes would enable alternative behaviors. On the locking side, the key observation is that once a thread has put itself on the wait queue of a lock or condition variable, it is guaranteed to have the lock transferred to it or a signal delivered to it. The thread does not have to call wait right away. The thread could spin for a certain amount of time on poll before calling wait effectively creating adaptive locks. It could also keep spinning which would create starvation-free spin locks. Both of these are scalable and contention-free similar to the approaches in [1, 3, 7].
Alternately, the lock operation could simply return without calling wait at all. This would allow the calling thread to take some application-specific action before invoking wait. We call this mode of operation as taking an “asynchronous” or “deferred blocking” lock. Asynchronous locking is the key enabler to work around the constraints that lock-ordering imposes. We use this functionality in building a generic highly concurrent doubly-linked list in the Data Domain File System . The list allows concurrent appends, dequeues, inserts (before or after any member), deletes and iterators (in either direction). Some of these operations need to acquire locks in opposite order of other operations. To avoid deadlocks, a canonical order is picked and operations that need to acquire locks in the opposite direction use asynchronous locking.
The following example, taken from doubly-linked list implementation, illustrates how asynchronous locking is used and why it is essential. Suppose the canonical order for nodes A & B is A, then B. A thread holds a lock on B already and needs to lock A. It will make an asynchornous lock call for A. If the thread is unable to get the lock, it is on A’s waitq, and it releases the lock on B. It then waits for the lock on A to be granted and then reacquires the lock on B (which is in canonical order). In the above sequence, the thread always either holds a lock (on A or B) or is in the waitq of a lock (on A). Other guarantees in the data structure assure that in this case A and B will remain valid and hence there will be no illegal access. Achieving this without asynchronous locking is not possible. Using trylock on A and upon failure, releasing B then locking A leaves a window open between release of B and locking of A where neither node is in any way aware of the thread. One or both nodes could go away in that window and the thread would end up performing an illegal access.
We are also working on building highly concurrent versions of other data structures (trees of various kinds) where we expect to use asynchronous locking frequently. Note that since there is only one waiter structure per thread per domain, a thread can only be performing one asynchronous lock operation per domain at any time. To keep the discussion focused on lwlocks, we cannot go into any more details of our list or other data structures here.
On the unlocking side of the operations, we note that since the waitq management is visible in user space, the unlocking thread has a lot of flexibility in picking which of the waiting threads to signal and whether to do lock hand-off or have the signaled thread retry. This can be exploited to create any custom scheduling policy. We could pick the thread with the highest priority or the longest waiting thread or even have applications use the app_data to define their own preferences. Signaling waiters in LIFO instead of FIFO order would trade fairness for performance as we illustrate in Section 5.
Finally, with most hardware supporting 64-bit CAS instructions, the generic building blocks of 16-bit waitq leaves 48 bits available for building other primitives. For example, we have built semaphore like counters and a combined mutex+condition variable structure, and implemented upgrade and downgrade operations for lw_rwlocks (see Appendix B for lw_rwlock algorithms that allow these). Although our implementation has focussed on process-private locks, we believe it is possible to extend the approach to include process-shared locks. For example, the Linux operating system limits the maximum number of processes to which would give a natural mapping from the process id to the waiter structure id for the process. The structures could be managed in user space shared memory or the kernel could manage them. Using an actual semaphore would be more appropriate to use in this case to implement the event interface for the waiters.
We now examine some experimental data to show that the performance of lwlocks is acceptable. The experiments were performed on a 4-socket system with Xeon E7-4860 processors. Each socket has 10 physical (20 with hyper-threading enabled) cores, for a total of 40 (80 hyper-threaded) cores. The machine has 256GB of memory and each core operates at 2.26GHz.
We have carried out three sets of experiments. Each experiment was run 20 times which was enough to get a confidence level of on the presented average values. The first one compares the performance of unfair lwmutexes with unfair pthread mutexes. Unfair mutexes trade off fairness for performance by using the greedy approach: the unlocking thread can reacquire the lock right away again. This is done to avoid the convoy problem. We have implemented two versions of unfair lwmutexes: (i) LIFO wake-ups, which wakes up the most recent thread in the waitq; and (ii) FIFO wake-ups, which wakes up the longest-waiting thread in the waitq. The experiment consists of threads carrying out the same number of operations on a global doubly-linked list protected by a single unfair mutex – each operation has the same cost. Each thread acquires the global mutex, performs an operation and drops the mutex. There is no activity outside the locked code block except to increment the loop counter.
Figure 4 (a) shows how the latency per operation increases with the number of contending threads. As the number of threads increases, the per operation cost goes up for all lock types. Note that, for relatively low contention (), unfair lwmutexes perform as good as unfair pthread mutex333Our code is written entirely in C and compiled with O4 optimization. Pthread code is part C and part fine-tuned assembly.. We are satisfied that our implementation is reasonably efficient from the performance shown by unfair lwmutex. The gap between pthread mutex and LIFO unfair lwmutex arises from the fact that pthread mutex try the CAS operation only once before making a system call to block. The lwmutex code (both lock and unlock) has to contend until the caller has performed a successful CAS operation. The performance gap betweek LIFO and FIFO version of lwmutex hightlight the overhead of traversing the waitq. It is well known that a fair mutex is considerably slower than an unfair one under high contention due to frequent context switches (the convoy problem). For 32 contending threads we saw that the latency per operation can go as high as 13x the latency per operation seen for unfair mutexes. However, If there is no contention or just a few contending threads (), the latency per operation is very close to the one obtained with unfair mutexes. We note that performance parity with pthread mutexes was never our goal. Although we believe that with proper tuning the cost difference between lwmutexes and pthread locks can be reduced further, our primary concern is the memory overhead that prevents their use in extremely fine-grained locking. Fine-grained locking results in lower contention in general and hence improved performance overall as we show in the next experiment.
The second experiment illustrates how fine-grained locking can deliver better performance overall. The experiment consists of threads performing lookups, followed by an update to the looked-up record, on a hash table. The hash table has 1 million buckets and is populated with 2 million elements (chaining is used as the collision resolution scheme). We evaluate the latency per operation (in microseconds) for two cases: (i) a fair lwmutex is embedded in each bucket’s list head; and (ii) 1,024 unfair pthread mutexes are used, where each one protects a range of 1,024 buckets.
Figure 4 (b) shows how the latency per operation increases as the number of threads concurrently operating on the hash table increases. As can be seen, it is preferable to have fine-grained locking than optimizing the performance of the lock itself. Also, when the lock is placed within the bucket itself, it improves the memory locality and may have fewer cache misses compared to accessing pthread mutex located in a separate memory area. For the hash table case is very easy to map from a bucket to a pthread mutex stored in a separate area. That is not true for other data structures like linked lists and trees. Additional logic to minimize the number of locks for those data structures introduces complexity which is more difficult to maintain than for the case where a lock can be cheaply added per node. Even a hash table that uses open-addressing schemes (probing, double hashing or cuckoo hashing ) for resolving conflicts presents challenges when using range locking.
Finally, the third experiment compares lw_rwlocks with read-write pthread locks. We use the same hash table as before but now we fix the number of threads (readers + writers) to 34 and then we vary the number of writers (or contending threads) from 0 to 34. Beyond 34 threads we start seeing contention across readers for pthread locks: the contention is on the update of the reader counter, which is surrounded by a mutex in the pthread library. Because we only want to evaluate the contention due to writers, we in turn, picked 34.
Figure 5 shows how the latency per operation increases as the number of writers concurrently operating on the hash table increases. Once again the fine-grained locking provided by the cheap lw_rwlocks delivers better overall performance and also scales better than read-write pthread locks.
We have presented in this paper a new approach to building compact synchronization primitives. This is possible because each thread can only block in one lock or condition variable at a time. Besides the compact nature of light-weight locks, the queue management of blocked threads is also done entirely in user space. This allows the implementation of features that are impossible to implement with traditional pthread locks. For instance, asynchronous locking cannot be implemented with pthread locks as they stand. The cost for light-weight locks is a 166-byte waiter structure per thread, which amortizes very quickly for applications where there are many more locks than threads. We believe that this is a fairly common scenario.
-  T. E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst., 1:6–16, January 1990.
-  Keir Fraser. Practical lock freedom. PhD thesis, Cambridge University Computer Laboratory, 2003.
-  Gary Granunke and Shreekant Thakkar. Synchronization algorithms for shared-memory multiprocessors. Computer, 23:60–69, June 1990.
-  Marcel Kornacker and Douglas Banks. High-concurrency locking in r-trees. In The 21st international conference on Very Large Data Bases, pages 134–145, 1995.
-  Edya Ladan-Mozes and Nir Shavit. An optimistic approach to lock-free fifo queues. In The 18th Annual Conference on Distributed Computing (DISC’04), volume 3274 of Lecture Notes in Computer Science, pages 117–131. Springer, 2004.
-  Philip L. Lehman and S. Bing Yao. Efficient locking for concurrent operations on b-trees. ACM Transactions on Database Systems, 6(4):650–670, 1981.
-  John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst., 9:21–65, February 1991.
-  Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing, 2001.
-  Hany E. Ramadan, Indrajit Roy, Maurice Herlihy, and Emmett Witchel. Committing conflicting transactions in an stm. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’09, pages 163–172, New York, NY, USA, 2009. ACM.
-  N. Shavit and D. Touitou. Software transactional memory. Distributed Computing, Special Issue, 10:99–116, 1997.
-  Nir Shavit. Data structures in the multicore age. Communications of the ACM, 54(3):76–84, 2011.
-  Benjamin Zhu, Kai Li, and Hugo Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, FAST’08, pages 18:1–18:14, Berkeley, CA, USA, 2008. USENIX Association.
Appendix A Pseudo code for light-weight condition variables
Figure 6 presents the structure of a lwcondvar as well as the operations it supports.
Appendix B Upgrading and Downgrading light-weight read-write locks
As mentioned in Section 4, a lw_rwlock also supports upgrade and downgrade operations. Figure 7 shows the algorithms for the two operations. Note that even though multiple readers could be traversing the waitq during upgrade, the traversal is safe. The waitq changes only due to arrival of new waiters to the back of the queue or removal of the waiter at the front of the queue during lock transfer. The former is immaterial to the traversal as it does not care for what happens to waiters behind it. The latter cannot happen as the traversing thread still has a read lock. The only possible race happens on the next field of the oldest waiter in the waitq: a reader performing an upgrade wants to add it’s own waiter in front of it and a thread releasing write lock on a reader-biased lock is re-inserting list of existing waiters. This situation is handled by the upgrade logic and by the unlock routine.
The unlock operation presented in Figure 3 has to be slightly changed to support downgrade and upgrade of a lw_rwlock. For the case where a writer is releasing the write lock of a read-biased lw_rwlock, while re-inserting the wr_q at the front of the waitq of the lw_rwlock, we have to use CAS instruction to co-ordinate with a possible upgrader. Also, if an upgrader is found to be already present at the front of the waitq, the re-inserted wr_q is added behind the upgrader’s waiter.