Brief Announcement: A Transactional Approach to Lock ...

Viewer
Transcript

Brief Announcement: A Transactional Approach to Lock Scalability Amitabha Roy

Keir Fraser

Steven Hand

University of Cambridge Computer Laboratory

University of Cambridge Computer Laboratory

University of Cambridge Computer Laboratory

[email protected] [email protected] [email protected] ABSTRACT Most software transactional memory implementations execute code using fine-grained optimistic concurrency control. This does not perform well with low contention data structures where fine grained conflict detection means manipulating metadata for every object touched and optimistic concurrency control imposes the overhead of making thread private shadow copies. Also, a purely optimistic approach does not coexist naturally with legacy code that is either already concurrent using locks or does IO operations that cannot be revoked. We try to address these problems by presenting a new form of the reader writer locks used by the vast majority of concurrent code today. Along with the traditional lock/unlock operations, these new locks support STM-like management of shadow versions that can be used when desired by the programmer. We show how existing lock based code can be scaled to perform as well as an STM, with few changes to the existing code base. We also show as a corollary that our design allows construction of data structures that retain strict fairness between threads, while simultaneously allowing disjoint access parallelism.

Categories and Subject Descriptors D.1.3 [Software]: Programming Techniques Concurrent Programming

General Terms Design, Algorithms, Performance

1.

TRANSACTIONAL LOCKING

Software Transactional Memory (STM) provides a mechanism to write scalable concurrent code by allowing a set of operations to be atomically executed on a shared heap. However, most existing runtime implementations for STM use only fine-grained optimistic concurrency control in the interests of scalability and performance. This leads to three key problems. First, fine grained concurrency does not perform well with low contention data structures because of the overhead of metadata management per accessed object, which is unnecessary because contention is low. Second, optimistic concurrency control means that there is no guarantee that an execution of an atomic block will commit and hence they cannot call out to legacy code with visible side effects. Without complex runtime support this rules out performing IO operations within transactions. Finally, a vast majority of concurrent code deployed today uses Copyright is held by the author/owner(s). SPAA’08, June 14–16, 2008, Munich, Germany. ACM 978-1-59593-973-9/08/06.

locks and so it could be desirable to support mixed forms of concurrency control, where transactional code co-exists with legacy lock based code. We believe that the lack of such compatibility is an impediment to the adoption of transactional memory until the number of available threads on commercial systems is high enough to cause a serious scalability problem, at which point we will be faced with the problem of significantly rewriting large amounts of lock-based concurrent software. We present an initial solution to this problem by adding a transactional locking runtime to existing concurrent software as an option to lock based programmers. To the programmer, locks remain the primary means for concurrency control: they must identify which locks protect which data, demark critical sections correctly and acquire locks on objects accessed in the critical section. However, they can optionally indicate that a certain critical section should be executed with STM-style optimistic fine-grained concurrency control. This approach tries to combine straight-line performance under low contention and scalability under higher contention without the programmers needing to roll their own correct deadlock free fine-grained locking. Finally, we show that our design allows fair access semantics to data structures. We can guarantee that critical sections are strictly fair, while retaining scalability when disjoint access parallelism is possible. This suffices to provide a powerful contention management system without any associated bookkeeping.

2. PROGRAMMER API In this section we describe the API to our new form of lock. We assume that there is already a library for managing locks, which allows reader-writer locks or specialisations (such as a spin lock) to be declared and acquired with read_lock (for shared read mode access) and write_lock (for exclusive write mode access) calls. We leave this existing locking API in place. As a running example in this section, we consider the problem of inserting a node into a concurrent sorted linked list. We first look a straightforward lock based pessimistic implementation, which can come in either a coarse grained or a fine grained variety. Pessimistic Coarse Grained: write_lock(&head->lock); node = head; while ( node->value < new->value ){ print(node->value); prev = node; node = node->next; } new->next = node; prev->next = new; write_unlock(&head->lock);

Pessimistic Fine Grained: write_lock(&head->lock); node = head; prev = NULL; while(node->value < new->value){ if(prev != NULL) write_unlock(&prev->lock); print(node->value); prev = node; node = node->next; write_lock(&node->lock); } write_unlock(&node->lock); new->next = node; prev->next = new; write_unlock(&prev->lock);

The pessimistic access methods are the most natural solutions that would be written by a programmer using locks. However they are clearly a bottleneck to scalability under high contention. For example, a concurrent reader could be blocked by an update even if it is searching for a different value in the list. To solve this problem our enhanced locking API allows a lock to be acquired transactionally. Thus, we introduce the tx_read_lock and tx_write_lock calls to enable locks to be acquired in transactional mode. Additionally we need to provide the capability to make thread private shadow copies to allow speculative changes to objects to be made: this is done using the shadow call. Transactions are started using the begin_transaction and committed using the end_transaction calls. An unsuccessful commit causes the transaction to be rolled back and retried. In Section 4 we outline how we can retain strict fairness in the face of excessive aborts. Optimistic Fine Grained : begin_transaction(); node = head; tx_read_lock(&node->lock); while ( node->value < new->value ){ prev = node; node = node->next; tx_read_lock(&node->lock); } tx_write_lock(&prev->lock); new->next = node; *shadow(tx, &prev->next, sizeof(prev->next)) = new; end_transaction();

While the optimistic implementation solves the scalability problem and provides an upgrade path, the pessimistic implementation can continue to co-exist. The pessimistic implementation is needed to take care of IO, since it actually prints out the traversed nodes that is not possible in the optimistic one. Finally the coarse grained pessimistic locking implementation would show the best performance in the face of low contention since it avoids the costs of locking every object and shadowing changes. We believe that our locking enhancements pave the way to dynamically switch between the different levels of concurrency based on measured contention, something we intend to explore in future work. In the rest of this paper we discuss the runtime requirements to support the transactional locking calls in Section 3 and the constraints that the programmer needs to follow in order for different code versions to co-exist simultaneously.

3.

PROTOTYPE RUNTIME

The runtime changes the existing lock data structure by adding a version number. The only other change is to existing library calls

for acquiring a lock in exclusive mode. Exclusive mode lock acquisitions increment the version number and the corresponding lock release increments the version number again. Thus an odd version number indicates that the object protected by the lock is write locked and its contents are unstable. We also chose to separate the lock related data from the version number. This allows us to be compatible with all lock implementation regardless of the organisation of their lock metadata. We also provide an optimistic concurrency control capability in the runtime, modelled after the TL2 [1] approach. A key difference though is that we do not use per word meta-data or a hash function. Instead, we associate a lock with each concurrent object to allow fine grained locking. This maps better to existing concurrent software that already uses fine grained locking. We provide a simple capability to shadow objects (and look up the shadow copies to avoid read after write hazards). When making a thread local copy, we ensure that the version number of the snapshot is even, which guarantees that we read a stable version. The commit phase acquires all write locks, verifies that version numbers on read locks are unchanged and applies changes to shared memory. The capability to invisibly acquire read locks allows programmers to avoid writing to a lock, which was shown to slow down fine grained implementations of data structures such as red black trees as a result of cache line contention on the root node [2]. Any attempt to acquire a lock transactionally is always bounded with a timeout failure resulting in the transaction being aborted. This allows programmers to do fine grained locking even when it might result in deadlocks. Thus most of the performance and simplicity advantages of existing STM solutions are made available to programmers using this extended locking scheme. To avoid a use after free problem, deleted objects are recycled through an epoch based garbage collector. We believe that this is not a significant change for concurrent software today: for example, the linux kernel uses epochs for garbage collection in the case of read copy update [3] managed objects. Our solution is not non-blocking (which was not a concern since we are targeting existing concurrent lock based software) but any lock acquired transactionally is held only for the duration of the commit section. Thus, if the commit section does not yield to other userspace threads that can potentially run a conflicting transaction (which can be ensured through appropriate scheduler calls) then the transaction itself becomes non-blocking. We believe that this capability provides a significant means to programmers to avoid lock convoying problems.

4. FAIR COARSE GRAINED LOCKS In this section we examine what constraints a programmer needs to follow in order for different code versions to co-exist correctly. We do this by formally define a subsumption relationship between locks, a concept similar to multi-granularity locking in databases. This relationship is interesting to us because it is the foundation on which we can correctly allow a coarse pessimistic locking routine to co-exist simultaneously with a fine grained one, such as the example in Section 2. Consider a set of mixed reader writer locks L = {li }. We define a relation between locks ≺ ⊂ L × L, induced by the concurrent program, where li ≺ lj iff any fine grained locking based method always acquires li in at least read mode before it attempts to acquire lj in any mode. We say in such cases that li subsumes lj . Such a relationship naturally exists in data structures like red-black trees because the root node of any subtree must always be at least read locked before accessing nodes below it. Consider a coarse grained lock lcoarse such that

5.

PERFORMANCE

We evaluate the performance and efficiency of our runtime and API by measuring the scalability of concurrent skip lists and red black trees, when using optimistic fine grained locking. We use the lockfree package of [2] as a baseline. The implementation provides fairness using the method of Section 4, switching to the coarse grained access method on 10 failures of the fine-grained method. Within the fine grained access method up to 1000 trials are allowed for each lock. We measured performance on an SGI Altix 4700 using 64 Itanium cores and NUMA memory connected by a switched fabric, taking the median of 5 runs of 10 seconds each for stability. Our set workload consisted of a key space of size 219 and a read fraction of 75% with the remaining divided equally between updates and deletes. For both the skip lists (Figure 1) and red black trees (Figure 2), our scalable lock implementation surpasses Fraser’s STM in performance and scalability by a factor of approximately 2X. We also experimented with a global timestamp counter as in [1] but as the graph shows it is detrimental to performance on a NUMA multicore machine due to the high cost of sending the cacheline containing the counter across the fabric.

6.

CONCLUSION

We have presented the design for a lock that can be acquired both in transactional and conventional modes. It allows a seamless upgrade path for existing concurrent software towards greater scalability, by virtue of not changing locking paradigms and making no significant changes to lock metadata. It can used to build concurrent data structures that support simultaneous coarse and fine grained locking and provide strict 1-fairness between conflicting threads. We can readily support irrevocability and IO in our design. We have also introduced the notion of a non-blocking commit section.

7.

ACKNOWLEDGEMENTS

We would like to thank the COSMOS facility at DAMTP Cambridge and their sponsors SGI/Intel for use of their Altix 4700

Skip List Scalability 90

stm fraser scalable locks per-list locks scalable locks timestamp

Transaction time (microseconds)

80 70 60 50 40 30 20 10 0 0

10

20

30

40

50

60

70

Threads

Figure 1: Skip list performance Red Black Tree Scalability 100

Transaction time (microseconds)

∀l ∈ L lcoarse ≺ l. As a concrete example, in the linked list implementations in Section 2, the lock on the list head subsumes all the other locks. A coarse grained access method first acquires lcoarse non transactionally. This locking call must include a transactional barrier (an epoch ensuring all currently executing transactions finish) after acquiring the coarse lock. This can be achieved by especially typing the coarse grained lock and thus hiding the transactional implications from the programmer and leaving legacy code unchanged. A more scalable approach is to ensure that before an object protected by a lock l is accessed for the first time, there must be a wait till version(l) is even, indicating that the object is stable. This would however imply more changes to any legacy coarse grained locking code. This is the approach that we have followed in our evaluation whose results are described in Section 5. In either case, the coarse grained access method avoids the cost of shadowing objects. Also it is irrevocable and thus can safely do IO, as shown in the example of Section 2. The separation between lock and version number in our metadata design means that we can use a heavy weight fair lock implementation such as the list-based one of [4] for the coarse grained locks. This means that any conflicting transaction can block the coarse grained access method at most once before it fails and needs to wait on the coarse grained lock (by virtue of the lock subsuming relationship). Thus the coarse grained access is strictly 1-fair [5]. The coarse grained lock also provides a simple fall back conflict management method for threads suffering excessive aborts.

hanke fraser stm fraser scalable locks scalable locks timestamp

80

60

40

20

0 0

10

20

30

40

50

60

70

Threads

Figure 2: Red-Black tree performance supercomputer. We would also like to thank Tim Harris and Derek Murray for their insights and helpful feedback.

8. REFERENCES [1] O. Shalev D. Dice and N. Shavit. Transactional locking II. In DISC 2006. [2] K. Fraser. Practical lock freedom. PhD thesis, Cambridge University Computer Laboratory, 2003. [3] Paul E. McKenney and John D. Slingwine. Read-copy update: Using execution history to solve concurrency problems. In Parallel and Distributed Computing and Systems, pages 509–518, October 1998. [4] J. Mellor-Crummey and M. Scott. Scalable reader-writer synchronization for shared-memory multiprocessors. In PPOPP 1991. [5] D. N. Jayasimha N. Dershowitz and S. Park. Bounded fairness. In Verification: Theory and Practice, 2003.

Brief Announcement: A Transactional Approach to Lock ...

tion of an atomic block will commit and hence they cannot call out to legacy code with visible side effects. Without complex runtime support this rules out ...

Download PDF

90KB Sizes 0 Downloads 89 Views

Report

Brief Announcement: A Transactional Approach to Lock ...

Recommend Documents