Can Lock-free and Combining Techniques Co-exist? A Novel Approach on Concurrent Queue Changwoo Min and Young Ik Eom Sungkyunkwan University, Korea
Head CAS Tail
Head next Head
null Tail
Tail
next
Head next next
Tail next
ReqTail
Fig. 2: Experimental results on four Intel Xeon E7-4850 processors (80 H/W threads and four NUMA clusters)
Tail
status result rnext
status Head
Tail next Tail status WAIT DONE
DONE
Tail status COMBINE
Fig. 1: The overall flow of LECD queue algorithm
Motivation: Concurrent queues are one of the most fundamental concurrent data structures. Most previous research focuses on how to avoid the contended hot spots, Head and Tail, and there are two contradictory approaches: (1) lock-free techniques [1], [2], which increase the degree of parallelism to improve performance and (2) combining techniques [3], where a single combining thread performs a batch operation for the pending requests from other threads to reduce synchronization cost in a high degree of parallelism. MS queue [1], a representative lock-free queue, uses compare-and-swap (CAS) to update Head and Tail. When the CAS fails, it retries until it succeeds. The recently proposed LCRQ [2] uses an x86 specific fetch-and-add (F&A), to avoid the costly CAS retry loops. To limit the contention level, lock-free approaches adopt schemes such as a backoff scheme in MS queue and voluntary yielding in LCRQ with hierarchy aware optimization (LCRQ+H). However, those schemes require parameter tuning for specific benchmark and machine configuration. Without such manual tuning, lockfree queues can result in suboptimal performance or contention meltdown [2]. Among the combining-based queues, H-Queue [3] is the most efficient. It is based on the two-lock queue presented in [1] and the locks are replaced with H-Synch combining constructions [3]. Though combining techniques can reduce synchronization cost, the single-threaded execution of a combiner can be a performance bottleneck. Our Goal: We propose a scalable out-of-the-box concurrent queue which does not require manual parameter tuning. In our LECD queue, which stands for Lock-free Enqueue and Combining Dequeue, enqueue operations are performed in a lock-free manner and dequeue operations are performed in a combining manner. Since LECD queue is based on a SWAP primitive, which always succeeds unlike CAS, parameter tuning to limit the contention level is not required. Also, SWAP is supported in most architectures including x86, ARM, and Sparc. The concurrent enqueue operations contribute to avoid that the combiner becomes a performance bottleneck
978 -1-4799-1021-2/13/$31.00 ©2013 IEEE
403
and increase the degree of combining; the higher degree of combining makes the combining operation more efficient. Our Solution, LECD Queue: We illustrate the overall flow in Figure 1. In the enqueue operation, we update Tail to a new node by using SWAP (E1) and then update the old Tail’s next to the new node (E2). In the dequeue operation, we first enqueue a request into a request list, which is per a NUMA cluster, by using SWAP (D1, D2) and check if the current thread takes combiner (D3). If it does not take combiner, we wait until the enqueued request is processed by the combiner (D4a). Otherwise (D4b), the combiner thread processes pending requests for the cluster (C2) and checks whether the queue has become empty (C3). Similar to MS queue, we use a dummy node pointed at Head to check if it is empty. While the last dequeued node is recycled to a dummy node in MS queue, our dummy node is allocated at queue initialization and permanently used. Since this makes the life cycle of a dequeued element and its queue node the same, more flexible memory management, such as embedding a queue node to an element, is possible. To this end, we update next of Head instead of Head when not empty (C4a). When the queue becomes empty, we update Tail to Head and next of Head to null (C4b). Since Tail is updated by using CAS due to the updates from concurrent enqueuers, no retry loop is needed. The execution of per-cluster combining threads is serialized by the global combiner lock (C1, C6). Evaluation: Figure 2 shows the experimental result of queue algorithms for 50% enqueues benchmark, where a thread randomly performs an enqueue or dequeue operation. In the lock-free queues, the parameter-tuned versions, such as MS queue with backoff and LCRQ+H, outperform the versions of no tuning. The throughputs of LECD queue and H-Queue increase as thread count increases, due to the higher degree of combining. Average combining of LECD queue is far greater than that of H-Queue due to the concurrent enqueue operations. As a result, LECD queue outperforms MS queue and H-Queue in all concurrency levels and starts to outperform LCRQ and LCRQ+H from 32 and 64 threads, respectively. R EFERENCES M. M. Michael and M. L. Scott, “Simple, fast, and practical non-blocking and blocking concurrent queue algorithms,” in Proc. of PODC, 1996. [2] A. Morrison and Y. Afek, “Fast concurrent queues for x86 processors,” in Proc. of PPoPP, 2013. [3] P. Fatourou and N. D. Kallimanis, “Revisiting the combining synchronization technique,” in Proc. of PPoPP, 2012. [1]
Can(Lock-free(and(Combining(Techniques(Co-exit?( A(Novel(Approach(on(Concurrent(Queue(( !
Changwoo!Min,!Advisor:!Young!Ik!Eom!(Sungkyunkwan!University,!Korea)!!
((MoDvaDon(and(Background Research(on(concurrent(queues(is(stuck(in(a(dilemma(with(regard(to(scalability.(( Lock-Free( Approaches(
Combining(( Approaches(
Key(IntuiDon( Major(Work( • Higher'degree'of'parallelism''!!Higher!scalability! • MS!Queue![PODC’96]' • Use!hardware!atomic!instrucGons,!such!as!CAS,!to! • LCRQ![PPoPP’13]' maximize!the!degree!of!parallelism.!!
Pros(&(Cons( • Pros:!Taking!advantage!of!parallel!execuGon!! • Cons:'Conten=on'management'schemes'requires'manual' parameter'tuning'to'avoid'contended'hot'spots'!'Not' feasible'in'libraries'and'run=mes'(JSR'166,'Boost'C++'Lib)''
• Higher!degree!of!parallelism!! • FC!Queue![SPAA’10]' !!!!!!Higher!synchronizaGon!cost!!!Lower!scalability!! • H!Queue![PPoPP’12]' ''''"'Batch'processing'by'a'thread' • One!designated!thread,!or!a!combiner,!performs! batch!processing!of!all!other!concurrent!threads.!!
• Pros:!Low!synchronizaGon!cost!of!single!threaded!execuGon! • Cons:'Losing'opportuni=es'to'exploit'the'advantages'of' parallel'execu=on'
LECD(Queue:(Lock-Free(Enqueue(and(Combining(Dequeue Goal:(PracDcal(and(Scalable(Concurrent(Queue(with(No(Parameter(Tuning(( Approach:(Use(Both(Lock-Free(and(Combining(Techniques(SynergisDcally( Enqueue(OperaDon Head!
Head!
Tail!
E2!
Tail!
next!
next!
!
Permanent! Dummy!
New(Node(
next!
next!
!
next!
!
!
E1!
!
next!
!
next!
next!
!
!
!
Global(Combiner(Lock(
Request(List(( for(Cluster(0(
next!
E3!
1.(Adding(a(New(Dummy(Request
Dequeue(OperaDon
Request(List(( for(Cluster(1(
Per-Cluster(( Request(List(
ReqTail!
result! rnext! status! COMBINE!
result! rnext! status! WAIT!
!
E1.new!next = null; E2.old_tail = SWAP(&Tail, new); E3.old_tail!next = new;
result! rnext! status! COMBINE! ! !
!
ReqTail!
result! rnext! status! WAIT!
result! rnext! status! WAIT!
result! rnext! status! COMBINE!
!
!
!
!
NUMA(Cluster(0(
CORE!0!
CORE!1!
A1!
!
WAIT! !
!
0.(IniDal(dummy( 2.(Spin(on( is(in(COMBINE.(( the(previous.((
!
!
NUMA(Cluster(1(
CORE!0!
A3!
!
!
!
!
!
!
!
!
!
!
!
result! rnext! status! WAIT!
result! rnext! status! WAIT!
!
result! rnext! status!
!
!
ReqTail!
A2!
1.(Add(a(new(in( WAIT(to(the(last.((
CORE!1!
A1. rq!next = null; • Unlike!CAS,!SWAP!always!succeeds.!!! Shared(Bus(-(Interconnect( A2. prv_rq = SWAP(&ReqTail, rq); • No!retry!!!No!exponenGal!backoff!scheme! A3. prv_rq!next = rq; • Append!a!new!dummy!request!to!a!request!list.!! !!No'parameter'tuning'is'needed.' • Except!for!the!SWAP,!concurrent!threads!are! • Depending!on!the!status!of!previous!request,! • Use!the!same!technique!used!in!the! • COMBINE:!taking!the!role!of!a!combiner.!! independent.!!!Enjoy'the'advantages'of' enqueue!operaGon! parallel'execu
Local!DRAM!
!
2.(Combining(Dequeue(OperaDons
D2a'
Head!
Tail!
Memory(Management
Head!
Tail!
D2b'
Why'upda
Permanent! Dummy!
Permanent! Dummy!
next!
next!
!
Request'' List'
!
result! rnext! status!
result! rnext! status!
D1!
WAIT! !DONE!
WAIT! !COMBINE!
COMBINE ! !DONE! !
!
!
ReqTail!
D1!
Request'' List'
result! rnext! status! !
!
D3!
result! rnext! status!
result! rnext! status!
result! rnext! status!
WAIT! !DONE!
WAIT! !DONE!
WAIT! !COMBINE!
!
!
D1!
!
!
!
!
COMBINE ! !DONE!
!
!
next!
next!
!
D2b'
!
!
next!
!
ReqTail!
!
!
next!
!
!
D1! result! rnext! status!
next!
next!
!
!
!
!
!
!
D3!
D2b.!If!the!queue!becomes!empty,!!!! 1. Head!next = null; 2. CAS(&Tail, the last dequeued node, Head)
D1.!A!combiner!thread!processes!pending! requests!and!changes!the!status!to!DONE.! D2a.!A\er!processing!the!requests,!if!the!queue! does!not!become!empty,!update!Head!next!to! the!next!of!the!last!dequeued!node.!! D3.!Change!the!status!of!the!last!dummy!request! to!COMBINE!for!a!subsequent!dequeuer!to!take! the!role!of!the!combiner.!'
Since!another!concurrent!enqueuer!can!update! Tail!in!between!1!and!2,!Tail!needs!to!be! updated!by!using!CAS.!! !!No!CAS!retry!needs!in!this!case.!!
EvaluaDon Encouraging(Results:(MS(Queue(with(backoff(by(14.3x,(H(Queue(by(1.6x,(and(LCRQ+H(by(1.6%( SynergisDc(InteracDons:(Parallel(Enqueue(!(Higher(Combining(Degree(!(More(Efficient(Combining(Dequeue(OperaDon( 30
Million opertions/second
20
16.1x(
15
11.1x(
10 5 0
Qu e
MS ue
(no
20
H-Queue(
15
LCRQ(
10
LC
Qu
eu
ba
RQ
e(
ba
ck off )
ck
off )
H-
Qu
eu
LC e
RQ
LE +H
CD
0 Qu
eu e
25
LECD(
20
LCRQ+H( H-Queue(
15 10
LCRQ(
5
5
1x( 1.8x( MS
LECD(
250
30
LCRQ+H(
25
Average(Combining(of((50%(enqueues(
50%(enqueues(
Million opertions/second
25.9x(26.3x(
25 Million operations/second
enqueue-dequeue(pairs(
0
10
20
30
40 Threads
50
60
70
MS-Q((backoff)( MS-Q((no(backoff)(
80
0 0
10
20
30
40 Threads
50
60
70
MS-Q((backoff)( MS-Q((no(backoff)( 80
LECD(
200 Average Combining
Average(throughput(of(the( benchmarks(on(80(threads(
150
100
50
H-Queue(
0 0
10
20
30
40 Threads
50
60
70
80