Can Lock-free and Combining Techniques Co-exist? A ...

Viewer
Transcript

Can Lock-free and Combining Techniques Co-exist? A Novel Approach on Concurrent Queue Changwoo Min and Young Ik Eom Sungkyunkwan University, Korea

Head CAS Tail

Head next Head

null Tail

Tail

next

Head next next

Tail next

ReqTail

Fig. 2: Experimental results on four Intel Xeon E7-4850 processors (80 H/W threads and four NUMA clusters)

Tail

status result rnext

status Head

Tail next Tail status WAIT DONE

DONE

Tail status COMBINE

Fig. 1: The overall flow of LECD queue algorithm

Motivation: Concurrent queues are one of the most fundamental concurrent data structures. Most previous research focuses on how to avoid the contended hot spots, Head and Tail, and there are two contradictory approaches: (1) lock-free techniques [1], [2], which increase the degree of parallelism to improve performance and (2) combining techniques [3], where a single combining thread performs a batch operation for the pending requests from other threads to reduce synchronization cost in a high degree of parallelism. MS queue [1], a representative lock-free queue, uses compare-and-swap (CAS) to update Head and Tail. When the CAS fails, it retries until it succeeds. The recently proposed LCRQ [2] uses an x86 specific fetch-and-add (F&A), to avoid the costly CAS retry loops. To limit the contention level, lock-free approaches adopt schemes such as a backoff scheme in MS queue and voluntary yielding in LCRQ with hierarchy aware optimization (LCRQ+H). However, those schemes require parameter tuning for specific benchmark and machine configuration. Without such manual tuning, lockfree queues can result in suboptimal performance or contention meltdown [2]. Among the combining-based queues, H-Queue [3] is the most efficient. It is based on the two-lock queue presented in [1] and the locks are replaced with H-Synch combining constructions [3]. Though combining techniques can reduce synchronization cost, the single-threaded execution of a combiner can be a performance bottleneck. Our Goal: We propose a scalable out-of-the-box concurrent queue which does not require manual parameter tuning. In our LECD queue, which stands for Lock-free Enqueue and Combining Dequeue, enqueue operations are performed in a lock-free manner and dequeue operations are performed in a combining manner. Since LECD queue is based on a SWAP primitive, which always succeeds unlike CAS, parameter tuning to limit the contention level is not required. Also, SWAP is supported in most architectures including x86, ARM, and Sparc. The concurrent enqueue operations contribute to avoid that the combiner becomes a performance bottleneck

978 -1-4799-1021-2/13/$31.00 ©2013 IEEE

403

and increase the degree of combining; the higher degree of combining makes the combining operation more efficient. Our Solution, LECD Queue: We illustrate the overall flow in Figure 1. In the enqueue operation, we update Tail to a new node by using SWAP (E1) and then update the old Tail’s next to the new node (E2). In the dequeue operation, we first enqueue a request into a request list, which is per a NUMA cluster, by using SWAP (D1, D2) and check if the current thread takes combiner (D3). If it does not take combiner, we wait until the enqueued request is processed by the combiner (D4a). Otherwise (D4b), the combiner thread processes pending requests for the cluster (C2) and checks whether the queue has become empty (C3). Similar to MS queue, we use a dummy node pointed at Head to check if it is empty. While the last dequeued node is recycled to a dummy node in MS queue, our dummy node is allocated at queue initialization and permanently used. Since this makes the life cycle of a dequeued element and its queue node the same, more flexible memory management, such as embedding a queue node to an element, is possible. To this end, we update next of Head instead of Head when not empty (C4a). When the queue becomes empty, we update Tail to Head and next of Head to null (C4b). Since Tail is updated by using CAS due to the updates from concurrent enqueuers, no retry loop is needed. The execution of per-cluster combining threads is serialized by the global combiner lock (C1, C6). Evaluation: Figure 2 shows the experimental result of queue algorithms for 50% enqueues benchmark, where a thread randomly performs an enqueue or dequeue operation. In the lock-free queues, the parameter-tuned versions, such as MS queue with backoff and LCRQ+H, outperform the versions of no tuning. The throughputs of LECD queue and H-Queue increase as thread count increases, due to the higher degree of combining. Average combining of LECD queue is far greater than that of H-Queue due to the concurrent enqueue operations. As a result, LECD queue outperforms MS queue and H-Queue in all concurrency levels and starts to outperform LCRQ and LCRQ+H from 32 and 64 threads, respectively. R EFERENCES M. M. Michael and M. L. Scott, “Simple, fast, and practical non-blocking and blocking concurrent queue algorithms,” in Proc. of PODC, 1996. [2] A. Morrison and Y. Afek, “Fast concurrent queues for x86 processors,” in Proc. of PPoPP, 2013. [3] P. Fatourou and N. D. Kallimanis, “Revisiting the combining synchronization technique,” in Proc. of PPoPP, 2012. [1]

Can(Lock-free(and(Combining(Techniques(Co-exit?( A(Novel(Approach(on(Concurrent(Queue(( !

Changwoo!Min,!Advisor:!Young!Ik!Eom!(Sungkyunkwan!University,!Korea)!!

((MoDvaDon(and(Background Research(on(concurrent(queues(is(stuck(in(a(dilemma(with(regard(to(scalability.(( Lock-Free( Approaches(

Combining(( Approaches(

Key(IntuiDon( Major(Work( •  Higher'degree'of'parallelism''!!Higher!scalability! •  MS!Queue![PODC’96]' •  Use!hardware!atomic!instrucGons,!such!as!CAS,!to! •  LCRQ![PPoPP’13]' maximize!the!degree!of!parallelism.!!

Pros(&(Cons( •  Pros:!Taking!advantage!of!parallel!execuGon!! •  Cons:'Conten=on'management'schemes'requires'manual' parameter'tuning'to'avoid'contended'hot'spots'!'Not' feasible'in'libraries'and'run=mes'(JSR'166,'Boost'C++'Lib)''

•  Higher!degree!of!parallelism!! •  FC!Queue![SPAA’10]' !!!!!!Higher!synchronizaGon!cost!!!Lower!scalability!! •  H!Queue![PPoPP’12]' ''''"'Batch'processing'by'a'thread' •  One!designated!thread,!or!a!combiner,!performs! batch!processing!of!all!other!concurrent!threads.!!

•  Pros:!Low!synchronizaGon!cost!of!single!threaded!execuGon! •  Cons:'Losing'opportuni=es'to'exploit'the'advantages'of' parallel'execu=on'

LECD(Queue:(Lock-Free(Enqueue(and(Combining(Dequeue Goal:(PracDcal(and(Scalable(Concurrent(Queue(with(No(Parameter(Tuning(( Approach:(Use(Both(Lock-Free(and(Combining(Techniques(SynergisDcally( Enqueue(OperaDon Head!

Head!

Tail!

E2!

Tail!

next!

next!

!

Permanent! Dummy!

New(Node(

next!

next!

!

next!

!

!

E1!

!

next!

!

next!

next!

!

!

!

Global(Combiner(Lock(

Request(List(( for(Cluster(0(

next!

E3!

1.(Adding(a(New(Dummy(Request

Dequeue(OperaDon

Request(List(( for(Cluster(1(

Per-Cluster(( Request(List(

ReqTail!

result! rnext! status! COMBINE!

result! rnext! status! WAIT!

!

E1.new!next = null; E2.old_tail = SWAP(&Tail, new); E3.old_tail!next = new;

result! rnext! status! COMBINE! ! !

!

ReqTail!

result! rnext! status! WAIT!

result! rnext! status! WAIT!

result! rnext! status! COMBINE!

!

!

!

!

NUMA(Cluster(0(

CORE!0!

CORE!1!

A1!

!

WAIT! !

!

0.(IniDal(dummy( 2.(Spin(on( is(in(COMBINE.(( the(previous.((

!

!

NUMA(Cluster(1(

CORE!0!

A3!

!

!

!

!

!

!

!

!

!

!

!

result! rnext! status! WAIT!

result! rnext! status! WAIT!

!

result! rnext! status!

!

!

ReqTail!

A2!

1.(Add(a(new(in( WAIT(to(the(last.((

CORE!1!

A1. rq!next = null; •  Unlike!CAS,!SWAP!always!succeeds.!!! Shared(Bus(-(Interconnect( A2. prv_rq = SWAP(&ReqTail, rq); •  No!retry!!!No!exponenGal!backoﬀ!scheme! A3. prv_rq!next = rq; •  Append!a!new!dummy!request!to!a!request!list.!! !!No'parameter'tuning'is'needed.' •  Except!for!the!SWAP,!concurrent!threads!are! •  Depending!on!the!status!of!previous!request,! •  Use!the!same!technique!used!in!the! •  COMBINE:!taking!the!role!of!a!combiner.!! independent.!!!Enjoy'the'advantages'of' enqueue!operaGon! parallel'execu
Local!DRAM!

!

2.(Combining(Dequeue(OperaDons

D2a'

Head!

Tail!

Memory(Management

Head!

Tail!

D2b'

Why'upda
Permanent! Dummy!

Permanent! Dummy!

next!

next!

!

Request'' List'

!

result! rnext! status!

result! rnext! status!

D1!

WAIT! !DONE!

WAIT! !COMBINE!

COMBINE ! !DONE! !

!

!

ReqTail!

D1!

Request'' List'

result! rnext! status! !

!

D3!

result! rnext! status!

result! rnext! status!

result! rnext! status!

WAIT! !DONE!

WAIT! !DONE!

WAIT! !COMBINE!

!

!

D1!

!

!

!

!

COMBINE ! !DONE!

!

!

next!

next!

!

D2b'

!

!

next!

!

ReqTail!

!

!

next!

!

!

D1! result! rnext! status!

next!

next!

!

!

!

!

!

!

D3!

D2b.!If!the!queue!becomes!empty,!!!! 1. Head!next = null; 2. CAS(&Tail, the last dequeued node, Head)

D1.!A!combiner!thread!processes!pending! requests!and!changes!the!status!to!DONE.! D2a.!A\er!processing!the!requests,!if!the!queue! does!not!become!empty,!update!Head!next!to! the!next!of!the!last!dequeued!node.!! D3.!Change!the!status!of!the!last!dummy!request! to!COMBINE!for!a!subsequent!dequeuer!to!take! the!role!of!the!combiner.!'

Since!another!concurrent!enqueuer!can!update! Tail!in!between!1!and!2,!Tail!needs!to!be! updated!by!using!CAS.!! !!No!CAS!retry!needs!in!this!case.!!

EvaluaDon Encouraging(Results:(MS(Queue(with(backoﬀ(by(14.3x,(H(Queue(by(1.6x,(and(LCRQ+H(by(1.6%( SynergisDc(InteracDons:(Parallel(Enqueue(!(Higher(Combining(Degree(!(More(Eﬃcient(Combining(Dequeue(OperaDon( 30

Million opertions/second

20

16.1x(

15

11.1x(

10 5 0

Qu e

MS ue

(no

20

H-Queue(

15

LCRQ(

10

LC

Qu

eu

ba

RQ

e(

ba

ck off )

ck

off )

H-

Qu

eu

LC e

RQ

LE +H

CD

0 Qu

eu e

25

LECD(

20

LCRQ+H( H-Queue(

15 10

LCRQ(

5

5

1x( 1.8x( MS

LECD(

250

30

LCRQ+H(

25

Average(Combining(of((50%(enqueues(

50%(enqueues(

Million opertions/second

25.9x(26.3x(

25 Million operations/second

enqueue-dequeue(pairs(

0

10

20

30

40 Threads

50

60

70

MS-Q((backoﬀ)( MS-Q((no(backoﬀ)(

80

0 0

10

20

30

40 Threads

50

60

70

MS-Q((backoﬀ)( MS-Q((no(backoﬀ)( 80

LECD(

200 Average Combining

Average(throughput(of(the( benchmarks(on(80(threads(

150

100

50

H-Queue(

0 0

10

20

30

40 Threads

50

60

70

80

Can Lock-free and Combining Techniques Co-exist? A ...

synchronization cost in a high degree of parallelism. MS queue [1], a representative lock-free queue, uses compare-and-swap (CAS) to update Head and Tail.

Download PDF

832KB Sizes 5 Downloads 173 Views

Report

Can Lock-free and Combining Techniques Co-exist? A ...

Recommend Documents