Can Lock-free and Combining Techniques Co-exist? A Novel Approach on Concurrent Queue Changwoo Min and Young Ik Eom Sungkyunkwan University, Korea

Head CAS Tail

Head next Head

null Tail

Tail

next

Head next next

Tail next

ReqTail

Fig. 2: Experimental results on four Intel Xeon E7-4850 processors (80 H/W threads and four NUMA clusters)

Tail

status result rnext

status Head

Tail next Tail status WAIT DONE

DONE

Tail status COMBINE

Fig. 1: The overall flow of LECD queue algorithm

Motivation: Concurrent queues are one of the most fundamental concurrent data structures. Most previous research focuses on how to avoid the contended hot spots, Head and Tail, and there are two contradictory approaches: (1) lock-free techniques [1], [2], which increase the degree of parallelism to improve performance and (2) combining techniques [3], where a single combining thread performs a batch operation for the pending requests from other threads to reduce synchronization cost in a high degree of parallelism. MS queue [1], a representative lock-free queue, uses compare-and-swap (CAS) to update Head and Tail. When the CAS fails, it retries until it succeeds. The recently proposed LCRQ [2] uses an x86 specific fetch-and-add (F&A), to avoid the costly CAS retry loops. To limit the contention level, lock-free approaches adopt schemes such as a backoff scheme in MS queue and voluntary yielding in LCRQ with hierarchy aware optimization (LCRQ+H). However, those schemes require parameter tuning for specific benchmark and machine configuration. Without such manual tuning, lockfree queues can result in suboptimal performance or contention meltdown [2]. Among the combining-based queues, H-Queue [3] is the most efficient. It is based on the two-lock queue presented in [1] and the locks are replaced with H-Synch combining constructions [3]. Though combining techniques can reduce synchronization cost, the single-threaded execution of a combiner can be a performance bottleneck. Our Goal: We propose a scalable out-of-the-box concurrent queue which does not require manual parameter tuning. In our LECD queue, which stands for Lock-free Enqueue and Combining Dequeue, enqueue operations are performed in a lock-free manner and dequeue operations are performed in a combining manner. Since LECD queue is based on a SWAP primitive, which always succeeds unlike CAS, parameter tuning to limit the contention level is not required. Also, SWAP is supported in most architectures including x86, ARM, and Sparc. The concurrent enqueue operations contribute to avoid that the combiner becomes a performance bottleneck

978 -1-4799-1021-2/13/$31.00 ©2013 IEEE

403

and increase the degree of combining; the higher degree of combining makes the combining operation more efficient. Our Solution, LECD Queue: We illustrate the overall flow in Figure 1. In the enqueue operation, we update Tail to a new node by using SWAP (E1) and then update the old Tail’s next to the new node (E2). In the dequeue operation, we first enqueue a request into a request list, which is per a NUMA cluster, by using SWAP (D1, D2) and check if the current thread takes combiner (D3). If it does not take combiner, we wait until the enqueued request is processed by the combiner (D4a). Otherwise (D4b), the combiner thread processes pending requests for the cluster (C2) and checks whether the queue has become empty (C3). Similar to MS queue, we use a dummy node pointed at Head to check if it is empty. While the last dequeued node is recycled to a dummy node in MS queue, our dummy node is allocated at queue initialization and permanently used. Since this makes the life cycle of a dequeued element and its queue node the same, more flexible memory management, such as embedding a queue node to an element, is possible. To this end, we update next of Head instead of Head when not empty (C4a). When the queue becomes empty, we update Tail to Head and next of Head to null (C4b). Since Tail is updated by using CAS due to the updates from concurrent enqueuers, no retry loop is needed. The execution of per-cluster combining threads is serialized by the global combiner lock (C1, C6). Evaluation: Figure 2 shows the experimental result of queue algorithms for 50% enqueues benchmark, where a thread randomly performs an enqueue or dequeue operation. In the lock-free queues, the parameter-tuned versions, such as MS queue with backoff and LCRQ+H, outperform the versions of no tuning. The throughputs of LECD queue and H-Queue increase as thread count increases, due to the higher degree of combining. Average combining of LECD queue is far greater than that of H-Queue due to the concurrent enqueue operations. As a result, LECD queue outperforms MS queue and H-Queue in all concurrency levels and starts to outperform LCRQ and LCRQ+H from 32 and 64 threads, respectively. R EFERENCES M. M. Michael and M. L. Scott, “Simple, fast, and practical non-blocking and blocking concurrent queue algorithms,” in Proc. of PODC, 1996. [2] A. Morrison and Y. Afek, “Fast concurrent queues for x86 processors,” in Proc. of PPoPP, 2013. [3] P. Fatourou and N. D. Kallimanis, “Revisiting the combining synchronization technique,” in Proc. of PPoPP, 2012. [1]

Can(Lock-free(and(Combining(Techniques(Co-exit?( A(Novel(Approach(on(Concurrent(Queue(( !

Changwoo!Min,!Advisor:!Young!Ik!Eom!(Sungkyunkwan!University,!Korea)!!

((MoDvaDon(and(Background Research(on(concurrent(queues(is(stuck(in(a(dilemma(with(regard(to(scalability.(( Lock-Free( Approaches(

Combining(( Approaches(

Key(IntuiDon( Major(Work( •  Higher'degree'of'parallelism''!!Higher!scalability! •  MS!Queue![PODC’96]' •  Use!hardware!atomic!instrucGons,!such!as!CAS,!to! •  LCRQ![PPoPP’13]' maximize!the!degree!of!parallelism.!!

Pros(&(Cons( •  Pros:!Taking!advantage!of!parallel!execuGon!! •  Cons:'Conten=on'management'schemes'requires'manual' parameter'tuning'to'avoid'contended'hot'spots'!'Not' feasible'in'libraries'and'run=mes'(JSR'166,'Boost'C++'Lib)''

•  Higher!degree!of!parallelism!! •  FC!Queue![SPAA’10]' !!!!!!Higher!synchronizaGon!cost!!!Lower!scalability!! •  H!Queue![PPoPP’12]' ''''"'Batch'processing'by'a'thread' •  One!designated!thread,!or!a!combiner,!performs! batch!processing!of!all!other!concurrent!threads.!!

•  Pros:!Low!synchronizaGon!cost!of!single!threaded!execuGon! •  Cons:'Losing'opportuni=es'to'exploit'the'advantages'of' parallel'execu=on'

LECD(Queue:(Lock-Free(Enqueue(and(Combining(Dequeue Goal:(PracDcal(and(Scalable(Concurrent(Queue(with(No(Parameter(Tuning(( Approach:(Use(Both(Lock-Free(and(Combining(Techniques(SynergisDcally( Enqueue(OperaDon Head!

Head!

Tail!

E2!

Tail!

next!

next!

!

Permanent! Dummy!

New(Node(

next!

next!

!

next!

!

!

E1!

!

next!

!

next!

next!

!

!

!

Global(Combiner(Lock(

Request(List(( for(Cluster(0(

next!

E3!

1.(Adding(a(New(Dummy(Request

Dequeue(OperaDon

Request(List(( for(Cluster(1(

Per-Cluster(( Request(List(

ReqTail!

result! rnext! status! COMBINE!

result! rnext! status! WAIT!

!

E1.new!next = null; E2.old_tail = SWAP(&Tail, new); E3.old_tail!next = new;

result! rnext! status! COMBINE! ! !

!

ReqTail!

result! rnext! status! WAIT!

result! rnext! status! WAIT!

result! rnext! status! COMBINE!

!

!

!

!

NUMA(Cluster(0(

CORE!0!

CORE!1!

A1!

!

WAIT! !

!

0.(IniDal(dummy( 2.(Spin(on( is(in(COMBINE.(( the(previous.((

!

!

NUMA(Cluster(1(

CORE!0!

A3!

!

!

!

!

!

!

!

!

!

!

!

result! rnext! status! WAIT!

result! rnext! status! WAIT!

!

result! rnext! status!

!

!

ReqTail!

A2!

1.(Add(a(new(in( WAIT(to(the(last.((

CORE!1!

A1. rq!next = null; •  Unlike!CAS,!SWAP!always!succeeds.!!! Shared(Bus(-(Interconnect( A2. prv_rq = SWAP(&ReqTail, rq); •  No!retry!!!No!exponenGal!backoff!scheme! A3. prv_rq!next = rq; •  Append!a!new!dummy!request!to!a!request!list.!! !!No'parameter'tuning'is'needed.' •  Except!for!the!SWAP,!concurrent!threads!are! •  Depending!on!the!status!of!previous!request,! •  Use!the!same!technique!used!in!the! •  COMBINE:!taking!the!role!of!a!combiner.!! independent.!!!Enjoy'the'advantages'of' enqueue!operaGon! parallel'execu
Local!DRAM!

!

2.(Combining(Dequeue(OperaDons

D2a'

Head!

Tail!

Memory(Management

Head!

Tail!

D2b'

Why'upda
Permanent! Dummy!

Permanent! Dummy!

next!

next!

!

Request'' List'

!

result! rnext! status!

result! rnext! status!

D1!

WAIT! !DONE!

WAIT! !COMBINE!

COMBINE ! !DONE! !

!

!

ReqTail!

D1!

Request'' List'

result! rnext! status! !

!

D3!

result! rnext! status!

result! rnext! status!

result! rnext! status!

WAIT! !DONE!

WAIT! !DONE!

WAIT! !COMBINE!

!

!

D1!

!

!

!

!

COMBINE ! !DONE!

!

!

next!

next!

!

D2b'

!

!

next!

!

ReqTail!

!

!

next!

!

!

D1! result! rnext! status!

next!

next!

!

!

!

!

!

!

D3!

D2b.!If!the!queue!becomes!empty,!!!! 1. Head!next = null; 2. CAS(&Tail, the last dequeued node, Head)

D1.!A!combiner!thread!processes!pending! requests!and!changes!the!status!to!DONE.! D2a.!A\er!processing!the!requests,!if!the!queue! does!not!become!empty,!update!Head!next!to! the!next!of!the!last!dequeued!node.!! D3.!Change!the!status!of!the!last!dummy!request! to!COMBINE!for!a!subsequent!dequeuer!to!take! the!role!of!the!combiner.!'

Since!another!concurrent!enqueuer!can!update! Tail!in!between!1!and!2,!Tail!needs!to!be! updated!by!using!CAS.!! !!No!CAS!retry!needs!in!this!case.!!

EvaluaDon Encouraging(Results:(MS(Queue(with(backoff(by(14.3x,(H(Queue(by(1.6x,(and(LCRQ+H(by(1.6%( SynergisDc(InteracDons:(Parallel(Enqueue(!(Higher(Combining(Degree(!(More(Efficient(Combining(Dequeue(OperaDon( 30

Million opertions/second

20

16.1x(

15

11.1x(

10 5 0

Qu e

MS ue

(no

20

H-Queue(

15

LCRQ(

10

LC

Qu

eu

ba

RQ

e(

ba

ck off )

ck

off )

H-

Qu

eu

LC e

RQ

LE +H

CD

0 Qu

eu e

25

LECD(

20

LCRQ+H( H-Queue(

15 10

LCRQ(

5

5

1x( 1.8x( MS

LECD(

250

30

LCRQ+H(

25

Average(Combining(of((50%(enqueues(

50%(enqueues(

Million opertions/second

25.9x(26.3x(

25 Million operations/second

enqueue-dequeue(pairs(

0

10

20

30

40 Threads

50

60

70

MS-Q((backoff)( MS-Q((no(backoff)(

80

0 0

10

20

30

40 Threads

50

60

70

MS-Q((backoff)( MS-Q((no(backoff)( 80

LECD(

200 Average Combining

Average(throughput(of(the( benchmarks(on(80(threads(

150

100

50

H-Queue(

0 0

10

20

30

40 Threads

50

60

70

80

Can Lock-free and Combining Techniques Co-exist? A ...

synchronization cost in a high degree of parallelism. MS queue [1], a representative lock-free queue, uses compare-and-swap (CAS) to update Head and Tail.

832KB Sizes 5 Downloads 134 Views

Recommend Documents

Integrating Lock-free and Combining Techniques for a ...
use not only computational resources, but also the mem- ...... free FIFO queues,” Distributed Computing, vol. 20, no ... for symbolic and irregular applications, ser.

pdf-1888\wax-and-paper-workshop-techniques-for-combining ...
Try one of the apps below to open or edit this item. pdf-1888\wax-and-paper-workshop-techniques-for-combining-encaustic-paint-and-handmade-paper.pdf.

Comparing and combining a semantic tagger and a ...
a Department of Linguistics and Modern English Language, Lancaster University, Lancaster LA1 4YT, United Kingdom ... Available online 19 March 2005. Abstract ...... MA. Maynard, D., Ananiadou, S., 2000. Trucks: a model for automatic multiword term re

Comparison of Diversity Combining Techniques for ...
The revolutionary idea behind MIMO technology is that contrary to SISO ..... tional Foundation for Science and Technology Development. (NAFOSTED) (No.

Combining techniques for protecting mobile agents
The techniques discussed are: environmental key generation, cryptographic traces, time-limited black boxes and blinded-key signatures. 1 Introduction. Mobile agent systems are a promising paradigm for building distributed applica- tions. They are cha

Comparison of Diversity Combining Techniques for ...
MRC, is not bounded as increasing signal-to-noise ratio (SNR). ... diversity gain, i.e. reliability of a wireless link, as compared to a conventional single-input ...

A Unified SMT Framework Combining MIRA and MERT
translation (SMT) adopts a log-linear framework to ... modeling, the unified training framework and the .... scalable training methods are based on the n-best.

Combining Language and Vision with a Multimodal ...
Combining Language and Vision with a. Multimodal Skip-gram Model. Angeliki Lazaridou* (University of Trento). Nghia The Pham (University of Trento). Marco Baroni (University of Trento ). Abstract. ”We present MMSkip-gram, a method for inducing word

General and Specific Combining Abilities - GitHub
an unstructured random effect with one level for each observed mating. We illustrate the methods with the following simulated data. Note that in this example the ...

Combining Coregularization and Consensus-based ...
Jul 19, 2010 - Self-Training for Multilingual Text Categorization. Massih-Reza .... text classification. Section 4 describes the boosting-based algorithm we developed to obtain the language-specific clas- sifiers. In Section 5, we present experimenta

Combining Intelligent Agents and Animation
tures - Funge's cognitive architecture and the recent SAC concept. Addi- tionally it puts emphasis on strong design and provides easy co-operation of different ...

Alkhateeb_COMM14_MIMO Precoding and Combining Solutions for ...
Alkhateeb_COMM14_MIMO Precoding and Combining Solutions for Millimeter-Wave Systems.pdf. Alkhateeb_COMM14_MIMO Precoding and Combining ...

Combining GPS and photogrammetric measurements ...
Mobile Multi-Sensor Systems Research Group. Department of ... ity and ease of implementation; however, a more fundamental fusion of the GPS data into the.

Alkhateeb_COMM14_MIMO Precoding and Combining Solutions for ...
Alkhateeb_COMM14_MIMO Precoding and Combining Solutions for Millimeter-Wave Systems.pdf. Alkhateeb_COMM14_MIMO Precoding and Combining ...

Combining Simulation and Virtualization through ...
Architectural simulators like SimpleScalar [1] (and its derivatives), SMTSim [17] or Simics [13] employ a very simple technique for functional simulation. They normally employ interpreted techniques to fetch, decode and execute the instructions of th

Combining MapReduce and Virtualization on ... - Semantic Scholar
Feb 4, 2009 - Keywords-Cloud computing; virtualization; mapreduce; bioinformatics. .... National Center for Biotechnology Information. The parallelization ...

Multiple mutant clones in blood rarely coexist
Feb 27, 2008 - 1Division of Hematology, Mayo Clinic College of Medicine, Rochester, Minnesota 55905, USA. 2Program for Evolutionary Dynamics, Harvard University, Cambridge, .... The upstream compartment k−1 consists of wild-type cells that do not c