Composing Transac>on Scalability on Mul>core ...

Viewer
Transcript

Composing Transac.on Scalability on Mul.core Pla8orms Anastasia Ailamaki Pınar Tözün, Danica Porobic, and Erie3a Liarou (EPFL) Islam A3a and Andreas Moshovos (U Toronto)

whither parallelism? 2005

Core pipelining ILP mul.threading

2020 Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

mul.socket mul.cores (CMP)

heterogeneous CMP

“performance” means scalability

2

increasing HW contexts Linear Scalability

Throughput/#HW Contexts

1200 1000 800 600 400 200 0 1

2

4

8 16 32 64

Cri?cal Sec?ons per Transac?on

TPC-‐C (64) Payment on Shore-‐MT Sun Niagara T2 70 60

other

50

xct manager

40

logging

30 20 10

buﬀer pool catalog latching locking

0

#HW Contexts

enemy #1: concurrency control

3

throughput

Shared-‐Nothing

Shared-‐Everything

throughput

distributed transac.ons

% Mul?site Transac?ons in Workload

Best Worst thread-‐to-‐core assignment

enemy #2: communica?on latency 4

cache eﬃciency

At peak throughput On Shore-‐MT Intel Xeon X5660

Max. Intel

Execu?on cycles breakdown

Instruc?ons per cycle

4

Instruc.on Stalls Other Stalls Busy

3

2

1

0

TPC-‐C

TPC-‐E

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

TPC-‐C

enemy #3: instruc?on misses

TPC-‐E 5

composing OLTP scalability •  alleviate concurrency control •  observe variability in comm latency •  maximize instruc.on locality

6

Logical Physical Index

Cri?cal Sec?ons per Transac?on

shared-‐everything OLTP 70

other

60

logging

xct manager buﬀer pool

50

catalog

40

latching locking

30 20 10 0 Conven.onal

PLP

Heap

conten?on due to unpredictable data accesses

7

physiological par..oning (PLP) Worker Thread

R1: A – M R2: N – Z

Index

Heap

Logical R1

R2

Physical

Cri?cal Sec?ons per Transac?on

Range

70

other

60

logging

xct manager buﬀer pool

50

catalog

40

latching locking

30 20 10 0 Conven.onal

PLP

[VLDB2010x2,VLDB2011] PLP eliminates 70% of the cri?cal sec?ons

8

PLP on mul.cores Sun Niagara T2 TATP – GetSubData 64 HW contexts, in order, 1.4GHz 400

PLP

350

Conven.onal

300

PLP

600

Throughput (Ktps)

Throughput (Ktps)

700

250 200 150 100

4 socket Quad AMD 16 HW contexts, OoO, 2.8GHz

Conven.onal

500 400 300 200 100

50 0 0

16

32

48

# HW Contexts

64

0 0

4

8

# HW Contexts

12

higher beneﬁt as # of HW contexts goes up

16

9

ugly duckling -‐> swan Throughput (tps/thread)

Sun Niagara T1 Insert-‐only workload

10

shore-‐mt commercialDB

1 shore 0.1 0

8 16 24 Concurrent Threads

32

(using DORA + PLP + Aether)

10

composing OLTP scalability •  alleviate concurrency control •  observe variability in comm latency •  maximize instruc.on locality

11

<10 cycles

communica.on delays 50 cycles

500 cycles

Core

Core

Core

Core

Core

Core

Core

Core

L1

L1

L1

L1

L1

L1

L1

L1

L2

L2

L2

L2

L2

L2

L2

L2

L3

L3

Memory controller

Memory controller

Inter-‐socket links

Inter-‐socket links

latency varies wildly

12

Placement of applica.on threads Counter microbenchmark 8socket x 10cores

Unpredictable

300

40%

200

47%

100 0

4socket x 6cores

12 Throughput (Ktps)

Throughput (Mtps)

400

TPC-‐C Payment 10

39%

8 6 4 2

? ? ? ? ? ? ? ? OS

0

? ? ? ?

OS

Spread Island

thread-‐to-‐core Spread Island assignment maZers [VLDB2012]

13

impact of sharing data among threads Counter microbenchmark

1000

8socket x 10cores 18.7x

Throughput (Ktps)

Throughput (Mtps)

10000

TPC-‐C Payment – local-‐only

516.8x 100 10 1

Counter Counter Single per core per counter socket

160 140 120 100 80 60 40 20 0

4socket x 6cores

4.5x

Shared nothing

Shared everything

fewer sharers -‐> beZer performance

14

Throughput (KTps)

impact of skewed input 800

50% mul.site

Local only

800

Few instances are highly loaded

600

600

400

400 24 Islands Conten?on 4 Islands for hot 1 Island data

200 0 0

0.25 0.5 0.75 Skew factor

Larger instances can balance load

200

S?ll few hot instances

0 1

0

0.25 0.5 0.75 Skew factor

1

4 Islands eﬀec?vely balance skew and conten?on

15

composing OLTP scalability •  alleviate concurrency control •  observe variability in comm latency •  maximize instruc?on locality

16

stall .me breakdown

Trace Simula.on (models Intel Xeon E5-‐2660)

Stall Cycles per k-‐Instruc?ons

450 400

L3D L2D L1D

350 300

L3I L2I L1I

250 200 150 100 50 0 0.1GB

1GB

10GB

100GB

TPC-‐C Database Size on Shore-‐MT

20GB

100GB TPC-‐E

L1-‐I misses are a signiﬁcant factor in stall ?me

17

concurrent transac.ons Threads Database Opera.ons Index Probe

Insert Record

condi.onal

Delete Record

T1

Index Probe (X)

Index Probe (X1)

Update Record (X)

Index Scan Update Record

Transac.on

Update Record (X1) Index Probe (Y1)

Index Probe (Y)

Delete Record(Z1)

Insert Record (Y)

T2

Delete Record (Z)

Index Probe (X2) Update Record (X2) Index Probe (Y2) Insert Record(Y2) Delete Record(Z2)

execute many common instruc?on blocks 18

transac.ons on a single core Threads T1

Miss penalty

Tradi.onal T1

T2

T3

L1I T2

T3

Stra?ﬁed Transac?on Execu?on (STREX) T1 T2 T3

T1 T2

T1 T3

T1

T2 T3

phase

1

2

3

5

leader

T1

T1

T1

4 T1

L1I

T2

?me-‐mul?plexing to reduce instruc?on misses

19

transac.ons on mul.cores Threads T1

Tradi.onal #Cache Fills T1

CORES

#Cache Fills

L1I T1

.me

CORES

T1

1 T2

Self-‐Assembly of L1-‐I Caches (SLICC)

T2

1 T2

T1

3

2 T1

T2

T3

T3

T2

T1

6

T3

3 T1

T2

T3

T1

T3

T2

9

4 T3

T3

10

exploits aggregate L1-‐I & instruc?on overlap

4 20

reduced L1-‐I misses Simula.on – 4-‐way OoO cores

L1-‐I Misses per k-‐Instruc?ons

Traces from Shore-‐MT

4-‐way 32KB L1-‐I & L1-‐D, 1MB per core L2

40

Conven.onal

35

STREX

SLICC

30 25 20 15 10 5 0

2

4

8

TPC-‐C

16

2

#Cores

4

8

16

TPC-‐E

STREX reduces L1-‐I misses regardless of core count SLICC is beZer on high core counts

21

hybrid: STREX + SLICC

Simula.on – 4-‐way OoO cores 32KB L1-‐I & L1-‐D, 1MB per core L2

Traces from Shore-‐MT

TPC-‐C Speedup over Conven?onal

Speedup over Conven?onal

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0

4

8 12 #Cores

16

TPC-‐E

1.8 1.5 1.2 0.9

Conven.onal STREX SLICC Hybrid

0.6 0.3 0 0

4

8 12 #Cores

16

ISCA2013] up to 80% beZer[MICRO2012,

22

summary •  mul.core = all parallelism methods –  scalability a complex problem

•  alleviate concurrency control –  eliminate cri.cal sec.ons: DORA,PLP,Aether,etc

•  observe variability in comm latency –  OLTP Islands

•  maximize instruc.on locality –  chase locality with SLICC + STREX 23

what’s next? •  data par..oning across Islands •  transac.on-‐aware thread migra.on

24

THANK YOU!

25

On the Scalability of Hierarchical Cooperation for ...

On the Scalability of Data Synchronization Protocols for ...

A Comparison of Scalability on Subspace Clustering ...

Effects of Population Size on Selection and Scalability in Evolutionary ...

Composing the Carpenter's Workshop - Squarespace

Composing Pervasive Data Using iQL

Composing Multi-View Aspect Models - Semantic Scholar

Composing Pervasive Data Using iQL

Advanced Computer Architecture â Parallelism Scalability ...

Share 'Advanced Computer Architecture - Parallelism Scalability ...

Composing-Arguments-An-Argumentation-And-Debate-Textbook ...

Cross-referencing, Producing Citations and Composing ... - GitHub

Advanced Computer Architecture â Parallelism Scalability ...

Improving OLTP scalability using speculative lock ...