Composing Transac.on Scalability on Mul.core Pla8orms Anastasia Ailamaki Pınar Tözün, Danica Porobic, and Erie3a Liarou (EPFL) Islam A3a and Andreas Moshovos (U Toronto)
whither parallelism? 2005
Core pipelining ILP mul.threading
2020 Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
mul.socket mul.cores (CMP)
heterogeneous CMP
“performance” means scalability
2
increasing HW contexts Linear Scalability
Throughput/#HW Contexts
1200 1000 800 600 400 200 0 1
2
4
8 16 32 64
Cri?cal Sec?ons per Transac?on
TPC-‐C (64) Payment on Shore-‐MT Sun Niagara T2 70 60
other
50
xct manager
40
logging
30 20 10
buffer pool catalog latching locking
0
#HW Contexts
enemy #1: concurrency control
3
throughput
Shared-‐Nothing
Shared-‐Everything
throughput
distributed transac.ons
% Mul?site Transac?ons in Workload
Best Worst thread-‐to-‐core assignment
enemy #2: communica?on latency 4
cache efficiency
At peak throughput On Shore-‐MT Intel Xeon X5660
Max. Intel
Execu?on cycles breakdown
Instruc?ons per cycle
4
Instruc.on Stalls Other Stalls Busy
3
2
1
0
TPC-‐C
TPC-‐E
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
TPC-‐C
enemy #3: instruc?on misses
TPC-‐E 5
composing OLTP scalability • alleviate concurrency control • observe variability in comm latency • maximize instruc.on locality
6
Logical Physical Index
Cri?cal Sec?ons per Transac?on
shared-‐everything OLTP 70
other
60
logging
xct manager buffer pool
50
catalog
40
latching locking
30 20 10 0 Conven.onal
PLP
Heap
conten?on due to unpredictable data accesses
7
physiological par..oning (PLP) Worker Thread
R1: A – M R2: N – Z
Index
Heap
Logical R1
R2
Physical
Cri?cal Sec?ons per Transac?on
Range
70
other
60
logging
xct manager buffer pool
50
catalog
40
latching locking
30 20 10 0 Conven.onal
PLP
[VLDB2010x2,VLDB2011] PLP eliminates 70% of the cri?cal sec?ons
8
PLP on mul.cores Sun Niagara T2 TATP – GetSubData 64 HW contexts, in order, 1.4GHz 400
PLP
350
Conven.onal
300
PLP
600
Throughput (Ktps)
Throughput (Ktps)
700
250 200 150 100
4 socket Quad AMD 16 HW contexts, OoO, 2.8GHz
Conven.onal
500 400 300 200 100
50 0 0
16
32
48
# HW Contexts
64
0 0
4
8
# HW Contexts
12
higher benefit as # of HW contexts goes up
16
9
ugly duckling -‐> swan Throughput (tps/thread)
Sun Niagara T1 Insert-‐only workload
10
shore-‐mt commercialDB
1 shore 0.1 0
8 16 24 Concurrent Threads
32
(using DORA + PLP + Aether)
10
composing OLTP scalability • alleviate concurrency control • observe variability in comm latency • maximize instruc.on locality
11
<10 cycles
communica.on delays 50 cycles
500 cycles
Core
Core
Core
Core
Core
Core
Core
Core
L1
L1
L1
L1
L1
L1
L1
L1
L2
L2
L2
L2
L2
L2
L2
L2
L3
L3
Memory controller
Memory controller
Inter-‐socket links
Inter-‐socket links
latency varies wildly
12
Placement of applica.on threads Counter microbenchmark 8socket x 10cores
Unpredictable
300
40%
200
47%
100 0
4socket x 6cores
12 Throughput (Ktps)
Throughput (Mtps)
400
TPC-‐C Payment 10
39%
8 6 4 2
? ? ? ? ? ? ? ? OS
0
? ? ? ?
OS
Spread Island
thread-‐to-‐core Spread Island assignment maZers [VLDB2012]
13
impact of sharing data among threads Counter microbenchmark
1000
8socket x 10cores 18.7x
Throughput (Ktps)
Throughput (Mtps)
10000
TPC-‐C Payment – local-‐only
516.8x 100 10 1
Counter Counter Single per core per counter socket
160 140 120 100 80 60 40 20 0
4socket x 6cores
4.5x
Shared nothing
Shared everything
fewer sharers -‐> beZer performance
14
Throughput (KTps)
impact of skewed input 800
50% mul.site
Local only
800
Few instances are highly loaded
600
600
400
400 24 Islands Conten?on 4 Islands for hot 1 Island data
200 0 0
0.25 0.5 0.75 Skew factor
Larger instances can balance load
200
S?ll few hot instances
0 1
0
0.25 0.5 0.75 Skew factor
1
4 Islands effec?vely balance skew and conten?on
15
composing OLTP scalability • alleviate concurrency control • observe variability in comm latency • maximize instruc?on locality
16
stall .me breakdown
Trace Simula.on (models Intel Xeon E5-‐2660)
Stall Cycles per k-‐Instruc?ons
450 400
L3D L2D L1D
350 300
L3I L2I L1I
250 200 150 100 50 0 0.1GB
1GB
10GB
100GB
TPC-‐C Database Size on Shore-‐MT
20GB
100GB TPC-‐E
L1-‐I misses are a significant factor in stall ?me
17
concurrent transac.ons Threads Database Opera.ons Index Probe
Insert Record
condi.onal
Delete Record
T1
Index Probe (X)
Index Probe (X1)
Update Record (X)
Index Scan Update Record
Transac.on
Update Record (X1) Index Probe (Y1)
Index Probe (Y)
Delete Record(Z1)
Insert Record (Y)
T2
Delete Record (Z)
Index Probe (X2) Update Record (X2) Index Probe (Y2) Insert Record(Y2) Delete Record(Z2)
C.2.1 [Computer-Communication Networks]: Network Archi- tecture and ... Principles]: Systems and Information TheoryâInformation The- ory. General Terms.
W e compare these protocols to a new algorithm, called. C P I Sync [2 ] .... company network (LAN) internet access point. Internet modem wireless access point.
IJRIT International Journal of Research in Information Technology, Volume 2, Issue 3, March ... 2Associate Professor, PSNA College of Engineering & Technology, Dindigul, Tamilnadu, India, .... choosing the best non-redundant clustering.
scalability of a conventional multi-objective evolutionary algorithm ap- ... scale up poorly to high dimensional objective spaces [2], particularly dominance-.
A host of rhetoricians have taken up Cooper's call. Margaret Syverson's The Wealth of .... be done with [in his specific case] new mediaâ (Brooke 2009, 10):.
Pervasive data sources may fail unexpectedly, or provide ..... local business-advertisement data source for the current zip code. (The output expression calls an ...
development and evolution of complex models. To manage this complexity ...... outline the needs to compose parameterized models and apply them to a system ...
These data sources enable context-sensitive, mobile applications, such as ..... The iQL input expression supports two forms of continual rebinding, which we call.
Contents. 1. Creating the Database and BibTEX usage. 1 ... big document, and it will not be talked about much, however, if you want to know about it more, .... 3The URL is http://mirror.ctan.org/macros/latex/contrib/mciteplus/mciteplus_doc.pdf ...
Locks managed globally. ⢠Fine-grained parallelism. â Each lock has its own .... Best solutions may be indirect. â Sidestep hard problems. â Look to distributed ...