Composing  Transac.on  Scalability  on   Mul.core  Pla8orms   Anastasia  Ailamaki     Pınar  Tözün,  Danica  Porobic,  and  Erie3a  Liarou  (EPFL)   Islam  A3a  and  Andreas  Moshovos  (U  Toronto)  

whither  parallelism?   2005  

Core   pipelining   ILP   mul.threading  

2020   Core  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

mul.socket  mul.cores   (CMP)  

heterogeneous  CMP  

“performance”  means  scalability  

2  

increasing  HW  contexts   Linear  Scalability  

Throughput/#HW  Contexts  

1200   1000   800   600   400   200   0   1  

2  

4  

8   16   32   64  

Cri?cal  Sec?ons  per  Transac?on  

TPC-­‐C  (64)  Payment  on  Shore-­‐MT   Sun  Niagara  T2     70   60  

other  

50  

xct  manager  

40  

logging  

30   20   10  

buffer  pool   catalog   latching   locking  

0  

#HW  Contexts  

enemy  #1:  concurrency  control  

3  

throughput  

Shared-­‐Nothing  

Shared-­‐Everything  

throughput  

distributed  transac.ons  

%  Mul?site  Transac?ons  in  Workload  

Best   Worst   thread-­‐to-­‐core   assignment  

enemy  #2:  communica?on  latency   4  

cache  efficiency  

At  peak  throughput   On  Shore-­‐MT   Intel  Xeon  X5660  

Max.  Intel  

Execu?on  cycles  breakdown  

Instruc?ons  per  cycle  

4  

Instruc.on  Stalls   Other  Stalls   Busy  

3  

2  

1  

0  

TPC-­‐C  

TPC-­‐E  

100%   90%   80%   70%   60%   50%   40%   30%   20%   10%   0%  

TPC-­‐C  

enemy  #3:  instruc?on  misses  

TPC-­‐E   5  

composing  OLTP  scalability   •  alleviate  concurrency  control   •  observe  variability  in  comm  latency       •  maximize  instruc.on  locality  

6  

Logical   Physical   Index  

Cri?cal  Sec?ons  per  Transac?on  

shared-­‐everything  OLTP   70  

other  

60  

logging  

xct  manager   buffer  pool  

50  

catalog  

40  

latching   locking  

30   20   10   0   Conven.onal  

PLP  

Heap  

conten?on  due  to  unpredictable  data  accesses  

7  

physiological  par..oning  (PLP)   Worker  Thread  

R1:  A  –  M     R2:  N  –  Z    

Index  

Heap  

Logical   R1  

R2  

Physical  

Cri?cal  Sec?ons  per  Transac?on  

Range  

70  

other  

60  

logging  

xct  manager   buffer  pool  

50  

catalog  

40  

latching   locking  

30   20   10   0   Conven.onal  

PLP  

[VLDB2010x2,VLDB2011]   PLP  eliminates  70%  of  the   cri?cal  sec?ons  

8  

PLP  on  mul.cores   Sun  Niagara  T2     TATP  –  GetSubData   64  HW  contexts,  in  order,  1.4GHz   400  

PLP  

350  

Conven.onal  

300  

PLP  

600  

Throughput  (Ktps)  

Throughput  (Ktps)  

700  

250   200   150   100  

4  socket  Quad  AMD   16  HW  contexts,  OoO,  2.8GHz  

Conven.onal  

500   400   300   200   100  

50   0   0  

16  

32  

48  

#  HW  Contexts  

64  

0   0  

4  

8  

#  HW  Contexts  

12  

higher  benefit  as  #  of  HW  contexts  goes  up  

16  

9  

ugly  duckling  -­‐>  swan   Throughput  (tps/thread)  

Sun  Niagara  T1   Insert-­‐only  workload  

10  

shore-­‐mt   commercialDB  

1   shore   0.1   0  

8   16   24   Concurrent  Threads  

32  

(using  DORA  +  PLP  +  Aether)  

10  

composing  OLTP  scalability   •  alleviate  concurrency  control   •  observe  variability  in  comm  latency       •  maximize  instruc.on  locality  

11  

<10  cycles  

communica.on  delays   50  cycles  

500  cycles  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

Core  

L1  

L1  

L1  

L1  

L1  

L1  

L1  

L1  

L2  

L2  

L2  

L2  

L2  

L2  

L2  

L2  

L3  

L3  

Memory  controller  

Memory  controller  

Inter-­‐socket  links  

Inter-­‐socket  links  

latency  varies  wildly  

12  

Placement  of  applica.on  threads   Counter  microbenchmark   8socket  x  10cores  

Unpredictable  

300  

40%  

200  

47%  

100   0  

4socket  x  6cores  

12   Throughput  (Ktps)  

Throughput  (Mtps)  

400  

TPC-­‐C  Payment   10  

39%  

8   6   4   2  

?   ?   ?   ?   ?   ?   ?   ?   OS  

0  

?   ?   ?   ?  

OS  

Spread     Island  

thread-­‐to-­‐core   Spread   Island  assignment  maZers [VLDB2012]  

13  

impact  of  sharing  data  among  threads   Counter  microbenchmark  

1000  

8socket  x  10cores   18.7x  

Throughput  (Ktps)  

Throughput  (Mtps)  

10000  

TPC-­‐C  Payment  –  local-­‐only  

516.8x   100   10   1  

Counter   Counter   Single   per  core   per   counter   socket  

160   140   120   100   80   60   40   20   0  

4socket  x  6cores  

4.5x  

Shared   nothing  

Shared   everything  

fewer  sharers  -­‐>  beZer  performance  

14  

Throughput  (KTps)  

 

impact  of  skewed  input   800  

50%  mul.site  

 

Local  only  

800  

Few  instances  are   highly  loaded  

600  

600  

400  

400   24  Islands   Conten?on   4  Islands   for  hot   1  Island   data  

200   0   0  

0.25   0.5   0.75   Skew  factor  

Larger  instances   can  balance  load  

200  

S?ll  few  hot   instances  

0   1  

0  

0.25   0.5   0.75   Skew  factor  

1  

4  Islands  effec?vely  balance  skew  and  conten?on  

15  

 

composing  OLTP  scalability   •  alleviate  concurrency  control   •  observe  variability  in  comm  latency       •  maximize  instruc?on  locality  

16  

stall  .me  breakdown  

 Trace  Simula.on   (models  Intel  Xeon  E5-­‐2660)  

Stall  Cycles  per  k-­‐Instruc?ons  

450   400  

L3D   L2D   L1D  

350   300  

L3I   L2I   L1I  

250   200   150   100   50   0   0.1GB  

1GB  

10GB  

100GB  

TPC-­‐C   Database  Size  on  Shore-­‐MT  

20GB  

100GB   TPC-­‐E  

L1-­‐I  misses  are  a  significant  factor  in  stall  ?me  

17  

concurrent  transac.ons   Threads   Database   Opera.ons   Index  Probe  

Insert  Record  

condi.onal  

Delete  Record  

T1  

Index  Probe  (X)  

Index  Probe  (X1)  

Update  Record  (X)  

Index  Scan   Update  Record  

Transac.on  

Update  Record    (X1)   Index  Probe  (Y1)  

Index  Probe  (Y)  

Delete  Record(Z1)  

Insert  Record  (Y)  

T2  

Delete  Record  (Z)  

Index  Probe  (X2)   Update  Record    (X2)   Index  Probe  (Y2)   Insert  Record(Y2)   Delete  Record(Z2)  

execute  many  common  instruc?on  blocks   18  

transac.ons  on  a  single  core   Threads   T1  

Miss  penalty  

Tradi.onal   T1  

T2  

T3  

L1I   T2  

T3  

 Stra?fied  Transac?on  Execu?on  (STREX)   T1   T2   T3  

T1   T2  

T1   T3  

T1  

T2   T3  

phase  

1  

2  

3  

5  

leader  

T1  

T1  

T1  

4   T1  

L1I  

T2  

?me-­‐mul?plexing  to  reduce  instruc?on  misses  

19  

transac.ons  on  mul.cores   Threads   T1  

Tradi.onal   #Cache   Fills   T1  

CORES  

#Cache   Fills  

L1I   T1  

.me  

CORES  

T1  

1   T2  

Self-­‐Assembly  of  L1-­‐I   Caches  (SLICC)  

T2  

1   T2  

T1  

3  

2   T1  

T2  

T3  

T3  

T2  

T1  

6  

T3  

3   T1  

T2  

T3  

T1  

T3  

T2  

9  

4   T3  

T3  

10  

exploits  aggregate  L1-­‐I  &  instruc?on  overlap  

4     20  

reduced  L1-­‐I  misses   Simula.on  –  4-­‐way  OoO  cores  

L1-­‐I  Misses  per  k-­‐Instruc?ons  

Traces  from  Shore-­‐MT  

 4-­‐way  32KB  L1-­‐I  &  L1-­‐D,  1MB  per  core  L2  

40  

Conven.onal  

35  

STREX  

SLICC  

30   25   20   15   10   5   0  

2  

4  

8  

TPC-­‐C  

16  

2  

#Cores  

4  

8  

16  

TPC-­‐E  

STREX  reduces  L1-­‐I  misses  regardless  of  core  count   SLICC  is  beZer  on  high  core  counts  

21  

hybrid:  STREX  +  SLICC  

Simula.on  –  4-­‐way  OoO  cores   32KB  L1-­‐I  &  L1-­‐D,  1MB  per  core  L2  

Traces  from  Shore-­‐MT  

TPC-­‐C   Speedup  over  Conven?onal  

Speedup  over  Conven?onal  

1.6   1.4   1.2   1   0.8   0.6   0.4   0.2   0   0  

4  

8   12   #Cores  

16  

TPC-­‐E  

1.8   1.5   1.2   0.9  

Conven.onal   STREX   SLICC   Hybrid  

0.6   0.3   0   0  

4  

8   12   #Cores  

16  

  ISCA2013]   up  to  80%  beZer[MICRO2012,  

22  

summary   •  mul.core  =  all  parallelism  methods   –  scalability  a  complex  problem  

•  alleviate  concurrency  control   –  eliminate  cri.cal  sec.ons:  DORA,PLP,Aether,etc  

•  observe  variability  in  comm  latency   –  OLTP  Islands  

•  maximize  instruc.on  locality   –  chase  locality  with  SLICC  +  STREX   23  

what’s  next?   •  data  par..oning  across  Islands   •  transac.on-­‐aware  thread  migra.on  

24  

THANK  YOU!  

25  

Composing Transac>on Scalability on Mul>core ...

ugly duckling -‐> swan. 10. Sun Niagara T1. Insert-‐only workload. 0.1. 1. 10. 0. 8. 16. 24. 32. Concurrent Threads. Throughput (tps/thread) shore-‐mt shore. commercialDB. (using DORA + PLP + Aether) ...

1MB Sizes 0 Downloads 158 Views

Recommend Documents

On the Scalability of Hierarchical Cooperation for ...
C.2.1 [Computer-Communication Networks]: Network Archi- tecture and ... Principles]: Systems and Information Theory—Information The- ory. General Terms.

On the Scalability of Data Synchronization Protocols for ...
W e compare these protocols to a new algorithm, called. C P I Sync [2 ] .... company network (LAN) internet access point. Internet modem wireless access point.

A Comparison of Scalability on Subspace Clustering ...
IJRIT International Journal of Research in Information Technology, Volume 2, Issue 3, March ... 2Associate Professor, PSNA College of Engineering & Technology, Dindigul, Tamilnadu, India, .... choosing the best non-redundant clustering.

Effects of Population Size on Selection and Scalability in Evolutionary ...
scalability of a conventional multi-objective evolutionary algorithm ap- ... scale up poorly to high dimensional objective spaces [2], particularly dominance-.

Composing the Carpenter's Workshop - Squarespace
A host of rhetoricians have taken up Cooper's call. Margaret Syverson's The Wealth of .... be done with [in his specific case] new media” (Brooke 2009, 10):.

Composing Pervasive Data Using iQL
Pervasive data sources may fail unexpectedly, or provide ..... local business-advertisement data source for the current zip code. (The output expression calls an ...

Composing Multi-View Aspect Models - Semantic Scholar
development and evolution of complex models. To manage this complexity ...... outline the needs to compose parameterized models and apply them to a system ...

Composing Pervasive Data Using iQL
These data sources enable context-sensitive, mobile applications, such as ..... The iQL input expression supports two forms of continual rebinding, which we call.

Advanced Computer Architecture – Parallelism Scalability ...
Advanced Computer Architecture – Parallelism Scalability & Programability - Kai Hwang.pdf. Advanced Computer Architecture – Parallelism Scalability & Programability - Kai Hwang.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Advance

Share 'Advanced Computer Architecture - Parallelism Scalability ...
Share 'Advanced Computer Architecture - Parallelism Scalability and Programmabilit.pdf'. Share 'Advanced Computer Architecture - Parallelism Scalability and ...

Cross-referencing, Producing Citations and Composing ... - GitHub
Contents. 1. Creating the Database and BibTEX usage. 1 ... big document, and it will not be talked about much, however, if you want to know about it more, .... 3The URL is http://mirror.ctan.org/macros/latex/contrib/mciteplus/mciteplus_doc.pdf ...

Advanced Computer Architecture – Parallelism Scalability ...
Page 3 of 165. Advanced Computer Architecture – Parallelism Scalability & Programability - Kai Hwang.pdf. Advanced Computer Architecture – Parallelism ...

Improving OLTP scalability using speculative lock ...
Locks managed globally. • Fine-grained parallelism. – Each lock has its own .... Best solutions may be indirect. – Sidestep hard problems. – Look to distributed ...