Performance Characterization of Graph500 on Large ...

Viewer
Transcript

Performance Characterization of Graph500 on Large Scale Distributed Environment Toyotaro Suzumura1,2,4, Koji Ueno1,4, Hitoshi Sato1,4, Katsuki Fujisawa1,3 and Satoshi Matsuoka1,4 1 Tokyo

Institute of Technology 2 IBM Research – Tokyo, 3 Chuo University, 4 JST CREST

2011 IEEE International Symposium on Workload Characterization

Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks

2

2011 IEEE International Symposium on Workload Characterization

Large-Scale Graph Mining is Everywhere

Internet Map [lumeta.com]

Friendship Network [Moody ’01]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Graph500 (http://www.graph500.org) §  New Graph Search Based Benchmark for Ranking Supercomputers §  BFS (Breadth First Search) from a single vertex on a static, undirected Kronecker graph with average vertex degree 16. §  Evaluation criteria: TEPS (Traversed Edges Per Second), and problem size that can be solved on a system, minimum execution time. §  Problem Size: SCALE –  # of Vertices 2SCALE, # of Edges 2SCALE+4

§  Reference MPI, shared memory implementations provided.

Graph500 Ranking in June 2011

SCALE

Top-ranked score with the latest ranking rule 5

2011 IEEE International Symposium on Workload Characterization

Motivation of Our Work §  To conduct detailed performance analysis of Graph500 reference implementations on the currently 5th-ranked (June, 2011) TSUBAME 2.0 supercomputer. §  To provide detailed analysis to other researchers that targets the high performance score on their supercomputers

Benchmark Flow Graph (Edge List) Generation

2 1

3

2

2

Graph Construction Kernel 1

Kernel 2

(Conversion to any space-efficient format such as CSR or CSC)

Breadth First Search (Reference or Customized Impl.)

Validation for BFS tree (5 rules)

1

3

Sampling 64 Search Keys

64 times iterations

Kronecker Graph [Leskovic, PKDD2005] Graph500 adopts a graph model called “Kronecker Graph” [Leskovic PKDD2005] that simulates dynamic time-evolving real network model that has the following properties •  Scale-free and power law •  Small diameter, etc

Kronecker graph generation model

A: 0.57, B: 0.19 C: 0.19, D: 0.05

Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks

9

2011 IEEE International Symposium on Workload Characterization

Level-Synchronized Breadth-First Search •  At any time, CQ (Current Queue) is the set of vertices that must be visited at the current level.

Level-Synchronized BFS 1 | for all vertex v in parallel do 2 | pred[v] ← -‐1; 3 | pred[r] ← 0

•  At level 1, CQ will contain the neighbors of r, so at level 2, it will contain their neighbors (the neighboring vertices that have not been visited at levels 0 or 1).

4 | Enqueue(CQ, r)

•  The algorithm maintains NQ (Next Queue), containing the vertices that should be visited at the next level. After visiting all of the nodes at each level, the queues CQ and NQ are swapped.

9 | | for each v adjacent to u in parallel do

5 | While CQ != Empty do 6 | NQ ← empty 7 | for all u in CQ in parallel do 8 | | u ← Dequeue(CQ) 10| | | if pred[v] = -‐1 then 11| | | | pred[v] ← u; 12| | | | Enqueue(NQ, v) 13| swap(CQ, NQ);

CQ (Current Queue)

Level 0

NQ (Next Queue)

NQ (Next Queue) r

Ｂ

F

G

D

Ｃ

H

I

J

K

L

M

N

O

E

P

Q

R

S

T

U

r

CQ

Level 0 Adding vertex r to CQ

NQ r

Ｂ

F

G

D

Ｃ

H

I

J

K

L

M

N

O

E

P

Q

R

S

T

U

Level 1

CQ

Retrieve vertex r from CQ, and insert its adjacent vertices to NQ

NQ

Ｂ

D

Ｃ

E

r

Level 0

Level 1

Ｂ

D

Ｃ

E

Level 2 F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

Level 1 – Swap CQ and NQ CQ

Swap CQ and NQ

Level 1

Ｂ

D

Ｃ

E

NQ

r

Level 0

Ｂ

D

Ｃ

E

Level 2 F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

Level 2 – Find all the unfound adjacent vertices with multi threads CQ Multiple threads simultaneously retrieve vertices B, C, D, E and inserts adjacent vertices into NQ

NQ

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

r

Level 0

Level 1

Ｂ

D

Ｃ

E

Level 2 F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks

16

2011 IEEE International Symposium on Workload Characterization

Graph500 Reference Implementations §  All the MPI implementations are based upon “Level-synchronized BFS” §  The benchmark includes reference MPI implementations that can be categorized into two methods on whether CQ (Current Queue) is replicated across all the nodes or not

17

Category

Code Name Partitioning Adjacency matrix format

Parallelism

Nonreplicated method

simple

Horizontal

CSR

Single thread

one_ sided

Horizontal

CSR

Single thread

Replicated method

replicatedcsr

Vertical

CSR

Multi-thread

replicatedcsc

Vertical

CSC

Multi-thread

2011 IEEE International Symposium on Workload Characterization

Partitioning Large-Scale Graph on Multiple Nodes § Vertices and adjacency matrix are partitioned. – Each processor has a block of vertices and outer edges from own vertices.

Vertices

Adjacency matrix Processor

Partitioning Methods Adjacency Matrix Horizontal Partitioning (e.g. simple)

Processor

Vertical Partitioning (e.g. replicated)

Non-replication based method : simple Each node has a partitioned block of CQ that contains only own vertices.

NQ becomes CQ at the next level. Dequeue

Vertex u is adjacent to v

Enqueue

v

u

CQ

visited

Adjacency matrix If visited[u] = 0

Inter processor communication

NQ

Non-replication based method (contd.) } 

Each processor exchanges edges among each other with asynchronous communication functions, MPI_̲Isend and MPI_̲Irecv.

} 

This involves all-‐‑‒to-‐‑‒all communication for exchanging all the edge information

Processor Edges

Replication-based method } 

In order to quickly ﬁnding a set of vertices to be handled at each level, the current queue (CQ) – that is represented as a bitmap -‐‑‒ is replicated among all the processors. Each processor has a replica of whole CQ.

Each processor makes a block of partitioned NQ.

Each processor sends NQ to all other processors.

The algorithm of this phase is different between csr and csc.

CQ

NQ

CQ (next level)

Replication-based method }  } 

Each processor has a set of its own vertices. All processors send a copy of their own vertices to all other processors with MPI_̲Allgather collective operations when synchronization occurs at each level Processor Vertices

Replication-based method: replicated-csr Each MPI process looks at only its local vertex and investigate whether its adjacent vertex exists in CQ, and if so, it is set to the local NQ which is eventually all gathered among all the MPI processes CQ (and NQ) is a bitmap array.

If CQ[u] = 1 CQ

Adjacent vertex u ② ①

③

For each local vertex v

v

u

row pointer

NQ[v] ← 1

NQ

Replication-based method: replicated-csc §  Each MPI process has the replicated CQ that contains all of the vertex information as to whether a vertex exists in CQ. In contrast NQ only has information on the local vertices that each MPI process is handling. §  All of the vertices in CQ are processed in parallel by multiple threads spawned with OpenMP, If the vertex exists in CQ , then it finds a set of local vertices adjacent to the global vertex in CQ. The local vertex is added to NQ. CQ

For all global vertex u

column pointer

① Adjacent vertex v ② It needs to look at all the global vertices, but this is contiguous memory access when compared to random memory access in replicated-csr

NQ NQ[v] ← 1

Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks

26

2011 IEEE International Symposium on Workload Characterization

TSUBAME 2.0 Super Computer

.

TSUBAME 2.0 System Configuration

TSUBAME 2.0 Specification Specification CPU

Intel Westmere EP (Xeon X5670, L2 Cache: 256 KB, L3: 12MB) 2.93 GHz processors, 12 CPU Cores (24 cores with Hyper Threading) x 2 sockets per 1 node (24 CPU Cores)

RAM

54 GB

OS

SUSE Linux Enterprise 11 (Linux kernel: 2.6.32)

# of Total Nodes

1466 nodes (We only tested up to 128 nodes)

Network Topology

Full-Bisection Fat-Tree Topology

Network

Voltaire / Mellanox Dual-rail QDR Infiniband (40Gbps x2 = 80 Gbps)

GPGPU

Three NVIDIA Fermi M2050 GPUs (*Not used for this work)

GCC and OpenMP

GCC 4.3.4 (-O3 option) , OpenMP 3.0

OpenMPI

OpenMPI 1.5.3, MVAPICH 1.6.1

Strong-Scaling Performance Comparison Although replication-based methods outperform non-replication based method, none of them shows scalability with larger number of nodes and saturated around 32 nodes.

1.00E+10

100

simple replicated-‐csr replicated-‐csc

Speedup against 1 node

TEPS 1.00E+09

1.00E+08 simple

10

replicated-‐csr replicated-‐csc

1.00E+07 1 30

2

4

8 16 32 64 128 # of nodes

1 1

2

Scale 26, OpenMPI

2011 IEEE International Symposium on Workload Characterization

4

8 16 32 64 128 # of nodes

*1 Scale 26 per node is the largest problem size that one node can handle (24 cores, 52 GB RAM) by consuming 17.71 GB in the CSC format

Weak-Scaling Performance Comparison §  This experiment fixes the problem size for each node, which allows us to see how the linear scalability is achieved and how much of the performance degradation is due to the communication and level synchronization §  All of the methods show the performance degradation with larger number of nodes in a weak-scaling setting TEPS 4.50E+08

4.50E+08 TEPS 4.00E+08

simple replicated-‐csr

simple replicated-‐csr replicated-‐csc

4.00E+08

3.50E+08

3.50E+08

3.00E+08

3.00E+08

2.50E+08 Scale : 24

2.50E+08 Scale : 26 Scale : 30

2.00E+08

2.00E+08

1.50E+08

1.50E+08

1.00E+08

1.00E+08

5.00E+07

5.00E+07

0.00E+00

0.00E+00 1

31

2

4 8 16 # of nodes

32

Scale 24 per node

64

2011 IEEE International Symposium on Workload Characterization

Scale : 32

1

2

4 8 16 # of nodes

Scale 26 per node*1

32

64

Profiling Communication Message Size §  With non-replication based method, the communication message size becomes large around the half level (total level : 8) §  With replication based method, aggregated message size become linearly larger with the number of nodes simple

1.2E+12

Scale

sim-‐31-‐32

8E+11 6E+11 4E+11 2E+11

1E+12

Data size (byte)

Data size (byte)

sim-‐30-‐16

1TB

1E+11 1E+10 1E+09 1GB

100000000 10000000

0 1 32

Aggregated Message Size for replicated methods

sim-‐29-‐8

1E+12 1TB

Replication-based method

# of Nodes

2

3

4 5 6 Level (Depth)

7

8

2

Scale 26, OpenMPI

2011 IEEE International Symposium on Workload Characterization

4

8 16 32 64 128 # of MPI processes

Profiling Execution Time at Each Level •  Replicated-csc and simple shows similar because the replicated-csc implementation needs to find all of the vertices in CQ that contain all the global vertex information by checking the corresponding bit in CQ. Thus when CQ contains more vertices at higher level, its processing time increases. •  In contrast, the replicated-csr implementation only checks whether adjacent “local” vertices are unvisited in CQ, and thus as the number of unvisited vertices decreases , the processing time also decreases replicated-csc ExecuIon Time (seconds)

ExecuIon Time (seconds)

8 6 4 2 0

33

1

2

3

4

5

6

7

8

8 6 4 2 0

1 2 3 4 5 6 7 8 Level Level 2011 IEEE International Symposium on Workload Characterization rep-‐26-‐1 rep-‐31-‐32 csc-‐26-‐1 csc-‐31-‐32

simple ExecuIon Time (seconds)

replicated-csr

8 6 4 2 0 1 2 3 4 5 6 7 8 Level sim-‐26-‐1 sim-‐31-‐32

Profiling Computation, Comm., and Stall Times •  Communication and stall (synchronization) time grows with larger number of nodes. 12

replicated-csr

Elapsed Time (seconds)

Elapsed Time (seconds)

12 10

10

8 6 4 2 0

1

4 8 16 # of nodes computaOon communicaOon 34

replicated-csc

2

32 stall

8 6 4 2 0 1

2

4 8 16 32 64 # of nodes computaOon communicaOon stall

Weak-Scaling : Scale 26 per node

2011 IEEE International Symposium on Workload Characterization

Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks

35

2011 IEEE International Symposium on Workload Characterization

Related Work §  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L [Andy, SC2005] –  Proposes 2D Partitioning Technique and optimization on BlueGene/L

§  Scalable Graph Exploration on Multicore Processors [Agarwal, SC2010] –  An efficient and scalable BFS algorithm for commodity multicore processors such as the 8-core Intel Nehalem EX processor

§  Desigining Multhreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 [Bader, ICPP 2006] §  Accelerating large graph algorithms on the GPU using CUDA [Harish, HiPC 2007]

Concluding Remarks and Ongoing Work §  Concluding Remarks –  To demonstrate the performance characteristics of reference implementations provided by Graph500 on commodity super computers such as TSUBAME 2.0 –  To provide a thorough guide for high performance graph search algorithm on large-scale distributed environments

§  Ongoing Work –  We designed and implemented our scalable and optimized BFS method on TSUBAME 2.0 based upon the thorough study published in IISWC 2011. –  Looking forward to the next Graph500 ranking list announced in SC2011 next week J 37

2011 IEEE International Symposium on Workload Characterization

Our Highly Scalable BFS Method §  We designed and implemented an optimized method based on 2D based partitioning and other various optimization methods such as communication compression and vertex sorting. §  Our optimized implementation can solve BFS (Breadth First Search) of large-scale graph with 236（68.7 billion）vertices and 240（1.1 trillion）edges for 10.58 seconds with 1366 nodes and 16392 CPU cores on TSUBAME 2.0 §  This record corresponds to 103.9 GE/s (TEPS) 25

120

optimized simple replicated-csr replicated-csc

15

100 TEPS (GE/s)

TEPS (GE/s)

20

99.0

10

80

63.5

60 37.2

40 21.3 20 11.3

5

0

0 0

32

64 96 # of nodes

128

Performance Comparison with Reference ImplementaOons (simple, replicated-‐csr and replicated-‐csc) and Scale 24 per 1 node

0

256

512 768 # of nodes

1024

Performance of Our OpOmized ImplementaOon with Scale 26 per 1 node

Questions

? ?

39

2011 IEEE International Symposium on Workload Characterization

Thank You

Performance Characterization of Graph500 on Large ...

Introduction to Graph500. â« Parallel BFS Algorithm ... Large-Scale Graph Mining is Everywhere. Internet Map ... The algorithm maintains NQ (Next. Queue) ...

Download PDF

4MB Sizes 1 Downloads 208 Views

Report

Performance Characterization of Graph500 on Large ...

Recommend Documents