Performance Characterization of Graph500 on Large Scale Distributed Environment Toyotaro Suzumura1,2,4, Koji Ueno1,4, Hitoshi Sato1,4, Katsuki Fujisawa1,3 and Satoshi Matsuoka1,4 1 Tokyo
Institute of Technology 2 IBM Research – Tokyo, 3 Chuo University, 4 JST CREST
2011 IEEE International Symposium on Workload Characterization
Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks
2
2011 IEEE International Symposium on Workload Characterization
Large-Scale Graph Mining is Everywhere
Internet Map [lumeta.com]
Friendship Network [Moody ’01]
Food Web [Martinez ’91]
Protein Interactions [genomebiology.com]
Graph500 (http://www.graph500.org) § New Graph Search Based Benchmark for Ranking Supercomputers § BFS (Breadth First Search) from a single vertex on a static, undirected Kronecker graph with average vertex degree 16. § Evaluation criteria: TEPS (Traversed Edges Per Second), and problem size that can be solved on a system, minimum execution time. § Problem Size: SCALE – # of Vertices 2SCALE, # of Edges 2SCALE+4
§ Reference MPI, shared memory implementations provided.
Graph500 Ranking in June 2011
SCALE
Top-ranked score with the latest ranking rule 5
2011 IEEE International Symposium on Workload Characterization
Motivation of Our Work § To conduct detailed performance analysis of Graph500 reference implementations on the currently 5th-ranked (June, 2011) TSUBAME 2.0 supercomputer. § To provide detailed analysis to other researchers that targets the high performance score on their supercomputers
Benchmark Flow Graph (Edge List) Generation
2 1
3
2
2
Graph Construction Kernel 1
Kernel 2
(Conversion to any space-efficient format such as CSR or CSC)
Breadth First Search (Reference or Customized Impl.)
Validation for BFS tree (5 rules)
1
3
Sampling 64 Search Keys
64 times iterations
Kronecker Graph [Leskovic, PKDD2005] Graph500 adopts a graph model called “Kronecker Graph” [Leskovic PKDD2005] that simulates dynamic time-evolving real network model that has the following properties • Scale-free and power law • Small diameter, etc
Kronecker graph generation model
A: 0.57, B: 0.19 C: 0.19, D: 0.05
Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks
9
2011 IEEE International Symposium on Workload Characterization
Level-Synchronized Breadth-First Search • At any time, CQ (Current Queue) is the set of vertices that must be visited at the current level.
Level-Synchronized BFS 1 | for all vertex v in parallel do 2 | pred[v] ← -‐1; 3 | pred[r] ← 0
• At level 1, CQ will contain the neighbors of r, so at level 2, it will contain their neighbors (the neighboring vertices that have not been visited at levels 0 or 1).
4 | Enqueue(CQ, r)
• The algorithm maintains NQ (Next Queue), containing the vertices that should be visited at the next level. After visiting all of the nodes at each level, the queues CQ and NQ are swapped.
9 | | for each v adjacent to u in parallel do
5 | While CQ != Empty do 6 | NQ ← empty 7 | for all u in CQ in parallel do 8 | | u ← Dequeue(CQ) 10| | | if pred[v] = -‐1 then 11| | | | pred[v] ← u; 12| | | | Enqueue(NQ, v) 13| swap(CQ, NQ);
CQ (Current Queue)
Level 0
NQ (Next Queue)
NQ (Next Queue) r
B
F
G
D
C
H
I
J
K
L
M
N
O
E
P
Q
R
S
T
U
r
CQ
Level 0 Adding vertex r to CQ
NQ r
B
F
G
D
C
H
I
J
K
L
M
N
O
E
P
Q
R
S
T
U
Level 1
CQ
Retrieve vertex r from CQ, and insert its adjacent vertices to NQ
NQ
B
D
C
E
r
Level 0
Level 1
B
D
C
E
Level 2 F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
Level 1 – Swap CQ and NQ CQ
Swap CQ and NQ
Level 1
B
D
C
E
NQ
r
Level 0
B
D
C
E
Level 2 F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
Level 2 – Find all the unfound adjacent vertices with multi threads CQ Multiple threads simultaneously retrieve vertices B, C, D, E and inserts adjacent vertices into NQ
NQ
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
r
Level 0
Level 1
B
D
C
E
Level 2 F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks
16
2011 IEEE International Symposium on Workload Characterization
Graph500 Reference Implementations § All the MPI implementations are based upon “Level-synchronized BFS” § The benchmark includes reference MPI implementations that can be categorized into two methods on whether CQ (Current Queue) is replicated across all the nodes or not
17
Category
Code Name Partitioning Adjacency matrix format
Parallelism
Nonreplicated method
simple
Horizontal
CSR
Single thread
one_ sided
Horizontal
CSR
Single thread
Replicated method
replicatedcsr
Vertical
CSR
Multi-thread
replicatedcsc
Vertical
CSC
Multi-thread
2011 IEEE International Symposium on Workload Characterization
Partitioning Large-Scale Graph on Multiple Nodes § Vertices and adjacency matrix are partitioned. – Each processor has a block of vertices and outer edges from own vertices.
Vertices
Adjacency matrix Processor
Partitioning Methods Adjacency Matrix Horizontal Partitioning (e.g. simple)
Processor
Vertical Partitioning (e.g. replicated)
Non-replication based method : simple Each node has a partitioned block of CQ that contains only own vertices.
NQ becomes CQ at the next level. Dequeue
Vertex u is adjacent to v
Enqueue
v
u
CQ
visited
Adjacency matrix If visited[u] = 0
Inter processor communication
NQ
Non-replication based method (contd.) }
Each processor exchanges edges among each other with asynchronous communication functions, MPI_̲Isend and MPI_̲Irecv.
}
This involves all-‐‑‒to-‐‑‒all communication for exchanging all the edge information
Processor Edges
Replication-based method }
In order to quickly finding a set of vertices to be handled at each level, the current queue (CQ) – that is represented as a bitmap -‐‑‒ is replicated among all the processors. Each processor has a replica of whole CQ.
Each processor makes a block of partitioned NQ.
Each processor sends NQ to all other processors.
The algorithm of this phase is different between csr and csc.
CQ
NQ
CQ (next level)
Replication-based method } }
Each processor has a set of its own vertices. All processors send a copy of their own vertices to all other processors with MPI_̲Allgather collective operations when synchronization occurs at each level Processor Vertices
Replication-based method: replicated-csr Each MPI process looks at only its local vertex and investigate whether its adjacent vertex exists in CQ, and if so, it is set to the local NQ which is eventually all gathered among all the MPI processes CQ (and NQ) is a bitmap array.
If CQ[u] = 1 CQ
Adjacent vertex u ② ①
③
For each local vertex v
v
u
row pointer
NQ[v] ← 1
NQ
Replication-based method: replicated-csc § Each MPI process has the replicated CQ that contains all of the vertex information as to whether a vertex exists in CQ. In contrast NQ only has information on the local vertices that each MPI process is handling. § All of the vertices in CQ are processed in parallel by multiple threads spawned with OpenMP, If the vertex exists in CQ , then it finds a set of local vertices adjacent to the global vertex in CQ. The local vertex is added to NQ. CQ
For all global vertex u
column pointer
① Adjacent vertex v ② It needs to look at all the global vertices, but this is contiguous memory access when compared to random memory access in replicated-csr
NQ NQ[v] ← 1
Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks
26
2011 IEEE International Symposium on Workload Characterization
TSUBAME 2.0 Super Computer
.
TSUBAME 2.0 System Configuration
TSUBAME 2.0 Specification Specification CPU
Intel Westmere EP (Xeon X5670, L2 Cache: 256 KB, L3: 12MB) 2.93 GHz processors, 12 CPU Cores (24 cores with Hyper Threading) x 2 sockets per 1 node (24 CPU Cores)
RAM
54 GB
OS
SUSE Linux Enterprise 11 (Linux kernel: 2.6.32)
# of Total Nodes
1466 nodes (We only tested up to 128 nodes)
Network Topology
Full-Bisection Fat-Tree Topology
Network
Voltaire / Mellanox Dual-rail QDR Infiniband (40Gbps x2 = 80 Gbps)
GPGPU
Three NVIDIA Fermi M2050 GPUs (*Not used for this work)
GCC and OpenMP
GCC 4.3.4 (-O3 option) , OpenMP 3.0
OpenMPI
OpenMPI 1.5.3, MVAPICH 1.6.1
Strong-Scaling Performance Comparison Although replication-based methods outperform non-replication based method, none of them shows scalability with larger number of nodes and saturated around 32 nodes.
1.00E+10
100
simple replicated-‐csr replicated-‐csc
Speedup against 1 node
TEPS 1.00E+09
1.00E+08 simple
10
replicated-‐csr replicated-‐csc
1.00E+07 1 30
2
4
8 16 32 64 128 # of nodes
1 1
2
Scale 26, OpenMPI
2011 IEEE International Symposium on Workload Characterization
4
8 16 32 64 128 # of nodes
*1 Scale 26 per node is the largest problem size that one node can handle (24 cores, 52 GB RAM) by consuming 17.71 GB in the CSC format
Weak-Scaling Performance Comparison § This experiment fixes the problem size for each node, which allows us to see how the linear scalability is achieved and how much of the performance degradation is due to the communication and level synchronization § All of the methods show the performance degradation with larger number of nodes in a weak-scaling setting TEPS 4.50E+08
4.50E+08 TEPS 4.00E+08
simple replicated-‐csr
simple replicated-‐csr replicated-‐csc
4.00E+08
3.50E+08
3.50E+08
3.00E+08
3.00E+08
2.50E+08 Scale : 24
2.50E+08 Scale : 26 Scale : 30
2.00E+08
2.00E+08
1.50E+08
1.50E+08
1.00E+08
1.00E+08
5.00E+07
5.00E+07
0.00E+00
0.00E+00 1
31
2
4 8 16 # of nodes
32
Scale 24 per node
64
2011 IEEE International Symposium on Workload Characterization
Scale : 32
1
2
4 8 16 # of nodes
Scale 26 per node*1
32
64
Profiling Communication Message Size § With non-replication based method, the communication message size becomes large around the half level (total level : 8) § With replication based method, aggregated message size become linearly larger with the number of nodes simple
1.2E+12
Scale
sim-‐31-‐32
8E+11 6E+11 4E+11 2E+11
1E+12
Data size (byte)
Data size (byte)
sim-‐30-‐16
1TB
1E+11 1E+10 1E+09 1GB
100000000 10000000
0 1 32
Aggregated Message Size for replicated methods
sim-‐29-‐8
1E+12 1TB
Replication-based method
# of Nodes
2
3
4 5 6 Level (Depth)
7
8
2
Scale 26, OpenMPI
2011 IEEE International Symposium on Workload Characterization
4
8 16 32 64 128 # of MPI processes
Profiling Execution Time at Each Level • Replicated-csc and simple shows similar because the replicated-csc implementation needs to find all of the vertices in CQ that contain all the global vertex information by checking the corresponding bit in CQ. Thus when CQ contains more vertices at higher level, its processing time increases. • In contrast, the replicated-csr implementation only checks whether adjacent “local” vertices are unvisited in CQ, and thus as the number of unvisited vertices decreases , the processing time also decreases replicated-csc ExecuIon Time (seconds)
ExecuIon Time (seconds)
8 6 4 2 0
33
1
2
3
4
5
6
7
8
8 6 4 2 0
1 2 3 4 5 6 7 8 Level Level 2011 IEEE International Symposium on Workload Characterization rep-‐26-‐1 rep-‐31-‐32 csc-‐26-‐1 csc-‐31-‐32
simple ExecuIon Time (seconds)
replicated-csr
8 6 4 2 0 1 2 3 4 5 6 7 8 Level sim-‐26-‐1 sim-‐31-‐32
Profiling Computation, Comm., and Stall Times • Communication and stall (synchronization) time grows with larger number of nodes. 12
replicated-csr
Elapsed Time (seconds)
Elapsed Time (seconds)
12 10
10
8 6 4 2 0
1
4 8 16 # of nodes computaOon communicaOon 34
replicated-csc
2
32 stall
8 6 4 2 0 1
2
4 8 16 32 64 # of nodes computaOon communicaOon stall
Weak-Scaling : Scale 26 per node
2011 IEEE International Symposium on Workload Characterization
Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks
35
2011 IEEE International Symposium on Workload Characterization
Related Work § A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L [Andy, SC2005] – Proposes 2D Partitioning Technique and optimization on BlueGene/L
§ Scalable Graph Exploration on Multicore Processors [Agarwal, SC2010] – An efficient and scalable BFS algorithm for commodity multicore processors such as the 8-core Intel Nehalem EX processor
§ Desigining Multhreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 [Bader, ICPP 2006] § Accelerating large graph algorithms on the GPU using CUDA [Harish, HiPC 2007]
Concluding Remarks and Ongoing Work § Concluding Remarks – To demonstrate the performance characteristics of reference implementations provided by Graph500 on commodity super computers such as TSUBAME 2.0 – To provide a thorough guide for high performance graph search algorithm on large-scale distributed environments
§ Ongoing Work – We designed and implemented our scalable and optimized BFS method on TSUBAME 2.0 based upon the thorough study published in IISWC 2011. – Looking forward to the next Graph500 ranking list announced in SC2011 next week J 37
2011 IEEE International Symposium on Workload Characterization
Our Highly Scalable BFS Method § We designed and implemented an optimized method based on 2D based partitioning and other various optimization methods such as communication compression and vertex sorting. § Our optimized implementation can solve BFS (Breadth First Search) of large-scale graph with 236(68.7 billion)vertices and 240(1.1 trillion)edges for 10.58 seconds with 1366 nodes and 16392 CPU cores on TSUBAME 2.0 § This record corresponds to 103.9 GE/s (TEPS) 25
120
optimized simple replicated-csr replicated-csc
15
100 TEPS (GE/s)
TEPS (GE/s)
20
99.0
10
80
63.5
60 37.2
40 21.3 20 11.3
5
0
0 0
32
64 96 # of nodes
128
Performance Comparison with Reference ImplementaOons (simple, replicated-‐csr and replicated-‐csc) and Scale 24 per 1 node
0
256
512 768 # of nodes
1024
Performance of Our OpOmized ImplementaOon with Scale 26 per 1 node
Questions
? ?
39
2011 IEEE International Symposium on Workload Characterization
Thank You