Performance Characterization of Graph500 on Large Scale Distributed Environment Toyotaro Suzumura1,2,4, Koji Ueno1,4, Hitoshi Sato1,4, Katsuki Fujisawa1,3 and Satoshi Matsuoka1,4 1 Tokyo

Institute of Technology 2 IBM Research – Tokyo, 3 Chuo University, 4 JST CREST

2011 IEEE International Symposium on Workload Characterization

Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks

2

2011 IEEE International Symposium on Workload Characterization

Large-Scale Graph Mining is Everywhere

Internet Map [lumeta.com]

Friendship Network [Moody ’01]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Graph500 (http://www.graph500.org) §  New Graph Search Based Benchmark for Ranking Supercomputers §  BFS (Breadth First Search) from a single vertex on a static, undirected Kronecker graph with average vertex degree 16. §  Evaluation criteria: TEPS (Traversed Edges Per Second), and problem size that can be solved on a system, minimum execution time. §  Problem Size: SCALE –  # of Vertices 2SCALE, # of Edges 2SCALE+4

§  Reference MPI, shared memory implementations provided.

Graph500 Ranking in June 2011

SCALE

Top-ranked score with the latest ranking rule 5

2011 IEEE International Symposium on Workload Characterization

Motivation of Our Work §  To conduct detailed performance analysis of Graph500 reference implementations on the currently 5th-ranked (June, 2011) TSUBAME 2.0 supercomputer. §  To provide detailed analysis to other researchers that targets the high performance score on their supercomputers

Benchmark Flow Graph (Edge List) Generation

2 1

3

2

2

Graph Construction Kernel 1

Kernel 2

(Conversion to any space-efficient format such as CSR or CSC)

Breadth First Search (Reference or Customized Impl.)

Validation for BFS tree (5 rules)

1

3

Sampling 64 Search Keys

64 times iterations

Kronecker Graph [Leskovic, PKDD2005] Graph500 adopts a graph model called “Kronecker Graph” [Leskovic PKDD2005] that simulates dynamic time-evolving real network model that has the following properties •  Scale-free and power law •  Small diameter, etc

Kronecker graph generation model

A: 0.57, B: 0.19 C: 0.19, D: 0.05

Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks

9

2011 IEEE International Symposium on Workload Characterization

Level-Synchronized Breadth-First Search •  At any time, CQ (Current Queue) is the set of vertices that must be visited at the current level.

Level-Synchronized BFS 1    |    for  all  vertex  v  in  parallel  do   2    |              pred[v]  ←  -­‐1;   3    |    pred[r]  ←  0  

•  At level 1, CQ will contain the neighbors of r, so at level 2, it will contain their neighbors (the neighboring vertices that have not been visited at levels 0 or 1).

4    |  Enqueue(CQ,  r)  

•  The algorithm maintains NQ (Next Queue), containing the vertices that should be visited at the next level. After visiting all of the nodes at each level, the queues CQ and NQ are swapped.

9    |                  |  for  each  v  adjacent  to  u  in  parallel  do    



5    |  While  CQ !=  Empty  do     6    |            NQ  ←  empty   7    |              for  all  u  in  CQ  in  parallel  do   8    |                  |        u  ←  Dequeue(CQ)   10|                  |              |    if  pred[v]  =  -­‐1  then     11|                  |                |        |      pred[v]  ←  u;   12|                  |                |        |      Enqueue(NQ,  v)   13|              swap(CQ,  NQ);  

CQ (Current Queue)

Level 0

NQ (Next Queue)

NQ (Next Queue) r



F

G

D



H

I

J

K

L

M

N

O

E

P

Q

R

S

T

U

r

CQ

Level 0 Adding vertex r to CQ

NQ r



F

G

D



H

I

J

K

L

M

N

O

E

P

Q

R

S

T

U

Level 1

CQ

Retrieve vertex r from CQ, and insert its adjacent vertices to NQ

NQ



D



E

r

Level 0

Level 1



D



E

Level 2 F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

Level 1 – Swap CQ and NQ CQ

Swap CQ and NQ

Level 1



D



E

NQ

r

Level 0



D



E

Level 2 F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

Level 2 – Find all the unfound adjacent vertices with multi threads CQ Multiple threads simultaneously retrieve vertices B, C, D, E and inserts adjacent vertices into NQ

NQ

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

r

Level 0

Level 1



D



E

Level 2 F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks

16

2011 IEEE International Symposium on Workload Characterization

Graph500 Reference Implementations §  All the MPI implementations are based upon “Level-synchronized BFS” §  The benchmark includes reference MPI implementations that can be categorized into two methods on whether CQ (Current Queue) is replicated across all the nodes or not

17

Category

Code Name Partitioning Adjacency matrix format

Parallelism

Nonreplicated method

simple

Horizontal

CSR

Single thread

one_ sided

Horizontal

CSR

Single thread

Replicated method

replicatedcsr

Vertical

CSR

Multi-thread

replicatedcsc

Vertical

CSC

Multi-thread

2011 IEEE International Symposium on Workload Characterization

Partitioning Large-Scale Graph on Multiple Nodes § Vertices and adjacency matrix are partitioned. – Each processor has a block of vertices and outer edges from own vertices.

Vertices

Adjacency matrix Processor

Partitioning Methods Adjacency Matrix Horizontal  Partitioning (e.g.  simple)  

Processor

Vertical  Partitioning (e.g.  replicated)

Non-replication based method : simple Each node has a partitioned block of CQ that contains only own vertices.

NQ becomes CQ at the next level. Dequeue

Vertex u is adjacent to v

Enqueue

v

u

CQ

visited

Adjacency matrix If visited[u] = 0

Inter processor communication

NQ

Non-replication based method (contd.) } 

Each  processor  exchanges  edges  among  each  other  with   asynchronous  communication  functions,  MPI_̲Isend  and   MPI_̲Irecv.  

} 

This  involves  all-‐‑‒to-‐‑‒all  communication  for  exchanging  all   the  edge  information  

Processor Edges

Replication-based method } 

In  order  to  quickly  finding  a  set  of  vertices  to  be  handled  at  each   level,  the  current  queue  (CQ)  –  that  is  represented  as  a  bitmap  -‐‑‒   is  replicated  among  all  the  processors.   Each processor has a replica of whole CQ.

Each processor makes a block of partitioned NQ.

Each processor sends NQ to all other processors.

The algorithm of this phase is different between csr and csc.

CQ

NQ

CQ (next level)

Replication-based method }  } 

Each  processor  has  a  set  of  its  own  vertices. All  processors  send  a  copy  of  their  own  vertices  to  all   other  processors  with  MPI_̲Allgather  collective   operations  when  synchronization  occurs  at  each  level   Processor Vertices

Replication-based method: replicated-csr Each MPI process looks at only its local vertex and investigate whether its adjacent vertex exists in CQ, and if so, it is set to the local NQ which is eventually all gathered among all the MPI processes CQ (and NQ) is a bitmap array.

If CQ[u] = 1 CQ

Adjacent vertex u ② ①



For each local vertex v

v

u

row pointer

NQ[v] ← 1

NQ

Replication-based method: replicated-csc §  Each MPI process has the replicated CQ that contains all of the vertex information as to whether a vertex exists in CQ. In contrast NQ only has information on the local vertices that each MPI process is handling. §  All of the vertices in CQ are processed in parallel by multiple threads spawned with OpenMP, If the vertex exists in CQ , then it finds a set of local vertices adjacent to the global vertex in CQ. The local vertex is added to NQ. CQ

For all global vertex u

column pointer

① Adjacent vertex v ② It needs to look at all the global vertices, but this is contiguous memory access when compared to random memory access in replicated-csr

NQ NQ[v] ← 1

Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks

26

2011 IEEE International Symposium on Workload Characterization

TSUBAME 2.0 Super Computer

.

TSUBAME 2.0 System Configuration

TSUBAME 2.0 Specification Specification CPU

Intel Westmere EP (Xeon X5670, L2 Cache: 256 KB, L3: 12MB) 2.93 GHz processors, 12 CPU Cores (24 cores with Hyper Threading) x 2 sockets per 1 node (24 CPU Cores)

RAM

54 GB

OS

SUSE Linux Enterprise 11 (Linux kernel: 2.6.32)

# of Total Nodes

1466 nodes (We only tested up to 128 nodes)

Network Topology

Full-Bisection Fat-Tree Topology

Network

Voltaire / Mellanox Dual-rail QDR Infiniband (40Gbps x2 = 80 Gbps)

GPGPU

Three NVIDIA Fermi M2050 GPUs (*Not used for this work)

GCC and OpenMP

GCC 4.3.4 (-O3 option) , OpenMP 3.0

OpenMPI

OpenMPI 1.5.3, MVAPICH 1.6.1

Strong-Scaling Performance Comparison Although replication-based methods outperform non-replication based method, none of them shows scalability with larger number of nodes and saturated around 32 nodes.

1.00E+10  

100  

simple   replicated-­‐csr   replicated-­‐csc  

Speedup  against  1  node

TEPS 1.00E+09  

1.00E+08   simple  

10  

replicated-­‐csr   replicated-­‐csc  

1.00E+07   1   30

2  

4  

8   16   32   64   128   #  of  nodes

1   1  

2  

Scale 26, OpenMPI

2011 IEEE International Symposium on Workload Characterization

4  

8   16   32   64   128   #  of  nodes

*1 Scale 26 per node is the largest problem size that one node can handle (24 cores, 52 GB RAM) by consuming 17.71 GB in the CSC format

Weak-Scaling Performance Comparison §  This experiment fixes the problem size for each node, which allows us to see how the linear scalability is achieved and how much of the performance degradation is due to the communication and level synchronization §  All of the methods show the performance degradation with larger number of nodes in a weak-scaling setting TEPS 4.50E+08  

4.50E+08   TEPS 4.00E+08  

simple   replicated-­‐csr  

simple   replicated-­‐csr   replicated-­‐csc  

4.00E+08  

3.50E+08  

3.50E+08  

3.00E+08  

3.00E+08  

2.50E+08   Scale : 24

2.50E+08   Scale : 26 Scale : 30

2.00E+08  

2.00E+08  

1.50E+08  

1.50E+08  

1.00E+08  

1.00E+08  

5.00E+07  

5.00E+07  

0.00E+00  

0.00E+00   1  

31

2  

4   8   16   #  of  nodes

32  

Scale 24 per node

64  

2011 IEEE International Symposium on Workload Characterization

Scale : 32

1  

2  

4   8   16   #  of  nodes

Scale 26 per node*1

32  

64  

Profiling Communication Message Size §  With non-replication based method, the communication message size becomes large around the half level (total level : 8) §  With replication based method, aggregated message size become linearly larger with the number of nodes simple

1.2E+12  

Scale

sim-­‐31-­‐32  

8E+11   6E+11   4E+11   2E+11  

1E+12  

Data size (byte)

Data size (byte)

sim-­‐30-­‐16  

1TB

1E+11   1E+10   1E+09   1GB

100000000   10000000  

0   1   32

Aggregated  Message  Size  for   replicated  methods  

sim-­‐29-­‐8  

1E+12   1TB

Replication-based method

# of Nodes

2  

3  

4   5   6   Level  (Depth)

7  

8  

2  

Scale 26, OpenMPI

2011 IEEE International Symposium on Workload Characterization

4  

8   16   32   64   128   #  of  MPI  processes  

Profiling Execution Time at Each Level •  Replicated-csc and simple shows similar because the replicated-csc implementation needs to find all of the vertices in CQ that contain all the global vertex information by checking the corresponding bit in CQ. Thus when CQ contains more vertices at higher level, its processing time increases. •  In contrast, the replicated-csr implementation only checks whether adjacent “local” vertices are unvisited in CQ, and thus as the number of unvisited vertices decreases , the processing time also decreases replicated-csc ExecuIon  Time  (seconds)

ExecuIon  Time  (seconds)

8   6   4   2   0  

33

1  

2  

3  

4  

5  

6  

7  

8  

8   6   4   2   0  

1   2   3   4   5   6   7   8   Level Level 2011 IEEE International Symposium on Workload Characterization rep-­‐26-­‐1   rep-­‐31-­‐32   csc-­‐26-­‐1   csc-­‐31-­‐32  

simple ExecuIon  Time  (seconds)

replicated-csr

8   6   4   2   0   1   2   3   4   5   6   7   8   Level sim-­‐26-­‐1   sim-­‐31-­‐32  

Profiling Computation, Comm., and Stall Times •  Communication and stall (synchronization) time grows with larger number of nodes. 12  

replicated-csr

Elapsed  Time  (seconds)

Elapsed  Time  (seconds)

12   10  

10  

8   6   4   2   0  

1  

4   8   16   #  of  nodes computaOon   communicaOon   34

replicated-csc

2  

32   stall  

8   6   4   2   0   1  

2  

4   8   16   32   64   #  of  nodes computaOon   communicaOon   stall  

Weak-Scaling : Scale 26 per node

2011 IEEE International Symposium on Workload Characterization

Outline § Introduction to Graph500 § Parallel BFS Algorithm § Graph500 Reference Implementations § Performance Characterization of Graph500 on TSUBAME 2.0 § Concluding Remarks

35

2011 IEEE International Symposium on Workload Characterization

Related Work §  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L [Andy, SC2005] –  Proposes 2D Partitioning Technique and optimization on BlueGene/L

§  Scalable Graph Exploration on Multicore Processors [Agarwal, SC2010] –  An efficient and scalable BFS algorithm for commodity multicore processors such as the 8-core Intel Nehalem EX processor

§  Desigining Multhreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 [Bader, ICPP 2006] §  Accelerating large graph algorithms on the GPU using CUDA [Harish, HiPC 2007]

Concluding Remarks and Ongoing Work §  Concluding Remarks –  To demonstrate the performance characteristics of reference implementations provided by Graph500 on commodity super computers such as TSUBAME 2.0 –  To provide a thorough guide for high performance graph search algorithm on large-scale distributed environments

§  Ongoing Work –  We designed and implemented our scalable and optimized BFS method on TSUBAME 2.0 based upon the thorough study published in IISWC 2011. –  Looking forward to the next Graph500 ranking list announced in SC2011 next week J 37

2011 IEEE International Symposium on Workload Characterization

Our Highly Scalable BFS Method §  We designed and implemented an optimized method based on 2D based partitioning and other various optimization methods such as communication compression and vertex sorting. §  Our optimized implementation can solve BFS (Breadth First Search) of large-scale graph with 236(68.7 billion)vertices and 240(1.1 trillion)edges for 10.58 seconds with 1366 nodes and 16392 CPU cores on TSUBAME 2.0 §  This record corresponds to 103.9 GE/s (TEPS) 25

120

optimized simple replicated-csr replicated-csc

15

100 TEPS (GE/s)

TEPS (GE/s)

20

99.0

10

80

63.5

60 37.2

40 21.3 20 11.3

5

0

0 0

32

64 96 # of nodes

128

Performance  Comparison  with  Reference  ImplementaOons   (simple,  replicated-­‐csr  and  replicated-­‐csc)  and  Scale  24  per  1  node  

0

256

512 768 # of nodes

1024

Performance  of  Our  OpOmized  ImplementaOon     with  Scale  26  per  1  node  

Questions

? ?

39

2011 IEEE International Symposium on Workload Characterization

Thank You

Performance Characterization of Graph500 on Large ...

Introduction to Graph500. ▫ Parallel BFS Algorithm ... Large-Scale Graph Mining is Everywhere. Internet Map ... The algorithm maintains NQ (Next. Queue) ...

4MB Sizes 1 Downloads 168 Views

Recommend Documents

Performance Characterization of Graph500 on Large ...
Hitoshi Sato1,4, Katsuki Fujisawa1,3 and Satoshi Matsuoka1,4. 1 Tokyo ... To provide detailed analysis to other researchers that targets the high .... Full-Bisection Fat-Tree Topology. Network. Voltaire / Mellanox Dual-rail QDR Infiniband.

Performance Characteristics of Graph500 on Large ...
Hendrickson, Douglas Gregor, and Andrew Lumsdaine, "DFS: A. Simple to Write Yet Difficult to Execute Benchmark,", IEEE Interna- tional Symposium on ...

Performance Characterization of Bituminous Mixtures ...
Mixtures With Dolomite Sand Waste and BOF Steel Slag,” Journal of Testing and Evaluation, Vol. ... ABSTRACT: The rapid growth of transport load in Latvia increases the demands for ..... and Their Application in Concrete Production,” Sci.

Structural characterization of the large soluble ...
the most highly conserved domain within the members of the dynamin family. .... 27 °C; average Rh ¼ 22.37 nm; (B) 100 lM GED in 0.1 M phos- phate buffer with 1% ... free and mobile while the rest were buried in the inter- ior of the oligomers.

large scale characterization of protein interactions
4.1 Extraction and Classification of Protein Interface Data . . . . . . . . . . . . . 35 .... C.2 Alternative Labeled Graph Representation Schemes . . . . . . . . . . . . . . 85 ...... interface in the cluster is considered for visualization. Results

ON THE CHARACTERIZATION OF FLOWERING ...
principal component analysis conducted on a set of reblooming indicators, and a subclassification is made using a ... mixture models, Longitudinal k-means algorithm, Principal component analysis, Characterization of curves .... anism of Gaussian mixt

Large Scale Performance Measurement of ... - Research at Google
Large Scale Performance Measurement of Content-Based ... in photo management applications. II. .... In this section, we perform large scale tests on two.

Performance Evaluation of an EDA-Based Large-Scale Plug-In ...
Performance Evaluation of an EDA-Based Large-Scale Plug-In Hybrid Electric Vehicle Charging Algorithm.pdf. Performance Evaluation of an EDA-Based ...

A High Performance Algorithm for Clustering of Large ...
Oct 5, 2013 - A High Performance Algorithm for Clustering of Large-Scale. Protein Mass Spectrometry Data using Multi-Core Architectures. Fahad Saeed∗ ...

ComposStruct-2015-132-575-83_Flutter performance of large-scale ...
ComposStruct-2015-132-575-83_Flutter performance of l ... cale wind turbine blade with shallow-angled skins.pdf. ComposStruct-2015-132-575-83_Flutter ...

Large Scale Performance Measurement of Content ... - Semantic Scholar
[6] Corel-Gallery 1,300,000 (1999) Image Gallery – 16 Compact. Disk Set – JB #40629. Table 3: Performance on 182 Categories. Grouped by performance on 4-Orientation discrimination task with no rejection. Performance with Rejection and. 3-Orientat

On the Theory of Connected Designs: Characterization ...
JSTOR is an independent not-for-profit organization dedicated to creating and preserving a digital archive of scholarly ...... NEW SOUTH WALES 2033 Box 4348.

On computation and characterization of robust Nash ...
provide some answers to these problems and have recently gained a lot of interest. In game theory, we can distinguish at least two sources of uncertainties. They can emerge from the fact that a player has only partial or raw information about his act

GPP vs DSP : A Performance/Energy Characterization ...
4cif (704x576). Bit-rate. 64, 128, 256, 512, 1024, 1536, 2048,. 2560, 3072, 3584, 4096, 4608, 5120. GStreamer. ARM plug-in ffdec h264. DSP plug-in. TIViddec. Operating System. Name. Linux 2.6.32. DVFS driver cpufreq. Hardw are. DSP. Name. TMS320C64x.

Synthesis and characterization of dimeric steroids based on ... - Arkivoc
Feb 4, 2018 - New dimeric steroids in which two 5-oxo-4,5-seco-3-yne steroids ... dimers added its first members when a few compounds were isolated from nature1 or ... We were happy to find that treatment of the alkynones 4a,b in such.

a study on agroclimatic characterization of albanian ...
The mean annual rainfall 800 – 2500 mm. Precipitation dominate during winter and generally are higher in northern and southern districts, while the districts of the central part are drier. This zone is exposed to high frequency of frost, especially

On the Characterization of the Phase Spectrum for ...
and design verification of important structures and systems. Since recorded .... to have a record length of 20 48 data points at 0.02 s interval. The phase curves of ...

Synthesis and characterization of dimeric steroids based on ... - Arkivoc
Feb 4, 2018 - networks in the solid state in which the facial hydrophobicity of the steroidal skeletons plays an important role.8 This prompted us to set up procedures ..... 17β-Acetoxy-4,5-epoxy-5β-androstan-3-one (4a).12 Mp 140–142 °C (from Et

Performance Evaluation of IEEE 802.11e based on ON-OFF Traffic ...
Student. Wireless Telecommunication ... for Wireless Local Area Communications, IEEE 802.11 [1], ..... technology-local and metropolitan area networks, part 11:.

Prediction of Aqueous Solubility Based on Large ...
The mean absolute errors in validation ranged from 0.44 to. 0.80 for the ... charges. For these published data sets, construction of the quantitative structure¿-.

SharePoint 2010: Large Farm - Performance Study - F5 Networks
RAM and the latest quad-core and six-core Intel Xeon processors. The M710 ..... Microsoft Office documents, Adobe PDF documents, and several image formats.

SharePoint 2010: Large Farm - Performance Study - F5 Networks
M1000e includes centralized management controllers, dynamic power ..... balancer systems feature high-performance SSL acceleration hardware and software ...

Prediction of Aqueous Solubility Based on Large ...
level descriptors that encode both the topological environment of each atom and also the electronic influence of all other atoms. Data Sets Description. Data sets ...