Exploiting Multi-core Architectures in Clusters for Enhancing the Performance of the Parallel Bootstrap Simulation Algorithm The 12th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2011)

C´esar A. F. De Rose, Paulo Fernandes, Antonio M. Lima, Afonso Sales, Thais Webber (Avelino F. Zorzo)

Pontif´ıcia Universidade Cat´olica do Rio Grande do Sul (PUCRS) PaleoProspec Project - PUCRS/Petrobras Funded also by CAPES and CNPq - Brazil

Context Interest The solution of complex and large state-based stochastic models to extract performance indices.

Application domains Biology, Physics, Social Sciences, Business, Computer Science and Telecommunication, etc.

Modeling Examples ASP - Alternate Service Patterns: describes an Open Queueing Network with servers that map P different service patterns FAS - First Available Server: indicates the availability of N servers RS - Resource Sharing: maps R shared resources to P processes

Context Problem large amount of possible configurations of large models a traditional numerical solution becomes intractable and easily dependable on the available computational resources

Solution alternatives Numerical: bounded by memory

Simulation: lack of accuracy

and computational power

generating samples

Iterative methods:

Methods:

◮ ◮ ◮

Power [Stewart 94] GMRES [Saad and Schultz Arnoldi [Arnoldi 51]



86]

◮ ◮ ◮

Traditional [Ross 96] Monte Carlo [H¨aggstr¨om 02] Backward [Propp and Wilson 96] Bootstrap [Czekster et al. 10] ⋆ ⋆

reliable estimations high computational cost to generate repeated batches of samples

Context Problem large amount of possible configurations of large models a traditional numerical solution becomes intractable and easily dependable on the available computational resources

Solution alternatives Numerical: bounded by memory

Simulation: lack of accuracy

and computational power

generating samples

Iterative methods:

Methods:

◮ ◮ ◮

Power [Stewart 94] GMRES [Saad and Schultz Arnoldi [Arnoldi 51]



86]

◮ ◮ ◮

Traditional [Ross 96] Monte Carlo [H¨aggstr¨om 02] Backward [Propp and Wilson 96] Bootstrap [Czekster et al. 10] ⋆ ⋆

reliable estimations high computational cost to generate repeated batches of samples

Context

Simulation of Markovian models Definitions: initial state, trajectory length (number of samples) Main idea: perform a random walking process given the set of possible states that the system assumes (simulation trajectory) computing an approximation of the steady-state probability distribution Generation of independent samples for later statistical analysis Large models imply large memory costs Remark: need long run trajectories for more accuracy results

Outline Goal Faster numerical solution using the Bootstrap simulation algorithm, exploiting parallelism in a multi-core SMP cluster

Strategy Parallel approaches: using only MPI primitives Hybrid programming model with MPI and OpenMP: ◮ ◮

fine grain (Hybrid-I) coarse grain (Hybrid-II)

Results and Discussion Performance issues of the parallel Bootstrap implementations

Bootstrap simulation Algorithm Schema States

Bootstrapping process

Initial state

s2

K1

s2

s1

s1

s1

s0

s0

s0

1

2

s1

...

s0

s1

. . . s1

s2

s2

Time 0

3 ...

n

π x¯1[0] + x¯2[0] + ··· + x¯z [0] z π1 x¯1[1] + x¯2[1]z+ ··· + x¯z [1] π2 x¯1[2] + x¯2[2] + ··· + x¯z [2] z

π0

Kz

s0

Normalization

s2

x¯1 Computing

x¯z

s0

s0

s1

. . . s1

s2

s2

01. 02. 03. 04. 05. 06. 07. 08. 09. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

α ← U (0..¯ n-1) π←0 K←0 s ← s0 for t ← 1 to n do snew ← φ(s, U (0..1)) for b ← 1 to z do for c ← 1 to n ¯ do if (U (0..¯ n − 1) == α) then Kb [snew ] ← Kb [snew ] + 1 end for end for s ← snew end for for b ← 1 to z do ω←0 for i ← 1 to |S| do ω ← ω + Kb [i] for i ← 1 to |S| do x ¯b [i] ← Kωb [i] end for for i ← 1 to |S| do for b ← 1 to z do π[i] ← π[i] + x ¯b [i] π[i] ← π[i] z end for

Bootstrap simulation Processing costs Times vs. Max. Absolute Error 100000

4.50e-03 4.00e-03

the number of trials (¯ n) in the resampling process

1000 2.50e-03 2.00e-03 100 1.50e-03 1.00e-03

10

5.00e-04 1

0.00e+00 1e+09

1e+08

1e+07

FAS model

1e+06

1e+05

1e+09

1e+08

1e+07

1e+06

1e+05

ASP model

1e+09

1e+08

1e+07

1e+06

1e+05

z×n ¯ samplings take place for every trajectory step

3.50e-03 3.00e-03

Time (s)

the number of bootstraps (z)

10000

Maximum Absolute Error

the trajectory length (n)

RS model

Simulation trajectory length

Proposal Parallel Bootstrap implementations that change the workload distribution and programming model

Bootstrap simulation Processing costs Times vs. Max. Absolute Error 100000

4.50e-03 4.00e-03

the number of trials (¯ n) in the resampling process

1000 2.50e-03 2.00e-03 100 1.50e-03 1.00e-03

10

5.00e-04 1

0.00e+00 1e+09

1e+08

1e+07

FAS model

1e+06

1e+05

1e+09

1e+08

1e+07

1e+06

1e+05

ASP model

1e+09

1e+08

1e+07

1e+06

1e+05

z×n ¯ samplings take place for every trajectory step

3.50e-03 3.00e-03

Time (s)

the number of bootstraps (z)

10000

Maximum Absolute Error

the trajectory length (n)

RS model

Simulation trajectory length

Proposal Parallel Bootstrap implementations that change the workload distribution and programming model

Parallel approaches Environment Multi-core SMP cluster 8 nodes (Gigabit Ethernet network) Each node has two Intel Xeon E5520 Quad-core - Hyper-Threading technology - 16 logical processors - 16 GB RAM Linux O.S. OpenMPI 1.4.2

Number of bootstraps assigned to nodes in each configuration configuration 1 2 3 4 5 6 7 8

number of bootstraps 36 18 12 9 7 6 5 4

18 12 9 7 6 5 4

12 9 7 6 5 4

9 7 6 5 4

8 6 5 5

OpenMP 2.5

Experiments 30 trials taking a 95% confidence interval 1 (sequential), 2, 3, 4, 5, 6, 7, and 8 nodes z = 36 bootstraps and n=1e+06, 1e+07, 1e+08, and 1e+09

6 5 5

6 5

5

Parallel approaches (pure-MPI)

Pure MPI implementation Node #1 Bootstrapping process (sequential computation)

Split the bootstrap sampling tasks over C processing nodes Node #2

# bootstraps

Bootstrapping process (sequential computation) ... ... .. . ...

... ... .. . ...

Node #C Bootstrapping process (sequential computation) # bootstraps

# bootstraps

Only MPI primitives

... ... .. . ...

Parallel approaches (pure-MPI results) Large models (millions of states)

Small models (hundreds of states)

n = 1e+06

n = 1e+06

30

30

Next-State Bootstrapping Normalization Communication Computing

25

20

Time (s)

Time (s)

20

15

15

10

10

5

5

0

0 1

2

3

4 5 ASP

6

7

8

1

2

3

4 5 FAS

6

7

8

1

2

3

4 5 RS

6

7

8

1

2

3

4 5 ASP

6

7

8

1

2

3

4 5 FAS

6

7

8

Configuration

Configuration

n = 1e+09

n = 1e+09

18000

1

18000

Next-State Bootstrapping Normalization Communication Computing

16000 14000

14000

12000

12000

10000 8000

3

4 5 RS

6

7

8

7

8

10000 8000

6000

6000

4000

4000

2000

2

Next-State Bootstrapping Normalization Communication Computing

16000

Time (s)

Time (s)

Next-State Bootstrapping Normalization Communication Computing

25

2000

0

0 1

2

3

4 5 ASP

6

7

8

1

2

3

4 5 FAS

6

7

Configuration

8

1

2

3

4 5 RS

6

7

8

1

2

3

4 5 ASP

6

7

8

1

2

3

4 5 FAS

6

7

Configuration

8

1

2

3

4 5 RS

6

Parallel approaches (fine grain) Hybrid MPI/OpenMP implementation (Hybrid-I) Hybrid programming

Node #1 Bootstrapping process (parallel computation)

MPI and OpenMP

...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...

loop at line 7

... for t ← 1 to n do snew ← φ(s, U (0..1)) for b ← 1 to z do for c ← 1 to n ¯ do if (U (0..¯ n − 1) == α) then Kb [snew ] ← Kb [snew ] + 1 end for end for s ← snew end for ...

Node #2

Node #C

Bootstrapping process (parallel computation) # threads

◮ 04. 05. 06. 07. 08. 09. 10. 11. 12. 13. 14. 15.

# threads

Intra-node parallelism #pragma omp parallel for

...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...

Bootstrapping process (parallel computation) # threads



...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...

Parallel approaches (Hybrid-I results)

n = 1e+06

n = 1e+09

70

70000

Next-State Bootstrapping Normalization Communication Computing

60

50000

Time (s)

50

Time (s)

Next-State Bootstrapping Normalization Communication Computing

60000

40

30

40000

30000

20

20000

10

10000

0

0 1

2

4 ASP

8

1

2

4

8

1

FAS

Number of threads in only one node

2

4 RS

8

1

2

4 ASP

8

1

2

4

8

1

FAS

2

4

8

RS

Number of threads in only one node

A parallel region is created at each step in the simulation process

Parallel approaches (coarse grain) Hybrid MPI/OpenMP implementation (Hybrid-II) Hybrid programming MPI and OpenMP

Node #1 Bootstrapping process (parallel computation) ...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...

lines 5 to 14

... for t ← 1 to n do snew ← φ(s, U (0..1)) for b ← 1 to z do for c ← 1 to n ¯ do if (U (0..¯ n − 1) == α) then Kb [snew ] ← Kb [snew ] + 1 end for end for s ← snew end for ...

Node #2

Node #C

Bootstrapping process (parallel computation) # threads

◮ 04. 05. 06. 07. 08. 09. 10. 11. 12. 13. 14. 15.

# threads

Intra-node parallelism Parallel region integrates the whole simulation process

...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...

Bootstrapping process (parallel computation) # threads



...

...

...

...

... .. . ...

... .. . ...

... .. . ...

... .. . ...

Parallel approaches (Hybrid-II results) Large models

Small models

n = 1e+06

n = 1e+06 Next-State + Bootstrapping Normalization Communication Computing

10

10

8

8

6

6

4

4

2

2

8 (5)

7 (6)

8 (5)

2500

Next-State + Bootstrapping Normalization Communication Computing

2000

1500

1500

Time (s)

Time (s)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

RS

n = 1e+09

2000

1000

500

Next-State + Bootstrapping Normalization Communication Computing

1000

500

RS

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

FAS

Configuration (Threads)

1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

1 (16)

8 (5)

7 (6)

ASP

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

0

1 (16)

8 (5)

7 (6)

RS

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

FAS

Configuration (Threads)

1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

1 (16)

8 (5)

7 (6)

ASP

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

1 (16)

0

FAS

Configuration (Threads)

n = 1e+09 2500

3 (12)

2 (16)

ASP

Configuration (Threads)

1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

RS

2 (16)

1 (16)

0

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

FAS

1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

ASP

1 (16)

8 (5)

7 (6)

6 (6)

5 (8)

4 (9)

3 (12)

2 (16)

1 (16)

0

Next-State + Bootstrapping Normalization Communication Computing

12

Time (s)

Time (s)

12

Parallel approaches (Hybrid-II results)

Speedup vs. Efficiency Large models (n = 1e+09) 40

Small models (n = 1e+09) 120

35

100

30

100 90

35

80 30

70

40

Speedup

Speedup

20

25 60 20

50 40

15 30

15

20 10

20

10

0

5

10 5 1 2 3 4 5 6 7 8 ASP

1 2 3 4 5 6 7 8 FAS

Configuration

1 2 3 4 5 6 7 8 RS

0 1 2 3 4 5 6 7 8 ASP

1 2 3 4 5 6 7 8 FAS

Configuration

1 2 3 4 5 6 7 8 RS

Efficiency (%)

60

Efficiency (%)

80 25

Conclusion and future works

Summary Parallel performance analysis of the Bootstrap simulation algorithm, exploiting different characteristics of multi-core SMP clusters The algorithm allows the generation of samples in an independent manner, which is trivial in terms of parallelization efforts Considerable speedups have been achieved for very large models Processing demands depend only on the trajectory length (n) Communication demands depend only on the size of the model

Conclusion and future works

Future works Practical ◮ ◮

an efficient implementation of transition function to compute samples adapting parallel sampling techniques to mitigate the efforts related to the simulation of structured Markovian models

Theoretical ◮



further study may be considered about the impact of the number of bootstraps on the accuracy of the simulation results incorporate the parallel sampling process with more sophisticated simulation approaches, such as Perfect Sampling

Thank you for your attention

Exploiting Multi-core Architectures in Clusters for ...

Performance of the Parallel Bootstrap Simulation Algorithm. The 12th IEEE ... Hybrid programming model with MPI and OpenMP: ▻ fine grain (Hybrid-I).

446KB Sizes 1 Downloads 174 Views

Recommend Documents

Exploiting multi-core architectures in clusters for ...
connected via a Gigabit Ethernet network. ... ASP (Alternate Service Pattern) model describes an ... the queues and P the number of service patterns. FAS (First ...

Design Space Exploration for Multicore Architectures: A ...
to the amount of parallelism available in the software. They evalu- ate several ... namic thermal management techniques for CMPs and SMTs. Li et al. .... shared bus (57 GB/s), pipelined and clocked at half of the core clock (see Table 2).

Design Space Exploration for Multicore Architectures: A ...
Power efficiency is evaluated through the system energy, i.e. the energy needed to run ... Furthermore, in Section 7, we evaluate several alternative floorplans by ...

pdf-1843\server-architectures-multiprocessors-clusters-parallel ...
... of the apps below to open or edit this item. pdf-1843\server-architectures-multiprocessors-clusters-parallel-systems-web-servers-storage-solutions.pdf.

Reactive DVFS Control for Multicore Processors - GitHub
quency domains of multicore processors at every stage. First, it is only replicated once ..... design/processor/manuals/253668.pdf. [3] AMD, “AMD Cool'n'Quiet.

Boosting Clusters of Samples for Sequence Matching in ...
Abstract—This study introduces a novel classification algo- rithm for learning and matching sequences in view independent object tracking. The proposed learning method uses adaptive boosting and classification trees on a wide collection (shape, pos

Exploiting Locality in Quantum Computation for Quantum Chemistry
Nov 25, 2014 - where rA is the vector from a point A that defines the center of ...... approach, which we will call Hamiltonian averaging, and bound its costs in .... exponential decline in the quality of trial wave functions, as measured by overlap 

PRIZM Clusters
When relocating due to a job transfer, people usually find a neighborhood that is very ... Claritas uses factor analysis of census data to uncover the demographic and lifestyle variables that explain the ..... Their diligence has enabled them to.

Multicore Prog.pdf
The New Architecture. If a person walks fast on a road covering fifty miles in ... Intel, and AMD have all changed their chip pipelines from single core processor production to. multicore processor production. This has prompted computer vendors such

Evolution in Materio: Exploiting the Physics of Materials for ... - arXiv
Nov 17, 2006 - computer, and the set of differential equations and boundary conditions that ... Of course we cannot directly program the molecular dy- namics and we do not have ... tionary programming in an FPGA was described by [10] and [11]. ... be

Evolution in Materio: Exploiting the Physics of Materials for Computation
Nov 17, 2006 - computation is taking place only between nearest neighbors. There is no global ... a computing machine an infinite number of logical .... A. Introduction ...... [31] M. Sipper, Evolution of Parallel Cellular Machines, The Cellular.

Exploiting Feature Hierarchy for Transfer Learning in ...
lated target domain, where the two domains' data are ... task learning, where two domains may be re- lated ... train and test articles written by the same author to.

Evolution in Materio: Exploiting the Physics of Materials for ... - arXiv
Nov 17, 2006 - that we would find that our computed results are only an approximation of the ... how tiny a region of space and no matter how tiny a region of time. ... the cells are pro- grammable so essentially any Boolean network can be con- .....

Evolution in Materio: Exploiting the Physics of Materials for ... - arXiv
Nov 17, 2006 - In summary, we have shown that we can use evolution to ..... spin state it acts essentially as a single giant classical spin system. Their quantum ...

Exploiting constructive interference for scalable flooding in ... - hkust cse
Abstract—Exploiting constructive interference in wireless net- works is an emerging trend for it allows multiple senders transmit an identical packet simultaneously. Constructive inter- ference based flooding can realize millisecond network flood-

Exploiting packet distribution for tuning RTS threshold in IEEE ... - BUET
whether it is idle for a DCF Inter-frame Space (DIFS) interval, and then attempt packet transmission. On the other hand, in virtual carrier sensing, RTS/CTS ...

Amdahl's Law in the Multicore Era - Computer Sciences Dept.
for symmetric multicore chips with chip resources of n = 64,. 256, and 1024 ... An alternative to a symmetric multicore chip is an asymmet- ..... Energy-Efficient.

Exploiting Similarities among Languages for Machine Translation
Sep 17, 2013 - translations given GT as the training data for learn- ing the Translation Matrix. The subsequent 1K words in the source language and their ...

Exploiting Graphics Processing Units for ... - Springer Link
Then we call the CUDA function. cudaMemcpy to ..... Processing Studies (AFIPS) Conference 30, 483–485. ... download.nvidia.com/compute/cuda/1 1/Website/.

Exploiting desktop supercomputing for three ...
Oct 8, 2008 - resolution, the size and number of images involved in a three-dimensional reconstruction ... High resolution structural studies demand huge computational ... under the conditions that the image data, as well as the informa-.

Exploiting desktop supercomputing for three ...
Oct 8, 2008 - under the conditions that the image data, as well as the informa- ...... Also, memory and hard disk storage prices drop, leading to more powerful ...