HPC Colony II

Viewer
Transcript

HPC Colony II Terry Jones, Colony Project Principal Investigator

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

2

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

3

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

What is Colony?   System software project funded by DOE Office of Science FastOS Award   Partners include University of Illinois at Urbana-Champaign & IBM Research   Terry Jones, coordinating PI   Laxmikant Kale, UIUC PI   Jose Moreira, IBM Research PI

  Three years completed   Funding for three more years awarded   http://www.hpc-colony.org “Most application developers would like to focus their attention on the domain aspects of their applications. Although their understanding of the problem will help them in finding potential sources of concurrency, managing that concurrency in more detail is difficult and error prone.” –“Getting Up To Speed: The Future of Supercomputing”, National Research Council, 2005 4

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

It is the Best of Times…   Itʼs common to hear computer simulation mentioned in the same breath as scientific theory and empirical research   Hardware technology is rapidly advancing   Computers are vastly more powerful than ever before

It is the Worst of Times…   Computer Science is young   Software is struggling to keep up with hardware   Fundamental needs in HPC remain unmet

5

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

†

Today’s Architecture Trend: The Growth of Cores Per Supercomputer

†source: 6

Managed by UT-Battelle for the U.S. Department of Energy

Jack Dongarra, top500.org Colony II Presentation at FASTOS

Today’s Reliability Trends: An application Perspective   Automated job monitoring and restart is a necessity for Running big jobs on existing large scale systems   Running 41 million PE hours over 16 weeks on Red Storm, Purple, and BG/L, it was typical to restart applications 10-20 times per day †   System hardware   System software   Application errors   Human error

†

7

J.T. Daly “Facilitating High-Throughput ASC Calculations”, ADTSC Nuclear Weapons Highlights ʻ07

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Today’s Application Trends: Larger & More Complicated •  Lightweight Kernel May Not Be Enough Evaluation of 7 apps/libs  78 system calls, 45 satisfied by lightweight kernels   I/O, sockets, signals, fork/exec   Good but not complete coverage by lightweight kernels   Exceptions: fork/exec, mmap, some socket calls   BGL and RedStorm had largely the same coverage •  It’s More Than Just The Apps Linux System Call Count 350

Count of System Calls

300 250 200 150 100 50 0 2.4.2

2.4.18

2.4.19

2.4.20

2.6.0

Linux Version

8

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

2.6.3

2.6.18

Today’s Application Trends: Larger & More Complicated

Evaluation of 7 apps/libs  78 system calls, 45 satisfied by lightweight kernels •  Emerging Needs   Coupled applications   Profiling needs, debugging needs, …   Accelerators & Heterogeneous architectures

9

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

The Brewing Storm •  OS interference –  IBM Allreduce Performance decried at ScicomP and SPXXL conferences in 2001 (LLNL, ORNL, NAVO) –  Effects documented in multiple papers

•  Data Hierarchy Stretching –  It’s not enough to have faster cores -- you need to be able to avoid stalls in the critical path

•  Programming Environment –  POSIX-like interface including threads desired by some apps –  Development tools desired by most apps

•  Dealing with Faults –  Applications desire progress despite increased component counts

•  Dealing with Massive Parallelism –  Too much to do by hand 10

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Colony Approach   App Complexity -> Donʼt Limit Development Environment

  Increased Nodes, Emergence of Clouds & Heterogeneous Computing -> Infrastructure for Communication Overlays

  OS Scalabilty -> Parallel Aware Scheduling

  Declining Application Interrupt Time -> Fault Tolerance through processor virtualization

  Application Load Imbalances -> Adaptive Load Balancing through processor virtualization Question: How much should system software offer in terms of features? Answer: Everything required, and as much desired as possible 11

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

12

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Ameliorating Application Complexity  Feature rich development environment  Eliminate black-box syndrome  Eliminate scaling problems associated with feature rich

  Source of hangs (hw, system sw, app)   Subset attach   Smarter, possibly asynchronous, compute node daemons   I want to know right now if the system is okay   Debug my application without system administration

13

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

14

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

The Growing Chasm

–The success of processors is making balanced systems difficult

  Coupled programming (e.g., climate codes) add vertical pressure

L1 cache

  Multiple networks adds vertical pressure

local memory

L2 cache

remote memory

  I/O is a major bottleneck in many parallel applications

disk drives tape drives

  Cloud computing and heterogeneous computing place vertical pressure   Secondary (stable) storage adds vertical pressure

15

Managed by UT-Battelle for the U.S. Department of Energy

Improved access time capacity

Colony II Presentation at FASTOS

Colony’s Strategy: Provide Communication Infrastructure to help   Cannot adequately address an overarching solution to the challenges of Data Hierarchy Stretching   Focus on key areas complementary to our scope

 Communication: Permit scalable infrastructure  Communication: Keep performance (latency, bandwidth, join/leave, …)  Parallel IO: Reduce time spent with checkpoint/restart  Parallel IO: Make daemons possible (e.g., permit non-blocking I/O)  Remote Memory: Overlay Communication technology

16

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

SpiderCast for High Performance Computing  A scalable, fully distributed, messaging, membership and monitoring infrastructure  Develop a standalone distributed infrastructure that will utilize peer-to-peer and overlay networking technologies, while utilizing HPC platform unique features and architecture  Focus on:   Membership – report which processes are alive, discover and report failing processes   Monitoring – collect load / performance statistics   Scalable group services – multicast and light weight pub/sub

 A set of services targeted for:   Increasing performance & scalability of scientific computing, by providing said services to load balancing, scheduling, fault tolerance, and parallel resource management system software   Enable general purpose workloads by providing missing distributed software services and components in the OS / Middleware level

17

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

SpiderCast Services  Light-weight topic-based publish/subscribe messaging  Send messages to group (topic) / Receive messages from group (topic)  Application level multicast  Allow load monitoring, implementation of shared state, etc.

 Interest-aware membership service  Which nodes are up (failure detection)  What are the topics of interest of each node  Several degrees of QoS (full view, partial view, w/o interest)

 Attribute service  Efficiently propagate slowly changing state information (node attributes)  Per node Map-like API – putAttribute, getAttribute, etc.  Can be used for dissemination of: deployed services, supported protocols, load, statistics, etc.

 Overlay access  Get the list of immediate neighbors  Send message to neighbor  Allows custom distributed algorithms to be implemented on top of the SpiderCast overlay

 ConvergeCast  Efficiently aggregate a response from many to one 18

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

19

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

OS Scheduling on a 2x4 System Time Node1a Node1b Node1c Node1d Node2a Node2b Node2c Node2d

Time Node1a Node1b Node1c Node1d Node2a Node2b Node2c Node2d

20

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Scaling with 2% Noise ALLREDUCE Results 40 35 30

Timings

25 CNK

20

Colony with SchedMods (quiet) Colony with SchedMods (2% noise)

15

Colony (quiet) 10

Colony (2% noise)

5 0 1024

2048

4096

8192

16384

Node Count

GLOB Results 80 70 60

Timings

50 CNK Colony with SchedMods (quiet) Colony with SchedMods (2% noise) Colony (quiet) Colony (2% noise)

40 30 20 10 0

21

Managed by UT-Battelle for the U.S. Department of Energy

1024

2048

4096 Count at FASTOS Colony IINode Presentation

8192

16384

Scaling with 30% Noise Allreduce 10000 1000

CNK Colony with SchedMods (quiet) Colony with SchedMods (30% noise) Colony (quiet) Colony (30% noise)

100 10 1 1024

2048

4096

8192

0.1

GLOB 10000

1000

CNK Colony with SchedMods (quiet) Colony with SchedMods (30% noise) Colony (quiet) Colony (30% noise)

100

10

1 1024

22

Managed by UT-Battelle for the U.S. Department of Energy

2048

4096

Colony II Presentation at FASTOS

8192

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

23

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Charm++   Parallel library for Object-Oriented C++ applications   Invoke functions remotely   Messaging via remote method calls   Methods called by scheduler   System determines who runs next

  Multiple objects per processor   Object migration fully supported   Even with broadcasts, reductions

24

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Charm++ Features: Object Arrays   Applications are written as a set of communicating objects

User’s view A[0] A[1] A[2] A[3]

25

Managed by UT-Battelle for the U.S. Department of Energy

A[n]

Colony II Presentation at FASTOS

Charm++ Features: Object Arrays   Charm++ maps those objects onto processors, routing messages as needed Virtualization leads to Message Driven Execution User’s view A[0] A[1] A[2] A[3]

A[n]

System view A[0]

26

Managed by UT-Battelle for the U.S. Department of Energy

A[3]

Colony II Presentation at FASTOS

Processor Virtualization with Migratable Objects •  Divide the computation into a large number of pieces –  Independent of the number of processors

•  Let the runtime system map objects to processors •  Implementations: Charm++, Adaptive-MPI (AMPI)

P0 User View

27

Managed by UT-Battelle for the U.S. Department of Energy

P1

System implementation

Colony II Presentation at FASTOS

P2

AMPI: MPI with Virtualization

•  Each MPI process implemented as a user-level thread embedded in a Charm++ object MPI “processes” processes Implemented as virtual processes (user-level migratable threads) Real Processors 28

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Results from 20,480 processors •  Results from BGW day (TJ Watson Research Center) •  Cosmological Code ChaNGa •  Results for Basic Load Balancers

29

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Load Balancing on Large Machines •  Existing load balancing strategies don’t scale on extremely large machines –  Consider an application with 1M objects on 64K processors –  Centralized: Object load and communication data sent to one processor, which makes decisions •  Becomes a bottleneck –  Distributed: Load balancing among neighboring processors •  Does not achieve good balance quickly

•  Hybrid (Gengbin Zheng, PhD Thesis) –  Dividing processors into independent sets of groups, and groups are organized in hierarchies (decentralized) –  Each group has a leader (the central node) which performs centralized load balancing 30

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

31

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Fault Tolerance Enabled by Charm++ •  Automatic checkpointing / fault-detection / restart –  Scheme 1: checkpoint to file-system –  Scheme 2: In-memory checkpointing

•  Proactive reaction to impending faults –  Migrate objects when a fault is imminent –  Keep “good” processors running at full pace –  Refine load balance after migrations

•  Scalable Fault Tolerance –  Using message-logging to tolerate frequent faults in a scalable fashion

32

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Fast Restart •  Message Logging allows fault-free processors to continue with their execution •  However, sooner or later some processors start waiting for crashed processor •  Virtualization allows us to move work from the restarted processor to waiting processors •  Chares are restarted in parallel •  Restart cost can be reduced

33

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Performance of Proactive Scheme

Iteration time of Sweep3d on 32 processors for 150^3 problem with 1 warning

34

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Scalable Fault Tolerance •  Basic idea: if one out of 100,000 processors fails, we shouldn’t have to send the “innocent” 99,999 processors scurrying back to their checkpoints, and duplicate all the work since their last checkpoint.

•  Basic scheme: –  everyone logs messages sent to others –  Asynchronous checkpoints –  On failure, •  the objects from the failed processors are resurrected (from their checkpoints) on other processors, •  Their acquaintances re-send messages since last checkpoint •  The failed objects catch up with the rest, and continue

•  Of course, several wrinkles and issues arise 35

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Benefit of virtualization in the fault free case:   NAS benchmarks 3000 MG class B

Mflops

2000 1500 1000 500

4 8 16 Processors

Mflops 36

32

0

AMPI AMPI-FT multiple vp

CG class B

800 600

10000

200

2000

Managed by UT-Battelle for the U.S. Department of Energy

0

32

Colony II Presentation at FASTOS

25

LU class B

6000 4000

4 8 16 Processors

9 16 Processors

8000

400

2

4

AMPI-FT 1vp

Mflops

2

1000

0

SP class B

2500

Mflops

4000 3500 3000 2500 2000 1500 1000 500 0

2

4 8 16 Processors

32

Composition of recovery time

Restart time for a MPI 7 point stencil with 3D decomposition on 16 processors with varying numbers of virtual processors 37

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Fault Tolerance: Status and Directions •  Message-Logging integrated to regular Charm++ distribution •  Performing a detailed comparison of the various schemes –  Testing both with kernels and full applications

•  Investigating enhancements to message-logging protocol: –  Overhead minimization by grouping processors –  Stronger coupling to load-balancing

•  Partial funding between Colony-1/Colony-2: –  Fullbright fellowship for a graduate student at UIUC

38

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

39

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

…and in conclusion…

  For Further Info   http://www.hpc-colony.org   http://charm.cs.uiuc.edu   http://www.research.ibm.com/bluegene

  Partnerships and Acknowledgements   DOE Office of Science   Colony Team

40

Core

Terry Jones (ORNL), Jose Moreira (IBM), Eliezer Dekel (IBM), Roie Melamed (IBM), Yoav Tock (IBM), Laxmikant Kale (UIUC), Celso Mendes (UIUC), Esteban Meneses (UIUC)

Extended

Bob Wisniewski (IBM), Todd Inglett (IBM), Andrew Tauferner (IBM), Edi Shmueli (IBM), Gera Goft (IBM), Avi Teperman (IBM), Gregory Chockler (IBM), Sayantan Chakravorty (UIUC)

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

System software project funded by DOE Office of Science FastOS Award. â« Partners include .... L2 cache local memory .... Scheme 1: checkpoint to file-system.

Download PDF

2MB Sizes 3 Downloads 219 Views

Report

HPC-C Labeling - FDA

HPC Requirements.pdf

Colony Chart.pdf

HPC-ACCOUNT-REQUEST.pdf

Space Colony - Mili.pdf

Streamlining HPC Workloads with Containers.pdf

HPC Processing of LIDAR Data

Streamlining HPC Workloads with Containers.pdf

Simul8 HPC Environment technical specifications.pdf

SC16 HPC Training Workshop-final

HPC Processing of LIDAR Data

an Ant Colony Approach

Colony s01 is_safe:1

Competitive ant colony optimisation

Granulocyte Colony-Stimulating Factor Reduces ...

The Colony of Vancouver Island.pdf

Regulation of neutrophilia by granulocyte colony ... - PDFKUL.COM

HPC Colony II

HPC-C Labeling - FDA

HPC Requirements.pdf

Colony Chart.pdf

HPC-ACCOUNT-REQUEST.pdf

Space Colony - Mili.pdf

Streamlining HPC Workloads with Containers.pdf

HPC Processing of LIDAR Data

Streamlining HPC Workloads with Containers.pdf

Simul8 HPC Environment technical specifications.pdf

SC16 HPC Training Workshop-final

HPC Processing of LIDAR Data

an Ant Colony Approach

Colony s01 is_safe:1

Competitive ant colony optimisation

Granulocyte Colony-Stimulating Factor Reduces ...

The Colony of Vancouver Island.pdf

Regulation of neutrophilia by granulocyte colony ... - PDFKUL.COM

HPC Colony II

Recommend Documents