HPC Colony II Terry Jones, Colony Project Principal Investigator

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

2

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

3

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

What is Colony?   System software project funded by DOE Office of Science FastOS Award   Partners include University of Illinois at Urbana-Champaign & IBM Research   Terry Jones, coordinating PI   Laxmikant Kale, UIUC PI   Jose Moreira, IBM Research PI

  Three years completed   Funding for three more years awarded   http://www.hpc-colony.org “Most application developers would like to focus their attention on the domain aspects of their applications. Although their understanding of the problem will help them in finding potential sources of concurrency, managing that concurrency in more detail is difficult and error prone.” –“Getting Up To Speed: The Future of Supercomputing”, National Research Council, 2005 4

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

It is the Best of Times…   Itʼs common to hear computer simulation mentioned in the same breath as scientific theory and empirical research   Hardware technology is rapidly advancing   Computers are vastly more powerful than ever before

It is the Worst of Times…   Computer Science is young   Software is struggling to keep up with hardware   Fundamental needs in HPC remain unmet

5

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS



Today’s Architecture Trend: The Growth of Cores Per Supercomputer

†source: 6

Managed by UT-Battelle for the U.S. Department of Energy

Jack Dongarra, top500.org Colony II Presentation at FASTOS



Today’s Reliability Trends: An application Perspective   Automated job monitoring and restart is a necessity for Running big jobs on existing large scale systems   Running 41 million PE hours over 16 weeks on Red Storm, Purple, and BG/L, it was typical to restart applications 10-20 times per day †   System hardware   System software   Application errors   Human error



7

J.T. Daly “Facilitating High-Throughput ASC Calculations”, ADTSC Nuclear Weapons Highlights ʻ07

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Today’s Application Trends: Larger & More Complicated •  Lightweight Kernel May Not Be Enough Evaluation of 7 apps/libs  78 system calls, 45 satisfied by lightweight kernels   I/O, sockets, signals, fork/exec   Good but not complete coverage by lightweight kernels   Exceptions: fork/exec, mmap, some socket calls   BGL and RedStorm had largely the same coverage •  It’s More Than Just The Apps Linux System Call Count 350

Count of System Calls

300 250 200 150 100 50 0 2.4.2

2.4.18

2.4.19

2.4.20

2.6.0

Linux Version

8

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

2.6.3

2.6.18

Today’s Application Trends: Larger & More Complicated

Evaluation of 7 apps/libs  78 system calls, 45 satisfied by lightweight kernels •  Emerging Needs   Coupled applications   Profiling needs, debugging needs, …   Accelerators & Heterogeneous architectures

9

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

The Brewing Storm •  OS interference –  IBM Allreduce Performance decried at ScicomP and SPXXL conferences in 2001 (LLNL, ORNL, NAVO) –  Effects documented in multiple papers

•  Data Hierarchy Stretching –  It’s not enough to have faster cores -- you need to be able to avoid stalls in the critical path

•  Programming Environment –  POSIX-like interface including threads desired by some apps –  Development tools desired by most apps

•  Dealing with Faults –  Applications desire progress despite increased component counts

•  Dealing with Massive Parallelism –  Too much to do by hand 10

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Colony Approach   App Complexity -> Donʼt Limit Development Environment

  Increased Nodes, Emergence of Clouds & Heterogeneous Computing -> Infrastructure for Communication Overlays

  OS Scalabilty -> Parallel Aware Scheduling

  Declining Application Interrupt Time -> Fault Tolerance through processor virtualization

  Application Load Imbalances -> Adaptive Load Balancing through processor virtualization Question: How much should system software offer in terms of features? Answer: Everything required, and as much desired as possible 11

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

12

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Ameliorating Application Complexity  Feature rich development environment  Eliminate black-box syndrome  Eliminate scaling problems associated with feature rich

  Source of hangs (hw, system sw, app)   Subset attach   Smarter, possibly asynchronous, compute node daemons   I want to know right now if the system is okay   Debug my application without system administration

13

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

14

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

The Growing Chasm

–The success of processors is making balanced systems difficult

  Coupled programming (e.g., climate codes) add vertical pressure

L1 cache

  Multiple networks adds vertical pressure

local memory

L2 cache

remote memory

  I/O is a major bottleneck in many parallel applications

disk drives tape drives

  Cloud computing and heterogeneous computing place vertical pressure   Secondary (stable) storage adds vertical pressure

15

Managed by UT-Battelle for the U.S. Department of Energy

Improved access time capacity

Colony II Presentation at FASTOS

Colony’s Strategy: Provide Communication Infrastructure to help   Cannot adequately address an overarching solution to the challenges of Data Hierarchy Stretching   Focus on key areas complementary to our scope

 Communication: Permit scalable infrastructure  Communication: Keep performance (latency, bandwidth, join/leave, …)  Parallel IO: Reduce time spent with checkpoint/restart  Parallel IO: Make daemons possible (e.g., permit non-blocking I/O)  Remote Memory: Overlay Communication technology

16

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

SpiderCast for High Performance Computing  A scalable, fully distributed, messaging, membership and monitoring infrastructure  Develop a standalone distributed infrastructure that will utilize peer-to-peer and overlay networking technologies, while utilizing HPC platform unique features and architecture  Focus on:   Membership – report which processes are alive, discover and report failing processes   Monitoring – collect load / performance statistics   Scalable group services – multicast and light weight pub/sub

 A set of services targeted for:   Increasing performance & scalability of scientific computing, by providing said services to load balancing, scheduling, fault tolerance, and parallel resource management system software   Enable general purpose workloads by providing missing distributed software services and components in the OS / Middleware level

17

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

SpiderCast Services  Light-weight topic-based publish/subscribe messaging  Send messages to group (topic) / Receive messages from group (topic)  Application level multicast  Allow load monitoring, implementation of shared state, etc.

 Interest-aware membership service  Which nodes are up (failure detection)  What are the topics of interest of each node  Several degrees of QoS (full view, partial view, w/o interest)

 Attribute service  Efficiently propagate slowly changing state information (node attributes)  Per node Map-like API – putAttribute, getAttribute, etc.  Can be used for dissemination of: deployed services, supported protocols, load, statistics, etc.

 Overlay access  Get the list of immediate neighbors  Send message to neighbor  Allows custom distributed algorithms to be implemented on top of the SpiderCast overlay

 ConvergeCast  Efficiently aggregate a response from many to one 18

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

19

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

OS Scheduling on a 2x4 System Time Node1a Node1b Node1c Node1d Node2a Node2b Node2c Node2d

Time Node1a Node1b Node1c Node1d Node2a Node2b Node2c Node2d

20

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Scaling with 2% Noise ALLREDUCE Results 40 35 30

Timings

25 CNK

20

Colony with SchedMods (quiet) Colony with SchedMods (2% noise)

15

Colony (quiet) 10

Colony (2% noise)

5 0 1024

2048

4096

8192

16384

Node Count

GLOB Results 80 70 60

Timings

50 CNK Colony with SchedMods (quiet) Colony with SchedMods (2% noise) Colony (quiet) Colony (2% noise)

40 30 20 10 0

21

Managed by UT-Battelle for the U.S. Department of Energy

1024

2048

4096 Count at FASTOS Colony IINode Presentation

8192

16384

Scaling with 30% Noise Allreduce 10000 1000

CNK Colony with SchedMods (quiet) Colony with SchedMods (30% noise) Colony (quiet) Colony (30% noise)

100 10 1 1024

2048

4096

8192

0.1

GLOB 10000

1000

CNK Colony with SchedMods (quiet) Colony with SchedMods (30% noise) Colony (quiet) Colony (30% noise)

100

10

1 1024

22

Managed by UT-Battelle for the U.S. Department of Energy

2048

4096

Colony II Presentation at FASTOS

8192

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

23

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Charm++   Parallel library for Object-Oriented C++ applications   Invoke functions remotely   Messaging via remote method calls   Methods called by scheduler   System determines who runs next

  Multiple objects per processor   Object migration fully supported   Even with broadcasts, reductions

24

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Charm++ Features: Object Arrays   Applications are written as a set of communicating objects

User’s view A[0] A[1] A[2] A[3]

25

Managed by UT-Battelle for the U.S. Department of Energy

A[n]

Colony II Presentation at FASTOS

Charm++ Features: Object Arrays   Charm++ maps those objects onto processors, routing messages as needed Virtualization leads to Message Driven Execution User’s view A[0] A[1] A[2] A[3]

A[n]

System view A[0]

26

Managed by UT-Battelle for the U.S. Department of Energy

A[3]

Colony II Presentation at FASTOS

Processor Virtualization with Migratable Objects •  Divide the computation into a large number of pieces –  Independent of the number of processors

•  Let the runtime system map objects to processors •  Implementations: Charm++, Adaptive-MPI (AMPI)

P0 User View

27

Managed by UT-Battelle for the U.S. Department of Energy

P1

System implementation

Colony II Presentation at FASTOS

P2

AMPI: MPI with Virtualization

•  Each MPI process implemented as a user-level thread embedded in a Charm++ object MPI “processes” processes Implemented as virtual processes (user-level migratable threads) Real Processors 28

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Results from 20,480 processors •  Results from BGW day (TJ Watson Research Center) •  Cosmological Code ChaNGa •  Results for Basic Load Balancers

29

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Load Balancing on Large Machines •  Existing load balancing strategies don’t scale on extremely large machines –  Consider an application with 1M objects on 64K processors –  Centralized: Object load and communication data sent to one processor, which makes decisions •  Becomes a bottleneck –  Distributed: Load balancing among neighboring processors •  Does not achieve good balance quickly

•  Hybrid (Gengbin Zheng, PhD Thesis) –  Dividing processors into independent sets of groups, and groups are organized in hierarchies (decentralized) –  Each group has a leader (the central node) which performs centralized load balancing 30

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

31

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Fault Tolerance Enabled by Charm++ •  Automatic checkpointing / fault-detection / restart –  Scheme 1: checkpoint to file-system –  Scheme 2: In-memory checkpointing

•  Proactive reaction to impending faults –  Migrate objects when a fault is imminent –  Keep “good” processors running at full pace –  Refine load balance after migrations

•  Scalable Fault Tolerance –  Using message-logging to tolerate frequent faults in a scalable fashion

32

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Fast Restart •  Message Logging allows fault-free processors to continue with their execution •  However, sooner or later some processors start waiting for crashed processor •  Virtualization allows us to move work from the restarted processor to waiting processors •  Chares are restarted in parallel •  Restart cost can be reduced

33

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Performance of Proactive Scheme

Iteration time of Sweep3d on 32 processors for 150^3 problem with 1 warning

34

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Scalable Fault Tolerance •  Basic idea: if one out of 100,000 processors fails, we shouldn’t have to send the “innocent” 99,999 processors scurrying back to their checkpoints, and duplicate all the work since their last checkpoint.

•  Basic scheme: –  everyone logs messages sent to others –  Asynchronous checkpoints –  On failure, •  the objects from the failed processors are resurrected (from their checkpoints) on other processors, •  Their acquaintances re-send messages since last checkpoint •  The failed objects catch up with the rest, and continue

•  Of course, several wrinkles and issues arise 35

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Benefit of virtualization in the fault free case: 
 NAS benchmarks 3000 MG class B

Mflops

2000 1500 1000 500

4 8 16 Processors

Mflops 36

32

0

AMPI AMPI-FT multiple vp

CG class B

800 600

10000

200

2000

Managed by UT-Battelle for the U.S. Department of Energy

0

32

Colony II Presentation at FASTOS

25

LU class B

6000 4000

4 8 16 Processors

9 16 Processors

8000

400

2

4

AMPI-FT 1vp

Mflops

2

1000

0

SP class B

2500

Mflops

4000 3500 3000 2500 2000 1500 1000 500 0

2

4 8 16 Processors

32

Composition of recovery time

Restart time for a MPI 7 point stencil with 3D decomposition on 16 processors with varying numbers of virtual processors 37

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Fault Tolerance: Status and Directions •  Message-Logging integrated to regular Charm++ distribution •  Performing a detailed comparison of the various schemes –  Testing both with kernels and full applications

•  Investigating enhancements to message-logging protocol: –  Overhead minimization by grouping processors –  Stronger coupling to load-balancing

•  Partial funding between Colony-1/Colony-2: –  Fullbright fellowship for a graduate student at UIUC

38

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

Outline   Colony Motivation & Goals   Approach & Research   Strategy for Application Complexity   Strategy for Data Hierarchy Stretching   Strategy for OS Scalability Issues   Strategy for Resource Management   Strategy for Fault Tolerance   Acknowledgements

39

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

…and in conclusion…

  For Further Info   http://www.hpc-colony.org   http://charm.cs.uiuc.edu   http://www.research.ibm.com/bluegene

  Partnerships and Acknowledgements   DOE Office of Science   Colony Team

40

Core

Terry Jones (ORNL), Jose Moreira (IBM), Eliezer Dekel (IBM), Roie Melamed (IBM), Yoav Tock (IBM), Laxmikant Kale (UIUC), Celso Mendes (UIUC), Esteban Meneses (UIUC)

Extended

Bob Wisniewski (IBM), Todd Inglett (IBM), Andrew Tauferner (IBM), Edi Shmueli (IBM), Gera Goft (IBM), Avi Teperman (IBM), Gregory Chockler (IBM), Sayantan Chakravorty (UIUC)

Managed by UT-Battelle for the U.S. Department of Energy

Colony II Presentation at FASTOS

HPC Colony II

System software project funded by DOE Office of Science FastOS Award. ▫ Partners include .... L2 cache local memory .... Scheme 1: checkpoint to file-system.

2MB Sizes 3 Downloads 219 Views

Recommend Documents

HPC-C Labeling - FDA
Blood Center at 1-866-767-NCBP (1-866-767-6227) and FDA at 1-800-FDA- .... DMSO is not removed, and at 4 °C for up to 24 hours if DMSO is removed in a ...... Call the Transplant Unit to advise them that the product is ready for infusion if ...

HPC Requirements.pdf
Sign in. Page. 1. /. 1. Loading… Page 1 of 1. Page 1 of 1. HPC Requirements.pdf. HPC Requirements.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying HPC Requirements.pdf.

Colony Chart.pdf
Page 1 of 1. Founded Founders Reasons for. Founding Colony Name Origin Major Cities Major Industry/. Economy. Geography and. Climate. Colony Research Chart. Colony.

HPC-ACCOUNT-REQUEST.pdf
pull down, select: USER ACCOUNT REQUEST. 3) In the “Describ your Issue”. box,enter: Request Access. Research HPC Systems. If not the PI of a grant, please.

Space Colony - Mili.pdf
Page 1. Whoops! There was a problem loading more pages. Space Colony - Mili.pdf. Space Colony - Mili.pdf. Open. Extract. Open with. Sign In. Main menu.

Streamlining HPC Workloads with Containers.pdf
Google image search shows... Page 5 of 47. Streamlining HPC Workloads with Containers.pdf. Streamlining HPC Workloads with Containers.pdf. Open. Extract.

HPC Processing of LIDAR Data
Feb 4, 2005 - processors, distributed memory, and message-passing software libraries. An enhancement is .... This section describes the software development issues associated with designing ...... Conversely, custom-designed .NET Web ...

Streamlining HPC Workloads with Containers.pdf
Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying... Download. Connect more ...

Simul8 HPC Environment technical specifications.pdf
Page 1 of 1. Deep dive into technical specifications of our machine. Within the last few years, we have seen the transformative impact of deep learning in many.

SC16 HPC Training Workshop-final
SC16 Workshop: Best Practices for HPC Training. 1. Third SC Workshop on Best Practices for HPC Training. Abstract (150 ... 9:00-9:10 am. Welcome and Goals ... www.hpcuniversity.org portal and multiple social media avenues. Timeline.

HPC Processing of LIDAR Data
high-performance LIDAR data processing—in light of the design criteria set ... for the appropriate parallel computing system, data processing algorithms, and.

an Ant Colony Approach
by the fact that the demand of any delivery customer can be met by a relatively ... municate via wireless links to perform distributed sensing and actuation ...... hicle Routing Problem with Pickup and Delivery Service” Mathematical. Problems in ..

Colony s01 is_safe:1
Page 1 of 15. Brothers love 2015.Codered mp3.66144904625 - Download Colony s01 is_safe:1.Thecomplete beatles recording.Importance ofHonest Work. inGreat Expectations InGreat Expectations, Charles Dickens places greatemphasis on theideasand attitudes

Competitive ant colony optimisation
Abstract. The usual assumptions of the ant colony meta-heuristic are that each ant constructs its own complete solution and that it will then operate relatively independently of the rest of the colony (with only loose communications via the pheromone

Granulocyte Colony-Stimulating Factor Reduces ...
Apr 22, 2006 - College, Tianjin 30020, China; 3European Hospital of George. Pompidou of Paris ... Conclusion: Our results suggest that administration of G-.

The Colony of Vancouver Island.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. The Colony of ...

Regulation of neutrophilia by granulocyte colony ... - PDFKUL.COM
May 10, 2005 - Journal of Clinical Research 2005; 8: 9–13 JME LOGO. © 2005 T&F Informa UK .... aminotransferase (AST) 162 U/l; alanine aminotransferase ...