HPC Colony II Terry Jones, Colony Project Principal Investigator
Outline Colony Motivation & Goals Approach & Research Strategy for Application Complexity Strategy for Data Hierarchy Stretching Strategy for OS Scalability Issues Strategy for Resource Management Strategy for Fault Tolerance Acknowledgements
2
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Outline Colony Motivation & Goals Approach & Research Strategy for Application Complexity Strategy for Data Hierarchy Stretching Strategy for OS Scalability Issues Strategy for Resource Management Strategy for Fault Tolerance Acknowledgements
3
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
What is Colony? System software project funded by DOE Office of Science FastOS Award Partners include University of Illinois at Urbana-Champaign & IBM Research Terry Jones, coordinating PI Laxmikant Kale, UIUC PI Jose Moreira, IBM Research PI
Three years completed Funding for three more years awarded http://www.hpc-colony.org “Most application developers would like to focus their attention on the domain aspects of their applications. Although their understanding of the problem will help them in finding potential sources of concurrency, managing that concurrency in more detail is difficult and error prone.” –“Getting Up To Speed: The Future of Supercomputing”, National Research Council, 2005 4
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
It is the Best of Times… Itʼs common to hear computer simulation mentioned in the same breath as scientific theory and empirical research Hardware technology is rapidly advancing Computers are vastly more powerful than ever before
It is the Worst of Times… Computer Science is young Software is struggling to keep up with hardware Fundamental needs in HPC remain unmet
5
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
†
Today’s Architecture Trend: The Growth of Cores Per Supercomputer
†source: 6
Managed by UT-Battelle for the U.S. Department of Energy
Jack Dongarra, top500.org Colony II Presentation at FASTOS
Today’s Reliability Trends: An application Perspective Automated job monitoring and restart is a necessity for Running big jobs on existing large scale systems Running 41 million PE hours over 16 weeks on Red Storm, Purple, and BG/L, it was typical to restart applications 10-20 times per day † System hardware System software Application errors Human error
†
7
J.T. Daly “Facilitating High-Throughput ASC Calculations”, ADTSC Nuclear Weapons Highlights ʻ07
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Today’s Application Trends: Larger & More Complicated • Lightweight Kernel May Not Be Enough Evaluation of 7 apps/libs 78 system calls, 45 satisfied by lightweight kernels I/O, sockets, signals, fork/exec Good but not complete coverage by lightweight kernels Exceptions: fork/exec, mmap, some socket calls BGL and RedStorm had largely the same coverage • It’s More Than Just The Apps Linux System Call Count 350
Count of System Calls
300 250 200 150 100 50 0 2.4.2
2.4.18
2.4.19
2.4.20
2.6.0
Linux Version
8
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
2.6.3
2.6.18
Today’s Application Trends: Larger & More Complicated
Evaluation of 7 apps/libs 78 system calls, 45 satisfied by lightweight kernels • Emerging Needs Coupled applications Profiling needs, debugging needs, … Accelerators & Heterogeneous architectures
9
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
The Brewing Storm • OS interference – IBM Allreduce Performance decried at ScicomP and SPXXL conferences in 2001 (LLNL, ORNL, NAVO) – Effects documented in multiple papers
• Data Hierarchy Stretching – It’s not enough to have faster cores -- you need to be able to avoid stalls in the critical path
• Programming Environment – POSIX-like interface including threads desired by some apps – Development tools desired by most apps
• Dealing with Faults – Applications desire progress despite increased component counts
• Dealing with Massive Parallelism – Too much to do by hand 10
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Colony Approach App Complexity -> Donʼt Limit Development Environment
Increased Nodes, Emergence of Clouds & Heterogeneous Computing -> Infrastructure for Communication Overlays
OS Scalabilty -> Parallel Aware Scheduling
Declining Application Interrupt Time -> Fault Tolerance through processor virtualization
Application Load Imbalances -> Adaptive Load Balancing through processor virtualization Question: How much should system software offer in terms of features? Answer: Everything required, and as much desired as possible 11
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Outline Colony Motivation & Goals Approach & Research Strategy for Application Complexity Strategy for Data Hierarchy Stretching Strategy for OS Scalability Issues Strategy for Resource Management Strategy for Fault Tolerance Acknowledgements
12
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Ameliorating Application Complexity Feature rich development environment Eliminate black-box syndrome Eliminate scaling problems associated with feature rich
Source of hangs (hw, system sw, app) Subset attach Smarter, possibly asynchronous, compute node daemons I want to know right now if the system is okay Debug my application without system administration
13
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Outline Colony Motivation & Goals Approach & Research Strategy for Application Complexity Strategy for Data Hierarchy Stretching Strategy for OS Scalability Issues Strategy for Resource Management Strategy for Fault Tolerance Acknowledgements
14
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
The Growing Chasm
–The success of processors is making balanced systems difficult
Coupled programming (e.g., climate codes) add vertical pressure
L1 cache
Multiple networks adds vertical pressure
local memory
L2 cache
remote memory
I/O is a major bottleneck in many parallel applications
disk drives tape drives
Cloud computing and heterogeneous computing place vertical pressure Secondary (stable) storage adds vertical pressure
15
Managed by UT-Battelle for the U.S. Department of Energy
Improved access time capacity
Colony II Presentation at FASTOS
Colony’s Strategy: Provide Communication Infrastructure to help Cannot adequately address an overarching solution to the challenges of Data Hierarchy Stretching Focus on key areas complementary to our scope
Communication: Permit scalable infrastructure Communication: Keep performance (latency, bandwidth, join/leave, …) Parallel IO: Reduce time spent with checkpoint/restart Parallel IO: Make daemons possible (e.g., permit non-blocking I/O) Remote Memory: Overlay Communication technology
16
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
SpiderCast for High Performance Computing A scalable, fully distributed, messaging, membership and monitoring infrastructure Develop a standalone distributed infrastructure that will utilize peer-to-peer and overlay networking technologies, while utilizing HPC platform unique features and architecture Focus on: Membership – report which processes are alive, discover and report failing processes Monitoring – collect load / performance statistics Scalable group services – multicast and light weight pub/sub
A set of services targeted for: Increasing performance & scalability of scientific computing, by providing said services to load balancing, scheduling, fault tolerance, and parallel resource management system software Enable general purpose workloads by providing missing distributed software services and components in the OS / Middleware level
17
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
SpiderCast Services Light-weight topic-based publish/subscribe messaging Send messages to group (topic) / Receive messages from group (topic) Application level multicast Allow load monitoring, implementation of shared state, etc.
Interest-aware membership service Which nodes are up (failure detection) What are the topics of interest of each node Several degrees of QoS (full view, partial view, w/o interest)
Attribute service Efficiently propagate slowly changing state information (node attributes) Per node Map-like API – putAttribute, getAttribute, etc. Can be used for dissemination of: deployed services, supported protocols, load, statistics, etc.
Overlay access Get the list of immediate neighbors Send message to neighbor Allows custom distributed algorithms to be implemented on top of the SpiderCast overlay
ConvergeCast Efficiently aggregate a response from many to one 18
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Outline Colony Motivation & Goals Approach & Research Strategy for Application Complexity Strategy for Data Hierarchy Stretching Strategy for OS Scalability Issues Strategy for Resource Management Strategy for Fault Tolerance Acknowledgements
19
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
OS Scheduling on a 2x4 System Time Node1a Node1b Node1c Node1d Node2a Node2b Node2c Node2d
Time Node1a Node1b Node1c Node1d Node2a Node2b Node2c Node2d
20
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Scaling with 2% Noise ALLREDUCE Results 40 35 30
Timings
25 CNK
20
Colony with SchedMods (quiet) Colony with SchedMods (2% noise)
15
Colony (quiet) 10
Colony (2% noise)
5 0 1024
2048
4096
8192
16384
Node Count
GLOB Results 80 70 60
Timings
50 CNK Colony with SchedMods (quiet) Colony with SchedMods (2% noise) Colony (quiet) Colony (2% noise)
40 30 20 10 0
21
Managed by UT-Battelle for the U.S. Department of Energy
1024
2048
4096 Count at FASTOS Colony IINode Presentation
8192
16384
Scaling with 30% Noise Allreduce 10000 1000
CNK Colony with SchedMods (quiet) Colony with SchedMods (30% noise) Colony (quiet) Colony (30% noise)
100 10 1 1024
2048
4096
8192
0.1
GLOB 10000
1000
CNK Colony with SchedMods (quiet) Colony with SchedMods (30% noise) Colony (quiet) Colony (30% noise)
100
10
1 1024
22
Managed by UT-Battelle for the U.S. Department of Energy
2048
4096
Colony II Presentation at FASTOS
8192
Outline Colony Motivation & Goals Approach & Research Strategy for Application Complexity Strategy for Data Hierarchy Stretching Strategy for OS Scalability Issues Strategy for Resource Management Strategy for Fault Tolerance Acknowledgements
23
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Charm++ Parallel library for Object-Oriented C++ applications Invoke functions remotely Messaging via remote method calls Methods called by scheduler System determines who runs next
Multiple objects per processor Object migration fully supported Even with broadcasts, reductions
24
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Charm++ Features: Object Arrays Applications are written as a set of communicating objects
User’s view A[0] A[1] A[2] A[3]
25
Managed by UT-Battelle for the U.S. Department of Energy
A[n]
Colony II Presentation at FASTOS
Charm++ Features: Object Arrays Charm++ maps those objects onto processors, routing messages as needed Virtualization leads to Message Driven Execution User’s view A[0] A[1] A[2] A[3]
A[n]
System view A[0]
26
Managed by UT-Battelle for the U.S. Department of Energy
A[3]
Colony II Presentation at FASTOS
Processor Virtualization with Migratable Objects • Divide the computation into a large number of pieces – Independent of the number of processors
• Let the runtime system map objects to processors • Implementations: Charm++, Adaptive-MPI (AMPI)
P0 User View
27
Managed by UT-Battelle for the U.S. Department of Energy
P1
System implementation
Colony II Presentation at FASTOS
P2
AMPI: MPI with Virtualization
• Each MPI process implemented as a user-level thread embedded in a Charm++ object MPI “processes” processes Implemented as virtual processes (user-level migratable threads) Real Processors 28
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Results from 20,480 processors • Results from BGW day (TJ Watson Research Center) • Cosmological Code ChaNGa • Results for Basic Load Balancers
29
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Load Balancing on Large Machines • Existing load balancing strategies don’t scale on extremely large machines – Consider an application with 1M objects on 64K processors – Centralized: Object load and communication data sent to one processor, which makes decisions • Becomes a bottleneck – Distributed: Load balancing among neighboring processors • Does not achieve good balance quickly
• Hybrid (Gengbin Zheng, PhD Thesis) – Dividing processors into independent sets of groups, and groups are organized in hierarchies (decentralized) – Each group has a leader (the central node) which performs centralized load balancing 30
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Outline Colony Motivation & Goals Approach & Research Strategy for Application Complexity Strategy for Data Hierarchy Stretching Strategy for OS Scalability Issues Strategy for Resource Management Strategy for Fault Tolerance Acknowledgements
31
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Fault Tolerance Enabled by Charm++ • Automatic checkpointing / fault-detection / restart – Scheme 1: checkpoint to file-system – Scheme 2: In-memory checkpointing
• Proactive reaction to impending faults – Migrate objects when a fault is imminent – Keep “good” processors running at full pace – Refine load balance after migrations
• Scalable Fault Tolerance – Using message-logging to tolerate frequent faults in a scalable fashion
32
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Fast Restart • Message Logging allows fault-free processors to continue with their execution • However, sooner or later some processors start waiting for crashed processor • Virtualization allows us to move work from the restarted processor to waiting processors • Chares are restarted in parallel • Restart cost can be reduced
33
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Performance of Proactive Scheme
Iteration time of Sweep3d on 32 processors for 150^3 problem with 1 warning
34
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Scalable Fault Tolerance • Basic idea: if one out of 100,000 processors fails, we shouldn’t have to send the “innocent” 99,999 processors scurrying back to their checkpoints, and duplicate all the work since their last checkpoint.
• Basic scheme: – everyone logs messages sent to others – Asynchronous checkpoints – On failure, • the objects from the failed processors are resurrected (from their checkpoints) on other processors, • Their acquaintances re-send messages since last checkpoint • The failed objects catch up with the rest, and continue
• Of course, several wrinkles and issues arise 35
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Benefit of virtualization in the fault free case:
NAS benchmarks 3000 MG class B
Mflops
2000 1500 1000 500
4 8 16 Processors
Mflops 36
32
0
AMPI AMPI-FT multiple vp
CG class B
800 600
10000
200
2000
Managed by UT-Battelle for the U.S. Department of Energy
0
32
Colony II Presentation at FASTOS
25
LU class B
6000 4000
4 8 16 Processors
9 16 Processors
8000
400
2
4
AMPI-FT 1vp
Mflops
2
1000
0
SP class B
2500
Mflops
4000 3500 3000 2500 2000 1500 1000 500 0
2
4 8 16 Processors
32
Composition of recovery time
Restart time for a MPI 7 point stencil with 3D decomposition on 16 processors with varying numbers of virtual processors 37
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Fault Tolerance: Status and Directions • Message-Logging integrated to regular Charm++ distribution • Performing a detailed comparison of the various schemes – Testing both with kernels and full applications
• Investigating enhancements to message-logging protocol: – Overhead minimization by grouping processors – Stronger coupling to load-balancing
• Partial funding between Colony-1/Colony-2: – Fullbright fellowship for a graduate student at UIUC
38
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
Outline Colony Motivation & Goals Approach & Research Strategy for Application Complexity Strategy for Data Hierarchy Stretching Strategy for OS Scalability Issues Strategy for Resource Management Strategy for Fault Tolerance Acknowledgements
39
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS
…and in conclusion…
For Further Info http://www.hpc-colony.org http://charm.cs.uiuc.edu http://www.research.ibm.com/bluegene
Partnerships and Acknowledgements DOE Office of Science Colony Team
40
Core
Terry Jones (ORNL), Jose Moreira (IBM), Eliezer Dekel (IBM), Roie Melamed (IBM), Yoav Tock (IBM), Laxmikant Kale (UIUC), Celso Mendes (UIUC), Esteban Meneses (UIUC)
Extended
Bob Wisniewski (IBM), Todd Inglett (IBM), Andrew Tauferner (IBM), Edi Shmueli (IBM), Gera Goft (IBM), Avi Teperman (IBM), Gregory Chockler (IBM), Sayantan Chakravorty (UIUC)
Managed by UT-Battelle for the U.S. Department of Energy
Colony II Presentation at FASTOS