End-to-end Modeling and Optimization of Power Consumption in HPC Interconnects HUCAA 2016, August 16th, Philadelphia Sébastien Rumley, Robert Polster, and Keren Bergman Lightwave Research Lab, Columbia University New York, NY, USA

Simon D. Hammond and Arun Rodrigues Sandia National Laboratories Albuquerque, NM, USA

Rev PA1

1

Context •  Power consumption of largest supercomputers is slowly reaching the “pain point” –  Power efficiency growing slower than compute power –  Megawatt power consumption is the norm •  Several instances above 10MW (Tianhe-2, K computer)

•  Pain Point? –  Department of Energy places it at 20MW –  Point of comparison: •  Nuclear reactor: ~500MW

à New rule for next generation supercomputers (in particular, Exascale) –  Scaling in terms of compute power = scaling in power efficiency –  Keep power consumption (at least) constant

Rev PA1

2

Living in a power constraint world •  Need to know –  Who the big consumers are –  How much they consume –  How does the consumption evolve with scale, structure

[Wallace et al., HPPAC 2013]

•  In this presentation: focus on the interconnect –  How do the number of network elements scale with system size –  Models for network element power consumption –  Exploration of various designs, analysis of the most power efficient ones Rev PA1

3

Interconnect model Long distance RR link

Node Node

Optical transceivers

Router

Router

Core

Core

N compute nodes in system

Node

Short distance RR link

Node Node Short distance NR link

Rev PA1

Router

Electrical transceivers

Core NR: node to router RR: router to router

4

First aspect: topology and structure • 

There is a large variety of topologies. But: –  Interconnect can be expected to be “balanced” • 

Symmetry among nodes - no obvious bottleneck

–  Topology can be expected to have a low average distance Δ • 

• 

Low Δ guarantees low latency, in particular for collectives

A direct topology based interconnect can be reduced to three “shaping” parameters 1.  Verbosity factor ν (byte/flop) • 

Establishes how much bandwidth is present, normalized by node computer power

2.  Concentration factor C • 

Defines how “big” the switches are (switchlets vs. large hubs)

3.  Internal bandwidth factor κ • 

Ratio between bandwidth provided by internal links (RR: router-router) and external links (NR: node-router)

Rev PA1

5

Interconnect model Bandwidth of RR links defined by κ (relative to NR links)

C nodes per router Node Node

Node Node

Rev PA1

Bandwidth of NR links defined by ν (relative to node computer power)

Router

Router

Core

Core

Node

Router Core

6

Balanced topologies •  Suppose uniform traffic, emitted at maximum injection rate

Average number of hops in the topology

N ⋅ν ⋅ Π node ⋅ Δ

–  What is the total instantaneous traffic?

Injection bandwidth of one compute node

•  And how much bandwidth do we have? # of switches

•  Therefore

S ⋅ R ⋅κ

N ⋅ R ⋅κ C

# of connections to other switches

must be larger than but close to

RR bandwidth factor

N ⋅ν ⋅ Π node ⋅ Δ

for the traffic to be adequately supported

Rev PA1

NνΠ nodeΔ ≤1 N Rκ C

Rκ Δ≤ CνΠ node 7

Switch RR connectivity related to Δ R + 2 R( R − 1) + 3R( R − 1) 2 + ... + Dx Δ ideal (R ) = N −1

Rκ Δ≤ CνΠ node

(1)

10

1,000 Switches N = 1,000 10,000 Switches N = 10,000 100,000 Switches N = 100,000

GMG

Δ ideal (R )

8 6

• 

•  So (1) applies

2

• 

Assuming topology –  Is achieving close to minimal distance

4

0

( 2)

–  Is well balanced •  So (2) applies

5

10 15 20 Topology connectivity Connectivity factor R R

25 à Number of RR links per switch R can be determined from shaping parameters

From R stems the number of links and switch radix

[1] S. Rumley et al. “Design Methodology for Optimizing Optical Rev PA1 Interconnection Networks in High Performance Systems”, ISC-HPC 2015.

8

Second ingredient: power models •  Short distance electrical transceivers

Energy per bit E = 0.189B + 1.496 pJ Rev PA1

(B: Bitrate) 9

Energy efficiency of optical links Single channel (low density) Assumes 30% laser efficiency nm

nm

0.67 pJ/bit

0.4 pJ/bit

[Bahadori, optical interconnects. 2016]

[R. Polster]

Performance-energy trade-offs for pointto-point links

Many channels (high density) Rev PA1Assumes 10% laser efficiency

10

Energy Efficiency [fJpb]

Optical links 30000 25000 20000 15000 10000 5000 0 0

10

20 30 40 Per Channel Bitrate [Gbps]

50

•  ~pJ/bit energy efficiencies reported for a variety of bitrates with VCSEL based links •  No clear trend! à Retained model: 1pJ/bit for any-rate Long distance RR link

•  NB: optical RR links has an extra transceiver •  50% of RR links are optical Rev PA1

Router

Router

Core

Core 11

Switch power model •  • 

What is the energy consumption (in pJ/bit) of a router chip with r ports each providing a bandwidth B? Assumptions: –  –  –  –  – 

Number of IO pins limited by PINMAX (here to 1280) A lane uses 4 pins (differential signaling in the two directions) Chip thermal dissipation limited (here to 132 W) Chip power consumption accounts for 70% (other 30%: overheads) Switch power consumption = IO consumption + switching consumption

IO consumption: 1.  2.  3.  4. 

Find available lanes per port: Lport = PINMAX/ 4r Find the bitrate per lane: Blane = B / Lport Apply electrical transceiver model Obtain chip wide IO consumption by multiplying with B x r

Rev PA1

Example: 40 ports, 100G Lport = 1280/(4 x 40) = 8 Blane = 100 / 8 = 12.5G 12.5 x 0.189 + 1.496 = 3.86pJ/bit 3.86 x 40 x 100 = 15.44W 12

Switch power model (2) •  Switch “core” consumption –  Based on commercial products –  Core power obtained by subtracting IO power from total power •  IO power estimated using IO model

Switch core power P = 8.15 B + 50.68W (B: Bitrate over all ports in Tb/s ) Rev PA1

13

Switch model (3) 6 Tb/s / 132W = 22 pJ/bit With 30% overhead à 31.4pJ/bit

Rev PA1

Power limited Pin limited

14

Result - Interconnect wide power consumption Verbosity = 0.01B/F

Long distance RR links All short distance links Switching cores

•  Routing (packet) is by far the dominant power consumer –  Motivation for developing energy-efficient switching schemes Rev PA1

15

Role of concentration factor

Increasing C also increases router radix, limiting per port bandwidth, thus node compute power Rev PA1

20 PF total

16

Overall results

Rev PA1

17

Disclaimer •  Modeling interconnect power consumption: our first tentative here –  Many aspects missing: •  Impact of utilization •  Are we modeling average or peak consumption?

–  Many “half blind” assumptions •  One optical link = 2 electrical transceivers – is that fair? •  Overhead of commercial routers

à Input and feedback most welcome à Especially for modeling the switching core

Rev PA1

18

Conclusions •  Rule of thumb 10-100TF system: 100 pJ/bit –  3 router traversals at 31 pJ/bit each = 95 pJ/bit (diameter 2 topology) –  ~5-12 pJ/bit for transmission à Should be reduced by factor 4 at least for Exascale

•  Switching takes the lion share –  Transceivers are not relevant powerwise •  Is that different in terms of area? Cost?

–  Path to improved energy efficiency: •  Energy efficient router chips (CMOS technology, microarchitecture) •  More locality in algorithms to decrease Δ •  Bandwidth steering with optical switching [1]

•  Interconnect shape can always be adapted to remain energy efficient –  E.g. voluntarily downgrading bandwidth of internal links compared to injection ones

[1] K, Wen, al. “Flexfly: Enabling a Reconfigurable Dragonfly Through Silicon Photonics” accepted at SC’16 19 Revet PA1

Thank you !

Rev PA1

20

End-to-end Modeling and Optimization of Power ...

Aug 16, 2016 - Power efficiency growing slower than compute power ... Department of Energy places it at 20MW. – Point of .... Energy efficiency of optical links.

1MB Sizes 1 Downloads 230 Views

Recommend Documents

Modeling, Optimization and Performance Benchmarking of ...
Modeling, Optimization and Performance Benchmarking of Multilayer (1).pdf. Modeling, Optimization and Performance Benchmarking of Multilayer (1).pdf. Open.

Modeling the dynamics of ant colony optimization
Computer Science Group, Catholic University of Eichstätt-Ingolstadt, D- ... describe the algorithm behavior as a combination of situations with different degrees ..... found solution π ∈ Pn (if the best found quality was found by several ants, on

Optimization and modeling of cellulase protein from Trichoderma ...
Jan 4, 2007 - Logistic kinetic model was the best model for the mixed substrates. A conceptual Artificial Neural. Network (ANN) model was well incorporated in the fermentative .... In reality the growth of cell was governed by a hyperbolic.

Modeling and Optimization of Scientific Workflows
Mar 25, 2008 - Dataflow Analysis and Optimization. 5. Experimental ... Scientific Workflow Modeling & Design ... Actors “pick up” only relevant data (read.

Modeling and Optimization of Scientific Workflows
Mar 25, 2008 - Department of Computer Science .... with concrete language, with solid .... σ – set of match rules, each of the form X → R with. X ∈ τα, and.

Ghasemi et al., 2013, Modeling and optimization of a binary ...
Ghasemi et al., 2013, Modeling and optimization of a binary geothermal power plant.pdf. Ghasemi et al., 2013, Modeling and optimization of a binary geothermal ...

Ghasemi et al., 2013, Modeling and optimization of a binary ...
Ghasemi et al., 2013, Modeling and optimization of a binary geothermal power plant.pdf. Ghasemi et al., 2013, Modeling and optimization of a binary geothermal ...

Design and Optimization of Power-Gated Circuits With Autonomous ...
Design and Optimization of Power-Gated Circuits. With Autonomous Data Retention. Jun Seomun and Youngsoo Shin, Senior Member, IEEE. Abstract—Power ...

Statistical Modeling of Power Grid Uncertainties and ...
Department of Statistical Sciences and Operations Research. www.stat.vcu.edu. (804) 828-0001. Page 1 of 1. flyer_wang.pdf. flyer_wang.pdf. Open. Extract.

Modeling Method and Design Optimization for a ... - Research at Google
driving vehicle, big data analysis, and the internet of things (IoT), .... 3) Reverse Recovery of SR ... two topologies suffer from hard switching, resulting in higher.