Arnold-FTXS-14.pdf

Viewer
Transcript

Department of Computer Science

Coarse-‐grained Energy Modeling of Rollback/Recovery Mechanisms Dewan Ibtesham, David DeBonis, Dorian Arnold / University of New Mexico Kurt Ferreira / Sandia Na5onal Laboratories

In the Good Old Days … } 

Strict applicaered ◦  Primary focus on applica
} 

No one cared about fault-‐tolerance and resilience ◦  Except the computer systems researchers

} 

No one cared about energy and power ◦  Except the electrical engineers

Scalable Systems Lab

Resilience

Money & Power: 2 Exascale Challenges } 

Resilience ◦  Increased component counts ◦  Increased component complexity ◦  Increased fault-‐tolerance overheads

} 

Higher failure rates

Power/energy ◦  The (already missed) exascale system power cap: 20 MW   Tianhe-‐2, 18 MW, 34 PFlop/s (#1)   Titan, 8.2 MW, 18 Pﬂop/s (#2)

Scalable Systems Lab

The Bigger Picture } 

Exascale design explora
} 

Exascale applica
Scalable Systems Lab

Bigger Picture (cont’d) } 

An applica
} 

A key principle: exploit coarse-‐grained opera
Scalable Systems Lab

Highlights from this Work 1. 

Coarse-‐grained energy modeling can be accurate and used for diﬀerent CR op
2. 

Checkpoint compression yields overall energy savings

3. 

Energy savings from checkpoint compression increase with applica
Scalable Systems Lab

Speciﬁc MoRvaRons Explore CR energy performance ◦  CR protocols can move large data volumes ◦  Data movement can dominate energy consump
Scalable Systems Lab

Checkpoint Compression – A case study Our previous work [Resilience ‘11, ICPP ‘12] ◦  Compression trades oﬀ computa
PHPCCG compress HPCCG compress PHDmesh compress MiniFE compress Lammps compress No compression

80

Efficiency (%)

} 

60

40

20

0

00

00

20 00

00

18 00

00

16 00

00

14 00

00

12 00

00

10 0

00

80 0

00

60 0

00

40 0

00

20

0

Nodes

Scalable Systems Lab

Checkpoint Compression (cont’d) } 

Increases per checkpoint energy costs ◦  Compression/decompression generally is CPU-‐bound ◦  Reduced data movement doesn’t reduce network energy

} 

Increases checkpoin
But }  Decreases applica
Scalable Systems Lab

Checkpoint compression and energy Compressing checkpoints Saving checkpoints

Application running Energy

80

B

80

80

80

80

80

60

60

60

60

60

60

60

60

60

60

60

60

40

40

40

40

40

40

40

40

40

40

40

40

40

20

20

20

20

20

20

20

20

20

20

20

20

20

0

0

0

0

0

0

0

0

0

0

0

0

0

60

A 60

60

60

60

40

40

40

40

20

20

20

20

0

0

0

0

Time

What is the overall impact on total application energy consumption?

Scalable Systems Lab

A Coarse-‐grained Approach to Modeling Energy ConsumpRon Energy = Time × Power Energyactivity = Timeactivity × Poweractivity } 

Coarse-‐grained: treat coarse ac
} 

Empirically measure average power per ac
PowerInsight Measurement Framework } 

Developed by Sandia Labs & Penguin Compu
} 

Uses hall eﬀect current sensor on CPU and memory power rails

} 

Electrically separate with oﬄine data collec
} 

10 Hz sampling frequency Images from Laros et al’s “PowerInsight – A Commodity Power Measurement Capability.”

Checkpoint/Restart Energy Modeling Energytotal = Energyapplication + Energycheckpoint + Energyrestart } 

Applica
} 

Checkpoint Ac
} 

Restart Ac
◦  Treat as one broad ac
ValidaRon Methodology Large-‐scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) }  Berkeley Labs Checkpoint/Restart Library (BLCR) } 

Application running

Checkpoint Compression

Measure Predict

Checkpoint commit

ValidaRng Energy Model using LAMMPS

Predictions within 94-99% of observations

Checkpoint Compression Energy Performance } 

LAMMPS, HPCCG, pHPCCG and miniFE mini apps

} 

Checkpoint sizes and compression performance from previous study [Ibtesham et al, ICPP ‘12]

} 

Daly’s model to calculate ◦  Number of checkpoints, ◦  Number of failures and ◦  Time spent in each phase.

Checkpoint compression saves energy

Compression Reduces Number of Checkpoints 900

uncompressed compressed with pbzip

800

Number of checkpoints

700 600 500 400 300 200 100 0 00

90

00

70

00

50

00

00

30

10

0

0

0

0

0

0

0

0

0

0

miniFE

00

90

00

70

00

50

00

00

30

10 0

0

0

0

0

0

0

0

0

0

LAMMPS

00

90

00

70

00

50

00

00

30

10

00

90

00

70

00

50

00

00

30

10

HPCCG

pHPCCG

18

Energy savings for C/R only

19

Related Research Using empirical energy-‐performance observa
20/26

Highlights Summary } 

A coarse-‐grained modeling approach is feasible for predic
} 

While compression increases per checkpoint costs, checkpoint compression reduces an applica
Scalable Systems Lab

21

Future Work } 

Valida
} 

Accoun
} 

Comparisons with energy-‐op
} 

Integra
} 

Characteriza
◦  CPU thro>ling to reduce CR energy consump
Scalable Systems Lab

22

QuesRons

23

Exascale design explora<on requires projecon performance for different system. configura<ons. } Exascale applica<on performance predic<on must. entail power/energy behavior and failure and fault- tolerance considera<ons. The Bigger Picture. Page 4 of 23. Arnold-FTXS-14.pdf. Arnold-FTXS-14.pdf. Open.

Download PDF

8MB Sizes 2 Downloads 164 Views

Report

Arnold-FTXS-14.pdf

Recommend Documents