Coarse-‐grained Energy Modeling of Rollback/Recovery Mechanisms Dewan Ibtesham, David DeBonis, Dorian Arnold / University of New Mexico Kurt Ferreira / Sandia Na5onal Laboratories
In the Good Old Days … }
Strict applicaered ◦ Primary focus on applica
}
No one cared about fault-‐tolerance and resilience ◦ Except the computer systems researchers
}
No one cared about energy and power ◦ Except the electrical engineers
Power/energy ◦ The (already missed) exascale system power cap: 20 MW Tianhe-‐2, 18 MW, 34 PFlop/s (#1) Titan, 8.2 MW, 18 Pflop/s (#2)
Scalable Systems Lab
The Bigger Picture }
Exascale design explora
}
Exascale applica
Scalable Systems Lab
Bigger Picture (cont’d) }
An applica
}
A key principle: exploit coarse-‐grained opera
Scalable Systems Lab
Highlights from this Work 1.
Coarse-‐grained energy modeling can be accurate and used for different CR op
2.
Checkpoint compression yields overall energy savings
3.
Energy savings from checkpoint compression increase with applica
Scalable Systems Lab
Specific MoRvaRons Explore CR energy performance ◦ CR protocols can move large data volumes ◦ Data movement can dominate energy consump
Scalable Systems Lab
Checkpoint Compression – A case study Our previous work [Resilience ‘11, ICPP ‘12] ◦ Compression trades off computa
PHPCCG compress HPCCG compress PHDmesh compress MiniFE compress Lammps compress No compression
80
Efficiency (%)
}
60
40
20
0
00
00
20 00
00
18 00
00
16 00
00
14 00
00
12 00
00
10 0
00
80 0
00
60 0
00
40 0
00
20
0
Nodes
Scalable Systems Lab
Checkpoint Compression (cont’d) }
Increases per checkpoint energy costs ◦ Compression/decompression generally is CPU-‐bound ◦ Reduced data movement doesn’t reduce network energy
}
Increases checkpoin
But } Decreases applica
Scalable Systems Lab
Checkpoint compression and energy Compressing checkpoints Saving checkpoints
Application running Energy
80
B
80
80
80
80
80
60
60
60
60
60
60
60
60
60
60
60
60
40
40
40
40
40
40
40
40
40
40
40
40
40
20
20
20
20
20
20
20
20
20
20
20
20
20
0
0
0
0
0
0
0
0
0
0
0
0
0
60
A 60
60
60
60
40
40
40
40
20
20
20
20
0
0
0
0
Time
What is the overall impact on total application energy consumption?
Scalable Systems Lab
A Coarse-‐grained Approach to Modeling Energy ConsumpRon Energy = Time × Power Energyactivity = Timeactivity × Poweractivity }
Coarse-‐grained: treat coarse ac
}
Empirically measure average power per ac
PowerInsight Measurement Framework }
Developed by Sandia Labs & Penguin Compu
}
Uses hall effect current sensor on CPU and memory power rails
}
Electrically separate with offline data collec
}
10 Hz sampling frequency Images from Laros et al’s “PowerInsight – A Commodity Power Measurement Capability.”