Coarse-‐grained Energy Modeling of Rollback/Recovery Mechanisms Dewan Ibtesham, David DeBonis, Dorian Arnold / University of New Mexico Kurt Ferreira / Sandia Na5onal Laboratories
In the Good Old Days … }
Strict applicaered ◦ Primary focus on applica
}
No one cared about fault-‐tolerance and resilience ◦ Except the computer systems researchers
}
No one cared about energy and power ◦ Except the electrical engineers
Power/energy ◦ The (already missed) exascale system power cap: 20 MW Tianhe-‐2, 18 MW, 34 PFlop/s (#1) Titan, 8.2 MW, 18 Pflop/s (#2)
Scalable Systems Lab
The Bigger Picture }
Exascale design explora
}
Exascale applica
Scalable Systems Lab
Bigger Picture (cont’d) }
An applica
}
A key principle: exploit coarse-‐grained opera
Scalable Systems Lab
Highlights from this Work 1.
Coarse-‐grained energy modeling can be accurate and used for different CR op
2.
Checkpoint compression yields overall energy savings
3.
Energy savings from checkpoint compression increase with applica
Scalable Systems Lab
Specific MoRvaRons Explore CR energy performance ◦ CR protocols can move large data volumes ◦ Data movement can dominate energy consump
Scalable Systems Lab
Checkpoint Compression – A case study Our previous work [Resilience ‘11, ICPP ‘12] ◦ Compression trades off computa
PHPCCG compress HPCCG compress PHDmesh compress MiniFE compress Lammps compress No compression
80
Efficiency (%)
}
60
40
20
0
00
00
20 00
00
18 00
00
16 00
00
14 00
00
12 00
00
10 0
00
80 0
00
60 0
00
40 0
00
20
0
Nodes
Scalable Systems Lab
Checkpoint Compression (cont’d) }
Increases per checkpoint energy costs ◦ Compression/decompression generally is CPU-‐bound ◦ Reduced data movement doesn’t reduce network energy
}
Increases checkpoin
But } Decreases applica
Scalable Systems Lab
Checkpoint compression and energy Compressing checkpoints Saving checkpoints
Application running Energy
80
B
80
80
80
80
80
60
60
60
60
60
60
60
60
60
60
60
60
40
40
40
40
40
40
40
40
40
40
40
40
40
20
20
20
20
20
20
20
20
20
20
20
20
20
0
0
0
0
0
0
0
0
0
0
0
0
0
60
A 60
60
60
60
40
40
40
40
20
20
20
20
0
0
0
0
Time
What is the overall impact on total application energy consumption?
Scalable Systems Lab
A Coarse-‐grained Approach to Modeling Energy ConsumpRon Energy = Time × Power Energyactivity = Timeactivity × Poweractivity }
Coarse-‐grained: treat coarse ac
}
Empirically measure average power per ac
PowerInsight Measurement Framework }
Developed by Sandia Labs & Penguin Compu
}
Uses hall effect current sensor on CPU and memory power rails
}
Electrically separate with offline data collec
}
10 Hz sampling frequency Images from Laros et al’s “PowerInsight – A Commodity Power Measurement Capability.”
Restart Ac
◦ Treat as one broad ac
ValidaRon Methodology Large-‐scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) } Berkeley Labs Checkpoint/Restart Library (BLCR) }
Application running
Checkpoint Compression
Measure Predict
Checkpoint commit
ValidaRng Energy Model using LAMMPS
Predictions within 94-99% of observations
Checkpoint Compression Energy Performance }
LAMMPS, HPCCG, pHPCCG and miniFE mini apps
}
Checkpoint sizes and compression performance from previous study [Ibtesham et al, ICPP ‘12]
}
Daly’s model to calculate ◦ Number of checkpoints, ◦ Number of failures and ◦ Time spent in each phase.
Checkpoint compression saves energy
Compression Reduces Number of Checkpoints 900
uncompressed compressed with pbzip
800
Number of checkpoints
700 600 500 400 300 200 100 0 00
90
00
70
00
50
00
00
30
10
0
0
0
0
0
0
0
0
0
0
miniFE
00
90
00
70
00
50
00
00
30
10 0
0
0
0
0
0
0
0
0
0
LAMMPS
00
90
00
70
00
50
00
00
30
10
00
90
00
70
00
50
00
00
30
10
HPCCG
pHPCCG
18
Energy savings for C/R only
19
Related Research Using empirical energy-‐performance observa
20/26
Highlights Summary }
A coarse-‐grained modeling approach is feasible for predic
}
While compression increases per checkpoint costs, checkpoint compression reduces an applica
Scalable Systems Lab
21
Future Work }
Valida
}
Accoun
}
Comparisons with energy-‐op
}
Integra
}
Characteriza
◦ CPU thro>ling to reduce CR energy consump
Scalable Systems Lab
22
QuesRons
23
Arnold-FTXS-14.pdf
... Computer Science. Coarse-grained Energy Modeling of. Rollback/Recovery Mechanisms. Dewan Ibtesham, David DeBonis, Dorian Arnold / University of New ...