Department of Computer Science

Coarse-­‐grained  Energy  Modeling  of   Rollback/Recovery  Mechanisms   Dewan  Ibtesham,  David  DeBonis,  Dorian  Arnold  /  University  of  New  Mexico       Kurt  Ferreira  /  Sandia  Na5onal  Laboratories    

In  the  Good  Old  Days  …   } 

Strict  applicaered   ◦  Primary  focus  on  applica
} 

No  one  cared  about  fault-­‐tolerance  and  resilience   ◦  Except  the  computer  systems  researchers  

} 

No  one  cared  about  energy  and  power   ◦  Except  the  electrical  engineers  

Scalable Systems Lab

Resilience  

Money  &  Power:  2  Exascale  Challenges   } 

Resilience   ◦  Increased  component  counts   ◦  Increased  component  complexity   ◦  Increased  fault-­‐tolerance  overheads  

} 

Higher failure rates

Power/energy   ◦  The  (already  missed)  exascale  system  power  cap:  20  MW   –  Tianhe-­‐2,  18  MW,  34  PFlop/s  (#1)   –  Titan,  8.2  MW,  18  Pflop/s  (#2)  

Scalable Systems Lab

The  Bigger  Picture   } 

Exascale  design  explora
} 

Exascale  applica
Scalable Systems Lab

Bigger  Picture  (cont’d)   } 

An  applica
} 

A  key  principle:  exploit  coarse-­‐grained  opera
Scalable Systems Lab

Highlights  from  this  Work   1. 

Coarse-­‐grained  energy  modeling  can  be  accurate  and   used  for  different  CR  op
2. 

Checkpoint  compression  yields  overall  energy  savings  

3. 

Energy  savings  from  checkpoint  compression   increase  with  applica
Scalable Systems Lab

Specific  MoRvaRons   Explore  CR  energy  performance   ◦  CR  protocols  can  move  large  data  volumes   ◦  Data  movement  can  dominate  energy  consump
Scalable Systems Lab

Checkpoint  Compression  –  A  case  study   Our  previous  work  [Resilience  ‘11,  ICPP  ‘12]   ◦  Compression  trades  off  computa
PHPCCG compress HPCCG compress PHDmesh compress MiniFE compress Lammps compress No compression

80

Efficiency (%)

} 

60

40

20

0

00

00

20 00

00

18 00

00

16 00

00

14 00

00

12 00

00

10 0

00

80 0

00

60 0

00

40 0

00

20

0

Nodes

Scalable Systems Lab

Checkpoint  Compression  (cont’d)   } 

Increases  per  checkpoint  energy  costs   ◦  Compression/decompression  generally  is  CPU-­‐bound   ◦  Reduced  data  movement  doesn’t  reduce  network  energy  

} 

Increases  checkpoin
But   }  Decreases  applica
Scalable Systems Lab

Checkpoint  compression  and  energy   Compressing checkpoints Saving checkpoints

Application running Energy

80

B

80

80

80

80

80

60

60

60

60

60

60

60

60

60

60

60

60

40

40

40

40

40

40

40

40

40

40

40

40

40

20

20

20

20

20

20

20

20

20

20

20

20

20

0

0

0

0

0

0

0

0

0

0

0

0

0

60

A 60

60

60

60

40

40

40

40

20

20

20

20

0

0

0

0

Time

What is the overall impact on total application energy consumption?

Scalable Systems Lab

A  Coarse-­‐grained  Approach  to  Modeling   Energy  ConsumpRon   Energy = Time × Power Energyactivity = Timeactivity × Poweractivity } 

Coarse-­‐grained:  treat  coarse  ac
} 

Empirically  measure  average  power  per  ac
PowerInsight  Measurement   Framework   } 

Developed  by  Sandia  Labs  &  Penguin  Compu
} 

Uses  hall  effect  current  sensor   on  CPU  and  memory  power  rails  

} 

Electrically  separate  with  offline   data  collec
} 

10  Hz  sampling  frequency   Images from Laros et al’s “PowerInsight – A Commodity Power Measurement Capability.”

Checkpoint/Restart  Energy  Modeling   Energytotal = Energyapplication + Energycheckpoint + Energyrestart } 

Applica
} 

Checkpoint  Ac
} 

Restart  Ac
◦  Treat  as  one  broad  ac
ValidaRon  Methodology   Large-­‐scale  Atomic/Molecular  Massively  Parallel   Simulator  (LAMMPS)   }  Berkeley  Labs  Checkpoint/Restart  Library  (BLCR)   } 

Application running

Checkpoint Compression

Measure Predict

Checkpoint commit

ValidaRng  Energy  Model  using  LAMMPS  

Predictions within 94-99% of observations

Checkpoint  Compression  Energy   Performance   } 

LAMMPS,  HPCCG,  pHPCCG  and  miniFE  mini  apps  

} 

Checkpoint  sizes  and  compression  performance  from   previous  study  [Ibtesham  et  al,  ICPP  ‘12]  

} 

Daly’s  model  to  calculate   ◦  Number  of  checkpoints,     ◦  Number  of  failures  and   ◦  Time  spent  in  each  phase.  

Checkpoint  compression  saves  energy  

Compression  Reduces  Number  of  Checkpoints   900

uncompressed compressed with pbzip

800

Number of checkpoints

700 600 500 400 300 200 100 0 00

90

00

70

00

50

00

00

30

10

0

0

0

0

0

0

0

0

0

0

miniFE

00

90

00

70

00

50

00

00

30

10 0

0

0

0

0

0

0

0

0

0

LAMMPS

00

90

00

70

00

50

00

00

30

10

00

90

00

70

00

50

00

00

30

10

HPCCG

pHPCCG

18  

Energy  savings  for  C/R  only  

19  

Related  Research   Using  empirical  energy-­‐performance  observa
20/26  

Highlights  Summary     } 

A  coarse-­‐grained  modeling  approach  is  feasible  for   predic
} 

While  compression  increases  per  checkpoint  costs,   checkpoint  compression  reduces  an  applica
Scalable Systems Lab

21  

Future  Work   } 

Valida
} 

Accoun
  } 

Comparisons  with  energy-­‐op
} 

Integra
} 

Characteriza
◦  CPU  thro>ling  to  reduce  CR  energy  consump
Scalable Systems Lab

22  

QuesRons  

23  

Arnold-FTXS-14.pdf

Exascale design explora<on requires projecon performance for different system. configura<ons. } Exascale applica<on performance predic<on must. entail power/energy behavior and failure and fault- tolerance considera<ons. The Bigger Picture. Page 4 of 23. Arnold-FTXS-14.pdf. Arnold-FTXS-14.pdf. Open.

8MB Sizes 2 Downloads 164 Views

Recommend Documents

No documents