RelaxFault Memory Repair
- Balance Resilience, Performance, and Cost -
Dong Wan Kim
Mattan Erez
The University of Texas at Austin
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
2
Are DRAM faults rare?
~150 years
~7 hours (25k nodes, 1.5PB memory)
5% (in a year) Takeaway 1 DRAM failures are frequent, but only in a few nodes
3
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Transient or permanent? 25 20 15 DRAM fault rate (FIT/device) 10 5
0 Transient Faults Single-bit/-word/-row/-column
Permanent Faults Single-bank
Multi-bank/-rank
* FIT (Failure in Time): Number of failures that can be expected in 1 billion device-hours of operation. * V. Sridharan et al., “Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults,” SC 2013
4
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Transient or permanent? 25 20 15 DRAM fault rate (FIT/device) 10 5
0 Transient Faults Single-bit/-word/-row/-column
Permanent Faults Single-bank
Multi-bank/-rank
Takeaway 2 Permanent faults are as frequent as transient faults * FIT (Failure in Time): Number of failures that can be expected in 1 billion device-hours of operation.
* V. Sridharan et al., “Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults,” SC 2013
5
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
DRAM fault granularity 25 20 15 DRAM fault rate (FIT/device) 10 5
0 Transient Faults Single-bit/-word/-row/-column
Permanent Faults Single-bank
Multi-bank/-rank
Takeaway 3 Most faults affect a small memory region [1] Vilas Sridharan, and Dean Liberty, “A Study of DRAM Failures in the Field”, SC 2012 * V. Sridharan et al., “Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults,” SC 2013
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Is replacing faulty memory cheap?
6
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Is replacing faulty memory cheap?
7
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Is replacing faulty memory cheap?
8
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Is replacing faulty memory cheap?
9
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Is replacing faulty memory cheap?
10
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
11
Is replacing faulty memory cheap?
Takeaway 4 Replacing DIMMs may lower system availability
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
12
ECC[1] to the rescue? • Storage overhead – Strong ECC more redundancy – Multiple device faults are a problem even though they are rare • Accumulated faults may result in DUE[2] or SDC[3]
• Latency overhead – Frequent error correction when permanent faults exist – CPU may raise exception to log errors to MCA[4] register
[1] ECC: Error Correcting Code [2] DUE: Detected Uncorrectable Error [3] SDC: Silent Data Corruption (Undetected errors) [4] MCA: Machine Check Architecture
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
13
ECC[1] to the rescue? • Storage overhead – Strong ECC more redundancy – Multiple device faults are a problem even though they are rare • Accumulated faults may result in DUE[2] or SDC[3]
• Latency overhead – Frequent error correction when permanent faults exist – CPU may raise exception to log errors to MCA[4] register
Takeaway 5 ECC is inefficient for permanent faults [1] ECC: Error Correcting Code [2] DUE: Detected Uncorrectable Error [3] SDC: Silent Data Corruption (Undetected errors) [4] MCA: Machine Check Architecture
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
14
Memory repair to the rescue!!! • Permanent faults are frequent Retire faulty regions • Faults likely affect small regions Retire faulty regions
Repair-based fault tolerance mechanism that improves resilience and availability while balancing performance and cost
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
15
Related work ─ remapping w/ redundancy • Row/column sparing • Post package repair (PPR) – In-the-field memory repair (introduced in DDR4 and LPDDR4) – Repair 1 row per bank group (DDR4)
Source: Samsung Electronics, “Understanding DDR4 and Today’s DRAM Frontier,” Oct 2014
16
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Related work ─ FreeFault [1] Processor FreeFaultaware Memory Controller
LLC (way 0)
LLC (way 1)
LLC (way M-1)
…
72-bit x4 ECC DIMM
Perm. Fault
DRAM
Example. Single Row Fault [1] Dong Wan Kim and Mattan Erez, “Balancing Reliability, Cost, and Performance Tradeoffs with FreeFault,” HPCA 2015
17
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Related work ─ FreeFault [1] Processor FreeFaultaware Memory Controller
LLC (way 0)
LLC (way 1)
LLC (way M-1)
Set # N
LOCKED
…
…...
Data (72-bit)
72-bit x4 ECC DIMM
Perm. Fault
DRAM
(a) After read one codeword Example. Single Row Fault [1] Dong Wan Kim and Mattan Erez, “Balancing Reliability, Cost, and Performance Tradeoffs with FreeFault,” HPCA 2015
18
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Related work ─ FreeFault [1] Processor FreeFaultaware Memory Controller
LLC (way 0)
LLC (way 1)
LLC (way M-1)
Set # N
… Set # N+255
…
…
LOCKED LOCKED
256 lines (16KB)
LOCKED
…...
Data (72-bit)
72-bit x4 ECC DIMM
Perm. Fault
DRAM
(b) After (a)read After allread the data one codeword from faulty row Example. Single Row aFault [1] Dong Wan Kim and Mattan Erez, “Balancing Reliability, Cost, and Performance Tradeoffs with FreeFault,” HPCA 2015
19
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Related work ─ FreeFault [1] Processor FreeFaultaware Memory Controller
LLC (way 0)
LLC (way 1)
LLC (way M-1)
Set # N
… Set # N+255
…
…
LOCKED LOCKED
256 lines (16KB)
LOCKED
FreeFault provides repair capability within …... microarchitecture, Data (72-bit) and repair is transparent to software 72-bit x4 ECC DIMM
Perm. Fault
DRAM
(b) After (a)read After allread the data one codeword from faulty row Example. Single Row aFault [1] Dong Wan Kim and Mattan Erez, “Balancing Reliability, Cost, and Performance Tradeoffs with FreeFault,” HPCA 2015
20
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Motivation Processor FreeFaultaware Memory Controller
LLC (way 0)
LLC (way 1)
LLC (way M-1)
Set # N
… Set # N+255
…
…
LOCKED LOCKED LOCKED
…... 72-bit x4 ECC DIMM
Perm. Fault
DRAM
(b) After read all the data from faulty row Example. Single Row aFault
256 lines (16KB)
21
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Motivation Processor FreeFaultaware Memory Controller
LLC (way 0)
LLC (way 1)
LLC (way M-1)
Set # N
… Set # N+255
…
…
LOCKED LOCKED LOCKED
…... 72-bit x4 ECC DIMM
Perm. Fault
DRAM
(b) After read all the data from faulty row Example. Single Row aFault
256 lines (16KB)
22
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Motivation Processor FreeFaultaware Memory Controller
LLC (way 0)
LLC (way 1)
LLC (way M-1)
Set # N
… Set # N+255
…
…
LOCKED LOCKED LOCKED
…... 72-bit x4 ECC DIMM
Perm. Fault
DRAM
(b) After read all the data from faulty row Example. Single Row aFault
256 lines (16KB)
23
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Motivation Processor FreeFaultaware Memory Controller
LLC (way 0)
LLC (way 1)
LLC (way M-1)
Set # N
… Set # N+255
…
…
LOCKED LOCKED
256 lines (16KB)
LOCKED
Simplicity of repair mechanism …... vs. Efficiency of 72-bit LLC usage x4 ECC DIMM Perm. Fault
DRAM
(b) After read all the data from faulty row Example. Single Row aFault
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
24
RelaxFault ─ relax repair constraint • Retire and remap data ONLY from a faulty device – Coalesce multiple data into a single cacheline • E.g. 16 of 4-byte blocks in a single 64-byte cacheline
– Reduce LLC usage significantly – Increase repair coverage with a given amount of LLC
• Repair-aware cache address mapping scheme – Remapped data of multiple codewords from a same device share same LLC address – Allocation of remapped lines are balanced across LLC sets
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
25
RelaxFault repair address mapping
Physical address
(a) Normal cache address map (including FreeFault repair)
* Bk: Bank, Rk: Rank, Ch: Channel, Col: Column
26
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault repair address mapping
Row
Physical Bk address Row Col Physical address
Bk Rk
Col
Ch
(a) Normal cache address map (including FreeFault repair)
* Bk: Bank, Rk: Rank, Ch: Channel, Col: Column
Col
Offset
27
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault repair address mapping
Row
Physical Bk address Row Col Physical address
Bk Rk
Col
LLC set index
Ch
Offset
LLC offset
(a) Normal cache address map (including FreeFault repair)
This fine-grained data interleaving results in hot LLC sets with FreeFault, lowering repair coverage
* Bk: Bank, Rk: Rank, Ch: Channel, Col: Column
Col
28
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault repair address mapping
Row
Physical Bk address Row Col Physical address
Bk Rk
Col
LLC set index
Ch
Col
Offset
LLC offset
(a) Normal cache address map (including FreeFault repair)
Device ID +
Row
Physical Bk address Row Col Bk Rk
Col
LLC set index
(b) RelaxFault remapped cache address map * Bk: Bank, Rk: Rank, Ch: Channel, Col: Column
Ch
Col
LLC offset
Offset
29
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault repair address mapping
Row
Physical Bk address Row Col Physical address
Bk Rk
Col
LLC set index
Ch
Col
Offset
LLC offset
Repair-aware set indexing avoids hot LLC sets, (a) Normal cache address map (including FreeFault repair) coalescing data improves remapping efficiency Device ID +
Row
Physical Bk address Row Col Bk Rk
Col
LLC set index
(b) RelaxFault remapped cache address map * Bk: Bank, Rk: Rank, Ch: Channel, Col: Column
Ch
Col
LLC offset
Offset
30
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault memory repair FreeFaultaware Memory Controller
L1/L2 Caches M-way LLC Way 0
Way 1
Way (M-1)
… Filled a whole cacheline
BUS
Main Memory (DRAM) Example. Single Row Fault * RF MEM Req: RelaxFault-generated memory request
31
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault memory repair FreeFaultaware Memory Controller + Coalescer + Faulty-bank table
L1/L2 Caches M-way LLC Way 0
Way 1
Way (M-1)
… Filled a whole cacheline
BUS
Main Memory (DRAM) Example. Single Row Fault * RF MEM Req: RelaxFault-generated memory request
32
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault memory repair FreeFaultaware Memory Controller + Coalescer + Faulty-bank table
L1/L2 Caches M-way LLC Way 0
Way 1
Cached
Way (M-1)
… Filled a whole cacheline
BUS
Fault
Main Memory (DRAM) Example. Single Row Fault
* RF MEM Req: RelaxFault-generated memory request
33
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault memory repair FreeFaultaware Memory Controller + Coalescer + Faulty-bank table
L1/L2 Caches M-way LLC Way 0
Way 1
Cached
Way (M-1)
… Locked
BUS
Fault
Main Memory (DRAM) Example. Single Row Fault
* RF MEM Req: RelaxFault-generated memory request
Filled a whole cacheline
34
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault memory repair FreeFaultaware Memory Controller + Coalescer + Faulty-bank table
L1/L2 Caches M-way LLC Way 0
Way 1
Cached
Way (M-1)
… Locked
BUS
Fault
Main Memory (DRAM) Example. Single Row Fault
* RF MEM Req: RelaxFault-generated memory request
Filled a whole cacheline
35
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault memory repair
RF MEM req
… RF MEM req RF MEM req RF MEM req
FreeFaultaware Memory Controller + Coalescer + Faulty-bank table
L1/L2 Caches M-way LLC Way 0
Way 1
Cached
Way (M-1)
… Locked
BUS
Fault
Main Memory (DRAM) Example. Single Row Fault
* RF MEM Req: RelaxFault-generated memory request
Filled a whole cacheline
36
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault memory repair
RF MEM req
… RF MEM req RF MEM req RF MEM req
FreeFaultaware Memory Controller + Coalescer + Faulty-bank table
L1/L2 Caches M-way LLC Way 0
Way 1
Cached
Way (M-1)
… Locked
Cached BUS
Fault
Main Memory (DRAM) Example. Single Row Fault
* RF MEM Req: RelaxFault-generated memory request
Filled a whole cacheline
37
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault memory repair
RF MEM req
… RF MEM req RF MEM req
L1/L2 Caches M-way LLC Way 0
Way 1
Cached
Way (M-1)
…
Cached Cached BUS
Fault
Main Memory (DRAM) Example. Single Row Fault
* RF MEM Req: RelaxFault-generated memory request
Locked …
RF MEM req
FreeFaultaware Memory Controller + Coalescer + Faulty-bank table
Locked
Filled a whole 16 lines cacheline
(1KB)
38
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RelaxFault memory repair
RF MEM req
… RF MEM req RF MEM req
…
RF MEM req
L1/L2 Caches FreeFaultaware M-way LLC Memory Way Way Way Controller 0 1 (M-1) + Cached … Coalescer + RelaxFault repair is Locked Cached Faulty-bank Cached to software Locked tabletransparent
BUS
Fault
Main Memory (DRAM) Example. Single Row Fault
* RF MEM Req: RelaxFault-generated memory request
Filled a whole 16 lines cacheline
(1KB)
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
RESILIENCE/AVAILABILITY EVALUATION
39
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
40
DRAM resilience evaluation • Monte-Carlo DRAM fault/resilience simulator – DRAM fault mode/rate from DDR3 memory field study [1] – 16K nodes per system, 8 x4 DIMMs per node, 8MB 16-way LLC
• Evaluation metrics – Repair coverage • % of nodes with any permanent fault that repaired
– Rate of DUEs and DIMM replacements – Measured in 6 years
TABLE. Failure rate of DRAM device (FIT/device) [1] [1] Vilas Sridharan et al., “Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults,” SC 2013
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Repair coverage
* PPR: Post Package Repair (not require LLC to repair memory)
41
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Repair coverage
* PPR: Post Package Repair (not require LLC to repair memory)
42
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Repair coverage
* PPR: Post Package Repair (not require LLC to repair memory)
43
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Repair coverage
* PPR: Post Package Repair (not require LLC to repair memory)
44
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Repair coverage
~256KB can repair about 97% of faulty nodes
* PPR: Post Package Repair (not require LLC to repair memory)
45
46
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Number of DUEs per system
Detected Uncorrectable Errors (DUEs)
(a) 1x FIT
(b) 10x FIT
* PPR: Post Package Repair (not require LLC to repair memory)
47
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Number of DUEs per system
Detected Uncorrectable Errors (DUEs)
(a) 1x FIT
(b) 10x FIT
Repairing memory reduces the number system * PPR: Postexpected Package Repair (not require LLC toof repair memory) failures
48
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Number of DIMM replacements per system
DIMM replacements
(a) 1x FIT
(b) 10x FIT
* PPR: Post Package Repair (not require LLC to repair memory)
49
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Number of DIMM replacements per system
DIMM replacements
(a) 1x FIT
(b) 10x FIT
Repairing memory enhances availability as well as resilience * PPR: Post Package Repair (not require LLC to repair memory)
50
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
Overhead estimation – storage • No storage overhead for remapped data • Metadata storage is as small as 16KB Size (bytes) Faulty-bank table
8
Data coalescer
128
LLC tag extension[1]
16,384
Total
16,520
Description 1 byte per DIMM in a node
Per-computed bitmasks 1 bit per LLC tag
[1] No noticeable size/latency change shown in CACTI 6.5 simulation
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
51
Overhead estimation – energy • The vast majority of accesses only require accessing a tiny (e.g. 8B) direct-mapped faulty-bank table – Negligible energy impact
• Most of remaining accesses require extra cache lookup – 1.5% of LLC access energy LLC tag
LLC (tag+data)
DRAM
Access energy
9 pJ
641 pJ
36,000 pJ
Relative energy
100 %
1.5 %
0.03 %
No performance and energy impact observed Detailed results are shown in the paper
RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim
52
Conclusions • HW-only permanent memory fault tolerance • Software-transparent memory repair • Zero redundancy (almost)
• Repair-aware mapping improves coverage and LLC use • No impact on performance and energy consumption