RelaxFault Memory Repair

- Balance Resilience, Performance, and Cost -

Dong Wan Kim

Mattan Erez

The University of Texas at Austin

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

2

Are DRAM faults rare?

~150 years

~7 hours (25k nodes, 1.5PB memory)

5% (in a year) Takeaway 1 DRAM failures are frequent, but only in a few nodes

3

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Transient or permanent? 25 20 15 DRAM fault rate (FIT/device) 10 5

0 Transient Faults Single-bit/-word/-row/-column

Permanent Faults Single-bank

Multi-bank/-rank

* FIT (Failure in Time): Number of failures that can be expected in 1 billion device-hours of operation. * V. Sridharan et al., “Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults,” SC 2013

4

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Transient or permanent? 25 20 15 DRAM fault rate (FIT/device) 10 5

0 Transient Faults Single-bit/-word/-row/-column

Permanent Faults Single-bank

Multi-bank/-rank

Takeaway 2 Permanent faults are as frequent as transient faults * FIT (Failure in Time): Number of failures that can be expected in 1 billion device-hours of operation.

* V. Sridharan et al., “Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults,” SC 2013

5

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

DRAM fault granularity 25 20 15 DRAM fault rate (FIT/device) 10 5

0 Transient Faults Single-bit/-word/-row/-column

Permanent Faults Single-bank

Multi-bank/-rank

Takeaway 3 Most faults affect a small memory region [1] Vilas Sridharan, and Dean Liberty, “A Study of DRAM Failures in the Field”, SC 2012 * V. Sridharan et al., “Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults,” SC 2013

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Is replacing faulty memory cheap?

6

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Is replacing faulty memory cheap?

7

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Is replacing faulty memory cheap?

8

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Is replacing faulty memory cheap?

9

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Is replacing faulty memory cheap?

10

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

11

Is replacing faulty memory cheap?

Takeaway 4 Replacing DIMMs may lower system availability

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

12

ECC[1] to the rescue? • Storage overhead – Strong ECC  more redundancy – Multiple device faults are a problem even though they are rare • Accumulated faults may result in DUE[2] or SDC[3]

• Latency overhead – Frequent error correction when permanent faults exist – CPU may raise exception to log errors to MCA[4] register

[1] ECC: Error Correcting Code [2] DUE: Detected Uncorrectable Error [3] SDC: Silent Data Corruption (Undetected errors) [4] MCA: Machine Check Architecture

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

13

ECC[1] to the rescue? • Storage overhead – Strong ECC  more redundancy – Multiple device faults are a problem even though they are rare • Accumulated faults may result in DUE[2] or SDC[3]

• Latency overhead – Frequent error correction when permanent faults exist – CPU may raise exception to log errors to MCA[4] register

Takeaway 5 ECC is inefficient for permanent faults [1] ECC: Error Correcting Code [2] DUE: Detected Uncorrectable Error [3] SDC: Silent Data Corruption (Undetected errors) [4] MCA: Machine Check Architecture

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

14

Memory repair to the rescue!!! • Permanent faults are frequent  Retire faulty regions • Faults likely affect small regions  Retire faulty regions

Repair-based fault tolerance mechanism that improves resilience and availability while balancing performance and cost

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

15

Related work ─ remapping w/ redundancy • Row/column sparing • Post package repair (PPR) – In-the-field memory repair (introduced in DDR4 and LPDDR4) – Repair 1 row per bank group (DDR4)

Source: Samsung Electronics, “Understanding DDR4 and Today’s DRAM Frontier,” Oct 2014

16

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Related work ─ FreeFault [1] Processor FreeFaultaware Memory Controller

LLC (way 0)

LLC (way 1)

LLC (way M-1)



72-bit x4 ECC DIMM

Perm. Fault

DRAM

Example. Single Row Fault [1] Dong Wan Kim and Mattan Erez, “Balancing Reliability, Cost, and Performance Tradeoffs with FreeFault,” HPCA 2015

17

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Related work ─ FreeFault [1] Processor FreeFaultaware Memory Controller

LLC (way 0)

LLC (way 1)

LLC (way M-1)

Set # N

LOCKED



…...

Data (72-bit)

72-bit x4 ECC DIMM

Perm. Fault

DRAM

(a) After read one codeword Example. Single Row Fault [1] Dong Wan Kim and Mattan Erez, “Balancing Reliability, Cost, and Performance Tradeoffs with FreeFault,” HPCA 2015

18

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Related work ─ FreeFault [1] Processor FreeFaultaware Memory Controller

LLC (way 0)

LLC (way 1)

LLC (way M-1)

Set # N

… Set # N+255





LOCKED LOCKED

256 lines (16KB)

LOCKED

…...

Data (72-bit)

72-bit x4 ECC DIMM

Perm. Fault

DRAM

(b) After (a)read After allread the data one codeword from faulty row Example. Single Row aFault [1] Dong Wan Kim and Mattan Erez, “Balancing Reliability, Cost, and Performance Tradeoffs with FreeFault,” HPCA 2015

19

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Related work ─ FreeFault [1] Processor FreeFaultaware Memory Controller

LLC (way 0)

LLC (way 1)

LLC (way M-1)

Set # N

… Set # N+255





LOCKED LOCKED

256 lines (16KB)

LOCKED

FreeFault provides repair capability within …... microarchitecture, Data (72-bit) and repair is transparent to software 72-bit x4 ECC DIMM

Perm. Fault

DRAM

(b) After (a)read After allread the data one codeword from faulty row Example. Single Row aFault [1] Dong Wan Kim and Mattan Erez, “Balancing Reliability, Cost, and Performance Tradeoffs with FreeFault,” HPCA 2015

20

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Motivation Processor FreeFaultaware Memory Controller

LLC (way 0)

LLC (way 1)

LLC (way M-1)

Set # N

… Set # N+255





LOCKED LOCKED LOCKED

…... 72-bit x4 ECC DIMM

Perm. Fault

DRAM

(b) After read all the data from faulty row Example. Single Row aFault

256 lines (16KB)

21

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Motivation Processor FreeFaultaware Memory Controller

LLC (way 0)

LLC (way 1)

LLC (way M-1)

Set # N

… Set # N+255





LOCKED LOCKED LOCKED

…... 72-bit x4 ECC DIMM

Perm. Fault

DRAM

(b) After read all the data from faulty row Example. Single Row aFault

256 lines (16KB)

22

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Motivation Processor FreeFaultaware Memory Controller

LLC (way 0)

LLC (way 1)

LLC (way M-1)

Set # N

… Set # N+255





LOCKED LOCKED LOCKED

…... 72-bit x4 ECC DIMM

Perm. Fault

DRAM

(b) After read all the data from faulty row Example. Single Row aFault

256 lines (16KB)

23

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Motivation Processor FreeFaultaware Memory Controller

LLC (way 0)

LLC (way 1)

LLC (way M-1)

Set # N

… Set # N+255





LOCKED LOCKED

256 lines (16KB)

LOCKED

Simplicity of repair mechanism …... vs. Efficiency of 72-bit LLC usage x4 ECC DIMM Perm. Fault

DRAM

(b) After read all the data from faulty row Example. Single Row aFault

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

24

RelaxFault ─ relax repair constraint • Retire and remap data ONLY from a faulty device – Coalesce multiple data into a single cacheline • E.g. 16 of 4-byte blocks in a single 64-byte cacheline

– Reduce LLC usage significantly – Increase repair coverage with a given amount of LLC

• Repair-aware cache address mapping scheme – Remapped data of multiple codewords from a same device share same LLC address – Allocation of remapped lines are balanced across LLC sets

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

25

RelaxFault repair address mapping

Physical address

(a) Normal cache address map (including FreeFault repair)

* Bk: Bank, Rk: Rank, Ch: Channel, Col: Column

26

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault repair address mapping

Row

Physical Bk address Row Col Physical address

Bk Rk

Col

Ch

(a) Normal cache address map (including FreeFault repair)

* Bk: Bank, Rk: Rank, Ch: Channel, Col: Column

Col

Offset

27

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault repair address mapping

Row

Physical Bk address Row Col Physical address

Bk Rk

Col

LLC set index

Ch

Offset

LLC offset

(a) Normal cache address map (including FreeFault repair)

This fine-grained data interleaving results in hot LLC sets with FreeFault, lowering repair coverage

* Bk: Bank, Rk: Rank, Ch: Channel, Col: Column

Col

28

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault repair address mapping

Row

Physical Bk address Row Col Physical address

Bk Rk

Col

LLC set index

Ch

Col

Offset

LLC offset

(a) Normal cache address map (including FreeFault repair)

Device ID +

Row

Physical Bk address Row Col Bk Rk

Col

LLC set index

(b) RelaxFault remapped cache address map * Bk: Bank, Rk: Rank, Ch: Channel, Col: Column

Ch

Col

LLC offset

Offset

29

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault repair address mapping

Row

Physical Bk address Row Col Physical address

Bk Rk

Col

LLC set index

Ch

Col

Offset

LLC offset

Repair-aware set indexing avoids hot LLC sets, (a) Normal cache address map (including FreeFault repair) coalescing data improves remapping efficiency Device ID +

Row

Physical Bk address Row Col Bk Rk

Col

LLC set index

(b) RelaxFault remapped cache address map * Bk: Bank, Rk: Rank, Ch: Channel, Col: Column

Ch

Col

LLC offset

Offset

30

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault memory repair FreeFaultaware Memory Controller

L1/L2 Caches M-way LLC Way 0

Way 1

Way (M-1)

… Filled a whole cacheline

BUS

Main Memory (DRAM) Example. Single Row Fault * RF MEM Req: RelaxFault-generated memory request

31

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault memory repair FreeFaultaware Memory Controller + Coalescer + Faulty-bank table

L1/L2 Caches M-way LLC Way 0

Way 1

Way (M-1)

… Filled a whole cacheline

BUS

Main Memory (DRAM) Example. Single Row Fault * RF MEM Req: RelaxFault-generated memory request

32

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault memory repair FreeFaultaware Memory Controller + Coalescer + Faulty-bank table

L1/L2 Caches M-way LLC Way 0

Way 1

Cached

Way (M-1)

… Filled a whole cacheline

BUS

Fault

Main Memory (DRAM) Example. Single Row Fault

* RF MEM Req: RelaxFault-generated memory request

33

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault memory repair FreeFaultaware Memory Controller + Coalescer + Faulty-bank table

L1/L2 Caches M-way LLC Way 0

Way 1

Cached

Way (M-1)

… Locked

BUS

Fault

Main Memory (DRAM) Example. Single Row Fault

* RF MEM Req: RelaxFault-generated memory request

Filled a whole cacheline

34

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault memory repair FreeFaultaware Memory Controller + Coalescer + Faulty-bank table

L1/L2 Caches M-way LLC Way 0

Way 1

Cached

Way (M-1)

… Locked

BUS

Fault

Main Memory (DRAM) Example. Single Row Fault

* RF MEM Req: RelaxFault-generated memory request

Filled a whole cacheline

35

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault memory repair

RF MEM req

… RF MEM req RF MEM req RF MEM req

FreeFaultaware Memory Controller + Coalescer + Faulty-bank table

L1/L2 Caches M-way LLC Way 0

Way 1

Cached

Way (M-1)

… Locked

BUS

Fault

Main Memory (DRAM) Example. Single Row Fault

* RF MEM Req: RelaxFault-generated memory request

Filled a whole cacheline

36

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault memory repair

RF MEM req

… RF MEM req RF MEM req RF MEM req

FreeFaultaware Memory Controller + Coalescer + Faulty-bank table

L1/L2 Caches M-way LLC Way 0

Way 1

Cached

Way (M-1)

… Locked

Cached BUS

Fault

Main Memory (DRAM) Example. Single Row Fault

* RF MEM Req: RelaxFault-generated memory request

Filled a whole cacheline

37

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault memory repair

RF MEM req

… RF MEM req RF MEM req

L1/L2 Caches M-way LLC Way 0

Way 1

Cached

Way (M-1)



Cached Cached BUS

Fault

Main Memory (DRAM) Example. Single Row Fault

* RF MEM Req: RelaxFault-generated memory request

Locked …

RF MEM req

FreeFaultaware Memory Controller + Coalescer + Faulty-bank table

Locked

Filled a whole 16 lines cacheline

(1KB)

38

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RelaxFault memory repair

RF MEM req

… RF MEM req RF MEM req



RF MEM req

L1/L2 Caches FreeFaultaware M-way LLC Memory Way Way Way Controller 0 1 (M-1) + Cached … Coalescer + RelaxFault repair is Locked Cached Faulty-bank Cached to software Locked tabletransparent

BUS

Fault

Main Memory (DRAM) Example. Single Row Fault

* RF MEM Req: RelaxFault-generated memory request

Filled a whole 16 lines cacheline

(1KB)

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

RESILIENCE/AVAILABILITY EVALUATION

39

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

40

DRAM resilience evaluation • Monte-Carlo DRAM fault/resilience simulator – DRAM fault mode/rate from DDR3 memory field study [1] – 16K nodes per system, 8 x4 DIMMs per node, 8MB 16-way LLC

• Evaluation metrics – Repair coverage • % of nodes with any permanent fault that repaired

– Rate of DUEs and DIMM replacements – Measured in 6 years

TABLE. Failure rate of DRAM device (FIT/device) [1] [1] Vilas Sridharan et al., “Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults,” SC 2013

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Repair coverage

* PPR: Post Package Repair (not require LLC to repair memory)

41

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Repair coverage

* PPR: Post Package Repair (not require LLC to repair memory)

42

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Repair coverage

* PPR: Post Package Repair (not require LLC to repair memory)

43

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Repair coverage

* PPR: Post Package Repair (not require LLC to repair memory)

44

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Repair coverage

~256KB can repair about 97% of faulty nodes

* PPR: Post Package Repair (not require LLC to repair memory)

45

46

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Number of DUEs per system

Detected Uncorrectable Errors (DUEs)

(a) 1x FIT

(b) 10x FIT

* PPR: Post Package Repair (not require LLC to repair memory)

47

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Number of DUEs per system

Detected Uncorrectable Errors (DUEs)

(a) 1x FIT

(b) 10x FIT

Repairing memory reduces the number system * PPR: Postexpected Package Repair (not require LLC toof repair memory) failures

48

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Number of DIMM replacements per system

DIMM replacements

(a) 1x FIT

(b) 10x FIT

* PPR: Post Package Repair (not require LLC to repair memory)

49

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Number of DIMM replacements per system

DIMM replacements

(a) 1x FIT

(b) 10x FIT

Repairing memory enhances availability as well as resilience * PPR: Post Package Repair (not require LLC to repair memory)

50

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

Overhead estimation – storage • No storage overhead for remapped data • Metadata storage is as small as 16KB Size (bytes) Faulty-bank table

8

Data coalescer

128

LLC tag extension[1]

16,384

Total

16,520

Description 1 byte per DIMM in a node

Per-computed bitmasks 1 bit per LLC tag

[1] No noticeable size/latency change shown in CACTI 6.5 simulation

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

51

Overhead estimation – energy • The vast majority of accesses only require accessing a tiny (e.g. 8B) direct-mapped faulty-bank table – Negligible energy impact

• Most of remaining accesses require extra cache lookup – 1.5% of LLC access energy LLC tag

LLC (tag+data)

DRAM

Access energy

9 pJ

641 pJ

36,000 pJ

Relative energy

100 %

1.5 %

0.03 %

No performance and energy impact observed Detailed results are shown in the paper

RelaxFault Memory Repair [ISCA’16] (c) Dong Wan Kim

52

Conclusions • HW-only permanent memory fault tolerance • Software-transparent memory repair • Zero redundancy (almost)

• Repair-aware mapping improves coverage and LLC use • No impact on performance and energy consumption

RelaxFault Memory Repair

Single-bit/-word/-row/-column Single-bank Multi-bank/-rank. Transient ... [4] MCA: Machine Check Architecture. [1] ECC: .... and repair is transparent to software.

2MB Sizes 4 Downloads 118 Views

Recommend Documents

Practical Memory Checking with Dr. Memory - BurningCutlery
call, which is not easy to obtain for proprietary systems like Windows. ..... Dr. Memory, as there is no way for the application to free this memory: it has lost ..... used by a program,” in Proc. of the 3rd International Conference on. Virtual Exe

Practical Memory Checking with Dr. Memory - BurningCutlery
gramming bugs. These errors include use of memory after free- .... redirected through a software code cache by the DynamoRIO dynamic binary translator.

Repair Abstractions for More Efficient Data Structure Repair
structures show how repair abstractions allow more efficient repair than previous techniques. Keywords: Data structure repair, Error recovery, Runtime analysis.

Glue for cartilage repair
Dec 13, 2010 - tion), Boston University College of Engineering, 2002. Gooch et al. ... ing Scaffolds”, The FASEB Journal, 16: 1691-1694, published online. (Aug. 7, 2002) .... Am J Vet Res. ..... In: The Pharmacological Basis of Therapeutics. Fifth

Glue for cartilage repair
Dec 13, 2010 - Primary Examiner * Allison Ford. (74) Attorney, Agent, or Firm ...... individuals, entailing signi?cant economic, social and psy chological costs.

Executive processes, memory accuracy, and memory ...
tap into a basic component of executive function. How .... mental Laboratory software (Schneider, 1990). ..... items loaded on the first factor, accounting for 42% of.

collective memory and memory politics in the central ...
2. The initiation of trouble or aggression by an alien force, or agent, which leads to: 3. A time of crisis and great suffering, which is: 4. Overcome by triumph over the alien force, by the Russian people acting heroically and alone. My study11 has

Memory Mapped Files And Shared Memory For C++ -
Jul 21, 2017 - Files and memory can be treated using the same functions. • Automatic file data ... In some operating systems, like Windows, shared memory is.

On Memory
the political domain: "To have once been a victim gives you the right to complain, to .... tions elicited here, it is to call for a renewal of the phenomenological.

Semantic memory
formal computational models, neural organization, and future directions. 1.1. ... Tulving's classic view of semantic memory as an amodal symbolic store has been ...

Memory Studies.pdf
However, it has fostered solidarity and commitment from indigenous and. non-indigenous people alike. The fight over rights to the Santa Rosa lot dates back to the end of the. 19th century, when the military campaigns of both Argentinean and Chilean n

Memory for pitch versus memory for loudness
incorporate a roving procedure in our 2I-2AFC framework: From trial to trial, the ... fair comparison between pitch and loudness trace decays, it is desirable to ...

Short-term memory and working memory in ...
This is demonstrated by the fact that performance on measures of working memory is an excellent predictor of educational attainment (Bayliss, Jarrold,. Gunn ...

Memory for pitch versus memory for loudness
these data suggested there is a memory store specialized in the retention of pitch and .... corresponding button was turned on for 300 ms; no LED was turned on if the ... to S2 in dB or in cents was large enough to make the task easy. Following ...

4DS Memory Ltd.
Sep 1, 2016 - valuation of other ASX-listed memory companies as well as the value of ... data center capacity is projected to be in excess of 10% CAGR in the ..... by Silicon Motion, the US$ 645M acquisition of storage software player Dot ...

Shared Memory
Algorithm. Server. 1. Initialize size of shared memory shmsize to 27. 2. Initialize key to 2013 (some random value). 3. Create a shared memory segment using shmget with key & IPC_CREAT as parameter. a. If shared memory identifier shmid is -1, then st

On Memory
What is collective memory? The attempt to respond to this question, which has been subject to lively debate over the course of the past decades, faces very ...