End-to-End Error Correction and Online Diagnosis for On-Chip Networks
CDSC Annual Review May
31st-Jun
1st
, 2011
Saeed Shamshiri, Amirali Ghofrani Kwang-Ting (Tim) Cheng
UC Santa Barbara Department of ECE
Error Correction Codes
Online Diagnosis
Switch-to-Switch (S2S)
Experimental Results
XY Routing
Expected Values 0.0001
0.0001
1
2
Interleaved SEC Hamming(21,16). 4-degree interleaving provides 4-bit burst correction.
0.0001 3
0.0000
0.0000
0.0001 4
0.0000
0.0001 1
0.0000
0.0001 2
0.0001 3
0.0000
0.0000
4 0.0000
4.00E-03
Expected Value
3.50E-03
0.0001
0.0002
5
6
0.0003 7
0.0000
0.0000
0.0004 8
0.0000
0.0003 5
0.0000
0.0002 6
0.0001
3.00E-03
7
0.0000
0.0000
8 2.50E-03
0.0000
2.00E-03 1.50E-03
0.0002 9
End-to-End (E2E)
0.0005 10
0.0000
0.0009 11
0.0000
0.0013 12
0.0000
0.0009 9
0.0000
0.0005 10
0.0002 11
0.0000
0.0000
1.00E-03
12 0.0000
5.00E-04
Mesh
0.00E+00 1
0.0004 13
0.0011 14
0.0000
Interleaved error-locality-aware 2G4L(26,16). 4-degree interleaving provides 16-bit burst correction. E2E approach is four times cheaper than S2S.
0.0021 15
0.0000
0.0002
0.0000
0.0005
1
2 0.0000
0.0021 B
0.0000
0.0009 3
0.0000
0.0036 A
0.0000
0.0009 1
0.0000
0.0004 15
0.0000
0.0013 4
0.0000
0.0011 14
7
a. Right-going links
0.0003 6
0.0004 7
0.0000
0.0000
0.0000
4
7.00E-04
0.0000
0.0005 8
0.0000
0.0004 5
0.0000
0.0003 6
S8
Expected Value
0.0000
0.0000
4.00E-04
8
0.0001 10
0.0002 11
0.0002 12
0.0002 9
0.0000
0.0001 10
3.00E-04
0.0001 11
1.00E-04
12
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0001
0.0001
0.0001
0.0001
0.0001
0.0000
0.00E+00 3
13
14 0.0000
15 0.0000
16 0.0000
100000000
100000
0.0000
14
15
0.0000
0.0000
7
0.9 0.8
S3
S4
S5
S6
S7
1
B(24,l)
B(23,l)
B(22,l)
B(21,l)
B(20,l)
B(19,l)
B(18,l)
100
B(17,l) B(13,l)
B(16,l) B(12,l)
B(15,l) B(11,l)
B(14,l) B(10,l)
B(9,l) B(5,l)
B(8,l) B(4,l)
B(7,l) B(3,l)
B(6,l) B(2,l)
10
0.96
0.6
0.3
Min Routing
2
3
4
5
6
7
8
9
0.92
XY-Route Hybrid-Route Min-Route
0.4
1 1
0.94
0.5
B(l)
Number of Corrections
6 p6 0 0 0 0 0 1 0 0 0 0 32
E2E Defect Observation Escape Rate
Number of Packets
1
2
(1/6)
3
4
1
D
8
5
11
12
A
15
16
S
2
4
Parity check matrix of 2G4L(26,16) 1 2 3 4 p1 p2 p3 p4 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 4 8
5 p5 0 0 0 0 1 0 0 0 0 0 16
6 p6 0 0 0 0 0 1 0 0 0 0 32
7 8 p7 p8 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 64 128
9 d1 1 1 0 0 1 1 0 0 0 0 51
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 d2 d3 d4 p9 d5 d6 d7 d8 d9 p10 d10 d11 d12 d13 d14 d15 d16 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 85 127 173 256 283 309 327 413 420 512 579 541 679 910 1003 841 744
5
1
1
6
3
2 A
2
3 3
6
0.5
D
8
11
12
3 B
2
2 S
0.17
0.33
B
0.33
1 14
1
0.5
14
0.17
15
16 (1/4)
Expected Value 1
2
3
Suspicion Value 4
(1/3)
1
2
3
4
D
8
11
12
15
16
Decoder of 2G4L(26,16) 5
A
Synthesis Results Area (um^2) BCH(26,16) 2G4L(26,16)
Power (mW)
0.028
6
B
0.042
0.028
D
8
5
11
12
A
15
16
S
0.043
0.083
6
B
0.125
0.083
Latency (ns)
Encoder
Decoder
Encoder
Decoder
Encoder
Decoder
2872
21043 22173
2.5107
13.758 14.0756
0.9
2.2365
0.78
3.4 3.35
2744
0.014
S
0.042
14
0.014
0.125
14
0.043
49 0
40 0
43 0 46 0
31 0
34 0 37 0
22 0 25 0
28 0
13 0
16 0 19 0
10 0
10
3
40
Usage Probability
0.86 70
NUMBER OF ROUTES
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 p7 p8 p9 p10 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 0 0 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 0 0 1 64 128 256 512 873 443 886 389 778 381 762 669 595 975 247 494 988 209 418 836
50% Noise
0.88
0.1
Parity check matrix of BCH(26,16) 5 p5 0 0 0 0 1 0 0 0 0 0 16
0% Noise 20% Noise
0.9
0.2
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Implementations 1 2 3 4 p1 p2 p3 p4 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 4 8
Accuracy
0.98
0.7
A(g)
Bit Position Endoded Data Bits p1 p2 p3 p4 Parity bit p5 coverage p6 p7 p8 p9 p10 Decimal
S2
Accuracy of Diagnosis
1000
Bit Position Endoded Data Bits p1 p2 p3 p4 Parity bit p5 coverage p6 p7 p8 p9 p10 Decimal
S1
0.0000
1
10000
0
Mesh 5
b. Up-going links
16
Accuracy
1000000
Number of Syndromes
10000000
13
0
Encoder of 2G4L(26,16)
S7
6.00E-04
0.0001 7
1
Error-locality-aware Codes
2G4L(26,16): Same cost as BCH(26,16) More reliable
S6
2.00E-04
0.0001
BurstCode(26,20): 25% higher code-rate than BCH(26,16) Only reliable against adjacent errors
S5
S3
5.00E-04
0.0001 5
9
Designing codes for burst (local) and random (global) errors
S2 S1
0.0002 3
0.0000
S4
5
0.0000
0.0005 2
3
16
0%
5%
10%
15%
20%
25%
30%
35%
40%
Conclusion A comprehensive end-to-end solution for error correction, data collection, and defect diagnosis and replacement for on-chip networks has been proposed. Four interleaved 2G4L(26,16) provide two random and up to 16 adjacent-bit error corrections per flit. E2E error pattern information is gathered in a centralized software on the host processor and used for diagnosis of defective wires. Under heavy noise, high escape rate, uncertainty about routing, and many other harmful effects, the collected data are still accurate enough for diagnosis. The collected data can also be used for other purposes such as diagnosis of defective routers, locating the intermittent faults, and many other interesting system observations.