Computer simulations create the future

The K computer and its failures Fumiyoshi Shoji Operations and Computer Technologies Division RIKEN AICS The 6th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016@Kyoto 31 May, 2016

1

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

2

• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

3

RIKEN – RIKEN is a largest and most comprehensive research organization for basic and applied science in Japan. – Foundation: 1917. – Researchers: ~ 2,000.

The K computer at RIKEN Kobe campus

SACLA(XFEL) X-ray Free Electron Laser facility at RIKEN Harima campus

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

4

RIKEN AICS • RIKEN Advanced Institute for Computational Science (AICS) established on July 1, 2010 • Missions – Manage the operations of the K computer, maintaining a user-friendly environment. – Promote collaborative projects with a focus on the disciplines of computational science and computer science. – Plot and develop Japan's strategy for computational science, including defining the path to exa-scale computing.

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

5

Location of RIKEN AICS Kobe

Kyoto

Tokyo 423km (263miles) west of Tokyo

Research building

~20,000m2

Chiller building

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

Computer building

Substation Supply

6

• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

7

The K computer and its achievements • The K computer: • • •

developed by collaboration between RIKEN and FUJITSU in a Japanese national project. has been started to operate since Sep. 2012. designed to aim for a general-purpose computing. • no accelerators, broader memory/interconnect bandwidth, etc.

• Achievements: – – – – –

TOP500 list :No.1 at Jun. and Nov. 2011 Graph500 list :No.1 at Jun. 2014, Jul. and Nov. 2015 HPCG results :No.2 in July and November 2015. Gordon Bell prize :Winner in 2011 and 2012 The other remarkable results for science and engineering • See http://www.aics.riken.jp/en/ RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

8

System Configuration 40 m x 40 m Full System

Compute Rack × 864

4000mm x 800mm 2 Cabinets

800mm x

Compute Rack × 4 800mm Disk Racks × 1

Compute Rack

10.6(11.3)PFLOPS 1.27(1.34)PiB

SB ×24 IOSB ×6

500mm x 500mm System Board(SB) Node×4

Node

49.2(52.4)TFLOPS 6.00(6.38)TiB

CPU×1 ICC×1 memory

128GFLOPS 16GiB

512GFLOPS 64GiB

12.3(13.1)TFLOPS 1.50(1.59)TiB

( )included IO node performance and memory capacity RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

9

Overview of Usage 2012/09/28 - 2016/01/31 • Registered projects/users • Average number of executed jobs • Average number of active users Job scale (node*time based) 20001-50000 7.7%

50001-80000 1.1%

nuclear 2.5%

101-500 14.3%

5001-10000 8.5%

501-1000 16.7%

2001-5000 24.8%

Science fields (node*time based)

>80000 3.0%

<101 4.1%

10001-20000 6.9%

: ~150/1200/FY : 1275.0/day : 113.4/day

1001-2000 13.0%

computer science mathema=cs 0.3% 2.4%

environment 14.5%

misc 0.1%

material science 27.6%

phisics 15.0%

life scince 16.9%

manufacturin g 20.6%

Wide range of job type and users from various fields RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

10

Overview of Operation Status 5% of 365d*24h is kept for scheduled maintenance.

System availability for schedule and job filling rate 100%

99.7%

96.7%

80% 60%

98.2%

75.9%

99.0%

75.6%

75.3%

91.0% (BlueWaters 2015(*)) (*)The

2015 Blue Waters Annual Report book:

61.2%

40% 20% 0%

JFY2012 (9/28~)

JFY2013

system availability for schedule

JFY2014

JFY2015

average job filling rate

• Remarkable high system availability achieved. • Job filling rates keep a sufficiently high level. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

11

• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

12

Failure analysis of K computer • K computer consists of extremely many parts and components. • K computer keeps on working with higher load for three and a half years and is used by a various types of jobs and users. • It is expected to occur failure events more frequently and it enables us to study a meaningful failure analysis. Failure statistics of K computer includes many significant information and is expected to be useful for general failure analysis of supercomputer. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

13

Number of major parts Compute Rack × 864

System Board 864×24 = 20,736

CPU 864×(24×4) = 82,944

Inter Connect Controller 864×(24×4)=82,944

PSU 864×9 = 7,776

CPU/ICC are water-cooled(inlet:15℃ outlet:17℃) Other components are air-cooled

DIMM 864×(24×4×8) = 663,552

When a failure of CPU/ICC/System Board occurred then the system board will be replaced. (For DIMM failure, the only DIMM will be replaced.)

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

14

Monthly Failure Rate of CPUs

Monthly failure rate

0.016%

Monthly failure rate = Failure counts in the month Number of installed CPUs

Number of CPUs=82,944 (Since July 2012)

0.020% Gordon-Bell challenges

Total failure counts=240 (July 2012-April 2016)

0.012% Gordon-Bell challenges

0.008% 0.004% 0.000%

2011-04 2011-07 2011-10 2012-01 2012-04 2012-07 2012-10 2013-01 2013-04 2013-07 2013-10 2014-01 2014-04 2014-07 2014-10 2015-01 2015-04 2015-07 2015-10 2016-01 2016-04

Full node LINPACK measurements

Failure trend of CPUs is almost stable except after high load events

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

15

Monthly Failure Rate of DIMMs 0.004%

Monthly failure rate

0.003%

Number of DIMMs=663,552 (Since July 2012) Total failure counts=467 (July 2012-April 2016)

0.002% 0.001%

Monthly failure rate = Failure counts in the month

2011-04 2011-07 2011-10 2012-01 2012-04 2012-07 2012-10 2013-01 2013-04 2013-07 2013-10 2014-01 2014-04 2014-07 2014-10 2015-01 2015-04 2015-07 2015-10 2016-01 2016-04

0.000%

Number of installed DIMMs

DIMM failures seem to decrease gradually and be lower than that of CPU

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

16

Monthly Failure Rate of System Boards (includes the failures of CPU, ICC, DIMM and System Board(SB) itself)

0.12%

Number of SBs=20,736 (Since July 2012) Total failure counts=850 (July 2012-April 2016)

0.08% Monthly failure rate =

0.04%

Failure counts in the month Number of installed System Boards

0.00%

2011-04 2011-07 2011-10 2012-01 2012-04 2012-07 2012-10 2013-01 2013-04 2013-07 2013-10 2014-01 2014-04 2014-07 2014-10 2015-01 2015-04 2015-07 2015-10 2016-01 2016-04

Monthly failure rate

0.16%

Failure rate of SBs seems to reach to a plateau Average failure counts (= maintenance operation) ~ 1 times / 2 days RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

17

30

Number of PSUs=7,776 (Since July 2012)

25

Total failure counts=194 (July 2012-April 2016)

20

preventive replacement failure

15 10 Monthly failure rate =

5

Failure counts in the month

0 2011-04 2011-07 2011-10 2012-01 2012-04 2012-07 2012-10 2013-01 2013-04 2013-07 2013-10 2014-01 2014-04 2014-07 2014-10 2015-01 2015-04 2015-07 2015-10 2016-01 2016-04

Monthly failure count

Monthly Failure Count of PSU

Number of installed PSUs

PSU failure has increased rapidly and preventive replacement based on a suspicious signal has been applied since 2015. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

18

Failure analysis A comparison of failure rates with Blue Waters FIT

: Failure In Time (1FIT = 1 failure per 109 hours)

K computer (April 2011 – January 2016)

CPU (FIT) DIMM (FIT/GB)

69.86 8.82

Blue Waters(*) 265.15 15.98

(*) C. Di Martino et al., Lessons learned from the analysis of system failures at petascale: the case of blue waters. 44th international conference on Dependable Systems and Networks (DSN 2014), 2014.

• CPU and DIMM failure rates of the K computer are about 1/4 and 1/2 compared with Blue Waters.

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

19

Consideration of low CPU failure rate Tj (junction temperature) of CPU = 30℃ 1.0E+04

Arrhenius's law k : chemical reaction rate constant A : constat Ea : activation energy k B : Boltzmann constant T : temperature

According to our early estimation, if Tj of CPU could be decreased from 85℃ to 30℃ then relative life time will be

1.0E+03

x60~x100

Ea k BT

Lifetime(relative)

k = Ae



1.0E+02 1.0E+01 1.0E+00 1.0E-01

30℃

85℃

1.0E-02 0

20

40

60

80

100

120

Tj(℃)

60 ~ 100 times longer.

Low Tj gives an essential contribution for lower CPU failure rate RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

20

• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

21

DIMM failures and rack location April 2014 – April 2016 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

A 0 1

B 0 1

C 1 0

D 0 0

E 0 2

F 0 0

G 1 0

H 0 0

I 0 0

J 0 0

K 0 1

L 2 1

M 0 2

N 0 1

O 0 0

P 0 0

Q 0 0

R 1 1

S 0 0

T 0 0

U 1 0

V 0 0

W 0 0

X 1 0

0 1 1 0

1 1 0 0

0 1 0 0

0 1 0 0

0 0 0 1

0 1 0 0

0 0 0 0

0 0 0 0

1 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 1 0

0 1 0 0

0 0 0 0

0 0 0 0

0 0 0 2

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 1

0 0 0 0

1 0 1 0

0 0 0 0

0 0 1 0

0 1 0 0

0 0 0 0

0 0 1 2

0 0 0 0

0 0 1 0

0 0 0 0

0 0 0 0

1 0 0 0

0 0 0 0

0 0 0 1

0 0 0 0

2 0 0 0

0 1 0 0

0 0 0 0

0 0 0 0

1 3 1 0

0 0 0 0

1 0 0 0

0 0 0 0

0 0 0 0

0 1 0 0

0 1 0 0

1 0 0 0

0 0 0 0

1 0 0 1

1 0 0 1

0 0 0 0

0 0 0 0

0 0 1 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 1 0

1 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 1

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

1 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 1

0 0 0 0

0 1 1 0

0 0 0 0

0 0 0 0

0 0 0 1

0 0 0 1

1 0 0 1

1 0 0 0

0 0 0 0

0 0 0 0

0 1 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 1 0

1 0 0 0

0 0 0 0

2 0 0 0

0 0 0 0

0 0 0 0

0 0 1 0

1 1 1 0

0 1 0 0

0 0 0 0

0 0 0 0

0 0 0 1

0 0 0 0

0 0 0 0

2 0 0 2

0 0 0 0

0 0 0 0

0 1 0 0

0 0 0 0

0 1 0 0

1 2 1 0

0 0 0 0

0 0 0 1

0 2 0 0

0 0 0 0

1 0 0 0

0 0 0 0

0 0 0 2

0 0 0 0

0 0 0 0

0 0 0 0

1 0 0 0

0 1 0 0

0 1 1 0

0 0 1 0

0 0 1 0

0 0 0 2

0 0 1 0

0 0 0 0

3 0 0 0

1 0 0 0

0 0 0 0

1 0 0 0

0 0 0 0

1 0 0 0

0 0 0 0

0 0 2 0

1 1 0 1

0 0 0 0

1 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

1 0 0 0

0 0 1 0

0 1 0 0

0 1 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

1 1 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 2 0 0

1 0 0 0

0 0 0 0

0 0 0 1

1 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 1 1

0 0 0 0

0 1 0 0

0 0 0 0

0 0 0 0

1 0 0 0

1 1 1 0

0 0 0 0

0 0 0 0

0 1 0 0

0 0 0 0

0 0 0 0

0 0 2 0

0 0 0 0

0 2 0 1

1 0 0 0

0 0 1 0

1 0 0 0

1 0 0 0

0 0 2 1

0 0 0 0

0 0 0 1

0 0 0 1

0 0 0 0

0 0 1 0

1 0 1 0

0 0 0 0

0 0 0 0

1 0

1 0

0 0

1 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0

0 1

0 0

2 0

0 1

0 0

0 1

1 0

0 1

1 0

0 0

0 1

1 0

0 0

%

Failure counts over rack are not uniform. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

22

DIMM failures and SB location .

%

. .

% #

. .

# %

%

%

.

%

.

%

%

. .

%

.

%

. .

%

.

%

.

%

.

#

. .

% %

%

%

. .

%

.

%

.

%

.

%

#%

. .

#

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

23

DIMM failures and DIMM slot location

DIMM CPU

DIMM CPU

DIMM

DIMM

CPU DIMM

CPU DIMM

%

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

24

Rack inlet air temperature since April 2014 average

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 44 45

A B C D E F G H I J K L M N O P Q R S T U V W X 19.3 20.0 19.0 20.3 19.0 18.9 18.9 22.1 19.9 19.6 19.1 19.2 18.9 20.0 19.4 19.7 19.3 19.7 19.2 20.3 18.5 20.3 19.4 20.4 24.2 19.0 24.6 19.0 24.8 18.9 23.9 19.3 25.5 18.9 23.8 19.1 24.1 19.4 25.3 19.0 24.0 19.3 24.2 19.2 23.8 19.3 24.2 20.4 19.4 20.7 20.2 25.6

24.7 19.3 20.2 19.5

19.7 20.2 20.1 23.9

23.7 19.1 19.5 19.5

19.0 20.0 20.2 24.5

23.9 18.7 18.9 19.3

19.1 18.9 19.1 24.0

24.9 19.4 19.6 18.8

19.6 19.4 19.4 23.9

24.1 19.5 20.3 20.3

19.8 19.9 19.6 23.5

24.0 19.5 19.9 19.6

19.5 19.2 19.8 22.6

23.8 19.7 19.5 19.6

19.5 20.3 19.5 25.4

24.4 19.5 19.4 19.6

19.5 19.7 19.7 24.9

24.0 20.1 19.9 20.1

20.0 19.5 19.8 25.5

23.9 19.8 19.8 19.3

19.5 19.5 19.7 24.8

23.6 19.4 19.4 19.0

18.7 19.1 18.7 24.3

24.1 19.7 20.8 20.6

20.6 19.6 21.2 24.5

25.5 20.7 20.3 21.1

22.8 20.1 21.0 25.6

23.1 20.5 21.6 20.5

22.0 22.8 22.4 25.2

24.0 19.8 20.0 20.0

19.8 19.4 19.4 24.4

24.7 20.1 19.8 20.2

19.9 19.2 19.6 24.4

23.5 20.2 19.9 20.4

20.0 19.4 19.8 24.1

23.8 20.1 19.6 20.2

19.8 20.3 19.7 24.3

24.5 19.7 19.8 20.2

20.2 19.9 20.7 24.8

25.0 19.5 19.7 20.1

19.5 19.7 19.4 24.0

25.7 19.6 20.6 20.5

19.4 19.6 19.9 25.0

22.1 19.8 19.1 20.2

19.7 19.4 19.9 24.2

25.1 19.2 19.7 19.7

19.1 19.2 19.6 23.8

24.5 20.2 19.8 20.2

20.0 23.0 20.9 25.0

25.2 24.1 24.6 21.5

22.1 25.2 23.9 23.4

23.5 24.4 24.3 21.9

22.2 24.6 24.0 24.3

24.7 23.7 23.2 20.1

20.7 23.1 21.9 21.2

22.4 23.9 21.8 18.9

20.8 23.8 20.6 21.8

25.0 23.9 24.3 22.5

20.2 23.3 23.8 22.3

24.6 24.0 23.1 20.0

21.7 23.9 21.6 20.6

24.7 23.7 21.0 19.6

22.2 23.9 20.3 20.1

24.4 22.4 22.6 18.7

22.3 23.9 21.3 23.9

26.0 24.5 23.7 20.0

21.4 23.5 22.9 21.0

24.4 23.9 23.0 19.9

20.6 22.7 20.9 23.7

24.0 21.7 22.3 19.3

19.5 22.2 20.7 23.7

24.4 21.0 20.8 19.7

19.8 20.6 20.4 25.5

25.7 20.5 21.8 21.3

22.2 21.3 22.2 24.3

24.2 21.6 21.3 19.7

22.3 19.4 20.1 26.2

21.2 19.1 19.3 19.5

19.1 19.4 19.2 23.2

24.0 20.4 19.5 19.6

20.8 20.0 19.3 23.4

23.2 19.9 20.2 20.3

20.1 20.0 19.5 25.0

20.5 20.1 20.3 20.5

19.5 20.3 19.9 21.7

21.5 19.7 19.5 19.8

19.9 19.5 19.9 24.0

24.1 20.0 19.4 18.4

19.6 19.6 18.9 24.5

23.3 19.7 19.8 19.4

19.4 19.0 18.6 22.8

24.1 19.0 19.1 18.7

19.4 19.0 18.8 24.4

24.0 18.9 19.5 19.0

19.0 19.2 19.2 24.1

24.0 19.9 19.6 19.9

20.5 21.2 20.3 24.5

24.4 21.7 20.1 20.3

22.0 23.4 21.9 23.8

24.9 18.9 19.4 19.4

20.4 21.8 21.8 23.6

24.9 19.7 19.2 19.9

19.2 18.9 19.0 21.0

23.7 21.1 20.9 20.0

20.2 19.5 19.6 21.7

26.3 20.7 20.0 20.3

20.1 19.0 20.1 21.6

21.4 20.6 19.6 20.1

20.5 19.4 20.4 20.5

21.1 20.4 19.3 19.9

20.3 19.8 20.2 19.8

24.4 18.9 19.1 19.0

19.0 18.4 18.5 20.3

23.6 21.5 21.7 20.4

21.8 20.9 19.6 21.0

22.5 20.1 19.3 19.3

19.3 18.9 19.0 24.0

24.9 19.6 19.5 19.2

19.5 19.1 19.0 24.0

25.1 20.3 20.2 20.7

20.3 21.5 22.9 21.8

23.7 22.8 24.5 23.2

20.5 23.9 25.1 25.5

22.2 22.6 24.3 23.3

20.3 24.1 24.6 25.9

22.6 21.0 22.8 19.9

18.9 22.2 23.3 24.9

24.3 20.3 23.8 20.4

19.6 23.7 24.1 24.7

22.5 23.2 24.4 22.6

19.9 23.7 24.2 24.9

20.8 21.5 24.3 21.9

19.8 23.5 23.7 24.7

21.9 19.5 23.5 21.0

19.7 20.8 23.3 24.9

24.1 21.2 22.8 19.9

19.5 21.9 22.3 25.2

22.3 22.8 23.1 20.9

21.3 23.5 22.6 24.5

24.9 21.5 23.0 20.9

19.3 22.2 21.7 23.9

24.0 21.5 22.2 19.5

19.4 21.8 22.2 24.2

25.0 20.3 21.5 20.0

20.4 20.5 20.7 25.3

25.3 19.4 20.0 19.4

20.5 20.7 20.4 23.9

24.9 19.4 19.7 19.4

19.7 19.6 19.9 24.9

24.8 19.2 19.6 19.1

19.3 19.5 19.2 23.9

24.1 19.5 19.7 19.7

18.8 19.6 19.7 24.0

25.1 19.3 20.1 20.1

19.8 19.9 19.8 23.5

25.2 20.0 20.0 19.8

19.6 19.7 20.0 22.7

24.3 19.5 19.7 20.0

19.9 19.8 19.7 25.6

24.6 19.4 19.6 19.4

19.4 19.4 19.4 25.6

24.8 19.3 19.6 19.3

19.3 19.9 19.5 25.8

23.7 19.7 19.6 19.7

19.3 19.2 19.3 24.5

23.9 19.2 19.3 18.5

19.1 19.3 19.4 25.0

24.5 19.6 21.0 20.6

20.6 20.0 20.3 24.2

25.4 19.6 19.1 19.7

20.4 20.0 20.3 24.3

23.6 19.2 18.7 19.1

20.1 19.4 19.5 23.9

24.2 19.4 19.2 19.1

19.0 19.3 19.2 24.3

24.4 19.1 19.3 19.1

19.3 19.4 19.5 24.0

23.4 20.0 19.9 19.9

19.8 20.0 19.8 23.9

22.6 19.9 19.9 19.9

19.6 19.5 19.6 23.8

22.2 19.6 19.6 20.0

19.8 19.6 20.3 24.6

25.0 19.5 19.5 19.5

19.7 19.2 19.5 25.0

25.6 19.7 19.8 19.6

19.5 19.5 19.3 23.8

25.5 19.5 19.8 19.8

20.1 19.6 19.6 24.1

24.9 19.4 19.7 19.6

19.5 19.3 19.3 23.6

25.2 20.3 20.2 20.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

20.2 23.9 19.0 23.9 19.3 23.9 19.1 25.1 19.2 24.0 19.3 23.9 19.6 23.9 19.5 24.3 19.2 24.6 19.5 25.2 19.4 24.0 19.6 24.4 20.2 19.1 19.6 19.1 21.2 19.1 19.8 19.5 21.1 19.3 19.7 19.7 20.2 20.0 21.8 19.4 20.2 19.4 20.0 19.4 19.8 19.5 20.9 19.5

44 45

Min:18.38℃-Max:26.26℃

• •

A 0.2 0.3

B 0.1 0.1

C 0.1 0.2

D 0.2 0.1

E 0.2 0.2

F 0.1 0.1

G 0.1 0.2

0.4 0.3 0.7 0.4

0.3 0.1 0.2 0.1

0.3 0.4 0.2 0.8

0.3 0.1 0.2 0.1

0.2 0.3 0.4 0.8

0.1 0.1 0.1 0.1

0.4 0.6 0.3 0.5

0.2 0.3 0.1 0.3

0.7 0.2 0.2 0.3

0.9 0.2 0.6 0.7

0.5 0.4 0.5 0.3

0.6 0.6 0.6 0.5

0.2 0.6 0.5 0.6

0.7 0.5 0.5 1.1

1.2 0.3 0.4 0.8

0.8 0.2 0.3 0.4

0.4 0.4 0.6 0.5

0.8 0.4 0.7 1.1

0.2 0.8 0.8 0.5

0.9 0.6 0.3 0.2

0.3 0.3 0.3 0.6

standard deviation H 0.3 0.5

I 0.4 0.3

J 0.2 0.2

K 0.2 0.5

L 0.3 0.2

M 0.3 0.2

N 0.2 0.7

O 0.3 0.3

P 0.3 0.2

Q 0.2 0.7

R 0.2 0.1

S 0.2 0.3

T 0.2 0.2

U 0.2 0.3

V 0.2 0.2

W 0.2 0.3

X 0.4 0.3

0.1 0.1 0.1 0.8

0.5 0.2 0.5 0.2

0.2 0.7 0.2 0.7

0.5 0.2 0.2 0.3

0.1 0.2 0.1 1.0

0.4 0.2 0.2 0.3

0.1 0.2 0.2 1.1

0.4 0.1 0.2 0.2

0.2 0.1 0.1 0.3

0.3 0.1 0.4 0.3

0.2 0.2 0.4 0.5

0.3 0.2 0.3 0.1

0.3 0.4 0.2 0.4

0.6 0.1 0.2 0.2

0.2 0.2 0.2 0.2

0.4 0.3 0.2 0.2

0.3 0.3 0.3 0.4

0.6 0.4 0.6 0.7

1.0 0.2 0.3 0.4

0.2 0.2 0.4 0.4

0.8 0.2 0.2 0.2

0.2 0.2 0.2 0.3

0.9 0.2 0.2 0.3

0.2 0.3 0.2 0.3

1.1 0.1 0.2 0.2

0.1 0.2 0.2 0.3

1.0 0.1 0.2 0.1

0.1 0.2 0.1 0.4

0.5 0.2 0.2 0.4

0.2 0.3 0.3 0.6

0.4 0.2 0.2 0.1

0.2 0.2 0.1 0.2

1.4 0.2 0.2 0.3

0.2 0.2 0.2 0.3

0.3 0.2 0.2 0.2

0.2 0.2 0.1 0.2

0.6 0.9 0.4 0.8

0.4 0.4 0.4 1.0

0.3 0.5 0.4 0.2

0.6 0.4 0.5 1.6

0.6 0.4 0.5 0.3

1.0 0.4 0.6 1.1

0.4 0.4 0.4 0.5

0.5 0.5 0.5 1.8

0.3 0.3 0.4 0.2

0.3 0.3 0.4 0.8

0.6 0.5 0.9 0.6

0.4 0.5 1.0 0.8

0.5 0.6 0.5 0.3

0.2 0.4 0.4 1.1

0.4 0.5 0.5 0.6

0.5 0.5 0.4 0.4

0.4 0.4 0.4 0.2

0.3 0.3 0.5 0.3

0.3 0.4 0.3 0.1

0.4 0.4 0.3 0.2

0.4 0.2 0.3 0.6

0.9 0.6 0.5 0.3

0.6 0.3 0.2 0.2

1.5 0.1 0.2 0.2

0.3 0.2 0.2 1.3

0.5 0.4 0.1 0.2

0.5 0.3 0.1 1.2

1.1 0.4 0.2 0.3

0.5 0.1 0.2 0.7

0.4 0.2 0.2 0.4

0.1 0.1 0.3 0.6

0.8 0.3 0.3 0.4

0.2 0.3 0.4 1.1

0.7 0.3 0.4 0.5

0.4 0.3 0.5 1.0

0.7 0.6 0.4 0.9

1.2 0.6 0.8 0.9

0.3 0.4 0.6 0.7

0.3 0.6 0.9 1.0

0.2 0.2 0.3 0.5

0.2 0.3 0.3 0.5

0.7 0.9 0.4 1.1

0.3 1.0 0.5 1.0

0.3 0.3 0.3 0.2

0.7 0.5 0.6 1.1

0.2 0.3 0.3 0.3

0.2 0.2 0.1 1.5

1.4 0.6 0.3 0.3

0.3 0.2 0.3 1.8

0.4 0.3 0.3 0.2

0.3 0.2 0.3 1.7

0.7 0.2 0.2 0.2

0.3 0.2 0.2 0.5

0.7 0.2 0.1 0.1

0.2 0.2 0.1 0.6

0.7 1.0 0.7 0.4

0.9 1.0 0.8 0.6

0.8 0.8 0.4 0.7

0.5 0.3 1.0 0.6

1.0 0.9 0.6 0.4

0.8 0.8 0.7 0.3

0.5 0.3 0.3 0.3

0.4 0.2 0.2 0.2

0.5 0.2 0.8 0.3

1.3 0.7 0.4 0.6

0.4 0.4 0.5 0.3

1.4 0.5 0.4 0.4

0.3 0.4 0.4 0.3

1.1 0.3 0.3 0.2

0.1 0.4 0.4 0.5

0.7 0.6 0.4 0.2

0.2 1.0 0.7 0.3

1.4 0.4 0.3 0.3

0.4 0.4 0.4 0.4

0.6 0.4 0.4 0.5

0.1 0.3 0.6 0.2

1.3 0.4 0.4 0.2

0.3 0.6 0.4 0.2

0.7 0.2 0.5 0.7

0.3 0.5 0.6 0.2

1.1 0.4 0.3 0.2

0.5 0.4 0.4 0.2

0.4 0.6 0.4 0.3

0.3 0.3 0.3 0.5

0.5 0.3 0.3 0.2

0.2 0.4 0.3 0.6

0.6 0.9 0.4 0.3

0.6 0.3 0.3 0.5

0.3 0.2 0.2 0.2

0.3 0.3 0.3 0.8

0.3 0.1 0.2 0.1

0.2 0.4 0.3 1.0

0.2 0.1 0.2 0.1

0.2 0.2 0.1 1.0

0.3 0.2 0.4 0.1

0.1 0.4 0.3 1.0

0.4 0.6 0.7 0.4

0.4 0.5 0.4 1.1

0.3 0.2 0.1 0.1

0.1 0.1 0.1 1.1

0.2 0.1 0.1 0.2

0.1 0.1 0.1 0.4

0.3 0.3 0.2 0.2

0.2 0.3 0.2 0.4

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.3

0.5 0.1 0.1 0.1

0.1 0.1 0.2 0.3

0.5 0.1 0.2 0.2

0.1 0.2 0.2 0.3

0.4 0.1 0.2 0.8

0.5 0.3 0.4 0.3

0.2 0.2 0.1 0.3

0.2 0.3 0.4 0.2

0.9 0.2 0.1 0.2

0.3 0.2 0.2 0.2

0.7 0.1 0.1 0.1

0.1 0.1 0.1 0.6

1.2 0.2 0.2 0.1

0.3 0.2 0.2 0.4

1.1 0.2 0.2 0.2

0.2 0.2 0.2 0.4

1.4 0.2 0.1 0.1

0.1 0.1 0.2 0.2

1.9 0.4 0.8 0.8

0.2 0.7 0.7 0.3

0.4 0.2 0.2 0.3

0.1 0.2 0.2 0.5

0.3 0.1 0.2 0.1

0.2 0.1 0.1 0.8

0.3 0.2 0.2 0.3

0.1 0.2 0.5 0.7

0.2 0.1 0.2 0.2

0.1 0.2 0.2 0.4

0.4 0.6 0.6 0.1

0.3 0.3

0.2 0.2

0.1 0.1

0.2 0.2

0.2 0.3

0.4 0.1

0.1 0.1

0.4 0.2

0.1 0.5

0.2 0.1

0.1 0.1

0.3 0.1

0.1 0.1

0.7 0.3

0.4 0.6

0.5 0.4

0.4 0.7

0.4 0.2

0.2 0.1

0.3 0.2

0.2 0.2

0.3 0.2

0.1 0.2

0.3 0.1

Min:0.08℃-Max:1.87℃

Differences between the lowest and the highest of the average are approximately 8℃(ave) and 2℃(stdev). Every other racks tends to be higher and have large fluctuation. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

25

Exhausted heat from neighboring disk rack

Compute rack

Compute rack

Disk rack

Compute rack

Compute rack

Sensor

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

26

DIMM failures and rack inlet air temperature vs average temperature

90%

350 counts

average failures per rack

300

70% 250

60% 50%

200

40%

150

30%

23.5%

20.8% 21.3%

20%

16.3%

14.9%

10% 0%

100

19.1% 18.8% 17.7% 19.0% 6.3%

19.17

19.96

20.74

21.53

22.32

23.11

23.90

24.68

25.47

26.26

-

-

-

-

-

-

-

-

-

-

18.38

19.17

19.96

20.74

21.53

22.32

23.11

23.90

24.68

25.47

number of racks

Average failures per rack

80%

50 0

Temperature distribution over racks has two peaks. Failure rates of rack seems not to depend on the average temperature. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

27

DIMM failures and rack inlet air temperature vs standard deviation

counts

average failures per rack 400 350

100%

300 80%

250

66.7%

60%

200 42.9%

40% 20%

20.5% 18.0%

150

27.8% 15.8%

100 9.1%

14.8%

number of racks

Average failures per rack

120%

50 0.0% 0.0%

0% 0.26

0.44

0.62

0.80

0.98

1.15

1.33

1.51

1.69

1.87

-

-

-

-

-

-

-

-

-

-

0.08

0.26

0.44

0.62

0.80

0.98

1.15

1.33

1.51

1.69

0

Failure rates seems to depends on the fluctuation of the temperature except for the highest temperature segments. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

28

J

DIMM failures and air temp. ranking Ranking by average temperature B4

9 4 8 8 A B46

E 58B 4 8B4 8 8 A8B4 EB8

C 78

B4

9 C 78

94 EB8 6 E C

Ranking by failure counts 6

D 8

FE

D68

7 D 6H D6

.

4

1

5

F

D6F D

D6

6H F

EF9 H

D6

EF9 H

. 5

2

4

1

4 3 3 2

1

2 1

0

1

3

0

2

0 .

The results suggest DIMM failures do not depend on the long term trend of the temperatures. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

29

• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

30

System availability actual system availability (September 2012 – March 2016) system failure (job scheduler) 0.2% system failure (MPI) 0.2% scheduled maintenance 4.0% in-opera)on 93.8%

system failure (mics) 0.3% system failure (LFS) 1.1%

system failure 2.2%

• About 70% of the system failure time was due to the file system(LFS and GFS) failures.

system failure (GFS) 0.4%

system failure(LFS+GFS) under invesGgaGon 8.4%

RAID (Bad Block) 13.7%

file system hardware 14.5%

human errors 5.1%

so#ware 34.6%

irregular use 23.6%

• System software bugs/invalid settings (34.6%) • MDS/OSS down due to user’s irregular use (23.6%) • Hardware troubles (14.5%) ・・・

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

31

Consideration of LFS failures Design concept for user requirements: • The K computer Compute nodes

# of CPU Memory capacity

many 6D OSS mesh/torus network one volumn

82944 1.27PiB

I/O nodes(OSS)



LFS consists of many OSSes and OSTs to realize higher bandwidth. • OSS: 2592, OST:5184 (GFS OSS:90, OST:2880) LFS is configured as one huge volume to provide a shared area.

Results: •

Local File System (LFS) (11PB) •

Global File System (GFS) (30PB)

Larger number of OSSes and OSTs revealed the many potential bugs in the file system software (Luster 1.X based) and many severe failures were caused by such bugs. LFS down means all service stop, because it is a single failure point.

Lessons learned: • •

Do not configure a file system with larger number of OSSes and OSTs to avoid potential bugs. Do not make one huge volume to avoid a single point failures.

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

32

MTBF/MTTR (Sep.2012-Jan.2016) MTBF = (Mean Time Between (system wide) Failures) =

(Real time) – (Maintenance time) – (Irregular stop time) (System wide irregular stop counts) 27402.8 hours = 442.0 hours = 18.4 days 62 times 11.2 days (Blue Waters 2015(*)) (*) The

2015 Blue Waters Annual Report book: https://bluewaters.ncsa.illinois.edu/documents/10157/27cb9800-01c1-49be-a7aa-a210ad14d21b

MTTR (Mean Time To Recovery)

= Average (System wide irregular stop time) = 10.6 hours (Max. 49.3 hours (October 2012))

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

33

Summary & Outlook • The K computer achieved – Lower failure rate for CPU, DIMM • àLow Tj contributed to low failure rate.

– better MTBF and MTTR – High system availability and job filling rate – Preliminary results suggest that high average temperature does not affect to DIMM failure. More precise analysis are needed.

• Outlook – To clarify a relation between DIMM/CPU failure with age, temperature, job load, etc. Especially a short term fluctuation should be taken into account. – To develop a failure prediction method.

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

34

Thank you for your attention

RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE

35

ftxs2016-keynote-The K computer and its failures (open).pdf ...

Page 1 of 35. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE. Computer simulations create the future. 1. Fumiyoshi Shoji. Operations and Computer Technologies Division. RIKEN AICS. The K computer and its failures. The 6th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016@Kyoto.

7MB Sizes 0 Downloads 128 Views

Recommend Documents

FLEXIBLE PAVEMENT FAILURES, MAINTENANCE AND ...
Connecticut Advanced Pavement Laboratory ... 179 Middle Turnpike, U-202. Storrs ... PAVEMENT FAILURES, MAINTENANCE AND EVALUATION NOTE 1.pdf.

Paradoxes and failures of cut
Nov 3, 2011 - argument is successful, then my treatment of cut as at best epiphenome- nal is mistaken. .... so 〈A〉 is the name of a formula A, T is a transparent truth predicate iff T〈A〉 ..... D is a nonempty domain such that L ⊆ D, and ...

K L University Department of Electronics and Computer Engineering ...
Department of Electronics and Computer Engineering ... Course Rationale:​The purpose of learning this course “Visual Programming” is to make students understand the principles of . ..... Equal weightage for all the lecture sessions (5 %) 5.

Brand Failures
169. 59 Clairol's Mist Stick in Germany. 170. 60 Parker Pens in Mexico. 171. 61 American ... Internet and new technology failures. 223. 81 Pets.com ...... Foundation for Economic Education, car industry journalist Anthony Young explained how ...

what is computer virus and its types pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item.

Financial Crises as Coordination Failures
Jun 1, 2014 - Email: [email protected]. ... revealed only through the actions of the agents and private signals, while financial market ... hard to compare. .... allocation is then a simple matter of comparing the payoffs gained by each ...

Systematic Reflection - Implications for Learning From Failures and ...
Systematic Reflection - Implications for Learning From Failures and Successes.pdf. Systematic Reflection - Implications for Learning From Failures and ...

CULTURAL FAILURES AT BANKS: A REVIEW AND ...
First, it is difficult to describe the way things are done – including how they are ... The first is that cultures arise for a given network of employees, defined both by ...

Fuzzy play, matching devices and coordination failures - Springer Link
Another approach to equilibrium selection involves exploring the dynamics of coordination games. This approach requires the specification of a dynamic process describing the play of agents involved in such a game, see e.g. Kandori et al. [11]. Anothe

Credit Crunches, Information Failures, and the ...
Jan 15, 2013 - to derive all our results in an analytic fashion. ..... The following proposition establishes the existence of a solution to this problem for all ...

Systematic Reflection - Implications for Learning From Failures and ...
190) classic novel Dracula, these words. are spoken by Professor Van Helsing to Dr. Seward. Although it is conventional wisdom that we learn most. from failures and mistakes, for decades psychologists too. have considered failures the most powerful l

K-12 Computer Science Education Background ... Services
K-12 Computer Science Education. Texas. Background. Broadening equitable student access to computer science (CS) is critical to our future, not only because ...

:Chronos Capitalism: failures of new capitalism
1 Professor of International Business Kingston University BusinessSchool London. ... management (BPR, lean manufacturing, JIT, best value, and the total quality ... reflected in reaction to the super-state of New Capitalism, the USA. An irony of ...

k.___)
Jul 9, 2010 - A statutory invention registration is not a patent. ... domains for 'CC Private ..... includes encrypted information and an anonymous identi?er.

k.___)
Jul 9, 2010 - See application ?le for complete search history. (56). References Cited ... at a door for controlling physical access, a desktop, laptop or kiosk for ...

Failures in the Field for Base Transceiver Stations
Microwave misalignment owing to bad weather was another cause of incipient failures. The analysis in ... component of any mobile communication system, as about 80% of the total equipment budget goes to BTS ... As cable–connectors and transmission t