Computer simulations create the future
The K computer and its failures Fumiyoshi Shoji Operations and Computer Technologies Division RIKEN AICS The 6th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016@Kyoto 31 May, 2016
1
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
2
• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
3
RIKEN – RIKEN is a largest and most comprehensive research organization for basic and applied science in Japan. – Foundation: 1917. – Researchers: ~ 2,000.
The K computer at RIKEN Kobe campus
SACLA(XFEL) X-ray Free Electron Laser facility at RIKEN Harima campus
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
4
RIKEN AICS • RIKEN Advanced Institute for Computational Science (AICS) established on July 1, 2010 • Missions – Manage the operations of the K computer, maintaining a user-friendly environment. – Promote collaborative projects with a focus on the disciplines of computational science and computer science. – Plot and develop Japan's strategy for computational science, including defining the path to exa-scale computing.
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
5
Location of RIKEN AICS Kobe
Kyoto
Tokyo 423km (263miles) west of Tokyo
Research building
~20,000m2
Chiller building
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
Computer building
Substation Supply
6
• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
7
The K computer and its achievements • The K computer: • • •
developed by collaboration between RIKEN and FUJITSU in a Japanese national project. has been started to operate since Sep. 2012. designed to aim for a general-purpose computing. • no accelerators, broader memory/interconnect bandwidth, etc.
• Achievements: – – – – –
TOP500 list :No.1 at Jun. and Nov. 2011 Graph500 list :No.1 at Jun. 2014, Jul. and Nov. 2015 HPCG results :No.2 in July and November 2015. Gordon Bell prize :Winner in 2011 and 2012 The other remarkable results for science and engineering • See http://www.aics.riken.jp/en/ RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
8
System Configuration 40 m x 40 m Full System
Compute Rack × 864
4000mm x 800mm 2 Cabinets
800mm x
Compute Rack × 4 800mm Disk Racks × 1
Compute Rack
10.6(11.3)PFLOPS 1.27(1.34)PiB
SB ×24 IOSB ×6
500mm x 500mm System Board(SB) Node×4
Node
49.2(52.4)TFLOPS 6.00(6.38)TiB
CPU×1 ICC×1 memory
128GFLOPS 16GiB
512GFLOPS 64GiB
12.3(13.1)TFLOPS 1.50(1.59)TiB
( )included IO node performance and memory capacity RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
9
Overview of Usage 2012/09/28 - 2016/01/31 • Registered projects/users • Average number of executed jobs • Average number of active users Job scale (node*time based) 20001-50000 7.7%
50001-80000 1.1%
nuclear 2.5%
101-500 14.3%
5001-10000 8.5%
501-1000 16.7%
2001-5000 24.8%
Science fields (node*time based)
>80000 3.0%
<101 4.1%
10001-20000 6.9%
: ~150/1200/FY : 1275.0/day : 113.4/day
1001-2000 13.0%
computer science mathema=cs 0.3% 2.4%
environment 14.5%
misc 0.1%
material science 27.6%
phisics 15.0%
life scince 16.9%
manufacturin g 20.6%
Wide range of job type and users from various fields RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
10
Overview of Operation Status 5% of 365d*24h is kept for scheduled maintenance.
System availability for schedule and job filling rate 100%
99.7%
96.7%
80% 60%
98.2%
75.9%
99.0%
75.6%
75.3%
91.0% (BlueWaters 2015(*)) (*)The
2015 Blue Waters Annual Report book:
61.2%
40% 20% 0%
JFY2012 (9/28~)
JFY2013
system availability for schedule
JFY2014
JFY2015
average job filling rate
• Remarkable high system availability achieved. • Job filling rates keep a sufficiently high level. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
11
• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
12
Failure analysis of K computer • K computer consists of extremely many parts and components. • K computer keeps on working with higher load for three and a half years and is used by a various types of jobs and users. • It is expected to occur failure events more frequently and it enables us to study a meaningful failure analysis. Failure statistics of K computer includes many significant information and is expected to be useful for general failure analysis of supercomputer. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
13
Number of major parts Compute Rack × 864
System Board 864×24 = 20,736
CPU 864×(24×4) = 82,944
Inter Connect Controller 864×(24×4)=82,944
PSU 864×9 = 7,776
CPU/ICC are water-cooled(inlet:15℃ outlet:17℃) Other components are air-cooled
DIMM 864×(24×4×8) = 663,552
When a failure of CPU/ICC/System Board occurred then the system board will be replaced. (For DIMM failure, the only DIMM will be replaced.)
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
14
Monthly Failure Rate of CPUs
Monthly failure rate
0.016%
Monthly failure rate = Failure counts in the month Number of installed CPUs
Number of CPUs=82,944 (Since July 2012)
0.020% Gordon-Bell challenges
Total failure counts=240 (July 2012-April 2016)
0.012% Gordon-Bell challenges
0.008% 0.004% 0.000%
2011-04 2011-07 2011-10 2012-01 2012-04 2012-07 2012-10 2013-01 2013-04 2013-07 2013-10 2014-01 2014-04 2014-07 2014-10 2015-01 2015-04 2015-07 2015-10 2016-01 2016-04
Full node LINPACK measurements
Failure trend of CPUs is almost stable except after high load events
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
15
Monthly Failure Rate of DIMMs 0.004%
Monthly failure rate
0.003%
Number of DIMMs=663,552 (Since July 2012) Total failure counts=467 (July 2012-April 2016)
0.002% 0.001%
Monthly failure rate = Failure counts in the month
2011-04 2011-07 2011-10 2012-01 2012-04 2012-07 2012-10 2013-01 2013-04 2013-07 2013-10 2014-01 2014-04 2014-07 2014-10 2015-01 2015-04 2015-07 2015-10 2016-01 2016-04
0.000%
Number of installed DIMMs
DIMM failures seem to decrease gradually and be lower than that of CPU
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
16
Monthly Failure Rate of System Boards (includes the failures of CPU, ICC, DIMM and System Board(SB) itself)
0.12%
Number of SBs=20,736 (Since July 2012) Total failure counts=850 (July 2012-April 2016)
0.08% Monthly failure rate =
0.04%
Failure counts in the month Number of installed System Boards
0.00%
2011-04 2011-07 2011-10 2012-01 2012-04 2012-07 2012-10 2013-01 2013-04 2013-07 2013-10 2014-01 2014-04 2014-07 2014-10 2015-01 2015-04 2015-07 2015-10 2016-01 2016-04
Monthly failure rate
0.16%
Failure rate of SBs seems to reach to a plateau Average failure counts (= maintenance operation) ~ 1 times / 2 days RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
17
30
Number of PSUs=7,776 (Since July 2012)
25
Total failure counts=194 (July 2012-April 2016)
20
preventive replacement failure
15 10 Monthly failure rate =
5
Failure counts in the month
0 2011-04 2011-07 2011-10 2012-01 2012-04 2012-07 2012-10 2013-01 2013-04 2013-07 2013-10 2014-01 2014-04 2014-07 2014-10 2015-01 2015-04 2015-07 2015-10 2016-01 2016-04
Monthly failure count
Monthly Failure Count of PSU
Number of installed PSUs
PSU failure has increased rapidly and preventive replacement based on a suspicious signal has been applied since 2015. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
18
Failure analysis A comparison of failure rates with Blue Waters FIT
: Failure In Time (1FIT = 1 failure per 109 hours)
K computer (April 2011 – January 2016)
CPU (FIT) DIMM (FIT/GB)
69.86 8.82
Blue Waters(*) 265.15 15.98
(*) C. Di Martino et al., Lessons learned from the analysis of system failures at petascale: the case of blue waters. 44th international conference on Dependable Systems and Networks (DSN 2014), 2014.
• CPU and DIMM failure rates of the K computer are about 1/4 and 1/2 compared with Blue Waters.
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
19
Consideration of low CPU failure rate Tj (junction temperature) of CPU = 30℃ 1.0E+04
Arrhenius's law k : chemical reaction rate constant A : constat Ea : activation energy k B : Boltzmann constant T : temperature
According to our early estimation, if Tj of CPU could be decreased from 85℃ to 30℃ then relative life time will be
1.0E+03
x60~x100
Ea k BT
Lifetime(relative)
k = Ae
−
1.0E+02 1.0E+01 1.0E+00 1.0E-01
30℃
85℃
1.0E-02 0
20
40
60
80
100
120
Tj(℃)
60 ~ 100 times longer.
Low Tj gives an essential contribution for lower CPU failure rate RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
20
• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
21
DIMM failures and rack location April 2014 – April 2016 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
A 0 1
B 0 1
C 1 0
D 0 0
E 0 2
F 0 0
G 1 0
H 0 0
I 0 0
J 0 0
K 0 1
L 2 1
M 0 2
N 0 1
O 0 0
P 0 0
Q 0 0
R 1 1
S 0 0
T 0 0
U 1 0
V 0 0
W 0 0
X 1 0
0 1 1 0
1 1 0 0
0 1 0 0
0 1 0 0
0 0 0 1
0 1 0 0
0 0 0 0
0 0 0 0
1 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 1 0
0 1 0 0
0 0 0 0
0 0 0 0
0 0 0 2
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 0 0
1 0 1 0
0 0 0 0
0 0 1 0
0 1 0 0
0 0 0 0
0 0 1 2
0 0 0 0
0 0 1 0
0 0 0 0
0 0 0 0
1 0 0 0
0 0 0 0
0 0 0 1
0 0 0 0
2 0 0 0
0 1 0 0
0 0 0 0
0 0 0 0
1 3 1 0
0 0 0 0
1 0 0 0
0 0 0 0
0 0 0 0
0 1 0 0
0 1 0 0
1 0 0 0
0 0 0 0
1 0 0 1
1 0 0 1
0 0 0 0
0 0 0 0
0 0 1 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 1 0
1 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 0 0
0 1 1 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 0 1
1 0 0 1
1 0 0 0
0 0 0 0
0 0 0 0
0 1 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 1 0
1 0 0 0
0 0 0 0
2 0 0 0
0 0 0 0
0 0 0 0
0 0 1 0
1 1 1 0
0 1 0 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 0 0
0 0 0 0
2 0 0 2
0 0 0 0
0 0 0 0
0 1 0 0
0 0 0 0
0 1 0 0
1 2 1 0
0 0 0 0
0 0 0 1
0 2 0 0
0 0 0 0
1 0 0 0
0 0 0 0
0 0 0 2
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
0 1 0 0
0 1 1 0
0 0 1 0
0 0 1 0
0 0 0 2
0 0 1 0
0 0 0 0
3 0 0 0
1 0 0 0
0 0 0 0
1 0 0 0
0 0 0 0
1 0 0 0
0 0 0 0
0 0 2 0
1 1 0 1
0 0 0 0
1 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
0 0 1 0
0 1 0 0
0 1 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 1 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 2 0 0
1 0 0 0
0 0 0 0
0 0 0 1
1 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 1 1
0 0 0 0
0 1 0 0
0 0 0 0
0 0 0 0
1 0 0 0
1 1 1 0
0 0 0 0
0 0 0 0
0 1 0 0
0 0 0 0
0 0 0 0
0 0 2 0
0 0 0 0
0 2 0 1
1 0 0 0
0 0 1 0
1 0 0 0
1 0 0 0
0 0 2 1
0 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 1 0
1 0 1 0
0 0 0 0
0 0 0 0
1 0
1 0
0 0
1 0
0 0
0 0
0 0
0 0
0 0
0 1
0 0
0 1
0 0
2 0
0 1
0 0
0 1
1 0
0 1
1 0
0 0
0 1
1 0
0 0
%
Failure counts over rack are not uniform. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
22
DIMM failures and SB location .
%
. .
% #
. .
# %
%
%
.
%
.
%
%
. .
%
.
%
. .
%
.
%
.
%
.
#
. .
% %
%
%
. .
%
.
%
.
%
.
%
#%
. .
#
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
23
DIMM failures and DIMM slot location
DIMM CPU
DIMM CPU
DIMM
DIMM
CPU DIMM
CPU DIMM
%
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
24
Rack inlet air temperature since April 2014 average
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 44 45
A B C D E F G H I J K L M N O P Q R S T U V W X 19.3 20.0 19.0 20.3 19.0 18.9 18.9 22.1 19.9 19.6 19.1 19.2 18.9 20.0 19.4 19.7 19.3 19.7 19.2 20.3 18.5 20.3 19.4 20.4 24.2 19.0 24.6 19.0 24.8 18.9 23.9 19.3 25.5 18.9 23.8 19.1 24.1 19.4 25.3 19.0 24.0 19.3 24.2 19.2 23.8 19.3 24.2 20.4 19.4 20.7 20.2 25.6
24.7 19.3 20.2 19.5
19.7 20.2 20.1 23.9
23.7 19.1 19.5 19.5
19.0 20.0 20.2 24.5
23.9 18.7 18.9 19.3
19.1 18.9 19.1 24.0
24.9 19.4 19.6 18.8
19.6 19.4 19.4 23.9
24.1 19.5 20.3 20.3
19.8 19.9 19.6 23.5
24.0 19.5 19.9 19.6
19.5 19.2 19.8 22.6
23.8 19.7 19.5 19.6
19.5 20.3 19.5 25.4
24.4 19.5 19.4 19.6
19.5 19.7 19.7 24.9
24.0 20.1 19.9 20.1
20.0 19.5 19.8 25.5
23.9 19.8 19.8 19.3
19.5 19.5 19.7 24.8
23.6 19.4 19.4 19.0
18.7 19.1 18.7 24.3
24.1 19.7 20.8 20.6
20.6 19.6 21.2 24.5
25.5 20.7 20.3 21.1
22.8 20.1 21.0 25.6
23.1 20.5 21.6 20.5
22.0 22.8 22.4 25.2
24.0 19.8 20.0 20.0
19.8 19.4 19.4 24.4
24.7 20.1 19.8 20.2
19.9 19.2 19.6 24.4
23.5 20.2 19.9 20.4
20.0 19.4 19.8 24.1
23.8 20.1 19.6 20.2
19.8 20.3 19.7 24.3
24.5 19.7 19.8 20.2
20.2 19.9 20.7 24.8
25.0 19.5 19.7 20.1
19.5 19.7 19.4 24.0
25.7 19.6 20.6 20.5
19.4 19.6 19.9 25.0
22.1 19.8 19.1 20.2
19.7 19.4 19.9 24.2
25.1 19.2 19.7 19.7
19.1 19.2 19.6 23.8
24.5 20.2 19.8 20.2
20.0 23.0 20.9 25.0
25.2 24.1 24.6 21.5
22.1 25.2 23.9 23.4
23.5 24.4 24.3 21.9
22.2 24.6 24.0 24.3
24.7 23.7 23.2 20.1
20.7 23.1 21.9 21.2
22.4 23.9 21.8 18.9
20.8 23.8 20.6 21.8
25.0 23.9 24.3 22.5
20.2 23.3 23.8 22.3
24.6 24.0 23.1 20.0
21.7 23.9 21.6 20.6
24.7 23.7 21.0 19.6
22.2 23.9 20.3 20.1
24.4 22.4 22.6 18.7
22.3 23.9 21.3 23.9
26.0 24.5 23.7 20.0
21.4 23.5 22.9 21.0
24.4 23.9 23.0 19.9
20.6 22.7 20.9 23.7
24.0 21.7 22.3 19.3
19.5 22.2 20.7 23.7
24.4 21.0 20.8 19.7
19.8 20.6 20.4 25.5
25.7 20.5 21.8 21.3
22.2 21.3 22.2 24.3
24.2 21.6 21.3 19.7
22.3 19.4 20.1 26.2
21.2 19.1 19.3 19.5
19.1 19.4 19.2 23.2
24.0 20.4 19.5 19.6
20.8 20.0 19.3 23.4
23.2 19.9 20.2 20.3
20.1 20.0 19.5 25.0
20.5 20.1 20.3 20.5
19.5 20.3 19.9 21.7
21.5 19.7 19.5 19.8
19.9 19.5 19.9 24.0
24.1 20.0 19.4 18.4
19.6 19.6 18.9 24.5
23.3 19.7 19.8 19.4
19.4 19.0 18.6 22.8
24.1 19.0 19.1 18.7
19.4 19.0 18.8 24.4
24.0 18.9 19.5 19.0
19.0 19.2 19.2 24.1
24.0 19.9 19.6 19.9
20.5 21.2 20.3 24.5
24.4 21.7 20.1 20.3
22.0 23.4 21.9 23.8
24.9 18.9 19.4 19.4
20.4 21.8 21.8 23.6
24.9 19.7 19.2 19.9
19.2 18.9 19.0 21.0
23.7 21.1 20.9 20.0
20.2 19.5 19.6 21.7
26.3 20.7 20.0 20.3
20.1 19.0 20.1 21.6
21.4 20.6 19.6 20.1
20.5 19.4 20.4 20.5
21.1 20.4 19.3 19.9
20.3 19.8 20.2 19.8
24.4 18.9 19.1 19.0
19.0 18.4 18.5 20.3
23.6 21.5 21.7 20.4
21.8 20.9 19.6 21.0
22.5 20.1 19.3 19.3
19.3 18.9 19.0 24.0
24.9 19.6 19.5 19.2
19.5 19.1 19.0 24.0
25.1 20.3 20.2 20.7
20.3 21.5 22.9 21.8
23.7 22.8 24.5 23.2
20.5 23.9 25.1 25.5
22.2 22.6 24.3 23.3
20.3 24.1 24.6 25.9
22.6 21.0 22.8 19.9
18.9 22.2 23.3 24.9
24.3 20.3 23.8 20.4
19.6 23.7 24.1 24.7
22.5 23.2 24.4 22.6
19.9 23.7 24.2 24.9
20.8 21.5 24.3 21.9
19.8 23.5 23.7 24.7
21.9 19.5 23.5 21.0
19.7 20.8 23.3 24.9
24.1 21.2 22.8 19.9
19.5 21.9 22.3 25.2
22.3 22.8 23.1 20.9
21.3 23.5 22.6 24.5
24.9 21.5 23.0 20.9
19.3 22.2 21.7 23.9
24.0 21.5 22.2 19.5
19.4 21.8 22.2 24.2
25.0 20.3 21.5 20.0
20.4 20.5 20.7 25.3
25.3 19.4 20.0 19.4
20.5 20.7 20.4 23.9
24.9 19.4 19.7 19.4
19.7 19.6 19.9 24.9
24.8 19.2 19.6 19.1
19.3 19.5 19.2 23.9
24.1 19.5 19.7 19.7
18.8 19.6 19.7 24.0
25.1 19.3 20.1 20.1
19.8 19.9 19.8 23.5
25.2 20.0 20.0 19.8
19.6 19.7 20.0 22.7
24.3 19.5 19.7 20.0
19.9 19.8 19.7 25.6
24.6 19.4 19.6 19.4
19.4 19.4 19.4 25.6
24.8 19.3 19.6 19.3
19.3 19.9 19.5 25.8
23.7 19.7 19.6 19.7
19.3 19.2 19.3 24.5
23.9 19.2 19.3 18.5
19.1 19.3 19.4 25.0
24.5 19.6 21.0 20.6
20.6 20.0 20.3 24.2
25.4 19.6 19.1 19.7
20.4 20.0 20.3 24.3
23.6 19.2 18.7 19.1
20.1 19.4 19.5 23.9
24.2 19.4 19.2 19.1
19.0 19.3 19.2 24.3
24.4 19.1 19.3 19.1
19.3 19.4 19.5 24.0
23.4 20.0 19.9 19.9
19.8 20.0 19.8 23.9
22.6 19.9 19.9 19.9
19.6 19.5 19.6 23.8
22.2 19.6 19.6 20.0
19.8 19.6 20.3 24.6
25.0 19.5 19.5 19.5
19.7 19.2 19.5 25.0
25.6 19.7 19.8 19.6
19.5 19.5 19.3 23.8
25.5 19.5 19.8 19.8
20.1 19.6 19.6 24.1
24.9 19.4 19.7 19.6
19.5 19.3 19.3 23.6
25.2 20.3 20.2 20.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
20.2 23.9 19.0 23.9 19.3 23.9 19.1 25.1 19.2 24.0 19.3 23.9 19.6 23.9 19.5 24.3 19.2 24.6 19.5 25.2 19.4 24.0 19.6 24.4 20.2 19.1 19.6 19.1 21.2 19.1 19.8 19.5 21.1 19.3 19.7 19.7 20.2 20.0 21.8 19.4 20.2 19.4 20.0 19.4 19.8 19.5 20.9 19.5
44 45
Min:18.38℃-Max:26.26℃
• •
A 0.2 0.3
B 0.1 0.1
C 0.1 0.2
D 0.2 0.1
E 0.2 0.2
F 0.1 0.1
G 0.1 0.2
0.4 0.3 0.7 0.4
0.3 0.1 0.2 0.1
0.3 0.4 0.2 0.8
0.3 0.1 0.2 0.1
0.2 0.3 0.4 0.8
0.1 0.1 0.1 0.1
0.4 0.6 0.3 0.5
0.2 0.3 0.1 0.3
0.7 0.2 0.2 0.3
0.9 0.2 0.6 0.7
0.5 0.4 0.5 0.3
0.6 0.6 0.6 0.5
0.2 0.6 0.5 0.6
0.7 0.5 0.5 1.1
1.2 0.3 0.4 0.8
0.8 0.2 0.3 0.4
0.4 0.4 0.6 0.5
0.8 0.4 0.7 1.1
0.2 0.8 0.8 0.5
0.9 0.6 0.3 0.2
0.3 0.3 0.3 0.6
standard deviation H 0.3 0.5
I 0.4 0.3
J 0.2 0.2
K 0.2 0.5
L 0.3 0.2
M 0.3 0.2
N 0.2 0.7
O 0.3 0.3
P 0.3 0.2
Q 0.2 0.7
R 0.2 0.1
S 0.2 0.3
T 0.2 0.2
U 0.2 0.3
V 0.2 0.2
W 0.2 0.3
X 0.4 0.3
0.1 0.1 0.1 0.8
0.5 0.2 0.5 0.2
0.2 0.7 0.2 0.7
0.5 0.2 0.2 0.3
0.1 0.2 0.1 1.0
0.4 0.2 0.2 0.3
0.1 0.2 0.2 1.1
0.4 0.1 0.2 0.2
0.2 0.1 0.1 0.3
0.3 0.1 0.4 0.3
0.2 0.2 0.4 0.5
0.3 0.2 0.3 0.1
0.3 0.4 0.2 0.4
0.6 0.1 0.2 0.2
0.2 0.2 0.2 0.2
0.4 0.3 0.2 0.2
0.3 0.3 0.3 0.4
0.6 0.4 0.6 0.7
1.0 0.2 0.3 0.4
0.2 0.2 0.4 0.4
0.8 0.2 0.2 0.2
0.2 0.2 0.2 0.3
0.9 0.2 0.2 0.3
0.2 0.3 0.2 0.3
1.1 0.1 0.2 0.2
0.1 0.2 0.2 0.3
1.0 0.1 0.2 0.1
0.1 0.2 0.1 0.4
0.5 0.2 0.2 0.4
0.2 0.3 0.3 0.6
0.4 0.2 0.2 0.1
0.2 0.2 0.1 0.2
1.4 0.2 0.2 0.3
0.2 0.2 0.2 0.3
0.3 0.2 0.2 0.2
0.2 0.2 0.1 0.2
0.6 0.9 0.4 0.8
0.4 0.4 0.4 1.0
0.3 0.5 0.4 0.2
0.6 0.4 0.5 1.6
0.6 0.4 0.5 0.3
1.0 0.4 0.6 1.1
0.4 0.4 0.4 0.5
0.5 0.5 0.5 1.8
0.3 0.3 0.4 0.2
0.3 0.3 0.4 0.8
0.6 0.5 0.9 0.6
0.4 0.5 1.0 0.8
0.5 0.6 0.5 0.3
0.2 0.4 0.4 1.1
0.4 0.5 0.5 0.6
0.5 0.5 0.4 0.4
0.4 0.4 0.4 0.2
0.3 0.3 0.5 0.3
0.3 0.4 0.3 0.1
0.4 0.4 0.3 0.2
0.4 0.2 0.3 0.6
0.9 0.6 0.5 0.3
0.6 0.3 0.2 0.2
1.5 0.1 0.2 0.2
0.3 0.2 0.2 1.3
0.5 0.4 0.1 0.2
0.5 0.3 0.1 1.2
1.1 0.4 0.2 0.3
0.5 0.1 0.2 0.7
0.4 0.2 0.2 0.4
0.1 0.1 0.3 0.6
0.8 0.3 0.3 0.4
0.2 0.3 0.4 1.1
0.7 0.3 0.4 0.5
0.4 0.3 0.5 1.0
0.7 0.6 0.4 0.9
1.2 0.6 0.8 0.9
0.3 0.4 0.6 0.7
0.3 0.6 0.9 1.0
0.2 0.2 0.3 0.5
0.2 0.3 0.3 0.5
0.7 0.9 0.4 1.1
0.3 1.0 0.5 1.0
0.3 0.3 0.3 0.2
0.7 0.5 0.6 1.1
0.2 0.3 0.3 0.3
0.2 0.2 0.1 1.5
1.4 0.6 0.3 0.3
0.3 0.2 0.3 1.8
0.4 0.3 0.3 0.2
0.3 0.2 0.3 1.7
0.7 0.2 0.2 0.2
0.3 0.2 0.2 0.5
0.7 0.2 0.1 0.1
0.2 0.2 0.1 0.6
0.7 1.0 0.7 0.4
0.9 1.0 0.8 0.6
0.8 0.8 0.4 0.7
0.5 0.3 1.0 0.6
1.0 0.9 0.6 0.4
0.8 0.8 0.7 0.3
0.5 0.3 0.3 0.3
0.4 0.2 0.2 0.2
0.5 0.2 0.8 0.3
1.3 0.7 0.4 0.6
0.4 0.4 0.5 0.3
1.4 0.5 0.4 0.4
0.3 0.4 0.4 0.3
1.1 0.3 0.3 0.2
0.1 0.4 0.4 0.5
0.7 0.6 0.4 0.2
0.2 1.0 0.7 0.3
1.4 0.4 0.3 0.3
0.4 0.4 0.4 0.4
0.6 0.4 0.4 0.5
0.1 0.3 0.6 0.2
1.3 0.4 0.4 0.2
0.3 0.6 0.4 0.2
0.7 0.2 0.5 0.7
0.3 0.5 0.6 0.2
1.1 0.4 0.3 0.2
0.5 0.4 0.4 0.2
0.4 0.6 0.4 0.3
0.3 0.3 0.3 0.5
0.5 0.3 0.3 0.2
0.2 0.4 0.3 0.6
0.6 0.9 0.4 0.3
0.6 0.3 0.3 0.5
0.3 0.2 0.2 0.2
0.3 0.3 0.3 0.8
0.3 0.1 0.2 0.1
0.2 0.4 0.3 1.0
0.2 0.1 0.2 0.1
0.2 0.2 0.1 1.0
0.3 0.2 0.4 0.1
0.1 0.4 0.3 1.0
0.4 0.6 0.7 0.4
0.4 0.5 0.4 1.1
0.3 0.2 0.1 0.1
0.1 0.1 0.1 1.1
0.2 0.1 0.1 0.2
0.1 0.1 0.1 0.4
0.3 0.3 0.2 0.2
0.2 0.3 0.2 0.4
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.3
0.5 0.1 0.1 0.1
0.1 0.1 0.2 0.3
0.5 0.1 0.2 0.2
0.1 0.2 0.2 0.3
0.4 0.1 0.2 0.8
0.5 0.3 0.4 0.3
0.2 0.2 0.1 0.3
0.2 0.3 0.4 0.2
0.9 0.2 0.1 0.2
0.3 0.2 0.2 0.2
0.7 0.1 0.1 0.1
0.1 0.1 0.1 0.6
1.2 0.2 0.2 0.1
0.3 0.2 0.2 0.4
1.1 0.2 0.2 0.2
0.2 0.2 0.2 0.4
1.4 0.2 0.1 0.1
0.1 0.1 0.2 0.2
1.9 0.4 0.8 0.8
0.2 0.7 0.7 0.3
0.4 0.2 0.2 0.3
0.1 0.2 0.2 0.5
0.3 0.1 0.2 0.1
0.2 0.1 0.1 0.8
0.3 0.2 0.2 0.3
0.1 0.2 0.5 0.7
0.2 0.1 0.2 0.2
0.1 0.2 0.2 0.4
0.4 0.6 0.6 0.1
0.3 0.3
0.2 0.2
0.1 0.1
0.2 0.2
0.2 0.3
0.4 0.1
0.1 0.1
0.4 0.2
0.1 0.5
0.2 0.1
0.1 0.1
0.3 0.1
0.1 0.1
0.7 0.3
0.4 0.6
0.5 0.4
0.4 0.7
0.4 0.2
0.2 0.1
0.3 0.2
0.2 0.2
0.3 0.2
0.1 0.2
0.3 0.1
Min:0.08℃-Max:1.87℃
Differences between the lowest and the highest of the average are approximately 8℃(ave) and 2℃(stdev). Every other racks tends to be higher and have large fluctuation. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
25
Exhausted heat from neighboring disk rack
Compute rack
Compute rack
Disk rack
Compute rack
Compute rack
Sensor
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
26
DIMM failures and rack inlet air temperature vs average temperature
90%
350 counts
average failures per rack
300
70% 250
60% 50%
200
40%
150
30%
23.5%
20.8% 21.3%
20%
16.3%
14.9%
10% 0%
100
19.1% 18.8% 17.7% 19.0% 6.3%
19.17
19.96
20.74
21.53
22.32
23.11
23.90
24.68
25.47
26.26
-
-
-
-
-
-
-
-
-
-
18.38
19.17
19.96
20.74
21.53
22.32
23.11
23.90
24.68
25.47
number of racks
Average failures per rack
80%
50 0
Temperature distribution over racks has two peaks. Failure rates of rack seems not to depend on the average temperature. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
27
DIMM failures and rack inlet air temperature vs standard deviation
counts
average failures per rack 400 350
100%
300 80%
250
66.7%
60%
200 42.9%
40% 20%
20.5% 18.0%
150
27.8% 15.8%
100 9.1%
14.8%
number of racks
Average failures per rack
120%
50 0.0% 0.0%
0% 0.26
0.44
0.62
0.80
0.98
1.15
1.33
1.51
1.69
1.87
-
-
-
-
-
-
-
-
-
-
0.08
0.26
0.44
0.62
0.80
0.98
1.15
1.33
1.51
1.69
0
Failure rates seems to depends on the fluctuation of the temperature except for the highest temperature segments. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
28
J
DIMM failures and air temp. ranking Ranking by average temperature B4
9 4 8 8 A B46
E 58B 4 8B4 8 8 A8B4 EB8
C 78
B4
9 C 78
94 EB8 6 E C
Ranking by failure counts 6
D 8
FE
D68
7 D 6H D6
.
4
1
5
F
D6F D
D6
6H F
EF9 H
D6
EF9 H
. 5
2
4
1
4 3 3 2
1
2 1
0
1
3
0
2
0 .
The results suggest DIMM failures do not depend on the long term trend of the temperatures. RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
29
• Introduction to RIKEN and RIKEN AICS • The K computer • Failure analysis of the K computer • Failure rates • DIMM failure analysis (Preliminary) • System wide availability • Summary
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
30
System availability actual system availability (September 2012 – March 2016) system failure (job scheduler) 0.2% system failure (MPI) 0.2% scheduled maintenance 4.0% in-opera)on 93.8%
system failure (mics) 0.3% system failure (LFS) 1.1%
system failure 2.2%
• About 70% of the system failure time was due to the file system(LFS and GFS) failures.
system failure (GFS) 0.4%
system failure(LFS+GFS) under invesGgaGon 8.4%
RAID (Bad Block) 13.7%
file system hardware 14.5%
human errors 5.1%
so#ware 34.6%
irregular use 23.6%
• System software bugs/invalid settings (34.6%) • MDS/OSS down due to user’s irregular use (23.6%) • Hardware troubles (14.5%) ・・・
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
31
Consideration of LFS failures Design concept for user requirements: • The K computer Compute nodes
# of CPU Memory capacity
many 6D OSS mesh/torus network one volumn
82944 1.27PiB
I/O nodes(OSS)
•
LFS consists of many OSSes and OSTs to realize higher bandwidth. • OSS: 2592, OST:5184 (GFS OSS:90, OST:2880) LFS is configured as one huge volume to provide a shared area.
Results: •
Local File System (LFS) (11PB) •
Global File System (GFS) (30PB)
Larger number of OSSes and OSTs revealed the many potential bugs in the file system software (Luster 1.X based) and many severe failures were caused by such bugs. LFS down means all service stop, because it is a single failure point.
Lessons learned: • •
Do not configure a file system with larger number of OSSes and OSTs to avoid potential bugs. Do not make one huge volume to avoid a single point failures.
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
32
MTBF/MTTR (Sep.2012-Jan.2016) MTBF = (Mean Time Between (system wide) Failures) =
(Real time) – (Maintenance time) – (Irregular stop time) (System wide irregular stop counts) 27402.8 hours = 442.0 hours = 18.4 days 62 times 11.2 days (Blue Waters 2015(*)) (*) The
2015 Blue Waters Annual Report book: https://bluewaters.ncsa.illinois.edu/documents/10157/27cb9800-01c1-49be-a7aa-a210ad14d21b
MTTR (Mean Time To Recovery)
= Average (System wide irregular stop time) = 10.6 hours (Max. 49.3 hours (October 2012))
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
33
Summary & Outlook • The K computer achieved – Lower failure rate for CPU, DIMM • àLow Tj contributed to low failure rate.
– better MTBF and MTTR – High system availability and job filling rate – Preliminary results suggest that high average temperature does not affect to DIMM failure. More precise analysis are needed.
• Outlook – To clarify a relation between DIMM/CPU failure with age, temperature, job load, etc. Especially a short term fluctuation should be taken into account. – To develop a failure prediction method.
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
34
Thank you for your attention
RIKEN ADVANCED INSTITUTE FOR COMPUTATIONAL SCIENCE
35