RDMA in the Cloud: Enabling high-bandwidth, low-latency communication in virtual environments for HPC
Josh Simons VMware Office of the CTO © 2014 VMware Inc. All rights reserved.
Server Virtualization Virtual Machine (VM)
Traditional Architecture
Virtual Architecture
2
Secure Private Cloud for HPC Research Group 1
Research Group m
Users
IT Hybrid/Public Clouds
VMware vRealize Automation User Portals
Blueprints
Security VMware vRA API
Research Cluster 1
Research Cluster n
NSX
Programmatic Control and Integrations
VMware vCenter Server
VMware vCenter Server
VMware vCenter Server
VMware vSphere
VMware vSphere
VMware vSphere
Task Parallel Performance
4
Testbed Configuration • Hardware – Four two-socket HP DL380 G8 servers (3.3 GHz E5-2667v2 CPUs; 128 GB) – Dual-ported Mellanox FDR / 10 Gb RoCE adaptor – Mellanox 12-port FDR switch
• Software – ESXi 5.5u1 hypervisor
– RHEL 6.5 (native and guest) – MLNX OFED 2.2.1
BioPerf Benchmark Suite Native to Virtual Ratios (Higher is Better) 1.2
1 CLUSTALW GLIMMER GRAPPA HMMER PHYLIP PREDATOR TCOFFEE BLAST FASTA
0.8 0.6 0.4 0.2 0
ESXi5.5u1
6
BLAST Native to Virtual Ratios (Higher is Better) 1 0.8
OMP_NUM_THREADS=1 OMP_NUM_THREADS=4
0.6
OMP_NUM_THREADS=8 OMP_NUM_THREADS=16
0.4 0.2 0
ESXi5.5u1 7
RDMA Performance
8
Kernel Bypass Model application
user
user
application
rdma
kernel
sockets tcp/ip driver
guest kernel
rdma sockets tcp/ip driver vmkernel hardware PHYSICAL
hardware VIRTUAL 9
FDR InfiniBand Read Latency ib_read_lat / passthrough
Half Round trip Latency (µs)
2048 1024 512 256 128 64 32 16
3 2.5 2 1.5
Native
1
0.5
8
ESXi 5.5
4 2 1 0.5
Message Sizes (Bytes) 10
HPC Challenge Benchmark (HPCC) Native to Virtual Ratios (Higher is Better)
1.0
1.0
0.8
0.8 n4np4 n4np8
0.6
n4np4 0.6
n4np16 0.4
n4np32 n4np64
0.2
0.4
n4np8 n4np16 n4np32
0.2
n4np64
0.0
0.0 N5000
N10000
N20000
High Performance LINPACK
11
NAS Parallel Benchmarks (NPB) Native to Virtual Ratios (Higher is Better)
NAS Parallel Benchmarks (Class C) 1.2 1 0.8
n4np4 n4np8 n4np16 n4np32 n4np64
0.6 0.4 0.2
0 IS
EP
CG
MG
LU
12
NAMD Native to Virtual Ratios (Higher is Better)
1 0.8
n4np4 n4np8 n4np16 n4np32 n4np64
0.6 0.4 0.2 0
Apoa1
f1atpase
13
NWCHEM Native to Virtual Ratios (Higher is Better)
1.2 1 n4np4 n4np8 n4np16 n4np32 n4np64
0.8 0.6 0.4 0.2 0
H2O7 MP2
14
Source: NWChem Performance Benchmark and Profiling, HPC Advisory Council 15
FDR InfiniBand Read Latency: Future ib_read_lat / passthrough / polling completions
2048
512
3
256
2.5
128 64 32 16 8
2 1.5 1
Native
0.5 2 4 8 16 32 64 128 256 512 1024
Half Round trip Latency (µs)
1024
Prototype
4 2 1 0.5
Message Sizes (Bytes) 16
RDMA Storage Performance
17
Remote Storage Access Path
app
device driver
PCI device
app
app
OS storage server
HW
switch 18
Passthrough Mode Limitation app
Guest OS driver
PCI device
Guest OS
Guest OS
ₓ ₓ
storage server
hardware
switch 19
Single-Root I/O Virtualization (SR-IOV)
Guest OS VF driver
PF driver
PCI device
Guest OS VF driver
Guest OS VF driver
vmkernel
storage server
hardware
switch 20
FDR InfiniBand Read Latency ib_read_lat / passthrough, SR-IOV
2048
Half Round trip Latency (µs)
1024
3
512
2.5
256
2
128
1.5
64
1
32
Native
0.5
16
ESXi 5.5
8 4 2
ESXi 5.5 with SR-IOV
1 0.5
Message Sizes (Bytes)
21
IOR Bandwidth Performance 3VM x 4core versus bare-metal Linux 12core 4000 3500
BW [MB/sec]
3000
Two-socket (8-core) IVB 64 GB memory MLX ConnectX-3 FDR IB 256 GB IOR dataset CentOS 6.4 Lustre 2.6
2500
VM write
2000
VM read
Bare Metal write
1500
Bare Metal read
1000 500
0
1
2
3
6
12
Data provided by Sorin Faibish, EMC Office of the CTO
No. of Procs
22
Summary • Virtualized HPC performance for throughput applications generally very
close to bare-metal (well under 5% overhead)
• Passthrough RDMA can deliver close to native performance for some
MPI benchmarks and applications
– Will continue to improve as latency overheads are reduced…or eliminated – Higher-scale testing
• SR-IOV can enable access to RDMA-connect parallel file systems from
virtual environments with good performance
23
Thank You Josh Simons
[email protected]