RDMA in the Cloud - VMware's Office of the CTO

Viewer
Transcript

RDMA in the Cloud: Enabling high-bandwidth, low-latency communication in virtual environments for HPC

Josh Simons VMware Office of the CTO © 2014 VMware Inc. All rights reserved.

Server Virtualization Virtual Machine (VM)

Traditional Architecture

Virtual Architecture

2

Secure Private Cloud for HPC Research Group 1

Research Group m

Users

IT Hybrid/Public Clouds

VMware vRealize Automation User Portals

Blueprints

Security VMware vRA API

Research Cluster 1

Research Cluster n

NSX

Programmatic Control and Integrations

VMware vCenter Server

VMware vCenter Server

VMware vCenter Server

VMware vSphere

VMware vSphere

VMware vSphere

Task Parallel Performance

4

Testbed Configuration • Hardware – Four two-socket HP DL380 G8 servers (3.3 GHz E5-2667v2 CPUs; 128 GB) – Dual-ported Mellanox FDR / 10 Gb RoCE adaptor – Mellanox 12-port FDR switch

• Software – ESXi 5.5u1 hypervisor

– RHEL 6.5 (native and guest) – MLNX OFED 2.2.1

BioPerf Benchmark Suite Native to Virtual Ratios (Higher is Better) 1.2

1 CLUSTALW GLIMMER GRAPPA HMMER PHYLIP PREDATOR TCOFFEE BLAST FASTA

0.8 0.6 0.4 0.2 0

ESXi5.5u1

6

BLAST Native to Virtual Ratios (Higher is Better) 1 0.8

OMP_NUM_THREADS=1 OMP_NUM_THREADS=4

0.6

OMP_NUM_THREADS=8 OMP_NUM_THREADS=16

0.4 0.2 0

ESXi5.5u1 7

RDMA Performance

8

Kernel Bypass Model application

user

user

application

rdma

kernel

sockets tcp/ip driver

guest kernel

rdma sockets tcp/ip driver vmkernel hardware PHYSICAL

hardware VIRTUAL 9

FDR InfiniBand Read Latency ib_read_lat / passthrough

Half Round trip Latency (µs)

2048 1024 512 256 128 64 32 16

3 2.5 2 1.5

Native

1

0.5

8

ESXi 5.5

4 2 1 0.5

Message Sizes (Bytes) 10

HPC Challenge Benchmark (HPCC) Native to Virtual Ratios (Higher is Better)

1.0

1.0

0.8

0.8 n4np4 n4np8

0.6

n4np4 0.6

n4np16 0.4

n4np32 n4np64

0.2

0.4

n4np8 n4np16 n4np32

0.2

n4np64

0.0

0.0 N5000

N10000

N20000

High Performance LINPACK

11

NAS Parallel Benchmarks (NPB) Native to Virtual Ratios (Higher is Better)

NAS Parallel Benchmarks (Class C) 1.2 1 0.8

n4np4 n4np8 n4np16 n4np32 n4np64

0.6 0.4 0.2

0 IS

EP

CG

MG

LU

12

NAMD Native to Virtual Ratios (Higher is Better)

1 0.8

n4np4 n4np8 n4np16 n4np32 n4np64

0.6 0.4 0.2 0

Apoa1

f1atpase

13

NWCHEM Native to Virtual Ratios (Higher is Better)

1.2 1 n4np4 n4np8 n4np16 n4np32 n4np64

0.8 0.6 0.4 0.2 0

H2O7 MP2

14

Source: NWChem Performance Benchmark and Profiling, HPC Advisory Council 15

FDR InfiniBand Read Latency: Future ib_read_lat / passthrough / polling completions

2048

512

3

256

2.5

128 64 32 16 8

2 1.5 1

Native

0.5 2 4 8 16 32 64 128 256 512 1024

Half Round trip Latency (µs)

1024

Prototype

4 2 1 0.5

Message Sizes (Bytes) 16

RDMA Storage Performance

17

Remote Storage Access Path

app

device driver

PCI device

app

app

OS storage server

HW

switch 18

Passthrough Mode Limitation app

Guest OS driver

PCI device

Guest OS

Guest OS

ₓ ₓ

storage server

hardware

switch 19

Single-Root I/O Virtualization (SR-IOV)

Guest OS VF driver

PF driver

PCI device

Guest OS VF driver

Guest OS VF driver

vmkernel

storage server

hardware

switch 20

FDR InfiniBand Read Latency ib_read_lat / passthrough, SR-IOV

2048

Half Round trip Latency (µs)

1024

3

512

2.5

256

2

128

1.5

64

1

32

Native

0.5

16

ESXi 5.5

8 4 2

ESXi 5.5 with SR-IOV

1 0.5

Message Sizes (Bytes)

21

IOR Bandwidth Performance 3VM x 4core versus bare-metal Linux 12core 4000 3500

BW [MB/sec]

3000

Two-socket (8-core) IVB 64 GB memory MLX ConnectX-3 FDR IB 256 GB IOR dataset CentOS 6.4 Lustre 2.6

2500

VM write

2000

VM read

Bare Metal write

1500

Bare Metal read

1000 500

0

1

2

3

6

12

Data provided by Sorin Faibish, EMC Office of the CTO

No. of Procs

22

Summary • Virtualized HPC performance for throughput applications generally very

close to bare-metal (well under 5% overhead)

• Passthrough RDMA can deliver close to native performance for some

MPI benchmarks and applications

– Will continue to improve as latency overheads are reduced…or eliminated – Higher-scale testing

• SR-IOV can enable access to RDMA-connect parallel file systems from

virtual environments with good performance

23

Thank You Josh Simons [email protected]