Altera brings OpenCL to FPGAs December, 2012

© 2012 Altera Corporation CONFIDENTIAL

© 2012 Altera Corporation CONFIDENTIAL 2

© 2012 Altera Corporation CONFIDENTIAL 3

© 2012 Altera Corporation CONFIDENTIAL 4

© 2012 Altera Corporation CONFIDENTIAL 5

© 2012 Altera Corporation CONFIDENTIAL 6

© 2012 Altera Corporation CONFIDENTIAL 7

© 2012 Altera Corporation CONFIDENTIAL 8

© 2012 Altera Corporation CONFIDENTIAL 9

© 2012 Altera Corporation CONFIDENTIAL 10

© 2012 Altera Corporation CONFIDENTIAL 11

© 2012 Altera Corporation CONFIDENTIAL 12

© 2012 Altera Corporation CONFIDENTIAL 13

© 2012 Altera Corporation CONFIDENTIAL 14

OpenCL Overview 

OpenCL is a software programming model  Uses Standard C language (C99)  Uses OpenCL C extensions (adds parallelism to C)  Includes API (open standard for different devices)



Targets heterogeneous systems  Performance via hardware acceleration



Host CPU

The consortium (short list):  Apple, Altera, AMD, Broadcom, Khronos, Intel, ARM,

Ericsson, Texas Instruments, Samsung, IBM, Google, Fujitsu

© 2012 Altera Corporation CONFIDENTIAL 15

Hardware Acceleration

OpenCL Enables Portability C to gates programs are proprietary

Heterogeneous Multicore CPU

Multicore CPU

SoC FPGA

FPGA

FPGA

Source: RapidMind

© 2012 Altera Corporation CONFIDENTIAL 16

OpenCL Programming Model

Accelerator

Host

Processor Accelerator Accelerator Accelerator

Local Mem Local Mem Local LocalMem Mem

Host Program

Global Mem

main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRange(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }

__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }

Kernel Program OpenCL application is combination of Host & Kernel © 2012 Altera Corporation CONFIDENTIAL 17

© 2012 Altera Corporation CONFIDENTIAL 18

Mapping Multithreaded Kernels to FPGAs 

Simplest way of mapping kernel functions to FPGAs is to replicate hardware for each thread  Inefficient and wasteful



Technique: deep pipeline parallelism  Attempt to create a deeply pipelined representation of a kernel  On each clock cycle, we attempt to send in input data for a new

thread  Method of mapping coarse grained thread parallelism to finegrained FPGA parallelism

© 2012 Altera Corporation CONFIDENTIAL 19

Example Pipeline for Vector Add 8 threads for vector add example

0

Load

1

2

3

4

5

6

7

Load

Thread IDs +



Store



© 2012 Altera Corporation CONFIDENTIAL 20

On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example

1

2

3

4

5

6

7

0

Load

Load

Thread IDs +



Store



© 2012 Altera Corporation CONFIDENTIAL 21

On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example

2

3

4

5

6

7

1

Load

Load

Thread IDs

0 +



Store



© 2012 Altera Corporation CONFIDENTIAL 22

On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example

3

4

5

6

7

2

Load

Load

Thread IDs

1 + 0

Store

© 2012 Altera Corporation CONFIDENTIAL 23





On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example

4

5

6

7

3

Load

Load

Thread IDs

2 + 1

Store 0

© 2012 Altera Corporation CONFIDENTIAL 24





On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add  Load

Load Load +

Load Load

Load

+

Store

+

Store Store

© 2012 Altera Corporation CONFIDENTIAL 25

Replicate the kernel circuit multiple times to process multiple workgroups simultaneously

© 2012 Altera Corporation CONFIDENTIAL 26

© 2012 Altera Corporation CONFIDENTIAL 27

© 2012 Altera Corporation CONFIDENTIAL 28

© 2012 Altera Corporation CONFIDENTIAL 29

© 2012 Altera Corporation CONFIDENTIAL 30

© 2012 Altera Corporation CONFIDENTIAL 31

© 2012 Altera Corporation CONFIDENTIAL 32

© 2012 Altera Corporation CONFIDENTIAL 33

Case Studies

© 2012 Altera Corporation CONFIDENTIAL

Performance (Monte-Carlo Black Scholes) 12.0

OpenCL MCBS Quad Core Xeon Simulations 240M per Second Number of 8 Cores

11.5

11.0

Stock Price

10.5

10.0

9.5

9.0

448

N/A

1.00

0.95

0.90

0.85

0.80

0.75

0.70

0.65

0.60

0.55

0.50

0.45

0.40

0.35

0.30

0.25

0.20

0.15

0.10

0.05

8.0

0.00

8.5

950M

Stratix® IV 530 FPGA 2,200M

NVIDIA S870

Time

 

Calculate the value of an option with multiple sources of uncertainty FPGA delivers higher performance at a fraction of the power

Achieve Higher Performance vs CPU © 2012 Altera Corporation CONFIDENTIAL 35

Performance/Watt (Document Search) 

Documenting filtering algorithm 

Review incoming stream (documents) and return best match

 E.g. Monitors news feeds and recommends others



FPGA outperforms by >5x 

Saving power = Saving $  Annual power cost was $2.9 million or $456 per KW

Higher Perf/Watt vs GPU © 2012 Altera Corporation CONFIDENTIAL 36

ALTERA OpenCL What’s Next

© 2012 Altera Corporation CONFIDENTIAL 37

Current OpenCL System Architecture Host Processor

Kernel0

Kernel1

Kernel2



Kernel N

Global Memory

High demand on CPU Memory-to-memory paradigm © 2012 Altera Corporation CONFIDENTIAL 38

Desired Architecture ( OpenCL Pipes ) Host Processor Initialize() Buffer

Kern0

Buffer

Kern1

p

Buffer Buffer Traffic Manager

Kern2

KernN

1-p

Buffer

Global Memory Traffic Manager

CPU: Configure and “Go” Stream orientation when needed © 2012 Altera Corporation CONFIDENTIAL 39

ALTERA OpenCL Q&A

© 2012 Altera Corporation CONFIDENTIAL 40

Altera brings OpenCL to FPGAs

Uses OpenCL C extensions (adds parallelism to C). - Includes API (open standard for different devices). ▫ Targets heterogeneous systems. - Performance via hardware acceleration. ▫ The consortium (short list):. - Apple, Altera, AMD, Broadcom, Khronos, Intel, ARM,. Ericsson, Texas Instruments, Samsung, IBM,. Google ...

2MB Sizes 0 Downloads 297 Views

Recommend Documents

No documents