Altera brings OpenCL to FPGAs December, 2012
© 2012 Altera Corporation CONFIDENTIAL
© 2012 Altera Corporation CONFIDENTIAL 2
© 2012 Altera Corporation CONFIDENTIAL 3
© 2012 Altera Corporation CONFIDENTIAL 4
© 2012 Altera Corporation CONFIDENTIAL 5
© 2012 Altera Corporation CONFIDENTIAL 6
© 2012 Altera Corporation CONFIDENTIAL 7
© 2012 Altera Corporation CONFIDENTIAL 8
© 2012 Altera Corporation CONFIDENTIAL 9
© 2012 Altera Corporation CONFIDENTIAL 10
© 2012 Altera Corporation CONFIDENTIAL 11
© 2012 Altera Corporation CONFIDENTIAL 12
© 2012 Altera Corporation CONFIDENTIAL 13
© 2012 Altera Corporation CONFIDENTIAL 14
OpenCL Overview
OpenCL is a software programming model Uses Standard C language (C99) Uses OpenCL C extensions (adds parallelism to C) Includes API (open standard for different devices)
Targets heterogeneous systems Performance via hardware acceleration
Host CPU
The consortium (short list): Apple, Altera, AMD, Broadcom, Khronos, Intel, ARM,
Ericsson, Texas Instruments, Samsung, IBM, Google, Fujitsu
© 2012 Altera Corporation CONFIDENTIAL 15
Hardware Acceleration
OpenCL Enables Portability C to gates programs are proprietary
Heterogeneous Multicore CPU
Multicore CPU
SoC FPGA
FPGA
FPGA
Source: RapidMind
© 2012 Altera Corporation CONFIDENTIAL 16
OpenCL Programming Model
Accelerator
Host
Processor Accelerator Accelerator Accelerator
Local Mem Local Mem Local LocalMem Mem
Host Program
Global Mem
main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRange(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }
__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }
Kernel Program OpenCL application is combination of Host & Kernel © 2012 Altera Corporation CONFIDENTIAL 17
© 2012 Altera Corporation CONFIDENTIAL 18
Mapping Multithreaded Kernels to FPGAs
Simplest way of mapping kernel functions to FPGAs is to replicate hardware for each thread Inefficient and wasteful
Technique: deep pipeline parallelism Attempt to create a deeply pipelined representation of a kernel On each clock cycle, we attempt to send in input data for a new
thread Method of mapping coarse grained thread parallelism to finegrained FPGA parallelism
© 2012 Altera Corporation CONFIDENTIAL 19
Example Pipeline for Vector Add 8 threads for vector add example
0
Load
1
2
3
4
5
6
7
Load
Thread IDs +
Store
© 2012 Altera Corporation CONFIDENTIAL 20
On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Example Pipeline for Vector Add 8 threads for vector add example
1
2
3
4
5
6
7
0
Load
Load
Thread IDs +
Store
© 2012 Altera Corporation CONFIDENTIAL 21
On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Example Pipeline for Vector Add 8 threads for vector add example
2
3
4
5
6
7
1
Load
Load
Thread IDs
0 +
Store
© 2012 Altera Corporation CONFIDENTIAL 22
On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Example Pipeline for Vector Add 8 threads for vector add example
3
4
5
6
7
2
Load
Load
Thread IDs
1 + 0
Store
© 2012 Altera Corporation CONFIDENTIAL 23
On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Example Pipeline for Vector Add 8 threads for vector add example
4
5
6
7
3
Load
Load
Thread IDs
2 + 1
Store 0
© 2012 Altera Corporation CONFIDENTIAL 24
On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Example Pipeline for Vector Add Load
Load Load +
Load Load
Load
+
Store
+
Store Store
© 2012 Altera Corporation CONFIDENTIAL 25
Replicate the kernel circuit multiple times to process multiple workgroups simultaneously
© 2012 Altera Corporation CONFIDENTIAL 26
© 2012 Altera Corporation CONFIDENTIAL 27
© 2012 Altera Corporation CONFIDENTIAL 28
© 2012 Altera Corporation CONFIDENTIAL 29
© 2012 Altera Corporation CONFIDENTIAL 30
© 2012 Altera Corporation CONFIDENTIAL 31
© 2012 Altera Corporation CONFIDENTIAL 32
© 2012 Altera Corporation CONFIDENTIAL 33
Case Studies
© 2012 Altera Corporation CONFIDENTIAL
Performance (Monte-Carlo Black Scholes) 12.0
OpenCL MCBS Quad Core Xeon Simulations 240M per Second Number of 8 Cores
11.5
11.0
Stock Price
10.5
10.0
9.5
9.0
448
N/A
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
8.0
0.00
8.5
950M
Stratix® IV 530 FPGA 2,200M
NVIDIA S870
Time
Calculate the value of an option with multiple sources of uncertainty FPGA delivers higher performance at a fraction of the power
Achieve Higher Performance vs CPU © 2012 Altera Corporation CONFIDENTIAL 35
Performance/Watt (Document Search)
Documenting filtering algorithm
Review incoming stream (documents) and return best match
E.g. Monitors news feeds and recommends others
FPGA outperforms by >5x
Saving power = Saving $ Annual power cost was $2.9 million or $456 per KW
Higher Perf/Watt vs GPU © 2012 Altera Corporation CONFIDENTIAL 36
ALTERA OpenCL What’s Next
© 2012 Altera Corporation CONFIDENTIAL 37
Current OpenCL System Architecture Host Processor
Kernel0
Kernel1
Kernel2
…
Kernel N
Global Memory
High demand on CPU Memory-to-memory paradigm © 2012 Altera Corporation CONFIDENTIAL 38
Desired Architecture ( OpenCL Pipes ) Host Processor Initialize() Buffer
Kern0
Buffer
Kern1
p
Buffer Buffer Traffic Manager
Kern2
KernN
1-p
Buffer
Global Memory Traffic Manager
CPU: Configure and “Go” Stream orientation when needed © 2012 Altera Corporation CONFIDENTIAL 39
ALTERA OpenCL Q&A
© 2012 Altera Corporation CONFIDENTIAL 40