SDR on GPU’s: Practical OpenCL Blocks and Their Performance Mike Piscopo, Director of Technical Consulting Services

GRCon 2017

Speaker Introduction – Who is this guy? Mike Piscopo ▪

▪ ▪



© 2017 - DELTA RISK LLC

Delta Risk LLC - Director CyberSecurity Technical Consulting Services and Product Development ING U.S. Financial Services - Information Security Officer PeriNet Technologies - President and CTO, security solutions architecture and implementation, penetration testing Past “Lives” ▪ BS Aerospace Engineering – Virginia Tech ▪ Aerospace Contractor – Real-time distributed centrifuge team ▪ Developer – embedded C++ and later application architect ▪ Network Field Engineer and IT architect

2

Why Look to GPU’s and OpenCL? ▪ In cyber, we use GPU’s all the time for hash generation and cracking ▪ Highly parallel computing with OpenCL and can leverage multiple cards

▪ Lots of talk about the benefits of GPU’s and signal processing (clfft libraries, etc.) ▪ On the surface it sounded like a reasonable question: Why don’t we have OpenCL SDR blocks? ▪ NVIDIA 1070 has 1,920 cores and drives virtual reality systems ▪ OpenCL Added bonus: cross-hardware capabilities with support for CPU’s, GPU’s, and OpenCLenabled FPGA’s

▪ Several parallelization modes: data parallel [“Single Instruction Multiple Data” (SIMD)] and task-parallel

▪ There is some great foundational research and proof-of-concept already available but no core open source GNURadio OOT modules covering all of the “basics”

© 2017 - DELTA RISK LLC

3

CL-Enabled Project Goals This gave birth to the cl-enabled project (github and pybombs). My “wish list”: ▪ Implement blocks commonly used in digital demodulation to run on my GPU using OpenCL ▪ Blocks you most frequently would use in a general flowgraph (signal source,multiply, filters) ▪ Digital signal processing demod blocks for ASK, FSK, and PSK (2PSK/QPSK) ▪ Opportunistically any other core blocks that looked like they could be accelerated ▪ Scalability ▪ We use multiple cards in cyber, so I want to be able to use multiple GPU’s in the same flowgraph simultaneously ▪ Be able to choose what blocks run on which OpenCL devices ▪ Would the blocks support 60 Msps real-time processing? What about 250 Msps? ▪ Have to be able to measure performance ▪ Know if the blocks actually ran better or worse on GPU’s and why (need to add tools) ▪ Categorize blocks into accelerated, offload, and enabled ▪ Extensibility - Custom kernel support and the ability to time them without re-compiling code

© 2017 - DELTA RISK LLC

4

Which Modules Have Been Implemented? ✓ Basic Building Blocks 1. Signal Source (more “pure” as well) 2. Multiply 3. Add 4. Subtract 5. Multiply Constant 6. Add Constant 7. Filters (FIR and FFT) I. Low Pass II. High Pass III. Band Pass / Reject IV. Root-Raised Cosine V. Generic Tap-Based ✓

Custom OpenCL Kernels 8. 1 input, 1 output 9. 2 inputs, 1 output © 2017 - DELTA RISK LLC

✓ Common Math or Complex Data Functions 10. Complex Conjugate 11. Multiply Conjugate 12. Complex to Arg 13. Complex to Mag Phase 14. Mag Phase to Complex 15. Log10 16. SNR Helper (a custom block performing divide->log10->abs) 17. Forward FFT 18. Reverse FFT

✓ Digital Signal Processing 19. Complex to Mag (used for ASK/OOK) 20. Quadrature Demod (used for FSK) 21. Costas Loop (used for BPSK/QPSK)

5

What makes an algorithm a good choice? ATOMIC Calculations



SIMD wants to run the same line of code on multiple data points simultaneously



Branching impacts performance (threads need to wait so they’re on the same instruction) Maps Really Well

Suboptimal Performance Branching and different instructions

Potentially unimplementable in SIMD: Sequential calculations

For (int i=0;i
For (int i=0;i
volk_32fc_x2_multiply_conjugate_32fc(&tmp[0], &in[1], &in[0], noutput_items); for(int i = 0; i < noutput_items; i++) { Real Block: Quad Demod Maps Well out[i] = f_gain * fast_atan2f(imag(tmp[i]), real(tmp[i])); } © 2017 - DELTA RISK LLC

6

SIMD-UnFriendly Calculations Flattened Costas Loop for(i = 0; i < noutput_items; i++) { nco_out = gr_expj(-d_phase); optr[i] = iptr[i] * nco_out; d_error = (*this.*d_phase_detector)(optr[i]); d_error = 0.5 * (std::abs(d_error+1) - std::abs(d_error-1));

//advance_loop(d_error); d_freq = d_beta * d_error + d_freq; d_phase = d_phase + d_alpha * d_error + d_freq; //phase_wrap(); if (d_phase > CL_TWO_PI) { while(d_phase>CL_TWO_PI) { d_phase -= CL_TWO_PI; } }

Problems! (715 Ksps task-parallel) Implemented as a single task on 1 core: Testing Costas Loop performance with 8192 items... OpenCL: using NVIDIA CUDA OpenCL Run Time: 0.011451 s (715,387.6875 sps) CPU-only Run Time: 0.000462 s (17,719,240.00 sps)

} © 2017 - DELTA RISK LLC

7

Performance Killer #1: Data Copy ▪ Have to move the data to the card then retrieve the results ▪ This incurs a transfer cost ▪ Total GPU execution time = Mem In + function execution + Mem Out ▪ Total CPU execution time = function execution (volk/SIMD-4 makes it even faster)

▪ Block sizes – Processing more data in a single transaction can offset the memory transfer cost ▪ Computational Complexity - Some functions take longer to run on a CPU (sin,cos,atan2,log10) and can be good candidates

▪ Generic equation for when OpenCL Performs better: TMemIn[block size] + Texec[block size] + TMemOut[block size] < TCPU[block size] That’s the “magic” spot for OpenCL SIMD/GPU acceleration in GNURadio!

© 2017 - DELTA RISK LLC

8

Testing Notes ▪ Tested on 4 platforms: a new NVIDIA 1070, older 970, laptop 1000M, and OpenCL-CPU ▪ Block sizes were the actual passed block sizes. Remember GNURadio’s scheduling engine may be passing around half the default block size. ▪ To get the block sizes in the charts, you may have to double it in your flowgraph ▪ Timing is isolated testing – Straight function calls

▪ Used included tools for timing (test-clenabled, test-clfilter, test-clkernel) ▪ Where I could the project even takes advantage of Fused Multiply Add (FMA) for performance

© 2017 - DELTA RISK LLC

9

Performance – Log10 600,000,000.00

500,000,000.00

Throughput

400,000,000.00

VM CPU VM OCL 1070 CPU

300,000,000.00

1070 OCL 970 CPU 970 OCL

200,000,000.00

1000M CPU 1000M OCL

100,000,000.00

-

0

© 2017 - DELTA RISK LLC

5000

10000

15000 Block Size

20000

25000

30000

10

Performance – Quad Demod 450,000,000.00 400,000,000.00 350,000,000.00

Throughput

300,000,000.00

VM CPU

1070

VM OCL

250,000,000.00

1070 CPU

970

200,000,000.00

1070 OCL

970 CPU 970 OCL

150,000,000.00

1000M CPU 100,000,000.00

1000M OCL

1000M

50,000,000.00 0

© 2017 - DELTA RISK LLC

5000

10000

15000 Block Size

20000

25000

30000

11

Performance – Signal Source 700,000,000.00 600,000,000.00

Throughput

500,000,000.00

VM CPU

1070

400,000,000.00

VM OCL 1070 CPU 1070 OCL

300,000,000.00

970

970 CPU 970 OCL

200,000,000.00

1000M CPU

1000M

100,000,000.00

1000M OCL

0

© 2017 - DELTA RISK LLC

5000

10000

15000 Block Size

20000

25000

30000

12

Signal Source – OpenCL and Trig Float Calculations

© 2017 - DELTA RISK LLC

Double Calculations

13

Performance – Filters Filter Analysis 180,000,000.00

160,000,000.00 140,000,000.00

Throughput

120,000,000.00

100,000,000.00 20% Filter Width 15%Filter Width

80,000,000.00

10% Filter Width 5% Filter Width

60,000,000.00

40,000,000.00 20,000,000.00 FD=Frequency Domain TD=Time Domain © 2017 - DELTA RISK LLC

VM OCL TD VM OCL FD VM CPU FD

1070 OCL TD

1070 OCL FD

1070 CPU 970 OCL TD 970 OCL FD 970 CPU FD 1000M OCL 1000M OCL 1000M CPU FD TD FD FD

Platform

14

Project Take-Aways ▪

There are a good number of common OpenCL blocks that can improve performance, especially at high rates



Time the OpenCL blocks on your hardware with test-clenabled, test-clfilter, and test-clkernel



Pick the most “expensive” block with an OpenCL version in your flowgraph and use an OpenCL block for that



Don’t run more than 1 block on a single card simultaneously (sounds obvious but incurs a 100x penalty). Consider multiple cards (blocks can be explicitly assigned to a card. Use clview for the #’s)



OpenCL signal source is a more “pure” waveform (trig versus lookup table)



Speed! Other flowgraph performance-boosters if speed is important:



CPU FFT Filters outperform FIR for increasing tap sizes (time with test-clfilter, gr-lfast FFT wrappers)



Look at gr-lfast for some CPU-optimized blocks where OpenCL doesn’t help (2nd/4th Costas and AGC)



If you’re doing signal sources and multiply blocks or an Xlating FIR Filter for offset tuning to get away from a DC spike, consider gr-correctiq or OpenCL signal source block / multiply to get rid of it and reduce CPU load



Spread out the work: Use a distributed model: Demod/Decode/Instrumentation. If you like visualization tools like gr-fosphor and you’re running into CPU constraints, consider a net block or gr-grnet to help you move data to other systems and split up your processing and visualization

© 2017 - DELTA RISK LLC

15

Q&A ▪

Download code at https://github.com/ghostop14/gr-clenabled.git or via pybombs



OpenCL install help on git in setup_help (card doesn’t need 1.2, just for compilation)



Contact Info: [email protected]

© 2017 - DELTA RISK LLC

16

Michael Piscopo - Gnuradio Opencl-Enabled Blocks - GRCon ...

Aerospace Contractor – Real-time distributed centrifuge team ... Several parallelization modes: data parallel [“Single Instruction Multiple Data” (SIMD)] and ... Michael Piscopo - Gnuradio Opencl-Enabled Blocks - GRCon 2017v7.pdf. Michael ...

924KB Sizes 0 Downloads 163 Views

Recommend Documents

Building Blocks Design - GitHub
daily-ipad-app-blocksworld-hd-lets-you-build-and-play-with-3d-b/. [4] Maister ... zombies-run-naomi-alderman-app. [6] Ohan ... columbia.edu/~ohan/oda08.pdf.

LOVE Baby Blocks 50x50.pdf
Page 1 of 1. Graphed By © Anabelle Tracy. www.crochetreasures.com Page: 1. 10 20 30 40 50. 10 20 30 40 50. Legend: RHS-0311 White RHS-0624 Tea Leaf. Page 1 of 1. LOVE Baby Blocks 50x50.pdf. LOVE Baby Blocks 50x50.pdf. Open. Extract. Open with. Sign

Toy Mystery Chatter Blocks preview.pdf
Toy Mystery Chatter Blocks preview.pdf. Toy Mystery Chatter Blocks preview.pdf. Open. Extract. Open with. Sign In. Main menu.

ACT LABOR BLOCKS CRITICAL SCHOOL CROSSING.pdf ...
Page 1 of 1. ACT LABOR BLOCKS CRITICAL SCHOOL CROSSING.pdf. ACT LABOR BLOCKS CRITICAL SCHOOL CROSSING.pdf. Open. Extract. Open with.

Descargar minecraft zumbi blocks 3d
... Minecraft blockszumbi descargar 3d.descargar libroscristianosen pdf gratis por ... star tv.descargarantivirus gratisen español 2013 para windows xp. ... recordar.descargaraplicacionescrackeadas iphone 4.descargar peliculas 3d gratis.

Resolvable designs with large blocks
Feb 10, 2005 - work on square lattice designs (1936, 1940), though the term ..... When r > v − 1 some of the edi are structurally fixed and there is no ...... additional zero eigenvalues plus a reduced system of n−z equations in tz+1,...,tn.

pattern blocks 2.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. pattern blocks 2.Missing:

201 crochet motifs blocks projects and ideas
Book 201 crochet motifs blocks projects and ideas pdf free download

Michael Baxter
Oct 3, 2015 - AFTERNOON PANEL DISCUSSION: Facilitator: Michael Boover, theologian and ... house, solar energy system, compost toilet, grease car and ...

pdf-144\michael-jackson-guitar-chord-songbook-by-michael ...
pdf-144\michael-jackson-guitar-chord-songbook-by-michael-jackson.pdf. pdf-144\michael-jackson-guitar-chord-songbook-by-michael-jackson.pdf. Open.

MICHAEL Michael Kors Jet Set Logo Jacquard Wrap ...
MICHAEL Michael Kors Jet Set Logo Jacquard Wrap, Only at Macy's is probably the major selections you could contemplate these days. Having a range of ...

burn (michael bennett) by james patterson, michael ...
Harlem, where he receives an unusual call: a man claims to have seen a group of ... same building, he is forced to take the demented caller seriously--and is ...

Michael Dertouzos - Leader Values
Alan Kay is one of the most influential computer scientists of the modern era. His ... Alan Kay. O nly 30 years ago, innovation was perceived as a threat: “if it ain't broke, don't fix it” was the ..... Complexity—The degree to which an innovat

Michael Dertouzos - Leader Values
Innovation has identified the five trends companies will either embrace or resist. Your choice may .... The CBI's software tool, IdeaX, provided ChevronTexaco with a much ..... touching on briefly here.10 Source code is the root programming ...

Michael Grosser CV - GitHub
Nov 21, 1983 - Help developer standardization via boxen, rails 2 -> 3 upgrades without downtime on ... 2006-2008 PHP-Rails clone developer at 20sec.net.

Michael Nida Autopsy Report
oral mucosa demonstrated. There is no edema of the larynx. ...... Operative Deaths (result of surgery or anesthesia) - refer to Coroner. CONTAGIOUS DISEASES.

Michael Williams.pdf
Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Michael Williams.pdf. Michael Williams.pdf. Open. Extract.

Michael the archangel
Satan is the one who “had the power of death” (Hebrews 2:14), and Jesus Christ is the One ... The Son of God is the only Being in the entire universe who can be ...

Michael Kosfeld
Economic Inquiry 55, 2017, 237-247 (with S. Neckermann and X. Yang). 2 .... Experimental Field Evidence from Vocational Training in Switzerland,” July 2016 ...