Using the VTune™ Performance Enhancement Environment for the Streaming SIMD Extensions
®
Copyright © 1999, Intel Corporation. All rights reserved.
1
Agenda n Background n The
VTune™ Performance Enhancement Environment for the Streaming SIMD Extensions n Development Methods for the Streaming SIMD Extensions n Summary
®
Copyright © 1999, Intel Corporation. All rights reserved.
2
Background: MMX™ Technology Tools n Enabled
low level work (assembly language) n Efforts to provide high level support (compilers) were: n late n not
utilized by ISVs n immature technically n not adopted by the industry quickly, or at all
®
Copyright © 1999, Intel Corporation. All rights reserved.
3
MMX™ MMX™ Technology Technology Tools Tools
What Developers Told Us
n It
is painful to realize performance benefits from MMX™ technology n Compilers are not capable of taking highlevel code and, automatically, producing optimized MMX™ technology instructions n You, the developer, had to use assembly
Lack Lackof ofgood goodtools toolscreates creates big bigheadaches headachesfor fordevelopers developers ®
Copyright © 1999, Intel Corporation. All rights reserved.
4
The VTune™ Performance Enhancement Environment, 4.0 n Intel®
C/C++ Compiler n VTune™ Analyzer n Register Viewing Tool n Performance Library Suite n Intel® Architecture Training Center The Thedefinitive definitivetoolkit toolkitfor forStreaming Streaming SIMD SIMDExtensions Extensionsprogramming programming ®
Copyright © 1999, Intel Corporation. All rights reserved.
5
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment
Intel® C/C++ Compiler
nA
“plug-in” to Microsoft* Developer Studio versions 5.0 and 6.0
Microsoft Visual Studio 97* n Object,
language compatible with MSVC v5.0 and v6.0
®
Copyright © 1999, Intel Corporation. All rights reserved. *Third-party brands and names are the property of their respective owners.
6
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment
Intel® C/C++ Compiler Optimization
n For
SIMD coding:
n inlined-asm, intrinsics,
vector classes, and
vectorization n data alignment mechanisms n CPU
Dispatch:
n different
code for different processors - one executable
n Scalar
optimization:
n aggressive
floating-point optimization n profile-guided and inter-procedural optimization Let Letthe thecompiler compilerdeal dealwith withoptimization optimization ®
Copyright © 1999, Intel Corporation. All rights reserved.
7
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment
n
VTune™ Analyzer
Performance tune apps via several methods: n Processor
sampling for CPU usage without binary instrumentation n CPU simulation in software - Dynamic Analysis n Chronologies of performance counters from OS, processor, 740 Graphics chipset n Call graph profiling for C/C++ and Java* with binary instrumentation ®
Copyright © 1999, Intel Corporation. All rights reserved. *Third-party brands and names are the property of their respective owners.
8
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment
VTune™ Analyzer
All tuning methods centered around source code views n Offers performance tuning advice for C/C++, Fortran, Java*, assembly n Teaches how to write better performing code n Supports Pentium® III, Pentium II, Pentium Pro, and Pentium processors on Windows* 95/98, Windows NT* 4.0/5.0 n
®
Copyright © 1999, Intel Corporation. All rights reserved. *Third-party brands and names are the property of their respective owners.
9
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment
®
™ VTune
Analyzer: Hotspots
Copyright © 1999, Intel Corporation. All rights reserved.
10
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment
®
™ VTune
Analyzer: Coach Advice
Copyright © 1999, Intel Corporation. All rights reserved.
11
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment
®
VTune™ Analyzer: Dynamic Analysis
Copyright © 1999, Intel Corporation. All rights reserved.
12
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment
®
VTune™ Analyzer: Call Graph Profiling
Copyright © 1999, Intel Corporation. All rights reserved.
13
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment n
®
Register Viewing Tool
Shows xmm registers during execution, debugging
Copyright © 1999, Intel Corporation. All rights reserved.
14
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment
Performance Libraries
n Image
Processing n Signal Processing n Recognition Primitives n Math Kernel n JPEG Library Streaming StreamingSIMD SIMDExtensions Extensionsand and MMX™ MMX™ Technology Technology used used extensively extensively ®
Copyright © 1999, Intel Corporation. All rights reserved.
15
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment
Intel® Architecture Training Center
n Computer
Based Training (CBT):
n interactive
®
tutorial on Streaming SIMD Extensions
Copyright © 1999, Intel Corporation. All rights reserved.
16
VTune™ VTune™ Performance Performance Enhancement Enhancement Environment Environment
Intel® Architecture Training Center, cont’d
n Pentium®
II and Pentium III processors Programmer’s Reference Manuals n Optimization Manual n Application
notes and code samples using Streaming SIMD Extensions: n 3D lighting/transform, filters, min/max, Newton-Raphson, FFT, deformable surfaces, & lots more
®
Copyright © 1999, Intel Corporation. All rights reserved.
17
Development Methods for the Streaming SIMD Extensions Hand coded assembly Bit Bangers Only!
Intrinsics
movaps xmm0, b[i] movaps xmm1, c[i] addps xmm0, xmm1 movaps a[i], xmm0
a[i]=_mm_add_ps(b[i],c[i])
Performance libraries You make the call
C++ class library
RLsbAdd3()
Difficulty
Fast food assembly
a[i]=b[i]+c[i]
Performance for the masses
Vectorization
#pragma vector
Let the compiler do the work, sort of...
®
Copyright © 1999, Intel Corporation. All rights reserved.
18
Development Development Methods Methods
Intrinsics
n New
data type: __m128 n No need to schedule and register allocate n Can
still choose instruction sequences n Use for MMX™ Technology and the Streaming SIMD Extensions n Definitions in Pentium® III Processor Programmer’s Reference Manual n Near hand-coded assembly performance (< 15% difference) a[i]=_mm_add_ps(b[i],c[i]) ®
Copyright © 1999, Intel Corporation. All rights reserved.
19
Development Development Methods Methods
C++ class library
n Abstract
the underlying technology n Performance gain everywhere class used n New packed data types: n I32vec2(2
32-bit ints), I16vec4 (4 16-bit ints), I8vec8(8 8-bit ints), and unsigned versions n F32vec4(4 32-bit floats) n Extensible,
easy to use, keeps code readable and portable n Nearly matches intrinsics performance (0-5% difference) a[i]=b[i]+c[i] ®
Copyright © 1999, Intel Corporation. All rights reserved.
20
Development Development Methods Methods
Vectorization
n Compiler
generates SIMD integer or FP code for you under strict conditions: n countable
loops with single-unit stride n body of loop must be single block – no internal branching, single entry/exit n data
types: float, char/short/int
n user
ensures correct alignment for floats n user ensures no aliases for pointers n no function calls in loop
®
Copyright © 1999, Intel Corporation. All rights reserved.
21
Development Development Methods Methods
Vector-Multiply-Add in C
void do_c(float *ac, float *m, float *v, int n) { for(int i=0; i
®
Copyright © 1999, Intel Corporation. All rights reserved.
22
Development Development Methods Methods
Intrinsics Example
// assumes data passed in is 16-byte aligned!!! void do_intrin(float *ac, float *m, float *v, int n) { __m128 t, vp, mp, ap; for(int i=0; i
Copyright © 1999, Intel Corporation. All rights reserved.
23
Development Development Methods Methods
Vector Class Example
void do_fvec(float *ac, float *m, float *v, int n) { F32vec4 *a=(F32vec4 *) ac; F32vec4 *c=(F32vec4 *) m; F32vec4 *b=(F32vec4 *) v; for(int i=0; i<(n / 4); i++) // reduced iterations a[i] += (c[i] * b[i]); } The Thevector vectorclasses classesoffer offerefficient, efficient, portable portable performance performance ®
Copyright © 1999, Intel Corporation. All rights reserved.
24
Development Development Methods Methods
Vectorization Example
// restrict says data is only accessible via given pointer void do_vectorize(float *restrict ac, float *restrict m, float *restrict v, int n) { #pragma vectorize aligned // user ensures data is aligned for(int i=0; i<100; i++) ac[i] += (m[i] * v[i]); } Give Give some some characteristics; characteristics; compiler compilergenerates generatesthe theSIMD SIMDcode code ®
Copyright © 1999, Intel Corporation. All rights reserved.
25
Summary n We
learned our lesson: lack of tools = big headaches for developers n The VTune™ Performance Enhancement Environment has everything you need for Streaming SIMD Extensions programming n The new development methods allow you to concentrate on your application; the tools will help you achieve the performance
®
Copyright © 1999, Intel Corporation. All rights reserved.
26
Now, it’s your turn:
www.intel.com/VTune
®
Copyright © 1999, Intel Corporation. All rights reserved.
27