Writing Scalable SIMD Programs with ISPC James Brodman

Dmitry Babokin

Ilia Filippov

Peng Tu

Intel Corporation {james.brodman,dmitry.y.babokin,ilia.filippov,ptu1}@intel.com

Abstract Modern processors contain many resources for parallel execution. In addition to having multiple cores, processors can also contain vector functional units that are capable of performing a single operation on multiple inputs in parallel. Taking advantage of this vector hardware is essential to obtaining peak performance on a machine, but it is often challenging for programmers to do so. This paper presents a performance study of compiling several benchmarks from the domains of computer graphics, financial modeling, and high-performance computing for different vector instruction sets using the Intel SPMD Program Compiler, an alternative to compiler autovectorization of scalar code or handwriting vector code with intrinsics. ispc is both a language and compiler that produces high quality code for SIMD CPU vector extensions such as Intel Streaming SIMD Extensions (SSE), Intel Advanced Vector Extensions (AVX), or ARM NEON. We present the results of compiling the same ispc source program for various targets. The performance of the resulting ispc versions is compared to that of scalar C++ code, and we also examine the scalability of the benchmarks when targeting wider vector units.

1.

Introduction

From mobile devices to supercomputers, modern processors contain many different resources for parallel execution. The shift to multiple cores on chips has been a hot topic in recent years, but the multiple cores present in today’s chips are not the only resources for parallel execution. Indeed, vector instruction set extensions preceded multiple cores in commodity CPUs by almost a decade. These vector instruction set extensions follow the Single Instruction, Multiple Data

(SIMD) model from Flynn’s taxonomy. Such SIMD vector instructions have let programmers perform the same operation on multiple pieces of different data in parallel. Exploiting such data parallelism is required in order to achieve the peak performance of modern machines. Writing parallel programs has long been known to be a challenging problem, and efficiently exploiting SIMD parallelism is no exception. The most common methods programmers use to exploit SIMD parallelism are hand-written intrinsics and compiler autovectorization. Intrinsics are lowerlevel programming model that requires the programmer to possess expert knowledge about the target vector instruction set. Compiler autovectorization is a difficult problem with many challenges. The compiler must be conservative and be able to prove that it is safe to rewrite code into a vector version. This can often require extra annotations by the programmer that inform the compiler of information that cannot otherwise be ascertained. This paper presents a study of vectorizing several applications using the Intel SPMD Program Compiler, or ispc, an alternative to both intrinsics and compiler autovectorization. The outline of this paper is as follows. Section 2 presents an overview of ispc. The SPMD programming used by ispc is discussed in Section 2.1. Section 2.2 presents an overview of the ispc language, and Section 2.3 describes the LLVM-based compiler. Next, an overview of the benchmarks used for this study is found in Section 3. Section 4 contains the performance evaluation that is the main contribution of this paper. This is broken down into a summary of speedup over scalar C++ found in Section 4.3, a summary of performance scaling when doubling vector width in Section 4.4, and a benchmark-by-benchmark analysis in Section 4.5. Finally, Section 5 presents concluding remarks.

2. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. WPMVP ’14, February 16, 2014, Orlando, Florida, USA. c 2014 ACM 978-1-4503-2653-7/14/02. . . $15.00. Copyright http://dx.doi.org/10.1145/2568058.2568065

ISPC

ispc, the Intel SPMD Program Compiler, is both a language and a compiler that produces high performance code for processors with SIMD vector units. The language is modeled after a mix of C and C++ and introduces a few additional keywords and control-flow constructs. ispc is open-source software that is freely available at [2]. It can currently generate code for Intel SSE4, Intel AVX1/2, Intel Xeon Phi, and ARM NEON vector instruction sets. The compiler is built

on top of the LLVM compiler infrastructure [7]. A detailed overview of ispc is presented in [10].

2.1

Programming Model

As its name implies, ispc is built around the SPMD (Single Program, Multiple Data) program model [4]. In the SPMD model, multiple instances of a program execute on different pieces of data. The contract between the programming model and the programmer states that each of these instances operates on independent data. Consequently, each of these instances is free to execute in parallel with all others as the programmer guarantees that there are no dependences between different instances. ispc implements a SPMD programming model on top of the SIMD vector hardware found in modern CPUs. ”Gangs” of SPMD program instances map to vector lanes of the SIMD hardware. For example, if the program is running on a machine with 8-wide SIMD units, a gang of eight program instances can execute in parallel on that vector unit, with each lane operating on different data. We thus refer to ispc’s implementation as ”SPMDon-SIMD”. Mapping SPMD onto SIMD is straightforward for straightline code since each operation is just translated into its vector equivalent. However, complications arise if program instances exhibit diverging control where some program instances want to execute one path of a branch while the rest follow the other. Since SIMD lanes share execution units in the CPU, each lane or program instance must execute the same sequence of instructions in lockstep. ispc handles such cases with diverging control flow by transforming it into data flow. Consider the example in Figure 1.

// Execute true branch mask = (x < 5) y = blend(y, 1, mask) // Execute false branch mask = ~mask y = blend(y, 3, mask) Figure 2. In the example in Figure 2, the control flow present in the original example has been transformed into data flow. Since the different program instances executing in the SIMD lanes must execute the same sequence of instructions, the branch has been removed. Every SIMD lane will now execute both the true and false branches. However, each instance cannot naively execute both assignments to y without producing incorrect results. A new value, mask, is computed based on the value of the conditional expression for each program instance. The mask computes which program instances are currently ”active” and is used to predicate the execution of instructions. In this example, it is used in conjunction with the blend operation to store the appropriate value into y at that point in the program. The blend operation chooses between two values based on the value of mask for that instance. Consequently, the first blend operation only assigns the value 1 in those lanes where the conditional expression is true. The mask is then updated to reflect which lanes will execute the statements in the false branch of the original if expression. Finally, another blend occurs using the updated mask. Every ispc function has an implicit mask argument. At the beginning of a program, the mask is likely ”all on”, where all program lane are active. As execution progresses, lanes can become inactive. If a function is called under divergent control flow, its mask upon function entry will reflect the value of the mask at the call site of the function. 2.2

if (x < 5) y = 1; else y = 3; Figure 1.

The SPMD model states that each program instance can have different values for x and y. Each program instance will compare its value for x against 5. Depending on the result of this comparison, some instances will want to assign y to 1 while others will assign y to 3. Since each program instance corresponds to a lane of the SIMD vector unit, every instance has to execute the same sequence of instructions. In order to ensure correct execution, ispc transforms the original program into the one shown in Figure 2.

The ispc Language

The ispc language is based upon C and C++. This provides a familiarity to programmers that enables a quick ramp-up from learning to using ispc in applications. ispc is similar in concept to CUDA [8] and OpenCL [5], which are both SPMD languages that primarily target GPUs, although some OpenCL implementations also target vectorization for CPUs [11]. The most important new language features of ispc are the keywords uniform and varying, which were first used in the Renderman Shading Language [6]. uniform and varying are type modifiers, similar to const, that are used to describe the variability of a value, which is defined as whether a value is scalar or vector. In C/C++, all values, by default, have what ispc would call uniform variability. For example, int x represents a single, scalar value. In ispc, values are varying by default. int x in ispc thus represents many values, one per program instance. In ispc, int x and varying int x are equivalent statements. However, uniform int x means that x has only a

single, scalar value just as in C/C++. It is legal to assign a uniform value to a varying value since the uniform value can be widened. However, it is illegal to assign a varying value to a uniform variable. When defining struct types, one may choose to mark the individual fields of the struct as being uniform or varying. However, if left undecorated, the variability of the fields is resolved when an instance of that struct type is created. Consider the example in Figure 3. struct TwoFloats { float one; float two; }; uniform TwoFloats foo; varying TwoFloats bar; Figure 3. The struct foo is declared as having uniform variability, and this is equivalent to how the struct would appear in C/C++ since foo is a struct that contains two uniform floats. However the struct bar is declared as having varying variability and instead contains two varying floats. Using varying structs in ispc is similar to manually transforming C/C++ code to use Structure of Arrays (SOA) data layout. Arrays of uniform variability behave as in C/C++. In contrast, varying arrays implicitly have an extra dimension since each element of the array is itself a varying value. Arrays passed into ispc routines from C/C++ typically have uniform variability. Indexing an uniform array with a uniform value yields a single uniform result. Likewise, indexing an uniform array with a varying value yields a varying result. This is a common technique for computing on the values of an array in parallel. ispc provides full support for pointers. Pointer types can also be decorated with the uniform and varying keywords. A varying pointer means that each program instance has a different value for the pointer, while a uniform pointer means that all program instances share the same value that points to a single location in memory. The pointed-to type also has its own variability, so one could declare a varying pointer to uniform float, a uniform pointer to varying int, or even a varying pointer to varying double. Note that the [] syntax for arrays is equivalent to a uniform pointer. The variability of data can have important consequences on the efficiency of the resulting code. Reading from and writing to varying pointers leads to gather/scatter operations if the compiler cannot prove that it is possible to use a more efficient vector load or store. Additionally, not all vector instruction sets contain instructions for gather/scatter operations. If such instructions do not exist, gather/scatter operations must be implemented as multiple scalar loads and stores. Variability also affects the performance of control

flow. If a conditional expression has uniform variability, then control flows as it would in C/C++. Every program instance takes the same path, so it is unnecessary to transform control flow to data flow. However, if a conditional expression has varying variability, then the execution paths of different program instances can diverge. In these situations, control flow must be transformed to data flow with masking. The situation is similar for loops with varying conditions. Two special read-only variables can be used by programmers: programCount and programIndex. The values of these variables depends on the particular target of the compiler. programCount returns the total number of SPMD program instances in a gang. For example, if code is compiled for a SSE4 target, programCount would have the value 4 while AVX targets would have programCount of 8. programIndex is of type varying int and gives the index of the SIMD lane in a gang being used to execute a particular program instance. This index starts at 0 and goes up to, but not including, programCount. programIndex and programCount can be used to easily specify parallel iteration over an array. uniform int len = ...; uniform int data[] = ...; for (uniform int i = 0; i < len; i+=programCount) { varying int index = i + programIndex; data[index] += 5; } Figure 4. In the example in Figure 4, the loop (with uniform control flow) executes in chunks of programCount elements. The array data is indexed using the varying expression index. While the loop variable i has uniform variability, programIndex is varying, so the expression i + programIndex is also varying and has values i+0, i+1, etc. Indexing data with a varying value returns a varying int, where each program instance in the gang has the value indicated by its respective value of index. Since the previous example is a common pattern, ispc provides a foreach construct that more concisely expresses this iteration pattern. The previous example could be written as seen in Figure 5. foreach (i = 0 len) { data[i] += 5; } Figure 5. The features of the ispc language provide programmers with a ”WYSIWYG” style of vector programming since the variability of data directly determines what code will be generated. Vectors are not discovered, but rather they’re

explicitly encoded in the types. This small set of extensions to a C-like language is sufficient to enable producing high quality vector code. 2.3

ISPC Compiler

The ispc compiler is built on top of the LLVM compiler infrastructure. It has a custom front end using flex and bison that parses input programs and builds an abstract syntax tree. It is important to note that the program represented by the AST is already a parallel program since data parallelism is explicit in ispc’s type system through the use of uniform and varying type modifiers. Next, the compiler translates this AST into LLVM’s intermediate representation (IR) . The LLVM type system supports vector types, so the resulting IR explicitly contains vector types and operations. This differs from autovectorizing compilers (including LLVM’s own vectorization passes) that try to translate scalar IR into vector code. It is during this phase that control flow is transformed into data flow and masking is introduced. Different IR must be generated for all desired targets specified to the compiler since different targets may have different values for special variables such as programCount or different internal representations for mask values. Depending on the specified optimization level, various LLVM and custom optimization passes are applied to the LLVM IR. The -O0 optimization level in ispc is mainly meant as a debugging tool. It will produce executable vector code with appropriate masking, but the resulting code has only the most minimal set of optimizations. The -O1 optimization level applies roughly the same set of optimizations applied by clang at -O2 as well as the previously mentioned custom passes. One of the most important custom passes applied at -O1 optimizes memory accesses in ispc programs. All loads and stores of varying values are initially treated as gather/scatter operations. This pass looks at these memory operations and tries to transform them into more efficient forms. This can take the form of transforming general gather/scatter from arbitrary varying pointers to gather/scatter with a common base address and varying offsets and transformation from gather/scatter to sequential vector load/store. Consider the example shown in Figure 6. uniform float data[] = ...; uniform float moreData[] = ...; uniform float * varying vp = ...; data[programIndex] = ...; varying int val = ...; varying float result = moreData[val]; *vp = ....; Figure 6. Initially, the assignment to data is represented as a scatter operation with each program instance writing to a differ-

ent memory address. However, optimization is able to prove that this can be replaced with a vector store since data is a uniform pointer indexed by programIndex, a vector of constant, stride-1 values. The read from moreData is initially represented as a gather operation, but the optimization pass is able to recognize that each program instance has a common uniform base pointer. The general gather is can be replaced with a gather from common base with varying offsets, a form that can be expressed with a single instruction in the AVX2 vector extensions. Finally, the write to the varying pointer vp cannot be optimized since the compiler has no additional knowledge and must pessimistically assume that each lane can write to an arbitrary address in memory. Another optimization performed by the compiler looks for cases when the execution mask can be determined to be either ”all on” or ”all off”. If the compiler can determine that the execution mask will be ”all on”, where every SIMD lane will take part in the following computations, then the compiler can emit more efficient code that does not require blend operations for that basic block. Likewise, if the mask is ”all off”, then none of the program instances need to execute that basic block and the compiler can instead jump to its successor. Similarly, the compiler can replace masked loads and stores with regular loads and stores if the mask is known to be all on, eliminating control flow to first if the mask is set for a lane before loading or storing. The ispc compiler can produce as output, depending on the target, either object files, assembly code, LLVM bitcode, or C++ code with intrinsics. The C++ backend is currently used when targeting Intel Xeon Phi processors. All other targets can generate native object code using LLVM’s provided backend code generators. The resulting object files are easily linked in with object files from other languages or compilers.

3.

Benchmarks

For this performance study, five benchmarks were selected from among the example programs included in downloads of ispc. These benchmarks are AOBench, Binomial Options, Black-Scholes, Perlin Noise, and a 3rd order finite difference stencil. Source code for these benchmarks is available at the ispc website [2]. The following subsections provide a brief description of each benchmark and a high-level description of how its data parallelism is expressed in ispc. 3.1

AOBench

AOBench is a small ambient occlusion renderer for benchmarking floating-point performance [1]. The main computation of AOBench loops over the scanlines of an image of width w and height h. It is implemented using an ispc foreach tiled loop, which is similar to a regular foreach except that it iterates over the space in square ”chunks” instead of linearly. This can be used to improve locality in many circumstances, but in this case it is used to improve

coherence for the gangs of rays since rays close in space are likely to follow similar execution paths. For each pixel, a ray is intersected with three spheres and a plane. Ambient occlusion is computed if a hit is registered. AOBench makes good use of ispc’s type system to obtain both readable and performant code. The ispc implementation looks very similar to the C version found at [1]. Rays, intersections, spheres, and planes are all represented as structs. Multiple rays are easily intersected with a sphere or plane in parallel by appropriate usage of uniform and varying qualifiers. Efficient code is generated because ispc converts varying structs to use SOA data layout that can use vector loads and stores to access each varying field. 3.2

Perlin Noise

Perlin Noise computes a type of gradient noise used in visual effects to increase the appearance of realism [9]. The ispc implementation computes the value of the function for each pixel of the two dimensional output image. Gangs of pixels in a row of the image are computed in parallel and are represented as varying values. The innermost loop of the computation contains a gradient computation that involves multiple indirect memory accesses of the form NoisePerm[NoisePerm[NoisePerm[x]+y]+z]. Gather operations are required to load the values for each program instance as they cannot be accessed with sequential vector loads. If possible, these gathers are implemented with efficient hardware operations. If not possible, the gather is replaced with a series of scalar loads. 3.3

Options Pricing

Binomial options and Black-Scholes are two algorithms for pricing options. Binomial options is computationally slower than Black-Scholes, but it is more accurate. Both algorithms have similar implementations in ispc that compute prices for several options in parallel. The inputs for both algorithms are arrays of prices, strike prices, time, interest rates, and volatility for each option. Both algorithms are embarrassingly parallel and iterate over the options with a foreach loop. Binomial options does not feature any varying control flow while Black-Scholes contains only two if statements per option. Binomial Options contains several loops per option, but these loops all feature uniform control flow. Both benchmarks also feature good arithmetic intensity and access relatively few bytes per operation. Consequently, one would expect both these benchmarks to vectorize very well and scale nicely with increased parallel resources. 3.4

Stencil

Stencil computes a 3rd-order 3-dimensional finite difference with an isotropic update on time. The main loop is a uniform loop over time steps that swaps input and output arrays on even/odd timesteps. Each time step executes a 3-dimensional foreach loop over the inner points of the 3-dimensional grid, with multiple points being stored in

varying values. Each point loads itself and the three neighboring values in each direction along each axis from the current location in order to compute the value for the next timestep. This benchmark has relatively poor arithmetic intensity as computing each point requires loading very many values compared to the number of operations.

4.

Evaluation

4.1

Experimental Setup

Experiments were performed using two systems. The first is an Intel Core i5-4670T Haswell CPU with four cores running at 2.3 GHz and six megabytes of L3 cache. This machine supports all Intel SIMD ISA extensions up to and including AVX2. The second system is a two socket Intel Xeon E5 2687W ”Sandy Bridge-EP” with 16 cores running at 3.1 GHz and 20 MB of L3 cache per socket. The Xeon system does not contain support for the AVX2 vector extensions. Benchmarks were compiled using the ispc compiler version 1.5.1 and LLVM/clang version 3.3 modified with ispcsupplied patches for performance and correctness. Benchmarks were compiled with optimization levels O1 on ispc and O3 on clang. Memory was not assumed to be aligned. Serial C++ versions of the benchmarks were compiled with clang and shared an interface with their ispc counterparts so that either version could be called from C++ with the same input. The same ispc source code was compiled for SSE4, AVX, and AVX2 Intel vector extensions without any source modifications. When generating code for the SSE4 target, ispc uses a programCount of four, meaning that the generated code assumes four program instances fit inside a SIMD register. programCount of eight is used for the AVX and AVX2 targets. These definitions of program count correspond to the number of 32-bit values that can fit inside the widest SIMD registers defined by the vector ISA. The Intel Software Development Emulator [3] was used to collect statistics about the execution of the benchmarks. In particular, SDE’s mixtool was used to determine the hot functions and basic blocks in the benchmarks as well as to collect histograms about the different types of dynamic instructions encountered during execution. 4.2

Code Characteristics

Tables 1, 2, and 3 show characters of the code generated for the various targets on the benchmarks. Table 1 shows how many millions of dynamic instructions occur in the hottest function in each benchmark. Typically these functions make up over ninety percent of the total execution time. When doubling vector width, such as happens when moving from SSE4 to AVX, one would expect the number of dynamic instructions to ideally shrink by at least a factor of two since each vector instruction now operates on twice as much data. Loop overhead is also expected to shrink as loops on larger vectors should require fewer iterations. However, the reduction in dynamic instructions is not always ideal. For exam-

Benchmark

SSE4

AVX

AVX2

AOBench

39270.27

20685.17

14701.00

Perlin Noise

2063.20

1493.43

771.49

Binomial Options

3563.62

1596.85

1553.84

Black-Scholes

28.80

13.52

11.65

Stencil

8275.57

3593.64

3526.11

Table 1. Dynamic Instructions in Hot Function (Millions) Table 2 shows the average number of bytes read and written per instruction in the hottest function. This statistic can be used to model the arithmetic intensity of a function since higher numbers indicate a less favorable ratio between computation and memory operations.

4.3

Performance

Figures 7 and 8 show the performance of the various benchmarks when compiled for the target vector ISAs on both the Haswell and Sandy Bridge machines. Specifically, the figures show the runtime speedup of the ispc vector version over that of the scalar C++ code. Figure 7 shows performance of the SSE4, AVX, and AVX2 versions on the Haswell machine while figure8 shows performance of the SSE4 and AVX versions on the Sandy Bridge machine. Since all of the benchmarks exhibit good data parallelism, the ispc-generated codes obtain good speedup over their scalar C++ counterparts. On the Haswell machine, the geometric mean speedups are 4.68x for SSE4, 6.21x for AVX, and 8.24x for AVX2. On the Sandy Bridge machine, SSE4 saw mean speedup of 4.52x, and AVX saw 7.01x. Superlinear speedups are possible since superscalar execution allows vector operations to execute in parallel with scalar integer operations, loads and stores, and even other vector operations. Speedup over Serial 18 16 14

12 Speedup

ple, AVX has no instructions that operate on 256-bit vectors of integers. Consequently, integer-heavy codes compiled for AVX can occur additional data movement instructions if integer data types are operated upon together with floating point data types. The AVX2 vector extensions add these missing 256-bit integer operations but also add instructions for gather operations and fused multiply-add (FMA) operations. Both of these additional operations further reduce the number of dynamic instructions.

10 8

Benchmark

SSE4

AVX

AVX2

AOBench

5.27

7.28

14.02

Perlin Noise

6.80

9.76

16.09

2

Binomial Options

7.58

14.20

18.15

0

Black-Scholes

7.08

10.64

20.49

Stencil

16.94

29.13

37.42

6 4

AOBench

AVX2

AOBench

.8527

.8848

.8317

Perlin Noise

.9074

.8365

.9840

Binomial Options

.5327

.4776

.4632

Black-Scholes

.9829

.9818

.9789

Stencil

.4169

.4119

.3614

Table 3. Vector Share of Total Dynamic Instructions

GeoMean

AVX2

Speedup over Serial 14 12 10

Speedup

AVX

AVX

Black-Scholes 3D Stencil

Figure 7. Vectorization Speedup (Haswell)

Table 3 shows the fraction of total dynamic instructions that operate on vector data. This data is useful for examining scalability of vectorization, as basic application of Amdahl’s law says that applications with a small percentage of vector instructions are not likely to see parallel speedup from vectorization. SSE4

Binomial Options SSE4

Table 2. Bytes R+W per Instruction

Benchmark

Perlin Noise

8 6 4 2 0 AOBench

Perlin Noise

Binomial Options SSE4

Black-Scholes 3D Stencil

GeoMean

AVX

Figure 8. Vectorization Speedup (Sandy Bridge)

4.4

Vector Width Scaling

4.5

Figures 9 and 10 show the speedup of the versions compiled for AVX or AVX2 over the version compiled for SSE4. AVX and AVX2 double the width of vector registers to 256-bits from the 128-bits of SSE4. Consequently, perfectly scaling programs should be roughly twice as fast with half as many instructions. However, the AVX vector extensions only provide operations on floating point data types. Operations on integer data types must use the SSE4 instructions and the 128-bit vector registers. This often requires the addition of additional instructions to insert or extract 128-bit chunks from the 256-bit ymm vector registers and can negatively impact scaling. Indeed, this impact on scaling is visible in the mean speedup of AVX over SSE4. On Haswell, moving to AVX from SSE4 only produces a mean speedup 1.32x, less than the desired 2x. The 256-bit integer, gather, and FMA instructions found in AVX2 improve this mean speedup to 1.76x. On the Sandy Bridge machine, AVX achieves a 1.55x mean speedup over SSE4. Speedup over SSE4 2.5

1.5

1

0.5

0 AOBench

Perlin Noise

Binomial Black-Scholes 3D Stencil Options SSE4

AVX

GeoMean

AVX2

Figure 9. Vectorization Scaling (Haswell)

2 1.8 1.6

Speedup

1.4 1.2 1 0.8 0.6 0.4 0.2 0 AOBench

Perlin Noise

Binomial Black-Scholes 3D Stencil Options

SSE4

GeoMean

AVX

Figure 10. Vectorization Scaling (Sandy Bridge)

Perlin Noise

Perlin Noise is the only benchmark of the bunch that requires gather operations, and it is also the most integer-heavy benchmark. The SSE4 and AVX2 versions have approximately a 35% / 65% mix of floating point vector operations to integer operations whereas the AVX version has a mix of 19% / 81%. As Figure 9 shows, the AVX version of Perlin Noise shows no speedup over the SSE4 on the Haswell machine. One can see in Table 1 that the AVX version only has a 28% reduction in dynamic instructions over the SSE4 version, far from the desired 100%. The AVX2 version with its 256-bit integer operations and gather instructions allow it to see a gain of 1.37x over SSE4 on Haswell. 4.5.3

Speedup over SSE4

AOBench

AOBench exhibits ideal scaling for the SSE4 version on both machine with speedup of at least 4 when using 4-wide SIMD units. However, one can see in Figures 9 and 10 that the AVX version does not exhibit this same scaling, showing only 1.33x gain over SSE4 on Haswell and 1.57x on Sandy Bridge. Table 3 shows that this benchmark does have a reasonably high percent of vector instructions. However, the AVX version has roughly a 55% / 45% mix of floating point vector operations to integer operations. The integer operations can only operate on 4-wide vectors and require extra instructions to insert and extract 128-bit chunks from 256bit vectors as previously mentioned, hurting both the performance and scaling of the AVX version. The AVX2 version of AOBench does not have this limitation as AVX2 adds support for missing integer operations. Table 1 shows that the AVX2 version reduces the number of dynamic instructions by roughly 29%. Consequently in Figure 7, one can see that the AVX2 version achieves near ideal scaling with speedup of 8. Likewise Figure 9 shows that this version achieves the desired almost 2x gain over SSE4. 4.5.2

2

Speedup

4.5.1

Benchmark Breakdown

Options Pricing

Both the Binomial Options and Black-Scholes options pricing benchmarks are somewhat embarrassingly data parallel, which is reflected in the ideal or even superlinear scaling seen in their performance. Looking at the vector share of instructions for the Binomial Options benchmark in Table3, the relatively low percentage of vector instructions in the hot loop seems like it would imply poor vector speedup. However, this is not the case for this benchmark since it has many branches with low misprediction rates that come from several loops with uniform control flow. These can be speculatively executed in parallel with the vector computations. As a result, the observed speedups are much better than Amdahl’s law would predict, since most of these instructions do not end up on the critical path. While both Binomial Options and Black-Scholes see small gains when going from AVX to AVX2 due to the 256-bit integer operations, the majority of

the increase in performance is actually due to use of FMA instructions as these benchmarks have many floating point operations per integer operation. On Haswell, FMA instructions have the same latency as a floating point multiply, so the resulting addition is essentially free. Also, Haswell can execute two independent FMA operations concurrently, and all these benefits translate into an extra 50% speedup on top of that already provided by AVX over the SSE4 versions of these benchmarks. 4.5.4

Stencil

The Stencil has the lowest arithmetic intensity of the all the benchmarks as is seen in Table 2. It requires about two to three times the number of bytes read or written per operation as all of the other benchmarks. This is not too surprising given that computing each stencil point requires reading itself across two time steps as well as the neighboring three values on each side in all three dimensions along with a few other values. This is reflected in Stencil’s vector share of total dynamic instructions, with Table 3 showing that each version of stencil hovers at around 40% of instructions being those that operate on vectors. Unlike Binomial Options, most of Stencil’s additional operations are address calculations and loads. The vector operations for the stencil depend on these calculations and loads, placing them on the critical path and limiting speedup to around 2x for the SSE4 version and 3-4x for the AVX and AVX2 versions on both Haswell and Sandy Bridge. Stencil sees a small boost from AVX2, but this is due to the use of FMA instructions for the actual stencil computation in the memory operation-heavy inner loop. It does not really exploit the 256-bit integer operations available as the vector operations of the stencil all operate on floating point data. Additionally, all data loads are sequential vector loads with uniform base addresses, so all the address computations occur on the scalar data path and do not require integer vector operations.

5.

Conclusion

In this paper, we have presented a performance evaluation of several benchmarks from the domains of computer graphics, financial modeling, and high-performance computing that have been written using the Intel SPMD Program Compiler. The same ispc source was compiled for three different vector instruction sets, with different vector widths and capabilities. The resulting programs exhibit good vector speedup over scalar C++ code and can scale with increasing vector width. In cases where this scaling isn’t perfect, we have analyzed the features of the benchmarks in question to identify factors inhibiting scaling, such as lack of 256-bit integer operations in AVX on AOBench or low arithmetic intensity for the Stencil. ispc’s ”SPMD-on-SIMD” approach coupled with its small set of language features enable programmers to easily write a single source code that can efficiently compile

to different targets. The simplicity of ispc’s features gives programmers a ”WYSIWYG” alternative to handwritten intrinsics or autovectorization that produces high performance code, taking advantage of all the parallel resources available in modern machines.

References [1] AOBench. ”http://code.google.com/p/aobench”. [2] The Intel SPMD Program Compiler. ”http://ispc.github.io”. [3] Intel Software Development Emulator. ”http://software.intel.com/en-us/articles/intel-softwaredevelopment-emulator”. [4] F. Darema. The SPMD Model: Past, Present and Future. In Proceedings of the 8th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 1–, London, UK, UK, 2001. Springer-Verlag. ISBN 3-540-42609-4. URL http://dl.acm.org/citation.cfm?id=648138.746808. [5] K. O. W. Group. The OpenCL Specification Version: 2.0, 2013. [6] P. Hanrahan and J. Lawsont. A language for shading and lighting calculations. In Computer Graphics (SIGGRAPH 90 Proceedings, pages 289–298, 1990. [7] C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04), Palo Alto, California, Mar 2004. [8] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2): 40–53, Mar. 2008. ISSN 1542-7730. . URL http://doi.acm.org/10.1145/1365490.1365500. [9] K. Perlin. Improving Noise. In Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’02, pages 681–682, New York, NY, USA, 2002. ACM. ISBN 1-58113-521-1. . URL http://doi.acm.org/10.1145/566570.566636. [10] M. Pharr and W. R. Mark. ispc: A SPMD Compiler for High-Performance CPU Programming. In Proceedings of Innovative Parallel Computing (InPar), San Jose, California, May 2012. [11] N. Rotem. Intel opencl sdk vectorizer. In LLVM Developer Conference Presentation, November 2011.

Writing Scalable SIMD Programs with ISPC

ability, and this is equivalent to how the struct would ap- pear in C/C++ since foo is ..... The Intel Software Development Emulator [3] was used to collect statistics ...

875KB Sizes 3 Downloads 174 Views

Recommend Documents

simd - eurasip
Department of Electrical and Computer Engineering - University of New Mexico. Albuquerque, NM ... different applications (multimedia, image processing, etc). The SIMD ..... [7] “IA-32 Intel Architecture Software Developer's Manual”. Vol. 2, No.

simd - eurasip
textbook in signal processing [2]. In the present paper it is convenient to ..... Digital Signal Processing” Prentice Hall, Englewood. Cliffs, NJ 1984. [3] C. Van Loan ...

Vectorization with SIMD extensions speeds up ...
Computer Architecture, University of Almerıa, 04120 Almerıa, Spain b National Center ... can be obtained by means of tomographic reconstruction algo- rithms.

Interactive Ray Tracing of Arbitrary Implicits with SIMD ...
on common laptop hardware, with a system that accurately visual- izes any implicit surface .... domains [4, 9], and ray tracing algorithms involving recursive in-.

Writing Parallel Programs on LINUX CLUSTER
4.2 A Parallel Java Application using mpiJava ...... After Java JDK is installed successfully, add the Java JDK `bin' directory to path setting of .bash_profile, ...

Developing honest Java programs with Diogenes - UniCa
sourceforge.net), a domain-specific language for writing type systems. 4 Conclusions. Diogenes fills a gap between foundational research on honesty [8,6,9] and more practical research on contract-oriented programming [5]. Its effectiveness can be imp

Increasing EDA Throughput with the Intel® Xeon® Processor Scalable ...
Δ For more complete information about performance and benchmark results, visit .... Time Needed to Complete 113 Jobs .... Follow us and join the conversation:.

Download Building a Scalable Data Warehouse with ...
"Building a Scalable Data Warehouse" covers everything one needs ... warehousing, applications, and the business context so readers can get-up and running ...

Increasing EDA Throughput with the Intel® Xeon® Processor Scalable ...
requirements, Intel IT conducts ongoing throughput performance tests, using the Intel® silicon design ... threaded EDA applications operating on more than 200 hours of Intel design workloads. By utilizing all ... faster, and keeping design engineeri

Scalable Privacy-Preserving Data Mining with ...
Abstract. In the Naıve Bayes classification problem using a vertically partitioned dataset, the conventional scheme to preserve privacy of each partition uses a secure scalar product and is based on the assumption that the data is synchronised among