Benchmarking the Compiler Vectorization for Multimedia Applications

Viewer
Transcript

Video SIMDBench: Benchmarking the Compiler Vectorization for Multimedia Applications Michail Alvanos The Cyprus Institute Nicosia, Cyprus Email: [email protected]

A BSTRACT Single Instruction Multiple Data (SIMD) Extensions become popular in computer architectures as a simple and efficient way to exploit the data parallelism hidden in applications. The compiler research community has proposed automatic vectorization as the answer to the complexity of low-level programming of vector units. Despite recent advances in compilation techniques, modern compilers miss opportunities to automatically vectorize code. One of the biggest challenges is to evaluate the changes against the best hand-written code. This paper presents a benchmark suite based on video encoding and decoding kernels. The suite contains hand-written versions of the kernels provided by the open source community that support the latest SIMD extensions. The paper also compares the performance of three available compilers (GCC, LLVM, and ICC) against the hand-written kernels. A performance evaluation, using an i7-4790 processor, shows that the auto-vectorized version produced by the best compiler, achieves on average only 28% of the hand-tuned kernels. I. I NTRODUCTION Single Instruction Multiple Data (SIMD) Extensions was the answer of computer architects to exploit the data parallelism in applications and to address the frequency and power wall [1]. In recent years, the length of vectors in SIMD extensions increased to 256 bits, and there are plans for even longer lengths [2]. SIMD parallelism is attractive for low power architectures because it potentially it is more energy efficient than multiple instruction multiple data (MIMD) parallelism [3], which needs to fetch and execute one instruction per data operation. The performance gain from vectorization can tremendously decrease the execution time of the applications. Moreover, dominant workloads in mobile devices are multimedia applications, games, and web rendering. For the same devices, scientific applications with floating point operations are less common. The common practice of using SIMD extensions is to accelerate carefully written libraries rather than for the compiler to transform the code. On the other hand,

Pedro Trancoso Department of Computer Science, University of Cyprus, Nicosia, Cyprus Email: [email protected]

compiler developers and researchers have put a considerable amount of effort to make the compiler generating such code. Compilers will continue to improve performance and double the execution speed in every 18 years [4]. This is disappointing compared with the performance gain from hardware changes, as Moore‘s Law [5] predicts. Despite recent advances in compilation techniques, compilers fail to automatic vectorize the scalar code, due to the code complexity. In addition, there is no benchmark suite based on real applications that also contain hand-tuned versions. This research project tries to address these challenges by providing a set of kernels used in video applications along with their hand-tuned variations as provided from the open source community. The main contributions of this paper are: • Presents a benchmark consisted of computation kernels, extracted from the x264 [6] and ffmpeg [7] applications, for evaluating the performance of automatic vectorization of compilers. The benchmark presents the performance achieved by the compiler compared with the hand-tuned version of kernels written in assembly. • Evaluates the performance of state-of-the-art of compilers regarding runtime performance of produced code. The evaluation shows that despite the advances in recent years, the performance achieved is on average only 28% of the hand-tuned version. The rest of this paper is organized as follows. Section II presents the necessary background of video applications. Section III categorizes the kernels and presents a short description. Section IV describes the experimental methodology and Section V presents the evaluation. Section VI discusses previous work and section VII presents the conclusions and future work. II. BACKGROUND A. Multimedia applications Modern video encoding and decoding [8], [9] is a computationally intensive task that creates high traffic volume in the local memory system. For each raw frame, the encoder identifies differences from one or more previously processed reference frames, as shown in Figure 1. The

Discrete Cosine Transformation

Input frame

Entropy encoding

Figure 1.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

int sad16x16(uint8_t *pix1, int i_stride_pix1, uint8_t *pix2, int i_stride_pix2 ){ int i_sum = 0; for( int y = 0; y < 16; y++ ) { for( int x = 0; x < 16; x++ ) { i_sum += abs( pix1[x] - pix2[x] ); } pix1 += i_stride_pix1; pix2 += i_stride_pix2; } return i_sum; }

Block diagram of a H.264 video encoder. Listing 1.

encoder identifies similarities between blocks (macroblocks) of pixels using motion estimation. After the encoder finds the best matching block, the encoder subtracts the selected block from the reference (motion compensation) and transmits it with motion vector to describe the reference position. The encoder also employs intra-frame analysis when cost effective. Then it uses the discrete cosine transformation to generated a set of coefficients that are going to be quantized. The video encoder reconstructs the encoded image to use it as a reference frame. A deblocking filter is applied to reduce blocking distortion and an entropy encoder compress the output data. x264 [6] is an open source high-performance video encoder that supports the H.264/AVC compression standards. The application is the state-of-the-art in open source video encoding, providing assembly code for the x86, PowerPC (AltiVec), and ARMv7 (NEON) platforms to speed up the computationally intensive parts. The handwritten code supports the latest vector extensions. The FFmpeg [7] decoding library is used in popular video players and applications, such as VLC and Google Chrome, due to its diversity of supported multimedia files. Moreover, the x264 encoder uses hand-tuned vectorized kernels of the FFmpeg library for the decoding. B. Compiler optimizations Automatic vectorization is one of the many optimizations that modern compilers [10], [11], [12] offer to the developer. The compiler identifies parts of the application that apply the same operation to a set of data. Then, it replaces this operand with an operand that uses the same process in a bigger set of data. Unfortunately, the automatic vectorization is not always trivial, because the compiler must preserve the expected program behavior. Thus, the compiler must take into account the dependencies between the data, the data precision, and the cost of transformation. For example, when mixing integer types, the compiler must ensure that the result fits in the destination registers, and no information is lost. Additional control code for the alignment and the loop bounds can increase the execution cost of the vectorized kernel, making it unprofitable.

SAD kernel for 16x16 block of pixels.

III. B ENCHMARK This section presents the majority of the kernels used in the benchmark and omits the kernels that account for the least execution time of the applications due to lack of space. The kernels cover from 50% up to 90% of the execution time of the applications, depending on the input file and the encoding/decoding options. The paper categorizes the kernels based on the operations of the encoder or decoder: Motion Estimation and Intra-frame analysis, Motion Compensation, Encoding, and Decoding. On each execution of the kernel, the benchmark fills the input arrays with random data. The benchmark validate the correctness of the code by comparing the output values of the scalar and vectorized version. A. Motion Estimation & Intra-frame Analysis SAD: The Sum of Absolute Differences (SAD) kernels measure the similarity between two image blocks by calculating the absolute different between each pixel. The source code sums the differences and it uses it as a metric for calculating the similarity between two different blocks or macroblocks. Video encoders use this metric for finding the best matching of reference on the encoding frame. Moreover, there are two additional variations (SAD X3 and SAD X4) that are used to reduce the overhead of calling the kernel multiple times and increase the opportunities for compiler optimization. The kernel takes blocks of pixels as inputs, and each pixel is 1 byte and returns an integer, as shown in Listing 1. SATD: The Sum of Absolute Transformed Differences (SATD) measure the block similarity as the SAD kernels. In contrast with the SAD kernels, they use frequency transformation (Hadamard transformation [13] in x264) to calculate the matching results. Due to transformation, these kernels are more computation intensive but more accurate than the SAD kernel. There are seven variations of the kernel depending on the input size, from 4x4 up to the macroblock size 16x16. Similar to the SAD kernel, the input is two blocks of pixels, and the output is one integer, the aggregated difference between transformed pixels. INTRA: The encoder, in addition to the motion estimation, also employs intra-frame analysis and when the cost of

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

void dct4x4dc( int16_t d[16] ){ int16_t tmp[16]; for( int i = 0; i < 4; i++ ) { int s01 = d[i*4+0] + d[i*4+1]; int d01 = d[i*4+0] - d[i*4+1]; int s23 = d[i*4+2] + d[i*4+3]; int d23 = d[i*4+2] - d[i*4+3]; tmp[0*4+i] tmp[1*4+i] tmp[2*4+i] tmp[3*4+i] } for( int i { int s01 int d01 int s23 int d23

= = = =

s01 s01 d01 d01

+ +

s23; s23; d23; d23;

= 0; i < 4; i++ ) = = = =

tmp[i*4+0] tmp[i*4+0] tmp[i*4+2] tmp[i*4+2]

+ + -

tmp[i*4+1]; tmp[i*4+1]; tmp[i*4+3]; tmp[i*4+3];

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

void dequant_8x8(int16_t dctcoef dct[64], int dequant_mf[6][64], int i_qp){ const int i_mf = i_qp%6; const int i_qbits = i_qp/6 - 6; if(i_qbits >= 0) { for(int i = 0; i < 64; i++) dct[i] = (dct[i] * dequant_mf[i_mf][i]) << i_qbits } else { const int f = 1 << (-i_qbits-1); for(int i = 0; i < 64; i++) dct[i] = (dct[i] * dequant_mf[i_mf][i] + f) >> (-i_qbits) } }

Listing 3. d[i*4+0] d[i*4+1] d[i*4+2] d[i*4+3]

= = = =

( ( ( (

s01 s01 d01 d01

+ +

s23 s23 d23 d23

+ + + +

1 1 1 1

) ) ) )

>> >> >> >>

1; 1; 1; 1;

} }

Listing 2.

Discrete Cosine transformation (DCT) kernel.

encoding using information from the same frame rather to previously encoded frame. There are 36 different variations of this kernel, including versions for discovering patterns on luma and chroma for different block sizes: 4x4, 8x8, 8x16 only for chroma, and 16x16 only for luma. B. Motion Compensation MC: The Motion Compensation (MC) is a technique to calculate the vector and the reference information for a given block of pixels. The motion estimation process chooses this block in the previous step. The are two types of the kernel, one for encoding the brightness (luma) and for the chroma part of a frame. Furthermore, the motion compensation kernel has input blocks from 4x4 up to 16x16. AVG: The Motion Compensation part of the encoder uses Average (AVG) kernels for calculating the information to be encoded in the next parts. The kernels calculate the average of two blocks of pixels. This kernel is also used in deblocking part in both encoder and decoder. There are variations of the kernels depending on the block size section from the motion estimation. The kernel has two blocks of pixels as input, from 2x2 up to 16x16, and a block of pixels of the same size as output. GET REF: Get reference frame is the process of combining the motion compensation with the average to calculate the cost of selecting this block of pixels, depending the application settings. For example, when sub-pixel interpolation and weighted motion compensation are disabled, the kernel just returns the input block, without any calculation on input block of pixels. C. Encoding DCT: The encoder applies the kernel of Discrete cosine transformation (DCT – also known as integer transforma-

De-quantization kernel used in ffmpeg and x264.

tion) before the quantization. This kernel is popular in signal processing and especially in lossy compression, where highfrequency parts of the image can be discarded. The main function of the kernel is to compress a block of pixels by removing parts of the pixel information. Listing 2 presents the 4x4-dc variation of the kernel. The input in all kernels is an array of short integers that contains the coefficient and the result is stored in place on this array. QUANT: After the integer transformation (DCT), the encoder quantizes the information remained in the pixels to discrete values. Modern video encoders use pre-calculated tables of values to allow faster implementation of the quantization. Thus, the majority of the access are in lookup tables that contain the pre-calculated values. The kernel is applied to blocks of coefficients from 2x2 up to 8x8. ZIGZAG: After the quantization and the discrete cosine transformation, the result values of pixels (also known as coefficients) contain zero values. Thus, the encoder tries to find the best way to represent these values efficiently by grouping them together. The search of the most effective way to group the coefficients is called zig-zag kernels, named after the scan order inside the block. D. Decoding DEQUANT: De-quantization is the inverse process of quantization applied in the last step. Similar to the quantization kernels, the variations are based on the size of coefficients, from 2x2 up to 8x8. Listing 3 presents the 8x8 de-quantization kernel. The input and output are the coefficients that are stored as short integers. Moreover, a lookup array of pre-calculated values for de-quantization is stored in an array of integers, dequant_mf in the aforementioned kernel. IDCT: The Inverse Discrete cosine transformation (iDCT) applies the reverse process of the DCT transformation. The x264 video encoder shares the same code for displaying video as ffmpeg. DEBLOCK: The deblocking filter kernel applies a technique to smoothen the sharp edges of a given block of pixels.

GCC Baseline GCC Vector LLVM Vector ICC Vector MMX SSE2 SSE3 SSE4 AVX AVX2

Megapixel/s

16000

1600

160

Sum of Absolute Difference (a)

p int

ra_

16

x1

8x 16 int

ra_

8x ra_

6_

p c_

_p 8c

8_ int

8x ra_

8_

vr

dc int

vr int

ra_ 8x

4x ra_ int

16 x td_ sa

Sum of Absolute Transformed Differences (b)

4_

16

8 8x td_ sa

td_ sa

4_ 16 sa

d_ x

x4 d_

4x 4

6 x1

x8 _8

x4 sa

sa

d_

x4

_4

16 6x sa d

_1

8x 8 d_ sa

sa

d_

4x

4

16

Intra-frame prediction (c)

Figure 2. Performance in processed Megapixels/s in logarithmic scale of the (a) Sum of Absolute Differences (SAD) kernels, (b) Sum of Absolute Transformed Differences (SATD) kernels, and (c) Intra analysis kernels compared with the compiler auto-vectorization techniques and hand-tuned version.

The encoder produces artifacts in the final picture due to block coding techniques used during encoding. Both video encoders and decoders use the kernel to improve the video quality. IV. E XPERIMENTAL M ETHODOLOGY R R The evaluation uses an Intel Core

i7-4790 processor for measuring the performance. The processor is running at 3.6GHz and supports all the recent SIMD extensions including the MMX, SSE, SSE2, SSE3, SSE4.1, SSE4.2, AES, AVX, AVX2, BMI, F16C, and FMA3. It has 32KB private Instruction and Data L1 cache, 256KB unified L2 cache per core, and 12MB shared L3. During the evaluation, we also disable the “Turbo Boost” technology of the processor. The test machine is running Ubuntu 14.04.3 with kernel 3.19.0-47-generic. The evaluation uses the latest stable releases of three different compilers: • GCC 5.3.0: The GNU Compiler Collection (GCC) is an open source well-known compiler distributed in many operating systems and platforms. The baseline compilation uses the -O3 -fno-tree-vectorize flags and the auto-vectorized version uses the -O3 -mtune=core-avx2 -march=core-avx2. • ICC 2016.1: The Intel C compiler is a commercially available compiler focused on the auto-tuning and autovectorization of C code running on x86 architectures. The evaluation uses the -O3 that enables the autovectorization optimizations. • CLANG-3.8: The Clang is compiler front end for the C part of the of the LLVM compilation framework. The compiler is relatively new compared to GCC and ICC compilers and often the produced scalar code is not optimized as the compilers mentioned above. Many researchers select this compiler framework to implement their optimizations due to its modularity. The benchmark runs each kernel 100 times, and each set of kernels is executed ten times. For accuracy, the benchmark uses the Time Stamp Counter (TSC) register to measure the execution time of kernels in CPU cycles, and the process is pinned to one core. In all times, the variation

of execution time is 2% or less and the geometric mean value is calculated. V. E XPERIMENTAL RESULTS The experimental evaluation assesses the effectiveness of the transformations by comparing the performance of the scalar version against the auto-vectorized produced from three compilers, and against the hand-tuned version. Due to space limitations, this section only includes a subset of the kernels provided by the benchmark. Furthermore, the section includes an analysis of the reasons of missing vectorization opportunities from the compilers. A. Motion estimation Intra-frame analysis kernels Motion estimation kernels compare two blocks of pixels and return a value. Figure 2(a) and (b) present the results for the SAD kernels for different compilers compared with the hand-optimized version. GCC misses the opportunities to optimize the SAD kernel due to the unknown pattern. This happens because the developers have not yet implemented to identify this particular set of operations. In SATD kernels, the GCC compiler partially vectorizes the SATD kernels, revealing minor performance gain. The LLVM (Clang) compiler optimizes the SAD kernel when operating in blocks of 16x16 pixels, but misses opportunities of auto-vectorization in the SATD kernels. Moreover, the LLVM compiler has worse absolute performance in all small kernels compared with the other compilers, indicating the bad quality of produced scalar code. The compiler is less mature than ICC and GCC, and it is expected to have artifacts in particular cases. Finally, the ICC compiler has the best performance of all compilers, achieving from 2.53x up to 7.28x speedup in SAD kernels. In the case of the x4 16x16 kernel, the compiler can exploit multiple levels of data parallelism by using the latest vectorization instructions (PSADBW). The slightly better performance of the ICC compiler compared with hand-tuned versions, shows that there is still room for improvement of hand-tuned kernels. On the other hand, the compiler misses opportunities of optimization in the 8x16 and 4x16 variations of the SAD kernel (not shown), due

Megapixel/s

50000

GCC Baseline GCC Vector LLVM Vector ICC Vector MMX SSE2 SSE3 AVX2

5000

500

8 x1

6

20 t_r ef_ ge

16 t_r ef_ ge

12 t_r ef_

x1

x1 0

8 8x ge t_r ef_

4 ge

t_r ef_

4x

6 av

g_

16

x1

8 8x g_ av

4x g_ av

4x 2 g_ av

6 x1 16

4

Average (b)

ge

Motion Compensation (a)

lum a_

8x 8 lum

a_

4x

4

8

lum a_

rom a_ 8x

4x rom

a_

2x ch

rom a_ ch

ch

2

4

50

Get Reference (c)

Figure 3. Absolute performance in logarithmic scale of the motion compensation (a), average (b), and get references (c) kernels compared with the auto-vectorized and hand-written version.

to the low number of iterations and the anti-dependencies in the inner loop. The ICC compiler does not auto-vectorize the SATD kernel, although it has a absolute performance gain (Geometric mean 1.05x) compared with other compilers due to better produced scalar code. The ICC and GCC compilers fail to transform any of the loops included in intra analysis kernels. GCC cannot find patterns that match vectorized operations, and the ICC compiler identifies that the cost of the vectorized kernel is higher than the scalar code. The LLVM compiler has slightly better results from other compilers achieving 1.05x relative performance gain, when there is data parallelism, as shown in Figure 2(c). Another important characteristic of this kernel compared to others, is that the MMX version has 1.7x mean performance gain that is relative low compared to other categories of kernels due to limited data parallelism. Finally, the majority of kernels max out the performance of vector units using SSE/SSE2 SIMD extensions. Performance gain for the AVX/AVX2 extensions is limited and only in kernels with 16X16 blocks of pixels. B. Motion Compensation The kernel of motion compensation (MC) has two variations, one for chroma and one for luma (luminance). The luma kernels are more complicated than the chroma kernel because they include calls to other kernels including the AVG kernel, making the compilers hard to analyze and autovectorize them. In particular, GCC is not able to vectorize any of the luma kernels. LLVM and ICC transform the code of the kernels when the input size is larger than 8x8 block size, as shown in Figure 3(a). On the other hand, the GCC and LLVM compilers can auto-vectorize the chroma variation of the kernels, although it results to a slowdown in the performance due to additional compiler overhead. On the other hand, the ICC compiler detects the high cost of vectorization and avoids transforming the code. However, the compilers achieve only a fraction of the hand optimized version. For example, the ICC compiler achieves from 16% up to 61% the performance of MMX version. Processors with MMX can benefit from 3.1x up to 8x compared to the scalar version. Moreover, newer processors with AVX2

support can achieve from 4.5x up to 16.1x speedup using the handwritten version of the kernel. In the AVG kernel, only the LLVM compiler gains significant performance. The GCC auto-vectorization transformation has marginal benefit +5%, although it is near to the variation measurement limit (2%). Moreover, the LLVM compiler was also able to auto-vectorize the loop revealing significant speedup up to 1.4x. On the other hand, despite successful vectorization, the ICC compiler results in bad performance. It turns out that the compiler produces low performance scalar in this kernel. For example, execution time of the 8x8 version of the kernel is 2507 CPU cycles for ICC and 2307 for the GCC without the auto-vectorization transformation. The GET REF kernel internally calls the motion compensation (MC) and the average (AVG) kernels. The GCC compiler has minor performance increase when the block size of pixels is more than 12x10, as illustrated in Figure 3(c). Thus, this is an indication that despite the successful autovectorization of one of the two kernels (AVG), GCC does not deliver considerable improvement. Furthermore, the ICC compiler, although optimized, has mixed results, ranging from a slowdown of 0.85x up to 1.9x speedup despite vectorizing all the kernels, due to bad performance of the AVG kernel. The hand-optimized version of AVX2 achieves the best performance, up to 15.5x performance gain compared with the scalar version. On the other hand, the performance gain between the MMX, SSE2, and SSE3 versions is similar due to the limited data parallelism. C. Encoding GCC compiler has the best performance against to other auto-vectorization compilers, gaining up to 1.5x speedup in the Discrete Cosine Transformation (DCT) kernel (Figure 4(a)). On the other hand, the LLVM compiler has a mean of 0.57x slowdown compared with GCC, due to bad scalar code produced. Moreover, ICC compiler achieves up to 1.2x speedup, although the mean performance gain is 1.06x. Hand-written kernels achieve better performance, up to 9.10x when using the latest SIMD extensions and only in kernels with a big dataset. For the majority of the loops, the

Mega-Coefficients/s

10000 GCC Baseline GCC Vector LLVM Vector ICC Vector MMX SSE2 SSE3 AVX AVX2

1000

100

e ac _fr am

am e

_4

x4

_fr x4 _4 zz

zz

_8

x8

_fi e

fie 4_ 4x zz _

_c x8 _8 zz

ld

ld

lc av

x8 t_8 an

t_4 qu an

qu

qu

x4

x4

_d c an t_4

an t_

x4

4x

_d x2 t_2 an qu

qu

Quantization (b)

zz

Discrete Cosine Transformation (a)

4

c

ct8 b1 su

su

b1

6x

6x

16

16

_d

_d

ct

dc ct_ _d x8 su

b8

b8 x8 su

su

b4 x

4_

dc

_d ct

t

10

ZigZag (c)

Figure 4. Absolute performance in logarithmic scale of the Discrete Cosine Transformation (a), Quantization (b), and ZigZag (c) kernels compared with the GCC scalar version without auto-vectorization transformations.

compilers miss the opportunity to vectorize the code due to either complexity of the code or the overhead of vectorized code. In particular, ICC compiler does not vectorize the majority of the loops due to higher execution cost than the scalar version. QUANT kernel follows a similar pattern as the previous kernels. The result of the auto-vectorized version of GCC is mixed, giving a slight performance gain of 1.05x overall. However, the remaining compilers exhibit slowdown ranging from 0.84 for LLVM to 0.811 for ICC, as shown in Figure 4(b). On the other hand, the ICC compiler vectorizes partially two of the kernels (8x8 and 4x4 dc), but it only gains speedup of 1.38x in the 4x4x dc variation. The remaining kernels are not auto-vectorized due to higher calculated cost than the scalar version. The ZIGZAG kernel is bandwidth limited, and it is expected to have low-performance potential from autovectorization compared with other kernels. In particular, the auto-vectorization performance gain is limited, achieving no more than 1.3x speedup. Moreover, the hand-tuned version has limited opportunities to exploit data parallelism due to dependencies. Except one kernel, the hand-tuned version can only offer a performance gain up to 2.7x as shown in Figure 4(c). D. Decoding The compilers can auto-vectorize all the de-quantization kernels, revealing speedups from 1.2x up to 1.9x, as illustrated in Figure 5(a). GCC has the best performance and ICC slightly less performance gain except one kernel that has a slowdown of 0.3x (not shown). The limited performance improvement is due to additional code the compiler inserts for the alignment check and packing the data. None of the available compilers can identify and exploit data parallelism in the deblocking filter. The source code contains data and control dependencies that are hard to avoid through more advanced techniques. Moreover, the low-performance number, compared to other kernels, of the hand-tuned kernels is due to the limited parallelism included in the kernels.

Furthermore, the inverse discrete cosine (IDCT) transformation kernel reveals bad results for the compilers, despite successful vectorization. The produced vectorized code includes additional code for alignment control and data packing, making the vectorized kernel unprofitable compared with the scalar version. E. Evaluation of Compilers During the evaluation process, the Intel compiler was able to identify vectorization patterns to most of the loops. From total 1150 loops, the ICC compiler vectorizes 213 in contrast to 46 of GCC and 43 of LVVM. The LLVM compiler has the worse performance as it has not yet achieved the maturity of the GCC and ICC compilers. Table I presents the results of auto-vectorization of the ICC compiler. For the analysis, the ICC compiler is selected because it transforms most of the kernels, provides detailed report regarding the reasons of missing opportunities, and achieves the best performance. The compiler misses opportunities for vectorization for three reasons. First, although the compiler identifies the vectorization pattern, the transformation is not profitable. The results show that 41.4% of the loops are vectorized (Table I), but the estimated cost of the execution of the vectorized code is greater than the scalar version. For example, this happens when the number of loop iterations is small, or the compiler must insert additional control code to check for runtime parameters, such as data alignment. An additional reason for high overhead is the additional checking for possible overlapping between pointers (aliasing). The challenge for the multimedia kernels is that they are using pointers instead of arrays making the analysis for the compiler harder. The ICC compiler detects dependencies between the data used in the 20.2% of available loops of the benchmark. Examining the source code of the kernel shows that the majority of the kernels not vectorized have real data or control dependencies. Furthermore, multimedia kernels use pointers making the analysis challenging for the compiler. Finally, the last major reason of missing vectorization opportunities is that the loop is already containing an inner vectorized loop. This exposes the limitations and the constrains of the

GCC Baseline GCC Vector LLVM Vector ICC Vector MMX SSE2 SSE3 AVX AVX2

300

De-Quantization (a)

c 16 6x d1 ad

ad

d1

6x

_d

16

dc ad

d8 x

8_

x8 ad d8

ad

int

d4

ra_

x4

1

0 ra_ int

lum a_ 1

0 lum a_

ch

rom a_

h

t fla 8_

4x

8x

m 8_ cq

dc 4_

dc 4x 4_

8x

_fl

qm _c

cq 4x 4_

at

30

m

Mega-Operations/s

3000

Inverse DCT (c)

Deblockign Filter (b)

Figure 5. Performance of the de-quantization in logarithmic scale (a), deblocking (b), and Inverse Discrete cosine transformation (c) kernels compared with the GCC baseline without auto-vectorization. 100

current state-of-the-art compilers that miss opportunities of exploiting additional data parallelism.

Reason Total Vectorized Vectorized Partial Vectorized

#Loops 213 211 2

% total 18.52% 18.35% 0.17%

Total not Vectorized Unknown Operation / pattern Inefficient Vectorization Data Dependencies Multiple exits inside the loop Unknown loop iterations Call to memcpy in the loop body Inner loop vectorized

937 6 476 232 19 16 5 183

81.48% 0.52% 41.39% 20.17% 1.65% 1.39% 0.43% 15.91%

Total loops

1150

100.00%

F. Summary The auto-vectorization transformations provided from the compilers can achieve only a fraction of the performance of the hand optimized version. Figure 6 presents the speedups compared with the scalar variation of the kernel compiled by the same compiler. The figure groups the achieved speedup or slowdown in 5 categories. The category 0.97x1.03x achieved speedup represents the kernels that the autovectorization has no impact. The majority of the kernels in LLVM (71%) have no impact on the performance, as the compiler leaves the scalar code unmodified. Moreover, 21% of the loops achieve limited speedup up to 2x. One the bright side of the LLVM compiler, only 1% of the compiler produces noticeable slowdown, less than 0.9x. GCC compiler gains speedup over 2x only for ten kernels (4.5%), and it was able to produce small gains for the 15% of the kernels. Unfortunately, 22% of the kernels exhibit slowdowns, compared with their scalar versions produced by GCC. This benchmark can be a powerful tool to evaluate and identify the reasons for performance degradation. Moreover, the LLVM and GCC compilers vectorize only a fraction of the available loops. The ICC compiler has the best results from the

% Kernels

Table I AUTO - VECTORIZATION REPORT OF ICC COMPILER .

80 Speedup > 2 1.03-2 0.97-1.03 0.9-0.97 Slowdown < 0.9

60 40 20 0 LLVM

GCC

ICC

Figure 6. Achieved speedup over scalar version grouped by the compiler.

three available compilers and it has the larger divergence regarding speedup to slowdown. The ICC compiler achieves the best performance of all compilers with 34% of the kernels achieving better than 2x speedup, although 17% produces significant slowdown, compared with their scalar versions produced by ICC. This shows the aggressive code transformation techniques implemented inside the compiler. Finally, the GCC compiler achieves 21.1% the speed of the hand-tuned version of the kernels on average. The LLVM compiler has the worst performance, achieving only 15.9% and ICC compiler has the best performance, achieving 28.4% of the hand tunned versions. VI. R ELATED W ORK Test Suite for Vectorizing Compilers (TSVC) [14] was the first introduced benchmark in Fortran for evaluating the auto-vectorization transformations of compilers. The benchmark contains synthetic kernels of both floating point and integer arithmetic. In contrast, our benchmark contains only real kernels and their hand-optimized counterparts them for comparison. Moreover, the TSVC benchmark suite includes floating point operations that are relevant to scientific computing in contrast with our work that contains integer based kernels. Researchers also ported the TSVC benchmark suite into C language to assess modern compilers [15] and compared the kernels similar to our work. In addition, researchers use kernels from HPC computing applications and the MediaBench II suite [16]. Similar to our approach, researchers analyzed the reasons for missing vectorization opportunities and confirm our results about the success rate of vectorization. However, in our kernels, we encountered test cases that

the vectorized version was slower than the scalar version. Moreover, our approach was to use open source available kernels that already have been tested and optimized from the open source community, ensuring the quality of the handoptimized kernels. Ren et al. [17] have also evaluated multimedia applications [18], and they concluded that despite advances in the compilation area, the performance of auto-vectorization is still limited. In addition to the evaluation of the multimedia benchmark suite, they examine the reasons of missing autovectorization opportunities. In contrast, this paper compares the performance of the best hand-optimized version of the kernels that the benchmark suite used is missing. Furthermore, they manually vectorize a part of the kernels and procedures, achieving speedups of 1.10 to 3.39 on a Pentium 4 processor (SSE2). In contrast, we present a benchmark that includes hand-optimized kernels from the open source community performing from 1.26X up to 29.0X faster compared to scalar versions. VII. C ONCLUSION This paper presents an open source benchmark suite with multimedia kernels extracted from open source project for evaluating the auto-vectorization performance of the compilers. The source code of the benchmark is publicly available [19] under the GNU General Public License. The benchmark can be a valuable tool for the compiler researcher to evaluate the code transformations. Moreover, This study evaluates the performance of stateof-the-art of compilers regarding runtime performance of produced code. Results show that the auto-vectorized version produced by the best compiler, achieves on average only 28% of the hand-tuned kernels. Despite the progress in last years in the field of auto-vectorization, compilers still missing opportunities of identifying and exploiting data parallelism included in applications. There are still some aspects that we would like to investigate in the future. First of all, the benchmark contains only a portion of the available kernels that are available for multimedia workloads. We are planning to add more kernels from open source applications with their hand optimized counterparts. Second, the paper examines the performance only in x86 platform. The plethora of mobile devices based on ARM architecture shows that further evaluation is required in other architectures. Finally, small changes in the source code can simplify the work of the compiler and improve the performance. For example, by avoiding additional control code for alignment checking the cost of auto-vectorization changes allowing more loops to transformed. R EFERENCES [1] M. J. Flynn and P. Hung, “Microprocessor design issues: thoughts on the road ahead,” Micro, IEEE, vol. 25, no. 3, pp. 16–31, 2005.

[2] J. Reinders, “AVX-512 instructions,” Intel Corporation, 2013. [3] H. Inoue, “How SIMD Width Affects Energy Efficiency: A Case Study on Sorting.” [4] T. Proebsting, “Proebstings law: Compiler advances double computing power every 18 years,” 1998. [5] R. R. Schaller, “Moore’s law: past, present and future,” Spectrum, IEEE, vol. 34, no. 6, pp. 52–59, 1997. [6] L. Merritt and R. Vanam, “x264: A high performance H.264/AVC encoder,” online] http://neuron2. net/library/avc/overview x264 v8 5. pdf, 2006. [7] F. Bellard, M. http://ffmpeg.org.

Niedermayer

et

al.,

“Ffmpeg,”

[8] “ITU-T H.264: Advanced video coding for generic audiovisual services,” International Telecommunications Union, November 2009. [9] D. Grois, D. Marpe, A. Mulayoff, B. Itzhaky, and O. Hadar, “Performance comparison of H.265/MPEG-HEVC, VP9, and H.264/MPEG-AVC encoders,” in Picture Coding Symposium (PCS), 2013. IEEE, 2013, pp. 394–397. [10] “Intel(R) C++ Compiler 16.0.1 for Linux .” [11] R. Stallman, Using and porting GNU CC. Foundation, 1993, vol. 675.

Free Software

[12] C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong program analysis & transformation,” in Code Generation and Optimization, 2004. CGO 2004., 2004, pp. 75–86. [13] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable video coding extension of the h. 264/avc standard,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 17, no. 9, pp. 1103–1120, 2007. [14] D. Callahan, J. Dongarra, and D. Levine, “Vectorizing compilers: A test suite and results,” in Proceedings of the 1988 ACM/IEEE conference on Supercomputing. IEEE Computer Society Press, 1988, pp. 98–105. [15] S. Maleki, Y. Gao, M. J. Garzaran, T. Wong, and D. A. Padua, “An evaluation of vectorizing compilers,” in Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 2011, pp. 372–382. [16] J. E. Fritts, F. W. Steiling, and J. A. Tucek, “Mediabench II video: expediting the next generation of video systems research,” in Electronic Imaging 2005. International Society for Optics and Photonics, 2005, pp. 79–93. [17] G. Ren, P. Wu, and D. Padua, “An empirical study on the vectorization of multimedia applications for multimedia extensions,” in Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International. IEEE, 2005, pp. 89b–89b. [18] N. T. Slingerland and A. J. Smith, “Design and characterization of the berkeley multimedia workload,” Multimedia Systems, vol. 8, no. 4, pp. 315–327, 2002. [19] “https://github.com/malvanos/video-simdbench.”