Optimizing CABAC for VLIW architectures M. Elena Castro Carballo, Javier D. Bruguera and Roberto Rodr´ıguez Orosio Dept. Electronic and Computer Engineering University of Santiago de Compostela 15706 Santiago de Compostela, Spain e-mail: [email protected], [email protected], [email protected]

Abstract This paper proposes a technique for optimizing the Context-Based Adaptive Binary Arithmetic Coding (CABAC) on Very Long Instruction Word (VLIW) architectures. Binary arithmetic coding (BAC) contains a large amount of conditional and sequential processing steps that make parallelism on VLIW processors difficult to realize. The purpose of this paper is to illustrate an optimized software implementation on a VLIW processor. The Texas Instruments (TI) TMS320C6711 Digital Signal Processor (DSP) was chosen as the implementation platform. Results of our optimized code show a 20% reduction in the number of cycles required to generate the output string in comparison with a straightforward implementation of the arithmetic encoder as defined in the H.264/AVC standard.

1

Introduction

Context Adaptive Binary Arithmetic Coding is a normative part of the new ITU-T/ISO/IEC standard H.264/MPEG-4 part 10 for video compression [1]. This adaptive entropy coding method consists of arithmetically coding binary symbols of a given context by using only shifts and table look-ups. It is a serial process that involves a high number of operations and memory accesses what makes its implementation very compute intensive. To overcome this bottleneck, system designers tend toward hardware solutions [2]. However, nowadays there are commercial software programmable DSPs whose architecture al-

lows processing speed-up whenever instruction-level parallelism (ILP) can be extracted from the computing algorithm. DSPs with several functional units can then process those instructions in parallel. This is the case of the TMS320C6711, which contains eight functional units [3]. The H.264/MPEG-4 part 10 binary arithmetic encoder, as implemented in the standard algorithm [5], is divided into following steps: code MPS/LPS, update probability states, re-normalize and, embedded to re-normalize, append bit to output string. The algorithm contains nested loops, nested conditional execution paths and data dependencies that forces a serial execution and prevents parallelization of instructions. In this work we present some modifications to the standard CABAC algorithm that can lead to a efficient implementation of H.264 encoding on VLIW DSP platforms, extracting underlying instruction parallelism and reducing compute intensive tasks.

2

CABAC

The CABAC encoding process consists of three elementary steps [4](see Figure 1): 1)Binarization: firstly, events, coefficients and parameters produced by the video encoder are converted to binary symbols. 2)Context modeling: most of the symbols are encoded within a context. Those with the same statistical properties are in the same context and are compressed together, while symbols of different nature

Figure 1: CABAC encoder block diagram are in different ones. In order to avoid the implementation of as many arithmetic coding (AC) schemes as contexts, CABAC considers a unique AC that switches the distribution of probabilities when context changes without any loose of performance and producing a unique output stream. In the context modeling stage, before arithmetically encoding one bin, the corresponding context model is selected for the bin. CABAC considers 399 different contexts, depending on the syntax element to be encoded, that can be identified by a unique context index γ. Each probability model related to a given context is determined by a pair of two values: a 6-bit probability state index and the binary value of the most probable symbol (MPS). Thus, the pairs (state, MPS) for 0 ≤ γ ≤ 398 and hence the models themselves can be efficiently represented by 7-bit unsigned integer values. 3)Binary arithmetic coding: The original principle of BAC is based on the recursive subdivision of an interval of width range. Given the estimation of the probability pLP S for the least probable symbol (LPS), the interval is subdivided into two: one of width RLP S = range × pLP S , that is associated with the LPS, and the dual interval of width RM P S = range − RLP S , which is assigned to the MPS. Depending on whether the observed bin to be encoded is MPS or LPS, the corresponding subinterval is chosen as the new interval.

Figure 2: Arithmetic coding in CABAC val range by an equi-partition of the whole interval 28 ≤ range < 29 into four cells, and the value of pLP S by 64 quantized values. We get this way a 4 × 64 2D table from where the subinterval range value RLP S is looked up. Each of the entries of the table is indexed by 2 bits of the range register and 6 bits of the probability state register. An update of the probability estimation is performed after each symbol is encoded. For a given probability state, the update depends on the state index and the value of the encoded symbol identified either as a LPS or a MPS. The process is also table-based since the transition rules are stored in two tables each having 64 entries of 6-bit unsigned integer values. Actually in CABAC two sub-engines are used: one for bins of regular coding mode as described before, and another so-called bypass coding mode for a fast encoding of equally probable symbols. For such bins, a default value of 0.5 is used as the estimation of pLP S .

2.1

Arithmetic coding in CABAC

Arithmetic coding [6] consists in the iterative division of an interval based on the probability of the symbols and the selection of the subinterval In the practical implementation of BAC there are associated to the symbol to encode. Any point two main factors deciding the throughput: the mul- within the final subinterval characterizes the whole tiplication range × pLP S and the dynamic estima- sequence. An adaptive version of this algorithm can tion of the probability pLP S . CABAC implements a be implemented by changing the size of the intervals multiplication-free BAC scheme where approximated as new symbols are processed. multiplication results are stored in a fixed table. The basic idea is to approximate both the current inter-

In the particular case of CABAC, the procedure is shown in Figure 2 and the equations are the following ones: MPS Most Probable Symbol lownew = low rangenew = range − RLP S (1) LPS Least Probable Symbol lownew = low + range − RLP S rangenew = RLP S where range is the current length of the subinterval; low is the position of the lowest point and RLP S is the size of the subinterval associated to the LPS. As it can be seen in Figure 2, as encoding process progresses, low value increases and keeps track of symbols that were encoded so far while the width of range decreases. Although it is possible to use floating-point numbers, integer arithmetic is used in most of the practical arithmetic coders for its simplicity and portability. In order to use only integers and keep the size of both operands under control, range and low are normalized every cycle.

3

CABAC implementation on a VLIW processor

The platform chosen for implementation of CABAC is the Texas Instruments’ DSP TMS320C6711. This is a floating-point processor based on the VLIW architecture. Features of the C6711 include two separate data paths A and B with sixteen 32-bit registers each and eight functional units that allow to process up to eight instructions in parallel. Data and instructions reside in separate memory spaces so that concurrent memory accesses are possible. The internal program memory is structured in such a way that a total of eight instructions can be fetched every cycle. The block diagram of the DSP core is in Figure 3.

3.1

Figure 3: C6x VLIW-based DSP architecture

Standard algorithm

In the original C code from the H.264 standard [5] the function biari encode symbol is the heart of the

Figure 4: Block diagram of CABAC process including update of probability states CABAC encoding engine. Under the reception of the binary symbol to encode together with the context information, this function first updates the values of range and low, following formulas in (1), and updates the probability estimation by looking up at the tables where the transition rules are stored and by occasionally interchanging the values of the LPS and MPS. Figure 4 illustrates this process. However the compute intensive portion of this function is the re-normalization loop (Figure 5) which ensures that the low and high values of the arithmetic encoder are not too close to avoid the encoded data may be incorrectly decoded. The process re-

have a direct control over all mentioned parameters and avoid, for example, that their values may be overwritten. This was achieved by assigning them a fixed position in the external memory of the device. The implementation of the arithmetic encoder on the DSP shows that about 50% of the processing occurs in the encoding loop and about a 40% is involved in the calculation of low and range and in updating the contexts. Table 1 shows the results of the profiling performed to the encode function. Total cycles inverted in codification 38416 Cycles inverted in renormalization and bit insertion 20031 (52%) Cycles to calculate low, range and update probability states 14499 (38%)

Figure 5: Renormalization and bit insertion normalizes the values of low and range in the interval [0, 1] so that they are at least separated by a QUARTER (0.25). The value of range is left shifted as many times as necessary. The value of low is also left shifted as many bits as range and the bits that are shifted out constitute the result of the compression process, they are packed into bytes and appended to the output string. The combination of two factors makes this renormalization very compute intensive. First, the algorithm does not know how many times the re-normalization loop is going to run. In addition, as this process is being evaluated, the algorithm performs a variable field insertion to generate the bit-stream. These factors significantly lower the available parallelism of a CABAC implementation.

Table 1: Profiling performed to the encode function

3.2

Optimizing the re-normalization loop

Since half of the computation time relies on the renormalization, we focussed the efforts for the optimization of CABAC on this while loop. The idea is to consider the 8 functional units of the DSP and try to process in parallel as many instructions as possible. This could be achieved by implementing the software pipelining technique in the loop so that the next itTo run the original CABAC C code on the C6711 eration could be started before the previous one is DSP and control all the parameters involved in the completed. However, this is limited by the nested encoding process, it was needed to set an appropriate conditional execution paths and data dependencies. working environment involving the creation of addi- The optimizations described in the following sections tional files and functions. try to eliminate these obstacles by restructuring the All the tests in present work were done with a set algorithm and by leveraging the computation charge of created input files of binary symbols to encode of specific tasks. and a set of files with contexts to associate to each symbol and whose probability states are updated at 3.2.1 Decoupling encoding and bit-insertion run time. The C program includes variables arranged tasks within structures and that characterize the state of the arithmetic coding engine. To ensure the correct This approach increases the available instructions progress of the encoding process it was mandatory to that could be executed in parallel by decoupling the

re-normalization loop and the bit-insertion task and This was effectively observed when running CABAC makes the encoding process faster in devices that ex- in the DSP, most of the coding computation time in ploit instruction level parallelism [7]. the re-normalization loop was spent in the bit insertion macros. This lead us to simply replace the calls to the macros by calls to equivalent functions. The result is a reduction in the number of cycles involved in the generation of the encoded bit stream with respect to the straightforward CABAC implementation. 3.2.3

Reduction of ’if-else’ propositions

The arithmetic encoder contains many nested conditional statements that limit parallelism. Algorithm restructuring was performed to minimize the number of different conditional execution paths and introduce more parallelism. One observation was made that Figure 6: Flow chart of the proposed algorithm Two things were observed: a) the procedure of insertion of bits to the output string (that is embedded within the while loop as calls to macros) is performed at a lower rate than the re-normalization associated to the encoding process of the actual symbol and b) the bit insertion is a heavy task that can double the computation time of the re-normalization (ex. from ∼ 1600 cycles for re-normalization without calls to bit insertion macros to ∼ 3500 with calls). The bit insertion task was removed from the encoding procedure by using two extra arrays to store the bits to insert after the whole re-normalization was done for the current symbol or after having filled the arrays. This way the generation of the packed bits, that are the result of previously encoded symbols, can proceed in parallel with the re-normalization of other symbols to encode since there are no dependencies between both loops. The resulting flow diagram is in Figure 6. 3.2.2

Figure 7: Code for loop optimization allowed the simplification of control flow: in the renormalization loop the code is almost the same for the cases where low ≥ HALF and low < QU ART ER. With the addition of a control bit ’sign’ the new renormalization loop results in what is shown in Figure 7.

4

Results

Replacing the bit insertion macros by Our optimized implementations reduce the number of functions cycles required to generate the output string in comThe insertion of bits to the output string is performed parison with the straightforward CABAC execution by calls to predefined macros. Although the macros on the DSP. Tables 2 and 3 show some of the results allow the reduction of the source code, the penalty obtained with a test file. Table 2 reflects the effect of replacing the macro is the production of more machine instructions than those that would be generated by using functions. for generation of the output string by a function.

Cycles to generate output string put bit/Ebits to follow Macro Function 0/7 3463 2911 0/11 4663 3664 1/2 2143 2056 Ebuffer=01111111011111111111100

∆ 552 999 87

Table 2: Reduction of cycles needed to generate output string when using function instead of macro

lyzed to find underlying instruction level parallelism and compute intensive tasks. The speed-up of the coding process can be achieved thanks to the capability of such DSPs to process several instructions in parallel. Several optimized implementations were benchmarked on the C6711. The results show a reduction of around a 20%, in the best case, in the total cycles required to generate the output string.

References In Table 3 we present, on one hand, the number of cycles spent in the renormalization loop by the implementation described in subsection 3.2.1 (in comparison with the original C code) and, on the other hand, the cycles spent in renormalization and bit-insertion in the case we use the optimization described in subsection 3.2.3. In the last case, the macro for bitinsertion was replaced by the equivalent function. We get around a 20% of reduction in the total cycles needed to generate the output string in comparison with the original CABAC C code. In the first case, to the renormalization we have to add the cycles involved to append bits to the final string, results are close to last case ones. Both cases could be combined by using arrays for storage of results of renormalization in the new optimized while loop and postponing the generation of final string for later stages. Decoupling byteout and renorm Total cycles Original Optimized Reduction Encoding 20046 13968 6078 (30%) Reducing if-else propositions Total cycles Original Optimized Reduction Encoding 20046 15624 4422 (22%) Table 3: Reduction of cycles needed to perform encoding process with optimized implementations

5

Conclusions

In order to implement H.264 encoding efficiently on VLIW DSP platforms, CABAC algorithm was ana-

[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard and A. Luthra. Overview of the H.264/AVC Video Coding Standard. IEEE Trans. Circuits Syst. Video Technol., vol. 13, pages 560-576,July 2003. [2] R. R. Osorio and J. D. Bruguera. Arithmetic Coding Architecture for H.264/AVC CABAC Compression System. In Euromicro Symp. on Digital System Design, pages 62-69, 2004. [3] Texas Instruments. TMS320C6000 CPU and Instruction Set Reference Guide. SPRU 189, October 2000. [4] D. Marpe, H. Schwartz and T. Wiegand. ContextBased Adaptive Binary Arithmetic Coding in the H.264/AVC video compression standard. IEEE Trans. on CSVT, 13(7):620-636, July 2003. [5] Karsten S¨ uhring. H.264/AVC Reference Software Encoder Documentation. http://iphome.hhi.de/suehring/tml/doc/lenc/ html/index.html [6] I. H. Witten, R. M. Neal and J. G. Cleary. Arithmetic Coding for Data Compression. Communications of the ACM, 30(6):520-540, June 1987. [7] B. Valentine and O. Sohm. Optimizing the JPEG2000 Binary Arithmetic Encoder for VLIW Architectures. In IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), volume 5, pages 17-21, 2004.

Optimizing CABAC for VLIW architectures

Figure 5: Renormalization and bit insertion normalizes the values of low and range in the interval. [0, 1] so that they are at least separated by a QUAR-.

358KB Sizes 2 Downloads 216 Views

Recommend Documents

VLIW Processors
Benefits of VLIW e VLIW design ... the advantage that the computing paradigm does not change, that is .... graphics boards, and network communications devices. ere are also .... level of code, sometimes found hard wired in a CPU to emulate ...

Lx: A Technology Platform for Customizable VLIW ...
degree of customization or scaling for a particular application ..... The memory savings for the code compression algorithm averages ..... tradeoffs into account. 3.

ST200: A VLIW Architecture for Media-Oriented ...
Technology platform for System-On-Chip (SOC) VLIW cores. ❖Lx is an ... Computing. (Internet Data Centers) ... Built-in scalability in compiler, tools and ISA.

Lx: A Technology Platform for Customizable VLIW ...
we developed the architecture and software from the beginning to support both ... from companies implementing variations of traditional embedded ...... Page 10 ...

Optimizing Network Coding Algorithms for Multicast Applications.pdf
PhD Thesis - Optimizing Network Coding Algorithms for Multicast Applications.pdf. PhD Thesis - Optimizing Network Coding Algorithms for Multicast ...

Optimizing Technology for Learning - MID NJ ASTD
Nov 6, 2014 - Registration and Networking ... an ELearning Strategy for the Future: Ten. Key Shifts to Watch” ... the new social network / communication plat-.

Optimizing Technology for Learning - MID NJ ASTD
Nov 6, 2014 - Advertising Marketing Communications at. Canadore College ... cations and marketing. He also holds the Cer- ... Email: [email protected].

Optimizing Performance for the Flash Platform - Adobe
Aug 21, 2012 - available only in Flash Player 10.1 for Windows Mobile. ..... A 250 ms refresh rate was chosen (4 fps) because many device ...... code, then the virtual machine spends significant amounts of time verifying code and JIT ...

Optimizing Constellations for Single-Subcarrier ...
limited systems results in a smaller power penalty than when using formats optimized for average power in peak-power limited systems. We also evaluate the modulation formats in terms of their ... In the absence of optical amplification, an IM/DD syst

Optimizing Performance for the Flash Platform - Adobe
Aug 21, 2012 - Movie quality . ... Chapter 6: Optimizing network interaction .... asynchronous operations, such as a response from loading data over a network.

Best Practices for Optimizing Web Advertising ... - AccountingWEB
May 1, 2006 - Set Clear Objectives: Identify what business goals the campaign is designed to ... small number of people will see the ads at a tremendously high .... Yahoo! Movies. -. 100. 200. 300. 400. 500. 600. 700. 800. 900. 1. 3. 5. 7. 9.

Concord: Homogeneous Programming for Heterogeneous Architectures
Mar 2, 2014 - Irregular applications on GPU: benefits are not well-understood. • Data-dependent .... Best. Overhead: 2N + 1. Overhead: N. Overhead: 1. Lazy.

SOFTWARE ARCHITECTURES FOR JPEG2000 David ...
The paper describes non-obvious implementation strategies, state machines, predictive algorithms and internal ..... The first collects statistics from code-blocks.

Deriving Software Architectures for CRUD ...
software domains, it is being experimented on data processing systems, which typically follow a CRUD pattern. For demonstration purposes, the FPL tower.

FPGA SDK for Nanoscale Architectures
From the tool-flow perspective, this architecture is similar to antifuse configurable architectures hence we propose a FPGA SDK based programming environment that support domain-space exploration. I. INTRODUCTION. Some nanowire-based fabric proposals

Method Framework for Engineering System Architectures (MFESA ...
Aircraft System. Ground Support System. Training System. Maintenance System. Airframe. Segment. Interiors. Segment. Propulsion. Segment. Vehicle. Segment.

SOFTWARE ARCHITECTURES FOR JPEG2000 David ...
The paper is informed by the au- thor's own work on the ... JPEG2000 also supports .... transform involving finite support operators may be ob- tained by applying ...

Optimizing Binary Fisher Codes for Visual Search - IEEE Xplore
The Institute of Digital Media, Peking University, Beijing, China. {zhew,lingyu,linjie,cjie,tjhuang,wgao}@pku.edu.cn. Fisher vectors (FV), a global representation obtained by aggregating local invari- ant features (e.g., SIFT), generates the state-of

Development and Optimizing of a Neural Network for Offline ... - IJRIT
IJRIT International Journal of Research in Information Technology, Volume 1, Issue ... hidden neurons layers and 5 neurons in output layers gives best results as.

Optimizing water treatment practices for the removal of ...
Abstract. Bovilla reservoir, which is situated 15 km North-East of Tirana the capital city of Albania is one of the major hidrotechnical works of this country. ... treatment plant treats 1800 L/s raw water taken from Bovilla reservoir, using oxidatio

Optimizing pipelines for power and performance
ture definition phase of high performance, power-efficient processors. Even from ...... CPU to maintain an enormous amount of architectural and non-architectural ...