A VLSI Architecture Design of an Edge Based Fast Intra Prediction Mode Decision Algorithm for H.264/AVC Shen Li*, Xianghui Wei, Takeshi Ikenaga, Satoshi Goto Graduate School of Information, Production and Systems, Waseda University Hibikino 2-7, Wakamatsu Ku, Kitakyushu City, Fukuoka, Japan, 808-0135 (+81) 093-692-5298

*[email protected] (MB) can be partitioned into sub-blocks as small as 4×4 and predicted from up to 5 reference frames. In intra-frame coding, a MB can either be coded as a whole or as a combination of 16 nonoverlapped 4×4 sub-blocks in with as many as 17 prediction modes available (7 for 4×4 blocks, 4 for 16×16 MB and 4 for the 8×8 chroma blocks).

ABSTRACT The intra-frame coding in H.264/AVC has made significant contribution to the enhancement of coding efficiency. However it brings about a heavy computation burden in the rate distortion based (RD) mode decision (MD) process. Although the real-time encoding of 1280×720p signals is realized in recent works with existing algorithms, for higher resolution e.g. 1920×1088p some hardwareoriented fast algorithms are necessary. Yet so far few of the many proposed fast MD algorithms have seen successful hardware implementation. This paper presents a novel VLSI design (15.8k gates@200MHz, with TSMC CMOS 0.18μm technology) of an edge based fast intra MD algorithm which can constantly reduce about 66% of the RD related computation with a negligible quality loss. It is expected to be utilized as a favorable accelerator hardware module in a real-time HDTV (1920×1088p) H.264 encoder or MPEG2-H.264 transcoder.

In order to determine the best coding mode out of all the candidates, H.264 adopts a straightforward approach as it evaluates all candidate modes through a rate-distortion optimization (RDO) process which brings about drastic increase of the computational complexity. A lot of efforts have been made to reduce the complexity of H.264. Yet compared with the great number of fast algorithms proposed for inter-frame coding, i.e. fast motion estimation and inter mode decision (MD), fast intra MD is still a new topic. It is found that most recent works addressing to this matter can be classified as edge based methods. They first detect the local edge and then decide the intra prediction modes according to its direction and strength. The reference [2] is a representative work, which introduced a 3×3 Sobel filter to detect the edge pixels and an edge direction histogram to find the characteristic edge of the 4×4 block. However the Sobel based edge detection actually becomes extra burden and will add to the total computation complexity. Although reference [3][4][5][6] tried to propose low-complexity and hardware-oriented methods, a common problem remains that the edge detection and edge direction decision part still require some complicated mathematical operations e.g. multiplication, division, and trigonometric calculation that cannot be easily realized in hardware.

Categories and Subject Descriptors C.3 [Special-Purpose and Application-Based Systems]: Signal processing systems; B.7.1 [Integrated Circuits]: Types and Design Styles—Algorithms implemented in hardware, VLSI

General Terms Algorithms, Performance, Design

Keywords H.264, Fast Intra Prediction Mode Decision, VLSI Architecture

1. INTRODUCTION

This paper presents a VLSI implementation of our proposed edge based fast intra MD algorithm which, according to the experiments with reference software JM9.3 [7], can save as much as 66% of the RDO related computation and 25% of the total H.264 encoding time (for HD video sequences, 1280×720p, 30fps) with negligible quality loss. It is implemented with TSMC CMOS 0.18 μm technology with a modest hardware cost of 15.8k gates. After logic synthesis the maximum working frequency is 200MHz and it is expected to be a favorable contribution to the real-time H.264 encoding or MPEG2H.264 transcoding of HDTV.

H.264/AVC [1] is the latest international video coding standard that achieved nearly twice the coding efficiency of MPEG-2. Like the earlier MPEG-2/4 and H.263, H.264 also adopts inter-frame coding and intra-frame coding as two major coding methods, which are, however, enhanced by the utilization of many advanced coding techniques. On the inter-frame coding part, for instance, variable block size motion estimation (VBSME) and multiple reference frames (MRF) are introduced. With these techniques a macroblock Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI’07, March 11–13, 2007, Stresa-Lago Maggiore, Italy. Copyright 2007 ACM 978-1-59593-605-9/07/0003...$5.00.

2. PROPOSED ALGORITHM 2.1 Intra Prediction in H.264/AVC In H.264 intra prediction, 2 kinds of block sizes are supported: 4×4 and 16×16. For a 4×4 luma block, H.264 provides 9 different prediction modes, as shown in Figure 1.

20

shown in (3). To enhance the accuracy of this method, the cell for the mode with the second-minimum matching error is also updated. err0 = Gyi , j − K 0 Gxi , j err1 = Gyi , j − K1 Gxi , j ⎧⎪ err3 = ⎨ ⎪⎩ ⎧⎪ err4 = ⎨ ⎪⎩ ⎧⎪ err5 = ⎨ ⎪⎩ ⎧⎪ err6 = ⎨ ⎪⎩ ⎧⎪ err7 = ⎨ ⎪⎩

Figure 1. H.264 Intra prediction (4×4 block) In addition to the 8 modes shown in the figure, Mode 2 (DC mode), which stuffs the prediction block with the average value of all available predictors is also available. For a 16×16 luma block, there are 4 other prediction modes: Mode 0 (vertical mode), Mode 1 (horizontal mode) Mode 2 (DC mode) and Mode 4 (plane mode), which is based on a linear spatial interpolation by using the upper and left-hand predictors of the MB. Similarly, for the 8×8 chroma blocks there are also 4 prediction modes.

∞ Gyi , j − K 4 Gxi , j ∞ Gyi , j − K 5 Gxi , j ∞ Gyi , j − K 6 Gxi , j ∞ Gyi , j − K 7 Gxi , j ∞

⎧⎪ Gyi , j − K 8 Gxi , j err8 = ⎨ ∞ ⎪⎩

A typical H.264 encoder has to perform M8 × (M4 × 16 + M16) = 592 (M8 = 4, M4 = 9, M16 = 4) RDO calculations to find an optimum prediction mode [2]. Furthermore, a data dependency problem exist that the prediction of one 4×4 block must wait until its neighbor is processed. Therefore it is impossible to process the 16 4×4 blocks within a MB simultaneously, which becomes an obstacle to efficient hardware implementation.

(sign(Gyi , j ) = sign(Gxi , j )) else (sign(Gyi , j ) ≠ sign(Gxi , j )) else (sign(Gyi , j ) ≠ sign(Gxi , j )) else

(2)

(sign(Gyi , j ) ≠ sign(Gxi , j )) else (sign(Gyi , j ) = sign(Gxi , j )) else (sign(Gyi , j ) = sign(Gxi , j )) else

uuur histo( mode _ with _ min _ err ) + = Amp (Gi , j )

(3)

For each 4×4 block, the top three histogram cells are picked out with the corresponding modes chosen as candidate modes. The Mode 2 (DC) is always enabled to minimize quality loss. For the 16×16 block, the histogram cells for 4×4 blocks can be re-used. Finally, the mode whose corresponding cell takes the maximum value will be chosen in addition to Mode 2 (DC). As for chroma blocks, similar methods described above can also be applied. Therefore, for each MB, one may have up to 4 candidate prediction modes for 4×4 luma blocks, 2 for 16×16 luma block and 2 or 3 for the 8×8 chroma blocks, which means the number of RDO is also reduced to 2×(4×16+2)=132 or 3×(4×16+2)=132, with about 66% of the RDO iterations saved, as is also realized in [2], but with much less complexity. Figure 3 shows the coding efficiency comparison between the traditional H.264 software encoder [7] and the revised encoder with our proposed algorithm.

2.2 Proposed Edge Based Algorithm In our proposal, a 2×2 filter (shown in Figure 2), which is defined in (1), is used instead of the 3×3 Sobel filter to calculate the gradient vectors. Thus the computational complexity is reduced from 16 × (8A + 2S) in [2], (A: addition or subtraction; S: shifting) to 9×4A as only 9 virtual pixels are checked.

uuur ⎧G (Gx , Gy ) (i, j = 0,1, 2) i, j ij ⎪ i, j ⎪⎪Gxi , j = ( f i , j +1 + fi +1, j +1 ) − ( fi , j + fi +1, j ) ⎨ ⎪Gyi , j = ( fi +1, j + fi +1, j +1 ) − ( f i , j + f i , j +1 ) uuur ⎪ ⎪⎩ Amp (Gi , j ) = Gxi , j + Gyi , j

Gyi , j − K 3 Gxi , j

(1)

Figure 2. A 2×2 filter for gradient vector calculation In order to remove multiplication, division and trigonometric calculation, a linear gradient vector direction decision method is proposed. First the matching errors of all the prediction modes except for the DC mode are defined in (2) ({K 0 , K1 ,...K 8 }can be {8, 0.125, 1, 1, 2, 0.5, 2, 0.5}, to make ease the calculation). Similar to [2][3][5], a histogram based local edge based prediction mode selection method is adopted. The histogram cells are updated as

Figure 3. R-D curve of HD sequence parkrun

21

can be achieved. The details are provided in the following subsections.

3. VLSI ARCHITECTURE DESIGN 3.1 Review of H.264 Intra-frame Coding Related Hardware Design

3.3 3-stage Job Interleaving Scheme

The VLSI implementation of H.264 encoder/decoder is no longer a new topic. Design strategies often differ as the applications vary. As for HDTV applications, which require both real-time coding capability and high quality, powerful custom hardware engine based solution seems to be the only choice. Hence finding highly efficient hardware architecture for intra-frame coding part is of great significance. However a critical issue should be noted that 4×4 blocks cannot be processed simultaneously as a predicted block has to be generated with the processed neighbors. Consequently highly parallel structure is not applicable.

The 66% theoretical computation reduction guaranteed by the proposed algorithm certainly comes at some cost in terms of processing time and hardware resource. The 3-stage job interleaving scheme is adopted to minimize the Fast IPMD related cycle number.

Reference [8] is a representative work done in this field, which guarantees real-time intra-frame encoding of SDTV (720x480p). The key points are 1) a context-based mode prediction method to reduce the number of candidate modes 2) a sub-sampling based acceleration technique for integer Hadamard transformation 3) a reconfigurable intra-predictor circuit that generate predicted blocks with every mode selected by 1) in a sequential manner. Although it is reported that 1) and 2) jointly saved nearly 60% of the intra prediction and RDO related computation, the latter actually contributes the most. Hence to realize the real-time encoding of HDTV, greater speed-up obtained on algorithm level is expected.

Figure 5. Execution timing chart of the proposed Fast IPMD HW module

3.2 Overview of Proposed Architecture Figure 4 is the shows the VLSI architecture of the proposed fast intra prediction mode decision algorithm (Fast IMPD), which serves as a hardware accelerator of an intra-frame coder, e.g. reference [8].

As shown in Figure 5, the 4×4 blocks in an MB are scanned in a zig-zag manner and processed sequentially. In stage I, the gradient vectors of the 9 virtual pixels (sample points) in a 4×4 block are calculated. In stage II, the 9 gradient vectors are processed sequentially with the mode matching error array calculated. In stage III, the linear direction decision and histogram update is done. The inter-stage delays are 9 cycles from stage I to II and 1 cycle from stage II to III. Averagely it only takes 9 cycles to process 1 4×4 block. When further interleaved with the intra-frame coding, which generally takes much longer time, this proposed Fast IPMD hardware introduces almost negligible extra cycle numbers.

3.4 Architecture Design of the 4×4 Processing Unit With the 3-stage job interleaving scheme above, the 4×4 processing unit is designed with a target to reduce the circuit area as well as the critical path for higher performance. Figure 6 shows the architecture of the gradient vector calculator. It takes 9 cycles to find all the gradient vectors within a 4×4 block. In each cycle, 4 pixels are fetched from the 4×4 block buffer, which is realized with a 2-D register array in order to enable multi-pixel access in both horizontal and vertical directions, with 3 subtractions are done simultaneously. In the earlier 4 cycles, pixels are read in row by row and the Gx components are computed. In the later 4 cycles the Gy components are computed. For instance, as shown in the figure, the gradient vector of S4 cannot be found until row1, 2 and column 1, 2 are processed. A register array is used to keep the partial results. Adders are necessary in the accumulation of partial results. In order to minimize the circuit area, 3 public adders are used instead of 9 individual adders for all the gradient vector registers.

Figure 4. Block diagram of the VLSI architecture of Fast IPMD Like the intra-frame coding, the Fast IPMD is also performed on the basis of 4×4 block. Therefore a 4×4 block buffer is used and can be shared with the intra-frame coder. The core of this design is a featured 4×4 processing unit based on a 3-stage job interleaving scheme. It is responsible for the prediction MD for each 4×4 block, and the result is an array stored in the intra coding mode buffer, with each bit of it representing a prediction mode (‘0’ and ‘1’ for “disabled” and “enabled”, respectively). The intra-frame coder will only use those enabled prediction modes so that further speed up

22

As our target applications are real-time H.264 encoding and MPEG2/H.264 transcoding for HDTV, it is of high significance to discuss the integration of the proposed Fast IPMD hardware accelerator with a H.264 encoder or a video transcoder. Since the speed-up is mainly guaranteed by the algorithm itself, the evaluation of the hardware implementation should be focused on the overhead in terms of extra processing time and hardware resource. The logic synthesis results confirm that only a modest hardware cost is introduced. As for the extra processing time, it is mentioned earlier that it averagely takes 9 cycles to finish the Fast IPMD for a 4×4 block and when it is interleaved with the intra-frame coding this overhead can be removed. With all the efforts of optimization, the critical path is minimized with circuit area reduced as well. As dedicated to CIF/SD applications, the reference [8] maximum operating frequency is 55MHz. However, in later designs for HDTV (1280×720p), the maximum frequency is raised to 105MHz [9]. However when it comes to our target, i.e. HDTV (1920×1088p) such frequency cannot meet the performance requirements with existing coding algorithms. Now that the estimated maximum operating frequency (on the logic synthesis level) is 200 MHz, this work is expected to be a favorable contribution to the real-time H.264 encoder design for HDTV.

Figure 6. Architecture of the gradient vector calculator According to the proposed algorithm, for each gradient vector, the mode matching error array will be calculated and the cells corresponding to the minimum and the second-minimum errors shall be updated, which is the done by the linear histogram update unit. Although binary tree structure is the fastest way to find a minimum out of multiple candidates, finding the second-minimum at the same time is not easy. Based on the principle of intra prediction, an optimized structure is proposed, as shown in Figure 7.

4. CONCLUSION This paper presents a VLSI implementation of our proposed edge based fast intra MD algorithm, which can save as much as 66% of the RDO related computation with negligible quality loss in terms of PSNR and bit overhead. Compared with other representative existing edge based algorithms, our method is more hardwarefriendly as it does not rely on any complicated mathematical operations e.g. multiplication, division or trigonometric calculation. It is implemented with TSMC CMOS 0.18 μm technology at a minor hardware cost of 15.8k gates, and the featured 3-stage job interleaving scheme helps to minimize the extra processing time introduced by the algorithm and the estimated maximum working frequency is as high as 200MHz. Hence it is expected to be a favorable contribution to the real-time H.264 encoding or MPEG2H.264 transcoding of HDTV (1920×1088p@30fps).

Since the mode with second-minimum error must be the neighbor of the mode with minimum error, the minimum finding are first done among error 0, 1, 3 and 4, corresponding to mode 0, 1, 3 and 4, as both the minimum and second minimum cannot be found within them. The result i.e. cell_no1, are used to find the cell_no2, which is its minimum neighbor. Compared with 8-point binary-tree structure, which has 3 levels and 7 minimum selectors i.e. comparators but can only find the minimum at a time, the proposed 2-level structure can find both the minimum and second-minimum with 7 comparators 1.5 times faster.

5. ACKNOWLEDGMENTS This work is supported by the funds of the MEXT via Kitakyushu innovative cluster projects and CREST, JST.

6. REFERENCES

Figure 7. Histogram cell update control

[1] Joint Video Team, Draft ITU-T recommendation and final draft international standard of joint video specification, ITUT Rec. H.264 and ISO/IEC, 14496-10 AVC, May 2003.

3.5 VLSI Implementation Report This design is implemented with TSMC CMOS 0.18 μm technology. The logic synthesis is done with Synopsys’s Design Compiler. The implementation result is summarized in Table 1.

[2] Pan, F., Lin, X., Rahardja, S., Lim, K. P., Li, Z. G., Wu, D. J. and Wu, S. Fast mode decision algorithm for intraprediction in H.264/AVC video coding, IEEE Trans. Circuits Syst. Video Tech., 15, 7, (Jul. 2005), 813-822.

Table 1. VLSI implementation report Technology

TSMC CMOS 0.18 μm

Total gate count

15.8k

Local SRAM

none

Max. Frequency

[3] Liao, N., Quan, Z. Y. and Men, A. D. Enhanced fast mode decision based on edge map and motion detail analysis for H.264/JVT, In Proceedings of the IEEE International Workshop on VLSI Design and Video Technology, (May 2005), 187-190.

200MHz (logic synthesis)

23

[7] JVT Test Model Ad Hoc Group, Evaluation sheet for motion estimation, Draft version 4, Feb. 19, 2003.

[4] Kalva, H., Petljansk, B. and Furht, B. Complexity reduction tools for MPEG-2 to H.264 video transcoding, WSEAS Trans. Info. Science & Appli., 2, (Mar. 2005), 295-300.

[8] Huang, Y.W., Hsieh, B.Y., Chen, T.C. and Chen, L.G. Analysis, fast algorithm, and VLSI architecture design for H.264/AVC intra frame coder, IEEE Trans. Circuits Syst. Video Tech., 15, 3, (Mar. 2005), 378-401.

[5] Jiang, G.Y., Li, S.P., Yu, M. and Li, F.C. An efficient fast mode selection for intra prediction, In Proceedings of the IEEE International Workshop on VLSI Design and Video Technology, (May 2005), 357-360.

[9] Huang, Y.W. et al., A 1.3TOPS H.264/AVC Single Chip Encoder for HDTV Applications, In Digest of Technical Papers of the IEEE International Solid-State Circuit Conference (ISSCC05) (Feb. 2005). 128-129.

[6] Wang, J.F., Wang, J.C., Chen, J.T., Tsai, A.C. and Paul, A. A novel fast algorithm for intra mode decision in H.264/AVC encoders, In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS 2006) (May 2006). 3498-3501.

24

A VLSI Architecture Design of an Edge Based Fast Intra ...

Mar 13, 2007 - Technology, (May 2005), 357-360. [6] Wang, J.F., Wang, J.C., Chen, J.T., Tsai, A.C. and Paul, A. A novel fast algorithm for intra mode decision ...

370KB Sizes 2 Downloads 173 Views

Recommend Documents

VLSI Friendly Edge Gradient Detection Based Multiple Reference ...
software oriented fast multiple reference frames motion es- timation ... Through analyzing the .... tor, we can accurately analyze the frequency nature of image.

A VLSI Architecture for Visible Watermarking in a ...
Abstract—Watermarking is the process that embeds data called a watermark, a tag, ...... U. C. Niranjan, “VLSI impementation of online digital watermarking techniques with ... Master's of Engineering degree in systems science and automation ...

A VLSI Array Processing Oriented Fast Fourier ...
key words: fast Fourier transform (FFT), array processing, singleton al- gorithm. 1. Introduction ... ment of Industry Science and Technology, Kitakyushu-shi, 808-. 0135 Japan. ..... Ph.D. degree in information & computer sci- ence from Waseda ...

A VLSI Architecture for Variable Block Size Motion ...
Dec 12, 2006 - alized in TSMC 0.18 µm 1P6M technology with a hardware cost of 67.6K gates. ...... Ph.D. degree in information & computer sci- ence from ...

Fast Intra Prediction for High Efficiency Video Coding
adopted in the HEVC reference software HM [5]. First, a rough mode decision is performed to ... algorithm is implemented and reported on HM3.0 software, and claims 40% encoding time reduction. However, early SKIP ..... ciency video coding (hevc) text

VLSI Oriented Fast Multiple Reference Frame Motion Estimation ...
mance. For the VLSI real-time encoder, the heavy computation of fractional motion ... is the most computation intensive part, can not be saved. Another promising ...

VLSI DESIGN COURSE HANDOUTS.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps. ... VLSI DESIGN COURSE HANDOUTS.pdf. VLSI DESIGN COURSE ...

introduction to vlsi design
possible due to good digital system design and modeling techniques. 1.2 ..... to them. Verilog as an HDL was introduced by Cadence Design Systems; they.

A sensitivity-based approach for pruning architecture of ...
It may not work properly when the rel- evance values of all Adalines concerned are very close one another. This case may mostly happen to Adalines with low.

A Java Based Architecture of P2P-Grid Middleware
ticated resource management and data transfer components. P2P systems on the other ... DGET and P2P Systems: The second class of system we can compare ...

Industrial Training VLSI Design -
(Live Project). (A Corporate Partner .... Visvesvaraya Regional College of Engineering (1976) o Experience: .... UNIVERSITIES IN USA, CANADA & GERMANY.

EC6612-VLSI DESIGN-LABORATORY- By EasyEngineering.net.pdf ...
Page 3 of 75. EC6612-VLSI DESIGN-LABORATORY- By EasyEngineering.net.pdf. EC6612-VLSI DESIGN-LABORATORY- By EasyEngineering.net.pdf. Open.

Regular 2D NASIC-based Architecture and Design ...
on its pipelined routing architecture, which paves the way towards high-throughput ... NASIC (Nanoscale Application Specific Integrated Circuits). [3] is a nanoscale fabric ..... motivation of the development of pipeline-aware placers and routers.

An Ontology-Based Method for Universal Design of ...
the advance of web systems, it is important to use the more accurate method and ... accessing information through a cellular phone and the information was not ...

Design and Evaluation of an HPVM-based Windows NT Supercomputer
services on top of LSF's native distributed computing concept. Second .... HPVM is available for download on our web site at http://www-csag.ucsd.edu. .... The minimum latency for a 0-byte message between di erent nodes is 13.3 s for nodes ...

pdf-2\vlsi-risc-architecture-and-organization-electrical-and-computer ...
DOWNLOAD FROM OUR ONLINE LIBRARY. Page 3 of 7. pdf-2\vlsi-risc-architecture-and-organization-electrical-and-computer-engineering-by-s-b-furber.pdf.

High-throughput GCM VLSI architecture for IEEE 802.1 ...
Email: {chzhang, lili}@nju.edu.cn. Zhongfeng ... Email: [email protected]. Abstract—This ..... Final layout of the proposed GCM design. Table I lists the ...

Low-cost and Accurate Intra-flow Contention-based ...
the backup paths towards the destination. The disadvantage of this ..... Comparison of intra-flow contention-based admission control methods. Metrics. CACP- ...