32-Parallel SAD Tree Hardwired Engine for Variable ...

Viewer
Transcript

32-PARALLEL SAD TREE HARDWIRED ENGINE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION IN HDTV1080P REAL-TIME ENCODING APPLICATION Zhenyu Liu, Yang Song, Ming Shao, Shen Li, Lingfeng Li, Satoshi Goto and Takeshi Ikenaga IPS, Waseda University, N355, 2-7, Hibikino, Wakamatsu, Kitakyushu, 808-0135, Japan ABSTRACT H.264/AVC coding standard incorporates variable block size (VBS) motion estimation (ME) to improve the compression efficiency. For HDTV-1080p application, the massive computation and huge memory bandwidth by the large video frame size and the wide search range are two critical impediments to the real-time hardwired VBSME engine design. In this paper, we present six techniques to circumvent these difficulties. First, the inter modes bellow 8 × 8 are eliminated in our design to reduce the hardware cost. Second, the low-pass filter based 4:1 down-sampling algorithm successfully reduces about 75% arithmetic computation in each search position. Third, the coarse to fine search scheme is made use of to reduce 25%50% search candidates. Fourth, C+ memory organization is adopted to reduce the external IO bandwidth. Fifth, horizontal zigzag scan mode optimizes the search window memories. Finally, in circuit design, 4:2 compressor based CSA tree, multi-cycle path delay and 2 pipeline stage SAD tree techniques are utilized to improve the speed and reduce the hardware of each SAD tree. The hardwired integer motion estimation (IME) engine with 192 × 128 search range for HDTV1080p@30Hz is demonstrated in this paper. With TSMC 0.18μm 1P6M CMOS technology, it is implemented with 485.7k gates standard cells and 327.68k bit on chip memories. The power dissipation is 729mw at 200MHz clock speed. Index Terms— H.264/AVC, variable block size, integer motion estimation, VLSI, HDTV1080p 1. INTRODUCTION Variable block size motion estimation (VBSME) with multiple reference frames (MRF) is the advanced technique adopted by the latest international video coding standard, H.264/AVC [1]. When encoding one macroblock (MB), motion estimation (ME) is conducted on different blocks sizes including 4 × 4, 4 × 8, 8 × 4, 8 × 8, 8 × 16, 16 × 8 and 16 × 16 in multiple reference frames. Compared with previous fixed block size ME process, VBSME can achieve higher compression ratio and better video quality. However, it puts heavy burden on the ME unit and makes traditional hardware architectures incompatible. Because of the intensive computation of MRF-VBSME, the hardware accelerator is critical for the real-time encoding system, especially for HDTV applications. For the VBSME hardwired engine design, many researches have been published in this field [2][3][4][5][6]. Among these proposed architectures, three works can provide superior performance in different applications, namely Propagate Partial SAD [3], SAD Tree [5] and Parallel Sub-Tree [6]. Parallel Sub-Tree architecture has the finest extension grain and fastest clock speed. When less than 256 PEs are required, Parallel Sub-Tree needs less This work was supported by fund from the CREST, JST.

1-4244-1222-6/07/$25.00 ©2007 IEEE

675

memory partitions than Propagate Partial SAD and SAD Tree counterparts. When 256 PEs are configured, Propagate Partial SAD has the most efficient datapath, so it is suitable for low and middle resolution video sequences. For HDTV applications, the massive computation makes the high degree of parallelism essential. For example, in ref. [5], 8 sets of SAD Tree are required for the real-time HDTV720p encoder when the clock speed is 108MHz. Because the reference pixel registers can be shared by adjacent SAD Trees, parallel SAD Tree has less chip area than parallel Propagate Partial SAD. However, in comparison with upper two counterparts, the original parallel SAD Tree architecture has some drawbacks. First, its critical path is longer than Propagate Partial SAD and Parallel Sub-Tree, which prevents it from working at high clock frequency. Second, from the analysis of section 3, we observe that its snake scan method will cause the dilemma of large partitions and low IO utilization for the search window memories. Our design target is HDTV1080p@30Hz with 192 × 128 search range and one reference frame. With exhaustive integer motion estimation (IME) search algorithm, the computation complexity is 6.8 times of the design of [5]. The straightforward implementation is simply increasing the SAD tree sets. For our system specifications, 8 × 6.8 56 sets of SAD tree are required. Consequently, from [5], the consumed hardware is 330.2×7 = 2311.4k gates. So much hardware cost not only consumes no trivial chip area and power but also brings much trouble to the frontend and backend design. In order to further reduce the hardware cost of VBSME engine, six optimization methods are proposed in this paper. On the algorithm level, we provide three schemes: (1) Inter modes 5, 6 and 7 are eliminated in our design. With the simple complex mode decision algorithm preferred by the real-time VLSI implementation, this method even contributes to the compression performance; (2) Low-pass filter based 4-1 downsampling algorithm is adopted; (3) With the sub-sampling approach, the coarse to fine search window adjustment is provided to reduce the search positions. The search window memory organization is another critical issue. In this field, two optimizations are proposed, i.e., level C+ data reuse scheme and horizontal zigzag scan mode. At last, we provide some circuit level optimizations, which include 4:2 compressor tree based 2-stage SAD tree and multi-cycle path delay. With all these approaches, the VBSME engine of our design just consumes 485.7k gates and its clock speed is 200MHz. The rest of this paper is organized as follows. The algorithm level optimizations are proposed in section 2. C+ search window memory and horizontal zigzag scan mode are described in section 3. In section 4, the circuit design is presented. At last, conclusions are draw in section 5. 2. ALGORITHM OPTIMIZATIONS The hybrid inter-frame coding techniques adopted by H.264/AVC are mainly composed of the 1/4-pel accurate separable 2-D Wiener

SiPS 2007

2.1. Reference Frame Reduction In fact, for HDTV applications, with the increase of sampler array dense, the prediction error caused by the displacement estimation error is alleviated. Under this condition, we can reduce the reference frame number. In the following paragraphs, we first present the mathematical analysis and then give the experimental results about the reference frame reduction. In order to simplify the mathematical description, the analysis is first restrict to one spatial dimension signal. As shown in Fig. 1, lt (x) and lt−1 (x) denote the spatial-continuous signals at time instance t and t − 1. lt (x) is a displaced version of lt−1 (x) and the distance is dx , which can be expressed as lt (x) = lt−1 (x − dx ). These continuous image signals are sampled by the sensor array before digital processing. The interval of the spatial samplers is denoted as sx . The displacement error can be expressed as Δx = dx −round(dx /sx ) · sx .

G[ V[

[W1([Q)

H

44 42 40

station 5 reference

38

sunﬂower 5 reference

station 1 reference sunﬂower 1 reference 36 34

0

20

40

60

Bitrate(mbps)

¨[

¨[OW1'([Q) 2.2. Inter Mode Reduction

OW([) [

[W([Q)

: Camera Sensor

Fig. 1. 1-D prediction error analysis From Fig. 1, it can be observed that prediction error e can be approximated as e ≈ Δx · lt−1 (xn ) (1) Where, the displacement error Δx is a random variable with zero mean, Δx ∈ [−sx /2, sx /2]. The mean of prediction error is E(e) = E(Δx )lt−1 (xn ) = 0

(2) s ·|l

(x )|

When Δx = ±sx /2, |e| reaches its maximum value x t−12 n and When Δx = 0, |e| vanishes. It is assumed that the prediction error e is a memoryless stationary Gaussian source of zero mean and variance σ2 . The variance σ2 is expressed as σ2 = E(e2 ) =

46

Fig. 2. MRF RD curve comparisons for “station” and “sunflower”

OW1'([Q)

OW1([)

more. Base on (3), since sx of HDTV is much less than that of QCIF and CIF, it can be derived that its prediction error is not as serious as that of QCIF and CIF. Moreover, with 2-D Wiener interpolation filter in H.264, this prediction error can be further reduced. So it is assumed that discarding MRF algorithm will not seriously degrade the coding performance. In order to verify this assumption, we test the MRF effect to HDTV1080p “station” and “sunflower” sequences. The simulation conditions are: QP=16, 20, 24, 28, 32, 36; [−128, 128] search range; IPPP; RDO ON; 90 frames encoded. The RD curve comparisons are shown in Fig. 2. It is clearly demonstrated that even at high bitrate, the PSNR loss caused by the reference frame reduction is just about 0.1dB.

PSNR(dB)

interpolation [7], multiple reference [8][9] frame and variable block size motion estimation [10]. Pronounced coding performance improvement comes from these advanced techniques, however, at the same time, the ME computation complexity is increased more than 10 times. According to the analysis in [11] , 89.2% computation power is consumed by ME part in H.264. If no optimization is adopted in the hardwired ME engine implementation, the required hardware cost and power consumption are too huge to the commercial applications with current technology.

s2x · (lt−1 (xn ))2 12

(3)

Therefore, when interval of the spatial samplers sx is decreased, the prediction error variance σ2 caused by the estimated displacement error is also reduced. According to the analysis of [8][9], MRF algorithm is mainly devised to reduce this kind of prediction error. In details, if the displacement dx,t−1 between the current st (xn ) and the previous st−1 (xn ) image is sub-pel and the more previous image st−k (xn ) has the fullpel displacement dx,t−k , st−k (xn ) is preferred to be chose as reference because the displacement error problem dose not exist any

676

H.264/AVC has 7 inter mode and 13 intra mode choices to code the luminance component of MB. In the JVT reference software, when the rate distortion optimization (RDO) is on, the encoder encodes a MB with all modes and the mode with the least rate distortion cost is designated as the final coding mode. However, such huge computation is hardware-consuming and power-consuming to the realtime implementation. So, the reference software provides a lowcomplexity mode decision scheme. By the low-complexity mode, the RDO is turned off. The cost of each mode is computed using some biases and Sum of Absolute Difference (SAD) of either the prediction errors or the Hadamard transformed coefficients (SATD) of the residues. The drawback of this mode decision is that its inaccurate RD costs always cause the suboptimal mode chosen. However, in order to comprise the hardware cost and performance, the low-complexity mode decision algorithm is applied by our design. Under this condition, sub-partitions bellow 8 × 8 are discarded. The simulation results are shown in Fig. 3. We observed that this approach in most cases even improves the coding quality. This fact comes from two matters. First, for HDTV images, one MB is seldom partitioned bellow 8 × 8 sub-MB level. Second, when RDO is off, eliminating these futile modes can avoid more error mode decisions. 2.3. Search Range Analysis For the hardware IME engine design, the search range is another critical issue. From the viewpoint of the coding performance, the large search range is preferred, especially for those video sequences with fast motions. However, the large search range is a nightmare to the hardware design because this means the massive calculation burden. In fact, later in this paper, we observe that the search range definition also effects the search window memory organization and the search

stage. After this stage, we can achieve the coarse search motion vector of 16 × 16 block, which can be expressed as Coarse MV16×16 .

46 44

2. The fine search area of other three kinds of search positions are defined as {[x, y]|(−32 ≤ x < 32) ∪ (−32 ≤ x − Coarse MV16×16 [x] < 32)}

PSNR(dB)

42 40

station full mode

38

sunﬂower full mode

station reduced mode sunﬂower reduced mode

32

36

x

32

32

32

34 0

20

40

60

Coarse_MV16x16

Fig. 3. Mode reduction comparisons for “station” and “sunflower” (QP=16, 20, 24, 28, 32, 36; [−128, 128] search range; 1 reference picture; IPPP; RDO OFF; 90 frames encoded) scan mode. Considering all these issues, the search range is defined as [−96, 95] × [−64, 63] and search center is fixed in origin [0, 0] to realized the search window data reusing. The coding performance comparisons with JM11.0 reference software are shown in Fig. 4.

PSNR(dB)

46

128

Bitrate(mbps)

y

192 Search area around [0,0]

Search area around Coarse_MV16x16

44

Fig. 5. Coarse to fine search window adjustment

42

It should be noticed that considering the hardware implementation complexity, we do not compress the fine search areas in y direction. When Coarse MV16×16 [x] = ±64, the two fine search areas have no overlapped area and 25% search positions are saved. In best case, where Coarse MV16×16 [x] = 0, these two fine search areas are totally overlapped, 50% search positions are reduced. In section 3, we can see that this coarse to fine search method also contributes to internal IO bandwidth reduction of the search window memories.

40

station jm11.0 [-128,128] station center[0,0] 192 × 128 pedestrian jm11.0 [-128,128]

38

pedestrian center[0,0] 192 × 128 36 34

0

20

40

60

Bitrate(mbps)

Fig. 4. Search range comparisons for “station” and “pedestrian area” (QP=16, 20, 24, 28, 32, 36; IPPP; RDO ON; 1 reference picture; 90 frames encoded)

3. MEMORY ORGANIZATION

For the search window memories consume non-trivial hardware cost and power dissipation, the memory organization is one critical issue for the system design. In this section, level c+ data reuse, the 2.4. Low-Pass Filter Based Down Sampling memory mapping algorithm and the horizontal zigzag scan mode are proposed to reduce the memory partitions and the external and In block matching algorithms, the down sampling method is always internal IO bandwidth. Consequently, the system hardware cost and adopted to reduce the computation. But the down sampling also power consumption are both improved. causes the signal aliasing problem. According to [12], the low-pass In order to reduce the external IO bandwidth for the search winfilter based sub-sampling method can alleviate the adverse effect of dow data refilling, level C+ data reuse scheme is adopted by our dedown sampling. So in our design, the Haar low-pass filter based 4:1 sign [13]. With this approach, the overlapped search area data can be down sampling algorithm is applied and the 75% SAD calculation fully reused in horizon and partially reused in vertical. Considering cost can be saved and the introduced computation by the low-pass the data dependency among the neighboring MBs, we use HF3V2 nfilter is negligible. The 2 LSBs of the low-pass filter outputs are stitched zigzag scan [13] mode in our design, as shown in Fig. 6(a). truncated to further optimize the hardware implementation. In this way, about 44.5% SRAM access for the search window update can be saved. Different from level C data reuse, in our design, 2.5. Coarse to Fine Search Strategy the search area memory is further extended 16-pixel in horizon. This area is denoted as “Refilling Buffer”, as shown in Fig. 6(b). ConseThe down sampling method can reduce the computation complexquently, the ME of “CBn+3 ” can be processed in parallel with the ity in each search candidate. In this subsection, the proposed coarse data refilling for “CBn+4 ” to fine search strategy can save the search candidates. The detailed The memory mapping algorithm and the search scan mode disearch steps are described bellow and its diagrammatic representarectly effect the hardware overhead of search window. The hardware tion is shown as Fig. 5. cost of one memory is mainly determined by three issues:(1) Stor1. According to their indexes, the search positions can be cateage cell array; (2) Address decoder logic; (3) Peripheral logic in the gorized as 4 sets, namely [even column, even row], [even column, form of sense amplifiers, prechargers and write drivers. The volume odd row], [odd column, even row] and [odd column, odd row]. of the storage cell array depends on the search area size. The storME is first processed on those candidates with [even column, age cell array volume of our design is decided by the search range even row]. This procedure is denoted as the coarse search and the level C+ HF3V2 n-stitched zigzag scan order. The hardware

677

(a) HF3V2 n-stitched zigzag scan order

Extend Area for CBn+1 CBn+2

Refilling Buffer

Search Area for CBn

Extend Area in Vertical Direction

in the address successive memory area, as shown in Fig. 8. In this way, the search window height is halved, which will contribute to the IO utilization with our search in horizon scheme. Search Window Width Even column even row pixel 00

(b) Search area data reuse scheme

Fig. 6. HF3V2 n-stitched zigzag scan order and its data reuse

04

06

Width Snake Scan Mode

256x8 71x8 32-1mux 71x8 Reference Buffer

Odd column even row pixel 20

22

24

26 Odd column odd row pixel

40

42

44

46

60

62

64

66 Search Window Pre-Mapping

Read out data

cost of cell array can not be optimized by our method, whereas our memory organization is targeted for reducing the memory partition number and the IO bandwidth, so the chip area consumed by the address decoder logic and the peripheral logic is saved. If we apply the snake scan mode proposed in [5], as shown in Fig. 7, the IO bandwidth is 256 × 8. With Artisan 0.18μm Memory Compiler, the maximum IO width of the generated SRAM is 128bit, so 256 × 8/128 = 16 partitions are required for the search window memory, which account for 26672.3 × 16 = 426756.8 gates. Moreover, a multiplexer array, which is composed of 71 × 8 32-1 multiplexers, is needed to select the read data from the search window to the reference pixel buffer. At the beginning stage of the scan in column direction, (64 + 16)/16 = 5 partitions are active and the memory IO utilization is 31.25% at this time. During the last 8 cycle in this column scan, more 4 memory partitions are enabled to fill the extra reference pixel buffer for the shift in horizon. Even at this stage, the IO utilization is 56.25%. On average, the IO utilization is 112/128 × 31.25% + 16/128 × 56.25% = 34.37%. In fact, the situation is even worse. Because of the down sampling algorithm, just 50% of the read data from the search window is utilized. Namely, the real IO utilization is only 34.37%/2 = 17.18%.

Height

02

Even column odd row pixel Search Window Height

CBn+3 CBn+5 CBn+7

16

128+16

CBn CBn+1 CBn+2 CBn+4 CBn+6

32

16

192+16

00

02

04

06

20

22

24

26

40

42

44

46

60

62

64

66 Search Window Post-Mapping

Fig. 8. Memory mapping algorithm Second, the zigzag scan in horizon is made use of, as shown in Fig. 9(a). The reference pixel buffer consists of 6 pixel buffers. The outputs of “B00”, “B01”, “B10” and “B11” are fed into the SAD tree to calculation the distortion. Instead of reading the pixels in the same row, our design reads one column pixels from the rearranged search window and pushes them into the reference pixel buffer. After 2 cycles’ initialization, the snapshot of the reference buffer is illustrated as “step-1” of 9(b) and the distortion of search position [−2, −2] is derived. In “step-2”, the vertical barrel shifters in the reference buffer take effect and the reference pixels at candidate [−2, 0] are moved to “B00”, “B01”, “B10” and “B11”, so the distortion at [−2, 0] is calculated. The snapshot of this stage is shown as “step-2” of Fig. 9(b). Until now, the searches in the most left [even column, even row] column positions are accomplished. At “step-3”, a new column data are fetched from the search window and pushed into the reference buffer. In this cycle, we achieved the distortion of [0, −2]. The scan on other candidates can be traced by analogy. B00

B01

B10

B11

B20

B21

Reference Buffer

00

02

04

06

20

22

24

26

40

42

44

46

60

62

64

66

B00

B01

B00

B01

B00

00

02

20

22

02

B01 04

B10

B11

B10

B11

B10

B11

20

22

40

42

22

24

B20

B21

B20

B21

B20

B21

40

42

00

02

42

44

Step-1

Step-2

Step-3

(a) Block diagram of the reference buffer (b) Snapshot of the reference buffer

Fig. 7. Snake scan mode and its memory access In order to reduce the partitions and improve the IO utilization of the search window memory, two approaches are applied in our design. First, we rearrange the reference pixels in the search window memory. As mentioned in subsection 2.4 and 2.5, because the 4:1 down sampling algorithm, the reference pixels in the search window can be divided as 4 categories according to their column and row indexes, namely the pixels in [even column, even row], [even column, odd row], [odd column, even row] and [odd column, odd row]. In order to simplify the analysis, it is assumed that the search window is 8 × 8, 4 × 4 fix block size matching is processed in this area with 4:1 down sampling algorithm. The search range is [−2, 1] × [−2, 1]. The matching on the [even column, even row] candidates just utilizes the reference pixels in the same category which are represented with triangle. So the pixels with the same attribute can be arranged

678

Fig. 9. Zigzag scan mode in horizon With the similar method, our search window memory can be organized as Fig. 10. Because level C+ data reuse, we still need the MUX array, which is necessary for the n-stitched zigzag scan. For the level C data reuse, this MUX array can be eliminated. Because the maximum IO width is 128-bit for the Artisan TSMC 0.18μm memory compiler, totally 5 single port memory partitions are required to build up the search window. The format of each partition is 128b × 512w. The hardware cost of search window accounts for 50765.8×5 = 253829 gates. Compared with the straightforward implementation, (1 − 253829/426756.8) × 100% = 40.52% hardware can be saved. In every 2 cycles, one column pixels are fetched out from the search window. So, the IO port utilization for read is in-

71 pixel

Reference Buffer

80x8bit

71x8 2-1MUX 71x8bit

Even_column Even_row pixels

Extension for Refilling Buffer Extension for CBn+2 Extension for CBn+1

Even_column Odd_row pixels

Extension for Refilling Buffer Extension for CBn+2 Extension for CBn+1

Odd_column Even_row pixels

Extension for Refilling Buffer Extension for CBn+2 Extension for CBn+1

Odd_column Odd_row pixels

Extension for Refilling Buffer Extension for CBn+2 Extension for CBn+1

24

8

Horizontal extension for C+ search window 128

128

128

128

Fig. 10. Search window organization

operation. In each SAD tree, the number of PE is 64, so the last adder in every PE consumes no-trivial hardware overhead. One approach is discarding ‘Cx’, so this adder in PE can be saved. However, this will cause one bit error in each PE. In the worst case, the accumulated error of all PEs is 64. In our design, as shown in Fig. 12(b), we did not apply the dedicated adder in each PE. ‘Cx’ and ‘ABSx’ in each PE are both output to the ‘4:2 Compressor CSA Tree’. The addition between ‘Cx’ and ‘ABSx’ is merged into the CSA tree and then the dedicated adder in each PE is eliminated. In addition, the current pixels in current MB register file have already been inverted. In this way, the invertors in all PEs are also discarded. Because 32 SAD tree are installed in our design, with this approach, 32 × 64 × 8 = 16384 invertors are saved. R[7:0] R[7:0] inv(C[7:0]) C[7:0]

creased up to 50%, which is 2.91 times of the original one. The other 50% IO bandwidth can be kept for the data refilling procedure. s8

4. CIRCUIT DESIGN

..

64-PE array (4-1 subsampling) 4:2 Compressor Tree

stage1

..

VBS Tree for 8×8 8×16 16×8 16×16 Blocks

stage2

The target work clock speed is 200MHz. Under this specification, 32 SAD trees are configured in the datapath to satisfy the throughput. 2-stage pipelined SAD tree architecture is utilized to short the critical path, as shown in Fig. 11. The first stage logic is composed of 64 PEs and 4:2 compressor trees. Each PE is in charge of the absolute difference operation of the corresponding pixel. Through 4:2 compressor carry save adder (CSA) trees, the carry and sum vectors of four 8 × 8 SAD are achieved and stored in the inter-stage registers. In the second stage, the VBS adder tree accumulates the derived carry and sum vectors of four 8 × 8 blocks and to get the SADs of 8 × 8, 8 × 16, 16 × 8 and 16 × 16.

8-bit adder

ABSx

I2

ABS ABS 13 12 I1

ABS ABS 11 10

I0 C[3] CI

C[7]

I3

ABS ABS 9 8

I2

I1

ICO CO[8:1] S[7:0]

ABS ABS 7 6

I0 C[2] CI

I3

I2

I2

ABS ABS 5 4 I1

ICO CO[8:1] S[7:0]

C[6] I3

ABS ABS 3 2

I0 C[1] CI

C[5] I1

ICO CO[9:1] S[8:0]

679

(b) Our PE AD circuit

It is well known that the 4:2 compressor based CSA tree [15] has both speed and hardware cost improvements over the traditional full adder based wallace tree [16]. For TSMC 0.18μm CMOS library provides 4:2 compressor standard cell, the CSA tree for 8×8 block is built up with 4:2 compressor, as illustrated in Fig. 13. 32 operands, which include Cx (x:0-15) and ABSx (x:0-15), are compressed by the 4-stage compressor tree into carry and sum vectors, which are denoted as “C[11:0]” and “S[10:0]” in the figure. I3

The intuitive hardware implementation of this algorithm is shown in Fig. 12(a). The MSB bit ‘s8’ of the sum from the first 8-bit adder is inverted and then used to bit-XOR with the rest bits of the result. ‘s8’ is added with the XOR result to generate the absolute difference

ABSx

Fig. 12. Hardware architecture of PE

ICO CO[8:1] S[7:0]

In order to reduce the hardware cost, we also optimize the circuits of PE and the CSA trees. As [14], the absolute difference operation can be expressed as (4). In theory, |C − R| is equal to |R −C|, but the latter is preferred in this hardware implementation. During the ME processing, the data of current pixels are constant. So the timing constraints through these paths can be defined as multi-cycle delay paths. In details, after the initialization of current MB, one cycle delay is inserted before the ME processing. Consequently, the setup time from current MB registers is two-cycle delay. The timing constrain of two-cycle delay, i.e., 10ns is very loose for 0.18μm technology. DC can use smallest standard cells to implement the circuits in the path starting from current MB register file. R +C + 1 R > C (4) |R −C| = R≤C (R +C )

8-bit adder

s7 s6 s5 s4 s3 s2 s1 s0

Cx

(a) Intuitive PE AD circuit

ABS ABS 15 14

Fig. 11. 2-stage pipelined SAD Tree architecture

s8

s7 s6 s5 s4 s3 s2 s1 s0

I3

ABS ABS 1 0

I2

I1

ICO CO[8:1] S[7:0]

I0 C[0] CI

C[4]

I0 C[9] CI

I3

I2

I1

ICO CO[9:1] S[8:0]

I0 C[8] CI

C[10]

C[11]

I3

I2

I1

ICO CO[10:1] S[9:0] C[13]

I2

I0 C[12] CI

I1

[0]

CO[11:1]

S[10:0]

C[11:0]

S[10:0]

I0 S

3-2 Compressor

C[14] I2

I1

CO

I3 ICO

I2

I1 CO

S

I0 CI

C[15] 4-2 Compressor

Fig. 13. 4:2 compressor based CSA tree By these circuit optimizations, 11.6% hardware cost can be saved when the clock speed is 200MHz. The hardware cost of each SAD tree is reduced to 12.1-12.2k gates. The top block diagram of our IME engine is presented in Fig. 14. In each cycle, 32 successive candidates in the column are searched by the PUs. The local minimum distortions among these 32 PUs are selected by the compare tree, which is denoted as “comp. tree”, to update the minimum distortion registers and the corresponding MV registers. Our design is described with Verilog HDL and synthesized with Synopsys Design Compiler (DC). Except the search window memo-

ries, the total gate count is 485.7k gates. The 32 SAD trees account for 387.2k gates. Reference pixel buffer and current MB buffer together consume 45.7k gates. Other stand cells are consumed by the MVD cost generator, comparator trees, the registers for minimum SADs and MVs and the control logics. The backend design is implemented with JupiterXT and Astro. Clock gating technique is utilized to save the dynamic power consumption. At 200MHz clock frequency, the power dissipation is 729mw. The performance comparisons with previous design are listed in Tab. 1. This VBSIME hardwired engine has been integrated in our HDTV1080p real-time encoder design [17]. The micrograph of the encoder SoC is shown in Fig. 15(a) and its coding performance is depicted in Fig. 15(b). (a) micrograph

Table 1. Comparisons with previous design Designs Chen [5] our Technique TSMC 0.18μm 1P6M TSMC 0.18μm 1P6M 108MHz 200MHz Clock Video 720p@30Hz 1080p@30Hz [−96, 96] × [−64, 63] Search Range [−64, 63] × [−32, 31] 330.2k 485.7k Gate account

45

Sun Flower

44 43

Pedestrian

PSNR (dB)

42 41 40

Sun Flower by JM8.1a (5 ref.) Sun Flower by Ours Pedestrian by JM8.1a (5 ref.) Pedestrian by Ours

39 38 37

2-D Low_Pass Filter Cur.MB 8×8Pels

4-1 Decimate

4

8

12

16

20

(b) coding performance

Cmp. Tree

...

RD_costs & IMV Regs

GEN.Pels

PU#1 64PE ...

Ref Pels Buf 71×8 Pels

MUX 71 Pels

80 Pels

ZigZag Scan

Cur Search Area

0

Bitrate (Mbps)

PU#0 64PE

PU#31 64PE

36

SRC.Pels

...

C+ Search Window 512 Pels 80 Pels

MVD_costs Gen.

Fig. 15. Encoder Chip Micrograph and Its Coding Performance

Fig. 14. VBS IME engine top block diagram 5. CONCLUSION In this paper, we proposed a hardwired VBS IME engine for HDTV1080p@30Hz real-time encoding application. The search range is 192×128 with one reference picture. On algorithm level, inter mode trimming, low-pass filter based down sampling and coarse to fine search methods are applied to reduce the computation complexity. In circuits design, we apply 4:2 compressor based CSA tree and multi-cycle path techniques to reduce the SAD tree hardware. In order to improve its clock speed, the SAD tree is 2-stage pipeline architecture. Its maximum clock speed is 200MHz. With the horizontal zigzag scan mode and the memory mapping, the search window memory partitions are reduced to 5 and its IO utilization is 2.91 times of the original one. With 0.18μm TSMC CMOS technology, its hardware cost are 485.7k gates standard cell and 327.68k-bit on chip memories. Its power dissipation is 729mw at 200MHz clock speed. 6. REFERENCES [1] J. Ostermann, etal. “Video coding with h.264/avc: Tools, performance, and complexity,” IEEE Circuits and Systems Magazine, vol. 4, no. 1, pp. 7–28, First Quarter 2004. [2] S. Y. Yap and J. V. McCanny, “A vlsi architecture for variable block size video motion estimation,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 51, no. 7, pp. 384–389, October 2004.

680

[3] Y. W. Huang, etal., “Hardware architecture design for variable block size motion estimation in mpeg-4 avc/jvt/itu-t h.264,” in ISCAS ’03. Proceedings of the 2003 International Symposium on Circuits and Systems, May 2003, vol. 2, pp. 796– 799. [4] M. Kim, I.Hwang, and S. I. Chae, “A fast vlsi architecture for full-search variable block size motion estimation in mpeg-4 avc/h.264,” in Asia and South Pacific Design Automation Conference, 2005. Proceedings of the ASP-DAC 2005, January 2005, vol. 1, pp. 631–634. [5] T. C. Chen, etal., “Analysis and architecture design of an hdtv720p 30 frames/s h.264/avc encoder,” IEEE Transactions on circuits and systems for video technology, vol. 16, no. 6, pp. 673–688, June 2006. [6] Z. Y. Liu, etal., “Fine-grain scalable and low memory cost variable block size motion estimation architecture for h.264/avc,” IEICE Transactions on Electronics, vol. E89-C, no. 12, pp. 1928–1936, December 2006. [7] T. Wedi and H. G. Musmann, “Motion- and aliasing-compensated prediction for hybrid video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 577–586, July 2003. [8] B. Girod, “Efficiency analysis of multihypothesis motion compensated prediction for video coding,” IEEE Transactions on Image Processing, vol. 9, no. 2, pp. 173–183, February 2000. [9] B. Girod, “The efficiency of motion-compensating prediction for hybrid coding of video sequences,” IEEE Journal on Selected Area in Communications, vol. SAC-5, no. 7, pp. 1140–1154, August 1987. [10] T. Wiegand, etal., “Overview of the h.264/avc video coding standard,” IEEE Transactions on circuits and systems for video technology, vol. 13, no. 7, pp. 560–576, July 2003. [11] Y. W. Huang, etal., “Analysis and complexity reduction of multiple reference frames motion estimation in h.264/avc,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 4, pp. 507–522, April 2006. [12] Z. Y. Liu, etal., “Low-pass filter based vlsi oriented variable block size motion estimation algorithm for h.264,” in ICASSP 2006 Proceedings, May 2006, vol. 2, pp. 253–256. [13] C. Y. Chen, etal., “Level c+ data reuse scheme for motion estimation with corresponding coding orders,” IEEE Transactions on circuits and systems for video technology, vol. 16, no. 4, pp. 553–558, April 2006. [14] J. Vanne, etal., “A high-performance sum of absolute difference implementation for motion estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 7, pp. 876–883, July 2006. [15] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach,” IEEE Transactions on Computers, vol. 45, no. 3, pp. 294–306, March 1996. [16] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Transactions on Computers, vol. 13, no. 3, pp. 14–17, February 1964. [17] Z. Y. Liu, etal., “A 1.41w h.264/avc real-time encoder soc for hdtv1080p,” in Symposium on VLSI Circuits 2007, pp. 12-13, June 2007.

32-Parallel SAD Tree Hardwired Engine for Variable ...

ger motion estimation (IME) engine with 192 Ã 128 search range ... In order to further reduce the hardware cost of VBSME engine, six optimization methods are ...

Download PDF

404KB Sizes 3 Downloads 199 Views

Report

32-Parallel SAD Tree Hardwired Engine for Variable ...

Recommend Documents