A VLSI Architecture for Variable Block Size Motion ...

Viewer
Transcript

IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.12 DECEMBER 2006

3594

PAPER

Special Section on VLSI Design and CAD Algorithms

A VLSI Architecture for Variable Block Size Motion Estimation in H.264/AVC with Low Cost Memory Organization Yang SONG†a) , Student Member, Zhenyu LIU†† , Nonmember, Takeshi IKENAGA† , Member, and Satoshi GOTO† , Fellow

SUMMARY A one-dimensional (1-D) full search variable block size motion estimation (VBSME) architecture is presented in this paper. By properly choosing the partial sum of absolute diﬀerences (SAD) registers and scheduling the addition operations, the architecture can be implemented with simple control logic and regular workflow. Moreover, only one single-port SRAM is used to store the search area data. The design is realized in TSMC 0.18 µm 1P6M technology with a hardware cost of 67.6K gates. In typical working conditions (1.8 V, 25◦ C), a clock frequency of 266 MHz can be achieved. key words: variable block size motion estimation (VBSME), memory organization, H.264/AVC Fig. 1

1.

Variable block sizes in H.264/AVC.

Introduction

H.264/AVC, an emerging video coding standard, was developed by the Joint Video Team (JVT) of ITU-T VCEG and ISO/IEC MPEG. The aims of H.264/AVC are to enhance compression eﬃciency and provide network friendly video representation for various applications [1]. Compared with previous video standards, H.264/AVC can provide up to 50% coding gains on diﬀerent bit rates and video resolutions [2]. In H.264/AVC, many new techniques are adopted, which include variable block size motion compensation, quarter sample accurate motion compensation, multiple reference picture motion compensation, weighted prediction, in-the-loop deblocking filtering and so on [3]. In H.264/AVC, variable block size motion estimation (VBSME) is adopted, which means motion estimation (ME) is conducted on 7 blocks sizes, as illustrated in Fig. 1. For one macroblock (MB), it has 4 block modes, which include 16 × 16, 16 × 8, 8 × 16, and 8 × 8. For 8 × 8 mode, it is further divided into 4 modes, namely 8 × 8, 8 × 4, 4 × 8, and 4 × 4. For each MB, the coding costs of all the block size are calculated, and the block size with the smallest cost is then chosen as the MB mode. Compared with previous fixed block size ME algorithms, VBSME provides higher compression ratio and better video quality, but it also puts Manuscript received March 8, 2006. Manuscript revised June 12, 2006. Final manuscript received July 24, 2006. † The authors are with Graduate School of Information, Production and Systems, Waseda University, Kitakyushu-shi, 808-0135 Japan. †† The author is with Kitakyushu Foundation for the Advancement of Industry Science and Technology, Kitakyushu-shi, 8080135 Japan. a) E-mail: [email protected] DOI: 10.1093/ietfec/e89–a.12.3594

heavy burden on the ME unit and makes traditional ME architectures incompatible. 1-D systolic array [4] is widely used in low-end products because of its small hardware cost and desirable processing capability. In this paper, based on the traditional 1-D array, one 1-D VBSME architecture is presented for portable devices. By properly choosing the partial-SAD registers and adders, the proposed PE array can be realized with simple control logic and regular workflow, and only costs about 51.7K gates. Moreover, only one single-port SRAM is required to store the search area data and the PE array still can achieve 100% utilization. In practice, this kind of memory organization can be used in previous 1-D ME architecture without modification. The rest of the paper is organized as follows: in Sect. 2, previous works are firstly introduced and then a classification is presented. The proposed 1-D VBSME architecture is discussed in Sect. 3. The memory organization is explained in Sect. 4. In Sect. 5, the silicon implementation is shown. At last, some conclusions are given in Sect. 6. 2.

Related Works

2.1 Previous VBSME Architectures The ME is the most computation intensive part in the encoding process. Therefore, hardware accelerator is a must for the real-time video application. Several full search VBSME architectures [5]–[8] have been proposed for H.264/AVC. In these designs, the partial-SAD reuse methodology is adopted to reduce computation complexity, which means the SADs of smaller blocks are stored and accumulated to get the SADs of bigger ones. Based on the Type-1 architecture in [9], an eﬃcient 2-

c 2006 The Institute of Electronics, Information and Communication Engineers Copyright

SONG et al.: A VLSI ARCHITECTURE FOR VBSME IN H.264/AVC

3595

D VBSME architecture is proposed in [5], which adopts 1-D data broadcasting and 1-D partial-SAD reuse. If the pipeline latency is not taken into account, this architecture can complete one VBSME on one search position in every clock cycle, and thus is very suitable for high-end video applications. However, in order to achieve this processing capability, 256 PEs are scheduled to work in parallel. Moreover, the search area data memory also is divided into 32 partitions. Because each SRAM module requires its address interface and control logic, these large number of smaller SRAMs cost more hardware and power than one or few larger SRAMs [10]. Another 2-D design is proposed in [6], which has the same processing capability as [5]. But by adopting preload registers and search data buﬀers inside every PE, the search area memory partitions can be reduced to 16. However, this approach incurs abundant hardware overhead and complex memory management. One VBSME architecture is proposed in [7], which is based on the tree ME architecture in [11]. This architecture totally avoids extra variable block sizes overhead and has the same throughput as [5], [6]. In this architecture, all the current pixels are stored inside the PEs. In every cycle, reference pixels are fed in and all the distortions are simultaneously calculated by the 256 PEs. Sixteen 2-D adder trees, which have adders in both horizon and vertical directions, are used to generate the sixteen 4 × 4 SADs. Based on the obtained 4 × 4 SADs, one adder tree is then used to calculate other bigger SADs. However, for portable devices, because of their samller frame size and limited processing capability, the 1-D ME architecture is more preferable. Based on the traditional 1-D ME architecture [4], a 1-D VBSME architecture is presented in [8]. For each PE, the 4 × 4 SADs are stored and SADs of bigger blocks are accumulated and then transferred by the shared SAD buses. Compared with 2-D designs, this architecture is more flexible and just requires two memory partitions. However, in order to save SAD registers and adders in this design, each PE has complex control logic and irregular workflow, which is ineﬃcient for hardware implementation.

II indicates adjacent several PEs share one partial-SAD register, and Category III represents no partial-SAD registers are required. From the point of hardware cost, Category II and III are more preferable than Category I because less registers are used. But when other factors such as memory bandwidth and power consumption are taken into account, all these categories are useful for diﬀerent video applications. Based on the presented classification, the designs in [8] and the proposed one can be taken into Category I because each PE has its own partial-SAD registers. The designs in [5], [6] can be considered as Category II because adjacent 4 PEs share one partial-SAD register. The design in [7] belongs to Category III because no partial-SAD registers are required. 3.

Proposed VBSME Architecture

3.1 Top-Level 1-D VBSME Architecture In H.264/AVC, each MB has 7 block modes. Therefore, one simple ME method is to process each block mode individually, and then choose the block mode with the smallest coding cost as the MB mode. However, because the SADs of smaller blocks are not reused for larger ones, this approach requires intensive computation. For hardware architectures, the partial-SAD reuse methodology is widely adopted, which means the SADs of 4 × 4 blocks are directly reused to calculate SADs of other bigger blocks. In the proposed architecture, this SAD reuse method is also adopted to reduce computation. The presented 1-D VBSME architecture is based on the traditional 1-D ME architecture [4], in which the reference frame data is broadcasted and the current MB data is transferred in a buﬀer line, and the PE number equals to the search area width. In order to clearly explain the proposed architecture, a relatively smaller search range of [−8, +7] in both directions are chosen. Therefore, there are 16 PEs in the architecture and each one is responsible for one search position, as shown in Fig. 2.

2.2 VBSME Architectures Classification ME architectures have been studied for decades and many designs have been proposed. Lots of classifications have also been presented based on various metrics, such as the dependency graph (DG) mapping method [12], data reuse method [13], and the parallelism of search points [14]. However, these ME classifications can not reflect the impact of the variable block sizes in H.264/AVC. For VBSME architectures, the SADs of smaller blocks are directly reused to calculate the SADs of other bigger ones. Therefore, many registers are used to store these SADs and thus cost lots of hardware and power. To emphasize the importance of partial-SAD registers, a classification is proposed in this paper. The VBSME architectures can be divided into 3 categories according to their partial-SAD registers. Category I means each PE has its own partial-SAD registers. Category

Fig. 2

1-D PE array VBSME architecture.

IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.12 DECEMBER 2006

3596

Fig. 3

The 41 blocks within one MB.

For each PE, the pixels in current MB and search area are loaded in every clock cycle. According to the pixel position, the generated SAD is added with previous one and stored in the corresponding SAD register. Inside one PE, there are 7 hardware output interfaces, which are used to transfer the obtained SADs. Therefore, each block mode has its own SAD interface, and there is no need to use multiplexer inside one PE to share the SAD output interfaces. By this way, both hardware cost and time delay can be saved. The SADs generated by each PE are transferred by the SAD bus network in a pre-defined order to compare with the minimum SAD registers, as shown in Fig. 2. Each MB has 41 blocks, as shown in Fig. 3. Therefore, there are totally 41 SADs and associated motion vector (MV) registers to store these minimum SADs and MVs. It is intuitive that there should be 41 comparators, each of which is responsible for one minimum SAD register. But by using time multiplexing, these comparators can be reused and only 16 ones are required. 3.2 PE Unit Architecture The PE unit logic architecture is illustrated in Fig. 4. An adder tree is used inside each PE, and thus the SADs of different block modes can be concurrently generated and a regular workflow can be achieved. In practice, this adder tree can be eﬀectively optimized and synthesized by Synopsys Design Compiler. After synthesis, this adder tree only has three adders and the critical path only passes two adders. In the 1-D ME architecture, each PE is responsible for one search position. Therefore, every PE should store the generated SADs and transfer them in the proper time. Because each MB has 7 block modes and 41 blocks, two issues are faced in PE architecture design. (1) The large number of

Fig. 4

PE unit logic architecture.

partial-SAD registers. (2) The larger number of SAD output interfaces. The key problem here is to make a good tradeoﬀ between the hardware cost and processing speed. In our design, as shown in Fig. 4, 11 partial-SAD registers are adopted inside one PE, namely four 4 × 4 SAD registers, four 4 × 8 SAD registers, two 8 × 8 SAD registers and one 8 × 16 SAD registers. Moreover, each PE has 7 SAD interfaces and thus every block mode has its own output interface. In order to clearly explain the workflow, PE0 in the 1-D array is used as an example. The C(x,y) and R(x,y) represent the pixels in current MB and search area, and x and y indicate the row and column positions, respectively. In clock cycle 0, C(0,0) and R(0,0) are loaded and the distortion is calculated. Because it is the first partial-SAD of BLK4x4 0, the generated value is directly stored in register SAD4x4 0. In cycle 1, C(0,1) and R(0,1) are fetched and processed. The generated distortion is added with SAD4x4 0 and also saved in register SAD4x4 0. The same operations are taken in cycles 2 and 3. In cycles 4–7, 8–11, 12–15, because the generated SADs belong to BLK4x4 1, BLK4x4 2 and BLK4x4 3, respectively, the similar operations are taken except that diﬀerent partial-SAD registers are used. The operations continue and the SADs of block BLK4x4 0– BLK4x4 3 are generated in cycles 51, 55, 59 and 63. In

SONG et al.: A VLSI ARCHITECTURE FOR VBSME IN H.264/AVC

3597 Table 1 Clock Cycle

BLK4x4

51 55 59 63 115 119 123 127 179 183 187 191 243 247 251 255

BLK4x4 0 BLK4x4 1 BLK4x4 2 BLK4x4 3 BLK4x4 4 BLK4x4 5 BLK4x4 6 BLK4x4 7 BLK4x4 8 BLK4x4 9 BLK4x4 10 BLK4x4 11 BLK4x4 12 BLK4x4 13 BLK4x4 14 BLK4x4 15

BLK4x8

SAD generation sequence (PE0). BLK8x4

BLK8x8

BLK8x16

BLK16x8

BLK16x16

BLK8x4 0 BLK8x4 1 BLK4x8 BLK4x8 BLK4x8 BLK4x8

0 1 2 3

BLK8x4 2

BLK8x8 0

BLK8x4 3

BLK8x8 1

BLK16x8 0

BLK8x4 4 BLK8x4 5 BLK4x8 BLK4x8 BLK4x8 BLK4x8

4 5 6 7

BLK8x4 6

BLK8x8 2

BLK8x16 0

BLK8x4 7

BLK8x8 3

BLK8x16 1

conclusion, during clock cycles 0–63, the SADs of the top four 4 × 4 blocks, namely BLK4x4 0–BLK4x4 3 are generated and delivered by the dedicated SAD output interfaces. The SADs of other 4 × 4 blocks are calculated in the same manner. The SADs of 4 × 4 block mode are reused to calculate the SADs of 4 × 8 block mode. In cycle 51, the SAD of BLK4x4 0 is calculated and stored in register SAD4x8 0, and it is the upper part of BLK4x8 0. In clock cycle 115, the SAD of BLK4x4 4 is available and it is the lower part of BLK4x8 0. Therefore, the new generated SAD is added with the value stored in SAD4x8 0 to get the SAD of BLK4x8 0. The SADs of other 4 × 8 blocks are processed in the same way. To calculate SADs of 8 × 8 block mode, SADs of 4 × 8 block mode are required. In clock cycle 115, the SAD of BLK4x8 0 is generated and stored in register SAD8x8 0. In clock cycle 119, the SAD for BLK4x8 1 is also obtained. By adding this data with data stored in register SAD4x8 0, the SAD for BLK8x8 0 is obtained. The SADs for BLK8x8 1, BLK8x8 2 and BLK8x8 3 can be obtained in cycles 127, 247 and 255, respectively. The SADs of 8 × 16 block mode rely on the SADs of 8 × 8 block mode. At clock cycle 119, the SAD for BLK8x8 0 is available and stored in register SAD8x16 0. In clock cycle 247, the SAD for BLK8x8 2 is also derived and these two values are added to get the SAD value of BLK8x16 0. The SAD of BLK8x16 1 can be obtained in cycle 255 by the same method. The calculations of other block modes are much simpler and can be well illustrated by Fig. 4. It can be seen that the SADs calculations are simple and regular, which is desirable for VLSI implementation. In order to clearly illustrate this, the SADs generation sequence of PE0 is given in Table 1. 3.3 SAD Bus Network Architecture In the proposed PE architecture, each PE has 7 SAD interfaces for the 7 block modes. Therefore, there are total 112

Fig. 5

BLK16x8 1

BLK16x16 0

SAD bus network architecture.

SAD interfaces for the 16 PEs. How to schedule the SADs on these interfaces and how to compare them with the 41 minimum SAD registers are the problems we faced in the SAD bus network. The SAD bus network architecture is based on two observations: (1) For each PE, the SADs of the same block are obtained with 1 cycle interval. For instance, for the same BLK4x4 0 in Fig. 3, its SAD is generated in cycle 51 by PE0, and is obtained in cycle 52 by PE1 and so on. So the comparators can be shared by the adjacent PEs to save hardware cost. (2) The 7 block modes in H.264/AVC can be divided into 3 categories, as shown in Fig. 3. Because adjacent SADs in the same PE are generated with a time interval of 4 cycles, block modes of 4 × 4 and 4 × 8 are taken as Group1. For example, in PE0, the SADs of BLK4x4 0 and BLK4x4 1 are generated in cycles 51 and 55, with a time interval of 4 cycles. For the SADs of BLK4x8 0 and BLK4x8 1, they are generated in cycles 115 and 119, also with a time interval of 4 cycles. For the same reason, block modes of 8 × 4, 8 × 8 and 8 × 16 are taken as Group2, and block modes of 16 × 8 and 16 × 16 are considered as Group3. Based on the two features, the SAD bus network architecture is proposed, as illustrated in Fig. 5. As defined in

IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.12 DECEMBER 2006

3598 Table 2 Clock Cycle 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78

SAD4x4 data delivery (SAD4x4 BUS0–SAD4x4 BUS3).

SAD4x4 BUS0 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4

0(PE0) 0(PE1) 0(PE2) 0(PE3) 1(PE0) 1(PE1) 1(PE2) 1(PE3) 2(PE0) 2(PE1) 2(PE2) 2(PE3) 3(PE0) 3(PE1) 3(PE2) 3(PE3)

SAD4x4 BUS1

BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4 BLK4x4

Fig. 3, for all the block modes in Group1, each mode has 4 SAD buses and the adjacent 4 PEs share one SAD bus. For block modes in Group2, each mode has 2 SAD buses and adjacent 8 PEs share one SAD bus. For block modes in Group3, each mode has only 1 SAD bus and all the 16 PEs share the same SAD bus. Therefore, the 112 SAD interfaces from the 16 PEs are time multiplexed on 16 shared SAD buses. Because each SAD bus has one comparator, only 16 comparators are required. In order to explain the data transfer mechanism, the four SAD buses for 4 × 4 block mode are used as examples. As we have discussed above, 16 PEs share the four 4 × 4 SAD buses, namely PE0–PE3 share the SAD4x4 BUS0, PE4–PE7 share the SAD4x4 BUS1, and so on, as shown in Fig. 5. The data transfers for SADs of BLK4x4 0– BLK4x4 3 are shown in Table 2. The data transfers for other 4 × 4 blocks can be traced by analogy. For SAD4x4 BUS0, in cycle 51, the SAD of BLK4x4 0 is generated in PE0 and is delivered. In cycle 52, the SAD of the same block BLK4x4 0 in PE1 is also ready and is then transferred. The same operations continue in 53 and 54 cycles, and the SADs of BLK4x4 0 in PE2 and PE3 are transferred. In cycle 55, the SAD of BLK4x4 1 in PE0 is generated and also is delivered by SAD4x4 BUS0. Therefore, from cycle 55–58, all the SADs of BLK4x4 1 in PE0–PE3 are transferred. The SADs of BLK4x4 2 and BLK4x4 3 in PE0–PE3 are transferred in SAD4x4 BUS0 during cycles 59–62 and 63–66, respectively. For SAD4x4 BUS1–SAD4x4 BUS3, the same operations are taken except that they are in charge of diﬀer-

0 (PE4) 0 (PE5) 0 (PE6) 0 (PE7) 1 (PE4) 1 (PE5) 1 (PE6) 1 (PE7) 2 (PE4) 2 (PE5) 2 (PE6) 2 (PE7) 3 (PE4) 3 (PE5) 3 (PE6) 3 (PE7)

SAD4x4 BUS2

BLK4x4 0 (PE8) BLK4x4 0 (PE9) BLK4x4 0 (PE10) BLK4x4 0 (PE11) BLK4x4 1 (PE8) BLK4x4 1 (PE9) BLK4x4 1 (PE10) BLK4x4 1 (PE11) BLK4x4 2 (PE8) BLK4x4 2 (PE9) BLK4x4 2 (PE10) BLK4x4 2 (PE11) BLK4x4 3 (PE8) BLK4x4 3 (PE9) BLK4x4 3 (PE10) BLK4x4 3 (PE11)

SAD4x4 BUS3

SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4 SAD4x4

0 (PE12) 0 (PE13) 0 (PE14) 0 (PE15) 1 (PE12) 1 (PE13) 1 (PE14) 1 (PE15) 2 (PE12) 2 (PE13) 2 (PE14) 2 (PE15) 3 (PE12) 3 (PE13) 3 (PE14) 3 (PE15)

ent PEs. For each SAD bus, it has a dedicated comparator. Therefore, total 16 comparators are required. According to the clock cycle, diﬀerent minimum SAD registers are chosen to compare with the transferred SADs. If the new SAD is smaller than the stored one, the corresponding minimum SAD and MV registers are updated. The comparator in SAD4x4 BUS0 is used as an example. During clock cycles 51–54, the minimum SAD register for BLK4x4 0 is chosen to compare with the generated SADs. During cycles 55– 58, the minimum SAD register for BLK4x4 1 is selected to compare with the generated SADs, and so on. The workflows of other comparators are the same and can be well explained by Table 2. 4.

Search Area Memory Organization

It is well known that the high memory bandwidth and large number of memory modules are mainly responsible for the power consumption and hardware cost in ME architectures [15]. However, For a specific ME algorithm and associated VLSI architecture, the required memory size and bandwidth are fixed. Therefore, the memory cost reduction mainly relies on the following aspects: 1) Reduce the number of memory modules: The hardware cost for one memory module can be approximated by the following equation [16]: Areamemory = Areabase + Areabit × bits,

(1)

SONG et al.: A VLSI ARCHITECTURE FOR VBSME IN H.264/AVC

3599

(a) Fig. 6

(b)

(a) Memory organization B (b) Proposed memory organization.

where Areabase is the area size required for the address generation, address decoding, row and column drivers and other control logic, and Areabit is the area size for one memory bit. From this equation, we can see that for a given memory size, less memory modules mean smaller hardware cost and power consumption [16]. 2) Reduce the width of memory I/O bus: The differential amplifier circuits are widely used in SRAM output drivers to increase the access speed. These circuits consume lots of area and power. For instance, for the same size 512B SRAMs which are generated by Artisan Memory Compiler with TSMC0.18 µm technology, under 100 MHz clock frequency, the average working current for the 128 bits × 32 words SRAM and 8 bits × 512 words SRAM are 25 mA and 3 mA, and the corresponding hardware costs are 17.7K gates and 8.0K gates, respectively. Hence, for a required memory bandwidth, the SRAM with the minimum I/O width is more hardware-eﬃcient. In the proposed full search 1-D VBSME architecture with a search range of [−8, +7], the search area has 31 × 31 bytes and should be stored into two SRAMs to avoid contention [4]. Therefore, a simple memory organization is two single-port 128 bits × 32 words SRAMs and it is named Organization A. However, in 1-D systolic array, only two pixels are simultaneously required in every cycle. Therefore, according to the above discussions, a more hardwareeﬃcient organization is two single-port 8 bits × 496 words SRAMs, and this organization is called Organization B, as illustrated in Fig. 6 (a). Because dual-port SRAM also can provide two diﬀerent pixels concurrently, one dual-port 8 bits × 992 words SRAM is also realized and entitled Organization C. In 1-D systolic array, two pixels are required in every

cycle and they are not in the same row [4]. Therefore, two SRAMs are used to store the search area data. For instance, when the search area data are stored in the memory organization B, the SRAM structure is shown in Fig. 6(a). The concurrently required two data are represented by the dash line. In practice, by using registers, this data confliction can be eliminated and the search area data can be stored in one 16 bits single-port SRAM, as illustrated in Fig. 6(b). The data in two separated SRAM modules are merged as one and extra fifteen 8-bits resisters are used to store the fetched pixels. For instance, in clock cycle 0, the reference data R(0,0) and R(0,16) are both loaded from SRAM. R(0,0) is directed inputted to the PE array and R(0,16) are stored in register 0. In clock cycle 1, both R(0,1) and R(0,17) are fetched. R(0,1) is transmitted to the PE array and R(0,17) is stored in register 1. The same operations continue in cycles 2–15. In cycle 16, according to the 1-D array dataflow [4], both R(1,0) and R(0,16) are required. R(1,0) can be directly fetched from SRAM in this cycle and R(0,16) are already stored in register 0. By this method, the two pixels are provided simultaneously by one single-port SRAM. The performance comparisons among the three memory organizations and the proposed one are listed in Table 3. SRAMs are generated by Artisan Memory Compiler with TSMC0.18 µm technology, and the power consumptions are obtained by Synopsys Power Compiler. For the organization A, because it has a very wide I/O bus, its hardware cost is the largest one. However, because this memory organization is only enabled once every 16 cycles, the average power consumption is the smallest. For the Organization B, it has a much narrower I/O bus, compared with organization A, its hardware cost is much smaller. But because the SRAM is enabled in every cycle, the power consumption is larger. For the Organization C, only one dual-port SRAM is required. Compared with the memory organization B, it can be seen that the both the hardware cost and power consumption are increased because of the two read-ports. For the proposed memory organization, it has the smallest hardware cost and the second smallest power consumption and thus make a good tradeoﬀ between the hardware cost and power consumption. This mainly because the proposed organization has the smallest I/O width and only need one SRAM module. Moreover, one SRAM also means friendly back-end design and small build-in-self-test (BIST) cost. It should be mentioned that fifteen 8-bits registers and one multiplexer are required in the proposed organization, and the hardware cost is 1.3K gates. But compared with the hardware and power reductions, this overhead is acceptable. However, because the proposed SRAM has a narrow data width, the reloading process costs 496 cycles and thus may not acceptable for real-time applications. In practice, this issue can be solved by ping-pang strategy. While one SRAM is in use, the other one is reloaded. Because the presented SRAM has a small hardware cost, the ping-pang strategy is quite feasible and extra reloading time can be totally avoided.

IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.12 DECEMBER 2006

3600 Table 3

Comparisons on search area memory organization (Worst case, 100 MHz).

Memory Organization Port Number Modules Gate Count (gates) Average Power Consumption (mW)

Organization A

Organization B

Organization C

Proposed

128 bits×32 words

8 bits×496 words

8 bits×992 words

16 bits×496 words

Single-Port 2 35.5K

Single-Port 2 12.6K

Dual-Port 1 22.5K

Single-Port 1 9.7K

4.8

12.1

15.5

8.5

Table 4 PE Number Control Logic Process Technology Clock Frequency Gate Count (PE Array) Power Consumption (mW) (SRAM included) Search Window Partitions

2-D Array [5] 256 Simple UMC 0.18 µm 110.8 MHz 81.5K

Comparisons with previous VBSME designs. 2-D Array [6] 256 Complex TSMC 0.18 µm 100 MHz (Max) 113.4K

Tree Structure [7] 256 Simple UMC 0.18 µm 110.8 MHz 88.6K

-

-

-

32

16

-

Table 5

5.

Silicon Implementation

The proposed architecture with 16 PEs aims for a search range of 16 × 16 and can handle all the block modes in H.264/AVC. For one MB, the VBSME can be finished in 4096 cycles. The architecture is realized with Verilog HDL and synthesized by Synopsys Design Compiler with TSMC 0.18 µm technology. The design costs about 67.6K gates. The 1-D PE array costs 51.7K gates, the 41 minimum SAD and MV registers, SAD bus network and associated 16 comparators responsible for a cost of 12.9K gates. The other 3K gates are consumed by the state machine and other control logic. In typical working conditions (1.8 V, 25◦ C), the max clock frequency is 266 MHz and the power consumption is 131.7 mW (SRAM included). The proposed design with 16 PEs and ping-pang memory strategy can be 100% utilized. For diﬀerent video applications, the required clock frequencies are shown in Table 5. The large search range is divided into the size of 16 × 16 to accommodate the proposed 1-D design. The comparisons among the proposed architecture and previous designs are listed in Table 4. The performance data for the designs in [5] and [7] are obtained from [14]. Compared with 2-D designs [5], [6] and tree architecture [7], the proposed design can not achieve the same throughput but need less hardware cost and power consumption. Moreover, much less SRAMs are required in our design, which is critical for portable devices. Compared with previous 1-D design [8], an adder tree is used in our design to realize regular workflow. In practice, this adder tree can be eﬃciently optimized by the behavior optimization arithmetic (BOA) tool in Synopsys Design Compiler. To transfer the SADs generated by PE array, totally 16 SAD buses are used in our design, which are larger than the 13 SAD buses required in [8]. However, in our design, because each PE has a dedicated SAD bus and the SADs are generated regularly, the

QCIF QCIF CIF CIF

1-D Array [8] 16 LUT based TSMC 0.13 µm 294 MHz (Max) 61K 573.4 mW (294 MHz) 2

This Work 16 Simple TSMC 0.18 µm 266 MHz (Max) 51.7K 131.7 mW (266 MHz) 1

Clock frequency versus video applications. Frame Size 176 × 144 176 × 144 352 × 288 352 × 288

Search Range 16 × 16 32 × 32 32 × 32 48 × 32

Frame Rate 30 Hz 30 Hz 30 Hz 25 Hz

Clock Frequency 12.2 MHz 48.8 MHz 194.6 MHz 243.3 MHz

interconnection complexity is much simpler. For instance, in [8], the sixteen 4 × 4 SADs of PE0 are sequential delivered by the following SAD buses: {0, 1, 3, 4, 0, 2, 6, 8, 0, 1, 3, 4, 0, 2, 6, 9}, which cover 8 SAD buses and is irregular. In comparison, these SADs are delivered in our design only by one SAD4x4 BUS0 bus with a regular sequence, as shown in Table 2. This simple and regular interconnections also mean less multiplexers are used. Therefore, less hardware cost and power consumption is consumed. 6.

Conclusions

In this paper, one 1-D array VBSME architecture for lowend products is proposed. By properly using the partialSAD registers and scheduling the addition operations, the design can be realized with simple control logic and regular workflow. By organizing the search area data, the traditional two separated SRAMs are replaced by one single-port SRAM and the PE array still can be 100% utilized. The design is realized with TSMC 0.18 µm 1P6M technology with a hardware cost of 67.6K gates (SRAM excluded). In typical working conditions, the maximum frequency is 266 MHz and power consumption is 131.7 mW (SRAM included). Acknowledgments This work was supported by fund from the MEXT via Kitakyushu innovative cluster project.

SONG et al.: A VLSI ARCHITECTURE FOR VBSME IN H.264/AVC

3601

References [1] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer, and T. Wedi, “Video coding with H.264/AVC: Tools, performance, and complexity,” IEEE Circuits Syst. Mag., vol.4, no.1, pp.7–28, 2004. [2] T. Wiegand, G.J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol.13, no.7, pp.560–576, 2003. [3] T. Wiegand, G. Sullivan, and A. Luthra, “Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264—ISO/IEC 14496-10 AVC),” 2003. [4] K.M. Yang, M.T. Sun, and L. Wu, “A family of VLSI designs for the motion compensation block-matching algorithm,” IEEE Trans. Circuits Syst., vol.36, no.10, pp.1317–1325, 1989. [5] Y.W. Huang, T.C. Wang, B.Y. Hsieh, and L.G. Chen, “Hardware architecture design for variable block size motion estimation in MPEG-4 AVC/JVT/ITU-T H.264,” Proc. IEEE Int. Symp. Circuits and Systems (ISCAS’03), vol.2, pp.796–799, 2003. [6] M. Kim, I. Hwang, and S. Chae, “A fast VLSI architecture for full-search variable block size motion estimation in MPEG-4 AVC/H.264,” Proc. ACM/IEEE Asia and South Pacific Design Automation Conf. (ASP-DAC’05), pp.631–634, 2005. [7] Y.W. Huang, T.C. Chen, C.H. Tsai, C.Y. Chen, T.W. Chen, C.S. Chen, C.F. Shen, S.Y. Ma, T.C. Wang, B.Y. Hsieh, H.C. Fang, and L.G. Chen, “A 1.3TOPS H.264/AVC single-chip encoder for HDTV applications,” Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC’05), pp.128–129, 2005. [8] S.Y. Yap and J.V. McCanny, “A VLSI architecture for variable block size video motion estimation,” IEEE Trans. Circuits Syst. II, Expr. Briefs, vol.51, no.7, pp.384–389, 2004. [9] L.D. Vos and M. Stegherr, “Parameterizable VLSI architectures for the full-search block-matching algorithm,” IEEE Trans. Circuits Syst., vol.36, no.10, pp.1309–1316, 1989. [10] J.K. Tanskanen and J.T. Niittylahti, “Scalable paralle memory architectures for video coding,” J. VLSI Signal Processing, vol.38, no.2, pp.173–199, 2004. [11] J.S. Jehng, L.G. Chen, and T.D. Chiueh, “An eﬃcient and simple VLSI tree architecture for motion estimation algorithms,” IEEE Trans. Signal Process., vol.41, no.2, pp.889–900, 1993. [12] P. Pirsch, “VLSI archtiectures for video compresssion—A survey,” Proc. IEEE, vol.83, no.2, pp.220–246, 1995. [13] J.C. Tuan, T.S. Chang, and C.W. Jen, “On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture,” IEEE Trans. Circuits Syst. Video Technol., vol.12, no.1, pp.61–72, 2002. [14] C.Y. Chen, S.Y. Chien, Y.W. Huang, T.C. Chen, T.C. Wang, and L.G. Chen, “Analysis and architectrue design of variable block size motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol.53, no.3, pp.578–593, 2006. [15] P. Kuhn, U. Niedermeier, L.F. Chao, and W. Stechele, “A flexible low-power VLSI architecture for MPEG-4 motion estimation,” Proc. SPIE Conf. Visual Commun. Image Processing (VCIP’99), pp.883– 894, 1999. [16] P. Kuhn, Algorithms, complexity analysis and VLSI architectures for MPEG-4 motion estimation, pp.134–135, Kluwer Academic Publishers, 1999.

Yang Song received the B.E. degree in Computer Science from Xi’an Jiaotong University, China in 2001 and M.E. degree in Computer Science from Tsinghua University, China in 2004. He is currently a Ph.D. candidate in Graduate School of Information, Production and Systems, Waseda University, Japan. His research interest includes motion estimation, video coding technology and associated VLSI architecture.

Zhenyu Liu received his B.E., M.E. and Ph.D. degree in electronics engineering from Beijing Institute of Technology in 1996, 1999 and 2002, respectively. His doctor research focused on real time signal processing and relative ASIC design. From 2002 to 2004, he worked as post doctor in Tsinghua University of China, where his research mainly concentrated on embedded CPU architecture. Currently he is a researcher in Kitakyushu Foundation for the Advancement of Industry Science and Technology. His research interests include real time H.264 encoding algorithms and associated VLSI architecture.

Takeshi Ikenaga received his B.E. and M.E. degrees in electrical engineering and the Ph.D. degree in information & computer science from Waseda University, Tokyo, Japan, in 1988, 1990, and 2002, respectively. He joined LSI Laboratories, Nippon Telegraph and Telephone Corporation (NTT) in 1990, where he has been undertaking research on the design and test methodologies for highperformance ASICs, a real-time MPEG2 encoder chip set, and a highly parallel LSI & system design for imageunderstanding processing. He is presently an associate professor in the system LSI field of the Graduate School of Information, Production and Systems, Waseda University. His current interests are application SoCs for image, security and network processing. Dr. Ikenaga is a member of the IPSJ and the IEEE. He received the IEICE Research Encouragement Award in 1992.

Satoshi Goto was born on January 3rd, 1945 in Hiroshima, Japan. He received the B.E. degree and the M.E. degree in Electronics and Communication Engineering from Waseda University in 1968 and 1970, respectively. He also received the Dr. of Engineering from the same university in 1981. He is IEEE fellow, Member of Academy Engineering Society of Japan and professor of Waseda University. His research interests include LSI System and Multimedia System.

A VLSI Architecture for Variable Block Size Motion ...

Dec 12, 2006 - alized in TSMC 0.18 Âµm 1P6M technology with a hardware cost of 67.6K gates. ...... Ph.D. degree in information & computer sci- ence from ...

Download PDF

503KB Sizes 4 Downloads 300 Views

Report

A VLSI Architecture for Variable Block Size Motion ...

Recommend Documents