Low-Power Partial Distortion Sorting Fast Motion ...

Viewer
Transcript

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.1 JANUARY 2007

108

PAPER

Special Section on Advanced Image Technology

Low-Power Partial Distortion Sorting Fast Motion Estimation Algorithms and VLSI Implementations Yang SONG†a) , Student Member, Zhenyu LIU†† , Nonmember, Takeshi IKENAGA† , Member, and Satoshi GOTO† , Fellow

SUMMARY This paper presents two hardware-friendly low-power oriented fast motion estimation (ME) algorithms and their VLSI implementations. The basic idea of the proposed partial distortion sorting (PDS) algorithm is to disable the search points which have larger partial distortions during the ME process, and only keep those search points with smaller ones. To further reduce the computation overhead, a simplified local PDS (LPDS) algorithm is also presented. Experiments show that the PDS and LPDS algorithms can provide almost the same image quality as full search only with 36.7% computation complexity. The proposed two algorithms can be integrated into diﬀerent FSBMA architectures to save power consumption. In this paper, the 1-D inter ME architecture [12] is used as an detailed example. Under the worst working conditions (1.62 V, 125◦ C) and 166 MHz clock frequency, the PDS algorithm can reduce 33.3% power consumption with 4.05 K gates extra hardware cost, and the LPDS can reduce 37.8% power consumption with 1.73 K gates overhead. key words: motion estimation (ME), partial distortion sorting (PDS), systolic array architecture

1. Introduction Motion Estimation (ME) is eﬀective to remove the temporal redundancy within frames and thus is widely adopted in nowadays video coding standards such as MPEG-1,2,4 and H.26x. Among all the block matching algorithms, full search block matching algorithm (FSBMA) is widely used in hardware accelerators because of its regularity and stable performance. For one macroblock (MB) in FSBMA, all the search points in the search area are checked and the point with the smallest distortion cost is chosen as the best motion vector (MV). Many matching functions have been proposed and the simple sum-of-absolute-diﬀerence (SAD) value is the most popular one SAD(m, n) =

N−1 N−1

|C(i, j) − R(i + m, j + n)|,

(1)

i=0 j=0

where (m, n) is the position of the current search point, and C(i, j) and R(i + m, j + n) represent the pixels (i, j) in the current MB and the candidate block, respectively. Manuscript received March 24, 2006. Manuscript revised July 16, 2006. † The authors are with Graduate School of Information, Production and Systems, Waseda University, Kitakyushu-shi, 808–0135 Japan. †† The author is with Kitakyushu Foundation for the Advancement of Industry Science and Technology, Kitakyushu-shi, 808– 0135 Japan. a) E-mail: [email protected] DOI: 10.1093/ietisy/e90–d.1.108

Although FSBMA can provide premiere image quality, it also requires intensive computation and is not suitable for portable devices. To reduce the computation, many fast ME algorithms have been proposed and they can be divided into the following two categories: (1) To reduce the search points, which means the ME is only conducted on a subset of the total search area. (2) To decrease the computation overhead for each search point, which indicates that the full distortion is replaced by a simple one. In the first category, search window size adaption [1], [2] is one eﬀective technique. For each MB, the associated search range is dynamic adjusted according to its characteristics. All the fixed search pattern fast ME algorithms also can be classified into this category. The underlying assumption is that the matching error monotonically decreases as the search moves towards the global minimum. Therefore, for each MB, only a number of search points in the search area are processed. Many well-known examples are 3-step search (TSS) [3], new 3-step search [4], 4-step search [5], diamond search (DS) [6] and hexagon search (HEXBS) [7]. Lots of fast ME algorithms have been presented in the second category. The lossless partial distortion elimination (PDE) [8] algorithm tries to terminate one search point as long as its accumulated partial distortion is larger than the current minimum distortion. The lossless successive elimination algorithm (SEA) [9] skips one search point if its absolute diﬀerence between the sum-norms of current MB and candidate block is larger than the current minimum distortion. Pixel decimation [10] is a lossy method which subsamples the pixels in one MB and thus reduce computation cost. Pixel truncation [11] decreases the word width of each pixel and then reduces the hardware cost and power consumption for VLSI implementations. Although many fast ME algorithms have been proposed, they can not easily be realized in hardware because of their instable image quality, irregular workflow and variable computation time. To overcome these demerits, in this paper, a novel partial distortion sorting (PDS) algorithm and a simplified local PDS (LPDS) algorithm are presented. Experiments show that the two algorithms have almost the same image quality as FSBMA. Moreover, the presented two algorithms have been integrated into the 1-D systolic array [12] to prove its hardware-friendliness. Under the worst working conditions (1.62 V, 125◦ C) and 166 MHz clock frequency, PDS algorithm can reduce 33.3% power consumption with 4.05 K gates extra hardware cost, and LPDS al-

c 2007 The Institute of Electronics, Information and Communication Engineers Copyright

SONG et al.: PARTIAL DISTORTION SORTING FAST ME ALGORITHMS

109

gorithm can reduce 37.8% power consumption with 1.73 K gates cost. The rest of the paper is organized as follows: The presented PDS and LPDS algorithms are discussed in Sect. 2. The computation analysis and performance comparison are shown in Sect. 3. In Sect. 4, the VLSI implementations are presented. Finally, Sect. 5 concludes the paper. 2. PDS and LPDS Algorithms 2.1 Proposed PDS Algorithm For image with P × P block size and M × N search range, where M is the width and N is the height, the proposed PDS algorithm is illustrated in Fig. 1 and is described as follows: (1) M search points in one row are fetched from search area. (2) For all M search points, the partial distortions of the block rows 0∼p1 are firstly calculated, and the obtained M partial distortions are sorted and only a subset m1 (m1 < M) search points with the smallest distortions are enabled. (3) For the enabled m1 search points, the distortion of block rows p1 ∼p2 are calculated and accumulated, and the obtained partial distortions are sorted, and only m2 (m2 < m1 ) search points are kept. (4) This process continues until the block row reaches P and the full distortions are obtained. (5) The ME process of one row in search area is accomplished and the best candidate is chosen from those still enabled search points. (6) If all the rows in search

Fig. 1

PDS algorithm workflow.

area are processed, ME process is finished, otherwise goto step (1). In conclusion, for one row of search points, the M search points are enabled at first. When the partial distortions of p1 ,p2 . . . pk . . . pi (pk−1 < pk , pk ⊆ [0, P]) block rows are calculated, the enabled search points are reduced to m1 ,m2 . . . mk . . . mi (mk−1 > mk , mk ⊆ [1, M]). The key problem in the proposed PDS algorithm is to decide the two parameters: the check row positions Ppos = (p1 , p2 . . . pk . . . pi ) (pk ⊆ [0, P]) and the associated enabled search point number Mnum = (m1 , m2 . . . mk . . . mi ) (mk ⊆ [1, M]). It can be seen that larger Ppos and Mnum mean more computation is kept and thus better image quality can be achieved, and smaller Ppos and Mnum represent less computation is used and more power can be saved. These parameters give a good tradeoﬀ between the image quality and computation complexity. It should be mentioned that as long as Ppos and Mnum are determined, the computation time is fixed. This feature is desirable for hardware implementation. The two parameters are tested on two typical video sequences, as shown in Table 1. The test conditions are 16×16 block size, 100 frames, and SAD is used as the distortion criterion. The search range is fixed to [−16, +15] and thus each search row includes 32 search points. According to our experiments, we suggest to set Ppos = (4, 8, 12, 16) and Mnum = (7, 5, 3, 1) for QCIF and CIF video with a search range of [−16, +15]. 2.2 Proposed Local PDS (LPDS) Algorithm Although PDS algorithm can provide desirable image quality with less computation, it also has two drawbacks for hardware implementation: (1) According to the above experiments, we have to store 7 minimum partial distortions and their associated MVs, which introduce much hardware overhead. (2) For one row of search points, the sorting procedure costs extra computation and is ineﬃcient for hardware realization. In practice, the global minimums required in PDS algorithm are responsible for the above two problems. Therefore, instead of using the global minimum distortions, the local minimum ones are used in the proposed local PDS (LPDS) algorithm. In LPDS, the search points in one row are divided into groups. For each group, only one search point with the smallest distortion is enabled and thus the sorting cost is eliminated. For instance, with a search width of M, the search points in [−M/2, m1 ) are grouped as “group1” and the search point with the smallest partial distortion is enabled. The search points in [m1 , m2 ) belong to “group2” and also only one search point is enabled, and so on. To make fair comparisons, we keep the same parameters Ppos and Mnum as those in PDS algorithm. Therefore, the key issue in LPDS is how to group the search points. Various combinations are tested and our suggested grouping method is illustrated in Fig. 2. It can be seen that for the first check, the 32 search points are divided into 7 groups: [−16, −9], [−8, −5], [−4, −1], [0], [1, 3], [4, 7] and [8, 15]. For each

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.1 JANUARY 2007

110 Table 1

Fig. 2

Image quality versus parameters Ppos & Mnum .

Algorithms

Ppos

Mnum

FSBMA PDS

− (4,8,12,16) (4,8,12,16) (4,8,12,16) (2,6,10,16) (2,6,10,16) (2,6,10,16) (2,4,8,16) (2,4,8,16) (2,4,8,16)

− (11,7,3,1) (7,5,3,1) (5,3,1,1) (11,7,3,1) (7,5,3,1) (5,3,1,1) (11,7,3,1) (7,5,3,1) (5,3,1,1)

Algorithms FSBMA

Ppos −

Mnum −

PDS

(4,8,12,16) (4,8,12,16) (4,8,12,16) (2,6,10,16) (2,6,10,16) (2,6,10,16) (2,4,8,16) (2,4,8,16) (2,4,8,16)

(11,7,3,1) (7,5,3,1) (5,3,1,1) (11,7,3,1) (7,5,3,1) (5,3,1,1) (11,7,3,1) (7,5,3,1) (5,3,1,1)

Table Tennis (QCIF) PSNR (db) Bitrate (kbps) 35.145 35.151 35.148 35.130 35.152 35.146 35.125 35.158 35.145 35.133

232.186 234.785 235.582 238.008 235.670 237.538 240.710 237.122 238.327 243.684

Table Tennis (CIF) PSNR (db) Bitrate (kbps) 35.183 820.639 35.183 35.173 35.175 35.167 35.168 35.164 35.173 35.169 35.162

LPDS algorithm grouping method.

group, only one search point with the smallest partial distortion is enabled and thus 7 search points are kept. For the second check, the enabled two search points in [−16, −9] and [−8, −5] are compared and only the point with smaller distortion is enabled, and the enabled search points in [4, 7] and [8, 15] are also operated in the same method. For other groups, the enabled search points in the first check are still kept. Therefore, after the second check, only 5 search points are enabled. For the third and forth checks, the enabled search points are reduced to 3 and 1, respectively. Because most motions are slight and stable in natural videos, the proposed grouping method is center-biased, which means more search points are kept near the search center and less ones are enabled far away the center. 3. Performance Analysis and Comparison 3.1 Computation Analysis For FSBMA with P × P block size and M × N search range, the computation cost for one search point is 3P2 − 1 op-

831.890 835.222 842.580 840.377 846.912 856.886 843.931 848.474 860.508

Same MV (%)

Complexity (%)

100.00% 96.02% 94.36% 86.51% 94.06% 92.41% 82.20% 92.70% 91.31% 79.10%

100.00% 41.41% 36.72% 32.03% 30.08% 25.39% 22.27% 26.95% 23.83% 21.48%

Same MV (%) 100.00%

Complexity (%) 100.00%

95.93% 94.47% 87.30% 93.99% 92.52% 83.57% 92.66% 91.53% 80.82%

41.41% 36.72% 32.03% 30.08% 25.39% 22.27% 26.95% 23.83% 21.48%

erations, which include P2 subtractions, P2 absolute operations and P2 − 1 additions. For the proposed PDS and LPDS algorithms with parameters Ppos = (p1 , p2 . . . pk . . . pi ) and Mnum = (m1 , m2 . . . mk . . . mi ), the computation of one search point is illustrated as i−1 M × p1 + k=1 mk × (pk+1 − pk ) × (3P2 − 1). (2) M×P It should be mentioned that extra computation cost is introduced in presented algorithms. For PDS algorithm, because only the smallest k distortions among the K ones are required, the partial quick sort [13] algorithm can be used and the computation complexity is O(K+klogk) [13]. Therefore, in PDS algorithm, the extra computation complexity for one MB is i−1 (mk + mk+1 logmk+1 ))). (3) O(N × (M + m1 logm1 + k=1

For the LPDS algorithm, the comparisons which are used to find the local minimum partial distortions are responsible for the extra cost. The computation cost of one MB is i−1 (mi − 1)). (4) N × (M − 1 + k=1

Compared with the computation complexity of FSBMA, which is O(3P2 ) for each search point, because M, N and P are in the same magnitude order, the introduced computations of PDS and LPDS are equal to the full search on several search points and thus are quite acceptable. For instance, in the presented PDS and LPDS algorithms with 16 × 16 block size and 32 × 32 search area,

SONG et al.: PARTIAL DISTORTION SORTING FAST ME ALGORITHMS

111

CIF

Performance comparisons for PDS, LPDS and DS. Same MV

PSNR

LPDS bitrate

Same MV

PSNR

DS bitrate

Same MV

99.10% 87.60% 98.34% 99.77% 94.36% 97.66% 98.53% 86.51% 96.14% 97.29% 94.47% 95.15% 95.41%

−0.01 dB −0.06 dB −0.05 dB −0.01 dB −0.12 dB +0.01 dB −0.01 dB −0.07 dB −0.04 dB −0.02 dB −0.12 dB +0.01 dB −0.04 dB

+0.18% +1.25% +1.25% +0.12% +3.10% −0.22% +0.18% +1.38% +0.85% −0.50% +3.41% −0.17% +0.90%

98.50% 75.43% 96.28% 99.87% 89.74% 98.64% 98.56% 72.88% 90.32% 97.31% 89.99% 94.23% 91.81%

−0.00 dB −0.07 dB −0.14 dB −0.00 dB −0.20 dB +0.01 dB −0.02 dB −0.39 dB −0.15 dB +0.05 dB −0.22 dB +0.00 dB −0.09 dB

+0.02% +1.84% +3.36% +0.09% +5.27% −0.29% +0.50% +3.68% +3.67% −1.29% +7.83% −0.04% +2.05%

98.49% 79.19% 86.78% 99.80% 90.47% 98.65% 96.64% 70.72% 87.03% 96.16% 87.73% 95.13% 90.57%

Sequence

PSNR

PDS bitrate

Container Football Foreman Mobile Table Tempete Container Football Foreman Mobile Table Tempete Average

−0.00 dB −0.07 dB −0.07 dB −0.01 dB −0.09 dB −0.03 dB −0.01 dB −0.01 dB −0.05 dB +0.01 dB −0.11 dB −0.03 dB −0.04 dB

+0.04% +1.44% +1.75% +0.12% +2.47% +0.78% +0.44% +1.39% +1.17% −0.25% +3.24% +0.89% +1.12%

Format QCIF

Table 2

Ppos = (4, 8, 12, 16) and Mnum = (7, 5, 3, 1), only 36.7% of FSBMA computation is required. Moreover, the introduced computation costs for the PDS and LPDS algorithms are about 2657 and 1376 operations, which are almost equal to the full search on 4 and 2 search points, respectively. 3.2 Performance Comparison When the parameters are fixed in the proposed PDS and LPDS algorithms, 12 video sequences are tested to prove their stable image quality. Moreover, the performance of diamond search (DS) algorithm [6] is also shown for comparison. The experiments are conducted on the H.264/AVC reference software JM8.1a. The test conditions are I-P-PP. . . , CAVLC, Hadamard transform, 1 reference frame, and 1/4-pixel accurate MV. The block size is fixed to 16 × 16, and the QP is set to 28, 32, 36, and 40. For QCIF and CIF sequences, the search range is [−16, +15]. Compared with FSBMA, the average PSNR and bitrate diﬀerence [14] are shown in Table 2, where the “+” symbol means increments and “−” symbol means decrements. Moreover, the same MV rate is also listed, which is defined as the same MV percentage between the FSBMA and other fast ME algorithms. From Table 2 we can see that the proposed PDS and LPDS algorithms have almost the same performance as FSBMA. For some sequences such as Mobile and Tempete, the proposed algorithms even have a little better image quality. For PDS, the maximum quality losses are 0.11 dB PSNR drop and 3.24% bitrate increase for Table CIF sequence. For LPDS, the maximum quality degradations are 0.12 dB PSNR loss and 3.41% bitrate increase for Table CIF. Compared with DS, the proposed PDS and LPDS algorithms have better image quality and higher same MV rates. This because DS algorithm only searches a subset of the total search area and thus do not perform well for sequences with fast and large motions such as Foreman and Football. However, the proposed PDS and LPDS algorithms conduct ME on the whole search area and then a stable image quality can be achieved. When compared with LPDS algorithm, we can say that the PDS algorithm can provide more stable image quality because the global minimum partial distortions used

in PDS are more reliable than the local ones. 4. Hardware Implementation Because FSBMA can provide the best image quality with regular workflow and memory access, many hardware architectures have been proposed and some good reviews can be found in [15]–[17]. Several architectures aimed for low hardware cost also have been presented with fast ME algorithms, such as the designs for TSS [18], [19], NTSS [20] and DS [21]. For these fast ME algorithms, because only a part of the search points in the search area are processed, the required processing element (PE) number can be greatly reduced and thus the hardware can be saved. However, for the proposed PDS and LPDS algorithms, because all the search points can not be directly skipped before calculation, it is not easy to reduce PE number to decrease hardware cost, but the PEs can be disabled during ME process to reduce power consumption. Therefore, the presented PDS and LPDS algorithms are more suitable for low-power applications. In practice, the proposed algorithms can be easily integrated into various FSBMA architectures, as discussed in the following. One interesting FSBMA architecture classification method is proposed in [22]. All the FSBMA designs are divided into inter, intra and hybrid ones. The inter architecture means each PE is responsible for one specific search point and some representative designs are [12], [23], [24]. The intra architecture indicates that all the PEs concurrently process one search point and [25]–[27] are some instances. They hybrid architecture combines the inter and intra ones, such as the parallel tree [28] architecture. In our proposed algorithms, because the partial distortions of one row search points are simultaneously required to make comparisons, the inter architecture and the parallel tree [28] architectures are more suitable. For instance, the 2-D inter architecture [23] with M × N search range has M × N PEs. Each PE processes one search point and all the distortions are concurrently obtained. Therefore, the larger partial distortions can be easily obtained by comparison operations. The 2-D inter architec-

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.1 JANUARY 2007

112

Fig. 4 Hybrid parallel tree ME architecture [28] with proposed PDS and LPDS algorithms (block size 4 × 4, search range 4 × 4).

Fig. 3 2-D inter ME architecture [23] with proposed PDS and LPDS algorithms (block size 4 × 4, search range 4 × 4).

ture [23] with 4 × 4 block size and 4 × 4 search range is shown in Fig. 3, we can see that 16 PEs are scheduled to work in parallel. The PE0∼PE3 process the first search row, the PE4∼PE7 calculate the second search row, and so on. Because all the partial distortions are generated in the same time, the global or local minimums can be easily obtained by comparator trees, and the PEs with larger partial distortions can be directly disabled. The introduced hardware cost is trivial and control logic is simple. Therefore, the proposed PDS and LPDS algorithms can be eﬃciently integrated with this architecture to save power consumption. For the hybrid parallel tree architecture [28] with M × N search range, M sub-trees are required and each sub-tree processes one search point. Therefore, the partial distortions calculated by each sub-tree can be directly compared and disabled to save power. The parallel tree architecture [28] with 4 × 4 block size and 4 × 4 search range is illustrated in Fig. 4. The 4 sub-trees parallel calculate one row of search points. Namely, the first sub-tree includes the PE0∼PE3 and calculates the distortion of first search point, and the second sub-tree has the PE4∼PE7 and processes the second search point and so on. The partial distortions are simultaneously generated by diﬀerent sub-trees and then are compared by one comparator tree to get the minimums. Then the subtrees with the larger distortions are disabled to save power. In this paper, the 1-D inter architecture [12] is used as an detailed example because that the partial distortions in this architecture are generated in diﬀerent time with 1 clock

cycle skew. Therefore, the computation logic and control logic are more complex. In the 1-D systolic array, the candidate block data are broadcasted and the current MB data are propagated in a buﬀer line, and the processing element (PE) number equals to the search width. For our algorithms with a search range of [−16, +15], 32 PEs are scheduled to work in parallel, and each PE is responsible for one search point. For each PE, the SAD calculation for one search point costs 256 clock cycles, and the SAD generated by adjacent PEs have 1 clock cycle interval. For instance, the SAD of search point [0, 0] (row = 0 and column = 0) is generated by PE0 at cycle 255, and the SAD of search point [1, 0] (row = 0 and column = 1) is calculated by PE1 at cycle 256, and so on. The proposed system architecture is illustrated in Fig. 5. This architecture is the same as the traditional full search 1D array except for the CTRL UNIT, which is in charge of the enable signals of each PE. Therefore, we need to solve two issues: (1) How to disable one PE. (2) How to realize the CTRL UNIT. The second issue will be discussed in following sub-sections, and the first problem is solved as follows. The presented PE unit architecture is shown in Fig. 6. For each PE, the input pixels are masked by their enable signals. Therefore, when the enable signal is invalid (logic “0”), there will be no transitions in the arithmetic components. Moreover, the SAD register which is used to accumulate the SAD value is also disabled by the clock gating method. 4.1 PDS Algorithm Architecture As mentioned before, PDS algorithm sorts the partial distortions and only enable those search points which have the smallest ones. When realized in 1-D array, this means the partial SAD generated by diﬀerent PEs should be compared and the PEs have larger SAD values are disabled to save power consumption. Sorting circuit is hardware consuming and thus is not suitable for this design. Because in 1-D array, the SAD values of diﬀerent PEs are generated at diﬀerent time with 1 cycle interval, we use the architecture in [29] to realize the sorting logic. The proposed CTRL UNIT architecture is illustrated in Fig. 7, where the symbols with “ reg” are registers. This architecture includes 3 major parts: The forward

SONG et al.: PARTIAL DISTORTION SORTING FAST ME ALGORITHMS

113

Fig. 5 1-D systolic array architecture with proposed PDS and LPDS algorithms (block size 16 × 16, search range 32 × 32).

the partial SAD of search point [1, 0] is obtained by PE1 at cycle 64, and so on. The last partial SAD of search point [31, 0] is calculated by PE31 at cycle 94. Therefore, during cycles 63 ∼ 94, the forward and backward parts in Fig. 7 (a), (b) continue working to maintain 7 minimum SADs and MVs. In cycle 95, all the 32 partial SAD values are generated by the 32 PEs and the 7 smallest ones are stored in the 7 MIN SADx reg and associated MIN MVx reg. Then, the 7 MV registers are decoded and only the 7 PEs whose tags are stored in these 7 registers are enabled. This procedure continues in cycles 127 ∼ 158 and 191 ∼ 222, and the enabled PE number is further reduced to 5 and 3, respectively. From cycles 255 ∼ 286, the full SAD values are generated from those enabled PEs and are compared and stored as the current minimum SAD. Fig. 6

PE unit architecture.

part shown in Fig. 7 (a) is used to get the maximum SAD among the 7 SAD registers and the new generated SAD. At beginning, MIN SAD1 reg ∼ MIN SAD7 reg are initialized to 0xFFFF, and when new SAD value is generated, it is compared with the 7 MIN SAD reg and the maximum one is selected by the max tree. The backward part is illustrated in Fig. 7 (b). The tag of the maximum SAD generated in the forward part is decoded and if the tag hits one of the 7 SAD registers, the associated SAD and MV registers are updated. Figure 7 (c) shows the PE control part, which generates the enable signals for the 32 PEs. For each MIN MVx reg, it may enable one of the 32 PEs. Therefore, the MVs stored in the 7 MV registers are decoded, and for each PE, its enable signal is the logic OR function of the decoded 7 MIN MVx reg. To further explain the proposed architecture, the first check procedure is used as an example, which uses the partial SAD values of top 4 rows to enable 7 PEs from the 32 PEs. According to the workflow of 1-D array, the partial SAD of search point [0, 0] is generated by PE0 at cycle 63,

4.2 LPDS Algorithm Architecture In LPDS algorithm, only one minimum partial SAD is kept for one search points group. When realized by 1-D array, this means the PEs are divided into groups, and one PE with the smallest partial SAD is enabled for each PE group. Therefore, only one MIN SAD reg and associated MIN MV reg are required and they are shared by diﬀerent PE groups. The workflow is described as follows: for each PE group, the MIN SAD reg is firstly initialized to 0xFFFF, and then updated in the following cycles if the new generated SAD is smaller than its stored value. This process continues until reaching the end of this PE group. Then, according to its stored MV, the PE with the same tag is enabled and other PEs are disabled. The same procedures are taken for other PE groups. Compared with PDS algorithm, because only one minimum partial SAD is kept and the sorting calculation is removed, less hardware overhead is introduced and more power consumption can be saved in the LPDS algorithm.

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.1 JANUARY 2007

114

Fig. 7 CTRL UNIT architecture. (a) Forward part to get the maximum SAD. (b) Backward part to update the MIN SADx reg and MIN MVx reg. (c) PE control part to enable/disable the 32 PEs.

4.3 Hardware Implementations The proposed PDS and LPDS algorithms are integrated into the traditional 1-D systolic array architecture [12]. The proposed two architectures with 32 PEs aim for a search range of 32 × 32. For one MB, ME processing can be finished in 8192 cycles. The designs are realized in Verilog HDL and synthesized by Synopsys Design Compiler with TSMC 0.18 µm 1P6M technology. Moreover, all the designs are placed and routed by Synopsys Astro. The chip layouts are illustrated in Fig. 8 and the performance are listed in Table 3.

All the data are obtained under the worst working conditions (1.62 V, 125◦ C) with 166 MHz clock frequency, and the SRAM cost is not included. Compared with the 1-D array with FSBMA algorithm, the proposed PDS and LPDS algorithms can reduce 33.3% and 37.8% power consumption, and the extra hardware cost are 4.05 K gates and 1.73 K gates, respectively. The extra control logic in PDS and LPDS algorithms consumes 2.76 K gates and 0.59 K gates, respectively. The other hardware increments are occupied by the introduced hardware costs in 32 PEs and other interface logic. It can be seen that in LPDS algorithm, because a much simpler scheme is adopted, less

SONG et al.: PARTIAL DISTORTION SORTING FAST ME ALGORITHMS

115

Fig. 8 Chip layouts. (a) 1-D systolic array. (b) 1-D systolic array with proposed PDS algorithm. (c) 1-D systolic array with proposed LPDS algorithm. Table 3 Performance comparisons among the 1-D systolic array, the 1-D systolic arrays with PDS and LPDS algorithms. PE Number Hardware Cost (gates) Control Unit Cost(gates) Post-Synthesis Frequency (MHz) Post-Layout Frequency (MHz) Post-Layout Power Consumption (mw) Chip Area (mm2 )

1-D Array

1-D Array + PDS

1-D Array + LPDS

32 23.31 K (+0.0%)

32 27.36 K (+17.4%)

32 25.04 K (+7.4%)

−

2.76 K

0.59 K

166.0 MHz

166.0 MHz

166.0 MHz

157.9 MHz

153.8 MHz

160.1 MHz

39.21 mw (−0.0%) 1.40 × 1.02 mm2

26.14 mw (−33.3%) 1.40 × 1.10 mm2

24.39 mw (−37.8%) 1.40 × 1.05 mm2

Table 4 Performance comparisons among the TSS, NTSS and DS designs and the proposed architectures. 1-D Array 1-D Array TSS [18] TSS [19] NTSS [20] DS [21] + PDS + LPDS Search Range 15 × 15 15 × 15 15 × 15 32 × 32 32 × 32 32 × 32 PE Number 9 9 48 tree-based 32 32 Clock Cycles Per MB 794 831 230 − 8192 8192 Technology − 0.8 µm − − 0.18 µm 0.18 µm Frequency (MHz) 40 MHz 50 MHz − 50 MHz 166 MHz 166 MHz Hardware Cost (gates) 14.5K 30.0 K∗ 32.9 K∗ 9.0 K 27.36 K 25.04 K (SRAM excluded) Power 350 mw 26.14 mw 24.39 mw Consumption(mw) − @50 MHz @166 MHz @166 MHz Chip Area (mm2 ) − 6.90 × 5.90 mm2 − − 1.40 × 1.10 mm2 1.40 × 1.05 mm2 * SRAM is included.

hardware is consumed and more power is saved. The performance comparisons among other fast ME designs [18]–[21] and the proposed architectures are shown in Table 4. We can see that because the TSS and DS only search part of the search area, the required PE number can be greatly reduced and thus the hardware cost can be decreased. However, these algorithms incur severe image quality degradation for video with large and fast motions. On the contrary, the proposed PDS and LPDS algorithms do not skip any search point before calculation and thus can not decrease the PE number to save hardware cost. But the presented algorithms have the following merits: (1) Because no search points are directly skipped, a better image quality

can be achieved. (2) The computation time is predictable and thus is friendly for the system design. (3) The proposed algorithms can be easily integrated into diﬀerent FSBMA architectures to save power consumption. The 1-D systolic array with 32 PEs is used as an example. We can see that the proposed algorithms can eﬀectively reduce the power consumption with little hardware overhead. 5. Conclusions In this paper, two hardware-friendly and low-power oriented fast ME algorithms are proposed. PDS algorithm globally sorts the generated partial distortions and then only en-

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.1 JANUARY 2007

116

ables those search points which have the smallest distortions. To eliminate the sorting overhead, LPDS algorithm is presented, which divides the search points into diﬀerent groups and only enable one search point for one group. Experiments show that both algorithms have almost the same performance as FSBMA. To be specific, for PDS, the maximum quality losses are 0.11 dB PSNR decrease and 3.24% bitrate increase. For LPDS, the maximum quality degradation are 0.12 dB PSNR loss and 3.41% bitrate increase. The presented two algorithms have been realized with 1-D systolic array in Verilog HDL and synthesized by Synopsys Design Compiler with TSMC 0.18 µm technology. Under the worst working conditions (1.62 V, 125◦ C) and 166 MHz clock frequency, the PDS algorithm can reduce 33.3% power consumption with 4.05 K gates extra hardware cost, and the LPDS can reduce 37.8% power with 1.73 K gates hardware overhead. The proposed two algorithms aim for diﬀerent requirements. For applications whose image quality is critical, PDS algorithm is suitable, otherwise, LPDS maybe more desirable because of its smaller hardware cost and power consumption. Acknowledgments This work was supported by fund from the MEXT via Kitakyushu innovative cluster project. References [1] L.W. Lee, J.F. Wang, J.Y. Lee, and J.D.Shie, “Dynamic searchwindow adjustment and interlaced search for block-matching algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol.3, no.1, pp.85–87, Feb. 1993. [2] M.R. Pickering, J.F. Arnold, and M.R. Frater, “An adaptive search length algorithm for block matching motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.7, no.6, pp.906–912, Dec. 1997. [3] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro, “Motion-compensated interframe coding for video conferencing,” Nat. telecommunications Conf., pp.G5.3.1–G5.3.5, 1981. [4] R. Li, B. Zeng, M.L. Liou, “A new three-step search algorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.4, no.4, pp.438–441, April 1994. [5] L.M. Po and W.C. Ma, “A novel four-step search algorithm for fast block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.6, no.3, pp.313–317, June 1996. [6] S. Zhu and K.K. Ma, “A new diamond search algorithm for fast block-matching motion estimation,” IEEE Trans. Image Process., vol.9, no.2, pp.287–290, Feb. 2000. [7] C. Zhu, X. Lin, and L.P. Chua, “Hexagon-based search pattern for fast block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.12, no.5, pp.349–355, May 2002. [8] C.D. Bei and R.M. Gray, “An improvement of the minimum distortion encoding algorithm for vector quantization,” IEEE Trans. Commun., vol.33, no.10, pp.1132–1133, Oct. 1985. [9] W. Li and E. Salari, “Successive elimination algorithm for motion estimation,” IEEE Trans. Image Process., vol.4, no.1, pp.105–107, Jan. 1995. [10] B. Liu and A. Zaccarin, “New fast algorithms for the estimation of block motion vector,” IEEE Trans. Circuits Syst. Video Technol., vol.3, no.2, pp.148–157, April 1993.

[11] Z.L. He, K.K. Chan, and M.L. Liou, “Low-power VLSI design for motion estimation using adaptive pixel truncation,” IEEE Trans. Circuits Syst. Video Technol., vol.10, no.5, pp.669–678, Aug. 2000. [12] K.M. Yang, M.T. Su, and L. Wu, “A family of VLSI designs for the motion compensation block-matching algorithm,” IEEE Trans. Circuits Syst., vol.36, no.10, pp.1317–1325, Oct. 1989. [13] C. Mart´ınez, “Parital quicksort,” The SIAM Workshop on Analytic Algorithmics and Combinatorics, Jan. 2003. [14] G. Bjøntegaard, “Calculation of average PSNR diﬀerences between RD-curves,” 13th VCEG-M33 Meeting, April 2001. [15] P. Pirsch, N. Demassieux, and W. Gehrke, “VLSI architecture for video compression - A survey,” Proc. IEEE, vol.83, no.2, pp.220– 246, Feb. 1995. [16] P. Pirsch and H. Stolberg, “VLSI implementation of image and video multimedia processing systems,” IEEE Trans. Circuits Syst. Video Technol., vol.8, no.7, pp.878–891, Nov. 1998. [17] P.C. Tseng, Y.C. Chang, Y.W. Huang, H.C. Fang, C.T. Huang, and L.G. Chen, “Advances in hardware architectures for image and video coding - A survey,” Proc. IEEE, vol.93, no.2, pp.184–197, Feb. 2005. [18] H.M. Jong, L.G. Chen, and T.D. Chiueh, “Parallel architectures for 3-step hierarchical search block-matching algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol.4, no.4, pp.407–416, Aug. 1994. [19] T.H. Chen, “A cost-eﬀective three-step hierarchical search blockmatching chip for motion estimation,” IEEE J. Solid-State Circuits, vol.33, no.8, pp.1252–1258, Aug. 1998. [20] Z.L. He, M.L. Liou, P.C.H. Chan, and R. Li, “An eﬃcient VLSI architecture for new three-step search algorithm,” IEEE Midwest Symp. Circuits Syst., pp.1228–1231, 1996. [21] W.M. Chao, C.W. Hsu, Y.C. Chang, and L.G. Chen, “A novel hybrid motion estimator supporting diamond search and fast full search,” IEEE Int. Symp. Circuits Syst., pp.492–495, 2002. [22] C.Y. Chen, S.Y. Chien, Y.W. Huang, T.C. Chen, T.C. Wang, and L.G. Chen, “Analysis and architectrue design of variable block size motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. I, Regular Papers, vol.53, no.2, pp.578–593, March 2006. [23] H. Yeo and Y.H. Hu, “A novel modular systolic array architecture for full-search block matching motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol.5, no.5, pp.407–416, Oct. 1995. [24] Y.K. Lai and L.G. Chen, “A data-interlacing architecture with twodimensional data-reuse for full-search block-matching algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol.8, no.2, pp.124–127, April 1998. [25] T. Komarek and P. Pirsch, “Array architectures for block matching algorithms,” IEEE Trans. Circuits Syst., vol.36, no.10, pp.1301– 1308, Oct. 1989. [26] L.D. Vos and M. Stegherr, “Parameterizable VLSI architectures for the full-search block-matching algorithm,” IEEE Trans. Circuits Syst., vol.36, no.10, pp.1309–1316, Oct. 1989. [27] C.H. Hsieh and T.P. Lin, “VLSI architecture for block-matching motion estimation algorithm,” IEEE Trans. Circuits Syst. Video Technol., vol.2, no.2, pp.169–175, June 1992. [28] S.S. Lin, P.C. Tseng, and L.G. Chen, “Low-power parallel tree architecture for full search block-matching motion estimation,” IEEE Int. Symp. Circuits Syst., pp.313–316, May 2004. [29] Y.W. Huang, S.Y. Chien, B.Y. Hsieh, and L.G. Chen, “Global elimination algorithm and architecture design for fast block matching motion estimaiton,” IEEE Trans. Circuits Syst. Video Technol., vol.14, no.6, pp.898–907, June 2004.

SONG et al.: PARTIAL DISTORTION SORTING FAST ME ALGORITHMS

117 Yang Song received the B.E. degree in Computer Science from Xi’an Jiaotong University, China in 2001 and M.E degree in Computer Science from Tsinghua University, China in 2004. He is currently a Ph.D. candidate in Graduate School of Information, Production and Systems, Waseda University, Japan. His research interest includes motion estimation, video coding technology and associated VLSI architecture.

Zhenyu Liu received his B.E., M.E. and Ph.D degrees in electronics engineering from Beijing Institute of Technology in 1996, 1999 and 2002, respectively. His doctor research focused on real time signal processing and relative ASIC design. From 2002 to 2004, he worked as post doctor in Tsinghua University of China, where his research mainly concentrated on embedded CPU architecture. Currently he is a researcher in Kitakyushu Foundation for the Advancement of Industry Science and Technology. His research interests include real time H.264 encoding algorithms and associated VLSI architecture.

Takeshi Ikenaga received his B.E. and M.E. degrees in electrical engineering and the Ph.D degree in information & computer science from Waseda University, Tokyo, Japan, in 1988, 1990, and 2002, respectively. He joined LSI Laboratories, Nippon Telegraph and Telephone Corporation (NTT) in 1990, where he has been undertaking research on the design and test methodologies for high-performance ASICs, a real-time MPEG2 encoder chip set, and a highly parallel LSI & system design for image-understanding processing. He is presently an associate professor in the system LSI field of the Graduate School of Information, Production and Systems, Waseda University. His current interests are application SoCs for image, security and network processing. Dr. Ikenaga is a member of the IPSJ and the IEEE. He received the IEICE Research Encouragement Award in 1992.

Satoshi Goto was born on January 3rd, 1945 in Hiroshima, Japan. He received the B.E. degree and the M.E. degree in Electronics and Communication Engineering from Waseda University in 1968 and 1970, respectively. He also received the Dr. of Engineering from the same university in 1981. He is IEEE fellow, Member of Academy Engineering Society of Japan and professor of Waseda University. His research interests include LSI System and Multimedia System.

Low-Power Partial Distortion Sorting Fast Motion ...

Jan 1, 2007 - in nowadays video coding standards such as MPEG-1,2,4 and H.26x. Among all the block matching algorithms, full search block matching ...

Download PDF

806KB Sizes 1 Downloads 184 Views

Report

Low-Power Partial Distortion Sorting Fast Motion ...

Recommend Documents