Motion Estimation Optimization for H.264/AVC Using Source Image ...

Viewer
Transcript

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 8, AUGUST 2009

1095

Motion Estimation Optimization for H.264/AVC Using Source Image Edge Features Zhenyu Liu, Member, IEEE, Junwei Zhou, Satoshi Goto, Fellow, IEEE, and Takeshi Ikenaga, Member, IEEE

Abstract— The H.264/AVC coding standard processes variable block size motion-compensated prediction with multiple reference frames to achieve a pronounced improvement in compression efficiency. Accordingly, the computation of motion estimation increases in proportion to the product of the number of reference frame and the number of intermode. The mathematical analysis in this paper illustrates that the motion-compensated prediction errors are mainly determined by the detailed textures in the source image. The image block being rich in textures contains numerous high-frequency signals, which make variable block size and multiple reference frame techniques essential. On the basis of rate-distortion theory, in this paper, the spatial homogeneity of an image block is made as a relative concept with respect to the current quantization step. For the homogenous block, its futile reference frames and intermodes can be eliminated efficiently. It is further revealed that the sum of absolute differences value of an image block is mainly determined by the sum of its edge gradient amplitude and the current quantization step. Consequently, the image content-based early termination algorithm is proposed, and it outperforms the original method adopted by JVT reference software. Moreover, the dynamic search range algorithm based on the edge gradient amplitude of source image block is analyzed. One eminent advantage of the proposed edgebased algorithms is their efficiency to the macroblock-pipelining architecture, and another desirable feature is their orthogonality to fast block-matching algorithms. Experimental results show that when these algorithms are integrated with hybrid unsymmetrical-cross multi-hexagongrid search, an averaged 31.4–60.0% motion estimation time can be saved, whereas the averaging BDPSNR loss is 0.0497 dB for all tested sequences. Index Terms— Edge gradient, fast mode decision, H.264/AVC, motion estimation (ME), multiple reference frame (MRF), variable block size (VBS).

I. I NTRODUCTION IGH-PERFORMANCE video coding algorithms strive to reduce temporal and spatial redundancy. For this purpose, the latest international video coding standard H.264/AVC adopts the state-of-the-art techniques, which include quarterpixel accurate, variable block size (VBS), and multiple reference frame (MRF) techniques, to improve the coding gain.

H

Manuscript received January 1, 2008; revised June 15, 2008, August 31, 2008, and December 15, 2008. First version published May 12, 2009; current version published August 14, 2009. This work was supported by CREST JST. This paper was recommended by Associate Editor G. Wen. Z. Liu is with RIIT of Tsinghua University, Beijing, 100084 China (e-mail: [email protected]). J. Zhou is with the Sun Microsystems Incorporation, Santa Clara, CA 95054 USA (e-mail: [email protected]). S. Goto, and T. Ikenaga are with the Graduate School of IPS, Waseda University, Tokyo, 808-0135 Japan (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2009.2022796

According to the analysis in [1], 89.2% computation power is consumed by motion estimation (ME) part, and hence reducing the redundant computation of ME in H.264/AVC has become the fundamental research topic. In general, the computation saving of ME rests on the following approaches: 1) reducing the search positions through the efficient search pattern (fast block matching), for example, 1-D full search, four-step search, and diamond search; 2) eliminating the searches of the redundant search modes and reference frames; 3) early termination of the block matching by defining some thresholds; and 4) dynamically adjusting the search range. The first category of algorithms, i.e., fast block matching, have been discussed thoroughly in many papers [2]–[4] and widely adopted in the software- or hardware-oriented implementations [5], [6]. In this paper, we focus on exploring the approaches in the other three categories. Through analyzing the edge features of the source image block, we discard the trivial inter-search modes and reference frames, define the content-based thresholds for early termination, and dynamically reduce the search range. The proposed approaches are compatible with the traditional fast block-matching methods. In addition, they are friendly to the macroblock (MB)-pipelining architecture, which is widely adopted in hardwired encoder designs [7], [8]. VBS and MRF algorithms are the major issues leading to massive computation in H.264/AVC encoding. The required computation is in direct proportion to the product of the number of reference frames and the number of intermodes. The traditional fast block-matching algorithms cannot efficiently reduce the computational complexity introduced by VBS and MRF techniques. On the other hand, it has been justified that the performance of VBS and MRF algorithms depends mainly on the nature of video sequences [1]. This means that a great deal of computation is performed without achieving any coding improvement. The experiments in this paper further reveal that, at low bit rates, significant superfluous computation exists in H.264/AVC ME processing. Many algorithms have been provided to discard the redundant computation in MRFs [1], [9]–[11]. Huang et al. [1], developed four criteria for early terminating the motion search on MRFs. Other works [9]–[11] reduced the search areas depending on the strong correlations of motion vectors (MV) in consequent pictures. However, the restrictions of MBpipelining architecture either degrade their performance, or introduce considerable hardware overheads, as we shall see in Section VI. Although some reasons have been adduced in [1] and [9] for the superior prediction performance of the MRF technique, the analysis of [12], [13] reveals that

1051-8215/$26.00 © 2009 IEEE

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

1096

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 8, AUGUST 2009

the most critical issue is the aliasing problem [14] coming from high-frequency signals (or detailed textures in the spatial domain) of the source image. Consequently, the efficiency of the MRF technique depends on the homogeneity of the image block. With rate-distortion theory, we make the homogeneity as a relative concept, which depends on the power spectral density of current quantization noise. For the homogenous image block, discarding its MRF technique merely introduces negligible coding quality loss, as shown in Section VII-A. With the augment of quantization parameter (QP), more image blocks are determined as homogenous by the proposed criteria, and in consequence, the computation-saving performance of our methods is ameliorated with increased QP. Moreover, the proposed homogeneity-based reference frame reduction algorithm is efficient to the MB-pipelining architecture, as analyzed in Section VI. The homogeneity-based fast intermode decision algorithm was first proposed by Wu et al. [15] to discard the redundant intermodes in H.264/AVC. In detail, the homogeneity of one N × N block, where N = 16 or N = 8, is determined by evaluating the sum of magnitude of edges at all pixels in this block. If the sum is less than the preset constant threshold, this block is designated as a homogenous one and it is not further split for other intermode search. The thresholds for homogeneity decision in [15] are constants. However, experiments illustrate that with the increase of QP, as the rate cost for side information, such as MVs and sub-MB types, becomes expensive, the ratio of 8 × 8 sub-MB modes always declines. Hence, in this paper the relative homogeneous concept is also adopted in the intermode reduction algorithm to improve the computation saving at low bit rates. Early termination schemes were used in [1], [16], [17], which depend on either the all_zero block detection or some constant thresholds derived by experiments. The investigations in [16] show that at high bit rates, as the quantization interval has a small value, the all_zero block detection algorithm hardly provides any computation saving, especially for those texturesrich video sequences. In this paper, the spatial investigation reveals that the sum of absolute difference (SAD) value of the image block has strong correlations with the sum of edge amplitude of all pixels in this block and the current quantization interval (Q step ). This conclusion indeed interprets the strong SAD correlation in pictures, which is illustrated in [5]. Consequently, the content-based early termination thresholds for integer motion estimation (IME) are developed. Experiments show that at high and moderate bit rates, the proposed adaptive thresholds outperform the original method [17] adopted by JM reference software version 11.0 (JM11.0) in terms of the coding quality and the computation saving, especially to those sequences being rich in detailed textures. The mathematical analysis in [18] suggested that the motion between the object and the camera sensor works as a low-pass filter, which can smooth the edges of the sampled image, and this is designated as motion blur [19]. Therefore, when considering the typical video recording conditions [20] which refer to 30-frames/s frame rate, 1/60 s exposure time, and no synthetic videos, we can estimate the motion speed of the image block according to its edge gradient nature. In detail, for a block

containing a large edge gradient amplitude, it is reasonable to assume that this block undergoes slow movement and then its search range can be reduced accordingly. This algorithm is combined with the original dynamic search range (DSR) algorithm in JM11.0 [21] to further improve its computation saving. The rest of this paper is organized as follows. In Section II, the impact of the spatial edge gradient on prediction error is analyzed. And then, the edge-based fast reference frame and intermode decision algorithms are proposed in Section III. The content-based early termination approach and the edge gradient-based search range decision algorithm are described in Section IV and Section V, respectively. The whole process flow of the proposed fast algorithms is depicted in Section VI. The proposed algorithms are integrated with unsymmetrical-cross multihexagongrid search (UMHexagonS) [5] to demonstrate their compatibility with the traditional fast block-matching algorithms. Section VII presents the detailed experimental results to verify the performance of the proposed schemes. Finally, conclusions are drawn in Section VIII. II. M ATHEMATICAL A NALYSIS OF E DGE G RADIENT I MPACT TO P REDICTION E RROR With the hybrid coder model [12], Girod deduced that when Sss > , the power spectral density of prediction errors is |P()|2 Sss () (1) See () = Sss () 1 − Sss () + where = (ωx , ω y ), See and Sss denote the power spectral density of the prediction errors and the source signals, respectively, P() is the 2-D Fourier transform of the probability density function (pdf) of the displacement estimation error, and can be interpreted as the power spectral density of the white noise incurred for the quantization processing. From (1), the prediction error power (See ) has a strong correlation with the source signals Sss . When Sss , (1) can be simplified as See () = Sss ()[1 − |P()|2 ]. In this case, the power of the prediction error hinges entirely on the image content and pdf of displacement estimation error. The spatial analysis presents a more straightforward insight of the prediction error power as compared to the spectral analysis provided in [12]. In this section, the impact of the edge intensity of the source image on the prediction errors is investigated in the spatial domain. In order to simplify the mathematical description, the analysis is first restricted to 1-D spatial signals, as shown in Fig. 1, and the quantization noise is temporarily ignored. st (x) and st−1 (x) denote the spatialcontinuous signals at time instance t and t − 1. st (x) is a displaced version of st−1 (x) and the distance is dx , which can be expressed as st (x) = st−1 (x − dx ). These continuous image signals are sampled by the sensor array before digital processing. The spatial sampling interval is denoted as u x . The displacement estimation error is x = dx −round(dx /u x ) · u x . From Fig. 1, the prediction error e(i · u x ) of pixel i can be approximated as e(i · u x ) ≈ x · st (i · u x )

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

(2)

LIU et al.: MOTION ESTIMATION OPTIMIZATION FOR H.264/AVC USING SOURCE IMAGE EDGE FEATURES

1097

N(u,v)

s't(i·ux) e(i·ux)

st(x)

Δxs't(i·ux)

St(u,v)

Δx

dx

St–1(u,v) st – 1 (x)

i Camera sensor

i+1

ux

i+2

∂st (i, j) ∂st (i, j) + y (i, j) · . ∂x ∂y

σ 2 (i, j) = σ2

∂st (i, j) ∂x

+

∂st (i, j) ∂y

(3)

2 .

(4)

Using the prediction error variance of one pixel (4), the prediction error power of an image block can be deduced as i, j

Optimum forward channel

Et(u,v)

+ +

F(u,v)

If it is assumed that x (i, j) and y (i, j) are independent, E(x ) = E( y ) = 0, and E(2x ) = E(2y ) = σ2 , the variance of e(i, j), i.e., σ (i, j), is written as 2

+ +

St(u,v)

i+3

where st (i · u x ) is the edge gradient of st (x) at the ith camera sensor and the displacement estimation error x is a random variable with zero mean and x ∈ [−u x /2, u x /2]. When x = ±u x /2, |e(i · u x )| reaches its maximum value (u x · |st (i · u x )|)/2 and when x = 0, |e(i · u x )| vanishes. This conclusion agrees with the aliasing investigation in the spectral domain provided in the literature [14]. Equation (2) also interprets the necessity of MRFs during prediction processing: If the displacement error x,t−1 between the current image st (x) and the first previous one st−1 (x) is larger than that of the kth previous image st−k (x), i.e., x,t−k , st−k (x) is preferred to be chosen as the prediction signal because its prediction error coming from aliasing problem is reduced. In order to simplify the notations in the following discussions, it is assumed that the spatial sampling intervals in xand y-direction are u x = u y = 1. From (2), it is convenient to derive the 2-D prediction error in one pixel

G(u,v)

x

Fig. 1. Analysis of 1-D prediction error caused by edge gradient and displacement estimation error.

e(i, j) ≈ x (i, j) ·

Et(u,v)

+

∂st (i, j) 2 ∂st (i, j) 2 σ 2 (i, j) = σ2 + (5) ∂x ∂y i, j

where (i, j) ∈ block. Like the spectral analysis represented by (1), (5) also indicates that the prediction error power is determined by the image features and the displacement estimation error. Additionally, the spatial analysis illustrates that the power of the block prediction error is proportional to the sum of squares of the edge gradient amplitudes. This conclusion plays an important role in the proposed early termination threshold definition described in Section IV.

Fig. 2. Model of hybrid coder with the optimum forward channel, G(u, v) = max[0, 1 − (/(See (u, v)))] and the power spectral density of N (u, v) is Snn (u, v) = max[0, (1 − (/(See (u, v)))].

Equation (3) yields two important conclusions. 1) According to the terms of displacement error |x | and | y |, the impact of aliasing vanishes at full pixel displacements and is at its maximum at half pixel displacements. 2) Because of the terms of edge gradient (∂st (i, j)/∂ x, ∂st (i, j)/∂ y), aliasing is caused by high-frequency signals in the source image. In practice, a picture that is rich in sharp edges must contain numerous high-frequency signals. In the literature [22], for 2-D spatial signal s(x, y), (∂s(x, y)/∂ x, ∂s(x, y)/∂ y) is defined as the local spatial frequency, which is introduced to describe the local frequency feature in a region. The spatial edge gradient analysis is superior to the spectral analysis because it can efficiently reveal the local frequency nature of the image with trivial computational overhead. Therefore, as we shall see in Section III, when the image block contains numerous textures, the power of its prediction errors becomes augmented, which requires advanced coding approaches, such as VBS and MRF techniques. Otherwise, the redundant computation can be discarded with negligible coding quality degradation. This is the essence of our homogeneity-based fast algorithms. III. H OMOGENEITY-BASED R EFERENCE F RAME AND I NTERMODE R EDUCTION Using rate-distortion theory, the relative homogeneity concept is developed in Section III-A. Based on the relative homogeneous block detection algorithm, the futile reference frames and intermodes could be eliminated efficiently, which is described in Section III-B. A. Relative Homogeneous Block Detection Algorithm Based on the hybrid coder model with the optimum forward channel, as shown in Fig. 2, it is convenient to develop the relative homogeneity concept. Capital letters, for example St (u, v), represent the discrete 2-D Fourier transforms of the corresponding spatial signals. Let St (u, v) denote the N × N small image block to be encoded through the hybrid coder and St−1 (u, v) is the prediction signals generated from the previously decoded image signals by the low-pass filter F(u, v). The optimum forward channel consists of a nonideal band-limiting filter G(u, v) and an additional noise N (u, v). With rate-distortion theory [23], the distortion D and the

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

1098

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 8, AUGUST 2009

corresponding rate R D of the optimum forward channel have the relations described by N −1 N −1 1 D= 2 min(, See (u, v)) N u=0 v=0 N −1 N −1 See (u, v) 1 1 RD = 2 max 0, log2 N 2

(6)

u=0 v=0

where See (u, v) is the power spectral density of prediction errors, that is, E t (u, v). According to (6), when See (u, v) < , for ∀(u, v), R D = 0 and thereby E t (u, v) = 0. In this case, no texture information St−1 (u, v). When is added to St−1 (u, v); that is, S t (u, v) = St (u, v) St+1 (u, v) is encoded, it is reasonable to assume that is a displaced version of St−1 (u, v) and it provides a similar prediction efficiency as St−1 (u, v). Therefore, for those image blocks that satisfy See (u, v) < , they are denoted as the relative homogenous ones and MRF technique is not essential. From the mathematical analysis in the Appendix, we derive the first homogeneity decision criterion, that is ⎞ ⎛ −1 N −1 ∂st (i, j) N ∂st (i, j) ⎠ max ⎝ ∂x , ∂ y < α · Q step (7) i, j=0

i, j=0

where α is a constant derived by experiments. It has been pointed out in [14] that the aliasing problem is mainly introduced by the high-frequency signals in the source image. The preceeding analysis confirms that when the quantization interval is increased, more prediction errors incurred for high-frequency signals can be endured. As max (|∂st (i, j)/∂ x|, |∂st (i, j)/∂ y|) indicates the highest local spatial frequency, the second homogeneity decision criterion, which is a semiempirical formula, is based on ∂st (i, j) ∂st (i, j) < β · QP (8) , max ∂x ∂y where β is a constant. From the foregoing discussions, with the increase of , more image blocks will be determined as relative homogenous ones, and consequently the efficiency of MRF algorithm should be degraded with the augment of QP. Experimental results also justified this speculation. For instance, when QP = 16, about 0.7 dB coding quality gain is obtained by MRF for Foreman_QCIF, whereas when QP ≥ 36, the efficiency of MRF is negligible. This result means that we can eliminate more redundant MRF motion searches for those relative homogeneous blocks at low bit rates. The detailed homogeneity-based fast reference frame algorithm is presented in Section III-B. In this paper, the relative homogeneity concept is also adopted in the spatial homogeneity-based intermode reduction algorithm. The intermode reduction by the spatial homogeneity decision is first introduced by Wu et al. [15]. Based on the homogeneity of current MB, a subset of all seven intermodes is selected in the rate-distortion optimization processing. The main disadvantage of this method is that the thresholds for the homogeneity decision are constants, which lead to the invariant mode reduction performance with respect to QP.

Experiments illustrated that the ratio of sub-MB modes, which include modes 8 × 8, 8 × 4, 4 × 8, and 4 × 4, always declines with increased QP. For example, when QP = 16, sub-MB modes account for 42% in Mobile_QCIF; in contrast, this ratio declines to 11% when QP = 36. Namely, when the bit rate is reduced, more blocks should be identified as homogeneous. Consequently, the threshold for intermode reduction should be dynamically adjusted according to the current QP value. At the end of this section, the performance of 2 × 2 edge detector, which will be used in the following homogeneity decision, is investigated. Various researchers [15], [24], [25] apply the Sobel edge detector, which uses a pair of 3 × 3 convolution masks Gx Sobel and Gy Sobel to estimate the edge gradient in x- and y-direction, respectively. A 2-D Fourier transform analysis revealed that Gx Sobel and Gy Sobel work as the band-pass filters in x- and y-direction, respectively. Since the high-frequency signals are the major contributors to prediction errors, a 2 × 2 edge detection operator using a pair of 2 × 2 convolution masks Gx2×2 and Gy2×2 is adopted in our algorithms. Gx2×2 and Gy2×2 can be written as in (9). The 2-D spectral analysis of Gx2×2 and Gy2×2 shows that these two masks work as high-pass filters in x- and ydirection, respectively. Therefore, the 2 × 2 edge detector can extract the high-frequency signals more efficiently than its Sobel counterpart −1 1 −1 −1 Gx2×2 = Gy2×2 = . (9) −1 1 1 1 Using the 2 × 2 edge detection operator, the corresponding edge vectors in current MB are defined as ⎧− → − → − → ⎪ ⎨ G i, j = gxi, j · x + gyi, j · y (10) gxi, j = pi+1, j + pi+1, j+1 − pi, j − pi, j+1 ⎪ ⎩ gyi, j = pi, j+1 + pi+1, j+1 − pi, j − pi+1, j where pi, j (i ∈ [0, 15], j ∈ [0, 15]) denotes the picture pixel value, and gxi, j and gyi, j represent the edge gradient in horizontal and vertical directions, i.e., ∂st (i, j)/∂ x and ∂st (i, j)/∂ y, respectively. Moreover, compared with the 3 × 3 Sobel operator, the 2 × 2 operator-based computational complexity of one MB’s gradient vectors is reduced from 256 × (10 A + 2S) (A: addition or subtraction; S: shifting) to 256 × (6A). B. Relative Homogeneity-Based Reference Frame and Intermode Reduction From the analysis of Section II and III-A, it is observed that the redundant computation of smooth image blocks can be saved with negligible coding quality loss. Using the proposed 2 × 2 edge detection operator, the fast reference frame and intermode reduction algorithms are described as follows. With the conclusions drawn from (7) and (8), the homogeneity of an 8 × 8 sub-MB is determined by its maximum edge gradient amplitude and the summation of its edge gradient amplitude. Specifically, the 8 × 8 sub-MB is designated as

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

LIU et al.: MOTION ESTIMATION OPTIMIZATION FOR H.264/AVC USING SOURCE IMAGE EDGE FEATURES

homogeneous provided that any one criterion of (11a) and (11b) is fulfilled ⎛ ⎞ ⎧ ⎪ ⎪ ⎨ max ⎝ |gxi, j |, |gyi, j |⎠ < 32 · Q step (11a) i, j i, j ⎪ ⎪ ⎩ (11b) max |gxi, j |, |gyi, j | < 2 · QP where (i, j) ∈ sub-MB. It should be noticed that the accumulation in (11a) works as the low-pass filter, which reduces the false alarm rate incurred by the adverse effect of highfrequency glitch noises. The homogeneity of 16 × 16, 16 × 8, and 8 × 16 partitions depends on the homogeneity of their underlayer 8 × 8 subMBs. Specifically, classify the M × N partition (M and N are 16 or 8) as homogeneous provided that the 8 × 8 sub-MBs within this partition are all homogeneous, and as nonhomogeneous otherwise. With the derived homogeneity property, the fast reference frame reduction algorithm is developed as follows: 1) for modes 16 × 16, 16 × 8, and 8 × 16, the reference frame number of each partition depends on its own homogeneity: The homogenous partition merely uses the first previous reference frame; For the nonhomogeneous one, MRFs are required; 2) if the 8 × 8 sub-MB is homogenous, its ME processing is restricted on the first previous reference frame for its modes 8 × 8, 8 × 4, 4 × 8, and 4 × 4; otherwise, MRF technique is employed. The proposed intermode reduction criteria also depend on the relative homogeneity of the image block. If the edges within the 8 × 8 sub-MB satisfy any one criterion of (12a) and (12b), it is denoted as one strong homogeneous sub-MB. ⎞ ⎛ ⎧ ⎪ ⎪ ⎨ max ⎝ |gxi, j |, |gyi, j |⎠ < 32 · Q step (12a) i, j i, j ⎪ ⎪ ⎩ (12b) max |gxi, j |, |gyi, j | < QP. The homogeneity-based fast mode reduction algorithm is described as follows: 1) if the 8 × 8 sub-MB is strong and homogenous, its intermodes 8 × 4, 4 × 8, and 4 × 4 are eliminated; 2) if the MB contains four strong homogenous sub-MBs, it is designated as strong homogenous MB, and its intermodes 8×8, 8×4, 4×8, and 4×4 are all discarded. For criteria (11a), (11b), (12a), and (12b), when QP increases, more image blocks will be determined as homogeneous or strong homogenous ones, and in consequence, the computation reduction performance of our algorithms is improved. This feature is a prominent advantage, and it also agrees with the experimental results that at low bit rates, the coding performance of VBS and MRF techniques is compromised, and more computation should be saved.

block matching processing, a set of candidate motion vectors, defined as = {(±1, 0), (0, ±1), (0, 0), (MVP x ± 1, MVP y ), (MVP x , MVP y ±1), (MVP x , MVP y )}, is first searched, where MVP x and MVP y represent the x- and y-component −−→ of motion vector predictor (MVP). If the current reference frame is the first previous one, and the minimum cost of is less than the preset threshold TH ET , subsequent searches are eliminated; Otherwise, if the derived minimum rate-distortion (RD) cost of the block is already less than TH MRF , the motion search is also terminated early. The original values of TH ET and TH MRF are defined in [17]. Equation (1) demonstrates that the power of prediction residues has a strong correlation with the content of encoded picture, and it increases with . The spatial analysis in Section II also arrives at the same conclusions. In this section, the relationship between the block edge gradient intensity and the SAD value is built to derive the image content-based early termination thresholds. Based on VBS technique, it is supposed that all pixels in the same partition have the uniform motion. Otherwise, this partition would be split further into smaller blocks to ameliorate the prediction accuracy. With this assumption, the pixels in the same partition have the uniform displacement error (x , y ). If the effect of quantization processing is temporally neglected, the SAD value of one partition can be approximated to ∂st (i, j) ∂st (i, j) x (13) S AD ≈ + y ∂x ∂y i, j

where (i, j) ∈ block. If y /x is denoted as γ , (13) yields ∂st (i, j) ∂st (i, j) . (14) S AD ≈ |x | +γ ∂x ∂y i, j

Equation (14) indicates that the SAD value of the image block is determined by its edge gradient intensity. Moreover, the SAD ratio between successive reference frames is mainly decided by their displacement estimation errors (|x |) as γ is assumed to be unchanged in successive reference frames. This conclusion can interpret the strong SAD correlation in neighboring reference pictures adopted in [5]. Until now, the noise of quantization processing has not been taken into consideration in the spatial analysis. With the augment of QP, more high-frequency signals in the reference pictures will be lost. In the spatial domain, this makes the edges of reference picture to become smooth. In other words, the quantization processing will augment the prediction errors. From the aforementioned investigation, the content-based early termination threshold should contain two terms. The first term represents the effect of ac signals of the image and the second term should denote the quantization noise. Consequently, the adaptive thresholds for one image block are written as |gxi, j | + |gyi, j | block_si ze + Q step (15) TH ET = 8 32 i, j

IV. I MAGE C ONTENT-BASED E ARLY T ERMINATION JM11.0 has integrated the constant threshold early termination algorithm [17] with UMHexagonS: Before UMHexagonS

1099

TH MRF =

|gxi, j | + |gyi, j | i, j

8×λ

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

+

block_si ze Q step 32 × λ

(16)

1100

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 8, AUGUST 2009

where (i, j) ∈ block, λ = 4 when the block mode is 16 × 16, otherwise λ = 8. V. E DGE G RADIENT-BASED DYNAMIC S EARCH R ANGE Reducing the search range of IME is another promising approach for alleviating its computation intensity. Xu et al., [21] proposed the DSR method, which has been implemented with UMHexagonS in JM11.0. If E is the current block, its left, top, and top-right neighborhood blocks are labeled as A, B, and C. Motion vectors of A, B, and C are designated as (MV _A x , MV _A y ), (MV _Bx , MV _B y ), and (MV _C x , MV _C y ). The IME RD-costs of blocks A, B, and C are represented as RD A , RD B , and RDC , respectively. The original algorithm decreases the search range of current block E by using the above-mentioned motion vectors and RD costs of its neighborhood blocks. In this section, we first analyze the effect of the image motion on its edge gradient, and then integrate the edge gradient-based search range reduction algorithm with the original DSR method to further improve its performance. According to [18], the motion between the object and the camera sensor obscures the sampled image, which is designated as the motion blurring phenomenon. If an image s(x, y) undergoes planar motion and x0 (t) and y0 (t) represent motion in x- and y-direction, respectively, the obtained image, i.e., s(x, y), can be expressed as the integration during the exposure period T . The exposure time T for motion picture film is normally half the time of frame interval [20]. For example, if a film camera records film at 30 frames per second, during each 30th of a second the film is actually exposed to light for roughly half the time, that is, T = 1/60s. In the frequency domain, this procedure can be written as S(ωx , ω y ) = S(ωx , ω y )H (ωx , ω y )

(17)

where S(ωx , ω y ) and S(ωx , ω y ) represent the 2-D Fourier transforms of s(x, y) and s(x, y), and H (ωx , ω y ) denotes the low-pass filter caused by the motion blurring effect. For uniform linear motion, x0 (t) = at/T and y0 (t) = bt/T , in which a/T and b/T represent the speed in x- and ydirection, respectively. Under this condition, H (ωx , ω y ) is expressed as H (ωx , ω y ) =

T sin[π(ωx a + ω y b)]e jπ(ωx a+ω y b) . π(ωx a + ω y b)

(18)

It is observed that the effect of motion is analogous to lowpass filter and its cutoff frequency decreases with a and b. For the conclusions drawn from (17) and (18), the edge gradients of the sampled image become flattened as the motion speed increases. In consequence, the block’s search range could be adjusted according to its edge gradient amplitude. To be more specific, when an image block contains sharp edges, we have the reason to assume that this block has slow motion and its search area could be shrunk. Otherwise, the motion search of this block should be processed in the large area. First, the maximum edge gradient magnitude of one 8 × 8 sub-MB is defined as (19) max gxi, j , gyi, j G 8×8_k = max (i, j)∈Sub-MB

where k ∈ [0, 1, 2, 3] represents the sub-MB label. G 8×8_k is categorized as five levels GL 8×8_k = max 0, log2 (G 8×8_k ) − 5 . (20) The sub-partitions within the 8 × 8 sub-MB share the same edge gradient level of their upper layer 8 × 8 block. The edge gradient levels of 16 × 16, 16 × 8, and 8 × 16 partitions are defined as (21) ⎧ ⎪ GL 16×8_0 = max GL 8×8_0 , GL 8×8_1 ⎪ ⎪ ⎪ ⎪ ⎪ GL 16×8_1 = max GL 8×8_2 , GL 8×8_3 ⎪ ⎪ ⎪ ⎨GL 8×16_0 = max GL 8×8_0 , GL 8×8_2 (21) GL 8×16_1 = max GL 8×8_1 , GL 8×8_3 ⎪ ⎪ ⎪ ⎪ 3 ⎪ ⎪ 1 ⎪ ⎪ GL 8×8_k . ⎪ ⎩GL 16×16_0 = 4 · k=0

In this paper, the edge gradient information is integrated with the original DSR method in JM11.0 to achieve better computation saving performance. The proposed search range decision algorithm is depicted as the pseudocode in Fig. 3, where the modifications to the original DSR algorithm have been highlighted in bold. Compared with the original method, we further reduce MV θ _E α and MV γ _E α according to the edge gradient level of E, GL_E. Note that, in the original DSR algorithm, if the RD costs of neighboring blocks are larger than the preset constant thresholds TH_DSR[mode], the motion vector predictor of block E is assumed inaccurate; therefore, the search range of E must be reset to the full size SRin . However, the analysis in Section IV suggests that RD A , RD B , and RDC should be associated with their texture natures. In other words, for the block containing numerous detailed textures, we should increase its threshold accordingly. Based on the spatial correlation, if the neighboring blocks have the same edge gradient level as block E, GL_E can be used to adjust the RD-cost thresholds of the neighboring blocks. Specifically, the thresholds in our algorithm are defined as TH_DSR[mode] · 2 (GL_E+1)/2 . If max (RD A , RD B , RDC ) is larger than this dynamic threshold, the search range of E is reset to SRin ; otherwise, the shrunk search range is applied. VI. OVERALL A LGORITHMS Based on the investigation in Sections III, IV, and V, the proposed fast algorithms are integrated with UMHexagonS fast ME algorithm provided by JM11.0. UMHexagonS is a hybrid of multiple fast search patterns, and achieves a good tradeoff between computation saving and coding quality. Additionally, UMHexagonS takes advantage of the spatial and the temporal correlations of RD costs to further improve its computation saving efficiency, which renders the performance of UMHexagonS more susceptible to other fast ME methods. For instance, the upper layer prediction scheme adopted in UMHexagonS uses the minimum RD cost of the upper layer block to derive the early termination thresholds of its underlayer blocks. If one block traps in the ill-condition position, its search results will further degrade the ME efficiency of its underlayer blocks. These features make UMHexagonS a

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

LIU et al.: MOTION ESTIMATION OPTIMIZATION FOR H.264/AVC USING SOURCE IMAGE EDGE FEATURES

if (Aavailable + Bavailable + Cavailable < 2) DSRα = SRin ; else { MVθ _Eα = SRin GL_E; max(|MV_Aα |, |MV_Bα |, |MV_Cα |) ; MVγ _Eα = max 2, 2GL_E−3 MV sum = (|MV _Aα | + |MV _Bα | + |MV _Cα |) 2; if (MV sum = 0)MV δ _E α = SRin 3; else if (MV sum ≤ 3)MV δ _E α = (3 · SRin + 8) 4; else MV δ _E α = (SRin + 2) 2; max(RDA , RDB , RDC ) if >1 TH_DSR[mode] · 2 (GL_E+1)/2

DSRα = SRin ; else DSRα = min(SRin , max(min(MVθ _Eα , MVδ _Eα ), MVγ _Eα )); } SRnew = max(DSR x , DSR y ); Fig. 3. Pseudocode of optimized DSR algorithm based on the edge gradient of source image block (E is the current block and its left, top, and topright neighborhood blocks are A, B, and C; The motion vector unit is one pixel; SRin is input full size search range; SRnew is the optimized search range; DSRα is the dynamic search range in x- and y-direction, α = x, y; TH_DSR[mode] are constant thresholds defined in JM11.0).

good platform for verifying the compatibility of the proposed schemes. The modifications are within the intermode motion estimation part of encode_one_macroblock function, and the main optimizations to the reference software have been highlighted in bold font, as shown in Fig. 4. In the first step, the edge detection function, which consumes 512 AD + 2162 A operations (AD: absolute difference; A: addition or subtraction), is executed to derive the edge information of current MB. In terms of AD operation cost, the computational complexity of the edge analysis function is only equal to the ME cost of 16 × 16 block with two search positions. With the derived homogeneity information of the current MB, the redundant reference frames for each partition of 16 × 16, 16 × 8, and 8 × 16, and for each 8 × 8 sub-MB could be eliminated. If the MB is strong homogeneous, the motion searches for 8 × 8 sub-MB modes are discarded. Furthermore, If one 8 × 8 subMB is defined as strong homogeneous, its sub-partitions 8×4, 4 × 8, and 4 × 4 are not searched anymore. The search range and early termination thresholds of each block depend on its own edge features, and they are adopted within the integer motion estimation processing. The proposed MRF-intermode reduction and content-based early termination algorithms are friendly to the MB-pipelining VLSI implementation. To investigate the impacts of MBpipelining on the existing MRF fast algorithms [1], [9]– [11], a brief dataflow overview of the four-stage encoder [7] shown in Fig. 5(a) is introduced. The encoder consists of five engines and they are partitioned into four pipelined stages, in which four consecutive MBs are processed simultaneously. In the first stage, the IME engine processes on all reference frames. The integer motion vectors (IMVs) of all intermode partitions (41 blocks) in current MB on all reference frames are achieved and then dispatched to the second fractional motion estimation (FME) stage. Through the quarter-pixel accurate

1101

Edge_Gradient_Detection (Current MB) //ME for modes 16 × 16, 16 × 8, 8 × 16 Modes = {16 × 16, 16 × 8, 8 × 16}; Loop Modes Loop Blocks of Current-Mode Loop Reference Frames Integer Motion Estimation{ Edge_Based_Search_Range_Decision; Content_Based_Early_Termination; } Fractional Motion Estimation; If(Current Block is Homogenous) Break; End Loop Accumulate Cost of Current Block; End Loop End Loop If(Current MB is Strong_Homogenous) Finish ME; //ME for 8 × 8 Sub-MB modes Loop 8 × 8 Blocks If(Current 8 × 8-Block is Strong_Homogenous) Sub_Partition Modes = {8 × 8}; Else Sub_Partition Modes = {8 × 8, 8 × 4, 4 × 8, 4 × 4}; Loop Sub_Partition Modes Loop Reference Frames Loop Sub_Blocks Integer Motion Estimation{ Edge_Based_Search_Range_Decision; Content_Based_Early_Termination; } Fractional Motion Estimation; End Loop If(Current 8 × 8-Block is Homogenous)Break; End Loop End Loop Mode Decision for Current 8 × 8-Block; Accumulate Cost of Current 8 × 8-Block; End Loop Fig. 4. Pseudocode of the proposed fast algorithms integrated in encode_one_macroblock function of JM11.0.

ME and precise RD-cost evaluation, the FME engine finds the best prediction candidates and decides the best inter-prediction mode. The intra-prediction (IP), post-inter/intra mode decision and chroma motion compensation are implemented at the third stage. Entropy coding (EC) and in-loop deblocking (DB) are processed in parallel at the fourth stage. From [1], it can be seen that the proposed early termination criteria must be applied in the second FME stage, and then the IME engine, which is the most computation-intensive component, cannot benefit from these approaches. The main drawback of [9]–[11] is the hardware overhead consumed by the MV composition. In the literature [9], for example, 4×4 block-based MVs of all reference frames must be stored. With HDTV720p frame size, 128 × 128 search range, and five reference frames, a total of 5 529 600-bit memories are required. For the accuracy of MV composition, the hardware consuming multiplication is also applied. The four-stage MB-pipelining hardwired encoder integrated with the proposed algorithms is illustrated in Fig. 5(b). When loading in the current MB pixels, the edge detection is performed and the homogeneity properties and early termination thresholds for all blocks within current MB are derived. These values are fed into the first IME stage along with current MB. As described in Section III-B, the homogeneity properties are applied to avoid the redundant reference frame

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

1102

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 8, AUGUST 2009

Fast MRF ME based on early termination

Termination criteria SATD & MV

Stage 1

Termination flag

IMVs

VBS FME

Search center on MRF

MV on framen–1

MV composition

5-frame MV buffer

Stage 3 IP Chroma MC

Mode MVs residues

41×RFnum

VBS IME

4-Stage MB-pipelining architecture Best inter mode & MVs

Stage 2

IME schedule based on MV composition Ref0 Ref1 Ref2 Ref3

Motion estimation

MV composition

Stage 4 EC DB

Ref4 Time

Fast MRF ME based on MV composition scheme

(a) Cur. MB pixels Edge detector ET thresholds Homo. Info.

VBS IME

Homo.

MRF-mode reduce

Info. 41×RFnum

Best inter mode & MVs

VBS FME

Chroma MC

IMVs

ET: early termination

IP

RFnum: reference frame number

Mode MVs residues

termination • Early • MRF-mode reduce

EC DB

IMV: integer motion vector

(b) Fig. 5. Comparisons of our VLSI friendly fast ME algorithms and the original methods in a four-stage MB-pipelining hardwired encoder. (a) Block diagram of the four-stage MB-pipelining hardwired encoder and its impacts on motion vector composition and early termination based fast MRF algorithms (five reference frames are configured and the number of current frame is n) and (b) Hardwired encoder’s MB-pipelining architecture integrated with the proposed early termination and fast MRF-intermode reduction algorithms.

and intermode IME searches, and the early termination scheme can further save the computational power of the IME engine. When the current MB moves to the second FME stage, its homogeneity properties are transferred together, and then the FME for trivial reference frames and intermodes are saved accordingly. VII. E XPERIMENTAL R ESULTS In this section, we first investigate the performance of homogeneity-based fast reference frame and intermode decision algorithm, followed by content-based early termination algorithm and edge gradient-based adaptive search range algorithm. Finally, the detailed coding quality and computation saving performance of the overall fast algorithms are presented. In the experiment, similar results were generally observed with many sequences. Due to the limited space, several representative sequences were selected, including QCIF resolution sequences Foreman, Mobile, Coastguard, Football and Container and CIF sequences Foreman, Football, Container, Mobile, and Canoa. Those sequences with fast-moving objects, such as Football, were used to test the effect of the proposed DSR method. Though some test vectors contain slow and simple movements, such as Mobile, their spatial details make them depend highly on the MRF and VBS techniques. These videos demonstrated the efficiency of the proposed reference frame and intermode reduction algorithms.

The picture rate was 30 frames per second, and IPPP GOP and a single slice per picture were used for all sequences. The search range was ±16 for QCIF sequences and ±32 for CIF sequences, respectively. The reference frame number was set as 5. High complexity rate-distortion optimization (RDO_ON) and CAVLC entropy method were also used. For each sequence, 200 frames and QP values of 16, 20, 24, 28, 32, 36, and 40 were tested [26]. All simulations were conducted on Dell Precision Workstation 650, which combines Intel Xeon 3.06 GHz processor and 2 GB ECC Double Data Rate SDRAM memory at 266 MHz. The operating system was Fedora4 and GCC 4.0.0 was used as C compiler. In the following evaluations of the coding quality of the proposed algorithms, BDBR (Bjonteggard Delta BitRate) and BDPSNR (Bjonteggard Delta PSNR) [27], which are respectively the average difference of bit rate and PSNR between two methods, were applied to produce the quantitative analysis. Without explicit explanation, BDBR and BDPSNR were derived from the simulations when QP = 16; 24; 32; 40, which covered the high, moderate, and low bit rates. The (+) sign in BDBR and (−) sign in BDPSNR indicate the coding loss. The improved UMHexagonS algorithm in JM11.0, which integrated the fast strategies proposed in [17] and [21], is used as the standard benchmark. The speedup, the coding quality, and the compatibility to fast block matching algorithms of the proposed schemes were demonstrated in this section.

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

LIU et al.: MOTION ESTIMATION OPTIMIZATION FOR H.264/AVC USING SOURCE IMAGE EDGE FEATURES

1103

TABLE I

TABLE IV

C ODING Q UALITY C OMPARISON OF H OMOGENEITY-BASED FAST A LGORITHMS WITH JVT C ODEC

P ERFORMANCE C OMPARISONS OF C ONTENT-BASED E ARLY T ERMINATION A LGORITHM

QCIF Foreman Mobile Coastguard Football Container

BDPSNR (dB) −0.0606 −0.0088 −0.0194 −0.0439 −0.0097

BDBR (%) +1.19 +0.19 +0.30 +0.82 +0.18

CIF Foreman Football Mobile Canoa Container

BDPSNR (dB) −0.1169 −0.0561 −0.0236 −0.0376 −0.0018

BDBR (%) +2.57 +1.00 +0.45 +0.67 +0.01

TABLE II M OTION E STIMATION C OMPUTATION S AVING (C ) OF H OMOGENEITY-BASED A LGORITHMS V ERSUS QP (%) QP Foreman_QCIF Football_QCIF Mobile_QCIF Coastguard_QCIF Container_QCIF Foreman_CIF Football_CIF Mobile_CIF Canoa_CIF Container_CIF

16 −17.5 −18.8 −01.9 −06.4 −35.3 −36.2 −35.3 −06.3 −10.8 −40.5

20 −22.2 −26.1 −04.3 −11.2 −38.7 −42.9 −44.3 −07.3 −16.1 −46.0

24 −26.3 −31.9 −06.3 −15.9 −40.4 −49.0 −50.9 −08.6 −21.4 −49.2

28 −32.1 −36.8 −07.8 −21.1 −42.1 −55.1 −56.1 −10.6 −26.6 −51.4

32 −40.8 −43.1 −10.2 −27.4 −44.5 −62.0 −61.9 −13.7 −32.9 −53.7

36 −51.7 −52.0 −16.6 −37.3 −48.0 −69.7 −68.7 −18.8 −42.4 −56.8

40 −70.5 −69.4 −37.8 −54.3 −54.5 −82.3 −79.4 −30.3 −61.1 −63.4

TABLE III M OTION E STIMATION T IME S AVING (ME ) OF H OMOGENEITY-BASED A LGORITHMS V ERSUS QP (%) QP Foreman_QCIF Football_QCIF Mobile_QCIF Coastguard_QCIF Container_QCIF Foreman_CIF Football_CIF Mobile_CIF Canoa_CIF Container_CIF

16 −16.7 −18.6 −02.6 −09.6 −39.2 −35.4 −32.5 −05.1 −09.0 −43.8

20 −21.8 −24.3 −02.8 −10.4 −42.4 −43.1 −43.4 −05.4 −13.8 −49.4

24 −23.3 −29.1 −06.3 −12.9 −42.0 −48.0 −48.1 −05.5 −16.7 −47.7

28 −30.7 −33.6 −06.7 −18.8 −38.5 −53.9 −51.9 −09.5 −22.8 −50.7

32 −39.6 −38.5 −07.2 −24.3 −42.7 −60.3 −57.4 −10.3 −26.7 −51.7

36 −50.0 −44.0 −14.2 −33.2 −45.5 −70.7 −64.1 −17.2 −44.5 −52.2

40 −70.6 −64.3 −35.3 −48.5 −52.8 −80.9 −72.9 −24.1 −56.0 −59.4

Sequences Foreman_QCIF Football_QCIF Mobile_QCIF Coastguard_QCIF Container_QCIF Foreman_CIF Football_CIF Mobile_CIF Canoa_CIF Container_CIF

The coding quality analysis of the homogeneity-based fast algorithms in terms of BDBR and BDPSNR is illustrated in Table I. In the worst case (Foreman_CIF), the introduced coding quality loss was BDPSNR = −0.1169 dB or equivalently, 2.57% bit rate increase versus the original UMHexagonS algorithm. To evaluate the computation saving performance of the proposed homogeneity-based fast algorithms, we first define the metric that is processor- and OS-neutral. In theory, the computation saving of ME processing, which comes from the proposed homogeneity-based fast algorithms, can be written as 7 m=1 Hm + 5 · H m · BS m − 1 × 100% (22) C = 8960 · M BNR

BDBR (%) −0.17 +0.06 −0.04 +0.02 +0.22 +0.00 +0.07 −0.08 +0.03 −0.04

avg(ME ) (%) −00.6 −00.3 −15.6 −04.1 −02.7 −00.6 −02.7 −11.8 −06.8 −02.6

TABLE V P ERFORMANCE C OMPARISONS OF E DGE -BASED DYNAMIC S EARCH R ANGE A LGORITHM Sequences Foreman_QCIF Football_QCIF Mobile_QCIF Tempete_QCIF Coastguard_QCIF Foreman_CIF Football_CIF Mobile_CIF Canoa_CIF Container_CIF

BDPSNR (dB) −0.0039 −0.0292 −0.0065 −0.0030 −0.0176 −0.0233 −0.0046 −0.0051 −0.0176 +0.0012

BDBR (%) +0.07 +0.55 +0.10 +0.02 +0.29 +0.53 +0.09 +0.12 +0.23 −0.04

avg(ME ) (%) −04.2 −04.3 −11.0 −12.4 −10.0 −04.6 −05.1 −21.3 −10.8 −02.7

where Hm denotes the number of mode m homogenous blocks using single reference frame, H m denotes the number of mode m blocks which are designated as nonhomogenous, BS m represents the block size of mode m in terms of its pixel number, and M BNR is the overall MB number in the encoded P frames. For the evaluation of practical ME speedup of the proposed fast algorithms, ME time saving ME is defined as ME =

A. Performance Analysis of Homogeneity-Based VBS and MRF Reduction Algorithms

BDPSNR (dB) +0.0066 −0.0054 +0.0023 −0.0016 −0.0077 −0.0000 −0.0047 +0.0053 −0.0028 +0.0022

Tfast − TUMHexagonS × 100% TUMHexagonS

(23)

where Tfast denotes the ME time of UMHexagonS integrated with our algorithms and TUMHexagonS is the time taken by the original UMHexagonS in JM11.0. The experimental results of C and ME versus QP are summarized in Tables II and III, respectively. From the comparisons between Table II and III, it could be found that the practical ME time saving statistics agree well with the theoretical analysis. It was clearly illustrated that the computation saving was enhanced with the increment of QP. Moreover, our algorithms were efficient for the sequences with strong spatial homogeneity, for example Container, because the aliasing problem in these videos was not prominent. On the other hand, some complex texture-rich sequences, such as Mobile_QCIF, required MRF and VBS techniques in most cases, especially at high and moderate bit rates. Even at low bit rates (QP = 40), only 37.8% computation

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 8, AUGUST 2009

46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24

Foreman_QCIF JM11.0 Foreman_QCIF Proposals

PSNR(dB)

PSNR(dB)

1104

Football_QCIF JM11.0 Football_QCIF Proposals Mobile_QCIF JM11.0 Mobile_QCIF Proposals Coastguard_QCIF JM11.0 Coastguard_QCIF Proposals Container_QCIF JM11.0 Container_QCIF Proposals 0

200

400

600

800

1000

1200

1400

1600

46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24

Foreman_CIF JM11.0 Foreman_CIF Proposals Football_CIF JM11.0 Football_CIF Proposals Mobile_CIF JM11.0 Mobile_CIF Proposals Canoa_CIF JM11.0 Canoa_CIF Proposals Container_CIF JM11.0 Container_CIF Proposals 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

Bitrate(mb/s)

Bitrate(kb/s)

Fig. 6. Rate-distortion curve comparisons for QCIF, CIF sequences (Simulation Conditions: QP = 16, 20, 24, 28, 32, 36, and 40; input search range QCIF = 16 CIF = 32; RDO_ON, five references; IPPP; CAVLC; FastME = UMHexagonS). TABLE VI P ERFORMANCE C OMPARISON OF OVERALL FAST M OTION E STIMATION A LGORITHMS UMHexagonS + Proposed QCIF Foreman Football BDPSNR (dB) −0.0747 −0.0659 BDBR (%) +1.47 +1.14 CIF Foreman Football BDPSNR (dB) −0.1321 −0.0665 BDBR (%) +2.95 +1.22 QCIF BDPSNR (dB) BDBR (%) CIF BDPSNR (dB) BDBR (%)

versus UMHexagonS Mobile Coastguard Container −0.0195 −0.0261 −0.0265 +0.37 +0.46 +0.58 Mobile Canoa Container −0.0294 −0.0563 +0.0001 +0.55 +0.96 −0.03

UMHexagonS + Proposed versus FS Foreman Football Mobile Coastguard Container −0.1051 −0.1153 −0.0254 −0.0187 −0.0507 +2.06 +1.89 +0.50 +0.36 +1.25 Foreman Football Mobile Canoa Container −0.1893 −0.1298 −0.0237 −0.0913 −0.0382 +4.28 +2.30 +0.43 +1.53 +1.03

in Mobile_QCIF was saved in theory. It was also observed that the computational saving efficiency improved with the frame size. Comparing the tests of Foreman_QCIF and Foreman_CIF, when QP = 40, C is increased by 11.8%. This result can be explained with the aliasing theory. With the decrease of the sampling interval, the sampling frequency increases, so that the high-frequency signals normalized by the sampling frequency become fewer. Consequently, the aliasing problem is alleviated, and more image blocks are determined as homogenous with our algorithms. B. Performance Analysis of Content-Based Early Termination To evaluate the performance of the proposed early termination algorithm, it was integrated with UMHexagonS in JM11.0, and the original constant thresholds early termination algorithm was used as the benchmark. BDBR and BDPSNR were derived from the simulations when QP = 16, 20, 24, and 28. The experimental results in terms of the coding quality and the ME time saving are depicted in Table IV. avg(ME ) indicates the average ME time saving when QP = 16, 20, 24, and 28. It was clearly illustrated that the content-based threshold method outperformed the constant counterpart, especially for those sequences rich in complex textures. For example, when encoding Mobile_QCIF, 15.6% ME time saving was achieved

TABLE VII M OTION E STIMATION T IME S AVING OF OVERALL E DGE -BASED FAST M OTION E STIMATION A LGORITHMS V ERSUS QP(%) QP Foreman_QCIF Football_QCIF Mobile_QCIF Coastguard_QCIF Container_QCIF Foreman_CIF Football_CIF Mobile_CIF Canoa_CIF Container_CIF average

16 −25.4 −25.2 −18.7 −19.7 −42.8 −40.7 −40.0 −31.5 −22.4 −47.2 −31.4

20 −29.0 −30.4 −21.6 −19.9 −43.6 −46.7 −41.6 −31.8 −28.1 −52.7 −34.5

24 −32.4 −34.7 −23.1 −28.8 −44.6 −53.2 −53.4 −33.0 −31.0 −55.2 −38.9

28 −38.7 −38.7 −23.0 −31.9 −46.3 −59.3 −57.1 −31.8 −32.5 −56.3 −41.6

32 −43.7 −42.4 −20.8 −32.7 −48.2 −65.5 −62.2 −34.3 −39.9 −57.8 −44.8

36 −53.6 −47.2 −20.0 −38.3 −50.0 −71.5 −68.0 −33.3 −43.7 −59.5 −48.5

40 −70.8 −65.2 −39.0 −50.9 −54.7 −82.9 −77.1 −35.7 −59.6 −64.5 −60.0

TABLE VIII OVERALL E NCODING T IME S AVING OF P ROPOSED A LGORITHMS V ERSUS QP (%) QP Foreman_QCIF Football_QCIF Mobile_QCIF Coastguard_QCIF Container_QCIF Foreman_CIF Football_CIF Mobile_CIF Canoa_CIF Container_CIF average

16 −13.0 −15.2 −07.4 −09.2 −21.3 −24.3 −26.0 −14.7 −12.3 −25.2 −16.9

20 −17.8 −19.2 −09.6 −09.5 −26.4 −29.4 −29.2 −15.4 −17.8 −29.4 −20.4

24 −19.5 −23.0 −12.5 −17.8 −26.9 −36.1 −38.4 −17.8 −20.9 −33.1 −24.6

28 −24.3 −27.2 −11.7 −19.6 −28.8 −40.4 −40.2 −17.7 −21.5 −34.5 −26.6

32 −29.7 −31.0 −10.8 −22.6 −29.4 −46.5 −46.9 −20.8 −30.4 −37.9 −30.6

36 −37.7 −36.5 −13.2 −27.0 −33.4 −51.8 −53.2 −19.2 −34.1 −39.7 −34.6

40 −50.5 −48.3 −26.4 −38.0 −36.9 −61.1 −60.1 −25.7 −46.6 −43.2 −43.7

by the proposed algorithm while the coding quality was well maintained. C. Performance Analysis of Edge Gradient-Based Dynamic Search Range As depicted in Section V, the edge gradient information of the current image block was used to enhance the computation saving performance of the original neighboring motion vectorbased DSR algorithm. The performance comparisons with the original method are illustrated in Table V. BDPSNR and

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

LIU et al.: MOTION ESTIMATION OPTIMIZATION FOR H.264/AVC USING SOURCE IMAGE EDGE FEATURES

BDBR were applied to evaluate the coding quality performance of our scheme. avg(ME ) indicates the average ME time saving when QP = 16, 20, 24, 28, 32, 36, and 40. The worst coding quality loss was BDPSNR = −0.0292 dB for Football_QCIF test sequence. Our algorithm had an advantage over the original one in computation saving, especially for those sequences rich in textures. For example, compared with the original method, an average of up to 21.3% ME time for Mobile_CIF sequence was saved, while the introduced BDPSNR loss was only 0.0051 dB. D. Performance Analysis of Whole Algorithms The RD curve comparisons of the overall edge-based fast algorithms integrated with UMHexagonS versus the original UMHexagonS are shown in Fig. 6. As our algorithms provide almost the same coding efficiency, it is hard to distinguish the fast algorithm’s curves from the reference ones in most cases. The detailed coding quality experimental results in terms of BDBR and BDPSNR are shown in Table VI. In these tests, the original UMHexagonS and exhaustive search (FS) provided by JM11.0 were used as the benchmarks. The comparisons with UMHexagonS verified the coding quality degradation incurred by the proposed schemes. Compared with the original UMHexagonS, the maximum coding quality loss (BDPSNR = −0.1321 dB, or equivalently 2.95% increase in bit rate) was from Foreman_CIF and the average coding quality degradation of these 10 sequences was BDPSNR = −0.0497 dB. We further used FS as the benchmark to better understand the coding performance of the whole ME algorithms. The comparisons with FS revealed that an average of 0.0788 dB coding quality loss in BDPSNR was introduced by the overall algorithms. The ME time saving results are listed in Table VII. Compared to the original UMHexagonS method, on an average, 31.4–60.0% ME time was saved by the proposed approaches. It was observed that the speed-up performance of our algorithms always increased with QP. For instance, when QP = 16, compared with UMHexagonS, the average ME time saving was 31.4%; in contrast, when QP = 40, this metric increased to 60.0%. This result came from the proposed relative homogeneity-based intermode and reference frame reduction algorithms. For those sequences rich in sharp edges, such as Mobile, most image blocks were identified as nonhomogeneous ones at high and moderate bit rates, which made VBS and MRF techniques essential. From the comparison of Tables VII and III, it was found that most of their ME time saving came from the edge-based DSR and the content-based early termination algorithms. Consequently, their time saving ratios almost remained unchanged when QP < 40. However, even for the worst case, Mobile_QCIF, 18.7–39.0% motion search time could still be saved by the proposed algorithms compared to the original UMHexagonS. The overall encoding time saving percentage with the increment of QP is illustrated in Table VIII. On average, 16.9–43.7% encoding time was saved by our fast ME algorithms. VIII. C ONCLUSION Using the source image edge features, four fast ME algorithms were developed in this paper to reduce the

1105

computational complexity of variable block size motion estimation with MRFs in H.264/AVC. For the relative homogenous block, the redundant computation for futile reference frames and intermodes are eliminated efficiently. On the other hand, the image block containing complex textures is identified as the slow-moving block and then the search area for integer motion estimation is shrunk accordingly. Additionally, the mathematical relationship between the edge intensity of the source image block and its SAD value was derived, and the image content-based early termination algorithm was proposed with superior performance over the original method of JM11.0, especially at high and moderate bit rates. Moreover, the proposed edge-based fast algorithms are efficient to the MB-pipelining hardwired encoders and they are orthogonal to the traditional fast block-matching methods. All proposed fast algorithms are integrated with UMHexagonS in JM11.0. Experimental results show that compared to the original UMHexagonS, 31.4–60.0% ME time can be saved, with an average of 0.0497 dB coding quality degradation. A PPENDIX D ERIVATION OF (7) Now the discussion turns to how to estimate See (u, v) < from the texture features of the source image block. The investigations of [28] and [29] reveal that the variance of dc coefficient of prediction residues is larger than that of other ac coefficients. From this conclusion, we assume that if See (0, 0) is less than , the condition See (u, v) < is satisfied. For a small scale image block, it can be supposed that all pixels of the block have the same motion vector, i.e., x (i, j) = x and y (i, j) = y . From (3), the dc coefficient of prediction errors of N × N image block is ⎛ ⎞ N −1 N −1 ∂st (i, j) ∂st (i, j) ⎠ 1 ⎝ + y E t (0, 0) = x . N ∂x ∂y i, j=0

i, j=0

(24) If x and y are independent, E(x ) = E( y ) = 0, and E(2x ) = E(2y ) = σ2 , See (0, 0) is expressed as See (0, 0) = E(|E t (0, 0)|2 ) ⎡⎛ ⎞2 ⎛ ⎞2 ⎤ N −1 N −1 2 ∂st (i, j) ⎥ σ ⎢ ∂st (i, j) ⎠ +⎝ ⎠ ⎦. = 2 ⎣⎝ N ∂x ∂y i, j=0

i, j=0

(25) Obviously

⎡ ⎛ N −1 ∂st (i, j) 2σ2 See (0, 0) ≤ 2 ⎣max ⎝ ∂x , N i, j=0 ⎞⎤2 N −1 ∂st (i, j) ⎠ ⎦ . ∂y

(26)

i, j=0

The optimum forward channel corresponds to the discrete cosine transform (DCT) and the quantization processing in H.264/AVC; in consequence, can be approximated as the

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

1106

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 8, AUGUST 2009

noise coming from the quantization. Based on the uniform quantization investigation in [28], is defined as =

Q 2step 12

(27)

where Q step is approximated as 0.625 × 2QP/6 in H.264/AVC. Combining (26) and (27) derives that, when ⎞ ⎛ −1 N −1 ∂st (i, j) N ∂st (i, j) N ⎠ max ⎝ ∂x , ∂ y < √24σ Q step i, j=0

i, j=0

(28) for ∀(u, v), See (u, v) < is satisfied. Equation (28) yields the first homogeneity decision criterion, that is ⎞ ⎛ −1 N −1 ∂st (i, j) N ∂st (i, j) ⎠ max ⎝ ∂x , ∂ y < α · Q step (29) i, j=0

i, j=0

where, α is a constant derived by experiments. R EFERENCES [1] Y.-W. Huang, B.-Y. Hsieh, S.-Y. Chien, S.-Y. Ma, and L.-G. Chen, “Analysis and complexity reduction of multiple reference frames motion estimation in H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 4, pp. 507–522, Apr. 2006. [2] M.-J. Chen, L.-G. Chen, and T.-D. Chiueh, “1-D full search motion estimation algorithm for video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 4, no. 5, pp. 504–509, Oct. 1994. [3] L.-M. Po and W.-C. Ma, “A novel four-step search algorithm for fast block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 3, pp. 313–317, Jun. 1996. [4] J. Y. Tham, S. Ranganath, M. Ranganath, and A. A. Kassim, “A novel unrestricted center-biased diamond search algorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 4, pp. 369–377, Aug. 1998. [5] Z.-B. Chen, J.-F. Xu, Y. He, and J.-L. Zheng, “Fast integer-pel and fractional-pel motion estimation for H.264/AVC,” J. Visual Commun. Image Representation, vol. 17, no. 2, pp. 264–290, Apr. 2006. [6] T.-C. Chen, Y.-H. Chen, S.-F. Tsai, S.-Y. Chien, and L.-G. Chen, “Fast algorithm and architecture design of low-power integer motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 5, pp. 568–577, May 2007. [7] T.-C. Chen, S.-Y. Chien, Y.-W. Huang, C.-H. Tsai, C.-Y. Chen, T.- W. Chen, and L.-G. Chen, “Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 6, pp. 673–688, Jun. 2006. [8] Z.-Y. Liu, Y. Song, M. Shao, S. Li, L.-F. Li, S. Ishiwata, M. Nakagawa, S. Goto, and T. Ikenaga, “HDTV1080p H.264/AVC encoder chip design and performance analysis,” IEEE J. Solid-State Circuits, vol. 44, no. 2, pp. 594–608, Feb. 2009. [9] Y.-P. Su and M.-T. Sun, “Fast multiple reference frame motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 3, pp. 447–452, Mar. 2006. [10] M.-J. Chen, Y.-Y. Chiang, H.-J. Li, and M.-C. Chi, “Efficient multiframe motion estimation algorithms for MPEG-4/AVC/JVT/H.264,” in Proc. Int. Symp. Circuits Syst. 2004, vol. 3, pp. 737–740. [11] M.-J. Chen, G.-L. Li, Y.-Y. Chiang, and C.-T. Hsu, “Fast multiframe motion estimation algorithms by motion vector composition for the MPEG-4/AVC/H.264 standard,” IEEE Trans. Multimedia, vol. 18, no. 3, pp. 478–487, Jun. 2006. [12] B. Girod, “The efficiency of motion-compensating prediction for hybrid coding of video sequences,” IEEE J. Selected Area Commun., vol. SAC-5, no. 7, pp. 1140–1154, Aug. 1987. [13] B. Girod, “Efficiency analysis of multihypothesis motion-compensated prediction for video coding,” IEEE Trans. Image Process., vol. 9, no. 2, pp. 173–183, Feb. 2000. [14] T. Wedi and H. G. Musmann, “Motion- and aliasing-compensated prediction for hybrid video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 577–586, Jul. 2003.

[15] D. Wu, F. Pan, K. P. Lim, S. Wu, Z. G. Li, X. Lin, S. Rahardja, and C. C. Ko, “Fast intermode decision in H.264/AVC video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 6, pp. 953–958, Jul. 2005. [16] Z.-Y. Liu, L.-F. Li, Y. Song, S. Li, S. Goto, and T. Ikenaga, “Motion feature and hadamard coefficient-based fast multiple reference frame motion estimation for H.264,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 5, pp. 620–632, May 2008. [17] X. Xu and Y. He, “Improvements on fast motion estimation strategy for H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 3, pp. 285–293, Mar. 2007. [18] R. Gonzalez and R. Woods, Digital Image Processing. 2nd ed. Englewood Cliffs, NJ: Prentice Hall, 2002. [19] Shutter-Speed [Online]. Available: http://en.wikipedia.org/wiki/ Shutter_speed. [20] Time-Lapse [Online]. Available: http://en.wikipedia.org/wiki/Time-lapse. [21] X.-Z. Xu and Y. He, “Modification of dynamic search range for JVT,” presented at the 17th meeting JVT of ISO/IEC MPEG ITU-T VCEG, Nice, France, Oct. 2005, JVT-Q088. [22] J. W. Goodman, Introduction to Fourier Optics. 3rd ed. Greenwood Village, CO: Roberts and Company Publisher, 2005. [23] T. Berger, Rate-Distortion Theory. Greenwood Village, CO: Prentice Hall, 1971. [24] F. Pan, X. Lin, S. Rahardja, K. P. Lim, Z. G. Li, D. Wu, and S. Wu, “Fast mode decision algorithm for intraprediction in H.264/AVC video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 6, pp. 813–821, Jul. 2005. [25] G. Sullivan, T. Wiegand, and K.-P. Lim, “Joint model reference encoding methods and decoding concealment methods,” presented at the 9th meeting JVT ISO/IEC MPEG ITU-T VCEG, San Diego, CA, Sep. 2003, JVT-I049. [26] G. Sullivan and G. Bjontegaard, “Recommended simulation common conditions for H.26L coding efficiency experiments on low-resolution progressive-scan source material,” presented at the 14th meeting ITUTelecommunications Standardization Sector, Santa Barbara, CA, Sep. 2001, VCEG-N81. [27] G. Bjontegaard, “ Calculation of average PSNR differences between RD-curves,” presented at the 13th meeting ITU-Telecommunications Standardization Secto, Austin, TX, Apr. 2001, VCEG-M33. [28] A. K. Jain, Fundamentals of Digital Image Processing. Englewood Cliffs, NJ: Prentice Hall, 1989. [29] I.-M. Pao and M.-T. Sun, “Modeling DCT coefficients for fast video encoding,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 4, pp. 608–616, Jun. 1999. Zhenyu Liu (M’07) received the B.E., M.E., and Ph.D. degrees in electronic engineering from Beijing Institute of Technology, China, in 1996, 1999, and 2002, respectively. His doctoral research focused on the real time signal processing and relative ASIC design. From 2002 to 2004, he was a Post Doctorate Fellow at Tsinghua University of China, where his work mainly concentrated on the embedded CPU architecture design. From September 2004 to March 2009, he was the visiting researcher of the Graduate School of IPS in Waseda University. He is currently an Associate Professor of Tsinghua University, Beijing, China. His research interests include real-time H.264/AVC encoding algorithm optimization and associated VLSI architecture design.

Junwei Zhou received the B.S. and M.S. degrees in electrical engineering from Beijing Institute of Technology, Beijing, China, in 1995 and 1998, respectively. He received the M.S. degree in computer science and Ph.D. degree in electrical engineering from Michigan State University, East Lansing, in 2001 and 2006, respectively. His research interests include low-power digital circuits and computer architecture. From 2002 to 2006, he was a Research Assistant in the Advanced MicroSystems and Circuits Group, Michigan State University. He is currently a Design Engineer at Sun Microsystems Incorporation, Santa Clara, CA, on high-performance microprocessors and register file design.

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

LIU et al.: MOTION ESTIMATION OPTIMIZATION FOR H.264/AVC USING SOURCE IMAGE EDGE FEATURES

Satoshi Goto (S’69–M’77–SM’84–F’86) was born in Hiroshima, Japan, in 1945. He received the B.E. and the M.E. degrees in electronics and communication engineering and the Doctor of Engineering from Waseda University, Tokyo, Japan, in 1968, 1970 and 1981, respectively. He joined NEC Laboratories in 1970 as GM and Vice President of LSI design, Multimedia System and Software Division. Since 2003, he has been a Professor at the Graduate School of Information, Production and Systems, Waseda University, Tokyo, Japan. His main interest is now on VLSI design methodologies for multimedia and mobile applications. He has published 7 books, 38 journal papers and 67 international conference papers with reviews. Dr. Goto served as General Chair of ICCAD and ASPDAC and was a board member of IEEE CAS society. He is an IEEE Fellow, IEICE Fellow and Member of Academy Engineering Society of Japan.

1107

Takeshi Ikenaga (M’95) received the B.E. and M.E. degrees in electrical engineering and the Ph.D. degree in information and computer science from Waseda University, Tokyo, Japan, in 1988, 1990, and 2002, respectively. He joined LSI Laboratories, Nippon Telegraph and Telephone Corporation in 1990, where he had been undertaking research on the design and test methodologies for high-performance ASICs, a realtime MPEG2 encoder chip set, and a highly parallel LSI and system design for image-understanding processing. He is presently an Associate Professor in LSI with the Graduate School of Information, Production and Systems, Waseda University. His current interests are application SoCs for image, and security and network processing. He also engages in the research of H.264 encoder LSI, JPEG2000 codec LSI, LDPC decoder LSI, UWB wireless communication LSI, public key encryption LSI, object recognition LSI, etc. Dr. Ikenaga is a member of the Institute of Electronics, Information and Communication Engineers of Japan, and the Information Processing Society of Japan.

Authorized licensed use limited to: Tsinghua University Library. Downloaded on August 21, 2009 at 06:22 from IEEE Xplore. Restrictions apply.

rate optimization by true motion estimation introduction