Multiresolution Feature-Based Image Registration Chiou-Ting Hsu1 and Rob A. Beuker2 1

Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan Video Processing Group, Systems Laboratory Eindhoven (PS-SLE), Philips Semiconductors B.V.

2

ABSTRACT Image registration is the fundamental task used to match two or more partially overlapping images and stitch these images into one panoramic image comprising the whole scene. To register two images, the coordinate transformation between a pair of images must be found. In this paper, a multiresolution feature-based method is developed to efficiently estimate an eight-parametric projective transformation model between pairs of images. The proposed method has been tested and work well on many images of static scene taken with a hand-held digital still camera. Keywords: image registration, projective coordinate transform, feature points, wavelet transform

1. INTRODUCTION Image registration is used to match two or more partially overlapping images and stitch them into one panoramic image of the scene. To register two images, the coordinate transformation between a pair of images must be found from class of transformations. The optimal transformation depends on the types of relation between the overlapping images. To find the relationship between two images we rely on the estimation of the parameters of the transformation model. The number of parameters depends on the chosen transformation model. A common assumption is that the coordinate transformations between two images are rigid planar models. Rigid planar transformation is composed of scaling, rotation, and translation changes, which map the pixel (x1,y1) of image I1 to the pixel (x2,y2) of another image I2: x2 x1 cosθ − sinθ x1 t x y = s R y + T = s sinθ cosθ y + t 2 1 1 y The rigid transformation is sufficient to match two images of a scene taken from the same viewing angle but from different position. That is, the camera can rotate about its optical axis. In the case of remote sensing, where the distance approaches infinity, the transformation between the captured images behaves like a planar rigid transformation. The projective transformation model involves eight parameters, mk ( k = 0 ~ 7) , m0 m1 x1 m2 + x 2 m3 m4 y1 m5 y = x 2 [m6 m7 ] 1 + 1 y1 that account for distortions which occur when images of a 3D static scene are taken at fixed location, or when multiple images of a planar scene are taken from arbitrary locations. In practice, the projective model is also a good approximation for a 3D scene when the movement of the camera is small compared to its depth, and for a scene when the relative distances between objects in the scene is small compared to its depth. In this paper, the projective transformation model is estimated efficiently by using the rigid planar parameters as a good initial guess. And the images are assumed to be taken from a hand-held digital still camera (DSC) while the rotation about its optical axis and scaling change are due to manual operation.

1

In Visual Communications and Image Processing 2000, Proceedings of SPIE vol. 4067 (2000), pp. 1490-1498, Perth, Australia, 20-23 June 2000.

As the number of possible transformation grows, the computational costs quickly become unmanageable. To reduce the computational costs, measurements are based only on features points of the images instead of using the entire images. In this work, a multiresolution feature-based technique is applied. The input images are initially decomposed into their multiresolution representations and a number of feature points in the coarest resolution are determined. Next, feature point pairs are matched at the coarest resolution level and then used to derive the projective parameters. Lastly, a hierarchical coarse-to-fine refinement is performed to efficiently derive the projective parameters at full image resolution and finally merge the warpped images into one image mosaic.

2. MATCHING ALGORITHM The overall flow chart of the proposed algorithm is illustrated in Figure 1. 2.1 FEATURE POINTS EXTRACTION Initially, input images are decomposed into their multiresolution representation by a discrete wavelet transform. Based on the multiresolution structure, a number of feature points in the coarsest level are determined by multiscale edge correlation method [1][2]. Herein, the multiscale edge detection [1] and its improved edge correlation method [2] will be described successively. 2.1.1 Multiscale edge detection In [1], the multiscale edge detection method based on wavelet transform is derived. Let ψ 1j ( x, y ) = 21 ψ 1 ( 2x , 2y ) and ψ 2j ( x , y ) = 21 ψ 2 ( 2x , 2y ) be two wavelets which are dilated of ψ1(x,y) and ψ2(x,y) by a factor of 2 j, respectively . j

j

j

j

j

j

The 2D wavelet transform of an image I(x,y) at scale 2 j (i.e. level j) can be decomposed into two independent directions as W j1 I ( x, y ) = I *ψ 1j ( x , y ) and W j2 I ( x , y ) = I *ψ 2j ( x , y ).

Let S(x,y) be a 2D smoothing function. The multiscale sharp variation points (i.e. edge points) can be obtained from a wavelet transform if ∂S ( x, y) ∂ S( x, y) ψ 1 ( x, y) = and ψ 2 ( x, y) = . ∂x ∂y It is derived in [1] that

∂Sj

∂ (I * Sj )(x, y) and ∂x ∂x ∂S ∂ Wj2 ( x, y) = I *ψ 2j (x, y) = I *(2 j j )(x, y) = 2 j (I *S j )(x, y). ∂y ∂y

Wj1 (x, y) = I *ψ 1j (x, y) = I *(2 j

)(x, y) = 2j

That is, r r W j1 ( x, y) ∂ (I * S j )(x, y ) 2 = 2 j ∂∂x = 2 j ∇( I * S j )( x, y) where ∇(I * S j )(x, y) denotes the gradient vector. W j ( x, y) ∂y ( I * S j )(x, y ) Hence, the two components of the wavelet transform are proportional to the two components of the gradient vector. At each level, the modulus of the gradient vector is proportional to 2

2

M j ( x, y) = W j1 I ( x, y ) + W j2 I ( x, y ) . If the local maxima of Mj are located and thresholded with a given value, then the edge points at scale 2 j are detected. 2.1.2 Edge correlation

2

In Visual Communications and Image Processing 2000, Proceedings of SPIE vol. 4067 (2000), pp. 1490-1498, Perth, Australia, 20-23 June 2000.

For image registration, more specific and robust feature points are preferred than noisy edge information. Another criterion named “edge correlation” is introduced in [2]: n −1

ER n ( j, x, y) = ∏ M j −i ( x, y) , i=0

where n indicating the number of scales involved in the multiplication, and j is the initial scale for edge correlation. Based on the edge correlation, the direct multiplication of wavelet transform data at adjacent scales is employed to distinguish important edges from noise. This method is based on the fact that sharp edges have large signal over multiple scales, while noise vanishes across scales. Therefore, correlation of the wavelet gradient over different scales will sharpen the major edges and suppress noise or unsignificant features. For an image with many high frequency components, only local maxima of the edge points are retained as image feature points [3], as shown in Figure 2. This is done in order to improve the efficiency for calculating matching feature points between the two images. Moreover, the estimation of accurate transformation is then applied only on the matched feature points set instead of the entire images, this reduces computational cost quite significantly. An additional cost saving is due to the down sampling performed during multiresolution decomposition, the computation of measurements and pairwise matching is carried out in the coarsest resolution level. 2.2 FINDING THE MATCHING PAIRS 2.2.1 Find all possible matching pairs After the feature points in the coarest level are extracted from both images, the set of all possible matching pairs { p ⇔ q } between the two images is determined. The similarity between any two feature points (px,py) and (qx,qy) is measured by their normalised cross-correlation. Notably, the cross-correlation must be normalised since local image intensity would otherwise influence the measure. This measure has the property that it measures correlation on a scale ranging from [-1,1]. For each feature point p, find the peak value of metric (which should be larger than a predefined threshold) to determine its best match point q. 2.2.2 Eliminating the false matching pairs Furthermore, we devised a consistency test which is based on the assumption that when rotation and scaling changes are small, the difference between the translational vectors of two consistent matching pairs is smaller than a threshold [4]. For example, let MP = {pk ⇔ qk , k = 1,2,L, N m } be a set of matching pairs, where Nm represents the number of elements in MP, pk is a feature points in I1 and qk is a feature points in I 2 . Then the relationship of pk and qk is qk = s R pk + T cos(0) - sin(0) 1 0 . if s ≅ 1 and R ≅ = sin(0) cos(0) 0 1 v v If the pairs {pi ⇔ qi} and {pl ⇔ ql} are correct matching pairs, then the difference between Ti = qi − pi and Tl = ql − pl should be small. Figure 3 shows an example. ≅ pk + T ,

v v Therefore, the difference between Ti and Tl , determines whether matching pairs are consistent, and for the set of all possible matching pairs, a list of counters records how many other pairs are consistent with current pair. For thoses matching pairs with the largest consistency count, select the matching pair with maximum matching similarity (i.e. normalised correlation value) as a reliable pair, and eliminate all the matching pairs not consistent with the selected pair. 2.3 DERIVING THE TRANSFORMATION PARAMETERS

3

In Visual Communications and Image Processing 2000, Proceedings of SPIE vol. 4067 (2000), pp. 1490-1498, Perth, Australia, 20-23 June 2000.

After the elimination of the false-matching pairs, the retained correct matching pairs are used to derive the image transformation parameters. 2.3.1 Deriving four rigid planar parameters Initially, four rigid planar parameters (i.e. scaling, rotation angle, and two translation shifts) are calculated by a singular value decomposition (SVD) method [5][3]. Let M = {pi ⇔ qi , i = 1,2, L, N c } be the set of correct matching pairs, where Nc is the total number of correct matching pairs. In order to derive four rigid planar parameters (s, R and T) based on the relation: tx cosθ − sinθ qi = s R pi + T = s pi + . sinθ cosθ ty The error function is defined as Nc

Φ = ∑ s R pi + T − qi

2

i =1

By minimising Φ, a set of solutions can be derived [3] as follows . 1 N 1 ∂Φ pi and q = From = 0 , we can obtain, T = q - s R p , where p = ∑ ∂T Nc i=0 Nc ∂Φ And from = 0 , we can derive ∂s c

Nc

∑q . i

i=0

Nc

s=

∑ q~ R p~ t i

i =1 Nc

∑ ~pit ~pi

i

, where ~pi = pi − p and q~i = qi − q .

i=1

Then, Φ is reorganised as 2

N ~t ~ ∑ qi R pi N i =1 . t~ ~ Φ = ∑ qi qi − N t ~ i =1 ~ pp c

c

∑ c

i

i

i =1

Therefore, minimising Φ can be converted into maximising the term N Φ′ = ∑ q~i t R ~pi . i =1 c

The singular value decomposition (SVD) method is used to solve R by following the steps: 1. Calculate the 2×2 matrix H, that is N ~ H= p q~ t .

∑ c

i

i

i=1

t

2. Find the SVD of H, i.e. H=PΛQ . 3. R = QPt After the rotation matrix R is found, the scale factor s and translation vector T can be found according to the above equations. 2.3.2 Deriving eight projective parameters Subsequently, the four rigid planar parameters are used as the initial guess to resolve the eight projective parameter problem by iteratively minimizing the distances between the coordinates of the feature points and the projected coordinates of their matching points using the Levenberg-Marquardt method [6].

4

In Visual Communications and Image Processing 2000, Proceedings of SPIE vol. 4067 (2000), pp. 1490-1498, Perth, Australia, 20-23 June 2000.

That is, if pi = ( pix , piy ) and qi = ( qix , qiy ) are one matching pair of feature points, 1. Use SVD-based method described above to find four parameters ( s,θ , t x and t y ) of rigid planar transform. 2. Use the four rigid planar parameters as an initial guess for the projective parameters, that is, q ′ix q ′ = iy

m0 m 3

m1 pix m2 + m4 piy m5 = p [m6 m7 ] ix + 1 piy

− sin θ pix t x + cos θ piy t y . p [0 0] ix + 1 piy

cos θ s sin θ

3. Iteratively minimize the sum of the distance between ( qix , qiy ) and ( q′ix , q′iy ) for each matching pair by LevenbergMarquardt method. That is, to minimize χ 2 = ∑ d i2 , where di = (qix′ − qix )2 + (qiy′ − qiy )2 , by calculating partial derivatives for each Nc

i =1

parameter ( mk , k = 0 ~ 7) : ∂q′ ∂d i ∂q′ 1 = (q ix′ − q ix ) ix + (q iy′ − qiy ) iy , ∂ mk ∂mk d i ∂ mk and then update mk ( k = 0 ~ 7) for each iteration.

The above minimization process converges quickly to the accurate eight projective parameters and the solution is not trapped within local minima since the four rigid planar parameters behave as a good initial estimate. It is worthwhile to note that the minimization is only applied to the coordinates of the matching pairs instead of on the image intensities of the entire image as is currently typical of the optical flow based method [7]. Therefore, the efficiency is significantly improved over the optical flow based method. 2.4 HIERARCHICAL MATCHING REFINEMENT Next, a hierarchical coarse-to-fine refinement is performed to refine the eight parameters estimated in the coarsest resolution. The refining steps are as follows: 1. Reconstruct both images in the coarse resolution up to one higer resolution layer by inverse wavelet transformation. 2. Multiply the coordinates of each matching pairs by two and adjust the eight estimated parameters to fit the corresponding coordinates to the current resolution layer. That is, m2 pi ⇐ pi × 2 and m5 qi ⇐ qi × 2, i = 1,L , N c m6 m7

⇐ m2 × 2 ⇐ m5 × 2 .

⇐ m6 / 2 ⇐ m7 / 2

3. Based on the adjusted parameters m0 ~ m 7 , the exact corresponding point of p i in current resolution layer should be in the neighborhood of q ′i , where m0 m1 m2 m m pi + m 5 . 4 q′i = 3 [m6 m7 ] pi + 1

Then, only the set of pixels within a small window centered around pi is warped according to m0 ~ m 7 . And, its best matching point is found by local search method within the neighborhood centerd around q′i . Hence, a refined set of matching pairs in the current resolution is obtained. 4. Next, based on the refined macthing pairs, the accurate parameters in current resolution layer is derived by applying the Levenberg-Marquardt iterative method using the adjusted parameters as an initial guess. 5. Recursively perform the refinement until we are at the full resolution level.

5

In Visual Communications and Image Processing 2000, Proceedings of SPIE vol. 4067 (2000), pp. 1490-1498, Perth, Australia, 20-23 June 2000.

3. EXPERIMENTAL RESULTS The proposed method has been tested on a number of images of static scene taken with a hand-held DSC (Philips Digital Camera ESP80). Figure 4(a) and (b) are two images taken indoors, where the projective distortions are quite visible. Fig. 4 (c) and (d) are the stitched mosaic using (a) and (b) as the reference coordinates, respectively. Figure 5(a)~(d) are four partially overlapping images taken outdoors with a hand-held still camera. The stitched mosaic is shown in Fig. 5(e) using (c) as the reference coordinates. The speed of the proposed method depends on level of decomposition, the number of retained feature points, and local search range for refinement. For two images of size 600×400 (as shown in Fig. 5(a)~(d)), with 3 decomposition levels, it takes about 2 seconds (on a Pentium-II 300MHz PC) to align and stitch by weighted average into one image mosaic.

4. ACKNOWLEDGEMENT The research described in this paper has been sponsored by Philips Innovation Center Taipei (PICT), Philips Research. We would like to acknowledge the support given by B. S. Paul Lin, Ir. Marcel Führen, Christopher Wei and the AFDC group members.

5. REFERENCES 1. S. Mallat and S. Zhong, “Characterization of Signals form Multiscale edges,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 14, No. 7, pp. 710-732, July 1992. 2. Y. Xu, J. B. Weaver, D. M. Healy and J. Lu, “Wavelet Transform Domain Filters: A Spatially Selective Noise Filtration Technique,” IEEE Trans. Image Processing, Vol. 3, No. 6, pp. 747-757, November 1994. 3. J. W. Hsieh, H. Y. M. Liao, K. C. Fan, M. T. Ko and Y. P. Hung, “Image Registration Using a New Edge-Based Approach,” Computer Vision and Image Understanding, Vol. 67, No. 2, pp. 112-130, August 1997. 4. Candy Hsu and Rob Beuker, “Planar Image Registration using Feature Point Extraction,” PICT TR 98/2, Philips Innovation Center Taipei. 5. K. S. Arun, T. S. Huang, and S. D. Blostein, “Least-square fitting of two 3-D point sets,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-9, no. 5. Pp. 698-700, Sept. 1987. 6. W. H. Press et al., Numerical Recipes in C: the Art of Scientific Computing, 2 nd ed., Cambridge University Press, Cambridge, 1992. 7. S. Mann and R. W. Picard, “Virtual Bellows: Constructing High-Quality Images for Video,” Proc. First ICIP, vol. 1, IEEE CS Press, 1994, pp. 363-367. 8. H. Y. Shum and R. Szeliski, “Panoramic Image Mosaics,” Microsoft Research Technical Report MSR-TR-97023. 9. S. Mann and R. W. Picard, “Video Orbits of the Projective Group: A simple approach to featureless estimation of parameters,” Oct. 1996.

6

In Visual Communications and Image Processing 2000, Proceedings of SPIE vol. 4067 (2000), pp. 1490-1498, Perth, Australia, 20-23 June 2000.

Image 1

Image 2

Multiresolution decomposition

Multiresolution decomposition

Feature points extraction

Feature points extraction Find all possible matching pairs Eliminate falsematching pairs

Initial estimation θ, s, tx, ty Find 8 parameters mk, k=0~7 One layer reconstruction

One layer reconstruction

Refine matching mk, k=0~7

Merge into one Image mosaic

Figure 1. Flowchart of multiresolution feature-based image registration.

(a) (b) (c) Figure 2. (a) shows two partially overlapping images. The feature points extracted by multiscale edge correlation method are shown in (b). To improve the matching efficiency, only local extreme points (as shown in (c)) are retained for matching.

7

In Visual Communications and Image Processing 2000, Proceedings of SPIE vol. 4067 (2000), pp. 1490-1498, Perth, Australia, 20-23 June 2000.

p1

T1 q1

p2

q1 T2

Figure 3. Consistent test for eliminating false-matching pairs. For example, if { p1 ⇔ q1 } and { p2 ⇔ q2 } are the correct v v r v matching pairs, then the difference between T1 = q1 − p1 and T2 = q2 − p1 should be small. That is, T1 − T2 ≤ thresholddis tan ce v v

and angle (T1 , T2 ) ≤ threshold angle .

(a)

(b)

(c)

(d)

Figure 4. (a) and (b) are two partially overlapping images with projective transformation. (c) is the registerd image mosaic using (a) as the reference coordinates; while (d) is the registerd image mosaic using (b) as the reference coordinates .

8

In Visual Communications and Image Processing 2000, Proceedings of SPIE vol. 4067 (2000), pp. 1490-1498, Perth, Australia, 20-23 June 2000.

(b)

(c)

(a)

(d)

(e) Figure 5. Registration of four images taken with one hand-held camera: (a)~(d) partially overlapping images; (e) the registerd image mosaic, where the image coordinates of (c) is chosen as the reference coordinates.

9