1

Mumford and Shah Segmentation

We discuss a segmentation approach that generalizes Region Competition, before going on to object detection. The model is as follows: • The image is composed of N regions Ri , i = 1, . . . , N (∪N i=1 Ri = Ω), Ri ∩ Rj = ∅ for i 6= j and I(x) = ui (x) + ηi (x), x ∈ Ri where

ui ∈

Z u : Ri → R :

|∇ui (x)| dx < ∞, Ri

(1) ∂u (x) = 0 ∂n

(2)

where the noise ηi (x) ∼ N (0, σ) is iid in both x and i. That is, unlike region competition, we have smooth functions representing the region rather than just constants. • The derivatives ∂x1 ui (x), ∂x2 ui (x) ∼ N (0, σp ) are iid in x and i. Note that one may use other priors for example a Laplacian prior as is better fits natural image statistics. The Gaussian prior implies Z 2 |∇ui (x)| dx (3) p(ui ) ∝ exp −α Ri

• The prior on Ri is the same as region competition, that is, large length curves are penalized : p(Ri ) ∝ exp (−βL(∂Ri ))

(4)

Calculating the posterior probability, p({Ri , ui }N i=1 |I) using Bayes’ Rule, and then calculating the MAP estimate is equivalent to minimizing the energy E({Ri , ui }N i=1 )

=

N Z X i=1

2

Z

(I(x) − ui (x)) dx + α

Ri

2

Z

|∇ui (x)| dx + β Ri

ds.

(5)

∂Ri

The case of N = 2 is the energy considered in [3, 4]. The Euler-Lagrange equations in ui (when holding Ri fixed) is the same as we saw for denoising : ( I(x) − ui (x) = α∆ui (x) x ∈ Ri . (6) ∂u x ∈ ∂Ri ∂n (x) = 0 1

Also, similar to region competition, the Euler-Lagrange equations with respect to the region boundary ci of Ri is 1 2 2 2 2 ∇ci ∩cj E({ui , Ri }N i=1 ) = ((I − ui ) − (I − uj ) + α(|∇ui | − |∇uj | ))Ni − 2βκi Ni ui + uj Ni + α(|∇ui |2 − |∇uj |2 )Ni − 2βκi Ni , = 2(uj − ui ) I − 2 when Ri is adjacent to Rj .

(7) (8) (9)

Thus, similar to region competition, we perform an iterative minimization where we 1. make a guess of the initial regions Ri 2. solve the for ui in Ri for all i using (6) 3. perturb the regions Ri by (7) 4. iterate 2 and 3 until convergence Notice that the process is extremely slow since at each iteration, we have to solve (6), which we saw from denoising is costly. For implementation details using level set methods and fast methods of solving the problem, see [7, 8].

2

Object Detection From Images

In order to detect objects in images, we cannot simply use image segmentation schemes that divide the image based on homogeneous image statistics (e.g., intensity, texture, edges, etc...). If we are to detect humans, cars, birds, etc..., our algorithm must have knowledge about how the object looks. Indeed, if I were to look at a generic image and had no notion of what a bird looks like, then I would not be able to detect the bird in the image. Thus, we need to incorporate a prior on the object appearance and shape into our object detection/segmentation model. Indeed, it would be nice if we had a model of object / shape appearance. Indeed, we discuss one possible way to build a model of object shape in this lecture, and incorporate that into an object detection algorithm. Suppose that we have a training database of n different samples of an object that is to be detected. For example, the training set could include several examples of multiple peoples’ hands and multiple examples of each person’s hand in different poses, and viewed from several viewpoints if the goal is to detect a hand. How to obtain such a training database is obviously a big question, and in many applications (such as medical imaging) such training data may be difficult to obtain. For our hand example, one could could take pictures of several peoples’ hands from different vantage points and then hand segment the images to extact the object, and that could serve as a training database. For now, we will only derive a model of the object’s shape and thus disregard the appearance information, e.g., image intensity of the object. What do we mean by “shape” of an object? The subject of shape has a considerable history (going back at least a hundred years e.g. [5]). Perhaps one of the first to formalize the notion of shape as being part of a space was [2] - where the author was motivated by the problem of classifying rocks for geological purposes. We will not explore 1 Note that the computation is more trickier than the case of region competition since the functions ui and uj only have meaning in regions Ri and Rj , respectively, and thus when perturbing ci to compute the variation of the energy, we are also implicitely perturbing ui and this interaction must be considered, which was not needed for region comptetion. Nevertheless, the variation computed without this interaction turns out to actually be the same as the variation considering this interaction.

2

too much the right definition of shape, but we will follow the definition given in [9]. We say that first that a “pre-shape” of an object in the imaging plane is simply the boundary of the region defined by the object in the imaging plane. Notice that the boundary is sufficient to determine the region 2 . We represent such a pre-shapes as γ1 , . . . , γn where γi ⊂ RN (for now N = 2 so that these pre-shapes are in the plane, but the same definitions will hold for higher than 2 dimensions. We assume that γi are both compact and closed (in two dimensions this means that γi is a simple, closed curve). We denote by {Ψ1 , . . . , Ψn } the signed distance representation of the pre-shape. Note that Ψ only contains pre-shape information of the object and not appearance information.

2.1

Shape of an Object and Shape Average

The pre-shapes γ1 , . . . , γn are at different locations, orientations, and scales, and we would like to have a model of shape that is invariant to such transformations as the notion of shape should not depend on orientation, location, scale 3 . We can do this by aligning all the γi with respect to a canonical shape, which we call the shape average, µ. We refer to this process of alignment of γi as canonization of the pre-shape γi , and this is simply choosing a scale / orientation / and location of γi that is in some sense closest to the average shape. The scale / orientation / location are called nuisances, as they are irrelevant to the definition of shape. These ideas of alignment to a canonical shape are found in [9]. 2.1.1

Groups

Let us be more precise in defining the notion of shape average and canonization. In order to canonize the shape with respect to the average shape, the nuisances must form a group 4 . Notice that scale changes, orientation changes (i.e., rotations), and location changes (translations) form a group : indeed, G = {g : Ω → Ω : g(x) = lRx + T, l ∈ R+ , R ∈ R2×2 , T ∈ R2 , RT R = Id2×2 , det R = +1}

(10)

is a group with the multiplication defined by function composition, i.e., g1 g2 = g1 ◦ g2 , i.e., g1 ◦ g2 (x) = g1 (g2 (x)). This group represents all possible scale, orientation, and position changes of the shape in the imaging plane. The group defined above can be factored into three separate groups SO(2), rotations, R2 translations, and scale changes l > 0. The group SO(2) is called the special orthogonal group. We may write that for the particular group defined above (10) that G = SE(2) × R+ = SO(2) × R2 × R+ .

(11)

We say that G acts on pre-shapes, by the following action, g ◦ γ, i.e., g ◦ γ(s) = g(γ(s)). 2 For the case of a simply connected region, this is the case. For non-simply connected regions, we would have to add the information of whether the interior of each curve comprising the boundary is inside or outside the object. 3 This depends on the application. For example, if our notion of shape is invariant to rotation, then is the number ’6’ and the number ’9’ have the same shape. 4 A group is denoted G and it is a set such that the elements of the group are denoted g ∈ G. A group has the following properties:

1. There is a “multiplication” defined, i.e., for any two g1 , g2 ∈ G, the operation g1 g2 is defined, and g1 g2 ∈ G. 2. If g1 , g2 , g3 ∈ G, then (g1 g2 )g3 = g1 (g2 g3 ). 3. There exists an identity element e ∈ G such that ge = g for all g ∈ G. 4. For each g ∈ G, there exists an element g −1 ∈ G called the inverse such that gg −1 = e.

3

2.1.2

Deformations

We now need to define what it means for two pre-shapes to be close. This is at the heart of shape analysis, and we will not properly review the methods in the literature, but we will choose a simple approach found in [9] and that is based on the ideas of “Deformable Templates” pioneered by Grenander in his works on Pattern Theory [1]. The idea is that two pre-shapes differ by the amount of deformation of the map h : Ω → Ω that maps one pre-shape onto the other. The map h ∈ H, where H, is called the space of deformations, and we let H = {h : Ω → Ω : h, h−1 is 1-1 and onto, and smooth } (12) such maps above are called diffeomorphisms. Note that H is a group under composition. Suppose that we have a function D : H → R that measures the amount that h ∈ H deforms a shape. Note that we can equivalently define a map D on H or equivelently on the two pre-shapes γ1 , γ2 such that γ1 = h ◦ γ2 . Therefore, we will abuse notation and write D(h) = D(γ1 , γ2 ). Examples of D are 1. Let Dh(x) denote the Jacobian of h, i.e., Dh(x) =

h1x1 (x) h1x2 (x) h2x1 (x) h2x2 (x)

(13)

where h = (h1 , h2 ) are the components of the map. Then Z Z X 2 D(h) = |Dh(x)| dx = |hixj (x)|2 dx Ω

(14)

Ω ij

where | · |is the Frobenius norm in the first expression. 2. Another example is the signed distance score. Let Ψ1 and Ψ2 represent signed distance representations of γ1 and γ2 . Suppose that γ2 = h ◦ γ1 . Then Z D(h) = D(γ1 , γ2 ) = Ψ2 (x) dx. (15) {x:Ψ1 (x)≤0}

The idea is that when the region enclosed by γ1 and γ2 are identical, then D would be as negative as possible. On the contrary, if they don’t overlap, then D will be a large positive number. 2.1.3

Shape Average and Canonization

We are now ready to define the shape average and the shape of γi (a canonical pre-shape with respect to the shape average). Definition 1. Let γ1 , ·, γ2 ⊂ Ω ⊂ RN be compact hypersurfaces (in the case N = 2, they will just be simply connected close curves), whichare pre-shapes. Let H be the space of diffeomorphisms acting on Ω. Let D : H → R, and G be a finite dimensional group acting on Ω. We say that gˆ1 , ·, gˆn ∈ G are the motions undergone by γi if there exists a pre-shape µ ˆ such that gˆ1 , . . . , gˆn , µ ˆ = arg min

g1 ,...,gn ,µ

n X

D(hi ) = arg min

g1 ,...,gn ,µ

i=1

n X

D(gi ◦ µ, γi ).

(16)

i=1

The pre-shape µ ˆ is called the shape average, and gˆi−1 (γi ) is called the shape of γi (or the canonization of γi with respect to µ). 4

2.2

Creating a Model of Shape

Now that all the pre-shapes γ1 , . . . , γn are all aligned together (g1−1 (γ1 ), . . . , gn−1 (γn )) with respect to the shape average, µ ˆ, we are going to create a model of shapes from which also other shapes not in the training database can be generated. We follow the ideas of [6]. Let Ψ1 , . . . , Ψn denote the signed distance representations of g1−1 (γ1 ), . . . , gn−1 (γn ), and let Ψ denote the signed distance representation of µ ˆ. The idea is that the database of shapes is large (n large); however, because the database is many instances of the same object (e.g., it could be various instances of hands), it would seem natural that the database of shapes would be described by a small number of variations of the average shape Ψ. The we are going to perform the simplest approach to dimensionality reduction called principal component analysis (PCA). Let 5

˜ i = Ψi − Ψ Ψ

(17)

denote the variations of Ψi from the shape average. Then the shape variability matrix is ˜ 1, . . . , Ψ ˜ n ), S = (Ψ and define

n X

(SS T )x,y∈Ω =

˜ k (x)Ψ ˜ k (y). Ψ

(18)

(19)

k=1

Performing an eigen-decomposition on the matrix above, we have that (SS T )x,y∈Ω =

n X

σk vk (x)vk (y)

(20)

k=1

where vk : Ω → R are eigenfunctions of SS T called the principal modes of the PCA, and σ1 ≥ . . . ≥ σn ≥ 0 are the singular values. Note that for computational purposes (the matrix SS T is very large) that one could do a eignedecomposition of Z T ˜ i (x)Ψ ˜ j (x) dx, (S S)ij = Ψ (21) Ω

S T S,

which has the same singular values as and the eigenvectors of S T S are related to eigenfunctions of SS T by vi = Sdi where di is an eignevector of S T S. Our model of the shape of an object is then the zero-level set of Ψw = Ψ +

k X

wi v i ,

(22)

i=1

where w ∈ Rk and k << n. Thus, to generate a new instance of the shape of an object, we simply sample w ∈ Rk and generate the new shape by the equation above. Note that wi cannot be too large, otherwise, the zero level set of Ψw would deviate too much from shapes in the database, and could possibly not look like a realistic shape of an object. 5

The space of signed distance functions is not a linear space, and thus, mathematically it does not make sense to add and subtract these functions. Nevertheless, when Ψ1 and Ψ2 are aligned with respect to each other, the zero level set of (Ψ1 +Ψ2 )/2 is a sensible average.

5

2.3

Object Detection

Our model for object detection is now : 1. The image consists of two regions R1 , R2 (the object and background regions) and I(x) = ui (x) + ηi (x),

(23)

where ui could either be a function (as in Mumford-Shah) or constant (as in region competition), x ∈ Ri , and ηi (x) ∼ N (0, σ) are iid in both i and x. 2. We now put a prior on R1 (the foreground region) so that the R1 is consistent with our model of shape (22). That is R1 = {x ∈ Ω : Ψw (g −1 (x)) ≤ 0}, (24) where g ∈ G and p(w), p(g) is uniform. Note that Ψw is aligned with respect to the shape average, and since objects in the image we wish to segment can be at different locations / orientations / scales, we allow the composition by g ∈ G. Using our machinery for the MAP estimate, we find that maximizing p(R1 |I) is equivalent to minimizing the energy Z Z (I(x) − u1 )2 dx +

E(w, g) = R1

which we write as

(I(x) − u2 )2 dx

(25)

fout (x) dx.

(26)

Ω\R1

Z E(w, g) =

Z fin (x) dx +

R1

Ω\R1

The energy is not convex in w nor g, and thus we resort to a gradient descent minimization. 2.3.1

Computing the w Gradient

We first compute the gradient with respect to w. Let H : R → R denote the Heaviside function : ( 1 x≥0 H(x) = . 0 x<0 Then

Z E(w, g) =

fin (x)(1 − H(Ψw (g −1 (x)))) dx +

Ω

Z

(27)

fout (x)H(Ψw (g −1 (x))) dx.

(28)

∂ Ψw (g −1 (x)) dx, ∂wi

(29)

Ω

Therefore, we see that ∂ E(w, g) = ∂wi

Z

(fout (x) − fin (x))δ(Ψw (g −1 (x)))

Ω

where δ : R → R is the Dirac delta function. Note that ∂ Ψw (g −1 (x)) = vi (g −1 (x)) ∂wi and also note by the co-area formula : Z Z f (x)δ(Ψw (g −1 (x)))|∇(Ψw (g −1 (x)))| dx =

{x∈Ω:Ψw (g −1 (x))=0}

Ω

6

(30)

f (x) ds,

(31)

and therefore, ∂ E(w, g) = ∂wi

Z {x∈Ω:Ψw (g −1 (x))=0}

(fout (x) − fin (x))vi (x) ds(x) |∇(Ψw (g −1 (x)))|

(32)

where ds denotes the arclength element of {x ∈ Ω : Ψw (g −1 (x)) = 0}. 2.3.2

Computing the g Gradient

Next, we compute the gradient with respect to the group element g. In order to do this is a numerically stable manner, we note that G is in addition a matrix Lie group, and this permits us to write each element of G as an exponential of a matrix. In this sequel, we assume that G = SE(3) × R+ (for the case of surfaces, the case of curves is a special case). In order to understand that each element of G, we write each element g ∈ SE(3) (we exclude the scale component for now) as the following R T g= . (33) 0 1 Note that gx where x = (x, 1) (such is called the homogeneous coordinates of x) corresponds is (Rx + T, 1)T . Note also that group composition is equivalent to the above matrix multiplication of the above matrices. Therefore, we see that T R −RT T −1 −1 . (34) gg = Id4×4 , where g = 0 1 Suppose that we let g : [0, 1] → SE(3) be a time-varying path in SE(3). Then, we have that g(t)g −1 (t) = Id4×4 , or gg ˙ −1 + g g˙ −1 = 0, and then we see that g˙ and then (35) becomes

˙ T RR 0

−1

=

˙ T T + T˙ −RR 0

˙ − RT T˙ R˙ −RT 0 0

+

RR˙ T 0

(35)

,

−RR˙ T T − T˙ 0

(36)

=0

(37)

and therefore, ˙ T + RR˙ T = 0, ⇒ RR ˙ T = −(RR ˙ T )T , RR ˙ T =: ω which says that RR b is skew symmetric, i.e., there exists a ω ∈ R3 such that 0 −ω3 ω2 0 −ω1 . ω b = ω3 −ω2 ω1 0 Therefore, we find that R˙ = ω b R, and we define v = −b ω T + T˙ . Then, ω b v b gg ˙ −1 = = ξb ⇒ g˙ = ξg, 0 0

7

(38)

(39)

(40)

for which we see that g = exp ξb =

n X bi (ξ) i=0

i!

.

(41)

Note that the above can be evaluated using Rodriques formula : ! b )v+ωω T v (Id3×3 −eω ω b R T e |ω| g= = exp ξb = 0 1 0 1

(42)

where

ω b ω b2 sin(|ω|) + (1 − cos |ω|). (43) |ω| |ω| Let S(x) = lx = (exp ξ7 )x (where l = exp ξ7 be the scaling operator then, any element of SE(3) × R+ can be represented as g(x) = exp ξb ◦ S, g ∈ SE(3) × R+ . (44) eωb = Id3×3 +

We now represent any g ∈ SE(3) × R+ with ξ ∈ R7 , and compute the gradient of E with respect to ξi . Note that Z Z E(w, g) = (fin (x) − fout (x)) dx + fout (x) dx, (45) R1

Ω

and the second term does not depend on g, and thus the derivative wrt ξi is zero. We let f = fin − fout . Note also that R1 = {gx ∈ Ω : Ψw (x) ≤ 0} = g{x ∈ Ω : Ψw (x) ≤ 0}, (46) and so

Z

Z f (x) dx =

R1

Z f (g(x))| det Dg(x)| dx,

f (x) dx =

(47)

R10

gR

where R10 = {x ∈ Ω : Ψw (x) ≤ 0} and we have performed a change of variable in the last equality. Note that det Dg(x) = l = eξ7 . Now, Z Z Z Z ∂eξ7 ∂ ∂ ∂ ξ7 ξ7 ∂ f (g(x)) dx = f (g(x)) dx+e f (g(x)) dx. E(w, g) = f (x) dx = e ∂ξi ∂ξi R1 ∂ξi ∂ξi R10 ∂ξi R10 R10 (48) Now, using the Divergence Theorem, we have that Z Z Z ∂ ∂g(x) ∂g(x) ∇f (g(x)) · f (g(x))N (x) · f (g(x)) dx = dx = ds(x) ∂ξi ∂ξi R10 ∂ξi R10 ∂R10 Z ∂g(x) − f (g(x)) div ( ) dx, (49) ∂ξi R10 where N is the outward normal of R10 6 . Note that ( 0 i 6= 7 ∂g(x) div = . ∂ξi eξ7 (1 + 2 cos |ω|) i = 7 Therefore, we have that Z Z ∂ ∂g(x) 2 E(w, g) = l f (g(x))N (x) · ds(x) + (l − l (1 + 2 cos |ω|))δi,7 f (g(x)) dx. ∂ξi ∂ξi ∂R10 R10 6

(50)

(51)

We avoid derivatives of f since f contains the image, because of noise, it is better to avoid derivatives on the image for numerical reasons.

8

References [1] U. Grenander. Elements of pattern theory. Johns Hopkins Univ Pr, 1996. [2] D.G. Kendall. Shape manifolds, procrustean metrics, and complex projective spaces. Bulletin of the London Mathematical Society, 16(2):81, 1984. [3] D. Mumford and J. Shah. Boundary detection by minimizing functionals. In IEEE Conference on Computer Vision and Pattern Recognition, volume 17, pages 137–154, 1985. [4] D. Mumford and J. Shah. Optimal approximations by piecewise smooth functions and associated variational problems. Comm. Pure Appl. Math, 42(5):577–685, 1989. [5] D.A.W. Thompson. On Growth and Form (abridged edition). On growth and form (abridged edition), 1917. [6] A. Tsai, A. Yezzi Jr, W. Wells, C. Tempany, D. Tucker, A. Fan, W.E. Grimson, and A. Willsky. A shape-based approach to the segmentation of medical imagery using level sets. Medical Imaging, IEEE Transactions on, 22(2):137–154, 2003. [7] A. Tsai, A. Yezzi Jr, and A.S. Willsky. Curve evolution implementation of the Mumford-Shah functional for image segmentation, denoising, interpolation, and magnification. Image Processing, IEEE Transactions on, 10(8):1169–1186, 2001. [8] L.A. Vese and T.F. Chan. A multiphase level set framework for image segmentation using the Mumford and Shah model. International Journal of Computer Vision, 50(3):271–293, 2002. [9] A.J. Yezzi and S. Soatto. Deformotion: Deforming motion, shape average and the joint registration and approximation of structures in images. International Journal of Computer Vision, 53(2):153–167, 2003.

9