Texture modeling with Gibbs probability distributions

Viewer
Transcript

Texture modeling with Gibbs probability distributions Benjamin Dittes July 30 2004 th

Bachelor thesis advised by Doz. Dr. Boris Flach Institute of Arti cial Intelligence Dresden University of Technology

Abstract A Markov random eld model with a Gibbs probability distribution on pairwise pixel interactions is proposed for describing certain two dimensional images, called spatially uniform stochastic textures. Based on the gray level co-occurrence histograms in a sample image, a method is formulated to derive both structure and strength of the pairwise interactions using an iterative gradient process based on the maximum likelihood. Related tasks such as clustering, sampling and recognition are also explored. Experiments on several natural and arti cial textures show the utility of the proposed model.

Contents 1 Introduction 2 Clustering 2.1 General approach . . . . . . . . . . . . . . . . 2.2 An optimal solution for n = 1 . . . . . . . . . 2.3 The K-means algorithm . . . . . . . . . . . . 2.4 Image clustering . . . . . . . . . . . . . . . . . 2.5 Runtime improvement . . . . . . . . . . . . . 3 Gibbs probability distributions 3.1 De nition . . . . . . . . . . . . . . . . . . . . 3.2 Potential equality . . . . . . . . . . . . . . . . 3.3 Potential decomposition . . . . . . . . . . . . 4 Sampling 4.1 The general Gibbs sampler . . . . . . . . . . . 4.2 Runtime improvements . . . . . . . . . . . . . 4.2.1 Iteration measurement . . . . . . . . . 4.2.2 Instability detection . . . . . . . . . . 4.3 Examples . . . . . . . . . . . . . . . . . . . . 5 Learning 5.1 Parameter estimation . . . . . . . . . . . . . . 5.1.1 Objective functions . . . . . . . . . . . 5.1.2 Maximum Likelihood optimization . . 5.2 Adapting the structure . . . . . . . . . . . . . 5.2.1 Interaction evaluation . . . . . . . . . 5.2.2 Interaction selection . . . . . . . . . . 5.3 Practical Notes . . . . . . . . . . . . . . . . . 5.3.1 The log-likelihood maximum . . . . . . 5.3.2 Terminating the parameter estimation 5.3.3 Comparison to shift invariant models . 5.4 Examples . . . . . . . . . . . . . . . . . . . . 6 Recognition 7 Conclusion 8 Acknowledgements

2

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

......... ......... ......... . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

3 6 6 6 7 8 9 11 11 12 14 15 15 15 15 16 17 18 18 18 19 20 20 22 23 23 23 23 26 28 30 30

1 Introduction One of the most interesting areas of image processing is the recognition of complex patterns, as found in textures. The term 'texture' shall here be understood as a property of a stochastic two-dimensional image, containing a spatially uniform, continuously repeated pattern. Thus, these images are called spatially uniform stochastic textures. A few examples of such images can be seen below:

Figure 1: Examples of spatially uniform stochastic textures The rst two images are D77 and D84, taken from [2] { a catalog of natural images. The third image is a computer-generated image based on a very simple set of parameters. The method for generating such images is described in section 4. As you can see, those textures show a repeating pattern but due to the stochasticity with dierent variations in dierent parts of the image. The rst image shows very little uctuation while the other two have rather big dierences between single patterns. As will be shown below, it is not possible to handle images with a large number of distinct color values. Thus, the next section will give a general introduction to clustering { i.e. reducing the number of distinct values by forming groups (\clusters") in the color space and using their representatives (\centers") { and apply this technique to image clustering. For the rest of the thesis, I will use clustered images, i.e. images that only contain the cluster center values at their pixels. To re ect the `stochastic' type of the textures, a Markov random eld (MRF) on such images will be de ned in section 3. In general, a random eld is a probability distribution on labelings of a graph, i.e. on speci c realizations of values at each vertex. In image processing, this graph is the 2-D lattice of the pixels as vertexes, and a labeling is a mapping of cluster numbers (henceforth called \signals") to each pixel. Note that the edges in this graph can be arbitrary connections between pixels and have nothing to do with the 2-D grid on which the image is built. The Markov assumption on such a random eld is, that the conditional probability of a signal in any vertex in dependence on the ( xed) rest of the labeling is equal to the 3

conditional probability of that signal in that vertex in dependence on the ( xed) signals in the vertexes connected by an edge. In other words, the MRF can be described on every edge in the graph by a matrix of parameters, i.e. by one number for each pair of signals on the vertexes of that edge. Both the edges of the graph and their parameter matrices have to be found to describe the texture. In the following, this process is started with a \spare" graph and edges are gradually added and their parameters are estimated. The property of `spatially uniform' textures provides a further specialization of the MRF: The edges have to be de ned in such a way, that vectors (de ned by a horizontal and vertical displacement between starting and nal vertex) are constantly repeated starting at all vertexes of the graph. Thus, the graph gets a very well de ned, `periodic' form. Moreover, on a spatially uniform texture it is natural to assume that every matrix of parameters is independent of the position of an edge with a xed displacement in the graph. This leads to the Gibbs probability distribution (GPD), which, on this special graph, is described by one matrix for each vector, in contrast to one matrix for each edge in the general MRF. Therefore, the Gibbs parameters are, in fact, spatially uniform MRF parameters, because they apply to every pair of pixels separated by the displacement of the respective vector. The required set of de nitions along with some interesting observations on GPDs can be found in section 3. For a more theoretical background on such models, please see [3]. A very nice thing about GPDs is, that there exists a simple way of generating labelings for the graph, corresponding to the probability distribution formed by the model. In other words, given a GPD, it is possible to generate images with the probability described by this GPD. This method is called Gibbs sampler { thus, the process of generating images is called sampling { and is shown in section 4. Given an original image of a texture { i.e. a labeling of the MRF graph {, the most interesting task is to nd a GPD that describes this texture. This topic divides into two aspects: a) nding vectors that are inherent to the texture and describe its repetitive aspects and b) estimate the Gibbs parameters on those vectors. This is done by comparing gray level co-occurrence histograms (GLCHs) between the current GPD and the original image. The details of this procedure can be found in section 5. Finally, given a GPD that describes one texture, it can be used to nd this texture in a larger image, i.e. to determine the similarity between a frame in the image and the texture inherent to the GPD. With this measure 1

1 Note

that `gray level' only means that the distinct color values in the vertexes are

ordered in a certain way. Here, this is a result of clustering.

4

and GPDs for all occurring textures, the image can be divided into segments for the contained textures. This is not the main topic of the thesis, but a very simple way of doing this is described in section 6. This thesis is related to an article by G. L. Gimmel'farb [1]. He, however, assumed Gibbs potentials which are independent of gray level shifts. In the sense of GPD matrices, this means that every parameter matrix is a band matrix, i.e. the values for all elements on all diagonals are equal. This simpli es the process of learning but I think it is not generally adequate. The aim of this thesis will be to generalize the learning process to matrix represented Gibbs potentials. The consequences of such an approach are demonstrated in section 5.3.3.

5

2 Clustering This section will cover clustering in a very general approach and then use it to perform a clustering in RGB-color-space, as needed for the following sections. 2.1 General approach Let X Rn be a set of m vectors that have to be clustered into l clusters with centers Y = (yj : j = 1 : : : l). The connection is described by a (m l) matrix C with elements cij 2 f0; 1g, cij = 1 meaning that vector xi is associated with

center yj . Of course, every vector can only be associated with one center. This constraint is expressed by the second of the following equations. Now, the optimal clustering is de ned as the optimal solution for m X l X i=1 j =1

cij kxi

(8i)

l X j =1

yj k2 !

min Y;C

(1)

cij = 1

(2)

2.2 An optimal solution for n = 1

In the one-dimensional case { i.e. gray-scale in image processing { the objective function is reduced to: m X l X i=1 j =1

cij (xi

yj )2 !

min Y;C

(3)

Now, since for any con guration of the cij each optimal center yj is the balance point of the associated vectors xi, the problem is reduced to nding the optimal C . This, however, is much easier than trying all combinations, because in a sorted set X , the selection of the optimal cij means a linear segmentation of the vectors xi. There exists an eective dynamic programming algorithm to solve this task. First, the cost for each segment [i ; i ] is the variance (i ; i ) in this segment: 1

1

2

2

(i1 ; i2 ) =

i2 X i=i1

( xi 6

E (i1 ; i2 ))2

(4)

where

E (i1 ; i2 ) =

1

i2 X

i1 + 1 i=i1

i2

(5)

xi

Now let there be fj (i) describing the cost for the optimal clustering of the rst i vectors into j clusters. Additionally I need the value of i0 = pj (i), meaning that the optimal clustering of the rst i vectors into j clusters is achieved by using one segment [i0; i] plus the optimal clustering of the rst i0 1 vectors into j 1 clusters. Now the fj (i) and pj (i) can be calculated in the following way: f (i) = (1; i); p (i) = 1 (6) 0 0 fj (i) = min ffj (i 1) + (i ; i)g; j = 2 : : : l (7) i pj (i) = arg min ffj (i0 1) + (i0 ; i)g; j = 2 : : : l (8) i 1

1

1

0

1

0

Starting at pl (m) it is now easy to read out the optimal clustering. The algorithm has a complexity of O(m l). This is not very nice when dealing with large sets of input vectors, but the algorithm is still applicable in image clustering, as shown below. Sadly, I am not aware of any discrete solution for n > 1. However, there exists a very simple iterative algorithm for the general case, called K-means. 2

2.3 The K-means algorithm

Starting from a con guration Y ; C , the algorithm repeats the following two steps until there is no more change: 0

m

m

i=1

i=1

0

1. yj0 = nj P cij xi, nj = P cij ; j = 1 : : : l 1

2. (8i) j (i) = arg min kxi j

yj k2 , c0ij = 1 only for j = j (i)

The rst step places each center in the balance point of all the xi associated with it. Thus, the overall distance sum Pi Pj cij kxi yj k declines. The second step associates every vector xi with the nearest center. Again, the overall distance sum declines. Thus, the iterative process continuously lowers the objective function and moves Y and C into the nearest local optimum. The process of clustering a new vector x is now the same as described in step 2, every vector is assigned to the nearest center. Quite naturally this division of the n-dimensional space produces Voronoi-polyedra. A possible clustering of a real RGB-colored image can be seen in gure 2: 2

7

Figure 2: Clustering of an RGB-colored image: Left the original image, in the middle the color distribution (blue) with the found clustering for 32 centers (black) and their borders (red) shown in the Red-Blue Pane and on the right the clustered result The left image is the source, a RGB-colored image. In the middle the clustering found by the K-means can be seen in the Red-Blue-pane. The circles mark the 32 found centers, the red lines border each area with vectors belonging to separate centers and the blue background marks the location and frequency { the darker, the more pixels of that color in the image { of the respective color. The right image shows the clustered image where every color is replaced by the respective center. A very interesting problem now is to nd a good initial con guration fY ; C g. This can be done in the following way: First, I solve the problem once in every dimension, using the algorithm described above. This gives an (optimal) set of Y~k ; k = 1 : : : n. Now, Y is constructed by simply combining the y~jk : yj = (y~j ; y~j ; : : : ; y~jn). C is built by choosing the nearest center for every xi. This gives a reliable, but certainly not optimal starting con guration for the K-means. The reason for not choosing a more sophisticated approach lies in the speci c application of the method in real image clustering. 0

0

0

1

2

0

2.4 Image clustering

Given a typical RGB image, the set of input vectors X is taken from R . There is no need to modify the K-means algorithm to solve this problem. In many cases however, the image comes as an RGB image, but is in fact only gray scaled, meaning that for xi = (xi ; xi ; xi ), xi = xi = xi . This would entail, that it is possible to nd an optimal solution, even though n > 1. This optimal solution is found by the selection of the starting con guration for the K-means. Since, in that case, all clusterings for each dimension are equal, the resulting centers for the three-dimensional case will also have equal components { thus presenting an optimal clustering for the color-space 3

1

8

2

3

1

2

3

diagonal { and will present the optimal solution. In RGB-colored images, experiments show that the K-means will still nd usable clusterings. The following images show a gray and an RGB-colored original (left) and their clustered counterparts, both with as little as 8 distinct colors. Even when reducing to such a small number of clusters, the dierence can hardly be told by eye:

Figure 3: Clustering of real textures: Left the original, in the middle the color distribution (blue) with the found clustering for 8 centers (black) and on the right the clustered results. The upper image is D77 from [2], the lower one is a natural image of cloth Again, the color distributions in the middle are shown in the Red-Bluepane. Note that the upper image is gray-scale, therefore the problem is actually a one-dimensional one and thus the clustering is optimal. 2.5 Runtime improvement

The algorithm for n = 1 had a complexity that was quadratic in the number of input vectors. On an image with 1000 1000 pixels, this is clearly not practicable. There is a nice way to shrink the number of `real' input vectors, which takes advantage of the fact, that even in a very large image, the number of distinct RGB-values is limited. So instead of working with every single pixel, it is possible to accumulate all pixels into a histogram, which contains the number of occurrences for every color and then work with those histogram entries fx~i; n~ig instead of the real xi. In the K-means algorithm, this only aects the rst step, where now m m X X 1 y = c x~ n~ , n = c n~ . (9) j

nj

i=1

ij i i

9

j

i=1

ij i

In the one-dimensional case only the calculations of E and have to be changed: i 1 X n~i x~i (10) E (i ; i ) = i P n~i i i 2

1

2

2

i=i1

(i1 ; i2 ) =

i2 X i=i1

n~i (x~i

= 1

E (i1 ; i2 ))2

(11)

In my experiments, this brought a runtime improvement of 75% on small images and over 99% on large images, acutally making the construction of this histogram the most time-consuming task. Fortunately, this information is inherent to the image and only has to be computed once. 2

2 Even

on a hashtable that is twice as big as the number of distinct RGB-values, the

number of collisions is not negligible

10

3

Gibbs probability distributions

3.1 De nition Let K = (K(r) : r 2 R) be a MRF with samples k = (k(r) : r 2 R; k(r) 2

K), where R is a nite 2-D lattice R = ((m; n) : m = 0 : : : (M 1); n = 0 : : : (N 1)) of size jRj = M N , and K = f0; 1; : : : ; k g is a nite set of signal values (cluster indices) at pixels (m; n). In other words, R is the structure of an image and every k is a labeling, assigning a signal k 2 K to every pixel (m; n). The pairwise pixel interaction is expressed by a neighborhood N (m; n) = f(m + x; n + y) : (x; y) = a 2 Ag where A is a set of displacement vectors (or shifts). Thus, every a 2 A de nes a second-order clique family Ka = f(i; j ) 2 R ; i j = (x; y); (x; y) = a 2 Ag containing all pairs of pixels with the respective displacement. The set of parameters G = (ga; a 2 A) contains interaction strengths for every shift, described by a jKj jKj matrix ga of Gibbs parameters, thus forming the Gibbs probability distribution: YY (12) ga (k(r); k(r a)) p(k; G) = ZG max

2

1

a2A r2R

Here, ZG is a normalizing factor. Alternatively, the Gibbs parameters can be expressed as Gibbs potentials = (a; a 2 A) with a(i; j ) = ln ga(i; j ). The GPD then looks like that: p(k; ) = Z1 exp

"

XX

a2A r2R

a (k(r); k(r

#

a))

(13)

Both descriptions are completely equivalent. A second-order gray level co-occurrence histogram (GLCH) in a sample k is a jKj jKj matrix nka for every shift a 2 A. Every nak (k; k0 ) counts the number of occurrences of (k; k0) on the vector a in k. In order to compare GLCHs between images of dierent dimension, the nak (k; k0) are scaled by the number of possible occurrences of the vector a in k. Every probability distribution p(k) on labelings k also de nes marginal expectations of co-occurrences X (a k; k0) = p(k)nka (k; k0) (14) k

(k; k0; ) shall be de ned as the marginal co-occurrence expectation produced by the Gibbs model with potentials . a

11

3.2 Potential equality

The runtime complexity of many tasks concerning GPDs is strongly dependent on the size of A. Thus, the question whether some a is obsolete is very important. More general, the problem is, in which cases (15) (8k) p(k; a) = p(k; b) holds. I will solve this problem for the general MRF and then show how this can be used in the GPD situation. In the MRF, there exists potentials for every pair (r; r0) { if an edge is not in the graph, the potential on that edge shall be zero. For sake of simplicity, I will call those potentials r;r . Then: 0

2

p(k; ) = Z1 exp 4

3

X

2R

(r;r 0 )

2

r;r0 (k(r); k(r0 ))5

Now we can set p(k; a) = p(k; b), which entails: X X ar;r (k(r); k(r0 )) = Zb exp Za exp 1

2R2

1

0

2R2

(r;r 0 )

(r;r 0 )

Z a r;r0 (k(r); k(r0 )) = ln Z b (r;r 0 )2R2 X

(16)

br;r0 (k(r); k(r0 ))

= const:

(17) (18)

with r;r (k; k0) = ar;r (k; k0) br;r (k; k0). In other words, there must exist a which is the dierence of a and b and forms a constant probability distribution for all k. There are two non-trivial conclusions: 1. X X (8r; k ; k ) r;r (k ; k(r0)) = r;r (k ; k(r0)) (19) 0

0

0

1

0

2

0

1

r0

2

r0

This follows from comparing two labelings k and k that only dier by the signal in one pixel, namely in r. A change of the signal in that pixel from k to k must lead to the same probability for k and k . 1

1

2

2

1

12

2

2.

(8r ; r ) r ;r (k; k0) = r (k) + r (k0) Consider the following gure: 1

2

1

1 2

u

u

r0

k12 r1

(20)

2

1

2

u

k21 u u u k11 S u u S u Q S S Q Su u QQu 1

k22 r2

Figure 4: Simultaneous change of two signals shall represent all vertexes connected to r except for r . r ;r will be abbreviated with and r ;r with . First, k(r ) is xed at k . Now, for r , with (19) there follows X X (k(r0 ); k ) + (k ; k ) (k(r0 ); k ) + (k ; k ) = r10

1

1

2

1 2

0 1 1

2

2

21

1

1

r10

1

11

2

11

1

21

1

r10

12

2

The same can be done with k(r ) xed at k , which entails X X (k(r0 ); k ) + (k (k(r0 ); k ) + (k ; k ) = 2

1

r10

1

12

2

12

12

22

22

1

1

r10

11

2

11

21

; k22 )

By adding those two equations and removing the sums over r0 which now appear on both sides, what remains is r ;r (k ; k ) + r ;r (k ; k ) = r ;r (k ; k ) + r ;r (k ; k ) (21) This is demonstrated in the gure by the red (left side) and green (right side) paths. It can be easily shown that for any matrix with the said property equation (20) holds. Now it is easy to transfer these conclusions to the GPD, were parameters are uniform for each shift (x; y) = a 2 A: It is clear, that any model = (a; a 2 A) is equivalent to a model containing potentials for all shifts (whether in A or not), but with a(k; k0) = 0 for a 2= A. Thus, removing a shift a is equivalent to nding some 0 with p(k; ) = p(k; 0 ), which has 0a (k; k0 ) = 0. With (18) and (20), that means, that a(k; k0) must be expressible as a(k) + a(k0). In that case, if this decomposition can be found, the shift can be removed by redistributing a to other shifts. 1

1 2

11

21

1 2

12

22

1 2

1

13

11

2

22

1 2

12

21

3.3 Potential decomposition

Decomposition would be trivial if (k; k0) could be decomposed exactly. However, in practical application, the values of (k; k0) will only converge slowly towards a perfectly decomposable potential. Thus, a method has to be found to approximate (k) and (k0). Then, the general objective formula looks like this: X E= (Cij a(i) b(j )) ! min (22) a;b 1

2

2

i;j

To solve this optimization problem I use a gradient approach with the gradients: X @E = 2 (a(i) + b(j ) Cij ) (23) @a(i) j

X @E = 2 (a(i) + b(j ) Cij ) @b(j ) i Thus, changing a and b iteratively with: a(i) = @a@E(i) b(j ) = @b@E(j )

(24) (25)

(26) will nd the nearest optimum. The problem did not appear to be a very intense optimization problem, nding very good optimum in very short time where = 0:2 proved best. Although this method is very fast, checking whether some a(k; k0) can be decomposed will be done much more frequently and thus needs an even faster approach. For this, I check the condition shown in (21): kX kX 1 0 (27) Da = (k 1) k k da (k; k ) da (k; k0 ) = ja (k; k0 ) + a (k + 1; k0 + 1) a (k; k0 + 1) a (k + 1; k0 )j (28) or, accordingly: kX kX 1 0 (29) Dga = (k 1) k k dga (k; k ) ga (k; k 0 )ga (k + 1; k 0 + 1) 0 dga (k; k ) = exp ln (30) ga (k; k0 + 1)ga (k + 1; k0 ) In my experiments, Dga < 1:003 presented a reliable condition for potential decomposability. max 1 max 1

2

max

=0

0 =0

max 1 max 1

2

max

=0

14

0 =0

4

Sampling

4.1 The general Gibbs sampler

Given a GPD model G on a set A of shifts, the Gibbs sampler is an algorithm to generate images with the probability p(k; G). This is done by iteratively re ning a current sample k in the following way: For all pixels r 2 R, the algorithm computes p(k(r) = kjknk(r)) for all k. Under the Markov assumption, this is equivalent to p(k(r) = kjk(N (r))) with N (r) being the neighborhood of r, as de ned by A: 1 Y g (k; k(r a)) p(k(r) = kjk(N (r))) = (31) c a2A

a

with c = Pk p(k(r) = kjk(N (r))). With that probability, a new signal value for k(r) is generated by random. Once this has been done for every pixel, one `run' is nished. This is repeated over and over, thus producing a constant stream of samples k. Usually the algorithm is started with a k that is simply generated by random with equiprobable signals in all pixels. 0

4.2 Runtime improvements

The Gibbs sampler is the most frequently used part of the process. Naturally, it has to be the most ecient part of any implementation, but apart from that there is one other possible way of acceleration: There is no way to tell exactly after how many iterations the current image k is really produced with p(k; G). Since taking 'to young' samples would disturb the whole process depending on the image, the most trivial solution would be to sample for a great number of iterations, thus making sure the image is ready. I would like to present a highly experimental heuristics for determining the end of the sampling process. 3

4.2.1 Iteration measurement The idea is based on self-observation. While looking at the sampling process, I noticed that the human mind is able to determine when the overall sample `stops changing' although there is still a great bit of pixel-wise noise from image to image. I tried to nd measurements that would show some noticeable change at the same time as the human perception and found them in the 3 At

least not in a reasonable time

15

derivative of the second-order gray level co-occurrence histograms (GLCH) on each shift a 2 A: XX D = (nak (k; k0) nka (k; k0)) (32) 0

2

2

a2A k;k0

Here, k is the current sample and k0 is the previous sample. With this de nition, the algorithm literally waits for changes in the GLCHs to stop. The results of this are not very good yet, because D still acts very stochastically, making it hard to formulate an automated strategy for stopping the sampling process. The rst improvement comes from using more than one past sample for computing D , for instance the last 5 or 10 samples. In my experiments, taking 5 samples proved to be a good compromise between too much uctuation and too coarse measurement intervals. Still, D declines in a rather stochastic pattern, so a method had to be found to separate a function's stochastic declining from uctuations on a rather constant level. 2

2

2

4.2.2 Instability detection I found the best result to come from comparing the latest n values of D to an average over that time. The real decision is expressed as a set of rules, based on the actual values D (0); D (1); : : : ; D (n 1), D (n 1) being the latest, and the mean value D as well as the min value D in the evaluated time frame of n values: D (0) < D or D (1) < D D (n 1) > D or D (n 2) > D D (n 1) > D and D (n 2) > D If all of these conditions hold, the state is said to be instable and sampling is stopped. I am well aware that this procedure is dreadfully little supported by mathematical background. In this special case, however, I was more interested in nding a usable technique for shortening the sampling procedure and experiments show that for n = 5 the process produces the same results in noticeably less time - for those cases I tried. 2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

16

4.3 Examples

The following gure shows the process of sampling on an already learned model for texture D77:

Figure 5: Sampling with a model for D77: current sample after 0, 10, 20, 50 and 70 runs The rst image shows k , a randomly assigned image. The following images show k , k , k and k . As you can see, the expected image forms after about 70 runs. Further runs do not improve the image. 0

10

20

50

70

17

5

Learning

5.1 Parameter estimation

5.1.1 Objective functions Given a set A of shifts and an original image k, it is now necessary to derive the parameters = (a; a 2 A). Since this is obviously a problem of optimization, the rst thing to nd is a usable objective function, wherein the probability p(k; ) is to approximate the 'real' probability p(k). Apart from the maximum likelihood, I have tried three other objective functions: 1. Minimizing the relative entropy of the two probability functions X p(k; ) E= p(k; ) ln (33) p ( k ) k @E @a (k; k0 )

=

a

(k; k0; )

X k

X

k k na

p(k; )

p(k; ) ln p (k)

(k; k0) ln p(k)

Again, it is not possible to compute this gradient because nothing can be said about the distribution p(k). 2. Minimizing the squared dierence between p(k; ) and p(k) X E= (p(k; ) p(k)) (34) 2

k

As with the relative entropy, the calculation is made impossible by failing to eliminate p(k). 5.1.2 Maximum Likelihood optimization Looking for the maximum likelihood means to maximize the probability for all k with respect to their 'real' occurrence. Since every k occurs with p(k), the maximum likelihood function looks like that: Y E = p(k; )p k ! max (35) ( )

k

18

Taken to the logarithm, the objective function changes into X E0 = p (k) ln p(k; ) ! max

(36)

k

First, this can be made much simpler: E0

= =

X

"

p (k) ln Z

1

exp

XX

!#

a (k(r); k(r

a))

a2A r2R XX ln Z + p (k) a (k(r); k(r a)) a2A r2R k Now, the GLCHs over pairs (k; k0) are used instead of k

X

the a(k(rX ); k(r a)): XX nak (k; k0 )a (k; k0 ) = ln Z + p(k) = =

a2A (k;k0 ) k 0 The (k; k )X do X not depend onXk, so the sums can be reordered: ln Z + a (k; k0 ) p (k)nak (k; k0 ) a2A (k;k0 ) k XX 0 ln Z + a (k; k ) a (k; k0 ) a2A (k;k0 )

Now, with Z = @E 0

@a (k; k0 )

P

k

= = =

exp

hP P a

1

Z X k

0 @

(k; k0)a(k; k0)

2

31 0

X k

exp 4 2

1 exp 4 Z

|

(k; k 0 )

a

(k; k 0 )

i

nk (k;k 0 ) a

XX a

XX a

{z

(k;k 0 )

p(k;)

X k

(k;k 0 )

: : : 5A

, + a(k; k0)

(37)

3

: : :5 nka (k; k0 ) + a (k; k0 )

(38)

}

p(k; )nak (k; k0 )

(39)

0 = a (40) a (k; k ; ) The only way to get a(k; k0) is to approximate it with nak (k; k0). On large original images however, this is a reasonable way. Likewise, a(k; k0; ) is produced by averaging over the nka (k; k0) in many samples k. In this way it is

19

possible to follow the gradient into the next optimum without ever knowing a value for E 0: a(k; k0) = nka (k; k0) nka (k; k0) (41) In my experiments, = 0:1 produced a slow-but-steady convergence, whereas larger values sped up the process but led to great uctuation near the optimum. An interesting idea would be to make adaptive, or at least time dependent, but I did not have time to pursue this. Another open point is to nd sophisticated starting values for a new a, for instance by tailoring the objective function. Again, I have to leave this to further work on this subject.

5.2 Adapting the structure

In section 3.3 I have shown under which conditions it is possible to remove a shift a from the model. The much more interesting task is to add shifts according to their 'use' for the model. This task splits into two subtasks: Interaction evaluation and Interaction selection. With those two tasks solved, the algorithm can estimate the parameters on the given set of vectors, then add a few new vectors with important structure, initializing their potentials with 0, and continue estimating the parameters of all vectors. This process can then be repeated until the visual quality of the image is sucient. A number of very interesting experiments on this subject can be found in [4]. 4

5.2.1 Interaction evaluation In [1], a method is proposed to estimate the importance of a shift by determining the distance between the gray level dierence histogram (GLDH) on that shift in the original image k and the GLDH on that shift in the IRF, i.e. a MRF with independent equiprobable signals in each pixel. Since in this case these GLDHs are vectors of dimension (2k + 1), the article uses the vector distance as a quality for that shift. I will adapt this method for use with gray level co-occurrence histograms (GLCHs). Additionally, I am not interested in the distance of any GLCH to the IRF, but rather in the distance of the GLCH on every shift in k to the GLCH on that shift in the current model . This means that it is necessary to de ne a distance between the two jKj jKj matrices nak and nka for every a 2 A. max

4 It

is evident that it is not enough to estimate parameters on the new vectors only { this

will change the Gibbs distribution so that it is necessary to change the other parameters, too.

20

I have investigated three quality measures: 1. The sum of the log-likelihoods between elements X Q(nka ; nka ) = nak (k; k0 ) ln nak (k; k0 )

(42)

(k;k 0 )

A large log-likelihood value indicates that the matrices are similar. Since great dierences should get high qualities, the negative sum is used. Unfortunately, this measure completely neglects the structure of the image and is therefore not usable. 2. The sum of the squared distances between elements X Q(nka ; nka ) = nak (k; k0 ) nka (k; k0 ) (43)

2

(k;k 0 )

This produces a ne measure when comparing to the IRF (which is equal to an empty model), however the usability declines if there are already some shifts in the model. 3. The sum of the relative entropies between elements X k nk (k; k0 ) (44) Q(nka ; nak ) = na (k; k0 ) ln ka 0) ( k; k n a k;k This produced by far the best results, creating models that are small but powerful enough to express the structure of the texture. To compare the three measures, I have visualized them in the following gure containing the result of each formula for one element:

(

n_k*

0)

n_k*

n_k*

5

5

5

4

4

4

3

3

3

2

2

2

1

1

1

1

2

3

4

5

n_k

1

2

3

4

5

n_k

1

2

3

4

5

n_k

Figure 6: Visualization of the distance measures for each element: Left the log-likelihood, in the middle the squared dierence and on the right the relative entropy 21

The images illustrate the quality of the three measures when comparing a single element. On the x-axis is the value for nka and on the y-axis the corresponding nka . The darker the region, the higher the quality it represents. It is obvious that the log-likelihood is in no way similar to the other two and thus it is understandable that the results are substantially dierent (and worse).

5.2.2 Interaction selection Given a distance measure Q(nka ; nka ) and a set A0 of shifts to investigate, a subset of shifts with the best qualities has to be selected. In [1], this is done by calculating the average quality MQ and the standard deviation and then choosing all shifts with Q(nak ; nka ) > MQ + k, k = 3 or 4. Although mostly adequate, at some instances this leads to adding a very large number of new shifts to the model. I am convinced that it is not wise to add more than a few shifts in one step because this has a serious impact on performance and most of these shifts tend to be decomposed after a large number of parameter estimation steps { there is simply never enough information in the system to justify such large changes. Therefore, in my model I use a threshold depending on the maximum quality: (45) Q(nak ; nka ) Q = max k k

max

na ;na

A shift a 2 A0 is selected, if Q(nak ; nka ) Q , with 0:99. In general, this will add one to four new shifts, giving the system the chance to rst adapt good potentials on these new shifts before adding more new shifts.

max

5.3 Practical Notes

5.3.1 The log-likelihood maximum As proposed in [1], the log-likelihood function I used for parameter estimation is strictly concave and has a unique maximum i (46) 0 < nak (k; k0) < 1 8a 2 A; k; k0 2 K The rst and obvious observation is, that this may in many cases not be the case. nka (k; k0) = 0 only means that the pair ( k; k0 ) of signal values never occurred on the shift a in k. In such cases the most trivial solution is to chose a very low value which will, in general, have the same eect for the parameter estimation.

22

The second, much more serious problem is, that it is not for sure, whether the parameter estimation will nd the unique maximum, even if there is one. This is due to the approximations done by taking nka (k; k0) instead of a(k; k0; ) during sampling. In eect, the gradient declines during the bigger part of the optimization process but does never reach zero (as would be the case in any optimum) and rather starts to uctuate at a low level. Since there is nothing that can be done to prevent this, I had to accept that dierent nal situations will be reached, depending on the initial potentials. 5.3.2 Terminating the parameter estimation If the objective function of the maximization is strictly concave, to terminate the optimization it is not neccesary to know the value of the objective function, but it is sucient to know the gradient. The closer the value of the objective function is to the optimum, the closer the gradient will be to zero. Sadly, the gradient will never reach zero (see above), so a method has to be found to separate the stochastic declining of the gradient from the

uctuation near the optimum. I have already discussed this problem in section 4.2.2 and my experiments have shown that the solution presented there is also appropriate for this task. 5.3.3 Comparison to shift invariant models In reference to [1] I would like to add a short comparison of the model proposed here and the shift-invariant approach Gimmel'farb chose. In his model, Gibbs parameters are not represented by a matrix on each shift, but rather by a vector of dimension 2k +1. Translated to the matrix representation, this only means that all matrices contain identical values in the elements of all diagonals. After learning a model for the matrix-based approach, it is easy to check this constraint: The most intuitive measure is the standard deviation on each diagonal in the potential matrices of the model. For comfort, the n n matrices are expressed in the following way: 0 n 1 a a n a n B .. . an a n C C A=B B ... . . . ... C @ a A a a ann Here, A is a potential matrix a. Now, the assumption of shift-invariant max

2 1

1

2

2 1 1 1

2 2

23

2

2 1

1

2 2

2

parameters can be expressed as (8 i; j ) aki = akj k = 1 : : : 2n 1 (47) Thus, the dierence of any matrix A to that assumption on any diagonal k is formulated as the standard deviation on that diagonal: s 1 X (ak E (k)) (k) = (48) A

nk

with EA (k) =

A

i

i

1

X

2

aki ;

nk i nk = min (k; 2n k)

Experiments show that on shifts that have gone through a great number of learning cycles ( 1000), this deviation takes considerable size. The following picture shows A(k) on a trained model for D77. sigma 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05

g0

2 sigma 0.1

4

sigma 0.1

g1 - g3

6

8

10

12

14

k

sigma 0.1

g4 - g9

0.08

0.08

0.08

0.06

0.06

0.06

0.04

0.04

0.04

0.02

0.02

0.02

2

4

6

8

10

12

14

k

2

4

6

8

10

12

14

k

g10 - g19

2

4

6

8

10

12

14

k

Figure 7: Standard deviation on the diagonals of the Gibbs potential matrices: The rst gure shows the deviation for the oldest shift, the other three images for the average of shifts 2-4, 5-9 and 10-19 respectively It is evident that the assumption of shift-independent Gibbs potential is adequate for most of the 'younger' shifts. However, a deviation of more than 30% in the oldest shift means, that there are corresponding Gibbs parameters 1:5 times as strong as others. To demonstrate the impact of such uctuation, I use a model for D77 in which the elements on every diagonal have been replaced by the mean value on that diagonal. Thus, the assumption of shiftinvariant potential is forced: 24

Figure 8: Sampling with a modi ed model for D77: current sample after 0, 10, 20, 50 and 70 runs with a model with enforced shift-invariant potentials To compare those samples to the ones produced by the `real' model, see gure 5. It is very striking how the quality has worsened. Of course, taking the mean value may not be the optimal solution as opposed to learning the shift-invariant potentials directly, but the dierence between the learned matrix-based model and that assumption becomes very clear. 5.4 Examples

The following gures show the process of generating a model for D77. The upper image shows the computed quality { i.e. the importance of adding this shift { for all shifts { measured from the purple center { in the search window. The darker a pixel, the better the quality for that shift. Note that for better visibility, each plot is aligned so that the worst quality is white and the best quality is black. Therefore, it is impossible to compare the absolute plots. Blue dots mark shifts that are part of the model. Since the shifts (x; y) and ( x; y) are equivalent, those `mirrors' are painted in light blue. In general, the quality of vectors close to shifts in the model is very low. This is to be expected, because the quality marks the dierence between model and original image. On shifts that are in the model, this dierence is supposed to be low. The lower images show a sample with the current parameters each, corresponding to the structure above. All images are taken shortly before extending the structure, thus at the end of the current parameter estimation period.

25

Figure 9: Generation of a model for D77 { Part 1: Visualisation of the interaction evaluation (top) and the current sampled image (bottom) after 328, 556, 593, 635 and 669 cycles Whereas the sample only slowly re ects the wanted pattern, it is evident that the qualities of the observed vectors massively change during the process. This means that it is clearly not practicable to estimate the whole set of vectors with the rst quality measurement. Instead, the information which vectors to take next is only gradually revealed.

Figure 10: Generation of a model for D77 { Part 2: Visualisation of the interaction evaluation (top) and the current sampled image (bottom) after 690, 743, 783, 888 and 1175 cycles Note that `slips', as seen in cycle 593 and 690, do not harm the nal quality of the solution. A second, interesting test of the quality of the structure algorithm is to test how well a given structure is reconstructed. For this, I have used the learned model for D77 to sample a large image and used that as an original 26

for a second run of the algorithm. The result was a new model with no noticeable change in the visual quality and a set of vectors which consisted of all but ve vectors from the original model, one that only diered by one pixel from one in the original and three completely dierent vectors. I would think that this shows that the algorithm is (mostly) able to reproduce a given structure.

27

6 Recognition This section only serves the purpose of demonstrating the utility of the proposed model for simple segmentation tasks. The method described for recognition is only applicable for separating truly dierent textures in an image. Still, it can be seen, that Gibbs models built as shown above are able to solve these situations with extremely little eort. More sophisticated models for image segmentation based on Gibbs models can be found in [5]. Given an image k on a 2-D lattice R, I presume that there already exists a model describing one of the major textures present in the image. The task now is to determine which area of the image this texture best describes. Therefore, I examine small frames of size n m in the image and compute their probability pni;jm (k; ) in every pixel (i; j ) 2 R for being produced by the texture as follows: (

)

2

pn(i;jm) (k; ) = Znm exp 4

m (i+ n 2 ;j + 2 )

X

(u;v )=(i

n ;j

X

2A

m ) (x;y )

3

a (k(u; v); k(u x; v

y))5

(49) This seems equivalent to the probability of every small frame to be built with the texture described by . It really is, except that it neglects to take into account that at the borders of every frame, shifts pointing outside should not be taken into account. Since this would make the calculation much more complicated, I will compromise at this point. Experiments show that the dierences are negligible. The real problem in the above formula is that it is not possible to determine Znm in reasonable time. Fortunately, this is not completely necessary. Only certain aspects of Znm are important for segmentation and those can be retrieved very easily. First, it is evident that no value for Znm has to be known to compare one frame of k to another, since Znm is not dependent on (i; j ). Thus, the regions in the image where the texture is matched best can be found without diculty. This, however, does not quite nish the task { it is now necessary to compare values of pni;jm (k; ) between models of dierent textures, without knowing either ones Znm. Generally speaking, Znm is a scaling factor. Omitting it produces the same \probabilities", but it is not possible to compare them. What is needed is some \probability" p for each model that can be said to mark a quality of 100% with respect to that model. For that, I have chosen the maximum quality a model assigns to it's own sample. Relying solely in the plausibility of this statement, scaling the probabilities with this (

2

2

)

100

28

p produced a suitable measure for comparing the qualities of many models in one frame of the image. 100

Figure 11: Sample image with three textures: On top the original image, in the middle the evaluation of the texture by each of the three models and at the bottom the segmentation into three regions As you can see, the sample image is adequately recognized by each model in the respective area. Therefore it is easy to segment the image using a thresholding algorithm, as seen in the fourth image.

29

7 Conclusion This thesis covered a couple of tasks on Gibbs texture models such as sampling, parameter and structure learning and the related task of clustering. The main points of interest were the adaption of the structure and the comparison of models found by a matrix-based approach to those that rely on shift-invariant parameters. On multiple examples I have shown the appliance of every one of the methods suggested for these tasks. All in all, this supports the statement that the proposed model is able to describe spatially uniform stochastic textures. In this thesis, there are some points where solutions to certain problems where empirically found. Thorough theoretical investigation is still to be done in this areas. In addition to those heuristics, it is necessary to de ne the borders of this method. It is not possible to use this model to recognize textures that were subject to any ane transformation compared to the original used for learning. This includes rotation (except for special angles), zoom, tilt, etc. Image disturbances such as shading or deformation are also fatal, but due to the stochastic property of the model it turns out to be resistant to ordinary noise. 8 Acknowledgements It is a pleasure to thank my advisor, Doz. Dr. B. Flach, for the permanent constructive and motivative input. Furthermore, this work was made possible by the continuous support of Prof. E. Stoschek. I thank him for his guidance over the last four years and for stimulating my interest in arti cial intelligence. Last but not least, I thank my family and friends for letting me rely on them, I could not have done this work without their support.

30

List of Figures 1 Examples of spatially uniform stochastic textures . . . . . . . 2 Clustering of an RGB-colored image: Left the original image, in the middle the color distribution (blue) with the found clustering for 32 centers (black) and their borders (red) shown in the Red-Blue Pane and on the right the clustered result . . . . 3 Clustering of real textures: Left the original, in the middle the color distribution (blue) with the found clustering for 8 centers (black) and on the right the clustered results. The upper image is D77 from [2], the lower one is a natural image of cloth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Simultaneous change of two signals . . . . . . . . . . . . . . . 5 Sampling with a model for D77: current sample after 0, 10, 20, 50 and 70 runs . . . . . . . . . . . . . . . . . . . . . . . . 6 Visualization of the distance measures for each element: Left the log-likelihood, in the middle the squared dierence and on the right the relative entropy . . . . . . . . . . . . . . . . . . . 7 Standard deviation on the diagonals of the Gibbs potential matrices: The rst gure shows the deviation for the oldest shift, the other three images for the average of shifts 2-4, 5-9 and 10-19 respectively . . . . . . . . . . . . . . . . . . . . . . 8 Sampling with a modi ed model for D77: current sample after 0, 10, 20, 50 and 70 runs with a model with enforced shiftinvariant potentials . . . . . . . . . . . . . . . . . . . . . . . . 9 Generation of a model for D77 { Part 1: Visualisation of the interaction evaluation (top) and the current sampled image (bottom) after 328, 556, 593, 635 and 669 cycles . . . . . . . . 10 Generation of a model for D77 { Part 2: Visualisation of the interaction evaluation (top) and the current sampled image (bottom) after 690, 743, 783, 888 and 1175 cycles . . . . . . . 11 Sample image with three textures: On top the original image, in the middle the evaluation of the texture by each of the three models and at the bottom the segmentation into three regions

31

3 8 9 13 17 22 25 25 26 27 29

References [1] G. L. Gimmel'farb Texture Modeling by Multiple Pairwise Pixel Interaction, IEEE Transaction on Pattern Analysis and Machine Intelligence, 18:1110-1114, 1996. [2] P. Brodatz, Textures: A Photographic Album for Artists and Designers, New York: Dover Publications, 1966. [3] Stan Z. Li Markov Random Field Modeling in Image Analysis, Tokyo: Springer-Verlag, 2001 [4] G. L. Gimmel'farb Imaging and Vision Systems: Theory, Assessment an Applications, chapter Characteristic interaction structures in Gibbs texture modeling, pages 71-90. Nova Science Publishers, 2001. [5] Ivan Kovtun Texture segmentation of images on the basis of Markov random elds. Technical Report TUD-FI03, TU-Dresden, May 2003

32

Declaration I, Benjamin Dittes, hereby declare that the bachelor thesis \Texture modeling with Gibbs probability distributions" was written by myself without any but the therein stated resources. Dresden, July 30 2004 th

Benjamin Dittes

33

Texture modeling with Gibbs probability distributions

30 Jul 2004 - esX he first figure shows the devi tion for the oldest shiftD the other three im ges for the ver ge of shifts PERD SEW nd IHEIW respe tively st is evident th t the ssumption of shiftEindependent qi s potenti l is dequ te for most of the 9younger9 shiftsF roweverD devi tion of more th n. QH7 in the oldest shift me ...

Download PDF

2MB Sizes 1 Downloads 217 Views

Report

Texture modeling with Gibbs probability distributions

Recommend Documents