EE 396: Lecture 9 Ganesh Sundaramoorthi March 12, 2011
0.1
Dictionaries
The work [4] has looked at the statistics of image patches in natural images (which they define to be images of the visual envirnoment that do not show artifacts of civilization, e.g., scenes of rocks, mountains, ocean, etc..., but not televisions, buildings, cars, etc...), and they have shown that each patch in the such images is a linear combination of a small number of patches chosen from a “dictionary” of “atomic” patches. Indeed they find that this dictionary consists of patches that resemble Gabor filters, which are similar to the receptive fields in the visual cortex of the human brain. Following this observation, we assume that patches of images of natural scenes (whether or not they contain artifacts of civilization) are formed as a linear combination of a small number of atomic patches chosen from a dictionary of these atoms; we will not assume that the dictionary are composed of Gabor filters 1 . We will incorporate this information to create a new prior on u ∈ U, the class of true images. Let us assume that Ω, the domain of the image is rectangular, and is the disjoint union of Np = Np1 × Np2 identical patches, Ωp = [0, L1 ] × [0, L2 ] (then Ω = [0, L1 × Np1 ] × [0, L2 × Np2 ]). Let also ui : Ωp → R for i = 1, . . . , Np denote the ith patch of u : Ω → R: n u(x) = ui (mod(x1 , Lp1 ), mod(x2 , Lp2 )) if x = (x1 , x2 ) is in the ith patch . (1) where mod denotes modulus. Let us denote by di : Ωp → R the ith dictionary atom where i = 1, · · · , Nd (there are Nd dictionary elements). We set D(x) = (d1 (x), . . . , dNd (x)) (that is D : Ωp → R1×Nd , is a row vector). 1. We assume that each patch of u is modeled as ui (x) =
Nd X
αij dj (x) + ηi (x) = D(x)αi + ηi (x)
(2)
j=1
where is ηi : Ωp → R is modeling noise, and αj ∈ RNd . 2. Most of the components of αj are zero, that is, indeed we assume that p(αi ) ∝ exp (−µi kαi k`0 )
(3)
1 Note that this is following a similar idea of our last lecture of image self-similarity; that is the the image is composed of repeated atoms
1
where k · k`0 denotes the `0 norm, that is, the number of non-zero elements of αi : Nd X (1 − δ(αij ))
kαi k`0 =
(4)
j=1
where δ(x) = 1 if x = 0 and δ(x) = 0 if x 6= 0. Further αi and αj are independent for i 6= j (indeed if i and j are nearby, they may be correlated, however, we make the assumption that they are not). 3. We assume that ηi (x) ∼ N (0, σp ) and is iid in both x ∈ Ωp and i = 1, . . . , Np . 4. Both ηi and αj are independent for each i and j. Using the above assumptions, we calculate p(u), our prior probability : p(u) = p(u1 , . . . , uNp ) =
Np Y
(5)
p(ui ) (stochastic components of ui , αi and ηi , are independent from each other )
(6)
p(αi , ηi ) (the only stochastic components of ui are αi , ηi )
(7)
i=1 Np
=
Y i=1 Np
=
Y
p(αi )p(ηi = ui − Dαj ) =
Np Y
i=1
i=1
!
Z
1 exp (−µi kαi k`0 ) exp − 2 2σp
2
(ui (x) − D(x)αi ) dx Ωp
(8) = exp −
Np X
µi kαi k`0 +
i=1
0.2
1 2σp2
Z
(ui (x) − D(x)αi )2 dx .
(9)
Ωp
Energy to be Optimized
Now using our additive noise model with Gaussian noise, we find that Z E(u) = − log p(u|I) = λ
2
(I(x) − u(x)) dx + Ω
Np X i=1
µi kαi k`0 +
Np Z X i=1
(ui (x) − D(x)αi )2 dx (10)
Ωp
(after an approrpriate scaling by a constant, and λ > 0). This is the model considered by [2]. Note that if we had αi , then if dictionary were known, we could minimize in u, and a MAP estimate for u, which would then be our denoisied image, however, we do not know αi ! Indeed, we must optimize E in both u and αi . Now, the previous reasoning assumes that we have a fixed pre-defined dictionary. What do we choose for the dictionary? Following the reasoning of [4], we could choose a Gabor dictionary. However, the brilliant idea of [2] is choose a dictionary tailored to the image itself, and to suppose that the dictionary can be determined from the noisy image itself. That is, they propose to also solve for the dictionary elements as part of the
2
optimization of E 2 . Therefore, the energy to be optimized is Np d E(u, {αi }i=1 , {di }N i=1 )
0.3 0.3.1
Z
2
(I(x) − u(x)) dx +
=λ Ω
Np X
µi kαi k`0 +
i=1
Np Z X i=1
(ui (x) − D(x)αi )2 dx. (11)
Ωp
Optimizing the Energy Updating u
We may define ˜ D(x) = D(mod(x1 , Lp1 ), mod(x2 , Lp2 )), x ∈ Ω
(12)
which is simply the dictionary extended to all of Ω. We shall also define χi : Ω → {0, 1} as ( 1 x is in the ith patch of Ω χi (x) = . 0 x is in not the ith patch of Ω
(13)
We may then write the energy as N
p d E(u, {αi }i=1 , {di }N i=1 ) = λ
Z Ω
(I(x) − u(x))2 dx +
Np X
Z µi kαi k`0 +
u(x) − Ω
i=1
Np X
2 ˜ D(x)α i χi (x) dx.
i=1
(14) We are now going to devise an iterative update scheme where we update each of u, α, D assuming that we are given some estimate of the others. Let us first assume that we have an estimate of α and D, and we now derive a scheme to estimate u. To do so, we compute the Euler-Lagrange equations with respect to u, and only the first two terms depend of the energy depend on u. Thus, the Euler-Lagrange equations are : λ(u(x) − I(x)) + u(x) −
Np X
˜ D(x)α i χi (x) = 0,
(15)
i=1
which yields : Np
λ 1 X ˜ u(x) = I(x) + D(x)αi χi (x), 1+λ 1+λ
(16)
i=1
that is, u is a weighted average of the (noisy) image and the current dictionary approximation of the noisy image. 2
Note that one must be very careful, because if we solve for u, αi , D, then a trivial solution would be choose ui to be the ith patch of I, choose αi1 = 1 and αij = 0 for j 6= 1, and the dictionary D to be the patches of I, in which case our energy is minimized, however, our image is not denoised! The trick is to have a prior on the number of dictionary elements, and therefore, Nd will have to be chosen small compared to the number of patches Np , and this will result in a non-trivial solution.
3
0.3.2
Updating αi
We now assume that we have and estimate of u, D, and we update αi . Note only the last two terms of the energy depend on αi , and therefore, we minimize arg min Np {αi }i=1
Np X
Z
(ui (x) − D(x)αi )2 dx =
µi kαi k`0 +
i=1
Ωp Np X
!
Z
(ui (x) − D(x)αi ) dx
arg min µi kαi k`0 + αi
i=1
2
(17)
Ωp
that is, the problem splits into Np independent problems. The problem above is known as least squares with a sparsity penalty. The problem has been of much recent interest in many applications, including compressive sensing. There are many numeric methods, for solving this problem, but we will just look at one - the matching pursuit algorithm [3], perhaps one of the simpliest 3 . The matching pursuit algorithm is as follows. We assume that kdi kL2 = 1. Let R0 = ui (the residual), k ui = 0 (the approximation of ui using dictionary elements). Then the matching pursuit algorithm works by iterating the following steps
1. ik = arg maxi=1,...,Nd Rk , di L2
P
2. uki = uk−1 + Rk , dik L2 dik = kl=1 Rl , dil L2 dil i 3. Rk+1 = Rk − hRk , dik iL2 dik
l Pk th k = 4. Note that αij l=1 δil ,j R , dil L2 (the k approximation of αij ). The algorithm stops when the desired level of sparsity is reached (the maximum number of non-zero entries of αik is reached) or when some other convergence criteria is reached. A modified version of this algorithm, known as orthogonal matching pursuit [5], has better convergence properties. 0.3.3
Updating the dictionary
We are going to define U (x) = (u1 (x), . . . , uNp (x)) where U : Ωp → R1×Np . Further, we let α = (α1 , . . . , αNp ) ∈ RNd ×Np . Then we see that |U (x) − D(x)α|2 =
Np X
(ui (x) − D(x)αi )2 , x ∈ Ωp .
(18)
i=1
Therefore, kU −
Dαk2L2
=
Np Z X i=1
where recall that kU k2L2
(ui (x) − D(x)αi )2 dx
(19)
|U (x)|2 dx.
(20)
Ωp
Z = Ω
3 Note that if the di were orthonormal, then the problem would be trivial, because then we would choose the components of αi to be the first few largest values of hdi , ui iL2 and the rest of the components of αi to be zero. However, the dictionary will not be orthonormal and may in fact be redundant.
4
Therefore, to minimize the energy of interest over D, while having an estimate of α, u, we are left to minimizing (21) arg min kU − Dαk2L2 . D
To do this, we are going to write, kU − Dαk2L2
2
2
Nd Nd
2
X X
i k i
di α − dk α = Ek − dk αk 2 = U − di α = U −
2 L
2 i=1 i=1,i6=k L
(22)
L
where αi is the ith row of the matrix α, and Ek (x) = U (x) −
Nd X
di (x)αi , x ∈ Ωp .
(23)
i=1,i6=k
We are going to now assume that we have estimates of di for i 6= k, and we are going to minimize
Ek − dk αk 2 2 over dk . L 0.3.4
SVD
To do this, we recall the singular value decomposition (SVD) of a matrix. Let A ∈ Rm×n then we can decompose the matrix into the form X A = W ΣV T = σi wi viT (24) i
where Σ ∈ Rm×n is diagonal, i.e.,
( σi Σij = 0
i=j , i 6= j
(25)
W = (w1 , . . . , wn ) ∈ Rn×n , and V = (v1 , . . . , vm ) ∈ Rm×m are orthogonal, i.e., W T W = Idn×n and V T V = Idm×m . The σi are called the singular values of A. Further, X X kAk2F := A2i,j = σi2 . (26) i,j
i
Note that {zij = wi vjT }i,j form an orthogonal basis with respect to the inner product hz1 , z2 i = tr(z1T z2 );
(27)
tr((wi1 vjT1 )T wi2 vjT2 ) = tr(vj1 wiT1 wi2 vjT2 ) = δi1 i2 tr(vj1 vjT2 ) = δi1 i2 vjT1 vj2 = δi1 i2 δj1 j2 ,
(28)
indeed,
thus, {wi vjT }i,j form an orthogonal basis.
5
0.3.5
Continuing with Dictionary Optimization
We decompose Ek using the singular value decomposition : X Ek (x) = σi wi (x)viT ;
(29)
i
then the problem is minimize
2
X
σi wi viT − dk αk
2 i
(30)
L
αk ,
over dk . Now suppose that we were to solve for dk instead of just dk . Either dk αk ∈ span{wi viT } k T or dk α is in the orthogonal complement of span{wi vi } (note that dk αk cannot have components in both span{wi viT } and its complement since then dk αk would have rank greater than 1, which cannot be the case). Suppose the former case, then dk αk = γwj vjT and then
2
X
X
T k σi wi vi − dk α = σi2 + (σj − γ)2 ; (31)
2 i
i6=j
L
we see that the sum is minimum when γ = σj , where σj is the largest singular value, in this case, dk αk = σj wj vjT . If dk αk is instead in the orthogonal complement of span{wi viT }, then
2
X
X
T k σ w v − d α σi2 + kdk αk k2L2 , (32)
= i i i k
2 i
i
L
which would have higher value than if we chose dk αk = σj wj vjT with σj being the largest singular value. Therefore, we would like to choose dk = wj and αk = σj vjT with σj being the largest singular value of Ek . However, vjT is unlikely to be sparse, and thus, this would not produce that α is sparse as it should be. However, there is a quick fix to this, that is, we optimize (22) over only those components of αk that are non-zero using the current estimate of αk . To this end, arg min kEk − dk αk k2L2 = arg min kEkR − dk (αk )R k2L2 dk ,αki 6=0
(33)
dk ,(αk )R
where (αk )R is the row vector that contains all non-zero entries of αk , and similarly, EkR is the sub matrix of Ek that has the columns removed that correspond to the columns where αk are zero. Now, using P the reasoning in the previous paragraph, we choose dk = wiR and (αk )R = σjR (viR )T where EkR = i σiR wiR (viR )T is the singular value decomposition of EkR and σjR is the largest singular value. Thus, we have updated dk and αk , but we still need to update the other dictionary elements dj . To do this, we simply formulate Ej and solve for dj and αj using the previous estimates of the other dictionary elements and the updated value of dk . We repeat the procedure until all dictionary elements are updated. This algorithm is known as K-SVD [1].
References [1] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. Signal Processing, IEEE Transactions on, 54(11):4311–4322, 2006. 6
[2] M. Elad and M. Aharon. Image denoising via learned dictionaries and sparse representation. 2006. [3] S.G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. Signal Processing, IEEE Transactions on, 41(12):3397–3415, 1993. [4] B.A. Olshausen and D.J. Field. Vision and the coding of natural images. American Scientist, 88:238, 2000. [5] Y.C. Pati, R. Rezaiifar, and PS Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, pages 40–44. IEEE, 1993.
7