EE 396: Lecture 5 Ganesh Sundaramoorthi Feb. 22, 2011
In this lecture, we learn to solve the Euler-Lagrange equations of our denoising model PDE: ( u(x) − α∆u(x) = I(x) x ∈ int(Ω) . ∂u x ∈ ∂Ω ∂n (x) = 0
(1)
numerically since in the last lecture, we have understood that the equation has a solution. Note that the solution will be a global minimizer of our energy : Z Z 2 E(u) = − log p(u|I) = (I(x) − u(x)) dx + α |∇u(x)|2 dx, (2) Ω
Ω
since the functional is convex.
0.1 0.1.1
Gradient Descent Steepest Descent in Rn
A generic algorithm for minimizing a smooth energy function is that of gradient descent or steepest descent. This algorithm for a function f : Rn → R works by making a guess for the minimum, xguess ∈ Rn and then correcting the guess by moving slightly in the negative gradient direction (recall from your calculus class that the gradient direction is the direction that maximizes f the quickest), xnext = xguess − ∆t∇f (xguess ) where ∆t > 0 is small. Since the negative gradient direction decreases the function in the steepest possible way, xnext should have lower value than xguess provided ∆t is small enough. Then one repeats the process on xnext and hopes that the procedure converges. That is, x0 = xguess xk+1 = xk − ∆t∇f (xk ).
(3) (4)
If the algorithm converges, then there is a possibility that the algorithm could become trapped in a local minimum if for example f is not convex (if f is convex, then the algorithm will reach a global minimum). Therefore, the initial guess is quite important; ideally it should be chosen close enough to the global minimum as possible or as close as possible to the desired local minimum. 0.1.2
Gradient of Functionals
We now generalize this gradient descent procedure to our denoising model. However, what is meant by gradient of such a functional? Turns out that this is analogous to the multivariable function case. Recall that 1
from previous lectures we have shown that the directional derivative of E is Z d v(x) · (u(x) − I(x) − α∆u(x)) dx dE(u) · v = E(u + tv) = dt Ω t=0
(5)
for v ∈ V, the space of permissible perturbations, and u ∈ U. Note that from this expression of the directional derivative, we can choose the perturbation v of u so that the directional derivative in the direction v is minimum. Note that an sure way to ensure dE(u) · v < 0 is to choose v(x) = −(u(x) − I(x) − α∆u(x)); in this case,
Z
|u(x) − I(x) − α∆u(x)|2 dx ≤ 0.
dE(u) · v = −
(6)
(7)
Ω
But is the steepest direction? To answer this question, we must first note that V is an inner product space equipped with the inner product 1 Z hv1 , v2 iL2 = v1 (x)v2 (x) dx for v1 , v2 ∈ V. (8) Ω
In an inner product space, we automatically get for free the Cauchy-Schwartz inequality | hv1 , v2 i | ≤ kv1 kkv2 k for the norm induced from the inner product and equality if and only if v1 and v2 are such that v1 = av2 for some a ∈ R. Applying this to (5), we find that
or
| dE(u) · v| ≤ kvkL2 ku − I − α∆ukL2
(9)
| dE(u) · v| ≤ ku − I − α∆ukL2 kvkL2
(10)
where k · kL2 is the norm induced from the L2 inner product. Therefore we see that the magnitude of the quotient above will be maximized (equality in the above expression) when v = −(u − I − α∆u),
(11)
v = −(u − I − α∆u) = −∇E(u)
(12)
thus, this justifies calling the negative gradient direction. This justifies the formal definition : Definition 1. Let E : U → R be a functional. Then the gradient of E at u (u ∈ U), denoted ∇E(u) ∈ V, is defined as the perturbation that satisfies Z dE(u) · v = v(x) · ∇E(u) dx = hv, ∇E(u)iL2 . (13) Ω
1
Recall that h·, ·i is a (real) inner product on a vector space V if the properties hold:
1. Linearity: ha1 v1 + a2 v2 , vi = a1 hv1 , vi + a2 hv2 , vi for all v1 , v2 , v ∈ V and a1 , a2 ∈ R. 2. Positive-definiteness: hv, vi ≥ 0and hv, vi = 0 is equivalent to v = 0 for all v ∈ V 3. Symmetry: hv1 , v2 i = hv2 , v1 i. Note that a norm can be defined from the inner product: kvk =
p
2
hv, vi.
0.1.3
Gradient Descent PDE
Now the steepest descent algorithm for our functionals is u0 = uguess uk+1 = uk − ∆t∇E(uk )
(14) (15)
where k represents the iteration number. In continuous time, the gradient descent is the PDE u(0, ·) = uguess (0, ·) ut (t, ·) = −∇E(u(t, ·)) and (15) is a discretization where we have used the forward Euler difference approximation : ut (t, ·) =
u(t + ∆t, ·) − u(t, ·) + O(∆t). ∆t
(16)
For our denoising model, this becomes ut = I − u + α∆u
(17)
2
where α > 0 We assume that we sample the function u on a regular grid with spacing ∆x in the x direction and ∆y in the y direction. We now discretize the PDE 3 . Using central differences 4 , we find that 1 ∂2u ∂u (t, x, y)∆x + (t, x, y)(∆x)2 + O((∆x)2 ) ∂x 2 ∂x ∂u 1 ∂2u u(t, x − ∆x, y) = u(t, x, y) − (t, x, y)∆x + (t, x, y)(∆x)2 + O((∆x)2 ) ∂x 2 ∂x where O(∆x) denotes linear dependence on ∆x. Adding the above two equations, we find that u(t, x + ∆x, y) = u(t, x, y) +
u(t, x + ∆x, y) − 2u(t, x, y) + u(t, x − ∆x, y) ∂2u (t, x, y) = + O(∆x) 2 ∂x (∆x)2 ∂2u u(t, x, y + ∆y) − 2u(t, x, y) + u(t, x, y − ∆y) (t, x, y) = + O(∆y) ∂y 2 (∆y)2
(18) (19)
(20) (21)
Setting uk (x, y) = u(k∆t, x, y) we find that our PDE can be written h uk+1 (x, y) = uk (x, y) + ∆t I(x, y) − uk (x, y) + uk (x + ∆x, y) − 2uk (x, y) + uk (x − ∆x, y) uk (x, y + ∆y) − 2uk (x, y) + uk (x, y − ∆y) α +α (∆x)2 (∆y)2 + O(∆x∆t) + O(∆y∆t), (22) thus, we have discretized our PDE. The question remains on how to choose ∆t to ensure that the numerical scheme is stable and converges. 2 The equation ut = α∆u is called the heat equation for Ω = Rn ; one can show that u(t, ·) is simply a Gaussian of standard √ deviation t convolved with the initial guess u0 . Therefore, we can understand that our denoising model will be a combination smoothing with and closeness to the original image I. 3 The is a subtle art in choosing the discretizing scheme for a PDE; ultimately the approriate numerical scheme requires full understanding on the behavior of the PDE. 4 Generally, central differences are preferred compared to forward or backward differences because they induce the least error. However, in certain PDE choosing central differences could lead to an unstable PDE. Again, the differencing scheme is dependent on the behavior of the PDE.
3
0.1.4
Von Neumann Analysis
We apply Von Neumann analysis to analyze the stability of the discretization of the PDE, and this analysis typically can be applied to linear PDE. We compute the Fourier Transform in the spatial variable (x, y) of our discretization scheme : i∆xω1 e − 2 + e−i∆xω2 ei∆yω2 − 2 + e−i∆yω2 k+1 k + U (ω1 , ω2 ) = U (ω1 , ω2 ) 1 − ∆t + α∆t + (∆x)2 (∆y)2 ∆tI(ω1 , ω2 ) (23) where we have noted that the Fourier transform of uk (x + ∆x, y) is U k (x, y)ei∆xω1 where U k denotes Fourier transform of uk and i is the imaginary unit. We define the H(ω1 , ω2 ) to be the expression in the brackets above, and then 2 sin (∆xω1 /2) sin2 (∆yω2 /2) H(ω1 , ω2 ) = 1 − ∆t − 4α∆t . (24) + (∆x)2 (∆y)2 Therefore, (23) becomes U k+1 (ω1 , ω2 ) = H(ω1 , ω2 )U k (ω1 , ω2 ) + ∆tI(ω1 , ω2 ).
(25)
Writting the above as an explicit formula, we have that U
k+1
k
0
(ω1 , ω2 ) = (H(ω1 , ω2 )) U (ω1 , ω2 ) + ∆tI(ω1 , ω2 )
k−1 X
(H(ω1 , ω2 ))k
(26)
i=0
where (·)i denotes ith power. We see that as k → ∞, we must have that |H(ω1 , ω2 )| < 1 for all ω1 , ω2 ; that is
1 − ∆t − 4α∆t
and
1 − ∆t − 4α∆t
(27)
sin2 (∆xω1 /2) sin2 (∆yω2 /2) + (∆x)2 (∆y)2
sin2 (∆xω1 /2) sin2 (∆yω2 /2) + (∆x)2 (∆y)2
<1
(28)
> −1.
(29)
The first yields that ∆t > 0 (noting that α > 0) and the second yields that 2
∆t < 1 + 4α
sin2 (∆xω1 /2) (∆x)2
+
sin2 (∆yω2 /2) (∆y)2
for all ω1 , ω2
(30)
which means that ∆t must be less than the min of the right hand side over ω1 , ω2 or 2
∆t < 1 + 4α
1 (∆x)2
+
1 (∆y)2
.
(31)
In the case that ∆x = ∆y = 1, we have that ∆t <
2 . 1 + 8α 4
(32)
0.1.5
Choosing α
Recall that the original energy from our model is Z Z 1 1 2 (I(x) − u(x)) dx + 2 |∇u(x)|2 dx, E(u) = 2 2σn Ω 2σp Ω
(33)
where σn is the standard deviation of the noise and σp is the standard deviation of the true noise free image derivatives. We scaled the energy by 2σn2 (which doesn’t affect the minimum u). Then α was defined as α=
σn2 1 = , 2 σp SNR
(34)
where SNR can be thought of as a signal to noise ratio. We see that α should be chosen larger as the noise increases. In practice, we do not know these parameters, but you may experiment with estimating these parameters from the noisy image. Notice that as α becomes larger (as the image becomes more noisy), ∆t becomes smaller and thus the algorithm becomes slower. 0.1.6
Algorithm Denoising Results
A key point is the initialization. Theoretically, we have shown that the energy is convex, and the energy only has global minimum, and thus the initialization should be irrelevant since the scheme can only converge at a global minimum. In practice, initialization is important. For these experiments we choose u0 = I. Also of importance is the convergence criteria, and we take that to be max |uk (x) − I(x) − α∆uk (x)| < ε x∈Ω
(35)
where ε > 0 is our allowed error. Fig. 1 shows an original image that is then corrupted by iid Gaussian noise. Fig. 2 shows the result of the algorithm constructed.
5
Figure 1: Left: An image of the UCLA campus, and right: the image corrupted synthetically generated noise (iid Gaussian mean zero with σ 2 = 0.05).
6
Figure 2: Left to right, top to bottom: denoising results for the noisy image in Fig. 1 with α = 1, 5, 10, 17, 25, 50, respectively.
7