A Relative Trust-Region Algorithm for Independent ...

Viewer
Transcript

A Relative Trust-Region Algorithm for Independent Component Analysis Heeyoul Choi, Seungjin Choi ∗ Department of Computer Science Pohang University of Science and Technology San 31 Hyoja-dong, Nam-gu Pohang 790-784, Korea

Abstract In this paper we present a method of parameter optimization, relative trust-region learning, where the trust-region method and the relative optimization (Zibulevsky, 2003) are jointly exploited. The relative trust-region method finds a direction and a step size with the help of a quadratic model of the objective function (as in the conventional trust-region methods) and updates parameters in a multiplicative fashion (as in the relative optimization). We apply this relative trust-region learning method to the problem of independent component analysis (ICA), which leads to the relative TR-ICA algorithm which turns out to possess the equivariant property (as in the relative gradient) and to achieve faster convergence than the relative gradient and even Newton-type algorithms. Empirical comparisons with several existing ICA algorithms, demonstrate the useful behavior of the relative TR-ICA algorithm, such as the equivariant property and fast convergence. Key words: Blind source separation, Gradient-descent learning, Independent component analysis, Relative optimization, Trust-region methods.

1

Introduction

Independent component analysis (ICA) is a statistical method that decomposes a multivariate data into a linear sum of non-orthogonal basis vectors ∗ Corresponding author. Tel.: +82-54-279-2259; Fax: +82-54-279-2299 Email: [email protected] (H. Choi), [email protected] (S. Choi) URL: http://www.postech.ac.kr/∼seungjin (S. Choi)

Preprint submitted to Neurocomputing

15 March 2006

with basis coefficients being statistically independent. A variety of approaches to ICA have been developed. These include maximum likelihood estimation, mutual information minimization, output entropy maximization (infomax), and negentropy maximization (for example, see (Lee, 1998; Hyv¨arinen et al., 2001; Cichocki and Amari, 2002; Choi et al., 2005) and references therein). All these approaches lead to an identical objective function in ICA (Cardoso, 1997; Lee et al., 2000). A popular implementation in these approaches, is the gradient-descent learning (including the natural gradient). Although gradient-based algorithms are simple and guarantee the local stability, but they are relatively slow and require a careful choice of a learning rate, which are cumbersome in practical applications. In order to overcome these drawbacks, Newton-type algorithms were proposed (Akuzawa, 2000; Zibulevsky, 2003). Even though FastICA is a fixed-point algorithm, it can be also interpreted as a kind of Newton-type algorithm (Hyv¨arinen, 1999). A trust-region method is an optimization technique, where a direction and a step size are determined in a reliable manner with the help of a quadratic model of the objective function (Nocedal and Wright, 1999). Trust-region methods define a region around the current iterate within which they trust the model to be an adequate representation of the objective function, and then choose the step to be the approximate minimizer of the model in this trust region. In effect, they choose the direction and length of the step simultaneously. If a step is not acceptable to minimize the objective function while the step minimizes the model, they reduce the size of the region and find a new minimizer. They are, in general, faster than gradient descent methods and their learning rate (or step size) is determined automatically by the size of the trust-region, whereas gradient-based methods require a careful choice of a learning rate. Their convergence is between linear and quadratic rate and its stability is always guaranteed, in contrast to the Newton method. The stability of FastICA is investigated in (Hyv¨arinen, 1999). The relative gradient (Cardoso and Laheld, 1996) or the natural gradient (Amari, 1998) was shown to be efficient in learning the parameters when the parameter space belongs to the Riemannian manifold. It was also shown that the relative gradient leads to algorithms having the equivariant property which produces the uniform performance, regardless of the condition of the mixing matrix in the task of blind source separation (BSS) or independent component analysis (ICA). Especially when the mixing matrix is ill-conditioned, the relative gradient ICA algorithms outperform other-type algorithms. The relative gradient is a particular instance of the relative optimization (Zibulevsky, 2003), where one determines the modified differential matrix corresponding to the search direction from the identity matrix via conventional optimization methods and updates the parameters in a multiplicative fashion. The relative Newton method (Zibulevsky, 2003) is another exemplary technique of the relative optimization method, where the modified differential matrix is learned 2

through Newton-type updates. In this paper we present a relative trust-region learning method 1 , where we incorporate the relative optimization into the trust-region method. In the relative trust-region optimization, we determine a search direction and a step size with the help of a quadratic model in a similar way to conventional trustregion methods. The parameters are updated in a multiplicative fashion as in the relative optimization. We apply the relative trust-region learning method to the problem of ICA, which leads to the relative TR-ICA algorithm. The relative TR-ICA algorithm inherits various useful properties, such as the fast convergence, stability, and the equivariant property, from both conventional trust-region methods and the relative optimization. Moreover we exploit a special structure of the Hessian matrix for memory-efficiency in the relative TR-ICA algorithm, so that the algorithm is useful, especially for the case of high-dimensional data. The rest of this paper is organized as follows. Next section briefly summarizes the ICA formulation and introduces necessary notations that will be used throughout this paper. Sec. 3 reviews the trust-region method to make this paper self-contained. The relative optimization, including the relative gradient, is explained, in Sec. 4, following earlier work in (Cardoso and Laheld, 1996; Cardoso, 1998; Zibulevsky, 2003). The main contribution of this paper, i.e., the relative trust-region method and the relative TR-ICA algorithm, are illustrated in Sec. 5. Numerical experiments and results are shown in Sec. 6, with comparison to several existing ICA algorithms. Finally conclusions are drawn in Sec. 7.

2

Independent Component Analysis

Given N data points, the simplest form of ICA considers the noise-free linear generative model where the observation data x(t) ∈ Rn is assumed to be generated by X = AS,

(1)

where X = [x(1), . . . , x(N )], S = [s(1), . . . , s(N )] and A ∈ Rn×n contains n basis vectors ai ∈ Rn , i = 1, . . . , n in its columns. s(t) ∈ Rn is a latent variable vector whose elements si (t) are mutually independent. The maximum likelihood estimation leads to the objective function that has the form 1

The earlier work was presented in (Choi and Choi, 2005)

3

f (W , X) = − log |det W | +

N X n 1 X ψi (yi (t)), N t=1 i=1

(2)

where W = A−1 , y(t) = W x(t), and ψi (yi (t)) = − log pi (yi (t)). The matrix W is referred to as a demixing matrix and the estimate of latent variable vector, y, is restored up to the scaled and re-ordered version of the original hidden variable vector s. 2

Throughout this paper, we will often use a parameter vector w ∈ Rn = vec W T where vec(·) is the vec-function which stacks the columns of the given matrix into one long vector. On the contrary, W = matT (w). We also abuse notations such as f (W , X) = f (W ) = f (w), f (I, Y ) = fr (W ) = fr (w),

(3)

where I ∈ Rn×n is the identity matrix and the subscript r is used to emphasize that it involves the relative mode which will be described in detail later. In the relative mode, the objective function involves I instead of W , hence Y instead of X. The relative optimization is based on the serial updating where the parameters are learned in a multiplicative fashion. At every iterations, the relative optimization searches the parameters, starting from the identity matrix I. Popular ICA algorithms are based on the gradient or the natural gradient method (Amari et al., 1996; Amari and Cichocki, 1998). Although gradientbased algorithms are simple and guarantee the local stability, but they are relatively slow and require a careful choice of a learning rate, which are cumbersome in practical applications.

3

Trust-Region Methods

In this section we overview a trust-region method (see (Nocedal and Wright, 1999) for more details), in order to help readers to understand how parameters are estimated through the trust-region optimization. 2

Let us consider an objective function f (w) : Rn → R to be minimized with 2 respect to the parameter vector w ∈ Rn . A quadratic model function m(k) which has elliptical contours, is based on the objective function and its deriv2 ative information at w(k) . The search direction p ∈ Rn is determined by solving the following subproblem: 4

h

arg min m(k) (p) = f (k) + ∇f (k) kpk≤△(k)

iT

1 p + pT B (k) p, 2

(4)

where △(k) > 0 is the trust-region radius, k · k is the Euclidean norm and f (k) = f (w(k) ), ∂f (k) . ∇f = ∂w w=w(k) 2

(5)

2

Here, B (k) ∈ Rn ×n is some symmetric matrix. In our work, we used the true (k) Hessian matrix as B (k) . The solution p∗ of Eq. (4) is the minimizer of m(k) (k) in the ball with its radius being △(k) . In regards to finding p∗ , there are several approximate solutions. In our relative trust-region optimization, we employ the dogleg method (see (Nocedal and Wright, 1999) for more details). A direct application of the trust-region method to ICA, can be found in (Choi et al., 2004) where the algorithm was referred to as TR-ICA. An important issue in a trust-region method is a strategy for choosing a trustregion radius △(k) at each iteration. The choice of △(k) is based on the agreement between the model function m(k) and the objective function f at previous iteration. Given a direction p(k) , this agreement measure ρ(k) is defined as the ratio of actual reduction to predicted reduction, i.e.,

ρ(k) =

f w(k) − f w(k) + p(k) m(k) (0) − m(k) (p(k) )

.

(6)

The convergence of the trust-region method is between linear and quadratic rate, hence it is, in general, faster than gradient methods. It also guarantees the stability, regardless of initial conditions, whereas the Newton method requires an extra technique for stability. Especially, trust-region is much better than Newton method when the quadratic model does not approximate the objective function well.

4

Relative Optimization

This section describes the relative optimization method which is an extension of the relative gradient (Zibulevsky, 2003). The method is illustrated in the framework of learning in Lie group, following some results in (Cardoso, 1998). We also revisit the relative gradient in the context of the relative optimization. 5

4.1 Learning in a Group The serial updating where the parameters are learned in a multiplicative fashion, keeps the parameters in a group structure (especially Lie group). The relative gradient is based on the idea of learning in Lie group, which was first investigated by Cardoso (Cardoso, 1998). See also (Fiori, 2005) for recent review of learning in orthogonal group. The general linear group of degree n, denoted by GL(n), is an instance of Lie group. In the case of ICA, the parameter matrix W belongs to the general linear group GL(n), and learning a demixing matrix W , can be carried out by a linear transformation of parameters, which leads to the following learning process W (k+1) = E (k) W (k) ,

(7)

where E (k) is a linear transformation of parameters W (k) . It follows from Eq. (7) that the updating rule consists of series of multiplications of matrices where the multiplication is the binary operator of the group GL(n). This implies that W (k) and E (k) at every iteration are the elements of GL(n) which form a Lie group. The linear transformation E (k) is computed such that an objective function (for instance, Eq. (2) in the case of ICA) is minimized on the Lie group. Moreover, a Lie group is a differentiable manifold obeying the group properties. Therefore, the multiplicative learning rule of W (k) in Eq. (7) reflects a manifold. In fact, the natural gradient (which is identical to the relative gradient in ICA) was developed in the framework of learning in Riemannian manifold (Amari, 1998).

4.2 Equivariant Property and Serial Updating An important property induced by an equivariant estimator is the uniform performance (Cardoso, 1998; Cardoso and Laheld, 1996), implying that the performance of an estimator does not depend on the mixing matrix A in ICA. This equivariant property is a consequence of the ’serial updating’. Without loss of generality, a family of adaptive ICA algorithms employs the updating rule that has the form

W (k+1) = I − η (k) G Y (k) , W (k)

W (k) ,

(8)

e X, W (k) is a matrix-valued function and η (k) > 0 where Y (k) = W (k) X, G

6

is a learning rate. A special case where the function G(·) does not rely on W (k) , leads to an updating rule that has the form

W (k+1) = I − η (k) G Y (k)

W (k) .

(9)

The current parameter matrix W (k) is transformed by a ’plugging’ matrix I − η (k) G Y (k) , in order to produce an updated parameter matrix W (k+1) . This serial updating is carried out in a multiplicative fashion. Note that updating rule depends only on the current output. In this sense, this serial updating is consistent with equivariance.

If we denote the ’plugging’ matrix I − η (k) G Y (k) rameter matrix W (k+1) can be decomposed into

by E (k) , then the pa-

W (k+1) = E (k) W (k) = E (k) E (k−1) · · · E (1) E (0) W (0) ,

(10)

where W (0) is an initial value of W . It follows from Eq. (10) that the serial updating rule for W (k) reflects a manifold as mentioned in previous section. The relative gradient (Cardoso and Laheld, 1996) is an exemplary instance of the serial updating where the function G(·) is determined by a search direction E that minimizes

f (W + EW ) = f (W ) + tr ∇r f T (W )E + o(E),

(11)

where ∇r f (W ) denotes the relative gradient of f at W and tr(·) denotes the trace operator.

4.3 Relative Optimization

The relative gradient involves the plugging matrix containing the first-order information (in the sense that the gradient is used). The relative optimization (Zibulevsky, 2003) is a generalization of the relative gradient, where the plugging matrix is determined by well-developed optimization methods (for example, Newton method) in the framework of the serial updating. The relative gradient algorithm can be easily derived in the framework of the relative optimization. The gradient in the relative mode, ∇fr (W ), is defined by 7

∂f (W , X) ∇fr (W , X) = ∂W W =I ,X =Y

(12)

where Y = W X.

The increment for V is given by ∆V = −η∇fr (W , X), leading to V (k+1) = I − η (k) ∇fr (W (k) , X), where V (k) is set as I. Taking the updating rule for W , W (k+1) = V (k+1) W (k) into account, leads to n

o

W (k) .

n

o

W (k) .

W (k+1) = I − η (k) ∇fr W (k) , X

(13)

One can easily see that Eq. (13) is identical to the original relative gradient ICA algorithm (Cardoso and Laheld, 1996) where the updating rule is given by W (k+1) = I − η (k) ∇r f W (k) , X

(14)

It can be easily verified that ∇r f (W , X) = ∇fr (W , X) in the case of ICA where the objective function Eq. (2) is considered.

5

Relative Trust-Region Optimization

This section contains the main contribution of this paper where we explain the relative trust-region optimization followed by the illustration of the relative TR-ICA algorithm. 5.1 Gradient and Hessian The objective function that we are concerned about, is given in Eq. (2) that has the form

f (w) = f (W , X) = − log |det W | +

N X n 1 X ψi (yi (t)). N t=1 i=1 2

(15)

Here we use the parameter vector w ∈ Rn = vec W T , in illustrating the relative optimization method and the relative TR-ICA algorithm. The row ~ i and wi for i = 1, . . . , n, vectors and column vectors of W are denoted by w respectively. We calculate the gradient and Hessian of Eq. (15) with respect 2 to w, instead of W . The gradient of Eq. (15), ∇f (w) ∈ Rn , is given by 8

"

∂f ∂f ∇f (w) = ··· ~1 ~n ∂w ∂w

#T

!

N 1 X T = vec −W −1 + x(t) [ψ ′ (y(t))] , N t=1

(16)

where ψ ′ (y(t)) is the n-dimensional vector, whose ith element is ψi′ (yi (t)) = d ψi (yi (t)) . d yi (t) 2 ×n2

The Hessian matrix of the objective function Eq. (15), ∇2 f (w) ∈ Rn computed by "

∂ ∂f ∇ f (w) = ∂w ∂w = H + D, 2

2

, is

#T

(17)

2

where D ∈ Rn ×n is a block-diagonal matrix, consisting of D l ∈ Rn×n , l = 1, . . . , n, that is of the form

Dl =

N 1 X ψ ′′ (yl (t))x(t)xT (t), N t=1 l 2 ×n2

and H ∈ Rn

(18)

~ m , that is given by consists of n2 row vectors, h

~ m = vecT (ai ~aj ) , h

m = (i − 1)n + j,

(19)

for i = 1, . . . , n and j = 1, . . . , n. aj and ~ai denote the jth column vector and the ith row vector of A = W −1 . P

P

n The matrix D in Eq. (17) corresponds to the Hessian matrix of N1 N i=1 ψi (yi (t)) t=1 that is the second term in Eq. (15). This is derived in a straightforward manner. The matrix H in Eq. (17) is the Hessian matrix of − log |det W |. Its detailed derivation is explained below.

We define Gw =

i ∂ h − log |det W | = −W −T . ∂W

(20)

The Hessian matrix H is derived from the relation

g w = vec dGTw = Hdw,

(21) 9

2

where g w ∈ Rn . Note that

dGw = − dW −1

T

= W −T dW T W −T = AT dW T AT .

(22) h

Using the identity, tr (AB) = vec AT is computed by [dGw ]ij = aTi dW T ~aTj

= tr ~aTj aTi dW T

= [vec (ai ~aj )]T vec dW T = vecT (ai ~aj ) dw.

iT

vec(B), the (i, j)-element of dGw

(23)

It follows from Eq. (21) and Eq. (23) that H is the matrix consisting of n2 ~ m , given in Eq. (19). row vectors, h 5.2 Relative TR-ICA Algorithm The relative trust-region optimization exploits jointly the conventional trustregion method and the relative optimization. The main motivation of the relative trust-region optimization is from the relative Newton ICA (Zibulevsky, 2003), The relative trust-region method consists of two modes: (1) relative mode; (2) absolute mode. The term ’absolute mode’ is used to emphasize that in the algorithm the actual reduction is calculated with real objective function as in conventional trust-region method, while the step p(k) and the predicted reduction are calculated in relative mode. In the relative mode, we consider an modified quadratic model, m(k) where the subscript r represents the relative r (k) mode and find a direction p which solves the following subproblem modified from Eq. (4), h

(k) (k) arg min m(k) r (p) = fr + ∇fr (k) kpk≤△

iT

1 p + pT ∇2 fr(k) p, 2

(24)

where fr (w), ∇fr (w), and ∇2 fr (w) are the relative counterpart of f (w), ∇f (w), and ∇2 f (w), where the calculation is followed by replacing W = I and X =Y. 10

For the case of ICA where the objective function Eq. (15) is considered, the objective function, gradient, and Hessian, in the relative mode, are given by n N X 1 X fr (w) = ψi (yi (t)), N t=1 i=1

(25) !

N 1 X T y(t) [ψ ′ (y(t))] , ∇fr (w) = vec −I + N t=1

(26)

f + D, f ∇2 fr (w) = H

(27)

f ∈ Rn2 ×n2 is a block-diagonal matrix, consisting of D f ∈ Rn×n , l = where D l 1, . . . , n, that is of the form f = D l

N 1 X ψ ′′ (yl (t))y(t)y T (t), N t=1 l

(28)

f ∈ Rn2 ×n2 consists of n2 row vectors, h ~e m , that is given by and H

~e m = vecT ei eT , h j

m = (i − 1)n + j,

(29)

for i = 1, . . . , n and j = 1, . . . , n and ei ∈ Rn is the unit vector where the ith element is 1 and the rest of it are zero. The basic idea in modifying the quadratic model comes from the fact that the relative optimization considers f (I, Y ) instead of f (W , X). The predicted reduction is also calculated with equations in relative mode, considering the modified quadratic model m(k) r . The actual reduction, however, should be calculated in the original objective function f (w) instead of fr (w). Therefore, in the relative trust-region method, the agreement measure ρ(k) in Eq. (6) is modified, which leads to ρ(k) r that has the form

ρ(k) r =

f W (k) − f W (k+1) (k)

(k)

mr (0) − mr (p(k) )

,

(30)

where W (k+1) = matT p(k) + I W (k) . That is, in the relative trust-region the direction p(k) and the pre algorithm, (k) dicted reduction m(k) p(k) are computed with equations in the r (0) − mr

relative mode and the actual reduction f w(k) − f w(k+1) is computed with equations in the absolute mode. Then trust-region method determines the size of the region and finally parameters W are serially updated (like the relative optimization). The relative trust-region is summarized in Table 1. 11

Table 1 Relative TR-ICA Algorithm. ˆ > 0, △(0) ∈ (0, △), ˆ and ζ ∈ [0, 1 ): Given △ 4 for k = 0, 1, 2, . . . Y (k) = W (k) X Obtain p(k) by solving Eq. (24); Calculate the agreement measure in Eq. (30); (k)

Evaluate ρr if

(k) ρr

<

in Eq. (30)

1 4,

then △(k+1) = 41 kp(k) k

>

3 4

else (k)

if ρr

ˆ and kp(k) k = △(k) , then △(k+1) = min(2△(k) , △)

else, then △(k+1) = △(k) ; (k)

if ρr

> ζ, then W (k+1) = matT p(k) + I W (k)

else, then W (k+1) = W (k) end (for)

5.3 A Trick for Memory-Efficiency In the relative mode, the direction p(k) is computed by solving the modified subproblem Eq. (24) which is involved with the calculation of the Hessian ma−1 trix ∇2 fr (w) in order to compute [∇2 fr (w)] ∇fr (w) and v T ∇2 fr (w) where v is a gradient or other learning directions. In the case of high-dimensional data, the Hessian matrix takes a huge memory space. As in the fast relative Newton method (Zibulevsky, 2003), we use the modified Newton direction which is found at low computational complexity, in order to take care of −1 [∇2 fr (w)] ∇fr (w). Now we show that the term v T ∇2 fr (w) in Eq. (24) can f and D f in Eq. (27). be easily computed, due to a special structure of H Let us consider

f + v T D. f v T ∇2 fr (w) = v T H

(31) h

iT

e f can be efficiently computed by v ⊙ d The second term v T D

2

e ∈ Rn where d

e = [D] f i.e., [d] f , i = is a vector that keeps only diagonal elements of D, i ii 2 1, . . . , n and ⊙ denotes the Hadamard product (element-wise multiplication). f in Eq. (31) can also be easily computed, considering that The first term v T H f is nothing but a permutation matrix in the relative mode. In other words, H f requires just re-ordering the elements of v, i.e., vT H

12

f [v T H] (i−1)×n+j = [v]i+j−1 ,

i, j = 1, 2, ...n.

(32)

Therefore, the computation of the Hessian matrix is not required, which saves dramatically memory space. The trust-region, which our proposed algorithm uses, effects the learning direction and controls the step length. This gives us the better direction and proper step length at lower computational complexity, whereas fast relative Newton method uses just the Newton direction and employs the backtracking line search method for learning rate. Hence, the relative TR-ICA algorithm shows faster convergence, compared to the fast relative Newton ICA algorithm. The computational cost to calculate the Hessian is the same as the relative Newton method, requiring O(n2 N ). However, the total cost per iteration is much less than the relative Newton method because the step size is determined in different ways. Relative trust-region method determines the step size within the trust region whereas the relative Newton uses the backtracking line search method for each iteration (several iterations of the backtracking are required). The computational cost to calculate the trust region is O(n2 ), which is negligible compared to O(n2 N ). That is, the computational difference is mainly based on the backtracking line search method, where each iteration requires the calculation of function value including matrix inversion and determinant. Notice that the function evaluation is much more complex than other simple operations. See Table. 2 for the details. In Table. 2, (*) and (**) might be 32n2 + n + 11 and 29n2 + n − 5, respectively, when the step is determined on the trust-region boundary. Table 2 Complexity Comparison between Newton and relative TR method in ICA with n-dimensional data. Operations Newton relative TR Backtracking Multiplication, (x)

20n2 + n

28n2 + n + 3 (*)

4n2

Addition, (+)

16n2 − 3

23n2 + n − 4 (**)

3n2 − n

Square Root, (·)1/2

n+2

n+3

0

max/min/abs

1/0/1

1/1/1

0/0/0

Function Evaluation

1

1

1

Gradient (∇f )

1

0

0

Gradient (∇fr )

1

1

0

Hessian

1

1

0

13

6

Numerical Experiments

We used 4 different data sets for our numerical experiments. The first data set contains the mixtures of two speech signals and one music signal, all of them were sampled at 8 kHz. The second set of data is the USPS data which contains handwritten digits that are converted to 256-dimensional data (16×16 = 256) of length 4000. The third data set is DNA microarray data which contains the expression of 4026 human genes in 95 samples of normal and malignant lymphocytes (Liebermeister, 2002), which are converted to the data matrix X ∈ R95×4026 . The last data set consists of linear instantaneous mixtures of n independent binary sources (either +1 or -1) with 3000 data points for each source, where n = 2, ..., 7. The optimization methods in ICA that we compared our relative TR-ICA with, include (1) the gradient; (2) the relative (or natural) gradient; (3) TR-ICA (with the dogleg implementation); (4) the fast relative Newton 2 (Zibulevsky, 2003). In the gradient, the relative gradient, and Newton’s methods, an optimal learning rate was determined by the backtracking line search method.

6.1 Comparison according to Mixing Matrix

For 3-dimensional sound data, three mixture signals were generated using the mixing matrix A where its condition number is 11.12 (well-conditioned). We executed every optimization method 50 times with randomly initialized demixing matrix W . Fig. 1 (a) shows the average convergence comparison of several numerical optimization methods in ICA. The relative TR-ICA algorithm outperforms other methods in terms of the CPU time. The fast relative Newton method took the same number of iterations as the relative trust-region method, but ate up more CPU time due to its high time complexity. This complexity comes from the backtracking method that is used to find an optimal learning rate as in the gradient methods. Even using a pre-selected constant learning rate in the gradient-based ICA algorithms, the relative TR-ICA algorithm takes similar time complexity per iteration. Relative optimization (or the natural gradient) was shown to have the equivariant property (Cardoso and Laheld, 1996; Amari, 1998; Choi et al., 2002; Zibulevsky, 2003). In the second experiment, sound signals were artificially mixed using an ill-conditioned mixing matrix A where its condition number is 525.44. Fig. 1 (b) shows the average convergence behavior of various ICA 2

The matlab code is available at http://ie.technion.ac.il/˜mcib/relnwt021203.zip

14

1.15

1.4

Gradient Rel. Grad. Trust−region Rel. Newton Rel. Trust−region

1.1 1.05

[2.114(s)] [1.340(s)] [1.000(s)] [0.930(s)] [0.529(s)]

Gradient [2.964(s)] Rel. Grad. [1.777(s)] Trust−region [2.022(s)] Rel. Newton [1.115(s)] Rel. Trust−region [0.632(s)]

1.3 1.2 1.1

Objective Function Value

Objective Function Value

1 0.95 0.9 0.85 0.8 0.75

1 0.9 0.8 0.7

0.7 0.65 0

5

10

15

20

25

0.6 0

30

5

10

Iteration

15

20

25

30

35

40

Iteration

(a)

(b)

Fig. 1. Convergence comparison of several numerical optimization methods in the quasi maximum likelihood ICA for a set of sound data: (a) for well-conditioned mixing matrix (b) for ill-conditioned mixing matrix.

algorithms. The relative algorithms (both relative Newton and relative trustregion) made convergence much faster than the gradient-based algorithms. Once again, the relative trust-region algorithm was the fastest one among all algorithms that we tested. 6.2 Statistical Comparison To compare our proposed method with other optimization methods, we used 3-dimensional sound data again. We executed every optimization method including FastICA 50 times with random mixing matrix A. Fig. 2 shows the statistical comparison of iteration numbers and CPU times until convergence, respectively. 6

70

1. Gradient 2. Rel. Gradient 3. Trust−region 4. Rel. Newton 5. Rel. Trust−region 6. FastICA

50

40

4 CPU Time (s)

Iteration Number

1. Gradient 2. Rel. Gradient 3. Trust−region 4. Rel. Newton 5. Rel. Trust−region 6. FastICA

5

60

30

3

2

20 1

10 0

1

2

3 4 Optimization Method

5

6

1

(a)

2

3 4 Optimization Method

5

6

(b)

Fig. 2. Statistical comparison of several optimization methods: (a) Comparison of iteration numbers (b) Comparison of CPU times.

In Fig. 2 (a) which compares the iteration numbers, Newton method including 15

FastICA is slightly better than others. Notice that the variances of relative methods are smaller than the variances of others. This shows the equivariant property of relative optimization where the performances are consistent disregarding the mixing matrix. Comparing (a) with (b), we can see other interesting points. Even though relative Newton converges with slightly less number of iterations than relative trust-region, relative trust-region is faster than relative Newton. As stated above, Newton method requires higher time complexity per iteration. Actually, FastICA which is also a kind of Newtontype algorithm, is best in both of iteration numbers and CPU time. In (a), however, the variance of FastICA is larger than relative methods, which means it has no equivariance property. In addition, when we have insufficient data samples, FastICA could be unstable in contrast to other methods. In next sections, we will see the weak points of FastICA.

6.3 High-Dimensional Data

This experiment was conducted with high-dimensional data sets which are 256dimensional USPS handwritten data set and 95-dimensional DNA microarray data set. For the case of these data sets, the computation of Hessian matrix at each iteration is very expensive and the Hessian matrix is not positive definite at some iteration steps. Moreover, if dimension is very high (more than 30), then conventional trust-region method cause a memory problem. However, the relative TR-ICA algorithm overcomes these limits by using a trick for memory-efficiency. Fig. 3 confirms that the relative TR-ICA algorithm method works well, even with high-dimensional data. Our proposed algorithm is faster than any other methods in high dimensional data set. In this figure, even the iteration number of relative TR-ICA is much less than fast relative Newton’s, as well as the CPU time. This means that the objective functions for high-dimensional real data have difficulty to be modelled with quadratic equation, so the recommended learning direction of trust-region method is better than Newton method’s. In (a), one can observe that the objective value in the relative trustregion method decreased much faster than the relative Newton method at the first iteration. This stresses out the characteristics of the relative trust-region method, which is, the trust-region method is much better than the Newton method when the quadratic model does not approximate the objective function well. In the numerical experiment with USPS data, we chose the portion associated with the digit ’2’ and reduced the dimension by PCA, which produced 100dimensional data of length 379. In such a case, FastICA (Hyv¨arinen, 1999) had difficulty in convergence because the number of data points were not enough. 16

The small number of data points does not guarantee the objective function to be a contraction map which is a necessary condition for the fixed point algorithm. That is, our proposed algorithm is the best optimization method when the dimension of data set is very high and the number of data points is not enough to guarantee FastICA to be stable. In addition, we will discuss another weak point of FastICA in next experiment. 0.65

0.46

Rel. Gradient [3415 Iters] [731.375 (s)] Rel. Trust−region [763 Iters] [109.063 (s)] Rel. Newton [1147 Iters] [269.407 (s)]

0.6

0.42

Objective Function Value

Objective Function Value

0.55 0.5 0.45 0.4 0.35 0.3

0.4 0.38 0.36 0.34

0.25

0.32

0.2

0.3

0.15 0

Rel. Gradient [1874 Iters] [3046.532 (s)] Rel. Trust−region [528 Iters] [598.250 (s)] Rel. Newton [559 Iters] [1174.235 (s)]

0.44

10

20

30

40

50

60

0.28 0

70

10

20

Iterations

30

40

50

60

70

Iterations

(a)

(b)

Fig. 3. Convergence comparison of several numerical optimization methods in the quasi maximum likelihood ICA: (a) for digit ’2’ among USPS data (Dimension is reduced into 100 by PCA) (b) for DNA micro array data.

6.4 Comparison with FastICA In this experiment, the equivariant property of relative trust-region method is shown clearly compared with FastICA. The binary sources were artificially mixed using an Hilbert matrix A ∈ Rn×n given by A (i, j) = 1/(i + j),

(33)

which is an ill-conditioned mixing matrix. As the dimension of binary sources increases, the condition number of the Hilbert matrix A also increases (See Table. 3). Table 3 Condition numbers of Hilbert matrix A corresponding to dimension Dimension n Cond(A)

2

3

4

5

6

7

3.84e+1

1.35e+3

4.58e+4

1.53e+6

5.10e+7

1.69e+9

For each dimension, we executed both of relative trust-region and FastICA 5 times and Fig. 4 shows the graph of iteration numbers and average CPU times for two algorithm until convergence. When the dimension is greater than 7, both do not converge. In relative trust-region, iteration number and CPU 17

time increase linearly which means the equivariant property, whereas FastICA rises exponentially. So, in ill-conditioned high-dimensional data, relative trustregion converges much faster than FastICA. 6

70

Rel. Trust−region FastICA

Rel. Trust−region FastICA 60

5

50

4

CPU Time (s)

Iteration

40

30

20

2

1

10

0 2

3

3

4

5

6

7

0 2

3

4

5

6

7

Dimension

Dimension

(a)

(b)

Fig. 4. Equivariant property of Relative TR-ICA compared with FastICA: (a) Iteration numbers until convergence (b) Average CPU time on 5 experiments

7

Discussions

We have introduced a relative trust-region method which jointly exploited the trust-region method and the relative optimization. In the relative trust-region method, a direction and a step size were searched with the help of a quadratic model (like the trust-region method) and the parameters were serially updated (relative optimization). The trust-region algorithm took much less number of iterations for convergence, compared to the gradient algorithms and required less CPU time, compared to the Newton method. In the relative optimization, the Riemannian structure of the parameter space was exploited. We have applied this relative trust-region method to the problem of ICA, which led to the relative TR-ICA algorithm. The relative TR-ICA algorithm enjoyed fast convergence, compared to most of existing ICA algorithms and showed the equivariant property like the natural or the relative gradient. Moreover exploiting a special structure of the Hessian matrix in the relative mode led to a memory-efficient relative TR-ICA algorithm which could handle highdimensional data. Especially in the case of an ill-conditioned mixing matrix or in the case of high-dimensional data, the useful behavior and the high performance of the relative TR-ICA algorithm were observed in terms of the iteration numbers and CPU time. FastICA (Hyv¨arinen, 1999) was also considered in comparison with the relative TR-ICA algorithm. In fact, FastICA seemed a little bit faster than our proposed algorithm. However, in the case of small size of data, FastICA did 18

not work well, whereas our algorithm worked fine. Moreover, in ill-conditioned high-dimensional data, relative trust-region converges much faster than FastICA. Although we focused on ICA, the relative trust-region method could be applied to other machine learning problems. It has been brought to the notice of the authors during the review process that trust-region methods on Riemannian manifolds (which share similar spirit with our method) have been independently developed and applied to numerical linear algebra problems (Absil et al., 2004) .

Acknowledgments

We appreciate anonymous reviewers’ and associate editor’s critical comments, which improved the clarity of our paper. This work was supported by ITEP Brain Neuroinformatics Program, KISTEP International Cooperative Research Program, Korea MIC under ITRC support program supervised by the IITA (IITA-2005-C1090-0501-0018), and Basic Research Fund (2004) in POSTECH. Heeyoul Choi was also supported by KOSEF Grant funded by Korea government (No.D00115).

References Absil, P. A., Baker, C. G., Gallivan, K. A., 2004. Trust-region methods on Riemannian manifolds. Tech. Rep. FSU-CSIT-04-13, School of Computational Science, Florida State University. Akuzawa, T., 2000. Extended quasi-Newton method for the ICA. In: Proc. ICA. Helsinki, Finland, pp. 521–525. Amari, S., 1998. Natural gradient works efficiently in learning. Neural Computation 10 (2), 251–276. Amari, S., Cichocki, A., Oct. 1998. Adaptive blind signal processing - neural network approaches. Proc. of the IEEE, Special Issue on Blind Identification and Estimation 86 (10), 2026–2048. Amari, S., Cichocki, A., Yang, H. H., 1996. A new learning algorithm for blind signal separation. In: Touretzky, D. S., Mozer, M. C., Hasselmo, M. E. (Eds.), Advances in Neural Information Processing Systems. Vol. 8. MIT press, pp. 757–763. Cardoso, J. F., Apr. 1997. Infomax and maximum likelihood for source separation. IEEE Signal Processing Letters 4 (4), 112–114. Cardoso, J. F., 1998. Learning in manifolds: The case of source separation. In: Proc. SSAP. Portland, Oregon. 19

Cardoso, J. F., Laheld, B. H., Dec. 1996. Equivariant adaptive source separation. IEEE Trans. Signal Processing 44 (12), 3017–3030. Choi, H., Choi, S., 2005. Relative trust-region learning for ICA. In: Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing. Philadelphia, PA. Choi, H., Kim, S., Choi, S., 2004. Trust-region learning for ICA. In: Proc. Int’l Joint Conf. Neural Networks. Budapest, Hungary, pp. 41–46. Choi, S., Cichocki, A., Amari, S., 2002. Equivariant nonstationary source separation. Neural Networks 15 (1), 121–130. Choi, S., Cichocki, A., Park, H. M., Lee, S. Y., 2005. Blind source separation and independent component analysis: A review. Neural Information Processing - Letters and Review 6 (1), 1–57. Cichocki, A., Amari, S., 2002. Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. John Wiley & Sons, Inc. Fiori, S., 2005. Quasi-geodesic neural learning algorithms over the orthogonal group: A tutorial. Journal of Machine Learning Research 6, 743–781. Hyv¨arinen, A., 1999. Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Networks 10 (3), 626–634. Hyv¨arinen, A., Karhunen, J., Oja, E., 2001. Independent Component Analysis. John Wiley & Sons, Inc. Lee, T. W., 1998. Independent Component Analysis: Theory and Applications. Kluwer Academic Publishers. Lee, T. W., Girolami, M., Bell, A., Sejnowski, T., 2000. A unifying information-theoretic framework for independent component analysis. International Journal on Mathematical and Computer Modeling 39 (11), 1– 21. Liebermeister, W., 2002. Linear modes of gene expression determined by independent component analysis. Bioinformatics 18 (1), 51–60. Nocedal, J., Wright, S. J., 1999. Numerical Optimization. Springer. Zibulevsky, M., 2003. Blind source separation with relative Newton method. In: Proc. ICA. Nara, Japan, pp. 897–902.

20

A Relative Trust-Region Algorithm for Independent ...

Mar 15, 2006 - learning, where the trust-region method and the relative optimization .... Given N data points, the simplest form of ICA considers the noise-free linear .... relative trust-region optimization followed by the illustration of the relative.

Download PDF

218KB Sizes 0 Downloads 265 Views

Report

A Relative Trust-Region Algorithm for Independent ...

Recommend Documents