Information Theoretical Approach to Identification of ...

Viewer
Transcript

Information Theoretical Approach to Identification of Hybrid Systems Li Pu, Jinchun Hu, and Badong Chen Tsinghua University, Beijing, China {pl06,chenbd04}@mails.tsinghua.edu.cn, [email protected]

Abstract. In this paper, we present a noisy version of the algebraic geometric approach (AGA) of identifying parameters of discrete-time linear hybrid system. The noisy Switched Auto-Regressive system with eXogenous inputs (SARX) is transformed into a higher space, together with a solution of recovering parameters of individual ARX models. But dynamics of the transformed nonlinear system is usually complex due to the simultaneous presence of noise and switching between modes. Two approximate ways of estimating hybrid parameters are considered: one is using MSE criteria, while the other is based on the information divergence that measures the distance between the probability density function of identified model error and the desired distribution. A stochastic information divergence gradient algorithm is derived for the identification problem, which shows satisfying performance with several classes of noises.

1

Introduction

Recently, a class of discrete-time linear hybrid systems known as SARX is widely studied in various focus. SARX system consists of several classical discrete ARX models and a switching mechanism that determines which ARX model takes effect in each period. We assume the ARX models have probably different orders, but we formulate them with same orders since some parameters can be identified as zeros [1], and the SARX system is described as,  n n  x(k) = Pa aλ(k) x(k − j) + Pc cλ(k) u(k − i) j i j=1 i=1  y(k) = x(k) + m(k)

(1)

where the switching mechanism is formulated as the mode function λ(k) : Z → {1, 2, · · · , n} which assigns each sample to one of the ARX models (n is the λ(k) λ(k) number of ARX models in SARX system). aj and ci are the parameters of each ARX model. y(k) and u(k) are the output and input of the system respectively. We assume all the ARX models in this paper are minimal, which means 1

An abbreviated version of this paper will appear in HSCC 2008.

2

the numerator and denominator of the transfer functions are coprime polynomials. m(k) is the i.i.d. measurement noise with probability density function (PDF) fm (x). Some practical methods for designing, control, and identification of hybrid systems are developed in various perspectives. We will focus on the influence of observation or measurement noise to parameter identification problem of SARX systems, and present a modified algebraic geometric approach with an information theoretical learning algorithm. Many identification approaches for hybrid system take account of the switching mechanism in order to get better parameter estimation in the specific case, but in this paper we make identification procedure only with sampling data points S = {y(k), u(k)} , (k = 1, 2, · · · , N ), just as the original algebraic geometric approach does [2, 3, 1]. The major problem of identification for hybrid systems is that mode function λ(k) is unknown, which results in difficulty of applying traditional parameter identification techniques to hybrid system. Once λ(k) is assigned a proper mode sequence, the task of estimating parameters of each ARX model is comparatively simple, but unfortunately the two tasks are inherently coupled together. Most presented methods depend on specific assumptions to predigest the identification procedure, such as clustering-based approach [4], Bayesian approach [5], mixedinteger programming [6], and greedy approach [7], see [8] and [9] for detailed comparison between these methods. The method presented in this paper is a noisy version of algebraic geometric approach - an ingenious method that is able to handle the identification problem of SARX systems with all possible switching mechanism, including PWARX systems and hybrid automata. It can be implemented in a recursive manner; unknown or overestimated model orders have little influence on the parameter estimation results [1, 8], which provides great convenience in real-life applications. However, the algebraic geometric approach does not provide ideal results in the presence of observation or measurement noise [8]. We will show that the poor result is due to the correlated component and non-gaussian distribution of equivalent noise in the identification model. To reduce the influence of noise, the analytic form of noise in the algebraic geometric approach is studied, as well as the analytic form of recovering individual ARX parameters from hybrid parameter vector. We also introduce the information theoretic learning techniques to deal with non-gaussian and non-zero-mean noises. The algebraic geometric approach uses least mean square (LMS) algorithm that results in optimal estimation of parameters when the noise is zero-mean gaussian. But the mean square error (MSE) criteria related to LMS is no longer optimal in the non-gaussian situation. Recently, to more accurately depict higherorder statistics of errors (or noises), some information theoretic criteria have been studied, including error-entropy, mutual information, and some information distances [10–12], all of which intend to reduce the average uncertainty of the learning model or maximize the similarity between the learning model and the real system given a set of training data. Experimental results show that these approaches could achieve better results than MSE criteria for certain

3

classes of linear and nonlinear systems. In this paper, we will present a stochastic information divergence gradient (SIDG) identification algorithm based on Kullback-Leibler divergence (KL-divergence). The remainder of the paper is organized as follows. The derivation of noisy SARX equations is given in Section 2. Some preliminaries of information theory are introduced in Section 3. In Section 4, we develop the stochastic information divergence gradient algorithm for identifying the hybrid parameter vector of SARX system. Numerical experiments and involved practical problems are discussed in Section 5. Conclusions and future work are given in Section 6.

2

Noisy Hybrid Polynomial

In this section we will study the hybrid decoupling polynomial developed in algebraic geometric approach in the presence of noise. First we recall the SARX system defined in (1), where the input and output sampling data set S = {y(k), u(k)} , (k = 1, 2, · · · , N ) is known, the mode function λ(k) is unknown λ(k) λ(k) but deterministic, and the individual ARX model parameters aj and ci are what we intend to estimate. If we reform the sampling data {y(k), u(k)} (for all k ≥ max(na , nc )) and ARX model parameters as,  T x = [u(k − nc ), . . . , u(k − 1), y(k − na ), . . . , y(k − 1), −y(k)] ∈ RK    T  m = [m(k − na ), . . . , m(k − 1), m(k)] ∈ Rna +1 £ ¤T  bi = cinc , . . . , ci1 , aina , . . . , ai1 , 1 ∈ RK   ¤T £  ai = −aina , . . . , −ai1 , 1 ∈ Rna +1

(2)

where K = na + nc + 1, (1) could be rewritten as, bTλ(k) x + aTλ(k) m = 0

(3)

Let wλ(k) = aTλ(k) m, it is the colored noise with PDF fw (x) that can be determined from fm (x) at every instant k for the mode function λ(k) is deterministic. Similarly as the hybrid decoupling polynomial in noiseless algebraic geometric approach, we can get the noisy hybrid polynomial (NHP) as below, which is satisfied at any instant k and by any mode λ(k) governed by all possible switching mechanism, n Y

(bTi x + wi ) = 0

(4)

i=1

For any instant k, noticing that bλ(k) x = −wλ(k) , we can expand the NHP,   n Y Y (bTi x)+  bTj x wi = 0 (5) i=1

j6=λ(k)

4

The first component of (5) is in fact a homogeneous polynomial of degree n in K variables [3], pn (x) =

n Y

(bTi x) =

X

hn1 ,...,nK xn1 1 · · · xnKK = hT vn (x)

(6)

i=1

where vn : RK → RMn (K) is the Veronese map of degree n, which takes K P 0 ≤ nj ≤ n for j = 1, . . . , K and nj = n. h ∈ RMn (K) is the hybrid parameter j=1 µ ¶ n+K −1 vector that represents the hybrid system, where Mn (K) = (see K −1 [3] for details). Theorem 1 (Noisy Hybrid Polynomial Equation). For any instant k, note T the regressor x = [x1 , . . . , xK ] , the following equation holds, pn (x) +

¶ n µ d d X w ∂ pn (x) d=1

∂xdK

d!

T

=h

¶ n µ d d X w ∂ vn (x) d=0

d!

∂xdK

=0

Proof. Note the polynomials of degree d in K variables as, X Pd = (bTi1 x)(bTi2 x) · · · (bTid x)

(7)

(8)

i1 ,i2 ,...,id

where ip ∈ {1, 2, . . . , n}, and ip 6= iq . Note another class of polynomials, X

Qd =

(bTj1 x)(bTj2 x) · · · (bTjd x)

(9)

j1 ,j2 ,...,jd

where jp ∈ {1, 2, . . . , n}, jp 6= λ(k), and jp 6= jq . Specially, let Q0 = 1. The last element of bi is 1, so we have the partial derivative of hT vn (x) with respect to xK , Ã ! n Q ∂d (bTj x) Dd =

j=1

= (d!)Pn−d

∂xdK

(10)

Noticing that bλ(k) x = −wλ(k) , we obtain the following equation, 1 Dd + wλ(k) Qn−d−1 = Qn−d d! 1 From (11), we can get that (n−2)! Dn−2 + Similarly, close the recursion relation to d = 1, n−1 Xµ d=1

1 d−1 Dd wλ(k) d!

1 (n−1)! Dn−1 wλ(k)

(11) 2 + wλ(k) = Q2 .

¶ = Qn−1

(12)

5

Noticing that

Q j6=λ(k)

proven.

bTj x = Qn−1 , from (5) and (12) the equation (7) is u t

Theorem 2 (Noisy Hybrid Parameter Decomposing). For any instant k, we can obtain the parameters of ARX model bλ(k) from the following relationship, Ã bλ(k) ∝

T

h

n−1 Xµ d=0

wd−n+1 ∂ d+1 vn (x) d! ∂x∂xdK

¶!T

Proof. Note the polynomials of degree d in K variables as, X (bTj1 x)(bTj2 x) · · · (bTjd x) Q(d,i) =

(13)

(14)

j1 ,j2 ,...,jd

where jp ∈ {1, 2, . . . , n}, jp 6= i, and jp 6= jq . Specially, let Q0 = 1. We have the partial derivative as following, ¶ X µ1 ∂ d+1 vn (x) T T = c b + Q b (15) hT d λ(k) (n−d−1,i) i d! ∂x∂xdK i6=λ(k)

where cd is the coefficient of bλ(k) . Noticing that bTλ(k) x = −w, we obtain the equation about Q(d,i) , n−1 Xµ d=0

Q(n−d−1,i) d!wn−d−1

¶ =0

(16)

Put (16) into (15), the theorem is proven. Since the last element of bλ(k) is 1, we can easily recover bλ(k) from (13). u t Remark 1. The measurement noise w must be known if we want to recover bλ(k) from (13), this could be done by solving equation (7) analytically or numerically (in the case of n > 4). The proper value of w may be determined among n roots of (7) by utilizing prior knowledge, e.g. choose the nearest real root to the mean of w. If use the true value of h, this method could achieve accurate results of bλ(k) . So the key problem of this approach is how to get the estimator of h from the nonlinear system (7), given enough samples and the PDF of measurement noise. Now recall the scheme of algebraic geometric approach as in figure 1. Let ´ n ³ d P w s(k) = hT D denote the equivalent noise. Apparently the equivalent d d! d=1

noise is not independent with the input for x is involved in s(k), and not independent at different instants neither. If λ(k) = C for all k, which means no switching between modes, it is often practical to assume that input x is stationary in real-life control systems, and so is the error. But in the presence of switching between modes, it is obvious that, even if the measurement noise is i.i.d. zero-mean gaussian random variable, s(k) is non-gaussian and non-stationary.

6

Because of non-gaussian and non-stationary errors, and the correlation between errors and inputs, LMS algorithm consequentially gets worse as measurement noise increasing.

hT vn ( x ) s ( k ) x

d {0 error

LMS hˆ T vn ( x )

z

Fig. 1. Scheme of algebraic geometric approach with LMS algorithm

The nonlinear equation (7) is difficult to deal with to get the best estimator of h, so we have to resort to some approximate methods. A straightforward approach is to ignore the higher-order components of w(k) (first-order approximation, FOA), since measurement noise is often small. If w(k) is zero-mean, from (7), we obtain, hT vn (x) + hT D1 (x)w(k) ≈ 0 (17) ± Thus, the nonlinear system model is hT vn (x) hT D1 (x), and the scheme of FOA is shown in figure 2. It is similar to the normalizing approach in [2].

hT vn ( x ) hT D1 ( x ) w(k )

d {0

x

error hˆ T vn ( x ) hˆ T D1 ( x )

z

Fig. 2. Scheme of first-order approximation algorithm

If measurement noises are zero-mean gaussians, w is zero-mean gaussian too, but with different variances in different modes. In this case, it is practical to adopt LMS algorithm; the recursive identifier of h is,

7

2α(k) ˆ + 1) = h(k) ˆ h(k −η 3 [β(k)vn (x(k)) − α(k)D1 (x(k))] (β(k))

(18)

ˆ T vn (x(k)), where η is the step size that adjusted by users, α(k) = h(k) ˆ T D1 (x(k)). Because any parameter vector that equals c · h, ∀c ∈ R β(k) = h(k) ˆ also satisfies model (17) and the PDF of w(k), the hybrid parameter vector h must be modified according to the fact that the last element of h is 1. That is T Tˆ Mn (K) ˆ = h/(P ˆ h . 1 h), where P 1 = [0 · · · 0 1] ∈ R By dropping the higher order partial derivatives in (13), we obtain the method used in [3] of estimating the parameters of individual ARX models, ˆ bλ(k) ≈

ˆ T E (x(k)) h(k) ˆ T E (x(k)) P T h(k)

(19)

2

T

where E (x(k)) = ∂vn (x(k))/x(k), P 2 = [0 · · · 0 1] ∈ RK . For each instant, an estimator of ARX model parameters can be calculated as above, but it is not exactly equals the actual value due to noises. Some well-known clustering techniques such as k-means can be used in the parameter vector space RK or in RK−1 (noticing that last element of bi is always 1), where the number of clusters is the number of ARX models n. However, if measurement noise is not zero-mean gaussian, the LMS identifier will usually be failed to provide ideal results. In this case, we resort to the information divergence based identification algorithm for better results.

3

Information Divergence

The Shannon’s entropy H(X) and some other definitions of entropy in information theory, such as Renyi’s entropy, or (h, ϕ)-entropy in a more general form, reveal the average uncertainty of the random variable X. The entropy does not consider the mean of the error signal, so one property of entropy is its invariance of shifting of X [13], that is H(X + c) = H(X), where c is a constant. This property may be useful in some special cases. But if we use Shannon’s entropy of error (or noise) as criteria to minimize, it may result in wrong identified parameters that produce the same shape of error PDF with true distribution but shifted by a constant, especially in nonlinear cases. So we resort to a more explicit approach, to directly compare the error PDF and the true distribution. In information theory, there are many distances or pseudo-distances defined in the probability density function space, among which KL-divergence is a widely-used one. It is defined as [14], Z f (x) KL(f ||g) = f (x) log dx (20) g(x) where f (x) and g(x) are two PDFs. KL(f ||g) ≥ 0 with equality if and only if f (x) = g(x). But KL-divergence is a pseudo-distance for it is not symmetric and does not satisfy the triangle inequality. Nevertheless, it is often useful to think of

8

KL-divergence as a distance between distributions. Based on the KL-divergence, a symmetric information divergence is given by, D(f, g) = KL(f ||g) + KL(g||f )

(21)

In this paper, we will use the symmetric information divergence to compare the sampling error PDF and the true distribution. If p and q are both gaussian ¡ ¢− 1 © ± ª ¡ ¢− 1 © ± ª (p(x) = 2πσ12 2 exp −(x − µ1 )2 2σ12 , q(x) = 2πσ22 2 exp −(x − µ2 )2 2σ22 ), it is easy to calculate information divergence between p and q, D(p, q) =

(µ1 − µ2 )2 (σ12 + σ22 ) + (σ12 − σ22 )2 2σ12 σ22

(22)

However, in practice the distribution is not always gaussian, so we estimate the PDF from samples that can be obtained on-line or off-line. Here we adopt the Parzen window method [15] to approximate the PDF of error and true distribution, in a nonparametric manner. In this paper, gaussian kernel is adopted in Parzen window; it results in a differentiable density function that brings a lot facilities in calculation. The 1-dimensional Parzen window density estimator is defined by, µ ¶ X 1 1 fˆ (x) = K (x − xk ) (23) |Sf | σf σf xk ∈Sf

where Sf denotes the sample set drawn from f (x), the kernel function K(x) = −1/2 (−x2 /2) (2π) e , and σf denotes the kernel width. Thus, we obtain the following estimator of the information divergence D(f, g), ˆ g) = D(fˆ, gˆ) = KL(fˆ||ˆ D(f, g ) + KL(ˆ g ||fˆ)

(24)

ˆ g) ≥ 0 It is easy to prove that the estimated information divergence D(f, ˆ with equality if and only if f (x) = gˆ(x). Based on the asymptotic properties ˆ g) = 0, we can also of Parzen window with gaussian kernels, by assuming D(f, obtain that f (x) and g(x) are close enough for any x as min {|Sf |, |Sg |} → ∞. Thus, the estimated information divergence could be used as an approximate measure of the distance between the actual PDFs.

4

Stochastic Information Divergence Gradient Algorithm

In figure 1 and 2, the desired output of the system is d that equals the sum of actual model output and noise. And z denotes the estimated model output without noise. The algorithm in fact minimizes some cost function of e = d − z, ˆ If minimize the information divergence beby adjusting estimated parameter h. tween PDF of and a reference PDF (approximation of the true PDF), rather than minimize the mean square error, we obtain the identification scheme of information divergence based SIG algorithm. In other words, this algorithm is designed to produce the desired error distribution that can be estimated from

9

some prior knowledge or preliminary identification results. In practice, the adjustment of the parameters is always carried out at every instant based on the instantaneous value of the information divergence. This leads to the following derivation of stochastic information divergence gradient (SIDG) algorithm. The information divergence between e and ed - the desired error - can be written as, ¡ ¢ ¡ ° ¢ ¡ d ¢ D pe , pde = KL npe °pde + o KL penkpe d d o (25) p (e ) = Epe log ppde (e) + Epde log pee (ed ) (e) e

By dropping off the expectation operators and substituting the estimated probability density functions into ¢ (25), we obtain the instantaneous value of ¡ information divergence D pe , pde at instant k as follows, ¢ ¡ ¡ ¢ pˆde ed (k) pˆe (e(k)) d ˆ D pe , pe , k = log d (26) + log pˆe (e(k)) pˆe (ed (k)) Denote Se (k) and Sed (k) the sample sequence©of e and ed , respectively. ª Let Se (k) = {e(k − 1), e(k − 2), . . . , e(k − L1 )}, Sed = e¯d (1), e¯d (2), . . . , e¯d (L2 ) , where L1 and L2 are the length of Parzen windows of e and ed , ¡respectively. By applying ¢ the gaussian kernel, the estimated PDFs pˆe (e) and pˆde ed are,     pˆe (e, k) =

1 L1 σ1

   pˆde (ed , k) =

k−L P1 i=k−1 L2 P

1 L2 σ2

³ K

K

i=1

³

e−e(i) σ1

´

ed −ed (i) σ2

(27)

´

d where σ1 and σ2 denote the kernel ¡ width¢ of e and e respectively. From (26) ˆ pe , pde , k with respect to h ˆ as follows, we can calculate the gradient of D

ˆ (pe ,pd ,k) ∂D e ˆ ∂h

= M1−1

k−L P1 i=k−1

n h ∆i (e(k)) K [∆i (e(k))] ∂z(k) ˆ − ∂h

−M2−1 ∂z(k) ˆ ∂h +M3−1

L2 © P i=1

∂z(i) ˆ ∂h

io

£ ¤ª ∆di (e(k)) K ∆di (e(k))

(28)

k−L P 1 n ¡ d ¢ £ ¡ d ¢¤ ∂z(i) o ∆i e (k) K ∆i e (k) ˆ ∂h

i=k−1

k−L ¡ ¢ P1 where ∆i (x) = σ1−1 (x − e(i)), ∆di (x) = σ2−1 x − e¯d (i) , and M1 = K [∆i (e(k))], i=k−1

M2 =

L2 P i=1

¤ £ K ∆di (e(k)) , M3 =

k−L P1

£ ¡ ¢¤ K ∆i ed (k) (for zero-mean measure-

i=k−1

ment noise, ed (k) = 0). In (28) the reference PDF (desired PDF) pˆde (ed ) is ˆ = vn (x(k)); while determined by Sed and σ2 . In the scheme of figure 1, ∂z(k)/∂ h ± ˆ = [β(k)vn (x(k)) − α(k)D1 (x(k))] β 2 (k). in the scheme of figure 2, ∂z(k)/∂ h After deriving the above stochastic gradient, we apply the steepest descent algorithm on the parameter vector,

10

¡ ¢ ˆ pe , pde , k ∂D ˆ + 1) = h(k) ˆ h(k −η (29) ˆ ∂h where η is the step-size. We refer (29) as the stochastic information divergence gradient (SIDG) algorithm. Remark 2. The equivalent noise of SIDG algorithm is not i.i.d., so theoretically it is not warranted to get unbias and stationary estimator of h by applying (29). However, if apply SIDG to the model (17), properly choose the reference PDF (Sed and σ2 ) and the parameters of Parzen window (L1 and σ1 ), in practice we can get more accurate identification results, especially when measurement noise is non-gaussian and non-zero-mean.

5

Experiments

In this section, we first show the comparison between performances of AGA, FOA and SIDG algorithms with artificially generated noisy samples (FOA means using LMS algorithm for the scheme in figure 2; SIDG means using SIDG algorithm for the same scheme as FOA). Then we will discuss how to properly choose the initial parameter vector and the reference PDF for SIDG algorithm when do off-line identification procedures. 5.1

Comparison of AGA, FOA and SIDG

In order to compare with some other approaches to hybrid system identification, we use the autoregressive autonomous system defined in [8], where the authors make detailed studies of overestimation of model orders and effects of noise for four different identification methods, including the similar approach as FOA. The system is,  x(k) ≤ 0  x(k + 1) = 2x(k) + u(k) + r(k), x(k + 1) = −1.5x(k) + u(k) + r(k), x(k) > 0 (30)  y(k) = x(k) + m(k) The additive noise term r(k) ∼ N (0, 0.01) is normally distributed. The sequence x(k) is generated with x(0) = −10, uniformly distributed input u(k) ∼ U[10, 11], and measurement noise m(k) with PDF fm (x). We use the following formula to measure the accuracy of the identified parameter vector through the quantity as in [8], ! Ã ||θˆi − θj ||2 ∆θ = max min (31) 1≤i≤n 1≤j≤n ||θj ||2 where n is the number of modes. Obviously, in the system (30), n = 2, T T na = nc = 1, b1 = [1 2 1] , b2 = [1 − 1.5 1] , and the parameter vector h = T [1 0.5 − 3 2 0.5 1] . First we generate a set of data according to (30), then get

11

{ˆ b1 , ˆ b2 }AGA , {ˆ b1 , ˆ b2 }F OA , and {ˆ b1 , ˆ b2 }SIDG by applying different algorithms, finally ∆θ is calculated for each estimated parameter vectors. Repeating the procedures above for different PDF of m(k), an approximate function of ∆θ for each PDF of m(k) can be constructed.

2

1

10

AGA FOA SIDG

h4

1

h1

0

10

h6

∆θ

0

h

5

h

2

−1

−1

10

h3

−2 −2

10

0

0.05

0.1

σ2m

0.15

0

500

1000

1500

2000

ˆ Fig. 4. Evolution of h(k)

2 Fig. 3. ∆θ for several variances of noise σm

4

3

3

2

2

1

1 1

4

0

a

a1

Figure 3 shows function of ∆θ for zero-mean normally distributed m(k) ∼ 2 2 N (0, σm ) with different σm . We can see that the performances of FOA and SIDG 2 are almost the same, much better than AGA when σm increases, and the value of ∆θ matches the case in [8]. ˆ An example of evolution of h(k) with SIDG algorithm is shown in figure 4. The involved parameters are m(k) ∼ N (0, 0.14), Sed = {0}, σ2 = 0.75, L1 = p 60, and σ1 (k) = 0.3 V ar (Se (k)) which is time variable to prevent numerical singular conditions.

0

−1

−1

−2

−2

−3

−3

0.8

0.9

1

1.1 c1

1.2

ˆ of FOA (a) h

1.3

1.4

0.8

0.9

1

1.1 c1

1.2

ˆ of SIDG (b) h

Fig. 5. Scatter graph of bi using (19)

1.3

1.4

3

3

2

2

1

1 a1

a1

12

0

0

−1

−1

−2

−2

0.9

0.95

1 c1

1.05

1.1

0.9

0.95

(a)

1 c1

1.05

1.1

(b) Fig. 6. Scatter graph of bi using (13)

We then consider the following experiment setup: (30) is used with r(k) = 0, m(k) is uniformly distributed in the range of [−1, −0.5] ∪ [0.5, 1], and the reference PDF Sed is set to {−2, −1.6, −1.6, −0.3, −0.3, −0.3, 0.3, 0.3, 2, 2.2} with σ2 = 0.5. In this experiment, performances of FOA and SIDG are studied in ˆ to get parameters of the presence of non-gaussian noise. Using the estimated h ARX models according to (19), we obtain the scatter graph like figure 5 with ˆ of FOA, while figure 300 samples. Figure 5(a) is generated from the estimated h ˆ 5(b) from h of SIDG. Obviously, there are two clusters in figure 5(a) and figure 5(b), but the SIDG generated clusters are more compact than those generate by FOA, and the centers of the former are closer to the true values than those of the latter (true values are {c1 , a1 }1 = {1, 2} and {c1 , a1 }2 = {1, −1.5}). The points in figure 6 is generated with (13), where w(k) is the closest root of (7) to 0; it shows ˆ from h could produce diverse results (in figure 6(a) h = that little difference of h T [1 0.5 −3 2 0.5 1] , in figure 6(b) h = [1.0048 0.5067 −2.9895 2.0103 0.4869 1]T ), which is due to the nature of nonlinear system. In practice, (13) is not as robust as (19). The involved error PDFs are shown in figure 7; true PDF is generated by the actual hybrid parameter vector h; designed PDF is in the form of in (27) with values of Sed mentioned above; SIDG and FOA produced PDF are also depicted in the figure, which shows SIDG has better PDF matching performance. This example suggests that well-designed SIDG algorithm could achieve better results than FOA in the case of non-gaussian noises. 5.2

Initialization of Parameter Vector and Choosing Reference PDF

As shown above, to run the SIDG algorithm, one must indicate the initial value ˆ and generate the initial error set Se (0) using this initial value. Generally, of h,

13 0.35 True PDF SIDG produced PDF FOA produced PDF Designed PDF

0.3

ferror(e)

0.25 0.2 0.15 0.1 0.05 0 −5

−4

−3

−2

−1

0 error

1

2

3

4

5

Fig. 7. Error PDFs of the SIDG and FOA algorithms

this problem can be solved in a two-stages way; we can first run FOA algorithm to get a preliminary estimator of h, and then use this estimator as initial value ˆ for SIDG. of h Another crucial problem of SIDG is how to properly choose the reference ¡ ¢ ˆ pe , pde . PDF, for the reference PDF determines the optimal point that minimizes D Generally, we should choose the PDF of w(k) = a(k)T m as the reference PDF. But the PDF of w(k) is non-stationary due to switching between modes. In fact, if the measurement noise m(k) is zero-mean, unimodal, and i.i.d., by the central limit theorem, w(k) is almost zero-mean normally distributed random variable, but with different variances in each mode. Table 1 shows the identification results from various reference PDFs; the setup in table 1 is the same with that in figure 3 except Sed and σ2 . µ ˆ and σ ˆ are the mean and variance of estimated error PDF.

Table 1. Identification results from various reference PDFs Sed

σ2

µ ˆ

σ ˆ

∆θ

{0}

0.75

0.0878

0.7884

0.0694

{0}

0.35

0.0782

0.7916

0.0695

{0}

2.50

0.0878

0.7884

0.0694

{-0.1}

0.75

-0.1671

0.7748

0.0789

{-0.2}

0.75

-0.2194

0.7501

0.0917

{-0.3}

0.75

-0.3646

0.7828

0.1496

14

It can be seen that, in the case of unimodal reference PDF, σ2 has little influence on¡ the identification results, for variable σ2 does not change the minimal ¢ ˆ pe , pd . But the estimated error PDF and parameter vector is greatly point of D e affected by the mean of reference PDF. Table 1 suggests that if m(k) is zeromean, we should set Sed = {0}, and any reasonable σ2 (usually the same as σ1 ). In the case of multimodal or non-zero-mean measurement noise, the PDF of w(k) can be regarded as the mixture, fw (x) =

n X

pi fi (x)

(32)

i=1

where pi is the probability of the SARX system in mode i and fi (x) is the PDF of aTi m. Due to different ai in the mixture, fw (x) is often multimodal (see the case in figure 7). So it is difficult to design the reference PDF only given the samples. Some prior knowledge must be considered to discover the ideal reference PDF, e.g. using the known PDF of measurement noise.

6

Conclusion and Future Work

We have presented the noisy version of algebraic geometric approach for the identification of hybrid system in SARX form, with an information theoretic identification algorithm based on the information divergence. The SIDG algorithm is intend to minimize the information divergence between estimated error PDF and true distribution, by adjusting the identified parameter vector in the gradient direction at each step, finally approaching the true value. In the numerical experiments, the proposed SIDG algorithm shows better performance than LMS algorithm in the case of non-gaussian noise. Because this method is able to approach any designed error PDF which approximates the PDF produced by the ideal parameters. It remains open to develop practical approaches for choosing the reference PDF of error in general cases. The key problem that is unsolved for the noisy hybrid polynomial is how to get unbias estimator of hybrid parameter vector from the nonlinear system (7), only given enough samples and the PDF of measurement noise. We also intend to discover other mappings that take samples into a system in higher space, and diminish the switching mechanism. This mapping is expected to have certain invariability of the noise, which could bring facilities for identifying hybrid parameters of the system.

References 1. Hashambhoy, Y., Vidal, R.: Recursive Identification of Switched ARX Models with Unknown Number of Models and Unknown Orders. 44th IEEE Conference on Decision and Control, and 2005 European Control Conference. CDC-ECC’05. (2005) 6115–6121

15 2. Vidal, R.: Generalized Principal Component Analysis (GPCA): an Algebraic Geometric Approach to Subspace Clustering and Motion Segmentation. PhD thesis, University of California (2003) 3. Vidal, R., Anderson, B.: Recursive identification of switched ARX hybrid models: exponential convergence and persistence of excitation. 43rd IEEE Conference on Decision and Control (2004) 4. Ferrari-Trecate, G., Muselli, M., Liberati, D., Morari, M.: A clustering technique for the identification of piecewise affine systems. Automatica 39(2) (2003) 205–217 5. Juloski, A., Weiland, S., Heemels, W.: A Bayesian approach to identification of hybrid systems. Automatic Control, IEEE Transactions on 50(10) (2005) 1520– 1533 6. Bemporad, A., Roll, J., Ljung, L.: Identification of hybrid systems via mixedinteger programming. Decision and Control, 2001. Proceedings of the 40th IEEE Conference on 1 (2001) 7. Bemporad, A., Garulli, A., Paoletti, S., Vicino, A.: A greedy approach to identification of piecewise affine models. Hybrid Systems: Computation and Control 2623 (2003) 97–112 8. Juloski, A., Heemels, W., Ferrari-Trecate, G., Vidal, R., Paoletti, S., Niessen, J.: Comparison of four procedures for the identification of hybrid systems. Proceedings of the 8th International Workshop, Hybrid Systems: Computation and Control, Lecture Notes in Computer Science (2005) 354–369 9. Juloski, A., Paoletti, S., Roll, J.: Recent techniques for the identification of piecewise affine and hybrid systems. Current Trends in Nonlinear Systems and Control (2006) 10. Principe, J., Xu, D., Fisher, J.: Information theoretic learning. Unsupervised Adaptive Filtering (2000) 265–319 11. Erdogmus, D., Principe, J.: An error-entropy minimization algorithm for supervised training ofnonlinear adaptive systems. Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on] 50(7) (2002) 1780–1786 12. Erdogmus, D., Principe, J.: Generalized information potential criterion for adaptive system training. Neural Networks, IEEE Transactions on 13(5) (2002) 1035– 1044 13. Lai, C.: Global Optimization Algorithms for Adaptive Infinite Impulse Response Filters. PhD thesis, University of Florida (2002) 14. Cover, T., Thomas, J.: Elements of Information Theory. Wiley-Interscience New York (2006) 15. Silverman, B.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC (1986)

Theoretical Development for Blind Identification of Non ...