LU FACTORIZATION FOR FEATURE TRANSFORMATION  



Patrick Nguyen , Luca Rigazio , Christian Wellekens and Jean-Claude Junqua



Panasonic Speech Technology Laboratory Santa Barbara, U.S.A. nguyen, rigazio, jcj  @research.panasonic.com

ABSTRACT Linear feature space transformations are often used for speaker or environment adaptation. Usually, numerical methods are sought to obtain solutions. In this paper, we derive a closed-form solution to ML estimation of full feature transformations. Closed-form solutions are desirable because the problem is quadratic and thus blind numerical analysis may converge to poor local optima. We decompose the transformation into upper and lower triangular matrices, which are estimated alternatively using the EM algorithm. Furthermore, we extend the theory to Bayesian adaptation. On the Switchboard task, we obtain 1.6% WER improvement by combining the method with MLLR, or 4% absolute using adaptation. 1. INTRODUCTION Linear feature space transformations have been subject to intense investigation recently. They provide a conceptually appropriate way of normalizing environment or speaker mismatch. They are naturally integrated into the SAT paradigm toward offering compact models for speech recognition. The analytical mathematics are somewhat related to semitied covariances [1] and MLLT [2]. Both acknowledge the absence of a closed-form solution in the general case and proceed to define numerical expedient for that ailment. Numerical methods are sensitive to conditioning and extra care is admonished to ensure convergence. Additionally, more insight may be gained from analytic solutions. In this paper, we discover a non-trivial special case of linear transformations that admits a closed-form solution: triangular matrices. We generalize to a full matrix by alternating the estimation of upper and lower triangular matrix, in a pattern which mimics the LU factorization. Lastly, we define the MAP estimator which serves as a foundation for smoothing. 2. FEATURE-SPACE TRANSFORMATIONS In this section, we show how to find the likelihood equation for linear transformations in the feature space. We review the



Institut Eur´ecom Sophia-Antipolis, France [email protected]

solution for diagonal transformations, and generalize to triangular matrices. 2.1. Linear transformation of observations Let  be a random variable with pdf  

 . We apply a linear transformation to  to obtain  . We know how to evaluate     , but we need to calculate it with transformed data  . The plug-in rule allows us to convert  

 into     . A corollary of the plug-in rule for pdfs yields:   !"#%$ &$')(*  +,'-(/. 10243576

(1)

As we will see later, the presence of the Jacobian $  $ '-( is the primary cause of analytical difficulties. The bias  does not appear in the Jacobian. We will discard the bias in most derivations for simplicity. The plug-in rule may be stated as: plug in the transformed observation in the pdf and multiply by the Jacobian $  $ ')( . 2.2. In the EM algorithm The mathematics of Hidden Markov Models (HMMs) are well-known. Using the plug-in rule, we re-compute the expected log-likelihood 8 . The 8 function becomes 890;<> : = ?*@ ACB

A

!DFEG0IHKJ/L$  $ MNO !PI0QSR

?

?  TVU& WPI0QSR Y  X

(2) and its derivative Z

Z

8 

[0

= ?*@ A B

A

? ?

WD)E\0,']T^_U`R a R T

? 0CUbP)RaT X6

(3)

We know that stationary points of the gradient correspond to a maximum or minimum in 8 . This seemingly simple problem is a multidimensional quadratic equation and has no closedform solution in general [3]. Gales [3] assumes rows to be almost independent and optimizes row by row. Gopinath [2] points out that half of the function is quadratic and therefore suitable for conjugate gradient descent. Digilakis [4] advocates iterative numerical methods but cites none in particular. Bilmes [5] uses a unitary matrices, for which the Jacobian disappears. We present a solution that can be seen as a combilation of [4] and [5].

It is solved by:

2.3. Diagonal matrix When the matrix  is diagonal [4], there are two solutions per dimension. We also assume precision matrices U to be diagonal. Let  be the th diagonal element of  . The expression for the gradient is quadratic and may be found in Gales or Digilakis. However, neither of them seem to give an explicit expression nor bear a preference for either root. We choose:



 

<:

 

= ?*@ ACB



A



'-( = ?*@ A B





'-( = ?*@ A B

A



Z

M 8



 M

[0

B

= ?*@ A

A



Z

WD6

!D :

M

Z

  R M 6

(5)

Both roots of the characteristic equation correspond to maxima in the likelihood. However, our choice guarantees a smaller absolute value of the second derivative, and also a value closer to unity. Without this additional hint, numerical methods < stationary points. The would converge arbitrarily to one of closed-form solution affords more insight. 2.4. Upper-triangular matrix and its closed-form solution Since all rows of the matrix are independent, thanks to the di agonality of covariances, we may set a dimension and solve   each dimension independently. Let  !     6K6K6 " be  th : the non-zero elements of the row of  . Let  be the bias of  the feature . Define

#. %$ R#. R

%$

( (

 % $ M   R %$ M 

   6 6K6  R   6K6 6



435T :

(6)

35T#6

(7)

We seek to find .   # 3 . Since the determinant only depends  on  , it is treated differently. First, we solve a &"O0 ('1 )"0    linear subsystem for # using the " 0 last elements of the gradient. Then, we use the special equation for  to yield the quadratic form of the previous section.  The objective function in eq. (2) for the dimension is 8[

0;< : EG0IH J LS$



$ M  )

#

T R

Differentiating with respect to ( , get a linear system Z

Z

8

#

0

= ?*@ ACB

A

WD

#   R    



0

 P

:



(10)

= ?*@ ACB

!D4 R#7R# T65

B

= ?*@ A B

A

!D4 R)R# A

!D4 P )R

 

#

 6

Now we need to find  and substitute back. The solution for  is found using the last derivative, which is merely a generalization of the diagonal case: 8





The second derivative indicates which of the two solutions corresponds to a stable point by indicating more negative values: Z



3 

WD RYP  A

= ?*@ A

A



(4)

 RaM 

!D

/

and :

with the appropriate definitions of

`3 

with the following



M   >



(#0/%'-( 1 2



6K6 6

 M

"

 

X6

(8)

and  , we

+* P  ,  R . - R #/6 0 (# TFR# , 0

(9)

= ?*@ A

A B

!P 09

87 ')(

WD

#

T R



# 9 0   R 

7 +R _ T

/

'-( R

#  :-6

We can use the linear dependency specified by eq.(10), and finally state that  is again the solution of a quadratic expression       <: (11)   M   with

 



= ?*@ A





 



B

A

 RaM

WD

'-(=< = *? @ ACB '-( = ?*@ A B

A

A

0Q T/%'-(;5

WD4YRYP 0

!D?5



 

T/%'-(%3>



6

When covariances U '-( are not diagonal, we must first solve the quadratic equation for @@  . Then, knowledge of this co@   . We proceed efficient will help find  and  '-( ')( '-( thus upwards until the top row, in the same manner as the back substitution step in a Gauss-Jordan matrix inversion. 2.5. The LU decomposition Looking at eq.(3), we see that the crux of the problem resides in the presence of a log determinant, which implies in turn the presence of the inverse matrix. A common way of dealing with inverse matrices involves the LU decomposition of a matrix, that is to say, our matrix  is written as OBA@C

(12)

with C an upper-triangular matrix, and A a unitary, lowertriangular matrix. Diagonal elements of A are all equal to . :

)+*$,

)+* )+*

0.6

We embed this decomposition by alternating the maximization step in the EM algorithm: R\SR>

A C`R

6

#

0.5

(13)

0.4

The upper-triangular method was derived above, and the lower-triangular method is found by setting  as in [5].

0.3

:



0.2

0.1

3. BAYESIAN EXTENSION 0 0

The Bayesian framework is useful for parameter smoothing. For instance, while using regression trees to define multiple classes, the leaf transforms are derived by smoothing with the parent nodes, as shown on Figure 1.   



 .

 /  $ " -  1  0

< ' 

0b <-" 

6

'-(  M  '324/'6587/ 6

(17)

(18)

Lastly, the Normal distribution is an old acquaintance of ours (16)

0%@ : -"M  '658  >

' :  5=  >N?- M  

 

< : -"M erf A-CB4 6

mean of the distribution 2.2

2.0

1.8

1.4

Furthermore, the Rayleigh distribution models the attenuation in fading channels and is   '$#  %  6 (15) !  $ "   <  " M

:

:

erf <  0 



A subset of the family of conjugate priors is a mixture of (extended) Maxwell, Rayleigh, and Gaussian distribution. We christen it hence the Maxwell-Rayleigh-Normal (MRN) distribution. Maxwell’s distribution models speeds of molecules in thermal equilibrium. It is defined for    5  :  < (14)





 'F8 . The distribution is and the error function erf /  ED  shown on Figure 2. The value of the hyper-parameter - with respect to the mean is shown on Figure 3.

1.6

' < :   ' #  (% a6  "

5

We define the MRN distribution to be

3.1. The Maxwell-Rayleigh-Normal distribution

!&  $ " N

4

Fig. 2. The MRN law for different values of -

<: -

The MAP framework is usually greatly simplified by selecting the prior distribution  +S among the family of conjugate priors for  . MAP estimators and prior distributions were defined for all but the diagonal term. The conjugate prior for the bias is a Normal law. The conjugate prior for non-diagonal elements is elliptic. The probability of diagonal terms has a transcendent shape. The prior family does not appear frequently enough in nature to justify a name. We proceed to define it.

  M  M '  M 6

3

and we include it here for the sake of completeness

Fig. 1. Using a regression tree: transformations  and 

( are interpolated versions between ML and  .

Q  $ N

2

The regularization constant 0 is chosen such that 9  .   :;:

  

1

1.2

1.0

0.8 −2.5

−2.1

−1.7

−1.3

−0.9

−0.5

−0.1

0.3

0.7

1.1

parameter G

Fig. 3. The mean of MRN w.r.t. - . The parameter that corre@ sponds to identity is -  0 6 . :

We proceed by defining the raised MRN law constitutes a family of conjugate priors: !+.

 /  $ -  H

 E0 '-( A-  H  MIJ/'6IK2L '5=7/ 6

(19)

Unfortunately, unless H is a multiple of ( , moments have no M closed-form expression. In most cases, we are only interested

There is a .2% WER improvement if we only use blockdiagonal matrices. We have observed that MLLR behaves best with 7 regression classes (1 for silence, 4 for vowels, and 2 for consonants). In this case as well, constraining the transformation matrices to be block-diagonal, we get an improvement. When we use MLLU as a feature normalization, followed by MLLR model adaptation, we obtain a 1.6% WER improvement over the baseline MLLR adapted models.

−1.1

−1.3

−1.5

−1.7

−1.9

−2.1

−2.3

−2.5 0

1



2

3

4

5

Fig. 4. We select H , and choose - s.t. the mean is one. in values of H - such that 9   7   +.  /  :] :



(20)

it is easier to use numerical integration and tabulate -- H  . We would then obtain the curve shown on Figure 4. The parameter H is interpreted as the weight given to prior information. 4. EXPERIMENTS 4.1. Conditions To validate our algorithm, we used the Switchboard conversational telephone speech database. We report results on the first evaluation test set of 2001 [6], which contains 20 conversations from the Switchboard-I database. The acoustic frontend uses 27 PLP coefficients (8 pole model plus energy, and their first and second derivatives), which were normalized using side-based cepstral mean subtraction (CMS) and variance < normalization. We train a total of k Gaussians with diagonal covariances, pooled in 3600 mixtures using decision trees. The language model (LM) for this task is a trigram model containing compound words and frequent abbreviations [7]. It was kindly provided to us by Andreas Stolcke of SRI. It contains 34k words, 5M bigrams, and 12M trigrams. Our recognizer, called EWAVES [8], is a lexical-tree based, gender-independent, word-internal context-dependent, trigram Viterbi decoder with bigram LM lookahead. For adaptation, we use the transcription of the first pass. The second pass is identical to the first pass but runs on adapted features or with adapted models. 4.2. Results In Table 1, we report Word Error Rates (WER). The feature space transformation, or MLLU for (Maximum-Likelihood LU transformation), yields an improvement comparable with MLLR when used in isolation. Since there were about 5 minutes of adaptation data in most cases, we disabled the MAP prior described in section 3.

SI MLLR 1 global class MLLU 1 global class MLLU block-diag MLLR 7 classes + block MLLU + MLLR(7)

WER 34.6% 32.8% 32.8% 32.6% 32.2% 30.6%

Table 1. Results 5. DISCUSSION AND FUTURE WORK In this paper, we have exposed a closed-form solution for the case of linear feature space triangular transformations. We extended the algorithm in the EM algorithm to yield the LU factorization of a full linear transformation. Furthermore, the Bayesian framework was also explored. On Switchboard, our new algorithm, MLLU, yields a significant improvement over adapted models. Due to time constraints, we were not able to investigate multiple-class, Bayesian LU feature decomposition. 6. REFERENCES [1] M. J. F. Gales, “Adapting Semi-Tied Full-Covariance Matrix HMMs (tr298),” Tech. Rep., Cambridge University (CUED), 1997. [2] R. A. Gopinath, “Maximum Likelihood Modeling with Gaussian Distributions for Classification,” in Proc. of ICASSP’98, Seattle. [3] M. J. F. Gales, “Maximum Likelihood Linear Transformations for HMM-based Speech Recognition (tr291),” Tech. Rep., Cambridge University (CUED), May 1997. [4] V. Digilakis, D. Ritchev, and L. Neumeyer, “Speaker Adaptation Using Constrained Estimation of Gaussian Mixtures,” IEEE Trans. SAP, vol. 3, pp. 129–136, 1995. [5] J. Bilmes, “Factored Sparse Inverse Covariance Matrices,” in Proc. of ICASSP’00, 2000, vol. II, pp. 1009–1012. [6] A. Martin and M. Przybocki, “Analysis of results,” in 2001 NIST LVCSR Workshop, 2001. [7] A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. Ramana Rao Gadde, M. Plauch, C. Richey, E. Shriberg, K. Snmez, F. Weng, and J. Zheng, “The SRI March 2000 Hub-5 Conversational Speech Transcription System,” in Proc. of 2000 Speech Transcription Workshop, 2000. [8] P. Nguyen, L. Rigazio, and J.-C. Junqua, “EWAVES: an efficient decoding algorithm for lexical tree based speech recognition,” in Proc. of ICSLP, Beijing, China, Oct. 2000, vol. 4, pp. 286–289.

авбг £ £ ¤ ! #"$¦& %( '0 )1 зй 324 5 76 "86 9A@Bзи C ...

ABSTRACT. Linear feature space transformations are often used for speaker or environment adaptation. Usually, numerical meth- ods are sought to obtain ...

89KB Sizes 1 Downloads 40 Views

Recommend Documents

!"(&*$ 0 3&0% 1 !. 0& 1* 0&
Enter the data you collected into two lists of a graphing calculator. " ! Display the data in a scatter plot. Notice that the points. Use the model for the data. ! & $ " ".

!"(&*$ 0 3&0% 1 !. 0& 1* 0&
shown at the right. Also, record the number of pennies that would fit in a circle with a diameter of 0 inch. ! ! Enter the data you collected into two lists of a graphing.

Syllabus 324 c 07.pdf
Mixtures (1.3, 3.1). Spring-Mass system (5.1). Population models (3.3). SIR model (3.3). Page 2 of 2. Syllabus 324 c 07.pdf. Syllabus 324 c 07.pdf. Open. Extract.

2 0 1 5 preview
Page 1. 2 0 1 5 p r e v i e w. Page 2. Page 3. Page 4. o _ _. Page 5. Afrodite. Design: C. Ballabio. Page 6. Amarcord. Design: P. Salvadè. Page 7. Atlante TV. Design: C. Ballabio. Page 8. Aura cassettiera. Design: Marelli & Molteni. Page 9. Aura com

f-j 0 (1'W c.?
Oct 1, 2015 - MEMORANDUM. TO. FROM. 01C, Asst. Schools Division Superintendent. Chief, SGOD. Chief, CID. All Secondary School Principals.

Fixed Points: [[1 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1] ] - GitHub
Key bae: Fixed Points: [[0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 0]. [0 1 0 1 1 1 1 1 0 0 0 0 1 0 0 1]. ] Negated Fixed Points: [[0 1 1 1 0 0 1 1 1 1 0 1 0 0 0 1]. ] Key baf:.

Page 1 01 1 234567895 8 6 59590 0 6 5 !""! # $% &' (')* +', -'. + ' /0(1 ...
"#"$%&'()*#+,%%-.$/01"(*'/2(*2*(/*%(3+%4/'/'+'//.(. '/5(%#("3*"/.(*(.6*/,78+6,*'3+*"(%*.(,9/,')2*%12'61#. (%,+'//'+'/9*",:*$':*7%('/9*),+$':*7%**'(9*%;*+$':*7%+$,+*. */'$9,"#':'(),*1;*:+3(.4/'*./'*2.5:*7%,+*1/8((/$,&

5505'? BRITTAH'I' Minna 0 RDENG c: messages-1 - c ...
Fill cut attached crder term, then email cr tax in by. ORDER ... H'cu will receive an crder ccrifirmaticn via email. ... Please direct questiene er special requests te.

1 LPG eneral 101 CS 1 0 5 Min 9 2 102 CS 1 0 5 Min ...
Item. Gend Item. Max. Item Name. Partici Pinnan. Code er Type pants y. 101 Prasangam - Malayalam. O 5 Min. 102 Padyamchollal - Malayalam. 0 5 Min.

cs50.c 1/5 cs50.c 2/5 - CS50 CDN
delete from middle or tail. 100: else. 101: {. 102: predptr->next = ptr->next;. 103: free(ptr);. 104: }. 105: 106: // all done. 107: break;. 108: }. 109: else. 110: {. 111:.

( S 0 s C? F 'F 1 C E )
Dec 10, 2015 - Test Construction for the Third Grading Period and Submission of ... In lieu of the forthcoming Third Periodical Test, the following teachers are ...

cs50.c 1/5 cs50.c 2/5 - CS50 CDN
11: * Based on Eric Roberts' genlib.h and simpio.h. 12: *. 13: * The latest version of this file can be found at. 14: * http://www.cs50.net/pub/releases/cs50/cs50.h.

Information About Brokerage Services (TREC IABS 1-0) (5).pdf ...
Nov 2, 2015 - Connect more apps... Try one of the apps below to open or edit this item. Information About Brokerage Services (TREC IABS 1-0) (5).pdf.

( ) ( 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 8 4 9 7 @ A B 2 C D 2 C E 8 4 1. F 2 G 8 2 4 E 7 4 8 G 4 4 A 7 ..... Ж y Е b R S W c V P Q V U X V Q v V V e b a U X V b i Г V W U Б e c P T U X V W ...

Corporation Bank PO Exam 9 - 5 - 2 0 1 0 Question paper.pdf ...
Corporation Bank PO Exam 9 - 5 - 2 0 1 0 Question paper.pdf. Corporation Bank PO Exam 9 - 5 - 2 0 1 0 Question paper.pdf. Open. Extract. Open with. Sign In.

Page 1 GL 0.53c 1/86/N Instruction Manual No. E3 237O-O 5 for IIO ...
No. E3 237O-O 5 for. IIO MHz Spectrum Analyser. TF 237O. Model No. 52370–015. (c). 1974. MARCON INSTRUMENTS LIMITED. ST, ALBANIS ...

Page 1 99th Percentile Update Latency | m. 0 0%y 0 1 IBOOOL Dh ...
99th Percentile Update Latency. | m. 0. 0%y. 0. 1. IBOOOL. Dh .Q. @§35. 1v. @mem ea mgm. | ei. T. Mdm. LMP. W. _ _. 0 0. 0 0. 5 4. 600 - îucouwmëê 5:35 ...

5 3 5 5 5 5 5 1 1 1 7 3 3 35 1 1 1 2 6 5 6 5 5 1 23 4 ... -
Then. Faith. 54351 1. 5 5321. 1 1671 all on truth hearts is our o'er all with the strength ev - girt love vic - be. 'ry a - a - to -. 1 1653 gainst faith, vat - on - is the they tion's ward the foe like hel - from glor- in a met the ious. 3 3 32 3. 5

76-1004-1.pdf
If you use a crow foot to torque the nuts, see the crow foot calculation on page. 3. If there was a ground wire connected to the manifold, place it back in the same location. Maintenance. Note! Check and retorque the M8 Lock Nuts (C) 100 miles after

Syllabus 324 1 11 S06.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Syllabus 324 1 ...Missing:

[2018-NEW]Braindump2go 300-160 Dumps VCE 125Q&As Share(76-86)
1.2018 New 300-160 Exam Dumps (PDF and VCE)Share: https://www.braindump2go.com/300-160.html 2.2018 New 300-160 Exam Questions & Answers PDF: https://drive.google.com/drive/folders/0B75b5xYLjSSNZ21HQjE0YzFwTWs?usp=sharing 100% Real Exam Questi

47-86-1-SM terbit.pdf
mengakses GPS. A-GPS juga menghasilkan. akurasi secara vertical dan estimasi jarak. yang baik. Akurasinya pun sampai kurang. dari 10m. Location Based ...