PCA-PMC: A NOVEL USE OF a priori KNOWLEDGE FOR FAST PARALLEL MODEL COMBINATION
Ruhi Sarikaya and John H. L. Hansen Robust Speech Processing Laboratory Center for Spoken Language Understanding University of Colorado at Boulder, Boulder, CO,
80309
http://cslu.colorado.edu
[email protected] [email protected]
ABSTRACT
This paper describes an algorithm to reduce computational complexity of the parallel model combination (PMC) method for robust speech recognition while retaining the same level of performance. Although, PMC is eective in composing a noise corrupted acoustic model from clean speech and noise models, the intense computational complexity limits its use in real-time use. The novel approach here is to encode the clean models using principal component analysis (PCA) and pre-compute the protype vectors and matrices for the means and covariances in the linear spectral-domain using rectangular DCT and inverse DCT matrices. Therefore, transformation into the linear spectral domain is reduced to nding the projection of each vector in the Eigen space of means and covariances followed by a linear combination of vectors and matrices obtained from the projections. Furthermore, the Eigen space allows a better trade-o for reducing computational complexity versus accuracy. The computational savings are demonstrated both analytically and through experimental evaluations. Experiments using context independent phone recognition with TIMIT data shows that the new PMC framework can outperforms the baseline method by a factor of 1.9 with the same level of accuracy. 1. INTRODUCTION
It is well known that speech recognition performance often degrades signi cantly when a mismatch exists between training and test environments. One source of mismatch is the background noise corrupting the speech. Research in robust speech recognition can be classi ed into several categories: noise robust speech features [1], feature compensation techniques [2], and model compensation methods [3]. Parallel model combination (PMC) is a promising compensation technique which has been shown to achieve good performance for speech recognition in additive background noise [4]. The PMC framework allows adaptation of both static and dynamic features as well as covariances. However, a major problem with the method is the intense computation requirements needed to adapt models. Previously, several methods have been proposed to reduce the computation of PMC. Among those are data-driven PMC (DPMC) [7], composite distribution based PMC [9] and their derivatives. We propose to develop a more eÆcient method to improve the computational requirements of PMC. One unique advantage of this method is that it complements many of the previously proposed techniques rather than being an alternative. Our technique can be used in combination with previously proposed methods techniques [7, 9] to further increase adaptation speed. This work was supported in part by a grant from the U.S. NAVY
SPAWAR Systems Center, CA.
In this study, we formulate a new PCA-PMC framework for fast model compensation. Principal component analysis (PCA) has been widely used in many disciplines for data analysis and dimensionality reduction [10]. Recently, PCA has been applied to speaker adaptation [5] providing fast adaptation. This paper is organized as follows. In Sec. 2, we review PMC for static parameters. Sec. 3 presents the new algorithm for static parameters. In Sec. 4, we present analysis of computational savings, with conclusions presented in Sec. 5. 2. PMC NOISE ADAPTATION FOR STATIC PARAMATERS
The basic notion of model combination was rst introduced in [3], and later considered by Gales and Young [4] what is now called PMC. PMC attempts to estimate the corrupted speech model by using a clean speech model and combining it with a noise model. The assumption is that the speech to be recognized is modeled by continuous density HMMs and the stationary interfering background noise is assumed to have a single state mixture. Nonstationary noise sources can also be modeled with multiple state mixtures in a PMC framework. PMC takes advantage of the common assumption that speech and noise are additive in the linear power spectral domain. Therefore, both the speech and noise distributions are mapped into the linear power spectral domain. The speech and noise spectra are combined with a matching gain term that scales the speech spectrum. The gain term accounts for the fact that the level of the original clean training speech data may be dierent from that of the noisy speech data. Next, the corrupted spectrum is mapped back to the cepstrum domain. The summary of the mapping procedure is shown in Eq. 1 with a more thorough derivation in [4]. l = F 1 c l = F 1 c (F 1 )T
(1)
Here, c and c are the static mean and covariance of any state output distribution in the cepstral domain. The transformation of these parameters into the log-spectral domain is obtained via an inverse discrete cosine transform (DCT) matrix, F 1 . The superscripts stand for the domain of the parameters: c for cepstrum and l for log-spectral domain. In the linear power spectral domain, no superscripts are used. The cepstrum parameters are assumed to have a Gaussian distribution. Since the DCT is a linear operation, the transformed distributions in the log-spectral domain are also Gaussian. Transformation of the parameters into a linear power spectral domain assumes the parameters are log-normal. The parameters in the linear domain are given as follows [4, 6]:
OFFLINE STEPS
i = exp(li + lii =2) ij = i j [exp(lij ) 1] (2) where the subscripts i and j specify the component of the
static mean vector or covariance matrix. Similarly, the noise distribution is also transformed into the linear power spectral domain. From the assumption that speech and noise are independent and additive, the corrupted speech parameters in this domain are given as follows: ^ = g + ~ ^ = g2 + ~ (3) where ~ and ~ are the static means and covariances for the noise model. Furthermore, PMC assumes that the combined distributions in the linear power spectral domain are approximately log-normal. Therefore, the process of moving to the linear domain can be inverted to allow movement back to the cepstrum domain. After compensation, the distributions must be mapped back to the log-spectral domain. These equations are summarized below, ^ ^li = log(^i ) 1 log[ ii2 + 1] 2 ^ i ^ ^ lij = log[ ^ij^ + 1] (4) i j Next, mapping back to the cepstral domain is achieved using the following equations where F is the DCT matrix: ^c = F ^l ^ c = F ^ l F T (5) 3. ENCODING HMM PARAMETERS USING PRINCIPAL COMPONENT ANALYSIS (PCA)
The PMC derivation shows that the technique is composed of three stages: mapping HMM parameters into linear power spectral domain, followed by composing the corrupted model, and then mapping back to the cepstrum domain. A fast method would be to precompute and store all clean HMM statistics in both the linear spectral domain and cepstrum domains. Although, this is a feasible method for low vocabulary speech recognition tasks where the number of distributions is small, excessive memory requirements for large vocabulary speech recognition systems with over 50000 distributions renders this technique impractical for real time applications. Since a diagonal covariance is used in cepstrum domain, the transformation results in a full covariance matrix in the linear power spectral domain. Therefore, for each diagonal covariance vector, an N N full covariance matrix must be stored where N is the dimension of the lterbank. Additionally, the dimension of the vectors in the linear power spectral domain is N rather than p (p is the dimension of the static parameter). This further increases the memory requirements. Our goal here is to speed up PMC without increasing the memory requirements of the system. The clean HMM parameters are known a priori and can be mapped to the linear power spectral domain for PMC compensation. Therefore, a compact and fast mapping technique is needed to go to the linear power spectral domain without a signi cant increase in memory requirements. To accomplish this, we rst need to encode the clean HMM parameters in the cepstrum domain using PCA. Since the transformation of means and covariances follow a similar yet different procedure, we do PCA on each of the parameter sets
ONLINE STEPS SPEECH MODEL
ALL SPEECH MODELS
ESTIMATE THE NOISE MODEL
PROJECT INTO EIGEN SPACE
PCA COMPUTE LINEAR POWER SPECTRUM
MAP INTO THE LINEAR POWER SPECTRUM
KEEP THE FIRST K EIGENVECTORS COMBINE THE SPEECH AND NOISE SPECTRA
COMPUTE
u 0 ,...,u K
V0 ,...,VK
MAP BACK TO THE CEPSTRUM DOMAIN
Figure 1: Block Diagram for PCA based PMC.
separately. PCA allows us to achieve almost exact reconstruction of the vectors in the training set for all practical purposes. The block diagram describing the PCA based PMC is presented in Fig. 1. The oine steps provide the Eigen vectors and the prototype vectors and matrices during online model compensation. 3.1 PCA on Static Means and Covariances
In most practical speech recognition systems, a diagonal covariance matrix is used to reduce computational complexity. Thus, we assume that the HMM models have diagonal covariance. We pool all the clean speech means and covariances into two pools and do a PCA on the set of vectors separately. We can reconstruct each vector with suÆcient accuracy by using the full set of Eigen vectors [10]. PCA results in a set of basis vectors: C1 ; C2 ; :::; CK for the static means and e1 ; e2 ; :::; eK for the covariances. The size of the basis is equal to the dimension of the vectors. Let c be any clean speech mean vector and the corresponding covariance vector be c in the cepstrum domain. We can express both the mean and the covariance vectors in terms of the basis vectors in their respective spaces: K X c = C0 + i Ci (6) i=1 K X
c = e0 + i ei (7) i=1 These equations can be written in a more compact form by including the global means, e0 and C0 from those spaces in the summations, and setting weights 0 and 0 to 1. K X c = i Ci ; 0 = 1 (8) c =
i=0 K X i=0
i ei ; 0 = 1
(9)
Substituting Eq. 9 into Eq. 1 for covariance yields the following expression: K X l = i [F 1 ei (F 1 )T ] (10) i=0 The covariance in the log-spectral domain is expressed in terms of the basis functions of the covariance space and the inverse
DCT matrix. Substituting Eq. 8 into Eq. 1 yields the mean in the log-spectral domain:
for nding the projections. In baseline PMC, N is set to 24, and p is set to 13. Using these settings, with K equal to 13, K yields the results listed in Table 1 for the baseline PMC and X l = i (F 1 Ci ) (11) PCA-PMC. As seen in the table the number of additions for PCA-PMC is 42% less than baseline PMC, and the number of i=0 Substituting Eq. 11 for the mean and Eq. 10 for the covariance multiplications for PCA-PMC is 33% of the baseline PMC. into Eq. 2 for the mean results in: Additionally, in mapping from the cepstrum to log-spectral ! domain, K K X X we note that the number of static parameters is typi = exp [ k (F 1 Ck )]i + 1 k [F 1 ek (F 1 )T ]ii ically 13, whereas the number of lterbank entries is 24. In 2 k=0 k =0 baseline PMC, cepstrum parameters are zero padded to equal0 1 ize their length with the number of lterbank. Therefore, an K X N N DCT or inverse DCT matrix is used for transformation. = exp B @ k (F| {z1 Ck})i + k =2(F| 1 ek{z(F 1 )T})ii C A However, padding zeros does not provide any additional infork=0 uk mation that eects recognition performance, yet it increases Vk ! the computational complexity. Instead, we propose to use a K X k = exp k [uk ]i + [Vk ]ii (12) rectangular DCT and inverse DCT matrices which are right 2 and left inverses of each other. The DCT matrix is N p and k=0 the inverse DCT matrix is p N . These matrices do not dewhere uk is the kth prototype vector for the mean vector space, grade performance, yet decrease the computational complexand Vk is the kth prototype matrix for the covariance space. ity signi cantly both in forward mapping to the linear power These vectors and matrices sets can be precomputed since their spectral domain and back to the cepstrum domain. elements are known prior to recognition. Next, substituting Eq. 10 into Eq. 1 for the covariance gives the result in the For baseline PMC, the number of operations and the computational requirements to map cepstrum parameters into linear linear spectral domain: power spectral domain is roughly equal to the requirements ! K X to map back to the cepstrum domain after combining with 1 1 T ij = i j exp[ k (F ek (F ) )ij ] 1 noise. Most current CPUs compute addition and multiplication in a single clock cycle. Therefore, the ratio of the sum 0 k=0 1 of the numbers of multiplications, additions and exponentials K X (converted to the cost of multiplication) corresponding to each = i j B @exp[ k (F| 1 ek{z(F 1 )T})ij ] 1C A technique will give the relative theoretical computational cost k=0 Vk of these techniques. Note that the number of exponential op! K erations are the same for both techniques. Therefore for a X = i j exp[ k [Vk ]ij ] 1 (13) cycle of PMC (i.e. going to linear power spectral domain and k=0 coming back to the cepstrum domain) the number of multiis 29400 2 = 58800. The number of additions is Again, once the projections of the mean and diagonal covari- plications 13848 2 27696. For PCA-PMC using rectangular DCT ance vectors in their respective Eigen-spaces are found, these and inverse=DCT the number of multiplications requantities in the linear spectral domain are readily available. quired for the full matrices, run of PMC is 9914 + 16464 = 26378. The number of additions is 7248+8112 = 15360. Since the number 4. EVALUATIONS of exponentials is equal in both methods, the upper bound for 4.1 Memory Requirements speed improvement is (58800+27696)=(26378+15360) = 2:07. The memory requirements for the compensation of the HMM Therefore, we expect the new method to be around twice as static mean and covariance parameters are as follows. Let p fast as the baseline method. denote the number of static parameters used in the feature The experimental evaluation task considered is context indevector, and K denote the dimension of the Eigen-space used phone recognition using an 8 KHz band-limited verin the projection process (with K p). For the mean vector pendent sion of the database. We employed two noise sources: space, we have K +1 vectors, uk , of length p. For the diagonal AutomobileTIMIT highway (HWY) and at communications covariance vector space we have (K + 1) Vk matrices of size channel noise (FLN). noise For HWY, the SNR is set to 5 dB for p p. Therefore, the upper bound for the memory requirement and mismatched simulations, and 20 dB for FLN is: 2 p p + p (p +1)+ p p (p +1) oating point numbers. matched noise. Clean HMMs are compensated with these noise sources For a typical p value of 13, the storage requirements are 2886 to create corrupted speech models. Noise observations are ob oating point numbers. tained from the rst 8-10 frames of the one of test les. 4.2 Computational Savings
In Table 1, the number of operations including additions, multiplications, and exponentials are listed for a single distribution to map from cepstrum into the linear spectral domain for both baseline PMC and PCA-PMC. Here, N denotes the size of lterbank which is typically in the range 19-24. p stands for the number of static parameters in the feature vectors (typically 12 or 13). K denotes the number of retained Eigen vectors
4.3 Experimental Evaluations
The CPU time, measured in seconds of the CPU clock, begins once a model is loaded into memory until mapping back to the cepstrum domain. During this time there was no disk accesses both techniques. All simulations were conducted using a Sun Ultra-10 workstation. A continuous density HMM system was used in this study. Here, 45 phonemes are modeled by 3-state, left-to-right HMM topology. For each HMM
Parameters l
l
Total l
l
Total Base. PMC PCA-PMC
Computational Requirements Multiplications Additions Baseline PMC N2 N (N 1) 2 N3 N 2 (N 1) N 2 N2 2 N3+ 3 N2 + N
N N N 2 (N 1)+ 2 N + N (N 1)
PCA based PMC
p (p 1) p (p 1) KN N (K 1) 2 KN N 2 (K 1) 2KN N (K 1) 2 N2 N (K 1) + N 2 p2 + K N 2 2 p (p 1) + N +3 K 2N +3 2N (K 1) +2 N +N (K 1) K = 13, N = 24, p = 13 p2 p2
29400 9914
13848 8112
exp N N2
N2 + N
-
N N2 N2 + N
Clean Mismatched Matched PMC comp. Mismatched Matched PMC comp. Table 2:
Recognition Scores (%) Accuracy Correct Deletion 56.11 62.17 10.57 5 dB HWY Noise 23.06 33.34 17.88 49.14 55.11 12.70 34.59 42.84 16.12 20 dB FLN Noise 22.33 29.97 20.65 47.62 54.54 12.13 32.18 39.64 17.48
Insertion 6.06 10.27 5.97 8.24 7.63 6.91 7.46
Context independent phone recognition results for dierent conditions.
Experiments on TIMIT Distribution count: 2136, (K = 13, N = 24, p = 13) Baseline PCA Speed PMC based PMC Ratio Seconds of CPU clock 6.46 3.40 1.9
Comparison of the experimental evaluation of the speeds of baseline PMC and PCA-PMC methods. Table 3:
600 600
The number of operations used in mapping mean and and covariance from cepstrum to linear power spectral domain for a single distribution. N stands for the number of lterbank, p for the number of static parameters and K is the number of Eigen vectors used in projection. Table 1:
state, between 4 to 16 mixture densities are used depending on available training data, to characterize observation probability densities. The speech waveform is parametrized every 10 msec by a vector consisting of 12 static Mel-frequency cepstral coeÆcients (MFCC), 12 delta MFCC, c0 and delta c0. The rst row of Table 2 shows simulation results for clean conditions. The remaining parts present the mismatched case where the training data is clean but test data is noisy, and matched case where both training and test data are noisy. Additionally, the recognition results for PMC compensated models are also presented for each noise source. PMC compensation improves the phone recognition results for all noise types, where the recognition scores are roughly midway between matched and mismatched cases. In Table 3, we present the computation time taken by both techniques in terms of the second of CPU clock for a full run of PMC. The total number of distributions in the phone recognition system is 2136. As seen in the table, PCA-PMC system using rectangular DCT and inverse DCT matrices is 1.9 times faster than baseline PMC. This result is in agreement with what we were expecting from the theoretical bound (2.07). Note since the corrupted speech models using both techniques are equal, and hence the recognition accuracy is equivalent. The new framework further allows trade-o between recognition accuracy versus the number of Eigen vectors used to de ne the Eigen space. In our case, the rst few Eigen vectors for the mean and covariance contain virtually all the energy. For example, the rst six Eigen vectors for the mean contain (93:41=93:93 = 0:9944) 99:44% of the energy. For the covariance, the rst six Eigen vectors contains (171:65=171:70 = 0:9997) 99:97% of the energy. Therefore, we may use fewer Eigen vectors (smaller K value) without significant degradation in recognition accuracy to further increase speed. Using only 7 Eigen vectors resulted in recognition scores of within 1% of the baseline PMC while increasing the speed from 1:9% to 2:15% for both noise sources considered.
5. CONCLUSIONS AND FUTURE WORK
In this study, a new framework for fast PMC is presented. The new PCA-PMC method exploits a priori knowledge of the clean HMM models using PCA and uses eÆcient rectangular DCT and inverse DCT matrices for transformation. Additionally, the method is exible, allowing further increase in speed at the expense of slight reduction in recognition accuracy. Keeping recognition accuracy constant, the new method increases computational speed by a factor of 1.9 over the baseline PMC independent of database used for recognition at the expense of a small memory requirements (11KB ) to store Eigen vectors, prototype vectors and matrices. The computationally intense steps are done o-line, saving signi cant time during on-line adaptation. Currently, we are working on a fast formulation to map noise corrupted models from linear power spectral domain back to the cepstrum domain. ACKNOWLEDGEMENT
The authors extend thanks to Dr. Jean-Claude Junqua of STL, Panasonic Inc. for valuable technical discussions during a summer internship supported by STL. References
[1] D. Mansour and B.H. Juang, \The Short-time Modi ed Coherence Representation and Noisy Speech Recognition," IEEE Trans. ASSP, vol. 37, pp. 795-804, 1989. [2] J.H.L. Hansen and M. Clements, \Constrained Iterative Speech Enhancement with Applications to Speech Recognition," IEEE Trans. Signal Processing, vol. 39, no. 4, pp. 795-805, 1991. [3] A.P. Varga and R.K. Moore, \HMM Decomposition of Speech and Noise", ICASSP-90, pp. 845-848. [4] M.J.F. Gales and S.J. Young, \Cepstral Parameter Compensation for HMM Recognition in Noise," Speech Comm. vol. 12, pp:231240, 1993. [5] R. Kuhn, et al., \Fast Speaker Adaptation Using a priori Knowledge," ICASSP-99, pp:749-752. [6] M.J.F. Gales and S.J. Young, \HMM Recognition in Noise Using PMC," Eurospeech-93, pp: 837-840. [7] M.J.F. Gales and S.J. Young, \A Fast and Flexible Implementation of PMC," ICASSP-95, pp:133-136, 1995. [8] S. Crafa, L. Fissore and C. Vair, \Data-Driven PMC and Bayesian Learning Integration for Fast Model Adaptation in Noisy Conditions," ICSLP-98, pp. 471-474. [9] Y. Komori, T. Kosaka, H. Yamamoto, and M. Yamada, \Fast Parallel Model Combination Noise Adaptation Processing," EUROSPEECH-97, pp. 1527-1530. [10] I.T. Jollie, \Principal Component Analysis," Springer-Verlag, 1986.