IEEE SIGNAL PROCESSING LETTERS, VOL. 8, NO. 8, AUGUST 2001

221

Fast Likelihood Computation Techniques in Nearest-Neighbor Based Search for Continuous Speech Recognition Bryan L. Pellom, Ruhi Sarikaya, and John H. L. Hansen

Abstract—This paper describes two effective algorithms that reduce the computational complexity of state likelihood computation in mixture-based Gaussian speech recognition systems. We consider a baseline recognition system that uses nearest-neighbor search and partial distance elimination (PDE) to compute state likelihoods. The first algorithm exploits the high dependence exhibited among subsequent feature vectors to predict the best scoring mixture for each state. The method, termed best mixture prediction (BMP), leads to further speed improvement in the PDE technique. The second technique, termed feature component reordering (FCR), takes advantage of the variable contribution levels made to the final distortion score for each dimension of the feature and mean space vectors. The combination of two techniques with PDE reduces the computational time for likelihood computation by 29.8% over baseline likelihood computation. The algorithms are shown to yield the same accuracy level without further memory requirements for the November 1992 ARPA Wall Street Journal (WSJ) task.

I. INTRODUCTION

I

N RECENT years, the majority of LVCSR systems developed have been based on continuous density hidden Markov models (CDHMMs). However, these systems typically operate slower than realtime, which renders them impractical for some applications (e.g., dialogue systems, dictation). One of the computationally most expensive steps in speech recognition systems based on CDHMMs is the state likelihood computation. Typically, these computations take up a major proportion (30–70%) of the overall recognition time. This is due to the multiple number of Gaussian mixtures used to model each state (4–64). In addition, the Gaussians associated with each active state must be evaluated for each frame of data. There are a number of techniques which attempt to reduce the computational requirements of likelihood calculation, while maintaining comparable recognition performance. One set of techniques reduce the dimension in the feature vector using linear discriminant analysis (LDA) [8]. Another class of techniques rely on Gaussian clustering for fast likelihood calculation [3], [4]. An elegant technique has been proposed by Seide [1] where the likelihood computation is formulated as a nearest-neighbor tree search. Some of the space partitioning techniques using a -dimensional binary search tree have been used to improve likelihood computations [5], [7]. However,

these methods can lead to degradation in recognition accuracy and an increase in memory requirements. II. NEAREST-NEIGHBOR APPROXIMATION The likelihood of an HMM state for a given input feature can be expressed as a weighted sum of likelihoods vector from individual Gaussian densities

(1) and represent the number of mulwhere tivariate Gaussians, mixture weight, mean, covariance matrix, and feature vector dimension for the th density in state . In practice, modeling states with diagonal covariances is preferred over full covariance matrices for computational and data sparsity reasons. Therefore, (1) becomes

(2) is a constant for each density. Since all terms that where are known prior to recognition, it can be precomcomprise puted. is expensive due to the multiComputation of exponential operations required plications, divisions, and in (2). One common technique for reducing the computational overhead is to use a nearest-neighbor [1] approximation of the likelihood. Instead of computing the likelihood by summing across all mixtures, the nearest mixture probability is taken as the state probability. In the log-domain, the nearest-neighbor approximation is given by

Manuscript received July 5, 2000. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. A. S. Spanias. The authors are with the Center for Speech and Language Research, University of Colorado, Boulder, CO 80309 USA (e-mail: [email protected]). Publisher Item Identifier S 1070-9908(01)05236-1. 1070–9908/01$10.00 © 2001 IEEE

Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on April 15, 2009 at 17:20 from IEEE Xplore. Restrictions apply.

(3)

222

IEEE SIGNAL PROCESSING LETTERS, VOL. 8, NO. 8, AUGUST 2001

III. PARTIAL DISTANCE ELIMINATION Given this framework, we can think of the nearest-neighbor search problem as a vector quantizer (VQ) codebook search problem, where the densities in that state are the codewords and the distortion measure is given on the right side of (3). Equation (3) involves joint maximization of the mixture constant and minimization of the weighted Euclidean distance given by the second term, over the densities belonging to that state. If we deas , note then we can rewrite (3) as

(4) is . Note that in general, division is comwhere putationally more expensive than multiplication. Therefore, we can load the inverse variance at run time and use multiplication instead of division. Contrary to the usual VQ search problems which minimize some type of distortion measure to find the best codeword, we maximize the distortion even though we perform minimization with the second term on the right side of the equation. Now, PDE can be used. Note that the weighted (with variance) squared error is a separable measure, and can be evaluated component-wise. Therefore, we can decide if a codeword will have minimal distortion without always computing over the range of feature vector elements. Before fin, if ishing the computation of a complete distortion for any the accumulated distortion for the first components of the input yet found in vector is smaller than the highest distortion the search, that codeword is eliminated. This early test will reduce the more expensive multiplications at the expense of comparisons for each component in the feature vector. As more mixwill be obtained, tures are computed, a better estimate of and more mixtures will be eliminated without exhausting all the components in the feature vector. Note that the first mixture must be computed over all components to get an initial . estimate of IV. PROPOSED ALGORITHMS Algorithms proposed in this paper are formulated to more efficiently calculate the probability given in (3). One unique aspect of the following techniques is that they require almost no memory and modifications in the underlying recognition system. Furthermore, they introduce no compromise between degraded recognition accuracy versus computational savings. The speed improvement resulting from these techniques comes at no additional cost. Finally, rather than being an alternative to previously proposed techniques [7], [5], these methods complement past approaches for further overall speed improvement. A. Best Mixture Prediction Obviously, the efficiency of the PDE technique heavily deis obtained. pends on how quickly a high estimate of

This can easily be accomplished by exploiting the high dependence between consecutive feature vectors. Let be the best match for the previous frame for state (5) Because of overlapping frames during feature extraction, and are usually similar. Therefore, we expect to be close to . Therefore, predicting the previous best match, Gaussian as the current best match and computing its distortion first results in a high immediately speeds up the elimination process for the best match search. In order to test the validity of this assumption, we used the entire 410 sentence test set from the WSJ0 database and counted how many times densities are repetitively selected for each state as the best match across time. The average number of likelihood calculations for each sentence is 0.67 million. The average frequency of picking the current best mixture in the current frame as the best mixture in the next frame is 0.48 million. Therefore 1 are 71.6% of the time, Gaussian densities selected at time selected as the best Gaussians at time . B. Feature Component Reordering (FCR) The second algorithm, feature component reordering (FCR) complements PDE and BMP techniques. By analyzing the components of the feature vectors and the densities, we observed that the contribution of some of the components are heavier than others. The idea in this technique is to reorder the components in such a way that components resulting in higher distortion are computed first followed by the components contributing less. be a mapping of the location of component in Let . Therefore, the distortion as the vectors into a new location, represented by (4) becomes

(6) can be learned from a portion of the developThe mapping ment test set offline and stays fixed during recognition. Our feature vector is composed of 12 mel scale frequency cepstrum coefficients (MFCCs) with their first-order and second-order differences. These features are augmented with energy and its first and second order differences resulting in a 39 dimensional feature vector. In Fig. 1, the solid plot shows the distortion contributed by individual elements of the feature vectors averaged over 30 test sentences during recognition. Note that after the first nine to ten sentences, the order of the vector elements is stabilized. The dashed curve shows the reordered distortion. It is interesting to note that energy, first and second cepstrums, and delta energy parameters ranked at the top of the reordered sequence indicating that they contribute to likelihood score computation more than others. This process speeds up PDE by eliminating mixture densities whose output probability is smaller than the current best.

Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on April 15, 2009 at 17:20 from IEEE Xplore. Restrictions apply.

PELLOM et al.: FAST LIKELIHOOD COMPUTATION TECHNIQUES

223

Fig. 1. The solid curve is the average distortion before reordering, and the dashed curve is the distortion of the reordered elements in descending order: f37; 1; 2; 38; 13; 25; 4; 26; 14; 3; 28; 16; 27; 8; 6; 5; 15; 29; 7; 11; 17; 12; 30; 20; 32; 10; 18; 9; 31; 39; 35; 19; 33; 34; 23; 21; 22; 36; 24g.

V. DESCRIPTION OF RECOGNITION SYSTEM The recognition task for the experiments used the WSJ0-Dev portion of the 1992 ARPA WSJ 5k vocabulary continuous speech recognition corpus. A cross-word gender-dependent system is used to build a model set using the complete SI-284 WSJ training set. From the signal processing side of the system, speech data, sampled 16 kHz are converted to 39 dimensional observation vectors. Acoustic modeling in our recognizer is based on gender dependent HMMs for base phones and triphones. All HMMs have three-state left-to-right topology with between six and 16 Gaussian mixtures per state. The lexicon for the recognizer uses the linear sequence of phones to represent the pronunciation for each word in the vocabulary. Since triphone acoustic models are used these base phone sequences are converted into triphone sequences by taking each base phone together with its left and right context base phones. The language model we used is a trigram language model. We use a single pass beam search Viterbi decoding. VI. EVALUATIONS Experimental results are summarized in Table I. There, we refer to “baseline” as the straightforward computation of the likelihood for each mixture in a state by using nearest-neighbor search without PDE. We assume this likelihood computation time to be 100% as the reference for the subsequent techniques. Note that the recognition accuracy is not listed either for the baseline as well as other techniques since it is maintained at its original level. The word error rate (WER) for the baseline as

TABLE I COMPARISON OF THE EXPERIMENTAL EVALUATION OF THE SPEEDS OF NEAREST-NEIGHBOR BASEDBASELINE AND PROPOSED TECHNIQUES ON NOVEMBER 1992 DARPA WSJ EVALUATION USING WSJ0-DEV SET

well as the proposed techniques is 11.8%. Incorporating PDE into the likelihood computation step lowered the computational time by 4.0%. The major improvement resulted from the FCR technique, which provides another 22.0% savings in computation. Finally, including BMP added 3.8%. Using all three techniques together reduces 29.8% of the time spent for likelihood computation. Note that FCR and BMP require the likelihood to be computed via a PDE technique, otherwise, they cannot be incorporated into the baseline scheme. At the bottom of Table I, we summarize the respective number of average multiplications and comparisons incurred for each technique. The new PDE technique trades off two multiplications for two comparisons. Note that as the number of computations decrease to compute the likelihood, the timing functions to measure the CPU clock begins to have more of an effect on computation time. For this

Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on April 15, 2009 at 17:20 from IEEE Xplore. Restrictions apply.

224

IEEE SIGNAL PROCESSING LETTERS, VOL. 8, NO. 8, AUGUST 2001

reason the average savings in multiplications and comparisons should have led to a greater decrease in likelihood computation time for BMP. VII. CONCLUSION In this study, two new techniques in a PDE framework are proposed to reduce computational time of the likelihood computation in nearest-neighbor based search for LVCSR. The algorithms are called BMP and FCR. The combination of these techniques with PDE reduces the computational complexity for likelihood computation by 29.8% over straightforward likelihood computation. The computational savings gained by these techniques come at no additional cost while complementing past approaches for fast likelihood computation. REFERENCES [1] F. Seide, “Fast likelihood computation for continuous-mixture densities using a tree-based nearest neighbor search,” in Proc. EUROSPEECH-95: Eur. Conf. Speech Technology, Madrid, Spain, 1995, pp. 1079–1082.

[2] L. Fissore, P. Laface, P. Massafra, and F. Ravera, “Analysis and improvement of the partial distance search algorithm,” in IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, Minneapolis, MN, 1993, pp. 315–318. [3] E. Bocchieri, “Vector quantization for efficient computation of continuous density likelihoods,” in IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, Minneapolis, MN, 1993, pp. 692–695. [4] M. J. F. Gales, K. M. Knill, and S. J. Young, “State-based Gaussian selection in large vocabulary continuous speech recognition using HMMs,” IEEE Trans. Speech and Audio Processing, vol. 7, pp. 152–161, Jan. 1999. [5] J. Fritsch and L. Rogina, “The bucket box intersection (BBI) algorithm for fast approximative evaluation of diagonal mixture Gaussians,” in IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, Atlanta, GA, 1996, pp. 837–840. [6] C. D. Bei and R. M. Gray, “An improvement of the minimum distortion encoding algorithm for vector quantization,” IEEE Trans. Commun., vol. COM-37, pp. 1132–1133, 1985. [7] S. Ortmanns, T. Firzlaff, and H. Ney, “Fast likelihood computation methods for continuous mixture densities in large vocabulary speech recognition,” in Proc. EUROSPEECH-97: Eur. Conf. Speech Technology, Rhodes, Greece, 1997, pp. 139–142. [8] M. Hunt and C. Lefebre, “A comparison of several acoustic representations for speech recognition with degraded and undegraded speech,” in IEEE Int. Conf. Acoustics, Speech, Signal Processing, Glasgow, U.K., 1989, pp. 262–265.

Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on April 15, 2009 at 17:20 from IEEE Xplore. Restrictions apply.

Fast likelihood computation techniques in nearest ...

exhibited among subsequent feature vectors to predict the best scoring mixture for each state. The method, termed best mixture ... The combination of two tech-.

69KB Sizes 2 Downloads 221 Views

Recommend Documents

pdf-1830\fast-nearest-neighbor-search-in-medical-image-databases ...
... the apps below to open or edit this item. pdf-1830\fast-nearest-neighbor-search-in-medical-image ... puter-science-technical-report-series-by-flip-korn.pdf.

pdf-1830\fast-nearest-neighbor-search-in-medical-image-databases ...
... the apps below to open or edit this item. pdf-1830\fast-nearest-neighbor-search-in-medical-image ... puter-science-technical-report-series-by-flip-korn.pdf.

Random Grids: Fast Approximate Nearest ... - Semantic Scholar
2056 matches - We propose two solutions for both nearest neigh- bors and ... the parameters in a learning stage adopting them to the case of a ... ages, our algorithms show meaningful speed-ups .... For each random translation/rotation, we use ...

Fast Covariance Computation and ... - Research at Google
Google Research, Mountain View, CA 94043. Abstract. This paper presents algorithms for ..... 0.57. 27. 0.22. 0.45. 16. 3.6. Ropes (360x240). 177. 0.3. 0.74. 39.

Fast Distributed PageRank Computation
Apr 2, 2014 - and Department of Computer Science, Brown University, ..... the first phase, each node v performs d(v)η (d(v) is degree of v) independent.

Fast maximum likelihood algorithm for localization of ...
Feb 1, 2012 - 1Kellogg Honors College and Department of Mathematics and Statistics, .... through the degree of defocus. .... (Color online) Localization precision (standard devia- ... nia State University Program for Education and Research.

a fast, accurate approximation to log likelihood of ...
It has been a common practice in speech recognition and elsewhere to approximate the log likelihood of a ... Modern speech recognition systems have acoustic models with thou- sands of context dependent hidden .... each time slice under a 25 msec. win

Monitoring Path Nearest Neighbor in Road Networks
show that our methods achieve satisfactory performance. Categories and Subject Descriptors. H.2.8 [Database Applications]: Spatial databases and. GIS. General Terms ... 5.00. Keywords. Path Nearest Neighbor, Road Networks, Spatial Databases ..... k-P

Monitoring Path Nearest Neighbor in Road Networks
paths. IEEE Transactions on Systems Science and. Cybernetics, 4(2):100–107, July 1968. [6] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM Trans. Database Syst.,. 24(2):265–318, 1999. [7] C. S. Jensen, J. Kolárvr, T. B.

Isotropic Remeshing with Fast and Exact Computation of ... - Microsoft
ρ(x)y−x. 2 dσ. (4). In practice we want to compute a CVT given by a minimizer of this function instead of merely a critical point, which may be a saddle point. If we minimize the same energy function as in .... timization extra computation is nee

Fast and Secure Three-party Computation: The ... - Semantic Scholar
experiments show that the online phase can be very fast. 1.2 Related ...... gates free (no communication and computation) since the ... Computing instances.

Nearest Neighbor Search in Google Correlate
eling techniques pioneered by Google Flu Trends and make them available to end .... of Asymmetric Hashing. Figure 2: Illustration of how vectors are split into.

Fast and Secure Three-party Computation: The ... - Research at Google
We propose a new approach for secure three-party compu- .... tion we call distributed credential encryption service, that naturally lends ...... The network time.

Novel Techniques for Automorphism Group Computation
This is also reflected in a reduction of the running time, which is substantial .... time, the execution time of conauto-2.03 does not become superpolynomial. Fur-.

likelihood
good sales people, what is the probability that most psychologists make good sales ... If the explanations are identical, we would expect the premise to increase.

All-Nearest-Neighbors Queries in Spatial Databases
have to be visited, if they can contain points whose distance is smaller than the minimum distance found. The application of BF is also similar to the case of NN queries. A number of optimization techniques, including the application of other metrics

Computationally fast techniques to reduce AWGN and ...
imaging systems [1–3] are used for various commercial, military and medicinal purposes [4–6]. Images and videos generated by coherent imaging systems are corrupted by speckle, which is a type of multiplicative noise [7]. Video signals also get co

Pseudo-likelihood methods for community detection in ... - CiteSeerX
Feb 21, 2013 - approximation to the block model likelihood, which allows us to easily fit block models to ..... web, routing, and some social networks. The model ...

ROBUST COMPUTATION OF AGGREGATES IN ...
aggregate statistics (aggregates) amid a group of sensor nodes[18,. 29, 14]. .... and broadcasts a group call message, GCM ≡ (groupid = i), to all its neighbors. 1.3 The ...... [15] J. M. Hellerstein, P. J. Haas, and H. J. Wang, ”Online Aggregati