Confidence Scoring and Rejection using Multi-Pass Speech Recognition Vincent Vanhoucke Nuance Communications, Menlo Park, CA, USA [email protected]

Abstract This paper presents a computationally efficient method for using multiple speech recognizers in a multi-pass framework to improve the rejection performance of an automatic speech recognition system. A set of criteria is proposed, which determine at run time when rescoring using a second pass is expected to improve the rejection performance. The second pass result is used along with a set of features derived from the first pass to compute a combined confidence score. The feature combination is optimized globally based on training data. The combined system significantly outperforms a simple two-pass system at little more computational cost than comparable one-pass and twopass systems.

1. Introduction The determination of whether a recognition hypothesis H is the correct one based on the input speech signal O can be performed with reasonable accuracy by computing an estimate of its posterior probability: p(H|O) =

p(O|H)p(H) p(O)

For computational reasons, this posterior is typically an approximation derived using statistics collected during the decoding of the speech [3]. Two different avenues have shown to be very promising in terms of improving the confidence scores derived from the acoustic posterior: 1. The combination of the scores with other statistics derived from the recognizer [6], 2. The combination of the outputs of multiple recognizers [4, 5]. The combination of multiple recognizers is a very powerful technique, but can be very expensive computationally if all the recognizers are to be run in parallel for any given utterance. It is more efficient to cascade the systems into a multipass framework, where an inexpensive recognizer is run as a first pass, followed by a rescoring of its output using a set of more detailed second passes. By limiting the number of utterances actually exercising this rescoring step, the average computational expense of the overall system can be constrained to be just slightly above the cost of a one-pass system, with a much improved overall accuracy [7]. When it comes to improving the rejection performance of the system, however, the optimization of a multi-pass system has to obey a different set of requirements than those of a system optimized for accuracy alone. For example, a system optimized for accuracy will want to determine which recognizer (the first

pass or second pass) has produced a correct answer. If the first pass is expected to be correct, then the second pass doesn’t have to be run. When it comes to confidence, however, the output of both recognizers is potentially of interest, since a disagreement between the two outputs could indicate that the result is questionable. The fact that the system determined that it needed to run a second pass is also an indication that the confidence in the result of the first pass should be low. In the following, we will show how the logic governing a multi-pass system optimized for accuracy can be enhanced to take into account the requirements of improving rejection. Section 2 describes a simplified baseline multi-pass system. Section 3 introduces a set of multi-pass features which can be added to the computation of the confidence score. Section 4 introduces a modification to the multi-pass logic which further improves the rejection performance.

2. Baseline Multi-Pass System The baseline system we consider uses a very simple multi-pass strategy, depicted in Figure 1. A first pass recognizer (Pass I) is run, and a set of hypotheses is produced, along with an acoustic posterior P for each of them. An additional likelihood margin D is computed, which measures how close the alternative hypotheses are from the best hypothesis. A small D indicates that there are possible alternatives to the best hypothesis in the grammar, and hence that a rescoring with better models could help disambiguate between them. A large D indicates that there is a lack of likely competing hypotheses in the grammar, which makes any attempt at rescoring to improve the recognition result superfluous. Based on a threshold ∆, we determine whether it is necessary or not to run a rescoring pass (Pass II). Finally, a threshold Θ is applied to the posterior P of the best hypothesis to determine whether to accept or reject it. This threshold is determined by the operating requirements of the system in terms of correct accept (CA) rate vs. false accept (FA) rate.

3. Multi-Pass Confidence Features 3.1. Second Pass Confidence Penalty The two measures of confidence (Acoustic Posterior P and Likelihood Margin D) provide different information about the top recognition hypothesis. P evaluates whether it is a good acoustic match, while D measures whether other hypotheses would be good candidates as well, and thus whether the recognizer picked its top hypothesis by chance. A simple but effective method for combining the two sources of information is to define a penalty: δΘ = Θ2 − Θ1

• Pass I agrees / disagrees with Pass II,

Acoustic Posterior

1.0 Accept Pass I

Accept Pass II

• best hypothesis is correct / incorrect, for some representative data, and applying LDA to the data.

Θ

0.0 max

∆ Likelihood Margin

min

Figure 1: Baseline system

Acoustic Posterior

1.0 Reject

Φ

Accept Pass II Accept Pass I Accept iff Pass I = Pass II

Θ2 Θ1 Reject

0.0 which is applied to the confidence score of the utterances which are rescored using the second pass. This translates into the acoustic confidence threshold for utterances going to the second pass to be increased by δΘ, as depicted in Figure 2. The value of δΘ can be learned simply by collecting the information: • Acoustic Posterior P of best hypothesis so far, • D > ∆, • best hypothesis is correct / incorrect, for some representative data, and applying Linear Discriminant Analysis (LDA, see e.g. [8, Chapter 4]) to the data to best separate the correct utterances from the incorrect ones. In practice, this optimization provides a very robust estimate of δΘ which is very stable across values of Θ1 , and very consistent across testsets. Acoustic Posterior

1.0 Accept Pass II Θ2

Accept Pass I

max

min

Figure 3: System with both Confidence Penalties

4. Multi-Pass Confirmation Logic Because of the random effect that beam pruning has on utterances which are not well modeled in the search space, it is often observed that an out-of-grammar (OOG) utterance matches a specific recognition hypothesis, possibly with little or nothing to do with the actual speech, much better than others. In this case, the acoustic confidence of the utterance is low, while the likelihood margin is large, instructing the system not to exercise the rescoring pass. By forcing the second pass to run in such situations, a “second opinion” can be obtained using very distinct acoustic models, which will exhibit a different behavior when presented with OOG data. For simplicity, the same penalty: δ 0 Θ = Φ1 − Θ 1 = Φ 2 − Θ 2

Θ1 Reject 0.0 max

∆ Likelihood Margin

min

Figure 2: System with Second Pass Confidence Penalty

as the penalty applied in the second pass is applied to the confidence score of those utterances, as depicted in Figure 4. This penalty could as well be learned from data as in the previous cases. This strategy forces the second pass to run more often than in the baseline, but the additional cost is only incurred when the likelihood of a misrecognition is high. 1.0

When the second pass recognizer is run, it provides an estimate of the recognition result which is relatively independent from the first pass result. As a consequence, whether the two recognizers agree on the resulting output can be expected to be a very salient feature for confidence scoring. As previously, we incorporate this information by calculating another penalty: δ 0 Θ = Φ − Θ2 to be applied to the confidence, as depicted in Figure 3. The value of δ 0 Θ can be learned jointly with δΘ by collecting the information: • Acoustic Posterior P of best hypothesis so far,

Acoustic Posterior

3.2. Recognizer Disagreement Confidence Penalty

• D > ∆,

∆ Likelihood Margin

Φ2 Φ1

Accept Pass II Accept Pass I

Accept iff Pass I = Pass II Θ2 Θ1 Reject

0.0 max

∆ Likelihood Margin

min

Figure 4: System with Second Pass Confirmation

5. Experiments

5.3. Impact on Decoding Speed

The experiments were run on a speaker-independent American English system. The first pass recognition engine used is a context-dependent HMM system with 18000 triphones and tied mixtures based on Genones [1]: each state cluster shares a common set of Gaussians called Genone, while the mixture weights are state-dependent. The system uses 2000 Genones and 32 Gaussians per Genone. The models are trained using Maximum Mutual Information Estimation [9], and use Mixtures of Inverse Covariances [2] as a covariance model. The second pass recognizers are one male and one female recognizer with comparable parametrizations, but trained using maximum likelihood. The features are 27 dimensional, including MFCC, ∆ and ∆∆. The first test-set is a collection of 50000 utterances from a business listings recognition task, including 20% of OOG data collected on the same task. The second test-set is a collection of 40000 utterances from a variety of tasks, including digits strings, stock quotes, and city names, with 50% of OOG data collected on the same tasks. In each case, the language model is a rule-based grammar specifically built and tuned for the corresponding task.

Figure 6 shows the percentage of utterances, represented by the length of the vertical bar, which use the second pass rescoring at each operating point. The system with confirmation will use a second pass rescoring at most 11% of the time on this task, while the baseline system uses the second pass at most 6% of the time. This increase has a negligible effect on the total efficiency of the system.

85 80 Correct Accepts (%)

5.1. Experimental Setup

75 70 65 60 55 50

Baseline With Confirmation

45 0.2

0.4

0.6

0.8 1 1.2 1.4 False Accepts (%)

5.2. Rejection Performance on Business Listings Figure 5 shows the rejection performance of the various configurations on the business listings testset. The graph depicts the Correct Accept (CA) rate , i.e. the number of semantically correct utterances which were accepted by the system as a percentage of the total number of utterances, as a function of the False Accept (FA) rate, i.e. the number of semantically incorrect utterances which were accepted by the system, also as a percentage of the total number of utterances. A perfect system would have its CA/FA curve follow the left vertical axis (0% FA) and the top horizontal axis (100% CA).

1.6

1.8

2

Figure 6: Percentage of the data going to the second pass (vertical bars) as a function of the operating point.

5.4. Rejection Performance at High OOG Rate Figure 7 shows the rejection performance of the various configurations on the mixed testset with 50% of OOG utterances:

90 90

85 Correct Accepts (%)

Correct Accepts (%)

85 80 75 70 65 First pass only Two-pass baseline Two-pass with second pass penalty Two-pass with both penalties Two-pass with second pass confirmation

60 55 0.4

0.6

0.8 1 1.2 1.4 False Accepts (%)

1.6

75 70 65 First pass only Two-pass baseline Two-pass with second pass penalty Two-pass with both penalties Two-pass with second pass confirmation

60 55 50 0

50 0.2

80

1.8

2

Figure 5: Rejection performance of the various systems on business listings. The best system using second pass confirmation leads to a 3 to 5% absolute improvement in the CA rate at a given FA rate, or conversely a 0.2 to 0.4% absolute reduction in the FA rate at a given CA rate.

1

2

3 4 False Accepts (%)

5

6

Figure 7: Rejection performance of the various systems on tasks with high OOG rate.

At a very low 1% FA rate, the CA rate of the best system is more than 12% (absolute) better than the one-pass system, and more than 8% (absolute) better than the baseline two-pass system. The one-pass system could only achieve the same CA rate at the cost of doubling the number of false accepts.

7

6. Conclusion This paper presents a simple, efficient method for leveraging two-pass rescoring for the purpose of improving rejection performance. The system uses a simple decision logic based on the acoustic posterior and likelihood margin of the top hypothesis of the first-pass recognizer to determine whether the rescoring pass should be run. The combined output of the two recognizers determines the final recognition result as well as its confidence score based on a set of penalties, which depend on both the likelihood margin and the semantic agreement between the two passes. The value of these penalties can be derived from data using LDA or other classification methods. The resulting system provides much better rejection performance at a very small computational cost.

7. Acknowledgments The author would like to thank Brian Strope, Mitch Weintraub and Larry Heck for their input.

8. References [1] Digalakis, V., Monaco, P. and Murveit, H., “Genones: Generalized mixture tying in continuous hidden Markov model-based speech recognizers,” IEEE Trans. on Speech and Audio Processing, vol. 4, no. 4, pp. 281–289, 1996. [2] Vanhoucke, V. and Sankar, A., “Mixtures of Inverse Covariances,” IEEE Trans. on Speech and Audio Processing, vol. 12, no. 3, May 2004. [3] Evermann, G., and Woodland, P.C., “Large Vocabulary Decoding and Confidence Estimation using Word Posterior Probabilities,” Proceedings of ICASSP’00, pp. 23662369, Istanbul, 2000. [4] Evermann, G., and Woodland, P.C., “Posterior Probability Decoding, Confidence Estimation, and System Combination,” Proceedings NIST Speech Transcription Workshop, College Park, MD, 2000. [5] Sankar, A., “Bayesian Model Combination (BAYCOM) for Improved Recognition,” Proceedings of ICASSP’05, 2005. [6] Hazen, T.J., Burianek, T., Polifroni, J., and Seneff, S., “Recognition Confidence Scoring for Use in Speech Understanding Systems” Proceedings of ASR’00, 2000. [7] Mao, M.Z., Vanhoucke, V., and Strope, B., “Automatic Training Set Segmentation for Multi-Pass Speech Recognition,” Proceedings of ICASSP’05, 2005. [8] Webb, A., Statistical Pattern Recognition, Arnold, 1999. [9] Woodland, P.C., and Povey, D., “Large Scale Discriminative Training for Speech Recognition,” Proceedings of ISCA ITRW Automatic Speech Recognition: Challenges for the Millenium, pp. 7–16, Paris, 2000.

Confidence Scoring and Rejection using Multi ... - Vincent Vanhoucke

using Multi-Pass Speech Recognition. Vincent Vanhoucke. Nuance Communications, Menlo Park, CA, USA [email protected]. Abstract. This paper presents a computationally efficient method for us- ing multiple speech recognizers in a multi-pass framework to improve the rejection performance of an automatic speech.

51KB Sizes 0 Downloads 248 Views

Recommend Documents

Confidence Scoring and Rejection using Multi ... - Vincent Vanhoucke
the system, however, the optimization of a multi-pass system has to obey a different set of .... ances which are not well modeled in the search space, it is often observed that an ... The first pass recognition engine used is a context-dependent ...

SPEAKER-TRAINED RECOGNITION USING ... - Vincent Vanhoucke
approach has been evaluated on an over-the-telephone, voice-ac- tivated dialing task and ... ments over techniques based on context-independent phone mod-.

SPEAKER-TRAINED RECOGNITION USING ... - Vincent Vanhoucke
advantages of this approach include improved performance and portability of the ... tion rate of both clash and consistency testing has to be minimized, while ensuring that .... practical application using STR in a speaker-independent context,.

mixtures of inverse covariances: covariance ... - Vincent Vanhoucke
Jul 30, 2003 - archive of well-characterized digital recordings of physiologic signals ... vein, the field of genomics has grown around the collection of DNA sequences such ... went the transition from using small corpora of very constrained data (e.

Asynchronous Stochastic Optimization for ... - Vincent Vanhoucke
send parameter updates to the parameter server after each gradient computation. In addition, in our implementation, sequence train- ing runs an independent ...

mixtures of inverse covariances: covariance ... - Vincent Vanhoucke
Jul 30, 2003 - of the software infrastructure that enabled this work, I am very much indebted to. Remco Teunen .... 6.2.2 Comparison against Semi-Tied Covariances . ...... Denoting by ⋆ the Kronecker product of two vectors, the Newton update can be

On Rectified Linear Units for Speech Processing - Vincent Vanhoucke
ronment using several hundred machines and several hundred hours of ... They scale better ... Machine (RBM) [2], as a way to provide a sensible initializa-.

Application of Pretrained Deep Neural Networks ... - Vincent Vanhoucke
Voice Search The training data for the Voice Search system consisted of approximately 5780 hours of data from mobile Voice Search and Android Voice Input. The baseline model used was a triphone HMM with decision-tree clustered states. The acoustic da

Variable Length Mixtures of Inverse Covariances - Vincent Vanhoucke
In that situation, a standard Newton algorithm can be used to optimize d [3, Chapter 9]. For that, we compute the gradient. ¥ with respect to v and the Hessian ¦ . Note that the Hessian is diagonal here. For simplicity we'll denote by § the diagon

Variable Length Mixtures of Inverse Covariances - Vincent Vanhoucke
vector and a parameter vector, both of which have dimension- .... Parametric Model of d ... to optimize with a family of barrier functions which satisfy the inequality ...

Reading Text in Consumer Digital Photographs - Vincent Vanhoucke
the robustness of OCR engines has a long history:2–5 by leveraging and ..... proposed:12 by formulating the problem as a constrained optimization over a ... to achieve limits the amount of pollution incurred by the the search engine index.

Design of Compact Acoustic Models through ... - Vincent Vanhoucke
Clustering of Tied-Covariance Gaussians. Mark Z. Mao†* and Vincent Vanhoucke*. † Department of Electrical Engineering, Stanford University, CA, USA. * Nuance Communications, Menlo Park, CA, USA [email protected], [email protected]. Abstract.

Design of Compact Acoustic Models through ... - Vincent Vanhoucke
there are sufficient commonalities across languages for an effi- cient sharing of parameters at the Gaussian level and below. The difficulty resides in the fact that ...

Reading Text in Consumer Digital Photographs - Vincent Vanhoucke
Commercial OCR companies have typically focused their efforts on ... best overall detection performance in the ICDAR 2005 text locating ..... recently been proposed:12 by formulating the problem as a constrained optimization over a known ... to achie

Amplified spontaneous emission rejection with multi-functional MEMS ...
References. 1 Yu, K., et al.: 'Tunable optical bandpass filter with variable-aperture ... 2 Zhou, T., et al.: 'MEMS-based 14 GHz resolution dynamic optical filter',.

Amplified spontaneous emission rejection with multi ... - IEEE Xplore
Sep 2, 2010 - A key component enabling dynamic wavelength provisioning and optimum signal-to-noise ratio (SNR) in the optical domain is the tunable optical filter. Among several technologies imple- menting tunable optical filters, optical MEMS techno

Rejection and valuations
Mar 30, 2009 - We can define validity more snappily, if we help ourselves to a bit of ... hyper-valuation which makes the premisses true and conclusion false, ...

Rejection and valuations - Logic Matters
Mar 30, 2009 - of themes from that paper, though done in our own way, and then considers a putative line of objection – recently advanced by Julien Murzi and ...

Rejection and valuations - Logic Matters
Mar 30, 2009 - of themes from that paper, though done in our own way, and then considers a putative line of objection – recently advanced by Julien Murzi and Ole ... Greek letters indicate whole signed sentences and ∗α be the result of ...

Optimal Disturbance Rejection Using Hopfield Neural ...
Abstract— This paper investigates the problem of disturbance rejection and its solution by means of Hopfield neural networks. The dynamic optimization problem is transformed into a static optimization problem via linear state space analysis methods

Optimal Disturbance Rejection Using Hopfield Neural ...
(phone: +98 918 8390095; fax: +98 21 303-555-5555; e-mail: ..... on Intelligent Control and Automation, Hangzhou, China, pp. 2598-. 2602, June 2004.

Organic Computing and an Approach using Multi-Agent ...
Nov 14, 2005 - be semi-automated by providing transformation rules from higher-level to lower-level ... no mentioning of di erent agent vendors in the paper.

Energy- and Cost-Efficient Mobile Communication using Multi-Cell ...
IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS. 1. Energy- and ... plementary technologies is of particular interest as multi-cell. MIMO is able to ...

Energy- and Cost-Efficient Mobile Communication using Multi-Cell ...
IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS. 1. Energy- and .... combine the benefits of multi-cell MIMO and relaying. Hence, a system applying .... ditional radio access-points in order to increase the spatial resource utilization ...