Acoustic Sensitive Language Model Perplexity for Automatic Speech Recognition Ciprian Chelba∗ Microsoft Research One Microsoft Way Redmond, WA 98052 [email protected] Traditional evaluation of language models (LM) for automatic speech recognition (ASR) uses either the information theoretic -motivated perplexity (PPL) or the word error rate (WER) — measured by plugging the model in a speech recognizer. It is a well known fact that WER and PPL and poorly correlated. The main reason is probably the fact that PPL measures the predictive power of the LM on correct text, whereas at recognition time the LM needs to discriminate between alternates suggested by the acoustic model used in the recognizer. Since the LM is estimated using maximum-likelihood methods on correct (well-formed) sentences, it is poorly suited for discriminating among the candidates proposed by the acoustic model as likely candidates. We propose a new evaluation metric for LMs that takes into account the coupling between language model and acoustic model in a given ASR system. The new metric, “acoustic model -sensitive” perplexity (AMS-PPL), aims at allowing one to optimize the LM parameters such that it performs best when used with a given acoustic model. The underlying main idea is to estimate the conditional cross-entropy H(W |A) for the correct word sequence W when the acoustic signal to be decoded was A.

1 Probability Model Let us assume that the correct word sequence W = w1 . . . wn is present2 in the ASR lattice [2] LAT (A), obtained by a given ASR system running acoustic model PAM (·) and language model PLM (·) on utterance A. We wish to calculate the conditional probability PLAT (w|w1k , A) for all vocabulary words w ∈ V and all prefixes w1k of W . . Let us denote with w1k = w1 . . . wk the k-length prefix of W , and with EN D(w1k , LAT (A)) the set of end nodes for partial paths whose word sequence is w1k : . EN D(w1k , LAT (A)) = {n ∈ LAT (A) : (∃) π, start node(π) = s, end node(π) = n, word(π) = w1k }

(1)

where s denotes the start node of LAT (A) and the word(·) operator extracts the words on a given path. With this notation: X PLAT (w|w1k , A) = PLAT (n|w1k , A) · PLAT (w|n, w1k , A) (2) n∈EN D(w1k ,LAT (A))

Let us denote the vector of acoustic observations between two nodes n1, n2 with An2 n1 . With this notation we have A = Aes , and A = Ans k Aen for every node in LAT (A) that is different from either the start or the end one; k denotes concatenation. ∗ Attending Author Category: Estimation, Prediction, and Sequence Modeling Preference: 1. Poster 2. Oral 2 For the case where it isn’t, one can use the lowest WER word sequence present in the lattice; ties are broken arbitrarily

1

1.1 Prefix-Sensitive Node Posteriors The first factor in the inner term of the summation in Eq. (2) becomes: PLAT (n, w1k , Ans k Aen ) k o e o∈EN D(wk ,LAT (A)) PLAT (o, w1 , As k Ao )

PLAT (n|w1k , Ans k Aen ) = P

1

Since the underlying model is a Hidden Markov Model (HMM), the probability needed above can be rewritten as: PLAT (n, w1k , Ans k Aen ) = PLAT (n, w1k , Ans ) · PLAT (Aen |n) {z } | {z } | α(n,w1k )

β(n)

where: X

length(π)

π:start node(π)=s,end node(π)=n, word(π)=wk 1

i=1

. α(n, w1k ) =

Y

X

length(π)

π:start node(π)=n,end node(π)=e

i=1

. β(n) =

Y

p(li (π))

p(li (π))

The β(n) is the familiar backward probability at node n. The α(n, w1k ) is a modified version of the forward probability at node n that takes into account only partial paths whose word string matches the w1k prefix; both can be efficiently calculated using standard dynamic programming algorithms. To complete the analogy with the standard forward/backward definitions, we can denote: . γ(n|w1k , Ans k Aen ) = PLAT (n|w1k , Ans k Aen ) (3) 1.2 Acoustic Sensitive Word Probability For a given link l leaving node n we can write: PLAT (l|n, Aen ) =

q:start

PLAT (l, Aen |n) e node(q)=n PLAT (q, An |n)

q:start

PLAT (l) · β(end node(l)) node(q)=n PLAT (q) · β(end node(q))

P P

=

(4)

Making use of the fact that the underlying model is an HMM, the second factor in the inner term of the summation in Eq. (2) can be re-written as: PLAT (w|n, w1k , A)

= PLAT (w|n, Aen ) X = PLAT (l|n, Aen ) l:word(l)=w

X

=

PLAT (l|n, Aen ) · δ(w, word(l))

(5)

l:start node(l)=n

To summarize: PLAT (w|w1k , A)

X

=

γ(n, w1k ) · PLAT (w|n, w1k , A)

n∈EN D(w1k ,LAT (A))

P

PLAT (w|n, w1k , A)

=

l:start node(l)=n

P

PLAT (l) · β(end node(l)) · δ(w, word(l))

l:start node(l)=n

PLAT (l) · β(end node(l))

where γ(n, w1k ) is calculated according to Eq. (3), and the probability of a link PLAT (l) is calculated using standard methods in ASR. 2

2 Acoustic Model Sensitive Perplexity The probability PLAT (w|w1k , A) can be used to calculate the acoustic model -sensitive perplexity (AMS-PPL). The lattices for the test data need to be generated/rescored with the LM to be evaluated, after which one can calculate the AMS-PPL: " # n−1 X k AM S − P P L = exp −1/n · log PLAT (wk+1 |w1 , A) (6) k=0

The same metric can be used for discriminative (acoustic sensitive) LM training [1], under the assumption that there is a sufficient amount of parallel — text and speech — training data, thus enabling one to generate lattices for the LM training data.

References [1] Frederick Jelinek. Acoustic sensitive language modeling. Technical report, Center for Language and Speech Processing, The Johns Hopkins University, 1995. [2] Steve Young, Gunnar Evermann, Thomas Hain, Dan Kershaw, Gareth Moore, Julian Odell, Dan Povey Dave Ollason, Valtcho Valtchev, and Phil Woodland. The HTK Book. Cambridge University Engineering Department, Cambridge, England, December 2002.

3

Acoustic Sensitive Language Model Perplexity for ...

Traditional evaluation of language models (LM) for automatic speech recognition .... assumption that there is a sufficient amount of parallel — text and speech ...

62KB Sizes 0 Downloads 203 Views

Recommend Documents

Uniform Multilingual Multi-Speaker Acoustic Model for Statistical ...
single-speaker system from small amounts of data for that lan- ... In recent years, statistical parametric speech synthesis has seen .... analysis components.

Uniform Multilingual Multi-Speaker Acoustic Model for Statistical ...
Uniform Multilingual Multi-Speaker Acoustic Model for Statistical Parametric ... training data. This type of acoustic models, the multilingual multi-speaker (MLMS) models, were proposed in [12, 13, 14]. These approaches utilize a large input feature

Confidence Scores for Acoustic Model Adaptation - Research at Google
Index Terms— acoustic model adaptation, confidence scores. 1. INTRODUCTION ... In particular, we present the application of state posterior confidences for ... The development of an automatic transcription system for such a task is difficult, ...

Discriminative Acoustic Language Recognition via ...
ments of recorded telephone speech of varying duration. Every ..... 5, on the 14 languages of the closed-set language detection task of the NIST 2007 Language ...

Discriminative Acoustic Language Recognition via ...
General recipe for GMM-based. Acoustic Language Recognition. 1. Build a feature extractor which maps: speech segment --> sequence of feature vectors. 2.

Discriminative Acoustic Language Recognition via ...
This talk will emphasize the more interesting channel ... Prior work: 1G: One MAP-trained GMM per language. .... concatenation of the mean vectors of all.

towards acoustic model unification across dialects - Research at Google
tools simultaneously trained on many dialects fail to generalize well for any of them, ..... neural network acoustic model with accent-specific top layer using the ...

Language Recognition Based on Acoustic Diversified ...
mation Science and Technology, Department of Electronic Engi- neering ... or lattices are homogeneous since the same training data and phone set are used.

Uniform Multilingual Multi-Speaker Acoustic Model ... - Semantic Scholar
training data. This type of acoustic models, the multilingual multi-speaker (MLMS) models, were proposed in [12, 13, 14]. These approaches utilize a large input feature space consist- ... ric speech synthesis consists of many diverse types of linguis

clustering of bootstrapped acoustic model with full ...
Mitsubishi Electric Research Laboratories, Cambridge, MA, 02139, USA. 3. Emails: {xck82,zhaoy}@mail.missouri.edu. 1. , {cuix, jxue, pederao, zhou}@us.ibm.com. 2. , [email protected]. 3. ABSTRACT. HMM-based acoustic models built from bootstrap are gene

Multi-Language Multi-Speaker Acoustic ... - Research at Google
for LSTM-RNN based Statistical Parametric Speech Synthesis. Bo Li, Heiga Zen ... training data for acoustic modeling obtained by using speech data from multiple ... guage u, a language dependent text analysis module is first run to extract a ...

Large-scale discriminative language model reranking for voice-search
Jun 8, 2012 - The Ohio State University ... us to utilize large amounts of unsupervised ... cluding model size, types of features, size of partitions in the MapReduce framework with .... recently proposed a distributed MapReduce infras-.

Bayesian Language Model Interpolation for ... - Research at Google
used for a variety of recognition tasks on the Google Android platform. The goal ..... Equation (10) shows that the Bayesian interpolated LM repre- sents p(w); this ...

Development of Spoken Language Model for Automatic ...
from the sentences with respect to the context using the language model. This has various applications in various situations like for say if you want to certain ...

A Category-integrated Language Model for Question ... - Springer Link
to develop effective question retrieval models to retrieve historical question-answer ... trieval in CQA archives is distinct from the search of web pages in that ...

Large-scale discriminative language model reranking for voice-search
Jun 8, 2012 - voice-search data set using our discriminative .... end of training epoch need to be made available to ..... between domains WD and SD.

A Middleware-Independent Model and Language for Component ...
A component implements a component type τ, same as a class implements an interface. A component (τ, D, C) is characterized by its type τ, by the distribution D of Boolean type which indicates whether the implementation is distributed, and by its c

Language Model Verbalization for Automatic ... - Research at Google
this utterance for a voice-search-enabled maps application may be as follows: ..... interpolates the individual models using a development set to opti- mize the ...

On-Demand Language Model Interpolation for ... - Research at Google
Sep 30, 2010 - Google offers several speech features on the Android mobile operating system: .... Table 2: The 10 most popular voice input text fields and their.

Back-Off Language Model Compression
(LM): our experiments on Google Search by Voice show that pruning a ..... Proceedings of the International Conference on Spoken Language. Processing ...

LANGUAGE MODEL CAPITALIZATION ... - Research at Google
tions, the lack of capitalization of the user's input can add an extra cognitive load on the ... adding to their visual saliency. .... We will call this model the Capitalization LM. The ... rive that “iphone” is rendered as “iPhone” in the Ca

MMI-MAP and MPE-MAP for Acoustic Model Adaptation
tation, where data from one task (Voicemail) is used to adapt a. HMM set trained on another task (Switchboard). MPE-MAP is shown to be effective for generating ...