Acoustic Sensitive Language Model Perplexity for Automatic Speech Recognition Ciprian Chelba∗ Microsoft Research One Microsoft Way Redmond, WA 98052
[email protected] Traditional evaluation of language models (LM) for automatic speech recognition (ASR) uses either the information theoretic -motivated perplexity (PPL) or the word error rate (WER) — measured by plugging the model in a speech recognizer. It is a well known fact that WER and PPL and poorly correlated. The main reason is probably the fact that PPL measures the predictive power of the LM on correct text, whereas at recognition time the LM needs to discriminate between alternates suggested by the acoustic model used in the recognizer. Since the LM is estimated using maximum-likelihood methods on correct (well-formed) sentences, it is poorly suited for discriminating among the candidates proposed by the acoustic model as likely candidates. We propose a new evaluation metric for LMs that takes into account the coupling between language model and acoustic model in a given ASR system. The new metric, “acoustic model -sensitive” perplexity (AMS-PPL), aims at allowing one to optimize the LM parameters such that it performs best when used with a given acoustic model. The underlying main idea is to estimate the conditional cross-entropy H(W |A) for the correct word sequence W when the acoustic signal to be decoded was A.
1 Probability Model Let us assume that the correct word sequence W = w1 . . . wn is present2 in the ASR lattice [2] LAT (A), obtained by a given ASR system running acoustic model PAM (·) and language model PLM (·) on utterance A. We wish to calculate the conditional probability PLAT (w|w1k , A) for all vocabulary words w ∈ V and all prefixes w1k of W . . Let us denote with w1k = w1 . . . wk the k-length prefix of W , and with EN D(w1k , LAT (A)) the set of end nodes for partial paths whose word sequence is w1k : . EN D(w1k , LAT (A)) = {n ∈ LAT (A) : (∃) π, start node(π) = s, end node(π) = n, word(π) = w1k }
(1)
where s denotes the start node of LAT (A) and the word(·) operator extracts the words on a given path. With this notation: X PLAT (w|w1k , A) = PLAT (n|w1k , A) · PLAT (w|n, w1k , A) (2) n∈EN D(w1k ,LAT (A))
Let us denote the vector of acoustic observations between two nodes n1, n2 with An2 n1 . With this notation we have A = Aes , and A = Ans k Aen for every node in LAT (A) that is different from either the start or the end one; k denotes concatenation. ∗ Attending Author Category: Estimation, Prediction, and Sequence Modeling Preference: 1. Poster 2. Oral 2 For the case where it isn’t, one can use the lowest WER word sequence present in the lattice; ties are broken arbitrarily
1
1.1 Prefix-Sensitive Node Posteriors The first factor in the inner term of the summation in Eq. (2) becomes: PLAT (n, w1k , Ans k Aen ) k o e o∈EN D(wk ,LAT (A)) PLAT (o, w1 , As k Ao )
PLAT (n|w1k , Ans k Aen ) = P
1
Since the underlying model is a Hidden Markov Model (HMM), the probability needed above can be rewritten as: PLAT (n, w1k , Ans k Aen ) = PLAT (n, w1k , Ans ) · PLAT (Aen |n) {z } | {z } | α(n,w1k )
β(n)
where: X
length(π)
π:start node(π)=s,end node(π)=n, word(π)=wk 1
i=1
. α(n, w1k ) =
Y
X
length(π)
π:start node(π)=n,end node(π)=e
i=1
. β(n) =
Y
p(li (π))
p(li (π))
The β(n) is the familiar backward probability at node n. The α(n, w1k ) is a modified version of the forward probability at node n that takes into account only partial paths whose word string matches the w1k prefix; both can be efficiently calculated using standard dynamic programming algorithms. To complete the analogy with the standard forward/backward definitions, we can denote: . γ(n|w1k , Ans k Aen ) = PLAT (n|w1k , Ans k Aen ) (3) 1.2 Acoustic Sensitive Word Probability For a given link l leaving node n we can write: PLAT (l|n, Aen ) =
q:start
PLAT (l, Aen |n) e node(q)=n PLAT (q, An |n)
q:start
PLAT (l) · β(end node(l)) node(q)=n PLAT (q) · β(end node(q))
P P
=
(4)
Making use of the fact that the underlying model is an HMM, the second factor in the inner term of the summation in Eq. (2) can be re-written as: PLAT (w|n, w1k , A)
= PLAT (w|n, Aen ) X = PLAT (l|n, Aen ) l:word(l)=w
X
=
PLAT (l|n, Aen ) · δ(w, word(l))
(5)
l:start node(l)=n
To summarize: PLAT (w|w1k , A)
X
=
γ(n, w1k ) · PLAT (w|n, w1k , A)
n∈EN D(w1k ,LAT (A))
P
PLAT (w|n, w1k , A)
=
l:start node(l)=n
P
PLAT (l) · β(end node(l)) · δ(w, word(l))
l:start node(l)=n
PLAT (l) · β(end node(l))
where γ(n, w1k ) is calculated according to Eq. (3), and the probability of a link PLAT (l) is calculated using standard methods in ASR. 2
2 Acoustic Model Sensitive Perplexity The probability PLAT (w|w1k , A) can be used to calculate the acoustic model -sensitive perplexity (AMS-PPL). The lattices for the test data need to be generated/rescored with the LM to be evaluated, after which one can calculate the AMS-PPL: " # n−1 X k AM S − P P L = exp −1/n · log PLAT (wk+1 |w1 , A) (6) k=0
The same metric can be used for discriminative (acoustic sensitive) LM training [1], under the assumption that there is a sufficient amount of parallel — text and speech — training data, thus enabling one to generate lattices for the LM training data.
References [1] Frederick Jelinek. Acoustic sensitive language modeling. Technical report, Center for Language and Speech Processing, The Johns Hopkins University, 1995. [2] Steve Young, Gunnar Evermann, Thomas Hain, Dan Kershaw, Gareth Moore, Julian Odell, Dan Povey Dave Ollason, Valtcho Valtchev, and Phil Woodland. The HTK Book. Cambridge University Engineering Department, Cambridge, England, December 2002.
3