Minimum Hypothesis Phone Error as a Decoding Method for Speech Recognition Haihua Xu,Daniel Povey
August 28, 2009
Maximum a Posteriori (MAP) The standard decoding formula normally used in speech recognition is Maximum A Posteriori (MAP) as follows: W∗
= argmaxW P(W |O) = argmaxW P(W )p(O|W )
Known limitation Such a criterion guarantees a sentence that minimizes sentence error can be decoded; however, it is usually the Word Error Rate (WER) not the sentence error used as the evaluation criterion for the recognition system performance. To make up for this mismatch, Minimum Bayes Risk (MBR) criterion is a natural alternative.
Minimum Bayes Risk (MBR)
W ∗ = argminWi
N X
P(Wj |O)E (Wi |Wj )
j=1
where E (Wi |Wj ) is the number of errors (Levenshtein distance) given Wi as a reference.
Problems Direct calculating the criterion in a subspace of W (generally represented as word graph/lattice) is prohibitive, thus many approximated strategies are attempted.
Known approaches to WER minimization
I
I I
N-best sentence list based decoding scheme proposed by A.Stockle et al. Consensus network proposed by L.Mangu et al. Time-frame word error proposed by F.Wessel et al. al.
MPE/MWE as a criterion for lattice rescoring
We approximate MBR with Minimum Phone Error (MPE) discriminative training criterion as decoding criterion, the approach is X W ∗ = argmaxW P κ (W 0 |O)Acc(W 0 |W ) W0
where argmaxW is taken over P an N-best list that we derive from the decoding lattice, and so is W 0 . In other words we find the hypothesis W ∗ to maximize the objective criterion.
Advantages of the proposed method
• Explicitly optimizing the objective criterion, the correctness of which has been proofed by MPE/MWE discriminative training. • Conceptually simple and clear, as simplified forward-backward algorithm is performed on the decoding lattice. • Much flexible on the accuracy criterion Acc(W 0 |W ) calculating, such as on phone error, time-frame phone error, time-frame word error, and word error criteria can be implemented under the same framework.
WER(%) of the hypothetical reference
How large N for a desired W ∗ ? (1)
26.4
26.2
MAP MHPE
26
25.8
0
20
40 60 N-best sentence number (N )
80
Figure: WER versus N on MSRA data (trained with MPE)
100
WER(%) of the hypothetical reference
How large N for a desired W ∗ ? (2)
33.5
MAP MHPE 33
32.5 0
20
40 60 N-best sentence number (N )
80
Figure: WER versus N on broadcast data(trained with MAP+MPE)
100
How large N for a desired W ∗ ? (3)
As illustrated from the figures, the desired WER can be gained when N is ranged from 20 to 40. Therefore with a very limited N, we can approximate WER minimization. Similar experiments has also been performed on English test data, and we reach the same conclusions.
Experimental results
Recognition system (A) trained with the MLE criterion Table: Baseline vs. MHPE on MSRA test data.
MHPE versus Consensus Network on lattice decoding Table: MHPE versus CN
Test sets MSRA(MLE) BDC(MAP) MSRA(MPE) BDC(MPE)
Base 26.41% 33.83% 23.85% 30.60%
CN 25.92% 32.80% 23.42% 29.41%
MHPE 25.88% 32.62% 22.90% 29.08%
Conclusions and future work
We have introduced a new decoding method for lattice rescoring that aims to get closer to the Minimum Bayes Risk decision rule with respect to the Word Error Rate. Future work will be focused on • To have a full comparison on other criteria to implement Acc(Wi , W ). • More sophisticated approach will be investigated to take the place of N-best sentence list. • Based on the proposed criterion, new system combination scheme will be studied, not just Confusion Network Combination.
Aug 28, 2009 - Minimum Hypothesis Phone Error as a Decoding ... sentence error used as the evaluation criterion for the recognition system ... 33. 33.5. W. E. R. (%. ) o f th e h y p o th e tic a l re fe re n c e. 0. 20. 40. 60. 80. 100. N-best sentence number (N). MAP. MHPE. Figure: WER versus N on broadcast data(trained ...
Optimising the MPE criterion: Extended Baum-Welch. ⢠I-smoothing for ... where λ are the HMM parameters, Or the speech data for file r, κ a probability scale and P(s) the .... Smoothed approximation to phone error in word recognition system.
Aug 2, 2009 - Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, ..... operation and it is identical to the algorithm de-.
In statistical machine translation, MBR decoding ... a range of translation experiments to analyze lattice ..... Statistics computed over these data sets are re-.
For example, when used in the codeword set C1 (described in the beginning of Section 3), previous decoders project all branches ranging from c1 to c9 at every.
[IM â (âIM + Câ1)C]Ïâ2 δ VT. (20). = â P. M. ΣsHT ED. â1. 2 x CÏâ2 δ VT. (21) where. C = (. IN +. P. M Ïâ2 δ VT V. ) â1. (22) and we used the Woodbury matrix identity in eq. 18. Under a minor assumption that the signal covari
probability (APP). Even though soft decision is more powerful than hard decision decoders, many systems can not use soft decision algorithms, e.g. in GSM.
May 30, 1997 - Webster's II NeW College Dictionary, Houghton Mif?in,. 1995, p. .... U.S. Patent. Oct. 28,2003. Sheet 10 0f 25. US RE38,292 E. Fl 6. I4. 200. 220.
hard and soft decision Viterbi decoders (we use hard decision type decoders only, for the channel, where data is in the binary format only), and convert the hard ...
with an object by their parents, the figure rises to approximately 6.5 million (Straus & Kantor,. 1987). Although ... clinically-oriented interview techniques that, while fostering support for the child, may under- mine its .... social worker's decis
... ESAT/SCD/IBBT-COSIC,. Kasteelpark Arenberg 10, 3001 Leuven-Heverlee, Belgium ... tions with a potential privacy impact, from social networking platforms to ... We show that the most widely used privacy metrics, such as k-anonymity, ... between da
Note: The world map in Figure 3 is from âThe World Factbookâ, operated by the ... Thomas F. Spande received a Ph.D. in chemistry from Princeton University in ...
which is crucially based on Move-F, is provided in section 3, Some theoretical consequences of the proposed analysis are discussed in section 4, followed by concluding remarks in section 5. 2. Takahashi (1993) and Some Problems. 2.1 Takahashi (1993).
hypothesis, though I present the data only for the first of ..... rather the ''software programming'' that occurs .... Machiavellian intelligence hypothesis, namely to ...
... mean weight of all bags of pretzels equals 5 oz. Ha : The mean weight of all bags of chips is less than 5 oz. Reject H0 in favor of Ha if the sample mean is sufficiently less than 5 oz. Matt Jones (APSU) Hypothesis Testing for One Mean and One Pr
detached, thus we call the predetermining flakes themselves ventral flakes. .... results in a plunging termination that ruins the core. ACKNOWLEDGEMENTS.
Whoops! There was a problem loading more pages. Retrying... Hypothesis testing.pdf. Hypothesis testing.pdf. Open. Extract. Open with. Sign In. Main menu.
Mar 1, 2003 - shops are collaborating on the website (http:// · www.aimath.org/WWN/rh/) ..... independently discovered some of the develop- ments that had ...